Optimization Methods for Query Federation on Linked Data

Télécharger le fichier original (Mémoire de fin d’études)

Federated Query Processing on Linked Data

Federated query processing, which is also called query federation, is based on divid-ing a query into its subqueries and distributing the query execution of them over the SPARQL endpoints of the selected data sources. The intermediate results from the data sources are aggregated and the final results are generated. These processes are performed with a federated query engine. The engine performs three main steps which are data source selection, query optimization, and query execution. Data source se-lection selects the relevant data sources for each triple pattern of a query. Query optimization is responsible for grouping the triple patterns, deciding the join method, and ordering the triple patterns. Query optimization is substantially important in query federation, because the subqueries and the intermediate results are transmitted over the Web of Data. The last step, query execution, is dedicated to the execution of the query operators defined by the optimizer and preparation of the result set.
In the following subsections, we synthesize data source selection methods, join meth-ods, and query optimization in query federation by surveying the promising engines in the literature.

Data Source Selection

We classify the data source selection methods as follows: (i) predicate-based selection, (ii) type-based selection, (iii) rule-based selection, and (iv) SPARQL ASK queries. All these methods except the last method need metadata catalogs. For this reason, we first discuss the metadata catalogs in query federation and then propose our classification.
Metadata catalogs can be defined as SPARQL endpoint descriptions that describe various properties about the data source belonging to this endpoint. The existing query federation engines use three types of metadata catalogs: (i) service descriptions (Quilitz and Leser, 2008), (ii) VoID (Vocabulary of Interlinked Datasets) descriptions (Alexander and Hausenblas, 2009), and (iii) list of predicates. We want to remark that dataset and data source are used interchangeably.
• Service descriptions: A service description provides metadata about the RDF data and cover some statistical information such as total triples and number of triples with a predicate. In other words, a service description specifies the information about the data source, which means a set of RDF triples, that is published by a single provider.
• VoID descriptions: VoID descriptions are similar to service descriptions that are used to provide metadata about the RDF data and cover some statistics about it. Furthermore, there is another concept in VoID descriptions which is called linkset. A linkset describes a set of RDF triples where all subjects refer to one dataset and all objects belong to another dataset (Alexander and Hausenblas, 2009). Thus, VoID descriptions can be used to describe the metadata of RDF datasets with the interlinking to other datasets. Moreover, statistics about the datasets can be defined in VoID descriptions as in service descriptions. Number of triples and number of instances of a class or property are some examples of the statistics here. Cyganiak et al. (2011) proposed a VoID guide to data publishers and consumers. Besides, Charalambidis et al. (2015a) proposed an extension of VoID descriptions which introduces new concepts in order to provide more detailed descriptions.
• List of predicates: List of predicates is also used as a metadata catalog. Although they are useful in order to decide the relevant data sources of a query, they do not include statistical information about the data.
As mentioned in the beginning of this section, according to our classification, the data source selection methods in query federation are divided into four basic categories: predicate-based selection, type-based selection, rule-based selection, and SPARQL ASK queries.
• Predicate-based selection: It is based on selecting the relevant data sources of a triple pattern by matching its predicate with the covered predicates in the metadata catalog.
• Type-based selection: This type of data source selection uses the type definitions (rdf:type) in the metadata catalogs in order to select the relevant data sources.
• Rule-based selection: This method selects the relevant data sources according to defined rules which are generated by analyzing the relations between the triple patterns of a query. First two categories and this category are not mutually dis-joint, as rule-based selection includes predicate-based and type-based selections.
• SPARQL ASK queries: A SPARQL ASK query returns a boolean indicating whether a query pattern matches or not4. Thus, data sources of a query can be selected by sending SPARQL ASK queries to the candidate endpoints. If the result of the query is TRUE, this data source is selected as a relevant data source.
DARQ (Quilitz and Leser, 2008) uses service descriptions as metadata catalogs which must be generated before the query execution. The engine employs predicate-based data source selection. Hence it compares the predicate of a triple pattern with the defined predicates of each service description. Therefore, the engine cannot support unbound predicate triple patterns.
SPLENDID (Görlitz and Staab, 2011b) uses VoID descriptions as metadata catalogs in data source selection. Data sources are indexed for every predicate and type by using VoID statistics. However, the statistics in VoID descriptions can be insuﬃcient to select a triple pattern’s relevant data source or data sources. This situation exists especially for the triples with common predicates such as rdfs:label. Since almost all datasets use this predicate, SPLENDID sends SPARQL ASK queries for the triple patterns with bound variables which are not covered in VoID descriptions. All data sources are selected for the triple patterns which have unbound predicates. Semagrow (Charalambidis et al., 2015b) uses VoID descriptions and SPARQL ASK queries in data source selection. The authors stated that Semagrow’s data source selection is pattern-wise like SPLENDID without detailed explanation. For this reason, we accept that its data source selection method is the same with SPLENDID’s.
LHD (Wang et al., 2013) is another query engine that uses VoID descriptions to-gether with SPARQL ASK queries in data source selection. It first uses VoID descrip-tions and then sends SPARQL ASK queries to refine the selected data sources. Its data source selection is based on predicates as DARQ. However, it can support unbound predicates without eliminating irrelevant data sources as SPLENDID.
WoDQA (Akar et al., 2012; Yönyül, 2014) also uses VoID descriptions and SPARQL ASK queries in data source selection. Akar et al. (2012) proposed diﬀerent rules based on query pattern analysis, because they think that predicate-based and type-based se-lections are not enough in order to eliminate all irrelevant data sources due to having common predicates or types. These rules include three perspectives which are IRI-based analysis, linking analysis, and shared variable analysis. IRI-based analysis selects the relevant data sources by matching the IRIs in the triple pattern with the void:uriSpace and void:vocabulary properties of VoID descriptions. Therefore, IRI-based analysis in-cludes the predicate-based and type-based selection methods. By this means, WoDQA does not only selects the data sources according to the predicates or types involved in a query, it considers all the IRIs in a query. In linking analysis, WoDQA takes into consideration the linkset definitions in the VoID descriptions. Lastly, in the shared variables analysis, WoDQA considers that triple patterns with shared variables can aﬀect their related data sources. In other words, shared variables analysis aims to eliminate the irrelevant data sources.
List of predicates can also be used as a metadata catalog, as stated previously. ANAPSID (Acosta et al., 2011) keeps a list of predicates, the execution timeout prop-erty of the endpoint, and the statistics as a metadata catalog. The endpoints’ execution timeouts and statistics are collected by an adaptive sampling technique (Blanco et al., 2012; Vidal et al., 2010) or they can be collected during the query execution. ANAPSID uses predicate-based selection and chooses the endpoints whose timeouts are longer than the estimated execution time of triple patterns. However, the details are not given in their publication so it is not clear how ANAPSID estimates the triple patterns’ execution times. ADERIS (Lynden et al., 2010, 2011) also uses list of pred-icates as metadata catalogs and employs predicate-based selection. It sends SPARQL SELECT queries with DISTINCT keyword to each endpoint to find out the unique predicates. Besides, ADERIS adds data sources manually when it is impossible to do that automatically5.
FedX (Schwarte et al., 2011) sends SPARQL ASK queries for each triple pattern of a query in order to decide if it can be answered by the endpoint or not. It also caches the relevance of each triple pattern with each data source in order to minimize the SPARQL ASK queries.
Table 2.1 shows the data source selection methods in the existing federated query engines. FedX (Schwarte et al., 2011) just sends SPARQL ASK queries in order to select the data sources. Although SPLENDID (Görlitz and Staab, 2011b) uses VoID descriptions to select the data sources based on predicates and types, it sends SPARQL ASK queries when the descriptions cannot help to select the relevant data sources. DARQ (Quilitz and Leser, 2008) selects the data sources based on predicates by using service descriptions. However, it does not send SPARQL ASK queries when the service description fail to select the related data sources. LHD (Wang et al., 2013) employs predicate-based selection via VoID descriptions and sends SPARQL ASK queries in order to eliminate the irrelevant data sources. ANAPSID (Acosta et al., 2011) and ADERIS (Lynden et al., 2010, 2011) use predicate-based selection as DARQ. However, ANAPSID also considers the execution timeout information of endpoints as well. Dif-ferent from other engines, WoDQA (Akar et al., 2012; Yönyül, 2014) aims to eliminate all irrelevant data sources and employs rule-based selection which includes predicate-based and type-based selections. It also uses SPARQL ASK queries.
In conclusion, data source selection is a diﬃcult task without metadata catalogs. In this case, SPARQL ASK queries are used in order to select the relevant data sources. In addition, Saleem et al. (2016) stated that caching the results of the SPARQL ASK queries greatly reduces the data source selection time.

Join Methods

The second step in federated query processing is query optimization which covers sub-query building, join method selection, and join ordering. In order to improve the coherence of the thesis, we present the join methods in this subsection and we will discuss the query optimization in the following subsection.
Join methods in the existing engines can be categorized as follows: (i) bind join, (ii) nested loop join, (iii) merge join, (iv) hash join, (v) symmetric hash join, and (vi) multiple hash join.
Bind Join
Bind join (Haas et al., 1997) passes the bindings of the intermediate results of the outer relation to the inner relation in order to filter the result set. It is substantially eﬃcient when the intermediate results are small.
Bind join, which is also called bound join, is commonly used by federated query engines. Schwarte et al. (2011) proposed a bind join technique for FedX which uses SPARQL UNION6 constructs to group a set of mappings in a subquery to be sent to the relevant data sources in a single remote request. WoDQA (Akar et al., 2012; Yönyül, 2014) uses bind join as well. Diﬀerent from FedX, WoDQA employs bind join method with SPARQL FILTER7 expression. In addition, SPARQL 1.1 Query Language8 proposes SERVICE9 keyword to explicitly execute certain subqueries on diﬀerent SPARQL endpoints, and WoDQA takes the advantage of SERVICE keyword in its bind join. Charalambidis et al. (2015b) tested bind join with both UNION and VALUES10 expressions for Semagrow. Although bind join with UNION expression requires additional processing in order to map the binding variables and their origi-nal names, the authors stated that it provides faster completion time than VALUES expression with the query they tested. Semagrow employs UNION expressions in a parallel fashion.
DARQ (Quilitz and Leser, 2008), ANAPSID (Acosta et al., 2011), ADERIS (Lynden et al., 2011), SPLENDID (Görlitz and Staab, 2011b), and LHD (Wang et al., 2013) use bind join as well. We will discuss their usage later. Diﬀerent from others, DARQ employs bind join when the data sources have limitations on access patterns (Florescu et al., 1999). Data sources with limited access patterns need some variables in a query to be bound in order to answer the query (Quilitz and Leser, 2008). For this reason, DARQ keeps the definition of limitations on access patterns in service descriptions.
Nested Loop Join
Nested loop join, as understood from its name, performs two nested loops over the relations. The inner relation is scanned for every binding in the outer relation while the bindings which provide the join condition are included in the result. Nested loop join is used by DARQ (Quilitz and Leser, 2008) when there is no limitation on access patterns. ADERIS (Lynden et al., 2010, 2011) applies index nested loop join method in query execution which uses an index on join attributes. Hence it provides an eﬃcient access path for the inner relation.
As mentioned previously, WoDQA (Akar et al., 2012; Yönyül, 2014) uses bind join. However, it employs nested loop join in order to join the intermediate results of the relations locally. In other words, WoDQA uses nested loop join as a complementary part of bind join.
Merge Join
Merge join is based on merging two sorted relations on the join attribute. Hence this method needs both relations sorted and the join type should be an equi-join that uses only equality comparisons on the join attribute. Consider two relations with n1 and n1 tuples, respectively. The cost of nested loop join is proportional to n1 ∗ n2, while the cost of merge join is proportional to n1 + n2. Besides, the cost of sorting n pages is proportional to n log n. As a result, merge join is useful when there is an equi-join and when the relations are previously sorted. In general, sorting the relations and employing merge join is eﬃcient when the cardinalities of relations are high (Ozsu and Valduriez, 2011). Semagrow (Charalambidis et al., 2015b) can employ merge join method besides bind join. It calculates the costs of both join methods and chooses the method with the lower cost.
Hash Join
Hash join is another join method used in federated query processing. It consists two phases. A hash table of one of the relations, generally the relation with the lower cardinality, is created in the first phase. In the second phase, the other relation’s tuples are read, hashed and compared with the values in the hash table. These phases are also referred to as build phase and probe phase, respectively. A result tuple is generated when a match is found.
SPLENDID (Görlitz and Staab, 2011b) and LHD (Wang et al., 2013) use hash join which requests the results of the join argument in parallel and joins them locally. Although hash join is a symmetric join method conceptually, it is asymmetric in its operands (Wilschut and Apers, 1991).
Symmetric Hash Join
Symmetric hash join (Wilschut and Apers, 1991) is one of the earliest symmetric join algorithms. It supports pipelining in parallel database systems by maintaining a hash table for each relation. In other words, symmetric hash join creates two hash tables instead of generating single hash table as in hash join method. Thus, symmetric hash join is a non-blocking join method which produces the output of tuples as early as possible. When a tuple arrives from a relation, it is probed in the other relation’s hash table. Besides, the tuple is added to its own hash table to be used later in the process.
Double pipelined hash join (Ives et al., 1999) and XJoin (Urhan and Franklin, 2000) are the extended versions of symmetric hash join. Diﬀerent from symmetric hash join, double pipelined hash join adapts its execution when the memory is insuﬃcient and XJoin moves some parts of hash tables to the secondary storage when the memory is full.
Acosta et al. (2011) proposed a non-blocking join method, called adaptive group join (agjoin), which is based on symmetric hash join and XJoin. By this means, ANAPSID can produce results even when an endpoint becomes blocked and can hide delays from users. The authors also proposed another join method called adaptive dependent join (adjoin) which is an extended version of dependent join (Florescu et al., 1999). It sends requests to the data sources in an asynchronous fashion and hides delays from the user. In other words, it sends the request to the second data source when tuples from the first source are received. Therefore, adjoin can be accepted as bind join, because it needs the bindings in order to answer the query. Both agjoin and adjoin flush to the secondary memory when the memory is full as XJoin does.

Table des matières

1 Introduction
1.1 Context
1.2 Query Federation
1.3 Problem Position
1.4 Contributions
1.5 Thesis Organization
2 State of the Art
2.1 Introduction
2.2 Federated Query Processing on Linked Data
2.2.1 Data Source Selection
2.2.2 Join Methods
2.2.3 Query Optimization
2.2.4 Discussion on Query Federation on Linked Data
2.2.5 Challenges of Query Federation on Linked Data
2.3 Adaptive Query Optimization
2.3.1 Adaptive Query Optimization for Relational Databases
2.3.2 Adaptive Query Optimization for Query Federation on Linked Data
2.4 Conclusion
3 Optimization Methods for Query Federation on Linked Data
3.1 Introduction
3.2 Adaptive Join Operator for Federated Queries
3.2.1 Adaptive Join Operator for Single Join Queries
3.2.2 Adaptive Join Operator for Multi-Join Queries
3.3 Extended Adaptive Join Operator for Federated Queries
3.3.1 Background
3.3.2 Extended Adaptive Join Operator for Single Join Queries
3.3.3 Extended Adaptive Join Operator for Multi-Join Queries
3.4 Conclusion
4 Performance Evaluation
4.1 Introduction
4.2 Performance Evaluation of Adaptive Join Operator
4.2.1 Performance Evaluation for Single Join Queries
4.2.2 Performance Evaluation for Multi-Join Queries
4.3 Performance Evaluation of Extended Adaptive Join Operator
4.3.1 Performance Evaluation for Single Join Queries
4.3.2 Performance Evaluation for Multi-Join Queries
4.4 Conclusion
5 Conclusion and Future Work
5.1 Thesis Review
5.2 Future Work
Bibliography