Graph similarity join has become imperative for integrating noisy and inconsistent data from multiple data sources. The edit distance is commonly used to measure the similarity between graphs. To accelerate the simila...Graph similarity join has become imperative for integrating noisy and inconsistent data from multiple data sources. The edit distance is commonly used to measure the similarity between graphs. To accelerate the similarity join based on graph edit distance, in the paper, we make use of a preprocessing strategy to remove the mismatching graph pairs with significant differences. Then a novel method of building indexes for each graph is proposed by grouping the nodes which can be reached in k hops for each key node with structure conservation, which is the k-hop-tree based indexing method. Experiments on real and synthetic graph databases also confirm that our method can achieve good join quality in graph similarity join. Besides, the join process can be finished in polynomial time.展开更多
This paper proposes a semi-greedy framework for optimizing multi-joinqueries in shared-nothing systems. The plan generated by the framework com-prises several pipelines, each performing several joins. The framework de...This paper proposes a semi-greedy framework for optimizing multi-joinqueries in shared-nothing systems. The plan generated by the framework com-prises several pipelines, each performing several joins. The framework deter-mines the 'optimal' number of joins to be performed in each pipeline. Thedecisions are made based on the cost estimation of the entire processing plan.Two ekisting optimization algorithms are extended under the framework. Ananalytical model is presented and used to compare the quality of plans producedby each optimization algorithm. Our study shows that the new algorithms out-perform their counterparts that are not extended.展开更多
Graphs have been widely used for complex data representation in many real applications, such as social network, bioinformatics, and computer vision. Therefore, graph similarity join has become imperative for integrati...Graphs have been widely used for complex data representation in many real applications, such as social network, bioinformatics, and computer vision. Therefore, graph similarity join has become imperative for integrating noisy and inconsistent data from multiple data sources. The edit distance is commonly used to measure the similarity between graphs. The graph similarity join problem studied in this paper is based on graph edit distance constraints. To accelerate the similarity join based on graph edit distance, in the paper, we make use of a preprocessing strategy to remove the mismatching graph pairs with significant differences. Then a novel method of building indexes for each graph is proposed by grouping the nodes which can be reached in k hops for each key node with structure conservation, which is the k-hop tree based indexing method. As for each candidate pair, we propose a similarity computation algorithm with boundary filtering, which can be applied with good efficiency and effectiveness. Experiments on real and synthetic graph databases also confirm that our method can achieve good join quality in graph similarity join. Besides, the join process can be finished in polynomial time.展开更多
The performance of online analytical processing (OLAP) is critical for meeting the increasing requirements of massive volume analytical applications. Typical techniques, such as in-memory processing, column-storage,...The performance of online analytical processing (OLAP) is critical for meeting the increasing requirements of massive volume analytical applications. Typical techniques, such as in-memory processing, column-storage, and join indexes focus on high perfor- mance storage media, efficient storage models, and reduced query processing. While they effectively perform OLAP applications, there is a vital limitation: main- memory database based OLAP (MMOLAP) cannot provide high performance for a large size data set. In this paper, we propose a novel memory dimension table model, in which the primary keys of the dimension table can be directly mapped to dimensional tuple addresses. To achieve higher performance of dimensional tuple access, we optimize our storage model for dimension tables based on OLAP query workload features. We present directly dimensional tuple accessing (DDTA) based join (DDTA- JOIN), a technique to optimize query processing on the memory dimension table by direct dimensional tuple access. We also contribute by proposing an optimization of the predicate tree to shorten predicate operation length by pruning useless predicate processing. Our experimental results show that the DDTA-JOIN algorithm is superior to both simulated row-store main memory query processing and the open-source column-store main memory database MonetDB, thanks to the reduced join cost and simple yet efficient query processing.展开更多
对基于R-Tree的空间连接代价模型进行了探讨,主要研究了HUANG Y W提出的空间连接代价模型。利用最优/最差选择策略降低该算法的时间复杂度,对基于缓冲区的代价模型提出了改进后的评估公式,通过实验验证了改进后的模型比原模型提高了评...对基于R-Tree的空间连接代价模型进行了探讨,主要研究了HUANG Y W提出的空间连接代价模型。利用最优/最差选择策略降低该算法的时间复杂度,对基于缓冲区的代价模型提出了改进后的评估公式,通过实验验证了改进后的模型比原模型提高了评估的精确度。展开更多
文摘Graph similarity join has become imperative for integrating noisy and inconsistent data from multiple data sources. The edit distance is commonly used to measure the similarity between graphs. To accelerate the similarity join based on graph edit distance, in the paper, we make use of a preprocessing strategy to remove the mismatching graph pairs with significant differences. Then a novel method of building indexes for each graph is proposed by grouping the nodes which can be reached in k hops for each key node with structure conservation, which is the k-hop-tree based indexing method. Experiments on real and synthetic graph databases also confirm that our method can achieve good join quality in graph similarity join. Besides, the join process can be finished in polynomial time.
文摘This paper proposes a semi-greedy framework for optimizing multi-joinqueries in shared-nothing systems. The plan generated by the framework com-prises several pipelines, each performing several joins. The framework deter-mines the 'optimal' number of joins to be performed in each pipeline. Thedecisions are made based on the cost estimation of the entire processing plan.Two ekisting optimization algorithms are extended under the framework. Ananalytical model is presented and used to compare the quality of plans producedby each optimization algorithm. Our study shows that the new algorithms out-perform their counterparts that are not extended.
文摘Graphs have been widely used for complex data representation in many real applications, such as social network, bioinformatics, and computer vision. Therefore, graph similarity join has become imperative for integrating noisy and inconsistent data from multiple data sources. The edit distance is commonly used to measure the similarity between graphs. The graph similarity join problem studied in this paper is based on graph edit distance constraints. To accelerate the similarity join based on graph edit distance, in the paper, we make use of a preprocessing strategy to remove the mismatching graph pairs with significant differences. Then a novel method of building indexes for each graph is proposed by grouping the nodes which can be reached in k hops for each key node with structure conservation, which is the k-hop tree based indexing method. As for each candidate pair, we propose a similarity computation algorithm with boundary filtering, which can be applied with good efficiency and effectiveness. Experiments on real and synthetic graph databases also confirm that our method can achieve good join quality in graph similarity join. Besides, the join process can be finished in polynomial time.
文摘The performance of online analytical processing (OLAP) is critical for meeting the increasing requirements of massive volume analytical applications. Typical techniques, such as in-memory processing, column-storage, and join indexes focus on high perfor- mance storage media, efficient storage models, and reduced query processing. While they effectively perform OLAP applications, there is a vital limitation: main- memory database based OLAP (MMOLAP) cannot provide high performance for a large size data set. In this paper, we propose a novel memory dimension table model, in which the primary keys of the dimension table can be directly mapped to dimensional tuple addresses. To achieve higher performance of dimensional tuple access, we optimize our storage model for dimension tables based on OLAP query workload features. We present directly dimensional tuple accessing (DDTA) based join (DDTA- JOIN), a technique to optimize query processing on the memory dimension table by direct dimensional tuple access. We also contribute by proposing an optimization of the predicate tree to shorten predicate operation length by pruning useless predicate processing. Our experimental results show that the DDTA-JOIN algorithm is superior to both simulated row-store main memory query processing and the open-source column-store main memory database MonetDB, thanks to the reduced join cost and simple yet efficient query processing.
文摘近年来,许多实际应用不仅需要支持空间连接查询而且需要具备关键词搜索功能,以帮助用户查找那些既满足空间连接条件又包含指定关键词的空间对象组合。正是在这种需求的驱动之下,定义了一种具备关键词搜索功能的空间连接查询(Spatial Join with Keyword Search,缩写SJKS),并提出了一种基于IR2-Tree的SJKS查询处理算法(IR2-TreeSJKS算法),旨在实现关键词搜索与空间连接查询的高效结合。实验表明,本算法可有效支持具有关键词搜索功能的空间连接查询处理。