Graphs have been widely used for complex data representation in many real applications, such as social network, bioinformatics, and computer vision. Therefore, graph similarity join has become imperative for integrati...Graphs have been widely used for complex data representation in many real applications, such as social network, bioinformatics, and computer vision. Therefore, graph similarity join has become imperative for integrating noisy and inconsistent data from multiple data sources. The edit distance is commonly used to measure the similarity between graphs. The graph similarity join problem studied in this paper is based on graph edit distance constraints. To accelerate the similarity join based on graph edit distance, in the paper, we make use of a preprocessing strategy to remove the mismatching graph pairs with significant differences. Then a novel method of building indexes for each graph is proposed by grouping the nodes which can be reached in k hops for each key node with structure conservation, which is the k-hop tree based indexing method. As for each candidate pair, we propose a similarity computation algorithm with boundary filtering, which can be applied with good efficiency and effectiveness. Experiments on real and synthetic graph databases also confirm that our method can achieve good join quality in graph similarity join. Besides, the join process can be finished in polynomial time.展开更多
String similarity join is an essential operation of many applications that need to find all similar string pairs from two given collections. A quantitative way to determine whether two strings are similar is to comput...String similarity join is an essential operation of many applications that need to find all similar string pairs from two given collections. A quantitative way to determine whether two strings are similar is to compute their similarity based on a certain similarity function. The string pairs with similarity above a certain threshold are regarded as results. The current approach to solving the similarity join problem is to use a unique threshold value. There are, however, several scenarios that require the support of multiple thresholds, for instance, when the dataset includes strings of various lengths. In this scenario, longer string pairs typically tolerate much more typos than shorter ones. Therefore, we proposed a so- lution for string similarity joins that supports different simi- larity thresholds in a single operator. In order to support dif- ferent thresholds, we devised two novel indexing techniques: partition based indexing and similarity aware indexing. To utilize the new indices and improve the join performance, we proposed new filtering methods and index probing tech- niques. To the best of our knowledge, this is the first work that addresses this problem. Experimental results on real-world datasets show that our solution performs efficiently while pro- viding a more flexible threshold specification.展开更多
String similarity join(SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-and...String similarity join(SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics.The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.展开更多
With the widespread deployment of indoor positioning systems, an unprecedented scale of indoor trajectories is being produced. By considering the inherent uncertainties and the text information contained in such an in...With the widespread deployment of indoor positioning systems, an unprecedented scale of indoor trajectories is being produced. By considering the inherent uncertainties and the text information contained in such an indoor trajectory, a new definition named Indoor Uncertain Semantic Trajectory is defined in this paper. In this paper, we focus on a new primitive, yet quite essential query named Indoor Uncertain Semantic Trajectory Similarity Join (IUST-Join for short), which is to match all similar pairs of indoor uncertain semantic trajectories from two sets. IUST-Join targets a number of essential indoor applications. With these applications in mind, we provide a purposeful definition of an indoor uncertain semantic trajectory similarity metric named IUS. To process IUST-Join more efficiently, both an inverted index on indoor uncertain semantic trajectories named 3IST and the first acceleration strategy are proposed to form a filtering-and-verification framework, where most invalid pairs of indoor uncertain semantic trajectories are pruned at quite low computation cost. And based on this filtering-and-verification framework, we present a highly-efficient algorithm named Indoor Uncertain Semantic Trajectory Similarity Join Processing (USP for short). In addition, lots of novel and effective acceleration strategies are proposed and embedded in the USP algorithm. Thanks to these techniques, both the time complexity and the time overhead of the USP algorithm are further reduced. The results of extensive experiments demonstrate the superior performance of the proposed work.展开更多
Refill friction stir spot welding(RFSSW)provides a novel method to join similar and/or dissimilar metallic materials without a key-hole in the center of the joint.Having the key-hole free characterization,the similar/...Refill friction stir spot welding(RFSSW)provides a novel method to join similar and/or dissimilar metallic materials without a key-hole in the center of the joint.Having the key-hole free characterization,the similar/dissimilar RFSSW joint exhibits remarkable and endurable characteristics,including high shear strength,long fatigue life,and strong corrosion resistance.In the meanwhile,as the key-hole free joint has different microstructures compared with conventional friction stir spot welding,thus the RFSSW joint shall possess different shear and fatigue fracture mechanisms,which needs further investigation.To explore the underlying failure mechanism,the similar/dissimilar metallic material joining parameters and pre-treatment,mechanical properties,as well as fracture mechanisms under this novel technology will be discussed.In details,the welding tool design,welding parameters setting,and the influence of processing on the lap shear and fatigue properties,as well as the corrosion resistance will be mainly discussed.Moreover,the roadmap of RFFSW is also discussed.展开更多
String similarity search and join are two impor- tant operations in data cleaning and integration, which ex- tend traditional exact search and exact join operations in databases by tolerating the errors and inconsiste...String similarity search and join are two impor- tant operations in data cleaning and integration, which ex- tend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been exten- sively studied in the recent decade, there is no thorough sur- vey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for siring similarity search and join. We also dis- cuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Fi- nally, we provide some open datasets and summarize some research challenges and open problems.展开更多
文摘Graphs have been widely used for complex data representation in many real applications, such as social network, bioinformatics, and computer vision. Therefore, graph similarity join has become imperative for integrating noisy and inconsistent data from multiple data sources. The edit distance is commonly used to measure the similarity between graphs. The graph similarity join problem studied in this paper is based on graph edit distance constraints. To accelerate the similarity join based on graph edit distance, in the paper, we make use of a preprocessing strategy to remove the mismatching graph pairs with significant differences. Then a novel method of building indexes for each graph is proposed by grouping the nodes which can be reached in k hops for each key node with structure conservation, which is the k-hop tree based indexing method. As for each candidate pair, we propose a similarity computation algorithm with boundary filtering, which can be applied with good efficiency and effectiveness. Experiments on real and synthetic graph databases also confirm that our method can achieve good join quality in graph similarity join. Besides, the join process can be finished in polynomial time.
基金This work was supported by China Scholarship Council and the National Natural Science Foundation of China (Grant Nos. 61402329 and 51378350).
文摘String similarity join is an essential operation of many applications that need to find all similar string pairs from two given collections. A quantitative way to determine whether two strings are similar is to compute their similarity based on a certain similarity function. The string pairs with similarity above a certain threshold are regarded as results. The current approach to solving the similarity join problem is to use a unique threshold value. There are, however, several scenarios that require the support of multiple thresholds, for instance, when the dataset includes strings of various lengths. In this scenario, longer string pairs typically tolerate much more typos than shorter ones. Therefore, we proposed a so- lution for string similarity joins that supports different simi- larity thresholds in a single operator. In order to support dif- ferent thresholds, we devised two novel indexing techniques: partition based indexing and similarity aware indexing. To utilize the new indices and improve the join performance, we proposed new filtering methods and index probing tech- niques. To the best of our knowledge, this is the first work that addresses this problem. Experimental results on real-world datasets show that our solution performs efficiently while pro- viding a more flexible threshold specification.
文摘String similarity join(SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics.The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.
基金supported by the National Natural Science Foundation of China under Grant Nos.U22A2025,U19A2059,61732003,61832003,U1811461,and 62102119the Key Research and Development Projects of the Ministry of Science and Technology of China under Grant No.2019YFB2101902.
文摘With the widespread deployment of indoor positioning systems, an unprecedented scale of indoor trajectories is being produced. By considering the inherent uncertainties and the text information contained in such an indoor trajectory, a new definition named Indoor Uncertain Semantic Trajectory is defined in this paper. In this paper, we focus on a new primitive, yet quite essential query named Indoor Uncertain Semantic Trajectory Similarity Join (IUST-Join for short), which is to match all similar pairs of indoor uncertain semantic trajectories from two sets. IUST-Join targets a number of essential indoor applications. With these applications in mind, we provide a purposeful definition of an indoor uncertain semantic trajectory similarity metric named IUS. To process IUST-Join more efficiently, both an inverted index on indoor uncertain semantic trajectories named 3IST and the first acceleration strategy are proposed to form a filtering-and-verification framework, where most invalid pairs of indoor uncertain semantic trajectories are pruned at quite low computation cost. And based on this filtering-and-verification framework, we present a highly-efficient algorithm named Indoor Uncertain Semantic Trajectory Similarity Join Processing (USP for short). In addition, lots of novel and effective acceleration strategies are proposed and embedded in the USP algorithm. Thanks to these techniques, both the time complexity and the time overhead of the USP algorithm are further reduced. The results of extensive experiments demonstrate the superior performance of the proposed work.
基金This work was supported by International Science and Technology Cooperation Project of Guangdong Province(Grant No.2022A0505050054)Innovation and Technology Fund(ITF)(Grant No.ITP/021/19AP)National Natural Science Foundation of China(Grant No.51905112).
文摘Refill friction stir spot welding(RFSSW)provides a novel method to join similar and/or dissimilar metallic materials without a key-hole in the center of the joint.Having the key-hole free characterization,the similar/dissimilar RFSSW joint exhibits remarkable and endurable characteristics,including high shear strength,long fatigue life,and strong corrosion resistance.In the meanwhile,as the key-hole free joint has different microstructures compared with conventional friction stir spot welding,thus the RFSSW joint shall possess different shear and fatigue fracture mechanisms,which needs further investigation.To explore the underlying failure mechanism,the similar/dissimilar metallic material joining parameters and pre-treatment,mechanical properties,as well as fracture mechanisms under this novel technology will be discussed.In details,the welding tool design,welding parameters setting,and the influence of processing on the lap shear and fatigue properties,as well as the corrosion resistance will be mainly discussed.Moreover,the roadmap of RFFSW is also discussed.
基金This work was partly supported by the National Grand Fundamental Research 973 Program of China (2015CB358700), the National Natural Science Foundation of China (Grant Nos. 61422205, 61472198), Beijing Higher Education Young Elite Teacher Project(YETP0105), Tsinghua-Tencent Joint Laboratory for Internet In- novation Technology, "NEXT Research Center", Singapore (WBS:R-252- 300-001-490), Huawei, Shenzhou, FDCT/ll6/2013/A3, MYRG105(Y1- L3)-FST13-GZ, National High-Tech R&D (863) Program of China (2012AA012600), and the Chinese Special Project of Science and Tech- nology (2013zx01039-002-002).
文摘String similarity search and join are two impor- tant operations in data cleaning and integration, which ex- tend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been exten- sively studied in the recent decade, there is no thorough sur- vey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for siring similarity search and join. We also dis- cuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Fi- nally, we provide some open datasets and summarize some research challenges and open problems.