In order to solve the problem that the existing cross-modal entity resolution methods easily ignore the high-level semantic informational correlations between cross-modal data,we propose a novel cross-modal entity res...In order to solve the problem that the existing cross-modal entity resolution methods easily ignore the high-level semantic informational correlations between cross-modal data,we propose a novel cross-modal entity resolution for image and text integrating global and fine-grained joint attention mechanism method.First,we map the cross-modal data to a common embedding space utilizing a feature extraction network.Then,we integrate global joint attention mechanism and fine-grained joint attention mechanism,making the model have the ability to learn the global semantic characteristics and the local fine-grained semantic characteristics of the cross-modal data,which is used to fully exploit the cross-modal semantic correlation and boost the performance of cross-modal entity resolution.Moreover,experiments on Flickr-30K and MS-COCO datasets show that the overall performance of R@sum outperforms by 4.30%and 4.54%compared with 5 state-of-the-art methods,respectively,which can fully demonstrate the superiority of our proposed method.展开更多
The multi-signature method can improve the accuracy of entity resolution. However,it will bring the redundant computation problem in the parallel processing framework. In this paper,a multisignature based parallel ent...The multi-signature method can improve the accuracy of entity resolution. However,it will bring the redundant computation problem in the parallel processing framework. In this paper,a multisignature based parallel entity resolution method called multi-sig-er is proposed. The method was implemented in MapReduce-based framework which first tagged multiple signatures for each input object and utilized these signatures to generate key-value pairs,then shuffled the pairs to the reduce tasks that are responsible for similarity computation. To improve the performance,two strategies were adopted. One is for pruning the candidate pairs brought by the blocking technique and the other is for eliminating the redundancy according to the transitive property. Both strategies reduce the number of similarity computation without affecting the resolution accuracy. Experimental results on real-world datasets show that the method tends to handle large datasets rather than small datasets,and it is more suitable for complex similarity computation as compared to simple similarity matching.展开更多
Record matching and clustering are two essential steps in the process of entity resolution,and the single text similarity clustering based on tf-idf(term frequency-inverse document frequency)feature often leads to poo...Record matching and clustering are two essential steps in the process of entity resolution,and the single text similarity clustering based on tf-idf(term frequency-inverse document frequency)feature often leads to poor precision in spots entity resolution.The paper outlines a mixed attributes two-stage-clustering entity resolution framework(abbreviated in MATC-ER)and designs an approach to measure the similarity by mixing spot name and spot introduction,which makes good use of the record information at different stages.Then the paper proves its efficiency based on the comparative experiments on the real data of travel spots.展开更多
This paper focuses on the visual analytics support for extracting domain-specific entities from extensive scientific literature,a task with inherent limitations using traditional named entity resolution methods.With t...This paper focuses on the visual analytics support for extracting domain-specific entities from extensive scientific literature,a task with inherent limitations using traditional named entity resolution methods.With the advent of large language models(LLMs)such as GPT-4,significant improvements over conventional machine learning approaches have been achieved due to LLM’s capability on entity resolution integrate abilities such as understanding multiple types of text.This research introduces a new visual analysis pipeline that integrates these advanced LLMs with versatile visualization and interaction designs to support batch entity resolution.Specifically,we focus on a specific material science field of Metal-Organic Frameworks(MOFs)and a large data collection namely CSD-MOFs.Through collaboration with domain experts in material science,we obtain well-labeled synthesis paragraphs.We propose human-in-the-loop refinement over the entity resolution process using visual analytics techniques,which allows domain experts to interactively integrate insights into LLM intelligence,including error analysis and interpretation of the retrieval-augmented generation(RAG)algorithm.Our evaluation through the case study of example selection for RAG demonstrates that this visual analysis approach effectively improves the accuracy of single-document entity resolution.展开更多
Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may ...Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may cause information loss and bring new inconsistencies. To avoid these problems, we propose EntityManager, a general system to manage dirty data without data cleaning. This system takes real-world entity as the basic storage unit and retrieves query results according to the quality requirement of users. The system is able to handle all kinds of inconsistencies recognized by entity resolution. We elaborate the EntityManager system, covering its architecture, data model, and query processing techniques. To process queries efficiently, our system adopts novel indices, similarity operator and query optimization techniques. Finally, we verify the efficiency and effectiveness of this system and present future research challenges.展开更多
Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing ap- proaches require manually designed match ru...Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing ap- proaches require manually designed match rules to solve the problem, which always needs domain knowledge and is time consuming. We propose a novel genetic algorithm based en- tity resolution approach via active learning. It is able to learn effective match rules by logically combining several different attributes' comparisons with proper thresholds. We use ac- tive learning to reduce manually labeled data and speed up the learning process. The extensive evaluation shows that the proposed approach outperforms the sate-of-the-art entity res- olution approaches in accuracy.展开更多
Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation (OLA), which provides a quick sketch of massive data before a long wait of the final accurate q...Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation (OLA), which provides a quick sketch of massive data before a long wait of the final accurate query result, has drawn significant research attention. However, the direct processing of OLA on duplicate data will lead to incorrect query answers, since sampling from duplicate records leads to an over representation of the duplicate data in the sample. This violates the prerequisite of uniform distributions in most statistical theories. In this paper, we propose CrowdOLA, a novel framework for integrating online aggregation processing with deduplication. Instead of cleaning the whole dataset, Crow~ dOLA retrieves block-level samples continuously from the dataset, and employs a crowd-based entity resolution approach to detect duplicates in the sample in a pay-as-you-go fashion. After cleaning the sample, an unbiased estimator is provided to address the error bias that is introduced by the duplication. We evaluate CrowdOLA on both real-world and synthetic workloads. Experimental results show that CrowdOLA provides a good balance between efficiency and accuracy.展开更多
Entity resolution (ER) is the problem of identi- fying and grouping different manifestations of the same real world object. Algorithmic approaches have been developed where most tasks offer superior performance unde...Entity resolution (ER) is the problem of identi- fying and grouping different manifestations of the same real world object. Algorithmic approaches have been developed where most tasks offer superior performance under super- vised learning. However, the prohibitive cost of labeling training data is still a huge obstacle for detecting duplicate query records from online sources. Furthermore, the unique combinations of noisy data with missing elements make ER tasks more challenging. To address this, transfer learning has been adopted to adaptively share learned common structures of similarity scoring problems between multiple sources. Al- though such techniques reduce the labeling cost so that it is linear with respect to the number of sources, its random sam- piing strategy is not successful enough to handle the ordinary sample imbalance problem. In this paper, we present a novel multi-source active transfer learning framework to jointly select fewer data instances from all sources to train classi- fiers with constant precision/recall. The intuition behind our approach is to actively label the most informative samples while adaptively transferring collective knowledge between sources. In this way, the classifiers that are learned can be both label-economical and flexible even for imbalanced or quality diverse sources. We compare our method with the state-of-the-art approaches on real-word datasets. Our exper- imental results demonstrate that our active transfer learning algorithm can achieve impressive performance with far fewerlabeled samples for record matching with numerous and var- ied sources.展开更多
基金the Special Research Fund for the China Postdoctoral Science Foundation(No.2015M582832)the Major National Science and Technology Program(No.2015ZX01040201)the National Natural Science Foundation of China(No.61371196)。
文摘In order to solve the problem that the existing cross-modal entity resolution methods easily ignore the high-level semantic informational correlations between cross-modal data,we propose a novel cross-modal entity resolution for image and text integrating global and fine-grained joint attention mechanism method.First,we map the cross-modal data to a common embedding space utilizing a feature extraction network.Then,we integrate global joint attention mechanism and fine-grained joint attention mechanism,making the model have the ability to learn the global semantic characteristics and the local fine-grained semantic characteristics of the cross-modal data,which is used to fully exploit the cross-modal semantic correlation and boost the performance of cross-modal entity resolution.Moreover,experiments on Flickr-30K and MS-COCO datasets show that the overall performance of R@sum outperforms by 4.30%and 4.54%compared with 5 state-of-the-art methods,respectively,which can fully demonstrate the superiority of our proposed method.
基金National Natural Science Foundation of China(No.61402100)the Fundamental Research Funds for the Central Universities of China(No.17D111205)
文摘The multi-signature method can improve the accuracy of entity resolution. However,it will bring the redundant computation problem in the parallel processing framework. In this paper,a multisignature based parallel entity resolution method called multi-sig-er is proposed. The method was implemented in MapReduce-based framework which first tagged multiple signatures for each input object and utilized these signatures to generate key-value pairs,then shuffled the pairs to the reduce tasks that are responsible for similarity computation. To improve the performance,two strategies were adopted. One is for pruning the candidate pairs brought by the blocking technique and the other is for eliminating the redundancy according to the transitive property. Both strategies reduce the number of similarity computation without affecting the resolution accuracy. Experimental results on real-world datasets show that the method tends to handle large datasets rather than small datasets,and it is more suitable for complex similarity computation as compared to simple similarity matching.
文摘Record matching and clustering are two essential steps in the process of entity resolution,and the single text similarity clustering based on tf-idf(term frequency-inverse document frequency)feature often leads to poor precision in spots entity resolution.The paper outlines a mixed attributes two-stage-clustering entity resolution framework(abbreviated in MATC-ER)and designs an approach to measure the similarity by mixing spot name and spot introduction,which makes good use of the record information at different stages.Then the paper proves its efficiency based on the comparative experiments on the real data of travel spots.
基金supported by the National Key R&D Program of China(2021YFB3500700)the National Natural Science Foun-dation of China Grant 62172026&Grant 62202033+1 种基金the National Social Science Fund of China(22&ZD153)the Fundamental Re-search Funds for the Central Universities,and the State Key Lab-oratory of Complex&Critical Software Environment(SKLCCSE).
文摘This paper focuses on the visual analytics support for extracting domain-specific entities from extensive scientific literature,a task with inherent limitations using traditional named entity resolution methods.With the advent of large language models(LLMs)such as GPT-4,significant improvements over conventional machine learning approaches have been achieved due to LLM’s capability on entity resolution integrate abilities such as understanding multiple types of text.This research introduces a new visual analysis pipeline that integrates these advanced LLMs with versatile visualization and interaction designs to support batch entity resolution.Specifically,we focus on a specific material science field of Metal-Organic Frameworks(MOFs)and a large data collection namely CSD-MOFs.Through collaboration with domain experts in material science,we obtain well-labeled synthesis paragraphs.We propose human-in-the-loop refinement over the entity resolution process using visual analytics techniques,which allows domain experts to interactively integrate insights into LLM intelligence,including error analysis and interpretation of the retrieval-augmented generation(RAG)algorithm.Our evaluation through the case study of example selection for RAG demonstrates that this visual analysis approach effectively improves the accuracy of single-document entity resolution.
基金This work was partially supported by the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant No. 2015BAH10F01, the National Natural Science Foundation of China under Grant Nos. U1509216, 61472099, and 61133002, the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province of China under Grant No. LC2016026, and the Ministry of Education (MOE)-Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology.
文摘Data quality is important in many data-driven applications, such as decision making, data analysis, and data mining. Recent studies focus on data cleaning techniques by deleting or repairing the dirty data, which may cause information loss and bring new inconsistencies. To avoid these problems, we propose EntityManager, a general system to manage dirty data without data cleaning. This system takes real-world entity as the basic storage unit and retrieves query results according to the quality requirement of users. The system is able to handle all kinds of inconsistencies recognized by entity resolution. We elaborate the EntityManager system, covering its architecture, data model, and query processing techniques. To process queries efficiently, our system adopts novel indices, similarity operator and query optimization techniques. Finally, we verify the efficiency and effectiveness of this system and present future research challenges.
基金The authors thank anonymous reviewers for their in- spiting doubts and helpful suggestions during the reviewing process. This work was supported by the National Basic Research Program of China (973 Program) (2012CB316201), the Fundamental Research Funds for the Cen- tral Universities (N 120816001) and the National Natural Science Foundation of China (Grant Nos. 61472070, 61402213).
文摘Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing ap- proaches require manually designed match rules to solve the problem, which always needs domain knowledge and is time consuming. We propose a novel genetic algorithm based en- tity resolution approach via active learning. It is able to learn effective match rules by logically combining several different attributes' comparisons with proper thresholds. We use ac- tive learning to reduce manually labeled data and speed up the learning process. The extensive evaluation shows that the proposed approach outperforms the sate-of-the-art entity res- olution approaches in accuracy.
基金This work was supported by the National Natural Science Foundation of China under Grant Nos. 61502121, 61472099, and 61602129.
文摘Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation (OLA), which provides a quick sketch of massive data before a long wait of the final accurate query result, has drawn significant research attention. However, the direct processing of OLA on duplicate data will lead to incorrect query answers, since sampling from duplicate records leads to an over representation of the duplicate data in the sample. This violates the prerequisite of uniform distributions in most statistical theories. In this paper, we propose CrowdOLA, a novel framework for integrating online aggregation processing with deduplication. Instead of cleaning the whole dataset, Crow~ dOLA retrieves block-level samples continuously from the dataset, and employs a crowd-based entity resolution approach to detect duplicates in the sample in a pay-as-you-go fashion. After cleaning the sample, an unbiased estimator is provided to address the error bias that is introduced by the duplication. We evaluate CrowdOLA on both real-world and synthetic workloads. Experimental results show that CrowdOLA provides a good balance between efficiency and accuracy.
文摘Entity resolution (ER) is the problem of identi- fying and grouping different manifestations of the same real world object. Algorithmic approaches have been developed where most tasks offer superior performance under super- vised learning. However, the prohibitive cost of labeling training data is still a huge obstacle for detecting duplicate query records from online sources. Furthermore, the unique combinations of noisy data with missing elements make ER tasks more challenging. To address this, transfer learning has been adopted to adaptively share learned common structures of similarity scoring problems between multiple sources. Al- though such techniques reduce the labeling cost so that it is linear with respect to the number of sources, its random sam- piing strategy is not successful enough to handle the ordinary sample imbalance problem. In this paper, we present a novel multi-source active transfer learning framework to jointly select fewer data instances from all sources to train classi- fiers with constant precision/recall. The intuition behind our approach is to actively label the most informative samples while adaptively transferring collective knowledge between sources. In this way, the classifiers that are learned can be both label-economical and flexible even for imbalanced or quality diverse sources. We compare our method with the state-of-the-art approaches on real-word datasets. Our exper- imental results demonstrate that our active transfer learning algorithm can achieve impressive performance with far fewerlabeled samples for record matching with numerous and var- ied sources.