This work proposes an unsupervised topological features based entity disambiguation solution.Most existing studies leverage semantic information to resolve ambiguous references.However,the semantic information is not ...This work proposes an unsupervised topological features based entity disambiguation solution.Most existing studies leverage semantic information to resolve ambiguous references.However,the semantic information is not always accessible because of privacy or is too expensive to access.We consider the problem in a setting that only relationships between references are available.A structure similarity algorithm via random walk with restarts is proposed to measure the similarity of references.The disambiguation is regarded as a clustering problem and a family of graph walk based clustering algorithms are brought to group ambiguous references.We evaluate our solution extensively on two real datasets and show its advantage over two state-of-the-art approaches in accuracy.展开更多
Community discovery is an important task in social network analysis.However,most existing methods for community discovery rely on the topological structure alone.These methods ignore the rich information available in ...Community discovery is an important task in social network analysis.However,most existing methods for community discovery rely on the topological structure alone.These methods ignore the rich information available in the content data.In order to solve this issue,in this paper,we present a community discovery method based on heterogeneous information network decomposition and embedding.Unlike traditional methods,our method takes into account topology,node content and edge content,which can supply abundant evidence for community discovery.First,an embedding-based similarity evaluation method is proposed,which decomposes the heterogeneous information network into several subnetworks,and extracts their potential deep representation to evaluate the similarities between nodes.Second,a bottom-up community discovery algorithm is proposed.Via leader nodes selection,initial community generation,and community expansion,communities can be found more efficiently.Third,some incremental maintenance strategies for the changes of networks are proposed.We conduct experimental studies based on three real-world social networks.Experiments demonstrate the effectiveness and the efficiency of our proposed method.Compared with the traditional methods,our method improves normalized mutual information(NMI)and the modularity by an average of 12%and 37%respectively.展开更多
Entity matching is a fundamental problem of data integration.It groups records according to underlying real-world entities.There is a growing trend of entity matching via deep learning techniques.We design mixed hiera...Entity matching is a fundamental problem of data integration.It groups records according to underlying real-world entities.There is a growing trend of entity matching via deep learning techniques.We design mixed hierarchical deep neural networks(MHN)for entity matching,exploiting semantics from different abstract levels in the record internal hierarchy.A family of attention mechanisms is utilized in different periods of entity matching.Self-attention focuses on internal dependency,inter-attention targets at alignments,and multi-perspective weight attention is devoted to importance discrimination.Especially,hybrid soft token alignment is proposed to address corrupted data.Attribute order is for the first time considered in deep entity matching.Then,to reduce utilization of labeled training data,we propose an adversarial domain adaption approach(DA-MHN)to transfer matching knowledge between different entity matching tasks by maximizing classifier discrepancy.Finally,we conduct comprehensive experimental evaluations on 10 datasets(seven for MHN and three for DA-MHN),which illustrate our two proposed approaches1 superiorities.MHN apparently outperforms previous studies in accuracy,and also each component of MHN is tested.DA-MHN greatly surpasses existing studies in transferability.展开更多
Conditional functional dependencies (CFDs) are a critical technique for detecting inconsistencies while they may ignore some potential inconsistencies without considering the content relationship of data. Content-re...Conditional functional dependencies (CFDs) are a critical technique for detecting inconsistencies while they may ignore some potential inconsistencies without considering the content relationship of data. Content-related conditional functional dependencies (CCFDs) are a type of special CFDs, which combine content-related CFDs and detect potential inconsistencies by putting content-related data together. In the process of cleaning inconsistencies, detection and repairing are interactive: 1) detection catches inconsistencies, 2) repairing corrects caught inconsistencies while may bring new incon- sistencies. Besides, data are often fragmented and distributed into multiple sites. It consequently costs expensive shipment for inconsistencies cleaning. In this paper, our aim is to repair inconsistencies in distributed content-related data. We propose a framework consisting of an inconsistencies detection method and an inconsistencies repairing method, which work iteratively. The detection method marks the violated CCFDs for computing the inconsistencies which should be repaired preferentially. Based on the repairing-cost model presented in this paper, we prove that the minimum-cost repairing using CCFDs is NP-complete. Therefore, the repairing method heuristically repairs the inconsistencies with minimum cost. To improve the efficiency and accuracy of repairing, we propose distinct values and rules sequences. Distinct values make less data shipments than real data for communication. Rules sequences determine appropriate repairing sequences to avoid some incorrect repairs. Our solution is proved to be more effective than CFDs by empirical evaluation on two real-life datasets.展开更多
Identifying accounts across different online social networks that belong to the same user has attracted extensive attentions.However,existing techniques rely on given user seeds and ignore the dynamic changes of onlin...Identifying accounts across different online social networks that belong to the same user has attracted extensive attentions.However,existing techniques rely on given user seeds and ignore the dynamic changes of online social networks,which fails to generate high quality identification results.In order to solve this problem,we propose an incremental user identification method based on user-guider similarity index(called CURIOUS),which efficiently identifies users and well captures the changes of user features over time.Specifically,we first construct a novel user-guider similarity index(called USI)to speed up the matching between users.Second we propose a two-phase user identification strategy consisting of USI-based bidirectional user matching and seed-based user matching,which is effective even for incomplete networks.Finally,we propose incremental maintenance for both USI and the identification results,which dynamically captures the instant states of social networks.We conduct experimental studies based on three real-world social networks.The experiments demonstrate the effectiveness and the efficiency of our proposed method in comparison with traditional methods.Compared with the traditional methods,our method improves precision,recall and rank score by an average of 0.19,0.16 and 0.09 respectively,and reduces the time cost by an average of 81%.展开更多
基金supported by the National Basic Research 973 Program of China under Grant No.2012CB316201,the Fundamental Research Funds for the Central Universities of China under Grant No.N120816001,and the National Natural Science Foundation of China under Grant Nos.61472070 and 61402213.
文摘This work proposes an unsupervised topological features based entity disambiguation solution.Most existing studies leverage semantic information to resolve ambiguous references.However,the semantic information is not always accessible because of privacy or is too expensive to access.We consider the problem in a setting that only relationships between references are available.A structure similarity algorithm via random walk with restarts is proposed to measure the similarity of references.The disambiguation is regarded as a clustering problem and a family of graph walk based clustering algorithms are brought to group ambiguous references.We evaluate our solution extensively on two real datasets and show its advantage over two state-of-the-art approaches in accuracy.
基金The work was supported by the National Key Research and Development Program of China under Grant No.2018YFB1003404the National Natural Science Foundation of China under Grant Nos.61672142,U1435216 and 61602103.
文摘Community discovery is an important task in social network analysis.However,most existing methods for community discovery rely on the topological structure alone.These methods ignore the rich information available in the content data.In order to solve this issue,in this paper,we present a community discovery method based on heterogeneous information network decomposition and embedding.Unlike traditional methods,our method takes into account topology,node content and edge content,which can supply abundant evidence for community discovery.First,an embedding-based similarity evaluation method is proposed,which decomposes the heterogeneous information network into several subnetworks,and extracts their potential deep representation to evaluate the similarities between nodes.Second,a bottom-up community discovery algorithm is proposed.Via leader nodes selection,initial community generation,and community expansion,communities can be found more efficiently.Third,some incremental maintenance strategies for the changes of networks are proposed.We conduct experimental studies based on three real-world social networks.Experiments demonstrate the effectiveness and the efficiency of our proposed method.Compared with the traditional methods,our method improves normalized mutual information(NMI)and the modularity by an average of 12%and 37%respectively.
基金the National Natural Science Foundation of China under Grant Nos.62002262,61672142,61602103,62072086 and 62072084the National Key Research and Development Project of China under Grant No.2018YFB1003404.
文摘Entity matching is a fundamental problem of data integration.It groups records according to underlying real-world entities.There is a growing trend of entity matching via deep learning techniques.We design mixed hierarchical deep neural networks(MHN)for entity matching,exploiting semantics from different abstract levels in the record internal hierarchy.A family of attention mechanisms is utilized in different periods of entity matching.Self-attention focuses on internal dependency,inter-attention targets at alignments,and multi-perspective weight attention is devoted to importance discrimination.Especially,hybrid soft token alignment is proposed to address corrupted data.Attribute order is for the first time considered in deep entity matching.Then,to reduce utilization of labeled training data,we propose an adversarial domain adaption approach(DA-MHN)to transfer matching knowledge between different entity matching tasks by maximizing classifier discrepancy.Finally,we conduct comprehensive experimental evaluations on 10 datasets(seven for MHN and three for DA-MHN),which illustrate our two proposed approaches1 superiorities.MHN apparently outperforms previous studies in accuracy,and also each component of MHN is tested.DA-MHN greatly surpasses existing studies in transferability.
基金This research was supported by the National Basic Research 973 Program of China under Grant No. 2012CB316201, the National Natural Science Foundation of China under Grant Nos. 61033007 and 61472070, and the Fundamental Research Funds for the Central Universities of China under Grant No. N150408001-3.
文摘Conditional functional dependencies (CFDs) are a critical technique for detecting inconsistencies while they may ignore some potential inconsistencies without considering the content relationship of data. Content-related conditional functional dependencies (CCFDs) are a type of special CFDs, which combine content-related CFDs and detect potential inconsistencies by putting content-related data together. In the process of cleaning inconsistencies, detection and repairing are interactive: 1) detection catches inconsistencies, 2) repairing corrects caught inconsistencies while may bring new incon- sistencies. Besides, data are often fragmented and distributed into multiple sites. It consequently costs expensive shipment for inconsistencies cleaning. In this paper, our aim is to repair inconsistencies in distributed content-related data. We propose a framework consisting of an inconsistencies detection method and an inconsistencies repairing method, which work iteratively. The detection method marks the violated CCFDs for computing the inconsistencies which should be repaired preferentially. Based on the repairing-cost model presented in this paper, we prove that the minimum-cost repairing using CCFDs is NP-complete. Therefore, the repairing method heuristically repairs the inconsistencies with minimum cost. To improve the efficiency and accuracy of repairing, we propose distinct values and rules sequences. Distinct values make less data shipments than real data for communication. Rules sequences determine appropriate repairing sequences to avoid some incorrect repairs. Our solution is proved to be more effective than CFDs by empirical evaluation on two real-life datasets.
基金This work was supported by the National Natural Science Foundation of China under Grant Nos.62072084,62172082 and 62072086the Science Research Foundation of Liaoning Province of China under Grant No.LJKZ0094+2 种基金the Natural Science Foundation of Liaoning Province of China under Grant No.2022-MS-171the Science and Technology Plan Major Project of Liaoning Province of China under Grant No.2022JH1/10400009the Fundamental Research Funds for the Central Universities of China under Grant No.N2116008。
文摘Identifying accounts across different online social networks that belong to the same user has attracted extensive attentions.However,existing techniques rely on given user seeds and ignore the dynamic changes of online social networks,which fails to generate high quality identification results.In order to solve this problem,we propose an incremental user identification method based on user-guider similarity index(called CURIOUS),which efficiently identifies users and well captures the changes of user features over time.Specifically,we first construct a novel user-guider similarity index(called USI)to speed up the matching between users.Second we propose a two-phase user identification strategy consisting of USI-based bidirectional user matching and seed-based user matching,which is effective even for incomplete networks.Finally,we propose incremental maintenance for both USI and the identification results,which dynamically captures the instant states of social networks.We conduct experimental studies based on three real-world social networks.The experiments demonstrate the effectiveness and the efficiency of our proposed method in comparison with traditional methods.Compared with the traditional methods,our method improves precision,recall and rank score by an average of 0.19,0.16 and 0.09 respectively,and reduces the time cost by an average of 81%.