Data quality management,especially data cleansing,has been extensively studied for many years in the areas of data management and visual analytics.In the paper,we first review and explore the relevant work from the re...Data quality management,especially data cleansing,has been extensively studied for many years in the areas of data management and visual analytics.In the paper,we first review and explore the relevant work from the research areas of data management,visual analytics and human-computer interaction.Then for different types of data such as multimedia data,textual data,trajectory data,and graph data,we summarize the common methods for improving data quality by leveraging data cleansing techniques at different analysis stages.Based on a thorough analysis,we propose a general visual analytics framework for interactively cleansing data.Finally,the challenges and opportunities are analyzed and discussed in the context of data and humans.展开更多
The rapid development of social networks has resulted in a proliferation of user-generated content(UGC),which can benefit many applications.In this paper,we study the problem of identifying a user's locations from...The rapid development of social networks has resulted in a proliferation of user-generated content(UGC),which can benefit many applications.In this paper,we study the problem of identifying a user's locations from microblogs,to facilitate effective location-based advertisement and recommendation.Since the location information in a microblog is incomplete,we cannot get an accurate location from a local microblog.As such,we propose a global location identification method,Glitter.Glitter combines multiple microblogs of a user and utilizes them to identify the user's locations.Glitter not only improves the quality of identifying a user's location but also supplements the location of a microblog so as to obtain an accurate location of a microblog.To facilitate location identification,Glitter organizes points of interest(POIs)into a tree structure where leaf nodes are POIs and non-leaf nodes are segments of POIs,e.g.,countries,cities,and streets.Using the tree structure,Glitter first extracts candidate locations from each microblog of a user which correspond to some tree nodes.Then Glitter aggregates these candidate locations and identifies top-κlocations of the user.Using the identified top-κuser locations,Glitter refines the candidate locations and computes top-κlocations of each microblog.To achieve high recall,we enable fuzzy matching between locations and microblogs.We propose an incremental algorithm to support dynamic updates of microblogs.We also study how to identify users'trajectories based on the extracted locations.We propose an effective algorithm to extract high-quality trajectories.Experimental results on real-world datasets show that our method achieves high quality and good performance,and scales well.展开更多
基金This research was funded by National Key R&D Program of China(No.SQ2018YFB100002)the National Natural Science Foundation of China(No.s 61761136020,61672308)+5 种基金Microsoft Research Asia,Fraunhofer Cluster of Excellence on"Cognitive Internet Technologies",EU through project Track&Know(grant agreement 780754)NSFC(61761136020)NSFC-Zhejiang Joint Fund for the Integration of Industrialization and Informatization(U1609217)Zhejiang Provincial Natural Science Foundation(LR18F020001)NSFC Grants 61602306Fundamental Research Funds for the Central Universities。
文摘Data quality management,especially data cleansing,has been extensively studied for many years in the areas of data management and visual analytics.In the paper,we first review and explore the relevant work from the research areas of data management,visual analytics and human-computer interaction.Then for different types of data such as multimedia data,textual data,trajectory data,and graph data,we summarize the common methods for improving data quality by leveraging data cleansing techniques at different analysis stages.Based on a thorough analysis,we propose a general visual analytics framework for interactively cleansing data.Finally,the challenges and opportunities are analyzed and discussed in the context of data and humans.
基金the National Natural Science Foundation of China under Grant Nos.61802414,61632016,61521002 and 61661166012the National Basic Research 973 Program of China under Grant No.2015CB358700+1 种基金the Social Science Foundation of Beijing under Grant No.18XCC011the Humanities and Social Sciences Base Foundation of Ministry of Education of China under Grant No.16JJD860008,Huawei,and TAL(Tomorrow Advancing Life)education.
文摘The rapid development of social networks has resulted in a proliferation of user-generated content(UGC),which can benefit many applications.In this paper,we study the problem of identifying a user's locations from microblogs,to facilitate effective location-based advertisement and recommendation.Since the location information in a microblog is incomplete,we cannot get an accurate location from a local microblog.As such,we propose a global location identification method,Glitter.Glitter combines multiple microblogs of a user and utilizes them to identify the user's locations.Glitter not only improves the quality of identifying a user's location but also supplements the location of a microblog so as to obtain an accurate location of a microblog.To facilitate location identification,Glitter organizes points of interest(POIs)into a tree structure where leaf nodes are POIs and non-leaf nodes are segments of POIs,e.g.,countries,cities,and streets.Using the tree structure,Glitter first extracts candidate locations from each microblog of a user which correspond to some tree nodes.Then Glitter aggregates these candidate locations and identifies top-κlocations of the user.Using the identified top-κuser locations,Glitter refines the candidate locations and computes top-κlocations of each microblog.To achieve high recall,we enable fuzzy matching between locations and microblogs.We propose an incremental algorithm to support dynamic updates of microblogs.We also study how to identify users'trajectories based on the extracted locations.We propose an effective algorithm to extract high-quality trajectories.Experimental results on real-world datasets show that our method achieves high quality and good performance,and scales well.