Although there exist a few good schemes to protect the kernel hooks of operating systems, attackers are still able to circumvent existing defense mechanisms with spurious context infonmtion. To address this challenge,...Although there exist a few good schemes to protect the kernel hooks of operating systems, attackers are still able to circumvent existing defense mechanisms with spurious context infonmtion. To address this challenge, this paper proposes a framework, called HooklMA, to detect compromised kernel hooks by using hardware debugging features. The key contribution of the work is that context information is captured from hardware instead of from relatively vulnerable kernel data. Using commodity hardware, a proof-of-concept pro- totype system of HooklMA has been developed. This prototype handles 3 082 dynamic control-flow transfers with related hooks in the kernel space. Experiments show that HooklMA is capable of detecting compomised kernel hooks caused by kernel rootkits. Performance evaluations with UnixBench indicate that runtirre overhead introduced by HooklMA is about 21.5%.展开更多
Keyword Search Over Relational Databases (KSORD) enables casual or Web users easily access databases through free-form keyword queries. Improving the performance of KSORD systems is a critical issue in this area. In...Keyword Search Over Relational Databases (KSORD) enables casual or Web users easily access databases through free-form keyword queries. Improving the performance of KSORD systems is a critical issue in this area. In this paper, a new approach CLASCN (Classification, Learning And Selection of Candidate Network) is developed to efficiently perform top-κ keyword queries in schema-graph-based online KSORD systems. In this approach, the Candidate Networks (CNs) from trained keyword queries or executed user queries are classified and stored in the databases, and top-κ results from the CNs are learned for constructing CN Language Models (CNLMs). The CNLMs are used to compute the similarity scores between a new user query and the CNs from the query. The CNs with relatively large similarity score, which are the most promising ones to produce top-κ results, will be selected and performed. Currently, CLASCN is only applicable for past queries and New All-keyword-Used (NAU) queries which are frequently submitted queries. Extensive experiments also show the efficiency and effectiveness of our CLASCN approach.展开更多
Data partitioning techniques are pivotal for optimal data placement across storage devices,thereby enhancing resource utilization and overall system throughput.However,the design of effective partition schemes faces m...Data partitioning techniques are pivotal for optimal data placement across storage devices,thereby enhancing resource utilization and overall system throughput.However,the design of effective partition schemes faces multiple challenges,including considerations of the cluster environment,storage device characteristics,optimization objectives,and the balance between partition quality and computational efficiency.Furthermore,dynamic environments necessitate robust partition detection mechanisms.This paper presents a comprehensive survey structured around partition deployment environments,outlining the distinguishing features and applicability of various partitioning strategies while delving into how these challenges are addressed.We discuss partitioning features pertaining to database schema,table data,workload,and runtime metrics.We then delve into the partition generation process,segmenting it into initialization and optimization stages.A comparative analysis of partition generation and update algorithms is provided,emphasizing their suitability for different scenarios and optimization objectives.Additionally,we illustrate the applications of partitioning in prevalent database products and suggest potential future research directions and solutions.This survey aims to foster the implementation,deployment,and updating of high-quality partitions for specific system scenarios.展开更多
Both computer science and archival science are concerned with archiving large-scale data,but they have different focuses.Large-scale data archiving in computer science focuses on technical aspects that can reduce the ...Both computer science and archival science are concerned with archiving large-scale data,but they have different focuses.Large-scale data archiving in computer science focuses on technical aspects that can reduce the cost of data storage and improve the reliability and efficiency of Big Data management.Its weaknesses lie in inadequate and non-standardized management.Archiving in archival science focuses on the management aspects and neglects the necessary technical considerations,resulting in high storage and retention costs and poor ability to manage Big Data.Therefore,the integration of large-scale data archiving and archival theory can balance the existing research limitations of the two fields and propose two research topics for related research-archival management of Big Data and large-scale management of archived Big Data.展开更多
Update management is very important for data integration systems. So update management in peer data management systems (PDMSs) is a hot research area. This paper researches on view maintenance in PDMSs. First, the d...Update management is very important for data integration systems. So update management in peer data management systems (PDMSs) is a hot research area. This paper researches on view maintenance in PDMSs. First, the definition of view is extended and the peer view, local view and global view are proposed according to the requirements of applications. There are two main factors to influence materialized views in PDMSs. One is that schema mappings between peers are changed, and the other is that peers update their data. Based on the requirements, this paper proposes an algorithm called 2DCMA, which includes two sub-algorithms: data and definition consistency maintenance algorithm% to effectively maintain views. For data consistency maintenance, Mork's rules are extended for governing the use of updategrams and boosters. The new rule system can be used to optimize the execution plan. And are extended for the data consistency maintenance algorithm is based on the new rule system. Furthermore, an ECA rule is adopted for definition consistency maintenance. Finally, extensive simulation experiments are conducted in SPDMS. The simulation results show that the 2DCMA algorithm has better performance than that of Mork's when maintaining data consistency. And the 2DCMA algorithm has better performance than that of centralized view maintenance algorithm when maintaining definition consistency.展开更多
Latent Dirichlet allocation(LDA)is a topic model widely used for discovering hidden semantics in massive text corpora.Collapsed Gibbs sampling(CGS),as a widely-used algorithm for learning the parameters of LDA,has the...Latent Dirichlet allocation(LDA)is a topic model widely used for discovering hidden semantics in massive text corpora.Collapsed Gibbs sampling(CGS),as a widely-used algorithm for learning the parameters of LDA,has the risk of privacy leakage.Specifically,word count statistics and updates of latent topics in CGS,which are essential for parameter estimation,could be employed by adversaries to conduct effective membership inference attacks(MIAs).Till now,there are two kinds of methods exploited in CGS to defend against MIAs:adding noise to word count statistics and utilizing inherent privacy.These two kinds of methods have their respective limitations.Noise sampled from the Laplacian distribution sometimes produces negative word count statistics,which render terrible parameter estimation in CGS.Utilizing inherent privacy could only provide weak guaranteed privacy when defending against MIAs.It is promising to propose an effective framework to obtain accurate parameter estimations with guaranteed differential privacy.The key issue of obtaining accurate parameter estimations when introducing differential privacy in CGS is making good use of the privacy budget such that a precise noise scale is derived.It is the first time that R′enyi differential privacy(RDP)has been introduced into CGS and we propose RDP-LDA,an effective framework for analyzing the privacy loss of any differentially private CGS.RDP-LDA could be used to derive a tighter upper bound of privacy loss than the overestimated results of existing differentially private CGS obtained byε-DP.In RDP-LDA,we propose a novel truncated-Gaussian mechanism that keeps word count statistics non-negative.And we propose distribution perturbation which could provide more rigorous guaranteed privacy than utilizing inherent privacy.Experiments validate that our proposed methods produce more accurate parameter estimation under the JS-divergence metric and obtain lower precision and recall when defending against MIAs.展开更多
Link-based similarity measures play a significant role in many graph based applications. Consequently, mea- suring node similarity in a graph is a fundamental problem of graph data mining. Personalized PageRank (PPR...Link-based similarity measures play a significant role in many graph based applications. Consequently, mea- suring node similarity in a graph is a fundamental problem of graph data mining. Personalized PageRank (PPR) and Sim- Rank (SR) have emerged as the most popular and influen- tial link-based similarity measures. Recently, a novel link- based similarity measure, penetrating rank (P-Rank), which enriches SR, was proposed. In practice, PPR, SR and P-Rank scores are calculated by iterative methods. As the number of iterations increases so does the overhead of the calcula- tion. The ideal solution is that computing similarity within the minimum number of iterations is sufficient to guaran- tee a desired accuracy. However, the existing upper bounds are too coarse to be useful in general. Therefore, we focus on designing an accurate and tight upper bounds for PPR, SR, and P-Rank in the paper. Our upper bounds are designed based on the following intuition: the smaller the difference between the two consecutive iteration steps is, the smaller the difference between the theoretical and iterative similar- ity scores becomes. Furthermore, we demonstrate the effec- tiveness of our upper bounds in the scenario of top-k similar nodes queries, where our upper bounds helps accelerate the speed of the query. We also run a comprehensive set of exper- iments on real world data sets to verify the effectiveness and efficiency of our upper bounds.展开更多
Monitoring on data streams is an efficient method of acquiring the characters of data stream. However the available resources for each data stream are limited, so the problem of how to use the limited resources to pro...Monitoring on data streams is an efficient method of acquiring the characters of data stream. However the available resources for each data stream are limited, so the problem of how to use the limited resources to process infinite data stream is an open challenging problem. In this paper, we adopt the wavelet and sliding window methods to design a multi-resolution summarization data structure, the Multi-Resolution Summarization Tree (MRST) which can be updated incrementally with the incoming data and can support point queries, range queries, multi-point queries and keep the precision of queries. We use both synthetic data and real-world data to evaluate our algorithm. The results of experiment indicate that the efficiency of query and the adaptability of MRST have exceeded the current algorithm, at the same time the realization of it is simpler than others.展开更多
The entering into big data era gives rise to a novel discipline called Data Science.Data Science is interdisciplinary in its nature,and the existing relevant studies can be categorized into domain-independent studies ...The entering into big data era gives rise to a novel discipline called Data Science.Data Science is interdisciplinary in its nature,and the existing relevant studies can be categorized into domain-independent studies and domain-dependent studies.The domain-dependent studies and domain-independent ones are evolving into Domain-general Data Science and Domain-specific Data Science.Domain-general Data Science emphasizes Data Science in a general sense,involving concepts,theories,methods,technologies,and tools.Domain-specific Data Science is a variant of Domain-general Data Science and varies from one domain to another.The most popular Domain-specific Data Science includes Data journalism,Industrial Data Science,Business Data Science,Health Data Science,Biological Data Science,Social Data Science,and Agile Data Science.The difference between Domain-general Data Science and Domain-specific Data Science roots in their thinking paradigms:DGDS conforms to data-centered thinking,while DSDS is in line with knowledge-centered thinking.As a result,DGDS focuses on the theoretical studies,while DSDS is centered on applied ones.However,DSDS and DGDS possess complementary advantages.Theoretical Data Science(TDS)is a new branch of Data Science that employs mathematical models and abstractions of data objects and systems to rationalize,explain and predict big data phenomena.TDS will bridge the gap between DGDS and DSDS.TDS contrasts with DSDS,which uses casual analysis,as well as DGDS,which employs data-centered thinking to deal with big data problems in that it balances the usability and the interpretability of Data Science practices.The main concerns of TDS are concentrated on integrating the data-centered thinking with the knowledge-centered thinking as well as transforming a correlation analysis into the casual analysis.Hence,TDS can bridge the gaps between DGDS and DSDS,and balance the usability and the interpretability of big data solutions.The studies of TDS should be focused on the following research purpose:to develop theoretical studies of TDS,to take advantages of active property of big data,to embrace design of experiments,to enhance causality analysis,and to develop data products.展开更多
Deep learning has shown significant improvements on various machine learning tasks by introducing a wide spectrum of neural network models.Yet,for these neural network models,it is necessary to label a tremendous amou...Deep learning has shown significant improvements on various machine learning tasks by introducing a wide spectrum of neural network models.Yet,for these neural network models,it is necessary to label a tremendous amount of training data,which is prohibitively expensive in reality.In this paper,we propose OnLine Machine Learning(OLML)database which stores trained models and reuses these models in a new training task to achieve a better training effect with a small amount of training data.An efficient model reuse algorithm AdaReuse is developed in the OLML database.Specifically,AdaReuse firstly estimates the reuse potential of trained models from domain relatedness and model quality,through which a group of trained models with high reuse potential for the training task could be selected efficiently.Then,multi selected models will be trained iteratively to encourage diverse models,with which a better training effect could be achieved by ensemble.We evaluate AdaReuse on two types of natural language processing(NLP)tasks,and the results show AdaReuse could improve the training effect significantly compared with models training from scratch when the training data is limited.Based on AdaReuse,we implement an OLML database prototype system which could accept a training task as an SQL-like query and automatically generate a training plan by selecting and reusing trained models.Usability studies are conducted to illustrate the OLML database could properly store the trained models,and reuse the trained models efficiently in new training tasks.展开更多
Data science is a rapidly growing academic field with significant implications for all conventional scientific studies. However, most relevant studies have been limited to one or several facets of data science from a ...Data science is a rapidly growing academic field with significant implications for all conventional scientific studies. However, most relevant studies have been limited to one or several facets of data science from a specific application domain perspective and less to discuss its theoretical framework. Data science is unique in that its research goals, perspectives, and body of knowledge are distinct from other sciences. The core theories of data science are the DIKW pyramid, data-intensive scientific discovery, data science life cycle, data wrangling or munging,big data analytics, data management, and governance, data products Dev Ops, and big data visualization. Six main trends characterize the recent theoretical studies on data science are:(1)the growing significance of Data Ops,(2) the rise of citizen data scientists,(3) enabling augmented data science,(4) integrating data warehouse with data lake,(5) diversity of domain-specific data science, and(6) implementing data stories as data products. Further development of data science should prioritize four ways to turn challenges into opportunities:(1) accelerating theoretical studies of data science,(2) the trade-off between explainability and performance,(3) achieving data ethics, privacy and trust, and(4) aligning academic curricula with industrial needs.展开更多
There is a trend that,virtually everyone,rang-ing from big Web companies to traditional enterprisers to physical science researchers to social scientists,is either al-ready experiencing or anticipating unprecedented g...There is a trend that,virtually everyone,rang-ing from big Web companies to traditional enterprisers to physical science researchers to social scientists,is either al-ready experiencing or anticipating unprecedented growth in the amount of data available in their world,as well as new op-portunities and great untapped value.This paper reviews big data challenges from a data management respective.In partic-ular,we discuss big data diversity,big data reduction,big data integration and cleaning,big data indexing and query,and fi-nally big data analysis and mining.Our survey gives a brief overview about big-data-oriented research and problems.展开更多
Advances in wireless sensor networks and positioning technologies enable new applications monitoring moving objects. Some of these applications, such as traffic management, require the possibility to query the future ...Advances in wireless sensor networks and positioning technologies enable new applications monitoring moving objects. Some of these applications, such as traffic management, require the possibility to query the future trajectories of the objects. In this paper, we propose an original data access method, the ANR-tree, which supports predictive queries. We focus on real life environments, where the objects move within constrained networks, such as vehicles on roads. We introduce a simulation-based prediction model based on graphs of cellular automata, which makes full use of the network constraints and the stochastic traffic behavior. Our technique differs strongly from the linear prediction model, which has low prediction accuracy and requires frequent updates when applied to real traffic with velocity changing frequently. The data structure extends the R-tree with adaptive units which group neighbor objects moving in the similar moving patterns. The predicted movement of the adaptive unit is not given by a single trajectory, but instead by two trajectory bounds based on different assumptions on the traffic conditions and obtained from the simulation. Our experiments, carried on two different datasets, show that the ANR-tree is essentially one order of magnitude more efficient than the TPR-tree, and is much more scalable.展开更多
Schema summarization on large-scale databases is a challenge. In a typical large database schema, a great proportion of the tables are closely connected through a few high degree tables. It is thus difficult to separa...Schema summarization on large-scale databases is a challenge. In a typical large database schema, a great proportion of the tables are closely connected through a few high degree tables. It is thus difficult to separate these tables into clusters that represent different topics. Moreover, as a schema can be very big, the schema summary needs to be structured into multiple levels, to further improve the usability. In this paper, we introduce a new schema summarization approach utilizing the techniques of community detection in social networks. Our approach contains three steps. First, we use a community detection algorithm to divide a database schema into subject groups, each representing a specific subject. Second, we cluster the subject groups into abstract domains to form a multi-level navigation structure. Third, we discover representative tables in each cluster to label the schema summary. We evaluate our approach on Freebase, a real world large-scale database. The results show that our approach can identify subject groups precisely. The generated abstract schema layers are very helpful for users to explore database.展开更多
Variable influence duration (VID) join is a novel spatio-temporal join operation between a set T of trajectories and a set P of spatial points. Here, trajectories are traveling histories of moving objects (e.g., tr...Variable influence duration (VID) join is a novel spatio-temporal join operation between a set T of trajectories and a set P of spatial points. Here, trajectories are traveling histories of moving objects (e.g., travelers), and spatial points are points of interest (POIs, e.g., restaurants). VID join returns all pairs of (τs, p) if τs is spatially close to p for a long period of time, where τs is a segment of trajectory τ ∈ T and p ∈ P. Each returned (τs, p) implies that the moving object associated with τs stayed at p (e.g., having dinner at a restaurant). Such information is useful in many aspects, such as targeted advertising, social security, and social activity analysis. The concepts of influence and influence duration are introduced to measure the spatial closeness between τ and p, and the time spanned, respectively. Compared to the conventional spatio-temporal join, the VID join is more challenging since the join condition varies for different POIs, and the additional temporal requirement cannot be indexed effectively. To process the VID join e?ciently, three algorithms are developed and several optimization techniques are applied, including spatial duplication reuse and time duration based pruning. The performance of the developed algorithms is verified by extensive experiments on real spatial data.展开更多
Deduplication has been commonly used in both enterprise storage systems and cloud storage. To overcome the performance challenge for the selective restore operations of deduplication systems, solid-state-drive-based ...Deduplication has been commonly used in both enterprise storage systems and cloud storage. To overcome the performance challenge for the selective restore operations of deduplication systems, solid-state-drive-based (i.e., SSD-based) re^d cache cm, be deployed for speeding up by caching popular restore contents dynamically. Unfortunately, frequent data updates induced by classical cache schemes (e.g., LRU and LFU) significantly shorten SSDs' lifetime while slowing down I/O processes in SSDs. To address this problem, we propose a new solution -- LOP-Cache to greatly improve tile write durability of SSDs as well as I/O performance by enlarging the proportion of long-term popular (LOP) data among data written into SSD-based cache. LOP-Cache keeps LOP data in the SSD cache for a long time period to decrease the number of cache replacements. Furthermore, it prevents unpopular or unnecessary data in deduplication containers from being written into the SSD cache. We implemented LOP-Cache in a prototype deduplication system to evaluate its pertbrmance. Our experimental results indicate that LOP-Cache shortens the latency of selective restore by an average of 37.3% at the cost of a small SSD-based cache with only 5.56% capacity of the deduplicated data. Importantly, LOP-Cache improves SSDs' lifetime by a factor of 9.77. The evidence shows that LOP-Cache offers a cost-efficient SSD-based read cache solution to boost performance of selective restore for deduplication systems.展开更多
Cloud-native data warehouses have revolutionized data analysis by enabling elasticity,high availability and lower costs.And the increasing popularity of artificial intelligence(AI)drives data warehouses to provide pre...Cloud-native data warehouses have revolutionized data analysis by enabling elasticity,high availability and lower costs.And the increasing popularity of artificial intelligence(AI)drives data warehouses to provide predictive analytics besides the existing descriptive analytics.Consequently,more vendors start to support training and inference of AI models in data warehouses,exploiting the benefits of near-data processing for fast model development and deployment.However,most of the existing solutions are limited by a complex syntax or slow data transportation across engines.In this paper,we present GaussDB-AISQL,a composable SQL system with AI capabilities.GaussDB-AISQL adopts a composable system design that decouples computing,storage,caching,DB engine and AI engine.Our system offers all the functionality needed by end-to-end model training and inference during the model lifecycle.It also enjoys the simplicity and efficiency by providing a SQL-like syntax and removes the burden of manual model management.When training an AI model,GaussDB-AISQL benefits from highly parallel data transportation by concurrent data pulling from the distributed shared memory.The feature selection algorithms in GaussDB-AISQL make the training more data-efficient.When running model inference,GaussDB-AISQL registers the trained model object in the local data warehouse as a user-defined-function,which avoids moving inference data out of the data warehouse to an external AI engine.Experiments show that GaussDB-AISQL is up to 19×faster than baseline approaches.展开更多
Tables,typically two-dimensional and structured to store large amounts of data,are essential in daily activities like database queries,spreadsheet manipulations,Web table question answering,and image table information...Tables,typically two-dimensional and structured to store large amounts of data,are essential in daily activities like database queries,spreadsheet manipulations,Web table question answering,and image table information extraction.Automating these table-centric tasks with Large Language Models(LLMs)or Visual Language Models(VLMs)offers significant public benefits,garnering interest from academia and industry.This survey provides a comprehensive overview of table-related tasks,examining both user scenarios and technical aspects.It covers traditional tasks like table question answering as well as emerging fields such as spreadsheet manipulation and table data analysis.We summarize the training techniques for LLMs and VLMs tailored for table processing.Additionally,we discuss prompt engineering,particularly the use of LLM-powered agents,for various tablerelated tasks.Finally,we highlight several challenges,including diverse user input when serving and slow thinking using chainof-thought.展开更多
In this paper, we consider skyline queries in a mobile and distributed environment, where data objects are distributed in some sites (database servers) which are interconnected through a high-speed wired network, an...In this paper, we consider skyline queries in a mobile and distributed environment, where data objects are distributed in some sites (database servers) which are interconnected through a high-speed wired network, and queries are issued by mobile units (laptop, cell phone, etc.) which access the data objects of database servers by wireless channels. The inherent properties of mobile computing environment such as mobility, limited wireless bandwidth, frequent disconnection, make skyline queries more complicated. We show how to efficiently perform distributed skyline queries in a mobile environment and propose a skyline query processing approach, called efficient distributed skyline based on mobile computing (EDS-MC). In EDS-MC, a distributed skyline query is decomposed into five processing phases and each phase is elaborately designed in order to reduce the network communication, network delay and query response time. We conduct extensive experiments in a simulated mobile database system, and the experimental results demonstrate the superiority of EDS-MC over other skyline query processing techniques on mobile computing.展开更多
Web services are commonly perceived as an environment of both offering opportunities and threats. In this environment, one way to minimize threats is to use reputation evaluation, which can be computed, for example, t...Web services are commonly perceived as an environment of both offering opportunities and threats. In this environment, one way to minimize threats is to use reputation evaluation, which can be computed, for example, through transaction feedback. However, the current feedback-based approach is inaccurate and ineffective because of its inner limitations (e.g., feedback quality problem). As the main source of feedback, the qualities of existing on-line reviews are often varied greatly from low to high, the main reasons include: (1) they have no standard expression formats, (2) dishonest comments may exist among these reviews due to malicious attacking. Up to present, the quality problem of review has not been well solved, which greatly degrades their importance on service reputation evaluation. Therefore, we firstly present a novel evaluation approach for review quality in terms of multiple metrics. Then, we make a further improvement in service reputation evaluation based on those filtered reviews. Experimental results show the effectiveness and efficiency of our proposed approach compared with the naive feedback-based approaches.展开更多
基金The authors would like to thank the anonymous reviewers for their insightful corrnlents that have helped improve the presentation of this paper. The work was supported partially by the National Natural Science Foundation of China under Grants No. 61070192, No.91018008, No. 61170240 the National High-Tech Research Development Program of China under Grant No. 2007AA01ZA14 the Natural Science Foundation of Beijing un- der Grant No. 4122041.
文摘Although there exist a few good schemes to protect the kernel hooks of operating systems, attackers are still able to circumvent existing defense mechanisms with spurious context infonmtion. To address this challenge, this paper proposes a framework, called HooklMA, to detect compromised kernel hooks by using hardware debugging features. The key contribution of the work is that context information is captured from hardware instead of from relatively vulnerable kernel data. Using commodity hardware, a proof-of-concept pro- totype system of HooklMA has been developed. This prototype handles 3 082 dynamic control-flow transfers with related hooks in the kernel space. Experiments show that HooklMA is capable of detecting compomised kernel hooks caused by kernel rootkits. Performance evaluations with UnixBench indicate that runtirre overhead introduced by HooklMA is about 21.5%.
基金This work is supported by the National Natural Science Foundation of China under Grant Nos. 60473069 and 60496325.
文摘Keyword Search Over Relational Databases (KSORD) enables casual or Web users easily access databases through free-form keyword queries. Improving the performance of KSORD systems is a critical issue in this area. In this paper, a new approach CLASCN (Classification, Learning And Selection of Candidate Network) is developed to efficiently perform top-κ keyword queries in schema-graph-based online KSORD systems. In this approach, the Candidate Networks (CNs) from trained keyword queries or executed user queries are classified and stored in the databases, and top-κ results from the CNs are learned for constructing CN Language Models (CNLMs). The CNLMs are used to compute the similarity scores between a new user query and the CNs from the query. The CNs with relatively large similarity score, which are the most promising ones to produce top-κ results, will be selected and performed. Currently, CLASCN is only applicable for past queries and New All-keyword-Used (NAU) queries which are frequently submitted queries. Extensive experiments also show the efficiency and effectiveness of our CLASCN approach.
基金supported by the National Key Research and Development Program of China under Grant No.2023YFB4503603the National Natural Science Foundation of China under Grant Nos.62072460,62076245,and 62172424the Beijing Natural Science Foundation under Grant No.4212022.
文摘Data partitioning techniques are pivotal for optimal data placement across storage devices,thereby enhancing resource utilization and overall system throughput.However,the design of effective partition schemes faces multiple challenges,including considerations of the cluster environment,storage device characteristics,optimization objectives,and the balance between partition quality and computational efficiency.Furthermore,dynamic environments necessitate robust partition detection mechanisms.This paper presents a comprehensive survey structured around partition deployment environments,outlining the distinguishing features and applicability of various partitioning strategies while delving into how these challenges are addressed.We discuss partitioning features pertaining to database schema,table data,workload,and runtime metrics.We then delve into the partition generation process,segmenting it into initialization and optimization stages.A comparative analysis of partition generation and update algorithms is provided,emphasizing their suitability for different scenarios and optimization objectives.Additionally,we illustrate the applications of partitioning in prevalent database products and suggest potential future research directions and solutions.This survey aims to foster the implementation,deployment,and updating of high-quality partitions for specific system scenarios.
基金supported by the National Natural Science Foundation of China(grant number 72074214).
文摘Both computer science and archival science are concerned with archiving large-scale data,but they have different focuses.Large-scale data archiving in computer science focuses on technical aspects that can reduce the cost of data storage and improve the reliability and efficiency of Big Data management.Its weaknesses lie in inadequate and non-standardized management.Archiving in archival science focuses on the management aspects and neglects the necessary technical considerations,resulting in high storage and retention costs and poor ability to manage Big Data.Therefore,the integration of large-scale data archiving and archival theory can balance the existing research limitations of the two fields and propose two research topics for related research-archival management of Big Data and large-scale management of archived Big Data.
基金This work is supported by the National Natural Science Foundation of China under Grant Nos. 60503038, 60473069, 60496325 and 60573092. The authors would like to thank Peter Mork for his comments on the extended rule system, and also thank the anonymous referees for their invaluable comments.
文摘Update management is very important for data integration systems. So update management in peer data management systems (PDMSs) is a hot research area. This paper researches on view maintenance in PDMSs. First, the definition of view is extended and the peer view, local view and global view are proposed according to the requirements of applications. There are two main factors to influence materialized views in PDMSs. One is that schema mappings between peers are changed, and the other is that peers update their data. Based on the requirements, this paper proposes an algorithm called 2DCMA, which includes two sub-algorithms: data and definition consistency maintenance algorithm% to effectively maintain views. For data consistency maintenance, Mork's rules are extended for governing the use of updategrams and boosters. The new rule system can be used to optimize the execution plan. And are extended for the data consistency maintenance algorithm is based on the new rule system. Furthermore, an ECA rule is adopted for definition consistency maintenance. Finally, extensive simulation experiments are conducted in SPDMS. The simulation results show that the 2DCMA algorithm has better performance than that of Mork's when maintaining data consistency. And the 2DCMA algorithm has better performance than that of centralized view maintenance algorithm when maintaining definition consistency.
基金the National Natural Science Foundation of China under Grant Nos.62072460,62076245,and 62172424the Beijing Natural Science Foundation under Grant No.4212022.
文摘Latent Dirichlet allocation(LDA)is a topic model widely used for discovering hidden semantics in massive text corpora.Collapsed Gibbs sampling(CGS),as a widely-used algorithm for learning the parameters of LDA,has the risk of privacy leakage.Specifically,word count statistics and updates of latent topics in CGS,which are essential for parameter estimation,could be employed by adversaries to conduct effective membership inference attacks(MIAs).Till now,there are two kinds of methods exploited in CGS to defend against MIAs:adding noise to word count statistics and utilizing inherent privacy.These two kinds of methods have their respective limitations.Noise sampled from the Laplacian distribution sometimes produces negative word count statistics,which render terrible parameter estimation in CGS.Utilizing inherent privacy could only provide weak guaranteed privacy when defending against MIAs.It is promising to propose an effective framework to obtain accurate parameter estimations with guaranteed differential privacy.The key issue of obtaining accurate parameter estimations when introducing differential privacy in CGS is making good use of the privacy budget such that a precise noise scale is derived.It is the first time that R′enyi differential privacy(RDP)has been introduced into CGS and we propose RDP-LDA,an effective framework for analyzing the privacy loss of any differentially private CGS.RDP-LDA could be used to derive a tighter upper bound of privacy loss than the overestimated results of existing differentially private CGS obtained byε-DP.In RDP-LDA,we propose a novel truncated-Gaussian mechanism that keeps word count statistics non-negative.And we propose distribution perturbation which could provide more rigorous guaranteed privacy than utilizing inherent privacy.Experiments validate that our proposed methods produce more accurate parameter estimation under the JS-divergence metric and obtain lower precision and recall when defending against MIAs.
文摘Link-based similarity measures play a significant role in many graph based applications. Consequently, mea- suring node similarity in a graph is a fundamental problem of graph data mining. Personalized PageRank (PPR) and Sim- Rank (SR) have emerged as the most popular and influen- tial link-based similarity measures. Recently, a novel link- based similarity measure, penetrating rank (P-Rank), which enriches SR, was proposed. In practice, PPR, SR and P-Rank scores are calculated by iterative methods. As the number of iterations increases so does the overhead of the calcula- tion. The ideal solution is that computing similarity within the minimum number of iterations is sufficient to guaran- tee a desired accuracy. However, the existing upper bounds are too coarse to be useful in general. Therefore, we focus on designing an accurate and tight upper bounds for PPR, SR, and P-Rank in the paper. Our upper bounds are designed based on the following intuition: the smaller the difference between the two consecutive iteration steps is, the smaller the difference between the theoretical and iterative similar- ity scores becomes. Furthermore, we demonstrate the effec- tiveness of our upper bounds in the scenario of top-k similar nodes queries, where our upper bounds helps accelerate the speed of the query. We also run a comprehensive set of exper- iments on real world data sets to verify the effectiveness and efficiency of our upper bounds.
基金Supported -by the National Natural Science Foundation of China under Grant Nos. 60603046, 60673138 the Key Project of Ministry of Education of China under Grant No. 106006 the Program for New Century Excellent Talents in University (NCET).
文摘Monitoring on data streams is an efficient method of acquiring the characters of data stream. However the available resources for each data stream are limited, so the problem of how to use the limited resources to process infinite data stream is an open challenging problem. In this paper, we adopt the wavelet and sliding window methods to design a multi-resolution summarization data structure, the Multi-Resolution Summarization Tree (MRST) which can be updated incrementally with the incoming data and can support point queries, range queries, multi-point queries and keep the precision of queries. We use both synthetic data and real-world data to evaluate our algorithm. The results of experiment indicate that the efficiency of query and the adaptability of MRST have exceeded the current algorithm, at the same time the realization of it is simpler than others.
基金the Ministry of education of Humanities and Social Science project(Project No.20YJA870003)
文摘The entering into big data era gives rise to a novel discipline called Data Science.Data Science is interdisciplinary in its nature,and the existing relevant studies can be categorized into domain-independent studies and domain-dependent studies.The domain-dependent studies and domain-independent ones are evolving into Domain-general Data Science and Domain-specific Data Science.Domain-general Data Science emphasizes Data Science in a general sense,involving concepts,theories,methods,technologies,and tools.Domain-specific Data Science is a variant of Domain-general Data Science and varies from one domain to another.The most popular Domain-specific Data Science includes Data journalism,Industrial Data Science,Business Data Science,Health Data Science,Biological Data Science,Social Data Science,and Agile Data Science.The difference between Domain-general Data Science and Domain-specific Data Science roots in their thinking paradigms:DGDS conforms to data-centered thinking,while DSDS is in line with knowledge-centered thinking.As a result,DGDS focuses on the theoretical studies,while DSDS is centered on applied ones.However,DSDS and DGDS possess complementary advantages.Theoretical Data Science(TDS)is a new branch of Data Science that employs mathematical models and abstractions of data objects and systems to rationalize,explain and predict big data phenomena.TDS will bridge the gap between DGDS and DSDS.TDS contrasts with DSDS,which uses casual analysis,as well as DGDS,which employs data-centered thinking to deal with big data problems in that it balances the usability and the interpretability of Data Science practices.The main concerns of TDS are concentrated on integrating the data-centered thinking with the knowledge-centered thinking as well as transforming a correlation analysis into the casual analysis.Hence,TDS can bridge the gaps between DGDS and DSDS,and balance the usability and the interpretability of big data solutions.The studies of TDS should be focused on the following research purpose:to develop theoretical studies of TDS,to take advantages of active property of big data,to embrace design of experiments,to enhance causality analysis,and to develop data products.
基金the National Natural Science Foundation of China under Grant No.62072458.
文摘Deep learning has shown significant improvements on various machine learning tasks by introducing a wide spectrum of neural network models.Yet,for these neural network models,it is necessary to label a tremendous amount of training data,which is prohibitively expensive in reality.In this paper,we propose OnLine Machine Learning(OLML)database which stores trained models and reuses these models in a new training task to achieve a better training effect with a small amount of training data.An efficient model reuse algorithm AdaReuse is developed in the OLML database.Specifically,AdaReuse firstly estimates the reuse potential of trained models from domain relatedness and model quality,through which a group of trained models with high reuse potential for the training task could be selected efficiently.Then,multi selected models will be trained iteratively to encourage diverse models,with which a better training effect could be achieved by ensemble.We evaluate AdaReuse on two types of natural language processing(NLP)tasks,and the results show AdaReuse could improve the training effect significantly compared with models training from scratch when the training data is limited.Based on AdaReuse,we implement an OLML database prototype system which could accept a training task as an SQL-like query and automatically generate a training plan by selecting and reusing trained models.Usability studies are conducted to illustrate the OLML database could properly store the trained models,and reuse the trained models efficiently in new training tasks.
文摘Data science is a rapidly growing academic field with significant implications for all conventional scientific studies. However, most relevant studies have been limited to one or several facets of data science from a specific application domain perspective and less to discuss its theoretical framework. Data science is unique in that its research goals, perspectives, and body of knowledge are distinct from other sciences. The core theories of data science are the DIKW pyramid, data-intensive scientific discovery, data science life cycle, data wrangling or munging,big data analytics, data management, and governance, data products Dev Ops, and big data visualization. Six main trends characterize the recent theoretical studies on data science are:(1)the growing significance of Data Ops,(2) the rise of citizen data scientists,(3) enabling augmented data science,(4) integrating data warehouse with data lake,(5) diversity of domain-specific data science, and(6) implementing data stories as data products. Further development of data science should prioritize four ways to turn challenges into opportunities:(1) accelerating theoretical studies of data science,(2) the trade-off between explainability and performance,(3) achieving data ethics, privacy and trust, and(4) aligning academic curricula with industrial needs.
基金This work was partially done when the authors worked in SA Center for Big Data Research in Renmin University of China.This Center is funded by a Chinese National 111 Project Attracting Interna-tional Talents in Data Engineering ResearchThis paper was also partially supported by Beijing Natural Science Foundation(Grant No.4112030)+1 种基金National Natural Science Foundation(Grant No.61170011)China Na-tional Social Security Foundation(Grant No:12&ZD220).
文摘There is a trend that,virtually everyone,rang-ing from big Web companies to traditional enterprisers to physical science researchers to social scientists,is either al-ready experiencing or anticipating unprecedented growth in the amount of data available in their world,as well as new op-portunities and great untapped value.This paper reviews big data challenges from a data management respective.In partic-ular,we discuss big data diversity,big data reduction,big data integration and cleaning,big data indexing and query,and fi-nally big data analysis and mining.Our survey gives a brief overview about big-data-oriented research and problems.
基金Partly supported by the National Natural Science Foundation of China (Grant No. 60573091), the Key Project of Ministry of Education of China (Grant No. 03044), Program for New Century Excellent Talents in University (NCET), Program for Creative Ph.D. Thesis in University. Acknowledgments The authors would like to thank Hai-Xun Wang from IBM T. J. Watson Research, Karine Zeitouni from PRISM, Versailles Saint- Quentin University in France and Stephane Grumbach from CNRS, LIAMA China for many helpful advices.
文摘Advances in wireless sensor networks and positioning technologies enable new applications monitoring moving objects. Some of these applications, such as traffic management, require the possibility to query the future trajectories of the objects. In this paper, we propose an original data access method, the ANR-tree, which supports predictive queries. We focus on real life environments, where the objects move within constrained networks, such as vehicles on roads. We introduce a simulation-based prediction model based on graphs of cellular automata, which makes full use of the network constraints and the stochastic traffic behavior. Our technique differs strongly from the linear prediction model, which has low prediction accuracy and requires frequent updates when applied to real traffic with velocity changing frequently. The data structure extends the R-tree with adaptive units which group neighbor objects moving in the similar moving patterns. The predicted movement of the adaptive unit is not given by a single trajectory, but instead by two trajectory bounds based on different assumptions on the traffic conditions and obtained from the simulation. Our experiments, carried on two different datasets, show that the ANR-tree is essentially one order of magnitude more efficient than the TPR-tree, and is much more scalable.
基金supported by the "HGJ" National Science and Technology Major Project of China under Grant No.2010ZX01042-001-002the National Natural Science Foundation of China under Grant No. 61070054+2 种基金the National High Technology Research and Development 863 Program of China under Grant No. 2009AA01Z149the Research Funds of Renmin University of China under Grant No. 10XNI018the Postgraduate Science & Research Funds of Renmin University of China under Grant No.12XNH177
文摘Schema summarization on large-scale databases is a challenge. In a typical large database schema, a great proportion of the tables are closely connected through a few high degree tables. It is thus difficult to separate these tables into clusters that represent different topics. Moreover, as a schema can be very big, the schema summary needs to be structured into multiple levels, to further improve the usability. In this paper, we introduce a new schema summarization approach utilizing the techniques of community detection in social networks. Our approach contains three steps. First, we use a community detection algorithm to divide a database schema into subject groups, each representing a specific subject. Second, we cluster the subject groups into abstract domains to form a multi-level navigation structure. Third, we discover representative tables in each cluster to label the schema summary. We evaluate our approach on Freebase, a real world large-scale database. The results show that our approach can identify subject groups precisely. The generated abstract schema layers are very helpful for users to explore database.
基金This work is partly supported by the National Natural Science Foundation of China under Grant No. 61402532, the Science Foundation of China University of Petroleum (Beijing) under Grant No. 2462013YJRC031, and the Excellent Talents of Beijing Program under Grant No. 2013D009051000003.
文摘Variable influence duration (VID) join is a novel spatio-temporal join operation between a set T of trajectories and a set P of spatial points. Here, trajectories are traveling histories of moving objects (e.g., travelers), and spatial points are points of interest (POIs, e.g., restaurants). VID join returns all pairs of (τs, p) if τs is spatially close to p for a long period of time, where τs is a segment of trajectory τ ∈ T and p ∈ P. Each returned (τs, p) implies that the moving object associated with τs stayed at p (e.g., having dinner at a restaurant). Such information is useful in many aspects, such as targeted advertising, social security, and social activity analysis. The concepts of influence and influence duration are introduced to measure the spatial closeness between τ and p, and the time spanned, respectively. Compared to the conventional spatio-temporal join, the VID join is more challenging since the join condition varies for different POIs, and the additional temporal requirement cannot be indexed effectively. To process the VID join e?ciently, three algorithms are developed and several optimization techniques are applied, including spatial duplication reuse and time duration based pruning. The performance of the developed algorithms is verified by extensive experiments on real spatial data.
基金This work is supported by the Natural Science Foundation of Beijing under Grant No. 4172031, the Pundamental Research FSmds for the Central Universities of China, and the Research Funds of Renmin University of China under Grant No. 16XNLQ02. Xiao Qin's work is supported by the U.S. National Science Foundation under Grant Nos. IIS-1618669, CCF-0845257 (CAREER), CNS-0917137, CNS-0757778, CCF-0742187, CNS-0831502, CNS-0855251, and OCI-0753305. Xiao Qin's study is also supported by the Programme of Introducing Talents of Discipline to Universities (111 Project) in China under Grant No. B07038.
文摘Deduplication has been commonly used in both enterprise storage systems and cloud storage. To overcome the performance challenge for the selective restore operations of deduplication systems, solid-state-drive-based (i.e., SSD-based) re^d cache cm, be deployed for speeding up by caching popular restore contents dynamically. Unfortunately, frequent data updates induced by classical cache schemes (e.g., LRU and LFU) significantly shorten SSDs' lifetime while slowing down I/O processes in SSDs. To address this problem, we propose a new solution -- LOP-Cache to greatly improve tile write durability of SSDs as well as I/O performance by enlarging the proportion of long-term popular (LOP) data among data written into SSD-based cache. LOP-Cache keeps LOP data in the SSD cache for a long time period to decrease the number of cache replacements. Furthermore, it prevents unpopular or unnecessary data in deduplication containers from being written into the SSD cache. We implemented LOP-Cache in a prototype deduplication system to evaluate its pertbrmance. Our experimental results indicate that LOP-Cache shortens the latency of selective restore by an average of 37.3% at the cost of a small SSD-based cache with only 5.56% capacity of the deduplicated data. Importantly, LOP-Cache improves SSDs' lifetime by a factor of 9.77. The evidence shows that LOP-Cache offers a cost-efficient SSD-based read cache solution to boost performance of selective restore for deduplication systems.
基金supported by the fund for building world-class universities(disciplines)of Renmin University of China.
文摘Cloud-native data warehouses have revolutionized data analysis by enabling elasticity,high availability and lower costs.And the increasing popularity of artificial intelligence(AI)drives data warehouses to provide predictive analytics besides the existing descriptive analytics.Consequently,more vendors start to support training and inference of AI models in data warehouses,exploiting the benefits of near-data processing for fast model development and deployment.However,most of the existing solutions are limited by a complex syntax or slow data transportation across engines.In this paper,we present GaussDB-AISQL,a composable SQL system with AI capabilities.GaussDB-AISQL adopts a composable system design that decouples computing,storage,caching,DB engine and AI engine.Our system offers all the functionality needed by end-to-end model training and inference during the model lifecycle.It also enjoys the simplicity and efficiency by providing a SQL-like syntax and removes the burden of manual model management.When training an AI model,GaussDB-AISQL benefits from highly parallel data transportation by concurrent data pulling from the distributed shared memory.The feature selection algorithms in GaussDB-AISQL make the training more data-efficient.When running model inference,GaussDB-AISQL registers the trained model object in the local data warehouse as a user-defined-function,which avoids moving inference data out of the data warehouse to an external AI engine.Experiments show that GaussDB-AISQL is up to 19×faster than baseline approaches.
基金supported by the National Key R&D Program of China(2023YFF0725100)the National Natural Science Foundation of China(Grant Nos.62322214,62272466)the Fundamental Research Funds for the Central Universities and the Research Funds of Renmin University of China(24XNKJ22)。
文摘Tables,typically two-dimensional and structured to store large amounts of data,are essential in daily activities like database queries,spreadsheet manipulations,Web table question answering,and image table information extraction.Automating these table-centric tasks with Large Language Models(LLMs)or Visual Language Models(VLMs)offers significant public benefits,garnering interest from academia and industry.This survey provides a comprehensive overview of table-related tasks,examining both user scenarios and technical aspects.It covers traditional tasks like table question answering as well as emerging fields such as spreadsheet manipulation and table data analysis.We summarize the training techniques for LLMs and VLMs tailored for table processing.Additionally,we discuss prompt engineering,particularly the use of LLM-powered agents,for various tablerelated tasks.Finally,we highlight several challenges,including diverse user input when serving and slow thinking using chainof-thought.
基金supported by the Natural Science Foundation of Tianjin under Grant No. 08JCYBJC12400the Innovative Foundation of Small and Medium Enterprises under Grant No. 08ZXCXGX15000+1 种基金the National High-Technology Research and Development 863 Program of China under Grant No. 2009AA01Z152the National Natural Science Foundation of China under Grant No. 60872064
文摘In this paper, we consider skyline queries in a mobile and distributed environment, where data objects are distributed in some sites (database servers) which are interconnected through a high-speed wired network, and queries are issued by mobile units (laptop, cell phone, etc.) which access the data objects of database servers by wireless channels. The inherent properties of mobile computing environment such as mobility, limited wireless bandwidth, frequent disconnection, make skyline queries more complicated. We show how to efficiently perform distributed skyline queries in a mobile environment and propose a skyline query processing approach, called efficient distributed skyline based on mobile computing (EDS-MC). In EDS-MC, a distributed skyline query is decomposed into five processing phases and each phase is elaborately designed in order to reduce the network communication, network delay and query response time. We conduct extensive experiments in a simulated mobile database system, and the experimental results demonstrate the superiority of EDS-MC over other skyline query processing techniques on mobile computing.
基金supported by the National Natural Science Foundation of China under Grant Nos.60603020,60496325 and 60573092
文摘Web services are commonly perceived as an environment of both offering opportunities and threats. In this environment, one way to minimize threats is to use reputation evaluation, which can be computed, for example, through transaction feedback. However, the current feedback-based approach is inaccurate and ineffective because of its inner limitations (e.g., feedback quality problem). As the main source of feedback, the qualities of existing on-line reviews are often varied greatly from low to high, the main reasons include: (1) they have no standard expression formats, (2) dishonest comments may exist among these reviews due to malicious attacking. Up to present, the quality problem of review has not been well solved, which greatly degrades their importance on service reputation evaluation. Therefore, we firstly present a novel evaluation approach for review quality in terms of multiple metrics. Then, we make a further improvement in service reputation evaluation based on those filtered reviews. Experimental results show the effectiveness and efficiency of our proposed approach compared with the naive feedback-based approaches.