Due to the increasing number of cloud applications,the amount of data in the cloud shows signs of growing faster than ever before.The nature of cloud computing requires cloud data processing systems that can handle hu...Due to the increasing number of cloud applications,the amount of data in the cloud shows signs of growing faster than ever before.The nature of cloud computing requires cloud data processing systems that can handle huge volumes of data and have high performance.However,most cloud storage systems currently adopt a hash-like approach to retrieving data that only supports simple keyword-based enquiries,but lacks various forms of information search.Therefore,a scalable and efficient indexing scheme is clearly required.In this paper,we present a skip list-based cloud index,called SLC-index,which is a novel,scalable skip list-based indexing for cloud data processing.The SLC-index offers a two-layered architecture for extending indexing scope and facilitating better throughput.Dynamic load-balancing for the SLC-index is achieved by online migration of index nodes between servers.Furthermore,it is a flexible system due to its dynamic addition and removal of servers.The SLC-index is efficient for both point and range queries.Experimental results show the efficiency of the SLC-index and its usefulness as an alternative approach for cloud-suitable data structures.展开更多
In order to settle the problem of workflow data consis-tency under the distributed environment, an invalidation strategy based-on timely updating record list is put forward. The strategy adopting the method of updatin...In order to settle the problem of workflow data consis-tency under the distributed environment, an invalidation strategy based-on timely updating record list is put forward. The strategy adopting the method of updating the records list and the recovery mechanism of updating message proves the classical invalidation strategy. When the request cycle of duplication is too long, the strategy uses the method of updating the records list to pause for sending updating message; when the long cycle duplication is requested again, it uses the recovery mechanism to resume the updating message. This strategy not only ensures the consistency of the workflow data, but also reduces the unnecessary network traffic. From theoretical comparison with those common strategies, the unnecessary network traffic of this strategy is fewer and more stable. The simulation results validate this conclusion.展开更多
A new method of data access which can effectively resolve the problem of high speed and real time reading data of nuclear instrument in small storage space is introduced. This method applies the data storage mode of ...A new method of data access which can effectively resolve the problem of high speed and real time reading data of nuclear instrument in small storage space is introduced. This method applies the data storage mode of “linked list” to the system of Micro Control Unit (MCU), and realizes the pointer access of nuclear data on the small storage space of MCU. Experimental results show that this method can solve some problems of traditional data storage method, which has the advantages of simple program design, stable performance, accurate data, strong repeatability, saving storage space and so on.展开更多
Despite that several studies have shown that data derived from species lists generated from distribution occurrence records in the Global Biodiversity Information Facility(GBIF)are not appropriate for those ecological...Despite that several studies have shown that data derived from species lists generated from distribution occurrence records in the Global Biodiversity Information Facility(GBIF)are not appropriate for those ecological and biogeographic studies that require high sampling completeness,because species lists derived from GBIF are generally very incomplete,Suissa et al.(2021)generated fern species lists based on data with GBIF for 100 km×100 km grid cells across the world,and used the data to determine fern diversity hotspots and species richness-climate relationships.We conduct an evaluation on the completeness of fern species lists derived from GBIF at the grid-cell scale and at a larger spatial scale,and determine whether fern data derived from GBIF are appropriate for studies on the relations of species composition and richness with climatic variables.We show that species sampling completeness of GBIF is low(<40%)for most of the grid cells examined,and such low sampling completeness can substantially bias the investigation of geographic and ecological patterns of species diversity and the identification of diversity hotspots.We conclude that fern species lists derived from GBIF are generally very incomplete across a wide range of spatial scales,and are not appropriate for studies that require data derived from species lists in high completeness.We present a map showing global patterns of fern species diversity based on complete or nearly complete regional fern species lists.展开更多
Often in longitudinal studies, some subjects complete their follow-up visits, but others miss their visits due to various reasons. For those who miss follow-up visits, some of them might learn that the event of intere...Often in longitudinal studies, some subjects complete their follow-up visits, but others miss their visits due to various reasons. For those who miss follow-up visits, some of them might learn that the event of interest has already happened when they come back. In this case, not only are their event times interval-censored, but also their time-dependent measurements are incomplete. This problem was motivated by a national longitudinal survey of youth data. Maximum likelihood estimation (MLE) method based on expectation-maximization (EM) algorithm is used for parameter estimation. Then missing information principle is applied to estimate the variance-covariance matrix of the MLEs. Simulation studies demonstrate that the proposed method works well in terms of bias, standard error, and power for samples of moderate size. The national longitudinal survey of youth 1997 (NLSY97) data is analyzed for illustration.展开更多
With the growing trend toward using cloud storage,the problem of efficiently checking and proving data integrity needs more consideration.Many cryptography and security schemes,such as PDP(Provable Data Possession) an...With the growing trend toward using cloud storage,the problem of efficiently checking and proving data integrity needs more consideration.Many cryptography and security schemes,such as PDP(Provable Data Possession) and POR(Proofs of Retrievability) were proposed for this problem.Although many efficient schemes for static data have been constructed,only a few dynamic schemes exist,such as DPDP(Dynamic Provable Data Possession).But the DPDP scheme falls short when updates are not proportional to a fixed block size.The FlexList-based Dynamic Provable Data Possession(FlexDPDP) was an optimized scheme for DPDP.However,the update operations(insertion,remove,modification)in Flex DPDP scheme only apply to single node at a time,while multiple consecutive nodes operation is more common in practice.To solve this problem,we propose optimized algorithms for multiple consecutive nodes,which including MultiNodes Insert and Verification,MultiNodes Remove and Verification,MultiNodes Modify and Verification.The cost of our optimized algorithms is also analyzed.For m consecutive nodes,an insertion takes O(m) + O(log N) + O(log m),where N is the number of leaf nodes of FlexList,a remove takes O(log/V),and a modification is the same as the original algorithm.Finally,we compare the optimized algorithms with original FlexList through experiences,and the results show that our scheme has the higher efficiency of time and space.展开更多
A knowledge graph(KG)is a knowledge base that integrates and represents data based on a graph-structured data model or topology.Geoscientists have made efforts to construct geosciencerelated KGs to overcome semantic h...A knowledge graph(KG)is a knowledge base that integrates and represents data based on a graph-structured data model or topology.Geoscientists have made efforts to construct geosciencerelated KGs to overcome semantic heterogeneity and facilitate knowledge representation,data integration,and text analysis.However,there is currently no comprehensive paleontology KG or data-driven discovery based on it.In this study,we constructed a two-layer model to represent the ordinal hierarchical structure of the paleontology KG following a top-down construction process.An ontology containing 19365 concepts has been defined up to 2023.On this basis,we derived the synonymy list based on the paleontology KG and designed corresponding online functions in the OneStratigraphy database to showcase the use of the KG in paleontological research.展开更多
From the aspects of profitability, debt paying ability, operational capacity, cash flow capacity, and innovation capacity, the sustainable development evaluation system of new energy listed companies in China was esta...From the aspects of profitability, debt paying ability, operational capacity, cash flow capacity, and innovation capacity, the sustainable development evaluation system of new energy listed companies in China was established, and then an empirical analysis was conducted. Finally, some policy suggestions were put forward. The empirical analysis shows that there are many problems in the sustainable development of new ener- gy listed companies in China.展开更多
Uncertain data are common due to the increasing usage of sensors, radio frequency identification(RFID), GPS and similar devices for data collection. The causes of uncertainty include limitations of measurements, inclu...Uncertain data are common due to the increasing usage of sensors, radio frequency identification(RFID), GPS and similar devices for data collection. The causes of uncertainty include limitations of measurements, inclusion of noise, inconsistent supply voltage and delay or loss of data in transfer. In order to manage, query or mine such data, data uncertainty needs to be considered. Hence,this paper studies the problem of top-k distance-based outlier detection from uncertain data objects. In this work, an uncertain object is modelled by a probability density function of a Gaussian distribution. The naive approach of distance-based outlier detection makes use of nested loop. This approach is very costly due to the expensive distance function between two uncertain objects. Therefore,a populated-cells list(PC-list) approach of outlier detection is proposed. Using the PC-list, the proposed top-k outlier detection algorithm needs to consider only a fraction of dataset objects and hence quickly identifies candidate objects for top-k outliers. Two approximate top-k outlier detection algorithms are presented to further increase the efficiency of the top-k outlier detection algorithm.An extensive empirical study on synthetic and real datasets is also presented to prove the accuracy, efficiency and scalability of the proposed algorithms.展开更多
基金Projects(61363021,61540061,61663047)supported by the National Natural Science Foundation of ChinaProject(2017SE206)supported by the Open Foundation of Key Laboratory in Software Engineering of Yunnan Province,China
文摘Due to the increasing number of cloud applications,the amount of data in the cloud shows signs of growing faster than ever before.The nature of cloud computing requires cloud data processing systems that can handle huge volumes of data and have high performance.However,most cloud storage systems currently adopt a hash-like approach to retrieving data that only supports simple keyword-based enquiries,but lacks various forms of information search.Therefore,a scalable and efficient indexing scheme is clearly required.In this paper,we present a skip list-based cloud index,called SLC-index,which is a novel,scalable skip list-based indexing for cloud data processing.The SLC-index offers a two-layered architecture for extending indexing scope and facilitating better throughput.Dynamic load-balancing for the SLC-index is achieved by online migration of index nodes between servers.Furthermore,it is a flexible system due to its dynamic addition and removal of servers.The SLC-index is efficient for both point and range queries.Experimental results show the efficiency of the SLC-index and its usefulness as an alternative approach for cloud-suitable data structures.
基金National Basic Research Program of China (973 Program) (2005CD312904)
文摘In order to settle the problem of workflow data consis-tency under the distributed environment, an invalidation strategy based-on timely updating record list is put forward. The strategy adopting the method of updating the records list and the recovery mechanism of updating message proves the classical invalidation strategy. When the request cycle of duplication is too long, the strategy uses the method of updating the records list to pause for sending updating message; when the long cycle duplication is requested again, it uses the recovery mechanism to resume the updating message. This strategy not only ensures the consistency of the workflow data, but also reduces the unnecessary network traffic. From theoretical comparison with those common strategies, the unnecessary network traffic of this strategy is fewer and more stable. The simulation results validate this conclusion.
文摘A new method of data access which can effectively resolve the problem of high speed and real time reading data of nuclear instrument in small storage space is introduced. This method applies the data storage mode of “linked list” to the system of Micro Control Unit (MCU), and realizes the pointer access of nuclear data on the small storage space of MCU. Experimental results show that this method can solve some problems of traditional data storage method, which has the advantages of simple program design, stable performance, accurate data, strong repeatability, saving storage space and so on.
文摘Despite that several studies have shown that data derived from species lists generated from distribution occurrence records in the Global Biodiversity Information Facility(GBIF)are not appropriate for those ecological and biogeographic studies that require high sampling completeness,because species lists derived from GBIF are generally very incomplete,Suissa et al.(2021)generated fern species lists based on data with GBIF for 100 km×100 km grid cells across the world,and used the data to determine fern diversity hotspots and species richness-climate relationships.We conduct an evaluation on the completeness of fern species lists derived from GBIF at the grid-cell scale and at a larger spatial scale,and determine whether fern data derived from GBIF are appropriate for studies on the relations of species composition and richness with climatic variables.We show that species sampling completeness of GBIF is low(<40%)for most of the grid cells examined,and such low sampling completeness can substantially bias the investigation of geographic and ecological patterns of species diversity and the identification of diversity hotspots.We conclude that fern species lists derived from GBIF are generally very incomplete across a wide range of spatial scales,and are not appropriate for studies that require data derived from species lists in high completeness.We present a map showing global patterns of fern species diversity based on complete or nearly complete regional fern species lists.
文摘Often in longitudinal studies, some subjects complete their follow-up visits, but others miss their visits due to various reasons. For those who miss follow-up visits, some of them might learn that the event of interest has already happened when they come back. In this case, not only are their event times interval-censored, but also their time-dependent measurements are incomplete. This problem was motivated by a national longitudinal survey of youth data. Maximum likelihood estimation (MLE) method based on expectation-maximization (EM) algorithm is used for parameter estimation. Then missing information principle is applied to estimate the variance-covariance matrix of the MLEs. Simulation studies demonstrate that the proposed method works well in terms of bias, standard error, and power for samples of moderate size. The national longitudinal survey of youth 1997 (NLSY97) data is analyzed for illustration.
基金supported in part by the National Natural Science Foundation of China under Grant No.61440014&&No.61300196the Liaoning Province Doctor Startup Fundunder Grant No.20141012+2 种基金the Liaoning Province Science and Technology Projects under Grant No.2013217004the Shenyang Province Science and Technology Projects under Grant Nothe Fundamental Research Funds for the Central Universities under Grant No.N130317002 and No.N130317003
文摘With the growing trend toward using cloud storage,the problem of efficiently checking and proving data integrity needs more consideration.Many cryptography and security schemes,such as PDP(Provable Data Possession) and POR(Proofs of Retrievability) were proposed for this problem.Although many efficient schemes for static data have been constructed,only a few dynamic schemes exist,such as DPDP(Dynamic Provable Data Possession).But the DPDP scheme falls short when updates are not proportional to a fixed block size.The FlexList-based Dynamic Provable Data Possession(FlexDPDP) was an optimized scheme for DPDP.However,the update operations(insertion,remove,modification)in Flex DPDP scheme only apply to single node at a time,while multiple consecutive nodes operation is more common in practice.To solve this problem,we propose optimized algorithms for multiple consecutive nodes,which including MultiNodes Insert and Verification,MultiNodes Remove and Verification,MultiNodes Modify and Verification.The cost of our optimized algorithms is also analyzed.For m consecutive nodes,an insertion takes O(m) + O(log N) + O(log m),where N is the number of leaf nodes of FlexList,a remove takes O(log/V),and a modification is the same as the original algorithm.Finally,we compare the optimized algorithms with original FlexList through experiences,and the results show that our scheme has the higher efficiency of time and space.
基金supported by the National Natural Science Foundation of China(Nos.41725007,42250104,41830323,42002015,and 42302001)the Fundamental Research Funds for the Central Universities(Nos.020614380168,JZ2023HGQA0144 and JZ2023HGTA0175)。
文摘A knowledge graph(KG)is a knowledge base that integrates and represents data based on a graph-structured data model or topology.Geoscientists have made efforts to construct geosciencerelated KGs to overcome semantic heterogeneity and facilitate knowledge representation,data integration,and text analysis.However,there is currently no comprehensive paleontology KG or data-driven discovery based on it.In this study,we constructed a two-layer model to represent the ordinal hierarchical structure of the paleontology KG following a top-down construction process.An ontology containing 19365 concepts has been defined up to 2023.On this basis,we derived the synonymy list based on the paleontology KG and designed corresponding online functions in the OneStratigraphy database to showcase the use of the KG in paleontological research.
文摘From the aspects of profitability, debt paying ability, operational capacity, cash flow capacity, and innovation capacity, the sustainable development evaluation system of new energy listed companies in China was established, and then an empirical analysis was conducted. Finally, some policy suggestions were put forward. The empirical analysis shows that there are many problems in the sustainable development of new ener- gy listed companies in China.
基金supported by Grant-in-Aid for Scientific Research(A)(#24240015A)
文摘Uncertain data are common due to the increasing usage of sensors, radio frequency identification(RFID), GPS and similar devices for data collection. The causes of uncertainty include limitations of measurements, inclusion of noise, inconsistent supply voltage and delay or loss of data in transfer. In order to manage, query or mine such data, data uncertainty needs to be considered. Hence,this paper studies the problem of top-k distance-based outlier detection from uncertain data objects. In this work, an uncertain object is modelled by a probability density function of a Gaussian distribution. The naive approach of distance-based outlier detection makes use of nested loop. This approach is very costly due to the expensive distance function between two uncertain objects. Therefore,a populated-cells list(PC-list) approach of outlier detection is proposed. Using the PC-list, the proposed top-k outlier detection algorithm needs to consider only a fraction of dataset objects and hence quickly identifies candidate objects for top-k outliers. Two approximate top-k outlier detection algorithms are presented to further increase the efficiency of the top-k outlier detection algorithm.An extensive empirical study on synthetic and real datasets is also presented to prove the accuracy, efficiency and scalability of the proposed algorithms.