Machine learning(ML)has recently enabled many modeling tasks in design,manufacturing,and condition monitoring due to its unparalleled learning ability using existing data.Data have become the limiting factor when impl...Machine learning(ML)has recently enabled many modeling tasks in design,manufacturing,and condition monitoring due to its unparalleled learning ability using existing data.Data have become the limiting factor when implementing ML in industry.However,there is no systematic investigation on how data quality can be assessed and improved for ML-based design and manufacturing.The aim of this survey is to uncover the data challenges in this domain and review the techniques used to resolve them.To establish the background for the subsequent analysis,crucial data terminologies in ML-based modeling are reviewed and categorized into data acquisition,management,analysis,and utilization.Thereafter,the concepts and frameworks established to evaluate data quality and imbalance,including data quality assessment,data readiness,information quality,data biases,fairness,and diversity,are further investigated.The root causes and types of data challenges,including human factors,complex systems,complicated relationships,lack of data quality,data heterogeneity,data imbalance,and data scarcity,are identified and summarized.Methods to improve data quality and mitigate data imbalance and their applications in this domain are reviewed.This literature review focuses on two promising methods:data augmentation and active learning.The strengths,limitations,and applicability of the surveyed techniques are illustrated.The trends of data augmentation and active learning are discussed with respect to their applications,data types,and approaches.Based on this discussion,future directions for data quality improvement and data imbalance mitigation in this domain are identified.展开更多
With the globalization of the economy,maritime trade has surged,posing challenges in the supervision of marine vessel activities.An automatic identification system(AIS)is an effective means of shipping traffic service...With the globalization of the economy,maritime trade has surged,posing challenges in the supervision of marine vessel activities.An automatic identification system(AIS)is an effective means of shipping traffic service,but many uncertainties exist regarding its data quality.In this study,the AIS data from Haiyang(HY)series of satellites were used to assess the data quality,analyze the global ship trajectory distribution and update frequencies from 2019 to 2023.Through the analysis of maritime mobile service identity numbers,we identified 340185 unique vessels,80.1%of which adhered to the International Telecommunication Union standards.Approximately 49.7%of ships exhibit significant data gaps,and 1.1%show inconsistencies in their AIS data sources.In the central Pacific Ocean at low latitudes and along the coast of South America(30°-60°S),a heightened incidence of abnormal trajectories of ships has been consistently observed,particularly in areas associated with fishing activities.According to the spatial distribution of ship trajectories,AIS data exhibit numerous deficiencies,particularly in high-traffic regions such as the East China Sea and South China Sea.In contrast,ship trajectories in the polar regions,characterized by high latitudes,are relatively comprehensive.With the increased number of HY satellites equipped with AIS receivers,the quantity of trajectory points displays a growing trend,leading to increasingly complete trajectories.This trend highlights the significant potential of using AIS data acquired from HY satellites to increase the accuracy of vessel tracking.展开更多
The Belt and Road global navigation satellite system(B&R GNSS)network is the first large-scale deployment of Chinese GNSS equipment in a seismic system.Prior to this,there have been few systematic assessments of t...The Belt and Road global navigation satellite system(B&R GNSS)network is the first large-scale deployment of Chinese GNSS equipment in a seismic system.Prior to this,there have been few systematic assessments of the data quality of Chinese GNSS equipment.In this study,data from four representative GNSS sites in different regions of China were analyzed using the G-Nut/Anubis software package.Four main indicators(data integrity rate,data validity ratio,multi-path error,and cycle slip ratio)used to systematically analyze data quality,while evaluating the seismic monitoring capabilities of the network based on earthquake magnitudes estimated from high-frequency GNSS data are evaluated by estimating magnitude based on highfrequency GNSS data.The results indicate that the quality of the data produced by the three types of Chinese receivers used in the network meets the needs of earthquake monitoring and the new seismic industry standards,which provide a reference for the selection of equipment for future new projects.After the B&R GNSS network was established,the seismic monitoring capability for earthquakes with magnitudes greater than M_(W)6.5 in most parts of the Sichuan-Yunnan region improved by approximately 20%.In key areas such as the Sichuan-Yunnan Rhomboid Block,the monitoring capability increased by more than 25%,which has greatly improved the effectiveness of regional comprehensive earthquake management.展开更多
Sign language dataset is essential in sign language recognition and translation(SLRT). Current public sign language datasets are small and lack diversity, which does not meet the practical application requirements for...Sign language dataset is essential in sign language recognition and translation(SLRT). Current public sign language datasets are small and lack diversity, which does not meet the practical application requirements for SLRT. However, making a large-scale and diverse sign language dataset is difficult as sign language data on the Internet is scarce. In making a large-scale and diverse sign language dataset, some sign language data qualities are not up to standard. This paper proposes a two information streams transformer(TIST) model to judge whether the quality of sign language data is qualified. To verify that TIST effectively improves sign language recognition(SLR), we make two datasets, the screened dataset and the unscreened dataset. In this experiment, this paper uses visual alignment constraint(VAC) as the baseline model. The experimental results show that the screened dataset can achieve better word error rate(WER) than the unscreened dataset.展开更多
Volunteered geographic information(VGI)has entered a phase where there are both a substantial amount of crowdsourced information available and a big interest in using it by organizations.But the issue of deciding the ...Volunteered geographic information(VGI)has entered a phase where there are both a substantial amount of crowdsourced information available and a big interest in using it by organizations.But the issue of deciding the quality of VGI without resorting to a comparison with authoritative data remains an open challenge.This article first formulates the problem of quality assessment of VGI data.Then presents a model to measure trustworthiness of information and reputation of contributors by analyzing geometric,qualitative,and semantic aspects of edits over time.An implementation of the model is running on a small data-set for a preliminary empirical validation.The results indicate that the computed trustworthiness provides a valid approximation of VGI quality.展开更多
Since the British National Archive put forward the concept of the digital continuity in 2007,several developed countries have worked out their digital continuity action plan.However,the technologies of the digital con...Since the British National Archive put forward the concept of the digital continuity in 2007,several developed countries have worked out their digital continuity action plan.However,the technologies of the digital continuity guarantee are still lacked.At first,this paper analyzes the requirements of digital continuity guarantee for electronic record based on data quality theory,then points out the necessity of data quality guarantee for electronic record.Moreover,we convert the digital continuity guarantee of electronic record to ensure the consistency,completeness and timeliness of electronic record,and construct the first technology framework of the digital continuity guarantee for electronic record.Finally,the temporal functional dependencies technology is utilized to build the first integration method to insure the consistency,completeness and timeliness of electronic record.展开更多
The real-time energy flow data obtained in industrial production processes are usually of low quality.It is difficult to accurately predict the short-term energy flow profile by using these field data,which diminishes...The real-time energy flow data obtained in industrial production processes are usually of low quality.It is difficult to accurately predict the short-term energy flow profile by using these field data,which diminishes the effect of industrial big data and artificial intelligence in industrial energy system.The real-time data of blast furnace gas(BFG)generation collected in iron and steel sites are also of low quality.In order to tackle this problem,a three-stage data quality improvement strategy was proposed to predict the BFG generation.In the first stage,correlation principle was used to test the sample set.In the second stage,the original sample set was rectified and updated.In the third stage,Kalman filter was employed to eliminate the noise of the updated sample set.The method was verified by autoregressive integrated moving average model,back propagation neural network model and long short-term memory model.The results show that the prediction model based on the proposed three-stage data quality improvement method performs well.Long short-term memory model has the best prediction performance,with a mean absolute error of 17.85 m3/min,a mean absolute percentage error of 0.21%,and an R squared of 95.17%.展开更多
OpenStreetMap(OSM)data are widely used but their reliability is still variable.Many contributors to OSM have not been trained in geography or surveying and consequently their contributions,including geometry and attri...OpenStreetMap(OSM)data are widely used but their reliability is still variable.Many contributors to OSM have not been trained in geography or surveying and consequently their contributions,including geometry and attribute data inserts,deletions,and updates,can be inaccurate,incomplete,inconsistent,or vague.There are some mechanisms and applications dedicated to discovering bugs and errors in OSM data.Such systems can remove errors through user-checks and applying predefined rules but they need an extra control process to check the real-world validity of suspected errors and bugs.This paper focuses on finding bugs and errors based on patterns and rules extracted from the tracking data of users.The underlying idea is that certain characteristics of user trajectories are directly linked to the type of feature.Using such rules,some sets of potential bugs and errors can be identified and stored for further investigations.展开更多
Nowadays,several research projects show interest in employing volunteered geographic information(VGI)to improve their systems through using up-to-date and detailed data.The European project CAP4Access is one of the su...Nowadays,several research projects show interest in employing volunteered geographic information(VGI)to improve their systems through using up-to-date and detailed data.The European project CAP4Access is one of the successful examples of such international-wide research projects that aims to improve the accessibility of people with restricted mobility using crowdsourced data.In this project,OpenStreetMap(OSM)is used to extend OpenRouteService,a well-known routing platform.However,a basic challenge that this project tackled was the incompleteness of OSM data with regards to certain information that is required for wheelchair accessibility(e.g.sidewalk information,kerb data,etc.).In this article,we present the results of initial assessment of sidewalk data in OSM at the beginning of the project as well as our approach in awareness raising and using tools for tagging accessibility data into OSM database for enriching the sidewalk data completeness.Several experiments have been carried out in different European cities,and discussion on the results of the experiments as well as the lessons learned are provided.The lessons learned provide recommendations that help in organizing better mapping party events in the future.We conclude by reporting on how and to what extent the OSM sidewalk data completeness in these study areas have benefited from the mapping parties by the end of the project.展开更多
This paper presents a methodology to determine three data quality (DQ) risk characteristics: accuracy, comprehensiveness and nonmembership. The methodology provides a set of quantitative models to confirm the informat...This paper presents a methodology to determine three data quality (DQ) risk characteristics: accuracy, comprehensiveness and nonmembership. The methodology provides a set of quantitative models to confirm the information quality risks for the database of the geographical information system (GIS). Four quantitative measures are introduced to examine how the quality risks of source information affect the quality of information outputs produced using the relational algebra operations Selection, Projection, and Cubic Product. It can be used to determine how quality risks associated with diverse data sources affect the derived data. The GIS is the prime source of information on the location of cables, and detection time strongly depends on whether maps indicate the presence of cables in the construction business. Poor data quality in the GIS can contribute to increased risk or higher risk avoidance costs. A case study provides a numerical example of the calculation of the trade-offs between risk and detection costs and provides an example of the calculation of the costs of data quality. We conclude that the model contributes valuable new insight.展开更多
Wired drill pipe(WDP)technology is one of the most promising data acquisition technologies in today s oil and gas industry.For the first time it allows sensors to be positioned along the drill string which enables c...Wired drill pipe(WDP)technology is one of the most promising data acquisition technologies in today s oil and gas industry.For the first time it allows sensors to be positioned along the drill string which enables collecting and transmitting valuable data not only from the bottom hole assembly(BHA),but also along the entire length of the wellbore to the drill floor.The technology has received industry acceptance as a viable alternative to the typical logging while drilling(LWD)method.Recently more and more WDP applications can be found in the challenging drilling environments around the world,leading to many innovations to the industry.Nevertheless most of the data acquired from WDP can be noisy and in some circumstances of very poor quality.Diverse factors contribute to the poor data quality.Most common sources include mis-calibrated sensors,sensor drifting,errors during data transmission,or some abnormal conditions in the well,etc.The challenge of improving the data quality has attracted more and more focus from many researchers during the past decade.This paper has proposed a promising solution to address such challenge by making corrections of the raw WDP data and estimating unmeasurable parameters to reveal downhole behaviors.An advanced data processing method,data validation and reconciliation(DVR)has been employed,which makes use of the redundant data from multiple WDP sensors to filter/remove the noise from the measurements and ensures the coherence of all sensors and models.Moreover it has the ability to distinguish the accurate measurements from the inaccurate ones.In addition,the data with improved quality can be used for estimating some crucial parameters in the drilling process which are unmeasurable in the first place,hence provide better model calibrations for integrated well planning and realtime operations.展开更多
One of the goals of data collection is preparing for decision-making, so high quality requirement must be satisfied. Rational evaluation of data quality is an effective way to identify data problem in time, and the qu...One of the goals of data collection is preparing for decision-making, so high quality requirement must be satisfied. Rational evaluation of data quality is an effective way to identify data problem in time, and the quality of data after this evaluation is satisfactory with the requirement of decision maker. A fuzzy neural network based research method of data quality evaluation is proposed. First, the criteria for the evaluation of data quality are selected to construct the fuzzy sets of evaluating grades, and then by using the learning ability of NN, the objective evaluation of membership is carried out, which can be used for the effective evaluation of data quality. This research has been used in the platform of 'data report of national compulsory education outlay guarantee' from the Chinese Ministry of Education. This method can be used for the effective evaluation of data quality worldwide, and the data quality situation can be found out more completely, objectively, and in better time by using the method.展开更多
<span style="font-family:Verdana;">Most GIS databases contain data errors. The quality of the data sources such as traditional paper maps or more recent remote sensing data determines spatial data qual...<span style="font-family:Verdana;">Most GIS databases contain data errors. The quality of the data sources such as traditional paper maps or more recent remote sensing data determines spatial data quality. In the past several decades, different statistical measures have been developed to evaluate data quality for different types of data, such as nominal categorical data, ordinal categorical data and numerical data. Although these methods were originally proposed for medical research or psychological research, they have been widely used to evaluate spatial data quality. In this paper, we first review statistical methods for evaluating data quality, discuss under what conditions we should use them and how to interpret the results, followed by a brief discussion of statistical software and packages that can be used to compute these data quality measures.</span>展开更多
The dairy herd improvement data from Henan Province were analyzed statistically to establish screening criteria for relevant data, thereby laying a foundation for genetic evaluation of dairy cows. With the 2 152 451 t...The dairy herd improvement data from Henan Province were analyzed statistically to establish screening criteria for relevant data, thereby laying a foundation for genetic evaluation of dairy cows. With the 2 152 451 test-day records about 155 893 Chinese Holstein dairy cows collected by the Henan Dairy Herd Improvement Center from January 2008 to April 2016, the dynamics of test times during a complete lactation, test interval during a complete lactation, days in milk (DIM) of first test-day record, daughter descendant number and herd number of bull, age at first calving and pedigree integrity rate among different years and different herd sizes were analyzed by MEANS order of SAS 9.4. In addition, the data that were applicable to genetic evaluation were screened by SQL program. The results showed that during 2008-2015, the number of cow individuals participating in DHI in Henan Province increased from 7 379 to 93 706; the test-day milk yield increased from 19.91 to 24.05 kg; the somatic cell count reduced from 411.09×10^3 to 277.08×10^3 cells/ml; the percentage of cows with DIM ranging from 5-305 d reached 70.92%; the average test times increased from 3.20 to 6.31 times; the test interval decreased from 70.22 to 33.83 d; the dairy cows with age at first calving of 25 months were dominant, accounting for 12.57%; the bulls whose daughter descendant number was 20 or more and the daughters were distributed in 10 or more farms accounted for 6.05%; the one-generation pedigree integrity rate was 82.54%; the percentage of data that could be used for genetic evaluation was screened as 20.67%, which was lower than the results of other similar studies.展开更多
Several organizations have migrated to the cloud for better quality in business engagements and security. Data quality is crucial in present-day activities. Information is generated and collected from data representin...Several organizations have migrated to the cloud for better quality in business engagements and security. Data quality is crucial in present-day activities. Information is generated and collected from data representing real-time facts and activities. Poor data quality affects the organizational decision-making policy and customer satisfaction, and influences the organization’s scheme of execution negatively. Data quality also has a massive influence on the accuracy, complexity and efficiency of the machine and deep learning tasks. There are several methods and tools to evaluate data quality to ensure smooth incorporation in model development. The bulk of data quality tools permit the assessment of sources of data only at a certain point in time, and the arrangement and automation are consequently an obligation of the user. In ensuring automatic data quality, several steps are involved in gathering data from different sources and monitoring data quality, and any problems with the data quality must be adequately addressed. There was a gap in the literature as no attempts have been made previously to collate all the advances in different dimensions of automatic data quality. This limited narrative review of existing literature sought to address this gap by correlating different steps and advancements related to the automatic data quality systems. The six crucial data quality dimensions in organizations were discussed, and big data were compared and classified. This review highlights existing data quality models and strategies that can contribute to the development of automatic data quality systems.展开更多
In contrast with the research of new models,little attention has been paid to the impact of low or high-quality data feeding a dialogue system.The present paper makes thefirst attempt tofill this gap by extending our ...In contrast with the research of new models,little attention has been paid to the impact of low or high-quality data feeding a dialogue system.The present paper makes thefirst attempt tofill this gap by extending our previous work on question-answering(QA)systems by investigating the effect of misspelling on QA agents and how context changes can enhance the responses.Instead of using large language models trained on huge datasets,we propose a method that enhances the model's score by modifying only the quality and structure of the data feed to the model.It is important to identify the features that modify the agent performance because a high rate of wrong answers can make the students lose their interest in using the QA agent as an additional tool for distant learning.The results demonstrate the accuracy of the proposed context simplification exceeds 85%.Thesefindings shed light on the importance of question data quality and context complexity construct as key dimensions of the QA system.In conclusion,the experimental results on questions and contexts showed that controlling and improving the various aspects of data quality around the QA system can significantly enhance his robustness and performance.展开更多
The basic task of geomagnetic observatory is .to produce accurate, relaible,continuous and complete observative data. The aim of examination is to judge the quality status of data. According to the operative principle...The basic task of geomagnetic observatory is .to produce accurate, relaible,continuous and complete observative data. The aim of examination is to judge the quality status of data. According to the operative principle of geomagnetic instruments and its operative status that should be achieved, geomagnetic activity and spread characteristics in time domain and location domain, authers proposed a complete set of data quality examination. The paper discusses respectively physical basement, examination method and the result about scalevalues, base-line values, monthly mean values, daily mean values, maximum and minimum values in daily range, magnetic storm and K index. The practice has proved that this set of examination is feasible and useful to raise and to guarantee the quality of observative data.展开更多
Geographical studies of outdoor activities have increased in recent years with the rise in popularity of these activities worldwide,including in Japan.Volunteered geographic information(VGI)is a key tool for organizin...Geographical studies of outdoor activities have increased in recent years with the rise in popularity of these activities worldwide,including in Japan.Volunteered geographic information(VGI)is a key tool for organizing outdoor activities as it offers a means to determine the locational information and names of places.To evaluate the quality of VGI,geospatial data generated by land survey agencies and other VGI are often utilized as reference data.However,since these reference data may not be available,other methods are necessary to assure the quality of VGI.In this study,we examined five trust indicators based on the inherent characteristics of VGI through an empirical case study.We used mountain names extracted from OpenStreetMap in Japan as data because there were almost no other VGI in the vicinity.As a result,we isolated three trust indicators,namely versions,users,and tag corrections,to examine the thematic accuracy of VGI because these were the only statistically significant indicators.However,we found that the prediction rate of thematic accuracy was very low.To improve thematic accuracy,this study recommends using the most accurate versions,applying correctly given tags,and considering the motivations and characteristics of the VGI contributors.展开更多
In recent years,with rapid increases in the number of vehicles in China,the contribution of vehicle exhaust emissions to air pollution has become increasingly prominent.To achieve the precise control of emissions,on-r...In recent years,with rapid increases in the number of vehicles in China,the contribution of vehicle exhaust emissions to air pollution has become increasingly prominent.To achieve the precise control of emissions,on-road remote sensing(RS)technology has been developed and applied for law enforcement and supervision.However,data quality is still an existing issue affecting the development and application of RS.In this study,the RS data from a cross-road RS system used at a single site(from 2012 to 2015)were collected,the data screening process was reviewed,the issues with data quality were summarized,a new method of data screening and calibration was proposed,and the effectiveness of the improved data quality control methods was finally evaluated.The results showed that this method reduces the skewness and kurtosis of the data distribution by up to nearly 67%,which restores the actual characteristics of exhaust diffusion and is conducive to the identification of actual clean and high-emission vehicles.The annual variability of emission factors of nitric oxide decreases by 60%-on average-eliminating the annual drift of fleet emissions and improving data reliability.展开更多
At present,water pollution has become an important factor affecting and restricting national and regional economic development.Total phosphorus is one of the main sources of water pollution and eutrophication,so the p...At present,water pollution has become an important factor affecting and restricting national and regional economic development.Total phosphorus is one of the main sources of water pollution and eutrophication,so the prediction of total phosphorus in water quality has good research significance.This paper selects the total phosphorus and turbidity data for analysis by crawling the data of the water quality monitoring platform.By constructing the attribute object mapping relationship,the correlation between the two indicators was analyzed and used to predict the future data.Firstly,the monthly mean and daily mean concentrations of total phosphorus and turbidity outliers were calculated after cleaning,and the correlation between them was analyzed.Secondly,the correlation coefficients of different times and frequencies were used to predict the values for the next five days,and the data trend was predicted by python visualization.Finally,the real value was compared with the predicted value data,and the results showed that the correlation between total phosphorus and turbidity was useful in predicting the water quality.展开更多
基金funded by the McGill University Graduate Excellence Fellowship Award(00157)the Mitacs Accelerate Program(IT13369)the McGill Engineering Doctoral Award(MEDA).
文摘Machine learning(ML)has recently enabled many modeling tasks in design,manufacturing,and condition monitoring due to its unparalleled learning ability using existing data.Data have become the limiting factor when implementing ML in industry.However,there is no systematic investigation on how data quality can be assessed and improved for ML-based design and manufacturing.The aim of this survey is to uncover the data challenges in this domain and review the techniques used to resolve them.To establish the background for the subsequent analysis,crucial data terminologies in ML-based modeling are reviewed and categorized into data acquisition,management,analysis,and utilization.Thereafter,the concepts and frameworks established to evaluate data quality and imbalance,including data quality assessment,data readiness,information quality,data biases,fairness,and diversity,are further investigated.The root causes and types of data challenges,including human factors,complex systems,complicated relationships,lack of data quality,data heterogeneity,data imbalance,and data scarcity,are identified and summarized.Methods to improve data quality and mitigate data imbalance and their applications in this domain are reviewed.This literature review focuses on two promising methods:data augmentation and active learning.The strengths,limitations,and applicability of the surveyed techniques are illustrated.The trends of data augmentation and active learning are discussed with respect to their applications,data types,and approaches.Based on this discussion,future directions for data quality improvement and data imbalance mitigation in this domain are identified.
基金The National Key R&D Program of China under contract Nos 2021YFC2803305 and 2024YFC2816301the Fundamental Research Funds for the Central Universities of China under contract No.2042022dx0001.
文摘With the globalization of the economy,maritime trade has surged,posing challenges in the supervision of marine vessel activities.An automatic identification system(AIS)is an effective means of shipping traffic service,but many uncertainties exist regarding its data quality.In this study,the AIS data from Haiyang(HY)series of satellites were used to assess the data quality,analyze the global ship trajectory distribution and update frequencies from 2019 to 2023.Through the analysis of maritime mobile service identity numbers,we identified 340185 unique vessels,80.1%of which adhered to the International Telecommunication Union standards.Approximately 49.7%of ships exhibit significant data gaps,and 1.1%show inconsistencies in their AIS data sources.In the central Pacific Ocean at low latitudes and along the coast of South America(30°-60°S),a heightened incidence of abnormal trajectories of ships has been consistently observed,particularly in areas associated with fishing activities.According to the spatial distribution of ship trajectories,AIS data exhibit numerous deficiencies,particularly in high-traffic regions such as the East China Sea and South China Sea.In contrast,ship trajectories in the polar regions,characterized by high latitudes,are relatively comprehensive.With the increased number of HY satellites equipped with AIS receivers,the quantity of trajectory points displays a growing trend,leading to increasingly complete trajectories.This trend highlights the significant potential of using AIS data acquired from HY satellites to increase the accuracy of vessel tracking.
基金supported by grants from the National Natural Science Foundation of China(No.42004010)the B&R Seismic Monitoring Network Project of the China Earthquake Networks Center(No.5007).
文摘The Belt and Road global navigation satellite system(B&R GNSS)network is the first large-scale deployment of Chinese GNSS equipment in a seismic system.Prior to this,there have been few systematic assessments of the data quality of Chinese GNSS equipment.In this study,data from four representative GNSS sites in different regions of China were analyzed using the G-Nut/Anubis software package.Four main indicators(data integrity rate,data validity ratio,multi-path error,and cycle slip ratio)used to systematically analyze data quality,while evaluating the seismic monitoring capabilities of the network based on earthquake magnitudes estimated from high-frequency GNSS data are evaluated by estimating magnitude based on highfrequency GNSS data.The results indicate that the quality of the data produced by the three types of Chinese receivers used in the network meets the needs of earthquake monitoring and the new seismic industry standards,which provide a reference for the selection of equipment for future new projects.After the B&R GNSS network was established,the seismic monitoring capability for earthquakes with magnitudes greater than M_(W)6.5 in most parts of the Sichuan-Yunnan region improved by approximately 20%.In key areas such as the Sichuan-Yunnan Rhomboid Block,the monitoring capability increased by more than 25%,which has greatly improved the effectiveness of regional comprehensive earthquake management.
基金supported by the National Language Commission to research on sign language data specifications for artificial intelligence applications and test standards for language service translation systems (No.ZDI145-70)。
文摘Sign language dataset is essential in sign language recognition and translation(SLRT). Current public sign language datasets are small and lack diversity, which does not meet the practical application requirements for SLRT. However, making a large-scale and diverse sign language dataset is difficult as sign language data on the Internet is scarce. In making a large-scale and diverse sign language dataset, some sign language data qualities are not up to standard. This paper proposes a two information streams transformer(TIST) model to judge whether the quality of sign language data is qualified. To verify that TIST effectively improves sign language recognition(SLR), we make two datasets, the screened dataset and the unscreened dataset. In this experiment, this paper uses visual alignment constraint(VAC) as the baseline model. The experimental results show that the screened dataset can achieve better word error rate(WER) than the unscreened dataset.
文摘Volunteered geographic information(VGI)has entered a phase where there are both a substantial amount of crowdsourced information available and a big interest in using it by organizations.But the issue of deciding the quality of VGI without resorting to a comparison with authoritative data remains an open challenge.This article first formulates the problem of quality assessment of VGI data.Then presents a model to measure trustworthiness of information and reputation of contributors by analyzing geometric,qualitative,and semantic aspects of edits over time.An implementation of the model is running on a small data-set for a preliminary empirical validation.The results indicate that the computed trustworthiness provides a valid approximation of VGI quality.
基金This work is supported by the NSFC(Nos.61772280,61772454)the Changzhou Sci&Tech Program(No.CJ20179027)the PAPD fund from NUIST.Prof.Jin Wang is the corresponding author。
文摘Since the British National Archive put forward the concept of the digital continuity in 2007,several developed countries have worked out their digital continuity action plan.However,the technologies of the digital continuity guarantee are still lacked.At first,this paper analyzes the requirements of digital continuity guarantee for electronic record based on data quality theory,then points out the necessity of data quality guarantee for electronic record.Moreover,we convert the digital continuity guarantee of electronic record to ensure the consistency,completeness and timeliness of electronic record,and construct the first technology framework of the digital continuity guarantee for electronic record.Finally,the temporal functional dependencies technology is utilized to build the first integration method to insure the consistency,completeness and timeliness of electronic record.
基金supported by the National Natural Science Foundation of China(51734004 and 51704069).
文摘The real-time energy flow data obtained in industrial production processes are usually of low quality.It is difficult to accurately predict the short-term energy flow profile by using these field data,which diminishes the effect of industrial big data and artificial intelligence in industrial energy system.The real-time data of blast furnace gas(BFG)generation collected in iron and steel sites are also of low quality.In order to tackle this problem,a three-stage data quality improvement strategy was proposed to predict the BFG generation.In the first stage,correlation principle was used to test the sample set.In the second stage,the original sample set was rectified and updated.In the third stage,Kalman filter was employed to eliminate the noise of the updated sample set.The method was verified by autoregressive integrated moving average model,back propagation neural network model and long short-term memory model.The results show that the prediction model based on the proposed three-stage data quality improvement method performs well.Long short-term memory model has the best prediction performance,with a mean absolute error of 17.85 m3/min,a mean absolute percentage error of 0.21%,and an R squared of 95.17%.
基金This research was supported financially by EU FP7 Marie Curie Initial Training Network MULTI-POS(Multi-technology Positioning Professionals)[grant number 316528].
文摘OpenStreetMap(OSM)data are widely used but their reliability is still variable.Many contributors to OSM have not been trained in geography or surveying and consequently their contributions,including geometry and attribute data inserts,deletions,and updates,can be inaccurate,incomplete,inconsistent,or vague.There are some mechanisms and applications dedicated to discovering bugs and errors in OSM data.Such systems can remove errors through user-checks and applying predefined rules but they need an extra control process to check the real-world validity of suspected errors and bugs.This paper focuses on finding bugs and errors based on patterns and rules extracted from the tracking data of users.The underlying idea is that certain characteristics of user trajectories are directly linked to the type of feature.Using such rules,some sets of potential bugs and errors can be identified and stored for further investigations.
基金supported by the European Community’s Seventh Framework Programme[FP7/2007–2013],[Grant No 612096(CAP4Access)].
文摘Nowadays,several research projects show interest in employing volunteered geographic information(VGI)to improve their systems through using up-to-date and detailed data.The European project CAP4Access is one of the successful examples of such international-wide research projects that aims to improve the accessibility of people with restricted mobility using crowdsourced data.In this project,OpenStreetMap(OSM)is used to extend OpenRouteService,a well-known routing platform.However,a basic challenge that this project tackled was the incompleteness of OSM data with regards to certain information that is required for wheelchair accessibility(e.g.sidewalk information,kerb data,etc.).In this article,we present the results of initial assessment of sidewalk data in OSM at the beginning of the project as well as our approach in awareness raising and using tools for tagging accessibility data into OSM database for enriching the sidewalk data completeness.Several experiments have been carried out in different European cities,and discussion on the results of the experiments as well as the lessons learned are provided.The lessons learned provide recommendations that help in organizing better mapping party events in the future.We conclude by reporting on how and to what extent the OSM sidewalk data completeness in these study areas have benefited from the mapping parties by the end of the project.
基金The National Natural Science Foundation of China (No.70772021,70372004)China Postdoctoral Science Foundation (No.20060400077)
文摘This paper presents a methodology to determine three data quality (DQ) risk characteristics: accuracy, comprehensiveness and nonmembership. The methodology provides a set of quantitative models to confirm the information quality risks for the database of the geographical information system (GIS). Four quantitative measures are introduced to examine how the quality risks of source information affect the quality of information outputs produced using the relational algebra operations Selection, Projection, and Cubic Product. It can be used to determine how quality risks associated with diverse data sources affect the derived data. The GIS is the prime source of information on the location of cables, and detection time strongly depends on whether maps indicate the presence of cables in the construction business. Poor data quality in the GIS can contribute to increased risk or higher risk avoidance costs. A case study provides a numerical example of the calculation of the trade-offs between risk and detection costs and provides an example of the calculation of the costs of data quality. We conclude that the model contributes valuable new insight.
基金supported by University of Stavanger, NorwaySINTEF,the Center for Integrated Operations in the Petroleum Industry and the management of National Oilwell Varco Intelli Serv
文摘Wired drill pipe(WDP)technology is one of the most promising data acquisition technologies in today s oil and gas industry.For the first time it allows sensors to be positioned along the drill string which enables collecting and transmitting valuable data not only from the bottom hole assembly(BHA),but also along the entire length of the wellbore to the drill floor.The technology has received industry acceptance as a viable alternative to the typical logging while drilling(LWD)method.Recently more and more WDP applications can be found in the challenging drilling environments around the world,leading to many innovations to the industry.Nevertheless most of the data acquired from WDP can be noisy and in some circumstances of very poor quality.Diverse factors contribute to the poor data quality.Most common sources include mis-calibrated sensors,sensor drifting,errors during data transmission,or some abnormal conditions in the well,etc.The challenge of improving the data quality has attracted more and more focus from many researchers during the past decade.This paper has proposed a promising solution to address such challenge by making corrections of the raw WDP data and estimating unmeasurable parameters to reveal downhole behaviors.An advanced data processing method,data validation and reconciliation(DVR)has been employed,which makes use of the redundant data from multiple WDP sensors to filter/remove the noise from the measurements and ensures the coherence of all sensors and models.Moreover it has the ability to distinguish the accurate measurements from the inaccurate ones.In addition,the data with improved quality can be used for estimating some crucial parameters in the drilling process which are unmeasurable in the first place,hence provide better model calibrations for integrated well planning and realtime operations.
基金the National Natural Science Foundation of China (60503024 50634010).
文摘One of the goals of data collection is preparing for decision-making, so high quality requirement must be satisfied. Rational evaluation of data quality is an effective way to identify data problem in time, and the quality of data after this evaluation is satisfactory with the requirement of decision maker. A fuzzy neural network based research method of data quality evaluation is proposed. First, the criteria for the evaluation of data quality are selected to construct the fuzzy sets of evaluating grades, and then by using the learning ability of NN, the objective evaluation of membership is carried out, which can be used for the effective evaluation of data quality. This research has been used in the platform of 'data report of national compulsory education outlay guarantee' from the Chinese Ministry of Education. This method can be used for the effective evaluation of data quality worldwide, and the data quality situation can be found out more completely, objectively, and in better time by using the method.
文摘<span style="font-family:Verdana;">Most GIS databases contain data errors. The quality of the data sources such as traditional paper maps or more recent remote sensing data determines spatial data quality. In the past several decades, different statistical measures have been developed to evaluate data quality for different types of data, such as nominal categorical data, ordinal categorical data and numerical data. Although these methods were originally proposed for medical research or psychological research, they have been widely used to evaluate spatial data quality. In this paper, we first review statistical methods for evaluating data quality, discuss under what conditions we should use them and how to interpret the results, followed by a brief discussion of statistical software and packages that can be used to compute these data quality measures.</span>
基金Supported by Science and Technology Open Cooperation Project of Henan Province(162106000017)Science and Technology People-benefiting Plan Project of Henan Province(152207110004)Puyang Science and Technology Plan Project(150109)~~
文摘The dairy herd improvement data from Henan Province were analyzed statistically to establish screening criteria for relevant data, thereby laying a foundation for genetic evaluation of dairy cows. With the 2 152 451 test-day records about 155 893 Chinese Holstein dairy cows collected by the Henan Dairy Herd Improvement Center from January 2008 to April 2016, the dynamics of test times during a complete lactation, test interval during a complete lactation, days in milk (DIM) of first test-day record, daughter descendant number and herd number of bull, age at first calving and pedigree integrity rate among different years and different herd sizes were analyzed by MEANS order of SAS 9.4. In addition, the data that were applicable to genetic evaluation were screened by SQL program. The results showed that during 2008-2015, the number of cow individuals participating in DHI in Henan Province increased from 7 379 to 93 706; the test-day milk yield increased from 19.91 to 24.05 kg; the somatic cell count reduced from 411.09×10^3 to 277.08×10^3 cells/ml; the percentage of cows with DIM ranging from 5-305 d reached 70.92%; the average test times increased from 3.20 to 6.31 times; the test interval decreased from 70.22 to 33.83 d; the dairy cows with age at first calving of 25 months were dominant, accounting for 12.57%; the bulls whose daughter descendant number was 20 or more and the daughters were distributed in 10 or more farms accounted for 6.05%; the one-generation pedigree integrity rate was 82.54%; the percentage of data that could be used for genetic evaluation was screened as 20.67%, which was lower than the results of other similar studies.
文摘Several organizations have migrated to the cloud for better quality in business engagements and security. Data quality is crucial in present-day activities. Information is generated and collected from data representing real-time facts and activities. Poor data quality affects the organizational decision-making policy and customer satisfaction, and influences the organization’s scheme of execution negatively. Data quality also has a massive influence on the accuracy, complexity and efficiency of the machine and deep learning tasks. There are several methods and tools to evaluate data quality to ensure smooth incorporation in model development. The bulk of data quality tools permit the assessment of sources of data only at a certain point in time, and the arrangement and automation are consequently an obligation of the user. In ensuring automatic data quality, several steps are involved in gathering data from different sources and monitoring data quality, and any problems with the data quality must be adequately addressed. There was a gap in the literature as no attempts have been made previously to collate all the advances in different dimensions of automatic data quality. This limited narrative review of existing literature sought to address this gap by correlating different steps and advancements related to the automatic data quality systems. The six crucial data quality dimensions in organizations were discussed, and big data were compared and classified. This review highlights existing data quality models and strategies that can contribute to the development of automatic data quality systems.
文摘In contrast with the research of new models,little attention has been paid to the impact of low or high-quality data feeding a dialogue system.The present paper makes thefirst attempt tofill this gap by extending our previous work on question-answering(QA)systems by investigating the effect of misspelling on QA agents and how context changes can enhance the responses.Instead of using large language models trained on huge datasets,we propose a method that enhances the model's score by modifying only the quality and structure of the data feed to the model.It is important to identify the features that modify the agent performance because a high rate of wrong answers can make the students lose their interest in using the QA agent as an additional tool for distant learning.The results demonstrate the accuracy of the proposed context simplification exceeds 85%.Thesefindings shed light on the importance of question data quality and context complexity construct as key dimensions of the QA system.In conclusion,the experimental results on questions and contexts showed that controlling and improving the various aspects of data quality around the QA system can significantly enhance his robustness and performance.
文摘The basic task of geomagnetic observatory is .to produce accurate, relaible,continuous and complete observative data. The aim of examination is to judge the quality status of data. According to the operative principle of geomagnetic instruments and its operative status that should be achieved, geomagnetic activity and spread characteristics in time domain and location domain, authers proposed a complete set of data quality examination. The paper discusses respectively physical basement, examination method and the result about scalevalues, base-line values, monthly mean values, daily mean values, maximum and minimum values in daily range, magnetic storm and K index. The practice has proved that this set of examination is feasible and useful to raise and to guarantee the quality of observative data.
文摘Geographical studies of outdoor activities have increased in recent years with the rise in popularity of these activities worldwide,including in Japan.Volunteered geographic information(VGI)is a key tool for organizing outdoor activities as it offers a means to determine the locational information and names of places.To evaluate the quality of VGI,geospatial data generated by land survey agencies and other VGI are often utilized as reference data.However,since these reference data may not be available,other methods are necessary to assure the quality of VGI.In this study,we examined five trust indicators based on the inherent characteristics of VGI through an empirical case study.We used mountain names extracted from OpenStreetMap in Japan as data because there were almost no other VGI in the vicinity.As a result,we isolated three trust indicators,namely versions,users,and tag corrections,to examine the thematic accuracy of VGI because these were the only statistically significant indicators.However,we found that the prediction rate of thematic accuracy was very low.To improve thematic accuracy,this study recommends using the most accurate versions,applying correctly given tags,and considering the motivations and characteristics of the VGI contributors.
基金supported by National Key R&D Program of China(Nos.2019YFC0214800 and 2017YFC0212100)Beijing Municipal Science&Technology Commission(No.Z181100005418015)。
文摘In recent years,with rapid increases in the number of vehicles in China,the contribution of vehicle exhaust emissions to air pollution has become increasingly prominent.To achieve the precise control of emissions,on-road remote sensing(RS)technology has been developed and applied for law enforcement and supervision.However,data quality is still an existing issue affecting the development and application of RS.In this study,the RS data from a cross-road RS system used at a single site(from 2012 to 2015)were collected,the data screening process was reviewed,the issues with data quality were summarized,a new method of data screening and calibration was proposed,and the effectiveness of the improved data quality control methods was finally evaluated.The results showed that this method reduces the skewness and kurtosis of the data distribution by up to nearly 67%,which restores the actual characteristics of exhaust diffusion and is conducive to the identification of actual clean and high-emission vehicles.The annual variability of emission factors of nitric oxide decreases by 60%-on average-eliminating the annual drift of fleet emissions and improving data reliability.
基金the National Natural Science Foundation of China(No.51775185)Natural Science Foundation of Hunan Province(No.2022JJ90013)+1 种基金Intelligent Environmental Monitoring Technology Hunan Provincial Joint Training Base for Graduate Students in the Integration of Industry and Education,and Hunan Normal University University-Industry Cooperation.the 2011 Collaborative Innovation Center for Development and Utilization of Finance and Economics Big Data Property,Universities of Hunan Province,Open Project,Grant Number 20181901CRP04.
文摘At present,water pollution has become an important factor affecting and restricting national and regional economic development.Total phosphorus is one of the main sources of water pollution and eutrophication,so the prediction of total phosphorus in water quality has good research significance.This paper selects the total phosphorus and turbidity data for analysis by crawling the data of the water quality monitoring platform.By constructing the attribute object mapping relationship,the correlation between the two indicators was analyzed and used to predict the future data.Firstly,the monthly mean and daily mean concentrations of total phosphorus and turbidity outliers were calculated after cleaning,and the correlation between them was analyzed.Secondly,the correlation coefficients of different times and frequencies were used to predict the values for the next five days,and the data trend was predicted by python visualization.Finally,the real value was compared with the predicted value data,and the results showed that the correlation between total phosphorus and turbidity was useful in predicting the water quality.