Machine learning(ML)has recently enabled many modeling tasks in design,manufacturing,and condition monitoring due to its unparalleled learning ability using existing data.Data have become the limiting factor when impl...Machine learning(ML)has recently enabled many modeling tasks in design,manufacturing,and condition monitoring due to its unparalleled learning ability using existing data.Data have become the limiting factor when implementing ML in industry.However,there is no systematic investigation on how data quality can be assessed and improved for ML-based design and manufacturing.The aim of this survey is to uncover the data challenges in this domain and review the techniques used to resolve them.To establish the background for the subsequent analysis,crucial data terminologies in ML-based modeling are reviewed and categorized into data acquisition,management,analysis,and utilization.Thereafter,the concepts and frameworks established to evaluate data quality and imbalance,including data quality assessment,data readiness,information quality,data biases,fairness,and diversity,are further investigated.The root causes and types of data challenges,including human factors,complex systems,complicated relationships,lack of data quality,data heterogeneity,data imbalance,and data scarcity,are identified and summarized.Methods to improve data quality and mitigate data imbalance and their applications in this domain are reviewed.This literature review focuses on two promising methods:data augmentation and active learning.The strengths,limitations,and applicability of the surveyed techniques are illustrated.The trends of data augmentation and active learning are discussed with respect to their applications,data types,and approaches.Based on this discussion,future directions for data quality improvement and data imbalance mitigation in this domain are identified.展开更多
[Objective]In response to the issue of insufficient integrity in hourly routine meteorological element data files,this paper aims to improve the availability and reliability of data files,and provide high-quality data...[Objective]In response to the issue of insufficient integrity in hourly routine meteorological element data files,this paper aims to improve the availability and reliability of data files,and provide high-quality data file support for meteorological forecasting and services.[Method]In this paper,an efficient and accurate method for data file quality control and fusion processing is developed.By locating the missing measurement time,data are extracted from the"AWZ.db"database and the minute routine meteorological element data file,and merged into the hourly routine meteorological element data file.[Result]Data processing efficiency and accuracy are significantly improved,and the problem of incomplete hourly routine meteorological element data files is solved.At the same time,it emphasizes the importance of ensuring the accuracy of the files used and carefully checking and verifying the fusion results,and proposes strategies to improve data quality.[Conclusion]This method provides convenience for observation personnel and effectively improves the integrity and accuracy of data files.In the future,it is expected to provide more reliable data support for meteorological forecasting and services.展开更多
In the new era,the impact of emerging productive forces has permeated every sector of industry.As the core production factor of these forces,data plays a pivotal role in industrial transformation and social developmen...In the new era,the impact of emerging productive forces has permeated every sector of industry.As the core production factor of these forces,data plays a pivotal role in industrial transformation and social development.Consequently,many domestic universities have introduced majors or courses related to big data.Among these,the Big Data Management and Applications major stands out for its interdisciplinary approach and emphasis on practical skills.However,as an emerging field,it has not yet accumulated a robust foundation in teaching theory and practice.Current instructional practices face issues such as unclear training objectives,inconsistent teaching methods and course content,insufficient integration of practical components,and a shortage of qualified faculty-factors that hinder both the development of the major and the overall quality of education.Taking the statistics course within the Big Data Management and Applications major as an example,this paper examines the challenges faced by statistics education in the context of emerging productive forces and proposes corresponding improvement measures.By introducing innovative teaching concepts and strategies,the teaching system for professional courses is optimized,and authentic classroom scenarios are recreated through illustrative examples.Questionnaire surveys and statistical analyses of data collected before and after the teaching reforms indicate that the curriculum changes effectively enhance instructional outcomes,promote the development of the major,and improve the quality of talent cultivation.展开更多
In order to further enhance the numerical application of weather radar radial velocity,this paper proposes a quality control scheme for weather radar radial velocity from the perspective of data assimilation.The propo...In order to further enhance the numerical application of weather radar radial velocity,this paper proposes a quality control scheme for weather radar radial velocity from the perspective of data assimilation.The proposed scheme is based on the WRFDA(Weather Research and Forecasting Data Assimilation)system and utilizes the biweight algorithm to perform quality control on weather radar radial velocity data.A series of quality control tests conducted over the course of one month demonstrate that the scheme can be seamlessly integrated into the data assimilation process.The scheme is characterized by its simplicity,fast implementation,and ease of maintenance.By determining an appropri-ate threshold for quality control,the percentage of outliers identified by the scheme remains highly stable over time.Moreover,the mean errors and standard deviations of the O-B(observation-minus-background)values are significantly reduced,improving the overall data quality.The main information and spatial distribution features of the data are pre-served effectively.After quality control,the distribution of the O-B Probability Density Function is adjusted in a manner that brings it closer to a Gaussian distribution.This adjustment is beneficial for the subsequent data assimilation process,contributing to more accurate numerical weather predictions.Thus,the proposed quality control scheme provides a valuable tool for improving weather radar data quality and enhancing numerical forecasting performance.展开更多
The Belt and Road global navigation satellite system(B&R GNSS)network is the first large-scale deployment of Chinese GNSS equipment in a seismic system.Prior to this,there have been few systematic assessments of t...The Belt and Road global navigation satellite system(B&R GNSS)network is the first large-scale deployment of Chinese GNSS equipment in a seismic system.Prior to this,there have been few systematic assessments of the data quality of Chinese GNSS equipment.In this study,data from four representative GNSS sites in different regions of China were analyzed using the G-Nut/Anubis software package.Four main indicators(data integrity rate,data validity ratio,multi-path error,and cycle slip ratio)used to systematically analyze data quality,while evaluating the seismic monitoring capabilities of the network based on earthquake magnitudes estimated from high-frequency GNSS data are evaluated by estimating magnitude based on highfrequency GNSS data.The results indicate that the quality of the data produced by the three types of Chinese receivers used in the network meets the needs of earthquake monitoring and the new seismic industry standards,which provide a reference for the selection of equipment for future new projects.After the B&R GNSS network was established,the seismic monitoring capability for earthquakes with magnitudes greater than M_(W)6.5 in most parts of the Sichuan-Yunnan region improved by approximately 20%.In key areas such as the Sichuan-Yunnan Rhomboid Block,the monitoring capability increased by more than 25%,which has greatly improved the effectiveness of regional comprehensive earthquake management.展开更多
This study demonstrates the complexity and importance of water quality as a measure of the health and sustainability of ecosystems that directly influence biodiversity,human health,and the world economy.The predictabi...This study demonstrates the complexity and importance of water quality as a measure of the health and sustainability of ecosystems that directly influence biodiversity,human health,and the world economy.The predictability of water quality thus plays a crucial role in managing our ecosystems to make informed decisions and,hence,proper environmental management.This study addresses these challenges by proposing an effective machine learning methodology applied to the“Water Quality”public dataset.The methodology has modeled the dataset suitable for providing prediction classification analysis with high values of the evaluating parameters such as accuracy,sensitivity,and specificity.The proposed methodology is based on two novel approaches:(a)the SMOTE method to deal with unbalanced data and(b)the skillfully involved classical machine learning models.This paper uses Random Forests,Decision Trees,XGBoost,and Support Vector Machines because they can handle large datasets,train models for handling skewed datasets,and provide high accuracy in water quality classification.A key contribution of this work is the use of custom sampling strategies within the SMOTE approach,which significantly enhanced performance metrics and improved class imbalance handling.The results demonstrate significant improvements in predictive performance,achieving the highest reported metrics:accuracy(98.92%vs.96.06%),sensitivity(98.3%vs.71.26%),and F1 score(98.37%vs.79.74%)using the XGBoost model.These improvements underscore the effectiveness of our custom SMOTE sampling strategies in addressing class imbalance.The findings contribute to environmental management by enabling ecology specialists to develop more accurate strategies for monitoring,assessing,and managing drinking water quality,ensuring better ecosystem and public health outcomes.展开更多
With the globalization of the economy,maritime trade has surged,posing challenges in the supervision of marine vessel activities.An automatic identification system(AIS)is an effective means of shipping traffic service...With the globalization of the economy,maritime trade has surged,posing challenges in the supervision of marine vessel activities.An automatic identification system(AIS)is an effective means of shipping traffic service,but many uncertainties exist regarding its data quality.In this study,the AIS data from Haiyang(HY)series of satellites were used to assess the data quality,analyze the global ship trajectory distribution and update frequencies from 2019 to 2023.Through the analysis of maritime mobile service identity numbers,we identified 340185 unique vessels,80.1%of which adhered to the International Telecommunication Union standards.Approximately 49.7%of ships exhibit significant data gaps,and 1.1%show inconsistencies in their AIS data sources.In the central Pacific Ocean at low latitudes and along the coast of South America(30°-60°S),a heightened incidence of abnormal trajectories of ships has been consistently observed,particularly in areas associated with fishing activities.According to the spatial distribution of ship trajectories,AIS data exhibit numerous deficiencies,particularly in high-traffic regions such as the East China Sea and South China Sea.In contrast,ship trajectories in the polar regions,characterized by high latitudes,are relatively comprehensive.With the increased number of HY satellites equipped with AIS receivers,the quantity of trajectory points displays a growing trend,leading to increasingly complete trajectories.This trend highlights the significant potential of using AIS data acquired from HY satellites to increase the accuracy of vessel tracking.展开更多
Sign language dataset is essential in sign language recognition and translation(SLRT). Current public sign language datasets are small and lack diversity, which does not meet the practical application requirements for...Sign language dataset is essential in sign language recognition and translation(SLRT). Current public sign language datasets are small and lack diversity, which does not meet the practical application requirements for SLRT. However, making a large-scale and diverse sign language dataset is difficult as sign language data on the Internet is scarce. In making a large-scale and diverse sign language dataset, some sign language data qualities are not up to standard. This paper proposes a two information streams transformer(TIST) model to judge whether the quality of sign language data is qualified. To verify that TIST effectively improves sign language recognition(SLR), we make two datasets, the screened dataset and the unscreened dataset. In this experiment, this paper uses visual alignment constraint(VAC) as the baseline model. The experimental results show that the screened dataset can achieve better word error rate(WER) than the unscreened dataset.展开更多
Offshore waters provide resources for human beings,while on the other hand,threaten them because of marine disasters.Ocean stations are part of offshore observation networks,and the quality of their data is of great s...Offshore waters provide resources for human beings,while on the other hand,threaten them because of marine disasters.Ocean stations are part of offshore observation networks,and the quality of their data is of great significance for exploiting and protecting the ocean.We used hourly mean wave height,temperature,and pressure real-time observation data taken in the Xiaomaidao station(in Qingdao,China)from June 1,2017,to May 31,2018,to explore the data quality using eight quality control methods,and to discriminate the most effective method for Xiaomaidao station.After using the eight quality control methods,the percentages of the mean wave height,temperature,and pressure data that passed the tests were 89.6%,88.3%,and 98.6%,respectively.With the marine disaster(wave alarm report)data,the values failed in the test mainly due to the influence of aging observation equipment and missing data transmissions.The mean wave height is often affected by dynamic marine disasters,so the continuity test method is not effective.The correlation test with other related parameters would be more useful for the mean wave height.展开更多
Multisensor data fusion (MDF) is an emerging technology to fuse data from multiple sensors in order to make a more accurate estimation of the environment through measurement and detection. Applications of MDF cross ...Multisensor data fusion (MDF) is an emerging technology to fuse data from multiple sensors in order to make a more accurate estimation of the environment through measurement and detection. Applications of MDF cross a wide spectrum in military and civilian areas. With the rapid evolution of computers and the proliferation of micro-mechanical/electrical systems sensors, the utilization of MDF is being popularized in research and applications. This paper focuses on application of MDF for high quality data analysis and processing in measurement and instrumentation. A practical, general data fusion scheme was established on the basis of feature extraction and merge of data from multiple sensors. This scheme integrates artificial neural networks for high performance pattern recognition. A number of successful applications in areas of NDI (Non-Destructive Inspection) corrosion detection, food quality and safety characterization, and precision agriculture are described and discussed in order to motivate new applications in these or other areas. This paper gives an overall picture of using the MDF method to increase the accuracy of data analysis and processing in measurement and instrumentation in different areas of applications.展开更多
Sea surface temperature(SST)data obtained from coastal stations in Jiangsu,China during 20102014 are quality controlled before analysis of their characteristic semidiurnal and seasonal cycles,including the correlation...Sea surface temperature(SST)data obtained from coastal stations in Jiangsu,China during 20102014 are quality controlled before analysis of their characteristic semidiurnal and seasonal cycles,including the correlation with the variation of the tide.Quality control of data includes the validation of extreme values and checking of hourly values based on temporally adjacent data points,with 0.15℃/h considered a suitable threshold for detecting abnormal values.The diurnal variation amplitude of the SST data is greater in spring and summer than in autumn and winter.The diurnal variation of SST has bimodal structure on most days,i.e.,SST has a significant semidiurnal cycle.Moreover,the semidiurnal cycle of SST is negatively correlated with the tidal data from March to August,but positively correlated with the tidal data from October to January.Little correlation is detected in the remaining months because of the weak coastal offshore SST gradients.The quality control and understanding of coastal SST data are particularly relevant with regard to the validation of indirect measurements such as satellite-derived data.展开更多
This study proposes a method to derive the climatological limit thresholds that can be used in an operational/historical quality control procedure for Chinese high vertical resolution(5–10 m)radiosonde temperature an...This study proposes a method to derive the climatological limit thresholds that can be used in an operational/historical quality control procedure for Chinese high vertical resolution(5–10 m)radiosonde temperature and wind speed data.The whole atmosphere is divided into 64 vertical bins,and the profiles are constructed by the percentiles of the values in each vertical bin.Based on the percentile profiles(PPs),some objective criteria are developed to obtain the thresholds.Tibetan Plateau field data are used to validate the effectiveness of the method in the application of experimental data.The results show that the derived thresholds for 120 operational stations and 3 experimental stations are effective in detecting the gross errors,and those PPs can clearly and instantly illustrate the characteristics of a radiosonde variable and reveal the distribution of errors.展开更多
We first analyzed GPS precipitable water vapor(GPS/PWV) available from a ground-based GPS observation network in Guangdong from 1 August 2009 to 27 August 2012 and then developed a method of quality control before GPS...We first analyzed GPS precipitable water vapor(GPS/PWV) available from a ground-based GPS observation network in Guangdong from 1 August 2009 to 27 August 2012 and then developed a method of quality control before GPS/PWV data is assimilated into the GRAPES 3DVAR system. This method can reject the outliers effectively. After establishing the criterion for quality control, we did three numerical experiments to investigate the impact on the precipitation forecast with and without the quality-controlled GPS/PWV data before they are assimilated into the system.In the numerical experiments, two precipitation cases(on 6 to 7 May, 2010 and 27 to 28 April, 2012 respectively) that occurred in the annually first raining season of Guangdong were selected. The results indicated that after quality control,only the GPS/PWV data that deviates little from the NCEP/PWV data can be assimilated into the system, has reasonable adjustment of the initial water vapor above Guangdong, and eventually improves the intensity and location of 24-h precipitation forecast significantly.展开更多
In this study, an analysis framework based on the regular monitoring data was proposed for investigating the annual/inter-annual air quality variation and the contributions from different factors(i.e., seasons, pollut...In this study, an analysis framework based on the regular monitoring data was proposed for investigating the annual/inter-annual air quality variation and the contributions from different factors(i.e., seasons, pollution periods and airflow directions), through a case study in Beijing from 2013 to 2016. The results showed that the annual mean concentrations(MC) of PM_(2.5), SO_2, NO_2 and CO had decreased with annual mean ratios of 7.5%, 28.6%, 4.6%and 15.5% from 2013 to 2016, respectively. Among seasons, the MC in winter contributed the largest fractions(25.8%~46.4%) to the annual MC, and the change of MC in summer contributed most to the inter-annual MC variation(IMCV) of PM_(2.5) and NO2. For different pollution periods, gradually increase of frequency of S-1(PM_(2.5), 0~ 75 μg/m^3) made S-1 become the largest contributor(28.8%) to the MC of PM_(2.5) in 2016, it had a negative contribution(-13.1%) to the IMCV of PM_(2.5); obvious decreases of frequencies of heavily polluted and severely polluted dominated(44.7% and 39.5%) the IMCV of PM_(2.5). For different airflow directions, the MC of pollutants under the south airflow had the most significant decrease(22.5%~62.5%), and those decrease contributed most to the IMCV of PM_(2.5)(143.3%),SO2(72.0%), NO_2(55.5%) and CO(190.3%); the west airflow had negative influences to the IMCV of PM_(2.5), NO_2 and CO. The framework is helpful for further analysis and utilization of the large amounts of monitoring data; and the analysis results can provide scientific supports for the formulation or adjustment of further air pollution mitigation policy.展开更多
Volunteered geographic information(VGI)has entered a phase where there are both a substantial amount of crowdsourced information available and a big interest in using it by organizations.But the issue of deciding the ...Volunteered geographic information(VGI)has entered a phase where there are both a substantial amount of crowdsourced information available and a big interest in using it by organizations.But the issue of deciding the quality of VGI without resorting to a comparison with authoritative data remains an open challenge.This article first formulates the problem of quality assessment of VGI data.Then presents a model to measure trustworthiness of information and reputation of contributors by analyzing geometric,qualitative,and semantic aspects of edits over time.An implementation of the model is running on a small data-set for a preliminary empirical validation.The results indicate that the computed trustworthiness provides a valid approximation of VGI quality.展开更多
Since the British National Archive put forward the concept of the digital continuity in 2007,several developed countries have worked out their digital continuity action plan.However,the technologies of the digital con...Since the British National Archive put forward the concept of the digital continuity in 2007,several developed countries have worked out their digital continuity action plan.However,the technologies of the digital continuity guarantee are still lacked.At first,this paper analyzes the requirements of digital continuity guarantee for electronic record based on data quality theory,then points out the necessity of data quality guarantee for electronic record.Moreover,we convert the digital continuity guarantee of electronic record to ensure the consistency,completeness and timeliness of electronic record,and construct the first technology framework of the digital continuity guarantee for electronic record.Finally,the temporal functional dependencies technology is utilized to build the first integration method to insure the consistency,completeness and timeliness of electronic record.展开更多
Water is one of the basic resources for human survival.Water pollution monitoring and protection have been becoming a major problem for many countries all over the world.Most traditional water quality monitoring syste...Water is one of the basic resources for human survival.Water pollution monitoring and protection have been becoming a major problem for many countries all over the world.Most traditional water quality monitoring systems,however,generally focus only on water quality data collection,ignoring data analysis and data mining.In addition,some dirty data and data loss may occur due to power failures or transmission failures,further affecting data analysis and its application.In order to meet these needs,by using Internet of things,cloud computing,and big data technologies,we designed and implemented a water quality monitoring data intelligent service platform in C#and PHP language.The platform includes monitoring point addition,monitoring point map labeling,monitoring data uploading,monitoring data processing,early warning of exceeding the standard of monitoring indicators,and other functions modules.Using this platform,we can realize the automatic collection of water quality monitoring data,data cleaning,data analysis,intelligent early warning and early warning information push,and other functions.For better security and convenience,we deployed the system in the Tencent Cloud and tested it.The testing results showed that the data analysis platform could run well and will provide decision support for water resource protection.展开更多
Wired drill pipe(WDP)technology is one of the most promising data acquisition technologies in today s oil and gas industry.For the first time it allows sensors to be positioned along the drill string which enables c...Wired drill pipe(WDP)technology is one of the most promising data acquisition technologies in today s oil and gas industry.For the first time it allows sensors to be positioned along the drill string which enables collecting and transmitting valuable data not only from the bottom hole assembly(BHA),but also along the entire length of the wellbore to the drill floor.The technology has received industry acceptance as a viable alternative to the typical logging while drilling(LWD)method.Recently more and more WDP applications can be found in the challenging drilling environments around the world,leading to many innovations to the industry.Nevertheless most of the data acquired from WDP can be noisy and in some circumstances of very poor quality.Diverse factors contribute to the poor data quality.Most common sources include mis-calibrated sensors,sensor drifting,errors during data transmission,or some abnormal conditions in the well,etc.The challenge of improving the data quality has attracted more and more focus from many researchers during the past decade.This paper has proposed a promising solution to address such challenge by making corrections of the raw WDP data and estimating unmeasurable parameters to reveal downhole behaviors.An advanced data processing method,data validation and reconciliation(DVR)has been employed,which makes use of the redundant data from multiple WDP sensors to filter/remove the noise from the measurements and ensures the coherence of all sensors and models.Moreover it has the ability to distinguish the accurate measurements from the inaccurate ones.In addition,the data with improved quality can be used for estimating some crucial parameters in the drilling process which are unmeasurable in the first place,hence provide better model calibrations for integrated well planning and realtime operations.展开更多
OpenStreetMap(OSM)data are widely used but their reliability is still variable.Many contributors to OSM have not been trained in geography or surveying and consequently their contributions,including geometry and attri...OpenStreetMap(OSM)data are widely used but their reliability is still variable.Many contributors to OSM have not been trained in geography or surveying and consequently their contributions,including geometry and attribute data inserts,deletions,and updates,can be inaccurate,incomplete,inconsistent,or vague.There are some mechanisms and applications dedicated to discovering bugs and errors in OSM data.Such systems can remove errors through user-checks and applying predefined rules but they need an extra control process to check the real-world validity of suspected errors and bugs.This paper focuses on finding bugs and errors based on patterns and rules extracted from the tracking data of users.The underlying idea is that certain characteristics of user trajectories are directly linked to the type of feature.Using such rules,some sets of potential bugs and errors can be identified and stored for further investigations.展开更多
This paper presents a methodology to determine three data quality (DQ) risk characteristics: accuracy, comprehensiveness and nonmembership. The methodology provides a set of quantitative models to confirm the informat...This paper presents a methodology to determine three data quality (DQ) risk characteristics: accuracy, comprehensiveness and nonmembership. The methodology provides a set of quantitative models to confirm the information quality risks for the database of the geographical information system (GIS). Four quantitative measures are introduced to examine how the quality risks of source information affect the quality of information outputs produced using the relational algebra operations Selection, Projection, and Cubic Product. It can be used to determine how quality risks associated with diverse data sources affect the derived data. The GIS is the prime source of information on the location of cables, and detection time strongly depends on whether maps indicate the presence of cables in the construction business. Poor data quality in the GIS can contribute to increased risk or higher risk avoidance costs. A case study provides a numerical example of the calculation of the trade-offs between risk and detection costs and provides an example of the calculation of the costs of data quality. We conclude that the model contributes valuable new insight.展开更多
基金funded by the McGill University Graduate Excellence Fellowship Award(00157)the Mitacs Accelerate Program(IT13369)the McGill Engineering Doctoral Award(MEDA).
文摘Machine learning(ML)has recently enabled many modeling tasks in design,manufacturing,and condition monitoring due to its unparalleled learning ability using existing data.Data have become the limiting factor when implementing ML in industry.However,there is no systematic investigation on how data quality can be assessed and improved for ML-based design and manufacturing.The aim of this survey is to uncover the data challenges in this domain and review the techniques used to resolve them.To establish the background for the subsequent analysis,crucial data terminologies in ML-based modeling are reviewed and categorized into data acquisition,management,analysis,and utilization.Thereafter,the concepts and frameworks established to evaluate data quality and imbalance,including data quality assessment,data readiness,information quality,data biases,fairness,and diversity,are further investigated.The root causes and types of data challenges,including human factors,complex systems,complicated relationships,lack of data quality,data heterogeneity,data imbalance,and data scarcity,are identified and summarized.Methods to improve data quality and mitigate data imbalance and their applications in this domain are reviewed.This literature review focuses on two promising methods:data augmentation and active learning.The strengths,limitations,and applicability of the surveyed techniques are illustrated.The trends of data augmentation and active learning are discussed with respect to their applications,data types,and approaches.Based on this discussion,future directions for data quality improvement and data imbalance mitigation in this domain are identified.
基金the Fifth Batch of Innovation Teams of Wuzhou Meteorological Bureau"Wuzhou Innovation Team for Enhancing the Comprehensive Meteorological Observation Ability through Digitization and Intelligence"Wuzhou Science and Technology Planning Project(202402122,202402119).
文摘[Objective]In response to the issue of insufficient integrity in hourly routine meteorological element data files,this paper aims to improve the availability and reliability of data files,and provide high-quality data file support for meteorological forecasting and services.[Method]In this paper,an efficient and accurate method for data file quality control and fusion processing is developed.By locating the missing measurement time,data are extracted from the"AWZ.db"database and the minute routine meteorological element data file,and merged into the hourly routine meteorological element data file.[Result]Data processing efficiency and accuracy are significantly improved,and the problem of incomplete hourly routine meteorological element data files is solved.At the same time,it emphasizes the importance of ensuring the accuracy of the files used and carefully checking and verifying the fusion results,and proposes strategies to improve data quality.[Conclusion]This method provides convenience for observation personnel and effectively improves the integrity and accuracy of data files.In the future,it is expected to provide more reliable data support for meteorological forecasting and services.
文摘In the new era,the impact of emerging productive forces has permeated every sector of industry.As the core production factor of these forces,data plays a pivotal role in industrial transformation and social development.Consequently,many domestic universities have introduced majors or courses related to big data.Among these,the Big Data Management and Applications major stands out for its interdisciplinary approach and emphasis on practical skills.However,as an emerging field,it has not yet accumulated a robust foundation in teaching theory and practice.Current instructional practices face issues such as unclear training objectives,inconsistent teaching methods and course content,insufficient integration of practical components,and a shortage of qualified faculty-factors that hinder both the development of the major and the overall quality of education.Taking the statistics course within the Big Data Management and Applications major as an example,this paper examines the challenges faced by statistics education in the context of emerging productive forces and proposes corresponding improvement measures.By introducing innovative teaching concepts and strategies,the teaching system for professional courses is optimized,and authentic classroom scenarios are recreated through illustrative examples.Questionnaire surveys and statistical analyses of data collected before and after the teaching reforms indicate that the curriculum changes effectively enhance instructional outcomes,promote the development of the major,and improve the quality of talent cultivation.
基金funded by Beijige Fund of Nanjing Joint Institute for Atmospheric Sciences(BJG202501)the Joint Research Project for Meteorological Capacity Improvement(22NLTSY009)+2 种基金Key Scientific Research Projects of Jiangsu Provincial Meteorological Bureau(KZ202203)China Meteorological Administration projects(CMAJBGS202316)the Guiding Research Projects of Jiangsu Provincial Meteorological Bureau(ZD202404,ZD202419).
文摘In order to further enhance the numerical application of weather radar radial velocity,this paper proposes a quality control scheme for weather radar radial velocity from the perspective of data assimilation.The proposed scheme is based on the WRFDA(Weather Research and Forecasting Data Assimilation)system and utilizes the biweight algorithm to perform quality control on weather radar radial velocity data.A series of quality control tests conducted over the course of one month demonstrate that the scheme can be seamlessly integrated into the data assimilation process.The scheme is characterized by its simplicity,fast implementation,and ease of maintenance.By determining an appropri-ate threshold for quality control,the percentage of outliers identified by the scheme remains highly stable over time.Moreover,the mean errors and standard deviations of the O-B(observation-minus-background)values are significantly reduced,improving the overall data quality.The main information and spatial distribution features of the data are pre-served effectively.After quality control,the distribution of the O-B Probability Density Function is adjusted in a manner that brings it closer to a Gaussian distribution.This adjustment is beneficial for the subsequent data assimilation process,contributing to more accurate numerical weather predictions.Thus,the proposed quality control scheme provides a valuable tool for improving weather radar data quality and enhancing numerical forecasting performance.
基金supported by grants from the National Natural Science Foundation of China(No.42004010)the B&R Seismic Monitoring Network Project of the China Earthquake Networks Center(No.5007).
文摘The Belt and Road global navigation satellite system(B&R GNSS)network is the first large-scale deployment of Chinese GNSS equipment in a seismic system.Prior to this,there have been few systematic assessments of the data quality of Chinese GNSS equipment.In this study,data from four representative GNSS sites in different regions of China were analyzed using the G-Nut/Anubis software package.Four main indicators(data integrity rate,data validity ratio,multi-path error,and cycle slip ratio)used to systematically analyze data quality,while evaluating the seismic monitoring capabilities of the network based on earthquake magnitudes estimated from high-frequency GNSS data are evaluated by estimating magnitude based on highfrequency GNSS data.The results indicate that the quality of the data produced by the three types of Chinese receivers used in the network meets the needs of earthquake monitoring and the new seismic industry standards,which provide a reference for the selection of equipment for future new projects.After the B&R GNSS network was established,the seismic monitoring capability for earthquakes with magnitudes greater than M_(W)6.5 in most parts of the Sichuan-Yunnan region improved by approximately 20%.In key areas such as the Sichuan-Yunnan Rhomboid Block,the monitoring capability increased by more than 25%,which has greatly improved the effectiveness of regional comprehensive earthquake management.
文摘This study demonstrates the complexity and importance of water quality as a measure of the health and sustainability of ecosystems that directly influence biodiversity,human health,and the world economy.The predictability of water quality thus plays a crucial role in managing our ecosystems to make informed decisions and,hence,proper environmental management.This study addresses these challenges by proposing an effective machine learning methodology applied to the“Water Quality”public dataset.The methodology has modeled the dataset suitable for providing prediction classification analysis with high values of the evaluating parameters such as accuracy,sensitivity,and specificity.The proposed methodology is based on two novel approaches:(a)the SMOTE method to deal with unbalanced data and(b)the skillfully involved classical machine learning models.This paper uses Random Forests,Decision Trees,XGBoost,and Support Vector Machines because they can handle large datasets,train models for handling skewed datasets,and provide high accuracy in water quality classification.A key contribution of this work is the use of custom sampling strategies within the SMOTE approach,which significantly enhanced performance metrics and improved class imbalance handling.The results demonstrate significant improvements in predictive performance,achieving the highest reported metrics:accuracy(98.92%vs.96.06%),sensitivity(98.3%vs.71.26%),and F1 score(98.37%vs.79.74%)using the XGBoost model.These improvements underscore the effectiveness of our custom SMOTE sampling strategies in addressing class imbalance.The findings contribute to environmental management by enabling ecology specialists to develop more accurate strategies for monitoring,assessing,and managing drinking water quality,ensuring better ecosystem and public health outcomes.
基金The National Key R&D Program of China under contract Nos 2021YFC2803305 and 2024YFC2816301the Fundamental Research Funds for the Central Universities of China under contract No.2042022dx0001.
文摘With the globalization of the economy,maritime trade has surged,posing challenges in the supervision of marine vessel activities.An automatic identification system(AIS)is an effective means of shipping traffic service,but many uncertainties exist regarding its data quality.In this study,the AIS data from Haiyang(HY)series of satellites were used to assess the data quality,analyze the global ship trajectory distribution and update frequencies from 2019 to 2023.Through the analysis of maritime mobile service identity numbers,we identified 340185 unique vessels,80.1%of which adhered to the International Telecommunication Union standards.Approximately 49.7%of ships exhibit significant data gaps,and 1.1%show inconsistencies in their AIS data sources.In the central Pacific Ocean at low latitudes and along the coast of South America(30°-60°S),a heightened incidence of abnormal trajectories of ships has been consistently observed,particularly in areas associated with fishing activities.According to the spatial distribution of ship trajectories,AIS data exhibit numerous deficiencies,particularly in high-traffic regions such as the East China Sea and South China Sea.In contrast,ship trajectories in the polar regions,characterized by high latitudes,are relatively comprehensive.With the increased number of HY satellites equipped with AIS receivers,the quantity of trajectory points displays a growing trend,leading to increasingly complete trajectories.This trend highlights the significant potential of using AIS data acquired from HY satellites to increase the accuracy of vessel tracking.
基金supported by the National Language Commission to research on sign language data specifications for artificial intelligence applications and test standards for language service translation systems (No.ZDI145-70)。
文摘Sign language dataset is essential in sign language recognition and translation(SLRT). Current public sign language datasets are small and lack diversity, which does not meet the practical application requirements for SLRT. However, making a large-scale and diverse sign language dataset is difficult as sign language data on the Internet is scarce. In making a large-scale and diverse sign language dataset, some sign language data qualities are not up to standard. This paper proposes a two information streams transformer(TIST) model to judge whether the quality of sign language data is qualified. To verify that TIST effectively improves sign language recognition(SLR), we make two datasets, the screened dataset and the unscreened dataset. In this experiment, this paper uses visual alignment constraint(VAC) as the baseline model. The experimental results show that the screened dataset can achieve better word error rate(WER) than the unscreened dataset.
基金Supported by the National Key Research and Development Program of China(Nos.2016YFC1402000,2018YFC1407003,2017YFC1405300)
文摘Offshore waters provide resources for human beings,while on the other hand,threaten them because of marine disasters.Ocean stations are part of offshore observation networks,and the quality of their data is of great significance for exploiting and protecting the ocean.We used hourly mean wave height,temperature,and pressure real-time observation data taken in the Xiaomaidao station(in Qingdao,China)from June 1,2017,to May 31,2018,to explore the data quality using eight quality control methods,and to discriminate the most effective method for Xiaomaidao station.After using the eight quality control methods,the percentages of the mean wave height,temperature,and pressure data that passed the tests were 89.6%,88.3%,and 98.6%,respectively.With the marine disaster(wave alarm report)data,the values failed in the test mainly due to the influence of aging observation equipment and missing data transmissions.The mean wave height is often affected by dynamic marine disasters,so the continuity test method is not effective.The correlation test with other related parameters would be more useful for the mean wave height.
文摘Multisensor data fusion (MDF) is an emerging technology to fuse data from multiple sensors in order to make a more accurate estimation of the environment through measurement and detection. Applications of MDF cross a wide spectrum in military and civilian areas. With the rapid evolution of computers and the proliferation of micro-mechanical/electrical systems sensors, the utilization of MDF is being popularized in research and applications. This paper focuses on application of MDF for high quality data analysis and processing in measurement and instrumentation. A practical, general data fusion scheme was established on the basis of feature extraction and merge of data from multiple sensors. This scheme integrates artificial neural networks for high performance pattern recognition. A number of successful applications in areas of NDI (Non-Destructive Inspection) corrosion detection, food quality and safety characterization, and precision agriculture are described and discussed in order to motivate new applications in these or other areas. This paper gives an overall picture of using the MDF method to increase the accuracy of data analysis and processing in measurement and instrumentation in different areas of applications.
基金The Open Fund of State Key Laboratory of Satellite Ocean Environment Dynamics under contract No.SOED1402the Youth Science and Technology Foundation of East China Sea Branch,SOA under contract No.201624
文摘Sea surface temperature(SST)data obtained from coastal stations in Jiangsu,China during 20102014 are quality controlled before analysis of their characteristic semidiurnal and seasonal cycles,including the correlation with the variation of the tide.Quality control of data includes the validation of extreme values and checking of hourly values based on temporally adjacent data points,with 0.15℃/h considered a suitable threshold for detecting abnormal values.The diurnal variation amplitude of the SST data is greater in spring and summer than in autumn and winter.The diurnal variation of SST has bimodal structure on most days,i.e.,SST has a significant semidiurnal cycle.Moreover,the semidiurnal cycle of SST is negatively correlated with the tidal data from March to August,but positively correlated with the tidal data from October to January.Little correlation is detected in the remaining months because of the weak coastal offshore SST gradients.The quality control and understanding of coastal SST data are particularly relevant with regard to the validation of indirect measurements such as satellite-derived data.
基金supported by the National Innovation Project for Meteorological Science and Technology grant number CMAGGTD003-5the National Key R&D Program of China grant number2017YFC1501801。
文摘This study proposes a method to derive the climatological limit thresholds that can be used in an operational/historical quality control procedure for Chinese high vertical resolution(5–10 m)radiosonde temperature and wind speed data.The whole atmosphere is divided into 64 vertical bins,and the profiles are constructed by the percentiles of the values in each vertical bin.Based on the percentile profiles(PPs),some objective criteria are developed to obtain the thresholds.Tibetan Plateau field data are used to validate the effectiveness of the method in the application of experimental data.The results show that the derived thresholds for 120 operational stations and 3 experimental stations are effective in detecting the gross errors,and those PPs can clearly and instantly illustrate the characteristics of a radiosonde variable and reveal the distribution of errors.
基金Natural Science Foundation of Guangdong Province(2016A030313140)Project 973(2015CB452802)+1 种基金Natural Science Foundation of China(41405104)Science and Technology Program of Guangzhou City(201604020012)
文摘We first analyzed GPS precipitable water vapor(GPS/PWV) available from a ground-based GPS observation network in Guangdong from 1 August 2009 to 27 August 2012 and then developed a method of quality control before GPS/PWV data is assimilated into the GRAPES 3DVAR system. This method can reject the outliers effectively. After establishing the criterion for quality control, we did three numerical experiments to investigate the impact on the precipitation forecast with and without the quality-controlled GPS/PWV data before they are assimilated into the system.In the numerical experiments, two precipitation cases(on 6 to 7 May, 2010 and 27 to 28 April, 2012 respectively) that occurred in the annually first raining season of Guangdong were selected. The results indicated that after quality control,only the GPS/PWV data that deviates little from the NCEP/PWV data can be assimilated into the system, has reasonable adjustment of the initial water vapor above Guangdong, and eventually improves the intensity and location of 24-h precipitation forecast significantly.
基金financially supported by the National Key R&D Program of China(2017YFC 0209905)the Natural Sciences Foundation of China(No.51878012,51638001)+1 种基金the project supported by Beijing Municipal Education Commission of Science and Technology(No.KM201610005019)the New Talent Program of Beijing University of Technology(No.2017-RX(1)-10)
文摘In this study, an analysis framework based on the regular monitoring data was proposed for investigating the annual/inter-annual air quality variation and the contributions from different factors(i.e., seasons, pollution periods and airflow directions), through a case study in Beijing from 2013 to 2016. The results showed that the annual mean concentrations(MC) of PM_(2.5), SO_2, NO_2 and CO had decreased with annual mean ratios of 7.5%, 28.6%, 4.6%and 15.5% from 2013 to 2016, respectively. Among seasons, the MC in winter contributed the largest fractions(25.8%~46.4%) to the annual MC, and the change of MC in summer contributed most to the inter-annual MC variation(IMCV) of PM_(2.5) and NO2. For different pollution periods, gradually increase of frequency of S-1(PM_(2.5), 0~ 75 μg/m^3) made S-1 become the largest contributor(28.8%) to the MC of PM_(2.5) in 2016, it had a negative contribution(-13.1%) to the IMCV of PM_(2.5); obvious decreases of frequencies of heavily polluted and severely polluted dominated(44.7% and 39.5%) the IMCV of PM_(2.5). For different airflow directions, the MC of pollutants under the south airflow had the most significant decrease(22.5%~62.5%), and those decrease contributed most to the IMCV of PM_(2.5)(143.3%),SO2(72.0%), NO_2(55.5%) and CO(190.3%); the west airflow had negative influences to the IMCV of PM_(2.5), NO_2 and CO. The framework is helpful for further analysis and utilization of the large amounts of monitoring data; and the analysis results can provide scientific supports for the formulation or adjustment of further air pollution mitigation policy.
文摘Volunteered geographic information(VGI)has entered a phase where there are both a substantial amount of crowdsourced information available and a big interest in using it by organizations.But the issue of deciding the quality of VGI without resorting to a comparison with authoritative data remains an open challenge.This article first formulates the problem of quality assessment of VGI data.Then presents a model to measure trustworthiness of information and reputation of contributors by analyzing geometric,qualitative,and semantic aspects of edits over time.An implementation of the model is running on a small data-set for a preliminary empirical validation.The results indicate that the computed trustworthiness provides a valid approximation of VGI quality.
基金This work is supported by the NSFC(Nos.61772280,61772454)the Changzhou Sci&Tech Program(No.CJ20179027)the PAPD fund from NUIST.Prof.Jin Wang is the corresponding author。
文摘Since the British National Archive put forward the concept of the digital continuity in 2007,several developed countries have worked out their digital continuity action plan.However,the technologies of the digital continuity guarantee are still lacked.At first,this paper analyzes the requirements of digital continuity guarantee for electronic record based on data quality theory,then points out the necessity of data quality guarantee for electronic record.Moreover,we convert the digital continuity guarantee of electronic record to ensure the consistency,completeness and timeliness of electronic record,and construct the first technology framework of the digital continuity guarantee for electronic record.Finally,the temporal functional dependencies technology is utilized to build the first integration method to insure the consistency,completeness and timeliness of electronic record.
基金the National Natural Science Foundation of China(No.61304208)Scientific Research Fund of Hunan Province Education Department(18C0003)+5 种基金Researchproject on teaching reform in colleges and universities of Hunan Province Education Department(20190147)Changsha City Science and Technology Plan Program(K1501013-11)Hunan NormalUniversity University-Industry Cooperation.This work is implemented at the 2011 Collaborative Innovation Center for Development and Utilization of Finance and Economics Big Data PropertyUniversities of Hunan ProvinceOpen projectgrant number 20181901CRP04.
文摘Water is one of the basic resources for human survival.Water pollution monitoring and protection have been becoming a major problem for many countries all over the world.Most traditional water quality monitoring systems,however,generally focus only on water quality data collection,ignoring data analysis and data mining.In addition,some dirty data and data loss may occur due to power failures or transmission failures,further affecting data analysis and its application.In order to meet these needs,by using Internet of things,cloud computing,and big data technologies,we designed and implemented a water quality monitoring data intelligent service platform in C#and PHP language.The platform includes monitoring point addition,monitoring point map labeling,monitoring data uploading,monitoring data processing,early warning of exceeding the standard of monitoring indicators,and other functions modules.Using this platform,we can realize the automatic collection of water quality monitoring data,data cleaning,data analysis,intelligent early warning and early warning information push,and other functions.For better security and convenience,we deployed the system in the Tencent Cloud and tested it.The testing results showed that the data analysis platform could run well and will provide decision support for water resource protection.
基金supported by University of Stavanger, NorwaySINTEF,the Center for Integrated Operations in the Petroleum Industry and the management of National Oilwell Varco Intelli Serv
文摘Wired drill pipe(WDP)technology is one of the most promising data acquisition technologies in today s oil and gas industry.For the first time it allows sensors to be positioned along the drill string which enables collecting and transmitting valuable data not only from the bottom hole assembly(BHA),but also along the entire length of the wellbore to the drill floor.The technology has received industry acceptance as a viable alternative to the typical logging while drilling(LWD)method.Recently more and more WDP applications can be found in the challenging drilling environments around the world,leading to many innovations to the industry.Nevertheless most of the data acquired from WDP can be noisy and in some circumstances of very poor quality.Diverse factors contribute to the poor data quality.Most common sources include mis-calibrated sensors,sensor drifting,errors during data transmission,or some abnormal conditions in the well,etc.The challenge of improving the data quality has attracted more and more focus from many researchers during the past decade.This paper has proposed a promising solution to address such challenge by making corrections of the raw WDP data and estimating unmeasurable parameters to reveal downhole behaviors.An advanced data processing method,data validation and reconciliation(DVR)has been employed,which makes use of the redundant data from multiple WDP sensors to filter/remove the noise from the measurements and ensures the coherence of all sensors and models.Moreover it has the ability to distinguish the accurate measurements from the inaccurate ones.In addition,the data with improved quality can be used for estimating some crucial parameters in the drilling process which are unmeasurable in the first place,hence provide better model calibrations for integrated well planning and realtime operations.
基金This research was supported financially by EU FP7 Marie Curie Initial Training Network MULTI-POS(Multi-technology Positioning Professionals)[grant number 316528].
文摘OpenStreetMap(OSM)data are widely used but their reliability is still variable.Many contributors to OSM have not been trained in geography or surveying and consequently their contributions,including geometry and attribute data inserts,deletions,and updates,can be inaccurate,incomplete,inconsistent,or vague.There are some mechanisms and applications dedicated to discovering bugs and errors in OSM data.Such systems can remove errors through user-checks and applying predefined rules but they need an extra control process to check the real-world validity of suspected errors and bugs.This paper focuses on finding bugs and errors based on patterns and rules extracted from the tracking data of users.The underlying idea is that certain characteristics of user trajectories are directly linked to the type of feature.Using such rules,some sets of potential bugs and errors can be identified and stored for further investigations.
基金The National Natural Science Foundation of China (No.70772021,70372004)China Postdoctoral Science Foundation (No.20060400077)
文摘This paper presents a methodology to determine three data quality (DQ) risk characteristics: accuracy, comprehensiveness and nonmembership. The methodology provides a set of quantitative models to confirm the information quality risks for the database of the geographical information system (GIS). Four quantitative measures are introduced to examine how the quality risks of source information affect the quality of information outputs produced using the relational algebra operations Selection, Projection, and Cubic Product. It can be used to determine how quality risks associated with diverse data sources affect the derived data. The GIS is the prime source of information on the location of cables, and detection time strongly depends on whether maps indicate the presence of cables in the construction business. Poor data quality in the GIS can contribute to increased risk or higher risk avoidance costs. A case study provides a numerical example of the calculation of the trade-offs between risk and detection costs and provides an example of the calculation of the costs of data quality. We conclude that the model contributes valuable new insight.