期刊文献+
共找到1,091篇文章
< 1 2 55 >
每页显示 20 50 100
Data cleaning method for the process of acid production with flue gas based on improved random forest 被引量:3
1
作者 Xiaoli Li Minghua Liu +2 位作者 Kang Wang Zhiqiang Liu Guihai Li 《Chinese Journal of Chemical Engineering》 SCIE EI CAS CSCD 2023年第7期72-84,共13页
Acid production with flue gas is a complex nonlinear process with multiple variables and strong coupling.The operation data is an important basis for state monitoring,optimal control,and fault diagnosis.However,the op... Acid production with flue gas is a complex nonlinear process with multiple variables and strong coupling.The operation data is an important basis for state monitoring,optimal control,and fault diagnosis.However,the operating environment of acid production with flue gas is complex and there is much equipment.The data obtained by the detection equipment is seriously polluted and prone to abnormal phenomena such as data loss and outliers.Therefore,to solve the problem of abnormal data in the process of acid production with flue gas,a data cleaning method based on improved random forest is proposed.Firstly,an outlier data recognition model based on isolation forest is designed to identify and eliminate the outliers in the dataset.Secondly,an improved random forest regression model is established.Genetic algorithm is used to optimize the hyperparameters of the random forest regression model.Then the optimal parameter combination is found in the search space and the trend of data is predicted.Finally,the improved random forest data cleaning method is used to compensate for the missing data after eliminating abnormal data and the data cleaning is realized.Results show that the proposed method can accurately eliminate and compensate for the abnormal data in the process of acid production with flue gas.The method improves the accuracy of compensation for missing data.With the data after cleaning,a more accurate model can be established,which is significant to the subsequent temperature control.The conversion rate of SO_(2) can be further improved,thereby improving the yield of sulfuric acid and economic benefits. 展开更多
关键词 Acid production data cleaning Isolation forest Random forest data compensation
在线阅读 下载PDF
A Review of Data Cleaning Methods for Web Information System 被引量:1
2
作者 Jinlin Wang Xing Wang +2 位作者 Yuchen Yang Hongli Zhang Binxing Fang 《Computers, Materials & Continua》 SCIE EI 2020年第3期1053-1075,共23页
Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an e... Web information system(WIS)is frequently-used and indispensable in daily social life.WIS provides information services in many scenarios,such as electronic commerce,communities,and edutainment.Data cleaning plays an essential role in various WIS scenarios to improve the quality of data service.In this paper,we present a review of the state-of-the-art methods for data cleaning in WIS.According to the characteristics of data cleaning,we extract the critical elements of WIS,such as interactive objects,application scenarios,and core technology,to classify the existing works.Then,after elaborating and analyzing each category,we summarize the descriptions and challenges of data cleaning methods with sub-elements such as data&user interaction,data quality rule,model,crowdsourcing,and privacy preservation.Finally,we analyze various types of problems and provide suggestions for future research on data cleaning in WIS from the technology and interactive perspective. 展开更多
关键词 data cleaning web information system data quality rule crowdsourcing privacy preservation
在线阅读 下载PDF
Data Cleaning Based on Stacked Denoising Autoencoders and Multi-Sensor Collaborations 被引量:1
3
作者 Xiangmao Chang Yuan Qiu +1 位作者 Shangting Su Deliang Yang 《Computers, Materials & Continua》 SCIE EI 2020年第5期691-703,共13页
Wireless sensor networks are increasingly used in sensitive event monitoring.However,various abnormal data generated by sensors greatly decrease the accuracy of the event detection.Although many methods have been prop... Wireless sensor networks are increasingly used in sensitive event monitoring.However,various abnormal data generated by sensors greatly decrease the accuracy of the event detection.Although many methods have been proposed to deal with the abnormal data,they generally detect and/or repair all abnormal data without further differentiate.Actually,besides the abnormal data caused by events,it is well known that sensor nodes prone to generate abnormal data due to factors such as sensor hardware drawbacks and random effects of external sources.Dealing with all abnormal data without differentiate will result in false detection or missed detection of the events.In this paper,we propose a data cleaning approach based on Stacked Denoising Autoencoders(SDAE)and multi-sensor collaborations.We detect all abnormal data by SDAE,then differentiate the abnormal data by multi-sensor collaborations.The abnormal data caused by events are unchanged,while the abnormal data caused by other factors are repaired.Real data based simulations show the efficiency of the proposed approach. 展开更多
关键词 data cleaning wireless sensor networks stacked denoising autoencoders multi-sensor collaborations
在线阅读 下载PDF
An Improvement of Data Cleaning Method for Grain Big Data Processing Using Task Merging 被引量:1
4
作者 Feiyu Lian Maixia Fu Xingang Ju 《Journal of Computer and Communications》 2020年第3期1-19,共19页
Data quality has exerted important influence over the application of grain big data, so data cleaning is a necessary and important work. In MapReduce frame, parallel technique is often used to execute data cleaning in... Data quality has exerted important influence over the application of grain big data, so data cleaning is a necessary and important work. In MapReduce frame, parallel technique is often used to execute data cleaning in high scalability mode, but due to the lack of effective design, there are amounts of computing redundancy in the process of data cleaning, which results in lower performance. In this research, we found that some tasks often are carried out multiple times on same input files, or require same operation results in the process of data cleaning. For this problem, we proposed a new optimization technique that is based on task merge. By merging simple or redundancy computations on same input files, the number of the loop computation in MapReduce can be reduced greatly. The experiment shows, by this means, the overall system runtime is significantly reduced, which proves that the process of data cleaning is optimized. In this paper, we optimized several modules of data cleaning such as entity identification, inconsistent data restoration, and missing value filling. Experimental results show that the proposed method in this paper can increase efficiency for grain big data cleaning. 展开更多
关键词 GRAIN BIG data data cleaning TASK MERGING Hadoop MAPREDUCE
在线阅读 下载PDF
A Rule Management System for Knowledge Based Data Cleaning
5
作者 Louardi BRADJI Mahmoud BOUFAIDA 《Intelligent Information Management》 2011年第6期230-239,共10页
In this paper, we propose a rule management system for data cleaning that is based on knowledge. This system combines features of both rule based systems and rule based data cleaning frameworks. The important advantag... In this paper, we propose a rule management system for data cleaning that is based on knowledge. This system combines features of both rule based systems and rule based data cleaning frameworks. The important advantages of our system are threefold. First, it aims at proposing a strong and unified rule form based on first order structure that permits the representation and management of all the types of rules and their quality via some characteristics. Second, it leads to increase the quality of rules which conditions the quality of data cleaning. Third, it uses an appropriate knowledge acquisition process, which is the weakest task in the current rule and knowledge based systems. As several research works have shown that data cleaning is rather driven by domain knowledge than by data, we have identified and analyzed the properties that distinguish knowledge and rules from data for better determining the most components of the proposed system. In order to illustrate our system, we also present a first experiment with a case study at health sector where we demonstrate how the system is useful for the improvement of data quality. The autonomy, extensibility and platform-independency of the proposed rule management system facilitate its incorporation in any system that is interested in data quality management. 展开更多
关键词 RULE data Quality data cleanING KNOWLEDGE RULE Management SYSTEM RULE Based SYSTEM Structure
暂未订购
Big Data Cleaning Based on Improved CLOF and Random Forest for Distribution Networks 被引量:1
6
作者 Jie Liu Yijia Cao +2 位作者 Yong Li Yixiu Guo Wei Deng 《CSEE Journal of Power and Energy Systems》 SCIE EI CSCD 2024年第6期2528-2538,共11页
In order to improve the data quality,the big data cleaning method for distribution networks is studied in this paper.First,the Local Outlier Factor(LOF)algorithm based on DBSCAN clustering is used to detect outliers.H... In order to improve the data quality,the big data cleaning method for distribution networks is studied in this paper.First,the Local Outlier Factor(LOF)algorithm based on DBSCAN clustering is used to detect outliers.However,due to the difficulty in determining the LOF threshold,a method of dynamically calculating the threshold based on the transformer districts and time is proposed.In addition,the LOF algorithm combines the statistical distribution method to reduce the misjudgment rate.Aiming at the diversity and complexity of data missing forms in power big data,this paper has improved the Random Forest imputation algorithm,which can be applied to various forms of missing data,especially the blocked missing data and even some completely missing horizontal or vertical data.The data in this paper are from real data of 44 transformer districts of a certain 10 kV line in a distribution network.Experimental results show that outlier detection is accurate and suitable for any shape and multidimensional power big data.The improved Random Forest imputation algorithm is suitable for all missing forms,with higher imputation accuracy and better model stability.By comparing the network loss prediction between the data using this data cleaning method and the data removing outliers and missing values,it can be found that the accuracy of network loss prediction has improved by nearly 4%using the data cleaning method identified in this paper.Additionally,as the proportion of bad data increased,the difference between the prediction accuracy of cleaned data and that of uncleaned data is more significant. 展开更多
关键词 data cleaning DBSCAN LOF missing data imputation outliers detection Random Forest
原文传递
IoT data cleaning techniques: A survey 被引量:1
7
作者 Xiaoou Ding Hongzhi Wang +3 位作者 Genglong Li Haoxuan Li Yingze Li Yida Liu 《Intelligent and Converged Networks》 EI 2022年第4期325-339,共15页
Data cleaning is considered as an effective approach of improving data quality in order to help practitioners and researchers be devoted to downstream analysis and decision-making without worrying about data trustwort... Data cleaning is considered as an effective approach of improving data quality in order to help practitioners and researchers be devoted to downstream analysis and decision-making without worrying about data trustworthiness.This paper provides a systematic summary of the two main stages of data cleaning for Internet of Things(IoT)data with time series characteristics,including error data detection and data repairing.In respect to error data detection techniques,it categorizes an overview of quantitative data error detection methods for detecting single-point errors,continuous errors,and multidimensional time series data errors and qualitative data error detection methods for detecting rule-violating errors.Besides,it provides a detailed description of error data repairing techniques,involving statistics-based repairing,rule-based repairing,and human-involved repairing.We review the strengths and the limitations of the current data cleaning techniques under IoT data applications and conclude with an outlook on the future of IoT data cleaning. 展开更多
关键词 Internet of Things(IoT) data quality data cleaning error detection data repairing
原文传递
A method for cleaning wind power anomaly data by combining image processing with community detection algorithms
8
作者 Qiaoling Yang Kai Chen +2 位作者 Jianzhang Man Jiaheng Duan Zuoqi Jin 《Global Energy Interconnection》 EI CSCD 2024年第3期293-312,共20页
Current methodologies for cleaning wind power anomaly data exhibit limited capabilities in identifying abnormal data within extensive datasets and struggle to accommodate the considerable variability and intricacy of ... Current methodologies for cleaning wind power anomaly data exhibit limited capabilities in identifying abnormal data within extensive datasets and struggle to accommodate the considerable variability and intricacy of wind farm data.Consequently,a method for cleaning wind power anomaly data by combining image processing with community detection algorithms(CWPAD-IPCDA)is proposed.To precisely identify and initially clean anomalous data,wind power curve(WPC)images are converted into graph structures,which employ the Louvain community recognition algorithm and graph-theoretic methods for community detection and segmentation.Furthermore,the mathematical morphology operation(MMO)determines the main part of the initially cleaned wind power curve images and maps them back to the normal wind power points to complete the final cleaning.The CWPAD-IPCDA method was applied to clean datasets from 25 wind turbines(WTs)in two wind farms in northwest China to validate its feasibility.A comparison was conducted using density-based spatial clustering of applications with noise(DBSCAN)algorithm,an improved isolation forest algorithm,and an image-based(IB)algorithm.The experimental results demonstrate that the CWPAD-IPCDA method surpasses the other three algorithms,achieving an approximately 7.23%higher average data cleaning rate.The mean value of the sum of the squared errors(SSE)of the dataset after cleaning is approximately 6.887 lower than that of the other algorithms.Moreover,the mean of overall accuracy,as measured by the F1-score,exceeds that of the other methods by approximately 10.49%;this indicates that the CWPAD-IPCDA method is more conducive to improving the accuracy and reliability of wind power curve modeling and wind farm power forecasting. 展开更多
关键词 Wind turbine power curve Abnormal data cleaning Community detection Louvain algorithm Mathematical morphology operation
在线阅读 下载PDF
Cleaning of Multi-Source Uncertain Time Series Data Based on PageRank
9
作者 高嘉伟 孙纪舟 《Journal of Donghua University(English Edition)》 CAS 2023年第6期695-700,共6页
There are errors in multi-source uncertain time series data.Truth discovery methods for time series data are effective in finding more accurate values,but some have limitations in their usability.To tackle this challe... There are errors in multi-source uncertain time series data.Truth discovery methods for time series data are effective in finding more accurate values,but some have limitations in their usability.To tackle this challenge,we propose a new and convenient truth discovery method to handle time series data.A more accurate sample is closer to the truth and,consequently,to other accurate samples.Because the mutual-confirm relationship between sensors is very similar to the mutual-quote relationship between web pages,we evaluate sensor reliability based on PageRank and then estimate the truth by sensor reliability.Therefore,this method does not rely on smoothness assumptions or prior knowledge of the data.Finally,we validate the effectiveness and efficiency of the proposed method on real-world and synthetic data sets,respectively. 展开更多
关键词 big data data cleaning time series truth discovery PAGERANK
在线阅读 下载PDF
Data Cleaning About Student Information Based on Massive Open Online Course System
10
作者 Shengjun Yin Yaling Yi Hongzhi Wang 《国际计算机前沿大会会议论文集》 2020年第1期33-43,共11页
Recently,Massive Open Online Courses(MOOCs)is a major way of online learning for millions of people around the world,which generates a large amount of data in the meantime.However,due to errors produced from collectin... Recently,Massive Open Online Courses(MOOCs)is a major way of online learning for millions of people around the world,which generates a large amount of data in the meantime.However,due to errors produced from collecting,system,and so on,these data have various inconsistencies and missing values.In order to support accurate analysis,this paper studies the data cleaning technology for online open curriculum system,including missing value-time filling for time series,and rulebased input error correction.The data cleaning algorithm designed in this paper is divided into six parts:pre-processing,missing data processing,format and content error processing,logical error processing,irrelevant data processing and correlation analysis.This paper designs and implements missing-value-filling algorithm based on time series in the missing data processing part.According to the large number of descriptive variables existing in the format and content error processing module,it proposed one-based and separability-based criteria Hot+J3+PCA.The online course data cleaning algorithm was analyzed in detail on algorithm design,implementation and testing.After a lot of rigorous testing,the function of each module performs normally,and the cleaning performance of the algorithm is of expectation. 展开更多
关键词 MOOC data cleaning Time series Intermittent missing Dimension reduction
原文传递
船舶智能能效管理系统发展现状与展望
11
作者 尹奇志 黄佳期 +3 位作者 胡浩帆 苏开文 徐超凡 欧阳武 《船舶工程》 北大核心 2026年第3期I0001-I0027,共19页
[目的]为了对船舶智能能效管理系统的国内外研究现状和发展趋势进行研究,[方法]系统综述了船舶能效数据采集、清洗、分析、建模和优化等关键技术的研究进展与应用实践,并对国内外典型船舶智能能效管理系统的功能与特点进行了对比分析。... [目的]为了对船舶智能能效管理系统的国内外研究现状和发展趋势进行研究,[方法]系统综述了船舶能效数据采集、清洗、分析、建模和优化等关键技术的研究进展与应用实践,并对国内外典型船舶智能能效管理系统的功能与特点进行了对比分析。[结果]结果表明,船舶智能能效管理系统已形成从数据感知到优化决策的完整技术链条,功能也从监测向分析、评估与优化逐步演进;国内外系统在远洋动态优化、内河多终端协同和能效评估等方面各具特色,已在实际船舶应用中实现节能增效。[结论]未来船舶智能能效管理系统将向“船-云-岸”协同架构发展,须重点突破多源信息融合、机理-数据混合建模和数字孪生等关键技术,构建开放共享的能效数据生态,以实现全船队、全航程的自主智能能效提升。 展开更多
关键词 智能船舶 能效管理 数据采集 数据清洗 能效分析 能效建模 能效优化
原文传递
数据调试综述
12
作者 李晨阳 马超红 孟小峰 《计算机研究与发展》 北大核心 2026年第1期41-65,共25页
人工智能的蓬勃发展,对医疗健康、生物信息、金融服务等各领域产生深远影响。人工智能应用的主要范式是构建机器学习模型,探索数据中的规则和模式,以用于推理和决策。人工智能系统的有效性和效率取决于2个关键方面:其一是模型方面(以模... 人工智能的蓬勃发展,对医疗健康、生物信息、金融服务等各领域产生深远影响。人工智能应用的主要范式是构建机器学习模型,探索数据中的规则和模式,以用于推理和决策。人工智能系统的有效性和效率取决于2个关键方面:其一是模型方面(以模型为中心),包括增强网络结构,如RNN到LSTM的转变、模型超参数调优等;其二是数据方面(以数据为中心),如标准化数据格式、增大数据量、减少数据噪声等。一直以来,调试人工智能系统主要侧重于优化模型。然而,以社交网络和电子商务为代表的数字化时代的到来产生庞大且多样的数据,使得以模型为中心的调试已无法满足人们对人工智能系统的需求。因此,研究界和工业界将注意力从模型转向数据,以弥补这一差距。为此,“数据调试”(data debugging)应运而生。与优化模型不同,数据调试侧重检查数据,即理解错误数据在机器学习管道的各阶段对下游任务的影响,进而调试相应错误以提高模型性能。基于此,在全面调研数据调试相关工作的基础上,首先,提出数据调试研究框架,根据数据调试方法与机器学习管道的交互,将现有方法分为封闭式数据调试、浸入式数据调试和混合式数据调试3类。接着,详细概述本领域的相关工作。然后,对数据调试方法进行实验评估,同时总结该研究领域常用的数据集和评价指标。最后,指出数据调试面临的挑战及未来发展方向。 展开更多
关键词 数据调试 机器学习 数据质量 数据清洗 人工智能
在线阅读 下载PDF
基于灰关联与批量回归的水轮发电机温度数据重构改进
13
作者 李飞霏 曾云 +1 位作者 那泓 曹瀚天 《水电站机电技术》 2026年第1期1-6,157,共7页
针对水轮发电机温度数据异常检测的工程需求,本研究提出基于多模态数据融合的BPOD-IGRA-MLR-BIR清洗框架。传统阈值法与统计量方法存在实时性不足、多源数据协同能力弱等问题。框架创新点包括:①构建箱型图离群点检测(BPOD)机制,通过动... 针对水轮发电机温度数据异常检测的工程需求,本研究提出基于多模态数据融合的BPOD-IGRA-MLR-BIR清洗框架。传统阈值法与统计量方法存在实时性不足、多源数据协同能力弱等问题。框架创新点包括:①构建箱型图离群点检测(BPOD)机制,通过动态阈值计算快速识别明显异常值;②改进灰色关联插值算法(IGRA),引入时间加权因子优化关联度计算,提升时序数据重构精度;③开发多元线性回归插值模型(MLR-BIR),建立多测点温度非线性关联模型实现协同插补。以某水电站实测数据验证,该框架异常数据覆盖率提升至98.7%,重构误差±1.2 ℃,较单一方法灵敏度和重构精度均有提升。工程实践表明,该方法有效解决了机组运行监测中温度数据缺失、异常干扰等问题,为健康评估与故障预警提供了高可靠数据支撑。 展开更多
关键词 水轮发电机温度数据 数据清洗 数据填补 灰关联 回归插值法
在线阅读 下载PDF
海洋多波束测深异常数据自动化检测和处理方法研究进展
14
作者 罗东旭 陈江欣 +7 位作者 徐华宁 吴能友 陆凯 欧文佳 李海龙 傅钰 韩同刚 杨添贵 《海洋地质与第四纪地质》 北大核心 2026年第1期99-113,共15页
针对海洋多波束测深异常数据的自动化检测和处理问题,综合国内外研究进展,本文根据针对的处理目标不同,将其分为3类:测深点数据域检测处理方法、单ping数据域检测处理方法及构建曲面模型域检测处理方法。通过对中值滤波、聚类算法、布... 针对海洋多波束测深异常数据的自动化检测和处理问题,综合国内外研究进展,本文根据针对的处理目标不同,将其分为3类:测深点数据域检测处理方法、单ping数据域检测处理方法及构建曲面模型域检测处理方法。通过对中值滤波、聚类算法、布料模拟法、CUBE算法、趋势面法和抗差估计法等方法的梳理,归纳总结出各种方法的处理过程、应用对象、应用准则、适用领域以及结果判断的不同之处,并通过列表的方式进行分类和对比分析,得到这三类方法处理时的侧重方向和适用的异常数据类型。分析了三类针对不同目标的自动化检测和处理方法的优势和不足,总结了以往各种方法在处理和实践中存在的问题,并在此基础上提出相应的建议。 展开更多
关键词 多波束测深 异常数据 自动化 检测和处理
在线阅读 下载PDF
房屋地理实体规范化构建方法研究
15
作者 刘玥 朱立珍 李广伟 《测绘与空间地理信息》 2026年第2期9-12,共4页
针对基础地理信息要素数据中的房屋要素与不动产登记自然幢数据在空间与属性上的冲突问题,本文提出一种多规则清洗与分级合并的实体构建方法。首先通过分类转换,得到房屋实体图元和房屋附属设施图元;其次通过空间筛选实现不动产登记自... 针对基础地理信息要素数据中的房屋要素与不动产登记自然幢数据在空间与属性上的冲突问题,本文提出一种多规则清洗与分级合并的实体构建方法。首先通过分类转换,得到房屋实体图元和房屋附属设施图元;其次通过空间筛选实现不动产登记自然幢数据的分类清洗;最后基于空间关系,分别对不动产登记自然幢数据区域与非覆盖区域的房屋实体图元及房屋附属设施图元设计差异化进行融合。实验表明,该方法可有效解决数据冗余、错位及属性混杂问题,生成逻辑一致的综合房屋地理实体图元,为房屋地理实体构建提供技术支撑。 展开更多
关键词 房屋地理实体 不动产登记自然幢数据 数据清洗 空间关系 图元融合
在线阅读 下载PDF
产品全生命周期数据库构建与碳足迹测算技术
16
作者 刘玺 尚楠 +1 位作者 朱浩骏 潘珍 《广东电力》 北大核心 2026年第1期1-9,共9页
随着碳配额、碳关税等一系列碳排放限制政策的提出,降低工业产品碳排放量对于提升国际竞争力与经济效益的作用愈加显著,针对各类产品精准碳排放测算的需求也随之剧增。然而工业领域产品种类繁多,工艺步骤复杂多样,且当前数据监测精度低... 随着碳配额、碳关税等一系列碳排放限制政策的提出,降低工业产品碳排放量对于提升国际竞争力与经济效益的作用愈加显著,针对各类产品精准碳排放测算的需求也随之剧增。然而工业领域产品种类繁多,工艺步骤复杂多样,且当前数据监测精度低、策略不统一等问题给精准碳排放测算带来巨大挑战。聚焦于碳排放典型工业产品,构建覆盖产品全生命周期的碳排放数据库,发展数据质量评价体系,提出数据敏感性分析与数据清洗机制,确保产品生命周期数据的可靠性,为精准测算奠定基础,并根据产品的生产流程数据,实现产品全生命周期碳排放的精准测算。 展开更多
关键词 工业产品 全生命周期 碳足迹数据库 数据清洗 碳排放测算
在线阅读 下载PDF
Can Automatic Classification Help to Increase Accuracy in Data Collection?
17
作者 Frederique Lang Diego Chavarro Yuxian Liu 《Journal of Data and Information Science》 2016年第3期42-58,共17页
Purpose: The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets. Design/methodology/approach: The paper is centered ... Purpose: The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets. Design/methodology/approach: The paper is centered on cleaning datasets gathered from publishers and online resources by the use of specific keywords. In this case, we analyzed data from the Web of Science. The accuracy of various forms of automatic classification was tested here in comparison with manual coding in order to determine their usefulness for data collection and cleaning. We assessed the performance of seven supervised classification algorithms (Support Vector Machine (SVM), Scaled Linear Discriminant Analysis, Lasso and elastic-net regularized generalized linear models, Maximum Entropy, Regression Tree, Boosting, and Random Forest) and analyzed two properties: accuracy and recall. We assessed not only each algorithm individually, but also their combinations through a voting scheme. We also tested the performance of these algorithms with different sizes of training data. When assessing the performance of different combinations, we used an indicator of coverage to account for the agreement and disagreement on classification between algorithms. Findings: We found that the performance of the algorithms used vary with the size of the sample for training. However, for the classification exercise in this paper the best performing algorithms were SVM and Boosting. The combination of these two algorithms achieved a high agreement on coverage and was highly accurate. This combination performs well with a small training dataset (10%), which may reduce the manual work needed for classification tasks. Research limitations: The dataset gathered has significantly more records related to the topic of interest compared to unrelated topics. This may affect the performance of some algorithms, especially in their identification of unrelated papers. Practical implications: Although the classification achieved by this means is not completely accurate, the amount of manual coding needed can be greatly reduced by using classification algorithms. This can be of great help when the dataset is big. With the help of accuracy, recall,and coverage measures, it is possible to have an estimation of the error involved in this classification, which could open the possibility of incorporating the use of these algorithms in software specifically designed for data cleaning and classification. 展开更多
关键词 DISAMBIGUATION Machine leaming data cleaning Classification ACCURACY RECALL COVERAGE
在线阅读 下载PDF
Intelligent Data Pre-processing Model in Integrated Ocean Observing Network System
18
作者 韩华 丁永生 刘凤鸣 《Journal of Donghua University(English Edition)》 EI CAS 2009年第5期499-502,共4页
There are a number of dirty data in observation data set derived from integrated ocean observing network system. Thus, the data must be carefully and reasonably processed before they are used for forecasting or analys... There are a number of dirty data in observation data set derived from integrated ocean observing network system. Thus, the data must be carefully and reasonably processed before they are used for forecasting or analysis. This paper proposes a data pre-processing model based on intelligent algorithms. Firstly, we introduce the integrated network platform of ocean observation. Next, the preprocessing model of data is presemed, and an imelligent cleaning model of data is proposed. Based on fuzzy clustering, the Kohonen clustering network is improved to fulfill the parallel calculation of fuzzy c-means clustering. The proposed dynamic algorithm can automatically f'md the new clustering center with the updated sample data. The rapid and dynamic performance of the model makes it suitable for real time calculation, and the efficiency and accuracy of the model is proved by test results through observation data analysis. 展开更多
关键词 integrated ocean observing network intelligentdata pre-processing data cleaning fuzzy soft clustering
在线阅读 下载PDF
基于TPE-CatBoost的风电机组齿轮箱油池温度预警方法
19
作者 郭浩宇 周元贵 +1 位作者 王露春 万罗强 《分布式能源》 2026年第1期27-33,共7页
针对风电机组齿轮箱油池温度异常难以早期预警的问题,提出一种基于数据采集与监视控制(supervisory control and data acquisition,SCADA)数据的故障预警方法,以提升机组运行可靠性。首先,结合风速-功率分布特征,采用四分位法与基于数... 针对风电机组齿轮箱油池温度异常难以早期预警的问题,提出一种基于数据采集与监视控制(supervisory control and data acquisition,SCADA)数据的故障预警方法,以提升机组运行可靠性。首先,结合风速-功率分布特征,采用四分位法与基于数据离散度的纵向滤波剔除异常功率点;接着,利用随机森林算法筛选影响油池温度的关键输入特征,构建基于类别提升(categorical boosting,CatBoost)算法的温度预测模型,并采用树结构parzen估计器(tree-structured parzen estimator,TPE)优化其超参数;最后,基于残差分布,通过统计过程控制确定动态预警阈值。在某风电场实际故障案例中,该模型在齿轮箱故障发生前约5 h发出有效预警,残差超出控制限的时间点与故障发展过程高度吻合。所提方法能有效识别油池温度异常工况,具备良好的早期预警能力与工程应用价值。 展开更多
关键词 风电机组 故障预警 齿轮箱 数据清洗
在线阅读 下载PDF
铁路客运大数据辅助决策关键技术研究
20
作者 严安 《铁道运营技术》 2026年第1期10-13,共4页
针对铁路客运大数据辅助决策需求,提出一套集数据采集、清洗、处理与可视化于一体的综合技术解决方案。采用PostgreSQL与GreenPlum构建混合数据库体系,结合Python脚本实现数据采集、清洗与智能处理,提升数据处理时效性;利用Echarts构建... 针对铁路客运大数据辅助决策需求,提出一套集数据采集、清洗、处理与可视化于一体的综合技术解决方案。采用PostgreSQL与GreenPlum构建混合数据库体系,结合Python脚本实现数据采集、清洗与智能处理,提升数据处理时效性;利用Echarts构建数据驾驶舱,实现近实时业务监控,支持铁路客运管理中的客流监测与数据分析。该成果已在中国铁路南宁局集团有限公司客运部门投入应用,有效提升了客运管理效率,并为客流组织优化、辅助营销决策提供了科学依据。 展开更多
关键词 铁路 客运数据 数据采集 数据清洗 可视化
在线阅读 下载PDF
上一页 1 2 55 下一页 到第
使用帮助 返回顶部