A recommender system is a tool designed to suggest relevant items to users based on their preferences and behaviors.Collaborative filtering,a popular technique within recommender systems,predicts user interests by ana...A recommender system is a tool designed to suggest relevant items to users based on their preferences and behaviors.Collaborative filtering,a popular technique within recommender systems,predicts user interests by analyzing patterns in interactions and similarities between users,leveraging past behavior data to make personalized recommendations.Despite its popularity,collaborative filtering faces notable challenges,and one of them is the issue of grey-sheep users who have unusual tastes in the system.Surprisingly,existing research has not extensively explored outlier detection techniques to address the grey-sheep problem.To fill this research gap,this study conducts a comprehensive comparison of 12 outlier detectionmethods(such as LOF,ABOD,HBOS,etc.)and introduces innovative user representations aimed at improving the identification of outliers within recommender systems.More specifically,we proposed and examined three types of user representations:1)the distribution statistics of user-user similarities,where similarities were calculated based on users’rating vectors;2)the distribution statistics of user-user similarities,but with similarities derived from users represented by latent factors;and 3)latent-factor vector representations.Our experiments on the Movie Lens and Yahoo!Movie datasets demonstrate that user representations based on latent-factor vectors consistently facilitate the identification of more grey-sheep users when applying outlier detection methods.展开更多
With the development of global position system(GPS),wireless technology and location aware services,it is possible to collect a large quantity of trajectory data.In the field of data mining for moving objects,the pr...With the development of global position system(GPS),wireless technology and location aware services,it is possible to collect a large quantity of trajectory data.In the field of data mining for moving objects,the problem of anomaly detection is a hot topic.Based on the development of anomalous trajectory detection of moving objects,this paper introduces the classical trajectory outlier detection(TRAOD) algorithm,and then proposes a density-based trajectory outlier detection(DBTOD) algorithm,which compensates the disadvantages of the TRAOD algorithm that it is unable to detect anomalous defects when the trajectory is local and dense.The results of employing the proposed algorithm to Elk1993 and Deer1995 datasets are also presented,which show the effectiveness of the algorithm.展开更多
With the development of data age,data quality has become one of the problems that people pay much attention to.As a field of data mining,outlier detection is related to the quality of data.The isolated forest algorith...With the development of data age,data quality has become one of the problems that people pay much attention to.As a field of data mining,outlier detection is related to the quality of data.The isolated forest algorithm is one of the more prominent numerical data outlier detection algorithms in recent years.In the process of constructing the isolation tree by the isolated forest algorithm,as the isolation tree is continuously generated,the difference of isolation trees will gradually decrease or even no difference,which will result in the waste of memory and reduced efficiency of outlier detection.And in the constructed isolation trees,some isolation trees cannot detect outlier.In this paper,an improved iForest-based method GA-iForest is proposed.This method optimizes the isolated forest by selecting some better isolation trees according to the detection accuracy and the difference of isolation trees,thereby reducing some duplicate,similar and poor detection isolation trees and improving the accuracy and stability of outlier detection.In the experiment,Ubuntu system and Spark platform are used to build the experiment environment.The outlier datasets provided by ODDS are used as test.According to indicators such as the accuracy,recall rate,ROC curves,AUC and execution time,the performance of the proposed method is evaluated.Experimental results show that the proposed method can not only improve the accuracy and stability of outlier detection,but also reduce the number of isolation trees by 20%-40%compared with the original iForest method.展开更多
In this paper, we propose a Packet Cache-Forward(PCF) method based on improved Bayesian outlier detection to eliminate out-of-order packets caused by transmission path drastically degradation during handover events in...In this paper, we propose a Packet Cache-Forward(PCF) method based on improved Bayesian outlier detection to eliminate out-of-order packets caused by transmission path drastically degradation during handover events in the moving satellite networks, for improving the performance of TCP. The proposed method uses an access node satellite to cache all received packets in a short time when handover occurs and forward them out in order. To calculate the cache time accurately, this paper establishes the Bayesian based mixture model for detecting delay outliers of the entire handover scheme. In view of the outliers' misjudgment, an updated classification threshold and the sliding window has been suggested to correct category collections and model parameters for the purpose of quickly identifying exact compensation delay in the varied network load statuses. Simulation shows that, comparing to average processing delay detection method, the average accuracy rate was scaled up by about 4.0%, and there is about 5.5% cut in error rate in the meantime. It also behaves well even though testing with big dataset. Benefiting from the advantage of the proposed scheme in terms of performance, comparing to conventional independent handover and network controlled synchronizedhandover in simulated LEO satellite networks, the proposed independent handover with PCF eliminates packet out-of-order issue to get better improvement on congestion window. Eventually the average delay decreases more than 70% and TCP performance has improved more than 300%.展开更多
Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse mult...Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse multivariate data obtained from geotechnical site investigation,it is impossible to identify outliers with certainty due to the distortion of statistics of geotechnical parameters caused by outliers and their associated statistical uncertainty resulted from data sparsity.This paper develops a probabilistic outlier detection method for sparse multivariate data obtained from geotechnical site investigation.The proposed approach quantifies the outlying probability of each data instance based on Mahalanobis distance and determines outliers as those data instances with outlying probabilities greater than 0.5.It tackles the distortion issue of statistics estimated from the dataset with outliers by a re-sampling technique and accounts,rationally,for the statistical uncertainty by Bayesian machine learning.Moreover,the proposed approach also suggests an exclusive method to determine outlying components of each outlier.The proposed approach is illustrated and verified using simulated and real-life dataset.It showed that the proposed approach properly identifies outliers among sparse multivariate data and their corresponding outlying components in a probabilistic manner.It can significantly reduce the masking effect(i.e.,missing some actual outliers due to the distortion of statistics by the outliers and statistical uncertainty).It also found that outliers among sparse multivariate data instances affect significantly the construction of multivariate distribution of geotechnical parameters for uncertainty quantification.This emphasizes the necessity of data cleaning process(e.g.,outlier detection)for uncertainty quantification based on geoscience data.展开更多
Uncertain data are common due to the increasing usage of sensors, radio frequency identification(RFID), GPS and similar devices for data collection. The causes of uncertainty include limitations of measurements, inclu...Uncertain data are common due to the increasing usage of sensors, radio frequency identification(RFID), GPS and similar devices for data collection. The causes of uncertainty include limitations of measurements, inclusion of noise, inconsistent supply voltage and delay or loss of data in transfer. In order to manage, query or mine such data, data uncertainty needs to be considered. Hence,this paper studies the problem of top-k distance-based outlier detection from uncertain data objects. In this work, an uncertain object is modelled by a probability density function of a Gaussian distribution. The naive approach of distance-based outlier detection makes use of nested loop. This approach is very costly due to the expensive distance function between two uncertain objects. Therefore,a populated-cells list(PC-list) approach of outlier detection is proposed. Using the PC-list, the proposed top-k outlier detection algorithm needs to consider only a fraction of dataset objects and hence quickly identifies candidate objects for top-k outliers. Two approximate top-k outlier detection algorithms are presented to further increase the efficiency of the top-k outlier detection algorithm.An extensive empirical study on synthetic and real datasets is also presented to prove the accuracy, efficiency and scalability of the proposed algorithms.展开更多
With the development of science and technology,the status of the water environment has received more and more attention.In this paper,we propose a deep learning model,named a Joint Auto-Encoder network,to solve the pr...With the development of science and technology,the status of the water environment has received more and more attention.In this paper,we propose a deep learning model,named a Joint Auto-Encoder network,to solve the problem of outlier detection in water supply data.The Joint Auto-Encoder network first expands the size of training data and extracts the useful features from the input data,and then reconstructs the input data effectively into an output.The outliers are detected based on the network’s reconstruction errors,with a larger reconstruction error indicating a higher rate to be an outlier.For water supply data,there are mainly two types of outliers:outliers with large values and those with values closed to zero.We set two separate thresholds,and,for the reconstruction errors to detect the two types of outliers respectively.The data samples with reconstruction errors exceeding the thresholds are voted to be outliers.The two thresholds can be calculated by the classification confusion matrix and the receiver operating characteristic(ROC)curve.We have also performed comparisons between the Joint Auto-Encoder and the vanilla Auto-Encoder in this paper on both the synthesis data set and the MNIST data set.As a result,our model has proved to outperform the vanilla Auto-Encoder and some other outlier detection approaches with the recall rate of 98.94 percent in water supply data.展开更多
Data reconciliation technology can decrease the level of corruption of process data due to measurement noise, but the presence of outliers caused by process peaks or unmeasured disturbances will smear the reconciled r...Data reconciliation technology can decrease the level of corruption of process data due to measurement noise, but the presence of outliers caused by process peaks or unmeasured disturbances will smear the reconciled results. Based on the analysis of limitation of conventional outlier detection algorithms, a modified outlier detection method in dynamic data reconciliation (DDR) is proposed in this paper. In the modified method, the outliers of each variable are distinguished individually and the weight is modified accordingly. Therefore, the modified method can use more information of normal data, and can efficiently decrease the effect of outliers. Simulation of a continuous stirred tank reactor (CSTR) process verifies the effectiveness of the proposed algorithm.展开更多
On the: basis of wavelet theory, we propose an outlier-detection algorithm for satellite gravity ometry by applying a wavelet-shrinkage-de-noising method to some simulation data with white noise and ers. The result S...On the: basis of wavelet theory, we propose an outlier-detection algorithm for satellite gravity ometry by applying a wavelet-shrinkage-de-noising method to some simulation data with white noise and ers. The result Shows that this novel algorithm has a 97% success rate in outlier identification and that be efficiently used for pre-processing real satellite gravity gradiometry data.展开更多
Purpose:The main aim of this study is to build a robust novel approach that is able to detect outliers in the datasets accurately.To serve this purpose,a novel approach is introduced to determine the likelihood of an ...Purpose:The main aim of this study is to build a robust novel approach that is able to detect outliers in the datasets accurately.To serve this purpose,a novel approach is introduced to determine the likelihood of an object to be extremely different from the general behavior of the entire dataset.Design/methodology/approach:This paper proposes a novel two-level approach based on the integration of bagging and voting techniques for anomaly detection problems.The proposed approach,named Bagged and Voted Local Outlier Detection(BV-LOF),benefits from the Local Outlier Factor(LOF)as the base algorithm and improves its detection rate by using ensemble methods.Findings:Several experiments have been performed on ten benchmark outlier detection datasets to demonstrate the effectiveness of the BV-LOF method.According to the results,the BV-LOF approach significantly outperformed LOF on 9 datasets of 10 ones on average.Research limitations:In the BV-LOF approach,the base algorithm is applied to each subset data multiple times with different neighborhood sizes(k)in each case and with different ensemble sizes(T).In our study,we have chosen k and T value ranges as[1-100];however,these ranges can be changed according to the dataset handled and to the problem addressed.Practical implications:The proposed method can be applied to the datasets from different domains(i.e.health,finance,manufacturing,etc.)without requiring any prior information.Since the BV-LOF method includes two-level ensemble operations,it may lead to more computational time than single-level ensemble methods;however,this drawback can be overcome by parallelization and by using a proper data structure such as R*-tree or KD-tree.Originality/value:The proposed approach(BV-LOF)investigates multiple neighborhood sizes(k),which provides findings of instances with different local densities,and in this way,it provides more likelihood of outlier detection that LOF may neglect.It also brings many benefits such as easy implementation,improved capability,higher applicability,and interpretability.展开更多
Outlier detection is a key research area in data mining technologies,as outlier detection can identify data inconsistent within a data set.Outlier detection aims to find an abnormal data size from a large data size an...Outlier detection is a key research area in data mining technologies,as outlier detection can identify data inconsistent within a data set.Outlier detection aims to find an abnormal data size from a large data size and has been applied in many fields including fraud detection,network intrusion detection,disaster prediction,medical diagnosis,public security,and image processing.While outlier detection has been widely applied in real systems,its effectiveness is challenged by higher dimensions and redundant data attributes,leading to detection errors and complicated calculations.The prevalence of mixed data is a current issue for outlier detection algorithms.An outlier detection method of mixed data based on neighborhood combinatorial entropy is studied to improve outlier detection performance by reducing data dimension using an attribute reduction algorithm.The significance of attributes is determined,and fewer influencing attributes are removed based on neighborhood combinatorial entropy.Outlier detection is conducted using the algorithm of local outlier factor.The proposed outlier detection method can be applied effectively in numerical and mixed multidimensional data using neighborhood combinatorial entropy.In the experimental part of this paper,we give a comparison on outlier detection before and after attribute reduction.In a comparative analysis,we give results of the enhanced outlier detection accuracy by removing the fewer influencing attributes in numerical and mixed multidimensional data.展开更多
The flue temperature is one of the important indicators to characterize the combustion state of an ethylene cracker furnace,the outliers of temperature data can lead to the false alarm.Conventional outlier detection a...The flue temperature is one of the important indicators to characterize the combustion state of an ethylene cracker furnace,the outliers of temperature data can lead to the false alarm.Conventional outlier detection algorithms such as the Isolation Forest algorithm and 3-sigma principle cannot detect the outliers accurately.In order to improve the detection accuracy and reduce the computational complexity,an outlier detection algorithm for flue temperature data based on the CLOF(Clipping Local Outlier Factor,CLOF)algorithm is proposed.The algorithm preprocesses the normalized data using the cluster pruning algorithm,and realizes the high accuracy and high efficiency outlier detection in the outliers candidate set.Using the flue temperature data of an ethylene cracking furnace in a petrochemical plant,the main parameters of the CLOF algorithm are selected according to the experimental results,and the outlier detection effect of the Isolation Forest algorithm,the 3-sigma principle,the conventional LOF algorithm and the CLOF algorithm are compared and analyzed.The results show that the appropriate clipping coefficient in the CLOF algorithm can significantly improve the detection efficiency and detection accuracy.Compared with the outlier detection results of the Isolation Forest algorithm and 3-sigma principle,the accuracy of the CLOF detection results is increased,and the amount of data calculation is significantly reduced.展开更多
The method of time series analysis,applied by establishing appropriate mathematical models for bridge health monitoring data and making forecasts of structural future behavior,stands out as a novel and viable research...The method of time series analysis,applied by establishing appropriate mathematical models for bridge health monitoring data and making forecasts of structural future behavior,stands out as a novel and viable research direction for bridge state assessment.However,outliers inevitably exist in the monitoring data due to various interventions,which reduce the precision of model fitting and affect the forecasting results.Therefore,the identification of outliers is crucial for the accurate interpretation of the monitoring data.In this study,a time series model combined with outlier information for bridge health monitoring is established using intervention analysis theory,and the forecasting of the structural responses is carried out.There are three techniques that we focus on:(1)the modeling of seasonal autoregressive integrated moving average(SARIMA)model;(2)the methodology for outlier identification and amendment under the circumstances that the occurrence time and type of outliers are known and unknown;(3)forecasting of the model with outlier effects.The method was tested with a case study using monitoring data on a real bridge.The establishment of the original SARIMA model without considering outliers is first discussed,including the stationarity,order determination,parameter estimation and diagnostic checking of the model.Then the time-by-time iterative procedure for outlier detection,which is implemented by appropriate test statistics of the residuals,is performed.The SARIMA-outlier model is subsequently built.Finally,a comparative analysis of the forecasting performance between the original model and SARIMA-outlier model is carried out.The results demonstrate that proper time series models are effective in mining the characteristic law of bridge monitoring data.When the influence of outliers is taken into account,the fitted precision of the model is significantly improved and the accuracy and the reliability of the forecast are strengthened.展开更多
A novel approach for outlier detection with iterative clustering( ICOD) in diverse subspaces is proposed. The proposed methodology comprises two phases,iterative clustering and outlier factor computation. During the c...A novel approach for outlier detection with iterative clustering( ICOD) in diverse subspaces is proposed. The proposed methodology comprises two phases,iterative clustering and outlier factor computation. During the clustering phase, multiple clusterings are detected alternatively based on an optimization procedure that incorporates terms for cluster quality and novelty relative to existing solution. Once new clusters are detected,outlier factors can be estimated from a new definition for outliers( cluster based outlier), which provides importance to the local data behavior. Experiment shows that the proposed algorithm can detect outliers which exist in different clusterings effectively even in high dimensional data sets.展开更多
Based on the multivariate mean-shift regression model,we propose a new sparse reduced-rank regression approach to achieve low-rank sparse estimation and outlier detection simultaneously.A sparse mean-shift matrix is i...Based on the multivariate mean-shift regression model,we propose a new sparse reduced-rank regression approach to achieve low-rank sparse estimation and outlier detection simultaneously.A sparse mean-shift matrix is introduced in the model to indicate outliers.The rank constraint and the group-lasso type penalty for the coefficient matrix encourage the low-rank row sparse structure of coefficient matrix and help to achieve dimension reduction and variable selection.An algorithm is developed for solving our problem.In our simulation and real-data application,our new method shows competitive performance compared to other methods.展开更多
Electricity price is of the first consideration for all the participants in electric power market and its characteristics are related to both market mechanism and variation in the behaviors of market participants. It ...Electricity price is of the first consideration for all the participants in electric power market and its characteristics are related to both market mechanism and variation in the behaviors of market participants. It is necessary to build a real-time price forecasting model with adaptive capability; and because there are outliers in the price data, they should be detected and filtrated in training the forecasting model by regression method. In view of these points, mis paper presents an electricity price forecasting method based on accurate on-line support vector regression (AOSVR) and outlier detection. Numerical testing results show that the method is effective in forecasting the electricity prices in electric power market展开更多
Air pollution is a major issue related to national economy and people's livelihood.At present,the researches on air pollution mostly focus on the pollutant emissions in a specific industry or region as a whole,and...Air pollution is a major issue related to national economy and people's livelihood.At present,the researches on air pollution mostly focus on the pollutant emissions in a specific industry or region as a whole,and is a lack of attention to enterprise pollutant emissions from the micro level.Limited by the amount and time granularity of data from enterprises,enterprise pollutant emissions are stll understudied.Driven by big data of air pollution emissions of industrial enterprises monitored in Beijing-Tianjin-Hebei,the data mining of enterprises pollution emissions is carried out in the paper,including the association analysis between different features based on grey association,the association mining between different data based on association rule and the outlier detection based on clustering.The results show that:(1)The industries affecting NOx and SO2 mainly are electric power,heat production and supply industry,metal smelting and processing industries in Beijing-Tianjin-Hebei;(2)These districts nearby Hengshui and Shijiazhuang city in Hebei province form strong association rules;(3)The industrial enterprises in Beijing-Tianjin-Hebei are divided into six clusters,of which three categories belong to outliers with excessive emissions of total vOCs,PM and NH3 respectively.展开更多
K-nearest neighbor(KNN)is one of the most fundamental methods for unsupervised outlier detection because of its various advantages,e.g.,ease of use and relatively high accuracy.Currently,most data analytic tasks need ...K-nearest neighbor(KNN)is one of the most fundamental methods for unsupervised outlier detection because of its various advantages,e.g.,ease of use and relatively high accuracy.Currently,most data analytic tasks need to deal with high-dimensional data,and the KNN-based methods often fail due to“the curse of dimensionality”.AutoEncoder-based methods have recently been introduced to use reconstruction errors for outlier detection on high-dimensional data,but the direct use of AutoEncoder typically does not preserve the data proximity relationships well for outlier detection.In this study,we propose to combine KNN with AutoEncoder for outlier detection.First,we propose the Nearest Neighbor AutoEncoder(NNAE)by persevering the original data proximity in a much lower dimension that is more suitable for performing KNN.Second,we propose the K-nearest reconstruction neighbors(K NRNs)by incorporating the reconstruction errors of NNAE with the K-distances of KNN to detect outliers.Third,we develop a method to automatically choose better parameters for optimizing the structure of NNAE.Finally,using five real-world datasets,we experimentally show that our proposed approach NNAE+K NRN is much better than existing methods,i.e.,KNN,Isolation Forest,a traditional AutoEncoder using reconstruction errors(AutoEncoder-RE),and Robust AutoEncoder.展开更多
Changepoint detection faces challenges when outlier data are present. This paper proposes a multivariate changepoint detection method which is based on the robust WPCA projection direction and the robust RFPOP method,...Changepoint detection faces challenges when outlier data are present. This paper proposes a multivariate changepoint detection method which is based on the robust WPCA projection direction and the robust RFPOP method, RWPCA-RFPOP method. Our method is double robust which is suitable for detecting mean changepoints in multivariate normal data with high correlations between variables that include outliers. Simulation results demonstrate that our method provides strong guarantees on both the number and location of changepoints in the presence of outliers. Finally, our method is well applied in an ACGH dataset.展开更多
Load time series analysis is critical for resource management and optimization decisions,especially automated analysis techniques.Existing research has insufficiently interpreted the overall characteristics of samples...Load time series analysis is critical for resource management and optimization decisions,especially automated analysis techniques.Existing research has insufficiently interpreted the overall characteristics of samples,leading to significant differences in load level detection conclusions for samples with different characteristics(trend,seasonality,cyclicality).Achieving automated,feature-adaptive,and quantifiable analysis methods remains a challenge.This paper proposes a Threshold Recognition-based Load Level Detection Algorithm(TRLLD),which effectively identifies different load level regions in samples of arbitrary size and distribution type based on sample characteristics.By utilizing distribution density uniformity,the algorithm classifies data points and ultimately obtains normalized load values.In the feature recognition step,the algorithm employs the Density Uniformity Index Based on Differences(DUID),High Load Level Concentration(HLLC),and Low Load Level Concentration(LLLC)to assess sample characteristics,which are independent of specific load values,providing a standardized perspective on features,ensuring high efficiency and strong interpretability.Compared to traditional methods,the proposed approach demonstrates better adaptive and real-time analysis capabilities.Experimental results indicate that it can effectively identify high load and low load regions in 16 groups of time series samples with different load characteristics,yielding highly interpretable results.The correlation between the DUID and sample density distribution uniformity reaches 98.08%.When introducing 10% MAD intensity noise,the maximum relative error is 4.72%,showcasing high robustness.Notably,it exhibits significant advantages in general and low sample scenarios.展开更多
文摘A recommender system is a tool designed to suggest relevant items to users based on their preferences and behaviors.Collaborative filtering,a popular technique within recommender systems,predicts user interests by analyzing patterns in interactions and similarities between users,leveraging past behavior data to make personalized recommendations.Despite its popularity,collaborative filtering faces notable challenges,and one of them is the issue of grey-sheep users who have unusual tastes in the system.Surprisingly,existing research has not extensively explored outlier detection techniques to address the grey-sheep problem.To fill this research gap,this study conducts a comprehensive comparison of 12 outlier detectionmethods(such as LOF,ABOD,HBOS,etc.)and introduces innovative user representations aimed at improving the identification of outliers within recommender systems.More specifically,we proposed and examined three types of user representations:1)the distribution statistics of user-user similarities,where similarities were calculated based on users’rating vectors;2)the distribution statistics of user-user similarities,but with similarities derived from users represented by latent factors;and 3)latent-factor vector representations.Our experiments on the Movie Lens and Yahoo!Movie datasets demonstrate that user representations based on latent-factor vectors consistently facilitate the identification of more grey-sheep users when applying outlier detection methods.
基金supported by the Aeronautical Science Foundation of China(20111052010)the Jiangsu Graduates Innovation Project (CXZZ120163)+1 种基金the "333" Project of Jiangsu Provincethe Qing Lan Project of Jiangsu Province
文摘With the development of global position system(GPS),wireless technology and location aware services,it is possible to collect a large quantity of trajectory data.In the field of data mining for moving objects,the problem of anomaly detection is a hot topic.Based on the development of anomalous trajectory detection of moving objects,this paper introduces the classical trajectory outlier detection(TRAOD) algorithm,and then proposes a density-based trajectory outlier detection(DBTOD) algorithm,which compensates the disadvantages of the TRAOD algorithm that it is unable to detect anomalous defects when the trajectory is local and dense.The results of employing the proposed algorithm to Elk1993 and Deer1995 datasets are also presented,which show the effectiveness of the algorithm.
基金supported by the State Grid Liaoning Electric Power Supply CO, LTDthe financial support for the “Key Technology and Application Research of the Self-Service Grid Big Data Governance (No.SGLNXT00YJJS1800110)”
文摘With the development of data age,data quality has become one of the problems that people pay much attention to.As a field of data mining,outlier detection is related to the quality of data.The isolated forest algorithm is one of the more prominent numerical data outlier detection algorithms in recent years.In the process of constructing the isolation tree by the isolated forest algorithm,as the isolation tree is continuously generated,the difference of isolation trees will gradually decrease or even no difference,which will result in the waste of memory and reduced efficiency of outlier detection.And in the constructed isolation trees,some isolation trees cannot detect outlier.In this paper,an improved iForest-based method GA-iForest is proposed.This method optimizes the isolated forest by selecting some better isolation trees according to the detection accuracy and the difference of isolation trees,thereby reducing some duplicate,similar and poor detection isolation trees and improving the accuracy and stability of outlier detection.In the experiment,Ubuntu system and Spark platform are used to build the experiment environment.The outlier datasets provided by ODDS are used as test.According to indicators such as the accuracy,recall rate,ROC curves,AUC and execution time,the performance of the proposed method is evaluated.Experimental results show that the proposed method can not only improve the accuracy and stability of outlier detection,but also reduce the number of isolation trees by 20%-40%compared with the original iForest method.
基金supported by National High Technology Research and Development Program of China(863 Program,No.2014AA7011005)National Nature Science Foundation of China(No.91438120)
文摘In this paper, we propose a Packet Cache-Forward(PCF) method based on improved Bayesian outlier detection to eliminate out-of-order packets caused by transmission path drastically degradation during handover events in the moving satellite networks, for improving the performance of TCP. The proposed method uses an access node satellite to cache all received packets in a short time when handover occurs and forward them out in order. To calculate the cache time accurately, this paper establishes the Bayesian based mixture model for detecting delay outliers of the entire handover scheme. In view of the outliers' misjudgment, an updated classification threshold and the sliding window has been suggested to correct category collections and model parameters for the purpose of quickly identifying exact compensation delay in the varied network load statuses. Simulation shows that, comparing to average processing delay detection method, the average accuracy rate was scaled up by about 4.0%, and there is about 5.5% cut in error rate in the meantime. It also behaves well even though testing with big dataset. Benefiting from the advantage of the proposed scheme in terms of performance, comparing to conventional independent handover and network controlled synchronizedhandover in simulated LEO satellite networks, the proposed independent handover with PCF eliminates packet out-of-order issue to get better improvement on congestion window. Eventually the average delay decreases more than 70% and TCP performance has improved more than 300%.
基金supported by the National Key R&D Program of China(Project No.2016YFC0800200)the NRF-NSFC 3rd Joint Research Grant(Earth Science)(Project No.41861144022)+2 种基金the National Natural Science Foundation of China(Project Nos.51679174,and 51779189)the Shenzhen Key Technology R&D Program(Project No.20170324)The financial support is grateful acknowledged。
文摘Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse multivariate data obtained from geotechnical site investigation,it is impossible to identify outliers with certainty due to the distortion of statistics of geotechnical parameters caused by outliers and their associated statistical uncertainty resulted from data sparsity.This paper develops a probabilistic outlier detection method for sparse multivariate data obtained from geotechnical site investigation.The proposed approach quantifies the outlying probability of each data instance based on Mahalanobis distance and determines outliers as those data instances with outlying probabilities greater than 0.5.It tackles the distortion issue of statistics estimated from the dataset with outliers by a re-sampling technique and accounts,rationally,for the statistical uncertainty by Bayesian machine learning.Moreover,the proposed approach also suggests an exclusive method to determine outlying components of each outlier.The proposed approach is illustrated and verified using simulated and real-life dataset.It showed that the proposed approach properly identifies outliers among sparse multivariate data and their corresponding outlying components in a probabilistic manner.It can significantly reduce the masking effect(i.e.,missing some actual outliers due to the distortion of statistics by the outliers and statistical uncertainty).It also found that outliers among sparse multivariate data instances affect significantly the construction of multivariate distribution of geotechnical parameters for uncertainty quantification.This emphasizes the necessity of data cleaning process(e.g.,outlier detection)for uncertainty quantification based on geoscience data.
基金supported by Grant-in-Aid for Scientific Research(A)(#24240015A)
文摘Uncertain data are common due to the increasing usage of sensors, radio frequency identification(RFID), GPS and similar devices for data collection. The causes of uncertainty include limitations of measurements, inclusion of noise, inconsistent supply voltage and delay or loss of data in transfer. In order to manage, query or mine such data, data uncertainty needs to be considered. Hence,this paper studies the problem of top-k distance-based outlier detection from uncertain data objects. In this work, an uncertain object is modelled by a probability density function of a Gaussian distribution. The naive approach of distance-based outlier detection makes use of nested loop. This approach is very costly due to the expensive distance function between two uncertain objects. Therefore,a populated-cells list(PC-list) approach of outlier detection is proposed. Using the PC-list, the proposed top-k outlier detection algorithm needs to consider only a fraction of dataset objects and hence quickly identifies candidate objects for top-k outliers. Two approximate top-k outlier detection algorithms are presented to further increase the efficiency of the top-k outlier detection algorithm.An extensive empirical study on synthetic and real datasets is also presented to prove the accuracy, efficiency and scalability of the proposed algorithms.
基金The work described in this paper was supported by the National Natural Science Foundation of China(NSFC)under Grant No.U1501253 and Grant No.U1713217.
文摘With the development of science and technology,the status of the water environment has received more and more attention.In this paper,we propose a deep learning model,named a Joint Auto-Encoder network,to solve the problem of outlier detection in water supply data.The Joint Auto-Encoder network first expands the size of training data and extracts the useful features from the input data,and then reconstructs the input data effectively into an output.The outliers are detected based on the network’s reconstruction errors,with a larger reconstruction error indicating a higher rate to be an outlier.For water supply data,there are mainly two types of outliers:outliers with large values and those with values closed to zero.We set two separate thresholds,and,for the reconstruction errors to detect the two types of outliers respectively.The data samples with reconstruction errors exceeding the thresholds are voted to be outliers.The two thresholds can be calculated by the classification confusion matrix and the receiver operating characteristic(ROC)curve.We have also performed comparisons between the Joint Auto-Encoder and the vanilla Auto-Encoder in this paper on both the synthesis data set and the MNIST data set.As a result,our model has proved to outperform the vanilla Auto-Encoder and some other outlier detection approaches with the recall rate of 98.94 percent in water supply data.
基金Supported by the National Outstanding Youth Science Foundation of China (No. 60025308) and Key Technologies R&DProgram in the 10th Five-year Plan (No. 2001BA204B07)
文摘Data reconciliation technology can decrease the level of corruption of process data due to measurement noise, but the presence of outliers caused by process peaks or unmeasured disturbances will smear the reconciled results. Based on the analysis of limitation of conventional outlier detection algorithms, a modified outlier detection method in dynamic data reconciliation (DDR) is proposed in this paper. In the modified method, the outliers of each variable are distinguished individually and the weight is modified accordingly. Therefore, the modified method can use more information of normal data, and can efficiently decrease the effect of outliers. Simulation of a continuous stirred tank reactor (CSTR) process verifies the effectiveness of the proposed algorithm.
基金supported by the Director Foundation of the Institute of Seismology,China Earthquake Administration (IS201126025)The Basis Research Foundation of Key laboratory of Geospace Environment & Geodesy Ministry of Education,China (10-01-09)
文摘On the: basis of wavelet theory, we propose an outlier-detection algorithm for satellite gravity ometry by applying a wavelet-shrinkage-de-noising method to some simulation data with white noise and ers. The result Shows that this novel algorithm has a 97% success rate in outlier identification and that be efficiently used for pre-processing real satellite gravity gradiometry data.
文摘Purpose:The main aim of this study is to build a robust novel approach that is able to detect outliers in the datasets accurately.To serve this purpose,a novel approach is introduced to determine the likelihood of an object to be extremely different from the general behavior of the entire dataset.Design/methodology/approach:This paper proposes a novel two-level approach based on the integration of bagging and voting techniques for anomaly detection problems.The proposed approach,named Bagged and Voted Local Outlier Detection(BV-LOF),benefits from the Local Outlier Factor(LOF)as the base algorithm and improves its detection rate by using ensemble methods.Findings:Several experiments have been performed on ten benchmark outlier detection datasets to demonstrate the effectiveness of the BV-LOF method.According to the results,the BV-LOF approach significantly outperformed LOF on 9 datasets of 10 ones on average.Research limitations:In the BV-LOF approach,the base algorithm is applied to each subset data multiple times with different neighborhood sizes(k)in each case and with different ensemble sizes(T).In our study,we have chosen k and T value ranges as[1-100];however,these ranges can be changed according to the dataset handled and to the problem addressed.Practical implications:The proposed method can be applied to the datasets from different domains(i.e.health,finance,manufacturing,etc.)without requiring any prior information.Since the BV-LOF method includes two-level ensemble operations,it may lead to more computational time than single-level ensemble methods;however,this drawback can be overcome by parallelization and by using a proper data structure such as R*-tree or KD-tree.Originality/value:The proposed approach(BV-LOF)investigates multiple neighborhood sizes(k),which provides findings of instances with different local densities,and in this way,it provides more likelihood of outlier detection that LOF may neglect.It also brings many benefits such as easy implementation,improved capability,higher applicability,and interpretability.
基金The authors would like to acknowledge the support of Southern Marine Science and Engineering Guangdong Laboratory(Zhuhai)(SML2020SP007)The paper is supported under the National Natural Science Foundation of China(Nos.61772280 and 62072249).
文摘Outlier detection is a key research area in data mining technologies,as outlier detection can identify data inconsistent within a data set.Outlier detection aims to find an abnormal data size from a large data size and has been applied in many fields including fraud detection,network intrusion detection,disaster prediction,medical diagnosis,public security,and image processing.While outlier detection has been widely applied in real systems,its effectiveness is challenged by higher dimensions and redundant data attributes,leading to detection errors and complicated calculations.The prevalence of mixed data is a current issue for outlier detection algorithms.An outlier detection method of mixed data based on neighborhood combinatorial entropy is studied to improve outlier detection performance by reducing data dimension using an attribute reduction algorithm.The significance of attributes is determined,and fewer influencing attributes are removed based on neighborhood combinatorial entropy.Outlier detection is conducted using the algorithm of local outlier factor.The proposed outlier detection method can be applied effectively in numerical and mixed multidimensional data using neighborhood combinatorial entropy.In the experimental part of this paper,we give a comparison on outlier detection before and after attribute reduction.In a comparative analysis,we give results of the enhanced outlier detection accuracy by removing the fewer influencing attributes in numerical and mixed multidimensional data.
基金Sponsored by the National Natural Science Foundation of China(Grant No.61973094)the Maoming Natural Science Foundation(Grant No.2020S004)the Guangdong Basic and Applied Basic Research Fund Project(Grant No.2023A1515012341).
文摘The flue temperature is one of the important indicators to characterize the combustion state of an ethylene cracker furnace,the outliers of temperature data can lead to the false alarm.Conventional outlier detection algorithms such as the Isolation Forest algorithm and 3-sigma principle cannot detect the outliers accurately.In order to improve the detection accuracy and reduce the computational complexity,an outlier detection algorithm for flue temperature data based on the CLOF(Clipping Local Outlier Factor,CLOF)algorithm is proposed.The algorithm preprocesses the normalized data using the cluster pruning algorithm,and realizes the high accuracy and high efficiency outlier detection in the outliers candidate set.Using the flue temperature data of an ethylene cracking furnace in a petrochemical plant,the main parameters of the CLOF algorithm are selected according to the experimental results,and the outlier detection effect of the Isolation Forest algorithm,the 3-sigma principle,the conventional LOF algorithm and the CLOF algorithm are compared and analyzed.The results show that the appropriate clipping coefficient in the CLOF algorithm can significantly improve the detection efficiency and detection accuracy.Compared with the outlier detection results of the Isolation Forest algorithm and 3-sigma principle,the accuracy of the CLOF detection results is increased,and the amount of data calculation is significantly reduced.
基金funded by the Natural Science Foundation of Fujian Province(Grant No.2020J05207)Fujian University Engineering Research Center for Disaster Prevention and Mitigation of Engineering Structures along the Southeast Coast(Grant No.JDGC03)+1 种基金Major Scientific Research Platform Project of Putian City(Grant No.2021ZP03)Talent Introduction Project of Putian University(Grant No.2018074).
文摘The method of time series analysis,applied by establishing appropriate mathematical models for bridge health monitoring data and making forecasts of structural future behavior,stands out as a novel and viable research direction for bridge state assessment.However,outliers inevitably exist in the monitoring data due to various interventions,which reduce the precision of model fitting and affect the forecasting results.Therefore,the identification of outliers is crucial for the accurate interpretation of the monitoring data.In this study,a time series model combined with outlier information for bridge health monitoring is established using intervention analysis theory,and the forecasting of the structural responses is carried out.There are three techniques that we focus on:(1)the modeling of seasonal autoregressive integrated moving average(SARIMA)model;(2)the methodology for outlier identification and amendment under the circumstances that the occurrence time and type of outliers are known and unknown;(3)forecasting of the model with outlier effects.The method was tested with a case study using monitoring data on a real bridge.The establishment of the original SARIMA model without considering outliers is first discussed,including the stationarity,order determination,parameter estimation and diagnostic checking of the model.Then the time-by-time iterative procedure for outlier detection,which is implemented by appropriate test statistics of the residuals,is performed.The SARIMA-outlier model is subsequently built.Finally,a comparative analysis of the forecasting performance between the original model and SARIMA-outlier model is carried out.The results demonstrate that proper time series models are effective in mining the characteristic law of bridge monitoring data.When the influence of outliers is taken into account,the fitted precision of the model is significantly improved and the accuracy and the reliability of the forecast are strengthened.
基金Natural Science Foundation Project of CQ CSTC(Nos.cstc2012jjA 40002,cstc2012jjA 40016)Fundamental Research Funds for the Central Universities,China(No.0216005207016)
文摘A novel approach for outlier detection with iterative clustering( ICOD) in diverse subspaces is proposed. The proposed methodology comprises two phases,iterative clustering and outlier factor computation. During the clustering phase, multiple clusterings are detected alternatively based on an optimization procedure that incorporates terms for cluster quality and novelty relative to existing solution. Once new clusters are detected,outlier factors can be estimated from a new definition for outliers( cluster based outlier), which provides importance to the local data behavior. Experiment shows that the proposed algorithm can detect outliers which exist in different clusterings effectively even in high dimensional data sets.
文摘Based on the multivariate mean-shift regression model,we propose a new sparse reduced-rank regression approach to achieve low-rank sparse estimation and outlier detection simultaneously.A sparse mean-shift matrix is introduced in the model to indicate outliers.The rank constraint and the group-lasso type penalty for the coefficient matrix encourage the low-rank row sparse structure of coefficient matrix and help to achieve dimension reduction and variable selection.An algorithm is developed for solving our problem.In our simulation and real-data application,our new method shows competitive performance compared to other methods.
基金This paper is about a project financed by the National Outstanding Young Investigator Grant (6970025)863 High Tech Development Plan of China (2001AA413910) the Project of National Natural Science Foundation (60274054) the Key Project of National Natural Science Foundation (59937150)it is also supported by its cooperating project financed by 863 High Tech Development Plan of China (2004AA412050).
文摘Electricity price is of the first consideration for all the participants in electric power market and its characteristics are related to both market mechanism and variation in the behaviors of market participants. It is necessary to build a real-time price forecasting model with adaptive capability; and because there are outliers in the price data, they should be detected and filtrated in training the forecasting model by regression method. In view of these points, mis paper presents an electricity price forecasting method based on accurate on-line support vector regression (AOSVR) and outlier detection. Numerical testing results show that the method is effective in forecasting the electricity prices in electric power market
基金supported by the National Natural Science Foundation of China[grant number 72271033]the Beijing Municipal Education Commission and Beijing Natural Science Foundation[grant number KZ202110017025]the National Undergraduate Innovation and Entrepreneurship Plan Project(2022J00244).
文摘Air pollution is a major issue related to national economy and people's livelihood.At present,the researches on air pollution mostly focus on the pollutant emissions in a specific industry or region as a whole,and is a lack of attention to enterprise pollutant emissions from the micro level.Limited by the amount and time granularity of data from enterprises,enterprise pollutant emissions are stll understudied.Driven by big data of air pollution emissions of industrial enterprises monitored in Beijing-Tianjin-Hebei,the data mining of enterprises pollution emissions is carried out in the paper,including the association analysis between different features based on grey association,the association mining between different data based on association rule and the outlier detection based on clustering.The results show that:(1)The industries affecting NOx and SO2 mainly are electric power,heat production and supply industry,metal smelting and processing industries in Beijing-Tianjin-Hebei;(2)These districts nearby Hengshui and Shijiazhuang city in Hebei province form strong association rules;(3)The industrial enterprises in Beijing-Tianjin-Hebei are divided into six clusters,of which three categories belong to outliers with excessive emissions of total vOCs,PM and NH3 respectively.
基金supported in part by the National Natural Science Foundation of China under Grant Nos.61925203 and U22B2021.
文摘K-nearest neighbor(KNN)is one of the most fundamental methods for unsupervised outlier detection because of its various advantages,e.g.,ease of use and relatively high accuracy.Currently,most data analytic tasks need to deal with high-dimensional data,and the KNN-based methods often fail due to“the curse of dimensionality”.AutoEncoder-based methods have recently been introduced to use reconstruction errors for outlier detection on high-dimensional data,but the direct use of AutoEncoder typically does not preserve the data proximity relationships well for outlier detection.In this study,we propose to combine KNN with AutoEncoder for outlier detection.First,we propose the Nearest Neighbor AutoEncoder(NNAE)by persevering the original data proximity in a much lower dimension that is more suitable for performing KNN.Second,we propose the K-nearest reconstruction neighbors(K NRNs)by incorporating the reconstruction errors of NNAE with the K-distances of KNN to detect outliers.Third,we develop a method to automatically choose better parameters for optimizing the structure of NNAE.Finally,using five real-world datasets,we experimentally show that our proposed approach NNAE+K NRN is much better than existing methods,i.e.,KNN,Isolation Forest,a traditional AutoEncoder using reconstruction errors(AutoEncoder-RE),and Robust AutoEncoder.
文摘Changepoint detection faces challenges when outlier data are present. This paper proposes a multivariate changepoint detection method which is based on the robust WPCA projection direction and the robust RFPOP method, RWPCA-RFPOP method. Our method is double robust which is suitable for detecting mean changepoints in multivariate normal data with high correlations between variables that include outliers. Simulation results demonstrate that our method provides strong guarantees on both the number and location of changepoints in the presence of outliers. Finally, our method is well applied in an ACGH dataset.
文摘Load time series analysis is critical for resource management and optimization decisions,especially automated analysis techniques.Existing research has insufficiently interpreted the overall characteristics of samples,leading to significant differences in load level detection conclusions for samples with different characteristics(trend,seasonality,cyclicality).Achieving automated,feature-adaptive,and quantifiable analysis methods remains a challenge.This paper proposes a Threshold Recognition-based Load Level Detection Algorithm(TRLLD),which effectively identifies different load level regions in samples of arbitrary size and distribution type based on sample characteristics.By utilizing distribution density uniformity,the algorithm classifies data points and ultimately obtains normalized load values.In the feature recognition step,the algorithm employs the Density Uniformity Index Based on Differences(DUID),High Load Level Concentration(HLLC),and Low Load Level Concentration(LLLC)to assess sample characteristics,which are independent of specific load values,providing a standardized perspective on features,ensuring high efficiency and strong interpretability.Compared to traditional methods,the proposed approach demonstrates better adaptive and real-time analysis capabilities.Experimental results indicate that it can effectively identify high load and low load regions in 16 groups of time series samples with different load characteristics,yielding highly interpretable results.The correlation between the DUID and sample density distribution uniformity reaches 98.08%.When introducing 10% MAD intensity noise,the maximum relative error is 4.72%,showcasing high robustness.Notably,it exhibits significant advantages in general and low sample scenarios.