A recommender system is a tool designed to suggest relevant items to users based on their preferences and behaviors.Collaborative filtering,a popular technique within recommender systems,predicts user interests by ana...A recommender system is a tool designed to suggest relevant items to users based on their preferences and behaviors.Collaborative filtering,a popular technique within recommender systems,predicts user interests by analyzing patterns in interactions and similarities between users,leveraging past behavior data to make personalized recommendations.Despite its popularity,collaborative filtering faces notable challenges,and one of them is the issue of grey-sheep users who have unusual tastes in the system.Surprisingly,existing research has not extensively explored outlier detection techniques to address the grey-sheep problem.To fill this research gap,this study conducts a comprehensive comparison of 12 outlier detectionmethods(such as LOF,ABOD,HBOS,etc.)and introduces innovative user representations aimed at improving the identification of outliers within recommender systems.More specifically,we proposed and examined three types of user representations:1)the distribution statistics of user-user similarities,where similarities were calculated based on users’rating vectors;2)the distribution statistics of user-user similarities,but with similarities derived from users represented by latent factors;and 3)latent-factor vector representations.Our experiments on the Movie Lens and Yahoo!Movie datasets demonstrate that user representations based on latent-factor vectors consistently facilitate the identification of more grey-sheep users when applying outlier detection methods.展开更多
Although quality assurance and quality control procedures are routinely applied in most air quality networks, outliers can still occur due to instrument malfunctions, the influence of harsh environments and the limita...Although quality assurance and quality control procedures are routinely applied in most air quality networks, outliers can still occur due to instrument malfunctions, the influence of harsh environments and the limitation of measuring methods. Such outliers pose challenges for data-powered applications such as data assimilation, statistical analysis of pollution characteristics and ensemble forecasting. Here, a fully automatic outlier detection method was developed based on the probability of residuals, which are the discrepancies between the observed and the estimated concentration values. The estimation can be conducted using filtering—or regressions when appropriate—to discriminate four types of outliers characterized by temporal and spatial inconsistency, instrument-induced low variances, periodic calibration exceptions, and less PM_(10) than PM_(2.5) in concentration observations, respectively. This probabilistic method was applied to detect all four types of outliers in hourly surface measurements of six pollutants(PM_(2.5), PM_(10),SO_2,NO_2,CO and O_3) from 1436 stations of the China National Environmental Monitoring Network during 2014-16. Among the measurements, 0.65%-5.68% are marked as outliers. with PM_(10) and CO more prone to outliers. Our method successfully identifies a trend of decreasing outliers from 2014 to 2016,which corresponds to known improvements in the quality assurance and quality control procedures of the China National Environmental Monitoring Network. The outliers can have a significant impact on the annual mean concentrations of PM_(2.5),with differences exceeding 10 μg m^(-3) at 66 sites.展开更多
With the development of global position system(GPS),wireless technology and location aware services,it is possible to collect a large quantity of trajectory data.In the field of data mining for moving objects,the pr...With the development of global position system(GPS),wireless technology and location aware services,it is possible to collect a large quantity of trajectory data.In the field of data mining for moving objects,the problem of anomaly detection is a hot topic.Based on the development of anomalous trajectory detection of moving objects,this paper introduces the classical trajectory outlier detection(TRAOD) algorithm,and then proposes a density-based trajectory outlier detection(DBTOD) algorithm,which compensates the disadvantages of the TRAOD algorithm that it is unable to detect anomalous defects when the trajectory is local and dense.The results of employing the proposed algorithm to Elk1993 and Deer1995 datasets are also presented,which show the effectiveness of the algorithm.展开更多
With the development of data age,data quality has become one of the problems that people pay much attention to.As a field of data mining,outlier detection is related to the quality of data.The isolated forest algorith...With the development of data age,data quality has become one of the problems that people pay much attention to.As a field of data mining,outlier detection is related to the quality of data.The isolated forest algorithm is one of the more prominent numerical data outlier detection algorithms in recent years.In the process of constructing the isolation tree by the isolated forest algorithm,as the isolation tree is continuously generated,the difference of isolation trees will gradually decrease or even no difference,which will result in the waste of memory and reduced efficiency of outlier detection.And in the constructed isolation trees,some isolation trees cannot detect outlier.In this paper,an improved iForest-based method GA-iForest is proposed.This method optimizes the isolated forest by selecting some better isolation trees according to the detection accuracy and the difference of isolation trees,thereby reducing some duplicate,similar and poor detection isolation trees and improving the accuracy and stability of outlier detection.In the experiment,Ubuntu system and Spark platform are used to build the experiment environment.The outlier datasets provided by ODDS are used as test.According to indicators such as the accuracy,recall rate,ROC curves,AUC and execution time,the performance of the proposed method is evaluated.Experimental results show that the proposed method can not only improve the accuracy and stability of outlier detection,but also reduce the number of isolation trees by 20%-40%compared with the original iForest method.展开更多
The distance-based outlier detection method detects the implied outliers by calculating the distance of the points in the dataset, but the computational complexity is particularly high when processing multidimensional...The distance-based outlier detection method detects the implied outliers by calculating the distance of the points in the dataset, but the computational complexity is particularly high when processing multidimensional datasets. In addition, the traditional outlier detection method does not consider the frequency of subsets occurrence, thus, the detected outliers do not fit the definition of outliers (i.e., rarely appearing). The pattern mining-based outlier detection approaches have solved this problem, but the importance of each pattern is not taken into account in outlier detection process, so the detected outliers cannot truly reflect some actual situation. Aimed at these problems, a two-phase minimal weighted rare pattern mining-based outlier detection approach, called MWRPM-Outlier, is proposed to effectively detect outliers on the weight data stream. In particular, a method called MWRPM is proposed in the pattern mining phase to fast mine the minimal weighted rare patterns, and then two deviation factors are defined in outlier detection phase to measure the abnormal degree of each transaction on the weight data stream. Experimental results show that the proposed MWRPM-Outlier approach has excellent performance in outlier detection and MWRPM approach outperforms in weighted rare pattern mining.展开更多
In this paper, we propose a Packet Cache-Forward(PCF) method based on improved Bayesian outlier detection to eliminate out-of-order packets caused by transmission path drastically degradation during handover events in...In this paper, we propose a Packet Cache-Forward(PCF) method based on improved Bayesian outlier detection to eliminate out-of-order packets caused by transmission path drastically degradation during handover events in the moving satellite networks, for improving the performance of TCP. The proposed method uses an access node satellite to cache all received packets in a short time when handover occurs and forward them out in order. To calculate the cache time accurately, this paper establishes the Bayesian based mixture model for detecting delay outliers of the entire handover scheme. In view of the outliers' misjudgment, an updated classification threshold and the sliding window has been suggested to correct category collections and model parameters for the purpose of quickly identifying exact compensation delay in the varied network load statuses. Simulation shows that, comparing to average processing delay detection method, the average accuracy rate was scaled up by about 4.0%, and there is about 5.5% cut in error rate in the meantime. It also behaves well even though testing with big dataset. Benefiting from the advantage of the proposed scheme in terms of performance, comparing to conventional independent handover and network controlled synchronizedhandover in simulated LEO satellite networks, the proposed independent handover with PCF eliminates packet out-of-order issue to get better improvement on congestion window. Eventually the average delay decreases more than 70% and TCP performance has improved more than 300%.展开更多
Uncertain data are common due to the increasing usage of sensors, radio frequency identification(RFID), GPS and similar devices for data collection. The causes of uncertainty include limitations of measurements, inclu...Uncertain data are common due to the increasing usage of sensors, radio frequency identification(RFID), GPS and similar devices for data collection. The causes of uncertainty include limitations of measurements, inclusion of noise, inconsistent supply voltage and delay or loss of data in transfer. In order to manage, query or mine such data, data uncertainty needs to be considered. Hence,this paper studies the problem of top-k distance-based outlier detection from uncertain data objects. In this work, an uncertain object is modelled by a probability density function of a Gaussian distribution. The naive approach of distance-based outlier detection makes use of nested loop. This approach is very costly due to the expensive distance function between two uncertain objects. Therefore,a populated-cells list(PC-list) approach of outlier detection is proposed. Using the PC-list, the proposed top-k outlier detection algorithm needs to consider only a fraction of dataset objects and hence quickly identifies candidate objects for top-k outliers. Two approximate top-k outlier detection algorithms are presented to further increase the efficiency of the top-k outlier detection algorithm.An extensive empirical study on synthetic and real datasets is also presented to prove the accuracy, efficiency and scalability of the proposed algorithms.展开更多
In this study,we propose a low-cost system that can detect the space outlier utilization of residents in an indoor environment.We focus on the users’app usage to analyze unusual behavior,especially in indoor spaces.T...In this study,we propose a low-cost system that can detect the space outlier utilization of residents in an indoor environment.We focus on the users’app usage to analyze unusual behavior,especially in indoor spaces.This is reflected in the behavioral analysis in that the frequency of using smartphones in personal spaces has recently increased.Our system facilitates autonomous data collection from mobile app logs and Google app servers and generates a high-dimensional dataset that can detect outlier behaviors.The density-based spatial clustering of applications with noise(DBSCAN)algorithm was applied for effective singular movement analysis.To analyze high-level mobile phone usage,the t-distributed stochastic neighbor embedding(t-SNE)algorithm was employed.These two clustering algorithms can effectively detect outlier behaviors in terms of movement and app usage in indoor spaces.The experimental results showed that our system enables effective spatial behavioral analysis at a low cost when applied to logs collected in actual living spaces.Moreover,large volumes of data required for outlier detection can be easily acquired.The system can automatically detect the unusual behavior of a user in an indoor space.In particular,this study aims to reflect the recent trend of the increasing use of smartphones in indoor spaces to the behavioral analysis.展开更多
Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse mult...Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse multivariate data obtained from geotechnical site investigation,it is impossible to identify outliers with certainty due to the distortion of statistics of geotechnical parameters caused by outliers and their associated statistical uncertainty resulted from data sparsity.This paper develops a probabilistic outlier detection method for sparse multivariate data obtained from geotechnical site investigation.The proposed approach quantifies the outlying probability of each data instance based on Mahalanobis distance and determines outliers as those data instances with outlying probabilities greater than 0.5.It tackles the distortion issue of statistics estimated from the dataset with outliers by a re-sampling technique and accounts,rationally,for the statistical uncertainty by Bayesian machine learning.Moreover,the proposed approach also suggests an exclusive method to determine outlying components of each outlier.The proposed approach is illustrated and verified using simulated and real-life dataset.It showed that the proposed approach properly identifies outliers among sparse multivariate data and their corresponding outlying components in a probabilistic manner.It can significantly reduce the masking effect(i.e.,missing some actual outliers due to the distortion of statistics by the outliers and statistical uncertainty).It also found that outliers among sparse multivariate data instances affect significantly the construction of multivariate distribution of geotechnical parameters for uncertainty quantification.This emphasizes the necessity of data cleaning process(e.g.,outlier detection)for uncertainty quantification based on geoscience data.展开更多
Outlier detection has very important applied value in data mining literature. Different outlier detection algorithms based on distinct theories have different definitions and mining processes. The three-dimensional sp...Outlier detection has very important applied value in data mining literature. Different outlier detection algorithms based on distinct theories have different definitions and mining processes. The three-dimensional space graph for constructing applied algorithms and an improved GridOf algorithm were proposed in terms of analyzing the existing outlier detection algorithms from criterion and theory. Key words outlier - detection - three-dimensional space graph - data mining CLC number TP 311. 13 - TP 391 Foundation item: Supported by the National Natural Science Foundation of China (70371015)Biography: ZHANG Jing (1975-), female, Ph. D, lecturer, research direction: data mining and knowledge discovery.展开更多
With the development of science and technology,the status of the water environment has received more and more attention.In this paper,we propose a deep learning model,named a Joint Auto-Encoder network,to solve the pr...With the development of science and technology,the status of the water environment has received more and more attention.In this paper,we propose a deep learning model,named a Joint Auto-Encoder network,to solve the problem of outlier detection in water supply data.The Joint Auto-Encoder network first expands the size of training data and extracts the useful features from the input data,and then reconstructs the input data effectively into an output.The outliers are detected based on the network’s reconstruction errors,with a larger reconstruction error indicating a higher rate to be an outlier.For water supply data,there are mainly two types of outliers:outliers with large values and those with values closed to zero.We set two separate thresholds,and,for the reconstruction errors to detect the two types of outliers respectively.The data samples with reconstruction errors exceeding the thresholds are voted to be outliers.The two thresholds can be calculated by the classification confusion matrix and the receiver operating characteristic(ROC)curve.We have also performed comparisons between the Joint Auto-Encoder and the vanilla Auto-Encoder in this paper on both the synthesis data set and the MNIST data set.As a result,our model has proved to outperform the vanilla Auto-Encoder and some other outlier detection approaches with the recall rate of 98.94 percent in water supply data.展开更多
On the: basis of wavelet theory, we propose an outlier-detection algorithm for satellite gravity ometry by applying a wavelet-shrinkage-de-noising method to some simulation data with white noise and ers. The result S...On the: basis of wavelet theory, we propose an outlier-detection algorithm for satellite gravity ometry by applying a wavelet-shrinkage-de-noising method to some simulation data with white noise and ers. The result Shows that this novel algorithm has a 97% success rate in outlier identification and that be efficiently used for pre-processing real satellite gravity gradiometry data.展开更多
The heterogeneous nodes in the Internet of Things(IoT)are relatively weak in the computing power and storage capacity.Therefore,traditional algorithms of network security are not suitable for the IoT.Once these nodes ...The heterogeneous nodes in the Internet of Things(IoT)are relatively weak in the computing power and storage capacity.Therefore,traditional algorithms of network security are not suitable for the IoT.Once these nodes alternate between normal behavior and anomaly behavior,it is difficult to identify and isolate them by the network system in a short time,thus the data transmission accuracy and the integrity of the network function will be affected negatively.Based on the characteristics of IoT,a lightweight local outlier factor detection method is used for node detection.In order to further determine whether the nodes are an anomaly or not,the varying behavior of those nodes in terms of time is considered in this research,and a time series method is used to make the system respond to the randomness and selectiveness of anomaly behavior nodes effectively in a short period of time.Simulation results show that the proposed method can improve the accuracy of the data transmitted by the network and achieve better performance.展开更多
The detection of outliers and change points from time series has become research focus in the area of time series data mining since it can be used for fraud detection, rare event discovery, event/trend change detectio...The detection of outliers and change points from time series has become research focus in the area of time series data mining since it can be used for fraud detection, rare event discovery, event/trend change detection, etc. In most previous works, outlier detection and change point detection have not been related explicitly and the change point detections did not consider the influence of outliers, in this work, a unified detection framework was presented to deal with both of them. The framework is based on ALARCON-AQUINO and BARRIA's change points detection method and adopts two-stage detection to divide the outliers and change points. The advantages of it lie in that: firstly, unified structure for change detection and outlier detection further reduces the computational complexity and make the detective procedure simple; Secondly, the detection strategy of outlier detection before change point detection avoids the influence of outliers to the change point detection, and thus improves the accuracy of the change point detection. The simulation experiments of the proposed method for both model data and actual application data have been made and gotten 100% detection accuracy. The comparisons between traditional detection method and the proposed method further demonstrate that the unified detection structure is more accurate when the time series are contaminated by outliers.展开更多
Data reconciliation technology can decrease the level of corruption of process data due to measurement noise, but the presence of outliers caused by process peaks or unmeasured disturbances will smear the reconciled r...Data reconciliation technology can decrease the level of corruption of process data due to measurement noise, but the presence of outliers caused by process peaks or unmeasured disturbances will smear the reconciled results. Based on the analysis of limitation of conventional outlier detection algorithms, a modified outlier detection method in dynamic data reconciliation (DDR) is proposed in this paper. In the modified method, the outliers of each variable are distinguished individually and the weight is modified accordingly. Therefore, the modified method can use more information of normal data, and can efficiently decrease the effect of outliers. Simulation of a continuous stirred tank reactor (CSTR) process verifies the effectiveness of the proposed algorithm.展开更多
Node localization is commonly employed in wireless networks. For example, it is used to improve routing and enhance security. Localization algorithms can be classified as range-free or range-based. Range-based algorit...Node localization is commonly employed in wireless networks. For example, it is used to improve routing and enhance security. Localization algorithms can be classified as range-free or range-based. Range-based algorithms use location metrics such as ToA, TDoA, RSS, and AoA to estimate the distance between two nodes. Proximity sensing between nodes is typically the basis for range-free algorithms. A tradeoff exists since range-based algorithms are more accurate but also more complex. However, in applications such as target tracking, localization accuracy is very important. In this paper, we propose a new range-based algorithm which is based on the density-based outlier detection algorithm (DBOD) from data mining. It requires selection of the K-nearest neighbours (KNN). DBOD assigns density values to each point used in the location estimation. The mean of these densities is calculated and those points having a density larger than the mean are kept as candidate points. Different performance measures are used to compare our approach with the linear least squares (LLS) and weighted linear least squares based on singular value decomposition (WLS-SVD) algorithms. It is shown that the proposed algorithm performs better than these algorithms even when the anchor geometry about an unlocalized node is poor.展开更多
We introduce a new wavelet based procedure for detecting outliers in financial discrete time series.The procedure focuses on the analysis of residuals obtained from a model fit,and applied to the Generalized Autoregre...We introduce a new wavelet based procedure for detecting outliers in financial discrete time series.The procedure focuses on the analysis of residuals obtained from a model fit,and applied to the Generalized Autoregressive Conditional Heteroskedasticity(GARCH)like model,but not limited to these models.We apply the Maximal-Overlap Discrete Wavelet Transform(MODWT)to the residuals and compare their wavelet coefficients against quantile thresholds to detect outliers.Our methodology has several advantages over existing methods that make use of the standard Discrete Wavelet Transform(DWT).The series sample size does not need to be a power of 2 and the transform can explore any wavelet filter and be run up to the desired level.Simulated wavelet quantiles from a Normal and Student t-distribution are used as threshold for the maximum of the absolute value of wavelet coefficients.The performance of the procedure is illustrated and applied to two real series:the closed price of the Saudi Stock market and the S&P 500 index respectively.The efficiency of the proposed method is demonstrated and can be considered as a distinct important addition to the existing methods.展开更多
Assessing machine's performance through comparing the same or similar machines is important to implement intelligent maintenance for swarm machine.In this paper,an outlier mining based abnormal machine detection a...Assessing machine's performance through comparing the same or similar machines is important to implement intelligent maintenance for swarm machine.In this paper,an outlier mining based abnormal machine detection algorithm is proposed for this purpose.Firstly,the outlier mining based on clustering is introduced and the definition of cluster-based global outlier factor(CBGOF) is presented.Then the modified swarm intelligence clustering(MSIC) algorithm is suggested and the outlier mining algorithm based on MSIC is proposed.The algorithm can not only cluster machines according to their performance but also detect possible abnormal machines.Finally,a comparison of mobile soccer robots' performance proves the algorithm is feasible and effective.展开更多
We introduce and develop a novel approach to outlier detection based on adaptation of random subspace learning. Our proposed method handles both high-dimension low-sample size and traditional low-dimensional high-samp...We introduce and develop a novel approach to outlier detection based on adaptation of random subspace learning. Our proposed method handles both high-dimension low-sample size and traditional low-dimensional high-sample size datasets. Essentially, we avoid the computational bottleneck of techniques like Minimum Covariance Determinant (MCD) by computing the needed determinants and associated measures in much lower dimensional subspaces. Both theoretical and computational development of our approach reveal that it is computationally more efficient than the regularized methods in high-dimensional low-sample size, and often competes favorably with existing methods as far as the percentage of correct outlier detection are concerned.展开更多
The method of time series analysis,applied by establishing appropriate mathematical models for bridge health monitoring data and making forecasts of structural future behavior,stands out as a novel and viable research...The method of time series analysis,applied by establishing appropriate mathematical models for bridge health monitoring data and making forecasts of structural future behavior,stands out as a novel and viable research direction for bridge state assessment.However,outliers inevitably exist in the monitoring data due to various interventions,which reduce the precision of model fitting and affect the forecasting results.Therefore,the identification of outliers is crucial for the accurate interpretation of the monitoring data.In this study,a time series model combined with outlier information for bridge health monitoring is established using intervention analysis theory,and the forecasting of the structural responses is carried out.There are three techniques that we focus on:(1)the modeling of seasonal autoregressive integrated moving average(SARIMA)model;(2)the methodology for outlier identification and amendment under the circumstances that the occurrence time and type of outliers are known and unknown;(3)forecasting of the model with outlier effects.The method was tested with a case study using monitoring data on a real bridge.The establishment of the original SARIMA model without considering outliers is first discussed,including the stationarity,order determination,parameter estimation and diagnostic checking of the model.Then the time-by-time iterative procedure for outlier detection,which is implemented by appropriate test statistics of the residuals,is performed.The SARIMA-outlier model is subsequently built.Finally,a comparative analysis of the forecasting performance between the original model and SARIMA-outlier model is carried out.The results demonstrate that proper time series models are effective in mining the characteristic law of bridge monitoring data.When the influence of outliers is taken into account,the fitted precision of the model is significantly improved and the accuracy and the reliability of the forecast are strengthened.展开更多
文摘A recommender system is a tool designed to suggest relevant items to users based on their preferences and behaviors.Collaborative filtering,a popular technique within recommender systems,predicts user interests by analyzing patterns in interactions and similarities between users,leveraging past behavior data to make personalized recommendations.Despite its popularity,collaborative filtering faces notable challenges,and one of them is the issue of grey-sheep users who have unusual tastes in the system.Surprisingly,existing research has not extensively explored outlier detection techniques to address the grey-sheep problem.To fill this research gap,this study conducts a comprehensive comparison of 12 outlier detectionmethods(such as LOF,ABOD,HBOS,etc.)and introduces innovative user representations aimed at improving the identification of outliers within recommender systems.More specifically,we proposed and examined three types of user representations:1)the distribution statistics of user-user similarities,where similarities were calculated based on users’rating vectors;2)the distribution statistics of user-user similarities,but with similarities derived from users represented by latent factors;and 3)latent-factor vector representations.Our experiments on the Movie Lens and Yahoo!Movie datasets demonstrate that user representations based on latent-factor vectors consistently facilitate the identification of more grey-sheep users when applying outlier detection methods.
基金supported by the National Natural Science Foundation (Grant Nos.91644216 and 41575128)the CAS Information Technology Program (Grant No.XXH13506-302)Guangdong Provincial Science and Technology Development Special Fund (No.2017B020216007)
文摘Although quality assurance and quality control procedures are routinely applied in most air quality networks, outliers can still occur due to instrument malfunctions, the influence of harsh environments and the limitation of measuring methods. Such outliers pose challenges for data-powered applications such as data assimilation, statistical analysis of pollution characteristics and ensemble forecasting. Here, a fully automatic outlier detection method was developed based on the probability of residuals, which are the discrepancies between the observed and the estimated concentration values. The estimation can be conducted using filtering—or regressions when appropriate—to discriminate four types of outliers characterized by temporal and spatial inconsistency, instrument-induced low variances, periodic calibration exceptions, and less PM_(10) than PM_(2.5) in concentration observations, respectively. This probabilistic method was applied to detect all four types of outliers in hourly surface measurements of six pollutants(PM_(2.5), PM_(10),SO_2,NO_2,CO and O_3) from 1436 stations of the China National Environmental Monitoring Network during 2014-16. Among the measurements, 0.65%-5.68% are marked as outliers. with PM_(10) and CO more prone to outliers. Our method successfully identifies a trend of decreasing outliers from 2014 to 2016,which corresponds to known improvements in the quality assurance and quality control procedures of the China National Environmental Monitoring Network. The outliers can have a significant impact on the annual mean concentrations of PM_(2.5),with differences exceeding 10 μg m^(-3) at 66 sites.
基金supported by the Aeronautical Science Foundation of China(20111052010)the Jiangsu Graduates Innovation Project (CXZZ120163)+1 种基金the "333" Project of Jiangsu Provincethe Qing Lan Project of Jiangsu Province
文摘With the development of global position system(GPS),wireless technology and location aware services,it is possible to collect a large quantity of trajectory data.In the field of data mining for moving objects,the problem of anomaly detection is a hot topic.Based on the development of anomalous trajectory detection of moving objects,this paper introduces the classical trajectory outlier detection(TRAOD) algorithm,and then proposes a density-based trajectory outlier detection(DBTOD) algorithm,which compensates the disadvantages of the TRAOD algorithm that it is unable to detect anomalous defects when the trajectory is local and dense.The results of employing the proposed algorithm to Elk1993 and Deer1995 datasets are also presented,which show the effectiveness of the algorithm.
基金supported by the State Grid Liaoning Electric Power Supply CO, LTDthe financial support for the “Key Technology and Application Research of the Self-Service Grid Big Data Governance (No.SGLNXT00YJJS1800110)”
文摘With the development of data age,data quality has become one of the problems that people pay much attention to.As a field of data mining,outlier detection is related to the quality of data.The isolated forest algorithm is one of the more prominent numerical data outlier detection algorithms in recent years.In the process of constructing the isolation tree by the isolated forest algorithm,as the isolation tree is continuously generated,the difference of isolation trees will gradually decrease or even no difference,which will result in the waste of memory and reduced efficiency of outlier detection.And in the constructed isolation trees,some isolation trees cannot detect outlier.In this paper,an improved iForest-based method GA-iForest is proposed.This method optimizes the isolated forest by selecting some better isolation trees according to the detection accuracy and the difference of isolation trees,thereby reducing some duplicate,similar and poor detection isolation trees and improving the accuracy and stability of outlier detection.In the experiment,Ubuntu system and Spark platform are used to build the experiment environment.The outlier datasets provided by ODDS are used as test.According to indicators such as the accuracy,recall rate,ROC curves,AUC and execution time,the performance of the proposed method is evaluated.Experimental results show that the proposed method can not only improve the accuracy and stability of outlier detection,but also reduce the number of isolation trees by 20%-40%compared with the original iForest method.
基金supported by Fundamental Research Funds for the Central Universities (No. 2018XD004)
文摘The distance-based outlier detection method detects the implied outliers by calculating the distance of the points in the dataset, but the computational complexity is particularly high when processing multidimensional datasets. In addition, the traditional outlier detection method does not consider the frequency of subsets occurrence, thus, the detected outliers do not fit the definition of outliers (i.e., rarely appearing). The pattern mining-based outlier detection approaches have solved this problem, but the importance of each pattern is not taken into account in outlier detection process, so the detected outliers cannot truly reflect some actual situation. Aimed at these problems, a two-phase minimal weighted rare pattern mining-based outlier detection approach, called MWRPM-Outlier, is proposed to effectively detect outliers on the weight data stream. In particular, a method called MWRPM is proposed in the pattern mining phase to fast mine the minimal weighted rare patterns, and then two deviation factors are defined in outlier detection phase to measure the abnormal degree of each transaction on the weight data stream. Experimental results show that the proposed MWRPM-Outlier approach has excellent performance in outlier detection and MWRPM approach outperforms in weighted rare pattern mining.
基金supported by National High Technology Research and Development Program of China(863 Program,No.2014AA7011005)National Nature Science Foundation of China(No.91438120)
文摘In this paper, we propose a Packet Cache-Forward(PCF) method based on improved Bayesian outlier detection to eliminate out-of-order packets caused by transmission path drastically degradation during handover events in the moving satellite networks, for improving the performance of TCP. The proposed method uses an access node satellite to cache all received packets in a short time when handover occurs and forward them out in order. To calculate the cache time accurately, this paper establishes the Bayesian based mixture model for detecting delay outliers of the entire handover scheme. In view of the outliers' misjudgment, an updated classification threshold and the sliding window has been suggested to correct category collections and model parameters for the purpose of quickly identifying exact compensation delay in the varied network load statuses. Simulation shows that, comparing to average processing delay detection method, the average accuracy rate was scaled up by about 4.0%, and there is about 5.5% cut in error rate in the meantime. It also behaves well even though testing with big dataset. Benefiting from the advantage of the proposed scheme in terms of performance, comparing to conventional independent handover and network controlled synchronizedhandover in simulated LEO satellite networks, the proposed independent handover with PCF eliminates packet out-of-order issue to get better improvement on congestion window. Eventually the average delay decreases more than 70% and TCP performance has improved more than 300%.
基金supported by Grant-in-Aid for Scientific Research(A)(#24240015A)
文摘Uncertain data are common due to the increasing usage of sensors, radio frequency identification(RFID), GPS and similar devices for data collection. The causes of uncertainty include limitations of measurements, inclusion of noise, inconsistent supply voltage and delay or loss of data in transfer. In order to manage, query or mine such data, data uncertainty needs to be considered. Hence,this paper studies the problem of top-k distance-based outlier detection from uncertain data objects. In this work, an uncertain object is modelled by a probability density function of a Gaussian distribution. The naive approach of distance-based outlier detection makes use of nested loop. This approach is very costly due to the expensive distance function between two uncertain objects. Therefore,a populated-cells list(PC-list) approach of outlier detection is proposed. Using the PC-list, the proposed top-k outlier detection algorithm needs to consider only a fraction of dataset objects and hence quickly identifies candidate objects for top-k outliers. Two approximate top-k outlier detection algorithms are presented to further increase the efficiency of the top-k outlier detection algorithm.An extensive empirical study on synthetic and real datasets is also presented to prove the accuracy, efficiency and scalability of the proposed algorithms.
文摘In this study,we propose a low-cost system that can detect the space outlier utilization of residents in an indoor environment.We focus on the users’app usage to analyze unusual behavior,especially in indoor spaces.This is reflected in the behavioral analysis in that the frequency of using smartphones in personal spaces has recently increased.Our system facilitates autonomous data collection from mobile app logs and Google app servers and generates a high-dimensional dataset that can detect outlier behaviors.The density-based spatial clustering of applications with noise(DBSCAN)algorithm was applied for effective singular movement analysis.To analyze high-level mobile phone usage,the t-distributed stochastic neighbor embedding(t-SNE)algorithm was employed.These two clustering algorithms can effectively detect outlier behaviors in terms of movement and app usage in indoor spaces.The experimental results showed that our system enables effective spatial behavioral analysis at a low cost when applied to logs collected in actual living spaces.Moreover,large volumes of data required for outlier detection can be easily acquired.The system can automatically detect the unusual behavior of a user in an indoor space.In particular,this study aims to reflect the recent trend of the increasing use of smartphones in indoor spaces to the behavioral analysis.
基金supported by the National Key R&D Program of China(Project No.2016YFC0800200)the NRF-NSFC 3rd Joint Research Grant(Earth Science)(Project No.41861144022)+2 种基金the National Natural Science Foundation of China(Project Nos.51679174,and 51779189)the Shenzhen Key Technology R&D Program(Project No.20170324)The financial support is grateful acknowledged。
文摘Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances(i.e.,outliers)that do not conform with the expected pattern of regular data instances.With sparse multivariate data obtained from geotechnical site investigation,it is impossible to identify outliers with certainty due to the distortion of statistics of geotechnical parameters caused by outliers and their associated statistical uncertainty resulted from data sparsity.This paper develops a probabilistic outlier detection method for sparse multivariate data obtained from geotechnical site investigation.The proposed approach quantifies the outlying probability of each data instance based on Mahalanobis distance and determines outliers as those data instances with outlying probabilities greater than 0.5.It tackles the distortion issue of statistics estimated from the dataset with outliers by a re-sampling technique and accounts,rationally,for the statistical uncertainty by Bayesian machine learning.Moreover,the proposed approach also suggests an exclusive method to determine outlying components of each outlier.The proposed approach is illustrated and verified using simulated and real-life dataset.It showed that the proposed approach properly identifies outliers among sparse multivariate data and their corresponding outlying components in a probabilistic manner.It can significantly reduce the masking effect(i.e.,missing some actual outliers due to the distortion of statistics by the outliers and statistical uncertainty).It also found that outliers among sparse multivariate data instances affect significantly the construction of multivariate distribution of geotechnical parameters for uncertainty quantification.This emphasizes the necessity of data cleaning process(e.g.,outlier detection)for uncertainty quantification based on geoscience data.
文摘Outlier detection has very important applied value in data mining literature. Different outlier detection algorithms based on distinct theories have different definitions and mining processes. The three-dimensional space graph for constructing applied algorithms and an improved GridOf algorithm were proposed in terms of analyzing the existing outlier detection algorithms from criterion and theory. Key words outlier - detection - three-dimensional space graph - data mining CLC number TP 311. 13 - TP 391 Foundation item: Supported by the National Natural Science Foundation of China (70371015)Biography: ZHANG Jing (1975-), female, Ph. D, lecturer, research direction: data mining and knowledge discovery.
基金The work described in this paper was supported by the National Natural Science Foundation of China(NSFC)under Grant No.U1501253 and Grant No.U1713217.
文摘With the development of science and technology,the status of the water environment has received more and more attention.In this paper,we propose a deep learning model,named a Joint Auto-Encoder network,to solve the problem of outlier detection in water supply data.The Joint Auto-Encoder network first expands the size of training data and extracts the useful features from the input data,and then reconstructs the input data effectively into an output.The outliers are detected based on the network’s reconstruction errors,with a larger reconstruction error indicating a higher rate to be an outlier.For water supply data,there are mainly two types of outliers:outliers with large values and those with values closed to zero.We set two separate thresholds,and,for the reconstruction errors to detect the two types of outliers respectively.The data samples with reconstruction errors exceeding the thresholds are voted to be outliers.The two thresholds can be calculated by the classification confusion matrix and the receiver operating characteristic(ROC)curve.We have also performed comparisons between the Joint Auto-Encoder and the vanilla Auto-Encoder in this paper on both the synthesis data set and the MNIST data set.As a result,our model has proved to outperform the vanilla Auto-Encoder and some other outlier detection approaches with the recall rate of 98.94 percent in water supply data.
基金supported by the Director Foundation of the Institute of Seismology,China Earthquake Administration (IS201126025)The Basis Research Foundation of Key laboratory of Geospace Environment & Geodesy Ministry of Education,China (10-01-09)
文摘On the: basis of wavelet theory, we propose an outlier-detection algorithm for satellite gravity ometry by applying a wavelet-shrinkage-de-noising method to some simulation data with white noise and ers. The result Shows that this novel algorithm has a 97% success rate in outlier identification and that be efficiently used for pre-processing real satellite gravity gradiometry data.
基金This work is partially supported by the Ministry of Education of China(www.moe.gov.cn)under grant Nos.201802123091(received by F.W.)and 201802123068(received by Z.W.)Scientific Project of CAFUC(www.cafuc.edu.cn)under grant Nos.F2017KF02 and J2018-3(both received by Z.W.)Teaching Reform Project of CAFUC(www.cafuc.edu.cn)under grant No.E2020044(received by Z.W.).
文摘The heterogeneous nodes in the Internet of Things(IoT)are relatively weak in the computing power and storage capacity.Therefore,traditional algorithms of network security are not suitable for the IoT.Once these nodes alternate between normal behavior and anomaly behavior,it is difficult to identify and isolate them by the network system in a short time,thus the data transmission accuracy and the integrity of the network function will be affected negatively.Based on the characteristics of IoT,a lightweight local outlier factor detection method is used for node detection.In order to further determine whether the nodes are an anomaly or not,the varying behavior of those nodes in terms of time is considered in this research,and a time series method is used to make the system respond to the randomness and selectiveness of anomaly behavior nodes effectively in a short period of time.Simulation results show that the proposed method can improve the accuracy of the data transmitted by the network and achieve better performance.
基金Project(2011AA040603) supported by the National High Technology Ressarch & Development Program of ChinaProject(201202226) supported by the Natural Science Foundation of Liaoning Province, China
文摘The detection of outliers and change points from time series has become research focus in the area of time series data mining since it can be used for fraud detection, rare event discovery, event/trend change detection, etc. In most previous works, outlier detection and change point detection have not been related explicitly and the change point detections did not consider the influence of outliers, in this work, a unified detection framework was presented to deal with both of them. The framework is based on ALARCON-AQUINO and BARRIA's change points detection method and adopts two-stage detection to divide the outliers and change points. The advantages of it lie in that: firstly, unified structure for change detection and outlier detection further reduces the computational complexity and make the detective procedure simple; Secondly, the detection strategy of outlier detection before change point detection avoids the influence of outliers to the change point detection, and thus improves the accuracy of the change point detection. The simulation experiments of the proposed method for both model data and actual application data have been made and gotten 100% detection accuracy. The comparisons between traditional detection method and the proposed method further demonstrate that the unified detection structure is more accurate when the time series are contaminated by outliers.
基金Supported by the National Outstanding Youth Science Foundation of China (No. 60025308) and Key Technologies R&DProgram in the 10th Five-year Plan (No. 2001BA204B07)
文摘Data reconciliation technology can decrease the level of corruption of process data due to measurement noise, but the presence of outliers caused by process peaks or unmeasured disturbances will smear the reconciled results. Based on the analysis of limitation of conventional outlier detection algorithms, a modified outlier detection method in dynamic data reconciliation (DDR) is proposed in this paper. In the modified method, the outliers of each variable are distinguished individually and the weight is modified accordingly. Therefore, the modified method can use more information of normal data, and can efficiently decrease the effect of outliers. Simulation of a continuous stirred tank reactor (CSTR) process verifies the effectiveness of the proposed algorithm.
文摘Node localization is commonly employed in wireless networks. For example, it is used to improve routing and enhance security. Localization algorithms can be classified as range-free or range-based. Range-based algorithms use location metrics such as ToA, TDoA, RSS, and AoA to estimate the distance between two nodes. Proximity sensing between nodes is typically the basis for range-free algorithms. A tradeoff exists since range-based algorithms are more accurate but also more complex. However, in applications such as target tracking, localization accuracy is very important. In this paper, we propose a new range-based algorithm which is based on the density-based outlier detection algorithm (DBOD) from data mining. It requires selection of the K-nearest neighbours (KNN). DBOD assigns density values to each point used in the location estimation. The mean of these densities is calculated and those points having a density larger than the mean are kept as candidate points. Different performance measures are used to compare our approach with the linear least squares (LLS) and weighted linear least squares based on singular value decomposition (WLS-SVD) algorithms. It is shown that the proposed algorithm performs better than these algorithms even when the anchor geometry about an unlocalized node is poor.
文摘We introduce a new wavelet based procedure for detecting outliers in financial discrete time series.The procedure focuses on the analysis of residuals obtained from a model fit,and applied to the Generalized Autoregressive Conditional Heteroskedasticity(GARCH)like model,but not limited to these models.We apply the Maximal-Overlap Discrete Wavelet Transform(MODWT)to the residuals and compare their wavelet coefficients against quantile thresholds to detect outliers.Our methodology has several advantages over existing methods that make use of the standard Discrete Wavelet Transform(DWT).The series sample size does not need to be a power of 2 and the transform can explore any wavelet filter and be run up to the desired level.Simulated wavelet quantiles from a Normal and Student t-distribution are used as threshold for the maximum of the absolute value of wavelet coefficients.The performance of the procedure is illustrated and applied to two real series:the closed price of the Saudi Stock market and the S&P 500 index respectively.The efficiency of the proposed method is demonstrated and can be considered as a distinct important addition to the existing methods.
基金the National Natural Science Foundation of China (No. 50705054)
文摘Assessing machine's performance through comparing the same or similar machines is important to implement intelligent maintenance for swarm machine.In this paper,an outlier mining based abnormal machine detection algorithm is proposed for this purpose.Firstly,the outlier mining based on clustering is introduced and the definition of cluster-based global outlier factor(CBGOF) is presented.Then the modified swarm intelligence clustering(MSIC) algorithm is suggested and the outlier mining algorithm based on MSIC is proposed.The algorithm can not only cluster machines according to their performance but also detect possible abnormal machines.Finally,a comparison of mobile soccer robots' performance proves the algorithm is feasible and effective.
文摘We introduce and develop a novel approach to outlier detection based on adaptation of random subspace learning. Our proposed method handles both high-dimension low-sample size and traditional low-dimensional high-sample size datasets. Essentially, we avoid the computational bottleneck of techniques like Minimum Covariance Determinant (MCD) by computing the needed determinants and associated measures in much lower dimensional subspaces. Both theoretical and computational development of our approach reveal that it is computationally more efficient than the regularized methods in high-dimensional low-sample size, and often competes favorably with existing methods as far as the percentage of correct outlier detection are concerned.
基金funded by the Natural Science Foundation of Fujian Province(Grant No.2020J05207)Fujian University Engineering Research Center for Disaster Prevention and Mitigation of Engineering Structures along the Southeast Coast(Grant No.JDGC03)+1 种基金Major Scientific Research Platform Project of Putian City(Grant No.2021ZP03)Talent Introduction Project of Putian University(Grant No.2018074).
文摘The method of time series analysis,applied by establishing appropriate mathematical models for bridge health monitoring data and making forecasts of structural future behavior,stands out as a novel and viable research direction for bridge state assessment.However,outliers inevitably exist in the monitoring data due to various interventions,which reduce the precision of model fitting and affect the forecasting results.Therefore,the identification of outliers is crucial for the accurate interpretation of the monitoring data.In this study,a time series model combined with outlier information for bridge health monitoring is established using intervention analysis theory,and the forecasting of the structural responses is carried out.There are three techniques that we focus on:(1)the modeling of seasonal autoregressive integrated moving average(SARIMA)model;(2)the methodology for outlier identification and amendment under the circumstances that the occurrence time and type of outliers are known and unknown;(3)forecasting of the model with outlier effects.The method was tested with a case study using monitoring data on a real bridge.The establishment of the original SARIMA model without considering outliers is first discussed,including the stationarity,order determination,parameter estimation and diagnostic checking of the model.Then the time-by-time iterative procedure for outlier detection,which is implemented by appropriate test statistics of the residuals,is performed.The SARIMA-outlier model is subsequently built.Finally,a comparative analysis of the forecasting performance between the original model and SARIMA-outlier model is carried out.The results demonstrate that proper time series models are effective in mining the characteristic law of bridge monitoring data.When the influence of outliers is taken into account,the fitted precision of the model is significantly improved and the accuracy and the reliability of the forecast are strengthened.