期刊文献+

基于剪枝的海量数据离群点挖掘 被引量:7

Pruning-based Outlier Mining from Large Dataset
在线阅读 下载PDF
导出
摘要 基于距离的离群点挖掘通常需要O(N2)的时间进行大量的距离计算与比较,这限制了其在海量数据上的应用。针对此问题,提出了一个带剪枝功能的离群点挖掘算法。算法分为两步:在对数据集进行一遍扫描后,剪枝掉大量的非离群点;然后对余下的可疑数据实施一种改进的嵌套循环算法,以每个数据点与其k个最近邻点的平均距离作为离群度,确定前n个离群点。在真实数据和合成数据集上的实验结果均表明,该算法在获得高命中率的同时仍保持低误警率。与相关算法相比,其具有较低的时间复杂性。 Distance-based outlier detection approach typically requires O(N2) time of distance computation and compari-son.This quadratic scaling restricts the ability to apply this approach to large datasets.To overcome this limitation,a novel distance-based outlier mining approach with pruning rules was proposed.The approach consists of two phases.During the first phase,the original input data are scanned and the majority of non-outliers are pruned.During second phase,an improved nested loops approach is applied to compute the average K-nearest distance which measures the degree of being an outlier and finally reports the top-n outliers.Experiments on both synthetic data and real-life data show that the proposed approach achieves a high hit rate with a low false alarm rate.Compared with related approaches,the proposed approach has a lower time complexity.
出处 《计算机科学》 CSCD 北大核心 2012年第10期152-156,共5页 Computer Science
关键词 离群点 数据挖掘 基于距离 Outlier Data mining Distance-based
  • 相关文献

参考文献15

  • 1Barnett V, Lewis T. Outliers in Statistical Data (3rd Ed)[M]. New York(New York City):John Wiley & Sons, 1994:1-2.
  • 2Knott E M, Ng R T. Algorithms for mining distance-based out- liers in large datasets [C]///Proc of the 24th int ' 1 conf on VLDB. New York (New York City):ACM, 1998:392-403.
  • 3Knott E M, Ng R T, Tucakov V. Distance-based outliers: algo- rithms and applications [C] //Proc of 26th Int' 1 conf on VLDB.Egypt (Cairo) : ACM, 2000 : 237-253.
  • 4Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mi- ning outliers from large data sets [C]//Proc of ACM SIGMOD Int ' 1 conf on Management of Data. Texas (Dallas): ACM, 2000 : 427 438.
  • 5Angiulli F, Pizzuti C. Fast outlier detection in high-dimensional spaces [C] // Proc of Principles of Data Mining and Knowledge Discovery. 6th European Conf. Finland (Helsinki) :ACM, 2002: 15-26.
  • 6Breunig M M, Kriegel H P, Ng R T. LOF: Identifying density- based local outliers [C]//Proc of ACM SIGMOD Int'l conf on Management of Data. Texas (Dallas) : ACM, 2000 : 93-104.
  • 7Bentley J L. Multidimensional binary search trees used for asso- ciative searching [J]. Communications of the ACM, 1975, 18 (9) :509-517.
  • 8Guttmann R. A dynamic index structure for spatial searching [C]//Proc of ACM SIGMOD int'l conf on Management of Da- ta. New York (New York City):ACM, 1984:47-57.
  • 9Berchtold S, Keim D, Kreigel H P. The X-tree: an index struc-ture for high-dimensional data [C]//Proc of the 22nd i nt' 1 conf on VLDB. Mumbai (Bombay):Morgan Kaufmann, 1996:28-39.
  • 10Knorr E M, Ng R T. Finding intentional knowledge of distance- based ontliers [C]//Proc of 25th Int'l eonf on VLDB. Scotland (Edinburgh) : Morgan Kaufmann, 1999 : 211-222.

二级参考文献5

共引文献9

同被引文献75

  • 1张继国,谢平,龚艳冰,刘高峰.降雨信息空间插值研究评述与展望[J].水资源与水工程学报,2012,23(1):6-9. 被引量:17
  • 2闫伟,张浩,陆剑峰.一种离群数据挖掘新方法的研究与应用[J].控制与决策,2006,21(5):563-566. 被引量:5
  • 3肖辉,龚薇.基于可达邻域的异常检测算法[J].计算机工程,2007,33(17):74-76. 被引量:4
  • 4KNORR E, NG R T. Algorithms for mining distance-based oudiers in large datasets [ C]// Proceedings of the 24th Very Large Data Base Conference. New York: VLDB Press, 1998:392 -403.
  • 5RAMASWAMY S, RASTOGI R, SHIM K. Efficient algorithms for mining outliers from large data sets [ C]/! Proceedings of the ACM SIGMOD Conference on Management of Data. New York: ACM Press, 2000:427-438.
  • 6BAY D S, SCHWABACHER M. Mining distance-based outliers in near linear time with randomization and a simple pruning rule [ C]/! Proceedings of the Ninth ACM SIGKDD on Knowledge Discovery and Data Mining. New York: ACM Press, 2003:29 -38.
  • 7BHADURI K, MATI'HEWS B, GIANNELLA C R. Algorithms for speeding up distance-based outlier detection [ C]/! Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining. New York: ACM Press, 2011:859 -867.
  • 8LOZANO E, ACUFIA E. Parallel algorithms for distance-based and density-based outliers [ C] // Proceedings of the 2005 IEEE Interna- tional Conference on Data Mining. Washington, DC: IEEE Comput- er Society, 2005:729 -732.
  • 9VU N H, GOPALKRISHNAN V. Efficient pruning schemes for dis- tance-based outlier detection [ C]//Proceedings of the 2009 Europe- an Conference on Machine Learning and Knowledge Discovery in Da- tabases. Berlin: Spring-Verlag, 2009:160-175.
  • 10OTEY M. E, GHOTING A, PARTHASARATHY S. Fast distribu- ted outlier detection in mixed-attribute data sets [ J]. Data Mining and Knowledge Discovery, 2006, 12(2/3): 203 -228.

引证文献7

二级引证文献38

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部