面向高维数据的低冗余top-k异常点发现方法被引量：2

Discovering Redundancy-Aware Top-k Anomalies in High Dimensional Data

下载PDF

导出

摘要异常发现是数据挖掘领域的一类重要任务.针对高维对象的异常度量问题和异常点集合的冗余问题,提出了一种新的面向高维数据的异常点发现方法.该方法通过采用高维数据的二部图表示,以高维对象的压缩能力作为其异常程度的度量,能够有效支持包含不同类型属性的高维数据.为了解决top-k异常点集合中的冗余问题,提出了低冗余top-k异常点的概念.由于精确计算低冗余的top-k异常点是NP-hard问题,设计了计算近似低冗余的top-k异常点的启发式方法k-AnomaliesHD算法.从在真实和人工数据集上的实验结果可以看出,该方法具有较好的扩展性;而且与不考虑冗余的异常点发现方法相比较,能够更有效地概括数据中的异常模式. Discovering anomalies is an important data mining task which has been studied in many applications In this paper,by emphasizing the problems of exception measurement of high dimensional objects and redundancy in the set of anomalies,an approach is proposed to discover the anomalies in high dimensional data With a bipartite graph representation of the given high dimensional dataset,the capability of compression of each object is used to measure the degree of exception of the object Based on the exception measure,the dataset containing different types of attributes,such as binary attributes,categorical attributes and numeric attributes,are well supported To solve the problem of redundancy in the set of top-k anomalies,the concept of redundancy-aware top-k anomalies is proposed For the problem of mining the exact set of the redundancy-aware top-k anomalies is NP-hard,an algorithm based on greedy heuristics,named k-AnomaliesHD,is designed to discover an approximate set of the redundancy-aware top-k anomalies efficiently The experimental study both on real and synthetic datasets shows that the algorithm scales linearly with the dimensionality of the dataset and quadratic to the size of the dataset Further,compared with the redundancy-unaware method,the set of redundancy-aware top-k anomalies is much more effective to cover the abnormal patterns of data

作者陈冠华马秀莉杨冬青唐世渭帅猛谢昆青

机构地区北京大学信息科学技术学院机器感知与智能教育部重点实验室(北京大学) 高可信软件技术教育部重点实验室(北京大学)

出处《计算机研究与发展》 EI CSCD 北大核心 2010年第5期788-795,共8页 Journal of Computer Research and Development

基金国家"八六三"高技术研究发展计划基金项目(2007AA120502) 国家自然科学基金项目(60874082)

关键词数据挖掘异常检测高维数据低冗余异常度量 data mining anomaly detection high dimensional data redundancy-aware exception measure

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献16

1Bolton R J,Hand D J.Statistical fraud detection:A review (with discussion)[J].Statistical Science,2002,17(3):235-255.
2Patcha A,Park J -M.An overview of anomaly detection techniques:Existing solutions and latest technological trends[J].Computer Networks,2007,51(12):3448-3470.
3Eskin E,Arnold A,Prerau M,et al.A geometric framework for unsupervised anomaly detection:Detecting intrusions in unlabeled data[G].Data Mining for Security Applications.Norvell:Kluwer,2002:77-99.
4Lane T,Brodley C E.Temporal sequence learning and data reduction for anomaly detection[J].ACM Trans on Information and System Security,1999,2(3):295-331.
5Li X,Han J.Mining approximate top-k subspace anomalies in multidimensional time-series data[C] //Proc of the 33rd Int Conf on Very Large Data Bases.New York:ACM,2007:447-458.
6Aggarwal R,Gehrke J,Gunopulos D,et al.Automatic subspace clustering of high dimensional data for data mining applications[C] //Proc of ACM SIGMOD Conf on Management of Data.New York:ACM,1998:94-105.
7Barnett V,Lewis T.Outliers in Statistical Data[M].New York:John Wiley and Sons,1994.
8Preparata F,Shamos M.Computational Geometry:An Introduction[M].Berlin:Springer,1985.
9Ramaswamy S,Rastogi R,Kyuseok S.Efficient algorithms for mining outliers from large data sets[J].ACM SIGMOD Record,2000,29(2):427-438.
10Breunig M M,Kriegel H P,Ng R T,et al.LOF:Identifying density-based local outliers[J].ACM SIGMOD Record,2000,29(2):93-104.

二级参考文献27

1Fayyad, U., Piatetsky-Shapiro, G., Smyth, P. Knowledge discovery and data mining: towards a unifying framework. In: Simoudis, E., Han, J., Fayyad, U.M., eds. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Portland, Oregon: AAAI Press, 1996. 82～88.
2Ng, R. T., Han, J. Efficient and effective clustering methods for spatial data mining. In: Bocca, J.B., Jarke, M., Zaniolo, C., eds. Proceedings of the 20th International Conference on Very Large Data Bases. Santiago: Morgan Kaufmann, 1994. 144～155.
3Ester, M., Kriegel, H.-p., Sander, J., et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis, E., Han, J., Fayyad, U.M., eds. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Portland, Oregon: AAAI Press, 1996. 226～231.
4Zhang, T., Ramakrishnan, R., Linvy, M. BIRCH: an efficient eata clustering method for very large databases. In: Jagadish, H.V., Mumick, I.S., eds. Proceedings of the ACM SIGMOD International Conference on Management of Data. Montreal: ACM Press, 1996. 103～114.
5Wang, W., Yang, J., Muntz, R. STING: a statistical information grid approach to spatial data mining. In: Jarke, M., Carey, M.J., Dittrich, K.R., et al., eds. Proceedings of the 23rd International Conference on Very Large Data Bases. Athens, Greece: Morgan Kaufmann, 1997. 186～195.
6Sheikholeslami, G., Chatterjee, S., Zhang, A. WaveCluster: a multi-resolution clustering approach for very large spatial databases. In: Gupta, A., Shmueli, O., Widom, J., eds. Proceedings of the 24th International Conference on Very Large Data Bases. New York : Morgan Kaufmann, 1998. 428～439.
7Hinneburg, A., Keim, D.A. An efficient approach to clustering in large multimedia databases with noise. In: Agrawal, R., Stolorz, P.E., Piatetsky-Shapiro, G. eds. Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. New York: AAAI Press, 1998. 58～65.
8Agrawal, R., Gehrke, J., Gunopulos, D., et al. Automatic subspace clustering of high dimensional data for data mining applications. In: Haas, L.M., Tiwary, A., eds. Proceedings of the ACM SIGMOD International Conference on Management of Data. Seattle, Washington, D C: ACM Press, 1998. 94～105.
9Ruts, I., Rousseeuw, P. Computing depth contours of bivariate point clouds. Journal of Computational Statistics and Data Analysis, 1996,23:153～168.
10Arning, A., Agrawal, R., Raghavan, P. A linear method for deviation detection in large databases. In: Simoudis, E., Han, J., Fayyad, U.M., eds. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Portland, Oregon: AAAI Press, 1996. 164～169.

共引文献43

1蒋盛益,徐雨明,陈溪辉.异常挖掘研究综述[J].衡阳师范学院学报,2004,25(3):63-66. 被引量：2
2ZHANG Jing 1,2 , SUN Zhi-hui 1 1.Department of Computer Science and Engineering, Southeast University, Nanjing 210096, Jiangsu, China,2.Department of Electricity and Information Engineering, Jiangsu University, Zhenjiang 212001, Jiangsu, China.Constructing Three-Dimension Space Graph for Outlier Detection Algorithms in Data Mining[J].Wuhan University Journal of Natural Sciences,2004,9(5):585-589. 被引量：1
3刘洪涛,童德利,陈世福.一种基于属性的异常点检测算法[J].计算机科学,2005,32(5):164-166. 被引量：4
4赵泽茂,何坤金,胡友进.基于距离的异常数据挖掘算法及其应用[J].计算机应用与软件,2005,22(9):105-107. 被引量：12
5蔡江辉,张华煜.离群数据挖掘方法研究[J].电脑开发与应用,2005,18(12):46-47. 被引量：1
6苏华.营销培训问题攻略[J].人才资源开发,2005(12):74-74.
7张净,孙志挥.GDLOF:基于网格和稠密单元的快速局部离群点探测算法[J].东南大学学报（自然科学版）,2005,35(6):863-866. 被引量：6
8金义富,朱庆生,邹咸林.高维数据集离群子空间特性研究[J].计算机工程与应用,2006,42(9):147-149. 被引量：2
9汤俊,熊前兴.用于可疑金融交易监控的对比离群点检测模型[J].武汉理工大学学报,2006,28(4):112-115. 被引量：7
10黄洪宇,林甲祥,陈崇成,樊明辉.离群数据挖掘综述[J].计算机应用研究,2006,23(8):8-13. 被引量：43

同被引文献20

1周晓云,孙志挥,张柏礼,杨宜东.高维类别属性数据流离群点快速检测算法[J].软件学报,2007,18(4):933-942. 被引量：21
2Zou L,Chen L.Dominant graph:An efficient indexing structure to answer top-k queries[C]//Proc of the IEEE 24th Int Conf on Data Engineering.Washington,DC:IEEE Computer Society,2008:536-545.
3Xin D,Chen C,Han J.Towards robust indexing for ranked queries[C]//Proc of the 32nd Int Conf on Very Large Data Bases.Trondheim,Norway:VLDB Endowment,2006:235-246.
4Chang Y C,Bergman L D,Castelli V,et al.The onion technique:Indexing for linear optimization queries[J].ACM SIGMOD Record,2000,29(4):391-402.
5Hristidis V,Koudas N,Papakonstantinou Y.Prefer:A system for the efficient execution of multi-parametric ranked queries[J].ACM SIGMOD Record,2001,30(2):259-270.
6Das G,Gunopulos D,Koudas N,et al.Answering top-k queries using views[C]//Proc of the 32nd Int Conf on Very Large Data Bases.Trondheim,Norway:VLDB Endowment,2006:451-462.
7Xin D,Han J,Cheng H,et al.Answering top-k queries with multi-dimensional selections:The ranking cube approach[C]//Proc of the 32nd Int Conf on Very Large Data Bases.Trondheim,Norway:VLDB Endowment,2006:463-474.
8Mouratidis K,Bakiras S,Papadias D.Continuous monitoring of top-k queries over sliding windows[C]//Proc of the 2006 ACM SIGMOD Int Conf on Management of Data.New York:ACM,2006:635-646.
9Borzsonyi S,Kossmann D,Stocker K.The skyline operator[C]//Proc of the 17th Int Conf on Data Engineering.Washington,DC:IEEE Computer Society,2001:421-430.
10Fagin R,Lotem A,Naor M.Optimal aggregation algorithms for middleware[C]//Proc of the Twentieth ACM SIGMOD-SIGACT-SIGART Symp on Principles of Database Systems.New York:ACM,2001:102-113.

引证文献2

1甘亮,金鑫,贾焰,李爱平,盘仰柯.GDG:一种基于逆支配点集的top-k高效查询索引方法[J].计算机研究与发展,2010,47(10):1771-1784. 被引量：3
2余立苹,李云飞,朱世行.基于高维数据流的异常检测算法[J].计算机工程,2018,44(1):51-55. 被引量：23

二级引证文献26

1苑津莎,甘斌斌,李中,万利,李灿.基于改进离群算法的多元时间序列异常检测[J].黑龙江电力,2020,42(2):113-118.
2甘亮,于莉莉,李润恒,贾焰,金鑫.一种基于逆支配点集的数据流Top-k计算方法[J].计算机工程与科学,2012,34(6):59-64.
3张建锋,韩伟红,樊华,邹鹏,贾焰.基于用户反馈的top-k查询修改算法[J].计算机研究与发展,2014,51(10):2206-2215. 被引量：2
4刘冬冬.基于密度异常因子的武器装备故障检测方法[J].舰船电子工程,2019,39(5):120-123. 被引量：1
5邓丹苹,秦小麟,李博涵,郑伟,刘亮,李雪.一种基于改进网格多维TTI索引的动态Top-k查询算法[J].计算机学报,2019,42(8):1827-1844. 被引量：2
6王艳丽,孔姝睿.复杂时变拓扑网络异常数据检测优化仿真[J].微电子学与计算机,2019,36(10):103-106. 被引量：5
7张凯斐.分布式网络动态数据异常区域时序挖掘仿真[J].计算机仿真,2019,36(9):361-364. 被引量：4
8吕勋,蔡畅.基于SDAE-DNN的网络异常检测方法[J].电子技术与软件工程,2020(1):26-28.
9石莹,罗峥,胡佳,魏添.基于云计算的电力运行大数据异常值快速检测算法[J].电子设计工程,2020,28(18):43-46. 被引量：20
10夏景,梁薇,吴珠瑛.基于移动小波树的电力监控异常数据自动识别算法研究[J].电子设计工程,2020,28(18):148-152. 被引量：12

1蒋盛益,徐雨明,陈溪辉.异常挖掘研究综述[J].衡阳师范学院学报,2004,25(3):63-66. 被引量：2
2邱云兰,苏翠云.嵌入式系统的几种PID算法及实现[J].福建电脑,2008,24(3):80-81. 被引量：1
3王国刚,史泽林,刘云鹏.采用黎曼度量的Hausdorff距离及其在图像匹配中的应用[J].红外与激光工程,2011,40(2):365-369. 被引量：3
4杨彦彬.一种基于神经网络的主动队列管理算法[J].电子科技,2009,22(6):7-9. 被引量：1
5陈剑,蔡龙征.一种无监督异常入侵检测的簇异常度量方法[J].计算机技术与发展,2013,23(4):131-134.
6秦小麟,林钧海,林钧海.支持抽象数据类型属性的索引机制[J].计算机工程,1992,18(2):5-9. 被引量：2
7毛金玲.OWL本体在关系数据库中的存储方法研究[J].中国科技纵横,2015,0(7):22-22. 被引量：1
8王杰,张毅,姜念.用于异常检测的小参数集树突状细胞算法[J].系统工程与电子技术,2010,32(11):2480-2483.
9吴磊,陈鹏.基于并行计算的关联规则挖掘优化算法[J].计算机应用,2005,25(9):1989-1991. 被引量：3
10曹文平,宁彬.基于异常因子的异常模式探测算法[J].计算机工程与设计,2009,30(16):3820-3822. 被引量：1

计算机研究与发展

2010年第5期

浏览历史

内容加载中请稍等...

面向高维数据的低冗余top-k异常点发现方法被引量：2

参考文献16

二级参考文献27

共引文献43

同被引文献20

引证文献2

二级引证文献26

相关作者

相关机构

相关主题

浏览历史

面向高维数据的低冗余top-k异常点发现方法 被引量：2

参考文献16

二级参考文献27

共引文献43

同被引文献20

引证文献2

二级引证文献26

相关作者

相关机构

相关主题

浏览历史

面向高维数据的低冗余top-k异常点发现方法被引量：2