期刊文献+

面向高维数据的低冗余top-k异常点发现方法 被引量:2

Discovering Redundancy-Aware Top-k Anomalies in High Dimensional Data
在线阅读 下载PDF
导出
摘要 异常发现是数据挖掘领域的一类重要任务.针对高维对象的异常度量问题和异常点集合的冗余问题,提出了一种新的面向高维数据的异常点发现方法.该方法通过采用高维数据的二部图表示,以高维对象的压缩能力作为其异常程度的度量,能够有效支持包含不同类型属性的高维数据.为了解决top-k异常点集合中的冗余问题,提出了低冗余top-k异常点的概念.由于精确计算低冗余的top-k异常点是NP-hard问题,设计了计算近似低冗余的top-k异常点的启发式方法k-AnomaliesHD算法.从在真实和人工数据集上的实验结果可以看出,该方法具有较好的扩展性;而且与不考虑冗余的异常点发现方法相比较,能够更有效地概括数据中的异常模式. Discovering anomalies is an important data mining task which has been studied in many applications In this paper,by emphasizing the problems of exception measurement of high dimensional objects and redundancy in the set of anomalies,an approach is proposed to discover the anomalies in high dimensional data With a bipartite graph representation of the given high dimensional dataset,the capability of compression of each object is used to measure the degree of exception of the object Based on the exception measure,the dataset containing different types of attributes,such as binary attributes,categorical attributes and numeric attributes,are well supported To solve the problem of redundancy in the set of top-k anomalies,the concept of redundancy-aware top-k anomalies is proposed For the problem of mining the exact set of the redundancy-aware top-k anomalies is NP-hard,an algorithm based on greedy heuristics,named k-AnomaliesHD,is designed to discover an approximate set of the redundancy-aware top-k anomalies efficiently The experimental study both on real and synthetic datasets shows that the algorithm scales linearly with the dimensionality of the dataset and quadratic to the size of the dataset Further,compared with the redundancy-unaware method,the set of redundancy-aware top-k anomalies is much more effective to cover the abnormal patterns of data
出处 《计算机研究与发展》 EI CSCD 北大核心 2010年第5期788-795,共8页 Journal of Computer Research and Development
基金 国家"八六三"高技术研究发展计划基金项目(2007AA120502) 国家自然科学基金项目(60874082)
关键词 数据挖掘 异常检测 高维数据 低冗余 异常度量 data mining anomaly detection high dimensional data redundancy-aware exception measure
  • 相关文献

参考文献16

  • 1Bolton R J,Hand D J.Statistical fraud detection:A review (with discussion)[J].Statistical Science,2002,17(3):235-255.
  • 2Patcha A,Park J -M.An overview of anomaly detection techniques:Existing solutions and latest technological trends[J].Computer Networks,2007,51(12):3448-3470.
  • 3Eskin E,Arnold A,Prerau M,et al.A geometric framework for unsupervised anomaly detection:Detecting intrusions in unlabeled data[G].Data Mining for Security Applications.Norvell:Kluwer,2002:77-99.
  • 4Lane T,Brodley C E.Temporal sequence learning and data reduction for anomaly detection[J].ACM Trans on Information and System Security,1999,2(3):295-331.
  • 5Li X,Han J.Mining approximate top-k subspace anomalies in multidimensional time-series data[C] //Proc of the 33rd Int Conf on Very Large Data Bases.New York:ACM,2007:447-458.
  • 6Aggarwal R,Gehrke J,Gunopulos D,et al.Automatic subspace clustering of high dimensional data for data mining applications[C] //Proc of ACM SIGMOD Conf on Management of Data.New York:ACM,1998:94-105.
  • 7Barnett V,Lewis T.Outliers in Statistical Data[M].New York:John Wiley and Sons,1994.
  • 8Preparata F,Shamos M.Computational Geometry:An Introduction[M].Berlin:Springer,1985.
  • 9Ramaswamy S,Rastogi R,Kyuseok S.Efficient algorithms for mining outliers from large data sets[J].ACM SIGMOD Record,2000,29(2):427-438.
  • 10Breunig M M,Kriegel H P,Ng R T,et al.LOF:Identifying density-based local outliers[J].ACM SIGMOD Record,2000,29(2):93-104.

二级参考文献27

  • 1Fayyad, U., Piatetsky-Shapiro, G., Smyth, P. Knowledge discovery and data mining: towards a unifying framework. In: Simoudis, E., Han, J., Fayyad, U.M., eds. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Portland, Oregon: AAAI Press, 1996. 82~88.
  • 2Ng, R. T., Han, J. Efficient and effective clustering methods for spatial data mining. In: Bocca, J.B., Jarke, M., Zaniolo, C., eds. Proceedings of the 20th International Conference on Very Large Data Bases. Santiago: Morgan Kaufmann, 1994. 144~155.
  • 3Ester, M., Kriegel, H.-p., Sander, J., et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis, E., Han, J., Fayyad, U.M., eds. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Portland, Oregon: AAAI Press, 1996. 226~231.
  • 4Zhang, T., Ramakrishnan, R., Linvy, M. BIRCH: an efficient eata clustering method for very large databases. In: Jagadish, H.V., Mumick, I.S., eds. Proceedings of the ACM SIGMOD International Conference on Management of Data. Montreal: ACM Press, 1996. 103~114.
  • 5Wang, W., Yang, J., Muntz, R. STING: a statistical information grid approach to spatial data mining. In: Jarke, M., Carey, M.J., Dittrich, K.R., et al., eds. Proceedings of the 23rd International Conference on Very Large Data Bases. Athens, Greece: Morgan Kaufmann, 1997. 186~195.
  • 6Sheikholeslami, G., Chatterjee, S., Zhang, A. WaveCluster: a multi-resolution clustering approach for very large spatial databases. In: Gupta, A., Shmueli, O., Widom, J., eds. Proceedings of the 24th International Conference on Very Large Data Bases. New York : Morgan Kaufmann, 1998. 428~439.
  • 7Hinneburg, A., Keim, D.A. An efficient approach to clustering in large multimedia databases with noise. In: Agrawal, R., Stolorz, P.E., Piatetsky-Shapiro, G. eds. Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining. New York: AAAI Press, 1998. 58~65.
  • 8Agrawal, R., Gehrke, J., Gunopulos, D., et al. Automatic subspace clustering of high dimensional data for data mining applications. In: Haas, L.M., Tiwary, A., eds. Proceedings of the ACM SIGMOD International Conference on Management of Data. Seattle, Washington, D C: ACM Press, 1998. 94~105.
  • 9Ruts, I., Rousseeuw, P. Computing depth contours of bivariate point clouds. Journal of Computational Statistics and Data Analysis, 1996,23:153~168.
  • 10Arning, A., Agrawal, R., Raghavan, P. A linear method for deviation detection in large databases. In: Simoudis, E., Han, J., Fayyad, U.M., eds. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Portland, Oregon: AAAI Press, 1996. 164~169.

共引文献43

同被引文献20

  • 1周晓云,孙志挥,张柏礼,杨宜东.高维类别属性数据流离群点快速检测算法[J].软件学报,2007,18(4):933-942. 被引量:21
  • 2Zou L,Chen L.Dominant graph:An efficient indexing structure to answer top-k queries[C]//Proc of the IEEE 24th Int Conf on Data Engineering.Washington,DC:IEEE Computer Society,2008:536-545.
  • 3Xin D,Chen C,Han J.Towards robust indexing for ranked queries[C]//Proc of the 32nd Int Conf on Very Large Data Bases.Trondheim,Norway:VLDB Endowment,2006:235-246.
  • 4Chang Y C,Bergman L D,Castelli V,et al.The onion technique:Indexing for linear optimization queries[J].ACM SIGMOD Record,2000,29(4):391-402.
  • 5Hristidis V,Koudas N,Papakonstantinou Y.Prefer:A system for the efficient execution of multi-parametric ranked queries[J].ACM SIGMOD Record,2001,30(2):259-270.
  • 6Das G,Gunopulos D,Koudas N,et al.Answering top-k queries using views[C]//Proc of the 32nd Int Conf on Very Large Data Bases.Trondheim,Norway:VLDB Endowment,2006:451-462.
  • 7Xin D,Han J,Cheng H,et al.Answering top-k queries with multi-dimensional selections:The ranking cube approach[C]//Proc of the 32nd Int Conf on Very Large Data Bases.Trondheim,Norway:VLDB Endowment,2006:463-474.
  • 8Mouratidis K,Bakiras S,Papadias D.Continuous monitoring of top-k queries over sliding windows[C]//Proc of the 2006 ACM SIGMOD Int Conf on Management of Data.New York:ACM,2006:635-646.
  • 9Borzsonyi S,Kossmann D,Stocker K.The skyline operator[C]//Proc of the 17th Int Conf on Data Engineering.Washington,DC:IEEE Computer Society,2001:421-430.
  • 10Fagin R,Lotem A,Naor M.Optimal aggregation algorithms for middleware[C]//Proc of the Twentieth ACM SIGMOD-SIGACT-SIGART Symp on Principles of Database Systems.New York:ACM,2001:102-113.

引证文献2

二级引证文献26

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部