A Semi-Random Multiple Decision-Tree Algorithm for Mining Data Streams 被引量：5

A Semi-Random Multiple Decision-Tree Algorithm for Mining Data Streams

导出

摘要 Mining with streaming data is a hot topic in data mining. When performing classification on data streams, traditional classification algorithms based on decision trees, such as ID3 and C4.5, have a relatively poor efficiency in both time and space due to the characteristics of streaming data. There are some advantages in time and space when using random decision trees. An incremental algorithm for mining data streams, SRMTDS （Semi-Random Multiple decision Trees for Data Streams）, based on random decision trees is proposed in this paper. SRMTDS uses the inequality of Hoeffding bounds to choose the minimum number of split-examples, a heuristic method to compute the information gain for obtaining the split thresholds of numerical attributes, and a Naive Bayes classifier to estimate the class labels of tree leaves. Our extensive experimental study shows that SRMTDS has an improved performance in time, space, accuracy and the anti-noise capability in comparison with VFDTc, a state-of-the-art decision-tree algorithm for classifying data streams. Mining with streaming data is a hot topic in data mining. When performing classification on data streams, traditional classification algorithms based on decision trees, such as ID3 and C4.5, have a relatively poor efficiency in both time and space due to the characteristics of streaming data. There are some advantages in time and space when using random decision trees. An incremental algorithm for mining data streams, SRMTDS （Semi-Random Multiple decision Trees for Data Streams）, based on random decision trees is proposed in this paper. SRMTDS uses the inequality of Hoeffding bounds to choose the minimum number of split-examples, a heuristic method to compute the information gain for obtaining the split thresholds of numerical attributes, and a Naive Bayes classifier to estimate the class labels of tree leaves. Our extensive experimental study shows that SRMTDS has an improved performance in time, space, accuracy and the anti-noise capability in comparison with VFDTc, a state-of-the-art decision-tree algorithm for classifying data streams.

作者胡学钢李培培吴信东吴共庆

机构地区 School of Computer Science and Information Engineering Hefei University of Technology School of Computer Science and Information Engineering Hefei University of Technology

出处《Journal of Computer Science & Technology》 SCIE EI CSCD 2007年第5期711-724,共14页 计算机科学技术学报（英文版）

基金 This research is supported by the National Natural Science Foundation of China(Grant No.60573174) the Natural Science Foundation of Anhui Province of China(Grant No.050420207).

关键词 data streams Naive Bayes random decision trees data streams, Naive Bayes, random decision trees

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献2

1徐利军,谢康林,徐虹.基于数据流的频繁集挖掘[J].上海交通大学学报,2006,40(3):502-506. 被引量：5
2周晓云,孙志挥,张柏礼,杨宜东.高维数据流子空间聚类发现及维护算法[J].计算机研究与发展,2006,43(5):834-840. 被引量：17

二级参考文献23

1金澈清,钱卫宁,周傲英.流数据分析与管理综述[J].软件学报,2004,15(8):1172-1181. 被引量：163
2宋国杰王腾蛟唐世渭.数据流中频繁模式的评估与维护[A]..第20届全国数据库学术会议[C].长沙,2003..
3Agrawal R,Srikant R.Fast algorithms for mining association rules[A].Proceedings of VLDB[C].SanMateo:Morgan Kauffman Publishers Inc,1994:487-499.
4Manku G S,Motwani R.Approximate frequency counts over data streams[A].Proceedings of VLDB[C].San Mateo:Morgan Kauffman Publishers Inc,2002:346-357.
5Chang J H,Lee W S.Finding recent frequent itemsets adaptively over online data streams[A].Proceedings of KDD[C].New York:ACM Press,2003:487-492.
6Giannella C,Han J,Pei J,et al.Mining frequent patterns in data streams at multiple time granularities[A].Next Generation Data Mining[C].Menlo Park:AAAI/MIT,2003:191-212.
7Kryszkiewicz M,Rybinski H,Gajek M.Dataless transitions between concise representations of frequent patterns[J].Intelligent Information Systems,2004,22(1):41-70.
8Feldman R,Aumann Y,Amir A,etal.Efficient algorithms for discovering frequent sets in incremental databases[A].Proceedings of SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery[C].New York:ACM Press,1997:59-66.
9Thomas S,Bodagala S,Alsabti K,et al.An efficient algorithm for the incremental updation of association rules[A].Proceedings of KDD[C].New York:ACM Press,1997:263-266.
10B.Babcock,S.Babu,M.Datar,etal.Models and issues in data stream systems.In:Proc.21st ACM Symposium on Principles of Database Systems.New York:ACM Press,2002.1～16

共引文献20

1汪仁红,王家伟,梁宗保.基于投影和密度的高维数据流聚类算法[J].重庆交通大学学报（自然科学版）,2013,32(4):725-728. 被引量：1
2颜晓龙,沈鸿.一种适用于高维数据流的子空间聚类方法[J].计算机应用,2007,27(7):1680-1684. 被引量：6
3谢坤武,毕晓玲,叶斌.基于单元区域的高维数据聚类算法[J].计算机研究与发展,2007,44(9):1618-1623. 被引量：3
4庄波,刘希玉.数据流中频繁模式挖掘算法研究及进展[J].福建电脑,2008,24(3):8-8.
5闫雷鸣,孙志挥,吴英杰,张柏礼.联合聚类非线性相关的时序基因表达数据[J].计算机研究与发展,2008,45(11):1865-1873. 被引量：5
6宗瑜,江贺,张彦春,李明楚.一种可信子空间标志方法[J].计算机应用研究,2009,26(10):3645-3648.
7唐懿芳,穆志纯,张师超,钟达夫.挖掘数据流频繁模式的相关技术和算法研究综述[J].计算机工程与应用,2009,45(26):121-125. 被引量：6
8祝琴,高学东,武森,陈敏,陈华.基于排序思想的高维稀疏数据聚类[J].计算机工程,2010,36(22):13-14. 被引量：2
9许颖梅.基于数据流频繁模式挖掘的入侵检测模型[J].陕西理工学院学报（自然科学版）,2011,27(4):24-29.
10王冬秀,李辉.基于概要数据结构的高维数据流聚类算法[J].广西工学院学报,2011,22(4):59-64.

同被引文献49

1倪志伟,黄玲,李锋刚,忻凌.数据流管理与挖掘研究[J].合肥工业大学学报（自然科学版）,2005,28(9):1157-1162. 被引量：5
2WANG,Ying-chun（王迎春）,LI,Da-yong（李大永）,YIN,Ji-long（尹纪龙）,PENG,Ying-hong（彭颖红）.Application of Decision Tree Algorithm in Stamping Process[J].Journal of Shanghai Jiaotong university(Science),2005,10(4):368-372. 被引量：1
3王勇,李战怀,张阳,蒋芸.基于相反分类器的数据流分类方法[J].计算机科学,2006,33(8):206-209. 被引量：2
4Golab L,Ozsu M T. Issues in data stream management[J]. SIGMOD Rec, 2003,32 (2) : 5-14.
5Supratik B, Sue M. Network performance monitoring and measurement=techniques and experience[C]//MMNS Tu torial, 2002 : 461 - 470.
6Quinlan J R. Induction of decision trees[J]. Machine Learning, 1986,1 (1) : 81-106.
7Quinlan J R. C4.5:programs for machine learning[M]. San Francisco, CA; Morgan Kaufmann Publishers Inc, 1993; 68-70.
8Domingos P, Hulten G. Mining high speed data streams [C]//Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000 : 71- 80.
9Gama J, Rocha R, Medas P. Accurate decision trees for mining high speed data streams[C]//Proceedings of the 9th ACM SIGKDD International Conferece on Knowledge Discovery and Data Mining, 2003 : 523-528.
10Breiman L. Random forests[J]. Machine Learning, 2001,45 (1):5-32.

引证文献5

1甄田甜,张玉红,李燕,王海平,胡学钢.一种基于UFFT的数据流分类器[J].合肥工业大学学报（自然科学版）,2011,34(1):65-70. 被引量：1
2张玉红,胡学钢,李培培.一种抗噪的概念漂移数据流分类方法[J].中国科学技术大学学报,2011,41(4):347-352. 被引量：1
3刘威,路来君,王洪肖,曹延波.基于G^4 ICCS系统的数据挖掘并行算法[J].吉林大学学报（信息科学版）,2013,31(3):324-327. 被引量：3
4贾涛,韩萌,王少峰,杜诗语,申明尧.数据流决策树分类方法综述[J].南京师大学报(自然科学版),2019,42(4):49-60. 被引量：18
5申明尧,韩萌,杜诗语,孙蕊,张春砚.数据流决策树集成分类算法综述[J].计算机应用与软件,2022,39(9):1-10. 被引量：16

二级引证文献38

1董建涛,王帅,武钰栋,李洋,相茜,赵硕,罗震.镍基焊条合金元素对LNG储罐焊缝强度和塑性的影响[J].航天制造技术,2020(3):11-14.
2程军锋.数据流挖掘技术研究[J].洛阳师范学院学报,2014,33(2):37-39. 被引量：1
3耿向华,潘宁.引入或然状态优化控制的网络文本特征挖掘[J].科技通报,2014,30(6):61-63. 被引量：1
4魏红雨,路来君,郝满,郝琳琳.地学G^4I系统集成数据质量评价关键技术研究[J].吉林大学学报（信息科学版）,2014,32(4):377-382. 被引量：1
5刘向东,刘奎,胡飞翔,王翠荣.基于MapReduce的并行聚类算法设计与实现[J].计算机应用与软件,2014,31(11):251-256. 被引量：10
6黄树成,刘悦.一种抗噪的动态数据流分类算法[J].江苏科技大学学报（自然科学版）,2016,30(3):281-285. 被引量：3
7杜诗语,韩萌,申明尧,张春砚,孙蕊.概念漂移数据流集成分类算法综述[J].计算机工程,2020,46(1):15-24. 被引量：13
8张尚旻.基于CHAID决策树的防疫排查重点人员分析[J].信息通信,2020(4):4-6. 被引量：3
9杜诗语,韩萌,申明尧,张春砚,孙蕊.基于Boosting的迭代加权集成分类算法[J].计算机应用研究,2021,38(4):1038-1043. 被引量：1
10杨瑞恒,唐向红,陆见光,陈功胜.基于DWT-DSCNN-SVM的轴承故障诊断[J].组合机床与自动化加工技术,2021(6):76-80. 被引量：3

1赵晓峰,叶震.基于加权多随机决策树的入侵检测模型[J].计算机应用,2007,27(5):1041-1043. 被引量：6
2刘雪静,冀俊忠.浅谈随机决策树[J].电脑知识与技术,2009,5(9):7206-7207.
3赵晓峰,叶震.基于加权多随机决策树的入侵检测分类算法[J].计算机工程与应用,2007,43(18):135-137.
4胡学钢,李楠.基于属性重要度的随机决策树学习算法[J].合肥工业大学学报（自然科学版）,2007,30(6):681-685. 被引量：5
5孙刚,周华平,孙克雷.基于改进的随机决策树的煤矿安全评价方法[J].阜阳师范学院学报（自然科学版）,2014,31(2):46-49.
6许磊,刘斌,徐光佑.Multicast Tree Algorithm in ATM Network Environment[J].Tsinghua Science and Technology,2001,6(1):14-17.
7乔桢.机器学习方法之随机决策树[J].电脑迷,2016(10).
8张玉红,胡学钢,李培培.一种抗噪的概念漂移数据流分类方法[J].中国科学技术大学学报,2011,41(4):347-352. 被引量：1
9张晖,陈琛.一种基于社会化网络的上下文相关推荐模型[J].计算机光盘软件与应用,2013,16(23):113-114. 被引量：1
10姚远,张林剑,乔文豹.RGB-D图像中手部样本标记与手势识别[J].计算机辅助设计与图形学学报,2013,25(12):1810-1817. 被引量：14

Journal of Computer Science & Technology

2007年第5期

浏览历史

内容加载中请稍等...