摘要
对结构化数据的分类方法大多是基于频繁子结构挖掘,然后通过排序剪枝等处理将频繁子结构与类关联得到结构规则进而分类.本文针对树形结构数据提出一种基于重要树模式的数据流分类方法 TSC,首先使用相关度量发现k个与类相关的最具有判别能力的树模式,在该过程中,使用分支界限法提高搜索效率,无需挖掘完全模式,另一方面对参考度不断更新从而避免后剪枝操作,得到的树模式可直接用于分类.同时,和以往的方法相比,TSC是无启发式算法,只需用户设置最大规则集数目.然后,采用经典adwin思想处理演变树流中的局部概念漂移.实验表明,与以往的方法相比,TSC生成更少的有效规则集使得测试时间大大降低,总运行时间相对较短的同时可达到较高正确率,简单高效.
The most existing methods to classify structured data are based on frequent substructure mining, then through the step of or- dering and pruning frequent sub-structure, get structural rules which are correlated with corresponding class values. This paper propo- ses TSC, an effective algorithm for classifying tree stream based on significant tree pattern. First of all, this method uses correlation measures to find k most discriminative tree patterns correlating with the class values. During this process, TSC uses branch and bound technology to improve the search efficiency without mining the complete frequent patterns, on the other hand, updates the threshold to avoid the post-prune step, and allows classifying directly using the tree patterns. Meanwhile, compared to existing methods, TSC is a no heuristic algorithm and only need to choose the maximum size of the rule set. Then, TSC uses classical adwin method to deal with local concept drift in evolving tree stream. The experimental results demonstrate that compared with the previous methods, TSC is sample and efficient which generates less effective rules to reduce the testing time greatly, and fulfills less total running time with higher predictive accuracy rate.
出处
《小型微型计算机系统》
CSCD
北大核心
2013年第6期1328-1333,共6页
Journal of Chinese Computer Systems
关键词
树流
分类
k-best树模式
相关度量
tree stream
classification
k-best tree pattem
correlation measures