摘要
设A是一训练集,B是A的一个子集,B是选择A中部分有代表性的示例而生成的。得到了这样一个结论,即对于适当选取的B,由B训练出的决策树其泛化精度优于由A训练出的决策树的泛化精度。进一步,设计实现了一种如何从A中挑选有代表性的示例来生成B的算法,并从数据分布和信息熵理论角度分析了该算法的设计原理。
Suppose that A is a training set and B is a subset of A.B is generated by selecting some representative samples from A,This paper draws such a conclusion that,for appropriately selected B,the generalization capability of decision tree trained on B is better than the decision tree trained on A.Furthermore,an algorithm of generating B by selecting representative samples from A is designed.And from the viewpoints of data distribution and information entropy,the algorithm is analyzed.
出处
《计算机工程与应用》
CSCD
北大核心
2006年第35期160-162,187,共4页
Computer Engineering and Applications
基金
国家自然科学基金资助项目(60473045
60573069)。
关键词
样例挑选
信息熵
模糊决策树归纳
泛化精度
sample selection
information entropy
fuzzy decision tree induction
generalization capability