期刊文献+

数据挖掘取样方法的衡量与选用研究 被引量:3

Research on measure and selection of sampling methods in data mining
在线阅读 下载PDF
导出
摘要 取样是一种通用有效的近似技术。在数据挖掘研究中,取样方法可显著减小所处理数据集的规模,使得众多数据挖掘算法得以应用到大规模数据集以及数据流数据上。文章在研究了统计学上随机均匀取样方法误差统计和衡量方法的基础上,着重探讨和研究了适用于数据挖掘领域的取样方法衡量标准以及影响取样方法选择的因素等问题,提出了能更好地评估取样质量,尤其是偏倚取样方法取样质量的"取样方法代表性"和"取样偏差"等概念并进行了量化,最后对数据挖掘取样方法的衡量标准和选用研究的后续工作与研究方向进行了阐述。 Sampling is a useful and efficient approximation technique,which enables lots of algorithms to be applied to huge dataset by dramatically scaling down dataset for data mining and data stream mining.Based on review of error statistics and measure of random uniform sampling techniques,measure of sampling methods in data mining and the factors to be considered in selecting appropriate sampling algorithms for data mining task were discussed and explored.The quantifying of the "representativeness" of a sample and "sample deviation" was conducted to make a more appropriate measure for biased sampling methods.The direction of farther research on measure and selection of sampling techniques for data mining model was also indicated.
出处 《福建工程学院学报》 CAS 2011年第4期351-356,共6页 Journal of Fujian University of Technology
基金 福建省教育厅科技项目(JA08161)
关键词 数据挖掘 均匀取样 偏倚取样 取样偏差 取样代表性 衡量与选用 data mining uniform sampling biased sampling sample deviation representativeness of a sample measure and selection
  • 相关文献

参考文献14

  • 1Cochran W G. Sampling Techniques [M]. 3rd edi. New York: John Wiley & Sons, 1977.
  • 2Shao J. Mathematical Statistics [ M ]. Berlin: Springer - Verlag, 1999.
  • 3Lohr S L. Sampling: design and analysis [M]. Pacific Grove, CA : Duxbury Press, 1999.
  • 4Vitter J. Random sampling with a reservoir [J]. ACM Trans on Mathematical Software, 1985, 11 (1) : 37 -57.
  • 5Gibbons P B, Matias Y. New Sampling-Based Summary Statistics for Improving Approximate Query Answers [ C ]//Proc of ACM SIGMOD, Seattle,Washington, United States, 1998. New York, US : ACM, 1998 : 331 - 342.
  • 6Palmer C, Faloutsos C. Density Biased Sampling: an Improved Method for Data Mining and Clustering [ C ] //Proc of ACM SIGMOD, Dallas, Texas, United States,2000. New York: ACM, 2000 : 82 - 92.
  • 7胡文瑜,孙志挥,吴英杰.数据挖掘取样方法研究[J].计算机研究与发展,2011,48(1):45-54. 被引量:54
  • 8Cormode G, Muthukrishnan S, Rozenbaum I. Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling [ C ] //Proc of 31st Intl Conf VLDB. Trondheim, Norway: VLDB,2005. Endowment, 2005 : 25 - 36.
  • 9Braverman V, Ostrovsky R, Zaniolo C. Optimal sampling from sliding windows [ C ]//Proc of the 28th ACM SIGMOD - SI- GACT-SIGART Symp on Principles of database systems, Providence, Rhode Island, 2009. New York: ACM, 2009 : 147 - 156.
  • 10Gibbons P B. Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports [ C ] //Proc In VLDB, San Francisco, CA USA: Morgan Kaufmann, 2001 : 541 -550.

二级参考文献58

  • 1金澈清,钱卫宁,周傲英.流数据分析与管理综述[J].软件学报,2004,15(8):1172-1181. 被引量:163
  • 2贾彩燕,陆汝钤.关联规则挖掘的取样误差量化模型和快速估计算法[J].计算机学报,2006,29(4):625-634. 被引量:7
  • 3杨雪梅,董逸生,徐宏炳,刘学军,钱江波,王永利.高维数据流的在线相关性分析[J].计算机研究与发展,2006,43(10):1744-1750. 被引量:9
  • 4Vitter J S. Random sampling with a reservoir [J]. ACM Trans on Mathematical Software, 1985, 11(1): 37-57.
  • 5Cochran W G. Sampling Techniques [M]. 3rd ed. New York: John Wiley & Sons, 1977.
  • 6Levy P S, Lemeshow S. Sampling of Populations" Methods and Applications [M]. New York: John Wiley & Sons, 1991.
  • 7Lohr S L. Sampling: Design and Analysis [M]. Pacific Grove, CA: Duxbury Press, 1999.
  • 8Olken F, Rotem D. Random sampling from B+trees[C] // Proc of the 15th Int Conf on VLDB. San Francisco: Morgan Kaufmann, 1989:269-277.
  • 9Olken F, Rotem D. Sampling from spatial databases [J]. Statistics and Computing, 1995, 5(1): 43-57.
  • 10Gibbons P B, Matias Y. New sampling-based summary statistics for improving approximate query answers [C] // Proc of ACM SIGMOD 1998. New York: ACM, 1998: 331- 342.

共引文献53

同被引文献9

引证文献3

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部