期刊文献+

Outlier-DivideConquer:近似聚集查询中离群分治取样算法 被引量:1

An outliers divide-and-conquers sampling algorithm for approximate aggregation queries
在线阅读 下载PDF
导出
摘要 取样是一种通用有效的近似技术,利用取样技术进行近似聚集查询处理是决策支持系统和数据挖掘实现技术中的常用方法.如何正确有效地给出近似查询结果并最小化近似查询误差是近似查询处理的关键和目标.在深入研究近似聚集查询取样方法的基础上,本文提出了一个有误差确界且只需单遍扫描数据集的离群分治取样Outlier-DivideConquer算法,该算法在聚集属性内部存在高方差分布时能克服随机均匀取样局限,可显著降低近似查询误差,且执行效率优于同类算法.最后通过与传统均匀取样算法的实验比较验证了Outlier-DivideConquer算法的有效性和正确性. Sampling is an efficient and most widely-used approximation technique.The ability to approximately answer aggregation queries accurately and efficiently is of great benefit for decision support system and data mining tools.We observe that uniform sampling performs poorly when the distribution of the aggregated attribute is high skewed.The distribution of high skewed data or the large variance in the aggregate column is primarily due to the presence of certain outliers or deviants in the data.To address this issue,we introduce an optimized uniform sampling technique called Outlier-DivideConquer that tries to overcome limitations of mere uniform sampling.The method presented by us belong to precomputed query processing scheme and it is worked out based on deep and extensive studies for sampling techniques applied to approximate aggregation queries.The central idea of Outlier-DivideConquer is adopting "divide and conquer" approach to deal with outliers separately.For this purpose,we identify the tuples with outlier values and store them in a separate sub-relation,and random uniform sample from the rest of the relation.In details,the scheme of Outlier-DivideConquer consists of two steps as follows: Separating outliers and Query processing.The Outlier-Divide algorithm included in separating outliers step is the very core of our scheme.The main characteristics of Outlier-Divide algorithm are briefly described as follows:(1) Outlier-Divide algorithm is a one pass and error guarantees algorithm,and(2) unlike outlier-indexes,which is a comparable algorithm with our Outlier-DivideConquer technique,in our algorithm,there is only a single scan of the data and is not necessary to sorting the aggregated column of entire tuples of the relation,so it has lower time complexity than outlier-indexes and can be naturally extended to approximate query processing of streaming data.Moreover,query processing step consists of three steps as follows: Aggregate outliers,Aggregate non-outliers,and Combine aggregates.A more detailed description of the query processing step is given later in section 2.2 of this paper.We demonstrated that Outlier-DivideConquer technique can be used to answer aggregation queries with significantly reduced approximation error compared to either the reservoir uniform sampling or outlier-indexes.Finally,a set of experiments on the modified TPC-H database demonstrates the correctness and effectiveness of the technique proposed.
出处 《南京大学学报(自然科学版)》 CAS CSCD 北大核心 2011年第5期524-531,共8页 Journal of Nanjing University(Natural Science)
基金 国家自然科学基金(60873176 61073059) 福建省教育厅科技项目(JA08161)
关键词 数据挖掘 决策支持 近似聚集查询 均匀取样 离群分治 data mining decision support approximate aggregation queries uniform sampling outliers divide-and-conquer
  • 相关文献

参考文献24

  • 1Vitter J S. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 1985, 11 (1) :37-57.
  • 2Brown P G, Haas P J. Techniques for warehousing of sample data. Proceedings of the 22^rd ICDE: IEEE Computer Society, Washington DC, USA, 2006, 6.
  • 3胡文瑜,孙志挥,吴英杰.数据挖掘取样方法研究[J].计算机研究与发展,2011,48(1):45-54. 被引量:54
  • 4Chaudhuri S, Das G, Narasayy A V. Optimized stratified sampling for approximate query processing. ACM Transactions on Datatmse Systems. New York: ACM, 2007, 32, 2(9): 50.
  • 5Hellerstein J M, Haas P J, Wang H J. Online aggregation. Proceedings of the ACM SIGMOD Conference, 1997, 26(2): 171- 182.
  • 6Acharya S, Gibbons P B, Poosala V, el al. Join Synopses for approximate query answering. Proceedings of the ACM SIGMOD Conference, New York: ACM,1999, 275-286.
  • 7Olken F. Random sampling from database. Ph. D thesis. U.C. Berkeley of California, 1993.
  • 8Hawkins D M. Identification of outliers. London: Chapman and Hall, 1980,188.
  • 9Chaudhuri S, Das G, Datar M, et al. Overcoruing limitations of sampling for aggregation queries. Proceedings of ICDE, Heidelberg, Germany, 2001, 534-542.
  • 10Haas P J, Hellerstein J M. Ripple joins for online aggregation. Proceedings of the ACM SIGMOD Conference, New York: ACM,1999, 287 -298.

二级参考文献61

  • 1金澈清,钱卫宁,周傲英.流数据分析与管理综述[J].软件学报,2004,15(8):1172-1181. 被引量:163
  • 2贾彩燕,陆汝钤.关联规则挖掘的取样误差量化模型和快速估计算法[J].计算机学报,2006,29(4):625-634. 被引量:7
  • 3杨雪梅,董逸生,徐宏炳,刘学军,钱江波,王永利.高维数据流的在线相关性分析[J].计算机研究与发展,2006,43(10):1744-1750. 被引量:9
  • 4Vitter J S. Random sampling with a reservoir [J]. ACM Trans on Mathematical Software, 1985, 11(1): 37-57.
  • 5Cochran W G. Sampling Techniques [M]. 3rd ed. New York: John Wiley & Sons, 1977.
  • 6Levy P S, Lemeshow S. Sampling of Populations" Methods and Applications [M]. New York: John Wiley & Sons, 1991.
  • 7Lohr S L. Sampling: Design and Analysis [M]. Pacific Grove, CA: Duxbury Press, 1999.
  • 8Olken F, Rotem D. Random sampling from B+trees[C] // Proc of the 15th Int Conf on VLDB. San Francisco: Morgan Kaufmann, 1989:269-277.
  • 9Olken F, Rotem D. Sampling from spatial databases [J]. Statistics and Computing, 1995, 5(1): 43-57.
  • 10Gibbons P B, Matias Y. New sampling-based summary statistics for improving approximate query answers [C] // Proc of ACM SIGMOD 1998. New York: ACM, 1998: 331- 342.

共引文献61

同被引文献14

  • 1Acharya S, Gibbons P B, V Poosala. Congressional samples for approximate answering of group-by queries[C] //Proc of the ACM SIGMOD on Management of Data, 2000: 487-498.
  • 2Cochran W G. Sampling Techniques [M]. Third edition. New York: John Wiley 8z Sons, 1977.
  • 3Kun-T Chuang, Hung-L Chen, Ming-S Chen. Feature-preserved sampling over streaming data[J]. ACM TKDD, 2009, 2(4): 1-45.
  • 4Braverman V, Ostrovsky R, Zaniolo C. Optimal sampling from sliding windows[C]//Proc of the 28th ACM SIGMOD-SIGACT-SIGART Symp on Principles of database systems, 2009, 147-156.
  • 5Mark Last. Improving data mining utility with projective sampling [C]//Proc of the 15th ACM SIGKDD intl conf on KDD, Paris, France, 2009, 487=496.
  • 6Ko|lios G, Gunopoulos D, Koudas N, et al. Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets [J]. IEEE Trans on Knowledge and Data Engineering, 2003; 15(5): 1170-1187.
  • 7Chaudhuri S, Das G, NARASAYY A V. Optimized stratified sampling for approximate query pro- cessing [J]. ACM Trans on Database Systems, 2007, 2(9): 32.
  • 8Chaudhuri S, Das G, Narasayya V, A Robust, Optimization-based approach for approximate an- swering of aggregate queries[C]//Proc of ACM SIGMOD, 2001, 295-306.
  • 9Acharya S, Gibbons P B, Poosala V, et al. Join synopses for approximate query answering[C]//In Procs of the ACM SIGMOD Conference, 1999: 275-286.
  • 10Hawkins D M. Identification of Outliers [M]. London: Chapman and Hall, 1980-188.

引证文献1

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部