Outlier-DivideConquer:近似聚集查询中离群分治取样算法被引量：1

An outliers divide-and-conquers sampling algorithm for approximate aggregation queries

下载PDF

导出

摘要取样是一种通用有效的近似技术,利用取样技术进行近似聚集查询处理是决策支持系统和数据挖掘实现技术中的常用方法.如何正确有效地给出近似查询结果并最小化近似查询误差是近似查询处理的关键和目标.在深入研究近似聚集查询取样方法的基础上,本文提出了一个有误差确界且只需单遍扫描数据集的离群分治取样Outlier-DivideConquer算法,该算法在聚集属性内部存在高方差分布时能克服随机均匀取样局限,可显著降低近似查询误差,且执行效率优于同类算法.最后通过与传统均匀取样算法的实验比较验证了Outlier-DivideConquer算法的有效性和正确性. Sampling is an efficient and most widely-used approximation technique.The ability to approximately answer aggregation queries accurately and efficiently is of great benefit for decision support system and data mining tools.We observe that uniform sampling performs poorly when the distribution of the aggregated attribute is high skewed.The distribution of high skewed data or the large variance in the aggregate column is primarily due to the presence of certain outliers or deviants in the data.To address this issue,we introduce an optimized uniform sampling technique called Outlier-DivideConquer that tries to overcome limitations of mere uniform sampling.The method presented by us belong to precomputed query processing scheme and it is worked out based on deep and extensive studies for sampling techniques applied to approximate aggregation queries.The central idea of Outlier-DivideConquer is adopting ＂divide and conquer＂ approach to deal with outliers separately.For this purpose,we identify the tuples with outlier values and store them in a separate sub-relation,and random uniform sample from the rest of the relation.In details,the scheme of Outlier-DivideConquer consists of two steps as follows： Separating outliers and Query processing.The Outlier-Divide algorithm included in separating outliers step is the very core of our scheme.The main characteristics of Outlier-Divide algorithm are briefly described as follows：（1） Outlier-Divide algorithm is a one pass and error guarantees algorithm,and（2） unlike outlier-indexes,which is a comparable algorithm with our Outlier-DivideConquer technique,in our algorithm,there is only a single scan of the data and is not necessary to sorting the aggregated column of entire tuples of the relation,so it has lower time complexity than outlier-indexes and can be naturally extended to approximate query processing of streaming data.Moreover,query processing step consists of three steps as follows： Aggregate outliers,Aggregate non-outliers,and Combine aggregates.A more detailed description of the query processing step is given later in section 2.2 of this paper.We demonstrated that Outlier-DivideConquer technique can be used to answer aggregation queries with significantly reduced approximation error compared to either the reservoir uniform sampling or outlier-indexes.Finally,a set of experiments on the modified TPC-H database demonstrates the correctness and effectiveness of the technique proposed.

作者胡文瑜孙志挥张柏礼

机构地区福建工程学院计算机与信息科学系东南大学计算机科学与工程学院

出处《南京大学学报（自然科学版）》 CAS CSCD 北大核心 2011年第5期524-531,共8页 Journal of Nanjing University（Natural Science）

基金国家自然科学基金(60873176 61073059) 福建省教育厅科技项目(JA08161)

关键词数据挖掘决策支持近似聚集查询均匀取样离群分治 data mining decision support approximate aggregation queries uniform sampling outliers divide-and-conquer

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献24

1Vitter J S. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 1985, 11 (1) :37-57.
2Brown P G, Haas P J. Techniques for warehousing of sample data. Proceedings of the 22^rd ICDE: IEEE Computer Society, Washington DC, USA, 2006, 6.
3胡文瑜,孙志挥,吴英杰.数据挖掘取样方法研究[J].计算机研究与发展,2011,48(1):45-54. 被引量：54
4Chaudhuri S, Das G, Narasayy A V. Optimized stratified sampling for approximate query processing. ACM Transactions on Datatmse Systems. New York: ACM, 2007, 32, 2(9): 50.
5Hellerstein J M, Haas P J, Wang H J. Online aggregation. Proceedings of the ACM SIGMOD Conference, 1997, 26(2): 171- 182.
6Acharya S, Gibbons P B, Poosala V, el al. Join Synopses for approximate query answering. Proceedings of the ACM SIGMOD Conference, New York: ACM,1999, 275-286.
7Olken F. Random sampling from database. Ph. D thesis. U.C. Berkeley of California, 1993.
8Hawkins D M. Identification of outliers. London: Chapman and Hall, 1980,188.
9Chaudhuri S, Das G, Datar M, et al. Overcoruing limitations of sampling for aggregation queries. Proceedings of ICDE, Heidelberg, Germany, 2001, 534-542.
10Haas P J, Hellerstein J M. Ripple joins for online aggregation. Proceedings of the ACM SIGMOD Conference, New York: ACM,1999, 287 -298.

二级参考文献61

1金澈清,钱卫宁,周傲英.流数据分析与管理综述[J].软件学报,2004,15(8):1172-1181. 被引量：163
2贾彩燕,陆汝钤.关联规则挖掘的取样误差量化模型和快速估计算法[J].计算机学报,2006,29(4):625-634. 被引量：7
3杨雪梅,董逸生,徐宏炳,刘学军,钱江波,王永利.高维数据流的在线相关性分析[J].计算机研究与发展,2006,43(10):1744-1750. 被引量：9
4Vitter J S. Random sampling with a reservoir [J]. ACM Trans on Mathematical Software, 1985, 11(1): 37-57.
5Cochran W G. Sampling Techniques [M]. 3rd ed. New York: John Wiley & Sons, 1977.
6Levy P S, Lemeshow S. Sampling of Populations" Methods and Applications [M]. New York: John Wiley & Sons, 1991.
7Lohr S L. Sampling: Design and Analysis [M]. Pacific Grove, CA: Duxbury Press, 1999.
8Olken F, Rotem D. Random sampling from B+trees[C] // Proc of the 15th Int Conf on VLDB. San Francisco: Morgan Kaufmann, 1989:269-277.
9Olken F, Rotem D. Sampling from spatial databases [J]. Statistics and Computing, 1995, 5(1): 43-57.
10Gibbons P B, Matias Y. New sampling-based summary statistics for improving approximate query answers [C] // Proc of ACM SIGMOD 1998. New York: ACM, 1998: 331- 342.

共引文献61

1徐德俊.数据挖掘技术在图书馆管理信息系统中的应用[J].黑龙江史志,2015(5):266-266. 被引量：6
2宋峻峰,张维明,肖卫东,唐九阳.基于本体的信息检索模型研究[J].南京大学学报（自然科学版）,2005,41(2):189-197. 被引量：44
3耿焕同,蔡庆生,于琨,赵鹏.一种基于词共现图的文档主题词自动抽取方法[J].南京大学学报（自然科学版）,2006,42(2):156-162. 被引量：30
4武兴龙,刘新旺.二元语义信息检索模型[J].现代图书情报技术,2006(6):43-46. 被引量：1
5王欣萍,孙昕,孙尧.基于BP人工神经网络模型构建电子病历系统的数据分析[J].中国组织工程研究与临床康复,2011,15(35):6592-6595. 被引量：9
6胡文瑜,蔡文培.数据挖掘取样方法的衡量与选用研究[J].福建工程学院学报,2011,9(4):351-356. 被引量：3
7黄克凤,孟鸣,侯至群.昆明城区地域空间分布轴线挖掘方法研究[J].云南地理环境研究,2012,24(2):83-89. 被引量：1
8王若冰.基于统计分析的数据挖掘在工程造价管理中的应用[J].硅谷,2012,5(18):107-108. 被引量：1
9魏瑞斌.基于微软学术搜索的信息检索研究的文献计量分析[J].图书情报工作,2012,56(20):53-57. 被引量：4
10施俊华,叶保留,李朝品.BP神经网络的医学数据挖掘[J].计算机与现代化,2013(3):92-95. 被引量：3

同被引文献14

1Acharya S, Gibbons P B, V Poosala. Congressional samples for approximate answering of group-by queries[C] //Proc of the ACM SIGMOD on Management of Data, 2000: 487-498.
2Cochran W G. Sampling Techniques [M]. Third edition. New York: John Wiley 8z Sons, 1977.
3Kun-T Chuang, Hung-L Chen, Ming-S Chen. Feature-preserved sampling over streaming data[J]. ACM TKDD, 2009, 2(4): 1-45.
4Braverman V, Ostrovsky R, Zaniolo C. Optimal sampling from sliding windows[C]//Proc of the 28th ACM SIGMOD-SIGACT-SIGART Symp on Principles of database systems, 2009, 147-156.
5Mark Last. Improving data mining utility with projective sampling [C]//Proc of the 15th ACM SIGKDD intl conf on KDD, Paris, France, 2009, 487=496.
6Ko|lios G, Gunopoulos D, Koudas N, et al. Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets [J]. IEEE Trans on Knowledge and Data Engineering, 2003; 15(5): 1170-1187.
7Chaudhuri S, Das G, NARASAYY A V. Optimized stratified sampling for approximate query pro- cessing [J]. ACM Trans on Database Systems, 2007, 2(9): 32.
8Chaudhuri S, Das G, Narasayya V, A Robust, Optimization-based approach for approximate an- swering of aggregate queries[C]//Proc of ACM SIGMOD, 2001, 295-306.
9Acharya S, Gibbons P B, Poosala V, et al. Join synopses for approximate query answering[C]//In Procs of the ACM SIGMOD Conference, 1999: 275-286.
10Hawkins D M. Identification of Outliers [M]. London: Chapman and Hall, 1980-188.

引证文献1

1胡文瑜,刘建华,张柏礼.近似聚集查询中Congressional Samples算法的优化研究[J].数学的实践与认识,2013,43(8):160-169. 被引量：3

二级引证文献3

1高彩霞.数据挖掘取样方法研究[J].电子技术与软件工程,2014(10):213-213. 被引量：2
2倪赛龙,王永利,赵忠文,董振江.基于分层抽样的数据流近似查询算法[J].计算机工程与设计,2017,38(10):2697-2702. 被引量：1
3邢馨心.数据挖掘取样方法研究[J].电子制作,2017,0(21):48-49. 被引量：1

1胡文瑜,刘建华,张柏礼.近似聚集查询中Congressional Samples算法的优化研究[J].数学的实践与认识,2013,43(8):160-169. 被引量：3
2刘潇璠.P2P网络中的近似聚集查询技术[J].山东煤炭科技,2009,27(2):85-86.
3韩希先,李建中,高宏.PAA:海量数据上一种有效的近似聚集查询算法[J].计算机研究与发展,2014,51(1):41-53. 被引量：2
4胡文瑜,孙志挥,吴英杰.数据挖掘取样方法研究[J].计算机研究与发展,2011,48(1):45-54. 被引量：54
5赵攀,江宇波,邱玲.一种新的网络攻击检测方法[J].四川理工学院学报（自然科学版）,2014,27(4):21-23. 被引量：3
6赵攀,江宇波,魏正曦.基于粒子群优化的网络攻击检测方法[J].计算机工程与设计,2014,35(8):2691-2695.
7周杨,刘文科,李凤霞,孙鹤.基于扫描线迭代的深度图像边缘检测算法[J].智能计算机与应用,2012,2(2):30-31.
8张本文.数据挖掘取样方法与数据结构研究[J].数字技术与应用,2016,34(12):106-106.
9高彩霞.数据挖掘取样方法研究[J].电子技术与软件工程,2014(10):213-213. 被引量：2
10张育,沈鸿.基于Multi-Bloom Filters的数据流聚集查询[J].计算机工程,2009,35(5):28-30. 被引量：2

南京大学学报（自然科学版）

2011年第5期

浏览历史

内容加载中请稍等...

Outlier-DivideConquer:近似聚集查询中离群分治取样算法被引量：1

参考文献24

二级参考文献61

共引文献61

同被引文献14

引证文献1

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

Outlier-DivideConquer:近似聚集查询中离群分治取样算法 被引量：1

参考文献24

二级参考文献61

共引文献61

同被引文献14

引证文献1

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

Outlier-DivideConquer:近似聚集查询中离群分治取样算法被引量：1