PAA:海量数据上一种有效的近似聚集查询算法被引量：2

PAA:An Efficient Approximate Aggregation Algorithm on Massive Data

下载PDF

导出

摘要聚集查询是一种常用但是耗时的数据库操作.相对于准确查询,以少得多的响应时间向用户返回满足置信区间的近似结果通常是一种更好的选择.现有的近似查询方法无法在海量数据上高效地处理满足任意精度的近似聚集查询.提出一种新的算法PAA(partition-based approximate aggregation)来有效处理满足任意置信区间的近似聚集.维属性的数据空间被划分为同样大小的空间区域,每个分片维护着维属性落入对应空间区域的元组.PAA算法维护表的随机样本RS,其执行包括两个阶段.在阶段1,如果利用预构建的随机样本RS不能返回满足用户要求的近似结果,那么在阶段2,PAA算法从与查询区域相交的空间区域对应的分片集合IPS中获得更多的随机元组.PAA算法的特色在于:1)如何在不知道IPS包含的每个分片满足谓词的元组数量情况下,从IPS中获得需要的随机元组;2)如何有效减少阶段2中的随机I/O费用.实验表明,相对于现有方法,PAA算法可以获得两个数量级的加速比. Aggregation is a commonly used but time-consuming operation in database systems. Relative Compared to exact query, it is often more attractive to return an approximate result with the required error bound to user in a much faster response time. However, we find that none of the previous methods can process approximate aggregation on massive data with arbitrary accuracy and high efficiency. A novel algorithm PAA is proposed to efficiently process approximate aggregation with an arbitrary confidence interval. The data space of dimensional attributes is divided into multiple hypercubes of the same cube size. Each partition maintains the tuples whose dimensional attributes fall into the corresponding hypercube. A random sample RS is pre-constructed on table. PAA consists of two stages. If the approximate result obtained by RS in stage 1 does not satisfy the confidence interval, it is required to retrieve more random tuples from partition set IPS whose hypercubes overlap with search region in stage 2. The novelty of PAA lies in how to retrieve random tuples from IPS when the exact number of tuples satisfying predicate in each partition is unknown and how to reduce random I/O cost of retrieval operation as much as possible. The experimental results show that PAA obtains up to two orders of magnitude speedup compared with the existing methods.

作者韩希先李建中高宏

机构地区哈尔滨工业大学计算机科学与技术学院

出处《计算机研究与发展》 EI CSCD 北大核心 2014年第1期41-53,共13页 Journal of Computer Research and Development

基金国家"九七三"重点基础研究发展计划基金项目(2012CB316200) 国家自然科学基金项目(61190115 61173022 61033015 60831160525 61272046 60903016) 哈尔滨工业大学科研创新基金项目(HIT.NSRIF.2014136)

关键词海量数据 PAA算法近似聚集划分随机样本 massive data PAA approximate aggregation partition random sample

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献21

1Garofalakis M, Gibbon P. Approximate query processing: Taming the terabytes [C] //Proe of VLDB'01. San Francisco, CA: Morgan Kaufmann, 2001:725-725.
2Acharya S, Gibbons P, Poosala V. Congressional samples for approximate answering of group by queries [C] //Proc of the 2000 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2000:487-498.
3Ganti V, Lee M, Ramakrishnan R. Icicles: Self-tuning samples for approximate query answering [C] /]Proe of the 26th Int Conf on Very Large Data Bases. San Francisco, CA: Morgan Kaufmann, 2000:176-187.
4Chaudhuri S, Das G, Datar M, et al. Overcoming limitations of sampling for aggregation queries [C] //Proc of the 17th IntConf on Data Engineering. Los Alamitos, CA: IEEE Computer Society, 2001: 534-542.
5Babcock B, Chaudhuri S, Das G. Dynamic sample selection for approximate query processing [C] //Proc of the 2003 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2003:539-550.
6Chaudhuri S, Das G, Narasayya V. Optimized stratified sampling for approximate query processing [J]. ACM Trans on Database Systems, 2007, 32(2): 38-87.
7Rosch P, Lehner W. Sample synopses for approximate answering of group-by queries [C] //Proc of the 12th Int Conf on Extending Database Technology. New York: ACM, 2009:403-414.
8Lazaridis I, Mehrotra S. Progressive approximate aggregate queries with a multi-resolution tree structure [C] //Proc of the 2001 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2001:401-412.
9Hellerstein J, Haas P, Wang H. Online aggregation [C] //Proc of the 1997 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 1997:171-182.
10Jermaine C, Arumugam S, Pol A, et al. Scalable approximate query processing with the dbo engine [C]//Proc of the 2007 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2007:725-736.

二级参考文献33

1Chang R,Kalashnikov D V,Prabhakar S. Evaluating probabilistic queries over imprecise data[A].New York:ACM,2003.551-562.
2Pei J,Jiang B,Lin X. Probabilistic skylines on uncertain data[A].New York:ACM,2007.15-26.
3Atallah M J,Qi Y. Computing all skyline probabilities for uncertain data[A].New York:ACM,2009.279-287.
4Christian B,Frank F,Annahita O. Probabilistic skyline queries[A].New York:ACM,2009.651-660.doi:10.1016/j.semcdb.2011.03.002.
5Bohm C,Pryakhin A,Schubert M. The Gauss-tree:Efficient object identification in databases of probabilistic feature vectors[A].Piscataway,NJ:IEEE,2006.9-20.doi:10.1007/s11745-011-3596-3.
6Ding X F,Jin H. Efficient and progressive algorithms for distributed skyline queries over uncertain data[A].Piscataway,NJ:IEEE,2010.149-158.
7Zhang W J,Lin X M,Zhang Y. Probahilistic skyline operator over sliding windows[A].Piscataway,NJ:IEEE,2009.1060-1071.
8Re C,Dalvi N N,Suciu D. Efficient top-k query evaluation on prohabilistic data[A].Piscataway,NJ:IEEE,2007.886-895.doi:10.1016/j.mehy.2011.07.039.
9Soliman M A,Ilyas I F,Chang K C. Top-k query processing in uncertain databases[A].Piscataway,NJ:IEEE,2007.895-905.
10Yi K,Li F,Kollios G. Efficient processing of top-k queries in uncertain databases[A].Piscataway,NJ:IEEE,2008.1669-1682.

共引文献16

1叶杰敏,刘国华,貟慧,石丹妮,吴云龙,费凡.Attribute-or模型下不确定关系的无损分解算法[J].计算机研究与发展,2013,50(S1):117-124. 被引量：1
2兰超,张勇,邢春晓.海量多版本文档的加权持久性top-k检索[J].计算机研究与发展,2013,50(S2):121-131.
3赵越,王意洁,王媛,李小勇.一种高效的不确定数据流并行Skyline查询处理方法[J].计算机研究与发展,2013,50(S2):132-139. 被引量：3
4王意洁,李小勇,杨永滔,祁亚斐,王广东.不确定Skyline查询技术研究[J].计算机研究与发展,2012,49(10):2045-2053. 被引量：5
5陈爱东,刘国华,肖瑞,万小妹,石丹妮.均匀分布下不确定数据的关联规则变粒度查询[J].计算机工程与科学,2013,35(10):79-88. 被引量：2
6秦丽,李兵.一种基于云模型的不确定性数据的建模与分类方法[J].计算机科学,2014,41(8):233-240. 被引量：8
7李明,张维明,刘青宝.不确定数据流多维建模方法[J].国防科技大学学报,2014,36(5):174-179. 被引量：1
8李鹏远,潘海为,李青,韩启龙,谢晓芹,张志强.基于关联图模型的医学图像Top-k查询方法[J].计算机研究与发展,2015,52(9):2033-2045.
9陈凤娟.可能世界语义下的概率频繁项集挖掘[J].新余学院学报,2016,21(1):17-19.
10黄冬梅,邓斌,赵丹枫.带权不确定图的K最近邻查询算法[J].计算机应用与软件,2016,33(2):212-216. 被引量：3

同被引文献4

1郭朝鹏,王智,韩峰,张一川,宋杰.HaoLap:基于Hadoop的海量数据OLAP系统[J].计算机研究与发展,2013,50(S1):378-383. 被引量：5
2张明波,陆锋,申排伟,程昌秀.R树家族的演变和发展[J].计算机学报,2005,28(3):289-300. 被引量：95
3向小军,高阳,商琳,杨育彬.基于Hadoop平台的海量文本分类的并行化[J].计算机科学,2011,38(10):184-188. 被引量：35
4闫永刚,马廷淮,王建.KNN分类算法的MapReduce并行化实现[J].南京航空航天大学学报,2013,45(4):550-555. 被引量：21

引证文献2

1闫威,马宗民.基于多谓词选择的海量XML数据并行查询方法[J].小型微型计算机系统,2015,36(7):1415-1420. 被引量：3
2申金鑫,吴烨,陈荦,景宁.面向空间在线分析的并行近似聚集查询[J].计算机科学与探索,2018,12(10):1559-1570. 被引量：1

二级引证文献4

1顾成喜,顾才东,龚伟.传感网络中入侵数据查询方法改进研究仿真[J].计算机仿真,2017,34(2):314-317.
2张虹.数据库中工业产品资源信息准确定位仿真[J].计算机仿真,2017,34(10):406-409. 被引量：1
3金仙力,马凯旋.基于MapReduce的OCL的并行查询方法[J].计算机应用与软件,2018,35(7):21-26. 被引量：2
4李骏.基于MapReduce的大数据在线聚集优化设计[J].河北大学学报（自然科学版）,2021,41(2):212-217. 被引量：4

1刘潇璠.P2P网络中的近似聚集查询技术[J].山东煤炭科技,2009,27(2):85-86.
2冯钧,史涯晴,唐志贤,芮彩华.路网移动对象聚集索引技术[J].吉林大学学报（工学版）,2014,44(6):1799-1805. 被引量：1
3胡文瑜,刘建华,张柏礼.近似聚集查询中Congressional Samples算法的优化研究[J].数学的实践与认识,2013,43(8):160-169. 被引量：3
4程思瑶,李建中.无线传感器网络中(ε,δ)-近似聚集算法[J].软件学报,2010,21(8):1936-1953. 被引量：2
5程思瑶,姜守旭,李建中.P2P网络中时变数据的近似聚集方法[J].软件学报,2009,20(7):1800-1811.
6胡文瑜,孙志挥,张柏礼.Outlier-DivideConquer:近似聚集查询中离群分治取样算法[J].南京大学学报（自然科学版）,2011,47(5):524-531. 被引量：1
7刘金岭.基于P2P网络的AVL索引树范围查询研究[J].微电子学与计算机,2011,28(2):11-14. 被引量：5
8李昕,孟祥福.基于相似性推荐的电子商务Web数据库关键字近似查询方法[J].小型微型计算机系统,2015,36(7):1487-1491. 被引量：4
9刘宇雷,秦小麟,沈佳佳.基于网格的传感器网络K近邻查询处理算法[J].计算机科学,2011,38(5):31-36.
10彭海霞,赵海,李大舟,林川.基于动态最小生成树路由协议的数据聚融算法[J].物理学报,2014,63(9):35-46. 被引量：4

计算机研究与发展

2014年第1期

浏览历史

内容加载中请稍等...

PAA:海量数据上一种有效的近似聚集查询算法被引量：2

参考文献21

二级参考文献33

共引文献16

同被引文献4

引证文献2

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

PAA:海量数据上一种有效的近似聚集查询算法 被引量：2

参考文献21

二级参考文献33

共引文献16

同被引文献4

引证文献2

二级引证文献4

相关作者

相关机构

相关主题

浏览历史

PAA:海量数据上一种有效的近似聚集查询算法被引量：2