期刊文献+

PAA:海量数据上一种有效的近似聚集查询算法 被引量:2

PAA:An Efficient Approximate Aggregation Algorithm on Massive Data
在线阅读 下载PDF
导出
摘要 聚集查询是一种常用但是耗时的数据库操作.相对于准确查询,以少得多的响应时间向用户返回满足置信区间的近似结果通常是一种更好的选择.现有的近似查询方法无法在海量数据上高效地处理满足任意精度的近似聚集查询.提出一种新的算法PAA(partition-based approximate aggregation)来有效处理满足任意置信区间的近似聚集.维属性的数据空间被划分为同样大小的空间区域,每个分片维护着维属性落入对应空间区域的元组.PAA算法维护表的随机样本RS,其执行包括两个阶段.在阶段1,如果利用预构建的随机样本RS不能返回满足用户要求的近似结果,那么在阶段2,PAA算法从与查询区域相交的空间区域对应的分片集合IPS中获得更多的随机元组.PAA算法的特色在于:1)如何在不知道IPS包含的每个分片满足谓词的元组数量情况下,从IPS中获得需要的随机元组;2)如何有效减少阶段2中的随机I/O费用.实验表明,相对于现有方法,PAA算法可以获得两个数量级的加速比. Aggregation is a commonly used but time-consuming operation in database systems. Relative Compared to exact query, it is often more attractive to return an approximate result with the required error bound to user in a much faster response time. However, we find that none of the previous methods can process approximate aggregation on massive data with arbitrary accuracy and high efficiency. A novel algorithm PAA is proposed to efficiently process approximate aggregation with an arbitrary confidence interval. The data space of dimensional attributes is divided into multiple hypercubes of the same cube size. Each partition maintains the tuples whose dimensional attributes fall into the corresponding hypercube. A random sample RS is pre-constructed on table. PAA consists of two stages. If the approximate result obtained by RS in stage 1 does not satisfy the confidence interval, it is required to retrieve more random tuples from partition set IPS whose hypercubes overlap with search region in stage 2. The novelty of PAA lies in how to retrieve random tuples from IPS when the exact number of tuples satisfying predicate in each partition is unknown and how to reduce random I/O cost of retrieval operation as much as possible. The experimental results show that PAA obtains up to two orders of magnitude speedup compared with the existing methods.
出处 《计算机研究与发展》 EI CSCD 北大核心 2014年第1期41-53,共13页 Journal of Computer Research and Development
基金 国家"九七三"重点基础研究发展计划基金项目(2012CB316200) 国家自然科学基金项目(61190115 61173022 61033015 60831160525 61272046 60903016) 哈尔滨工业大学科研创新基金项目(HIT.NSRIF.2014136)
关键词 海量数据 PAA算法 近似聚集 划分 随机样本 massive data PAA approximate aggregation partition random sample
  • 相关文献

参考文献21

  • 1Garofalakis M, Gibbon P. Approximate query processing: Taming the terabytes [C] //Proe of VLDB'01. San Francisco, CA: Morgan Kaufmann, 2001:725-725.
  • 2Acharya S, Gibbons P, Poosala V. Congressional samples for approximate answering of group by queries [C] //Proc of the 2000 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2000:487-498.
  • 3Ganti V, Lee M, Ramakrishnan R. Icicles: Self-tuning samples for approximate query answering [C] /]Proe of the 26th Int Conf on Very Large Data Bases. San Francisco, CA: Morgan Kaufmann, 2000:176-187.
  • 4Chaudhuri S, Das G, Datar M, et al. Overcoming limitations of sampling for aggregation queries [C] //Proc of the 17th IntConf on Data Engineering. Los Alamitos, CA: IEEE Computer Society, 2001: 534-542.
  • 5Babcock B, Chaudhuri S, Das G. Dynamic sample selection for approximate query processing [C] //Proc of the 2003 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2003:539-550.
  • 6Chaudhuri S, Das G, Narasayya V. Optimized stratified sampling for approximate query processing [J]. ACM Trans on Database Systems, 2007, 32(2): 38-87.
  • 7Rosch P, Lehner W. Sample synopses for approximate answering of group-by queries [C] //Proc of the 12th Int Conf on Extending Database Technology. New York: ACM, 2009:403-414.
  • 8Lazaridis I, Mehrotra S. Progressive approximate aggregate queries with a multi-resolution tree structure [C] //Proc of the 2001 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2001:401-412.
  • 9Hellerstein J, Haas P, Wang H. Online aggregation [C] //Proc of the 1997 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 1997:171-182.
  • 10Jermaine C, Arumugam S, Pol A, et al. Scalable approximate query processing with the dbo engine [C]//Proc of the 2007 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2007:725-736.

二级参考文献33

  • 1Chang R,Kalashnikov D V,Prabhakar S. Evaluating probabilistic queries over imprecise data[A].New York:ACM,2003.551-562.
  • 2Pei J,Jiang B,Lin X. Probabilistic skylines on uncertain data[A].New York:ACM,2007.15-26.
  • 3Atallah M J,Qi Y. Computing all skyline probabilities for uncertain data[A].New York:ACM,2009.279-287.
  • 4Christian B,Frank F,Annahita O. Probabilistic skyline queries[A].New York:ACM,2009.651-660.doi:10.1016/j.semcdb.2011.03.002.
  • 5Bohm C,Pryakhin A,Schubert M. The Gauss-tree:Efficient object identification in databases of probabilistic feature vectors[A].Piscataway,NJ:IEEE,2006.9-20.doi:10.1007/s11745-011-3596-3.
  • 6Ding X F,Jin H. Efficient and progressive algorithms for distributed skyline queries over uncertain data[A].Piscataway,NJ:IEEE,2010.149-158.
  • 7Zhang W J,Lin X M,Zhang Y. Probahilistic skyline operator over sliding windows[A].Piscataway,NJ:IEEE,2009.1060-1071.
  • 8Re C,Dalvi N N,Suciu D. Efficient top-k query evaluation on prohabilistic data[A].Piscataway,NJ:IEEE,2007.886-895.doi:10.1016/j.mehy.2011.07.039.
  • 9Soliman M A,Ilyas I F,Chang K C. Top-k query processing in uncertain databases[A].Piscataway,NJ:IEEE,2007.895-905.
  • 10Yi K,Li F,Kollios G. Efficient processing of top-k queries in uncertain databases[A].Piscataway,NJ:IEEE,2008.1669-1682.

共引文献16

同被引文献4

引证文献2

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部