期刊文献+

概率数据库中近似函数依赖挖掘算法 被引量:6

An Algorithm on Mining Approximate Functional Dependencies in Probabilistic Database
在线阅读 下载PDF
导出
摘要 一个近似函数依赖(approximate functional dependency,AFD)是一个几乎成立的函数依赖,目前大部分工作仅限于从一般数据上挖掘近似函数依赖.有时数据是被组织成概率数据的形式,为了从挖掘概率数据中挖掘出可用的近似函数依赖,定义了概率近似函数依赖,它不同于任何一种以往的定义,并给出了在不确定数据中,置信概率的动态规划求解算法,由于动态规划算法复杂度较高,导出了候选依赖的概率下界来进行剪枝,随后给出了基于字典序的挖掘方法以及相应的剪枝策略,最后,在真实和合成的数据集上进行充分的实验,说明了挖掘算法的可扩展性和剪枝策略的高效性,并展示了有趣的挖掘结果. An approximate functional dependency(AFD)is a functional dependency almost hold,and the most existing works are only able to mine AFDs from general data.Sometimes,data is stored in probabilistic database,in order to mine AFDs from such type of data,we define the probabilistic AFD,namely(λ,δ)-AFD which is different from the previous definition.We propose a dynamic programming to compute the confidence probability of a candidate AFD and check if the confidence probability is more than the probability threshold,however,as the high time complexity of dynamic programming,we derive the lower bound based on Chernoff bound to prune candidates as much as possible.Then,under help of the anti-monotone property,we propose a mining algorithm based on lexicographical order and some pruning criterions to speed up the mining process. At last,experiments are performed on the synthetic and the real-life data sets,and the results show the effectiveness of the pruning criterions and the scalability of our mining algorithm,and we show the interesting results mined from DBLP data set.
出处 《计算机研究与发展》 EI CSCD 北大核心 2015年第12期2857-2865,共9页 Journal of Computer Research and Development
基金 国家"九七三"重点基础研究发展计划基金项目(2012CB316200 2012CB316202) 国家自然科学基金项目(61402130)
关键词 近似函数依赖 数据挖掘 概率数据库 数据质量 不一致性 approximate functional dependency(AFD) data mining probabilistic database data quality inconsistency
  • 相关文献

参考文献14

  • 1Galiano F, Cubero J, Cuenca F, et al. Relational decomposition through partial functional dependencies [J]. Data & Knowledge Engineering, 2002, 43(2) : 207-234.
  • 2Wolf G, Khatri H, Chokshi B, et al. Query processing over incomplete autonomous databases [C] //Proc of the 33rd Int Conf on Very Large Data Bases. New York: ACM, 2007: 651-662.
  • 3Ilyas I, Markl V, Haas P, et al. Cords: Automatic discovery of correlations and soft functional dependencies [C] //Proc of the 2004 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2004:647-658.
  • 4Nambiar U, Kambhampati S. Answering imprecise queries over autonomous Web databases [C] //Proc of the 22nd Int Conf on Data Engineering. Los Alamitos, CA: IEEE Computer Society, 2006:45-45.
  • 5Wolf G, Khatri H, Chen Yi, et al. Quic: A system for handling imprecision & incompleteness in autonomous databases (demo)[C] //Proc of the 3rd Biennial Conf on Innovative Data Systems Research. New York: ACM, 2007: 263-268.
  • 6Huhtala Y, Karkkainen J, Po::kka P, et al. TANE: An efficient algorithm for discovering functional and approximate dependencies [J]. The Compuler Journal, 1999, 42 (2): 100-111.
  • 7Giannella C, Robertson E. On approximation measures for functional dependencies [J]. Information Systems, 2004, 29 (6) : 483-507.
  • 8Yao Hong, Hamilton H. Mining functional dependencies from data [J]. Data Mining and Knowledge Discovery, 2008, 16(2) : 197-219.
  • 9苗东菁,石胜飞,李建中.一种局部相关不确定数据库快照集合上的概率频繁最近邻算法[J].计算机研究与发展,2011,48(10):1812-1822. 被引量:12
  • 10De S, Kambhampati S. Defining and mining functional dependencies in probabilistic databases EOI.:. [2014-05-20]. http://arxiv, org/abs/1005. 4714.

二级参考文献26

  • 1Jeffery S R, Franklin M J, Garofalakis M. An adaptive RFID middleware for supporting metaphysical data independence [J]. The International Journal on Very Large Data Bases, 2008, 17(2): 265-289.
  • 2Cheng R, Kalashnikov D V, Prabhakar S. Evaluating prubabilistic queries over imprecise data [C] //Proc of ACM SIGMOD'2003. New York: ACM, 2003: 551-562.
  • 3Faradjian A, Gehrke J, Bonnet P. Gadt: A probability space ADT for representing and querying the physical world [C] // Proc of the 18th Int Conf on Data Engineering. Los Alamitos, CA: IEEE Computer Society, 2002: 201-211.
  • 4Mokbel M F, Chow C Y, Aref W G. The new Casper: Query processing for location services without compromising privacy [C]//Proc of the 32nd Int Conf on Very Large Data Bases. NewYork: ACM, 2006:763-774.
  • 5Dong X L, Berti Equille L, Srivastava D. Integrating conflicting data: The role of source dependence[J]. PVLDB, 2009, 2(1): 550-561.
  • 6Bohm C, Pryakhin A, Schubert M. The Gauss-tree: Efficient object identification in databases of probabilistic feature vectors [C] //Proe of the 22nd Int Conf on Data Engineering. Los Alamitos, CA: IEEE Computer Society, 2006 : 9.
  • 7Bleiholder J, Naumann F. Data fusion [J]. ACM Computing: Surveys, 2008, 41(1): 1-41.
  • 8Lian Xiang, Chen Lei. A generic framework for handling uncertain data with local correlations [J]. Proc of the VLDB Endowment, 2010, 4(1): 12-21.
  • 9Kanagal B, Deshpande A. Indexing correlated probabilistic databases [C] //Proc of ACM SIGMOD 2009. New York: ACM, 2009:455-468.
  • 10Jordan M I. Graphical models [J]. In Statistical Science: Special Issue on Bayesian Statistics, 2004, 19(1): 140-155.

共引文献11

同被引文献45

引证文献6

二级引证文献80

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部