摘要
提出了应用贝叶斯统计方法在分布式数据库MCDB上处理超大规模数据的实现方法,并以贝叶斯线性回归、话题模型的LDA和狄利克雷过程的聚类算法为例进行了论证。用户可以通过SQL语言定义变量之间的关系进行模拟。探索了一种使用简洁的SQL设计大规模统计学习系统的方法,其利用MCDB能够自动解决并行化和资源优化问题,以获得高性能的并行处理能力。
This paper described how the Monte Carlo database system (MCDB) can be used to easily implement Baye- sian inference via Markov chain Monte Carlo (MCMC) over very large datasets. Linear Bayesian regression, LDA and Dirichlet clustering were used as examples to demonstrate this task. To implement an MCMC simulation in MCDB, a programmer specifies dependencies among variables and how they parameterize one another using the SQL language. This paper devised a simple scheme for developing large scale machine learning systems with SQL,whieh with the help of MCDB, can automaticly deal with parallelization and optimization problems, to achieve high efficiency in computation.
出处
《计算机科学》
CSCD
北大核心
2013年第6期256-259,287,共5页
Computer Science
基金
国家自然科学基金(61272539)资助
关键词
贝叶斯推断
并行算法
SQL
分布式系统
Bayesian inference, Parallel algorithms, SQL, Distributed system