摘要
随着二代测序技术的发展,产生了海量16S rRNA基因序列数据.如何有效地挖掘这些数据中隐藏的基因组学信息,是当前研究的热点与难点.序列聚类研究如何将来源于同一物种的序列合并在一起,其构成了物种多样性、结构及功能多样性研究的基础.针对454测序误差的来源特点,提出一种基于邻域种子序列的启发式序列聚类算法(NbHClust).实验结果表明,该算法具有良好的鲁棒性能.与传统启发式序列聚类算法相比,该算法能够降低操作分类单元(operational taxonomy unit,简称OTU)过估计问题,提高聚类精度,有效地进行操作分类单元计算.
With the development of next-generation sequencing technology, a large number of 16S rRNA gene reads have been collected. A key and important issue is to develop novel methods for mining the hidden information among those data. Sequence clustering aims to find the natural groups of large-scale data which can help us to understand the species, functional and structural diversity of microbial communities. This present work proposes a heuristic clustering method based on Neighbor-seeds, named NbHClust, for 454 sequencing data. The results show that this method can reduce extent of overestimation of operational taxonomy unit (OTU) and have a good robust and high clustering accuracy.
出处
《软件学报》
EI
CSCD
北大核心
2014年第5期929-938,共10页
Journal of Software
基金
国家自然科学基金(61170134,61135001)
航空基金(20100853010)
西安市科技计划(CXY1350(2))
西北工业大学博士创新基金(cx201017)
关键词
二代测序技术
操作分类单元
物种多样性
16S
RRNA基因
序列聚类
second-generation sequencing technology
operational taxonomy unit
species diversity
16S rRNA gene
sequenceclustering