摘要
在生物信息学中,对给定氨基酸序列的蛋白质进行分类,检测细微的蛋白质序列相似性或远同源性对于准确预测蛋白质功能和结构都非常重要。提出一种新的基于半监督支持向量机的远同源性检测方法,通过定义序列概率剖面,充分利用大型数据库的非标记数据,并行构筑支持向量机核函数,并结合最近邻分类器实现对任何数据的全覆盖。实验表明,该方法能够大幅提高蛋白质序列分类器的性能与效率。使用并行技术将总体计算时间控制在一定范围,推动了半监督支持向量机分类器的广泛应用。
The classification of protein sequences into functional and structural families based on sequence homology is a fundamental problem in computational biology. This paper introduced a novel parallel remote homology detection approach based on semi-supervised support vector machine. The method defined the SVM kernel function parallel by probabilistic profiles which were built with unlabeled data by searching large database and got the complete data coverage by combined with the nearest neighbor algorithm, And presented the remote homology detection experiments to show that the parallel method could increase accuracy and computational efficiency greatly. The use of parallel computing technology to a whole-time control to a certain extent, promoted the semi-supervised support vector machine classifier widely used.
出处
《计算机应用研究》
CSCD
北大核心
2009年第12期4624-4627,共4页
Application Research of Computers
基金
天津市科技支撑重点项目(09ZCKFGX00400)
河南省高等教育信息化工程项目(2008xxh011)
关键词
半监督学习
支持向量机
并行计算
分类器
semi-supervised learning
support vector machine
parallel computing
classifier