期刊文献+

利用Transformer的组合聚类算法在蛋白质数据分析中的应用

Application of combinatorial clustering algorithm in protein data analysis using Transformer
在线阅读 下载PDF
导出
摘要 该研究将Transformer模型适配于蛋白质特征降维场景,通过其特有的自注意力机制,赋予模型对长程依赖关系的较好建模性能,同时,多头注意力设计使得模型能够从不同角度捕获特征间的相互作用,进一步提升降维结果的表达力和鲁棒性。文章提出了一种新型的GRKM组合聚类算法,在原始K-means算法中引入了灰狼优化算法(Grey Wolf Optimization Algorithm)确定聚类的K值,以随机游走算法(Random Walk)确定初始聚类中心,以马氏距离(Markov Distance)来衡量样本间的相似性。研究中,对5种具有代表性的蛋白质数据集进行了实验验证,得到了改进后算法在轮廓系数以及DB指数等方面相较于改进前都有较大提升的结论。最终的结果分析选取APP蛋白质数据,将蛋白质聚为8类,探讨了各类别的生物功能,在解释性方面也取得了较为明显的效果。所提算法为深入理解蛋白质功能、发现潜在生物标志物以及指导药物设计等实际应用提供了参考工具。 In this study,the Transformer model is adapted to the protein feature dimensionality reduction scenario,which endows the model with better modeling performance for long-range dependencies through its unique self-attention mechanism,and at the same time,the multi-attention design enables the model to capture the interactions between features from different perspectives,which further enhances the expressiveness and robustness of the dimensionality reduction results.A novel GRKM combinatorial clustering algorithm is studied and experimented,which introduces a Grey Wolf Optimization Algorithm into the original K-means algorithm to determine the K value of the clusters,and a Random Walk algorithm to determine the initial cluster centers,and the Markov Distance to measure the similarity between samples.In the study,five representative protein datasets are experimentally validated,and it is concluded that the improved algorithm has a substantial improvement in the profile coefficient as well as DB index compared with the pre-improved one.The final result analysis selects APP protein data,clusters the proteins into eight categories,explores the biological functions of each category,and achieves more obvious results in terms of interpretability.The algorithm in this paper provides a reference tool for practical applications such as in-depth understanding of protein function,discovering potential biomarkers,and guiding drug design.
作者 陈祥龙 李海军 赵福军 袁媛 CHEN Xianglong;LI Haijun;ZHAO Fujun;YUAN Yuan(School of Information and Intelligent Engineering,University of Sanya,Sanya 572022,China;Academician Guoliang Chen Team Innovation Center,University of Sanya,Sanya 572022,China)
出处 《无线互联科技》 2024年第14期74-81,共8页 Wireless Internet Technology
基金 三亚学院硕士研究生导师“产教融合”研究项目,项目编号:USY23CJRH03。
关键词 蛋白质序列 Transformer模型 聚类算法 马氏距离 随机游走 灰狼优化算法 protein sequence Transformer model clustering algorithm Markov Distance Random Walk Grey Wolf Optimization Algorithm
  • 相关文献

参考文献10

二级参考文献103

  • 1刘婷,郭海湘,诸克军,高思维.一种改进的遗传k-means聚类算法[J].数学的实践与认识,2007,37(8):104-111. 被引量:24
  • 2张昕,彭宏,郑启伦.基于微粒群算法的聚类分析[J].微电子学与计算机,2006,23(9):94-95. 被引量:6
  • 3高尚,汤可宗,杨静宇.一种新的基于混合蚁群算法的聚类方法[J].微电子学与计算机,2006,23(12):38-40. 被引量:17
  • 4熊伟清,魏平,赵杰煜.信号传递的二元蚁群算法[J].模式识别与人工智能,2007,20(1):15-20. 被引量:10
  • 5Christian B Anfinsen. Principles that govern the folding of protein chains[J ]. Science, 1973 (181) : 223 - 230.
  • 6Zhang Y, Skolnick J. Spicker: approach to clustering protein structures for near-native model selection[J]. Journal of Computational Chemistry, 2004 (25) : 865 - 871.
  • 7Brendan J Frey, Delbert Dueck. Clustering by passing messages between data points [J]. Science, 2007, 315 (5814) : 972 - 976.
  • 8Yang Zhang. I - tasser server for protein 3d structure prediction[J]. BMC Bioinformatics, 2008(9) :40 - 47.
  • 9Andriy Kryshtafovych, Maciej Milostan, Lukasz Szajkowski, et al. Casp6 data processing and automatic evaluation at the protein structure prediction center[J]. Proteins: Structure, Function, and Bioinformatics, 2005(7):19-23.
  • 10Sitao Wu, Jeffrey Skolnich, Yang Zhang. Ab initio modeling of small proteins by iterative tasser simulations [J]. BMC Bioinformatics, 2007(5) : 17 - 26.

共引文献121

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部