期刊文献+

基于序列与跨模态对齐的蛋白质功能预测模型

Sequence-Based and Cross-Modal Alignment Model for Protein Function Prediction
在线阅读 下载PDF
导出
摘要 蛋白质功能预测是生物信息学核心任务之一.现有方法虽能实现蛋白质多模态特征的融合,但仍存在预测准确率不足、依赖有限的实验数据导致适用范围受限等问题.为解决此类问题,本研究提出基于序列与跨模态对齐的蛋白质功能预测模型(Sequence-based and Cross-Modal Alignment Model for Protein Function Prediction,SCMAGO),以蛋白质序列作为唯一输入,通过主流工具AlphaFold2、InterProScan分别预测三级结构和家族结构域信息;使用蛋白质大语言模型(Evolutionary Scale Model Cambrian,ESMC)实现序列嵌入,并采用几何向量感知机图神经网络(Geometric Vector Perceptron Graph Neural Network,GVP-GNN)提取三级结构特征,再通过广播嵌入方法获取家族结构域表示;模型SCMAGO设计两步跨模态对齐方法:基于双向交叉注意力,在残基层面对序列和结构特征进行对齐;结合图注意力池化方法,进一步融合家族结构域特征.实验结果表明,SCMAGO在Swiss-Prot数据集上的性能优于现有的基准方法,在生物过程(Biological Process,BP)、分子功能(Molecular Function,MF)和细胞组分(Cellular Component,CC)三方面的Fmax分别为0.487、0.739和0.736,AUPR则分别达到0.507、0.760、0.800.此外,对序列一致性低于40%的蛋白质,仍能保持稳定的预测性能. Protein function prediction is one of the core tasks in bioinformatics.Although existing methods can fuse multimodal features of proteins,they still suffer from issues such as insufficient prediction accuracy and limited application scope due to reliance on limited experimental data.To address these problems,this study proposes a sequence-and crossmodal alignment-based protein function prediction model(SCMAGO),which takes protein sequences as the sole input.Specifically,it predicts tertiary structure and family domain information using the mainstream tools AlphaFold2 and InterProScan,respectively.It employs the protein large language model(Evolutionary Scale Model Cambrian,ESMC)to achieve sequence embedding,uses the geometric vector perceptron graph neural network(GVP-GNN)to extract tertiary structure features,and further obtains family domain representations through the broadcast embedding method.The SCMAGO model is designed with a two-step cross-modal alignment approach:first,it aligns sequence and structure features at the residue level based on bidirectional cross-attention;second,it further fuses family domain features by combining the graph attention pooling method.Experimental results show that SCMAGO outperforms existing benchmark methods on the Swiss-Prot dataset.Its Fmax values for biological process(BP),molecular function(MF),and cellular component(CC)are 0.487,0.739 and 0.736,respectively,while the corresponding AUPR values reach 0.507,0.760 and 0.800.Furthermore,SCMAGO still maintains stable prediction performance for proteins with sequence identity below 40%.
作者 徐敏 胡春玲 胡婷 张芳芳 代相龙 XU Min;HU Chun-ling;HU Ting;ZHANG Fang-fang;DAI Xiang-long(School of Artificial Intelligence and Big Data,Hefei University,Hefei,Anhui 230031,China)
出处 《电子学报》 北大核心 2025年第11期4022-4034,共13页 Acta Electronica Sinica
基金 国家自然科学基金(No.62306100)。
关键词 蛋白质功能预测 多模态融合 注意力机制 Gene Ontology protein function prediction multimodal fusion attention mechanism Gene Ontology
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部