摘要
针对目前蛋白质交互关系识别主要以单句为依据、因标注数据缺乏而导致训练集规模小等不足,提出一种以关系相似性分析为框架、基于大规模文本的蛋白质交互关系自动识别方法。首先通过对大规模生物医学文本数据库的自动搜索获取描述蛋白质对的句子集合,然后分别从单词、短语结构、依赖关系3个角度抽取特征,建立向量空间模型来表示一对蛋白质之间的关系,最后根据两个向量之间的相似性对关系作出判断。所需训练数据直接取自现有蛋白质交互网络,无需任何额外的人工标注。实验表明,基于关系相似性的蛋白质交互关系自动识别取得了较高的精度(F-score 74.2%)。
Current protein-protein interaction (PPI) identification systems use single sentences as evidence, and often suffer from the heavy burden of manual annotation. To address these problems, a new relational similarity-based ap- proach using large-scale text as evidence was proposed. First, description of PPIs is obtained by automatic searching of the whole PubMed database. Then, three types of features including lexical features, phrases, and dependency relations are extracted to build the vector space model of PPL Finally, similarity between vectors is measured to classify the rela- tionship between two proteins. In this method, training data is taken from existing PPI databases and no extra annota- tion work is needed. Results o~ the experiment show that this approach achieves high F-score (74. 2%).
出处
《计算机科学》
CSCD
北大核心
2013年第6期229-232,251,共5页
Computer Science
基金
教育部高等学校博士学科点专项基金项目(20103218120024)
国家自然科学基金项目(61170043),国家自然科学基金青年科学基金项目(61202132)资助
校青年科创基金(NS2012073)
关键词
蛋白质交互关系
关系相似性
句法分析
空间向量模型
Protein-protein interaction, Relational similarity, Syntactic analysis, Vector space model