摘要
为更好地利用和挖掘学术论文文本,识别并提取学术论文中的学术信息已成为一种非常迫切的现实需求,在文本挖掘、信息检索、主题监测、信息计量学等领域都有广阔的应用前景。学术信息可以分为题录信息、章节信息、引文信息、引用信息和其他信息。本文综述了在PDF和HTML/XML两种不同格式的学术论文全文中,提取各类学术信息的主要方法,并指出这些方法主要面向的格式文本以及可用来提取的信息种类。最后,本文列出了提取学术信息的常用工具。
In order to make better use of rich information in academic papers, it is a very urgent and realistic requirement to identify and extract academic information within. The academic information extracting has a broad application prospect in text mining, information retrieval, theme monitoring, information metrology and many other fields. There are five kinds of academic information, such as title information, section information, citation information, reference information and other information. This paper reviews the methods of academic information extracting from the full text of academic papers. Different methods could be used to extract different kinds of academic information from different types of full texts, PDF or HTML/XML. Finally, the paper also lists the current tools for extracting academic information.
出处
《数字图书馆论坛》
CSSCI
2017年第10期39-47,共9页
Digital Library Forum
基金
国家自然科学基金项目"开放获取背景下的全文引文分析方法与应用研究"(编号:71503031)资助
关键词
学术信息
论文全文本
信息提取
机器学习
Academic Information
Full Text
Information Extraction
Machine Learning