摘要
[目的/意义]针对方志资源的知识价值利用率极为有限的现状,探究面向方志知识图谱的自动化术语抽取,解决缺乏大规模标注语料的冷启动问题。[方法/过程]构建由文本表示、特征提取、序列标注三层体系结构组成的TFT模型,通过远程监督实现源域标注语料到目标域方志文本的知识迁移,并以传统机器学习方法 CRF模型作为基准进行对比。[结果/结论]实验结果表明:在文本表示层,Random <Char2Vec <BERT,相较于随机字向量的F1值,Char2Vec整体提升了约3%,BERT提升了约30%;在序列标注层,Softmax <CRF,CRF算法对长实体识别效果有显著提升;BERT-BiLSTM-CRF表现出最强的稳定性,适合应用于融合迁移学习的方志术语抽取。[局限]文章只选取了某地的方志大事记文本用于目标域的实验评估,考虑到方志语料的多样性,这可能会对实验结果造成一定影响。
[Purpose/significance] In view of the limited utilization of knowledge value from chorography resource,exploreautomaticterm extraction for knowledge graph of local chronicles,so as to solve the cold start problem due to the lack of largescale annotated corpus. [Method/process] We propose a model named TFT,consisting of text representation,feature extraction and sequence tagging these three layers. By means of remote supervision,we transfer the knowledge from annotated corpus in source domain to target domain which is the filed of chorography,and CRF model as the traditional machine learning method is used as the benchmark for comparison. [Result/conclusion] The experimental results show that: in the layer of text representation,Random< Char2 Vec < BERT,and against the weighted average F1 value of Random,there were a increase about 3% of Char2 Vec and a increased about 30% of BERT respectively. As for sequence tagging,Softmax < CRF,and CRF algorithm improved the recognition for long entity especially. Generally,BERT-BiLSTM-CRF maintains the strongest stability,which is suitable for term extraction from local chronicles text integrated with transfer learning. [Limitations] Considering the diversity of local chronicles corpus,we only used one specific set for the experimental evaluation in target domain,which may have an influence on experimental results to some extent.
出处
《情报理论与实践》
CSSCI
北大核心
2021年第4期176-184,共9页
Information Studies:Theory & Application
基金
国家社会科学基金重大招标项目“情报学学科建设与情报工作未来发展路径研究”(项目编号:17ZDA291)
南京大学“文科青年跨学团队专项”项目“面向人文计算的方志文本的语义分析和知识图谱研究”的成果
“江苏青年社科英才”和“南京大学仲英青年学者”等人才培养计划的支持。