摘要
针对非结构化普铁桥梁缺陷描述文本数据利用度不高的问题,为有效挖掘数据价值,提出了一种面向普铁桥梁缺陷描述文本的命名实体识别方法。该方法首先以Lattice LSTM模型在综合考虑文本上下文信息和词语信息的基础上提取特征信息,以应对中文缺少显式分隔符和特定领域NER任务中专有词汇识别效率低的问题;随后,引入自注意力机制捕捉数据局部信息关键特征;接着,以全连接层网络将所有特征进行整合;最后,以CRF模块实现标签结果预测。选取了桥隧子系统普铁桥梁缺陷描述文本数据进行人工标注获取实验数据,对所提方法进行验证,同时对比了HMM、CRF、BiLSTM、BiLSTM-CRF、Bert-BiLSTM-CRF和Lattice LSTM-CRF六种基线模型。结果表明,该方法可有效识别普铁桥梁缺陷描述文本实体,为发挥本领域非结构化文本数据价值提供经验借鉴。
Aiming at the problem of low utilization of unstructured defect description text data for ordinary speed railway bridges,a named entity recognition method for ordinary speed railway bridge defect description text is proposed to effectively explore the value of data.This method firstly uses the Lattice LSTM model to extract feature information based on a comprehensive consideration of text context information and word information,in order to address the problems of the lack of explicit separators in Chinese and low recognition efficiency of specialized vocabulary in specific domain NER tasks;Subsequently,self-attention mechanism is introduced to capture key features of local information in the data;Then,all features are integrated using a fully connected layer network;Finally,the CRF module is used to predict label results.The paper selects text data describing the defects of ordinary speed railway bridges in the bridge tunnel subsystem for manual annotation to obtain experimental data,and validates the proposed method.At the same time,the paper compares six baseline models:HMM,CRF,BiLSTM,BiLSTM-CRF,Bert-BiLSTM-CRF,and Lattice LSTM-CRF.The results indicate that this method can effectively identify the textual entities describing defects in ordinary speed railway bridges,providing experience and reference for leveraging the value of unstructured text data in this field.
作者
郭心全
李俊波
沈鹍
吴霞
李林
刘静
GUO Xinquan;LI Junbo;SHEN Kun;WU Xia;LI Lin;LIU Jing(Institute of Electronic Computing Technologies,China Academy of Railway Sciences Co.,Ltd.,Beijing 100081,China;Beijing Jingwei Information Technology Co.,Ltd.,Beijing 100081,China)
出处
《智能计算机与应用》
2025年第5期199-204,共6页
Intelligent Computer and Applications
基金
中国铁道科学研究院集团有限公司科研课题(2023YJ131)。