摘要
针对分词错误可能会影响词性标记的正确性,以及单一音节粒度特征难以全面捕捉上下文信息,从而导致实体边界识别不准确的问题。提出了一种融合词汇信息的藏文自动分词和词性标注联合方法,通过构建词汇向量信息库,对输入BERT后的音节编码与对应的词汇级向量进行融合获取更全面的特征输入,增强了模型对词汇语义的理解。在7万句词性标注数据集上训练了融合藏文音节和词汇特征的BERT+Softlexicon(BiLSTM)+CRF模型。实验结果表明,在7千句测试语料上F1值达到92.74%,相比基线一体化模型和大语言模型分别提高了1.8%和1.9%。
To address the issue where word segmentation errors may compromise the accuracy of part-of-speech tagging,and where features at the single-syllable granularity are insufficient for comprehensively capturing contextual information,leading to imprecise entity boundary recognition,a joint approach integrating lexical information for Tibetan automatic word segmentation and part-of-speech tagging was proposed.A lexical vector information database was constructed to achieve more comprehensive feature input by merging the syllable coding after BERT input with corresponding lexical-level vectors,there by enhancing the model’s understanding of lexical semantics.A BERT+Softlexicon(BiLSTM)+CRF model,which integrates Tibetan syllables and lexical features,was trained on a part-of-speech tagging dataset comprising 70000 sentences.Experimental results demonstrate that on a test corpus of 7000 sentences,the method achieves an F1-score of 92.74%,representing improvements of 1.8%and 1.9%over the baseline integrated model and large language model,respectively.
作者
完么措
华却才让
白颖
环科尤
张瑞
WAN Me-cuo;HUAQUE Cai-rang;BAI Ying;HUAN Ke-you;ZHANG Rui(School of Computer Science,Qinghai Normal University,Xining 810008,China;The State Key Laboratory of Tibetan Intelligence,Qinghai Normal University,Xining 810008,China;Key Laboratory of Tibetan Information Processing,Ministry of Education,Qinghai Normal University,Xining 810008,China)
出处
《计算机工程与设计》
北大核心
2025年第12期3578-3585,共8页
Computer Engineering and Design
基金
国家自然科学基金项目(62166034)
藏语智能信息处理及应用国家重点实验室基金项目(2020-ZJ-Y05)。
关键词
藏文词性标注
标注一体化
词汇增强
大语言模型
BERT
藏文分词
特征融合
Tibetan part-of-speech tagging
integrated tagging
lexical enhancement
large language models
BERT
Tibetan word segmentation
feature fusion