基于层次信息增强的中文语义错误识别模型

Chinese semantic error recognition model based on hierarchical information enhancement

下载PDF

导出

摘要中文语义错误不同于简单的拼写错误和语法错误,它们通常更加隐蔽和复杂。中文语义错误识别(CSER)旨在判断中文句子是否包含语义错误,作为语义校对的前置任务,识别模型的性能对语义错误校对至关重要。针对CSER模型在融合句法信息时忽视句法结构与上下文结构之间差异的问题,提出一种层次信息增强的图卷积神经网络(HIE-GCN)模型,旨在将句法树中节点的层次信息嵌入上下文编码器,从而缩小句法结构与上下文结构之间的差异。首先,采用遍历算法提取句法树中节点的层次信息;其次,将层次信息嵌入BERT(Bidirectional Encoder Representations from Transformers)模型生成字符特征,而图卷积网络(GCN)将字符特征用于图上节点,并在图卷积计算后得到整个句子的特征向量;最后,利用全连接层进行单分类错误识别或多分类错误识别。在FCGEC(Fine-grained corpus for Chinese Grammatical Error Correction)和NaCGEC(Native Chinese Grammatical Error Correction)数据集上进行语义错误识别和校对的实验结果表明,在识别任务中,与基线模型相比,HIE-GCN模型在FCGEC数据集的单分类错误识别中将准确率至少提高0.10个百分点,F1值至少提高0.13个百分点;在多分类错误识别中将准确率至少提高1.05个百分点,F1值至少提高0.53个百分点;消融实验验证了层次信息嵌入的有效性;与GPT、Qwen等多个大语言模型(LLM)相比,所提模型的整体识别性能更高。在校对实验中,与序列到序列的直接纠错模型相比,采用识别-纠错二阶段流水线可将纠错精确率提高8.01个百分点,同时还发现,在LLM GLM4纠错过程中,向模型提示句子错误类型可将纠错的精确率提高4.62个百分点。 The semantic errors in Chinese differ from simple spelling and grammatical errors,as they are more inconspicuous and complex.Chinese Semantic Error Recognition(CSER)aims to determine whether a Chinese sentence contains semantic errors.As a prerequisite task for semantic review,the performance of recognition model is crucial for semantic error correction.To address the issue of CSER models ignoring the differences between syntactic structure and contextual structure when integrating syntactic information,a Hierarchical Information Enhancement Graph Convolutional Network(HIE-GCN)model was proposed to embed the hierarchical information of nodes in the syntactic tree into the context encoder,thereby reducing the gap between syntactic structure and contextual structure.Firstly,a traversal algorithm was used to extract the hierarchical information of nodes in the syntactic tree.Secondly,the hierarchical information was embedded into the BERT(Bidirectional Encoder Representations from Transformers)model to generate character features,the Graph Convolutional Network(GCN)was adopted to utilize these character features for the nodes in the graph,and after graph convolution calculation,the feature vector of the entire sentence was obtained.Finally,a fully connected layer was used for one-class or multi-class semantic error recognition.Results of semantic error recognition and correction experiments conducted on the FCGEC(Fine-grained corpus for Chinese Grammatical Error Correction)and NaCGEC(Native Chinese Grammatical Error Correction)datasets show that,on the FCGEC dataset,in the recognition task,compared with the baseline model:HIE-GCN improves the accuracy by at least 0.10 percentage points and the F1 score by at least 0.13 percentage points in the one-class error recognition;in the multi-class error recognition,the accuracy is improved by at least 1.05 percentage points and the F1 score is improved by at least 0.53 percentage points.Ablation experimental results verify the effectiveness of hierarchical information embedding.Compared with Large Language Models(LLMs)such as GPT and Qwen,the proposed model’s overall performance in recognition is significantly higher.In the correction experiment,compared to the sequence-to-sequence direct error correction model,the recognition-correction two-stage pipeline improves the correction precision by 8.01 percentage points.It is also found that in the correction process of LLM GLM4,providing the model with hints on the sentence’s error type increases the correction precision by 4.62 percentage points.

作者张瑜琦沙灜 ZHANG Yuqi;SHA Ying(College of Informatics,Huazhong Agricultural University,Wuhan Hubei 430070,China;Key Laboratory of Smart Breeding Technology,Ministry of Agriculture and Rural Affairs(Huazhong Agricultural University),Wuhan Hubei 430070,China;Hubei Engineering Technology Research Center of Agricultural Big Data(Huazhong Agricultural University),Wuhan Hubei 430070,China;Engineering Research Center of Agricultural Intelligent Technology,Ministry of Education(Huazhong Agricultural University),Wuhan Hubei 430070,China)

机构地区华中农业大学信息学院农业农村部智慧养殖技术重点实验室(华中农业大学) 湖北省农业大数据工程技术研究中心(华中农业大学) 农业智能技术教育部工程研究中心(华中农业大学)

出处《计算机应用》北大核心 2025年第12期3771-3778,共8页 journal of Computer Applications

基金国家自然科学基金资助项目(62272188)。

关键词自然语言处理图卷积网络中文语义错误识别大语言模型依存句法分析 Natural Language Processing(NLP) Graph Convolutional Network(GCN) Chinese Semantic Error Recognition(CSER) Large Language Model(LLM) dependency parsing

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]