期刊文献+

基于泛化模板的神经机器翻译数据增强方法

A Generalization Template-Based Data Augmentation Method for Neural Machine Translation
在线阅读 下载PDF
导出
摘要 Transformer模型采用完全自意力机制学习上下文关系,从而打破传统循环神经网络在翻译过程中的长距离限制.然而实际模型预测长句的结果仍然存在困惑度高、上下文衔接不连贯甚至语义缺失等问题;另一方面,现有公开训练语料中长句占比稀缺进一步导致模型难以更好地学习长句表示.为缓解以上问题,提出一种优化长句翻译的数据增强方法.首先从对源端和目标端的句法分析结果中抽取对应的泛化语言模板,然后将提取的互译模板句与原句对分别进行拼接从而构造最终的伪数据.针对不同拼接策略构造的伪句对,多个公开数据集翻译任务的实验结果表明,相比于基线系统,基于泛化模板的数据增强方法提升了模型的翻译性能,且有效提高了模型对长句翻译的准确性. The Transformer model breaks through the limitation of long-distance dependency in tradi-tional recurrent neural networks during translation by leveraging a fully self-attention mechanism to learn contextual relationships.However,in practical long-sentence prediction,the model still exhibits issues such as high perplexity,incoherent context connections,and even semantic loss.Moreover,the scarcity of long sentences in existing public training corpora further hinders the model’s ability to learn effective long-sentence representations.To address these issues,this study proposes a data augmentation method to optimize long sentence translation.Firstly,corresponding generalized language templates are extracted from the syntactic analysis results of the source and target sentences.Then,pseudo-data is constructed by concatenating the extracted inter-translation template sentences with the original ones.Experimental results on multiple publicly available datasets show that the data augmentation method based on general-ized templates can effectively improve the translation performance of the model compared to the baseline system,and also increase the accuracy of translating long sentences.
作者 谢京天 王军 赵忠超 李付学 闫红 XIE Jingtian;WANG Jun;ZHAO Zhongchao;LI Fuxue;YAN Hong(College of Computer Science and Technology,Shenyang University of Chemical Technology,Shenyang 110142,China;Liaoning Key Laboratory of Intelligent Technology for Chemical Process Industry,Shenyang 110142,China;College of Electrical Engineering,Yingkou Institute of Technology,Yingkou 115014,China)
出处 《沈阳化工大学学报》 2025年第4期436-442,共7页 Journal of Shenyang University of Chemical Technology
基金 辽宁省自然基金项目(2022-MS-291,2021-YKLH-12,2022-YKLH-18) 辽宁省教育厅科研项目(LJ2020024) 中国高校产学研创新基金(2021LD06009) 辽宁省教育厅基本科研项目(LJKMZ20220781)。
关键词 数据泛化 句法分析 数据增强 data generalization syntactic analysis data augmentation
  • 相关文献

参考文献3

二级参考文献10

共引文献44

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部