适用于特定领域机器翻译的汉语分词方法被引量：4

Chinese Word Segmentation Method for Domain-Special Machine Translation

下载PDF

导出

摘要在特定领域的汉英机器翻译系统开发过程中,大量新词的出现导致汉语分词精度下降,而特定领域缺少标注语料使得有监督学习技术的性能难以提高。这直接导致抽取的翻译知识中出现很多错误,严重影响翻译质量。为解决这个问题,该文实现了基于生语料的领域自适应分词模型和双语引导的汉语分词,并提出融合多种分词结果的方法,通过构建格状结构(Lattice)并使用动态规划算法得到最佳汉语分词结果。为了验证所提方法,我们在NTCIR-10的汉英数据集上进行了评价实验。实验结果表明,该文提出的融合多种分词结果的汉语分词方法在分词精度F值和统计机器翻译的BLEU值上均得到了提高。 In developing a domain-specific Chinese-English machine translation system,the accuracy of Chinese word segmentation in large-scale training corpus often decreases because of unknown words.The lack of domain-specific annotated corpus makes supervised learning approaches unable to adapt.This problem results in many errors in translation knowledge extraction and therefore seriously affects translation quality.To resolve the domain adaptation problem,we implemented Chinese word segmentation by exploiting n-gram statistical features in raw corpus and bilingually motivated word segmentation information in parallel corpus,respectively.We further propose a latticebased method to combine multiple results and use dynamic programming algorithm to get the best word segmentation result.For evaluation,we conducted experiments of Chinese word segmentation and Chinese-English machine translation using the data of NTCIR-10Chinese-English patent task.The experimental results show that the proposed method brought about improvements both in F-measure of the Chinese word segmentation and in BLEU score of the Chinese-English statistical machine translation system.

作者苏晨张玉洁郭振徐金安

机构地区北京交通大学计算机与信息技术学院

出处《中文信息学报》 CSCD 北大核心 2013年第5期184-190,共7页 Journal of Chinese Information Processing

基金北京交通大学人才基金(KKRC11001532)

关键词汉语分词领域适应双语引导 LATTICE 机器翻译 Chinese word segmentation domain adaptation bilingual motivation Lattice machine translation

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献13

1Guo Z,Zhang Y,Su C,et al.Exploration of N-gram Features for the Domain Adaptation of Chinese Word Segmentation[M].Natural Language Processing and Chinese Computing.Springer Berlin Heidelberg,2012:121-131.
2Och F J,Ney H.The alignment template approach to statistical machine translation[J].Computational linguistics,2004,30(4):417-449.
3Chiang D.A hierarchical phrase-based model for statistical machine translation[C]//Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics.Association for Computational Linguistics,2005:263-270.
4张梅山,邓知龙,车万翔,等.统计与词典相结合的领域自适应中文分词[J].中国计算语言学研究前沿进展(2009-2011),2011.
5Wang Y,Kazama J,Tsuruoka Y,et al.Improving chinese word segmentation and pos tagging with semisupervised methods using large auto-analyzed data[C]//Proceedings of 5th International Joint Conference on Natural Language Processing.2011:309-317.
6Ma Y,Way A.Bilingually motivated domain-adapted word segmentation for statistical machine translation[C]//Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics.Association for Computational Linguistics,2009:549-557.
7奚宁,李博渊,黄书剑,陈家骏.一种适用于机器翻译的汉语分词方法[J].中文信息学报,2012,26(3):54-58. 被引量：2
8Jin Kiat Low,Hwee Tou Ng,Wenyuan Guo.A Maximum Entropy Approach to Chinese Word Segmentation[C]//Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing (SIGHAN05),2005:161-164.
9Haodi Feng,Kang Chen,Xiaotie Deng,et al.Accessor variety criteria for Chinese word extraction[J].Computational Linguistics,2004,30(1):75-93.
10William H Press,Saul A Teukolsky,William T Vetterling,et al.Numerical Recipes in C++[M].Cambridge University Press,Cambridge,UK,2002.

二级参考文献13

1Pi-Chuan Chang,Michel Galley,Christopher D.Manning.Optimizing Chinese word segmentationformachine translation performance[C] //Proceedings ofthe Third Workshop on Statistical MachineTranslation,2008:224-232.
2Ruiqiang Zhang,Keiji Yasuda,Eiichiro Sumita.Improved statistical machine translation by multipleChinese word segmentation[C] //Proceedings of theThird Workshop on Statistical Machine Translation,2008:216-223.
3Yanjun Ma,Andy Way.Bilingually MotivatedDomain-Adapted Word Segmentation for StatisticalMachine Translation[C] //Proceedings of the 12thEACL,2009:549-557.
4Michael Paul,Andrew Finch,Eiichiro Sumita.Integration of Multiple Bilingually-LearnedSegmentation Schemes into Statistical MachineTranslation[C] //Proceedings of the Joint 5thWorkshop on Statistical Machine Translation andMetricsMATR,2010:400-408.
5Philipp Koehn,Franz Josef Och,Daniel Marcu.Statistical Phrase-based translation[C] //Proceedingsof the 2003Conference of the North American Chapterof the Association for Computational Linguistics onHuman Language Technology,2003:923-940.
6John D.Lafferty,Andrew McCallum,Fernando C.N.Pereira.Conditional Random Field:Probabilisticmodels for segmenting and labeling sequence data[C] //Proceedings 18th International Conference onMachine Learning,2001:282-289.
7Fuchun Peng,Fangfang Feng,Andrew McCallum.Chinese segmentation and new word detection usingConditional Random Fields[C] //Proceedings of the20th international conference on ComputationalLinguistics,2004:562-568.
8Jun-Sheng Zhou,Xin-Yu Dai,Rui-Yu Ni,et al..A hybridapproach to Chinese word segmentation around CRFs[C] //Proceedings of the Fourth SIGHAN Workshop on ChineseLanguage Processing,2005:196-199.
9Franz Och.Minimum error rate training in statistical machinetranslation[C] //Proceedings of the 41st Annual Meeting ofthe Association for Computational,2003.
10Kishore Papineni,Salim Roukos,ToddWard,et al..BLEU:a Method for Automatic Evaluation ofMachine Translation[C] //Proceedings of the 40thAnnual Meeting on Association for ComputationalLinguistics,2002:311-318.

共引文献1

1吴培昊,徐金安,张玉洁.面向短语统计机器翻译的汉日联合分词研究[J].计算机工程与应用,2015,51(5):116-120. 被引量：1

同被引文献44

1李斌,袁义国,芦靖雅,冯敏萱,许超,曲维光,王东波.第一届古代汉语分词和词性标注国际评测[J].中文信息学报,2023,37(3):46-53. 被引量：7
2段慧明,松井久仁於,徐国伟,胡国昕,俞士汶.大规模汉语标注语料库的制作与使用[J].语言文字应用,2000(2):72-77. 被引量：20
3黄河燕,张克亮,张孝飞.基于本体的专业机器翻译术语词典研究[J].中文信息学报,2007,21(1):17-22. 被引量：10
4孙倩,万建成.基于叙词表的领域本体构建方法研究[J].计算机工程与设计,2007,28(20):5054-5056. 被引量：18
5Rabiner L, Juang B. An introduction to hidden Markov models[J]. ASSP Magazine, 1986: 4-16.
6Adam L B, Della P V J, Della P S A. A maximum entropy approach to natural language processing[J]. Computational linguistics, 1996,22(1): 39-71.
7John L, Andrew M, et al. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the ICML, 2001: 45-54.
8Guo Z, Zhang Y, Su C, et al. Exploration of n-gram Features for the Domain Adaptation of Chinese Word Segmentation[J]. Nature Language Processing and Chinese Computing. Springer Berlin Heidelberg, 2012: 121-131.
9Angluin D. Queries and concept learning[J]. Machine Learning, 1988, 2(4):319-342.
10Burr S. Active Learning Literature Survey[J]. University of Wisconsinmadison, 2009, 39(2): 127-131.

引证文献4

1许华婷,张玉洁,杨晓晖,单华,徐金安,陈钰枫.基于Active Learning的中文分词领域自适应[J].中文信息学报,2015,29(5):55-62. 被引量：7
2白宁超,唐聃,王亚强.基于主动学习的传统中医症状本体构建方法研究综述[J].电子技术与软件工程,2016(7):162-163. 被引量：2
3李丽双,郭瑞,黄德根,周惠巍.基于迁移学习的蛋白质交互关系抽取[J].中文信息学报,2016,30(2):160-167. 被引量：6
4许乾坤,王东波,刘禹彤,吴梦成,黄水清.基于UniLM模型的古文到现代文机器翻译词汇共享研究[J].情报资料工作,2024,45(1):89-100. 被引量：4

二级引证文献19

1贾润亮.基于自然语言处理的知识检索算法研究[J].微电子学与计算机,2016,33(10):130-133. 被引量：2
2邓丽萍,罗智勇.基于半监督CRF的跨领域中文分词[J].中文信息学报,2017,31(4):9-19. 被引量：20
3倪维健,孙浩浩,刘彤,曾庆田.面向领域文献的无监督中文分词自动优化方法[J].数据分析与知识发现,2018,2(2):96-104. 被引量：10
4赵哲焕,杨志豪,孙聪,林鸿飞.生物医学文献中的蛋白质关系抽取研究[J].中文信息学报,2018,32(7):82-90. 被引量：7
5代君,李佶壕,秦岩,王文欣.基于综述型文献的跨学科领域信息源地图绘制[J].图书情报知识,2018,35(6):61-74. 被引量：2
6陈耀东,刘琴,彭蝶飞.面向姿态估计的组件感知自适应算法[J].计算机工程,2018,44(11):257-264. 被引量：2
7高慧,贵彩虹.基于WEB技术的协同数据库分布交互仿真研究[J].计算机仿真,2019,36(6):341-345. 被引量：2
8成于思,施云涛.基于深度学习和迁移学习的领域自适应中文分词[J].中文信息学报,2019,33(9):9-16. 被引量：15
9江明奇,严倩,李寿山.基于联合学习的跨领域法律文书中文分词方法[J].中文信息学报,2019,33(9):17-23. 被引量：4
10胡潇涛,吴浩,杨亮,顾小平,宋弘.基于伪标注样本融合的领域分词方法[J].四川轻化工大学学报（自然科学版）,2021,34(1):48-55. 被引量：1

1张玉红,周全,胡学钢.面向跨领域情感分类的特征选择方法[J].模式识别与人工智能,2013,26(11):1068-1072. 被引量：3
2唐涛,周俏丽,张桂平.统计与规则相结合的术语抽取[J].沈阳航空航天大学学报,2011,28(5):71-74. 被引量：7
3骆正清,陈增武,王泽兵,胡上序.汉语自动分词研究综述[J].浙江大学学报（自然科学版）,1997,31(3):306-312. 被引量：16
4张梅山,邓知龙,车万翔,刘挺.统计与词典相结合的领域自适应中文分词[J].中文信息学报,2012,26(2):8-12. 被引量：46
5张海营.网络信息检索中堆栈——最大匹配自动分词算法研究[J].计算机光盘软件与应用,2011(8):27-27.
6陈博逊,黄晶晓.一种基于HMM和CRF的双层分词模型[J].硅谷,2009,2(22).
7祁昌平,高彩霞,方媛.基于子空间插值的领域适应学习[J].西北师范大学学报（自然科学版）,2014,50(5):40-43.
8孟凡东,徐金安,姜文斌,刘群.异种语料融合方法:基于统计的中文词法分析应用[J].中文信息学报,2012,26(2):3-7. 被引量：5
9樊养余,李祖贺,王凤琴,马江涛.基于跨领域卷积稀疏自动编码器的抽象图像情绪性分类[J].电子与信息学报,2017,39(1):167-175. 被引量：4
10苏惠明.自动分词模型中的歧义字段消除探讨[J].价值工程,2012,31(9):137-137.

中文信息学报

2013年第5期

浏览历史

内容加载中请稍等...

适用于特定领域机器翻译的汉语分词方法被引量：4

参考文献13

二级参考文献13

共引文献1

同被引文献44

引证文献4

二级引证文献19

相关作者

相关机构

相关主题

浏览历史

适用于特定领域机器翻译的汉语分词方法 被引量：4

参考文献13

二级参考文献13

共引文献1

同被引文献44

引证文献4

二级引证文献19

相关作者

相关机构

相关主题

浏览历史

适用于特定领域机器翻译的汉语分词方法被引量：4