期刊文献+

适用于特定领域机器翻译的汉语分词方法 被引量:4

Chinese Word Segmentation Method for Domain-Special Machine Translation
在线阅读 下载PDF
导出
摘要 在特定领域的汉英机器翻译系统开发过程中,大量新词的出现导致汉语分词精度下降,而特定领域缺少标注语料使得有监督学习技术的性能难以提高。这直接导致抽取的翻译知识中出现很多错误,严重影响翻译质量。为解决这个问题,该文实现了基于生语料的领域自适应分词模型和双语引导的汉语分词,并提出融合多种分词结果的方法,通过构建格状结构(Lattice)并使用动态规划算法得到最佳汉语分词结果。为了验证所提方法,我们在NTCIR-10的汉英数据集上进行了评价实验。实验结果表明,该文提出的融合多种分词结果的汉语分词方法在分词精度F值和统计机器翻译的BLEU值上均得到了提高。 In developing a domain-specific Chinese-English machine translation system,the accuracy of Chinese word segmentation in large-scale training corpus often decreases because of unknown words.The lack of domain-specific annotated corpus makes supervised learning approaches unable to adapt.This problem results in many errors in translation knowledge extraction and therefore seriously affects translation quality.To resolve the domain adaptation problem,we implemented Chinese word segmentation by exploiting n-gram statistical features in raw corpus and bilingually motivated word segmentation information in parallel corpus,respectively.We further propose a latticebased method to combine multiple results and use dynamic programming algorithm to get the best word segmentation result.For evaluation,we conducted experiments of Chinese word segmentation and Chinese-English machine translation using the data of NTCIR-10Chinese-English patent task.The experimental results show that the proposed method brought about improvements both in F-measure of the Chinese word segmentation and in BLEU score of the Chinese-English statistical machine translation system.
出处 《中文信息学报》 CSCD 北大核心 2013年第5期184-190,共7页 Journal of Chinese Information Processing
基金 北京交通大学人才基金(KKRC11001532)
关键词 汉语分词 领域适应 双语引导 LATTICE 机器翻译 Chinese word segmentation domain adaptation bilingual motivation Lattice machine translation
  • 相关文献

参考文献13

  • 1Guo Z,Zhang Y,Su C,et al.Exploration of N-gram Features for the Domain Adaptation of Chinese Word Segmentation[M].Natural Language Processing and Chinese Computing.Springer Berlin Heidelberg,2012:121-131.
  • 2Och F J,Ney H.The alignment template approach to statistical machine translation[J].Computational linguistics,2004,30(4):417-449.
  • 3Chiang D.A hierarchical phrase-based model for statistical machine translation[C]//Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics.Association for Computational Linguistics,2005:263-270.
  • 4张梅山,邓知龙,车万翔,等.统计与词典相结合的领域自适应中文分词[J].中国计算语言学研究前沿进展(2009-2011),2011.
  • 5Wang Y,Kazama J,Tsuruoka Y,et al.Improving chinese word segmentation and pos tagging with semisupervised methods using large auto-analyzed data[C]//Proceedings of 5th International Joint Conference on Natural Language Processing.2011:309-317.
  • 6Ma Y,Way A.Bilingually motivated domain-adapted word segmentation for statistical machine translation[C]//Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics.Association for Computational Linguistics,2009:549-557.
  • 7奚宁,李博渊,黄书剑,陈家骏.一种适用于机器翻译的汉语分词方法[J].中文信息学报,2012,26(3):54-58. 被引量:2
  • 8Jin Kiat Low,Hwee Tou Ng,Wenyuan Guo.A Maximum Entropy Approach to Chinese Word Segmentation[C]//Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing (SIGHAN05),2005:161-164.
  • 9Haodi Feng,Kang Chen,Xiaotie Deng,et al.Accessor variety criteria for Chinese word extraction[J].Computational Linguistics,2004,30(1):75-93.
  • 10William H Press,Saul A Teukolsky,William T Vetterling,et al.Numerical Recipes in C++[M].Cambridge University Press,Cambridge,UK,2002.

二级参考文献13

  • 1Pi-Chuan Chang,Michel Galley,Christopher D.Manning.Optimizing Chinese word segmentationformachine translation performance[C] //Proceedings ofthe Third Workshop on Statistical MachineTranslation,2008:224-232.
  • 2Ruiqiang Zhang,Keiji Yasuda,Eiichiro Sumita.Improved statistical machine translation by multipleChinese word segmentation[C] //Proceedings of theThird Workshop on Statistical Machine Translation,2008:216-223.
  • 3Yanjun Ma,Andy Way.Bilingually MotivatedDomain-Adapted Word Segmentation for StatisticalMachine Translation[C] //Proceedings of the 12thEACL,2009:549-557.
  • 4Michael Paul,Andrew Finch,Eiichiro Sumita.Integration of Multiple Bilingually-LearnedSegmentation Schemes into Statistical MachineTranslation[C] //Proceedings of the Joint 5thWorkshop on Statistical Machine Translation andMetricsMATR,2010:400-408.
  • 5Philipp Koehn,Franz Josef Och,Daniel Marcu.Statistical Phrase-based translation[C] //Proceedingsof the 2003Conference of the North American Chapterof the Association for Computational Linguistics onHuman Language Technology,2003:923-940.
  • 6John D.Lafferty,Andrew McCallum,Fernando C.N.Pereira.Conditional Random Field:Probabilisticmodels for segmenting and labeling sequence data[C] //Proceedings 18th International Conference onMachine Learning,2001:282-289.
  • 7Fuchun Peng,Fangfang Feng,Andrew McCallum.Chinese segmentation and new word detection usingConditional Random Fields[C] //Proceedings of the20th international conference on ComputationalLinguistics,2004:562-568.
  • 8Jun-Sheng Zhou,Xin-Yu Dai,Rui-Yu Ni,et al..A hybridapproach to Chinese word segmentation around CRFs[C] //Proceedings of the Fourth SIGHAN Workshop on ChineseLanguage Processing,2005:196-199.
  • 9Franz Och.Minimum error rate training in statistical machinetranslation[C] //Proceedings of the 41st Annual Meeting ofthe Association for Computational,2003.
  • 10Kishore Papineni,Salim Roukos,ToddWard,et al..BLEU:a Method for Automatic Evaluation ofMachine Translation[C] //Proceedings of the 40thAnnual Meeting on Association for ComputationalLinguistics,2002:311-318.

共引文献1

同被引文献44

  • 1李斌,袁义国,芦靖雅,冯敏萱,许超,曲维光,王东波.第一届古代汉语分词和词性标注国际评测[J].中文信息学报,2023,37(3):46-53. 被引量:7
  • 2段慧明,松井久仁於,徐国伟,胡国昕,俞士汶.大规模汉语标注语料库的制作与使用[J].语言文字应用,2000(2):72-77. 被引量:20
  • 3黄河燕,张克亮,张孝飞.基于本体的专业机器翻译术语词典研究[J].中文信息学报,2007,21(1):17-22. 被引量:10
  • 4孙倩,万建成.基于叙词表的领域本体构建方法研究[J].计算机工程与设计,2007,28(20):5054-5056. 被引量:18
  • 5Rabiner L, Juang B. An introduction to hidden Markov models[J]. ASSP Magazine, 1986: 4-16.
  • 6Adam L B, Della P V J, Della P S A. A maximum entropy approach to natural language processing[J]. Computational linguistics, 1996,22(1): 39-71.
  • 7John L, Andrew M, et al. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the ICML, 2001: 45-54.
  • 8Guo Z, Zhang Y, Su C, et al. Exploration of n-gram Features for the Domain Adaptation of Chinese Word Segmentation[J]. Nature Language Processing and Chinese Computing. Springer Berlin Heidelberg, 2012: 121-131.
  • 9Angluin D. Queries and concept learning[J]. Machine Learning, 1988, 2(4):319-342.
  • 10Burr S. Active Learning Literature Survey[J]. University of Wisconsinmadison, 2009, 39(2): 127-131.

引证文献4

二级引证文献19

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部