Parallel corpus is of great importance to machine translation, and automatic sentence alignment is the first step towards its processing. This paper puts forward a bilingual dictionary based sentence alignment method ...Parallel corpus is of great importance to machine translation, and automatic sentence alignment is the first step towards its processing. This paper puts forward a bilingual dictionary based sentence alignment method for Chinese English parallel corpus, which differs from previous length based algorithm in its knowledge-rich approach. Experimental result shows that this method produces over 93% accuracy with usual English-Chinese dictionaries whose translations cover 31 88%~47 90% of the corpus.展开更多
To eliminate the mismatch between words of relevant documents and user's query and more seriousnegative effects it has on the performance of information retrieval,a method of query expansion on the ba-sis of new t...To eliminate the mismatch between words of relevant documents and user's query and more seriousnegative effects it has on the performance of information retrieval,a method of query expansion on the ba-sis of new terms co-occurrence representation was put forward by analyzing the process of producingquery.The expansion terms were selected according to their correlation to the whole query.At the sametime,the position information between terms were considered.The experimental result on test retrievalconference(TREC)data collection shows that the method proposed in the paper has made an improve-ment of 5%~19% all the time than the language modeling method without expansion.Compared to thepopular approach of query expansion,pseudo feedback,the precision of the proposed method is competi-tive.展开更多
String similarity measures of edit distance, cosine correlation and Dice coefficient are adopted to evaluate machine translation results. Experiment shows that the evaluation method distinguishes well between "go...String similarity measures of edit distance, cosine correlation and Dice coefficient are adopted to evaluate machine translation results. Experiment shows that the evaluation method distinguishes well between "good" and "bad" translations. Another experiment manifests a consistency between human and automatic scorings of 6 general-purpose MT systems. Equational analysis validates the experimental results. Although the data and graphs are very promising, correlation coefficient and significance tests at 0.01 level are made to ensure the reliability of the results. Linear regression is made to map the automatic scoring results to human scorings.展开更多
A novel model based on structure alignments is proposed for statistical machine translation in this paper. Meta-structure and sequence of meta-structure for a parse tree are defined. During the translation process, a ...A novel model based on structure alignments is proposed for statistical machine translation in this paper. Meta-structure and sequence of meta-structure for a parse tree are defined. During the translation process, a parse tree is decomposed to deal with the structure divergence and the alignments can be constructed at different levels of recombination of meta-structure (RM). This method can perform the structure mapping across the sub-tree structure between languages. As a result, we get not only the translation for the target language, but sequence of meta-stmctu .re of its parse tree at the same time. Experiments show that the model in the framework of log-linear model has better generative ability and significantly outperforms Pharaoh, a phrase-based system.展开更多
In this paper, we present a modular incremental statistical model for English full parsing. Unlike other full parsing approaches in which the analysis of the sentence is a uniform process, our model separates the full...In this paper, we present a modular incremental statistical model for English full parsing. Unlike other full parsing approaches in which the analysis of the sentence is a uniform process, our model separates the full parsing into shallow parsing and sentence skeleton parsing. In shallow parsing, we finish POS tagging, Base NP identification, prepositional phrase attachment and subordinate clause identification. In skeleton parsing, we use a layered feature-oriented statistical method. Modularity possesses the advantage of solving different problems in parsing with corresponding mechanisms. Feature-oriented rule is able to express the complex lingual phenomena at the key point if needed. Evaluated on Penn Treebank corpus, we obtained 89.2% precision and 89.8% recall.展开更多
The design and implementation of EATS, a machine translation system for e-mail, are presented. It first puts forward the notion of “instant machine translation service" and illustrates how it is provided through...The design and implementation of EATS, a machine translation system for e-mail, are presented. It first puts forward the notion of “instant machine translation service" and illustrates how it is provided through client-server mode in EATS. Then this paper gives a panoramic view of the realization of Chinese-English bi-directional translation module through multi-engine strategy. The prototype of the system has been successfully demonstrated in campus net in PPP mode, with 70%~80% translation accuracy.展开更多
Translation lexicons are fundamental to natural language processing tasks like machine translation and cross language information retrieval. This paper presents a lexicon builder that can auto extract (or assist lexic...Translation lexicons are fundamental to natural language processing tasks like machine translation and cross language information retrieval. This paper presents a lexicon builder that can auto extract (or assist lexicographer in compiling) the word translations from Chinese English parallel corpus. Key mechanisms in this builder system are further described, including co occurrence measure, indirection association resolution and multi word unit translation. Experiment results indicate the effectiveness of the authors’ method and the potentiality of the lexicon builder system.展开更多
The complex sentence structure of English is a bottleneck to our practical machine translation system. The simplification of English subordinate clauses will greatly relieves the burden of parsing and other grammatica...The complex sentence structure of English is a bottleneck to our practical machine translation system. The simplification of English subordinate clauses will greatly relieves the burden of parsing and other grammatical or semantic analysis of a complex sentence, thus improves the output quality of the MT system. But there have not any satisfactory research achievements reported in this field up to now as we know. In this paper, author’s work on a corpus-based approach to English subordinate clause identification is reported. The approach integrates rule-based and statistical methods to get the left and right boundaries of the subordinate clauses. The Penn Treebank corpus is used as the training standard. The precision and recall ratios of subordinate clause identification are tested on both closed and open corpora. A result of 92.9% precision and 91.26% recall is obtained for the closed test and the open test result is 80.34% precision and 83.93% recall. This algorithm has been integrated into our machine translation system. The method can also be applied to processing of any other language.展开更多
文摘Parallel corpus is of great importance to machine translation, and automatic sentence alignment is the first step towards its processing. This paper puts forward a bilingual dictionary based sentence alignment method for Chinese English parallel corpus, which differs from previous length based algorithm in its knowledge-rich approach. Experimental result shows that this method produces over 93% accuracy with usual English-Chinese dictionaries whose translations cover 31 88%~47 90% of the corpus.
基金the High Technology Research and Development Program of China(No.2006AA01Z150)the National Natural Science Foundation of China(No.60435020)
文摘To eliminate the mismatch between words of relevant documents and user's query and more seriousnegative effects it has on the performance of information retrieval,a method of query expansion on the ba-sis of new terms co-occurrence representation was put forward by analyzing the process of producingquery.The expansion terms were selected according to their correlation to the whole query.At the sametime,the position information between terms were considered.The experimental result on test retrievalconference(TREC)data collection shows that the method proposed in the paper has made an improve-ment of 5%~19% all the time than the language modeling method without expansion.Compared to thepopular approach of query expansion,pseudo feedback,the precision of the proposed method is competi-tive.
文摘String similarity measures of edit distance, cosine correlation and Dice coefficient are adopted to evaluate machine translation results. Experiment shows that the evaluation method distinguishes well between "good" and "bad" translations. Another experiment manifests a consistency between human and automatic scorings of 6 general-purpose MT systems. Equational analysis validates the experimental results. Although the data and graphs are very promising, correlation coefficient and significance tests at 0.01 level are made to ensure the reliability of the results. Linear regression is made to map the automatic scoring results to human scorings.
基金the National High Technology Research and Development Progran of China(No.200606010108.2006AA01Z150)
文摘A novel model based on structure alignments is proposed for statistical machine translation in this paper. Meta-structure and sequence of meta-structure for a parse tree are defined. During the translation process, a parse tree is decomposed to deal with the structure divergence and the alignments can be constructed at different levels of recombination of meta-structure (RM). This method can perform the structure mapping across the sub-tree structure between languages. As a result, we get not only the translation for the target language, but sequence of meta-stmctu .re of its parse tree at the same time. Experiments show that the model in the framework of log-linear model has better generative ability and significantly outperforms Pharaoh, a phrase-based system.
文摘In this paper, we present a modular incremental statistical model for English full parsing. Unlike other full parsing approaches in which the analysis of the sentence is a uniform process, our model separates the full parsing into shallow parsing and sentence skeleton parsing. In shallow parsing, we finish POS tagging, Base NP identification, prepositional phrase attachment and subordinate clause identification. In skeleton parsing, we use a layered feature-oriented statistical method. Modularity possesses the advantage of solving different problems in parsing with corresponding mechanisms. Feature-oriented rule is able to express the complex lingual phenomena at the key point if needed. Evaluated on Penn Treebank corpus, we obtained 89.2% precision and 89.8% recall.
文摘The design and implementation of EATS, a machine translation system for e-mail, are presented. It first puts forward the notion of “instant machine translation service" and illustrates how it is provided through client-server mode in EATS. Then this paper gives a panoramic view of the realization of Chinese-English bi-directional translation module through multi-engine strategy. The prototype of the system has been successfully demonstrated in campus net in PPP mode, with 70%~80% translation accuracy.
文摘Translation lexicons are fundamental to natural language processing tasks like machine translation and cross language information retrieval. This paper presents a lexicon builder that can auto extract (or assist lexicographer in compiling) the word translations from Chinese English parallel corpus. Key mechanisms in this builder system are further described, including co occurrence measure, indirection association resolution and multi word unit translation. Experiment results indicate the effectiveness of the authors’ method and the potentiality of the lexicon builder system.
文摘The complex sentence structure of English is a bottleneck to our practical machine translation system. The simplification of English subordinate clauses will greatly relieves the burden of parsing and other grammatical or semantic analysis of a complex sentence, thus improves the output quality of the MT system. But there have not any satisfactory research achievements reported in this field up to now as we know. In this paper, author’s work on a corpus-based approach to English subordinate clause identification is reported. The approach integrates rule-based and statistical methods to get the left and right boundaries of the subordinate clauses. The Penn Treebank corpus is used as the training standard. The precision and recall ratios of subordinate clause identification are tested on both closed and open corpora. A result of 92.9% precision and 91.26% recall is obtained for the closed test and the open test result is 80.34% precision and 83.93% recall. This algorithm has been integrated into our machine translation system. The method can also be applied to processing of any other language.