Reading and writing are the main interaction methods with web content.Text simplification tools are helpful for people with cognitive impairments,new language learners,and children as they might find difficulties in u...Reading and writing are the main interaction methods with web content.Text simplification tools are helpful for people with cognitive impairments,new language learners,and children as they might find difficulties in understanding the complex web content.Text simplification is the process of changing complex text intomore readable and understandable text.The recent approaches to text simplification adopted the machine translation concept to learn simplification rules from a parallel corpus of complex and simple sentences.In this paper,we propose two models based on the transformer which is an encoder-decoder structure that achieves state-of-the-art(SOTA)results in machine translation.The training process for our model includes three steps:preprocessing the data using a subword tokenizer,training the model and optimizing it using the Adam optimizer,then using the model to decode the output.The first model uses the transformer only and the second model uses and integrates the Bidirectional Encoder Representations from Transformer(BERT)as encoder to enhance the training time and results.The performance of the proposed model using the transformerwas evaluated using the Bilingual Evaluation Understudy score(BLEU)and recorded(53.78)on the WikiSmall dataset.On the other hand,the experiment on the second model which is integrated with BERT shows that the validation loss decreased very fast compared with the model without the BERT.However,the BLEU score was small(44.54),which could be due to the size of the dataset so the model was overfitting and unable to generalize well.Therefore,in the future,the second model could involve experimenting with a larger dataset such as the WikiLarge.In addition,more analysis has been done on the model’s results and the used dataset using different evaluation metrics to understand their performance.展开更多
Unsupervised text simplification has attracted much attention due to the scarcity of high-quality parallel text simplification corpora. Recent an unsupervised statistical text simplification based on phrase-based mach...Unsupervised text simplification has attracted much attention due to the scarcity of high-quality parallel text simplification corpora. Recent an unsupervised statistical text simplification based on phrase-based machine translation system (UnsupPBMT) achieved good performance, which initializes the phrase tables using the similar words obtained by word embedding modeling. Since word embedding modeling only considers the relevance between words, the phrase table in UnsupPBMT contains a lot of dissimilar words. In this paper, we propose an unsupervised statistical text simplification using pre-trained language modeling BERT for initialization. Specifically, we use BERT as a general linguistic knowledge base for predicting similar words. Experimental results show that our method outperforms the state-of-the-art unsupervised text simplification methods on three benchmarks, even outperforms some supervised baselines.展开更多
With the development of machine translation technology,automatic pre-editing has attracted increasing research attention for its important role in improving translation quality and efficiency.This study utilizes UAM C...With the development of machine translation technology,automatic pre-editing has attracted increasing research attention for its important role in improving translation quality and efficiency.This study utilizes UAM Corpus Tool 3.0 to annotate and categorize 99 key publications between 1992 and 2024,tracing the research paths and technological evolution of automatic pre-translation editing.The study finds that current approaches can be classified into four categories:controlled language-based approaches,text simplification approaches,interlingua-based approaches,and large language model-driven approaches.By critically examining their technical features and applicability in various contexts,this review aims to provide valuable insights to guide the future optimization and expansion of pre-translation editing systems.展开更多
文摘Reading and writing are the main interaction methods with web content.Text simplification tools are helpful for people with cognitive impairments,new language learners,and children as they might find difficulties in understanding the complex web content.Text simplification is the process of changing complex text intomore readable and understandable text.The recent approaches to text simplification adopted the machine translation concept to learn simplification rules from a parallel corpus of complex and simple sentences.In this paper,we propose two models based on the transformer which is an encoder-decoder structure that achieves state-of-the-art(SOTA)results in machine translation.The training process for our model includes three steps:preprocessing the data using a subword tokenizer,training the model and optimizing it using the Adam optimizer,then using the model to decode the output.The first model uses the transformer only and the second model uses and integrates the Bidirectional Encoder Representations from Transformer(BERT)as encoder to enhance the training time and results.The performance of the proposed model using the transformerwas evaluated using the Bilingual Evaluation Understudy score(BLEU)and recorded(53.78)on the WikiSmall dataset.On the other hand,the experiment on the second model which is integrated with BERT shows that the validation loss decreased very fast compared with the model without the BERT.However,the BLEU score was small(44.54),which could be due to the size of the dataset so the model was overfitting and unable to generalize well.Therefore,in the future,the second model could involve experimenting with a larger dataset such as the WikiLarge.In addition,more analysis has been done on the model’s results and the used dataset using different evaluation metrics to understand their performance.
基金supported by the National Natural Science Foundation of China(Grant Nos.62076217 and 61906060)and the Program for Changjiang Scholars and Innovative Research Team in University(PCSIRT)of the Ministry of Education,China(IRT17R32).
文摘Unsupervised text simplification has attracted much attention due to the scarcity of high-quality parallel text simplification corpora. Recent an unsupervised statistical text simplification based on phrase-based machine translation system (UnsupPBMT) achieved good performance, which initializes the phrase tables using the similar words obtained by word embedding modeling. Since word embedding modeling only considers the relevance between words, the phrase table in UnsupPBMT contains a lot of dissimilar words. In this paper, we propose an unsupervised statistical text simplification using pre-trained language modeling BERT for initialization. Specifically, we use BERT as a general linguistic knowledge base for predicting similar words. Experimental results show that our method outperforms the state-of-the-art unsupervised text simplification methods on three benchmarks, even outperforms some supervised baselines.
基金supported by Chunhui Collaborative Research Project funded by the Ministry of Education of China[Grant No.202200490]Humanities and Social Sciences Research Project funded by the Ministry of Education of China[Grant No.23YJAZH139].
文摘With the development of machine translation technology,automatic pre-editing has attracted increasing research attention for its important role in improving translation quality and efficiency.This study utilizes UAM Corpus Tool 3.0 to annotate and categorize 99 key publications between 1992 and 2024,tracing the research paths and technological evolution of automatic pre-translation editing.The study finds that current approaches can be classified into four categories:controlled language-based approaches,text simplification approaches,interlingua-based approaches,and large language model-driven approaches.By critically examining their technical features and applicability in various contexts,this review aims to provide valuable insights to guide the future optimization and expansion of pre-translation editing systems.