The primary objective of Chinese grammatical error correction(CGEC)is to detect and correct errors in Chinese sentences.Recent research shows that large language models(LLMs)have been applied to CGEC with significant ...The primary objective of Chinese grammatical error correction(CGEC)is to detect and correct errors in Chinese sentences.Recent research shows that large language models(LLMs)have been applied to CGEC with significant results.For LLMs,selecting appropriate reference examples can help improve their performance.However,existing methods predominantly rely on text similarity for example retrieval,a strategy that frequently mismatches actual error patterns and retrieves lexically similar yet grammatically irrelevant sentences.To address this problem,we propose a method named RE^(2),which retrieves appropriate examples with explanations of grammatical errors.Instead of using text similarity of the input sentence,we use explanations of grammatical errors to select reference examples,which are used by LLMs to improve the performance of CGEC.We conduct experiments on two CGEC datasets and create a high-quality grammatical error explanation(GEE)dataset,which is not only used in our research but also serves as a valuable resource for future studies in both CGEC and GEE.The experimental results on the two datasets indicate that our proposed method effectively improves the performance of CGEC.展开更多
Due to the lack of parallel data in current grammatical error correction(GEC)task,models based on sequence to sequence framework cannot be adequately trained to obtain higher performance.We propose two data synthesis ...Due to the lack of parallel data in current grammatical error correction(GEC)task,models based on sequence to sequence framework cannot be adequately trained to obtain higher performance.We propose two data synthesis methods which can control the error rate and the ratio of error types on synthetic data.The first approach is to corrupt each word in the monolingual corpus with a fixed probability,including replacement,insertion and deletion.Another approach is to train error generation models and further filtering the decoding results of the models.The experiments on different synthetic data show that the error rate is 40%and that the ratio of error types is the same can improve the model performance better.Finally,we synthesize about 100 million data and achieve comparable performance as the state of the art,which uses twice as much data as we use.展开更多
基金support of the National Natural Science Foundation of China(NSFC)(Grant Nos.62236004,62206078,and 62476073).
文摘The primary objective of Chinese grammatical error correction(CGEC)is to detect and correct errors in Chinese sentences.Recent research shows that large language models(LLMs)have been applied to CGEC with significant results.For LLMs,selecting appropriate reference examples can help improve their performance.However,existing methods predominantly rely on text similarity for example retrieval,a strategy that frequently mismatches actual error patterns and retrieves lexically similar yet grammatically irrelevant sentences.To address this problem,we propose a method named RE^(2),which retrieves appropriate examples with explanations of grammatical errors.Instead of using text similarity of the input sentence,we use explanations of grammatical errors to select reference examples,which are used by LLMs to improve the performance of CGEC.We conduct experiments on two CGEC datasets and create a high-quality grammatical error explanation(GEE)dataset,which is not only used in our research but also serves as a valuable resource for future studies in both CGEC and GEE.The experimental results on the two datasets indicate that our proposed method effectively improves the performance of CGEC.
基金was supported by the funds of Bejing Advanced Innovation Center for Language Resources.(TYZ19005)Research Program of State Language Commission(ZDI135-105,YB135-89).
文摘Due to the lack of parallel data in current grammatical error correction(GEC)task,models based on sequence to sequence framework cannot be adequately trained to obtain higher performance.We propose two data synthesis methods which can control the error rate and the ratio of error types on synthetic data.The first approach is to corrupt each word in the monolingual corpus with a fixed probability,including replacement,insertion and deletion.Another approach is to train error generation models and further filtering the decoding results of the models.The experiments on different synthetic data show that the error rate is 40%and that the ratio of error types is the same can improve the model performance better.Finally,we synthesize about 100 million data and achieve comparable performance as the state of the art,which uses twice as much data as we use.