Creating a parallel corpus for machine translation is a challenging and time-consuming task,especially in a linguistically diverse country like the Philippines,with 185 languages.Although a wealth of text is available...Creating a parallel corpus for machine translation is a challenging and time-consuming task,especially in a linguistically diverse country like the Philippines,with 185 languages.Although a wealth of text is available,annotated data is scarce,particularly for languages like Bikol.Bikol is one of the major languages in the Philippines;however,its underrepresentation in the digital sphere is attributed to the absence of annotated data.This study outlines the development process of BFParCo,a proposed gold standard dataset for the Bikol and Filipino parallel corpus.The corpus underwent refinement through manual phrase alignment,translation,and evaluation.Subsequently,T5 and mT5 transformer models were fine-tuned with the parallel corpus and were evaluated using the BLEU metric.The results showed a notable improvement in Bilingual Evaluation Understudy(BLEU)score after fine-tuning,with an increase of 60.68 in BIK→FIL and 58.93 in FIL→BIK translations.Additionally,human evaluators comprehensively assessed the fine-tuned models'results using Multidimensional Quality Metrics and Scalar Quality Metrics error taxonomies.The fine-tuned models then were made publicly accessible through Hugging Face.This study represents a significant stride in advancing machine translation tools for Bikol and Filipino languages.展开更多
文摘Creating a parallel corpus for machine translation is a challenging and time-consuming task,especially in a linguistically diverse country like the Philippines,with 185 languages.Although a wealth of text is available,annotated data is scarce,particularly for languages like Bikol.Bikol is one of the major languages in the Philippines;however,its underrepresentation in the digital sphere is attributed to the absence of annotated data.This study outlines the development process of BFParCo,a proposed gold standard dataset for the Bikol and Filipino parallel corpus.The corpus underwent refinement through manual phrase alignment,translation,and evaluation.Subsequently,T5 and mT5 transformer models were fine-tuned with the parallel corpus and were evaluated using the BLEU metric.The results showed a notable improvement in Bilingual Evaluation Understudy(BLEU)score after fine-tuning,with an increase of 60.68 in BIK→FIL and 58.93 in FIL→BIK translations.Additionally,human evaluators comprehensively assessed the fine-tuned models'results using Multidimensional Quality Metrics and Scalar Quality Metrics error taxonomies.The fine-tuned models then were made publicly accessible through Hugging Face.This study represents a significant stride in advancing machine translation tools for Bikol and Filipino languages.