摘要
语料库在语言学和自然语言处理领域至关重要。北京语言大学的BCC语料库,资源丰富且检索高效,备受推崇。然而,BCC检索式的复杂性限制了其普及。为此,本文提出TextToBCC模型,目标是实现自然语言对BCC语料库的检索。本文首先构建了一个均衡的BCC检索式数据集,利用大语言模型为BCC检索式生成自然语言描述。其次,微调大语言模型使其能够支持自然语言到BCC检索式的转换。实验结果证明了TextToBCC模型的优异性能。这一成果不仅降低了BCC语料库的使用难度,而且有助于促进其在更广泛领域的传播和应用,为语言学研究和自然语言处理实践带来便利。
Corpora play a vital role in the fields of linguistics and natural language processing.The BCC corpus developed by Beijing Language and Culture University is known for its rich resources and efficient retrieval capabilities.However,the complexity of its search query language limits its accessibility and widespread use.To address this issue,this paper introduces the TextToBCC model,which enables natural language retrieval over the BCC corpus.A balanced dataset of BCC search queries was first constructed,and corresponding natural language descriptions were generated using a large language model.The model was then fine-tuned to support the conversion from natural language to BCC search queries.Experimental results demonstrate the strong performance of the proposed TextToBCC model.This work not only reduces the learning curve associated with using the BCC corpus but also promotes its broader dissemination and application,facilitating research and development in linguistics and natural language processing.
作者
刘廷超
鲁鹿鸣
荀恩东
靳泽莹
杨兆勇
Tingchao Liu;Luming Lu;Endong Xun;Zeying Jin;Zhaoyong Yang
出处
《语料库语言学》
2025年第1期1-16,共16页
Corpus Linguistics
关键词
语料库
检索式
大语言模型
微调
corpus
search query
large language model
fine-tuning