摘要
《辞海》是中国文化重要资产之一,具有重大研究价值.分词是数字《辞海》的研究基础,而《辞海》内容比较复杂,具有古文类型广和知识领域广等特点,给分词任务带来一定挑战.针对《辞海》内容的特点,提出基于深度学习的分词方法,首先对《辞海》内容进行预处理,去除包括文言文、诗、歌等古文内容;其次,选择《新华字典》,并利用CBOW模型训练字向量;最后,选择BI-LSTM-CRF模型执行《辞海》分词任务.实验结果显示,提出的分词方法,准确率、召回率和F1值到分别达到94.18%、94.09%和94.13%,具有较好的分词表现.
Cihai is one of the important assets of Chinese culture and has great research value.Word segmentation is the research foundation of the digital Cihai.The content of Cihai,which has the characteristics of types of ancient texts and kinds of knowledge fields and so on,is rela⁃tively complex and brings certain challenges to the task of word segmentation.According to the characteristics of the content of Cihai,puts forward a method of word segmentation based on deep learning.Firstly,the content of Cihai is preprocessed to remove the content of classi⁃cal Chinese,poetry,song and so on.Secondly,CBOW model is chosen to train Xinhua Dictionary to generate character vector.Finally,BI-LSTM-CRF model is selected to carry out the word segmentation task of Cihai.The experimental results show that the proposed segmen⁃tation method has better performance,and accuracy,recall and F1 reach 94.18%,94.09%and 94.13%respectively.
作者
陈美
李顿伟
高洪美
吴小丽
CHEN Mei;LI Dun-wei;GAO Hong-mei;WU Xiao-li(Shanghai Development Center of Computer Software Technology,Shanghai 201112)
出处
《现代计算机》
2020年第16期60-64,82,共6页
Modern Computer
基金
上海市科技人才计划项目(No.18PJ1431600)。