摘要
从搜狗日志语料出发,分析语料特点,以词语本身、词性信息、位置信息、查询词串频次和音节数为特征,提出了基于SVM_HMM模型的短语自动识别方法,对"V+N"、"V+V"短语进行多重对比实验,实验验证了上下文信息量的增加能提高短语识别效率,证实了音节数、位置特征对实验效果的低影响力,为搜索引擎用短语词典的构建提供技术支持,为进一步的短语类别识别研究提供方向性指导。
A new way of automatic recognition for phrase based on SVM_HMM model is put forward in this paper through analyzing the characters of the Sogou log corpus.Multiple experiments are conducted on the "V+N"、"V+V" phrases from different perspectives using some information of words,such as parts of speech,position,the times of being searched and so on.The results of the experiments reveal that the increase of the context information can improve the efficiency of phrase recognition,and the characteristics of the number and location of syllables have little influence on the experimental effect.This research provides technical support for phrase dictionary building of search engine in future and directive guidance for a further research on recognition of phrases category.
出处
《北京信息科技大学学报(自然科学版)》
2012年第2期53-58,共6页
Journal of Beijing Information Science and Technology University
基金
国家社会科学基金项目(09CYY021)