期刊文献+

基于聚类和索引技术的语言模型压缩方法

Compression Method of Language Model Based on Clustering Algorithm and Multistep Indexing
在线阅读 下载PDF
导出
摘要 由于训练语料的庞大,SRILM训练生成的ARPA统计语言模型数据文件体积过大,导致查找效率低下以及消耗大量的存储空间。针对该问题,借鉴聚类和索引查找的思想,提出了一种基于K均值(K-means)聚类算法的对语言模型中的转移概率和回退概率压缩,并通过多级索引技术提高查找速度的压缩方法。理论分析和实验表明,该方法可以在减少压缩造成的数据失真对选词影响的同时,取得非常好的压缩效果,同时提高了对语言模型文件查找效率,并且输入法的反应速度得到了明显的提升。 Because of the large-scale training corpus,the language model data file of the ARPA format produced by SRILM toolkit usual- ly takes too much space and reduces the search rate. For the problem, learning from the idea of unsupervised clustering analysis and multi level index ,proposed a compression method of N-Gram Chinese language model file based on K-means clustering algorithm and multi level index technology to increase search speed. Theoretical analysis and experiments show that the method can promptly obtain an out standing compression ratio and effectively reduce the redundant search times, showing a good performance.
出处 《计算机技术与发展》 2012年第12期25-28,共4页 Computer Technology and Development
基金 国家"973"重点基础研究发展计划项目(2011CB808300)
关键词 语言模型 压缩方法 聚类算法 多级索引 language model compression method K-means clustering algorithm multilevel index technology
  • 相关文献

参考文献11

  • 1李晓光,王大玲,于戈.基于统计语言模型的信息检索[J].计算机科学,2005,32(8):124-127. 被引量:9
  • 2Manning C,Schiitze H.统计自然语言处理基础[M].苑春法,李伟,李庆中,译.北京:电子工业出版社,2005:45-50..
  • 3殷芳刚,吴建国,吴海辉,李炜.Windows Mobile平台下智能手机输入法研究[J].计算机技术与发展,2011,21(5):75-78. 被引量:3
  • 4Rosenfeld R. The CMU Statistical Language Modeling Toolkit [ C]//Proe of ARPA Spoken Language Technology Work- shop. Is. 1. ] :Is.n. 1,1995.
  • 5Jelinek F,Mercer R L. Interpolated Estimation of Markov So- urce Parameters from Sparse Data[ C]//Proc of Workshop on Pattern Recognition in Practice. Amsterdam:North-Holland, 1980.
  • 6Lafferty J D, Sleator D, Temperley D. Grammatical Trigrams : A Probabilistic Model of Link Grammar[ C ]//Proceedings of the AAAI Fall Symposium on Probabilistic Approaches to Natural Language. Cambridge, MA: [ s. n. ], 1992:89-97.
  • 7Ye Z X, Berger T. Information Measures for Discrete Random Fields [ M ]. Beijing: Science Press, 1998. Kaufnman L,Rousseeuw P J. Finding group in data:an intro- duction to cluster analysis [ M ]. New York : Wiley, 1990 : 83 - 88.
  • 8Kaufnman L,Rousseeuw P J. Finding group in data:an intro- duction to cluster analysis [ M ]. New York : Wiley, 1990 : 83 - 88.
  • 9段小斌,林雯,阮百尧,陈基漓.一种基于三级索引词库结构的中文分词方法研究[J].计算机与数字工程,2007,35(7):47-49. 被引量:5
  • 10Brown P F,deSouza P V,Mercer R L,et al. Class-based n- gram models of natural language [ J ]. Computational Linguis- tics, 1992,18 (4) : 153-157.

二级参考文献46

  • 1欣闻.手机文字输入技术及其发展趋势[J].现代通信,2005(2):34-35. 被引量:1
  • 2李培峰,朱巧明,钱培德.一个应用于手持设备的汉字通用输入模型[J].计算机工程,2006,32(18):258-260. 被引量:3
  • 3[6]Segal M,Korobkin,R Van W klcnfeh et al Fast Shadow and Lithting Effects is Using Texture Mapping[C],USA;Proceedings of SIGGRAPH92,1992,249-252.
  • 4[7]S Seitz,C Dyer Photorealistic Scene Reconstruction by Voxel Coloring[C],CVPR,1997,1067~1073.
  • 5张晋.汉字信息处理研究[M].北京:北京语言学院出版社,1992:4-21.
  • 6Po Lai-Man, Wong Chi-Kwan. Six-Digit Stroke-based Chinese Input Method[C]//Proceedings of the 2009 IEEE Inter- national Conference on Systems, Man, and Cybernetics. San Antonio, TX, USA : [ s. n. ] , 2009 : 818- 823.
  • 7Microsoft. Win32 Multilingual IME Application Programming Interface[M]. [s. l. ] :[s.n. ], 2003.
  • 8Microsoft Corp. Win32 Multilingual IME Application Programzing Interface [M]. [ s. l. ]: [ s. n. ], 1998.
  • 9Rosenfeld R. Adaptive Statistical Language Modeling: A Maximum Entropy Approach: [CMU Technical Report CMU-CS-94-138]..
  • 10Zhai C, Lafferty J. A Risk Minimization Framework for Information Retrieval. citeseer. nj. nec. com.

共引文献53

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部