摘要
文中首先从信息开销的角度分析了识别一个汉字所需要的信息量.研究表明,单字识别算法是一种等概模型,需要的信息最多.因此,可把汉字文本当作Markov模型来处理,当前汉字的发生仅依赖于前m个汉字.根据对文本的统计,得到许多语言统计信息,在此基础上,设计了利用语言知识基于句子的文本自动识别方法.识别时当前待识字的匹配仅在前一个字的后邻接字集里进行;当一个句子识别完后,对其进行语言知识处理后再输出结果.因而识别速度和识别率比单字识别方法都有明显提高.
It is first analyzed how much information is used when recognizing a Chinese character. It is indicated that the single character recognition algorithm is an equal probability model and needs the most information. So the Chinese text is regarded as a Markov model, which means that the character is determined by the last m characters. On the basis of the statistics of the text, a lot of Chinese linguistic knowledge is obtained. An automatic recognition is designed, in which the character is matched in the next neighboring character set of the last character. After recognized, the sentence is treated as the linguistic knowledge before it is output. So the recognition speed and recognition rate are higher than that of the single character recognition algorithm.
出处
《计算机研究与发展》
EI
CSCD
北大核心
1998年第7期668-672,共5页
Journal of Computer Research and Development
关键词
语言知识
汉语文本
汉字识别
汉字信息处理
linguistic knowledge, Chinese text, Chinese character recognition, Markov model