摘要
目前,在计算机中汉字有多种代码页,汉字的多代码页并存现象将长期存在。为了实现汉字多代码页并存,需要汉字代码页自动识别技术的支撑。屏幕实时解释引擎是目前各种在线字典、词典以及教学软件的核心技术,此技术目前存在不能跨代码页,取词不全面、不正确等缺陷。本文主要针对以上情况,描述了采用汉字内码的代码页自动识别技术以及优化的自动屏幕取词技术的中文屏幕实时解释引擎的系统架构,并阐述了数据词典的设计以及在设计中采用的关键技术。对五百万汉字样本的测试中,应用此引擎的在线词典对有意义短字符串(不包括单字)代码页的识别率可以达到99%以上。
Nowadays, in the computer the Chinese Characters are represented by various code pages, and it is a long existing phenomenon. In order to use all kinds of Chinese code pages including GB2312, GBK, GB18030, BIG-5, HKSCS and ISO10646/Unieode at same time, the technology of Chinese code pages auto recognition is required. The Chinese screen real-time paraphrase engineer is the key technology to build many kinds of online dictionary, teaching software and so on. This paper describes the system architecture of the Chinese Screen Real-time Paraphrase Engineering, which is based on the technology of Chinese code pages auto recognition and auto capturing words from screen. It also illuminates the design of data dictionary and the key technology of such engineer. In an online dictionary which used this engineer as a sample, the recognition rate of short string's code pages can reach 99% on the test documents which include about five million Chinese characters.
出处
《中文信息学报》
CSCD
北大核心
2005年第5期90-96,共7页
Journal of Chinese Information Processing
基金
江苏省高校自然科学基金资助项目(01kjb520001
04KKB320134)
关键词
计算机应用
中文信息处理
汉字代码页自动识别
屏幕取词
ISO10646
computer application
Chinese information processing
Chinese character code pages auto recognition
capturing words from screen
ISO10646