摘要
基于英国国家图书馆的Reshelp和Burney两个古旧英文报纸数字化项目,作者对文本型数字图像的OCR识别的准确度进行测试实验,结果显示整体准确度不高,且从高到低依次为字符、单词、重要单词、大写字母开头的重要单词。然后,将OCR识别周期划分为数字扫描对象的获取、数字图像的生产、数字图像的处理和文本识别等四个阶段,分析每个阶段影响准确度的因素,探讨提高准确度的具体措施。
The following two aspects are discussed in this paper: ( 1 ) based on Reshelp and Burney historic English newspaper digitization projects in British Library, the author does an experiment on OCR accuracy measurement, and the result shows that the overall accuracies are not very good, and the sequence from high to low is characters, words, significant words and words start with capital letter; (2) based on the four stages of OCR period which are digital scanning object obtainment, digital image production, digital image process and text recognition, the author analyses the accuracy influencing factors and discusses the measures for improving the accuracy.
出处
《图书情报知识》
CSSCI
北大核心
2010年第3期62-67,共6页
Documentation,Information & Knowledge
基金
河南省高校科技创新人才支持计划(2008-551)资助
关键词
OCR识别
准确度测试
信息资源数字化
OCR recognition Accuracy measurement Information resource digitization