摘要
为了正确分离图文,提出一种基于字符群体特征的图文分离算法。该方法以直线识别得到的短直线为基础,对连通域进行限制长度的外轮廓提取;通过大小和密度判据捡出候选字符,并以字符串形式出现的群体特征吸收漏识的字符和符号,实现包含标注字符、标题栏及明细栏字符等各类字符与图形位图的分离。结果表明:该算法提高了字符特别是难检字符及符号判定的可靠性,保持了字符串的完整性,具有适应性强、效果好的特点。
This paper presented a separation method based on feature of characters which always occur as a string. First, the algorithm began with short lines retrieving by line vectorization and extracts their corresponding outer contours restricted by length threshold. Then it used size and density criteria to find out character candidates from connected areas satisfied with con- tour length condition. Finally, absorbing missed characters and symbols in the string, separating text including annotation, headline and characters in title column and subsidiary column from graphics. Experiments show that the algorithm is strongly adaptable and more reliable to extract characters, especially for characters which difficultly judge by mathematically feature of connected area like “I” “i” “1”and so on, better keeping the integrality of string.
出处
《计算机应用研究》
CSCD
北大核心
2007年第8期242-245,共4页
Application Research of Computers
基金
国家科技成果重点推广计划资助项目(2004EC000096)
关键词
图文分离
工程图纸
矢量化
群体特征
轮廓提取
text/graphics separation
engineering drawing
vectorization
group feature
contour extraction