Automatic Generation of Training Character Samples for OCR Systems

Automatic Generation of Training Character Samples for OCR Systems
Automatic Generation of Training Character Samples for OCR Systems

ㆍ 저자명: Le. Ha,Kim. Soo-Hyung,Na. In-Seop,Do. Yen,Park. Sang-Cheol,Jeong. Sun-Hwa
ㆍ 간행물명: International journal of contents
ㆍ 권/호정보: 2012년|8권 3호|pp.83-93 (11 pages)
ㆍ 발행정보: 한국콘텐츠학회
ㆍ 파일정보: 정기간행물|ENG|
PDF텍스트
ㆍ 주제분야: 기타

이 논문은 한국과학기술정보연구원과 논문 연계를 통해 무료로 제공되는 원문입니다.

서지반출

기타언어초록

In this paper, we propose a novel method that automatically generates real character images to familiarize existing OCR systems with new fonts. At first, we generate synthetic character images using a simple degradation model. The synthetic data is used to train an OCR engine, and the trained OCR is used to recognize and label real character images that are segmented from ideal document images. Since the OCR engine is unable to recognize accurately all real character images, a substring matching method is employed to fix wrongly labeled characters by comparing two strings; one is the string grouped by recognized characters in an ideal document image, and the other is the ordered string of characters which we are considering to train and recognize. Based on our method, we build a system that automatically generates 2350 most common Korean and 117 alphanumeric characters from new fonts. The ideal document images used in the system are postal envelope images with characters printed in ascending order of their codes. The proposed system achieved a labeling accuracy of 99%. Therefore, we believe that our system is effective in facilitating the generation of numerous character samples to enhance the recognition rate of existing OCR systems for fonts that have never been trained.

키워드

Character Sample Generation Optical Character Recognition Postal Envelope Images Training Samples Degradation Model Substring Matching

다운URL