바이그램이 문서범주화 성능에 미치는 영향에 관한 연구

바이그램이 문서범주화 성능에 미치는 영향에 관한 연구

ㆍ 저자명: 이찬도,최준영,Lee. Chan-Do,Choi. Joon-Young
ㆍ 간행물명: Journal of information technology applications & management
ㆍ 권/호정보: 2005년|12권 2호|pp.15-27 (13 pages)
ㆍ 발행정보: 한국데이타베이스학회
ㆍ 파일정보: 정기간행물|
PDF텍스트
ㆍ 주제분야: 기타

이 논문은 한국과학기술정보연구원과 논문 연계를 통해 무료로 제공되는 원문입니다.

서지반출

기타언어초록

Text categorization systems generally use single words (unigrams) as features. A deceptively simple algorithm for improving text categorization is investigated here, an idea previously shown not to work. It is to identify useful word pairs (bigrams) made up of adjacent unigrams. The bigrams it found, while small in numbers, can substantially raise the quality of feature sets. The algorithm was tested on two pre-classified datasets, Reuters-21578 for English and Korea-web for Korean. The results show that the algorithm was successful in extracting high quality bigrams and increased the quality of overall features. To find out the role of bigrams, we trained the Na$"{i}$ve Bayes classifiers using both unigrams and bigrams as features. The results show that recall values were higher than those of unigrams alone. Break-even points and F1 values improved in most documents, especially when documents were classified along the large classes. In Reuters-21578 break-even points increased by 2.1%, with the highest at 18.8%, and F1 improved by 1.5%, with the highest at 3.2%. In Korea-web break-even points increased by 1.0%, with the highest at 4.5%, and F1 improved by 0.4%, with the highest at 4.2%. We can conclude that text classification using unigrams and bigrams together is more efficient than using only unigrams.

키워드

Automated Text Categorization Text Classification Machine learning Bigram Algorithm

다운URL