KR-WordRank : WordRank를 개선한 비지도학습 기반 한국어 단어 추출 방법

KR-WordRank : WordRank를 개선한 비지도학습 기반 한국어 단어 추출 방법

ㆍ 저자명: 김현중,조성준,강필성,Kim. Hyun-Joong,Cho. Sungzoon,Kang. Pilsung
ㆍ 간행물명: 대한산업공학회지
ㆍ 권/호정보: 2014년|40권 1호|pp.18-33 (16 pages)
ㆍ 발행정보: 대한산업공학회
ㆍ 파일정보: 정기간행물|
PDF텍스트
ㆍ 주제분야: 기타

이 논문은 한국과학기술정보연구원과 논문 연계를 통해 무료로 제공되는 원문입니다.

서지반출

기타언어초록

A Word is the smallest unit for text analysis, and the premise behind most text-mining algorithms is that the words in given documents can be perfectly recognized. However, the newly coined words, spelling and spacing errors, and domain adaptation problems make it difficult to recognize words correctly. To make matters worse, obtaining a sufficient amount of training data that can be used in any situation is not only unrealistic but also inefficient. Therefore, an automatical word extraction method which does not require a training process is desperately needed. WordRank, the most widely used unsupervised word extraction algorithm for Chinese and Japanese, shows a poor word extraction performance in Korean due to different language structures. In this paper, we first discuss why WordRank has a poor performance in Korean, and propose a customized WordRank algorithm for Korean, named KR-WordRank, by considering its linguistic characteristics and by improving the robustness to noise in text documents. Experiment results show that the performance of KR-WordRank is significantly better than that of the original WordRank in Korean. In addition, it is found that not only can our proposed algorithm extract proper words but also identify candidate keywords for an effective document summarization.

키워드

Word Extraction Keyword Extraction Text Mining Unsupervised Learning WordRank

다운URL