품사 태깅 말뭉치에서 추출한 n-gram을 이용한 음절 단위의 한국어 형태소 분석

품사 태깅 말뭉치에서 추출한 n-gram을 이용한 음절 단위의 한국어 형태소 분석

ㆍ 저자명: 심광섭,Shim. Kwangseob
ㆍ 간행물명: 정보과학회논문지. Journal of KIISE. 소프트웨어 및 응용
ㆍ 권/호정보: 2013년|40권 12호|pp.869-876 (8 pages)
ㆍ 발행정보: 한국정보과학회
ㆍ 파일정보: 정기간행물|
PDF텍스트
ㆍ 주제분야: 기타

이 논문은 한국과학기술정보연구원과 논문 연계를 통해 무료로 제공되는 원문입니다.

서지반출

기타언어초록

본 논문에서는 품사 태깅 말뭉치로부터 자동 추출된 음절 n-gram 정보, 음절 복원 정보, 태그 바이그램 정보를 이용하는 음절 단위의 한국어 형태소 분석 모델을 제안한다. 본 논문에서 제안한 모델에서는 원형 복원을 하기 전에 주어진 어절의 각 음절에 대한 품사 태깅을 먼저 하는데, 이는 원형 복원을 먼저 하는 기존 확률 모델에 비하여 형태소 분석 과정이 훨씬 효율적이고 간결하다. 그 결과 정답 제시율은 98.9%로 기존 모델과 크게 차이가 나지 않으면서도 처리 속도가 초당 수백 어절에서 32만 어절로 크게 향상되었다.

기타언어초록

This paper presents a syllable-based Korean morphological analysis model that uses three types of information: syllable n-gram, syllable restoration and tag bigram information. They are automatically extracted from a POS tagged corpus. In our model, syllable restoration is performed after POS tags are attached to each syllable. With this approach, the morphological analysis phase becomes much simpler and more efficient than the previous probabilistic models for Korean morphology. As a result, the analysis speed reaches up to 322K eojeols per second, while the answer inclusion rate (AIR) maintained up to 98.9%.

키워드

한국어 형태소 분석 품사 태깅 말뭉치 음절 복원 사전 학습 Korean morphological analysis POS tagged corpus syllable restoration dictionary training

다운URL