다중서열정렬을 이용한 변형 문자열 집합의 유사도 계산 기법

다중서열정렬을 이용한 변형 문자열 집합의 유사도 계산 기법

ㆍ 저자명: 김성환,조환규,Kim. Sung-Hwan,Cho. Hwan-Gue
ㆍ 간행물명: 정보과학회논문지. Journal of KIISE. 소프트웨어 및 응용
ㆍ 권/호정보: 2013년|40권 1호|pp.53-60 (8 pages)
ㆍ 발행정보: 한국정보과학회
ㆍ 파일정보: 정기간행물|
PDF텍스트
ㆍ 주제분야: 기타

이 논문은 한국과학기술정보연구원과 논문 연계를 통해 무료로 제공되는 원문입니다.

서지반출

기타언어초록

인터넷 상에서 언어는 사용자에 의해 지속적으로 변형된다. 한 문자열로부터 변형된 문자열의 일부 사례가 주어졌을 때, 한 문자열이 같은 문자열로부터 파생된 것인지를 판별하는 문제는 효율적인 근사 문자열 탐색 및 데이터 수집을 위한 중요한 문제이다. 본 논문에서는 주어진 문자열 집합 내에 한 문자열로부터 파생된 변형 문자열들이 있는 경우 이들을 다중 서열 정렬을 통하여 대표 문자열을 정의하고, 이를 이용하여 문자열과 문자열 집합 간의 유사도 계산 방법을 제안하였다. 제안 기법은 문자열 집합의 크기에 관계없이 상수 시간 내에 동작한다. 실험 결과 주어진 문자열 집합의 크기가 100 이상인 경우 기존 기법에 비해 효율적으로 동작하며, 269개 이상의 문자열 집합에 대하여 기존 기법에 비해 2배 이상 빠르게 동작함을 보였다. 또한 실험을 통하여 일부 매개변수 조합이 민감도와 특이도 측면에서 전수 조사를 수행하는 것보다도 우수한 분류 성능을 보이는 것을 확인하였다.

기타언어초록

In the Internet environment, words are continuously being deformed by users. Given a sample of deformed strings derived from one string, determining whether a string is the same sort of the given strings or not is an important problem for efficient approximate string search and data mining. In this paper, we define a representative string of a string set whose elements are derived from one string. Then we present a similarity calculation method between a string and a set of strings. Our proposed method runs in a constant time regardless of the size of given string set. As experiment results, we show that our proposed method outperforms the existing method when the size of the set of given strings is larger than 100, and runs as faster by a factor of 2 when 269 strings are given. And we demonstrate empirically that some combinations of parameters achieve better classification performance than even an exhaustive search in terms of sensitivity and specificity.

키워드

변형단어 서열정렬 근사문자열탐색 문자열분류 Deformed Word Sequence Alignment Approximate String Search String Classification

다운URL