Top-${kappa}$ 유사도 조인을 위한 샘플링 기반 알고리즘

Top-${kappa}$ 유사도 조인을 위한 샘플링 기반 알고리즘

ㆍ 저자명: 박종수,Park. Jong Soo
ㆍ 간행물명: 정보과학회논문지. Journal of KIISE. 데이타베이스
ㆍ 권/호정보: 2014년|41권 4호|pp.256-261 (6 pages)
ㆍ 발행정보: 한국정보과학회
ㆍ 파일정보: 정기간행물|
PDF텍스트
ㆍ 주제분야: 기타

이 논문은 한국과학기술정보연구원과 논문 연계를 통해 무료로 제공되는 원문입니다.

서지반출

기타언어초록

Top-${kappa}$ 유사도 조인 문제는 두 개의 입력 레코드 집합들에서 유사도를 기준한 상위 ${kappa}$ 개의 레코드 쌍을 찾는 것이다. 샘플링 기법을 이용하여 상위 ${kappa}$ 개의 유사도 조인 쌍을 반환하는 효율적인 알고리즘을 제안한다. 입력 레코드들의 표본에서 집합 유사도 조인들의 히스토그램을 구성하고, 상위 ${kappa}$ 개의 조인 쌍을 위한 추정 유사도 한계치를 통계 추론으로 95% 신뢰 구간의 오차 한계 내에서 계산한다. 상위 ${kappa}$ 개의 유사도 조인을 얻기 위하여 최소-히프 구조를 사용하는 일반 유사도 조인 알고리즘에 이 추정 한계치를 적용한다. 대 용량의 실제 데이터집합에서의 실험결과는 제안된 알고리즘의 좋은 성능을 보여준다.

기타언어초록

The problem of top-${kappa}$ set similarity joins finds the top-${kappa}$ pairs of records ranked by their similarities between two sets of input records. We propose an efficient algorithm to return top-${kappa}$ similarity join pairs using a sampling technique. From a sample of the input records, we construct a histogram of set similarity joins, and then compute an estimated similarity threshold in the histogram for top-${kappa}$ join pairs within the error bound of 95% confidence level based on statistical inference. Finally, the estimated threshold is applied to the traditional similarity join algorithm which uses the min-heap structure to get top-${kappa}$ similarity joins. The experimental results show the good performance of the proposed algorithm on large real datasets.

키워드

상위-${kappa}$ 유사도 조인 샘플링 95% 신뢰도 수준 성능 평가 top-${kappa}$ similarity join sampling 95% confidence level performance evaluation

다운URL