Mining Maximal Frequent Contiguous Sequences in Biological Data Sequences

Mining Maximal Frequent Contiguous Sequences in Biological Data Sequences
Mining Maximal Frequent Contiguous Sequences in Biological Data Sequences

ㆍ 저자명: Kang. Tae-Ho,Yoo. Jae-Soo,Kim. Hak-Yong,Lee. Byoung-Yup
ㆍ 간행물명: International journal of contents
ㆍ 권/호정보: 2007년|3권 2호|pp.18-24 (7 pages)
ㆍ 발행정보: 한국콘텐츠학회
ㆍ 파일정보: 정기간행물|ENG|
PDF텍스트
ㆍ 주제분야: 기타

이 논문은 한국과학기술정보연구원과 논문 연계를 통해 무료로 제공되는 원문입니다.

서지반출

기타언어초록

Biological sequences such as DNA and amino acid sequences typically contain a large number of items. They have contiguous sequences that ordinarily consist of more than hundreds of frequent items. In biological sequences analysis(BSA), a frequent contiguous sequence search is one of the most important operations. Many studies have been done for mining sequential patterns efficiently. Most of the existing methods for mining sequential patterns are based on the Apriori algorithm. In particular, the prefixSpan algorithm is one of the most efficient sequential pattern mining schemes based on the Apriori algorithm. However, since the algorithm expands the sequential patterns from frequent patterns with length-1, it is not suitable for biological datasets with long frequent contiguous sequences. In recent years, the MacosVSpan algorithm was proposed based on the idea of the prefixSpan algorithm to significantly reduce its recursive process. However, the algorithm is still inefficient for mining frequent contiguous sequences from long biological data sequences. In this paper, we propose an efficient method to mine maximal frequent contiguous sequences in large biological data sequences by constructing the spanning tree with a fixed length. To verify the superiority of the proposed method, we perform experiments in various environments. The experiments show that the proposed method is much more efficient than MacosVSpan in terms of retrieval performance.

키워드

Sequence Pattern Mining Sequence Analysis Motif Finding Problem Bioinformatics

다운URL