Protein Sequence Search based on N-gram Indexing

Protein Sequence Search based on N-gram Indexing
Protein Sequence Search based on N-gram Indexing

ㆍ 저자명: Hwang. Mi-Nyeong,Kim. Jin-Suk
ㆍ 간행물명: Bioinformatics and Biosystems
ㆍ 권/호정보: 2006년|1권 1호|pp.46-50 (5 pages)
ㆍ 발행정보: 한국생물정보시스템생물학회
ㆍ 파일정보: 정기간행물|ENG|
PDF텍스트
ㆍ 주제분야: 기타

이 논문은 한국과학기술정보연구원과 논문 연계를 통해 무료로 제공되는 원문입니다.

서지반출

기타언어초록

According to the advancement of experimental techniques in molecular biology, genomic and protein sequence databases are increasing in size exponentially, and mean sequence lengths are also increasing. Because the sizes of these databases become larger, it is difficult to search similar sequences in biological databases with significant homologies to a query sequence. In this paper, we present the N-gram indexing method to retrieve similar sequences fast, precisely and comparably. This method regards a protein sequence as a text written in language of 20 amino acid codes, adapts N-gram tokens of fixed-length as its indexing scheme for sequence strings. After such tokens are indexed for all the sequences in the database, sequences can be searched with information retrieval algorithms. Using this new method, we have developed a protein sequence search system named as ProSeS (PROtein Sequence Search). ProSeS is a protein sequence analysis system which provides overall analysis results such as similar sequences with significant homologies, predicted subcellular locations of the query sequence, and major keywords extracted from annotations of similar sequences. We show experimentally that the N-gram indexing approach saves the retrieval time significantly, and that it is as accurate as current popular search tool BLAST.

키워드

homology search N-gram indexing sequence retrieval sequence search tool ProSeS

다운URL