시공간 효율적 DNA 서열 검색 알고리즘을 이용한 유전체 서열 어셈블러

서지반출

기타언어초록

초 고성능 바이오 서열 분석 장비(NGS) 기술의 발달로 대량의 바이오 정보가 쏟아져 나오고 있으며, 바이오 산업의 발달로 개인별 유전체 정보에 의한 맞춤의학의 시대가 다가오고 있다. 수많은 서열에 대한 분석에는 많은 저장공간이 필요하므로 슈퍼컴퓨터 급의 서버와 대량의 데이터를 빠르게 처리할 수 있는 프로그램이 필요하다. 이러한 분석에는 염기서열 일치 검색과 이를 기반으로 하는 Alignment와 Assembly 분석이 있으며, 이를 수행하는 기존의 알고리즘은 염기서열을 문자열로 취급하고, 해쉬 인덱스 테이블, Brujin 그래프의 사용, 버러우즈 휠러 변환(BWT) 등의 기법 등을 사용한다. 본 논문에서는 시간과 공간적으로 효율적인 DNA 검색을 위해 염기서열을 문자열이 아닌 k-mer 묶음의 정수형 배열로 변환한 후 단위 및 비단위 연산자로 검색함으로써 저장 공간의 크기를 약 28% 이상 줄여서 검색할 수 있는 알고리즘을 제안한다. 이에 기반한 Assembly 분석 프로그램인 CalcGen 프로그램을 개발하여 본 알고리즘의 유용성을 실험을 통해 검증하였다.

기타언어초록

The advent of ultra-high-throughput sequencing technology makes pour bulky bio-sequence information and the advance of bio-industry pull the era of personalized medicine using individual genome information. However, the analysis of massive bio-sequence requires large storages, so that analysis sometimes needs supercomputer and novel software that can handle bulky sequence information. In that type of analysis, there are sequence match algorithms based on alignment and assemble. These alignment and assemble are fundamental for analyzing bio-sequence. Those algorithms regard nucleotide sequences as strings and compare one by one character during analysis of sequences. They use hash index tables, de Bruijn graph, Burrows-Wheeler transform method, and so on. In this paper, for time and space efficient DNA searching, we propose an algorithm that transforms base sequence into k-mer integer array and analyzes the integer array transformed by unit search operator and non-unit search operator, resulting in the storage space reduced about 0.28 fold. Furthermore, based on the algorithm we have developed CalcGen assembler that is fundamental sequence analysis program, and show the usefulness of the program with experiments.

키워드

차세대 염기서열 분석 염기서열 검색 알고리즘 어셈블리 분석 Next-generation Sequencing (NGS)Sequence Search Algorithm Assembly Analysis

다운URL