Design of de Bruijn graph-based genome indexing structure

With the rapid development of sequencing technology and its gradual cost reduction,individual genome sequencing has become the main approach to study the genotypes of different species,variation knowledge and the related diseases.However,due to the massive repetitive sequences and high complex genomic regions,the ever-increasing sequencing data size and the technical limitations of sequencing technology,how to effectively and efficiently map the amount of reads to reference genomes is still facing the great challenges.This thesis introduces the hash table-based genomic data storage and indexing method and the basic idea of seed-and-extension scheme.A de Bruijn graph-based indexing structure named as DBG-index and its three-level storage mode are proposed.Moreover,several basic corresponding operations are put forward based on the index characteristics.It demonstrates that this structure could effectively organize and index the repetitive sequences on the genomes in such a way that the number of candidate seeds could be decreased and the mapping speed could greatly increase.