The algorithm presents the first full-fledged, short-read alignment software that solves the problem of solving indices for efficient seeding.
The Human Genome is a complete set of DNA, which is about 6.4 billion letters long. Because of its size, reading the whole genome sequence is challenging at once. So scientists use DNA sequencers to produce DNA sequence fragments, or short reads, up to 300 letters long. Then the DNA sequencer assembles all the short reads like a giant jigsaw puzzle to reconstruct the entire genome sequence. Even with very fast computers, this job can take hours to complete.
A research team at KAIST has achieved up to 3.45x faster speeds developing the first short-read alignment software that uses a recent index called machine-learning.
The research team reported their findings in the March 7, 2022, Journal of Bioinformatics. The software has been released as open source and can be found on github (https://github.com/kaist-ina/BWA-MEME).
Next-generation sequencing (NGS) is a state-of-the-art DNA sequencing method. Projects are the population scale at the genome sequencing of the goal with the underway. Modern NGS hardware is capable of generating billions of short reads in a single run. Then the short DNA reads are aligned with the reference DNA sequence. With next-generation sequences running large-scale DNA sequencing operations, the need for an efficient short read alignment tool is even more critical. Accelerating the DNA sequence alignment would result in population-scale sequencing of the target. However, existing algorithms are limited in their performance because of their frequent memory access.
BWA-MEM2 is a popular short-read alignment software package currently used to sequence DNA. However, it has its limitations. The state-of-the-art alignment has two phases – seeding and extending. During the seeding phase, searches find short reads of exact matches in the reference DNA sequence. During the expanding phase, the short reads from the seeding phase are extended. In the current process, bottlenecks occur in the seeding phase. The process of finding the exact matches slows.
The researchers set out to solve the problem of accelerating the DNA sequence alignment. To speed up the process, they use machine learning techniques to create an algorithmic improvement. Their algorithm, BWA-MEME (BWA-MEM emulated) leverages learned indices to solve the exact match search problem. An exact match search for one character at a time compared to the original software. The team’s new algorithm achieves up to 3.45x faster speeds in seeding throughput over BWA-MEM2 by asking for the number of instructions by 4.60x and memory accesses by 8.77x. “Through this study, it has been shown that full genome big data analysis can be performed faster and less costly than conventional methods of applying machine learning technology,” said Professor Dongsu Han from KAIST at the School of Electrical Engineering.
The researchers’ ultimate goal was to develop efficient software that would enable scientists and industry to analyze big data in genomics. “With recent advances in artificial intelligence and machine learning, we see so many opportunities for designing better software for genomic data analysis. The potential for there is accelerating existing analysis as well as enabling new types of analysis, and our goal is to develop such software, ”added Han.
Whole genome sequencing has traditionally been used for discovering genomic mutations and identifying the root causes of diseases, which leads to the discovery and development of new drugs and cures. There could be many potential applications. Whole genome sequencing is not only used for research, but also for clinical purposes. “Analyzing genomic data for science and technology is making rapid progress and making it more accessible for scientists and patients. This will improve our understanding of diseases and develop a better cure for many different diseases. ”
The research was funded by the National Research Foundation of the Korean Government’s Ministry of Science and ICT.
Youngmok Jung and Dongsu Han, “BWA-MEME: A Machine Learning Approach with BWA-MEM Emulated,” Bioinformatics, Volume 38, Issue 9, May 2022 (https://doi.org/10.1093/bioinformatics/btac137)