Cenk Sahinalp
Genomic research has the potential to change the world, and dealing with the sheer amount of data that is becoming more and more available every day is a challenge for computer scientists.
It’s not just about the quantity of data in terms of information. It’s also about the size of the files that are stored and exchanged between researchers and clinicians. One human genome, for example, features roughly 250-300 GB of data with first-level compression, which creates issues with both transmission and storage.
Finding a way to compress the information into a more manageable format has long been a goal of the field, and a group of researchers directed by Cenk Sahinalp, a professor of computer science and co-director of the Center for Bioinformatics at the School of Informatics and Computing, are working to find an answer.
Sahinalp’s group studied the available approaches to high-throughput sequencing data compression to determine which format had the potential to serve as an open standard for genomics data. The team consisted of an international team of scientists and was formed as a working group of the International Standards Organization’s Moving Picture Experts Group (MPEG), which created a standard for compression in the digital media industry.
Their paper “Comparison of high-throughput sequencing data compression tools,” was published in the November issue of the journal Nature Methods, the premier journal in the field of scientific methods and one of the top journals about scientific and medical disciplines.
“It may take up to a day to transmit an individual uncompressed human genome data from one location to another,” Sahinalp says. “In the Pan Cancer Analysis of Whole Genomes Project (a joint venture between NIH’s The Cancer Genome Atlas and the International Cancer Genomes Consortium) some 25,000 tumor samples and matching blood samples are being sequenced. The total memory footprint of this dataset will be roughly 15 Peta Bytes. My research group, led by Ibrahim Numanagic, my Ph.D. student who also is the lead author of the Nature Methods paper, has developed some of the best compression methods available which can reduce the memory footprint of genomic data by a factor of two or more.
“There are also other competing programs and formats; some need less time to actually perform compression – others can work on genomic data streams. Our working group aims to find out the single program and format that provides the best tradeoff across several parameters. The paper presents a major step towards setting up a standard for genomic data representation. We have developed software to automatically benchmark any future method and compare it against existing methods.”
The study is also the first paper to be funded in part by the Precision Health Grand Challenges program at Indiana University.
“Cenk’s involvement in this critical effort is another example of the important work being done by our faculty,” says Raj Acharya, the dean of SoIC. “It also shows the ways in which our participation in the Precision Health Grand Challenges can have a positive impact on the world.”
Although the effort to set up a standard is in its early stages, Sahinalp is excited for the breakthrough such an identification would represent.
“MPEG has now issued a call for setting up an efficient compression standard,” Sahinalp says. “Once the standard is set, the effect could be similar to that MPEG achieved in video compression, which created the standard for digital transmissions and made HDTV practical. We’re excited about the possibilities.”