GENOMICS DATA MANAGEMENT: OPPORTUNITIES AND PERSPECTIVES
By 2025, genomics could represent the biggest big data generator

Genomics technologies like next generation sequencing (NGS) reveal millions of genetic variants and offer a more comprehensive understanding of the mechanisms and genetic origin of both common and rare diseases. Applications have increased rapidly and sequencing costs have plummeted. To put things in perspective, in 2000 the cost of sequencing one individual’s genome was close to $100M[1]. Today, the cost is expected to fall under $1,000. Lower pricing will inevitably lead to more testing and more data, especially with the availability of big data technology and improved analytics that can transform testing to a mass market of personalized medicine. This begs the question:

How can we store the vast quantities of genomic data we are producing?

Let us compare the projected growth of genomic data to three other sources considered among the most prolific data producers in the world: astronomy, Twitter, and YouTube. Scientists predict that by 2025, genomics could well represent the largest of big data fields[2]. Data generated by the first commercially available NGS was around 0.6 Gb while data generated by a high-end, modern sequencing system is approximately 6,000 Gb. That’s a 10,000-fold increase in data output in 15 years. Consequently, the issue becomes data storage cost. For example, Google Genomics charges 2.2 cent per Gb per month[3] which seems fairly reasonable…but based on predictions, by 2025 storing one year’s worth of human genome data alone, for one year, would cost $10,560,000,000…



Data security and integrity is another area of concern. Large collections of human DNA data could lead to very real advances in medicine. But they could also be used or misused - for a variety of reasons. Will laws and data regulation adapt quickly to protect consumers? A 2013 study demonstrated that it is possible to re-identify research participants using easily accessible “de-identified” genomic data alongside genealogical databases and public records.


Before data generation ramps up to the billion-plus human genomes that will likely be sequenced by 2025, it is imperative that institutions embrace informed consent policies, allowing for massive data sharing and maximizing its utility.

Now, another question arises: How long should genomics data be stored? Nobody knows! There’s not even agreement on what format genomics data should be stored in…and what should be exactly stored.

 

Solutions for analysis, interpretation, compression and storage already exist but improvements are expected

While standard data compression tools (e.g. zip and rar) are being used to compress sequence data, this approach has been criticized as inefficient because genomic sequences often contain repetitive content (e.g. microsatellite sequences) or many sequences exhibit high levels of similarity (e.g. multiple genome sequences from the same species).


The compression ratio of currently eight available genomic data compression tools ranges between 65-fold and 1,200-fold for human genomes[4-11]. This does not seem to be sufficient considering the expected evolution of this market.

Regarding NGS data analysis and interpretation, this area remains also very challenging because all genomic analysis tools were created by and for the research community. Data-crunching program designed by engineers or scientists are often difficult to use for clinical biologists. With the projected explosion in genomic data, it’s critical to have tools that can be used as easily by consumers and physicians as by experts in genetics.

Fear not, the future is bright

From helium hard drives to DNA digital storage, the future holds smaller, higher capacity, more efficient data storage and compression devices. Leaders in the fields of NGS data services like Illumina or Integragen Genomics – but also newly created Start-ups (E.g., Enancio) - are constantly working on the development of improved compression tools. The main challenge will be to provide an intelligent compression of fastq files, pictures and texts…while not losing any relevant information allowing rigorous re-analyses in the future.

IBM is developing Multi-Cloud Storage to help avoid service outages and maintain data accessibility[12]. In order to reduce expenses, companies like Verne Global are developing their huge, new data centers in places like Iceland, where energy costs are low[13]. Regarding analysis and interpretation, recent platforms and other’s smart tools are in development and will facilitate analysis and interpretation. Bioinformaticians are also increasingly looking to improve data analyses to simplify interpretation while guaranteeing data integrity regarding diagnoses and patients confidentiality. Solutions remain to be found but it is underway.

 

[1] https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/
[2] Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, et al. (2015) Big Data: Astronomical or Genomical?. PLoS Biol 13(7):e1002195. doi:10.1371/journal.pbio.1002195.
[3] Google Genomics pricing data. Available at https://cloud.google.com/genomics/. (Accessed 6 June 2018).
[4] Brandon M.C, Wallace D.C, and Baldi P (2009). Data structures and compression algorithms for genomic sequence data. Bioinformatics 25(14):1731–1738.
[5] Deorowicz S, and Grabowski S (2011). Robust relative compression of genomes with random access. Bioinformatics 27(21):2979-2986.
[6] Wang C, and Zhang D (2011). A novel compression tool for efficient storage of genome resequencing data. Nucleic Acids Res 39(7):e45.
[7] Pinho A.J, Pratas D and Garcia S.P (2012). GReEn: a tool for efficient compression of genome resequencing data. Nucleic Acids Res 40(4):e27.
[8] Tembe W, Lowey J and Suh E. (2010). G-SQZ: Compact encoding of genomic sequence and quality data. Bioinformatics 26(17):2192-2194.
[9] Christley S.Y, Lu C. Li and Xie X (2009). Human genomics as email attachments. Bioinformatics 25(2):274-275.
[10] Pavlichin, D.S, Weissman T., and Yona G (2013). The human genome contracts again. Bioinformatics 29(17):2199-2202.
[11] Pratas D., Pinho A.J and Ferreira P.J.S. Efficient compression of genomic sequences. Data Compression Conference, Snowbird, Utah, 2016.
[12] IBM. Now’s the time for a multi-cloud strategy. Available at - https://www.ibm.com/blogs/cloud-computing/2017/03/ time-multi-cloud-strategy/ (Accessed 6 June 2018).
[13] Verne Global. Customer case study – Why Icelandic HPC is Bioinformatics’ Best Friend. Available at - https://verneglobal.com/assets/ uploads%2F2017%2F2%2Fa7677429-ea0c-3f6e-1bc1-66f32c5072b1%2F31+-+Verne+Global+-+Earlham+Institute+Case+Study+HIGH+RES.pdf.