University of British Columbia Identifies 130,000 New Viruses in 11 Days
Using AWS, Serratus can process over one million libraries of next-generation sequencing data per day
Biotechnology data is expanding at a rate exceeding Moore’s law, and experts are increasingly grappling with how to effectively share this data to accelerate scientific breakthroughs. Genomic data sharing leads to more accurate and deeper research insights, but its massive volume can present logistical challenges in areas like security, storage capacity, and global accessibility which can be addressed using Amazon Web Services (AWS).
To facilitate this move towards international genomic data sharing for research purposes, the National Center for Biotechnology Information (NCBI) mirrored the Sequence Read Archive (SRA) into AWS in February 2020 using Amazon Simple Storage Service (S3), an object storage service. The SRA is the world’s largest repository of high-throughput genetic sequencing information. It contains more than 50 petabases (5x1016 DNA letters) of raw data from thousands of species from all corners of the Earth, ranging from Antarctic penguins to peat bogs in British Columbia.
Artem Babaian, Ph.D., a researcher at the University of British Columbia, decided to take advantage of this open-access data to understand how the COVID-19 pandemic emerged. While billions of dollars have been invested in understanding the genome of SARS-CoV-2 – the coronavirus that causes COVID-19 – the scientific community still has a lot to learn about coronaviruses in general, such as their evolutionary history and how these viruses can undergo genetic recombination between different virus species.
Their research, published in the scientific journal Nature, should help doctors make connections faster when dealing with sick patients, improve diagnostic testing and vaccine development, and help policymakers decide where to direct their research and monitoring more effectively.