Life sciences big data use cases
Post on 16-Apr-2017
Embed Size (px)
Big data and Life Sciences
Wellcome Trust Sanger Institute
The Sanger Institute
Funded by Wellcome Trust.2nd largest research charity in the world.
Based in Hinxton Genome Campus, Cambridge, UK.
Large scale genomic research.Sequenced 1/3 of the human genome. (largest single contributor).
Large scale sequencing with an impact on human and animal health.
Data is freely available.Websites, ftp, direct database access, programmatic APIs.Some restrictions for potentially identifiable data.
My team:Scientific computing systems architects.
250 Million * 75-108 Base fragments
~1 TByte / day / machine
Human Genome (3GBases)
Cost of sequencing halves every 12 months.Wrong side of Moore's Law.
The Human genome project: 13 years.
A Human genome today:3 days.
Trend will continue:$1000 genome is probable within 2 years.Informatics not included.
The scary graph
Peak Yearly capillary sequencing: 30 Gbase
Current weekly sequencing:7-10 Tbases
Data doubling Time: 4-6 months.
Gen III Sequencers this year?
Pbytes!Sequencing data flow.
Structured data(databases)Unstructured data(Flat files)
Raw data(10 TB)
Sequencing the start of most analysisPeople = Umanaged dataData in wrong placeDuplicatedNobody can find anythingInc systems:Backups/securityCapacity planning?
A Sequencing Centre Today
CPUGeneric x86_64 cluster.(16,000 cores)
Storage~1 TB per day per sequencer.(15 PB disk)
(Lustre + NFS)
Metadata driven data managementOnly keep our important files.
Catalogue them, so we can find them!
Keep the number of copies we want, and no more.(iRODS, in house LIMs).
A solved problem; we know how to do this.
This is not big data
This is not big data either...
Proper Big Data
We want to compute across all the data.Sequencing data (of course).
Patient records, treatment and outcomes.
Why?Cancer: tie in genetics, patient outcomes and treatments.
Pharma: high failure rate due to genetic factors in drug response.
Infectious disease epidemiology.
Rare genetic diseases.
Many genetic effects are smallMillion member cohorts to get good signal:noise.
Translation: Genomics of drug sensitivity in Cancer
Pre-treatmentBRAF inhibitor 15 weeks of treatment
BRAF mutation positive 70% response rate vs 10% for standard chemotherapy
BRAF Inhibitors in maligant melanoma
Slide from Mathew Garnet (CGP)
Current Data Archives
EBI ERA / NCBI SRA store results of all sequencing experiments.Public data availability: A good thing (tm)1.6 Pbases
ProblemsArchives are dark.
You can put data in, but you can't do anything with it.
In order to analyse the data, you need to download it all.100s of Tbytes
Situation replicated at local Institute level too.eg How does CRI get hold of their data currently held at Sanger?
Global Alliance for sharing genomic and clinical data70 research institutes & hospitals (including Sanger, Broad, EBI, BGI, Cancer Research UK)
Million cancer genome warehouse (UC Berkeley)
To the Cloud!
How do we get there?
Code & Algorithms
Bioinformatics code:Integer not FP heavy.
Large memory footprints.
Not a good fit for future computing architectures.
Expensive to run on public clouds.Memory footprint leads to unused cores.
Out of scope for a data talk, but still an important point.
Global File systemcpucpucpucpucpucpucpu
Object StorecpuFast Network
Slow NetworkStatic nodes
A VM is just a VM, right?Clouds are supposed to be programmable.
Nobody wants to re-write a pipeline when they move clouds.
Object:Low level: AWS S3, Openstack SWIFT, Ceph/rados
High level: Data management layer (eg iRODS)?
Cloud Interoperability?Do we need is more standards?!
Pragmatic approach:First person to make one that actually works, wins.
Data still has to get from our instruments to the Cloud.
Good news:Lots of products out there for wide area data movement.
Bad news:We are currently using all of them(!)
Network bandwidth still a problem.Research institutes have fast data networks.
What about your GP's surgery?
UDT / UDR
rsync / ssh
Unlikely that data archives are going to allow anonymous access.Who are you?
Federated identify providers.Is everyone signed up to the same federation?
Does it include the right mix of cross-national co-operation?
Does your favourite bit of software support federated IDs?
LegalTheory: anonymised data can be stored and accessed without jumping through hoops.
Practice: Risk of re-identification. Becomes easier the more data you have.Medical records are hard to anonymise and still be useful.
EthicalMedical consent process adds more restrictions above data-protection law.Limits data use & access even if anonymised.
Controlled data access?No ad-hoc analysis.
Access via restricted API only (trusted intermediary model).
Policy development ongoing.Cross juristiction for added fun.
We know where we want to get to.No shortage of Vision
There are lots of interesting tools and technologies out there.Getting them to work coherently together will be a challenge.
Prototyping efforts are underway.
Need to leverage expertese and experience in other fields.
Not simply technical issues:Significant policy issues need to be worked out.
We have to bring the public along.
Global Alliance whitepaper:
Million Cancer Genome Warehouse whitepaper
Click to edit the title text formatClick to edit Master title style