challenges and opportunities of big data genomics

25
Challenges and Opportuni2es of Big Data Genomics Yasin Memari Wellcome Trust Sanger Ins2tute January 2014

Upload: yasin-memari

Post on 27-Jan-2015

110 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Challenges and Opportunities of Big Data Genomics

Challenges  and  Opportuni2es  of  Big  Data  Genomics  

Yasin  Memari  Wellcome  Trust  Sanger  Ins2tute  

January  2014  

Page 2: Challenges and Opportunities of Big Data Genomics

Outline  •  Big  data  genomics:  hype  or  reality?  •  Limita2ons  of  big  data  analysis  •  Hardware  and  soKware  solu2ons  •  Bioinforma2cs  using  MapReduce  •  Hadoop  Distributed  File  System  •  Cloud  compu2ng  for  Genomics  •  Configuring  VRPipe  in  the  cloud  •  Lessons  from  cloud  compu2ng  •  A  unified  bioinforma2cs  plaRorm  

Page 3: Challenges and Opportunities of Big Data Genomics

Big  Data  Genomics:  Hype  or  reality?  

•  BoSleneck  in  sequencing  has  moved  from  data  genera2on  to  data  handling.  

•  World’s  sequencing  capacity  at  15PB  in  2013,  and  expected  to  double  every  year.  

•  10  petabytes  of  storage  required  for  100,000  human  genomes  (50X  ~  100GB  each).  

•  $100  per  year  cost  of  storing  each  genome.  •  Data  deluge  is  inevitable  in  the  interim  as  sequencing  

becomes  cheaper.  •  In  the  long  term,  DNA  itself  is  a  beSer  storage  medium!  •  Throughput  from  metagenomic  and  single-­‐cell  sequencing  

will  rapidly  outpace  hard  gains  in  compression.  

 

Page 4: Challenges and Opportunities of Big Data Genomics

Use  case  scenario:  Run  lobSTR  on  the  above  datasets  to  understand  varia2on  at  Short  Tandem  Repeats  on  a  genome-­‐wide  and  popula2on-­‐wide  scale  and  how  they  contribute  to  phenotypic  varia2on.    

Page 5: Challenges and Opportunities of Big Data Genomics

“D  and  A”  Model  

Sanger’s  farm  data  flow  

Download/transfer  and  Analyze:    •  I/O  intensive  jobs  can  overload  NAS  fileservers.  

•  High  performance  file  systems  provide  fast  access  to  data  to  mul2ple  clients.  

•  Network  performance  is  the  limi2ng  factor  for  big  data.  

Filesystem  load  

Page 6: Challenges and Opportunities of Big Data Genomics

Compress  the  data!    High  coverage  equals  high  redundancy!      Images/TIFF  files  no  longer  in  use.  No  intermediate  fastq  files.  Bcl  and  locs/clocs  -­‐>  bam  (directly)  BAM  is  being  replaced  by  CRAM  (30%  reduc2on  in  size)  Discard  the  read  data  every  5-­‐10  years!  More  compression?  Smooth  out  sequencing  errors,  normalize  the  coverage,  down-­‐sample,  etc.    

Page 7: Challenges and Opportunities of Big Data Genomics

How  can  we  improve  storage  performance?    •  Scale-­‐out  architectures  are  s2ll  costly  and  imprac2cal  e.g.  scale-­‐

out  NAS  ($1000/TB)  or  SAN  over  Fibre  Channel.  •  Solid-­‐state  drives  (SSDs)  are  being  used  to  enhance  cache  

memory  and  IOPS  performance.  •  Hybrid  storage  systems    integrate  SSDs  into  tradi2onal  HDD-­‐

based  storage  arrays  as  a  1st  2er  of  storage.  •  Avere  FXT  and  Nexsan  NST  store  warm  data  in  SSDs  for  storage  

accelera2on.  Migrate  cold  data  to  powered-­‐down  drives.  •  Fast  random  access  can  be  achieved  when  storing  metadata  in  

Flash  SSDs.  Limited  gain  for  sequen2al  access!  •  Alterna2vely,  archive  the  data  in  cheap  object  stores  in  the  

cloud,  but  invest  in  bandwidth!  

Page 8: Challenges and Opportunities of Big Data Genomics

What  can  be  done  about  network  latency?    •  Use  high  performance  network  protocols  (e.g.  UDP-­‐based  UDT)  

to  achieve  higher  speeds  that  can  be  achieved  with  TCP.    •  Aspera’s  fasp  accelerates  the  transfer  in  high-­‐latency  high-­‐loss  

networks  where  the  transport  protocol  is  a  boSleneck.  •  Transmission  rates  can  be  enhanced  using  mul2ple  concurrent  

transfers  (mul2-­‐part  downloads):  •  GeneTorrent  is  a  file  transfer  client  applica2on  based  on  

BitTorrent  technology  (up  to  200MB/s  over  the  internet).  •  GridFTP  (implemented  in  Globus  toolkit)  enables  reliable  and  

high-­‐speed  transmission  of  very  large  files  (up  to  ~800MB/s  when  scp  is  17MB/s).  

•  High  speed  internet  connec2on  StarLight/internet2?  Firewall  and  network  security  problems.  

Page 9: Challenges and Opportunities of Big Data Genomics

Alterna2ve  Models  What  types  of  analyses  do  we  run  in  genomics?    •  Embarrassingly  parallel  algorithms:    

 Most  sequence  analysis  soKware  have  distributed  solu2ons.  E.g.  Alignment,  imputa2on,  etc.  Use  genome  chunking  and  run  in  batches!  •  Tightly-­‐coupled  algorithms:    

 Some  require  message  passing  or  shared  memory,  e.g.  genome  assembly,  pathway  analysis.    Forms  of  parallelism:  •  Task  parallelism:  Distribute  the  execu2on  threads  across  different  

nodes.  •  Data  parallelism:  Distribute  the  data  across  different  execu2on  nodes.  

Page 10: Challenges and Opportunities of Big Data Genomics

Healthcare  data  need  to  be  stored  and  analyzed  centrally!  

Page 11: Challenges and Opportunities of Big Data Genomics

Map-­‐Reduce  Framework  A  distributed  soluGon  to  a  data-­‐centric  problem:  •  Map:  Divide  up  the  problem  into  smaller  chunks  and  send  each  

compute  task  to  where  the  data  resides.  •  Reduce:  Collect  the  answers  to  each  sub-­‐problem  and  combine  the  

results.  

Page 12: Challenges and Opportunities of Big Data Genomics

Example:  K-­‐mer  Coun2ng  

(ATG:1)!

(TGA:1)!

(GAA:1)!

(AAC:1)!

(ACC:1)!

(CCT:1)!

(CTT:1)!

(TTA:1)!

(GAA:1)!

(AAC:1)!

(ACA:1)!

(CAA:1)!

(AAC:1)!

(ACT:1)!

(CTT:1)!

(TTA:1)!

(TTT:1)!

(TTA:1)!

(TAG:1)!

(AGG:1)!

(GGC:1)!

(GCA:1)!

(CAA:1)!

(AAC:1)!

map reduce

K-mer Counting •! Application developers focus on 2 (+1 internal) functions

–! Map: input ! key:value pairs

–! Shuffle: Group together pairs with same key

–! Reduce: key, value-lists ! output

ATGAACCTTA!

GAACAACTTA!

TTTAGGCAAC!

ACA -> 1!

ATG -> 1!

CAA -> 1,1!

GCA -> 1!

TGA -> 1!

TTA -> 1,1,1!

ACT -> 1!

AGG -> 1!

CCT -> 1!

GGC -> 1!

TTT -> 1!

AAC -> 1,1,1,1!

ACC -> 1!

CTT -> 1,1!

GAA -> 1,1!

TAG -> 1!

ACA:1!

ATG:1!

CAA:2!

GCA:1!

TGA:1!

TTA:3!

ACT:1!

AGG:1!

CCT:1!

GGC:1!

TTT:1!

AAC:4!

ACC:1!

CTT:2!

GAA:2!

TAG:1!

Map, Shuffle & Reduce

All Run in Parallel

shuffle

Michael  Schatz  

Page 13: Challenges and Opportunities of Big Data Genomics

Hadoop  Distributed  File  System  (HDFS)  •  Apache  Hadoop,  an  open-­‐source  implementa2on  of  Google’s  MapReduce  and  

Google  File  System  (GFS).  •  A  highly  reliable  and  scalable  solu2on  to  storage  and  processing  of  massive  data  

using  cheap  commodity  hardware.    •  Op2mised  for  high  throughput  access  to  data.  Data  is  replicated  for  fault  tolerance.    

Page 14: Challenges and Opportunities of Big Data Genomics

HDFS  vs  Lustre  

Hadoop  •  Data  is  local:  data  nodes  act  as  

compute  nodes.  •  I/O  is  not  very  relevant  here,  

although  it  can  be  improved  by  concurrency.  

•  Op2mised  for  batch  processing.  •  Single-­‐node  boSlenecks  or  name  

node  failures.  

Lustre  •  Data  is  shared:  compute  clients  

talk  to  object  store  servers.  •  High  aggregate  I/O  can  be  

achieved  with  striping.  •  Op2mised  for  HPC.  Used  in  

Top500!  •  BoSleneck  is  gerng  the  data  on  

lustre!  

DAS  

CPU  

Network  

DAS  

CPU  

DAS  

CPU  

OST  

Client  

Network  

OST  

Client   Client  

OSS  

OST   OST  

OSS  

Page 15: Challenges and Opportunities of Big Data Genomics

Bioinforma2c  Tools  for  Hadoop  Suites  of  tools  ac2vely  under  development:    •  SeqPig:  A  library  which  u2lizes  Apache  Pig  to  translate  sequence  

data  analysis  into  a  sequence  of  MapReduce  jobs.  •  Seal:  A  collec2on  of  distributed  applica2ons  for  alignment  and  

manipula2on  of  short  read  sequence  data.  •  SeqWare:  A  toolkit  for  building  high-­‐throughput  sequencing  data  

analysis  workflows  in  cloud-­‐based  environments.    •  And  many  algorithms  for  sequence  mapping  (CloudAligner)  and  

SNP  calling  (Crossbow),  de  novo  assembly  (Contrail),  peak  calling  (PeakRanger)  and  RNA-­‐Seq  data  analysis  (Eoulsan,  FX  and  Myrna).  

Page 16: Challenges and Opportunities of Big Data Genomics

 Virtualiza2on  increases  u2liza2on  of  costly  hardware:  •  En2re  workflows  as  Virtual  Machines  residing  in  SAN.  VMs  are  

sent  to  hypervisors  for  execu2on.  

Storage-­‐Area  Network  (SAN)  

VM   VM  

VM   VM  

Hardware  (CPU,  Memory,  etc)  

Hypervisor  (Xen,  Hyper-­‐V,  etc)  

Hardware  Virtualiza2on  

App  

OS  

App  

OS  

App  

OS  

Fiber  Channel  

Management  Console  

Page 17: Challenges and Opportunities of Big Data Genomics

Cloud  Compu2ng  What  does  AWS  cloud  have  to  offer?    •  Networking:  Direct  connect,  Virtual  Private  Cloud  (VPC),  

Route  53  •  Compute:  Elas2c  Compute  Cloud  (EC2),  Elas2c  MapReduce  •  Storage:  Simple  Storage  Service  (S3),  Glacier,  Storage  

Gateway,  CloudFront  •  Database:  Rela2onal  Database  Service  (RDS),  DynamoDB,  

Elas2Cache,  RedshiK  •  Management:  Iden2ty  and  Access  Management  (IAM),  

CloudWatch,  CloudForma2on,  Elas2c  BeanStalk  

Page 18: Challenges and Opportunities of Big Data Genomics

Network  Performance  within  Amazon  

•  Bandwidth  speeds  within  AWS  are  way  too  low  for  moving  big  genome  data.  

•  Experiments  achieve  maximum  70-­‐80MB/s  speeds  between  two  EC2  instances  and  10-­‐20MB/s  between  EC2  and  S3.  

•  Download  from  S3  to  EC2  is  unreliable  and  constrained  given  data  inges2on  over  HTTP.  

•  Gigabit  Ethernet  in  EC2  is  only  available  with  cluster  instances  •  Enhanced  networking  using  network  virtualiza2on  may  

provide  higher  I/O  performance.  •  CloudFront,  Amazon’s  content  delivery  service  provides  

streaming  at  HD  rates  only.  •  AWS  Data  Pipeline  is  not  up  to  the  task  of  big  data  workflows.  

Page 19: Challenges and Opportunities of Big Data Genomics

VRPipe  in  the  Cloud  To  deploy  the  VRPipe  in  the  cloud  one  needs  to  sa2sfy  the  following  requirements  (Sendu  Bala):    •  Set  up  a  DataBase  Management  System  (DBMS)  for  VRPipe  

in  the  AWS  RDS  (or  use  a  SQLite  or  locally  installed  MySQL  database)  

•  Create  a  distributed  file  system  to  provide  shared  access  to  soKware  and  data  (adjust  for  speed  or  redundancy)  

•  Configure  the  VRPipe  and  provide  the  required  permissions  and  security  creden2als  

•  Install  and  configure  a  job  scheduling  system  supported  by  VRPipe,  e.g.  SGE  or  LSF  

hSps://github.com/VertebrateResequencing/vr-­‐pipe/wiki    

   

Page 20: Challenges and Opportunities of Big Data Genomics

Tes2ng  VRPipe  in  AWS  Cloud  Alignment  and  calling  of  110  Phase  3  YRI  exomes  (~1.1TB):      (sequence.index)  -­‐>  2.  1000genomes_illumina_mapping_with_improvement  27.  bam_merge_lanes_and_fix_rgs  61.  snp_calling_mpileup  59.  snp_calling_gatk_unified_genotyper_and_annotate  89.  vcf_gatk_filter  90.  vcf_merge  93.  vcf_vep_annotate  

 •  Set  up  GlusterFS  volume  using  EBS  blocks  aSached  to  EC2  instances.  •  Enable  Elas2c  Load  Balancing  within  VPC  and  grant  r/w  privileges  to  DBMS.  •  Op2onally  use  SGE  job  scheduling  in  conjunc2on  with  EC2  load  balancing.  

Page 21: Challenges and Opportunities of Big Data Genomics

Lessons  from  AWS  Cloud  •  The  bulk  of  the  cloud  is  made  of  general  purpose  hardware  suitable  

for  enterprise  compu2ng.  •  Scien2fic  applica2ons  require  compute-­‐op2mised  HPC  plaRorms  

and  high-­‐speed  I/O  and  storage.  •  On-­‐demand  services  are  expensive,  but  large  organiza2ons  may  

benefit  from  the  economy  of  scale!?  •  As  a  self-­‐service  environment,  the  user  should  handle  sysadmin  

tasks  including  provisioning  and  configura2on.  •  EC2  not  being  able  to  compute  against  S3  (high-­‐IO  tasks)  recalls  the  

same  “D  and  A”  problem!  •  ElasFc  MapReduce  (EMR)  runs  on  EC2  instances,  with  ephemeral  

disks  used  to  build  HDFS,  so  data  need  to  be  streamed  in/out  of  S3.  •  Virtualiza2on  imposes  performance  penal2es  as  the  available  

physical  resources  are  shared  among  VMs.  

Page 22: Challenges and Opportunities of Big Data Genomics

Bio-­‐cloud  Prototypes  •  The  EBI  has  developed  an  in-­‐house  cloud  for  public  sequence  repositories  such  as  

the  European  Genome-­‐Phenome  Archive  (EGA).    •  The  NaGonal  Center  for  Biotechnology  InformaGon  is  working  on  cloud  

implementa2ons  for  storing  genomic  data  such  as  dbGaP.  •  The  Beijing  Genomics  InsGtute  has  developed  five  bio-­‐cloud  compu2ng  centers  in  

different  loca2ons  that  store  and  process  genomes.  •  The  US  NaGonal  Cancer  InsGtute  maintains  the  Cancer  Genome  Hub  (CGHub)  

which  is  a  system  for  storing  large  genome  data.  •  The  Broad  InsGtute  has  instan2ated  its  analysis  pipeline  for  germline  and  cancer  

soma2c  data  on  commercial  cloud  environments.  •  The  AMP  Lab  at  UC  Berkley  has  develop  and  is  deploying  its  genome  analysis  

pipeline  on  commercial  cloud  environments.  •  Illumina  uploads  data  directly  to  the  cloud  where  they  have  created  a  plaRorm  for  

sequence  analysis  called  Basespace.  Source:  Global  Alliance  White  Paper,  3  June  2013  

Page 23: Challenges and Opportunities of Big Data Genomics

Data/Pipeline  Sharing  •  Grid  compu2ng  in  the  cloud  

enables  sharing  data  and  resources  across  virtualized  servers.  

•  Cloud  APIs  enable  applica2on  inter  operability  and  cross-­‐plaRorm  compa2bility.  

•  Applica2ons  are  able  to  launch  and  access  distributed  data  irrespec2ve  of  underlying  IT  infrastructures.  

BGI  Private  Cloud  

Sanger  Private  Cloud  

Broad  Private  Cloud  

NCBI  Private  Cloud  

EBI  Private  Cloud  

Public  cloud  

Page 24: Challenges and Opportunities of Big Data Genomics

A  Unified  PlaRorm?  An  open  source  plaRorm  for  storing,  organizing,  processing,  and  sharing  very  large  genomic  and  biomedical  data  on  premise  or  in  the  cloud:  •  Data  Management:  Files  and  metadata  storage,  structured/

unstructured  data,  provenance  tracking,  security  and  access  control.  •  Content  addressable  Distributed  File  System:  Scalability  and  fault-­‐

tolerance,  block  storage  of  data,  High  performance  over  low  latency.    •  Computa2on  and  Pipeline  Processing:  Pipeline  crea2on  tools,  revision  

control  system,  MapReduce  engine,  etc.  •  APIs  and  SDKs:  REST  and  na2ve  APIs,  w-­‐based  user  interface,  

command  Line  Interface,  programming  languages  and  tools,  etc.  •  Cloud  OS  and  Virtualiza2on:  Networking,  self-­‐service  provisioning,  

administra2on,  block  storage,  user  management,  etc.  

Page 25: Challenges and Opportunities of Big Data Genomics

Discussion  •  Compute  is  much  cheaper.  Algorithms  run  faster  and  more  

efficiently.    •  Transmission  of  big  data  will  be  a  boSleneck.  Network  latency  and  

storage  I/O  are  the  limi2ng  factors.    •  Minimize  the  data  flow!    •  Distributed  file  systems  have  reduced  the  costs;  rou2ne  analy2cs  of  

big  data  has  been  made  possible  using  cheap  commodity  hardware.  •  We  should  feel  lucky  that  sequence  analysis  is  mainly  

embarrassingly  parallel!  •  MapReduce  engines  may  be  deployed  in  genome  data  centres?  •  Cloud  compu2ng  enables  data  and  applica2on  sharing  across  

consolidated  IT  infrastructures.