having it both ways: bring data to computation ... · having it both ways: bring data to...

20
Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav Merchant The University of Arizona [email protected] h5p://www.cyverse.org Twi5er: @CyVerseOrg

Upload: others

Post on 16-Oct-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Having it both ways: Bring Data to Computation ... · Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav&Merchant& The$University$of$Arizona nirav@email.arizona.edu&

Having it both ways: Bring Data to Computation & Computation to Data with

iRODS

Nirav  Merchant  The  University  of  Arizona  [email protected]  

h5p://www.cyverse.org  Twi5er:  @CyVerseOrg  

Page 2: Having it both ways: Bring Data to Computation ... · Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav&Merchant& The$University$of$Arizona nirav@email.arizona.edu&

Topic  Coverage:    •  Mo8va8on/Use  case  

•  Constraints,  challenges  •  Technology  op8ons  

•  Our  solu8on,  early  results  •  Next  steps  

 

 

 

Page 3: Having it both ways: Bring Data to Computation ... · Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav&Merchant& The$University$of$Arizona nirav@email.arizona.edu&

CyVerse:  Pla,orm  Philosophy  •  Strive  to  provide  the  CI  Lego  blocks  •  Danish  'leg  godt'  -­‐  'play  well’  •  Also  translates  as  'I  put  together'  in  La8n  •  If  desired   func8onality   is  not  available,   the  community  can  craJ  their  own  by  using  and  extending  CyVerse  CI  components  (like  lego  blocks)  

•  Through   these   extensible   and   customized  p laPorms   c reate   a   ecosys tem   of  interoperable   tools   that   benefit   the   broad  community  (and  not  few  lab  groups)  

•  Provide   the   tools   to   allow   community   to  manage  their  digital  assets  (cloud,  HPC  etc.)  

•  Improve  Computa8onal  Produc8vity    

 

6/3/16   3  

Page 4: Having it both ways: Bring Data to Computation ... · Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav&Merchant& The$University$of$Arizona nirav@email.arizona.edu&

Ready  to  use  PlaGorms  

FoundaIonal  CapabiliIes  

Established  CI  Components  

Extensible  Services    

h"p://www.cyverse.org  

The  CyVerse  Technology  Stack  A  Blueprint  for  Cyberinfrastructure  Design  

Ease  of  U

se  

Flexibility  

Page 5: Having it both ways: Bring Data to Computation ... · Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav&Merchant& The$University$of$Arizona nirav@email.arizona.edu&

How  is  it  being  used  ?  •  User  build  their  own  systems  (powered  by  CyVerse  components)  but  managed  by  them  

•  Share  analysis  methods,  algorithms,  data  (reproducibility)  •  Consume  specific  components  (a  la  carte,  Data  Store,  Atmosphere)  

•  Directly  use  applica8ons  (DE)  •  Custom  design  appliances  (Atmosphere)  •  Publish  their  findings  (PNAS,  Nature)  •  Advocate  use  and  build  “your”  community  •  Create  new  learning  material  and  courses,  special  topics  workshops  

 6/3/16   5  Licensed  under  CC  By  2015  h_p://

iplantc.org  

Page 6: Having it both ways: Bring Data to Computation ... · Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav&Merchant& The$University$of$Arizona nirav@email.arizona.edu&

Cohesive  Pla,orm  for  Data  lifecycle  

6/3/16   6  

Page 7: Having it both ways: Bring Data to Computation ... · Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav&Merchant& The$University$of$Arizona nirav@email.arizona.edu&

The  eternal  ques:on…..  

6/8/16   7  

Data  to  Compute  or  Compute  to  Data    

Page 8: Having it both ways: Bring Data to Computation ... · Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav&Merchant& The$University$of$Arizona nirav@email.arizona.edu&

Toolchest  

•  iRODS  •  Condor  •  Docker  •  Rethinking  the  role  of  a  “resource  server”  

6/8/16   8  Licensed  under  CC  By  2015  h_p://iplantc.org  

Page 9: Having it both ways: Bring Data to Computation ... · Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav&Merchant& The$University$of$Arizona nirav@email.arizona.edu&

Mo8va8on:  Data  to  Compute    •  Most  of  our  use  cases  operated  on  ~100-­‐200  GB  data  at  a  8me  

•  Many  of  the  analysis  steps  were  few  cores  (~12)  and  reasonable  RAM  (  ~128  GB)      

•  Tasks  were  “naturally  data  parallel”    •  Easier  to  provision,  share,  scale  and  maintain  “shared  nothing”  (or  not  much)  compu8ng    infrastructure  

6/8/16   9  Licensed  under  CC  By  2015  h_p://iplantc.org  

Page 10: Having it both ways: Bring Data to Computation ... · Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav&Merchant& The$University$of$Arizona nirav@email.arizona.edu&

Condor  Worker  

Condor  Worker  

Condor  Worker  

Condor  Worker  

Our  Solu8on:  Data  to  Compute    

6/8/16   10  Licensed  under  CC  By  2015  h_p://iplantc.org  

Discovery  Env.  

Condor  Master  (Docker)  

iRODS  

Condor  Worker  

Condor  Worker  

Condor  Worker  

Condor  Worker  

Note:  Conceptual  View  

Other  Compute  infrastructure    (HPC,  Cloud)  

Note:  Conceptual  View  

Page 11: Having it both ways: Bring Data to Computation ... · Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav&Merchant& The$University$of$Arizona nirav@email.arizona.edu&

Mo8va8on:  Compute  to  Data    •  Moving  data  to  compute  not  feasible  in  many  cases  (100  TB+,  large  repositories)  

•  Availability  of  “fat  nodes”  (or  choice  for  resource  servers)  

•  Availability  of  specialized  compute  with  storage  systems  (Wrangler)  

6/8/16   11  Licensed  under  CC  By  2015  h_p://iplantc.org  

Page 12: Having it both ways: Bring Data to Computation ... · Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav&Merchant& The$University$of$Arizona nirav@email.arizona.edu&

Condor  Worker  

Condor  Worker  

Condor  Worker  

Condor  Worker  

Our  Solu8on:  Compute  to  Data    

6/8/16   12  Licensed  under  CC  By  2015  h_p://iplantc.org  

Discovery  Env.  

Condor  Master  (Docker)  

iRODS  

Condor  Worker  

Condor  Worker  

Condor  Worker  

Condor  Worker  

Note:  Conceptual  View  

Other  Compute  infrastructure    (HPC,  Cloud)  

R   R  

Res  

Page 13: Having it both ways: Bring Data to Computation ... · Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav&Merchant& The$University$of$Arizona nirav@email.arizona.edu&

Steps  •  Bring  in  data  (choose  your  method)  •  Register  the  data  with  iRODS  •  Apply  the  metadata    (ipc_data_set=IPCC-­‐WG2)  •  Let  condor  announce  it  (class  ads),  also  configure  limits  (num  of  concurrent  jobs,  core,  ram,  space  to  write  output  etc.)  

•  Submit  job  with  class  add  and  let  condor  scheduler  match  and  manage  it  

•  If  you  need  more  ,  create  more  copies  (replica)  and  profit  

•  If  you  need  to  send  it  else  where  (HPC  etc)  use  glidein  and  bosco  

6/8/16   13  Licensed  under  CC  By  2015  h_p://iplantc.org  

Page 14: Having it both ways: Bring Data to Computation ... · Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav&Merchant& The$University$of$Arizona nirav@email.arizona.edu&

ireme  IREME  is  a  command-­‐line  u8lity  which  allows  registering  dataset(s)  with  irods  ,  and  assigning  metadata  to  those  datasets  ,  which  are  then  used  with  condor’s  classads  mechanism  to  match  jobs  with  machines  Ireme  is  also  responsible  for  orchestra8ng  the  process  of    adver8sing  metadata  and  datasets  present  on  the  condor  worker  /  resource  worker  ,  in  the  form  of  machine  classads  

Usage      -­‐p  -­‐-­‐path  :    Physical  path  of  the  resource  to  be  registered  with  irods      -­‐c  -­‐-­‐coll  :  CollecIon  name  within  the  irods  database  where  files  are  registered      -­‐m  -­‐-­‐meta  :  Comma-­‐seperated  meta  data  tags  (key:value  pairs)  associated  with  the  collecIon      

 Example  Syntax      ireme  -­‐p  /home/user/sample_folder  -­‐c  /tempZone/home/user/sample_coll  -­‐m  key1:value1,key2:value2  

Page 15: Having it both ways: Bring Data to Computation ... · Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav&Merchant& The$University$of$Arizona nirav@email.arizona.edu&

iRODS ClassAds IRODS_RESOURCE is the classad custom variable which advertises iRODS resource required by the

job in the form of metadata tags or dataset name (collection name). The Condor Negotiator matches job classad requirement (metadata or dataset) with classads

advertised by the Condor Worker

 Sample  Job  ClassAd  w/  iRODS  requirement    Executable=test2  Log=test.log  Output=test.out  error=test.error  log=test.log  +IRODS_RESOURCE="key=value"  Requirements=TARGET.meta_available==true  Queue  

 

Condor    Nego8ator  

Sample  Machine  ClassAd  w/  iRODS  Ads    meta_available  =  isMetaAvaialbe(TARGET.IRODS_RESOURCE)    STARTD_EXPRS=meta_available  ,  $(STARTD_EXPRS)  

 

Page 16: Having it both ways: Bring Data to Computation ... · Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav&Merchant& The$University$of$Arizona nirav@email.arizona.edu&

Condor  Master/  Nego8ator  /  Collector  

Condor  Worker  /  Resource  Server  

Condor  Worker/  Resource  Server  

Condor  Worker/  Resource  Server  

       ICAT  server  

Page 17: Having it both ways: Bring Data to Computation ... · Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav&Merchant& The$University$of$Arizona nirav@email.arizona.edu&

Data FlowUser  Job    

iRODS  Dataset  ClassAd  

iRODS  Meta  Data  ClassAd    

Condor  Master  /  Nego8ator  

Condor  Worker  /  Resource  Server  

Condor  Worker  /  Resource  Server  w/  required  iRODS  resoruce  

Condor  Worker  /  Resource  Server  

ClassAds   ClassAds  

ClassAds  

Page 18: Having it both ways: Bring Data to Computation ... · Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav&Merchant& The$University$of$Arizona nirav@email.arizona.edu&

Data Flow after Classads Matching

User  Job    

iRODS  Dataset  ClassAd  

iRODS  Meta  Data  ClassAd  

Condor  Master  /  Nego8ator  

Condor  Worker  /  Resource  Server  

Condor  Worker  /  Resource  Server  w/  required  iRODS  resoruce  

Condor  Worker  /  Resource  Server  

Page 19: Having it both ways: Bring Data to Computation ... · Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav&Merchant& The$University$of$Arizona nirav@email.arizona.edu&

Syndicate:  Using  CDN  &  beyond  (Edge  Compu8ng)  

S3  

DropBox  

Metadata  Service  

SG  

SG  

SG  SG  

SG  

GenBank  

Shared  Volume  

SG  

SG  

CyVerse  

Page 20: Having it both ways: Bring Data to Computation ... · Having it both ways: Bring Data to Computation & Computation to Data with iRODS Nirav&Merchant& The$University$of$Arizona nirav@email.arizona.edu&