towards enabling big data and federated computing in the cloud towards enabling big data and...

Download Towards Enabling Big Data and Federated Computing in the Cloud Towards Enabling Big Data and Federated

If you can't read please download the document

Post on 04-Jun-2020




0 download

Embed Size (px)


  • Towards  Enabling  Big  Data  and   Federated  Compu8ng  in  the  Cloud     Yousef  Kowsar1,  Enis  Afgan1,2  

      1VLSCI,  University  of  Melbourne   2CIR,  Ruđer  Bošković  Ins8tute  (RBI)     BOSC  2013,  Berlin  

  • Lots of big data

    Distributed data

    ‘Cloud’ resources

  • Big  Data  and  Hadoop  

    •  Distributed  compu8ng     solu8on   – Map-­‐Reduce  paradigm   – A  plaVorm  for  Big  Data   – Supports  running  of     applica8ons  on  large   clusters  

    •  Hadoop  has  been   effec8vely  applied  to  the   Big  Data  problem  

  • Federated  Compu8ng    and  HTCondor  

    •  An  approach  toward  federated  compu8ng   •  HTCondor:   –  Since  1988  at  University  of  Wisconsin-­‐Madison   – High  Throughput  Compu8ng  on  large  collec8ons   distribu8ve  compu8ng  resources:  cycle  scavenging  

    •  Gains  from  using  HTCondor   –  Exis8ng  solu8on   –  Scalability   –  Reliability   –  Cost  

  • •  Cloud  Manager  for  orchestra8ng  cloud  resources   •  Cluster-­‐on-­‐the-­‐cloud,  any  cloud   •  Ease  the  process  of  establishing  a  cloud   environment  for  bioinforma8cs  analysis   –  “Galaxy  on  the  Cloud”  

    •  Facilitate  management  of  IaaS  services  

  • A  path  forward  

    •  Have  a  central  manager  capturing  all  the  three   func8ons  at  once:       – CloudMan  

    •  Easy  &  ready  to  use  cluster  environment  for  the  cloud   – Hadoop  

    •  PlaVorm  for  Big  Data  analysis   – HTCondor  

    •  Central  manager  able  to  handle  versa8le,   heterogeneous  compute  environments  

  • Our Approach

    •  Integrate HTCondor and Hadoop into CloudMan clusters

    •  Single management interface

    •  Multiple types of workloads and infrastructures

    •  Make  it  easier  to  deploy   necessary  plaVorm  and   enable   1.  Tool  development     2.  Data  analysis  


    SGE Hadoop Condor

    Batc h job

    s Big-data jobsFederate d jobs

  • Hadoop-on-demand platform •  Hadoop-over-SGE: dynamically

    setup at runtime •  Low and constant setup

    overhead •  Increase infrastructure flexibility

    •  Cost •  Workload type

  •  Hadoop  example   •  Edit  sge-­‐integra8on  script  

    •  Submit  your  job  into  SGE  

  • HTCondor  integra8on  

    •  Local  jobs  run  via  SGE   •  Nodes  pooled  together  via   –  Flocking   –  Gliding   –  Pool  sharing  

    AWS $$$

    NeCTAR (~private)



  • HTCondor  example   Cluster  1  -­‐  AWS   Cluster  2  -­‐  NeCTAR  

    Common resource pool

    Job submission script

    Running jobs

  • Conclusions   •  Challenges   – Data  transfer  &  locality  

    •  Future  work   –  Streamline  scaling  of  Condor  hosts   –  Integra8on  with  Galaxy   –  Condor  over  Hadoop  

    •  An  architecture  paper  available  from  MIPRO  2013   –  “Support  for  data-­‐intensive  compu8ng  with  CloudMan”  

    A  cloud  environment  for  distributed  compu8ng:   batch;  Hadoop;  HTCondor   hfp://  


View more >