opennebula: experiences at kth€¦ · mapreduce, hive, hcatalog, hbase, zookeeper, oozie, pig and...

37
Åke Edlund KTH PDCHPC Center for High Performance Computing KTH HPCViz DataIntensive Computing Group KTH PDCHPC Cloud 1 OpenNebula: Experiences at KTH With a deeper dive into emerging data analytics stacks

Upload: others

Post on 05-Oct-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Åke  Edlund  KTH  PDC-­‐HPC    Center  for  High  Performance  Computing  KTH  HPCViz  Data-­‐Intensive  Computing  Group  KTH  PDC-­‐HPC  Cloud 1

OpenNebula: Experiences at KTH

With a deeper dive into emerging data analytics stacks

Page 2: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Outline of this talkCloud computing and data-intensive computing at PDC - a brief overview

OpenNebula at PDC - examples

Apache Spark at PDC - what I use our cloud for

2

Page 3: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Cloud computing and data-intensive computing at PDC - a brief overview!

OpenNebula at PDC - examples

Apache Spark at PDC - what I use our cloud for

3

Page 4: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Cloud computing and data-intensive computing at PDC - a brief overview

• Cloud  research  since  2007  – Cloud  provider  since  2009  –  national  and  international  users  

• Spark  user  since  May  2012  (more  in  the  last  section)  – Version  0.6  released  in  October  15,  2012  

• Research  and  Development  – Distributed  and  federated  clouds  and  data  analytics  stacks  – Bioinformatics  and  LifeScience  applications  – Scalable  statistics  – Self-­‐improving  systems  – Strong  and  usable  security  factors  to  enable  researchers  to  store  sensitive  data  in  the  Cloud  

• Projects  (many)  – SNIC  Cloud  Infrastructure  (co-­‐Initiator  and  Coordinator)  –  the  Swedish  roll  out  of  cloud  for  

eScience  – NeIC  Nordic  Cloud  (co-­‐Initiator  and  coordinator  Swedish  part)  – BioBankCloud  (WP  leader)  –  PaaS  for  biobanking  – EGI  Federated  Cloud  task  force  (development  and  resource  provider)  – VENUS-­‐C  (WP-­‐Leader)  (2010  –  2012)  – …

4

Page 5: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Cloud Resources at PDCPDC  Cloud  has  been  in  production  (with  external  users)  since  2010  and  is  today  an  installation  of  364  cores  !-­‐ 12  nodes,  each  consisting  of  32  cores  –  1  TB  x  2  disk  and  64  GB  RAM  -­‐ 20  TB  shared  (through  Infiniband)  by  the  12  nodes  using  Ceph  (RBD  (block  

devices),  S3  (Object  Storage)  -­‐  this  is  under  reconstruction  (from  SAN  to  dedicated  Ceph  storage  nodes  -­‐>  36  TB)  

-­‐ Cloud  middlewares  used  over  the  years  range  from  Eucalyptus,  OpenNebula,  and  now  a  mix  of  OpenNebula  and  OpenStack  

-­‐ Users  access  their  resources  using  web  panel  and/or  CLI/API  !

Users  (so  far)  are  Nordic  and  European  researchers.  PDC  Cloud  is  leading  partner  in  a  number  of  Swedish,  Nordic  and  European  cloud  projects,  e.g.  being  one  of  the  first  certified  cloud  resource  providers  to  EGI  Federated  Cloud.

5

Page 6: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Data-Intensive Computing at PDC

HPCViz  Data-­‐Intensive  Computing  Group  (started  2012)  is  a  research  group  building  on  the  experiences  from  PDC.  !-­‐ 9  group  members  (7  researchers,  2  developers)  -­‐ Collaborating  mainly  with  Uppsala  University  (bioinformatics),  KI  

(SciLifeLab)  on  applying,  and  further  expand,  emerging  novel  techniques  for  iterative  and  interactive  in-­‐memory  data  analytics  stacks  (Spark,  Stratosphere,  H2O,  …)  

-­‐ Other  areas  of  interest  include  anomaly  detection  in  streaming  data,  with  applications  in  performance  improvement  of  distributed  systems,  and  security  (intrusion  detection).

6

Page 7: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

[1]  "Practical  Cloud  Evaluation  from  a  Nordic  eScience  User  Perspective",  VTDC'11,  ACM  conference  San  Jose  (2011)  by  Åke  Edlund  and  Maarten  Koopman,  Zeeshan  Ali  Shah,  Ilja  Livenson,  Frederik  Orellana,  Jukka  Kommeri,  Miika  Tuisku,  Pekka  Lehtovuori,  Klaus  Marius  Hansen,  Helmut  Neukirchen,    Ebba  Þóra  Hvannberg   7

Our Cloud Learning Curve

2001                    2004                            2007                          2010                    2011                        2012                    2013                        2014

Nordic  cloud  project,  NEON  (2010)  Practical  evaluation  [1],  testing  public  vs  private  cloud  for  eScience  users  (bioinformatics)

SNIC  Cloud  project  (2011.6-­‐2012.6+)  Enabled  cloud  access  (public  and  private)  to  SNIC  users.    14  (some  recurring)  users  of  SNIC  Cloud  for  Amazon    (e.g.  running  Galaxy)  and  54  on  the  private  cloud  (currently  only  PDC  Cloud,  partially  from  outside  SNIC)

SNIC  Galaxy  project  (2013.3-­‐2014.3).  The  goal  of  the  project  is  to  deliver  Galaxy  as  a  service,  using  the  Galaxy  cloud  management  platform,  Cloudman,  on  local  cloud  installations  (private  clouds).  

SNIC  Cloud  Infrastructure  (long-­‐term,  started  Jan  2014).  A  (generic)  IaaS  on  which  communities/users  can  build  their  PaaS.  Strong  emphasize  on  user  communities  and  their  commitment.  

Grid  Computing  projects  (DataGrid,  EGEE,  EGI)  –  including  EGI  Federated  Clouds  TF

KTH  PDC  Cloud  experimentation

Public      IaaSPrivate  IaaS

Private  PaaSPublic      PaaS

PDC-­‐HPC  (since  1989)

Page 8: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

[1]  "Practical  Cloud  Evaluation  from  a  Nordic  eScience  User  Perspective",  VTDC'11,  ACM  conference  San  Jose  (2011)  by  Åke  Edlund  and  Maarten  Koopman,  Zeeshan  Ali  Shah,  Ilja  Livenson,  Frederik  Orellana,  Jukka  Kommeri,  Miika  Tuisku,  Pekka  Lehtovuori,  Klaus  Marius  Hansen,  Helmut  Neukirchen,    Ebba  Þóra  Hvannberg   8

2001                    2004                            2007                          2010                    2011                        2012                    2013                        2014

Nordic  cloud  project,  NEON  (2010)  Practical  evaluation  [1],  testing  public  vs  private  cloud  for  eScience  users  (bioinformatics)

SNIC  Cloud  project  (2011.6-­‐2012.6+)  Enabled  cloud  access  (public  and  private)  to  SNIC  users.    14  (some  recurring)  users  of  SNIC  Cloud  for  Amazon    (e.g.  running  Galaxy)  and  54  on  the  private  cloud  (currently  only  PDC  Cloud,  partially  from  outside  SNIC)

SNIC  Galaxy  project  (2013.3-­‐2014.3).  The  goal  of  the  project  is  to  deliver  Galaxy  as  a  service,  using  the  Galaxy  cloud  management  platform,  Cloudman,  on  local  cloud  installations  (private  clouds).  

SNIC  Cloud  Infrastructure  (long-­‐term,  started  Jan  2014).  A  (generic)  IaaS  on  which  communities/users  can  build  their  PaaS.  Strong  emphasize  on  user  communities  and  their  commitment.  

Grid  Computing  projects  (DataGrid,  EGEE,  EGI)  –  including  EGI  Federated  Clouds  TF

KTH  PDC  Cloud  experimentation

Public      IaaSPrivate  IaaS

Private  PaaSPublic      PaaS

PDC-­‐HPC  (since  1989)Iaas  à PaaS  

Security  concerns.  Service  to  our  users.  Easier  to  manage  larger  user  groups.

Public  IaaS  à Private  IaaS  Large  amount  of  sensitive  data,  often  too  cumbersome  for  

practical  use  of  public  clouds.  

Our Cloud Learning Curve

Page 9: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Federated  Cloud  ProjectsCurrent  Cloud  Projects  

- SNIC Cloud (co-Initiator and Coordinator) – the Swedish roll out of cloud for eScience!- NeIC Nordic Cloud (co-Initiator and Coordinator Swedish part)  -  BioBankCloud (WP leader) – PaaS for biobanking - EGI Federated Cloud (development and resource provider)! 

Earlier  Cloud  Projects  

-SNIC Galaxy (PaaS) (co-Initiator and Coordinator) (2013) -SNIC Cloud (Initiator and Coordinator) (2011-2012) -SICS Startup Accelerator (co-Initiator and Coordinator) (2011) -VENUS-C (WP leader) (2010-2012) -NEON – Northern Europe cloud project (Initiator and Coordinator) (2010)

9

Page 10: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

10

Main contribution to this section: from Zeeshan Ali Shah*

Cloud computing and data-intensive computing at PDC - a brief overview

OpenNebula at PDC - examples!

Apache Spark at PDC - what I use our cloud for

* [email protected]

Page 11: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Started with Eucalyptus• Back in 2009

• Federated between KTH centers cross Stockholm.

• Then Eucalyptus selected redhat in licensing model.

• And we selected Open Nebula due to its openness and easy access to it’s core team which was located in EU .

11

Page 12: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Open Nebula• 2010 - Selected during technical kick-off of Venus-C project

• Based in EU , easy access to developers

• Fully open source

• Started with Open Nebula 2.0

• OVF (Open Virtualization format) interfaced was developed within Venus-C

• Federated with Other Venus-C sites such as BSC (Spain) and ENGINEERING (Italy).

12

Page 13: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

User base

13

www.e-science.sewww.scilifelab.se

www.natmeg.se

Neurosciences, Karolinska Institute

And, yes, from EGI Fed cloud communities

Science for Life Laboratory (SciLifeLab) is a national center for molecular biosciences with focus on health and environmental research.

Page 14: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

OpenNebula User experience• Served around 100+ users, both Swedish and other EU

researchers

• Interfaces:

– Open Nebula CLI

– Sunstone Dashboard

– SDK (not so many) but option was there

• Conducted Hands-on Workshops for users

14

Page 15: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Federation with EGI• Compute using OCCI (backend with Open Nebula)

• Auto injection of user keys from Voms server

• Federated identity with VOMS and X.509

• Information system

• Accounting service

15

From “The EGI Federated Cloud, a production IaaS infrastructure for the EEA”, D. Wallom (EGI CF, 20.04.2014)

Page 16: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Bio science usersPre configured apps with Open Nebula

• Galaxy - galaxyproject.org

• Cloudbio linux - cloudbiolinux.org

Cloud  Bio  Linux Galaxy  (AWS  -­‐  for  CloudMan)

16

Issue: PoC Cloudman on ON (SARA, NL) - but moved to OS

Page 17: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Way forward

• Dedicated storage service, like S3 , Swift (OpenStack)

• Network service for versatile setups, like Neutron (OS)

• Image caching on compute nodes.

– To minimize launch time of VMs, what we notice is that most of time in VMs launch took for copying image to designated host

– Shared FS is an option, but it has its own limitations.

17

“Wish list” from Zeeshan Ali Shah *

* [email protected]

Page 18: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Big Data analytics• Apache Spark

• Hadoop

• Mesos -> YARN

• Orchestration of Spark clusters with Open Nebula

18

See next section ….

Page 19: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

19

Cloud computing and data-intensive computing at PDC - a brief overview

OpenNebula at PDC - examples

Apache Spark at PDC - what I use our cloud for

Page 20: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Sources  to  Big  Data

Probing  extreme  phenomena  in  scientific  fields  with  mature  theories

Increasingly  exploratory  research  areas

Making  meaning  of  human  activity  on  the  Internet  

1990 2010

Sensing  everything  

20

Page 21: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Sources  to  Big  Data

Probing  extreme  phenomena  in  scientific  fields  with  mature  theories

Increasingly  exploratory  research  areas

Making  meaning  of  human  activity  on  the  Internet  

1990 2010

Sensing  everything  

21

Sthlm, May 2014

Page 22: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Research  at  HPCViz  Data-­‐Intensive  Computing  Group

….  building  a  DS  curriculum  for  the  group

Brain  images  –  Scabia  project,  MEG  data  Paas  for  Life  Science    -­‐  Biobankcloud,  Galaxy,  ..

Privacy  preservation  in  the  cloud      -­‐  Biobankcloud

Federated  clouds    -­‐  EGI,  Nordic  Cloud,  CDMi  proxy

Cloud  environments    -­‐  Environment  launching    -­‐  Streaming  capabilities  -­‐  Workflows  -­‐  including  graph  data        capabilities

Anomaly  detection  in  performance  data  -­‐  Intrusion  Detection  -­‐  Performance  Analysis  -­‐  Sensor  data,  IoT,  …

Next:  Scalable  statistics

Cloud  and  industry  –  esp.  startups

Chemoinformatics  -­‐    MapReduce  based  Parallel  Virtual            Screening  !!!

!!!!!!

22

Applicat

ions

Technolo

gies

Industry

Algorithm

s    

Theory

Page 23: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Federated Cloud Services

Federated  IaaS  and  STaaS  Cloud

Tier 1: Reliable

Infrastructure Cloud

Tier 4: Zero ICT

Infrastructures

Tier 3: Platform as a Service

Tier 2: General-purpose platform services

PaaS

PaaS

DB  aaS

Hado

op  

aaS

VRE

Secure  storage

Key  Mgm

t

Encryptio

n

ACL  mgm

t

Virtual    eLaboratory

23

From “The EGI Federated Cloud, a production IaaS infrastructure for the EEA”, D. Wallom (EGI CF, 20.04.2014)

Page 24: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Federated Cloud Services

Federated  IaaS  and  STaaS  Cloud

Tier 1: Reliable

Infrastructure Cloud

Tier 4: Zero ICT

Infrastructures

Tier 3: Platform as a Service

Tier 2: General-purpose platform services

PaaS

PaaS

DB  aaS

Hado

op  

aaS

VRE

Secure  storage

Key  Mgm

t

Encryptio

n

ACL  mgm

t

Virtual    eLaboratory

24

From “The EGI Federated Cloud, a production IaaS infrastructure for the EEA”, D. Wallom (EGI CF, 20.04.2014)

Page 25: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

DAaaS  -­‐  What  do  We  Need?• Interactive  queries:  enable  faster  decisions  • Queries  on  streaming  data:  enable  decisions  on  real-­‐time  data  • Sophisticated  data  processing:  enable  “better”  decisions  • Need  of  statistical  principles  (that  scale):  to  justify  the  inferential  

leap  from  data  to  knowledge:  – Need  estimates  of  uncertainty  in  the  outputs  of  algorithms  (“error  bars”)  

• Pipelines:  ability  to  run  mixed  analysis  under  one  framework  –  for  efficiency  and  to  be  able  to  develop  sophisticated  algorithms

Support batch, streaming, and interactive computations… in a unified framework

25

Page 26: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Applications

Spark Streaming GraphX MLBase

BlinkDBPig

… Storm MPIShark HIVE

Spark Hadoop MR

HDFS Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.

Hadoop YARN “Yet-Another-Resource-Negotiator”. A framework for job scheduling and cluster resource management.

Infrastructure E.g. public and private clouds

Data !Processing

Data!Management

Resource!Management

Berkeley Data Analytics Stack

26

Page 27: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Apache Hadoop

• Hadoop Common: The common utilities that support the other Hadoop modules.

• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

• Hadoop YARN: A framework for job scheduling and cluster resource management.

• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Other Hadoop-related projects at Apache include:

• Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.

• Avro™: A data serialization system.

• Cassandra™: A scalable multi-master database with no single points of failure.

• Chukwa™: A data collection system for managing large distributed systems.

• HBase™: A scalable, distributed database that supports structured data storage for large tables.

• Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.

• Mahout™: A Scalable machine learning and data mining library.

• Pig™: A high-level data-flow language and execution framework for parallel computation.

• Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.

• Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.

• ZooKeeper™: A high-performance coordination service for distributed applications.

Applications

Spark Streaming GraphX MLBase

BlinkDBPig

… Storm MPIShark HIVE

Spark Hadoop MR

HDFS Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.

Hadoop YARN “Yet-Another-Resource-Negotiator”. A framework for job scheduling and cluster resource management.

Infrastructure E.g. public and private clouds

27

Page 28: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Applications

Spark Streaming GraphX MLBase

BlinkDBPig

… Storm MPIShark HIVE

Spark Hadoop MR

HDFS Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.

Hadoop YARN “Yet-Another-Resource-Negotiator”. A framework for job scheduling and cluster resource management.

Infrastructure E.g. public and private clouds

Berkeley Data Analytics Stack

• Shark - Hive and SQL on top of Spark • MLbase - Machine Learning project on top of Spark • BlinkDB - a massively parallel, approximate query engine built on top of Shark and Spark • GraphX - a graph processing & analytics framework on top of Spark (GraphX has been merged into Spark 0.9) • Apache Mesos - Cluster management system that supports running Spark • Tachyon - In memory storage system that supports running Spark • Apache MRQL - A query processing and optimization system for large-scale, distributed data analysis, built on

top of Apache Hadoop, Hama, and Spark • OpenDL - A deep learning algorithm library based on Spark framework. Just kick off. • SparkR - R frontend for Spark • Spark Job Server - REST interface for managing and submitting Spark jobs on the same cluster

28

Page 29: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

• Unifies batch,  streaming,  interac<ve  comp. • Easy to build sophisticated applications

– Support iterative, graph-parallel algorithms – Powerful APIs in Scala, Python, Java

Applications

Spark Streaming GraphX MLBase

BlinkDB

Pig

… Storm MPI

Shark HIVE

Spark Hadoop MR

HDFS Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.

Hadoop YARN “Yet-Another-Resource-Negotiator”. A framework for job scheduling and cluster resource management.

Infrastructure E.g. public and private clouds

Berkeley Data Analytics Stack

29

StreamingInteractive

Sophisticated algorithms

Batch, Interactive

Interactive

Sophisticated algorithms

spark.apache.org

Page 30: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Turning Data into Value, Examples• Unify real-time and historical data analysis

– Easier to build and maintain

– Cheaper to operate

– Easier to get insights, faster decisions

• Unify streaming and machine-learning

– Faster diagnosis, decisions (e.g., better ad targeting)

• Unify graph processing and ETLs

– Faster to get social network insights (e.g., improve user experience)

30

Page 31: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

What it Means for UsersSeparate frameworks:

…HDFS read

HDFS write

E T L

HDFS read

HDFS write

t r a i n

HDFS read

HDFS write

q u e r y

HDFS

HDFS read

Spark: Interactiveanalysis

31

E T L

t r a i n

q u e r y

Page 32: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Advantage of an unified stack• Explore data interactively

to identify problems

!

• Use same code in Spark for processing large logs

!

• Use similar code in Spark Streaming for realtime processing

$  ./spark-­‐shell  scala>  val  file  =  sc.hadoopFile(“smallLogs”)  ...  scala>  val  filtered  =  file.filter(_.contains(“ERROR”))  ...  scala>  val  mapped  =  filtered.map(...)  ...  

object  ProcessProductionData  {      def  main(args:  Array[String])  {          val  sc  =  new  SparkContext(...)          val  file  =  sc.hadoopFile(“productionLogs”)          val  filtered  =  file.filter(_.contains(“ERROR”))          val  mapped  =  filtered.map(...)          ...      }  } object  ProcessLiveStream  {  

   def  main(args:  Array[String])  {          val  sc  =  new  StreamingContext(...)          val  stream  =  sc.kafkaStream(...)          val  filtered  =  stream.filter(_.contains(“ERROR”))          val  mapped  =  filtered.map(...)          ...      }  }

32

Page 33: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Spark Integration

• val  points  =  sc.runSql[Double,  Double](    “select  latitude,  longitude  from  historic_tweets”)val  model  =  KMeans.train(points,  10)sc.twitterStream(...)    .map(t  =>  (model.closestCenter(t.location),  1))    .reduceByWindow(“5s”,  _  +  _)

From Scala:

33

Page 34: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Summary – challenges and opportunities arising

• Data  processing:  from  special  to  general  -­‐  and  back?  • Data  locality:  from  detailed,  to  general  –  and  back?  See  eg.  Google’s  OMEGA  efforts  

• Infrastructure:  from  public  to  private  to  hybrid  cloud  • Disk  vs  in-­‐memory:  going  back  to  earlier  more  complex  environments?  Not  yet.  

• Workflows/pipelines:  unification  crucial  for  performance  and  usability  

• New  areas  evolving,  both  in  computer  science  as  in  statistics  – Quality:  Need  of  “error  bars”  around  outcomes  

• Need  of  new  solutions  to  make  this  possible,  on  large  data  sets  – Algorithmic weakening for statistical inference  

• a new area in theoretical computer science?  • a new area in statistics? 34

Page 35: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Summary – Exciting times ahead!• Data  processing:  from  special  to  general  -­‐  and  back?  • Data  locality:  from  detailed,  to  general  –  and  back?  See  eg.  Google’s  OMEGA  efforts  

• Infrastructure:  from  public  to  private  to  hybrid  cloud  • Disk  vs  in-­‐memory:  going  back  to  earlier  more  complex  environments?  Not  yet.  

• Workflows/pipelines:  unification  crucial  for  performance  and  usability  

• New  areas  evolving,  both  in  computer  science  as  in  statistics  – Quality:  Need  of  “error  bars”  around  outcomes  

• Need  of  new  solutions  to  make  this  possible,  on  large  data  sets  – Algorithmic weakening for statistical inference  

• a new area in theoretical computer science?  • a new area in statistics? 35

!

!

!

“Use  Clouds  running  Data  Analytics  processing  Big  Data  to  solve  problems  in  

X-­‐Informatics  (  or  e-­‐X)”  !

!

!

!

!

!!

Need  to  excel  in  many  areas,  at  the  same  time!  

!

!

Comput

er  Skills

Mathem

atics  &  

Statistics  Knowledge

Substantive    Experience

Data  Science

Machine  Learning

Traditional  Research

Danger  

Zone  !

Page 36: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

References• Geoffrey  Fox,  Indiana  University  

– http://www.soic.indiana.edu/people/profiles/fox-­‐geoffrey-­‐charles.shtml  -­‐  great  visionary  researcher  in  distributed  computing  and  its  usage  

• Frontiers  in  Massive  Data  Analysis  – http://www.nap.edu/catalog.php?record_id=18374  -­‐  fundament  of  current  state-­‐of-­‐

the-­‐art  • The  Fourth  Paradigm:  Data-­‐Intensive  Scientific  Discovery  

– http://research.microsoft.com/en-­‐us/collaboration/fourthparadigm/  -­‐  a  good  starting  point,  esp.  visions  from  Jim  Gray  

• Spark  related  slides  from    – Spark  team  

• Matei  Zaharia,  MIT  and  Databricks  • Ion  Stoika,  UC  Berkeley  and  Databricks  • Patrick  Wendell,  Databricks  • Joseph  Gonzales  (GraphX),  UC  Berkeley

36

Page 37: OpenNebula: Experiences at KTH€¦ · MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps

Thanks!

37

Åke  Edlund

[email protected]

Q&A