opennebula: experiences at kth€¦ · mapreduce, hive, hcatalog, hbase, zookeeper, oozie, pig and...

Åke Edlund KTH PDC-‐HPC Center for High Performance Computing KTH HPCViz Data-‐Intensive Computing Group KTH PDC-‐HPC Cloud 1

OpenNebula: Experiences at KTH

With a deeper dive into emerging data analytics stacks

Outline of this talkCloud computing and data-intensive computing at PDC - a brief overview

OpenNebula at PDC - examples

Apache Spark at PDC - what I use our cloud for

2

Cloud computing and data-intensive computing at PDC - a brief overview!



3

Cloud computing and data-intensive computing at PDC - a brief overview

• Cloud research since 2007 – Cloud provider since 2009 – national and international users

• Spark user since May 2012 (more in the last section) – Version 0.6 released in October 15, 2012

• Research and Development – Distributed and federated clouds and data analytics stacks – Bioinformatics and LifeScience applications – Scalable statistics – Self-‐improving systems – Strong and usable security factors to enable researchers to store sensitive data in the Cloud

• Projects (many) – SNIC Cloud Infrastructure (co-‐Initiator and Coordinator) – the Swedish roll out of cloud for

eScience – NeIC Nordic Cloud (co-‐Initiator and coordinator Swedish part) – BioBankCloud (WP leader) – PaaS for biobanking – EGI Federated Cloud task force (development and resource provider) – VENUS-‐C (WP-‐Leader) (2010 – 2012) – …

4

Cloud Resources at PDCPDC Cloud has been in production (with external users) since 2010 and is today an installation of 364 cores !-‐ 12 nodes, each consisting of 32 cores – 1 TB x 2 disk and 64 GB RAM -‐ 20 TB shared (through Infiniband) by the 12 nodes using Ceph (RBD (block

devices), S3 (Object Storage) -‐ this is under reconstruction (from SAN to dedicated Ceph storage nodes -‐> 36 TB)

-‐ Cloud middlewares used over the years range from Eucalyptus, OpenNebula, and now a mix of OpenNebula and OpenStack

-‐ Users access their resources using web panel and/or CLI/API !

Users (so far) are Nordic and European researchers. PDC Cloud is leading partner in a number of Swedish, Nordic and European cloud projects, e.g. being one of the first certified cloud resource providers to EGI Federated Cloud.

5

Data-Intensive Computing at PDC

HPCViz Data-‐Intensive Computing Group (started 2012) is a research group building on the experiences from PDC. !-‐ 9 group members (7 researchers, 2 developers) -‐ Collaborating mainly with Uppsala University (bioinformatics), KI

(SciLifeLab) on applying, and further expand, emerging novel techniques for iterative and interactive in-‐memory data analytics stacks (Spark, Stratosphere, H2O, …)

-‐ Other areas of interest include anomaly detection in streaming data, with applications in performance improvement of distributed systems, and security (intrusion detection).

6

[1] "Practical Cloud Evaluation from a Nordic eScience User Perspective", VTDC'11, ACM conference San Jose (2011) by Åke Edlund and Maarten Koopman, Zeeshan Ali Shah, Ilja Livenson, Frederik Orellana, Jukka Kommeri, Miika Tuisku, Pekka Lehtovuori, Klaus Marius Hansen, Helmut Neukirchen, Ebba Þóra Hvannberg 7

Our Cloud Learning Curve

2001 2004 2007 2010 2011 2012 2013 2014

Nordic cloud project, NEON (2010) Practical evaluation [1], testing public vs private cloud for eScience users (bioinformatics)

SNIC Cloud project (2011.6-‐2012.6+) Enabled cloud access (public and private) to SNIC users. 14 (some recurring) users of SNIC Cloud for Amazon (e.g. running Galaxy) and 54 on the private cloud (currently only PDC Cloud, partially from outside SNIC)

SNIC Galaxy project (2013.3-‐2014.3). The goal of the project is to deliver Galaxy as a service, using the Galaxy cloud management platform, Cloudman, on local cloud installations (private clouds).

SNIC Cloud Infrastructure (long-‐term, started Jan 2014). A (generic) IaaS on which communities/users can build their PaaS. Strong emphasize on user communities and their commitment.

Grid Computing projects (DataGrid, EGEE, EGI) – including EGI Federated Clouds TF

KTH PDC Cloud experimentation

Public IaaSPrivate IaaS

Private PaaSPublic PaaS

PDC-‐HPC (since 1989)

[1] "Practical Cloud Evaluation from a Nordic eScience User Perspective", VTDC'11, ACM conference San Jose (2011) by Åke Edlund and Maarten Koopman, Zeeshan Ali Shah, Ilja Livenson, Frederik Orellana, Jukka Kommeri, Miika Tuisku, Pekka Lehtovuori, Klaus Marius Hansen, Helmut Neukirchen, Ebba Þóra Hvannberg 8

2001 2004 2007 2010 2011 2012 2013 2014

Nordic cloud project, NEON (2010) Practical evaluation [1], testing public vs private cloud for eScience users (bioinformatics)

SNIC Cloud project (2011.6-‐2012.6+) Enabled cloud access (public and private) to SNIC users. 14 (some recurring) users of SNIC Cloud for Amazon (e.g. running Galaxy) and 54 on the private cloud (currently only PDC Cloud, partially from outside SNIC)

SNIC Galaxy project (2013.3-‐2014.3). The goal of the project is to deliver Galaxy as a service, using the Galaxy cloud management platform, Cloudman, on local cloud installations (private clouds).

SNIC Cloud Infrastructure (long-‐term, started Jan 2014). A (generic) IaaS on which communities/users can build their PaaS. Strong emphasize on user communities and their commitment.

Grid Computing projects (DataGrid, EGEE, EGI) – including EGI Federated Clouds TF

KTH PDC Cloud experimentation

Public IaaSPrivate IaaS

Private PaaSPublic PaaS

PDC-‐HPC (since 1989)Iaas à PaaS

Security concerns. Service to our users. Easier to manage larger user groups.

Public IaaS à Private IaaS Large amount of sensitive data, often too cumbersome for

practical use of public clouds.

Our Cloud Learning Curve

Federated Cloud ProjectsCurrent Cloud Projects

- SNIC Cloud (co-Initiator and Coordinator) – the Swedish roll out of cloud for eScience!- NeIC Nordic Cloud (co-Initiator and Coordinator Swedish part) - BioBankCloud (WP leader) – PaaS for biobanking - EGI Federated Cloud (development and resource provider)!

Earlier Cloud Projects

-SNIC Galaxy (PaaS) (co-Initiator and Coordinator) (2013) -SNIC Cloud (Initiator and Coordinator) (2011-2012) -SICS Startup Accelerator (co-Initiator and Coordinator) (2011) -VENUS-C (WP leader) (2010-2012) -NEON – Northern Europe cloud project (Initiator and Coordinator) (2010)

9

10

Main contribution to this section: from Zeeshan Ali Shah*


OpenNebula at PDC - examples!


* [email protected]

Started with Eucalyptus• Back in 2009

• Federated between KTH centers cross Stockholm.

• Then Eucalyptus selected redhat in licensing model.

• And we selected Open Nebula due to its openness and easy access to it’s core team which was located in EU .

11

Open Nebula• 2010 - Selected during technical kick-off of Venus-C project

• Based in EU , easy access to developers

• Fully open source

• Started with Open Nebula 2.0

• OVF (Open Virtualization format) interfaced was developed within Venus-C

• Federated with Other Venus-C sites such as BSC (Spain) and ENGINEERING (Italy).

12

User base

13

www.e-science.sewww.scilifelab.se

www.natmeg.se

Neurosciences, Karolinska Institute

And, yes, from EGI Fed cloud communities

Science for Life Laboratory (SciLifeLab) is a national center for molecular biosciences with focus on health and environmental research.

OpenNebula User experience• Served around 100+ users, both Swedish and other EU

researchers

• Interfaces:

– Open Nebula CLI

– Sunstone Dashboard

– SDK (not so many) but option was there

• Conducted Hands-on Workshops for users

14

Federation with EGI• Compute using OCCI (backend with Open Nebula)

• Auto injection of user keys from Voms server

• Federated identity with VOMS and X.509

• Information system

• Accounting service

15

From “The EGI Federated Cloud, a production IaaS infrastructure for the EEA”, D. Wallom (EGI CF, 20.04.2014)

Bio science usersPre configured apps with Open Nebula

• Galaxy - galaxyproject.org

• Cloudbio linux - cloudbiolinux.org

Cloud Bio Linux Galaxy (AWS -‐ for CloudMan)

16

Issue: PoC Cloudman on ON (SARA, NL) - but moved to OS

Way forward

• Dedicated storage service, like S3 , Swift (OpenStack)

• Network service for versatile setups, like Neutron (OS)

• Image caching on compute nodes.

– To minimize launch time of VMs, what we notice is that most of time in VMs launch took for copying image to designated host

– Shared FS is an option, but it has its own limitations.

17

“Wish list” from Zeeshan Ali Shah *

* [email protected]

Big Data analytics• Apache Spark

• Hadoop

• Mesos -> YARN

• Orchestration of Spark clusters with Open Nebula

18

See next section ….

19




Sources to Big Data

Probing extreme phenomena in scientific fields with mature theories

Increasingly exploratory research areas

Making meaning of human activity on the Internet

1990 2010

Sensing everything

20

Sources to Big Data

Probing extreme phenomena in scientific fields with mature theories

Increasingly exploratory research areas

Making meaning of human activity on the Internet

1990 2010

Sensing everything

21

Sthlm, May 2014

Research at HPCViz Data-‐Intensive Computing Group

…. building a DS curriculum for the group

Brain images – Scabia project, MEG data Paas for Life Science -‐ Biobankcloud, Galaxy, ..

Privacy preservation in the cloud -‐ Biobankcloud

Federated clouds -‐ EGI, Nordic Cloud, CDMi proxy

Cloud environments -‐ Environment launching -‐ Streaming capabilities -‐ Workflows -‐ including graph data capabilities

Anomaly detection in performance data -‐ Intrusion Detection -‐ Performance Analysis -‐ Sensor data, IoT, …

Next: Scalable statistics

Cloud and industry – esp. startups

Chemoinformatics -‐ MapReduce based Parallel Virtual Screening !!!

!!!!!!

22

Applicat

ions

Technolo

gies

Industry

Algorithm

s

Theory

Federated Cloud Services

Federated IaaS and STaaS Cloud

Tier 1: Reliable

Infrastructure Cloud

Tier 4: Zero ICT

Infrastructures

Tier 3: Platform as a Service

Tier 2: General-purpose platform services

PaaS

PaaS

DB aaS

Hado

op

aaS

VRE

Secure storage

Key Mgm

t

Encryptio

n

ACL mgm

t

Virtual eLaboratory

23


Federated Cloud Services

Federated IaaS and STaaS Cloud

Tier 1: Reliable

Infrastructure Cloud

Tier 4: Zero ICT

Infrastructures

Tier 3: Platform as a Service

Tier 2: General-purpose platform services

PaaS

PaaS

DB aaS

Hado

op

aaS

VRE

Secure storage

Key Mgm

t

Encryptio

n

ACL mgm

t

Virtual eLaboratory

24


DAaaS -‐ What do We Need?• Interactive queries: enable faster decisions • Queries on streaming data: enable decisions on real-‐time data • Sophisticated data processing: enable “better” decisions • Need of statistical principles (that scale): to justify the inferential

leap from data to knowledge: – Need estimates of uncertainty in the outputs of algorithms (“error bars”)

• Pipelines: ability to run mixed analysis under one framework – for efficiency and to be able to develop sophisticated algorithms

Support batch, streaming, and interactive computations… in a unified framework

25

Applications

Spark Streaming GraphX MLBase

BlinkDBPig

… Storm MPIShark HIVE

Spark Hadoop MR

HDFS Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.

Hadoop YARN “Yet-Another-Resource-Negotiator”. A framework for job scheduling and cluster resource management.

Infrastructure E.g. public and private clouds

Data !Processing

Data!Management

Resource!Management

Berkeley Data Analytics Stack

26

Apache Hadoop

• Hadoop Common: The common utilities that support the other Hadoop modules.

• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

• Hadoop YARN: A framework for job scheduling and cluster resource management.

• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Other Hadoop-related projects at Apache include:

• Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.

• Avro™: A data serialization system.

• Cassandra™: A scalable multi-master database with no single points of failure.

• Chukwa™: A data collection system for managing large distributed systems.

• HBase™: A scalable, distributed database that supports structured data storage for large tables.

• Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.

• Mahout™: A Scalable machine learning and data mining library.

• Pig™: A high-level data-flow language and execution framework for parallel computation.

• Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.

• Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.

• ZooKeeper™: A high-performance coordination service for distributed applications.

Applications


BlinkDBPig


Spark Hadoop MR




27

http://incubator.apache.org/ambari/

http://avro.apache.org/

http://cassandra.apache.org/

http://incubator.apache.org/chukwa/

http://hbase.apache.org/

http://hive.apache.org/

http://mahout.apache.org/

http://pig.apache.org/

http://spark.incubator.apache.org/

http://tez.incubator.apache.org/

http://zookeeper.apache.org/

Applications


BlinkDBPig


Spark Hadoop MR





• Shark - Hive and SQL on top of Spark • MLbase - Machine Learning project on top of Spark • BlinkDB - a massively parallel, approximate query engine built on top of Shark and Spark • GraphX - a graph processing & analytics framework on top of Spark (GraphX has been merged into Spark 0.9) • Apache Mesos - Cluster management system that supports running Spark • Tachyon - In memory storage system that supports running Spark • Apache MRQL - A query processing and optimization system for large-scale, distributed data analysis, built on

top of Apache Hadoop, Hama, and Spark • OpenDL - A deep learning algorithm library based on Spark framework. Just kick off. • SparkR - R frontend for Spark • Spark Job Server - REST interface for managing and submitting Spark jobs on the same cluster

28

https://github.com/amplab/shark/wiki

http://mlbase.org/

http://blinkdb.org/

https://github.com/amplab/graphx

http://mesos.apache.org/

https://github.com/amplab/tachyon/wiki

https://wiki.apache.org/mrql/

https://github.com/guoding83128/OpenDL/

https://github.com/amplab-extras/SparkR-pkg

https://github.com/ooyala/spark-jobserver

• Unifies batch, streaming, interac<ve comp. • Easy to build sophisticated applications

– Support iterative, graph-parallel algorithms – Powerful APIs in Scala, Python, Java

Applications


BlinkDB

Pig

… Storm MPI

Shark HIVE

Spark Hadoop MR





29

StreamingInteractive

Sophisticated algorithms

Batch, Interactive

Interactive

Sophisticated algorithms

spark.apache.org

Turning Data into Value, Examples• Unify real-time and historical data analysis

– Easier to build and maintain

– Cheaper to operate

– Easier to get insights, faster decisions

• Unify streaming and machine-learning

– Faster diagnosis, decisions (e.g., better ad targeting)

• Unify graph processing and ETLs

– Faster to get social network insights (e.g., improve user experience)

30

What it Means for UsersSeparate frameworks:

…HDFS read

HDFS write

E T L

HDFS read

HDFS write

t r a i n

HDFS read

HDFS write

q u e r y

HDFS

HDFS read

Spark: Interactiveanalysis

31

E T L

t r a i n

q u e r y

Advantage of an unified stack• Explore data interactively

to identify problems

!

• Use same code in Spark for processing large logs

!

• Use similar code in Spark Streaming for realtime processing

$ ./spark-‐shell scala> val file = sc.hadoopFile(“smallLogs”) ... scala> val filtered = file.filter(_.contains(“ERROR”)) ... scala> val mapped = filtered.map(...) ...

object ProcessProductionData { def main(args: Array[String]) { val sc = new SparkContext(...) val file = sc.hadoopFile(“productionLogs”) val filtered = file.filter(_.contains(“ERROR”)) val mapped = filtered.map(...) ... } } object ProcessLiveStream {

def main(args: Array[String]) { val sc = new StreamingContext(...) val stream = sc.kafkaStream(...) val filtered = stream.filter(_.contains(“ERROR”)) val mapped = filtered.map(...) ... } }

32

Spark Integration

• val points = sc.runSql[Double, Double]( “select latitude, longitude from historic_tweets”)val model = KMeans.train(points, 10)sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)

From Scala:

33

Summary – challenges and opportunities arising

• Data processing: from special to general -‐ and back? • Data locality: from detailed, to general – and back? See eg. Google’s OMEGA efforts

• Infrastructure: from public to private to hybrid cloud • Disk vs in-‐memory: going back to earlier more complex environments? Not yet.

• Workflows/pipelines: unification crucial for performance and usability

• New areas evolving, both in computer science as in statistics – Quality: Need of “error bars” around outcomes

• Need of new solutions to make this possible, on large data sets – Algorithmic weakening for statistical inference

• a new area in theoretical computer science? • a new area in statistics? 34

Summary – Exciting times ahead!• Data processing: from special to general -‐ and back? • Data locality: from detailed, to general – and back? See eg. Google’s OMEGA efforts

• Infrastructure: from public to private to hybrid cloud • Disk vs in-‐memory: going back to earlier more complex environments? Not yet.

• Workflows/pipelines: unification crucial for performance and usability

• New areas evolving, both in computer science as in statistics – Quality: Need of “error bars” around outcomes

• Need of new solutions to make this possible, on large data sets – Algorithmic weakening for statistical inference

• a new area in theoretical computer science? • a new area in statistics? 35

!

!

!

“Use Clouds running Data Analytics processing Big Data to solve problems in

X-‐Informatics ( or e-‐X)” !

!

!

!

!

!!

Need to excel in many areas, at the same time!

!

!

Comput

er Skills

Mathem

atics &

Statistics Knowledge

Substantive Experience

Data Science

Machine Learning

Traditional Research

Danger

Zone !

References• Geoffrey Fox, Indiana University

– http://www.soic.indiana.edu/people/profiles/fox-‐geoffrey-‐charles.shtml -‐ great visionary researcher in distributed computing and its usage

• Frontiers in Massive Data Analysis – http://www.nap.edu/catalog.php?record_id=18374 -‐ fundament of current state-‐of-‐

the-‐art • The Fourth Paradigm: Data-‐Intensive Scientific Discovery

– http://research.microsoft.com/en-‐us/collaboration/fourthparadigm/ -‐ a good starting point, esp. visions from Jim Gray

• Spark related slides from – Spark team

• Matei Zaharia, MIT and Databricks • Ion Stoika, UC Berkeley and Databricks • Patrick Wendell, Databricks • Joseph Gonzales (GraphX), UC Berkeley

36

http://www.soic.indiana.edu/people/profiles/fox-geoffrey-charles.shtml

http://www.nap.edu/catalog.php?record_id=18374

http://research.microsoft.com/en-us/collaboration/fourthparadigm/

Thanks!

37

Åke Edlund

[email protected]

Q&A

opennebula: experiences at kth€¦ · mapreduce, hive, hcatalog, hbase, zookeeper, oozie, pig and...

Documents