a short introduction to spark and its benefits

Big Data & Data Science20 mars 2017

Big Data & Data Science : Agenda – 18h30 / 20h15

1/ L’écosystème Apache Spark Johan Picard, Expert Big Data

2/ SQL on Hadoop at scale – SparkSQL2.1 & BigSQL4.3 on 100TB Hadoop-DS Victor Hatinguais, Architecte Big Data

3/ Social Data : Machine Learning pour un projet à caractère social Samed Atouati & Abdellah Lamrani Alaoui, aspirants Data Scientist, étudiants à l'Ecole Centrale Paris

4/ Data Science Experience Zied Abidi, Data Scientist

5/ Comment faire parler les données pour détecter des anomalies ? Pauline Clavelloux, Data Scientist

Questions & Réponses - Clôture

IBM | Spark 3

Power of data. Simplicity of design. Speed of innovation.

Apache Spark in 15 minutes

IBM | Spark 4

Apache Spark

Apache Spark is a fast and general engine for large scale data processing.

https://spark.apache.org/

https://spark.apache.org/

IBM | Spark 5

Spark History: one of the most active open-source projects

2002 – MapReduce @ Google2004 – MapReduce paper2006 – Hadoop @ Yahoo2008 – Hadoop Summit2010 – Spark paper2013 – Spark 0.7 Apache Incubator2014 – Apache Spark top-level 2014 – 1.2.0 released in December2015 – 1.3.0 released in March2015 – 1.4.0 released in June2015 – 1.5.0 released in September2016 – 1.6.0 released in January2016 – 2.0.0 released in July2016 – 2.1.0 released in December

Spark is HOT!!!Most active project in Hadoop ecosystemOne of top 3 most active Apache projectsDatabricks founded by the creators of Spark from UC Berkeley’s AMPLab

IBM | Spark 6

Spark is the most active open source project in Big Data

Source: Syncort – Hadoop Perspectives for 2016

2015

2014

2016900

Now 1039 contributors…

IBM | Spark 7

Why Spark? In-memory performances and code compactness

IBM | Spark 8

Spark RDDIn-memory distribution

HDFSOn-disk distribution

Why Spark? A distributed framework

IBM | Spark 9

Resilient Distributed Dataset

Create RDDs: parallelize textFile Transformations

Get results: Actions

IBM | Spark 10

Why Spark? A bunch of comfortables APIs

IBM | Spark 11

Spark Programming Languages

IBM | Spark 12

Distributed File System Data Preparation SQL Engine Stream Processing Graph Engine Machine Learning Distributed R

Spark SQL Spark Streaming GraphX MLlib Spark R

Why Spark? An unified framework

IBM | Spark 13

• Reliability• Resiliency• Security

• Multiple data sources• Multiple applications

• Multiple users

• Files• Semi-structured

• Databases

Unlimited Scale

Enterprise Platform

Wide Range of Data Formats

Spark complements Hadoop (1/3): Hadoop Strengths

IBM | Spark 14

• Need deep Java skills• Few abstractions available for

analysts

• No in-memory framework• Application tasks write to disk with

each cycle

• Only suitable for batch workloads• Rigid processing model

In-Memory Performance

Ease of Development

Combine Workflows

Spark complements Hadoop (2/3): MapReduce Weaknesses

IBM | Spark 15


Ease of Development• Easier APIs

• Python, Scala, Java

• Resilient Distributed Datasets• Unify processing

• Batch• Interactive

• Iterative algorithms• Micro-batch

Combine Workflows

Spark complements Hadoop (3/3): Spark Advantages

IBM | Spark 16


Ease of Development

Combine Workflows

Unlimited Scale

Enterprise Platform

Wide Range of Data Formats

The Flexibility of Spark on a Stable Hadoop Platform

IBM | Spark 17

Spark Shell: interactive Scala PySpark: interactive Python Spark Submit: compiled Notebooks: Jupyter, Zeppelin

How to develop and run a Spark job?

IBM | Spark 18

What Spark Is Not!

Not only for Hadoop – Spark can work with Hadoop (especially HDFS), but Spark is a standalone system

Not a data store – Spark attaches to other data stores but does not provide its own

Not only for machine learning – Spark includes machine learning and does it very well, but it can handle much broader tasks equally well

Not a replacement for Streams – Spark Streaming is micro-batching, not true streaming, and cannot handle the real-time complex event processing

Not a language!!!

IBM | Spark 19

Spark et IBM

IBM | Spark 20

IBM has the largest investment in Spark of any company in the world

visit www.spark.tc for more informationIBM | Spark

IBM Spark Technology Center

https://ibm.biz/hadoop-jirahttps://ibm.biz/spark-jira

On of the top commiter/contributor 300+ inventors Commitment to educate 1 million data

scientists Contributed SystemML Founding member of AMPLab Partnerships in the ecosystem

http://www.spark.tc/

https://ibm.biz/hadoop-jirahttp:/bit.do/ibm-spark

https://ibm.biz/spark-jira

IBM | Spark 21

Leadership in Spark

Spark Technology Center has contributed 829 code changes to Spark components since we started around middle of 2015

STC contributions have been. 52% to Spark SQL, 16% to PySpark, 26% to ML and MLlib. For more details, use this dash board https://www.ibm.biz/spark-jira

https://www.ibm.biz/spark-jira

IBM | Spark 22

Data Science Experience (DSX)

IBM | Spark

ALL YOUR TOOLS IN ONE PLACEIBM Data Science Experience is an environment that

brings together everything that a Data Scientist needs. It includes the most popular Open Source tools and IBM unique value-add functionalities with community and social features, integrated as a first class citizen to

make Data Scientists more successful.

datascience.ibm.com

IBM | Spark 23

Power of data. Simplicity of design. Speed of innovation.

PoT IBM sur Google

9 Mai : Manipulation de données massives avec Spark10 Mai : Formation machine learning utilisant DSX

a short introduction to spark and its benefits

Data & Analytics