spark 1.0 and beyond

33
Patrick Wendell Databricks Spark.incubator.apache.org Spark 1.0 and Beyond

Upload: wilmet

Post on 07-Jan-2016

54 views

Category:

Documents


0 download

DESCRIPTION

Spark 1.0 and Beyond. Patrick Wendell Databricks Spark.incubator.apache.org. About me. Committer and PMC member of Apache Spark “Former” PhD student at Berkeley Release manager for Spark 1.0 Background in networking and distributed systems. Today’s Talk. Spark background - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Spark 1.0 and Beyond

Patrick Wendell

Databricks

Spark.incubator.apache.org

Spark 1.0 and Beyond

Page 2: Spark 1.0 and Beyond

About meCommitter and PMC member of Apache Spark

“Former” PhD student at Berkeley

Release manager for Spark 1.0

Background in networking and distributed systems

Page 3: Spark 1.0 and Beyond

Today’s Talk

Spark background

About the Spark release process

The Spark 1.0 release

Looking forward to Spark 1.1

Page 4: Spark 1.0 and Beyond

What is Spark?

EfficientGeneral execution graphs

In-memory storage

UsableRich APIs in Java, Scala, Python

Interactive shell

Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop

2-5× less code

Up to 10× faster on disk,100× in memory

Page 5: Spark 1.0 and Beyond
Page 6: Spark 1.0 and Beyond

30-Day Commit Activity

Patches0

50

100

150

200

250

MapReduceStormYarnSpark

Lines Added0

5000

10000

15000

20000

25000

30000

35000

40000

45000

MapReduceStormYarnSpark

Lines Removed0

2000

4000

6000

8000

10000

12000

14000

16000

MapReduceStormYarnSpark

Page 7: Spark 1.0 and Beyond

Spark PhilosophyMake life easy and productive for data scientists

Well documented, expressive API’s

Powerful domain specific libraries

Easy integration with storage systems

… and caching to avoid data movement

Predictable releases, stable API’s

Page 8: Spark 1.0 and Beyond

Spark Release Process

Quarterly release cycle (3 months)

2 months of general development

1 month of polishing, QA and fixes

Spark 1.0 Feb 1 April 8th, April 8th+

Spark 1.1 May 1 July 8th, July 8th+

Page 9: Spark 1.0 and Beyond

Spark 1.0:By the numbers- 3 months of development

- 639 patches

- 200+ JIRA issues

- 100+ contributors

Page 10: Spark 1.0 and Beyond

API Stability in 1.XAPI’s are stable for all non-alpha projects

Spark 1.1, 1.2, … will be compatible

@DeveloperApi

Internal API that is unstable

@Experimental

User-facing API that might stabilize later

Page 11: Spark 1.0 and Beyond

Today’s Talk

About the Spark release process

The Spark 1.0 release

Looking forward to Spark 1.1

Page 12: Spark 1.0 and Beyond

Spark 1.0 FeaturesCore engine improvements

Spark streaming

MLLib

Spark SQL

Page 13: Spark 1.0 and Beyond

Spark CoreHistory server for Spark UI

Integration with YARN security model

Unified job submission tool

Java 8 support

Internal engine improvements

Page 14: Spark 1.0 and Beyond

History ServerConfigure with :

spark.eventLog.enabled=truespark.eventLog.dir=hdfs://XX

In Spark Standalone, history server is embedded in the master.

In YARN/Mesos, run history server as a daemon.

Page 15: Spark 1.0 and Beyond

Job Submission ToolApps don’t need to hard-code master: conf = new SparkConf().setAppName(“My App”) sc = new SparkContext(conf)

./bin/spark-submit <app-jar> \ --class my.main.Class --name myAppName --master local[4] --master spark://some-cluster

Page 16: Spark 1.0 and Beyond

Java 8 SupportRDD operations can use lambda syntaxclass Split extends FlatMapFunction<String, String> { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); });JavaRDD<String> words = lines.flatMap(new Split());

JavaRDD<String> words = lines .flatMap(s -> Arrays.asList(s.split(" ")));

Old

New

Page 17: Spark 1.0 and Beyond

Java 8 SupportNOTE: Minor API changes

(a) If you are extending Function classes, use implements rather than extends.

(b) Return-type sensitive functions

mapToPairmapToDouble

Page 18: Spark 1.0 and Beyond

Python API Coveragerdd operators

intersection(), take(), top(), topOrdered()

meta-data

name(), id(), getStorageLevel()

runtime configuration

setJobGroup(), setLocalProperty()

Page 19: Spark 1.0 and Beyond

Integration with YARN SecuritySupports Kerberos authentication in YARN environments:

spark.authenticate = true

ACL support for user interfaces:

spark.ui.acls.enable = true

spark.ui.view.acls = patrick, matei

Page 20: Spark 1.0 and Beyond

Engine ImprovementsJob cancellation directly from UI

Garbage collection of shuffle and RDD data

Page 21: Spark 1.0 and Beyond

DocumentationUnified Scaladocs across modules

Expanded MLLib guide

Deployment and configuration specifics

Expanded API documentation

Page 22: Spark 1.0 and Beyond

Spark

RDDs, Transformations, and Actions

Spark Streamin

greal-time

SparkSQL

MLLibmachine learning

DStream’s: Streams of

RDD’s

SchemaRDD’s RDD-Based Matrices

Page 23: Spark 1.0 and Beyond

Spark SQL

Page 24: Spark 1.0 and Beyond

Turning an RDD into a Relation// Define the schema using a case class.case class Person(name: String, age: Int)

// Create an RDD of Person objects, register it as a table.val people = sc.textFile("examples/src/main/resources/people.txt") .map(_.split(",") .map(p => Person(p(0), p(1).trim.toInt))

people.registerAsTable("people")

 

Page 25: Spark 1.0 and Beyond

Querying using SQL// SQL statements can be run directly on RDD’sval teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are SchemaRDDs and support // normal RDD operations.val nameList = teenagers.map(t => "Name: " + t(0)).collect()

// Language integrated queries (ala LINQ)val teenagers = people.where('age >= 10).where('age <= 19).select('name)

Page 26: Spark 1.0 and Beyond

Import and Export// Save SchemaRDD’s directly to parquetpeople.saveAsParquetFile("people.parquet")

// Load data stored in Hiveval hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)import hiveContext._

// Queries can be expressed in HiveQL.hql("FROM src SELECT key, value")

Page 27: Spark 1.0 and Beyond

In Memory Columnar StorageSpark SQL can cache tables using an in-memory columnar format:

- Scan only required columns- Fewer allocated objects (less GC)- Automatically selects best compression

Page 28: Spark 1.0 and Beyond

Spark StreamingWeb UI for streaming

Graceful shutdown

User-defined input streams

Support for creating in Java

Refactored API

Page 29: Spark 1.0 and Beyond

MLlibSparse vector support

Decision trees

Linear algebra

SVD and PCA

Evaluation support

3 contributors in the last 6 months

Page 30: Spark 1.0 and Beyond

MLlibNote: Minor API change

val data = sc.textFile("data/kmeans_data.txt")val parsedData = data.map( s => s.split(‘\t').map(_.toDouble).toArray)val clusters = KMeans.train(parsedData, 4, 100)

val data = sc.textFile("data/kmeans_data.txt")val parsedData = data.map( s => Vectors.dense(s.split(' ').map(_.toDouble)))val clusters = KMeans.train(parsedData, 4, 100)

Page 31: Spark 1.0 and Beyond

1.1 and BeyondData import/export leveraging catalyst HBase, Cassandra, etc

Shark-on-catalyst

Performance optimizationsExternal shuffle

Pluggable storage strategies

Streaming: Reliable input from Flume and Kafka

Page 32: Spark 1.0 and Beyond

Unifying ExperienceSchemaRDD represents a consistent integration point for data sources

spark-submit abstracts the environmental details (YARN, hosted cluster, etc).

API stability across versions of Spark

Page 33: Spark 1.0 and Beyond

ConclusionVisit spark.apache.org for videos, tutorials, and hands-on exercises.

Help us test a release candidate!

Spark Summit on June 30th

spark-summit.org

Meetup group meetup.com/spark-users