atlanta spark user meetup 09 22 2016

pipeline.io

After Dark 2.0End-to-End, Real-time, Advanced Analytics and ML

Big Data Reference Pipeline

Atlanta Spark User GroupSept 22, 2016

Thanks Emory Continuing Education!

Chris FreglyResearch Scientist @ PipelineIO

We’re Hiring - Only Nice People!

pipeline.ioadvancedspark.com

pipeline.io

Who Am I?

Research Scientist @ PipelineIOgithub.com/fluxcapacitor/pipeline

Meetup Founder

Advanced

Book Author

Advanced .

pipeline.io

Who Was I?

Streaming Data Engineer

Netflix Open Source Committer

Data Solutions Engineer

Apache Contributor

Principal Data Solutions Engineer

IBM Technology Center

pipeline.io

Advanced Spark and Tensorflow MeetupMeetup Metrics

Top 5 Most Active Spark Meetup!4000+ Members in just 1 year!!6000+ Downloads of Docker Image

with many Meetup Demos!!!@ advancedspark.com

Meetup GoalsCode dive deep into Spark and related open source code basesStudy integrations with Cassandra, ElasticSearch,Tachyon, S3,

BlinkDB, Mesos, YARN, Kafka, R, etcSurface and share patterns and idioms of well-designed,

distributed, big data processing systems

pipeline.io

Atlanta Hadoop User Meetup Last Night

http://www.slideshare.net/cfregly/atlanta-hadoop-users-meetup-09-21-2016

pipeline.io

MLconf ATL Tomorrow

pipeline.io

Current PipelineIO ResearchModel Deploying and Testing

Model Scaling and Serving

Online Model Training

Dynamic Model Optimizing

pipeline.io

PipelineIO Deliverables100% Open Source!!

Githubhttps://github.com/fluxcapacitor/

DockerHubhttps://hub.docker.com/r/fluxcapacitor

Workshophttp://pipeline.io

pipeline.io

Topics of This Talk (20-30 mins each)① Spark Streaming and Spark ML

Generating Real-time Recommendations

② Spark CoreTuning and Profiling

③ Spark SQLTuning and Customizing

pipeline.io

Live, Interactive Demo!Kafka, Cassandra, ElasticSearch, Redis, Spark ML

pipeline.io

Audience Participation Required!

You -> Audience Instructions①Navigate to

demo.pipeline.io

②Swipe software used inProduction Only!

Data -> Scientist

This is Totally Anonymous!!

pipeline.io

Topics of This Talk (15-20 mins each)① Spark Streaming and Spark ML

Kafka, Cassandra, ElasticSearch, Redis, Docker

pipeline.io

Mechanical Sympathy“Hardware and software working together.”

- Martin Thompsonhttp://mechanical-sympathy.blogspot.com

“Whatever your data structure, my array will win.”- Scott Meyers

Every C++ Book, basically

pipeline.io

Spark and Mechanical Sympathy

Project Tungsten(Spark 1.4-1.6+)

100TBGraySortChallenge(Spark 1.1-1.2)

Minimize Memory & GCMaximize CPU Cache

Saturate Network I/OSaturate Disk I/O

pipeline.io

CPU Cache Refresher

aka“LLC”

MyLaptop

pipeline.io

CPU Cache Sympathy (AlphaSort paper)

Key (10 bytes) + Pointer (4 bytes) = 14 bytes

Key PtrPre-process & Pull Key From Record

PtrKey-Prefix2x CPU Cache-line Friendly!

Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes

Key (10 bytes) + Pad (2 bytes) + Pointer (4 bytes) = 16 bytes

PtrMust Dereference to Compare Key

Pointer (4 bytes) = 4 bytes

Key Ptr

/Pad CPU Cache-line Friendly!

Dereference full key only to resolve prefix duplicates

pipeline.io

Sort Performance Comparison

pipeline.io

Sequential vs Random Cache Misses

pipeline.io

Demo!Sorting

pipeline.io

Instrumenting and Monitoring CPUUse Linux perf command!

http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html

pipeline.io

Results of Random vs. Sequential Sort

Naïve Random AccessPointer Sort

Cache Friendly Sequential Key/Pointer Sort

Ptr KeyMust Dereference to Compare Key

Key PtrPre-process & Pull Key From Record

-90%-68%

% Change

perf stat –event \L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses

pipeline.io

Demo!Matrix Multiplication

pipeline.io

CPU Cache Naïve Matrix Multiplication

// Dot product of each row & column vectorfor (i <- 0 until numRowA)for (j <- 0 until numColsB)for (k <- 0 until numColsA)

res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];

Bad: Row-wise traversal, not using full CPU cache line,ineffective pre-fetching

pipeline.io

CPU Cache Friendly Matrix Multiplication

// Transpose Bfor (i <- 0 until numRowsB)for (j <- 0 until numColsB)

matBT[ i ][ j ] = matB[ j ][ i ];

Good: Full CPU Cache Line,Effective Prefetching

OLD: matB [ k ][ j ];

// Modify algo for Transpose Bfor (i <- 0 until numRowsA)for (j <- 0 until numColsB)for (k <- 0 until numColsA)res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];

pipeline.io

Results Of Matrix Multiplication

Cache-Friendly Matrix Multiply

Naïve Matrix Multiply

perf stat –event \L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, \

LLC-prefetch-misses,cache-misses,stalled-cycles-frontend

-93%-70%

% Change

+8543%?

pipeline.io

Demo!Thread Synchronization

pipeline.io

Thread and Context Switch SympathyProblemAtomically Increment 2 Counters (each at different increments) by1000’s of Simultaneous Threads

Possible Solutions①Synchronized Immutable②Synchronized Mutable③AtomicReference CAS④Volatile?

Context Switches are Expen$ive!!

aka“LLC”

pipeline.io

Synchronized Immutable Counterscase class Counters(left: Int, right: Int)

object SynchronizedImmutableCounters {var counters = new Counters(0,0)def getCounters(): Counters = {

this.synchronized { counters }}def increment(leftIncrement: Int, rightIncrement: Int): Unit = {

this.synchronized {counters = new Counters(counters.left + leftIncrement,

counters.right + rightIncrement)}

Locks wholeouter object!!

pipeline.io

Synchronized Mutable Countersclass MutableCounters(left: Int, right: Int) {def increment(leftIncrement: Int, rightIncrement: Int): Unit={this.synchronized {…}

}def getCountersTuple(): (Int, Int) = {this.synchronized{ (counters.left, counters.right) }

}}object SynchronizedMutableCounters {val counters = new MutableCounters(0,0)…def increment(leftIncrement: Int, rightIncrement: Int): Unit = {counters.increment(leftIncrement, rightIncrement)

Locks justMutableCounters

pipeline.io

Lock-Free AtomicReference Counterscase class Counters(left: Int, right: Int)

object LockFreeAtomicReferenceCounters {val counters = new AtomicReference[Counters](new Counters(0,0))def increment(leftIncrement: Int, rightIncrement: Int) : Long = {

var originalCounters: Counters = nullvar updatedCounters: Counters = nulldo {

originalCounters = getCounters()updatedCounters = new Counters(originalCounters.left+ leftIncrement,

originalCounters.right+ rightIncrement)} // Retry lock-free, optimistic compareAndSet() until AtomicRef updateswhile !(counters.compareAndSet(originalCounters, updatedCounters))

}} 30Lock Free!!

pipeline.io

Lock-Free AtomicLong Countersobject LockFreeAtomicLongCounters {// a single Long (64-bit) will maintain 2 separate Ints (32-bits each)val counters = new AtomicLong()…def increment(leftIncrement: Int, rightIncrement: Int): Unit = {

var originalCounters = 0Lvar updatedCounters = 0Ldo {

originalCounters = counters.get()…// Store two 32-bit Int into one 64-bit Long// Use >>> 32 and << 32 to set and retrieve each Int from the Long}

// Retry lock-free, optimistic compareAndSet() until AtomicLong updateswhile !(counters.compareAndSet(originalCounters, updatedCounters))

31 Lock Free!!

A: The JVM does notguarantee atomicupdates of 64-bitlongs and doubles

Q: Why not use@volatile long?

pipeline.io

Results of Thread Synchronization

Immutable Case Class

Lock-Free AtomicLong

perf stat –event \context-switches,L1-dcache-load-misses,L1-dcache-prefetch-misses, \

LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend

% Change

-31%-32%

case class Counters(left: Int, right: Int)...this.synchronized {counters = new Counters(counters.left + leftIncrement,

counters.right + rightIncrement)}

val counters = new AtomicLong()…do {…} while !(counters.compareAndSet(originalCounters,

updatedCounters))

pipeline.io

Profile Visualizations: Flame Graphs

33Example: Spark Word Count

Java Stack Traces are Good! JDK 1.8+(-XX:-Inline -XX:+PreserveFramePointer)

Plateaus are Bad!I/O stalls, Heavy CPU

serialization, etc

pipeline.io

Project Tungsten: CPU and MemoryCreate Custom Data Structures & Algorithms

Operate on serialized and compressed ByteArrays!Minimize Garbage Collection

Reuse ByteArraysIn-place updates for aggregations

Maximize CPU Cache Effectiveness8-byte alignmentAlphaSort-based Key-Prefix

Utilize Catalyst Dynamic Code GenerationDynamic optimizations using entire query planDeveloper implements genCode() to create Scala source code

(String)Project Janino compiles source code into JVM bytecode 34

pipeline.io

Why is CPU the Bottleneck?CPU for serialization, hashing, & compression

Spark 1.2 updates saturated Network, Disk I/O

10x increase in I/O throughput relative to CPU

More partition, pruning, and pushdown support

Newer columnar file formats help reduce I/O35

pipeline.io

Custom Data Structs & Algos: AggsUnsafeFixedWidthAggregationMap

Uses BytesToBytesMap internallyIn-place updates of serialized aggregationNo object creation on hot-path

TungstenAggregate & TungstenAggregationIteratorOperates directly on serialized, binary UnsafeRow2 steps to avoid single-key OOMs

① Hash-based (grouping) agg spills to disk if needed② Sort-based agg performs external merge sort on spills

pipeline.io

Custom Data Structures & Algorithmso.a.s.util.collection.unsafe.sort.

UnsafeSortDataFormatUnsafeExternalSorterUnsafeShuffleWriterUnsafeInMemorySorterRecordPointerAndKeyPrefix

PtrKey-Prefix

2x CPU Cache-line Friendly!

SortDataFormat<RecordPointerAndKeyPrefix, Long[ ]>Note: Mixing multiple subclasses of SortDataFormatsimultaneously will prevent JIT inlining.

Supports merging compressed records(if compression CODEC supports it, ie. LZF)

In-place external sorting of spilled BytesToBytes data

AlphaSort-based, 8-byte aligned sort key

In-place sorting of BytesToBytesMap data

pipeline.io

Code GenerationProblem

Boxing creates excessive objectsExpression tree evaluations are costlyJVM can’t inline polymorphic impls

Lack of polymorphism == poor code design

SolutionCode generation enables inliningRewrite and optimize code using overall plan, 8-byte alignDefer source code generation to each operator, UDF,

UDAFUse Janino to compile generated source code ->bytecode

(More IDE friendly than Scala quasiquotes) 38

pipeline.io

Autoscaling Spark Workers (Spark 1.5+)Scaling up is easy J

SparkContext.addExecutors() until max is reachedScaling down is hard L

SparkContext.removeExecutors()Lose RDD cache inside Executor JVMMust rebuild active RDD partitions in another Executor

Uses External Shuffle Service from Spark 1.1-1.2If Executor JVM dies/restarts, shuffle keeps shufflin’! 39

pipeline.io

“Hidden” Spark Submit REST APIhttp://arturmkrtchyan.com/apache-spark-hidden-rest-apiSubmit Spark Job

curl -X POST http://127.0.0.1:6066/v1/submissions/create \--header "Content-Type:application/json;charset=UTF-8" \--data ’{"action" : "CreateSubmissionRequest”,

"mainClass" : "org.apache.spark.examples.SparkPi”,"sparkProperties" : {

"spark.jars" : "file:/spark/lib/spark-examples-1.5.1.jar","spark.app.name" : "SparkPi",…

Get Spark Job Statuscurl http://127.0.0.1:6066/v1/submissions/status/<job-id-from-submit-request>

Kill Spark Jobcurl -X POST http://127.0.0.1:6066/v1/submissions/kill/<job-id-from-submit-request>

(the snitch)

pipeline.io

Outline① Spark Streaming and Spark ML

Kafka, Cassandra, ElasticSearch, Redis, Docker

pipeline.io

Parquet Columnar File FormatBased on Google Dremel paper ~2010Collaboration with Twitter and ClouderaColumnar storage format for fast columnar aggsSupports evolving schemaSupports pushdownsSupport nested partitionsTight compressionMin/max heuristics enable file and chunk skipping 42

Min/Max HeuristicsFor Chunk Skipping

pipeline.io

PartitionsPartition Based on Data Access Patterns

/genders.parquet/gender=M/…/gender=F/… <-- Use Case: Access Users by Gender/gender=U/…

Dynamic Partition Creation (Write)Dynamically create partitions on write based on column (ie. Gender)SQL: INSERT TABLE genders PARTITION (gender) SELECT …DF: gendersDF.write.format("parquet").partitionBy("gender")

.save("/genders.parquet")

Partition Discovery (Read)Dynamically infer partitions on read based on paths (ie./gender=F/…)SQL: SELECT id FROM genders WHERE gender=FDF: gendersDF.read.format("parquet").load("/genders.parquet/").select($"id").

.where("gender=F")43

pipeline.io

PruningPartition Pruning

Filter out rows by partition

SELECT id, gender FROM genders WHERE gender = ‘F’

Column PruningFilter out columns by column filterExtremely useful for columnar storage formats (Parquet)Skip entire blocks of columns

SELECT id, gender FROM genders44

pipeline.io

Pushdownsaka. Predicate or Filter Pushdowns Predicate returns true or false for given functionFilters rows deep into the data sourceReduces number of rows returnedData Source must implement PrunedFilteredScan

def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]

pipeline.io

Demo!File Formats, Partitions, Pushdowns, and Joins

pipeline.io

Predicate Pushdowns & Filter Collapsing

Filter pushdownNo extra pass

Filter combiningOnly 1 extra pass

2 extra passes through the data after retrieval

pipeline.io

Join Between Partitioned & Unpartitioned

pipeline.io

Join Between Partitioned & Partitioned

pipeline.io

Broadcast Join vs. Normal Shuffle Join

pipeline.io

Cartesian Join vs. Inner Join

pipeline.io

Visualizing the Query Plan

Effectiveness of Filter

Cost-basedJoin Optimization

Similar toMapReduce

Map-side Join& DistributedCache

Peak Memory forJoins and Aggs

UnsafeFixedWidthAggregationMapgetPeakMemoryUsedBytes()

pipeline.io

Data Source APIRelations (o.a.s.sql.sources.interfaces.scala)BaseRelation (abstract class): Provides schema of dataTableScan (impl): Read all data from sourcePrunedFilteredScan (impl): Column pruning & predicate pushdownsInsertableRelation (impl): Insert/overwrite data based on SaveModeRelationProvider (trait/interface): Handle options, BaseRelation factory

Filters (o.a.s.sql.sources.filters.scala)Filter (abstract class): Handles all filters supported by this source

EqualTo (impl)GreaterThan (impl)StringStartsWith (impl) 53

pipeline.io

Native Spark SQL Data Sources

pipeline.io

JSON Data SourceDataFrameval ratingsDF = sqlContext.read.format("json")

.load("file:/root/pipeline/datasets/dating/ratings.json.bz2")-- or –val ratingsDF = sqlContext.read.json

("file:/root/pipeline/datasets/dating/ratings.json.bz2")

SQL CodeCREATE TABLE genders USING jsonOPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2")

json() convenience method

pipeline.io

Parquet Data SourceConfiguration

spark.sql.parquet.filterPushdown=truespark.sql.parquet.mergeSchema=false (unless your schema is evolving)spark.sql.parquet.cacheMetadata=true (requires sqlContext.refreshTable())spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]

DataFramesval gendersDF = sqlContext.read.format("parquet")

.load("file:/root/pipeline/datasets/dating/genders.parquet")gendersDF.write.format("parquet").partitionBy("gender")

.save("file:/root/pipeline/datasets/dating/genders.parquet")

SQLCREATE TABLE genders USING parquetOPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet")

pipeline.io

ElasticSearch Data SourceGithub

https://github.com/elastic/elasticsearch-hadoop

Mavenorg.elasticsearch:elasticsearch-spark_2.10:2.1.0

Codeval esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>",

"es.port" -> "<port>")df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)

.options(esConfig).save("<index>/<document-type>")57

pipeline.io

Cassandra Data SourceGithub

https://github.com/datastax/spark-cassandra-connector

Mavencom.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1

CoderatingsDF.write

.format("org.apache.spark.sql.cassandra")

.mode(SaveMode.Append)

.options(Map("keyspace"->"<keyspace>","table"->"<table>")).save(…) 58

pipeline.io

Tips for Cassandra AnalyticsBy-pass Cassandra CQL “front door”

CQL Optimized for Transactions

Bulk read and write directly against SSTablesCheck out Netflix OSS project “Aegisthus”

Cassandra becomesa first-class analytics optionReplicated analytics cluster no longer needed

pipeline.io

Creating a Custom Data Source① Study existing implementations

o.a.s.sql.execution.datasources.jdbc.JDBCRelation② Extend base traits & implement required methods

o.a.s.sql.sources.{BaseRelation,PrunedFilterScan}

Spark JDBC (o.a.s.sql.execution.datasources.jdbc)class JDBCRelation extends BaseRelation

with PrunedFilteredScanwith InsertableRelation

DataStax Cassandra (o.a.s.sql.cassandra)class CassandraSourceRelation extends BaseRelation

with PrunedFilteredScanwith InsertableRelation 60

pipeline.io

Demo!Create a Custom Data Source

pipeline.io

Publishing Custom Data Sources

spark-packages.org

pipeline.io

Spark SQL UDF Code Generation100+ UDFs now generating code

More to come in Spark 1.6+

Details in SPARK-8159, SPARK-9571

Every UDF must use Expressions andimplement Expression.genCode()

to participate in the funLambdas (RDD or Dataset API)

and sqlContext.udf.registerFunction()are not enough!!

pipeline.io

Creating a Custom UDF with Code Gen① Study existing implementations

o.a.s.sql.catalyst.expressions.Substring

② Extend and implement base traito.a.s.sql.catalyst.expressions.Expression.genCode

③ Don’t forget about Python!python.pyspark.sql.functions.py

pipeline.io

Demo!Creating a Custom UDF participating in Code Generation

pipeline.io

Spark 1.6 and 2.0 ImprovementsAdaptiveness, Metrics, Datasets, and Streaming State

pipeline.io

Adaptive Query ExecutionAdapt query execution using data from previous stagesDynamically choose spark.sql.shuffle.partitions (default 200)

Broadcast Join(popular keys)

Shuffle Join(not-so-popular keys)

AdaptiveHybrid

pipeline.io

Adaptive Memory ManagementSpark <1.6

Manual configure between 2 memory regionsSpark execution engine (shuffles, joins, sorts, aggs)

spark.shuffle.memoryFractionRDD Data Cache

spark.storage.memoryFractionSpark 1.6+

Unified memory regionsDynamically expand/contract memory regionsSupports minimum for RDD storage (LRU Cache)

pipeline.io

MetricsShows exact memory usage per operator & nodeHelps debugging and identifying skew

pipeline.io

Spark SQL APIDatasets type safe API (similar to RDDs) utilizing Tungsten

val ds = sqlContext.read.text("ratings.csv").as[String]val df = ds.flatMap(_.split(",")).filter(_ != "").toDF() // RDD API, convert to DFval agg = df.groupBy($"rating").agg(count("*") as "ct”).orderBy($"ct" desc)

Typed Aggregators used alongside UDFs and UDAFsval simpleSum = new Aggregator[Int, Int, Int] with Serializable {

def zero: Int = 0def reduce(b: Int, a: Int) = b + a def merge(b1: Int, b2: Int) = b1 + b2def finish(b: Int) = b

}.toColumnval sum = Seq(1,2,3,4).toDS().select(simpleSum)

Query files directly without registerTempTable()%sql SELECT * FROM json.`/datasets/movielens/ml-latest/movies.json` 70

pipeline.io

Spark Streaming State ManagementNew trackStateByKey()

Store deltas, compact laterMore efficient per-key state updateSession TTL

Integrated A/B Testing (?!)

Show Failed Output in Admin UIBetter debugging

pipeline.io

Thank You!!Chris FreglyResearch Scientist @ PipelineIO(http://pipeline.io)San Francisco, California, USA

advancedspark.comSign up for the Meetup and BookContribute on Github!Run All Demos in Docker~6000 Docker Downloads!!

Find me on LinkedIn, Twitter, Github, Email, Fax72

atlanta spark user meetup 09 22 2016

Software

budapest spark meetup - apache spark @enbrite.ly

seattle spark-meetup-032317

using apache spark to fight world hunger - spark meetup

20150617 spark meetup zagreb

paris spark meetup - trifacta - 03_04_2017

[spark meetup] spark streaming overview

spark sql meetup

ankara spark meetup - big data & apache spark mimarisi...

budapest spark meetup - basics of spark coding

autoscaling spark on aws ec2 - 11th spark london meetup

dec6 meetup spark presentation

meetup: spark + kerberos

austin data meetup 092014 - spark

hadoop world spark meetup: interactive spark in your browser

meetup spark 2.0

pyspark cassandra - amsterdam spark meetup

spark meetup - zoomdata streaming

spark meetup tchug

spark web meetup

spark meetup july 2015