spark meetup tchug

53
LARGE-SCALE ANALYTICS WITH APACHE SPARK THOMSON REUTERS R&D TWIN CITIES HADOOP USER GROUP FRANK SCHILDER SEPTEMBER 22, 2014

Upload: ryan-bosshart

Post on 01-Jul-2015

607 views

Category:

Technology


1 download

DESCRIPTION

http://bit.ly/RCtaTI

TRANSCRIPT

Page 1: Spark meetup TCHUG

LARGE-SCALE ANALYTICS WITH APACHE SPARK THOMSON REUTERS R&D TWIN CITIES HADOOP USER GROUP

FRANK SCHILDER SEPTEMBER 22, 2014

Page 2: Spark meetup TCHUG

THOMSON REUTERS •  The Thomson Reuters Corporation

–  50,000+ employees –  2,000+ journalists at news desks world wide –  Offices in more than 1,000 countries –  $12 billion dollars revenue/year

•  Products: intelligent information for professionals and enterprises –  Legal: WestlawNext legal search engine –  Financial: Eikon financial platform; Datastream real-time share price data –  News: REUTERS news –  Science: Endnote, ISI journal impact factor, Derwent World Patent Index –  Tax & Accounting: OneSource tax information

•  Corporate R&D –  Around 40 researchers and developers (NLP, IR, ML) –  Three R&D sites the US one in the UK: Eagan, MN; Rochester, NY; NYC

and London –  We are hiring… email me at [email protected]

Page 3: Spark meetup TCHUG

OVERVIEW •  Speed

–  Data locality, scalability, fault tolerance

•  Ease of Use –  Scala, interactive Shell

•  Generality –  SparkSQL, MLLib

•  Comparing ML frameworks –  Vowpal Wabbit (VW) –  Sparkling Water

•  The Future

Page 4: Spark meetup TCHUG

WHAT IS SPARK? Apache Spark is a fast and general engine for large-scale data processing.

•  Speed: allows to run iterative Map-Reduce faster because of in-Memory computation: Resilient Distributed Datasets (RDD)

•  Ease of use: enables interactive data analysis in Scala, Python, or Java; interactive Shell

•  Generality: offers libraries for SQL, Streaming and large-scale analytics (graph processing and machine learning)

•  Integrated with Hadoop: runs on Hadoop 2’s YARN cluster

Page 5: Spark meetup TCHUG

ACKNOWLEDGMENTS •  Matei Zaharia and ampLab and databricks team for

fantastic learning material and tutorials on Spark

•  Hiroko Bretz, Thomas Vacek, Dezhao Song, Terry Heinze for Spark and Scala support and running experiments

•  Adam Glaser for his time as a TSAP intern

•  Mahadev Wudali and Mike Edwards for letting us play in the “sandbox” (cluster)

Page 6: Spark meetup TCHUG

SPEED

Page 7: Spark meetup TCHUG

PRIMARY GOALS OF SPARK •  Extend the MapReduce model to better support

two common classes of analytics apps: –  Iterative algorithms (machine learning, graphs) –  Interactive data mining (R, Python)

•  Enhance programmability: –  Integrate into Scala programming language –  Allow interactive use from Scala interpreter –  Make Spark easily accessible from other

languages (Python, Java)

Page 8: Spark meetup TCHUG

MOTIVATION

• Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data: –  Iterative algorithms (machine learning, graphs) –  Interactive data mining tools (R, Python)

• With current frameworks, apps reload data from stable storage on each query

Page 9: Spark meetup TCHUG

HADOOP MAPREDUCE VS SPARK

Page 10: Spark meetup TCHUG

SOLUTION: Resilient Distributed Datasets (RDDs) •  Allow apps to keep working sets in memory for

efficient reuse

•  Retain the attractive properties of MapReduce –  Fault tolerance, data locality, scalability

•  Support a wide range of applications

Page 11: Spark meetup TCHUG

PROGRAMMING MODEL Resilient distributed datasets (RDDs)

–  Immutable, partitioned collections of objects –  Created through parallel transformations (map, filter,

groupBy, join, …) on data in stable storage –  Functions follow the same patterns as Scala operations

on lists –  Can be cached for efficient reuse

80+ Actions on RDDs –  count, reduce, save, take, first, …

Page 12: Spark meetup TCHUG

EXAMPLE: LOG MINING Load error messages from a log into memory, then interactively search for various patterns

Val lines = spark.textFile(“hdfs://...”)

Val errors = lines.filter(_.startsWith(“ERROR”))

Val messages = errors.map(_.split(‘\t’)(2))

Val cachedMsgs = messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“timeout”)).count

cachedMsgs.filter(_.contains(“license”)).count

. . .

tasks  

results  

Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data)

Page 13: Spark meetup TCHUG

BEHAVIOR WITH NOT ENOUGH RAM

68.8  

58.1  

40.7  

29.7  

11.5  

0  

20  

40  

60  

80  

100  

Cache  disabled  

25%   50%   75%   Fully  cached  

Iteration  time  (s)  

%  of  working  set  in  memory  

Page 14: Spark meetup TCHUG

RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions

Ex:

messages = textFile(...).filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2))

HDFS File Filtered RDD Mapped RDD filter  

(func  =  _.contains(...))  map  

(func  =  _.split(...))  

Page 15: Spark meetup TCHUG

Fault Recovery Results

119  

57  

56  

58  

58   81

 

57  

59  

57  

59  

0  20  40  60  80  100  120  140  

1   2   3   4   5   6   7   8   9   10  

Iteratrion

 time  (s)  

Iteration  

No  Failure  Failure  in  the  6th  Iteration  

Page 16: Spark meetup TCHUG

EASE OF USE

Page 17: Spark meetup TCHUG

INTERACTIVE SHELL •  Data analysis can be done in the interactive shell.

–  Start from local machine or cluster –  Access multi-core processor with local[n] –  Spark context is already set up for you: SparkContext sc

•  Load data from anywhere (local, HDFS, Cassandra, Amazon S3 etc.):

•  Start analyzing your data:

Processing starts here

Local data file

Page 18: Spark meetup TCHUG

ANALYZE YOUR DATA •  Word count in one line:

•  List the word counts:

•  Broadcast variables (e.g. dictionary, stop word list) because local variables need to distributed to the workers:

Page 19: Spark meetup TCHUG

RUN A SPARK SCRIPT

Page 20: Spark meetup TCHUG

PYTHON SHELL & IPYTHON •  The interactive shell can also be started as Python

shell called pySpark:

•  Start analyzing your data in python now:

•  Since it’s Python, you may want to use iPython –  (command shell for interactive programming in your

brower) :

Page 21: Spark meetup TCHUG

IPYTHON AND SPARK •  The iPython notebook environment and pySpark:

–  Document data analysis results –  Carry out machine learning experiments –  Visualize results with matplotlib or other visualization

libraries –  Combine with NLP libraries such as NLTK

•  PySpark does not offer the full functionality of Spark Shell in Scala (yet)

•  Some bugs (e.g. problems with unicode)

Page 22: Spark meetup TCHUG
Page 23: Spark meetup TCHUG

PROJECTS AT R&D USING SPARK •  Entity linking

–  Alternative name extraction from Wikipedia, Freebase, free text, ClueWeb12; several TB large web collection (planned)

•  Large-scale text data analysis: –  creating fingerprints for entities/events –  Temporal slot filling: Assigning a begin and end time

stamp to a slot filler (e.g. A is employee of company B from BEGIN to END)

–  Large-Scale text classification of Reuters News Archive articles (10 years)

•  Language model computation used for search query analysis

Page 24: Spark meetup TCHUG

SPARK MODULES •  Spark streaming:

–  Processing real-time data streams

•  Spark SQL: –  Support for structured data (JSON, Parquet) and

relational queries (SQL)

•  MLlib: –  Machine learning library

•  GraphX: –  New graph processing API

Page 25: Spark meetup TCHUG

SPARKSQL

Page 26: Spark meetup TCHUG

SPARK SQL •  Relational queries expressed in

–  SQL –  HiveQL –  Scala Domain specific language (DSL)

•  New type of RDD: SchemaRDD : –  RDD composed of Row objects –  Schema definition or inferred from a Parquet file, JSON

data set, or data store in Hive

•  SPARK SQL is in alpha: API may change in the future!

Page 27: Spark meetup TCHUG

DEFINING A SCHEMA

Page 28: Spark meetup TCHUG

MLLIB

Page 29: Spark meetup TCHUG

MLLIB •  A machine learning module that comes with Spark •  Shipped since Spark 0.8.0

•  Provides various machine learning algorithms for classification and clustering

•  Sparse vector representation since 1.0.0

•  New features in recently released version 1.1.0: –  Includes a standard statistics library (e.g. correlation,

Hypothesis testing, sampling) –  More algorithms ported to Java and Python –  More feature engineering: TF-IDF, Singular Value

Decomposition (SVD)

Page 30: Spark meetup TCHUG

MLLIB •  Provides various machine learning algorithms:

–  Classification: •  Logistic regression, support vector machine (SVM), naïve

Bayes, decision trees

–  Regression: •  Linear regression, regression trees

–  Collaborative Filtering: •  Alternative least square (ALS)

–  Clustering: •  K-means

–  Decomposition •  Singular value decomposition (SVD), Principal component

analysis (PCA)

Page 31: Spark meetup TCHUG

OTHER ML FRAMEWORKS •  Mahout

•  LIBLINEAR

•  MatLAB

•  Scikit-learn

•  GraphLab

•  R

•  Weka

•  Vowpal Wabbit

•  BigML

Page 32: Spark meetup TCHUG

LARGE-SCALE ML INFRASTRUCTURE •  More data implies bigger training sets and richer

feature sets.

•  More data with simple ML algorithm often beats small data with complicated ML algorithm

•  Large-scale ML requires big data infrastructure: –  Faster processing: Hadoop, Spark –  Feature engineering: Principal Component Analysis,

Hashing trick, Word2Vec

Page 33: Spark meetup TCHUG

PREDICTIVE ANALYTICS WITH MLLIB

Page 34: Spark meetup TCHUG

PREDICTIVE ANALYTICS WITH MLLIB

http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html

Page 35: Spark meetup TCHUG

VW AND MLLIB COMPARISON •  We compared Vowpal Wabbit and MLlib in

December 2013 (work with Tom Vacek)

•  Vowpal Wabbit (VW) is a large-scale ML tool developed by John Langford (Microsoft)

•  Task: binary text classification task on Reuters articles –  Ease of implementation –  Feature Extraction –  Parameter tuning –  Speed –  Accessibility of programming languages

Page 36: Spark meetup TCHUG

VW VS. MLLIB •  Ease of implementation

–  VW: user tool designed for ML, not programming language –  MLlib: programming language, some support now (e.g. regularization)

•  Feature Extraction –  VW: specific capabilities for bi-grams, prefix etc. –  MLlib: no limit in terms of creating features

•  Parameter tuning –  VW: no parameter search capability, but multiple parameters can be hand-tuned –  MLlib: offers cross-validation

•  Speed –  VW: highly optimized, very fast even on a single machine with multiple cores –  MLlib: fast with lots of machines

•  Accessibility of programming languages –  VW: written in C++, a few wrappers (e.g. Python) –  MLlib: Scala, Python, Java

•  Conclusion end of 2013: VW had a slight advantage, but MLlib has caught up in at least some of the areas (e.g. sparse feature representation)

Page 37: Spark meetup TCHUG

FINDINGS SO FAR •  Large-scale extraction is a great fit for Spark when

working with large data sets (> 1GB)

•  Ease of use makes Spark an ideal framework for rapid prototyping.

•  MLlib is a fast growing ML library, but “under development”

•  Vowpal Wabbit has been shown to crunch even large data sets with ease.

0

50

100

150

200

250

vw liblinear Spark local[4]

0/1 loss

time

Page 38: Spark meetup TCHUG

OTHER ML FRAMEWORKS •  Internship by Adam Glaser compared various ML

frameworks with 5 standard data sets (NIPS) –  Mass-spectrometric data (cancer), handwritten digit

detection, Reuters news classification, synthetic data sets –  Data sets were not very big, but had up to 1.000.000

features

•  Evaluated accuracy of the generated models and speed for training time

•  H20, GraphLab and Microsoft Azure showed strong performances in terms of accuracy and training time.

Page 39: Spark meetup TCHUG

ACCURACY

Page 40: Spark meetup TCHUG

SPEED

Page 41: Spark meetup TCHUG

WHAT IS NEXT? •  Oxdata plans to release Sparkling Water in October

2014:

•  Microsoft Azure also offers a strong platform with multiple ML algorithm and an intuitive user interface

•  GraphLab has GraphLab Canvas ™ for visualizing your data and plans to incorporate more ML algorithms.

Page 42: Spark meetup TCHUG

CAN’T DECIDE?

Page 43: Spark meetup TCHUG

CONCLUSIONS

Page 44: Spark meetup TCHUG

CONCLUSIONS •  Apache Spark is the most active project in the Hadoop

eco system

•  Spark offers speed and ease of use because of –  RDDs –  Interactive shell and –  Easy integration of Scala, Java, Python scripts

•  Integrated in Spark are modules for –  Easy data access via SparkSQL –  Large-scale analytics via MLlib

•  Other ML frameworks enable analytics as well

•  Evaluate which framework is the best fit for your data problem

Page 45: Spark meetup TCHUG

THE FUTURE? •  Apache Spark will be a unified platform to run

under various work loads: –  Batch –  Streaming –  Interactive

•  And connect with different runtime systems –  Hadoop –  Cassandra –  Mesos –  Cloud –  …

Page 46: Spark meetup TCHUG

THE FUTURE? •  Spark will extend its offering of large-scale

algorithms for doing complex analytics: –  Graph processing –  Classification –  Clustering –  …

•  Other frameworks will continue to offer similar capabilities.

•  If you can’t beat them, join them.

Page 47: Spark meetup TCHUG

[email protected]

http://labs.thomsonreuters.com/about-rd-careers/

Page 48: Spark meetup TCHUG

EXTRA SLIDES

Page 49: Spark meetup TCHUG

Example: Logistic Regression Goal: find best line separating two sets of points

+

+ + +

+

+

+ + +

– – –

– – –

+

target  

random  initial  line  

Page 50: Spark meetup TCHUG

Example: Logistic Regression

val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)

Page 51: Spark meetup TCHUG

Logistic Regression Performance

0 500

1000 1500 2000 2500 3000 3500 4000 4500

1 5 10 20 30

Run

ning

Tim

e (s

)

Number of Iterations

Hadoop Spark

127  s  /  iteration  

first  iteration  174  s  further  iterations  6  s  

Page 52: Spark meetup TCHUG

Spark Scheduler

Dryad-like DAGs

Pipelines functions within a stage

Cache-aware work reuse & locality

Partitioning-aware to avoid shuffles

join  

union  

groupBy  

map  

Stage  3  

Stage  1  

Stage  2  

A:   B:  

C:   D:  

E:  

F:  

G:  

=  cached  data  partition  

Page 53: Spark meetup TCHUG

Spark Operations

Transformations (define a new

RDD)

map filter

sample groupByKey reduceByKey

sortByKey

flatMap union join

cogroup cross

mapValues

Actions (return a result to driver program)

collect reduce count save

lookupKey