1
Seattle Spark Meetup
2
Next Sessions
• Currently planning
– New sessions
• Joint Seattle Spark Meetup and Seattle Mesos User Group meeting
• Joint Seattle Spark Meetup and Cassandra meeting
• Mobile marketing scenario with Tune
– Repeats requested for:
• Deep Dive into Spark, Tachyon, and Mesos Internals (Eastside)
• Spark at eBay: Troubleshooting the everyday issues (Seattle)
• This session (Eastside)
3
Unlocking your Hadoop data with Apache
Spark and CDH5
Denny Lee, Steven Hastings
Data Sciences Engineering
4
Purpose
• Showcasing Apache Spark scenarios within your Hadoop cluster
• Utilizing some Concur Data Sciences Engineering Scenarios
• Special Guests:
– Tableau Software: Showcasing Tableau to Apache Spark
– Cloudera: How to configure and deploy Spark on YARN via Cloudera Manager
5
Agenda
• Configuring and Deploying Spark on YARN with Cloudera Manager
• Quick primer on our expense receipt scenario
• Connecting to Spark on your CDH5.1 cluster
• Quick demos
– Pig vs. Hive
– SparkSQL
• Tableau connecting to SparkSQL
• Deep Dive demo
– MLLib: SVD
6
Take Picture of Receipt601 108th Ave. NE Uel lcvue, WA 98004
Chantanee Thai Restaurant & Bar’
www.chantanee.com
TABLE: B 6 - 2 Guests
Your Server was Jerry
4/14/2014 1 ;14:02 PM Sequence #0000052
ID #0281727 Subtotal $40-00
Tota] Taxes $380 Grand Tota1 $43-80
Credit Purchase Name
BC Type : Amex
00 Num : xxxx xxxx xxxx 2000
Approval : 544882
Server : Jerry
Ticket Name : B 6
15% $6.00
I agree to pay the amount shown above.
Visit us i
Payment Amount:
2.5% .15 1 O .00
n BotheH-Pen Thai
7
Help Choose Expense Type
8
Gateway Node
Hadoop
Gateway Node- Can connect to HDFS
- Execute Hadoop on cluster
- Can execute Spark on cluster
- OR on local for adhoc
- Can setup multiple VMs
9
Connecting…
spark-shell --master spark://$masternode:7077 --executor-
memory 1G --total-executor-cores 16
--master
specify the master node OR if using a gateway node
can just run locally to test
--executor-memory
limit amount of memory you use otherwise you’ll
use up as much as you can (should set defaults)
--total-executor-cores
limit amount of cores you use otherwise you’ll use up
as much as you can (should set defaults)
10
Connecting… and resources
11
RowCount: Pig
A = LOAD
'/user/hive/warehouse/dennyl.db/sample_ocr/000000_0' USING
TextLoader as (line:chararray);
B = group A all;
C = foreach B generate COUNT(A);
dump C;
12
RowCount: Spark
val ocrtxt =
sc.textFile("/user/hive/warehouse/dennyl.db/sample_ocr/0000
00_0")
ocrtxt.count
13
RowCount: Pig vs. Spark
Query Pig Spark
1 0:00:41 0:00:02
2 0:00:42 0:00:00.5
Row Count against 1+ million categorized receipts
14
RowCount: Spark Stages
15
WordCount: Pig
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as
word;
C = group B by word;
D = foreach C generate COUNT(B), group;
E = group D all;
F = foreach E generate COUNT(D);
dump F
16
WordCount: Spark
var wc =
ocrtxt.flatMap(
line => line.split(" ")
).map(
word => (word, 1)
).reduceByKey(_ + _)
wc.count
17
WordCount: Pig vs. Spark
Query Pig Spark
1 0:02:09 0:00:38
2 0:02:07 0:00:02
Word Count against 1+ million categorized receipts
18
SparkSQL: Querying
-- Utilize SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
-- Configure class
case class ocrdata(company_code: String, user_id: Long, date: Long,
category_desc: String, category_key: String, legacy_key: String,
legacy_desc: String, vendor: String, amount: Double)
19
SparkSQL: Querying (2)
-- Extract Data
val ocr = sc.textFile("/$HDFS_Location"
).map(_.split("\t")).map(
m => ocrdata(m(0), m(1).toLong, m(2).toLong, m(3), m(4), m(5), m(6),
m(7), m(8).toDouble)
)
-- For Spark 1.0.2
ocr.registerAsTable("ocr")
-- For Spark 1.1.0+
ocr.registerTempTable("ocr")
20
SparkSQL: Querying (3)
-- Write a SQL statement
val blah = sqlContext.sql(
"SELECT company_code, user_id FROM ocr”
)
-- Show the first 10 rows
blah.map(
a => a(0) + ", " + a(1)
).collect().take(10).foreach(println)
21
Oops!
14/11/15 09:55:35 ERROR scheduler.TaskSetManager: Task 16.0:0 failed 4 times; aborting job
14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Cancelling stage 16
14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Stage 16 was cancelled
14/11/15 09:55:35 INFO scheduler.DAGScheduler: Failed to run collect at <console>:22
14/11/15 09:55:35 WARN scheduler.TaskSetManager: Task 136 was killed.
14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 16.0, whose tasks have all
completed, from pool
org.apache.spark.SparkException: Job aborted due to stage failure: Task 16.0:0 failed 4 times, most
recent failure: Exception failure in TID 135 on host $server$: java.lang.NumberFormatException: For
input string: "\N"
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)
java.lang.Double.parseDouble(Double.java:540)
scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
22
Let’s find the error
-- incorrect configuration, let me try to find "\N"
val errors = ocrtxt.filter(line => line.contains("\\N"))
-- error count (71053 lines)
errors.count
-- look at some of the data
errors.take(10).foreach(println)
23
Solution
-- Issue
The [amount] field contains \N which is NULL value generated by Hive
-- Configure class (Original)
case class ocrdata(company_code: String, user_id: Long, ... amount:
Double)
-- Configure class (Original)
case class ocrdata(company_code: String, user_id: Long, ... amount:
String)
24
Re-running the Query
14/11/16 18:43:32 INFO scheduler.DAGScheduler: Stage 10 (collect at
<console>:22) finished in 7.249 s
14/11/16 18:43:32 INFO spark.SparkContext: Job finished: collect at
<console>:22, took 7.268298566 s
-1978639384, 20156192
-1978639384, 20164613
542292324, 20131109
-598558319, 20128132
1369654093, 20130970
-1351048937, 20130846
25
SparkSQL: By Category (Count)
-- Query
val blah = sqlContext.sql("SELECT category_desc, COUNT(1) FROM ocr GROUP BY
category_desc")
blah.map(a => a(0) + ", " + a(1)).collect().take(10).foreach(println)
-- Results
14/11/16 18:46:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 12.0, whose
tasks have all completed, from pool
14/11/16 18:46:12 INFO spark.SparkContext: Job finished: collect at
<console>:22, took 4.275620339 s
Category 1, 25
Category 2, 97
Category 3, 37
26
SparkSQL: via Sum(Amount)
-- Query
val blah = sqlContext.sql("SELECT category_desc, sum(amount) FROM ocr GROUP BY
category_desc")
blah.map(a => a(0) + ", " + a(1)).collect().take(10).foreach(println)
-- Results
14/11/16 18:46:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 12.0, whose
tasks have all completed, from pool
14/11/16 18:46:12 INFO spark.SparkContext: Job finished: collect at
<console>:22, took 4.275620339 s
Category 1, 2000
Category 2, 10
Category 3, 1800
27
Connecting Tableau to SparkSQL
28
Diving into MLLib / SVD for Expense
Receipt Scenario
29
Overview
• Re-Intro to Expense Receipt Prediction
• SVD
– What is it?
– Why do I care?
• Demo
– Basics (get the data)
– Data wrangling
– Compute SVD
30
Expense Receipts (Problems)
• Want to guess Expense type based on word on receipt.
• Receipt X Word matrix is sparse
• Some words are likely to be found together
• Some words are actually the same word
31
SVD (Singular Value Decomposition)
• Been around a while
– But still popular and useful
• Matrix Factorization
– Intuition: rotate your view of the data
• Data can be well approximated by fewer features
– And you can get an idea of how good of an approximation
32
Demo: Overview
Raw Data
Tokenized
Words
Grouped
Words,
Records
Matrix
SVD
33
Basics
val rawlines = sc.textFile(“/user/stevenh/subocr/subset.dat”)
val ocrRecords = rawlines map { rawline =>
rawline.split(“\t”)
} filter { line =>
line.length == 10 && line(8) != “”
} zipWithIndex() map { case (lineItems, lineIdx) =>
OcrRecord(lineIdx, lineItems(0), lineItems(1).toLong, lineItems(4),
lineItems(8).toDouble, lineItems(9))
}
zipWithIndex() lets you give each record in your RDD an incrementing integer index
34
Tokenize Records
val splitRegex = new scala.util.matching.Regex("\\\\r\\\\n")
val wordRegex = new scala.util.matching.Regex("[a-zA-Z0-9_]+")
val recordWords = ocrRecords flatMap { rec =>
val s1 = splitRegex.replaceAllIn(rec.ocrText, "")
val s2 = wordRegex.findAllIn(s1)
for { S <- s2 } yield (rec.recIdx, S.toLowerCase)
}
Keep track of which record this came from
35
Demo: Overview
Raw Data
Tokenized
Words
Grouped
Words,
Records
Matrix
SVD
36
Group data by Record and Word
val wordCounts = recordWords groupBy { T => T }
val wordsByRecord = wordCounts map { gr =>
(gr._1._2, (gr._1._1, gr._1._2, gr._2.size))
}
val uniqueWords = wordsByRecord groupBy { T =>
T._2._2
} zipWithIndex() map { gr =>
(gr._1._1, gr._2)
}
37
Join Record, Word Data
val preJoined = wordsByRecord join uniqueWords
val joined = preJoined map { pj =>
RecordWord(pj._2._1._1, pj._2._2, pj._2._1._2, pj._2._1._3.toDouble)
}
Join 2-tuple RDDs on first value of tuple
Now we have data for each non-zero word/record combo
38
Demo: Overview
Raw Data
Tokenized
Words
Grouped
Words,
Records
Matrix
SVD
39
Generate Word x Record Matrix
val ncols = ocrRecords.count().toInt
val nrows = uniqueWords.count().toLong
val vectors: RDD[Vector] = joined groupBy { T =>
T.wordIdx
} map { gr =>
val indices = for { x <- gr._2 } yield x.recIdx.toInt
val data = for { x <- gr._2 } yield x.n
new SparseVector(ncols, indices.toArray, data.toArray)
}
val rowMatrix = new RowMatrix(vectors, nrows, ncols)
This is a Spark Vector, not a scala Vector
40
Demo: Overview
Raw Data
Tokenized
Words
Grouped
Words,
Records
Matrix
SVD
41
Compute SVD
val svd = rowMatrix.computeSVD(5, computeU = true)
• Ironically, in Spark v1.0 computeSVD is limited by an operation
which must complete on a single node…
42
Spark / SVD References
• Distributing the Singular Value Decomposition with Spark
– Spark-SVD gist
– Twitter / Databricks blog post
• Spotting Topics with the Singular Value Decomposition
43
Now do something with the data!
44
Upcoming Conferences
• Strata + Hadoop World
– http://strataconf.com/big-data-conference-ca-2015/public/content/home
– San Jose, Feb 17-20, 2015
• Spark Summit East
– http://spark-summit.org/east
– New York, March 18-19, 2015
• Ask for a copy of “Learning Spark”
– http://www.oreilly.com/pub/get/spark