how conviva used spark to speed up video analytics by...
TRANSCRIPT
![Page 1: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/1.jpg)
How Conviva used Spark to Speed Up Video Analytics by 30x Dil ip Antony Joseph (@Dil ipAntony)
![Page 2: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/2.jpg)
Conviva monitors and optimizes online video for premium content providers
![Page 3: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/3.jpg)
What happens if you don’t use Conviva?
![Page 4: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/4.jpg)
65 million video streams a day
![Page 5: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/5.jpg)
Conviva data processing architecture
Video Player
Hadoop for historical data
Live data processing
Monitoring Dashboard
Spark
Reports
Ad-‐hoc analysis
![Page 6: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/6.jpg)
Group By queries dominate our Reporting workload
10s of metrics, 10s of group bys
SELECT videoName, COUNT(1) FROM summaries WHERE date='2012_08_22' AND customer='XYZ’ GROUP BY videoName;
![Page 7: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/7.jpg)
Group By query in Spark
val sessions = sparkContext.sequenceFile[SessionSummary, NullWritable]( pathToSessionSummaryOnHdfs, classOf[SessionSummary], classOf[NullWritable]) .flatMap { case (key, val) => val.fieldsOfInterest } val cachedSessions = sessions.filter( whereCondifonToFilterSessionsForTheDesiredDay) .cache val mapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) } val reduceFn : (Long, Long) => Long = { (a,b) => a+b } val results = cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap
![Page 8: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/8.jpg)
Spark is 30x faster than Hive 45 minutes versus 24 hours
for weekly Conviva Geo Report
![Page 9: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/9.jpg)
How much data are we talking about?
• 150 GB / week of compressed summary data
• Compressed ~ 3:1
• Stored as Hadoop Sequence Files • Custom binary serializafon format
![Page 10: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/10.jpg)
Spark is faster because it avoids reading data from disk multiple times
![Page 11: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/11.jpg)
Read from HDFS Decompress Deserialize
Read from HDFS Decompress Deserialize
Read from HDFS Decompress Deserialize
Group By Country
Group By State
Group By Video
10s of Group Bys …
Group By Country Read from HDFS Decompress Deserialize Cache data in memory
Read data from memory
Read data from memory
Group By State
Group By Video
10s of Group Bys …
Spark
Hive/MapReduce startup overhead
Cache only columns of interest
Overhead of flushing
intermediate data to disk
![Page 12: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/12.jpg)
Why not <some other big data system>?
• Hive
• Mysql
• Oracle/SAP HANA
• Column oriented dbs
![Page 13: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/13.jpg)
Spark just worked for us
• 30x speed-‐up • Great fit for our report computafon model • Open-‐source • Flexible • Scalable
![Page 14: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/14.jpg)
30% of our reports use Spark
![Page 15: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/15.jpg)
We are working on more projects that use Spark
• Streaming Spark for unifying batch and live computafon
• SHadoop – Run exisfng Hadoop jobs on Spark
![Page 16: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/16.jpg)
Problems with Spark
Hadoop for historical data
Live data processing
Spark
Reports
Ad-‐hoc analysis
Video Player
Monitoring Dashboard
No Logo
![Page 17: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/17.jpg)
Hadoop for historical data
Live data processing
Spark
Reports
Ad-‐hoc analysis
Video Player
Monitoring Dashboard
![Page 18: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/18.jpg)
Spark queries are not very succinct SELECT videoName, COUNT(1) FROM summaries WHERE date='2012_08_22' AND customer='XYZ’ GROUP BY videoName;
val sessions = sparkContext.sequenceFile[SessionSummary, NullWritable]( pathToSessionSummaryOnHdfs, classOf[SessionSummary], classOf[NullWritable]) .flatMap { case (key, val) => val.fieldsOfInterest } val cachedSessions = sessions.filter( whereCondifonToFilterSessionsForTheDesiredDay) .cache val mapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) } val reduceFn : (Long, Long) => Long = { (a,b) => a+b } val results = cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap
Spark
![Page 19: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/19.jpg)
There is a learning curve associated with Scala, but ...
• Type-‐safety offered by Scala is a great boon – Code complefon via Eclipse Scala plugin
• Complex queries are easier in Scala than in Hive – Cascading IF()s in Hive
• No need to write Scala in the future – Shark – Java, Python APIs
![Page 20: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/20.jpg)
Additional Hiccups
• Always on the bleeding edge – geung dependencies right
• More maintenance/debugging tools required
![Page 21: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/21.jpg)
Spark has been working great for Conviva for over a year
![Page 22: How Conviva used Spark to Speed Up Video Analytics by …ampcamp.berkeley.edu/wp-content/uploads/2012/06/dilip-joseph-amp... · How Conviva used Spark to Speed Up Video Analytics](https://reader031.vdocuments.net/reader031/viewer/2022022608/5b889da67f8b9a435b8e27f1/html5/thumbnails/22.jpg)
We are Hiring [email protected]
hwp://www.conviva.com/blog/engineering/using-‐spark-‐and-‐hive-‐to-‐process-‐bigdata-‐at-‐conviva