spark 2.0: what’s next keynote_ja.pdf · 2/8/2016 · next-gen streaming with dataframes 1....
TRANSCRIPT
![Page 1: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/1.jpg)
Spark 2.0: What’s Next
Reynold Xin @rxinSpark Conference JapanFeb 8, 2016
![Page 2: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/2.jpg)
Please put up your hand
if you know what Spark is?
![Page 3: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/3.jpg)
Put up your hand
if you think your significant otherknow what Spark is?
(girlfriend, boyfriend, wife, husband, …)
![Page 4: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/4.jpg)
This Talk
What is Spark?How are people using it?Spark 2.0
![Page 5: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/5.jpg)
open source data processing engine built around speed, ease of use, and sophisticated analytics�����������!#��� Ò�Á�Ñ´¶
������������"� �
![Page 6: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/6.jpg)
![Page 7: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/7.jpg)
![Page 8: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/8.jpg)
About Databricks
Founded by creators of Spark & behind Spark development
Cloud Enterprise Spark Platform• Cluster management, interactive notebooks,
dashboards, production jobs,data governance, security, …
Databricks Àº¥»
Spark�ht½Spark�hÒA>³Î�¶¸À˹»�l°Ï¶
ØďêđüĉÕçÝĉÖñ Spark üĉíðúÙđăĐÝĉæên` .�$ôđðûíÝ
ĐëíäĆÿđñ åĈûb<
ĐïđêÛöòďæ èÜĆĊîÔ
![Page 9: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/9.jpg)
2015: Great Year for Spark
Most active open source project in data (1000+ contributors)
New language: R
Widespread industry support & adoption
2015: SparkÀ½¹»'«¿3
ïđê� IÉYh¿Úđüďéđæüčå×Ýð (1000��Â�^t)
D²¥�� : R
24¥ReâĀđð½@c
![Page 10: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/10.jpg)
Meetup Groups: December 2014
source: meetup.com
![Page 11: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/11.jpg)
Meetup Groups: December 2015
source: meetup.com
TokyoSpark Meetup
![Page 12: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/12.jpg)
IBMÃApache Spark Â�6�ÅÂáĂíðĄďðÒÓòÖďæ TÂ103¼IÉ�~¿Úđüďéđæüčå×Ýð½¿Î�w;ÒkÈ»¥Î½¥¦
Spark Or Hadoop – ¾¸Ìªýæð¿øíÞïđêúČđăĎđÝē
Apache Spark ª�W:�G·½�PqOªj³
![Page 13: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/13.jpg)
“Spark is the Taylor Swiftof big data software.”
- Derrick Harris, Fortune¢Spark ÃøíÞïđêéúðÖ×ÓÂîÕĉđĐæÖÔúð�·£
- ëđĊíÝĐõĊæ úÙđìĆď
��X : FM¼ÃîČøfp¢îĉæõÖæ£Â��U¿¾¼K
![Page 14: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/14.jpg)
“Spark is the �1H(of big data software.”
(A Japanese engineer told me)
![Page 15: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/15.jpg)
How are people using Spark?
![Page 16: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/16.jpg)
Diverse Runtime EnvironmentsHOW RESPONDENTS ARE
RUNNING SPARK
51%on a public cloud
MOST COMMON SPARK DEPLOYMENTENVIRONMENTS (CLUSTER MANAGERS)
48% 40% 11%Standalone mode YARN Mesos
Cluster Managers
°Ç±Ç¿+{a%
![Page 17: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/17.jpg)
Industries Using Spark
Other
Software(SaaS, Web, Mobile)
Consulting (IT)Retail,
e-Commerce
Advertising,Marketing, PR
Banking, Finance
Health, Medical,Pharmacy, Biotech
Carriers,Telecommunications
Education
Computers, Hardware
29.4%
17.7%
14.0%
9.6%
6.7%
6.5%
4.4%
4.4%
3.9%
3.5%
SparkÒ�c²»¥ÎRe
éúðÖ×Ó
áďâċîÔďÞ (IT)
�{ �z
áďùĆđêđ õđñÖ×Ó
Bv
�7 �g y� öÕÚîÝôčåđ
ÜąĊÓ ��
4"
āđßîÔďÞ PR
/& eáāđæ
µÂ�
![Page 18: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/18.jpg)
Top Applications
29%
36%
40%
44%
52%
68%
Fraud Detection / Security
User-Facing Services
Log Processing
Recommendation
Data Warehousing
Business IntelligenceøåóæÕďîĊå×ďæ
ïđêÖ×ÓõÖåďÞ
ČáĄďïđäĈď
čÞ�`
ćđãđ!âđøæ
�VQ� / èÜĆĊîÔ
��ÂÓüĊßđäĈď
![Page 19: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/19.jpg)
Are we done?
No. Development is faster than ever!
ɦ)<ē
¥¥§¡�hÃ�Ǽ�ÀYhÀ¿¹»r¥»¥ÎĒ
![Page 20: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/20.jpg)
2012
started@
Berkeley
2010
researchpaper
2013
Databricksstarted
& donatedto ASF
2014
Spark 1.0 & libraries(SQL, ML, GraphX)
2015
DataFramesTungsten
ML Pipelines
2016
Spark 2.0
![Page 21: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/21.jpg)
SQL Streaming MLlib
Spark Core (RDD)
GraphX
Spark stack diagram SparkÂæêíÝ#
![Page 22: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/22.jpg)
Frontend(user facing APIs)
Backend(execution)
Spark stack diagram(a different take)
SparkÂæêíÝ#(�¦�E¼)
������
(ćđãđÀ�³ÎAPI)
öíÝØďñ
(+{)
![Page 23: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/23.jpg)
Frontend(RDD, DataFrame, ML pipelines, …)
Backend(scheduler, shuffle, operators, …)
Spark stack diagram(a different take)
SparkÂæêíÝ#(�¦�E¼)
������
(RDD, DataFrame, ML pipelines, …)
öíÝØďñ
(æßåĆđĉ äąíúċ [m(
…)
![Page 24: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/24.jpg)
FrontendAPI Foundation
Streaming DataFrame/Dataset
SQL
Backend10X Performance
Whole-stage CodegenVectorization
Spark 2.0
účďðØďñ API Â��
æðĊđĂďÞ
DataFrame/DatasetSQL
öíÝØďñ 10�Â÷úÙđāďæ
�æîđå áđñb<
ýÝðċ�
![Page 25: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/25.jpg)
Guiding Principles for API Foundation
1. Simple yet expressive
2. (Semantics) well-defined
3. Sufficiently abstracted to allow optimized backends
API Ò�ÎÀ¤¶¹»Â?�
äďüċ·ª|_�©À
(èāďîÔÝæª) ��*s°Ï»¥Î
öíÝØďñÂI��ª¼«Î˦��À=��°Ï»¥Î
![Page 26: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/26.jpg)
Java/Scalafrontend
JVMbackend
RDD
DataFramefrontend
Logical Plan
Physical execution
Catalystoptimizer
DataFrame
![Page 27: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/27.jpg)
Python Java/Scala SQL
DataFrameLogical Plan
JVM Tungsten …
![Page 28: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/28.jpg)
API Foundations in Spark 2.0
1. Streaming DataFrames
2. Maturing and merging DataFrame and Dataset
3. ANSI SQL• natural join, subquery, view support
Spark 2.0 À¨ÎAPIÂ��
æðĊđĂďÞ DataFrames
DataFrame ½ Dataset Â<]½āđå
x\q� âûÝØĊ øĆđÂâĀđð
![Page 29: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/29.jpg)
Challenges with Stream Processing
Stream processing is hard to reason about• Output over time• Late data• Failures• Distribution
And all this has to work across complex operations• Windows, sessions, aggregation, etc
æðĊđă�`À�³Î��
æðĊđă�`ª�²¥`dÃ
Đ�¥L�ÀZÎÓÖðüíð
Đ�Ï»¬Îïđê
�-
��
®Ï̳ƻª}�¿ÚþČđäĈďÀѶ¹»Sw²¿ÏÄ¿Ì¿¥
ĐÖÔďñÖ èíäĈď ÓÞĊàđäĈď ¿¾
![Page 30: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/30.jpg)
Next-gen Streaming with DataFrames
1. Easy-to-use APIs (batch, streaming, and interactive)
2. Well-defined semantics• Out-of-order data• Failures• Sources/sinks with exactly-once semantics
3. Leverages Tungsten backend
DataFramesÀËÎT�æðĊđĂďÞ
1. ¥Ê³¥API (öíì æðĊđĂďÞ ÕďêĉÝîÔû)2. ¦Ç¬*s°Ï¶èāďîÔÝæĐ�5�ͼ¿¥ïđê
�-
Đexactly-once èāďîÔÝæÒ>º source / sink
3. Tungsten öíÝØďñÂ�c
![Page 31: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/31.jpg)
Next-gen Streaming with DataFrames
1. Easy-to-use APIs (batch, streaming, and interactive)
2. Well-defined semantics• Out-of-order data• Failures• Sources/sinks with exactly-once semantics
3. Leverages Tungsten backend
DataFramesÀËÎT�æðĊđĂďÞ
1. ¥Ê³¥API (öíì æðĊđĂďÞ ÕďêĉÝîÔû)2. ¦Ç¬*s°Ï¶èāďîÔÝæĐ�5�ͼ¿¥ïđê
�-
Đexactly-once èāďîÔÝæÒ>º source / sink
3. Tungsten öíÝØďñÂ�c
More details next few weeksC��9ÀËÍ�oÒ
![Page 32: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/32.jpg)
Spark is already pretty fast.
Can we make it 10X faster in 2.0?
Spark ó¼À©¿Í�¥
2.0¼ 10���À¼«Î·Ц©ē
![Page 33: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/33.jpg)
Spark 1.6 13.95 millionrows/sec
Spark 2.0work-in-progress
125 millionrows/sec
High throughput�æċđüíð
Teaser: SQL/DataFrame Performance
come to my talk this afternoon to learn more�²¬iͶ¥Ë¦¼²¶Ì�9ÂѶ²Â�Òu«ÀN»¬·°¥
0²·,�: SQL/DataFrame��������
![Page 34: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/34.jpg)
Tungsten Execution
PythonSQL R Streaming
DataFrame (& Dataset)
AdvancedAnalytics
![Page 35: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/35.jpg)
Spark 2.0 Release Schedule
Under active development on GitHub
March – April: code freeze
April – May: official release
Spark 2.0 ÂĊĊđææßåĆđċ
GitHub�¼YhÀ�h�
3J-4J : áđñúĊđç
4J-5J : V8ĊĊđæ
![Page 36: Spark 2.0: What’s Next keynote_ja.pdf · 2/8/2016 · Next-gen Streaming with DataFrames 1. Easy-to-use APIs (batch, streaming, and interactive) 2. Well-defined semantics • Out-of-order](https://reader034.vdocuments.net/reader034/viewer/2022042219/5ec5556313b08355f20a98d0/html5/thumbnails/36.jpg)
¤Íª½¦¯±¥Ç²¶@rxin