effective spark with alluxio at strata+hadoop world san jose 2017
TRANSCRIPT
![Page 1: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/1.jpg)
EFFECTIV E SPAR K WITH ALLUXIO
Strata + Hadoop World San Jose 2017Calvin Jia, Alluxio Inc.
![Page 2: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/2.jpg)
OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
2
![Page 3: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/3.jpg)
BIG DATA ECOSYSTEM TODAYBIG DATA ECOSYSTEM WITH ALLUXIO
…
…
FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System
Unifying Data at Memory Speed
BIG DATA ECOSYSTEM ISSUES
GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface
3
BIG DATA ECOSYSTEM YESTERDAY
![Page 4: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/4.jpg)
4
• Fastest growing open-source project in the big data ecosystem
• 400+ contributors from 100+ organizations
• Running in large production clusters
• Community members are welcome!
FASTEST GROWING BIG DATA PROJECTS
Popular Open Source Projects’ Growth
Months
Num
ber o
f Con
trib
utor
s
![Page 5: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/5.jpg)
OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
5
![Page 6: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/6.jpg)
ACCELERATE I/O TO/FROM REMOTE STORAGE
The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds.
- Baidu
RESULTS
• Data queries are now 30x faster with Alluxio
• Alluxio cluster runs stably, providing over 50TB of RAM space
• By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds
Baidu’s PMs and analysts run
interactive queries to gain insights
into their products and business
• 200+ nodes deployment
• 2+ petabytes of storage
• Mix of memory + HDD
ALLUXIO
Baidu File System
6
![Page 7: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/7.jpg)
GUARANTEE PERFORMANCE TO MEET SLAS
RESULTS
• Introducing Alluxio provides the system with smooth performance in an environment with highly variable load
• Without Alluxio, the performance of critical jobs could take more than 4x the expected time
• With Alluxio’s guaranteed performance, business logic could be reliably be executed leading to improved user experience
• Real-time machine learning
• 10+ TB of storage
• Mix of Memory + HDD
A leading online retailer uses
Alluxio to compute conversion
attribution
7
ALLUXIO
![Page 8: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/8.jpg)
SHARE DATA ACROSS JOBS @ MEMORY SPEED
Thanks to Alluxio, we now have the raw data immediately available at every iteration and we can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity.
- Barclays
RESULTS
• Barclays workflow iteration time decreased from hours to seconds
• Alluxio enabled workflows that were impossible before
• By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds
Barclays uses query and machine
learning to train models for risk
management
• 6 node deployment
• 1TB of storage
• Memory only
ALLUXIO
Relational Database
8
![Page 9: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/9.jpg)
TRANSPARENTLY MANAGE DATA ACROSS STORAGE SYSTEMS
We’ve been running Alluxio in production for over 9 months, Alluxio’s unified namespace enable different applications and frameworks to easily interact with data from different storage systems.
- Qunar
RESULTS
• Data sharing among Spark Streaming, Spark batch and Flink jobs provide efficient data sharing
• Improved the performance of their system with 15x – 300x speedups
• Tiered storage feature manages storage resources including memory, SSD and disk
• 200+ nodes deployment
• 6 billion logs (4.5 TB) daily
• Mix of Memory + HDD
ALLUXIO
Qunar uses real-time machine
learning for their website ads.
9
![Page 10: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/10.jpg)
OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
10
![Page 11: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/11.jpg)
CONSOLIDATING MEMORY
11
Storage Engine & Execution EngineSame Process
• Two copies of data in memory – double the memory used• Inter-process Sharing Slowed Down by Network / Disk I/O
Spark Compute
Spark Storage
block 1
block 3
HDFS / Amazon S3block 1
block 3
block 2
block 4
Spark Compute
Spark Storage
block 1
block 3
![Page 12: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/12.jpg)
CONSOLIDATING MEMORY
12
Storage Engine & Execution EngineDifferent process
• Half the memory used• Inter-process Sharing Happens at Memory Speed
Spark Compute
Spark Storage
HDFS / Amazon S3block 1
block 3
block 2
block 4
HDFSdisk
block 1
block 3
block 2
block 4Alluxio
block 1
block 3 block 4
Spark Compute
Spark Storage
![Page 13: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/13.jpg)
DATA RESILIENCE DURING CRASH
13
Spark Compute
Spark Storageblock 1
block 3
HDFS / Amazon S3block 1
block 3
block 2
block 4
Storage Engine & Execution EngineSame Process
![Page 14: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/14.jpg)
DATA RESILIENCE DURING CRASH
14
CRASH
Spark Storageblock 1
block 3
HDFS / Amazon S3block 1
block 3
block 2
block 4
• Process Crash Requires Network and/or Disk I/O to Re-read Data
Storage Engine & Execution EngineSame Process
![Page 15: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/15.jpg)
DATA RESILIENCE DURING CRASH
15
CRASH
HDFS / Amazon S3block 1
block 3
block 2
block 4
Storage Engine & Execution EngineSame Process
• Process Crash Requires Network and/or Disk I/O to Re-read Data
![Page 16: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/16.jpg)
DATA RESILIENCE DURING CRASH
16
Spark Compute
Spark Storage
HDFS / Amazon S3block 1
block 3
block 2
block 4
HDFSdisk
block 1
block 3
block 2
block 4Alluxio
block 1
block 3 block 4
Storage Engine & Execution EngineDifferent process
![Page 17: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/17.jpg)
DATA RESILIENCE DURING CRASH
17
• Process Crash – Data is Re-read at Memory Speed
HDFS / Amazon S3block 1
block 3
block 2
block 4
HDFSdisk
block 1
block 3
block 2
block 4Alluxio
block 1
block 3 block 4
CRASH Storage Engine & Execution EngineDifferent process
![Page 18: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/18.jpg)
ACCESSING ALLUXIO DATA FROM SPARK
18
Writing Data Write to an Alluxio file
Reading Data Read from an Alluxio file
![Page 19: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/19.jpg)
CODE EXAMPLE FOR SPARK RDDS
19
Writing RDD to Alluxiordd.saveAsTextFile(alluxioPath)rdd.saveAsObjectFile(alluxioPath)
Reading RDD from
Alluxiordd = sc.textFile(alluxioPath)rdd = sc.objectFile(alluxioPath)
![Page 20: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/20.jpg)
CODE EXAMPLE FOR SPARK DATAFRAMES
20
Writing to Alluxio df.write.parquet(alluxioPath)
Reading from Alluxio df = sc.read.parquet(alluxioPath)
![Page 21: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/21.jpg)
OUTLINE
• Alluxio Overview
• Alluxio + Spark Use Cases
• Using Alluxio with Spark
• Performance Evaluation
21
![Page 22: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/22.jpg)
ENVIRONMENT
Spark 2.0.0 + Alluxio 1.2.0
Single worker: Amazon r3.2xlarge
Comparisons:• Alluxio• Spark Storage Level: MEMORY_ONLY• Spark Storage Level: MEMORY_ONLY_SER• Spark Storage Level: DISK_ONLY
22
![Page 23: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/23.jpg)
0
50
100
150
200
250
0 10 20 30 40 50
Tim
e [s
econ
ds]
RDD Size [GB]
READING CACHED RDD
Alluxio (textFile) Alluxio (objectFile) DISK_ONLY
MEMORY_ONLY_SER MEMORY_ONLY
23
![Page 24: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/24.jpg)
24
0 50 100 150 200 250
Alluxio(textFile)
Alluxio(objectFile)
No Alluxio
Time [seconds]
NEW CONTEXT: READ 50 GB RDD (SSD)
![Page 25: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/25.jpg)
25
0 100 200 300 400 500 600 700 800
Alluxio(textFile)
Alluxio(objectFile)
No Alluxio
Time [seconds]
NEW CONTEXT: READ 50 GB RDD (S3)
7x speedup
16x speedup
![Page 26: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/26.jpg)
26
0
50
100
150
200
250
0 10 20 30 40 50
Tim
e [s
econ
ds]
DataFrame Size [GB]
READING CACHED DATAFRAME (PARQUET)
Alluxio (textFile) MEMORY_ONLY_SER MEMORY_ONLY
![Page 27: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/27.jpg)
27
0 50 100 150 200 250
Alluxio
No Alluxio
Time [seconds]
NEW CONTEXT: READ 50 GB DATAFRAME (SSD)
![Page 28: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/28.jpg)
28
0 250 500 750 1000 1250 1500 1750
Alluxio
No Alluxio
Time [seconds]
NEW CONTEXT: READ 50 GB DATAFRAME(S3)
10x average speedup, 17x peak speedup
![Page 29: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/29.jpg)
CONCLUSION
• Easy to use with Spark
• Predictable and improved performance
• Easily connect to various storages
29
![Page 30: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/30.jpg)
Contact: [email protected]
Twitter: @jiacalvin
Websites: www.alluxio.com and www.alluxio.org
Thank you!
30
![Page 31: Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017](https://reader031.vdocuments.net/reader031/viewer/2022030307/58e499731a28abf5428b4a8d/html5/thumbnails/31.jpg)
31
Unification
New workflows across any data in any storage system
Orders of magnitude improvement in run time
Choice in compute and storage – grow each independently, buy only what is needed
Performance Flexibility
BENEFITS