dancing elephants - efficiently working with object stories from apache spark and apache hive
TRANSCRIPT
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Dancing Elephants:Working with Object Storage in Apache Spark and HiveSteve LoughranSanjay Radia
April 2017
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Why?
• No upfront hardware costs• Data ingress for IoT, mobile apps• Cost effective if sufficiently agile
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Object Stores are a key part of agile cloud applications
⬢ It's the only persistent store for in-cloud clusters
⬢ Object stores are the source and final destination of work
⬢ Cross-application data storage
⬢ Asynchronous data exchange
⬢ External data sourcesAlso useful in physical clusters!
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Cloud Storage Integration: Evolution for Agility
HDFS
Application
HDFS
Application
GoalEvolution towards cloud storage as the persistent Data Lake
Input Output
Backup Restore
InputOutput
Upload
HDFS
Application
Input
Output
tmp
AzureAWS –today
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
ORCdatasets
inbound
Elastic ETL
HDFS
external
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
datasets
external
Notebooks
library
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Streaming
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Danger: Object stores are not filesystems∗
Cost & Geo-distribution overConsistency and Performance
∗for more information, please re-read
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
A Filesystem: Directories, Files Data
/
work
pending
part-00
part-01
00
00
00
01
01
01
complete
part-01
rename("/work/pending/part-01", "/work/complete")
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Object Store: hash(name)⇒data
00
00
00
01
01
s01 s02
s03 s04
hash("/work/pending/part-01")["s02", "s03", "s04"]
copy("/work/pending/part-01","/work/complete/part01")
01
010101
delete("/work/pending/part-01")
hash("/work/pending/part-00")["s01", "s02", "s04"]
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Often: Eventually Consistent
00
00
00
01
01
s01 s02
s03 s04
01
DELETE /work/pending/part-00
GET /work/pending/part-00
GET /work/pending/part-00
200
200
200
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
The dangers of Eventual Consistency
⬢ Temp Data leftovers
⬢ List inconsistency means new data may not be visible
⬢ Lack of atomic rename() can leave output directories inconsistent
You can get bad data and not even notice
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
org.apache.hadoop.fs.FileSystem
hdfs s3awasb adlswift gs
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
s3:// —“inode on S3”
s3n://“Native” S3
s3a:// Replaces s3n
swift://OpenStack
wasb://Azure WASB
Phase I: Stabilize S3A
oss://Aliyun
gs://Google Cloud
Phase II: speed & scale
adl://Azure Data Lake
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
s3://Amazon EMR S3
History of Object Storage Support
Phase III: scale & consistency
(proprietary)
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Make Apache Hadoop at home in the cloud Step 1: Hadoop runs great on AzureStep 2: Beat EMR on EC2
✔✔
✔
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Problem: S3 Analytics is too slow/broken
1. Analyze benchmarks and bug-reports 2. Fix Read path for Columnar Data3. Fix Write path4. Improve query partitioning5. The Commitment Problem
getFileStatus()read()
LLAP (single node) on AWS
TPC-DS queries at 200 GB
scale
readFully(pos)
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
HDP 2.6/Hadoop 2.8 transforms I/O performance!
// forward seek by skipping streamfs.s3a.readahead.range=256K
// faster backward seek for Columnar Storagefs.s3a.experimental.input.fadvise=random
// enhanced data uploadfs.s3a.fast.output.enabled=true
—see HADOOP-11694 for lots more!
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
benchmarks !=your queriesyour datayour VMsyour directory tree
…but we think we've made a good start
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
S3 Data Source 1TB TPCDS LLAP- vs Hive 1.x:
0
500
1,000
1,500
2,000
2,500
LLAP-1TB-TPCDS
Hive-1-1TB-TPCDS
1 TB TPC-DS ORC DataSet
3 x i2x4x Large (16 CPU x 122 GB RAM x 4 SSD)
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Apache Spark
Object store work appliesNeeds tuningCommits to S3 "trouble"
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
spark-default.conf
spark.hadoop.fs.s3a.readahead.range 256Kspark.hadoop.fs.s3a.experimental.input.fadvise random
spark.sql.orc.filterPushdown truespark.sql.orc.splits.include.file.footer truespark.sql.orc.cache.stripe.details.size 10000spark.sql.hive.metastorePartitionPruning true
spark.sql.parquet.filterPushdown truespark.sql.parquet.mergeSchema falsespark.hadoop.parquet.enable.summary-metadata false
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
The S3 Commitment Problem
⬢ rename() depended upon for atomic transaction
⬢ Time to copy() + delete() proportional to data * files
⬢ Compared to Azure Storage, S3 is slow (6-10+ MB/s)
⬢ Intermediate data may be visible
⬢ Failures leave storage in unknown state
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Spark's Direct Output Committer? Risk of Corruption of data
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
S3guardFast, consistent S3 metadataHADOOP-13445
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
DynamoDB as fast, consistent metadata store
00
00
00
01
01
s01 s02
s03 s04
01
DELETE part-00200
HEAD part-00200
HEAD part-00404
PUT part-00200
00
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Netflix Staging Committer
1. Saves output to local files file://
2. Task commit: upload to S3A as multipart PUT —but do not complete it
3. Job committer completes all uploads from successful tasks; cancels others.
Outcome:
⬢ No work visible until job is commited
⬢ Task commit time = data/bandwidth
⬢ Job commit time = POST * #files
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Availability
Read + Write in HDP 2.6 and Apache Hadoop 2.8 S3Guard: preview of DDB integration soon Zero-rename commit: work in progress
Look to HDCloud for the latest work!
© Hortonworks Inc. 2011 – 2017 All Rights Reserved
Big thanks to:
Rajesh Balamohan
Mingliang Liu
Chris Nauroth
Dominik Bialek
Ram Venkatesh
Everyone in QE, RE
+ everyone who reviewed/tested, patches and added their own, filed bug reports and measured performance
© Hortonworks Inc. 2011 – 2017 All Rights Reserved31
© Hortonworks Inc. 2011 – 2017. All Rights Reserved
Questions?