running apache spark & apache zeppelin in production

27
Running Apache Spark & Apache Zeppelin in Production Director, Product Management August 31, 2016 Twitter: @neomythos Vinay Shukla

Upload: hadoop-summit

Post on 07-Jan-2017

574 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: Running Apache Spark & Apache Zeppelin in Production

Running Apache Spark & Apache Zeppelin in Production

Director, Product Management August 31, 2016Twitter: @neomythos

Vinay Shukla

Page 2: Running Apache Spark & Apache Zeppelin in Production

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank You

Page 3: Running Apache Spark & Apache Zeppelin in Production

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Who am i?

Product Management Spark for 2.5 + years, Hadoop for 3+ years Recovering Programmer Blog at www.vinayshukla.com Twitter: @neomythos Addicted to Yoga, Hiking, & Coffee Smallest contributor to Apache Zeppelin

Vinay Shukla

Page 4: Running Apache Spark & Apache Zeppelin in Production

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Security: Rings of Defense

Perimeter Level Security•Network Security (i.e. Firewalls)

Data Protection•Wire encryption•HDFS TDE/Dare•Others

Authentication•Kerberos•Knox (Other Gateways)

OS Security

Authorization•Apache Ranger/Sentry•HDFS Permissions•HDFS ACLs•YARN ACL

Page 4

Page 5: Running Apache Spark & Apache Zeppelin in Production

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Interacting with Spark

Ex

Spark on YARN

Zeppelin

Spark-Shell

Ex

Spark Thrift Server

Driver

REST ServerDriver

Driver

Driver

Page 6: Running Apache Spark & Apache Zeppelin in Production

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Context: Spark Deployment Modes

• Spark on YARN–Spark driver (SparkContext) in YARN AM(yarn-cluster)–Spark driver (SparkContext) in local (yarn-client):

• Spark Shell & Spark Thrift Server runs in yarn-client only

Client

Executor

App Master

Spark Driver

Client

Executor

App Master

Spark Driver

YARN-Client YARN-Cluster

Page 7: Running Apache Spark & Apache Zeppelin in Production

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark on YARN

Spark Submit

John Doe

Spark AM

1

Hadoop Cluster

HDFS

Executor

YARN RM

4

2 3

Node Manager

Page 8: Running Apache Spark & Apache Zeppelin in Production

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark – Security – Four Pillars

Authentication Authorization Audit Encryption

Spark leverages Kerberos on YARNEnsure network is secure

Page 9: Running Apache Spark & Apache Zeppelin in Production

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Kerberos authentication within Spark

KDC

Use Spark ST, submit Spark Job

Spark gets Namenode (NN) service ticket

YARN launches Spark Executors using John Doe’s identity

Get service ticket for Spark,

John Doe

Spark AMNN

Executor reads from HDFS using John Doe’s delegation token

kinit

1

2

3

4

5

6

7

Hadoop Cluster

Page 10: Running Apache Spark & Apache Zeppelin in Production

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark – Kerberos - Example

kinit -kt /etc/security/keytabs/johndoe.keytab [email protected]

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10

Page 11: Running Apache Spark & Apache Zeppelin in Production

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

HDFS

Spark – Authorization

YARN Cluster

A B C

KDC

Use Spark ST, submit Spark Job

Get Namenode (NN) service ticket

Executors read from HDFS

Client gets service ticket for Spark

RangerCan John launch this job?Can John read this file

John Doe

Page 12: Running Apache Spark & Apache Zeppelin in Production

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Encryption: Spark – Communication Channels

Spark Submit

RM

Shuffle Service

AMDriver

NM

Ex 1 Ex N

Shuffle Data

Control/RPC

ShuffleBlockTransfer

DataSource

Read/Write Data

FS – Broadcast,File Download

Page 13: Running Apache Spark & Apache Zeppelin in Production

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Communication Encryption Settings

Shuffle Data

Control/RPC

ShuffleBlockTransfer

Read/Write Data

FS – Broadcast,File Download

spark.authenticate.enableSaslEncryption= true

spark.authenticate = true. Leverage YARN to distribute keys

Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS

NM > Ex leverages YARN based SSL

spark.ssl.enabled = true

Page 14: Running Apache Spark & Apache Zeppelin in Production

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Sharp Edges with Spark Security SparkSQL – Only coarse grain access control today Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd hop

– Lowers security, forces STS to run as Hive user to read all data– Use SparkSQL via shell or programmatic API– https://issues.apache.org/jira/browse/SPARK-5159

Spark Stream + Kafka + Kerberos– Issues fixed in HDP 2.4.x– No SSL support yet

Spark Shuffle > Only SASL, no SSL support Spark Shuffle > No encryption for spill to disk or intermediate data

Fine grained Security to SparkSQL

http://bit.ly/2bLghGzhttp://bit.ly/2bTX7Pm

Page 15: Running Apache Spark & Apache Zeppelin in Production

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin Security

Page 16: Running Apache Spark & Apache Zeppelin in Production

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin: Authentication + SSL

Spark on YARNEx Ex

LDAP

John Doe

1

2

3

SSL

Firewall

Page 17: Running Apache Spark & Apache Zeppelin in Production

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zeppelin + Livy E2E Security

Zeppelin

Spark

Yarn

Livy

Ispark GroupInterpreter

SPNego: Kerberos Kerberos/RPC

Livy APIs

LDAP

John Doe

Job runs as John Doe

Page 18: Running Apache Spark & Apache Zeppelin in Production

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin: Authorization

Note level authorization Grant Permissions (Owner, Reader, Writer)

to users/groups on Notes LDAP Group integration

Zeppelin UI Authorization Allow only admins to configure interpreter Configured in shiro.ini

For Spark with Zeppelin > Livy > Spark– Identity Propagation Jobs run as End-User

For Hive with Zeppelin > JDBC interpreter Shell Interpreter

– Runs as end-user

Authorization in Zeppelin Authorization at Data Level

[urls]/api/interpreter/** = authc, roles[admin]

/api/configurations/** = authc, roles[admin]

/api/credential/** = authc, roles[admin]

Page 19: Running Apache Spark & Apache Zeppelin in Production

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin: Credentials

LDAP/AD account Zeppelin leverages Hadoop Credential API

Interpreter Credentials Not solved yet

Credentials

Credentials in Zeppelin

This is still an open issue

Page 20: Running Apache Spark & Apache Zeppelin in Production

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Performance

Page 21: Running Apache Spark & Apache Zeppelin in Production

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

#1 Big or Small Executor ?

spark-submit --master yarn \ --deploy-mode client \ --num-executors ? \ --executor-cores ? \ --executor-memory ? \ --class MySimpleApp \ mySimpleApp.jar \ arg1 arg2

Page 22: Running Apache Spark & Apache Zeppelin in Production

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Show the Details

Cluster resource: 72 cores + 288GB Memory

24 cores

96GB Memory24 cores

96GB Memory24 cores

96GB Memory

Page 23: Running Apache Spark & Apache Zeppelin in Production

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Benchmark with Different Combinations

Cores Per E#

Memory Per E (GB)

E Per Node

1 6 18

2 12 9

3 18 6

6 36 3

9 54 2

18 108 1

#E stands for Executor

Except too small or too big executor, the performance is relatively close

18 cores 108GB memory per node# The lower the better

Page 24: Running Apache Spark & Apache Zeppelin in Production

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Big Or Small Executor ?

Avoid too small or too big executor.– Small executor will decrease CPU and memory efficiency.– Big executor will introduce heavy GC overhead.

Usually 3 ~ 6 cores and 10 ~ 40 GB of memory per executor is a preferable choice.

Page 25: Running Apache Spark & Apache Zeppelin in Production

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Any Other Thing ?

Executor memory != Container memory Container memory = executor memory + overhead memory (10% of executory memory) Leave some resources to other service and system

Enable CPU scheduling if you want to constrain CPU usage#

#CPU scheduling - http://hortonworks.com/blog/managing-cpu-resources-in-your-hadoop-yarn-clusters/

Page 26: Running Apache Spark & Apache Zeppelin in Production

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Multi-tenancy for Spark

Leverage YARN queues– Set user quotas– Set default yarn-queue in spark-defaults– User can override for each job

Leverage Dynamic Resource Allocation– Specify range of executors a job uses– This needs shuffle service to be used

Cluster resource Utilization

Page 27: Running Apache Spark & Apache Zeppelin in Production

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank You

Vinay Shukla @neomythos