securing spark applications
TRANSCRIPT
1© Cloudera, Inc. All rights reserved.
Securing Spark Applications
Hadoop Summit 2016 - Dublin
Marcelo Vanzin
2© Cloudera, Inc. All rights reserved.
What is Security?
• Security has many facets• This talk will focus on three areas:
• Encryption• Authentication• Authorization
3© Cloudera, Inc. All rights reserved.
Why do I need security?
• User identification• Application isolation• Access control enforcement• Compliance with government regulations
4© Cloudera, Inc. All rights reserved.
Before we go further...
• Set up Kerberos• Use HDFS (or another secure filesystem)• Use YARN• Configure them for security (enable auth, encryption)
Kerberos, HDFS, and YARN provide the security backbone for Spark.
5© Cloudera, Inc. All rights reserved.
Encryption
• In a secure cluster, data should not be visible in the clear• On-the-wire data• At-rest data
• Very important to financial / government institutions• Or anyone who works with sensitive data
6© Cloudera, Inc. All rights reserved.
What a Spark app looks like
DriverExecutor
Executor
Control RPC
File Download
Shuffle / Cached BlocksShuffle Service
Shuffle BlocksUI
Disk
Disk
Shuffle Blocks / Metadata
7© Cloudera, Inc. All rights reserved.
Prior to Spark 1.6
Different channel, different method- Control plane SSL- File distribution SSL- Shuffle Blocks SASL- User UI / REST API Nothing- Spilled/Shuffle Blocks Use ecryptfs (or
equivalent)
8© Cloudera, Inc. All rights reserved.
What is wrong with SSL?
9© Cloudera, Inc. All rights reserved.
Why not SSL?
• SSL can be hard to set up• Need certificates readable on every node• Sharing certificates not as secure• Hard to have per-user certificates
10© Cloudera, Inc. All rights reserved.
Spark 1.6
Standardizes around a common transport library• Replaces Akka RPC (SPARK-6028)• Replaces HTTP File service (SPARK-11140)• Uses SASL encryption
But..• WebUI still has no encryption• Shuffle / Spilled blocks still require FS-level encryption• SASL in JVM restricted to 3DES encryption – not very strong
11© Cloudera, Inc. All rights reserved.
Spark 2.0
• REPL class distribution using transport lib (SPARK-11563)• HTTPS Support for WebUI and History Server (SPARK-2750)• Encrypting shuffle blocks is almost in (SPARK-5682)
• Depends on third party Chimera library for encryption• Work is being done to add Chimera to Apache Commons
Future:• Use Chimera to encrypt over-the-wire data
12© Cloudera, Inc. All rights reserved.
Authentication
Who is reading my data?
• Spark relies on Kerberos • the necessary evil
• Ubiquitous in Hadoop• YARN, HDFS, Hive...
13© Cloudera, Inc. All rights reserved.
Who is reading my data?
Kerberos provides secure authentication.
KDC
Application
Hi I’m Bob.
Hello Bob. Here’s your TGT.
Here’s my TGT. I want to talk to HDFS.
Here’s your HDFS ticket.
User
14© Cloudera, Inc. All rights reserved.
Now with a distributed app...
KDC
Executor
Executor
Executor
Executor
Executor
Executor
Executor
Executor
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Something is wrong.
15© Cloudera, Inc. All rights reserved.
Kerberos in Hadoop / Spark
Hadoop services use delegation tokens to avoid KDC limitations.
Driver
NameNode
Executor
DataNode
16© Cloudera, Inc. All rights reserved.
Delegation Tokens
Like Kerberos tickets, they have a TTL.• OK for most batch applications.• Not OK for long running applications
• Streaming• Spark SQL Thrift Server
Since 1.4, Spark can manage delegation tokens, but very limited.• Full support only for HDFS.• Limited support for Hive, HBase.
17© Cloudera, Inc. All rights reserved.
How about Secure Kafka?
18© Cloudera, Inc. All rights reserved.
Spark Streaming with Kafka
• Kafka 0.9 supports some security features• Requires the use of a new consumer API (SPARK-12177)• Kafka 0.9 does not support delegation tokens! (KAFKA-1696)
19© Cloudera, Inc. All rights reserved.
Authorization
How can I share my data?
Simplest form of authorization: file permissions.• Unix-style user/group/other or ACLs• Simple, but high maintenance.
• umask• manually change new files
• Trusted entity (OS kernel) enforces access control
20© Cloudera, Inc. All rights reserved.
More than just FS semantics
Not all applications operate on files...• Tables, columns, partitions instead of files and directories• Trusted service needs to understand app’s semantics
21© Cloudera, Inc. All rights reserved.
Trusted Service Example: Hive
Client HiveServer2DataNode
DataNode
HMS
22© Cloudera, Inc. All rights reserved.
Untrusted App Example: Spark
User CodeDataNode
DataNode
HMS
23© Cloudera, Inc. All rights reserved.
Apache Sentry
• Role-based access control to resources• Integrates with HMS / HS2 to control access to data• Fine-grained (up to column level) controls
HDFS plugin synchronizes file permissions.• Permission to read table = permission to read table’s files• Permission to create table = permission to write to database’s
directory
24© Cloudera, Inc. All rights reserved.
Still restricted to FS view of the world!• Files, directories, etc…• Cannot provide column-level and row-level access control.• Whole table or nothing.
Still, it goes a long way in allowing Spark applications to work well with Hive data in a shared, secure environment.
But...
25© Cloudera, Inc. All rights reserved.
A Simple Example
Assume we had a table “accounts”column_name column_type
name string
country string
balanceint
26© Cloudera, Inc. All rights reserved.
Untrusted App Example: Spark
User Code HDFS
HMS
1. Where’s table “accounts”?
2. In path “/accounts”
3. Give me the files in “/accounts”
4. Here’s the file
namecountrybalance
27© Cloudera, Inc. All rights reserved.
Future: RecordService
A distributed, scalable, data access service for unified authorization in Hadoop.• Drop in replacement for Hive InputFormats• Integration with Spark SQL Data Sources API
• Predicate pushdown, projection
28© Cloudera, Inc. All rights reserved.
RecordService
Users can enforce row- and column- level permissions using views.
name country balance
Alice US 1000
Bob BR 1500
Eve US 2000
> create view customers as select customer, country from accounts
> create view balances_us as select customer, amount from accounts where country = “US”
29© Cloudera, Inc. All rights reserved.
Untrusted App Example: Spark
User Code RS Worker
RS Planner
1. Where’s table “accounts”?
2. Sorry, you can’t read it.
3. Where’s table “customers”?
4. In Worker “X”
5. Give me table “customers”
6. Here’s a list of (name, country)
namecountrybalance
namecountry
30© Cloudera, Inc. All rights reserved.
Takeaways
• Spark can be made secure today!• Builds on top of security features in Hadoop• Still work to be done
• Stronger encryption• Easier to use SSL• Better integration with Sentry / RecordService
31© Cloudera, Inc. All rights reserved.
References
• Encryption: SPARK-6017, SPARK-5682• Delegation tokens: SPARK-5342• Sentry: http://sentry.apache.org/
• HDFS synchronization: SENTRY-432• RecordService: http://cloudera.github.io/RecordServiceClient/
32© Cloudera, Inc. All rights reserved.
Thanks!
Questions?