securing spark applications

32
1 © Cloudera, Inc. All rights reserved. Securing Spark Applications Hadoop Summit 2016 - Dublin Marcelo Vanzin

Upload: dataworks-summithadoop-summit

Post on 16-Apr-2017

523 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Securing Spark Applications

1© Cloudera, Inc. All rights reserved.

Securing Spark Applications

Hadoop Summit 2016 - Dublin

Marcelo Vanzin

Page 2: Securing Spark Applications

2© Cloudera, Inc. All rights reserved.

What is Security?

• Security has many facets• This talk will focus on three areas:

• Encryption• Authentication• Authorization

Page 3: Securing Spark Applications

3© Cloudera, Inc. All rights reserved.

Why do I need security?

• User identification• Application isolation• Access control enforcement• Compliance with government regulations

Page 4: Securing Spark Applications

4© Cloudera, Inc. All rights reserved.

Before we go further...

• Set up Kerberos• Use HDFS (or another secure filesystem)• Use YARN• Configure them for security (enable auth, encryption)

Kerberos, HDFS, and YARN provide the security backbone for Spark.

Page 5: Securing Spark Applications

5© Cloudera, Inc. All rights reserved.

Encryption

• In a secure cluster, data should not be visible in the clear• On-the-wire data• At-rest data

• Very important to financial / government institutions• Or anyone who works with sensitive data

Page 6: Securing Spark Applications

6© Cloudera, Inc. All rights reserved.

What a Spark app looks like

DriverExecutor

Executor

Control RPC

File Download

Shuffle / Cached BlocksShuffle Service

Shuffle BlocksUI

Disk

Disk

Shuffle Blocks / Metadata

Page 7: Securing Spark Applications

7© Cloudera, Inc. All rights reserved.

Prior to Spark 1.6

Different channel, different method- Control plane SSL- File distribution SSL- Shuffle Blocks SASL- User UI / REST API Nothing- Spilled/Shuffle Blocks Use ecryptfs (or

equivalent)

Page 8: Securing Spark Applications

8© Cloudera, Inc. All rights reserved.

What is wrong with SSL?

Page 9: Securing Spark Applications

9© Cloudera, Inc. All rights reserved.

Why not SSL?

• SSL can be hard to set up• Need certificates readable on every node• Sharing certificates not as secure• Hard to have per-user certificates

Page 10: Securing Spark Applications

10© Cloudera, Inc. All rights reserved.

Spark 1.6

Standardizes around a common transport library• Replaces Akka RPC (SPARK-6028)• Replaces HTTP File service (SPARK-11140)• Uses SASL encryption

But..• WebUI still has no encryption• Shuffle / Spilled blocks still require FS-level encryption• SASL in JVM restricted to 3DES encryption – not very strong

Page 11: Securing Spark Applications

11© Cloudera, Inc. All rights reserved.

Spark 2.0

• REPL class distribution using transport lib (SPARK-11563)• HTTPS Support for WebUI and History Server (SPARK-2750)• Encrypting shuffle blocks is almost in (SPARK-5682)

• Depends on third party Chimera library for encryption• Work is being done to add Chimera to Apache Commons

Future:• Use Chimera to encrypt over-the-wire data

Page 12: Securing Spark Applications

12© Cloudera, Inc. All rights reserved.

Authentication

Who is reading my data?

• Spark relies on Kerberos • the necessary evil

• Ubiquitous in Hadoop• YARN, HDFS, Hive...

Page 13: Securing Spark Applications

13© Cloudera, Inc. All rights reserved.

Who is reading my data?

Kerberos provides secure authentication.

KDC

Application

Hi I’m Bob.

Hello Bob. Here’s your TGT.

Here’s my TGT. I want to talk to HDFS.

Here’s your HDFS ticket.

User

Page 14: Securing Spark Applications

14© Cloudera, Inc. All rights reserved.

Now with a distributed app...

KDC

Executor

Executor

Executor

Executor

Executor

Executor

Executor

Executor

Hi I’m Bob.

Hi I’m Bob.

Hi I’m Bob.

Hi I’m Bob.

Hi I’m Bob.

Hi I’m Bob.

Hi I’m Bob.

Hi I’m Bob.

Something is wrong.

Page 15: Securing Spark Applications

15© Cloudera, Inc. All rights reserved.

Kerberos in Hadoop / Spark

Hadoop services use delegation tokens to avoid KDC limitations.

Driver

NameNode

Executor

DataNode

Page 16: Securing Spark Applications

16© Cloudera, Inc. All rights reserved.

Delegation Tokens

Like Kerberos tickets, they have a TTL.• OK for most batch applications.• Not OK for long running applications

• Streaming• Spark SQL Thrift Server

Since 1.4, Spark can manage delegation tokens, but very limited.• Full support only for HDFS.• Limited support for Hive, HBase.

Page 17: Securing Spark Applications

17© Cloudera, Inc. All rights reserved.

How about Secure Kafka?

Page 18: Securing Spark Applications

18© Cloudera, Inc. All rights reserved.

Spark Streaming with Kafka

• Kafka 0.9 supports some security features• Requires the use of a new consumer API (SPARK-12177)• Kafka 0.9 does not support delegation tokens! (KAFKA-1696)

Page 19: Securing Spark Applications

19© Cloudera, Inc. All rights reserved.

Authorization

How can I share my data?

Simplest form of authorization: file permissions.• Unix-style user/group/other or ACLs• Simple, but high maintenance.

• umask• manually change new files

• Trusted entity (OS kernel) enforces access control

Page 20: Securing Spark Applications

20© Cloudera, Inc. All rights reserved.

More than just FS semantics

Not all applications operate on files...• Tables, columns, partitions instead of files and directories• Trusted service needs to understand app’s semantics

Page 21: Securing Spark Applications

21© Cloudera, Inc. All rights reserved.

Trusted Service Example: Hive

Client HiveServer2DataNode

DataNode

HMS

Page 22: Securing Spark Applications

22© Cloudera, Inc. All rights reserved.

Untrusted App Example: Spark

User CodeDataNode

DataNode

HMS

Page 23: Securing Spark Applications

23© Cloudera, Inc. All rights reserved.

Apache Sentry

• Role-based access control to resources• Integrates with HMS / HS2 to control access to data• Fine-grained (up to column level) controls

HDFS plugin synchronizes file permissions.• Permission to read table = permission to read table’s files• Permission to create table = permission to write to database’s

directory

Page 24: Securing Spark Applications

24© Cloudera, Inc. All rights reserved.

Still restricted to FS view of the world!• Files, directories, etc…• Cannot provide column-level and row-level access control.• Whole table or nothing.

Still, it goes a long way in allowing Spark applications to work well with Hive data in a shared, secure environment.

But...

Page 25: Securing Spark Applications

25© Cloudera, Inc. All rights reserved.

A Simple Example

Assume we had a table “accounts”column_name column_type

name string

country string

balanceint

Page 26: Securing Spark Applications

26© Cloudera, Inc. All rights reserved.

Untrusted App Example: Spark

User Code HDFS

HMS

1. Where’s table “accounts”?

2. In path “/accounts”

3. Give me the files in “/accounts”

4. Here’s the file

namecountrybalance

Page 27: Securing Spark Applications

27© Cloudera, Inc. All rights reserved.

Future: RecordService

A distributed, scalable, data access service for unified authorization in Hadoop.• Drop in replacement for Hive InputFormats• Integration with Spark SQL Data Sources API

• Predicate pushdown, projection

Page 28: Securing Spark Applications

28© Cloudera, Inc. All rights reserved.

RecordService

Users can enforce row- and column- level permissions using views.

name country balance

Alice US 1000

Bob BR 1500

Eve US 2000

> create view customers as select customer, country from accounts

> create view balances_us as select customer, amount from accounts where country = “US”

Page 29: Securing Spark Applications

29© Cloudera, Inc. All rights reserved.

Untrusted App Example: Spark

User Code RS Worker

RS Planner

1. Where’s table “accounts”?

2. Sorry, you can’t read it.

3. Where’s table “customers”?

4. In Worker “X”

5. Give me table “customers”

6. Here’s a list of (name, country)

namecountrybalance

namecountry

Page 30: Securing Spark Applications

30© Cloudera, Inc. All rights reserved.

Takeaways

• Spark can be made secure today!• Builds on top of security features in Hadoop• Still work to be done

• Stronger encryption• Easier to use SSL• Better integration with Sentry / RecordService

Page 31: Securing Spark Applications

31© Cloudera, Inc. All rights reserved.

References

• Encryption: SPARK-6017, SPARK-5682• Delegation tokens: SPARK-5342• Sentry: http://sentry.apache.org/

• HDFS synchronization: SENTRY-432• RecordService: http://cloudera.github.io/RecordServiceClient/

Page 32: Securing Spark Applications

32© Cloudera, Inc. All rights reserved.

Thanks!

Questions?