big data everywhere chicago: the big data imperative -- discovering & protecting sensitive data...

31
© 2014 Dataguise Inc. All rights reserved. [email protected] Discovering & Protecting Sensitive Data in Hadoop

Upload: bigdataeverywhere

Post on 11-Jul-2015

146 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved.

[email protected]

Discovering & Protecting Sensitive Data in Hadoop

Page 2: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Goals For Today

Big Data for banking, healthcare, tech, govt, education, etc. need data security (But few have workable approaches in production today)

Hadoop security approaches (What works and doesn’t work from the past, challenges in the present) Real world case studies (data-centric protection)

Credit card security Healthcare data lake (Data-as-a-Service) Product analytics in the cloud

2

Page 3: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 3

Market Overview

Page 4: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Data Growth

•  100% growth and 80% unstructured data by 2015 …finding and classifying sensitive data will get harder

4

Exa

byte

s

Page 5: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Real-world unstructured data scenarios

5

Voice-to-txt files in Hadoop for customer service optimization;

Patient and doctor medical data in emails, PDFs, doctor’s notes

Web comment fields and customer surveys, CRM data

Log data from wellheads and oil drilling sensors

Web e-Commerce Pay System

Page 6: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

From%2012%to%2020,%enterprise%Big%Data%will%grow%7500%%in%next%6;8%yrs%%%

IT%headcount%for%Big%Data%will%grow%

1.5x%

The Importance of Automation

Page 7: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Why Security in Big Data Vertical Refine Explore Enrich

Retail & Web •  Log Analysis Site Optimization

•  Social Network Analysis

•  Dynamic Pricing •  Session & Content

Optimization

Retail •  Loyalty Program Optimization

•  Brand & Sentiment Analysis

•  Dynamic Pricing/Targeted Offer

Intelligence •  Threat Identification •  Person of Interest Discovery

•  Cross Jurisdiction Queries

Finance

•  Risk Modeling & Fraud Identification

•  Trade Performance Analytics

•  Surveillance & Fraud Detection

•  Customer Risk Analysis

•  Real-time upsell, cross sales marketing offers

Energy •  Smart Grid: Production Optimization

•  Grid Failure Prevention

•  Smart Meters •  Individual Power Grid

Manufacturing •  Supply Chain Optimization

•  Customer Churn Analysis

•  Dynamic Delivery •  Replacement Parts

Healthcare & Payer

•  Electronic Medical Records (EMPI) •  Clinical Trials Analysis •  Insurance Premium

Determination

Page 8: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Why Security in Big Data Vertical Refine Explore Enrich

Retail & Web •  Log Analysis Site Optimization

•  Social Network Analysis

•  Dynamic Pricing •  Session & Content

Optimization

Retail •  Loyalty Program Optimization

•  Brand & Sentiment Analysis

•  Dynamic Pricing/Targeted Offer

Intelligence •  Threat Identification •  Person of Interest Discovery

•  Cross Jurisdiction Queries

Finance

•  Risk Modeling & Fraud Identification

•  Trade Performance Analytics

•  Surveillance & Fraud Detection

•  Customer Risk Analysis

•  Real-time upsell, cross sales marketing offers

Energy •  Smart Grid: Production Optimization

•  Grid Failure Prevention

•  Smart Meters •  Individual Power Grid

Manufacturing •  Supply Chain Optimization

•  Customer Churn Analysis

•  Dynamic Delivery •  Replacement Parts

Healthcare & Payer

•  Electronic Medical Records (EMPI) •  Clinical Trials Analysis •  Insurance Premium

Determination

Privacy data

PCI or Financial

Personal Health (PHI)

Personal Health (PHI)

PCI or Financial

Privacy data

Personal Health (PHI)

Privacy data

Privacy data PCI or

Financial

PCI or Financial

Privacy data

Page 9: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Three Critical Considerations

1.  Ensuring Compliance •  The Big Ps (PCI, HIPAA, Privacy), data residency,

FERPA,FISMA, FERC , etc. •  1200 laws in 63 countries

2.  Reducing Breach Risk 3.  Quantifying both

1.  How much sensitive data? (“un-announced”)

2.  Who is adding? (ad hoc user directories) 3.  Who is accessing? (sharing, selling, re-purposing)

9

Page 10: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Lab Project • Hadoop as

R&D • Strictly data

science • Zero $$$ or

selection of Distribution

• Zero recognition of sensitive data or exposure

Proof Stage • Achieving

value • Data lake cost

savings • Line of

business ownership

• Nodal expansion

• Security elements? (unknown to InfoSec)

ROI Validity • ROI and TCO

validity • Distribution

selection and purchase

• The Security ‘A- Ha’ moment

• Solved with legacy or penalty box Hadoop

On Demand Hadoop •  Full scale

production •  Ad hoc new

uses •  Go Faster:

Spark, Kafka

•  Security sanctified

The Evolution of Hadoop Projects

Page 11: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

On-Demand Hadoop.

•  Without adequate sensitive data protection, customers left to “Penalty Boxing” Hadoop

»  “Security zones” imposed by InfoSec

»  Slows business, costly and cumbersome

•  Data-centric protection can set those assets free

11

Page 12: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 12

Data Protection In Hadoop

Page 13: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Security in Hadoop In Summary

•  Like Cloud, Mobile, Virtualization… Big Data drives fundamental new rules in security

»  Ad hoc computing, wide open data sets »  Extended users and usages, sharing and selling »  3 Vs moving to 6 Vs (automation, non-blocking)

•  Problem #1 is compliance

»  Reporting/auditing/monitoring as/more important than data security

•  Data-centric protection can help

13

Page 14: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Hadoop Security Framework

Perimeter'Guarding%access%to%the%

cluster%itself%%%%

Technical'Concepts:'AuthenHcaHon%

Network%isolaHon%%

Data'ProtecHng%data%in%the%

cluster%from%unauthorized%visibility%%

Technical'Concepts:'EncrypHon,%TokenizaHon,%

Data%masking%%

•  The 4 approaches to address security within Hadoop (Perimeter, Data, Access, Visibility)

•  Dataguise discovers & protects at the data layer and provides visibility for audit reporting and data lineage

Perimeter'Guarding%access%to%the%

cluster%itself%%

Technical'Concepts:'AuthenHcaHon%

Network%isolaHon%%

Access'Defining%what%users%

and%applicaHons%can%do%with%data%

%Technical'Concepts:'

Permissions%AuthorizaHon%

%

Visibility'ReporHng%on%where%data%came%from%and%how%it’s%being%used%

%Technical'Concepts:'

AudiHng%Lineage%

%

Page 15: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Kerberos on Hadoop •  Kerberos (developed at MIT) has been the de-

facto standard for strong authentication/authz »  Protection against user and service spoofing

attacks, and allows for enforcement of user HDFS access permissions

•  What does Kerberos Do? »  Establishes identity for clients, hosts, and services »  Prevents impersonation, passwords are never sent over

the wire »  Tickets grant cryptographic “permissions” to resources

•  Kerberos is core of authentication in native Apache Hadoop from 2010

»  Used for access ecosystem services HDFS, JT, Oozie., for server to server traffic auth. etc. BUT complex to manage!

»  Lots of steps for example: http://www.cloudera.com/content/cloudera--�content/cloudera--�docs/CDH4/4.3.0/CDH4--�Security--�Guide/cdh4sg_topic_3.html

15

Access'Defining%what%users%

and%applicaHons%can%do%with%data%

%Technical'Concepts:'

Permissions%AuthorizaHon%

%

Page 16: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

MapR Improvements on Auth/Authz

•  Vastly simpler »  But no requirements for Kerberos in core »  Identity represented using a ticket which is issued by

MapR CLDB servers (Container Location DataBase) »  Core services secured by default

•  Easier integration »  User identity independent of host or operating system »  Local to MapR (no external Kerberos required)

•  Faster »  Leverage Intel accelerated hardware crypto

16

Page 17: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Elements of Data Centric Protection

•  1. Identify which elements you want to protect via:

»  Delimiters (structured data), name-value pairs (semi-structured) or data discovery service (unstructured)

•  2. Automated Protection Options: »  Automatically apply protection via: »  Format preserving encryption (FPE) »  Masking (replace, randomize, intellimask, static) »  Redaction (nullify)

•  3. Audit Strategy »  Sensitive data protection/access/lineage

17

Page 18: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Discovery

•  Within HDFS »  Search for sensitive data per company policy – PII, PCI,… »  Handle complex data types such as addresses »  Process incrementally (default) to handle only the new content

•  In-flight »  Processing data on the fly as they are ingested into Hadoop HDFS »  Plug-in solution for FTP, Flume,

Sqoop »  Search for sensitive data

per policy – PII, PCI, HIPAA…

»  NEXT UP: Kafka

18

Page 19: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

How Discovery Works

19

•  MapReduce or Flume/FTP/Sqoop Agent »  Root directories and drill downs »  Can scan entire dataset or incrementally (watermarking)

•  Runs pattern, logic, context, algorithm, and ontology filters •  Can utilize white/black lists and reference sets

Page 20: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Protection Measures

•  Protection plan should start with cutting

»  What data can we delete/cut? »  What data can be redacted? »  Masking choices

•  Consistency •  Realistic looking data •  Partial reveal (Intellimask)

Credit Card # 4541 **** **** 3241 •  What data needs reversibility

20

Page 21: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Encryption “vs” Masking

•  Encryption: + Reversible + Trusted with security proofs + The first hammer + De-centralized architectures -  Complex -  Key management -  Useless without robust

authentication and authorization -  Data value destruction -  Needs both encrypt-decrypt

tooling

21

•  Masking: + Highest security + Realistic data + Range and value preserving + Once and done +Scale-out and distributed + No performance impact on usage + Zero need for authentication and authorization and key management -  Not as well marketed -  Not reversible

Page 22: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

•  Masking: + Highest security + Realistic data + Range and value preserving + Format-preserving and partial reveals +Scale-out and distributed + No performance impact on usage + Zero need for authentication and authorization and key management -  Not as well marketed -  Not reversible -  Perceived to grow data

Encryption “vs” Masking

•  Encryption: + Reversible + Trusted with security proofs + Format-preserving and partial reveals +Scale-out and distributed + The first hammer + De-centralized architectures -  Complex -  Key management -  Useless without robust

authentication and authorization -  Data value destruction

22

The fundamental decision between masking and encryption comes down to reversibility: Some elements in analytics must resolve to original: (e.g. 66.249.22.145 or $34,332.12) Some elements ideal for psuedonyms:

Social Security Numbers Credit Card Numbers Names

Page 23: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Real-World Performance

•  Leveraging the power of MapReduce to run distributed encryption or masking

•  Data volume: 2.2 TB •  Run Time: 23 min •  Sensitive Data %: 8/50 Columns in 2.2 Bn rows

•  Run on 360 node MapR system •  In old-word database technology, this would type

of job would have taken days/week(s)

23

Page 24: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Audit Strategy •  Essential to all goals: Compliance, breach

protection, visibility and metrics •  Avoids the “gotcha” moment

»  Show all sensitive elements (count, location) »  Remediation applied »  Dashboard for fast access to critical policies and drill-

downs for file and user action

24

Page 25: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

How It works: Detection and Protection In-flight or @Rest

RDBMS Xaction

Data warehouse

Site

WEB

FTP

Flum

e A

gent

DgF

lum

e Pl

ug-in

DgFlume Agent

1. Detect sensitive data

2. Protect applying masking/encryption policies

Production C

luster

Had

oop

API

D

isco

ver/

Mas

k/En

cryp

t

DgHDFS Agent

1. Detect sensitive data

2. Protect applying masking/encryption policies

Had

oop

API

DGHive, HDFS bulk decryption/Java app

1. Selective decryption based on user/role and policy

1 Data Discovery and protection while loaded into HDFS

2 Data masked or encrypted in HDFS with Map/Reduce job

3 Users can now access data

DGDiscover-Masker

1. In DB (Oracle, SQL.. SharePoint, Files)

2. Protect applying masking/encryption policies

Sqoo

p

DgScoop Agent

1. Detect sensitive data

2. Protect applying masking/encryption policies

Page 26: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 26

Case Studies

Page 27: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Protecting sensitive data in top credit card firm

27

Objectives

!  Consolidate existing payment risk analysis inside high-scale, lower cost Hadoop

!  Provide tiered access & authorization for multiple business apps (fraud, risk, cross-sell

Solution !  MapR Hadoop for single, reliable, high

performance data analysis platform !  Dataguise consistent masking enables

analysis and unique index key values for de-identified data

!  Unique ability to output protected data in adjacent column or appended with delimiter inside existing column to protect data while governing access via authorization rules

•  Continuous real-time protection (job runs every 5 mins on ingest)

•  Analytics draws on the secure purchasing data of 90 million credit card holders across 127 countries

Results & Benefits

Omniture Files

Credit Card Transactions

Source Data Protection Analysis

Incremental updates to HDFS automatically protected

Selective access to sensitive data based on role and app

Page 28: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Protecting personal health info (PHI) in aggregate data lake

28

Objectives

!  Reduce costly and preventable readmission, decrease mortality rates, and improve the quality of life for patients

!  Internal data service model DAaaS (Data Architecture as a Service)

Solution

!  Solution needs to protect structured and unstructured source data in database, data warehouse, and flat file structures

!  Customer required customization of encryption and key management to fit into their existing corp infrastructure and security policies

!  Dataguise dashboard gives admins easy way to identify directories/files containing sensitive data

•  Delivered a cost-effective and easy way to determine where sensitive data resides within the cluster, and how it’s been protected

!  Seamless access to encrypted data from a variety of data access methods {Hive, Pig, Analytic tools}

Results & Benefits

Health Records

SQL Data

DG FTP Agent

FTP

HDFS Authorization controlled through group membership in Active Directory

Page 29: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Global Tech Product Analytics

29

Objectives

! Aggregate logging data (product, usage, user configuration) for all smartphones worldwide t

! De-identify personal user info to ensure privacy and compliance with European/US Privacy

Solution

•  Customer routes all device logging data into 7 Global AWS clouds

•  Uses Dataguise Flume agent to protect all sensitive data being written to Amazon S3

•  Runs Dataguise in AWS, also utilizes Dataguise EMR security agents to selectively decrypt for authorized analytics in AWS

•  On-demand Hadoop for product analytics, user behavior, supply chain optimization

•  High scale-out, high performance and high scale-out paramount

•  100% cloud based security

Results & Benefits

Virtualized DG Secure

Protected Data Amazon S3

Smartphone Device Log Collectors

Apache Flume

DG Flume Agent

AWS Clouds in Korea, Singapore, US (3), UK, and Ireland

Page 30: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Hadoop Data-Protection Checklist

30

" Discover sensitive data " Automate protective measures "  Integrate into Hadoop authorization " With continuous real-time tracking " Dashboards, Reports & Auditing " Automated Risk Assessment/Scoring " Automated inference protection

(roadmap)

Page 31: Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL

Thank You

31

Jeremy Stieglitz VP Products [email protected]