enterprise data classification and provenance

40
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enterprise Data Classification and Provenance Apache Atlas Shwetha Shivalingamurthy Suma Shivaprasad

Upload: hadoop-summit

Post on 07-Jan-2017

259 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Enterprise Data Classification and Provenance

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Enterprise Data Classification and ProvenanceApache Atlas

Shwetha Shivalingamurthy Suma Shivaprasad

Page 2: Enterprise Data Classification and Provenance

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Disclaimer

This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed.

Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery.

This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product.

Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.

Page 3: Enterprise Data Classification and Provenance

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Agenda

• Demo• Big Data Governance• Overview of Atlas• Atlas architecture• Features and Roadmap

Page 4: Enterprise Data Classification and Provenance

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Demo usecase – Ad network

• Matches advertiser demand with ad space supply from publishers• Billing based on ad impressions/ad engagement• Enables targeting, tracking and reporting of ad impressions• Typical reports/queries:• Mismatch of demand and supply• Country/os wise reports• Top advertisers/publishers

Page 5: Enterprise Data Classification and Provenance

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data landscape

Traditional warehouse

Ad serversUser

AdImpression,

Click,Billing logs

MetadataSummaries

Page 6: Enterprise Data Classification and Provenance

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data governance requirements

• Cross platform lineage – impact analysis, forensic, discovery• Asset search• Common Business Terms • Compliance

Page 7: Enterprise Data Classification and Provenance

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Demo

• Technical and business metadata • Cross Component Lineage• Creating views• Create tags• Entity deletes• Search using tags, attributes• Entity audit• Business catalog – find assets• Flexible model, external lineage ingest

HDP 2.5

Page 8: Enterprise Data Classification and Provenance

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Governance

Data Discovery

and Tagging

Metadata Management

Data Lineage/Prov

enance

Access Management

Data Security & PrivacyData Quality

Compliance and Audit

Data Wrangling

Data Lifecycle Management

Data integration

Data Governance Aspects

Data governance refers to processes, methods and tools used in an enterprise for effective control of availability, usability, integrity, and security of data

Page 9: Enterprise Data Classification and Provenance

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Enterprise Data Governance: Apache Atlas Data Managementalong the entire data lifecycle with integrated provenance and lineage capability

• Cross component lineage

Modeling with Metadataenables comprehensive business metadata vocabulary with enhanced tagging and attribute capabilities

• Common Business Language

• Hierarchically organized – No dupes !

Interoperable Solutionsacross the Hadoop ecosystem, through a common metadata store

• Combine and Exchange Metadata

STRUCTURED

UNSTRUCTURED

TRADITIONALRDBMS

METADATA

MPP APPLIANCES

Kafka Storm

Sqoop

Hive

ATLASMETADATA

Falcon

RANGER

STREAMING

Custom

Partners

Page 10: Enterprise Data Classification and Provenance

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Background: DGI Community becomes Apache Atlas

May2015

Apache AtlasIncubation

DGI groupKickoff

Dec 2014

Aug2016

HDP 2.5/Apache 0.7 Release

Global FinancialCompany

* DGI: Data Governance Initiative

Key Benefits:

• Co-Dev = Built for real customer use cases

• Faster & Safer = Customers know business + HWX knows Hadoop

• Code contributors - Hortonworks, IBM, Aetna , Merck, Target

Jul2015

HDP 2.3/Apache 0.5Foundation Release

Page 11: Enterprise Data Classification and Provenance

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Architecture

Page 12: Enterprise Data Classification and Provenance

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Architecture

Page 13: Enterprise Data Classification and Provenance

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Atlas Type System

• Defines model – schema of metadata• Flexible and powerful to define any model/custom types• Supports inheritance• Types

• Primitive types – bool, integer types, string, date, enum• Collections - array, map• Struct – set of attributes• Class – Identifiable struct, hierarchy • Trait – set of attributes, hierarchy

Page 14: Enterprise Data Classification and Provenance

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive Model

DataSetmetaType: ClassTypename: String required hive_db

metaType: ClassTypename: string required

createTime: date requiredparameters: map<string,string> optional

hive_table

metaType: ClassTypedb: hive_ db required

createTime: date requiredcolumns: array<hive_column>

required

hive_columnmetaType: ClassTypename: string requiredtype: string required

extends references

references

0..n

Page 15: Enterprise Data Classification and Provenance

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Entities

Instances of typesName: rawlogs

Guid: 1createTime: 2015-01-01 10:00

Type: hive_db

name: impressionsGuid: 2

Type: hive_table

name: adv_idtype: string

Guid: 3Type: hive_column

name: user_idtype: string

Guid: 4Type: hive_column

db column

column

EXPIRES_ONTime: March, 2016

PII

trait

trait

Page 16: Enterprise Data Classification and Provenance

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Graph Engine

• Graph Database• Titan with storage backed by HBase

• Types and Entities are translated to the Graph Model• Classes, Structs and Traits map to a vertex• Relationships are mapped as edges• Rich relationships between metadata objects

• Indexing and Search• Indexing based on type annotations• External indexing – Titan backed by Solr

Page 17: Enterprise Data Classification and Provenance

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Titan property graph modelGraph Search with Gremlinsaturn = g.V.has('name','saturn').next()

hercules = saturn.as(‘x’).in(‘father’).loop(‘x’) { it.loops > 3}.next()

hercules.outE(‘battled’).has(‘time’, T.gt, 1).inV.name cerberus hydra

Page 18: Enterprise Data Classification and Provenance

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Search

Find Relevant Assets based on their attributes ,

associations with business terms

DSL with sql like syntax based on type system

from $type is $trait where $clause select|has $attributes, repeat

Examples Select columns from a hive_table where its name

is “impressions” and db name is “raw”hive_column where table.name=”impressions", table.db.name = ‘raw’

Select all columns from hive tables which are tagged as “PII”

hive_column is ‘PII’

Full text search ‘(rawlogs) AND hive’

‘(rawlogs OR supply*) AND hive’

Page 19: Enterprise Data Classification and Provenance

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Features and Roadmap

Page 20: Enterprise Data Classification and Provenance

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Atlas Component Integration & Lineage

• Cross- component dataset lineage. Centralized location for all metadata inside HDP

• Single Interface point for Metadata Exchange with platforms outside of HDP

Apache Atlas

Hiv

e

Ran

ger

Falc

on

Sqoo

p

Stor

m

Kaf

ka

Spar

k

NiF

i

HB

ase

Partner

Custom

HDP 2.3

HDP 2.5 Beyond HDP 2.5

HDP 2.5 External

Page 21: Enterprise Data Classification and Provenance

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Business Catalog for Ease of Use

Organize data assets along business terms– Authoritative: Hierarchical Taxonomy Creation– Agile modeling: Model Conceptual, Logical, Physical assets– Definition and assignment of tags like PII (Personally

Identifiable Information)

Comprehensive features for compliance – Multiple user profiles including Data Steward and Business

Analysts– Object auditing to track “Who did it”– Metadata Versioning to track ”what did they do”

Faster Insight: ( Roadmap )– Data Quality tab for profiling and sampling– User Comments

Key Benefits:

Organize data assets along business terms

Compliance Features:

Faster Insight

Page 22: Enterprise Data Classification and Provenance

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Ranger: Introduction

Centralized authorization and auditing across Hadoop components• HDFS, Hive, HBase, Knox, Strom, YARN, Kafka, Solr, ..• Audit logs to: Solr, HDFS, RDBMS, Log4j, ..

Resource based security• Policies for specific set of resources• Requires revision of policies as resources get added/moved

Classification based security• Policies for classifications and not for specific resources• A single policy protects resources in multiple components• As classification for resources change, appropriate policies would

automatically be applied• Enables separation of duties: resource-classification and security policies

Page 23: Enterprise Data Classification and Provenance

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Scalable Access Control – Reusable Tag Policy

User group• AD• Linux

Resources:

• Files• Tables• Topologies

Atlas Tag

• PII

ANY asset PII

• Files• Tables• Topologies

Single Admin Group Assigns

Many Stewards Tag +Single point of

enforcement and audit

All future tagging is covered by

existing policy

Not Scalable

Scalable

Page 24: Enterprise Data Classification and Provenance

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Open: Governance Ready Certification ProgramChoice: Customers choose features that they want to deploy—a la carte versus vendor lock

Curated & Fast: Selected group of vendor partners to provide rich, complimentary and complete features ready to deploy

Agile: Low switching costs, Faster deployment and innovation

Centralized : Common SLA & common open metadata store

Flexibility: Interoperability of products through Atlas metadata

Safe: HDP at core to provide stability and interoperability

Completed:• Waterline• Dataguise• Attivio• Trifacta

Pending:• Collibra• Alation• Meta

Integration (Miti)

• Paxata• Syncsort• Talend

Page 25: Enterprise Data Classification and Provenance

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Roadmap…

• MultiTenancy• Titan 1.x Migration• Hive Column Level Lineage

Page 26: Enterprise Data Classification and Provenance

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Summary

• Designed for Hadoop at platform, not application level• High Confidence data in Hadoop for regulated verticals• Compliance and business objectives aligned to data organization• Faster discovery for analysts – reduce time to value• Agile and adaptable – ensures information is current by native

connectors• Dynamic protection with Ranger in simple audited policies

Page 28: Enterprise Data Classification and Provenance

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Questions

Page 29: Enterprise Data Classification and Provenance

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Backup

Page 30: Enterprise Data Classification and Provenance

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Dynamic Access PolicyApache Ranger + Atlas Integration

Page 31: Enterprise Data Classification and Provenance

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

How does Atlas work with Ranger at scale?

Atlas provides: Metadata• Business Classification (taxonomy): Company > HR > Driver

• Hierarchy with Inheritance of attribute to child objects: Sensitive “PII” tag of department HR will be inherited by group HR> Driver

• Atlas will notify Ranger via Kafka Topic for changes

Apache Atlas

Hiv

e

Ran

ger

Falc

on

Kaf

ka

Stor

m

Atlas provides the metadata tag to create policies

Ranger provides: Access & Entitlements

• Ranger will cache tags and asset mapping for performance

• Ranger will have a policy based on tags instead of roles.

• Example: PII = <group> This can work for a may assets.

Page 32: Enterprise Data Classification and Provenance

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Automatic update of policies – active protection

Metastore

• Tags• Assets• Entities

Notification Framework

Kafka Topics

AtlasAtlas Client

• Subscribes to Topic• Gets Metadata

Updates

PDPResource Cache

Ranger

Notification Metadata updates

Messagedurability

Optimized for Speed

Event driven updates

Page 33: Enterprise Data Classification and Provenance

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Ranger: Authorization and Auditing

HBase

Ranger Administration Portal

HDFS

Hive Server2

Ranger Audit StoreRanger Policy Store

Ranger Plugin

Hadoop Components

Enterprise Users

Log4j

Knox

Storm

YARN

Kafka

Solr

HDFS

Solr

Ranger Plugin

Ranger Plugin

Ranger Plugin

Ranger Plugin

Ranger Plugin

Ranger Plugin

Ranger Plugin

RDBMS

Page 34: Enterprise Data Classification and Provenance

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Big Data GovernanceCurrent Landscape

• Opaque Data and in variety of data stores – HDFS, S3, Data warehouses• Schema is hardly sufficient – Hive Metastore, Avro, Data Warehouse• Platform tools like Ranger and Falcon solve parts of the problem

Need for Data governanceOrganizations need data governance to understand its information to answer questions such as:

• What do we know about our information?• Where did this data come from and how’s it being used?• Does this data adhere to company policies and rules?• Need for effective control and consumption of data

Atlas helps customers discover information about data objects, their meaning, location, characteristics, and usage.

Page 35: Enterprise Data Classification and Provenance

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Business Taxonomy

Business Taxonomy (Catalog)The practice and science of classification of things or concepts, including the principles that underlie such classification. The business organization model is hierarchical making authoritative with no duplication.

Tags: Traits vs. Labels vs. Business TaxonomyAtlas has Tags that are authorative and prevent duplication. Tag can span different parts of the business taxonomy. A tag PII can be used in HR as well Finance or Sales.

Benefits:

A view of data assets organized by business language

Compliance, Acceptable use – Dynamic Metadata based access control

Common taxonomy through Hadoop components

Page 36: Enterprise Data Classification and Provenance

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Principle Roles & Activities in an Enterprise

• Data Steward – Curator, responsible for data classification – associate business taxonomy and tagging, access policies

• Data Scientist – Analyst, primary consumer of Business Taxonomy

• Administrator/Operations – Role management, Data lifecycle management (Archival, retention)

• Data Engineer – Data ingress and egress, semantic data quality

• 50% - 80%+ Time spend looking for data

• Profit Center • Primary User of Atlas

• Enables Scientist

Goal: < 25% spent on finding data=Empowering scientist to spend their time uncovering insights -- faster

Page 37: Enterprise Data Classification and Provenance

37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Governance Usecases : Impact analysis

HortonAdNetwork – A large size Ad network which has an international footprint with multiple publishers and advertisers across several countries

Complex ETL jobs and data pipelines processing real-time ad network data from several different sources and various data processing platforms

No easy way to determine the root cause when something is off charts Data analysts need effective data provenance tools for Impact/Root cause anaylsis

Cross component lineage is a must Data Lineage (Provenance)

Data lineage is defined as a data life cycle that includes the data's origins and where it moves over time. It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sources

Page 38: Enterprise Data Classification and Provenance

38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Governance Usecases - Compliance HortoniaBank – mid size bank expanding from US to international markets 2 Customer Tables owned by BH: 50K customer records each with 38 fields (PII, PHI, PCI

& non-sensitive data)– us_customers: USA person data only– ww_customers: multi-language, multi-country, localized person data

1 data set of prospects leased from a data broker– tax_2010: Data lease expired already!

Page 39: Enterprise Data Classification and Provenance

39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

User Group Access Privileges

joe_analyst us_employee US Data Only, non-sensitive data only, rest forbidden depending on sensitivity

kate_hr us_hr US Data Only, All sensitive data (PCI, PII, PHI)

Tag Based Policies US HR team members can see all original data (PCI, PII,….) Analysts are prohibited from viewing PII data in any of the tables Anyone except operations/Admin are prohibited to access tax_2010 after the specified

date - Expires_on policy turns off access on the configured expiry date

Page 40: Enterprise Data Classification and Provenance

40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Sqoop

TeradataConnector

ApacheKafka

Expanded Native Connector: Dataset Lineage

Custom Activity Reporter

MetadataRepository

RDBMS

Any process using Sqoop is

covered

No other tool tracks IOT of

the box