how information governance is getting analytics on big data's...

© 2017 IBM Corporation

How Information Governance is getting

Analytics on Big Data's Best Friend

Albert Maier

[email protected]

© 2017 International Business Machines Corporation3

How many astronauts are there in Argentina?


Traditional Governance: Ensures proper Management and Use of Information

Information Governance

Compliance

PolicyAdministration

PolicyEnforcement

PolicyMonitoring

PolicyImplementation

Standards Protection

Lifecycle

Quality

Information ValuesQuality

InformationDependencies

InformationRequirements

Information SupplyChain Integrity

InformationIdentification

InformationRetention

InformationUsage

InformationPrivacy

InformationArchitecture

InformationDisposal

Are People/Systems operating properly

Is data qualitysufficient for use?

Is data kept for appropriate

length of time?

Is data properlyprotected from loss or

inappropriate use?

Are systems built to appropriate

standards?


A growing demand …

5

Business Teams want• Open access to more information• More powerful analysis and visualization tools

Business Teams want• Self-service access to more information (“big data”)• More powerful analysis and visualization tools

IT Teams are• Concerned about cost.

• Concerned about governance and regulatory requirements.

Governance mitigates, it enables

the self-service world

This is related to the

“Rise of The CDO”

Chief Data Officers need

• To enable access to enterprise wide information assets

• Collaboration & Sharing of assets

• Enhanced compliance with regulations

• …


“Governance 2.0”: Drives the Self-Service World

Information is Accurate

Information is Secure

Information is Understood

Information is Current

Informationis Holistic

Informationis Findable

Creates confidence to both consume and share information

Governed data lakes are an excellent example scenario to discuss how governance helps

to achieve these goals


Data Lake (IBM’s view)

Data Lake = Efficient Management, Governance, Protection and Access of Big Data

Data Lake

Information Management and Governance Fabric

Data Lake Services

Data Lake Repositories

7

All services integrated with

governance


Governed Data Lake: Users and Subsystems

Data Lake (System of Insight)

Information Management and Governance Fabric

Catalogue

Self-

Service

Access

Enterprise

IT Data

Exchange

Self-Service

Access

Analytics

Teams

Governance, Risk and

Compliance Team

Information

Curator

Line of Business

Teams

Data Lake

Operations

Enterprise IT

Other Data

Lakes

Systems of

Engagement

Data Lake Repositories

Systems of

Automation

Systems of

Record

New Sources

8

Governance is important

here


Governance & Data Lake Summary

No direct access to

repositories

Business-led information governance

Catalog of data, ownership, meaning and permitted usage

Moderated, view-based self-service access to data and analytics for line of business.

Governed access to raw data to develop new production analytics. Shop for data.

Effective and governed interchange of data and insight with other systems.

Data-centric Security

Multiple repositories organized based on source and usage; hosted on appropriate data platforms for

workload.

Curation of all data to define meaning and classifications

9


Selected challenges demanding innovation on the technology side

▪ A central metadata catalog is not realistic, independent of technology choice

• How to design and implement the „virtual“ metadata catalog of the future?

• How to keep governance services such as data lineage efficient in a distributed world?

=> How to design and implement efficient Open Metadata capabilities?

▪ Standard search for information and governance assets fails to deliver results that are good enough for business users

• How to build efficient contextual search capabilities?

• How to keep this extensible for all the asset types relevant for a specific enterprise?

=> How to design and implement open and efficient contextual search solutions?

▪ Business-level classification of information assets is still a costly manual process

• Current discovery technologies fail to propose good enough classification candidates

• Exploitation of machine learning for this domain is in its infancy

=> How to design and implement an efficient automated classification of information?


Open Metadata - What problem is it solving

All industrial products for metadata and governance are built on top of a central metadata

repository. This turns out to be not future proof for various reasons, including

• Cloud platforms, open data and API economy means an organization no longer owns

and manages all of the data it uses. Maintaining a single inventory of metadata is

untenable since IT is no longer in control of all of the data.

• The data landscape is evolving too rapidly to maintain a metadata repository that is fed

with snapshots of metadata from data platforms. The metadata repository can get

quickly out-of-date and can become untrusted. Metadata for big data needs to be local

to this data.

• Metadata is typically locked in specific tools and platforms in proprietary formats.

Supporting the ever-increasing variety of data platforms, data types and functions in a

proprietary model is expensive and needs significant development bandwidth


Open Metadata Management

▪ Peer-to-peer network of repositories

▪ Metadata stored and managed close to its source

▪ Open, extensible metadata structures for metadata exchange and federation – extending coverage of the types of resources that need to be described.

▪ Open source infrastructure sharing cost of development and maintenance between vendors

▪ Support for open standards where available

CollaborationSpace Metadata

Analytics Platform Metadata

ApplicationMetadata

Cloud SaaS platform Metadata

Hadoop Platform Metadata


IBM is proposing and working towards an Open Metadata

Ecosystem on top of Apache Atlas

• Significant enhancements of Apache Atlas are

necessary to broaden its scope and to mature it

• IBM started to strongly engage in this community

and to contribute code (e.g. a graph abstraction

and capabilities to address HA)

• IBM has been working on additional componentry

(Open Metadata Access Services, Open

Connector Framework, Governance Action

Framework, Open Discovery Framework) and

intends to contribute significant parts of this work

to the community

Open ConnectorFramework

Governance Action Framework

Open Discovery FrameworkConnector Broker

Metadata Repository

Databases

Applications

FunctionFunction

Functions

ConnectorConnect

or

OperationalLogs

Engine

Open Metadata Access Services

Met

adat

aC

on

nec

tor

Files

Connector

Apache AtlasIBM Value-addOthers Value-add

Key

Met

adat

aC

on

nec

tor

More details here: http://www.ibmbigdatahub.com/blog/insightout-role-apache-atlas-open-metadata-ecosystem


Background: What we learned from customers and user studies

Primary Focus: Business Analysts and Data Scientists

1. Getting started is hard…Contextual search (e.g., ‘Shop For Data’) to quickly

find relevant information, assets and experts

What’s needed Common Challenges

Seamless conversation as the common denominator

across tools 2. …Teams are diverse, and adhoc sharing is vital.

Provenance information captured automatically and

transparently

3. Context is critical to establish trustworthiness.

=> LabBook project (IBM Research)


LabBooks heart is a graph that is both populated and consumed by 3rd party tools

Data Integration Tools

Data Science Tools

Social Networking Tools

Business Analyst Tools

Contextual Usage Graph Embeddable WidgetsSource systems User Interfaces

COMMUNITY

COMMENT

WORKSTREAM

PERSON

PERSON

DATASET

VISUALIZATION

APP

DATASET

INVOKES RESPONSE

DATASETCOMMENT

WORKSTREAM

Business users

Business analysts

Data Scientists

IT staff

Contextual

Search

Social Widgets

Recommendations

Activity

Streams

Contextual

Graph Browser


What context is currently captured in the graph?

▪ Schematic▪ How data is structured

▪ Semantic▪ What data means

▪ Collaborative▪ How people work together

▪ Usage▪ How data is used

memberOf

follows

publishes

contains

contains

contains

visualize

is

similarTo

consumes produces

derivedFrom

ORGANIZATION

PERSON

DATASOURCEDATASET

DATAFILE

TABLE

VISUALIZATION

COLUMN

ONTOLOGYREF

APPLICATION

COMMENT

RESPONSE

collaborates

createdBy

hasauthorOf

authorOf

replyTo

respondTo

is

COMMUNITYmemberOf

authorOf

INVOCATION

contains

NOTE

QUERY

DATABASE

SCHEMA

outputsdownloads

17


Classification – situation technology side

▪ Classification is about tagging information assets (e.g. columns) with their semantic meaning

(e.g. social security number, date of birth, account status, …)

• This is crucial for finding the right information

• This is crucial for managing information according to regulations and company policies

▪ Many existing capabilities and assets (within IBM products, competitor products, research, ..)

• Typically focusing on either low hanging classification based on simple syntactic analysis (regular

expressions, simple code, …) or very specialized domains (e.g. finding address information)

• Typically only able to automatically classify a smaller percentage of the information assets

▪ No silver bullet on the technology side

• all existing algorithms are specialized to address specific scenarios, e.g.

• they work for specific data formats only (e.g. text data only)

• they assume certain metadata being available & useful (e.g. descriptions)

• many have no machine learning, for others training sets and proper feedback has been an issue

▪ No common technology base, everybody has been re-inventing the wheel, nothing that

brings diverse technologies together to play in concert

Similiar issues exist for the broader area of automating data understanding

This motivated us to build an Open Discovery Framework (ODF)

Intention is to contribute this to Open Source soon


A closer look at Open Discovery Framework (ODF)

▪ Pluggable Framework to enrich metadata with discovery results• Developers writing discovery and classification algorithms can easily plugin their code

▪ Built on open-source stack: Atlas, Kafka, Spark, Zookeeper

▪ Extension points to support other environments. IBM is using these e.g. for• Information Server: Use XMeta instead of Atlas• Bluemix: Message Hub service instead of plain Kafka, Cloudant as config store

▪ Jenkins based build pipeline for build and test automation

▪ The IBM Information Analyzer profiling and data quality analysis services are available as plugins for this framework

▪ IBM started to develop diverse new classification services, specifically• A „term classification“ service comparing information asset metadata against business

glossary content• A „fingerprint“ based classification service comparing statistical fingerprints against

fingerprints of already classified information assets


Open Discovery Framework Architecture

ODF Core

ODF REST API

Service Choreography

Request Notifications

Config Store

Metadata Access

Service 1Annotation Store

Declarative Request Processor

Spark runtime Queue

Java runtime Queue

Notification Topic

ODF Java API

ODF Event API

Service2

Some Metadata

Store

Some ConfigStore

Kafka

REST service Queue


Take Aways & Outlook

▪ Take Aways

• Governance is extending from „Governance for Compliance“ to „Governance for Insights“

• Data lakes are helping CDOs to implement a vision of a data driven enterprise,

but data lakes need to be fully governed to live up to this value proposition

• Governance and the underlying metadata and metadata discovery and exploitation

technologies are not mature enough for big data, there is a lot of (and vice versa big data systems

are not mature enough to be a player in a governed landscape)

▪ IBM Governance Strategic Directions:

• Huge focus on Governance for Insights (comprising topics like shop for data, recommendation driven tools, machine learning, collaboration ,...)

• Moving to an open-source base (Kafka, Spark, Atlas, ...)

• Re-basing governance on an open, non-centralized metadata infrastructure

• Huge focus on „Unified Governance“ to bring IBM‘s governance capabilities together(across structured and unstructured data, across cloud and on prem, across all information governance domains)


zzzzzzz

Questions?


© Copyright IBM Corporation 2017. All rights reserved. The information contained in these materials is provided for informational purposes

only, and is provided AS IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use

of, or otherwise related to, these materials. Nothing contained in these materials is intended to, nor shall have the effect of, creating any

warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement

governing the use of IBM software. References in these materials to IBM products, programs, or services do not imply that they will be available in

all countries in which IBM operates. Product release dates and/or capabilities referenced in these materials may change at any time at IBM’s sole

discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any

way. IBM, the IBM logo, and other IBM products and services are trademarks of the International Business Machines Corporation, in the United

States, other countries or both. Other company, product, or service names may be trademarks or service marks of others.

how information governance is getting analytics on big data's...

Documents