sti summit 2011 - di@scale

31
Data Integration @ Scale 2011 STI Semantic Summit 7/8/11 Michael L. Brodie Michaelbrodie.com 1 ! Copyright 2011 STI - INTERNATIONAL www.sti2.org Michael L. Brodie STI Fellow Michaelbrodie.com The following is for Fortune 100 organizations

Upload: semantic-technology-institute-international

Post on 11-Nov-2014

368 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

1

! Copyright 2011 STI - INTERNATIONAL www.sti2.org

Michael L. Brodie STI Fellow

Michaelbrodie.com

The following is for Fortune 100 organizations

Page 2: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

2

Based on 2010 industry data verified by 4 analyst organizations 10+ Fortune 100 enterprises

Information Systems?

Did you know…

Page 3: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

3

Information Systems in a Fortune 100 company ~10 thousand 100s per business area (e.g., billing) 30% mission critical 30% modified significantly annually

Databases?

Did you know…

Page 4: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

4

Logical databases in a Fortune 100 … ~10,000 100s per business area (e.g., billing)

~90% relational 100s new / year

Database size … 20% large 1-10 TB 40% medium 10 GB-1 TB

40% small <10GB

Page 5: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

5

Database growth… 2X every 18 mo OLTP / transactional 2X every 12 mo OLAP / decision support

Logical database complexity … Tables/database: 100-200 Attributes/table: 50-200 Stored procedures/database: ~# tables Web services/database: < # tables but !! Views: ~3X # tables

Page 6: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

6

Database Design?

Did you know…

Database design level … ~100% table (SQL) ~0% Entity Relationship

Page 7: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

7

Relational Database Integration?

Did you know…

Data integration accounts for … ~40% software project costs

Page 8: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

8

Circa 1985 Tables: 2+ Dominant operation: join Design: schema-based

Circa 2010 Tables: 10s-1000s Dominant operations ! Extract, Translate, Load (ETL) ! File transfer ! Then … join

Design ! Evidence gathering (often manual) ! 60-75% schema-less

Page 9: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

9

Information Ecosystems?

Did you know…

Information systems that interact to support domain-specific end-to-end

processes, e.g., supply chain, billing, marketing, sales,

engineering, operations, finance, …

Page 10: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

10

Information Ecosystems involve … 100s-1,000s Information Systems 100s-1,000s Databases

Conclusion Information systems & databases live in complex, deeply inter-connected worlds

Page 11: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

11

Database Integration @ Scale?

Did you know…

Database integration in an Information Ecosystem … 100s-1,000s source databases 1,000s-millions function-specific views

Page 12: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

12

Every organization with a database depends critically on

Data +

Data Integration

Page 13: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

13

Boring huh?

Then how about …

Page 14: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

14

Deeply sophisticated analysis [data integration]

across …

28 intelligence databases connected across

17 US intelligence agencies over 5 years

totally missed …

Page 15: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

15

Data Integration

@ Scale

Page 16: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

16

What is Data Integration? Combine 2+ data items to form a meaningful result

Source1 … Sourcen

Mapping Target

Name Address Acct# … Bill 43 Main 123… Sally 12 Peel 489…

Acct# Charge Date … 123… $2.32 2/1/10 149 $1.17 1/7/10

Account Charge

Name Address Acct# Amount Bill 43 Main 123… $2.32 Sally 12 Peel 489… $0.00

Monthly Bill

TelcoSum (Charge) where Account.Acct# = Charge.Acct#

1980’s Data Integration

Entity Mapping DI computation

Page 17: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

17

What does DI mean? " Determined by application “semantics”

" Financial Close for a Large Enterprise •  What: Combine data from 100s-1,000s of systems •  Meaning: Defined by financial accounting practices, jurisdictional

regulations, industry standards, … •  Speed: 1-2 days after end of period •  Solutions: e.g., SAP Financial, PeopleSoft Financial, Hyperion •  Governance: laws and regulations •  Correctness:

o  US= Generally Accepted Accounting Principles (US GAAP) o  GB = Generally Accepted Accounting Practice (UK GAAP) o  Etc.

" Data Integration = •  Entity mapping •  Data integration computation

Definitions (cont.)

" Semantically homogeneous relational data descriptions •  Verified by an authoritative expert •  To describe the same entity (considering set inclusion) •  For which there is one and only one simple, complete entity mapping

between the relational schemas for the data descriptions •  The entity mapping can be defined in relational expressions

" Semantically heterogeneous relational data descriptions •  Verified by an authoritative expert •  To describe the same entity (considering set inclusion) •  For which there is not a complete entity mapping between the relational

schemas for the data descriptions •  Entity mappings cannot be defined in relational expressions

Page 18: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

18

Deeper Thoughts " Schemas: sufficient but not necessary

" Simple map: 1:N, … atomicity

" Degree of semantic heterogeneity

Complete mapping

No mapping

Semantic homogeneity

Semantic heterogeneity

Consequences " Semantically homogeneous relational data descriptions

•  Data independence + updatable => consistent views •  Provide basis for

o  Correct o  Efficient

•  Schema level design and integration

" Semantically heterogeneous relational data descriptions •  No guarantee of: Data independence + updatable views •  Provide no basis for

o  Correctness o  Efficiency

•  Schema level design and integration not always possible

" Data modelling and integration best practices •  Semantic homogeneity •  Radical simplification •  Standardization: ontologies, schemas, application solutions, …

Page 19: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

19

Semantic Heterogeneity … "  Is unavoidable @ scale

•  Autonomous design •  Ontologically heterogeneous, e.g., Enterprise Customer

o  Legal entity-based ontology o  Location-based ontology

" Dominates semantic homogeneity

Information Ecosystem Integration (3D)

Telco Service Provider

GTE

MCI

BA

Worldcom

VZW

Sales

Business Domains

Ordering Marketing Finance

Billing Repair Enterprise Customer

Wells Fargo

Wachovia

Bank of America Merril Lynch

Wells Fargo

Bank of America

Page 20: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

20

Correctness & Efficiency "  Legally Binding Requirements

•  Law governing communications, financial transactions, privacy, … o  E.g.,: Customer Proprietary Network Information (CPNI)

•  Regulation: rules defining operational details o  E.g.,: FCC CPNI rules –customer views require manual verification

•  Customer agreements: SLAs for all services •  Telco agreements: SLAs for all production systems

" Mission Critical: Operational & Information Correctness " Complex Views

•  Customer Defined Views •  Inconsistent & Conflicting Views

o  Business perspectives, manually defined, no ultimate truth or agreement o  Conflicting telephone #: Long-distance vs. local

Data Modelling: Enterprise Customer " Enterprise Customer data description

•  Enterprise customer identifier (ECID) •  Enterprise customer data

o  Telco information, e.g., billing, must be mapped (integrated into) Enterprise Customer organizational units

" Enterprise Customer Ontologies = Conceptualizations •  Legal entity •  Locations •  Telco contracts •  ~40 others

Level of abstraction

Perspective

Page 21: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

21

Enterprise A

Legal entity structure (partial) •  ~400 legal entities •  14 levels deep •  20,000 accounts (~50/entity)

Industry Standard Legal-entity Ontology Industry Standard Ontology

Radically Simplified Schema

Legal-entity based ECID instance (DUNS code)

Page 22: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

22

Location-based Ontology Industry Standard Ontology

Radically Simplified Schema

Location-based ECID instance

Contract-based Ontology

Radically Simplified Schema

Industry Standard Ontology

Contract-based ECID Instance

Page 23: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

23

Entity maps for semantically heterogeneous Enterprise Customer Entities are schema-less

Legal entity-based ECID schema

Contract-based ECID schema

Location-based ECID schema

Legal-L

Legal-C

Sales

Enterprise Customer: 3 Departments in 2 locations

Marketing

Legal-L

Marketing

Page 24: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

24

Location-Based Enterprise Customer IDs: 2 locations

Two Entities

Legal entity-based Enterprise Customer IDs: 3 organizations with 1, 2 and 3 departments

Four Entities

Legal-L

Legal-C Sales

Marketing

Page 25: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

25

www.sti2.org

Multiple Versions of Truth

5/12/2007 - Vienna

Male Female Married

Nov 2002

Age: 13 Age: 48

US Federal

Govt

Fund. LDS

Church

Gay Rights ACLU Anti-

Feminists Feminists LDS Church

State of

Utah

US Citizens

US Federal

Govt

US Citizens

1st Amendment Religious Freedom

Fundamental LDS Law 1838

Govt out of bedroom

Civil Liberties

Against Women’s

rights Women’s

rights 46% 54%

Bigamy LDS Law repealed 1990

Statutory rape

Data Integration Observations " Each view is unique

•  Entity map •  Data integration computation

" Views often conflict or mutually inconsistent

" Data integration for semantically heterogeneous relations •  Schema-less •  By evidence gathering •  At instance level

" Views are •  Dominant means of accessing data •  Lenses into Information Ecosystems

Page 26: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

26

Codd’s Rules Extended "  0: System must be relational, a

database, a management system

"  1: Information rule (tabular)

"  2: Guaranteed access rule (keys)

"  3: Systematic treatment of nulls

"  4: Active online catalog based on the relational model

"  5: Comprehensive data sublanguage rule

"  6: View updating rule

"  7: High-level insert, update, and delete

"  8: Physical data independence

"  9: Logical data independence

"  10: Integrity independence

"  11: Distribution independence extra

"  12: Nonsubversion rule

"  13: Relational power rule: The full power of relational technology applies to semantically homogeneous relational databases.

For Semantically Heterogeneous Databases

" Cannot be guaranteed •  Relational model •  Codd’s rules •  Relational properties

o  Updateable views •  Relational assumptions

o  Global schema o  Single version of truth

" No basis for correctness or efficiency

" Unfortunately, like the real world

Page 27: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

27

Shadow Approach " Database Approach: power and limits

•  Data structures •  Schemas •  Data models •  Logic

" Shadow Approach •  Meta-tagging

o  Equivalence (E-tags) for entity mapping after resolution o  What (W-tags) for “meaning” o  Projection (P-tags) for ETL over logical data structures

(Traditional) Data Integration

DB

Schema

DB

Schema

data

Metadata

DB

Schema

Enterprise portal

Schema

Ontology 1 Ontology 2 Ontology 3 Ontology 4

Business Rules

In SQL, Java, C/C++…

Evidence 1 Evidence 2 Evidence 3 Evidence 4 Evidence 5 Evidence N

Single Version of Truth Single Version of Truth Single Version of Truth Single Version of Truth Single Version of Truth Single Version of Truth

Ontology

Single Version of Truth ??

Billing view

Ordering view

Repair view

Sales view

Marketing view

Service view

Total Semantic Heterogeneity

Page 28: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

28

Shadow Theory-based Data Integration

DB

Schema

DB

Schema

data

Metadata

DB

Schema

Enterprise portal

Schema

Ontology 1 Ontology 2 Ontology 3 Ontology 4

Business Rules

Evidence 1 Evidence 2 Evidence 3 Evidence 4 Evidence 5 Evidence N

Ontologies

Explicitly represented w/ Multiple Version of Truth.

Need a new language to implement rules by tags

Shadows Shadows Shadows

Shadows Shadows Shadows Shadows

Billing view

Ordering view

Repair view

Sales view

Marketing view

Service view

Additional slides

Page 29: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

29

Data Integration Pragmatics

"  Codd’s Rules violated •  Un-natural fit •  Primary keys

o  Problems: technical, managerial o  Autonomous design o  Semantic heterogeneity

•  Few views updatable

"  Schemas largely unavailable •  10% not RDBMSs •  30% legal •  60% autonomy

"  Federation often prohibited •  Legal: privacy, CPNI, …

"  Single Source of Truth •  Makes sense in specific cases •  Usually not!

"  Views conflict •  For good reason!

o  Sales vs. marketing vs. repair

•  Maintain conflicts or lose value

"  Best practices rare •  Semantic homogeneity •  Radical simplification •  Standardization

"  Data Integration •  Table-level (Not ER-level) •  Instance-based •  Seldom schema-based •  Manual •  Evidence gathering

Current Data Integration Technologies "  Relational DBMS

"  Extract, Transform, & Load (ETL)

"  Federation aka Enterprise information integration (EII)

"  DI Platforms: 50+, e.g., "  IBM InfoSphere Information Server "  Informatica Platform "  Microsoft SQL Server Integration

Services "  Oracle Data Integrator, … "  SAP Data Integrator/Federator

"  Instance-level integration •  XML-based integration •  Custom tools: Verizon •  Dun & Bradstreet ECID maps

"  Limitations •  Schema-based •  Federation issues •  Limited (no!) support for

o  Schema-less integration o  Instance-level integration o  Evidence gathering o  Conflicting views o  Semantic heterogeneity

•  50+ ad hoc tools •  Expensive to migrate

Information Ecosystem onto data integration platform

Page 30: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

30

Emerging Data Integration Technologies

"  Data Discovery

"  Data Quality Services

"  Instance-level integration •  XML-based integration •  Custom tools •  Dun & Bradstreet ECID maps

"  Schema-less Tools •  Evidence gathering: Web, … •  Entity / Identity Resolution •  Social Networking

o  Collaboration o  Web 2.0 o  Social Search

•  Data & Process Mining •  Linguistics / Ontologies / Taxonomies

o  Derivation & Analysis •  Semantic Technologies •  Content Analytics

"  Shadow Theory •  Plato’s Cave •  Monadic 2nd Order Logic

"  DBMS Architecture •  In-Database Analytics •  Database Appliances

"  Distributed DBMS Architecture •  Change Data Capture (CDC)

•  Data Access •  Data Replication •  Distributed Data Caching

"  Service Orientation •  Data Services Platforms •  Information-As-A-Service •  Enterprise Service Buses (ESBs)

"  Master Data Management (MDM)

Compu

tatio

n – P

erfo

rman

ce (n

o se

man

tics)

Prob

lem S

olvin

g – m

anua

l; se

man

tics

Heterogeneous meanings of a telephone number

Telephone attribute in customer database.

Owner identified uniquely with additional attribute NPA+NXX+XXX+CustomerID

North America Numbering Plan (NANP) telephone number

(NPA) NXX-XXXX

NPA - Number Plan Area

NXX - Central office code

XXXX – telephone line #

Domestic long distance services, charges determined by NPA.

International long distance services, charges determined by country code.

Ported to cell or internet phone Routed to / from another phone number (e.g. 1 800 toll free)

? Telephone entity = a physical phone.

Page 31: STI Summit 2011 - di@scale

Data Integration @ Scale 2011 STI Semantic Summit

7/8/11 Michael L. Brodie Michaelbrodie.com

31

Limitations of Current Solutions "  Data Model Principles Don’t Apply

•  Unique IDs (keys) are limiting •  Modelling Choices: entities,

properties, or relationships

"  Schemas Are Seldom Available •  10% not RDBMSs •  30% legal •  60% owners maintain autonomy

o  Biz: impact operations o  Tech: impact schedule & cost

"  Schema-based Integration Challenges •  Inconsistent data models •  View Conflicts •  Cascaded Views

"  Federation Seldom Allowed (CPNI)

"  No Single Source of Truth

"  Tools Don’t Meet Requirements •  No single model of data •  Inconsistent / conflicting views

"  Meta-data and Multi-database Scale •  Information ecosystem (vs. database)

management •  Millions of views •  10,000s of cross database dependencies

"  Migrating Information Ecosystems to Vendor Solutions Is Prohibitive

"  Pragmatics •  Bootstrapping in a legacy environment

"  Conceptual Modelling Support •  Different models required for different

workers •  No round-trip engineering