master data in minutes with smart mastering · with smart mastering. slide: 2 1 june 2018©...

30
1 June 2018 © MARKLOGIC CORPORATION Kasey Alderete Senior Product Manager, MarkLogic @kaseya Damon Feldman Solutions Director, MarkLogic @damonfeldman Master Data In Minutes with Smart Mastering

Upload: lamcong

Post on 17-Aug-2018

219 views

Category:

Documents


1 download

TRANSCRIPT

1 June 2018© MARKLOGIC CORPORATION

Kasey AldereteSenior Product Manager,

MarkLogic

@kaseya

Damon FeldmanSolutions Director,

MarkLogic

@damonfeldman

Master Data In Minutes with Smart Mastering

SLIDE: 2 1 June 2018© MARKLOGIC CORPORATION

Data Integration AspectsSOLVING THE SILOS PROBLEM

Match different inputs to a single 360o record

Merge duplicates

Increase data quality

Refining the system and correcting mistakes

DISCOVERYFEEDBACK

ADJUSTMENT

SLIDE: 3 1 June 2018© MARKLOGIC CORPORATION

Why MDM?• Valid reporting and analytics

• Find correct values for Data Quality

• More data beats a better algorithm

• So does correct data

Equipment Location InspectionFailures

PSV-11 North Sea 1 2PSV-NS-11 North Sea 1 2PSV-11 N-Sea 1 3

Equipment Location FailuresPSV-11 North Sea 1 7

Type 4 pressure sensor valves fail 3x more often in operating temperatures below 15 oC

SLIDE: 4 1 June 2018© MARKLOGIC CORPORATION

MarkLogic Mastering – How?{"Person": {"givenName": "Bob", "familyName": "McDougal", "phone": "415 826-3389", "city": "Laguna Beach""zip": "92652""favoritePet": "cat"}

}

{"Person": {"givenName": ”Robert", "familyName": "MacDougal", "phone": "415 826-3389","city": "Laguna Beach”"zip": "92652”"favoritePet": ”dog”}

}Weight Incoming Data Existing Master Match Type15 Bob Robert20 McDougal MacDougal4 92652 926522 Laguna Beach Laguna Beach45 415 826-3389 415 826-3389

SLIDE: 5 1 June 2018© MARKLOGIC CORPORATION

MarkLogic Mastering – How?{"Person": {"givenName": "Bob", "familyName": "McDougal", "phone": "415 826-3389", "city": "Laguna Beach""zip": "92652""favoritePet": "cat"}

}

{"Person": {"givenName": ”Robert", "familyName": "MacDougal", "phone": "415 826-3389","city": "Laguna Beach”"zip": "92652”"favoritePet": ”dog”}

}Weight Incoming Data Existing Master Match Type5 (was 15) Bob Robert Nickname3 (was 20) McDougal MacDougal Metaphone4 92652 92652 Exact0 (was 2) Laguna Beach Laguna Beach Exact but redundant45 415 826-3389 415 826-3389 Exact

SLIDE: 6 1 June 2018© MARKLOGIC CORPORATION

MarkLogic Data HubsIngest Your

Data Model Harmonize Customized Processes

QuerySearch

Use

Agile Modeling

Harmonize each source

Ingest as-is• Silos: busted Operationalize

Clean, merge, enrich, refine

SLIDE: 7 1 June 2018© MARKLOGIC CORPORATION

MarkLogic and MDMIngest Your

Data Model Harmonize Match MergeQuerySearch

Use

DiscoverAdjust

Ingest as-is• Silos: busted

• One 360o view• Weighted score per field• Additional smart rules

Merge, retaining context• All values or best value• Maintain lineage, history

Operationalize

Adjust and Refine• Accuracy and Recall• Evaluate merge results

Agile Modeling

Harmonize each source

SLIDE: 8 1 June 2018© MARKLOGIC CORPORATION

MarkLogic and MDM

Deduplication

Source Record 1 Single 360o

RecordSource

Record 2

BI and Reporting

Security Policy + Data Lenses

Data HubREST

SQL

SPARQL

Ingest Your Data Model Harmonize Match Merge

QuerySearch

Use

DiscoverAdjust

Search and Discovery

SLIDE: 9 1 June 2018© MARKLOGIC CORPORATION

Best Practices Have Emerged Over TimeDATA INTEGRATION DATA MASTERING

Multiple US StateHealth & Human Services

agencies

SLIDE: 10 1 June 2018© MARKLOGIC CORPORATION

Unfortunately, Traditional MDM Usually FailsTraditional MDM Why?

Slow to implement • Big Modeling Up Front (BMUF)• Map and ETL all the data

Occasional update, slow query • “Query and Scan” is inherently slow

Weak Provenance • Timestamps on every table is difficult to model• … and slow to query

Re-engineer the business • MDM + Operational access + Data Warehouse• More silos

Security risk • Each additional system = more risk

Not trusted • Lineage and history lost

SLIDE: 11 1 June 2018© MARKLOGIC CORPORATION

Unfortunately, Traditional MDM Usually FailsTraditional MDM Why?

Slow to implement • Big Modeling Up Front (BMUF)• Map and ETL all the data

Occasional update, slow query • “Query and Scan” is inherently slow

Weak Provenance • Timestamps on every table is difficult to model• … and slow to query

Re-engineer the business • MDM + Operational access + Data Warehouse• More silos

Security risk • Each additional system = more risk

Not trusted • Lineage and history lost

Traditional MDM has a 76% Failure Rate

A Better Alternative?

SLIDE: 13 1 June 2018© MARKLOGIC CORPORATION

NEW FEATURE

MarkLogic’s Smart Mastering Fast integration into a Data Hub

Operational (thousands of TPS)

More data than traditional MDM

Track provenance and lineage

Maintain data security

SLIDE: 14 1 June 2018© MARKLOGIC CORPORATION

NEW FEATURE

MarkLogic’s Smart Mastering Fast integration into a Data Hub

Operational (thousands of TPS)

More data than traditional MDM

Track provenance and lineage

Maintain data security

SLIDE: 15 1 June 2018© MARKLOGIC CORPORATION

Use Case – HealthCare.gov Master Person Index The problem

- Person records from multiple sources

- Reporting? Find people? Analytics?

- At massive scale, in real time

Temporal history tracked for all Person updates

Lessons learned

- Use indexes for speed

- Same algorithms for matching and search

SLIDE: 16 1 June 2018© MARKLOGIC CORPORATION

Use Case – Oil and Gas

Keeping everyone safe

- Equipment, inspections, repairs, risk

- Names and locations vary

- PSV-11 (Argentina, Rig 11)

- PS Valve11 (Ireland, Oil Platform 11)

- Analyze by equipment item

- Or model number, or oil rig

Design docs, inspections work orders, reports

Lesson learned: Data cleansing + fuzzy match

SLIDE: 17 1 June 2018© MARKLOGIC CORPORATION

Raw Harmonized

Use Case – Insurance with Existing MDMExisting MDM system predating MarkLogic

Slow updates as insured are added

Low query rate

MarkLogic holds the full data set

Claims, doctors, policies

MDM data is “just another” data source

Used to link all other items

Result: Operational Data Hub + External MDM

Demographic row

Demographic row

Demographic row

Demographic row

Demographic row

MasterID | Sys1-ID | Sys2-ID

MasterID | Sys1-ID

MasterID | Sys1-ID | Sys2-ID

MasterID

MasterID | | Sys2-ID

Raw Record(src 1)

HarmonizedRecord

Raw Record(src 2)

MasterIDSet

SLIDE: 18 1 June 2018© MARKLOGIC CORPORATION

MarkLogic Mastering – Process

Raw Record(src 1)

HarmonizedRecord

Raw Record(src 2)

Harmonize first

Raw Record(src 3)

SLIDE: 19 1 June 2018© MARKLOGIC CORPORATION

MarkLogic Mastering – Process

Raw Record(src 1)

HarmonizedRecord

Raw Record(src 2)

CandidateMaster

CandidateMaster

Harmonize first

Fast matching during ingest

- Likely matches from the indexes

- Scan each ”candidate” match using rules

- Nickname

- Sounds-alike

- Typo

- Wrong-data penalty

Raw Record(src 3)

SLIDE: 20 1 June 2018© MARKLOGIC CORPORATION

MarkLogic Mastering – Process Harmonize first

Fast query on ingest

- Get likely matches using the indexes

- Scan each ”candidate” match using rules

- Nickname

- Sounds-alike

- Typo

- Wrong-data penalty

Thresholds for automatic merging

- With configurable merge rules

Raw Record(src 1)

HarmonizedRecord

Raw Record(src 2)

CandidateMaster

CandidateMaster

New, UpdatedMaster

(with lineage and history)Raw Record(src 3)

SLIDE: 21 1 June 2018© MARKLOGIC CORPORATION

MarkLogic Mastering – Process Harmonize first

Fast query on ingest

- Get likely matches using the indexes

- Scan each ”candidate” match using rules

- Nickname

- Sounds-alike

- Typo

- Wrong-data penalty

Thresholds for automatic merging

- With configurable merge rules

Raw Record(src 1)

HarmonizedRecord

Raw Record(src 2)

CandidateMaster

CandidateMaster

New, UpdatedMaster

History and LineageRaw Record

(src 3)

SLIDE: 22 1 June 2018© MARKLOGIC CORPORATION

Raw Record(src 3)

MarkLogic Mastering – Process Harmonize first

Fast query on ingest

- Get likely matches using the indexes

- Scan each ”candidate” match using rules

- Nickname

- Sounds-alike

- Typo

- Wrong-data penalty

Thresholds for automatic merging

- With configurable merge rules

Human review of sensitive or low-score matches

Raw Record(src 1)

HarmonizedRecord

Raw Record(src 2)

CandidateMaster

CandidateMaster

New, UpdatedMaster

SLIDE: 23 1 June 2018© MARKLOGIC CORPORATION

MarkLogic Mastering – API First Enterprise, Enterprise, Enterprise

APIs and Data Lenses to Use Operationally

- Beyond overnight batch processes – though we do that

- Real-time

- Shipping system

- Customer portal

- Analytic dashboards

- Contact preferences

Raw Record(src 1)

HarmonizedRecord

Raw Record(src 2)

Master

Master

Master

REST SQL SPARQL Export

Raw Record(src 3)

APIs

Product Response& Demo

SLIDE: 25 1 June 2018© MARKLOGIC CORPORATION

Smart MasteringWHAT IT INCLUDES

Extensible framework to match & merge –efficiently address duplicate, incomplete, and partial entities

Out of the box rule configurations, set of APIs and a visual interface to get started

MarkLogic

Smart Mastering APIs

Demo GUI

SLIDE: 26 1 June 2018© MARKLOGIC CORPORATION

Smart Mastering

SmartEnable – don’t eliminate – humans.

NEW FEATURE

TrustedBuild confidence in curated data via a trust-based process.

Score based on configurable rules

Keep all data in its original form

Track all processing for auditability

Protect data from end to end

SLIDE: 27 1 June 2018© MARKLOGIC CORPORATION

THE GOAL

Determine Deal Eligibility for Each CustomerCreate a 360° View to determine the best deal available for each car insurance customer…

Is this a preferred customer?

Are there any driving infractions?

Has this person applied for a car policy in the past, possibly with one of our affiliates?

Lillian

Mr. Pollak

SLIDE: 28 1 June 2018© MARKLOGIC CORPORATION

MarkLogic Operational Data Hub

VALIDATION

INDEXING

REFERENCE DATA DENORMALIZATION

HARMONIZATION

POLICY APPLICATION

SMART MASTERING

RELATIONAL VIEWS

SEMANTIC VIEWS

TEMPORAL TRACKING

PROVENANCE & LINEAGE

ACCESSPRIVILEGES & PERMISSIONS

AS IS DATA CURATIONIN

GES

TIO

NCURATED DATA

SOURCE 1DATA

SOURCE N DATA

METADATA

ENVELOPE (ENTITY 1)

ENVELOPE (ENTITY 2)

ENTITY N

MESSAGE BUS

RDBMS

CONTENT FEED

TRANSACTIONAL APPSOPERATIONAL, SEARCH, SEMANTIC APPS

ANALYTICS & BIREAL-TIME TRENDS, BI TOOLS

DOWNSTREAM SYSTEMSERP PROCESSING, ARTIFICIAL INTELLIGENCE

BUSINESS PROCESSESDATA SERVICES, MICROSERVICES

BUSINESS APIs

STANDARD INTERFACES

REST, SQL, SPARQL, OPTIC

EXPORT APIs

SLIDE: 29 1 June 2018© MARKLOGIC CORPORATION

Unpacking Smart MasteringAdmin Screens

Built-in APIs & FunctionsWeighted

QueryMerge

Algorithms

Process Match, MergeLineage

Custom Functions

Harmonize Flow

Match & Merge ConfigurationMatch Scoring

& RulesMerge Policies & Thresholds

<scoring><add property-name="last-name"

weight="8"/><add property-name="first-name"

weight="6"/><add property-name="city"

weight="3"/><expand property-name="first-name"

algorithm-ref="thesaurus" weight="6"> <thesaurus>/mdm/config/thesauri/first-name-synonyms.xml</thesaurus>

<distance-threshold>50</distance-threshold></expand>

</scoring>

Matching Issue

Feature Weight

Bill, William –Nickname

Thesaurus expansion

6

Chrissy, Krissy –Sounds like

Double metaphone

5

Same last name & address –Household

Reduction -5

<merge property-name="address" max-values="1"><postal-code prefer="zip+4" />

</merge>

Merge Algorithms

Length

Max Values

Source

Date Time (Recency)

SLIDE: 30 1 June 2018© MARKLOGIC CORPORATION

SMART MASTERING

Faster MDM Without Buying MDM Fast integration into a multi-model data hub

Lightweight matching and merging

Configuration-driven, with extension points

Track provenance and lineage

Maintain data securityhttps://github.com/marklogic-

community/smart-mastering-core

Future updates: Roadmap in progress: non-person entities, enhance scoring intelligence, iterative mastering