data driven parsing impact of reference data on contact data parsing

25
DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

Upload: magnus-ward

Post on 18-Jan-2016

233 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

DATA DRIVEN PARSINGImpact of Reference Data on Contact Data Parsing

Page 2: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

UNDERSTANDING VS. KNOWLEDGEThe difference between an Algorithmic Approach vs. a Data Driven Approach

Page 3: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

Definition

PARSING

Analyze (a string or text) into logical syntactic components.

* The process of detecting and extracting individual components of a string into their respective and specific Domains.

Page 4: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

The Algorithmic Approach

Understanding Addresses

123 Main St,Los Angeles, CA 90210USA

ADDRESSLINE

CITY

STATE ZIP

COUNTRY

Page 5: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

The Algorithmic Approach

Understanding Phone Numbers

1 (800) 800 - 6245

AREA CODE

US COUNTRY CODE

SUFFIX

PREFIX

Page 6: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

The Algorithmic Approach

Understanding Names

Condition Name Form

2 words John Doe First Last

2 words with comma Doe, John Last First

3 words John M. Doe First Middle Last

4 words John M. Doe Jr. First Middle Last Suffix

• Use of Word Counts

TECHNIQUES

Page 7: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

The Algorithmic Approach

Disadvantages

• Logic is very Presumptuous

• Relies for the data to be well formed

• There will always be exceptions

Page 8: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

The Algorithmic Approach

Exception: No Delimiters

123 Main St Los Angeles CA 90210 USA

ADDRESSLINE

CITY

STATE ZIP

COUNTRY

Page 9: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

The Algorithmic Approach

Exceptions: Missing Elements

123 Main St,Los Angeles, CA 90210

ADDRESSLINE

CITY

STATE

Page 10: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

The Algorithmic Approach

Exceptions: Missing Elements

123 Main St,Los Angeles, CA 90210

ADDRESSLINE

ZIP

Page 11: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

The Algorithmic Approach

Exceptions: Unconventional Order

123 Main St,90210 Los Angeles, CAUSA

ADDRESSLINE

ZIP

CITY STATE

COUNTRY

Page 12: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

The Algorithmic Approach

Exceptions: Missing Elements

(800) 800 - 6245

AREA CODE SUFFIX

PREFIX

Page 13: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

The Algorithmic Approach

Exception: Unexpected Elements

1 (800) 800 – 6245 x236

AREA CODE

US COUNTRY CODE

SUFFIX

PREFIX EXTENSION

Page 14: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

The Algorithmic Approach

Exceptions: Name

Exception Name Form

Inverted Order Doe John Last First

Prefix Dr. John Doe Prefix First Last

Unknown Type John First

Dual Name John and Jane Doe First1 First2 Last

Page 15: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

The Algorithmic Approach

Exception: Combination of Domains

22382 Avenida Empresa, 92688 1(800) 800 –6245 Melissa Data [email protected]

Page 16: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

The Data Driven Approach

Advantages

• Bypasses many exceptions

• Does not rely on well formed data

• Having both an Understanding and Knowledge of Domains greatly improves Parsing Accuracy

Page 17: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

The Data Driven Approach

Knowledge of State through Reference

123 Main St Los Angeles CA 90210 USAState

AK

AZ

AR

CA

CO

CT

DE

Page 18: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

The Data Driven Approach

Knowledge of Zip through Reference

123 Main St Los Angeles CA 90210 USAZip

90200

90207

90208

90210

90211

90215

90220

Page 19: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

The Algorithmic Approach

Knowledge of Area Code and Prefix through Reference

(800) 800 - 6245

Area Codes

714

866

855

800

877

909

Area Codes

672

682

692

800

822

872

Page 20: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

The Algorithmic Approach

Knowledge of First Names through Reference

Vertido Joseph

First Name

John

James

Jerry

Joseph

Jeffrey

Jeremy

Page 21: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

Problem

Data alone is not enough

123 Main Ct Hartford CT 06154 USAState

AK

AZ

AR

CA

CO

CT

DE

Page 22: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

Intelligent Parsing: The Combined Approach

By using both Logic and Data, we can develop a more Robust and Intelligent way to parse Contact Data and overcome exceptions that would otherwise cause problems.

Page 23: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

DEMOMelissa Data Solution to Intelligent Parsing

Page 24: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

Recap

Algorithmic Approach

Reference Data Approach

Intelligent Parsing using both Algorithms/Logic and Reference Data through the Melissa Data Components.

Page 25: DATA DRIVEN PARSING Impact of Reference Data on Contact Data Parsing

Joseph [email protected]

800 800 6245 x827 Download the Data Quality Components for SSIS

ASK ABOUT OUR MVP PROGRAM

Thank You!

View our Other Webinars