data driven parsing impact of reference data on contact data parsing

Post on 18-Jan-2016

233 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

DATA DRIVEN PARSINGImpact of Reference Data on Contact Data Parsing

UNDERSTANDING VS. KNOWLEDGEThe difference between an Algorithmic Approach vs. a Data Driven Approach

Definition

PARSING

Analyze (a string or text) into logical syntactic components.

* The process of detecting and extracting individual components of a string into their respective and specific Domains.

The Algorithmic Approach

Understanding Addresses

123 Main St,Los Angeles, CA 90210USA

ADDRESSLINE

CITY

STATE ZIP

COUNTRY

The Algorithmic Approach

Understanding Phone Numbers

1 (800) 800 - 6245

AREA CODE

US COUNTRY CODE

SUFFIX

PREFIX

The Algorithmic Approach

Understanding Names

Condition Name Form

2 words John Doe First Last

2 words with comma Doe, John Last First

3 words John M. Doe First Middle Last

4 words John M. Doe Jr. First Middle Last Suffix

• Use of Word Counts

TECHNIQUES

The Algorithmic Approach

Disadvantages

• Logic is very Presumptuous

• Relies for the data to be well formed

• There will always be exceptions

The Algorithmic Approach

Exception: No Delimiters

123 Main St Los Angeles CA 90210 USA

ADDRESSLINE

CITY

STATE ZIP

COUNTRY

The Algorithmic Approach

Exceptions: Missing Elements

123 Main St,Los Angeles, CA 90210

ADDRESSLINE

CITY

STATE

The Algorithmic Approach

Exceptions: Missing Elements

123 Main St,Los Angeles, CA 90210

ADDRESSLINE

ZIP

The Algorithmic Approach

Exceptions: Unconventional Order

123 Main St,90210 Los Angeles, CAUSA

ADDRESSLINE

ZIP

CITY STATE

COUNTRY

The Algorithmic Approach

Exceptions: Missing Elements

(800) 800 - 6245

AREA CODE SUFFIX

PREFIX

The Algorithmic Approach

Exception: Unexpected Elements

1 (800) 800 – 6245 x236

AREA CODE

US COUNTRY CODE

SUFFIX

PREFIX EXTENSION

The Algorithmic Approach

Exceptions: Name

Exception Name Form

Inverted Order Doe John Last First

Prefix Dr. John Doe Prefix First Last

Unknown Type John First

Dual Name John and Jane Doe First1 First2 Last

The Algorithmic Approach

Exception: Combination of Domains

22382 Avenida Empresa, 92688 1(800) 800 –6245 Melissa Data joseph@melissadata.com

The Data Driven Approach

Advantages

• Bypasses many exceptions

• Does not rely on well formed data

• Having both an Understanding and Knowledge of Domains greatly improves Parsing Accuracy

The Data Driven Approach

Knowledge of State through Reference

123 Main St Los Angeles CA 90210 USAState

AK

AZ

AR

CA

CO

CT

DE

The Data Driven Approach

Knowledge of Zip through Reference

123 Main St Los Angeles CA 90210 USAZip

90200

90207

90208

90210

90211

90215

90220

The Algorithmic Approach

Knowledge of Area Code and Prefix through Reference

(800) 800 - 6245

Area Codes

714

866

855

800

877

909

Area Codes

672

682

692

800

822

872

The Algorithmic Approach

Knowledge of First Names through Reference

Vertido Joseph

First Name

John

James

Jerry

Joseph

Jeffrey

Jeremy

Problem

Data alone is not enough

123 Main Ct Hartford CT 06154 USAState

AK

AZ

AR

CA

CO

CT

DE

Intelligent Parsing: The Combined Approach

By using both Logic and Data, we can develop a more Robust and Intelligent way to parse Contact Data and overcome exceptions that would otherwise cause problems.

DEMOMelissa Data Solution to Intelligent Parsing

Recap

Algorithmic Approach

Reference Data Approach

Intelligent Parsing using both Algorithms/Logic and Reference Data through the Melissa Data Components.

Joseph Vertidojoseph@melissadata.com

800 800 6245 x827 Download the Data Quality Components for SSIS

ASK ABOUT OUR MVP PROGRAM

Thank You!

View our Other Webinars

top related