data driven parsing impact of reference data on contact data parsing
TRANSCRIPT
DATA DRIVEN PARSINGImpact of Reference Data on Contact Data Parsing
UNDERSTANDING VS. KNOWLEDGEThe difference between an Algorithmic Approach vs. a Data Driven Approach
Definition
PARSING
Analyze (a string or text) into logical syntactic components.
* The process of detecting and extracting individual components of a string into their respective and specific Domains.
The Algorithmic Approach
Understanding Addresses
123 Main St,Los Angeles, CA 90210USA
ADDRESSLINE
CITY
STATE ZIP
COUNTRY
The Algorithmic Approach
Understanding Phone Numbers
1 (800) 800 - 6245
AREA CODE
US COUNTRY CODE
SUFFIX
PREFIX
The Algorithmic Approach
Understanding Names
Condition Name Form
2 words John Doe First Last
2 words with comma Doe, John Last First
3 words John M. Doe First Middle Last
4 words John M. Doe Jr. First Middle Last Suffix
• Use of Word Counts
TECHNIQUES
The Algorithmic Approach
Disadvantages
• Logic is very Presumptuous
• Relies for the data to be well formed
• There will always be exceptions
The Algorithmic Approach
Exception: No Delimiters
123 Main St Los Angeles CA 90210 USA
ADDRESSLINE
CITY
STATE ZIP
COUNTRY
The Algorithmic Approach
Exceptions: Missing Elements
123 Main St,Los Angeles, CA 90210
ADDRESSLINE
CITY
STATE
The Algorithmic Approach
Exceptions: Missing Elements
123 Main St,Los Angeles, CA 90210
ADDRESSLINE
ZIP
The Algorithmic Approach
Exceptions: Unconventional Order
123 Main St,90210 Los Angeles, CAUSA
ADDRESSLINE
ZIP
CITY STATE
COUNTRY
The Algorithmic Approach
Exceptions: Missing Elements
(800) 800 - 6245
AREA CODE SUFFIX
PREFIX
The Algorithmic Approach
Exception: Unexpected Elements
1 (800) 800 – 6245 x236
AREA CODE
US COUNTRY CODE
SUFFIX
PREFIX EXTENSION
The Algorithmic Approach
Exceptions: Name
Exception Name Form
Inverted Order Doe John Last First
Prefix Dr. John Doe Prefix First Last
Unknown Type John First
Dual Name John and Jane Doe First1 First2 Last
The Algorithmic Approach
Exception: Combination of Domains
22382 Avenida Empresa, 92688 1(800) 800 –6245 Melissa Data [email protected]
The Data Driven Approach
Advantages
• Bypasses many exceptions
• Does not rely on well formed data
• Having both an Understanding and Knowledge of Domains greatly improves Parsing Accuracy
The Data Driven Approach
Knowledge of State through Reference
123 Main St Los Angeles CA 90210 USAState
AK
AZ
AR
CA
CO
CT
DE
The Data Driven Approach
Knowledge of Zip through Reference
123 Main St Los Angeles CA 90210 USAZip
90200
90207
90208
90210
90211
90215
90220
The Algorithmic Approach
Knowledge of Area Code and Prefix through Reference
(800) 800 - 6245
Area Codes
714
866
855
800
877
909
Area Codes
672
682
692
800
822
872
The Algorithmic Approach
Knowledge of First Names through Reference
Vertido Joseph
First Name
John
James
Jerry
Joseph
Jeffrey
Jeremy
Problem
Data alone is not enough
123 Main Ct Hartford CT 06154 USAState
AK
AZ
AR
CA
CO
CT
DE
Intelligent Parsing: The Combined Approach
By using both Logic and Data, we can develop a more Robust and Intelligent way to parse Contact Data and overcome exceptions that would otherwise cause problems.
DEMOMelissa Data Solution to Intelligent Parsing
Recap
Algorithmic Approach
Reference Data Approach
Intelligent Parsing using both Algorithms/Logic and Reference Data through the Melissa Data Components.
Joseph [email protected]
800 800 6245 x827 Download the Data Quality Components for SSIS
ASK ABOUT OUR MVP PROGRAM
Thank You!
View our Other Webinars