fuzzy duplicates analysis with acl prepared by: kevin legere date: april 3 rd, 2013

20
Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd , 2013

Upload: aylin-burchett

Post on 31-Mar-2015

228 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd, 2013

Fuzzy Duplicates Analysis with ACL

Prepared by: Kevin Legere

Date: April 3rd, 2013

Page 2: Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd, 2013

2© 2012 ACL Services Ltd.ACL | Transforming Audit and Risk

Agenda

Overview

Example

FUZZYDUP command

OMIT() Function

Script Editor and RECOFFSET

Q&A

Page 3: Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd, 2013

3© 2012 ACL Services Ltd.ACL | Transforming Audit and Risk

Overview What is a "Fuzzy Duplicate"?

– Match based on criteria where the values are not exact but very close» EX: "ACL Services" and "ACL Service"

Typically used for:» Keyword matching» Invoice Number matching» Vendor Name matching*» Employee Name matching

Can be simple or complex» Completely depends on your approach and desired accuracy

* focus for this presentation

Page 4: Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd, 2013

4© 2012 ACL Services Ltd.ACL | Transforming Audit and Risk

Overview Simple Match Examples:

– Exact or 100% match» "ACL" = "ACL"

– Force Upper or Lower case» "ACL" = UPPER("acl")» "acl" = LOWER("ACL")

– Removal of special characters» "ACL" = EXCLUDE("*ACL." "!@#$%^&*().")

– Only compare numbers or letters» "ACL" = INCLUDE(UPPER("ACL123") "ABCDEFGHIJKLMNOPQRSTUVWXYZ")» "123" = INCLUDE("ACL123" "1234567890")

Page 5: Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd, 2013

5© 2012 ACL Services Ltd.ACL | Transforming Audit and Risk

Overview Complex Match Examples:

– Removal of company type indicators (LLC, INC, LTD, etc) » "ACL Services Ltd." = "ACL Services"

– Percent of word match AKA letter by letter» "ACL Services" "ACL Service"

• 11/12 character match or 91.6% match

– Word by Word*» "ACL Services" "ACL Champions"

• "ACL" "ACL" • "Services" "Champions"• = 50% match

– Levenshtein distance– Sounds like– NYSIIS

*Most used by ACL Consultants

Page 6: Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd, 2013

6© 2012 ACL Services Ltd.ACL | Transforming Audit and Risk

Vendor Master Analysis

Page 7: Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd, 2013

7© 2012 ACL Services Ltd.ACL | Transforming Audit and Risk

Fuzzy Duplicates on Vendor Name– Possible Risk

» Payments are being sent to more than one vendor

– May not involve risk. The desire can be to normalize the vendor master list to ensure that duplicates do not exist.

» Ideally, one unique vendor should exist in your vendor master list with one or more address records in your vendor address table

Vendor Master Analysis

Page 8: Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd, 2013

8© 2012 ACL Services Ltd.ACL | Transforming Audit and Risk

Sample file contains 75 vendors– Only Vendor Code and Vendor Name

Where do you start for Vendor Name matching?– Look for exact duplicates– Focus on Simple matching– Sort or Summarize!

Vendor Master Analysis

Sample Vendor Master

Page 9: Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd, 2013

9© 2012 ACL Services Ltd.ACL | Transforming Audit and Risk

Step 1: Summarize your Vendor Master File » Choose Vendor Name as your key field» Add Vendor Code as the Other Fields for Summarizing » Be sure to check "Presort"

Vendor Master Analysis

Page 10: Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd, 2013

10© 2012 ACL Services Ltd.ACL | Transforming Audit and Risk

Step 2: Quickly comb over the data to identify a common trend. » We will focus on this issue, in the sample data:

» Create a computed field that corrects the trend (or cleans the data).

Vendor Master Analysis

Page 11: Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd, 2013

11© 2012 ACL Services Ltd.ACL | Transforming Audit and Risk

Functions used in Default Value text box:

INCLUDE(UPPER(ALLTRIM(Vendor_Name)) 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')

Within ACL, the computed field will return the following:

Vendor Master Analysis

Page 12: Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd, 2013

12© 2012 ACL Services Ltd.ACL | Transforming Audit and Risk

Step 3: Perform a Duplicates Command on the computed field

Vendor Master Analysis

Page 13: Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd, 2013

13© 2012 ACL Services Ltd.ACL | Transforming Audit and Risk

Results are as follows:

Vendor Master Analysis

Page 14: Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd, 2013

14© 2012 ACL Services Ltd.ACL | Transforming Audit and Risk

ACL 9.3 has new features that make Fuzzy Duplicate analysis easier– FUZZYDUP command– OMIT() function– ISFUZZYDUP() function– LEVDIST() function

Important parameters to understand– Levenshtein Distance – Difference Percentage

FUZZYDUP command

Page 15: Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd, 2013

15© 2012 ACL Services Ltd.ACL | Transforming Audit and Risk

Syntax– FUZZYDUP ON {key_field} <OTHER fields> {LEVDISTANCE value} <DIFFPCT value><RESULTSIZE

value> <EXACT> TO table_name

Example– FUZZYDUP ON Vendor_Name OTHER ALL LEVDISTANCE 2 DIFFPCT 50 TO My_Results

Levenshtein Distance (LEVDISTANCE)

» The number of edits required to make the strings equal• EX: "Smith" and "Smythe" have a Levenshtein Distance of 2

Difference Percentage (DIFFPCT)

» The threshold for percentage difference between two strings• EX: "Smith" and "Smythe" have a Percentage Difference of 40% • (2/5) * 100%

FUZZYDUP command

Page 16: Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd, 2013

16© 2012 ACL Services Ltd.ACL | Transforming Audit and Risk

When Do I use OMIT()?– When you want to refine fuzzy duplicate analysis– Look for repeating strings you want to remove from your Vendor Name field

Syntax– OMIT(string1, string2 <,case_sensitive>)– Specify T to make substrings specified for removal case-sensitive, or F to ignore

case

Example– OMIT(Vendor_Name " Ltd, Inc, Corp, Corporation" F)

OMIT() Function

Page 17: Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd, 2013

17© 2012 ACL Services Ltd.ACL | Transforming Audit and Risk

Script Editor and RECOFFSET

Page 18: Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd, 2013

18© 2012 ACL Services Ltd.ACL | Transforming Audit and Risk

Page 19: Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd, 2013

19© 2012 ACL Services Ltd.ACL | Transforming Audit and Risk

Contact Information

Kevin LegereImplementation

Consultant

ACL Services Ltd.1550 Alberni Street, Vancouver, BC, Canada V6G 1A5 [email protected] | @aclkevinwww.acl.com/linkedin | www.acl.com/twitter | www.acl.com/facebook

Page 20: Fuzzy Duplicates Analysis with ACL Prepared by: Kevin Legere Date: April 3 rd, 2013

20© 2012 ACL Services Ltd.ACL | Transforming Audit and Risk