data quality assurance in telecommunications databases c.-m. chen, m. cochinwala {chungmin,...

Data Quality Assurance in Telecommunications Databases

C.-M. Chen , M. Cochinwala

{chungmin, munir}@research.telcordia.com

Applied Research

Telcordia Technologies

Morristown, NJ 07960

Doc Name – 2

Outline

Operation Support Systems Data Quality Issues Record Matching Problem & Techniques Example Applications References

Doc Name – 3

Operation Support Systems

Telecom carriers or service providers use multiple Operation Support Systems (OSS) for – network configuration

– network engineering

– service provisioning

– network performance monitoring

– customer care & billing, etc. Each OSS may maintain its own database using a DBMS

Doc Name – 4

OSS Databases

The databases may overlap on the network entities they describe

The OSSs may use different model/schemas to describe the same entity

Doc Name – 5

Data Quality Issues

Corrupted data: data do not correctly reflect the properties/specifications of the modeled entities

Inaccurate data: discrepancy between the data and the real-world entity they modeled (e.g. outdated information)

Inconsistent data: – records in different OSS databases referencing the same real-world entity

do not agree on the attribute values of the entity

– an entity is represented by more than one record in a single OSS database

Doc Name – 6

Example: DSL Deployment

corrupted data: many of the Central Offices (CO) do not have correct location information (longitude,latitude)

inaccurate data: the database says a port or bandwidth is available at the CO while indeed it is not (or vice versa)

inconsistent data: Service Provisioning database says a line is available while Network Configuration database says otherwise (which one to believe?)

Doc Name – 7

Why is data quality an issue

corrupted data disrupt/delay business deployment inaccurate data impede decision making inconsistent data degrade operation efficiency bottom line:

– increased cost

– decreased revenue

– customer dissatisfaction

Doc Name – 8

Cost of Data Quality Assurance

To date, practice of data quality assurance is mostly manual-driven and labor-intensive

“ … it costs a typical telephone company in the mid-1990s $2 per line per year to keep the network database reasonably accurate.”

Now imagine the scale and complexity of the networks brought by the IP and wireless technologies and their effects on the cost of data quality

(Semi-) automatic tools are needed

Doc Name – 9

Data Quality Assurance Techniques

data correctness and accuracy:– database validation and correction via autodiscovery tools

data consistency (reconciliation)– record matching (record linkage): identify records that correspond to the

same real-world entity

Doc Name – 10

Data Accuracy

to assess and correct the discrepancy between data in the database and the modeled entities in the real-world

autodiscovery tools: – automatically probe network elements for configuration parameters

– reconstruct the network connection topology at different layers (from physical to application layers)

how to efficiently validate the data stored in the databases against the auto-discovered information?

– sampling

– graph algorithms

Doc Name – 11

Record Matching Techniques [Cochinwala01b]

Problem Definition Record Matching Phases Quality Metrics Searching Techniques (to reduce search space) Matching Decision Rules (to determine matched pairs)

Doc Name – 12

The Record Matching Problem

Goal: To identify records in the same or different databases that correspond to the same real-world entity.

two records that correspond to the same entity are called a matched pair

in a relational table, no two records can hold identical values on all attributes a matched pair consists of two different records that refer to the same entity

the same entity may be represented differently in different tables/databases, with no agreement on key value(s)

Doc Name – 13

Three Phases of Reconciliation

Data Preparation: scrubbing and cleansing Searching: reducing search space Matching: finding matched pairs

Doc Name – 14

Data Preparation

Parsing Data transformation Standardization

Doc Name – 15

Searching Problem

Consider reconciling tables A and B search space: P = A B (Cartesian product) M: matched pairs U: unmatched pairs P = M U

Problem: searching P requires |A||B| comparisons

Goal: reduce the search space to a smaller P’ P, such that M P’.

Doc Name – 16

Searching Techniques

Heuristic: potentially matched records should fall into the same cluster

Relational join operators: nest-loop join, merge-sort join, hash join, band join [Hernadez96]

Blocking: Soundex Code, NYSIIS [Newcombe88]

Windowing (sorted neighborhood) [Hernadez98]

Priority queue [MongeElkan97]

Multi-pass merge-sort joins [Hernadez98]

Doc Name – 17

String Matching Techniques

Edit distance: counts of insert, delete, replace [Manber89]

Smith-Waterman: dynamic programming to find minimum edit distance [SmithWaterman81]

Doc Name – 18

Record Matching Techniques

Given a (reduced) space of pairs, how to determine if a pair (x,y) is a match, i.e., records x and y refer to the same real-world entity ?

Probabilistic approaches [Newcombe59]– Bayes test for minimum error– Bayes test for minimum cost

Non-probabilistic approaches– Supervised Learning (to generate decision rules)

[Cochinwala01a]– Equation theory model [Hernadez98]– Distance-based techniques [DeySarkarDe98]

Doc Name – 19

Bayes Test for Minimum Error

Cartesian product P = A B = M UFor each record (a1, a2, … , an , b1, b2, … , bn) P, define

x = (x1, x2, …, xn), where

xi = 1 if ai = bi

0 if ai bi

a-prior prob conditional density function

Class M:M p(x|M)

Class U: U p(x|U)

Unconditional density function:

p(x) = M p(x|M) + U p(x|U)

Doc Name – 21

Bayes Test for Minimum Cost

C i,j : cost of a class j record being (mis)-classified to class iConditional cost of x being classified to class M given x is:

cM(x) = cMM p(M|x) + cMU p(U|x)Similarly,

cU(x) = cUM p(M|x) + cUU p(U|x)

Likelihood Ratio Decision Rule:x is decided to belong to M iff

cM(x) cU(x) iff

l(x) = (p(x|M) / p(x|U)) ((cMU - cUU )U / (cUM - cMM )M )

The test gives min cost

Doc Name – 22

Supervised Learning

Take a small sample S form A B For every record s S, label it as M (matched) or U

(unmatched) Select a predictive model, along with associated

decision rules, that best discriminates between classes M and U for records in S

Apply the selected model to classify all records in A B

Doc Name – 23

Arroyo: a data reconciliation tool [Cochinwala01a]

matches customer records across two databases:– wireline database: 860,000 records– wireless database: 1,300,000 records

methodology– pre-processing (application dependent)– parameter space definition (application dependent)– matching rule generation & pruning (learning, model

selection)– matching

Doc Name – 24

Preprocessing (Application Dependent)

elimination of stop-words– blanks, special characters, ...– “Chung-Min Chen” becomes “ChungMin Chen”

word re-ordering– “20 PO BOX” becomes “PO BOX 20”

word substitution– “St.” becomes “Street”

Doc Name – 25

Parameter Space Definition (Application Dependent)

Six parameters used: edit distance between the Name fields edit distance between the Address fields length of Name field in Wireline length of Name field in Wireless length of Address field in Wireline length of Address field in Wireless

Doc Name – 26

Matching Rule Generation

1. Learning Set Generation select records that contain the word “Falcon” in Name

field– Wireless - 241 records– Wireline - 883 records

Choose sample first from the database that is the image of the higher degree “onto”-mapping. For example, Wireline Wireless has a higher “onto” degree than “Wireless Wireline”

Doc Name – 27


2. Learning Set Labeling identify prospective matches

– match if the edit distances on Name and Address fields are both less than 3

– (30% false matches but no miss of true matches) verdict assignment

– each pair that is not a prospective match is assigned “unmatch”

– each prospective match pair is manually examined and labeled as “match”, “unmatch”, or “ambigious” (labor intensive)

Doc Name – 28


3. Model Selection Input: 241 883 = 212,803 pairs with 7 fields (parameters)

Label (verdict)

Edit distance between Name fields Edit distance between Address fields

Length of Name field in WirelessLength of Name field in Wireline

Length of Address in Wireless Length of Address in Wireline

Three model tried, with cross-validation (half sample to build the model, half sample to estimate the error rate)Model Error Rate (average of 50

runs)CART 5.1%Linear Discriminant Analysis 5.3%Vector Quantization 9.4%

Doc Name – 29


3. Model Selection (cont.)Model Error Rate (average of 50

runs)CART 5.1%Linear Discriminant Analysis 5.3%Vector Quantization 9.4%

LDA could have inefficient rules for database evaluation, e.g.:

2*(Address edit distance) + 1.3*(Name length on Wireless) < 3 match

– an index on Name is useless

CART rules can be evaluated efficiently using existing database indices

Doc Name – 30

Matching Rule Pruning

determines which parameters (rules) to drop to reduce tree complexity while retaining tree quality– maximize delta of a cost function– dynamic programming with a threshold

complexity of tree T: C(T) = C(p) / C(q) ,

C(p) = complexity to compute the value of parameter p e.g.

C(edit distance between fields A and B) = avg_length(A)* avg_length(B)

Note: 0 < C(T) 1

pT q S

Doc Name – 31

Matching Rule Pruning (cont.)

C(edit distance between fields A and B) = avg_length(A)* avg_length(B)

Parameter Avg length ComplexityAddress length (wireless) 6.88 6.88Address length (wireline) 6.72 6.72Name length (wireless) 10.87 10.87Name length (wireline) 10.70 10.70Edit distance on Addresses 6.88*6.72 = 46.23Edit distance on Names 10.87*10.7 = 116.31

Doc Name – 32


misclassification rate of T:M(T) = # misclassified pairs / # pairs in the test set,

Note: 0 M(T) 1

Cost of Tree T:J(T) = w1*C(T) + w2*M(T),

where w1 and w2 are weights. (We used w1 = w2 = 1) Let T* = T - {all rues involving parameter p), define

J(T) = J(T) - J(T*) = (C(T)-C(T*)) + (M(T) - M(T*)) = C(T) + M(T)

Goal: Find a set of parameters to drop to maximize J(T)

Doc Name – 33

Dynamic programming– find a parameter p1 that maximizes J(T)

– fix p1 and find a second parameter p2 such that {p1,p2} maximizes J(T)

– repeat the step until we reach a set of parameters {p1,p2,…,pk} such that J(T) is less than a threshold value

Parameter C(T) M(T) J(T) Edit dist on Name 0.5882 -0.0140 0.5742Edit dist on Address 0.2338 -0.1840 0.0498Len. on Name (Wireless) 0.0566 -0.0610 -0.0044Len. on Name (Wireline) 0.0541 -0.0160 0.0381Len. on Address (Wireless) 0.0347 -0.0081 0.0266Len. On Address (Wireline) 0.0339 -0.0070 0.0269

Single parameter pruning


Doc Name – 34

2nd Parameter C(T) M(T) J(T)

Len. on Name (Wireline) 0.0541 -0.00880.0453

Len. on Address (Wireless) 0.0347 -0.00720.0275

Len. On Address (Wireline) 0.0339 -0.00910.0248

...Dual parameters pruning


Doc Name – 35


Reduce from 6 parameters (original tree) to 2 parameters (final tree)

index on Name field can be used to evaluate the tree

rootAddress edit distance > 1.5Address edit distance < 1.5

Wireless Name length < 11.5 Wireless Name length > 11.5

matched

unmatched ambiguous

Doc Name – 36

CART Matching Result

x/y Matched Unmatched Ambiguous

Matched 0.965 0.014 0.03Unmatched 0.020 0.956 0.18Ambiguous 0.015 0.030 0.79

approximately 10% error rate reduced 50% computation cost for matching about 47% more correctly matched records over a

leading commercial product

Prob (predicted as x | given y)

Doc Name – 37

Major Telephone Operating Company

Problem:– Mergers and acquisitions of wireless companies resulted in

the RBOC’s inability to determine common customers among wireline and wireless businesses

– Customer databases for each business unit use different schema and contain many quality problems

– RBOC’s experience with commercial vendor’s data reconciliation tool was unsatisfactory

Solution:– Use small manually-verified data samples (~100 records) to

determine appropriate matching rules– Use machine learning to prune the rules for efficient analysis

of the large dataset– Resulted in 30% more correct matches than the commercial

tool

Doc Name – 38

Large Media Conglomerate

Problem:– Company provides magazines to wholesalers who in turn

provide magazines to resalers for distribution– Reconciliation of wholesaler and retailer databases would

make it easier to track where gaps in reporting are occurring

– Identify ‘bad’ retailers Solution:

– Group by primary keys– Match by secondary keys– E.g. 3000 C.V.S. Pharmacies are grouped and compared by

zip code and street address – identify ‘bad’ pharmacies

Doc Name – 39

International Government

Problem:– Reconcile vital taxpayer data from several different sources– Known problems include record duplication, address

mismatches, address obsolescence, distributed responsibility for database accuracy and updates

– Identify causes for mistakes Solution:

– Improve the process flows and architecture to allow for rapid modification of pre-processing rules and matching rules

– Detection and classification of likely causes of duplication– Analysis and improvements reduced number of records that

required manual verification

Doc Name – 40

ILEC-CLEC Billing Reconciliation

Problem– ILECs charge CLECs for use of network resources– Verification of actual usage vs. charging

E.g customer changing providers

– Identify actual usage and send verification to ILEC– Resource identification in ILEC and CLEC is different

Solution– Check charges in bill against actual usage– Common identification of resources (matching table)– Solution has only been implemented for access line charge

Doc Name – 41

Future Directions

Reduction of manual work– Type based standardization, rules– Rules and Parameter pruning?

Database and tool– Rule execution plan

E.g. string length before edit distance

– Sharing/maintenance of index across steps in process– Extending matching to structures (trees, graphs – circuits)– Matching across layers in networks (OSI layers)– How often to discover/audit network? – sampling techniques

Doc Name – 42

References

[Cochinwala01a ]M Cochinwala, V. Kurien, G. Lalk and D. Shasha, “Efficient data reconciliation”, Information Sciences, 137 (1-4) , 2001.

[Cochinwala01b] M. Cochinwala, S. Dalal, A. Elmagarmid and V. Verykios, “Record matching: past, present and future”, submitted for publication.

[DeySarkarDe]D. Dey, S. Sarkar and P. De, “Entity matching in heterogeneous databases: a distance-based decision model”, 31st Hawaii int. Conf. On System Sciences, 1998.

[Hernadez96] M. Hernadez, “A generalization of band joins and the merge/purge problem”, Ph.D. Thesis, CS Dept, Columbia University, 1996.

[Hernadez98] M. Hernadez and S. Stolfo, “Real-world data is dirty: data cleansing and the merger/purge problem”, Data Mining and Knowledge Discovery, 2 (1), 1998.

[Manber89] U. Manber, “Introductions to Algorithms”, Addiso_Wesley, 1989. [MongeElkan97] A. Monge and C. Elkan, “An efficient domain-independent algorithm for

detecting approximately duplicate database records”, SIGMOD DMKD Workshop, 1997. [Newcombe59] H. Newcombe, J. Kennedy, S. Axford and A. James, “Automatic linkage of

vital records”, Science 130(3381), 1959. [Newcombe88]H. Newcombe, “Handbook of Record Linkage”, Oxford University Press,

1988. [SmithWaterman81] T. Smith and M. Waterman, “Identification of common molecular

subsequences”, J. Mol. Biol. 147, 1981.

data quality assurance in telecommunications databases c.-m. chen, m. cochinwala {chungmin,...

Documents