tomer sagi and avigdor gal technion - israel institute of technology non-binary evaluation for...

34
Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence Italy

Upload: emory-adams

Post on 24-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

      Tomer Sagi and Avigdor GalTechnion - Israel Institute of Technology

Non-binary Evaluation for Schema Matching

Presentation @ ER 2012October 2012, Florence Italy

Page 2: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Presentation Outline

Background• Schema Matching

Schema Matching Evaluation• Current model: Set based Precision and Recall• Proposed Model: Similarity Spaces, a vector-space model• Non-Binary measures

Usage example:• Tuning schema matchers using Non-binary measure

2

Page 3: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

3

BackgroundSchema Matching

Schema matching is the task of providing correspondences between concepts describing the meaning of data

Schema matching is recognized to be a basic operation required by data integration and web-query interface integration

Page 4: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

BackgroundSchema Matching: Schemas

Schemas contain attributes

Each attributes may have a name, label, type, domain (allowed values), instances, etc.

Structural links and relationships are defined between attributes

4

First Name

Last Name

Gender

What are your favourite hobbies

Requested Password:

Password confirmation:

Yes

Register

Web-form Schema Small Business-Document Schema

With 5 concepts

Page 5: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

BackgroundSchema Matching: First Line Matchers

First line matchers (a.k.a similarity measures) compare two schemas, generating correspondences between them

Each correspondence is assigned a confidence value over [0,1] The results is often represented as a similarity matrix:

5

0.32

0.64

0.84

0.350.62

Page 6: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

BackgroundSchema Matching: Second Line Matchers

Second line matchers transform similarity matrices Filters transform a matrix by removing those values which do not

satisfy the constraint function. Examples: Threshold, MaxDelta.

6

Similarity Matrix

Transformed Similarity Matrix

Page 7: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

BackgroundSchema Matching: Second Line Matchers

Second line matchers transform similarity matrices. Filters transform a matrix by removing those values which do not

satisfy the constraint function. Examples: Threshold, MaxDelta. Decision makers transform a matrix by changing the values of some

correspondence to 1 and the rest to 0.

7

Similarity Matrix

Binary Similarity Matrix

Page 8: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Schema matching systems employ various first and second line matchers; their results are composed, aggregated and combined.

Schema Matching SystemUncertain

Schema Matching System

BackgroundSchema Matching: Systems

8

String Matcher

Domain Matcher

Parent-Child Matcher

Instance Matcher

Aggregation

Filter

Decision maker

Schema Pair

Binary Similarity Matrix

Binary Similarity Matrix

1st Line Matchers 2nd Line Matchers

Page 9: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Presentation Outline

Background• Schema Matching

Schema Matching Evaluation• Current model: Set based Precision and Recall• Proposed Model: Similarity Spaces, a vector-space model• Non-Binary measures

Usage example:• Tuning schema matchers using Non-binary measure

9

Page 10: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Exact Match

Current evaluation model provides measures for evaluating a complete system using set-based measures

Major shortcoming: Evaluation of individual components (e.g. first line matchers) and non-binary results (uncertain schema matching systems) is undefined.

Schema Matching EvaluationCurrent Model

10

Schema Matching

System

String Matcher

Domain Matcher

Parent-Child Matcher

Instance Matcher

Aggregation

Filter

Decision maker

Schema Pair

Binary Similarity Matrix

True Positive (TP)

False Negative

(FN)

False Positive

(FP)

Precision = FPTP

TP

Recall =FNTP

TP

Page 11: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Schema Matching Evaluation

Begin with a similarity matrix:

Taking each entry as an element in a vector transforms this matrix to a similarity vector:

(0.84,0.29,0.34,0.32,1.00,0.33,0.32,0.33,0.35,0.30,0.30,0.64)

11

Similarity Spaces: A Vector Space Model

S1

S2

1 cardNum 2 city 3 arrivalDay 4 checkInTime

1 clientNum 0.84 0.32 0.32 0.30

2 city 0.29 1.00 0.33 0.30

3 checkInDay 0.34 0.33 0.35 0.64

Page 12: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Schema Matching Evaluation

We propose a Vector Space model for evaluation

Dimensions are possible correspondences between an attribute pair

Vectors are matching results

12

Similarity Spaces: A Vector Space Model

S1

S2

1 cardNum 2 city 3 arrivalDay 4 checkInTime

1 clientNum 1 0 0 0

2 city 0 1 0 0

3 checkInDay 0 0 0 1

S1

S2

1 cardNum 2 city 3 arrivalDay 4 checkInTime

1 clientNum 0.84 0.32 0.32 0.30

2 city 0.29 1.00 0.33 0.30

3 checkInDay 0.34 0.33 0.35 0.64

Page 13: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

The Schema Matching Evaluation Problem

13

The Schema Matching Evaluation Problem

K

Informed?

0 1 2 >2

Yes?

Non-BinaryBinary

Non-BinaryBinary

No Non-BinaryBinary

Non-BinaryBinary

Non-BinaryBinary

Area in green marks where most research has focused to date. Areas in Yellow designate limited work done.

Page 14: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Schema Matching Evaluation

Over this vector space, evaluation functions are defined:

For example, the well known precision and recall functions are functions of two vectors ( v = outcome of a decision maker. ve = exact match ) :

Similarity Spaces: A Vector Space Model

14

5.02

1,1

1

1

2,1

1

1,1)1,0(

REPR

e

e

e

gg

vv

vv

vv

Page 15: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Schema Matching Evaluation

Accommodating non-binary evaluation is now trivial: allow v to be non-binary

Non-binary measures

15

49.02

99.0,99.0

1

99.0

2,99.0

99.0

1,1)35.0,64.0(

REPR

e

e

e

gg

vv

vv

vv

Page 16: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Implications:

Schema Matching SystemUncertain

Schema Matching System

Schema Matching EvaluationNon-binary measures - Implications

16

String Matcher

Domain Matcher

Parent-Child Matcher

Instance Matcher

Aggregation

Filter

Decision maker

Schema Pair

Binary Similarity Matrix

Binary Similarity Matrix

1st Line Matchers 2nd Line Matchers

Can evaluate individual 1st line matchers

Can evaluate Interim Results

Can evaluate Uncertain Results

Page 17: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Match DistanceSometimes you need a metric…

We define two complementary distance metrics:

17

Page 18: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Match DistanceBehavior vs. NBPrecision and NBRecall

Results of synthetic evaluation

18

Page 19: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Match DistanceBehavior vs. NBPrecision and NBRecall

Results of synthetic evaluation

19

Nonsense solution of increasing magnitudeNoisy solution with increasingly strict filter applied

Page 20: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Schema Matching EvaluationPredictors

Background Model Evaluation

EXACT MATCH

What if you don’t have an exact match?

In most applications, this is the case…

Page 21: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Schema Matching Evaluation

Predictors are a special class of schema matching evaluation methods, which do not use an exact match as part of the input.

Predictors can be classified into two subclasses:• Internalizers that refer to the internal structure of a vector as an

indication of match quality (e.g. max , stdev, average)• Idealizers assume the existence of an ideal vector and compare with it

Predictors

Background Model Evaluation

Page 22: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Schema Matching EvaluationIdealizers

Background Model Evaluation

(Ideal Vector)

Page 23: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Schema Matching Evaluation

Desired design properties:• Tunable: We should be able to tune predictors towards the desired

quality in a specific scenario.• Generalizable: Predictors should be based upon principles which are

applicable at several levels of granularity and can be specialized to some levels of granularity.

Desired empirical properties:• Correlated: Well correlated with the quality they are designed to

evaluate• Robust: Correlations are robust and statistically significant over varied

matching systems and datasets

23

Desired Properties of Predictors

Background Model Evaluation

Page 24: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Using PredictionTunable Prediction Models

24

Loosely correlated predictors can be composed into a model. The weights of its participating predictors can be tuned Construction by (multiple) step-wise regression.

Page 25: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Using PredictionTunable Prediction Models

25

Added bonus: Increased correlation

Page 26: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Schema Matching Evaluation

Consider the following example and how a matrix level vs. an attribute level predictor would fare in it

26

Why Granularity Matters

Exact Match Matcher Vectors

Page 27: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

The Schema Matching Evaluation Problem

27

The Schema Matching Evaluation Problem

K

Informed?

0 1 2 >2

Yes?

Non-BinaryBinary

Non-BinaryBinary

No Non-BinaryBinary

Non-BinaryBinary

Non-BinaryBinary

Area in green marks where most research has focused to date. Areas in Yellow designate limited work done.

Page 28: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Presentation Outline

Background• Schema Matching

Schema Matching Evaluation• Current model: Set based Precision and Recall• Proposed Model: Similarity Spaces, a vector-space model• Non-Binary measures

Usage examples• Tuning schema matchers using Non-binary measure• Weighting ensembles using attribute-level prediction

28

Page 29: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Usage Examples

First-line matcher named Term has tunable parameter label score weight (α) defining the relative weight of the term’s label and name

Tuning can be done via machine learning methods or statistical methods

All tuning methods benefit from: • Smoothness: Gradual changes in α gradual changes in measure• Robustness: Observed behavior is robust w.r.t number of test cases

Tuning scenario

29

nameScore-1labelScore

Label

Name: leaveSlice

Page 30: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Usage ExampleSmoothness

30

To use binary precision, a decision maker is required. Introducing a decision maker causes random noise caused by arbitrary

thresholds

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Precision Binary Measures

Threshold(0.2)Threshold(0.25)Stable Marriage

Label score weight

Prec

ision

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.020.040.060.08

0.10.120.140.160.18

0.2

Non Binary Precision

Label score weight

NBP

reci

sion

Page 31: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Usage ExampleRobustness – effect of sample size

31

An outlier in schema pair no.2 causes Binary-precision (fig. (b)) to diverge greatly from the eventual observed behavior with 10 pairs.

Unperturbed by outliers, NBPrecision (fig. (a)) displays robust behavior

(pairs) (pairs)

Page 32: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Using PredictionDynamic Prediction Models - Results

32

Page 33: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Conclusions

Non-binary measures are a useful addition to the schema matching evaluation tool-kit.

Non-binary evaluation presents desired characteristics in tuning scenarios (smoothness and robustness)

Using the similarity vector space model we can generate additional measures, breaking from traditional measures (e.g. binary precision and recall) to measures more attuned to modern schema matching needs.

33

Page 34: Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching Presentation @ ER 2012 October 2012, Florence

Thank You

34

Questions?