lemon: explainable entity matching

LEMON: Explainable Entity MatchingNils Barlaug∗

Norwegian University of Science and TechnologyNorway, Trondheim

ABSTRACTState-of-the-art entity matching (EM) methods are hard to interpret,and there is significant value in bringing explainable AI to EM. Un-fortunately, most popular explainability methods do not work wellout of the box for EM and need adaptation. In this paper, we iden-tify three challenges of applying local post hoc feature attributionmethods to entity matching: cross-record interaction effects, non-match explanations, and variation in sensitivity. We propose ournovel model-agnostic and schema-flexible method LEMON1 thataddresses all three challenges by (i) producing dual explanationsto avoid cross-record interaction effects, (ii) introducing the novelconcept of attribution potential to explain how two records couldhave matched, and (iii) automatically choosing explanation gran-ularity to match the sensitivity of the matcher and record pair inquestion. Experiments on public datasets demonstrate that the pro-posed method is more faithful to the matcher and does a better jobof helping users understand the decision boundary of the matcherthan previous work. Furthermore, user studies show that the rateat which human subjects can construct counterfactual examplesafter seeing an explanation from our proposed method increasesfrom 54% to 64% for matches and from 15% to 49% for non-matchescompared to explanations from a standard adaptation of LIME.

KEYWORDSentity matching, entity resolution, explainability

1 INTRODUCTIONEntity matching is an essential task in data integration [9]. It isthe task of identifying which records refer to the same real-worldentity. Table 1 shows an example. Machine learning has become astandard tool to tackle the variety of data and to avoid laborsomefeature engineering from experts while still achieving high accuracy[e.g., 16, 18, 22]. Unfortunately, this is often at the cost of reducedtransparency and interpretability. While it is possible to carefullyselect classical machine learning methods that are intrinsicallyinterpretable and combine them with classical string similaritymetrics, current state-of-the-art is large deep learning models [4,12, 18, 22], which offer very limited interpretability out of the box.The possible benefits of being able to explain black-box models arenumerous. To mention some: 1) Researchers can gain new insightinto their models and find ways to improve them 2) Practitionerswill have a valuable tool for verifying that the models work asexpected and debugging those which do not 3) Companies can getthe necessary transparency they need to trust such black-boxesfor mission-critical data integration efforts 4) End-users can bereassured that models and their results can be trusted (or discoverthemselves that they should not be).

∗Also with Cognite.1https://github.com/NilsBarlaug/lemon

Table 1: Example of two records that need to be classifiedas either a match or a non-match. In this case, from theWalmart-Amazon dataset.

title belkin shield micrafor ipod touch tint

category mp3 accessories

brand belkin

modelno f8z646ttc01

price 47.88

title belkin ipod touchshield micratint-royal purple

category cases

brand belkin

modelno f8z646ttc02

price 12.49

The challenge of explaining machine learning models and theirpotential benefits is, of course, not unique to entity matching. There-fore, explainable machine learning has in recent years received sig-nificant attention from the broader research community [1, 14, 15].The result is a multitude of techniques and methods with differ-ent strengths and weaknesses. But as previous work has discussed,applying these techniques to entity matching is non-trivial. It is nec-essary to adapt and evolve them to tackle the unique characteristicsof entity matching [8, 11, 29].

Local post hoc feature attribution methods are perhaps the mostpopular type of explainability method in general, and the moststudied so far for entity matching [2, 8, 11, 29]. Previous workon explainable entity matching base their work on LIME [25] —one of the most popular methods of that type. In this paper, wechoose to focus mainly on LIME to be consistent with and forease of comparison to earlier work. Our work is relevant beyondLIME, and we will reference and include other methods in ourexperiments, but we consider in-depth adaptation and treatmentof other methods outside of the scope of this paper and hope toaddress them in future work.

Challenges. Unfortunately, standard local post hoc attributionmethods do not work satisfactorily for entity matching out of thebox. We identify three challenges of applying them:

(1) Cross-record interaction effects: Since EM is a match-ing problem, features across a record pair will tend to havestrong interaction effects, but linear surrogate models suchas in LIME implicitly assumes independent features. Thiscan severely impair the accuracy of the surrogate model.

(2) Non-match explanations: In essence, most feature attri-bution methods analyses the effect of removing features todetermine their attribution. However, for record pairs forwhich the matcher is fairly confident that they do not match,so the match score is close to zero, it is unlikely that removalof features will make any significant difference on the matchscore. The result is that we have no significant attributionsto explain why the records do not match. This is especiallyimportant since most record pairs do not match.

arX

iv:2

110.

0051

6v1

[cs

.DB

] 1

Oct

202

1

https://orcid.org/0000-0003-4618-9702

https://github.com/NilsBarlaug/lemon

Nils Barlaug

(3) Variation in sensitivity:While some record pairs may onlyneed to perturb a few features to trigger a significant changein the output of the matcher, others may contain a lot ofredundant features, making it hard to substantially impactthe matching score and provide meaningfully sized attribu-tions. It can be hard to make the trade-off between token andattribute level feature granularity. You risk being either toofine-grained or unnecessarily course-grained, and it differsbetween specific record pairs in the same dataset, acrossdatasets, and across matchers.

As we will outline in Section 2, earlier work has only partiallyaddressed these challenges.

Proposedmethod. Our proposedmethod specifically addressesall three challenges jointly by: 1) Making dual explanations to avoidcross-record interaction effects. 2) Introducing the novel concept ofattribution potential, an improvement over the copy/append per-turbation from previous work that are schema-flexible and morerobust to dirty data and matchers sensitive to the order of tokens.3) Choosing an interpretable representation granularity that op-timizes the trade-off between counterfactual interpretation andthe finest granularity possible. Our proposed method provides oneunified frame of interpretation with the same single explanationformat for all record pairs, has no dataset-specific hyperparametersthat need tinkering, and does not require matched schemas.

Evaluation of explainability methods is still an open problem,and there are no standard ways of evaluating explainable entitymatching. Ideally, in broad terms, we would like to measure towhat degree an explanation helps users understand how the modelmakes a matching decision. Inspired by the motivations behindcounterfactual examples [31], we argue that a useful attributionexplanation should help the user understand where the decisionboundary is andwhat kind of difference in input would be necessaryto sway the matcher. To that end, we propose to ask users whatthey think would be a minimal change to a record pair to make thematcher change its prediction and then check if they are correct.We show how this can be done for simulated users as well as humansubjects.

Finally, our results show that our proposed method is state-of-the-art, both in terms of faithfulness and helping users understandthe matcher’s behavior. Additionally, our user study shows real-world improvement in understanding by human subjects.

Contributions. In summary, our main contributions are:• We propose a method that addresses three important chal-lenges of applying local post hoc attribution methods toentity matching: 1) Cross-record interaction effects, 2) non-match explanation, and 3) variation in sensitivity. We showthrough experiments that this is indeed effective.

• To evaluate entity matching attribution explanations, wepropose a novel evaluation method that aims to measureto what degree explanations help users understand the de-cision boundary of a matcher. We show how to performexperiments on both simulated users and human subjects.

• Through extensive experiments on public datasets we showthat our proposed method is state-of-the-art both in termsof faithfulness and helping users understand the matcher’s

behavior. We verify the real-world applicability of our pro-posed method by performing an extensive user study. Tothe best of our knowledge, we are the first to conduct a userstudy for explainable entity matching.

Outline. The rest of the paper is organized as follows. Section2 briefly covers related work, Section 3 covers LIME and its adapta-tion to entity matching, and Section 4 goes into the details of ourproposed method. We explain our experimental setup in Section 5,and then we walk through and discuss the experiments in Section6 before we make our conclusive remarks in Section 7.

2 RELATEDWORKMachine learning for EM. The immense variety in datasets

makes machine learning a natural solution for entity matching. Thetraditional approach has been to handcraft string similarity metricsto produce similarity feature vectors and then utilize a classical off-the-shelf machine learning model such as SVM or random forest toclassify them [5, 13, 16]. The two main drawbacks of this approachare the necessary manual tinkering and poor performance on dirtydata [22]. However, in the last few years, the research communityhas increasingly adopted deep learning [4, 12, 18, 22, 23, 32]. Whileearly work focused on custom architectures and trained modelsfrom scratch, the current state of the art focuses on fine-tuninglarge natural languagemodels such as BERT [7], which offers higheraccuracy and decreases the need for training examples [18]. Werefer to Barlaug and Gulla [3] for an extensive survey on deeplearning for EM.

ExplainableAI. There aremanyways to explainmachine learn-ingmodels. Generally, explanations can be either global or local [10],in other words, explaining the model’s behavior as a whole or ex-plaining a single prediction. Furthermore, we often distinguishbetween intrinsically interpretable models and post hoc interpreta-tion methods [21] (which can be model-agnostic or not). The formerare models that are interpretable on their own, like linear regres-sion or decision trees, while the latter are methods for explainingblack-box models. We refer the reader to one of many extensivesources on the topic [1, 10, 14, 15, 21].

A particularly prominent group of approaches are local post hocattribution methods [e.g., 20, 25, 28], which aim to explain a predic-tion by communicating to what degree different parts of the inputare to be attributed for the prediction. LIME [25] is one of the mostprominent among such methods. It is a model-agnostic method,and works by randomly removing features of an input example andtraining a (usually linear) interpretable surrogate model to predictthe model’s output for these perturbations. Among other popularlocal post hoc attribution methods are the game-theoretic-basedSHAP [20] and gradient-based methods [e.g., 26, 28].

Explainable EM. The use of explainability techniques for ma-chine learning-based entity matching is still a young subject, andthere has only been a limited amount of previous work. However,we note that rule-based methods have historically been used tomake systems that can be interpreted by experts [13], and is analternative way to make explainable matchers [24].

LEMON: Explainable Entity Matching

Ebaid et al. [11] demonstrate ExplainER, a tool for explainingentity matching that utilizes multiple prominent explainability tech-niques, while Thirumuruganathan et al. [29] discuss challenges andresearch opportunities. Further, there have been two significantadaptations of LIME for entity matching, which we will now de-scribe and contrast to our work.

Mojito [8] introduces two versions of LIME for entity matching:LIME_DROP and LIME_COPY. The former is a straightforwardapplication of LIME similar to how the original authors do textclassification using token level feature granularity, while in thelatter they use attribute level representation and perturb by copyingthe entire attribute value to the corresponding attribute in the otherrecord instead of removing tokens. LIME_COPY is an elegant way toaddress challenge (2), but leaves more to be wanted. Firstly, attributelevel granularity is too coarse-grained for most cases with longertextual attributes (the extreme case being a single textual attribute).Secondly, since it is separate from LIME_DROP, it requires the userto interpret two different kinds of explanations. Finally, it relies ona matched schema with one-to-one attribute correspondence.

Recently, Landmark [2] was proposed as a two-part improve-ment over Mojito. Firstly, it makes two explanations, one per record,and avoid perturbing both records simultaneously, which effec-tively solves challenge (1). Secondly, for record pairs labeled asnon-matches, instead of perturbing by randomly copying entireattributes, it appends every corresponding attribute value from theother record and performs regular token level exclusion pertur-bation (a technique named double-entity generation), effectivelycombining LIME_DROP and LIME_COPY. The authors demonstratethrough experiments that their techniques are indeed effective andthat Landmark is a substantial improvement over Mojito. The limi-tation of the approach is that tokens from the other record are onlyever considered to be appended at the end of the correspondingattribute, which is unfortunate if the matcher is sensitive on theorder of tokens (e.g., many products have the brand name first inthe title), the schemas are not matched one-to-one, or the data isdirty. Similar to Mojito, Landmark also makes two different kindsof explanations.

While making important contributions, neither Mojito nor Land-mark addresses all three challenges identified in Section 1. OnlyLandmark tackle challenge (1). Both propose a solution to challenge(2), but as we discussed above, with important limitations. Neitheraddresses challenge (3). Moreover, they do not provide a unifiedand coherent way of actually communicating or visualizing an ex-planation to the end-user in the same way the original authors ofLIME do — something we aim to do.

3 PRELIMINARIESIn this section, we introduce LIME and describe how it can beadapted for entity matching.

3.1 LIMEThe main idea of LIME [25] is to locally approximate a classifieraround one instance with an interpretable model over an inter-pretable representation in a way that balances faithfulness andcomplexity, and then use the interpretable model as an explanation.The intuition is that while our problem is too complex for classical

machine learning models that we regard as inherently interpretable(e.g., linear regression or decision trees) to be accurate enough, itmight be possible to faithfully approximate the decision boundaryfor a black-box model locally around one input instance. In otherwords, we can train an interpretable surrogate model using localdata points sampled by perturbing an input instance and use it as anexplanation of that particular instance. And while the input featuresof a black-box model might be unsuited for human interpretation(e.g., deep learning embeddings or convoluted string metrics), wecan define an alternative interpretable representation for the inputinstance we want to explain and use that to train our interpretablesurrogate model.

Formally, let 𝑓 (𝑥) : 𝐴 × 𝐵 → R be the classifier we want toexplain — in our case, an entity matching model that accepts tworecords, 𝑎 ∈ 𝐴 and 𝑏 ∈ 𝐵, and outputs a predictions score. For asingle instance 𝑥 = (𝑎, 𝑏) that we want to explain, let 𝐼𝑥 = {0, 1}𝑑𝑥be the interpretable domain and 𝑥 ′ ∈ 𝐼𝑥 the interpretable repre-sentation of 𝑥 . Its elements represent the presence (or absence) ofwhat Ribeiro et al. [25] call interpretable components, essentiallynon-overlapping parts of the input. E.g., for text, it could be thepresence of different words. For each 𝑥 there must exist a function𝑡𝑥 (𝑥 ′) : 𝐼 → 𝐴×𝐵 that can translate an interpretable representationto the input domain of the classifier 𝑓 .

With the goal of approximating 𝑓 local to 𝑥 , we sample a newdataset Z𝑥 = {(𝑧′, 𝑦) ∈ 𝐼 × R} where 𝑦 = 𝑓 (𝑡𝑥 (𝑧′)). Each 𝑧′ isdrawn by setting a uniformly random-sized subset of 𝑥 ′ to zero.Let 𝐺 be a class of interpretable models over 𝐼𝑥 . Furthermore, letL(𝑓 , 𝑔, 𝜋𝑥 ) be how unfaithful 𝑔 ∈ 𝐺 is to 𝑓 in the neighborhooddefined by the distance kernel 𝜋𝑥 (𝑧′). We want to find a 𝑔 ∈ 𝐺 thatis as faithful to 𝑓 as possible, but since many interpretable modelscan be made more accurate by increasing the complexity, we needto balance the faithfulness with model complexity so 𝑔 is simpleenough to actually be interpretable for humans. To that end, letΩ(𝑔) be a measure of complexity for 𝑔, and choose the followingexplanation for 𝑥 :

b (𝑥) = argmin𝑔∈𝐺

[L(𝑓 , 𝑔, 𝜋𝑥 ) + Ω(𝑔)

](1)

In practice, Ribeiro et al. [25] chose 𝐺 to be weighted sparselinear regression models and L to be mean squared error weightedby 𝜋𝑥 on Z𝑥 . Furthermore, Ω(𝑔) is chosen to be the number ofnon-zero coefficients of 𝑔, and the trade-off between L and Ω issimplified by constraining Ω(𝑔) not to be greater than a constant𝐾 known to be low enough. It is now simply a matter of fitting aregularized weighted least squares linear regression model onZ𝑥 .For a simplified overview of the whole process see Figure 1.

3.2 LIME for Entity MatchingBefore we describe our proposed method in the next section, wewill now go through the design decisions done within the LIMEframework. A setup we then build upon and use as a baseline forour proposed method.

Let a record 𝑟 = {(𝛼 𝑗 , 𝑣 𝑗 )} be an ordered set of pairs with at-tribute name and value. Inspired by how Ribeiro et al. does LIMEfor text classification, we define the interpretable representation 𝐼𝑥to be the absence of unique tokens in attribute names and values forboth records in 𝑥 = (𝑎, 𝑏) (i.e., 𝑡𝑥 (0) = 𝑥 ). Attribute values that are

Nils Barlaug

Fit weighted linearregression model

Sampleperturbations

Interpretablecomponents

Makeexplanation

Figure 1: Illustration of LIME [25] for a binary classification problem.

not strings are treated as single tokens, and their absence is theirnull/zero value. We sample Z𝑥 by setting random subsets of 𝑥 ′ toone, where the size of the subsets are sampled uniformly from theinterval [0, 𝐷max], and we use the neighborhood distance kernel

𝜋𝑥 (𝑧′) = exp(−2𝐷 (𝑥 ′, 𝑧′)𝐷max

) (2)

where 𝐷 is the Hamming distance. While the original authors sim-ply used 𝐷max = 𝑑𝑥 for text classification, we empirically find thisneighborhood too large due to entity matching generally beingmore sensitive to single tokens compared to standard text classifica-tion. This could, of course, be accounted for by narrowing 𝜋𝑥 , butit is more sample efficient to also reduce the neighborhood we aresampling from. We use 𝐷max =𝑚𝑎𝑥 (5, ⌊𝑑𝑥5 ⌋), and let the numberof samples |Z𝑥 | be clamp(30𝑑𝑥 , 500, 3000). From our experience,none of these parameters are especially sensitive.

Finally, we let𝐺 beweighted regressionmodels without intercept.The loss then being

L(𝑓 , 𝑔, 𝜋𝑥 ) =∑

(𝑧′,𝑦) ∈Z𝑥

𝜋𝑥 (𝑧′) (𝑦 − 𝑔(𝑧′))2 (3)

b (𝑥) is found using weighted least squares and forward selection(choosing 𝐾 coefficients).

4 METHODWe now describe how we address the three challenges describedin Section 1 with three distinct, but coherent, techniques that to-gether form our proposed method: Local explanations for Entitiesthat Match Or Not (LEMON). As discussed earlier, we use LIMEas the basis for our method, but the proposed ideas have widerapplicability. The three following subsections respectively addressand propose a solution to the three challenges (1) cross-record in-teraction effects, (2) non-match explanations, and (3) variation insensitivity.

4.1 Dual ExplanationsOne shortcoming of LIME, when applied directly to entity matching,is that it does not take into account the inherent duality of thematching. No distinction is made between the two input records.

This is problematic because our surrogate linear regression modelassumes independent features, but perturbations across two inputrecords will almost per definition have strong interaction effects.In essence, the surrogate model 𝑔 cannot sufficiently capture thebehaviour of our matcher 𝑓 even for small neighborhoods, whichseverely hurts the approximation accuracy.

The proposed solution is relatively straightforward but still ef-fective. Equivalently to Baraldi et al. [2], we make two explanations,one for each record. For each explanation, we let 𝐼𝑥 represent onlythe absence of tokens in one record. In effect, we approximate at-tributions from only one record at a time while keeping the otherconstant. That way, we avoid the strong interaction effects acrossthem. Intuitively, we explain why record 𝑎 matches 𝑏 or not andwhy record 𝑏 matches 𝑎 or not, separately. The two explanationscan still be presented together as one joint explanation to the user.

4.2 Attribution PotentialAttribution methods such as the LIME implementation describedabove tell us which part of the input is the most influential. This isusually achieved using some kind of exclusion analysis where onelooks at the difference between the absence and presence of inputfeatures [e.g., 20, 25, 28]. While that might be an effective approachfor many machine learning problems, it is inherently unsuited forentity matching. The issue lies in explaining non-matches. Recordpairs that a matcher classifies as a match can be explained subtrac-tively because removing or zeroing out essential parts of the recordswill result in lower matching scores from most well-behaved match-ers. But for record pairs where the matcher is convinced they donot match and provide a near-zero match score, it is unlikely thatremoval or zeroing out any part of the records will make a signifi-cant difference on the match score. Seemingly, nothing influencesthe match score, thereby providing no useful signal of the contri-bution from different features. Notice that standard gradient-basedmethods are not able to escape this problem because 𝑓 will, in thesecases, be in a flat area and the gradients be rather uninformative.Intuitively, we can not explain why two records do not match bywhat they contain. A natural solution to this problem is instead toexplain by what they do not contain.


Interpretable Representation. Let the interpretable represen-tation 𝐼𝑥 = {𝑃,𝐴,𝑀}𝑑𝑥 be categorical instead of binary, where thevalues represent whether the corresponding token is Present, Ab-sent, or Matched. Unsurprisingly, 𝑧′𝑖 = 𝐴 means 𝑡𝑥 (𝑧′) will excludetoken 𝑖 and if 𝑧′𝑖 = 𝑃 it will be kept — much like before. On theother hand, if 𝑧′𝑖 = 𝑀 , we will copy and inject the token in the otherrecord where it maximizes the match score 𝑓 (𝑡𝑥 (𝑧′)). For now, letus assume we have an accurate and efficient implementation of𝑡𝑥 (𝑧′).

For the linear surrogate model, we dummy code 𝐼𝑥 , using 𝑃 asreference value. When we do forward selection, we select bothdummy variables representing a categorical variable at once, sothat we either pick the entire categorical variable or not. We willthen have two estimated coefficients, 𝛽𝐴𝑖 and 𝛽𝑀𝑖 , for each of the 𝐾selected interpretable features. Finally, we define the attribution fortoken 𝑖 to be𝑤𝑖 = −𝛽𝐴𝑖 , and the attribution potential to be 𝑝𝑖 = 𝛽𝑀𝑖 .Intuitively,𝑤𝑖 is the contribution of token 𝑖 , and is the same as inLIME, while 𝑝𝑖 can be interpreted as the maximum additional con-tribution token 𝑖 could have had if the other record matched better.Note that it is important to model this new attribution potentialthrough a categorical variable instead of simply adding anotherbinary variable to 𝐼𝑥 , because exclusion and injection perturbationsof the same token strongly interact with each other and should bemutually exclusive.

Approximating 𝒕𝒙 . In contrast to plain LIME as described inSection 3.2, 𝑡𝑥 is now less straightforward to compute. The diffi-culty lies in where to inject tokens 𝑖 for which 𝑧′𝑖 = 𝑀 to maximize𝑓 (𝑡𝑥 (𝑧′𝑖 )). Since our method is model-agnostic, the best we can dois try all possible injections. That would, of course, be computa-tionally prohibitive, both because of the high number of possibleinjection targets but also because of the exponential growth ofcombinations when multiple tokens should be injected. Instead, wecan approximate it by sampling 𝐿 combinations of injection targetsand picking the one that give the highest match score.

The possible injection targets for a token in an attribute value areanywhere in the string attributes of the other record (but withoutsplitting tokens in the target attribute value). If the token is a non-string value, it can also overwrite attribute values of the same type— e.g., a number attribute can replace a number attribute in theother record. Tokens from attribute names can only be injectedto attribute names. When we sample injection targets, we firstpick a target attribute uniformly at random and then a randomposition within that attribute. In addition, we employ a heuristicto incorporate information about matched schemas if available. Incases where the schemas are matched, we boost the probabilityof choosing the corresponding attribute as the target to 50%. Thismakes efficient use of prior knowledge about the schemas whilestill preserving the robustness to dirty data. To pick the samplesize 𝐿, we first sum the possible injection targets per token to beinjected. We cap the number of targets to three per attribute and tenin total and then let 𝐿 be the maximum of all tokens to be injected.

Neighborhood sampling. We sample the neighborhood Z𝑥

much like before. Now, 𝑥 ′ will be a vector of only 𝑃 , and we let𝑧′𝑖 = 𝐴 instead of 1 for random subsets. But we additionally set ran-dom subsets of elements to𝑀 . The subset size is chosen uniformly

at random from [0,max(3, ⌊𝑑𝑥/3⌋)], except with a 50% chance ofpicking 0. The reason we sample 𝑀 less than 𝐴 is that injectionstend to have more dramatic effects on the match score than exclu-sions, and so we consider them to be larger perturbations and wantto avoid drowning the exclusion effects.

4.3 Counterfactual GranularityDepending on the dataset and matcher, influencing the match scoresignificantly can sometimes require perturbating large parts of theinput records. This is especially true for datasets where recordscontain many high-quality pieces of information because it pro-vides the matcher with multiple redundant strong signals aboutwhether they match or not. An example is the iTunes-Amazondataset, where attributes such as song name, artist name, albumname, and more might all agree or disagree at the same time. Theproblem is that we want to pick out 𝐾 important features for theuser to focus on, but in such cases, no single token is likely to besignificantly important. While we could use attribute-level features,that would be unnecessarily course-grained for many cases. Instead,we propose an adaptive strategy where we automatically choose anappropriate explanation granularity. The idea is to exponentiallydecrease the granularity of the interpretable features until the attri-butions and attribution potentials are large enough in magnitudeto explain the decision boundary.

Let 𝑒𝑥 = {(𝑤𝑖 , 𝑝𝑖 )}𝑖∈𝐸 be the 𝐾 pairs of attributions and attri-bution potentials for an explanation b (𝑥), where 𝐸 is the 𝐾 in-terpretable features chosen to be used in the regularized linearsurrogate model. Further, let inc𝑖 = max(−𝑤𝑖 , 𝑝𝑖 ) and dec𝑖 = 𝑤𝑖 .In other words, this is how much perturbation of token 𝑖 couldincrease or decrease the match score according to𝑤𝑖 and 𝑝𝑖 if youremoved the token or injected the token in the other record. Then,to represent greedy actions to increase the match score, let INC bea vector of the elements in 𝐸 with positive inc𝑖 and sorted by inc𝑖in descending order, and similarly for DEC. We define the predictedcounterfactual strength of 𝑘 steps to be

CFS𝑘𝑥 =

𝑝 − [

𝑓 (𝑥) −𝑠=𝑘∑𝑠=1

DEC𝑠], 𝑓 (𝑥) > 𝑝[

𝑓 (𝑥) +𝑠=𝑘∑𝑠=1

INC𝑠] − 𝑝, 𝑓 (𝑥) ≤ 𝑝

(4)

Intuitively, this is to what degree one would assume to surpassthe classification threshold 𝑝 if one performs 𝑘 greedy actionsto change the matcher prediction. Note that most matchers, aswell as those in our experiments, have a classification threshold 𝑝of 0.5 [16, 18]. Then let the greedy counterfactual strategy 𝑘𝑔 bethe smallest number of steps predicted to be necessary to get acounterfactual strength of at least 𝜖 :

𝑘𝑔 =

min{𝑘 : CFS𝑘𝑥 ≥ 𝜖}, if such 𝑘 exists|DEC |, otherwise if 𝑓 (𝑥) > 𝑝|INC |, otherwise if 𝑓 (𝑥) ≤ 𝑝

(5)

Nils Barlaug

Finally, we define the predicted counterfactual strength of the ex-planation b (𝑥) to simply be CFS𝑥 = CFS

𝑘𝑔𝑥 , and the actual counter-

factual strength to be:

CFS𝑥 =

𝑝 −

[𝑓 (𝑥) − 𝑓 (𝑡𝑥 (𝑥 ′𝑔)) ], 𝑓 (𝑥) > 𝑝[

𝑓 (𝑥) + 𝑓 (𝑡𝑥 (𝑥 ′𝑔)) ] − 𝑝, 𝑓 (𝑥) ≤ 𝑝(6)

where 𝑥 ′𝑔 is the perturbation of the interpretable representation 𝑥 ′corresponding to the greedy counterfactual strategy.

When an explanation’s interpretable features represent (up to) 𝑛consecutive tokens, we say that explanation has a granularity of 𝑛tokens. To find our desired granularity, we start with a granularity ofone token and then double until we find a granularity that satisfiesCFS𝑥 ≥ 𝜖 ∧ CFS𝑥 ≥ 𝜖 or no courser granularity is possible (i.e., allfeatures are whole attributes). When no granularity satisfies therequirement, we pick the granularity with the highest harmonicmean between CFS𝑥 and CFS𝑥 , which will favor them to be largeand similar.

We call the resulting approach for picking granularity counter-factual granularity. It will try to find explanations that explain thedecision boundary while balancing maximal granularity and faith-fulness. Note that the granularity is chosen independently for eachof the two dual explanations.

4.4 SummaryAll three extensions introduced above fit together in our proposedmethod. Finally, we choose𝐾 to be 5 and 𝜖 to be 0.1 for all examples.Figure 2 provides a simplified overview of the steps that make upLEMON. It is a model-agnostic and schema-flexible method withoutany hyperparameters that need tuning. One challenge comparedto LIME is that it is more computationally demanding. The mainreason is the increased number of matcher predictions made toestimate the attribution potential and finding the right granularity.However, it is still possible to generate an explanation within a fewseconds in most cases.

One key advantage of LEMON over previous work is that itprovides one type of explanation for all record pairs, whether it ismatch or non-match, with a clear and easy way to interpret andvisualize. The attributions𝑤𝑖 are equivalent to those in LIME andcan be interpreted the same way. Its interpretation is to what degreeinterpretable feature 𝑖 (some part of a record) contributes to thematch score. If the corresponding part of the record is removed, weexpect the match score to decrease by approximately 𝑤𝑖 . For thesame interpretable feature 𝑖 , the interpretation of the attributionpotential 𝑝𝑖 is how much higher the attribution𝑤𝑖 could be if theother record matched the feature better.

While the explanation can be visualized in many ways, we pro-pose a straightforward extension of the visualization proposed bythe original LIME authors. Figure 3 show an example. In additionto plotting a colored bar for each𝑤𝑖 , we also plot gray bars from𝑤𝑖 to𝑤𝑖 + 𝑝𝑖 , outlining feature 𝑖’s potential attribution.

5 EXPERIMENTAL SETUP5.1 DatasetsAll experiments are carried out on the 13 public datasets used inthe evaluation of DeepMatcher [22] — originally from Köpcke et al.

Table 2: The public DeepMatcher [22] benchmark datasetand 𝐹1 score for the Magellan and Bert-Mini matchers usedin the experiments.

Type Name #Cand. #MatchesMatcher 𝐹1

MG BM

Structured

Amazon-Google 11 460 1 167 0.52 0.67Beer 450 68 0.85 0.76DBLP-ACM 12 363 2 220 0.99 0.98DBLP-Scholar 28 707 5 347 0.94 0.93Fodors-Zagats 946 110 1.00 0.95iTunes-Amazon 539 132 0.90 0.93Walmart-Amazon 10 242 962 0.66 0.80

Dirty

DBLP-ACM 12 636 2 220 0.91 0.97DBLP-Scholar 28 707 5 347 0.83 0.94iTunes-Amazon 539 132 0.53 0.90Walmart-Amazon 10 242 962 0.41 0.79

Textual Abt-Buy 9 575 1 028 0.51 0.81Company 112 632 28 200 0.57 0.90

[17] and Das et al. [6]. Table 2 lists them together with their numberof candidates and number of matches. The datasets are divided intothree types: structured, dirty, and textual. Structured datasets havenicely separated attributes, dirty datasets are created from theirstructured counterpart by randomly injecting other attributes intothe title attribute [22], and textual datasets generally consist of longtextual attributes containing multiple pieces of information. Forthe company dataset, we truncate each record to max 256 space-separated words.

5.2 MatchersTo show that our proposed method is versatile and model-agnostic,we perform our experiments for each dataset on both a matcherthat uses classical machine learning with string metrics as featuresand on a deep learning based matcher. See Table 2 for their 𝐹1 scoreon the benchmark datasets.

Magellan. For the classical approach, we train a Magellan [16]random forest matcher. We use the automatically suggested simi-larity features and the default random forest settings provided bythe library. Furthermore, we do no downsampling and train on theentire training dataset.

Bert-Mini. For the deep learning approach, we train a baselineDitto [18] matcher using Bert-Mini [30]. While not quite achievingstate-of-the-art accuracy, Bert-Mini provides a decent accuracy vs.cost trade-off while maintaining the main characteristics of state-of-the-art deep learning matchers and still significantly outperformingthe classical matcher on dirty and textual data. We use batch size32, linearly decreasing learning rate from 3 · 10−5 with 50 warmupsteps, 16-bit precision optimization, and 1, 3, 5, 10, or 20 epochsdepending on the dataset size. The final model is the one from theepoch with the highest F1 score on the validation dataset.


Fit weighted linearregression model

Sampleperturbations

Interpretablecomponents

Non-match Match

Makeexplanation

Non-match Match

Exponentially decrease granularity

If

Figure 2: Illustration of how LEMON generates an explanation for a record pair 𝑥 = (𝑎, 𝑏).

Figure 3: Example of how a LEMON explanation can be visualized and what we show users in our user study. This particularexplanation is for the prediction of the Bert-Mini matcher used in our experiments on the record pair from Table 1.

5.3 BaselinesWe now introduce the baselines we use for comparison. For a faircomparison, we adopt dual explanations for all of them.

LIME. Since ourwork can be seen as a continuation of LIME [25],it is a natural baseline. We use LIME as described in Section 3.2.

SHAP. Another popular approach for producing input attri-butions is SHAP [20]. It is based on the game-theoretic Shapleyvalues [19, 27] and provides several methods for different types ofmodels. For a fair comparison, we use their model-agnostic method,

Kernel SHAP, which can be interpreted as using LIME to approxi-mate Shapley values. Note that Kernel SHAP does not limit 𝐾 andset Ω(𝑔) = 0. We use default settings from the SHAP library.

Integrated gradients. Our proposed method is a perturbation-based attribution method. To compare against a gradient-basedmethod, we use integrated gradients [28] as a baseline. This methodis not truly model-agnostic as it requires gradients, so we can onlyapply it to our deep learning matcher.

Nils Barlaug

Integrated gradients explain input 𝑥 in reference to some baselineinput 𝑥∗ (some neutral input that gives a score close to zero). Let 𝑥be the embedding vector of a record pair. As the original authorssuggest for textual input, we let 𝑥∗ be the zero embedding. Theattribution for the 𝑖th element of 𝑥 is then defined to be

𝐼𝐺𝑖 (𝑥𝑖 ) = (𝑥𝑖 − 𝑥∗𝑖 ) ×∫ 1

𝛼=0

𝜕𝑓 (𝑥∗ + 𝛼 × (𝑥 − 𝑥∗))𝜕𝑥𝑖

𝑑𝛼 (7)

The integral is approximated by averaging the gradient of evenlyspaced points from 𝑥∗ to 𝑥 . We use 50 points in our experiments.The raw attributions are for single elements of the embedded input,by no means interpretable for humans, so it is common to sum themfor each embedding. To get attributions on the same representationlevel as our method, we combine attributions of Bert subwordembeddings into whole words.

6 EXPERIMENTSWe will now go through several experiments to evaluate LEMONand compare it to other methods. When we evaluate post-hoc ex-plainability it is important to remember that we do not wish tomeasure the performance of the matchers but rather what the ex-plainability method can tell us about the matchers. Explanationsshould not be judged disconnected from the matcher on whetherthey provide the same rationale as users but to what degree theyreflect the actual (correct or wrong) behavior of the matchers andto what degree they are effective at communicating this to users.

6.1 Counterfactual InterpretationExplanations can sometimes provide enough information to theuser to understand how the prediction could be different. Baraldiet al. [2] call this the "interest" of an explanation. We argue similarlythat a useful explanation should implicitly reveal to the user somechanges to the records that would flip the prediction outcome. Butwe further argue that we should help the user understand a minimalnumber of such changes necessary since that would mean the userhas a greater understanding of where the decision boundary is.

To that end, we simulate users being shown an explanation fora record pair and then being asked what they think would be someminimal changes to the records that would flip the matcher’s pre-diction. The simulated users will greedily try to make the smallestnumber of perturbations necessary according to the explanation,as described in Section 4.3. Of the two dual explanations, they pickthe explanation with the lowest 𝑘𝑔 if CFS𝑥 ≥ 𝜖 or the one with thehighest CFS𝑥 otherwise. We extend the same greedy strategy toLandmark explanations but with their corresponding perturbations.Let the counterfactual recall of an attribution method be the fractionof explanations where at least one of the two dual explanationsindicate how the matching prediction could be flipped (CFS𝑥 ≥ 𝜖),and the counterfactual precision be the fraction of those where thegreedy counterfactual strategy is actually successful (CFS𝑥 > 0).To unify them into a single metric, we report the counterfactual 𝐹1score. We produce 500 explanations for both predicted matches andnon-matches (or all when there are less than 500 available) for eachexplanation method per dataset. Table 3 shows the counterfactual𝐹1 for the different explainability methods, matchers, and datasets.

We observe that LEMON performs the best overall, with the high-est or close to the highest 𝐹1 score for most cases. It significantlyoutperforms all three baselines, where non-matches, as expected,have the starkest difference. Since all the baselines are fundamen-tally analyzing the prediction by observing what happens whenfeatures are removed, they suffer from the same issue of explain-ing non-matches as discussed in Section 4.2. Importantly, the lowperformance of all the baselines backs up the claim that standardlocal post hoc attribution methods do not work satisfactorily outof the box for entity matching. Our proposed method generallyoutperforms Landmark, with the exception of three datasets for theMagellan matcher on non-matches. We note that LEMON has thebiggest advantage over Landmark on datasets that typically wouldrequire more substantial perturbations to flip the prediction, suchas DBLP-GoogleScholar and Company.

6.2 Explanation FaithfulnessIt is desirable that explanations are faithful to the matcher. Alluseful explanations provide some simplified view of a matcher’sbehavior, but we still want them to be indicative of how the matcheractually operates without being unnecessarily misleading. Inspiredby Ribeiro et al. [25], we make perturbations to a record pair andcompare the resulting match score with what we would expectfrom the attributions and attribution potentials. Specifically, if weremove feature 𝑖 , we expect the match score to decrease with 𝑤𝑖

(remember that𝑤𝑖 can be negative), and if we inject feature 𝑖 intothe other record, we expect the match score to increase with 𝑝𝑖 .The same applies for Landmark, but appending instead of injectingfeatures. Since the baselines do not estimate attribution poten-tials, we ignore that perturbation for them. We perform 1, 2, and3 random perturbations among the interpretable features for bothdual explanations. We repeat for 500 explanations of matches andnon-matches for each matcher and dataset (or all when there areless than 500 available). Let 𝛿𝑙 be the set of expected match scoreincreases and decreases for experiment 𝑙 out of 𝐿, and 𝑧𝑙 be theperturbed version of the original input 𝑥𝑙 . Let the mean absoluteerror be

MAE =1𝐿

∑𝑙

��𝑓 (𝑧) − [𝑓 (𝑥) +

∑𝑐∈𝛿𝑙

𝑐] �� (8)

This error measure alone will favor conservative explanation meth-ods that make small and insignificant claims, and punish methodslike Landmark and LEMON that provide higher impact explana-tions because of the inject/append features. Therefore, define theperturbation error to be the mean absolute error by dividing by theaverage magnitude of the predicted change:

PE =MAE

1𝐿

∑𝑙∑𝑐∈𝛿𝑙 |𝑐 |

(9)

Table 4 shows the perturbation error for all methods. No methodachieves truly low error levels, which is expected given the simpli-fied assumption of independent additative attributions. However,we observe that LEMON overall is the method with the smallesterrors, with LIME being very similar. Further, we see Landmarksometimes gets extremely high perturbation error, especially forMagellan on the dirty datasets. Upon closer inspection, we think thisstems from a combination of sampling a too large neighborhood


Table 3: Counterfactual 𝐹1 score for all the evaluated explainability methods across all datasets and the two matchers.

Model Type Method

DatasetStructured Dirty Textual

AG B DA DG FZ IA WA DA DG IA WA AB C Mean

Magellan

Match

LIME 0.96 0.83 1.00 0.87 0.77 0.98 0.82 0.50 0.79 0.90 0.86 0.95 0.47 0.82SHAP 0.95 1.00 1.00 1.00 0.95 1.00 0.83 0.65 0.96 0.68 0.81 0.98 0.68 0.88SHAP (w/ CFG) 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.99 1.00 0.91 0.99 0.99 0.87 0.98Landmark 0.92 0.89 1.00 0.96 0.73 0.87 0.95 0.75 0.74 0.88 0.90 0.95 0.28 0.83LEMON (w/o DE) 0.98 0.91 1.00 0.97 0.95 1.00 0.96 0.83 0.93 0.93 0.95 0.98 0.49 0.91LEMON (w/o AP) 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.98 0.83 0.98LEMON (w/o CFG) 0.96 0.89 1.00 0.88 0.91 0.98 0.81 0.52 0.81 0.85 0.83 0.92 0.41 0.83LEMON 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.99 0.80 0.98

Non-match


Bert-Mini

Match

LIME 0.95 0.65 0.97 0.85 0.93 0.65 0.81 0.97 0.69 0.62 0.80 0.78 0.18 0.76SHAP 0.90 0.81 0.79 0.65 0.91 0.69 0.71 0.79 0.62 0.63 0.75 0.77 0.25 0.71SHAP (w/ CFG) 0.95 0.96 0.92 0.98 0.86 0.89 0.97 0.98 0.99 1.00 0.99 1.00 0.39 0.91IG 0.90 0.43 0.71 0.83 0.50 0.70 0.67 0.69 0.89 0.81 0.68 0.79 0.33 0.69IG (w/ CFG) 0.88 0.52 0.66 0.94 0.68 0.70 0.77 0.86 0.95 0.94 0.87 0.93 0.28 0.77Landmark 0.98 0.93 1.00 0.94 0.86 0.94 0.84 0.99 0.83 0.90 0.88 0.83 0.08 0.85LEMON (w/o DE) 0.99 0.69 1.00 0.97 0.98 0.88 0.83 1.00 0.94 0.89 0.84 0.86 0.25 0.85LEMON (w/o AP) 1.00 1.00 1.00 1.00 1.00 0.94 0.99 1.00 1.00 1.00 1.00 0.98 0.39 0.95LEMON (w/o CFG) 0.95 0.65 0.98 0.86 0.98 0.58 0.81 0.99 0.70 0.59 0.81 0.79 0.15 0.76LEMON 1.00 1.00 1.00 1.00 1.00 0.94 0.98 1.00 0.99 1.00 1.00 0.97 0.37 0.94

Non-match


and the ineffectiveness of the double-entity generation strategywhen the data does not follow the matched schemas (i.e., is dirty).

6.3 Ablation StudyIncluded in Table 3 and Table 4 is also an ablation study. We ex-amine what effect it has when we remove each of the three maincomponents of LEMON: 1) Dual Explanations. Instead of dual ex-planations, we produce one joint explanation for both records. Weuse 𝐾 = 10 for a fair comparison. 2) Attribution Potential. We usethe interpretable representation of the LIME baseline and do notestimate any attribution potential. 3) Counterfactual Granularity.We fix the granularity to be one token. In addition, we examine theeffect of adding counterfactual granularity to the baselines SHAPand integrated gradients (there is no trivial way to do the same forattribution potential).

LEMON performs better across the board for matches with dualexplanations, but it is more varied for non-matches. This makessense since the problematic interaction effects mainly occur whentwo records match and have a lot of similar content. Unsurpris-ingly, since its primary goal is to explain how records could matchbetter, attribution potential only significantly improves non-matchexplanations. Nevertheless, the improvement for non-matches isdramatic, demonstrating how effective attribution potential is forexplaining record pairs that do not match. Finally, we observe thatthe effectiveness of counterfactual granularity varies hugely fromdataset to dataset. It makes the biggest difference on datasets wherewe consider the records to have multiple high-quality pieces ofinformation — either in the form of several high-quality attributesor long textual attributes with multiple high-quality keywords.

Nils Barlaug

Table 4: Perturbation error PE for all the evaluated explainability methods across all datasets and the two matchers.

Model Type Method

DatasetStructured Dirty Textual

AG B DA DG FZ IA WA DA DG IA WA AB C Mean

Magellan

Match


Non-match


Bert-Mini

Match


Non-match


6.4 StabilitySeveral of the benchmarked methods, including LEMON itself, relyon random sampling of the neighborhood of 𝑥 . Different initialrandom seeds will result in different explanations. However, withsufficient samples we would like a well-behaved method to gen-erate similar explanations — i.e. explanations to be stable and notchange much if a different random seeds are used. Stability is adesirable trait from a trust perspective but also especially usefulwhen examining or debugging a matcher. If we make changes to amatcher, we want to be confident that the differences we observein the explanations mostly reflect the matcher changes and notinstability of the explanation method. LEMON not only rely onsampling the neighborhood of 𝑥 in the interpretable domain 𝐼𝑥 ,but also rely on sampling to approximate 𝑡𝑥 when translating from

𝐼𝑥 . A natural question to ask is if this additional random samplinghurts stability.

Let 𝑒1𝑥 = {(𝑤𝑖1, 𝑝𝑖2)}𝑖∈𝐸1 and 𝑒2𝑥 = {(𝑤𝑖2, 𝑝𝑖2)}𝑖∈𝐸2 be two ex-planations for the same input 𝑥 and matcher 𝑓 with different ran-dom seed. Let the the similarity between the two explanations𝑠 (𝑒1, 𝑒2) = 𝑒1∩𝑒2

𝑒1∪𝑒2 be the weighted Jaccard coefficient such that theintersection is

𝑒1𝑥 ∩ 𝑒2𝑥 =∑

𝑖∈𝐸1∩𝐸2

[(𝑤𝑖1 ¤∩𝑤𝑖2) + (𝑝𝑖1 ¤∩𝑝𝑖2)

](10)

and the union is𝑒1𝑥 ∪ 𝑒2𝑥 =

∑𝑖∈𝐸1∩𝐸2

max( |𝑤𝑖1 |, |𝑤𝑖2 |) +max( |𝑝𝑖1 |, |𝑝𝑖2 |)

+∑

𝑖∈𝐸1\𝐸2

(𝑤𝑖1 + 𝑝𝑖1) +∑

𝑖∈𝐸2\𝐸1

(𝑤𝑖2 + 𝑝𝑖2)(11)


S-AG S-B S-DA S-DG S-FZ S-IA S-WA D-DA D-DG D-IA D-WA T-AB T-C0

0.20.40.60.81

Match

S-AG S-B S-DA S-DG S-FZ S-IA S-WA D-DA D-DG D-IA D-WA T-AB T-C0

0.20.40.60.81

Non

-match

LIME SHAP Landmark LEMON

Figure 4: Stability of explainability methods based on neighborhood sampling for Bert-Mini on all datasets.

where ¤∩ is a shorthand for

𝑟 ¤∩𝑞 = 𝐻 (𝑟𝑞)min( |𝑟 |, |𝑞 |) (12)

and 𝐻 is the unit step function. For LIME and SHAP 𝑝𝑖 is 0 for all𝑖 . To be able to compare explanations of different granularity, allexplanations are normalized to single token interpretable features— i.e. if feature 𝑖 is a 𝑛-token interpretable feature we split it into 𝑛features with attribution 𝑤𝑖

𝑛 and attribution potential 𝑝𝑖𝑛 . Finally,

we define the stability of an explanation method as the expectedsimilarity between two explanations E𝑥

[𝑠 (𝑒1𝑥 , 𝑒2𝑥 )

]. Note that this

definition slightly favors methods such as SHAP and Landmarkthat uses all interpretable features in its explanations instead ofonly the 𝐾 most important like LIME and LEMON. Picking the𝐾 most important interpretable features controls the explanationcomplexity at the cost of exposing the method to more instabilitybecause small changes in importance can change which featuresare within or outside top 𝐾 .

Figure 4 show the estimated stability of LIME, SHAP, Landmark,and LEMON for Bert-Mini on all datasets. For each dataset, wesample 100 predicted matches and non-matches uniformly at ran-dom, generate two explanations with different random seed foreach example, and average the similarities. We see that LEMON isrelatively stable and is similar to LIME in terms of stability despitethe sampling-based approximation of 𝑡𝑥 . SHAP is overall the moststable method, while Landmark is the least stable. To understandthese differences in stability it is important to also take into accountsample size.

Neighborhood sample size. From our experience, the mainconcern when choosing the neighborhood sample size |Z𝑥 | is sta-bility. One achieves satisfactory counterfactual 𝐹1 score and pertur-bation error with relatively few samples, but it takes considerablymore samples to get stable explanations. Thus, picking |Z𝑥 | mostlyboils down to a trade-off between stability and speed.

0 1,000 2,000 3,0000

0.2

0.4

0.6

0.8

1Match

Stability

0 1,000 2,000 3,000

Non-match

LIME SHAP Landmark LEMON

|ZG | |ZG |

Figure 5: Stability of explainability methods based on neigh-borhood sampling for Bert-Mini on the Abt-Buy dataset.

The different neighborhood sampling-based methods have dif-ferent strategies for picking a sample size2, so it is interesting toalso compare the stability at equal samples sizes. Figure 5 show thestability of the explainability methods for Bert-Mini on the Abt-Buydataset3 when we vary the neighborhood sample size. As before,we sample 100 predicted matches and non-matches, generate twoexplanations per examples, and estimate the stability to be the av-erage similarity between the explanation pairs. We observe thatLEMON is close to or equally sample efficient as LIME and SHAPfor matches, and slightly more for non-matches. This shows thatthe difference in stability between SHAP and LEMON is mainlya matter of difference in sample sizes. We deliberately use a lessaggressive sampling scheme for LEMON than SHAP because wefind the returns in terms of stability diminishing — especially given2SHAP defaults to 2𝑑𝑥 + 2048, Landmark to 500, and our LIME baseline and LEMONto clamp (30𝑑𝑥 , 500, 3000) .3The methods generally behave similarly on the other datasets, but we only includeone due to space constraints.

Nils Barlaug

Table 5: Counterfactual precision of users after being shownan explanation from LIME or LEMON.

Dataset

Method

LIME LEMON

Match Non-match Match Non-match

StructuredAmazon-Google 0.71 0.22 0.77 0.42Beer 0.47 0.16 0.50 0.63DBLP-ACM 0.82 0.06 0.79 0.23DBLP-GoogleScholar 0.53 0.08 0.62 0.27Fodors-Zagats 0.59 0.14 0.69 0.46iTunes-Amazon 0.45 0.18 0.60 0.60Walmart-Amazon 0.55 0.24 0.63 0.58

DirtyDBLP-ACM 0.71 0.02 0.81 0.40DBLP-GoogleScholar 0.45 0.08 0.58 0.48iTunes-Amazon 0.53 0.22 0.75 0.56Walmart-Amazon 0.53 0.24 0.62 0.71

TextualAbt-Buy 0.55 0.14 0.62 0.73Company 0.14 0.16 0.29 0.35

Mean 0.54 0.15 0.63 0.49

the higher computationally footprint of LEMON. Landmarks insta-bility, however, can not be attributed to the lower sampling size. Itis clear from Figure 5 that the method is significantly less samplingefficient than the others. We suspect this is mostly due to the largeneighborhood used when sampling.

6.5 User StudyTo examine if explanations from LEMON improve human subjectsunderstanding of a matcher compared to LIME, we adopt the exper-iment on counterfactual interpretation from Section 6.1 to humansubjects. We recruit random test users from the research surveyplatform Prolific. Note that these users are laypersons and do nothave any experience with entity matching or a background in com-puter science. A user is shown an explanation for a record pairand then asked what they think would be a minimal change to therecord pair that would make the matcher predict the opposite. Eachuser is shown one explanation for a match and a non-match foreach dataset, and we only use the Bert-Mini matcher. Afterward, wecheck what fraction of them successfully get the opposite prediction— i.e., the counterfactual precision. We conduct the experiment on50 users for LIME and 50 other users for LEMON, and report thecounterfactual precision in Table 5.

As expected, and in line with the experiments above, the greatestimprovement is for non-matches. We see an average improvementin the counterfactual precision of 0.09 for matches and 0.34 fornon-matches. The results are generally less pronounced than forthe simulated experiments. We suspect the lower maximum scoresare a reflection of the difficulty of the task for laypersons, and thatthe higher minimum scores are a reflection of humans ability to usecommon sense to "save" weak explanations. Note that we cannotcompare to Landmark [2] since the authors do not propose any

way of presenting an actual explanation to a user. The combinationof double-entity generation and not limiting Ω(𝑔) (i.e., explainingusing all features instead of limiting them to 𝐾) makes such apresentation non-trivial.

7 CONCLUSIONLocal post hoc feature attribution is a valuable and popular type ofexplainability method that can explain any classifier, but standardmethods leaves significant room for improvement when applied toentity matching. We have identified three challenges of applyingsuch methods to entity matching and proposed LEMON, a model-agnostic and schema-flexible method that addresses all three chal-lenges. Experiments and a novel evaluation method for explainableentity matching show that our proposed method is more faithfulto the matcher and more effective in explaining to the user wherethe decision boundary is — especially for non-matches. Lastly, userstudies verify a significant real-world improvement in understand-ing for laypersons seeing LEMON explanations compared to naiveLIME explanations.

There is still much to be done within explainable entity matching.A disadvantage of LEMON (and other perturbation-based meth-ods like LIME and SHAP) is their running time. Even though it ispossible to trade off significantly shorter running time for explana-tion stability and trivial to parallelize the computational bottleneck(running inference of matcher 𝑓 ), depending on the hardware andmatcher, it might still be infeasible in practice to do real-time expla-nation or generate explanations for all record pairs in large datasetswhile still achieving satisfactory stability. Therefore, more efficientsampling strategies should be explored. Furthermore, there is stillwork to be done on examining the adaptation of other explainabil-ity methods for entity matching in-depth and on how to evaluatethem. Our experiments and ablation study show dual explanationsand counterfactual granularity are easily applicable to SHAP andgradient based methods — and that they are indeed effective forother methods than LIME. It is less clear how one would adapt theideas of attribution potential to those methods, and we hope toaddress that in future work.

ACKNOWLEDGMENTSThis work is supported by Cognite and the Research Council ofNorway under Project 298998. We thank Hassan Abedi Firouzjaei,Jon Atle Gulla, Dhruv Gupta, Ludvig Killingberg, Kjetil Nørvåg,and Mateja Stojanović for valuable feedback.

REFERENCES[1] Amina Adadi and Mohammed Berrada. 2018. Peeking Inside the Black-Box: A

Survey on Explainable Artificial Intelligence (XAI). IEEE Access 6 (2018), 52138–52160. https://doi.org/10.1109/ACCESS.2018.2870052

[2] Andrea Baraldi, Francesco Del Buono, Matteo Paganelli, and Francesco Guerra.2021. Using Landmarks for Explaining Entity Matching Models. In EDBT. Open-Proceedings.org, 451–456.

[3] Nils Barlaug and Jon Atle Gulla. 2021. Neural Networks for Entity Matching: ASurvey. ACM Transactions on Knowledge Discovery from Data 15, 3 (April 2021),1–37. https://doi.org/10.1145/3442200

[4] Ursin Brunner and Kurt Stockinger. 2020. Entity Matching with Transformer Ar-chitectures - A Step Forward in Data Integration. In EDBT. OpenProceedings.org,463–473.

[5] Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage,Entity Resolution, and Duplicate Detection. Springer-Verlag, Berlin Heidelberg.https://doi.org/10.1007/978-3-642-31164-2

https://doi.org/10.1109/ACCESS.2018.2870052

https://doi.org/10.1145/3442200

https://doi.org/10.1007/978-3-642-31164-2


[6] Sanjib Das, A Doan, C Gokhale Psgc, P Konda, Y Govind, and D Paulsen. 2015.The Magellan Data Repository. (2015).

[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:Pre-Training of Deep Bidirectional Transformers for Language Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, Volume 1 (Long andShort Papers). Association for Computational Linguistics, Minneapolis, Minnesota,4171–4186. https://doi.org/10.18653/v1/N19-1423

[8] Vincenzo Di Cicco, Donatella Firmani, Nick Koudas, Paolo Merialdo, and DiveshSrivastava. 2019. Interpreting Deep Learning Models for Entity Resolution:An Experience Report Using LIME. In Proceedings of the Second InternationalWorkshop on Exploiting Artificial Intelligence Techniques for Data Management -aiDM ’19. ACM Press, Amsterdam, Netherlands, 1–4. https://doi.org/10.1145/3329859.3329878

[9] AnHai Doan, Alon Halevy, and Zachary G. Ives. 2012. Principles of Data Integra-tion. Morgan Kaufmann, Waltham, MA.

[10] Mengnan Du, Ninghao Liu, and Xia Hu. 2019. Techniques for InterpretableMachine Learning. Commun. ACM 63, 1 (Dec. 2019), 68–77. https://doi.org/10.1145/3359786

[11] Amr Ebaid, Saravanan Thirumuruganathan, Walid G. Aref, Ahmed Elmagarmid,and Mourad Ouzzani. 2019. EXPLAINER: Entity Resolution Explanations. In2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, Macao,Macao, 2000–2003. https://doi.org/10.1109/ICDE.2019.00224

[12] Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, MouradOuzzani, and Nan Tang. 2018. Distributed Representations of Tuples for EntityResolution. Proceedings of the VLDB Endowment 11, 11 (July 2018), 1454–1467.https://doi.org/10.14778/3236187.3236198

[13] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007.Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and DataEngineering 19, 1 (Jan. 2007), 1–16. https://doi.org/10.1109/TKDE.2007.250581

[14] Leilani H. Gilpin, David Bau, Ben Z. Yuan, Ayesha Bajwa, Michael Specter, andLalana Kagal. 2018. Explaining Explanations: An Overview of Interpretability ofMachine Learning. In 2018 IEEE 5th International Conference on Data Science andAdvanced Analytics (DSAA). IEEE, Turin, Italy, 80–89. https://doi.org/10.1109/DSAA.2018.00018

[15] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, FoscaGiannotti, and Dino Pedreschi. 2019. A Survey of Methods for Explaining BlackBox Models. Comput. Surveys 51, 5 (Jan. 2019), 1–42. https://doi.org/10.1145/3236009

[16] Pradap Konda, Jeff Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, VijayRaghavendra, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan,Jeffrey R. Ballard, Han Li, Fatemah Panahi, and Haojun Zhang. 2016. Magellan:Toward Building Entity Matching Management Systems. Proceedings of the VLDBEndowment 9, 12 (Aug. 2016), 1197–1208. https://doi.org/10.14778/2994509.2994535

[17] Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of Entity Res-olution Approaches on Real-World Match Problems. Proceedings of the VLDB En-dowment 3, 1-2 (Sept. 2010), 484–493. https://doi.org/10.14778/1920841.1920904

[18] Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan.2020. Deep Entity Matching with Pre-Trained Language Models. Proceedings ofthe VLDB Endowment 14, 1 (Sept. 2020), 50–60. https://doi.org/10.14778/3421424.3421431 arXiv:2004.00584

[19] Stan Lipovetsky and Michael Conklin. 2001. Analysis of Regression in GameTheory Approach. Applied Stochastic Models in Business and Industry 17, 4 (Oct.2001), 319–330. https://doi.org/10.1002/asmb.446

[20] Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting ModelPredictions. In Advances in Neural Information Processing Systems, I. Guyon, U. V.Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.),Vol. 30. Curran Associates, Inc.

[21] Christoph Molnar. 2019. Interpretable Machine Learning - A Guide for MakingBlack Box Models Explainable.

[22] Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park,Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018.Deep Learning for Entity Matching: A Design Space Exploration. In Proceedingsof the 2018 International Conference on Management of Data - SIGMOD ’18. ACMPress, Houston, TX, USA, 19–34. https://doi.org/10.1145/3183713.3196926

[23] Hao Nie, Xianpei Han, Ben He, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and HaoKong. 2019. Deep Sequence-to-Sequence Entity Matching for HeterogeneousEntity Resolution. In Proceedings of the 28th ACM International Conference on Infor-mation and Knowledge Management (CIKM ’19). Association for Computing Ma-chinery, New York, NY, USA, 629–638. https://doi.org/10.1145/3357384.3358018

[24] Kun Qian, Lucian Popa, and Prithviraj Sen. 2019. SystemER: A Human-in-the-Loop System for Explainable Entity Resolution. Proceedings of the VLDB Endow-ment 12, 12 (Aug. 2019), 1794–1797. https://doi.org/10.14778/3352063.3352068

[25] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should ITrust You?": Explaining the Predictions of Any Classifier. In Proceedings of the22nd ACM SIGKDD International Conference on Knowledge Discovery and DataMining. ACM, San Francisco California USA, 1135–1144. https://doi.org/10.1145/2939672.2939778

[26] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning Im-portant Features through Propagating Activation Differences. In Proceedingsof the 34th International Conference on Machine Learning (Proceedings of Ma-chine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR,3145–3153.

[27] Erik Štrumbelj and Igor Kononenko. 2014. Explaining Prediction Models andIndividual Predictions with Feature Contributions. Knowledge and InformationSystems 41, 3 (Dec. 2014), 647–665. https://doi.org/10.1007/s10115-013-0679-x

[28] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic Attributionfor Deep Networks. In International Conference on Machine Learning. PMLR,3319–3328.

[29] Saravanan Thirumuruganathan, Mourad Ouzzani, and Nan Tang. 2019. Explain-ing Entity Resolution Predictions: Where Are We and What Needs to Be Done?.In Proceedings of the Workshop on Human-In-the-Loop Data Analytics - HILDA’19.ACM Press, Amsterdam, Netherlands, 1–6. https://doi.org/10.1145/3328519.3329130

[30] Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Well-Read Students Learn Better: On the Importance of Pre-Training Compact Models.arXiv:1908.08962 [cs] (Sept. 2019). arXiv:1908.08962 [cs]

[31] Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2017. Counterfactual Ex-planations Without Opening the Black Box: Automated Decisions and the GDPR.SSRN Scholarly Paper ID 3063289. Social Science Research Network, Rochester,NY. https://doi.org/10.2139/ssrn.3063289

[32] Chen Zhao and Yeye He. 2019. Auto-EM: End-to-End Fuzzy Entity-MatchingUsing Pre-Trained Deep Models and Transfer Learning. In The World Wide WebConference on - WWW ’19. ACM Press, San Francisco, CA, USA, 2413–2424.https://doi.org/10.1145/3308558.3313578

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.1145/3329859.3329878

https://doi.org/10.1145/3329859.3329878

https://doi.org/10.1145/3359786

https://doi.org/10.1145/3359786

https://doi.org/10.1109/ICDE.2019.00224

https://doi.org/10.14778/3236187.3236198

https://doi.org/10.1109/TKDE.2007.250581

https://doi.org/10.1109/DSAA.2018.00018

https://doi.org/10.1109/DSAA.2018.00018

https://doi.org/10.1145/3236009

https://doi.org/10.1145/3236009

https://doi.org/10.14778/2994509.2994535

https://doi.org/10.14778/2994509.2994535

https://doi.org/10.14778/1920841.1920904

https://doi.org/10.14778/3421424.3421431

https://doi.org/10.14778/3421424.3421431

https://arxiv.org/abs/2004.00584

https://doi.org/10.1002/asmb.446

https://doi.org/10.1145/3183713.3196926

https://doi.org/10.1145/3357384.3358018

https://doi.org/10.14778/3352063.3352068

https://doi.org/10.1145/2939672.2939778

https://doi.org/10.1145/2939672.2939778

https://doi.org/10.1007/s10115-013-0679-x

https://doi.org/10.1145/3328519.3329130

https://doi.org/10.1145/3328519.3329130

https://arxiv.org/abs/1908.08962

https://doi.org/10.2139/ssrn.3063289

https://doi.org/10.1145/3308558.3313578

lemon: explainable entity matching

Documents