lemon: explainable entity matching
TRANSCRIPT
LEMON: Explainable Entity MatchingNils Barlaugβ
Norwegian University of Science and TechnologyNorway, Trondheim
ABSTRACTState-of-the-art entity matching (EM) methods are hard to interpret,and there is significant value in bringing explainable AI to EM. Un-fortunately, most popular explainability methods do not work wellout of the box for EM and need adaptation. In this paper, we iden-tify three challenges of applying local post hoc feature attributionmethods to entity matching: cross-record interaction effects, non-match explanations, and variation in sensitivity. We propose ournovel model-agnostic and schema-flexible method LEMON1 thataddresses all three challenges by (i) producing dual explanationsto avoid cross-record interaction effects, (ii) introducing the novelconcept of attribution potential to explain how two records couldhave matched, and (iii) automatically choosing explanation gran-ularity to match the sensitivity of the matcher and record pair inquestion. Experiments on public datasets demonstrate that the pro-posed method is more faithful to the matcher and does a better jobof helping users understand the decision boundary of the matcherthan previous work. Furthermore, user studies show that the rateat which human subjects can construct counterfactual examplesafter seeing an explanation from our proposed method increasesfrom 54% to 64% for matches and from 15% to 49% for non-matchescompared to explanations from a standard adaptation of LIME.
KEYWORDSentity matching, entity resolution, explainability
1 INTRODUCTIONEntity matching is an essential task in data integration [9]. It isthe task of identifying which records refer to the same real-worldentity. Table 1 shows an example. Machine learning has become astandard tool to tackle the variety of data and to avoid laborsomefeature engineering from experts while still achieving high accuracy[e.g., 16, 18, 22]. Unfortunately, this is often at the cost of reducedtransparency and interpretability. While it is possible to carefullyselect classical machine learning methods that are intrinsicallyinterpretable and combine them with classical string similaritymetrics, current state-of-the-art is large deep learning models [4,12, 18, 22], which offer very limited interpretability out of the box.The possible benefits of being able to explain black-box models arenumerous. To mention some: 1) Researchers can gain new insightinto their models and find ways to improve them 2) Practitionerswill have a valuable tool for verifying that the models work asexpected and debugging those which do not 3) Companies can getthe necessary transparency they need to trust such black-boxesfor mission-critical data integration efforts 4) End-users can bereassured that models and their results can be trusted (or discoverthemselves that they should not be).
βAlso with Cognite.1https://github.com/NilsBarlaug/lemon
Table 1: Example of two records that need to be classifiedas either a match or a non-match. In this case, from theWalmart-Amazon dataset.
title belkin shield micrafor ipod touch tint
category mp3 accessories
brand belkin
modelno f8z646ttc01
price 47.88
title belkin ipod touchshield micratint-royal purple
category cases
brand belkin
modelno f8z646ttc02
price 12.49
The challenge of explaining machine learning models and theirpotential benefits is, of course, not unique to entity matching. There-fore, explainable machine learning has in recent years received sig-nificant attention from the broader research community [1, 14, 15].The result is a multitude of techniques and methods with differ-ent strengths and weaknesses. But as previous work has discussed,applying these techniques to entity matching is non-trivial. It is nec-essary to adapt and evolve them to tackle the unique characteristicsof entity matching [8, 11, 29].
Local post hoc feature attribution methods are perhaps the mostpopular type of explainability method in general, and the moststudied so far for entity matching [2, 8, 11, 29]. Previous workon explainable entity matching base their work on LIME [25] βone of the most popular methods of that type. In this paper, wechoose to focus mainly on LIME to be consistent with and forease of comparison to earlier work. Our work is relevant beyondLIME, and we will reference and include other methods in ourexperiments, but we consider in-depth adaptation and treatmentof other methods outside of the scope of this paper and hope toaddress them in future work.
Challenges. Unfortunately, standard local post hoc attributionmethods do not work satisfactorily for entity matching out of thebox. We identify three challenges of applying them:
(1) Cross-record interaction effects: Since EM is a match-ing problem, features across a record pair will tend to havestrong interaction effects, but linear surrogate models suchas in LIME implicitly assumes independent features. Thiscan severely impair the accuracy of the surrogate model.
(2) Non-match explanations: In essence, most feature attri-bution methods analyses the effect of removing features todetermine their attribution. However, for record pairs forwhich the matcher is fairly confident that they do not match,so the match score is close to zero, it is unlikely that removalof features will make any significant difference on the matchscore. The result is that we have no significant attributionsto explain why the records do not match. This is especiallyimportant since most record pairs do not match.
arX
iv:2
110.
0051
6v1
[cs
.DB
] 1
Oct
202
1
Nils Barlaug
(3) Variation in sensitivity:While some record pairs may onlyneed to perturb a few features to trigger a significant changein the output of the matcher, others may contain a lot ofredundant features, making it hard to substantially impactthe matching score and provide meaningfully sized attribu-tions. It can be hard to make the trade-off between token andattribute level feature granularity. You risk being either toofine-grained or unnecessarily course-grained, and it differsbetween specific record pairs in the same dataset, acrossdatasets, and across matchers.
As we will outline in Section 2, earlier work has only partiallyaddressed these challenges.
Proposedmethod. Our proposedmethod specifically addressesall three challenges jointly by: 1) Making dual explanations to avoidcross-record interaction effects. 2) Introducing the novel concept ofattribution potential, an improvement over the copy/append per-turbation from previous work that are schema-flexible and morerobust to dirty data and matchers sensitive to the order of tokens.3) Choosing an interpretable representation granularity that op-timizes the trade-off between counterfactual interpretation andthe finest granularity possible. Our proposed method provides oneunified frame of interpretation with the same single explanationformat for all record pairs, has no dataset-specific hyperparametersthat need tinkering, and does not require matched schemas.
Evaluation of explainability methods is still an open problem,and there are no standard ways of evaluating explainable entitymatching. Ideally, in broad terms, we would like to measure towhat degree an explanation helps users understand how the modelmakes a matching decision. Inspired by the motivations behindcounterfactual examples [31], we argue that a useful attributionexplanation should help the user understand where the decisionboundary is andwhat kind of difference in input would be necessaryto sway the matcher. To that end, we propose to ask users whatthey think would be a minimal change to a record pair to make thematcher change its prediction and then check if they are correct.We show how this can be done for simulated users as well as humansubjects.
Finally, our results show that our proposed method is state-of-the-art, both in terms of faithfulness and helping users understandthe matcherβs behavior. Additionally, our user study shows real-world improvement in understanding by human subjects.
Contributions. In summary, our main contributions are:β’ We propose a method that addresses three important chal-lenges of applying local post hoc attribution methods toentity matching: 1) Cross-record interaction effects, 2) non-match explanation, and 3) variation in sensitivity. We showthrough experiments that this is indeed effective.
β’ To evaluate entity matching attribution explanations, wepropose a novel evaluation method that aims to measureto what degree explanations help users understand the de-cision boundary of a matcher. We show how to performexperiments on both simulated users and human subjects.
β’ Through extensive experiments on public datasets we showthat our proposed method is state-of-the-art both in termsof faithfulness and helping users understand the matcherβs
behavior. We verify the real-world applicability of our pro-posed method by performing an extensive user study. Tothe best of our knowledge, we are the first to conduct a userstudy for explainable entity matching.
Outline. The rest of the paper is organized as follows. Section2 briefly covers related work, Section 3 covers LIME and its adapta-tion to entity matching, and Section 4 goes into the details of ourproposed method. We explain our experimental setup in Section 5,and then we walk through and discuss the experiments in Section6 before we make our conclusive remarks in Section 7.
2 RELATEDWORKMachine learning for EM. The immense variety in datasets
makes machine learning a natural solution for entity matching. Thetraditional approach has been to handcraft string similarity metricsto produce similarity feature vectors and then utilize a classical off-the-shelf machine learning model such as SVM or random forest toclassify them [5, 13, 16]. The two main drawbacks of this approachare the necessary manual tinkering and poor performance on dirtydata [22]. However, in the last few years, the research communityhas increasingly adopted deep learning [4, 12, 18, 22, 23, 32]. Whileearly work focused on custom architectures and trained modelsfrom scratch, the current state of the art focuses on fine-tuninglarge natural languagemodels such as BERT [7], which offers higheraccuracy and decreases the need for training examples [18]. Werefer to Barlaug and Gulla [3] for an extensive survey on deeplearning for EM.
ExplainableAI. There aremanyways to explainmachine learn-ingmodels. Generally, explanations can be either global or local [10],in other words, explaining the modelβs behavior as a whole or ex-plaining a single prediction. Furthermore, we often distinguishbetween intrinsically interpretable models and post hoc interpreta-tion methods [21] (which can be model-agnostic or not). The formerare models that are interpretable on their own, like linear regres-sion or decision trees, while the latter are methods for explainingblack-box models. We refer the reader to one of many extensivesources on the topic [1, 10, 14, 15, 21].
A particularly prominent group of approaches are local post hocattribution methods [e.g., 20, 25, 28], which aim to explain a predic-tion by communicating to what degree different parts of the inputare to be attributed for the prediction. LIME [25] is one of the mostprominent among such methods. It is a model-agnostic method,and works by randomly removing features of an input example andtraining a (usually linear) interpretable surrogate model to predictthe modelβs output for these perturbations. Among other popularlocal post hoc attribution methods are the game-theoretic-basedSHAP [20] and gradient-based methods [e.g., 26, 28].
Explainable EM. The use of explainability techniques for ma-chine learning-based entity matching is still a young subject, andthere has only been a limited amount of previous work. However,we note that rule-based methods have historically been used tomake systems that can be interpreted by experts [13], and is analternative way to make explainable matchers [24].
LEMON: Explainable Entity Matching
Ebaid et al. [11] demonstrate ExplainER, a tool for explainingentity matching that utilizes multiple prominent explainability tech-niques, while Thirumuruganathan et al. [29] discuss challenges andresearch opportunities. Further, there have been two significantadaptations of LIME for entity matching, which we will now de-scribe and contrast to our work.
Mojito [8] introduces two versions of LIME for entity matching:LIME_DROP and LIME_COPY. The former is a straightforwardapplication of LIME similar to how the original authors do textclassification using token level feature granularity, while in thelatter they use attribute level representation and perturb by copyingthe entire attribute value to the corresponding attribute in the otherrecord instead of removing tokens. LIME_COPY is an elegant way toaddress challenge (2), but leaves more to be wanted. Firstly, attributelevel granularity is too coarse-grained for most cases with longertextual attributes (the extreme case being a single textual attribute).Secondly, since it is separate from LIME_DROP, it requires the userto interpret two different kinds of explanations. Finally, it relies ona matched schema with one-to-one attribute correspondence.
Recently, Landmark [2] was proposed as a two-part improve-ment over Mojito. Firstly, it makes two explanations, one per record,and avoid perturbing both records simultaneously, which effec-tively solves challenge (1). Secondly, for record pairs labeled asnon-matches, instead of perturbing by randomly copying entireattributes, it appends every corresponding attribute value from theother record and performs regular token level exclusion pertur-bation (a technique named double-entity generation), effectivelycombining LIME_DROP and LIME_COPY. The authors demonstratethrough experiments that their techniques are indeed effective andthat Landmark is a substantial improvement over Mojito. The limi-tation of the approach is that tokens from the other record are onlyever considered to be appended at the end of the correspondingattribute, which is unfortunate if the matcher is sensitive on theorder of tokens (e.g., many products have the brand name first inthe title), the schemas are not matched one-to-one, or the data isdirty. Similar to Mojito, Landmark also makes two different kindsof explanations.
While making important contributions, neither Mojito nor Land-mark addresses all three challenges identified in Section 1. OnlyLandmark tackle challenge (1). Both propose a solution to challenge(2), but as we discussed above, with important limitations. Neitheraddresses challenge (3). Moreover, they do not provide a unifiedand coherent way of actually communicating or visualizing an ex-planation to the end-user in the same way the original authors ofLIME do β something we aim to do.
3 PRELIMINARIESIn this section, we introduce LIME and describe how it can beadapted for entity matching.
3.1 LIMEThe main idea of LIME [25] is to locally approximate a classifieraround one instance with an interpretable model over an inter-pretable representation in a way that balances faithfulness andcomplexity, and then use the interpretable model as an explanation.The intuition is that while our problem is too complex for classical
machine learning models that we regard as inherently interpretable(e.g., linear regression or decision trees) to be accurate enough, itmight be possible to faithfully approximate the decision boundaryfor a black-box model locally around one input instance. In otherwords, we can train an interpretable surrogate model using localdata points sampled by perturbing an input instance and use it as anexplanation of that particular instance. And while the input featuresof a black-box model might be unsuited for human interpretation(e.g., deep learning embeddings or convoluted string metrics), wecan define an alternative interpretable representation for the inputinstance we want to explain and use that to train our interpretablesurrogate model.
Formally, let π (π₯) : π΄ Γ π΅ β R be the classifier we want toexplain β in our case, an entity matching model that accepts tworecords, π β π΄ and π β π΅, and outputs a predictions score. For asingle instance π₯ = (π, π) that we want to explain, let πΌπ₯ = {0, 1}ππ₯be the interpretable domain and π₯ β² β πΌπ₯ the interpretable repre-sentation of π₯ . Its elements represent the presence (or absence) ofwhat Ribeiro et al. [25] call interpretable components, essentiallynon-overlapping parts of the input. E.g., for text, it could be thepresence of different words. For each π₯ there must exist a functionπ‘π₯ (π₯ β²) : πΌ β π΄Γπ΅ that can translate an interpretable representationto the input domain of the classifier π .
With the goal of approximating π local to π₯ , we sample a newdataset Zπ₯ = {(π§β², π¦) β πΌ Γ R} where π¦ = π (π‘π₯ (π§β²)). Each π§β² isdrawn by setting a uniformly random-sized subset of π₯ β² to zero.Let πΊ be a class of interpretable models over πΌπ₯ . Furthermore, letL(π , π, ππ₯ ) be how unfaithful π β πΊ is to π in the neighborhooddefined by the distance kernel ππ₯ (π§β²). We want to find a π β πΊ thatis as faithful to π as possible, but since many interpretable modelscan be made more accurate by increasing the complexity, we needto balance the faithfulness with model complexity so π is simpleenough to actually be interpretable for humans. To that end, letΞ©(π) be a measure of complexity for π, and choose the followingexplanation for π₯ :
b (π₯) = argminπβπΊ
[L(π , π, ππ₯ ) + Ξ©(π)
](1)
In practice, Ribeiro et al. [25] chose πΊ to be weighted sparselinear regression models and L to be mean squared error weightedby ππ₯ on Zπ₯ . Furthermore, Ξ©(π) is chosen to be the number ofnon-zero coefficients of π, and the trade-off between L and Ξ© issimplified by constraining Ξ©(π) not to be greater than a constantπΎ known to be low enough. It is now simply a matter of fitting aregularized weighted least squares linear regression model onZπ₯ .For a simplified overview of the whole process see Figure 1.
3.2 LIME for Entity MatchingBefore we describe our proposed method in the next section, wewill now go through the design decisions done within the LIMEframework. A setup we then build upon and use as a baseline forour proposed method.
Let a record π = {(πΌ π , π£ π )} be an ordered set of pairs with at-tribute name and value. Inspired by how Ribeiro et al. does LIMEfor text classification, we define the interpretable representation πΌπ₯to be the absence of unique tokens in attribute names and values forboth records in π₯ = (π, π) (i.e., π‘π₯ (0) = π₯ ). Attribute values that are
Nils Barlaug
Fit weighted linearregression model
Sampleperturbations
Interpretablecomponents
Makeexplanation
Figure 1: Illustration of LIME [25] for a binary classification problem.
not strings are treated as single tokens, and their absence is theirnull/zero value. We sample Zπ₯ by setting random subsets of π₯ β² toone, where the size of the subsets are sampled uniformly from theinterval [0, π·max], and we use the neighborhood distance kernel
ππ₯ (π§β²) = exp(β2π· (π₯ β², π§β²)π·max
) (2)
where π· is the Hamming distance. While the original authors sim-ply used π·max = ππ₯ for text classification, we empirically find thisneighborhood too large due to entity matching generally beingmore sensitive to single tokens compared to standard text classifica-tion. This could, of course, be accounted for by narrowing ππ₯ , butit is more sample efficient to also reduce the neighborhood we aresampling from. We use π·max =πππ₯ (5, βππ₯5 β), and let the numberof samples |Zπ₯ | be clamp(30ππ₯ , 500, 3000). From our experience,none of these parameters are especially sensitive.
Finally, we letπΊ beweighted regressionmodels without intercept.The loss then being
L(π , π, ππ₯ ) =β
(π§β²,π¦) βZπ₯
ππ₯ (π§β²) (π¦ β π(π§β²))2 (3)
b (π₯) is found using weighted least squares and forward selection(choosing πΎ coefficients).
4 METHODWe now describe how we address the three challenges describedin Section 1 with three distinct, but coherent, techniques that to-gether form our proposed method: Local explanations for Entitiesthat Match Or Not (LEMON). As discussed earlier, we use LIMEas the basis for our method, but the proposed ideas have widerapplicability. The three following subsections respectively addressand propose a solution to the three challenges (1) cross-record in-teraction effects, (2) non-match explanations, and (3) variation insensitivity.
4.1 Dual ExplanationsOne shortcoming of LIME, when applied directly to entity matching,is that it does not take into account the inherent duality of thematching. No distinction is made between the two input records.
This is problematic because our surrogate linear regression modelassumes independent features, but perturbations across two inputrecords will almost per definition have strong interaction effects.In essence, the surrogate model π cannot sufficiently capture thebehaviour of our matcher π even for small neighborhoods, whichseverely hurts the approximation accuracy.
The proposed solution is relatively straightforward but still ef-fective. Equivalently to Baraldi et al. [2], we make two explanations,one for each record. For each explanation, we let πΌπ₯ represent onlythe absence of tokens in one record. In effect, we approximate at-tributions from only one record at a time while keeping the otherconstant. That way, we avoid the strong interaction effects acrossthem. Intuitively, we explain why record π matches π or not andwhy record π matches π or not, separately. The two explanationscan still be presented together as one joint explanation to the user.
4.2 Attribution PotentialAttribution methods such as the LIME implementation describedabove tell us which part of the input is the most influential. This isusually achieved using some kind of exclusion analysis where onelooks at the difference between the absence and presence of inputfeatures [e.g., 20, 25, 28]. While that might be an effective approachfor many machine learning problems, it is inherently unsuited forentity matching. The issue lies in explaining non-matches. Recordpairs that a matcher classifies as a match can be explained subtrac-tively because removing or zeroing out essential parts of the recordswill result in lower matching scores from most well-behaved match-ers. But for record pairs where the matcher is convinced they donot match and provide a near-zero match score, it is unlikely thatremoval or zeroing out any part of the records will make a signifi-cant difference on the match score. Seemingly, nothing influencesthe match score, thereby providing no useful signal of the contri-bution from different features. Notice that standard gradient-basedmethods are not able to escape this problem because π will, in thesecases, be in a flat area and the gradients be rather uninformative.Intuitively, we can not explain why two records do not match bywhat they contain. A natural solution to this problem is instead toexplain by what they do not contain.
LEMON: Explainable Entity Matching
Interpretable Representation. Let the interpretable represen-tation πΌπ₯ = {π,π΄,π}ππ₯ be categorical instead of binary, where thevalues represent whether the corresponding token is Present, Ab-sent, or Matched. Unsurprisingly, π§β²π = π΄ means π‘π₯ (π§β²) will excludetoken π and if π§β²π = π it will be kept β much like before. On theother hand, if π§β²π = π , we will copy and inject the token in the otherrecord where it maximizes the match score π (π‘π₯ (π§β²)). For now, letus assume we have an accurate and efficient implementation ofπ‘π₯ (π§β²).
For the linear surrogate model, we dummy code πΌπ₯ , using π asreference value. When we do forward selection, we select bothdummy variables representing a categorical variable at once, sothat we either pick the entire categorical variable or not. We willthen have two estimated coefficients, π½π΄π and π½ππ , for each of the πΎselected interpretable features. Finally, we define the attribution fortoken π to beπ€π = βπ½π΄π , and the attribution potential to be ππ = π½ππ .Intuitively,π€π is the contribution of token π , and is the same as inLIME, while ππ can be interpreted as the maximum additional con-tribution token π could have had if the other record matched better.Note that it is important to model this new attribution potentialthrough a categorical variable instead of simply adding anotherbinary variable to πΌπ₯ , because exclusion and injection perturbationsof the same token strongly interact with each other and should bemutually exclusive.
Approximating ππ . In contrast to plain LIME as described inSection 3.2, π‘π₯ is now less straightforward to compute. The diffi-culty lies in where to inject tokens π for which π§β²π = π to maximizeπ (π‘π₯ (π§β²π )). Since our method is model-agnostic, the best we can dois try all possible injections. That would, of course, be computa-tionally prohibitive, both because of the high number of possibleinjection targets but also because of the exponential growth ofcombinations when multiple tokens should be injected. Instead, wecan approximate it by sampling πΏ combinations of injection targetsand picking the one that give the highest match score.
The possible injection targets for a token in an attribute value areanywhere in the string attributes of the other record (but withoutsplitting tokens in the target attribute value). If the token is a non-string value, it can also overwrite attribute values of the same typeβ e.g., a number attribute can replace a number attribute in theother record. Tokens from attribute names can only be injectedto attribute names. When we sample injection targets, we firstpick a target attribute uniformly at random and then a randomposition within that attribute. In addition, we employ a heuristicto incorporate information about matched schemas if available. Incases where the schemas are matched, we boost the probabilityof choosing the corresponding attribute as the target to 50%. Thismakes efficient use of prior knowledge about the schemas whilestill preserving the robustness to dirty data. To pick the samplesize πΏ, we first sum the possible injection targets per token to beinjected. We cap the number of targets to three per attribute and tenin total and then let πΏ be the maximum of all tokens to be injected.
Neighborhood sampling. We sample the neighborhood Zπ₯
much like before. Now, π₯ β² will be a vector of only π , and we letπ§β²π = π΄ instead of 1 for random subsets. But we additionally set ran-dom subsets of elements toπ . The subset size is chosen uniformly
at random from [0,max(3, βππ₯/3β)], except with a 50% chance ofpicking 0. The reason we sample π less than π΄ is that injectionstend to have more dramatic effects on the match score than exclu-sions, and so we consider them to be larger perturbations and wantto avoid drowning the exclusion effects.
4.3 Counterfactual GranularityDepending on the dataset and matcher, influencing the match scoresignificantly can sometimes require perturbating large parts of theinput records. This is especially true for datasets where recordscontain many high-quality pieces of information because it pro-vides the matcher with multiple redundant strong signals aboutwhether they match or not. An example is the iTunes-Amazondataset, where attributes such as song name, artist name, albumname, and more might all agree or disagree at the same time. Theproblem is that we want to pick out πΎ important features for theuser to focus on, but in such cases, no single token is likely to besignificantly important. While we could use attribute-level features,that would be unnecessarily course-grained for many cases. Instead,we propose an adaptive strategy where we automatically choose anappropriate explanation granularity. The idea is to exponentiallydecrease the granularity of the interpretable features until the attri-butions and attribution potentials are large enough in magnitudeto explain the decision boundary.
Let ππ₯ = {(π€π , ππ )}πβπΈ be the πΎ pairs of attributions and attri-bution potentials for an explanation b (π₯), where πΈ is the πΎ in-terpretable features chosen to be used in the regularized linearsurrogate model. Further, let incπ = max(βπ€π , ππ ) and decπ = π€π .In other words, this is how much perturbation of token π couldincrease or decrease the match score according toπ€π and ππ if youremoved the token or injected the token in the other record. Then,to represent greedy actions to increase the match score, let INC bea vector of the elements in πΈ with positive incπ and sorted by incπin descending order, and similarly for DEC. We define the predictedcounterfactual strength of π steps to be
CFSππ₯ =
π β [
π (π₯) βπ =πβπ =1
DECπ ], π (π₯) > π[
π (π₯) +π =πβπ =1
INCπ ] β π, π (π₯) β€ π
(4)
Intuitively, this is to what degree one would assume to surpassthe classification threshold π if one performs π greedy actionsto change the matcher prediction. Note that most matchers, aswell as those in our experiments, have a classification threshold πof 0.5 [16, 18]. Then let the greedy counterfactual strategy ππ bethe smallest number of steps predicted to be necessary to get acounterfactual strength of at least π :
ππ =
min{π : CFSππ₯ β₯ π}, if such π exists|DEC |, otherwise if π (π₯) > π|INC |, otherwise if π (π₯) β€ π
(5)
Nils Barlaug
Finally, we define the predicted counterfactual strength of the ex-planation b (π₯) to simply be CFSπ₯ = CFS
πππ₯ , and the actual counter-
factual strength to be:
CFSπ₯ =
π β
[π (π₯) β π (π‘π₯ (π₯ β²π)) ], π (π₯) > π[
π (π₯) + π (π‘π₯ (π₯ β²π)) ] β π, π (π₯) β€ π(6)
where π₯ β²π is the perturbation of the interpretable representation π₯ β²corresponding to the greedy counterfactual strategy.
When an explanationβs interpretable features represent (up to) πconsecutive tokens, we say that explanation has a granularity of πtokens. To find our desired granularity, we start with a granularity ofone token and then double until we find a granularity that satisfiesCFSπ₯ β₯ π β§ CFSπ₯ β₯ π or no courser granularity is possible (i.e., allfeatures are whole attributes). When no granularity satisfies therequirement, we pick the granularity with the highest harmonicmean between CFSπ₯ and CFSπ₯ , which will favor them to be largeand similar.
We call the resulting approach for picking granularity counter-factual granularity. It will try to find explanations that explain thedecision boundary while balancing maximal granularity and faith-fulness. Note that the granularity is chosen independently for eachof the two dual explanations.
4.4 SummaryAll three extensions introduced above fit together in our proposedmethod. Finally, we chooseπΎ to be 5 and π to be 0.1 for all examples.Figure 2 provides a simplified overview of the steps that make upLEMON. It is a model-agnostic and schema-flexible method withoutany hyperparameters that need tuning. One challenge comparedto LIME is that it is more computationally demanding. The mainreason is the increased number of matcher predictions made toestimate the attribution potential and finding the right granularity.However, it is still possible to generate an explanation within a fewseconds in most cases.
One key advantage of LEMON over previous work is that itprovides one type of explanation for all record pairs, whether it ismatch or non-match, with a clear and easy way to interpret andvisualize. The attributionsπ€π are equivalent to those in LIME andcan be interpreted the same way. Its interpretation is to what degreeinterpretable feature π (some part of a record) contributes to thematch score. If the corresponding part of the record is removed, weexpect the match score to decrease by approximately π€π . For thesame interpretable feature π , the interpretation of the attributionpotential ππ is how much higher the attributionπ€π could be if theother record matched the feature better.
While the explanation can be visualized in many ways, we pro-pose a straightforward extension of the visualization proposed bythe original LIME authors. Figure 3 show an example. In additionto plotting a colored bar for eachπ€π , we also plot gray bars fromπ€π toπ€π + ππ , outlining feature πβs potential attribution.
5 EXPERIMENTAL SETUP5.1 DatasetsAll experiments are carried out on the 13 public datasets used inthe evaluation of DeepMatcher [22] β originally from KΓΆpcke et al.
Table 2: The public DeepMatcher [22] benchmark datasetand πΉ1 score for the Magellan and Bert-Mini matchers usedin the experiments.
Type Name #Cand. #MatchesMatcher πΉ1
MG BM
Structured
Amazon-Google 11 460 1 167 0.52 0.67Beer 450 68 0.85 0.76DBLP-ACM 12 363 2 220 0.99 0.98DBLP-Scholar 28 707 5 347 0.94 0.93Fodors-Zagats 946 110 1.00 0.95iTunes-Amazon 539 132 0.90 0.93Walmart-Amazon 10 242 962 0.66 0.80
Dirty
DBLP-ACM 12 636 2 220 0.91 0.97DBLP-Scholar 28 707 5 347 0.83 0.94iTunes-Amazon 539 132 0.53 0.90Walmart-Amazon 10 242 962 0.41 0.79
Textual Abt-Buy 9 575 1 028 0.51 0.81Company 112 632 28 200 0.57 0.90
[17] and Das et al. [6]. Table 2 lists them together with their numberof candidates and number of matches. The datasets are divided intothree types: structured, dirty, and textual. Structured datasets havenicely separated attributes, dirty datasets are created from theirstructured counterpart by randomly injecting other attributes intothe title attribute [22], and textual datasets generally consist of longtextual attributes containing multiple pieces of information. Forthe company dataset, we truncate each record to max 256 space-separated words.
5.2 MatchersTo show that our proposed method is versatile and model-agnostic,we perform our experiments for each dataset on both a matcherthat uses classical machine learning with string metrics as featuresand on a deep learning based matcher. See Table 2 for their πΉ1 scoreon the benchmark datasets.
Magellan. For the classical approach, we train a Magellan [16]random forest matcher. We use the automatically suggested simi-larity features and the default random forest settings provided bythe library. Furthermore, we do no downsampling and train on theentire training dataset.
Bert-Mini. For the deep learning approach, we train a baselineDitto [18] matcher using Bert-Mini [30]. While not quite achievingstate-of-the-art accuracy, Bert-Mini provides a decent accuracy vs.cost trade-off while maintaining the main characteristics of state-of-the-art deep learning matchers and still significantly outperformingthe classical matcher on dirty and textual data. We use batch size32, linearly decreasing learning rate from 3 Β· 10β5 with 50 warmupsteps, 16-bit precision optimization, and 1, 3, 5, 10, or 20 epochsdepending on the dataset size. The final model is the one from theepoch with the highest F1 score on the validation dataset.
LEMON: Explainable Entity Matching
Fit weighted linearregression model
Sampleperturbations
Interpretablecomponents
Non-match Match
Makeexplanation
Non-match Match
Exponentially decrease granularity
If
Figure 2: Illustration of how LEMON generates an explanation for a record pair π₯ = (π, π).
Figure 3: Example of how a LEMON explanation can be visualized and what we show users in our user study. This particularexplanation is for the prediction of the Bert-Mini matcher used in our experiments on the record pair from Table 1.
5.3 BaselinesWe now introduce the baselines we use for comparison. For a faircomparison, we adopt dual explanations for all of them.
LIME. Since ourwork can be seen as a continuation of LIME [25],it is a natural baseline. We use LIME as described in Section 3.2.
SHAP. Another popular approach for producing input attri-butions is SHAP [20]. It is based on the game-theoretic Shapleyvalues [19, 27] and provides several methods for different types ofmodels. For a fair comparison, we use their model-agnostic method,
Kernel SHAP, which can be interpreted as using LIME to approxi-mate Shapley values. Note that Kernel SHAP does not limit πΎ andset Ξ©(π) = 0. We use default settings from the SHAP library.
Integrated gradients. Our proposed method is a perturbation-based attribution method. To compare against a gradient-basedmethod, we use integrated gradients [28] as a baseline. This methodis not truly model-agnostic as it requires gradients, so we can onlyapply it to our deep learning matcher.
Nils Barlaug
Integrated gradients explain input π₯ in reference to some baselineinput π₯β (some neutral input that gives a score close to zero). Let π₯be the embedding vector of a record pair. As the original authorssuggest for textual input, we let π₯β be the zero embedding. Theattribution for the πth element of π₯ is then defined to be
πΌπΊπ (π₯π ) = (π₯π β π₯βπ ) Γβ« 1
πΌ=0
ππ (π₯β + πΌ Γ (π₯ β π₯β))ππ₯π
ππΌ (7)
The integral is approximated by averaging the gradient of evenlyspaced points from π₯β to π₯ . We use 50 points in our experiments.The raw attributions are for single elements of the embedded input,by no means interpretable for humans, so it is common to sum themfor each embedding. To get attributions on the same representationlevel as our method, we combine attributions of Bert subwordembeddings into whole words.
6 EXPERIMENTSWe will now go through several experiments to evaluate LEMONand compare it to other methods. When we evaluate post-hoc ex-plainability it is important to remember that we do not wish tomeasure the performance of the matchers but rather what the ex-plainability method can tell us about the matchers. Explanationsshould not be judged disconnected from the matcher on whetherthey provide the same rationale as users but to what degree theyreflect the actual (correct or wrong) behavior of the matchers andto what degree they are effective at communicating this to users.
6.1 Counterfactual InterpretationExplanations can sometimes provide enough information to theuser to understand how the prediction could be different. Baraldiet al. [2] call this the "interest" of an explanation. We argue similarlythat a useful explanation should implicitly reveal to the user somechanges to the records that would flip the prediction outcome. Butwe further argue that we should help the user understand a minimalnumber of such changes necessary since that would mean the userhas a greater understanding of where the decision boundary is.
To that end, we simulate users being shown an explanation fora record pair and then being asked what they think would be someminimal changes to the records that would flip the matcherβs pre-diction. The simulated users will greedily try to make the smallestnumber of perturbations necessary according to the explanation,as described in Section 4.3. Of the two dual explanations, they pickthe explanation with the lowest ππ if CFSπ₯ β₯ π or the one with thehighest CFSπ₯ otherwise. We extend the same greedy strategy toLandmark explanations but with their corresponding perturbations.Let the counterfactual recall of an attribution method be the fractionof explanations where at least one of the two dual explanationsindicate how the matching prediction could be flipped (CFSπ₯ β₯ π),and the counterfactual precision be the fraction of those where thegreedy counterfactual strategy is actually successful (CFSπ₯ > 0).To unify them into a single metric, we report the counterfactual πΉ1score. We produce 500 explanations for both predicted matches andnon-matches (or all when there are less than 500 available) for eachexplanation method per dataset. Table 3 shows the counterfactualπΉ1 for the different explainability methods, matchers, and datasets.
We observe that LEMON performs the best overall, with the high-est or close to the highest πΉ1 score for most cases. It significantlyoutperforms all three baselines, where non-matches, as expected,have the starkest difference. Since all the baselines are fundamen-tally analyzing the prediction by observing what happens whenfeatures are removed, they suffer from the same issue of explain-ing non-matches as discussed in Section 4.2. Importantly, the lowperformance of all the baselines backs up the claim that standardlocal post hoc attribution methods do not work satisfactorily outof the box for entity matching. Our proposed method generallyoutperforms Landmark, with the exception of three datasets for theMagellan matcher on non-matches. We note that LEMON has thebiggest advantage over Landmark on datasets that typically wouldrequire more substantial perturbations to flip the prediction, suchas DBLP-GoogleScholar and Company.
6.2 Explanation FaithfulnessIt is desirable that explanations are faithful to the matcher. Alluseful explanations provide some simplified view of a matcherβsbehavior, but we still want them to be indicative of how the matcheractually operates without being unnecessarily misleading. Inspiredby Ribeiro et al. [25], we make perturbations to a record pair andcompare the resulting match score with what we would expectfrom the attributions and attribution potentials. Specifically, if weremove feature π , we expect the match score to decrease with π€π
(remember thatπ€π can be negative), and if we inject feature π intothe other record, we expect the match score to increase with ππ .The same applies for Landmark, but appending instead of injectingfeatures. Since the baselines do not estimate attribution poten-tials, we ignore that perturbation for them. We perform 1, 2, and3 random perturbations among the interpretable features for bothdual explanations. We repeat for 500 explanations of matches andnon-matches for each matcher and dataset (or all when there areless than 500 available). Let πΏπ be the set of expected match scoreincreases and decreases for experiment π out of πΏ, and π§π be theperturbed version of the original input π₯π . Let the mean absoluteerror be
MAE =1πΏ
βπ
οΏ½οΏ½οΏ½π (π§) β [π (π₯) +
βπβπΏπ
π] οΏ½οΏ½οΏ½ (8)
This error measure alone will favor conservative explanation meth-ods that make small and insignificant claims, and punish methodslike Landmark and LEMON that provide higher impact explana-tions because of the inject/append features. Therefore, define theperturbation error to be the mean absolute error by dividing by theaverage magnitude of the predicted change:
PE =MAE
1πΏ
βπβπβπΏπ |π |
(9)
Table 4 shows the perturbation error for all methods. No methodachieves truly low error levels, which is expected given the simpli-fied assumption of independent additative attributions. However,we observe that LEMON overall is the method with the smallesterrors, with LIME being very similar. Further, we see Landmarksometimes gets extremely high perturbation error, especially forMagellan on the dirty datasets. Upon closer inspection, we think thisstems from a combination of sampling a too large neighborhood
LEMON: Explainable Entity Matching
Table 3: Counterfactual πΉ1 score for all the evaluated explainability methods across all datasets and the two matchers.
Model Type Method
DatasetStructured Dirty Textual
AG B DA DG FZ IA WA DA DG IA WA AB C Mean
Magellan
Match
LIME 0.96 0.83 1.00 0.87 0.77 0.98 0.82 0.50 0.79 0.90 0.86 0.95 0.47 0.82SHAP 0.95 1.00 1.00 1.00 0.95 1.00 0.83 0.65 0.96 0.68 0.81 0.98 0.68 0.88SHAP (w/ CFG) 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.99 1.00 0.91 0.99 0.99 0.87 0.98Landmark 0.92 0.89 1.00 0.96 0.73 0.87 0.95 0.75 0.74 0.88 0.90 0.95 0.28 0.83LEMON (w/o DE) 0.98 0.91 1.00 0.97 0.95 1.00 0.96 0.83 0.93 0.93 0.95 0.98 0.49 0.91LEMON (w/o AP) 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.98 0.83 0.98LEMON (w/o CFG) 0.96 0.89 1.00 0.88 0.91 0.98 0.81 0.52 0.81 0.85 0.83 0.92 0.41 0.83LEMON 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.99 0.80 0.98
Non-match
LIME 0.02 0.11 0.02 0.01 0.02 0.14 0.03 0.09 0.10 0.17 0.10 0.13 0.11 0.08SHAP 0.00 0.03 0.00 0.01 0.00 0.05 0.02 0.07 0.06 0.10 0.12 0.06 0.13 0.05SHAP (w/ CFG) 0.02 0.22 0.01 0.02 0.00 0.05 0.02 0.09 0.09 0.23 0.21 0.08 1.00 0.16Landmark 0.14 0.84 0.14 0.20 0.23 0.21 0.93 0.04 0.43 0.39 0.03 0.79 0.09 0.34LEMON (w/o DE) 0.59 0.78 0.06 0.37 0.65 0.64 0.82 0.64 0.69 0.87 0.82 0.88 0.96 0.68LEMON (w/o AP) 0.04 0.42 0.02 0.03 0.02 0.14 0.05 0.17 0.21 0.26 0.36 0.17 0.68 0.20LEMON (w/o CFG) 0.40 0.46 0.08 0.13 0.03 0.24 0.73 0.23 0.49 0.69 0.63 0.78 0.13 0.38LEMON 0.71 0.50 0.12 0.54 0.98 0.77 0.76 0.75 0.78 0.87 0.87 0.87 0.96 0.73
Bert-Mini
Match
LIME 0.95 0.65 0.97 0.85 0.93 0.65 0.81 0.97 0.69 0.62 0.80 0.78 0.18 0.76SHAP 0.90 0.81 0.79 0.65 0.91 0.69 0.71 0.79 0.62 0.63 0.75 0.77 0.25 0.71SHAP (w/ CFG) 0.95 0.96 0.92 0.98 0.86 0.89 0.97 0.98 0.99 1.00 0.99 1.00 0.39 0.91IG 0.90 0.43 0.71 0.83 0.50 0.70 0.67 0.69 0.89 0.81 0.68 0.79 0.33 0.69IG (w/ CFG) 0.88 0.52 0.66 0.94 0.68 0.70 0.77 0.86 0.95 0.94 0.87 0.93 0.28 0.77Landmark 0.98 0.93 1.00 0.94 0.86 0.94 0.84 0.99 0.83 0.90 0.88 0.83 0.08 0.85LEMON (w/o DE) 0.99 0.69 1.00 0.97 0.98 0.88 0.83 1.00 0.94 0.89 0.84 0.86 0.25 0.85LEMON (w/o AP) 1.00 1.00 1.00 1.00 1.00 0.94 0.99 1.00 1.00 1.00 1.00 0.98 0.39 0.95LEMON (w/o CFG) 0.95 0.65 0.98 0.86 0.98 0.58 0.81 0.99 0.70 0.59 0.81 0.79 0.15 0.76LEMON 1.00 1.00 1.00 1.00 1.00 0.94 0.98 1.00 0.99 1.00 1.00 0.97 0.37 0.94
Non-match
LIME 0.13 0.06 0.01 0.04 0.04 0.13 0.08 0.02 0.04 0.23 0.07 0.05 0.03 0.07SHAP 0.14 0.16 0.01 0.04 0.01 0.29 0.08 0.02 0.05 0.33 0.14 0.15 0.18 0.12SHAP (w/ CFG) 0.14 0.26 0.02 0.04 0.01 0.34 0.11 0.02 0.05 0.49 0.17 0.37 0.26 0.18IG 0.07 0.00 0.00 0.03 0.01 0.00 0.02 0.01 0.01 0.00 0.03 0.07 0.02 0.02IG (w/ CFG) 0.08 0.03 0.00 0.04 0.01 0.00 0.03 0.02 0.02 0.02 0.06 0.08 0.03 0.03Landmark 0.40 0.70 0.05 0.17 0.50 0.63 0.64 0.07 0.35 0.50 0.74 0.67 0.01 0.42LEMON (w/o DE) 0.75 0.93 0.55 0.68 0.86 0.91 0.93 0.74 0.78 0.88 0.97 0.93 0.97 0.84LEMON (w/o AP) 0.18 0.08 0.03 0.08 0.05 0.67 0.14 0.04 0.09 0.56 0.16 0.23 0.26 0.20LEMON (w/o CFG) 0.50 0.92 0.02 0.19 0.85 0.94 0.89 0.04 0.18 0.76 0.95 0.90 0.96 0.62LEMON 0.81 0.94 0.65 0.68 0.86 0.97 0.90 0.50 0.79 0.87 0.95 0.98 0.97 0.84
and the ineffectiveness of the double-entity generation strategywhen the data does not follow the matched schemas (i.e., is dirty).
6.3 Ablation StudyIncluded in Table 3 and Table 4 is also an ablation study. We ex-amine what effect it has when we remove each of the three maincomponents of LEMON: 1) Dual Explanations. Instead of dual ex-planations, we produce one joint explanation for both records. Weuse πΎ = 10 for a fair comparison. 2) Attribution Potential. We usethe interpretable representation of the LIME baseline and do notestimate any attribution potential. 3) Counterfactual Granularity.We fix the granularity to be one token. In addition, we examine theeffect of adding counterfactual granularity to the baselines SHAPand integrated gradients (there is no trivial way to do the same forattribution potential).
LEMON performs better across the board for matches with dualexplanations, but it is more varied for non-matches. This makessense since the problematic interaction effects mainly occur whentwo records match and have a lot of similar content. Unsurpris-ingly, since its primary goal is to explain how records could matchbetter, attribution potential only significantly improves non-matchexplanations. Nevertheless, the improvement for non-matches isdramatic, demonstrating how effective attribution potential is forexplaining record pairs that do not match. Finally, we observe thatthe effectiveness of counterfactual granularity varies hugely fromdataset to dataset. It makes the biggest difference on datasets wherewe consider the records to have multiple high-quality pieces ofinformation β either in the form of several high-quality attributesor long textual attributes with multiple high-quality keywords.
Nils Barlaug
Table 4: Perturbation error PE for all the evaluated explainability methods across all datasets and the two matchers.
Model Type Method
DatasetStructured Dirty Textual
AG B DA DG FZ IA WA DA DG IA WA AB C Mean
Magellan
Match
LIME 0.33 0.38 0.31 0.34 0.31 0.25 0.67 0.47 0.41 0.35 0.46 0.33 0.60 0.40SHAP 1.09 1.03 0.96 1.01 0.80 0.93 1.12 1.03 1.25 2.37 2.50 1.07 7.70 1.76SHAP (w/ CFG) 1.03 0.79 0.96 1.00 0.79 0.95 0.31 0.64 1.17 1.24 1.28 1.04 3.58 1.14Landmark 0.95 0.73 0.70 1.18 0.67 0.94 0.86 7.01 2.93 5.40 1.93 1.25 3.45 2.15LEMON (w/o DE) 0.36 0.38 0.30 0.36 0.33 0.25 0.35 0.45 0.40 0.40 0.37 0.33 0.68 0.38LEMON (w/o AP) 0.33 0.40 0.31 0.33 0.28 0.25 0.23 0.42 0.39 0.37 0.35 0.33 0.53 0.35LEMON (w/o CFG) 0.33 0.36 0.30 0.34 0.30 0.26 0.27 0.45 0.40 0.35 0.39 0.33 0.67 0.37LEMON 0.32 0.37 0.30 0.33 0.28 0.26 0.28 0.42 0.39 0.35 0.36 0.32 0.54 0.35
Non-match
LIME 0.53 0.64 0.42 0.47 0.58 0.34 0.45 0.56 0.60 0.46 0.59 0.46 0.79 0.53SHAP 0.89 0.93 1.34 1.08 1.43 1.10 1.16 1.44 1.70 2.15 1.24 1.34 6.36 1.71SHAP (w/ CFG) 0.62 0.52 0.79 0.80 1.12 0.38 0.64 0.83 0.79 0.99 0.68 0.77 0.82 0.75Landmark 0.64 0.74 0.81 0.79 0.76 1.45 0.63 0.90 1.04 5.58 3.05 1.34 2.10 1.53LEMON (w/o DE) 0.44 0.52 0.63 0.50 0.46 0.37 0.38 0.59 0.51 0.40 0.46 0.33 0.47 0.47LEMON (w/o AP) 0.46 0.35 0.47 0.46 0.51 0.21 0.40 0.49 0.51 0.44 0.41 0.40 0.59 0.44LEMON (w/o CFG) 0.43 0.55 0.44 0.50 0.43 0.51 0.38 0.58 0.52 0.44 0.48 0.35 0.78 0.49LEMON 0.42 0.59 0.64 0.53 0.45 0.43 0.40 0.54 0.49 0.46 0.47 0.37 0.47 0.48
Bert-Mini
Match
LIME 0.32 0.47 0.39 0.40 0.30 0.44 0.61 0.38 0.41 0.48 0.62 0.77 1.20 0.52SHAP 0.67 0.50 1.48 0.87 0.65 0.61 1.14 1.37 0.85 0.81 1.08 0.82 2.18 1.00SHAP (w/ CFG) 0.64 0.57 1.03 0.78 0.61 0.71 0.74 0.91 0.71 0.71 0.71 0.60 0.60 0.72IG 1.20 0.76 1.64 1.15 1.01 0.80 1.18 1.65 1.05 1.16 1.17 0.89 1.93 1.20IG (w/ CFG) 1.16 0.72 1.45 1.06 0.87 0.93 1.11 1.28 0.92 0.89 0.95 0.80 0.84 1.00Landmark 0.92 1.00 1.17 0.81 0.55 0.74 0.94 1.20 0.78 1.01 1.14 0.79 0.91 0.92LEMON (w/o DE) 0.38 0.51 0.50 0.45 0.41 0.41 0.50 0.52 0.46 0.48 0.51 0.56 0.40 0.47LEMON (w/o AP) 0.32 0.40 0.45 0.43 0.33 0.31 0.45 0.40 0.41 0.30 0.42 0.49 0.61 0.41LEMON (w/o CFG) 0.33 0.42 0.38 0.40 0.31 0.43 0.51 0.38 0.41 0.44 0.54 0.60 0.68 0.45LEMON 0.33 0.45 0.45 0.44 0.40 0.43 0.49 0.42 0.44 0.39 0.46 0.47 0.53 0.44
Non-match
LIME 0.50 0.51 0.68 0.46 0.61 0.50 0.61 0.58 0.52 0.53 0.67 0.63 0.61 0.57SHAP 0.67 0.82 0.75 0.82 0.87 0.79 0.89 0.81 0.75 1.04 0.79 0.86 1.32 0.86SHAP (w/ CFG) 0.65 0.80 0.77 0.75 0.79 0.77 0.75 0.76 0.68 0.78 0.71 0.80 0.62 0.74IG 0.95 1.01 1.05 1.02 1.71 1.80 1.27 0.73 0.88 2.73 1.03 0.88 1.31 1.26IG (w/ CFG) 0.94 1.06 1.19 0.98 2.31 5.08 1.29 0.90 0.86 4.38 1.14 1.08 2.34 1.81Landmark 0.76 0.83 0.84 0.79 0.72 0.65 0.72 0.85 0.83 0.83 0.74 1.03 0.80 0.80LEMON (w/o DE) 0.53 0.34 0.81 0.66 0.45 0.43 0.39 0.72 0.61 0.50 0.36 0.33 0.42 0.50LEMON (w/o AP) 0.48 0.49 0.63 0.54 0.58 0.47 0.53 0.57 0.50 0.48 0.58 0.63 0.62 0.55LEMON (w/o CFG) 0.51 0.29 0.98 0.70 0.45 0.42 0.39 0.81 0.71 0.48 0.35 0.33 0.42 0.53LEMON 0.49 0.40 0.87 0.58 0.43 0.48 0.46 0.77 0.54 0.48 0.38 0.33 0.42 0.51
6.4 StabilitySeveral of the benchmarked methods, including LEMON itself, relyon random sampling of the neighborhood of π₯ . Different initialrandom seeds will result in different explanations. However, withsufficient samples we would like a well-behaved method to gen-erate similar explanations β i.e. explanations to be stable and notchange much if a different random seeds are used. Stability is adesirable trait from a trust perspective but also especially usefulwhen examining or debugging a matcher. If we make changes to amatcher, we want to be confident that the differences we observein the explanations mostly reflect the matcher changes and notinstability of the explanation method. LEMON not only rely onsampling the neighborhood of π₯ in the interpretable domain πΌπ₯ ,but also rely on sampling to approximate π‘π₯ when translating from
πΌπ₯ . A natural question to ask is if this additional random samplinghurts stability.
Let π1π₯ = {(π€π1, ππ2)}πβπΈ1 and π2π₯ = {(π€π2, ππ2)}πβπΈ2 be two ex-planations for the same input π₯ and matcher π with different ran-dom seed. Let the the similarity between the two explanationsπ (π1, π2) = π1β©π2
π1βͺπ2 be the weighted Jaccard coefficient such that theintersection is
π1π₯ β© π2π₯ =β
πβπΈ1β©πΈ2
[(π€π1 Β€β©π€π2) + (ππ1 Β€β©ππ2)
](10)
and the union isπ1π₯ βͺ π2π₯ =
βπβπΈ1β©πΈ2
max( |π€π1 |, |π€π2 |) +max( |ππ1 |, |ππ2 |)
+β
πβπΈ1\πΈ2
(π€π1 + ππ1) +β
πβπΈ2\πΈ1
(π€π2 + ππ2)(11)
LEMON: Explainable Entity Matching
S-AG S-B S-DA S-DG S-FZ S-IA S-WA D-DA D-DG D-IA D-WA T-AB T-C0
0.20.40.60.81
Match
S-AG S-B S-DA S-DG S-FZ S-IA S-WA D-DA D-DG D-IA D-WA T-AB T-C0
0.20.40.60.81
Non
-match
LIME SHAP Landmark LEMON
Figure 4: Stability of explainability methods based on neighborhood sampling for Bert-Mini on all datasets.
where Β€β© is a shorthand for
π Β€β©π = π» (ππ)min( |π |, |π |) (12)
and π» is the unit step function. For LIME and SHAP ππ is 0 for allπ . To be able to compare explanations of different granularity, allexplanations are normalized to single token interpretable featuresβ i.e. if feature π is a π-token interpretable feature we split it into πfeatures with attribution π€π
π and attribution potential πππ . Finally,
we define the stability of an explanation method as the expectedsimilarity between two explanations Eπ₯
[π (π1π₯ , π2π₯ )
]. Note that this
definition slightly favors methods such as SHAP and Landmarkthat uses all interpretable features in its explanations instead ofonly the πΎ most important like LIME and LEMON. Picking theπΎ most important interpretable features controls the explanationcomplexity at the cost of exposing the method to more instabilitybecause small changes in importance can change which featuresare within or outside top πΎ .
Figure 4 show the estimated stability of LIME, SHAP, Landmark,and LEMON for Bert-Mini on all datasets. For each dataset, wesample 100 predicted matches and non-matches uniformly at ran-dom, generate two explanations with different random seed foreach example, and average the similarities. We see that LEMON isrelatively stable and is similar to LIME in terms of stability despitethe sampling-based approximation of π‘π₯ . SHAP is overall the moststable method, while Landmark is the least stable. To understandthese differences in stability it is important to also take into accountsample size.
Neighborhood sample size. From our experience, the mainconcern when choosing the neighborhood sample size |Zπ₯ | is sta-bility. One achieves satisfactory counterfactual πΉ1 score and pertur-bation error with relatively few samples, but it takes considerablymore samples to get stable explanations. Thus, picking |Zπ₯ | mostlyboils down to a trade-off between stability and speed.
0 1,000 2,000 3,0000
0.2
0.4
0.6
0.8
1Match
Stability
0 1,000 2,000 3,000
Non-match
LIME SHAP Landmark LEMON
|ZG | |ZG |
Figure 5: Stability of explainability methods based on neigh-borhood sampling for Bert-Mini on the Abt-Buy dataset.
The different neighborhood sampling-based methods have dif-ferent strategies for picking a sample size2, so it is interesting toalso compare the stability at equal samples sizes. Figure 5 show thestability of the explainability methods for Bert-Mini on the Abt-Buydataset3 when we vary the neighborhood sample size. As before,we sample 100 predicted matches and non-matches, generate twoexplanations per examples, and estimate the stability to be the av-erage similarity between the explanation pairs. We observe thatLEMON is close to or equally sample efficient as LIME and SHAPfor matches, and slightly more for non-matches. This shows thatthe difference in stability between SHAP and LEMON is mainlya matter of difference in sample sizes. We deliberately use a lessaggressive sampling scheme for LEMON than SHAP because wefind the returns in terms of stability diminishing β especially given2SHAP defaults to 2ππ₯ + 2048, Landmark to 500, and our LIME baseline and LEMONto clamp (30ππ₯ , 500, 3000) .3The methods generally behave similarly on the other datasets, but we only includeone due to space constraints.
Nils Barlaug
Table 5: Counterfactual precision of users after being shownan explanation from LIME or LEMON.
Dataset
Method
LIME LEMON
Match Non-match Match Non-match
StructuredAmazon-Google 0.71 0.22 0.77 0.42Beer 0.47 0.16 0.50 0.63DBLP-ACM 0.82 0.06 0.79 0.23DBLP-GoogleScholar 0.53 0.08 0.62 0.27Fodors-Zagats 0.59 0.14 0.69 0.46iTunes-Amazon 0.45 0.18 0.60 0.60Walmart-Amazon 0.55 0.24 0.63 0.58
DirtyDBLP-ACM 0.71 0.02 0.81 0.40DBLP-GoogleScholar 0.45 0.08 0.58 0.48iTunes-Amazon 0.53 0.22 0.75 0.56Walmart-Amazon 0.53 0.24 0.62 0.71
TextualAbt-Buy 0.55 0.14 0.62 0.73Company 0.14 0.16 0.29 0.35
Mean 0.54 0.15 0.63 0.49
the higher computationally footprint of LEMON. Landmarks insta-bility, however, can not be attributed to the lower sampling size. Itis clear from Figure 5 that the method is significantly less samplingefficient than the others. We suspect this is mostly due to the largeneighborhood used when sampling.
6.5 User StudyTo examine if explanations from LEMON improve human subjectsunderstanding of a matcher compared to LIME, we adopt the exper-iment on counterfactual interpretation from Section 6.1 to humansubjects. We recruit random test users from the research surveyplatform Prolific. Note that these users are laypersons and do nothave any experience with entity matching or a background in com-puter science. A user is shown an explanation for a record pairand then asked what they think would be a minimal change to therecord pair that would make the matcher predict the opposite. Eachuser is shown one explanation for a match and a non-match foreach dataset, and we only use the Bert-Mini matcher. Afterward, wecheck what fraction of them successfully get the opposite predictionβ i.e., the counterfactual precision. We conduct the experiment on50 users for LIME and 50 other users for LEMON, and report thecounterfactual precision in Table 5.
As expected, and in line with the experiments above, the greatestimprovement is for non-matches. We see an average improvementin the counterfactual precision of 0.09 for matches and 0.34 fornon-matches. The results are generally less pronounced than forthe simulated experiments. We suspect the lower maximum scoresare a reflection of the difficulty of the task for laypersons, and thatthe higher minimum scores are a reflection of humans ability to usecommon sense to "save" weak explanations. Note that we cannotcompare to Landmark [2] since the authors do not propose any
way of presenting an actual explanation to a user. The combinationof double-entity generation and not limiting Ξ©(π) (i.e., explainingusing all features instead of limiting them to πΎ) makes such apresentation non-trivial.
7 CONCLUSIONLocal post hoc feature attribution is a valuable and popular type ofexplainability method that can explain any classifier, but standardmethods leaves significant room for improvement when applied toentity matching. We have identified three challenges of applyingsuch methods to entity matching and proposed LEMON, a model-agnostic and schema-flexible method that addresses all three chal-lenges. Experiments and a novel evaluation method for explainableentity matching show that our proposed method is more faithfulto the matcher and more effective in explaining to the user wherethe decision boundary is β especially for non-matches. Lastly, userstudies verify a significant real-world improvement in understand-ing for laypersons seeing LEMON explanations compared to naiveLIME explanations.
There is still much to be done within explainable entity matching.A disadvantage of LEMON (and other perturbation-based meth-ods like LIME and SHAP) is their running time. Even though it ispossible to trade off significantly shorter running time for explana-tion stability and trivial to parallelize the computational bottleneck(running inference of matcher π ), depending on the hardware andmatcher, it might still be infeasible in practice to do real-time expla-nation or generate explanations for all record pairs in large datasetswhile still achieving satisfactory stability. Therefore, more efficientsampling strategies should be explored. Furthermore, there is stillwork to be done on examining the adaptation of other explainabil-ity methods for entity matching in-depth and on how to evaluatethem. Our experiments and ablation study show dual explanationsand counterfactual granularity are easily applicable to SHAP andgradient based methods β and that they are indeed effective forother methods than LIME. It is less clear how one would adapt theideas of attribution potential to those methods, and we hope toaddress that in future work.
ACKNOWLEDGMENTSThis work is supported by Cognite and the Research Council ofNorway under Project 298998. We thank Hassan Abedi Firouzjaei,Jon Atle Gulla, Dhruv Gupta, Ludvig Killingberg, Kjetil NΓΈrvΓ₯g,and Mateja StojanoviΔ for valuable feedback.
REFERENCES[1] Amina Adadi and Mohammed Berrada. 2018. Peeking Inside the Black-Box: A
Survey on Explainable Artificial Intelligence (XAI). IEEE Access 6 (2018), 52138β52160. https://doi.org/10.1109/ACCESS.2018.2870052
[2] Andrea Baraldi, Francesco Del Buono, Matteo Paganelli, and Francesco Guerra.2021. Using Landmarks for Explaining Entity Matching Models. In EDBT. Open-Proceedings.org, 451β456.
[3] Nils Barlaug and Jon Atle Gulla. 2021. Neural Networks for Entity Matching: ASurvey. ACM Transactions on Knowledge Discovery from Data 15, 3 (April 2021),1β37. https://doi.org/10.1145/3442200
[4] Ursin Brunner and Kurt Stockinger. 2020. Entity Matching with Transformer Ar-chitectures - A Step Forward in Data Integration. In EDBT. OpenProceedings.org,463β473.
[5] Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage,Entity Resolution, and Duplicate Detection. Springer-Verlag, Berlin Heidelberg.https://doi.org/10.1007/978-3-642-31164-2
LEMON: Explainable Entity Matching
[6] Sanjib Das, A Doan, C Gokhale Psgc, P Konda, Y Govind, and D Paulsen. 2015.The Magellan Data Repository. (2015).
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:Pre-Training of Deep Bidirectional Transformers for Language Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, Volume 1 (Long andShort Papers). Association for Computational Linguistics, Minneapolis, Minnesota,4171β4186. https://doi.org/10.18653/v1/N19-1423
[8] Vincenzo Di Cicco, Donatella Firmani, Nick Koudas, Paolo Merialdo, and DiveshSrivastava. 2019. Interpreting Deep Learning Models for Entity Resolution:An Experience Report Using LIME. In Proceedings of the Second InternationalWorkshop on Exploiting Artificial Intelligence Techniques for Data Management -aiDM β19. ACM Press, Amsterdam, Netherlands, 1β4. https://doi.org/10.1145/3329859.3329878
[9] AnHai Doan, Alon Halevy, and Zachary G. Ives. 2012. Principles of Data Integra-tion. Morgan Kaufmann, Waltham, MA.
[10] Mengnan Du, Ninghao Liu, and Xia Hu. 2019. Techniques for InterpretableMachine Learning. Commun. ACM 63, 1 (Dec. 2019), 68β77. https://doi.org/10.1145/3359786
[11] Amr Ebaid, Saravanan Thirumuruganathan, Walid G. Aref, Ahmed Elmagarmid,and Mourad Ouzzani. 2019. EXPLAINER: Entity Resolution Explanations. In2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, Macao,Macao, 2000β2003. https://doi.org/10.1109/ICDE.2019.00224
[12] Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, MouradOuzzani, and Nan Tang. 2018. Distributed Representations of Tuples for EntityResolution. Proceedings of the VLDB Endowment 11, 11 (July 2018), 1454β1467.https://doi.org/10.14778/3236187.3236198
[13] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007.Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and DataEngineering 19, 1 (Jan. 2007), 1β16. https://doi.org/10.1109/TKDE.2007.250581
[14] Leilani H. Gilpin, David Bau, Ben Z. Yuan, Ayesha Bajwa, Michael Specter, andLalana Kagal. 2018. Explaining Explanations: An Overview of Interpretability ofMachine Learning. In 2018 IEEE 5th International Conference on Data Science andAdvanced Analytics (DSAA). IEEE, Turin, Italy, 80β89. https://doi.org/10.1109/DSAA.2018.00018
[15] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, FoscaGiannotti, and Dino Pedreschi. 2019. A Survey of Methods for Explaining BlackBox Models. Comput. Surveys 51, 5 (Jan. 2019), 1β42. https://doi.org/10.1145/3236009
[16] Pradap Konda, Jeff Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, VijayRaghavendra, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan,Jeffrey R. Ballard, Han Li, Fatemah Panahi, and Haojun Zhang. 2016. Magellan:Toward Building Entity Matching Management Systems. Proceedings of the VLDBEndowment 9, 12 (Aug. 2016), 1197β1208. https://doi.org/10.14778/2994509.2994535
[17] Hanna KΓΆpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of Entity Res-olution Approaches on Real-World Match Problems. Proceedings of the VLDB En-dowment 3, 1-2 (Sept. 2010), 484β493. https://doi.org/10.14778/1920841.1920904
[18] Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan.2020. Deep Entity Matching with Pre-Trained Language Models. Proceedings ofthe VLDB Endowment 14, 1 (Sept. 2020), 50β60. https://doi.org/10.14778/3421424.3421431 arXiv:2004.00584
[19] Stan Lipovetsky and Michael Conklin. 2001. Analysis of Regression in GameTheory Approach. Applied Stochastic Models in Business and Industry 17, 4 (Oct.2001), 319β330. https://doi.org/10.1002/asmb.446
[20] Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting ModelPredictions. In Advances in Neural Information Processing Systems, I. Guyon, U. V.Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.),Vol. 30. Curran Associates, Inc.
[21] Christoph Molnar. 2019. Interpretable Machine Learning - A Guide for MakingBlack Box Models Explainable.
[22] Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park,Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018.Deep Learning for Entity Matching: A Design Space Exploration. In Proceedingsof the 2018 International Conference on Management of Data - SIGMOD β18. ACMPress, Houston, TX, USA, 19β34. https://doi.org/10.1145/3183713.3196926
[23] Hao Nie, Xianpei Han, Ben He, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and HaoKong. 2019. Deep Sequence-to-Sequence Entity Matching for HeterogeneousEntity Resolution. In Proceedings of the 28th ACM International Conference on Infor-mation and Knowledge Management (CIKM β19). Association for Computing Ma-chinery, New York, NY, USA, 629β638. https://doi.org/10.1145/3357384.3358018
[24] Kun Qian, Lucian Popa, and Prithviraj Sen. 2019. SystemER: A Human-in-the-Loop System for Explainable Entity Resolution. Proceedings of the VLDB Endow-ment 12, 12 (Aug. 2019), 1794β1797. https://doi.org/10.14778/3352063.3352068
[25] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should ITrust You?": Explaining the Predictions of Any Classifier. In Proceedings of the22nd ACM SIGKDD International Conference on Knowledge Discovery and DataMining. ACM, San Francisco California USA, 1135β1144. https://doi.org/10.1145/2939672.2939778
[26] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning Im-portant Features through Propagating Activation Differences. In Proceedingsof the 34th International Conference on Machine Learning (Proceedings of Ma-chine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR,3145β3153.
[27] Erik Ε trumbelj and Igor Kononenko. 2014. Explaining Prediction Models andIndividual Predictions with Feature Contributions. Knowledge and InformationSystems 41, 3 (Dec. 2014), 647β665. https://doi.org/10.1007/s10115-013-0679-x
[28] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic Attributionfor Deep Networks. In International Conference on Machine Learning. PMLR,3319β3328.
[29] Saravanan Thirumuruganathan, Mourad Ouzzani, and Nan Tang. 2019. Explain-ing Entity Resolution Predictions: Where Are We and What Needs to Be Done?.In Proceedings of the Workshop on Human-In-the-Loop Data Analytics - HILDAβ19.ACM Press, Amsterdam, Netherlands, 1β6. https://doi.org/10.1145/3328519.3329130
[30] Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Well-Read Students Learn Better: On the Importance of Pre-Training Compact Models.arXiv:1908.08962 [cs] (Sept. 2019). arXiv:1908.08962 [cs]
[31] Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2017. Counterfactual Ex-planations Without Opening the Black Box: Automated Decisions and the GDPR.SSRN Scholarly Paper ID 3063289. Social Science Research Network, Rochester,NY. https://doi.org/10.2139/ssrn.3063289
[32] Chen Zhao and Yeye He. 2019. Auto-EM: End-to-End Fuzzy Entity-MatchingUsing Pre-Trained Deep Models and Transfer Learning. In The World Wide WebConference on - WWW β19. ACM Press, San Francisco, CA, USA, 2413β2424.https://doi.org/10.1145/3308558.3313578