efficient graph -based document similarity · 2018-12-05 · l lessons learned Ø value of dbpedia...

25
KIT – Universität des Landes Baden-Württemberg und nationales Großforschungszentrum in der Helmholtz-Gemeinschaft Institute of Applied Informatics and Formal Description Methods (AIFB) www.kit.edu Efficient Graph-based Document Similarity Christian Paul, Achim Rettinger, Aditya Mogadala, Craig A. Knoblock, Pedro Szekely

Upload: others

Post on 25-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

KIT – Universität des Landes Baden-Württemberg undnationales Großforschungszentrum in der Helmholtz-Gemeinschaft

Institute of Applied Informatics and Formal Description Methods (AIFB)

www.kit.edu

EfficientGraph-basedDocumentSimilarity

ChristianPaul,AchimRettinger,AdityaMogadala,CraigA.Knoblock,PedroSzekely

Page 2: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

2 06/01/2016

Commontask:Related-documentSearch

Applebreakslaptopsalesrecord

Hedrinksapplejuiceduringhalf-timebreak

All-timehighinMacBooks sold

U2recordpre-installedoniPhones

.

.

.

Querydocument

.

.

.

DocumentCollection

Page 3: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

3 06/01/2016

Hedrinksapplejuiceduringhalf-timebreak

Matchingwordsdonotalwaysindicatesimilarity

Applebreakslaptopsalesrecord

All-timehighinMacBookssold

U2recordpre-installedoniPhones

.

.

.

.

.

.

DocumentCollection

Querydocument

Page 4: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

4 06/01/2016

Wordco-occurrencecanbemisleading,too

Applebreakslaptopsalesrecord

All-timehighinMacBookssold

U2recordpre-installedoniPhones

.

.

.

.

.

.

DocumentCollection

Querydocument

Hedrinksapplejuiceduringhalf-timebreak

Page 5: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

5 06/01/2016

SemanticTechnologies:resolveambiguity&exploitrelationalknowledge

Applebreakslaptopsalesrecord

All-timehighinMacBookssold

U2recordpre-installedoniPhones

.

.

.

.

.

.

MacBook

AppleInc.

developer

Laptop

type

iPhone

developer

DocumentCollection

Querydocument

AppleJuice

Hedrinksapplejuiceduringhalf-timebreak

Page 6: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

6 06/01/2016

SemanticTechnologies:resolveambiguity&exploitrelationalknowledge

Applebreakslaptopsalesrecord

All-timehighinMacBookssold

U2recordpre-installedoniPhones

.

.

.

.

.

.

MacBook

AppleInc.

developer

Laptop

type

iPhone

developer

DocumentCollection

Querydocument

AppleJuice

Hedrinksapplejuiceduringhalf-timebreak

Expensive graph traversal

Page 7: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

7 06/01/2016

RelatedWork

Distributional:+scalable,fast- Noexplicitdisambiguationandconceptualrelations

ExplicitSemanticAnalysis(ESA)[GM07]

TF-IDF,VectorSpaceModel

SalientSemanticAnalysis(SSA) [HM11]

Knowledge-based:+richsemanticknowledge

- expensivegraphtraversal

PathSim [SHY+11]

HeteSim [SKH+14]

AnnSim:1-1matching,hierarchicalsimilarity

[PVH+13]

Schuhmacher, Ponzetto:GraphEditDistance[SP14]

Nunes etal.:Transversaldoc.similarity

[NKF+13]

Page 8: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

8 06/01/2016

Bridgingthegap

Distributional:+scalable,fast- Noexplicitdisambiguationandconceptualrelations

ExplicitSemanticAnalysis(ESA)[GM07]

TF-IDF,VectorSpaceModel

SalientSemanticAnalysis(SSA) [HM11]

Knowledge-based:+richsemanticknowledge

- expensivegraphtraversal

PathSim [SHY+11]

HeteSim [SKH+14]

AnnSim:1-1matching,hierarchicalsimilarity

[PVH+13]

Efficient Graph-based

DocumentSimilarity Schuhmacher, Ponzetto:

GraphEditDistance[SP14]

Nunes etal.:Transversaldoc.similarity

[NKF+13]

Page 9: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

9 06/01/2016

CoreContributions

Ø Scalablerelated-documentsearchprocess

Ø Graphtraversalduring pre-processing

Ø Light-weighttasksatsearchtime

Weachievesimilarcomputationalefficiencyasstatisticalapproaches

Page 10: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

10 06/01/2016

CoreContributions

Ø Scalablerelated-documentsearchprocess

Ø Graphtraversalduring pre-processing

Ø Light-weighttasksatsearchtime

Weachievesimilarcomputationalefficiencyasstatisticalapproaches

Ø Bag-of-entitiesdocumentmodel&similarity

Ø Documentsimilarityascombination ofpairwiseentitysimilarities

Ø Exploitshierarchical&transversal knowledgegraphrelations

Inourexperiments,weachievehighercorrelationwithhumannotion ofdocumentsimilaritythanthecompetition

Page 11: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

11 06/01/2016

Related-documentSearchusingGraph-basedSimilarity1) SemanticDocumentExpansion

• Enrichquerydocumentwithrelationalknowledge

2) Inclusionincorpus

• Store&indexexpanded document

3) Pre-search

• Useinvertedindextogeneratecandidateset

4) Fullsearch

• Entity-level,path-basedsimilarities

Page 12: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

12 06/01/2016

SemanticDocumentExpansion

l Enrichdocumentannotations

l Hierarchically

- Categories&theirancestors+hierarchicaldepths

l Transversally

- Weightneighboring entitiesbasedon

l numberofpaths

l lengthofpaths

w(e)=∑l= 1

L

βl∗∣pathsa , e(l) ∣

DocA

1.5

0.75

0.25

0.5

1

1

0.5

0.5

0.5

0.5

Page 13: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

13 06/01/2016

Pre-Search:GenerateCandidateSet

l Invertedindexfromentitiestodocuments

- Retrievecandidatesefficiently

l Assumption:Entityoverlapà contextualsimilarity

- Coarse,document-levelassessment

Page 14: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

14 06/01/2016

FullSearch:Graph-basedDocumentSimilarity

l Foreachcandidatedocument,reconstructquery-candidateannotationsubgraph-hierarchical &transversal

Ø Computeallpairwiseentitysimilarityscores

Ø Combine intodocumentscore

DocA DocB

Page 15: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

15 06/01/2016

l Usingstoredancestors&depthstocompute

l Example:

Hierarchicalentitysimilarity

hierSimdps(ent1 , ent2)=1

1+ 2+ 2= 0.2

hierSimdps (x , y )=d (root , lca( x , y ))

d (root , lca(x , y ))+ d (lca( x , y) , x )+ d ( lca(x , y ,) , y )

Page 16: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

16 06/01/2016

Transversalentitysimilarity

l Usestoredneighbors&weightstocompute:

l Example:transSim(ent1 , ent2)= 0.52+ 2∗0.252+ 0.5∗0.25= 0.5

transSim(a ,b)=∑l= 1

L∗ 2

βl∗∣pathsa ,b(l ) ∣

Page 17: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

17 06/01/2016

Documentsimilarity:bipartitegraphofentitysimilarities1. Annotationpairsimilarity:Combinetransversal&hierarchicalscores

2. DeterminemaxGraph: foreachannotation,choosemax.scoreedge(bold)

3. Computedocumentscorebasedonmax.edges foreachannotationa1i ofDocA:

DocA DocB

docSim(docA , docB)=∑a1i∈A1

(entSiment(a1i ,matched (a1i)))

∣A1∣+∣A2∣

(a1i ,matched (a1i))

Page 18: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

18 06/01/2016

Documentsimilarity:DBpediaexample

l Exampledocumentsscore:

docSim(docA , docB)= 0.53+ 0.92+ 0.43+ 0.53+ 0.58+ 0.813+ 3

≈ 0.63

Page 19: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

19 06/01/2016

Evaluation

• Task:Measurecorrelationwithhumannotionofsimilarity

• Datasets

• Documentsimilarity:Lee50[1]

• Sentencesimilarity:2012-MSRvid-Test[2],2015-Images[3]

• ...using andX-LiSA[ZR14] entityextractor

[1]https://webfiles.uci.edu/mdlee/LeePincombeWelsh.zip[2]http://research.microsoft.com/en-us/downloads/38cf15fd-b8df-477e-a4e4-a4680caa75af/[3]http://ixa2.si.ehu.es/stswiki/index.php/

Page 20: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

20 06/01/2016

DocumentSimilarity: Lee50corpus

• 50shortnewsarticles(51to126words)

• Goldstandardsetoffullpairwisedocumentsimilarityscores

Ø Outperformingbaselines&competition:

• Statistical(LSA,ESA,SSA)

• Knowledge-based(GED)

Page 21: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

21 06/01/2016

SentenceSimilarity

• Comparedtorelatedunsupervisedapproaches(ontextswithoneormoreextractedentities)

• 2012-MSRvid-Test:Videodescriptions fromMSRVideoParaphraseCorpus

• 2015-Images:Flickrimagedescriptions

Ø Outperformingbaselines&competition

• Statistical(Polyglot)

• Knowledge-based(Tiantianzhu7,IRIT,WSL)

Page 22: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

22 06/01/2016

Related-documentSearch:Pre-Search,FullSearch&Efficiency

Ø Rankingscore(nDCG)improvesfromPre-Search toFullSearch

Ø Computationtimegrowslinearlywithcandidatesetsize

Ø Here:candidatesetofsize~15achieveshighperformance

Page 23: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

23 06/01/2016

Conclusion&Outlook

l EfficientGraph-basedDocumentSimilarity

• …combineshierarchical&transversalrelationalknowledge

• …outperforms relateddistributional &knowledge-basedapproaches,onbotharticlesandsentences

• …iscomputationallyefficient:related-documentsearch

l Lessonslearned

Ø ValueofDBpediaforsemanticsimilarity

Ø Themoreentities(atleastone)perdocument, thebetter:

Ø Fewentities:disambiguationhelps

Ø Manyentities:maxGraph entitypairingemphasizesmeaningful relations

l Resources(code,data,documents):http://people.aifb.kit.edu/amo/eswc2016/

Page 24: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

24 06/01/2016

ReferencesI

l [TMS08] Thiagarajan,Manjunath,Stumptner. Computing semanticsimilarityusingontologies.InISWC08,theInternational SemanticWebConference(ISWC),2008.

l [LD08] Lemaire,Denhière.Effectsofhigh-order co-occurrences onwordsemanticsimilarities.

l [GM07]Gabrilovich, Markovitch.Computing semanticrelatednessusingwikipedia-basedexplicit semanticanalysis.InIJCAI,volume7,pages1606–1611, 2007.

l [HM11] Hassan,Mihalcea.Semanticrelatednessusingsalientsemanticanalysis.InAAAI,2011.

l [SP14] Schuhmacher, Ponzetto. Knowledge-basedgraphdocument modeling.InProceedingsofthe7thACMInternational ConferenceonWebSearchandDataMining,WSDM’14.

l [NKF+13] Nunes, Kawase,Fetahu,Dietze, Casanova,Maynard.Interlinkingdocumentsbasedonsemanticgraphs.ProcediaComputerScience,22:231–240,2013.

l [PSA08] Potthast, Stein, Anderka.Awikipedia-basedmultilingual retrieval model.InAdvancesinInformationRetrieval, pages522–530.Springer, 2008.

l [SHY+11] Sun,Han,Yan,Yu,Wu.Pathsim:Metapath-basedtop-ksimilaritysearchinheterogeneous information networks.VLDB’11, 2011.

l [SKH+14] Chuan,Xiangnan,Yue,Yu,Bin.Hetesim:Ageneralframeworkforrelevancemeasureinheterogeneous networks.IEEETransactionsonKnowledge&DataEngineering.

l [PVH+13] Palma,Vidal,Haag,Raschid,Thor.Measuringrelatedness betweenscientific entities inannotation datasets.InProceedingsoftheInternational ConferenceonBioinformatics,Computational Biology andBiomedical Informatics,BCB’13.

l [ZR14] Zhang,Rettinger.X-lisa:Cross-lingual semanticannotation.ProceedingsoftheVLDBEndowment(PVLDB), the40thInternational ConferenceonVeryLargeDataBases(VLDB).

l [KJC+15] PavanKapanipathi, PrateekJain,Chitra Venkataramani,AmitSheth.Hierarchical interestgraph, 21January2015.wiki.knoesis.org/index.php/Hierarchical_Interest_Graph, lastaccessed07/15/2015

Page 25: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International

Institute of Applied Informatics and Formal Description Methods (AIFB)

25 06/01/2016

ReferencesII

l [LIJ+15] Lehmann,J.,Isele,R.,Jakob,M.,Jentzsch, A.,Kontokostas, D.,Mendes,P.N.,Hellmann, S.,Morsey,M.,vanKleef,P.,Auer, S.,etal.:Dbpedia-alarge-scale,multilingual knowledgebaseextracted fromwikipedia.SemanticWeb6(2),167-195(2015)