linked data inferringsemanticstables generating linked data by inferring the semantics of tables...

33
Generating Linked Data by inferring the semantics of tables Varish Mulwad (@varish) University of Maryland, Baltimore County September 2, 2011 Dr. Tim Finin Dr. Anupam Joshi

Upload: millicent-lawson

Post on 18-Jan-2018

221 views

Category:

Documents


0 download

DESCRIPTION

Contribution NameTeamPositionHeight Michael JordanChicagoShooting guard1.98 Allen IversonPhiladelphiaPoint guard1.83 Yao MingHoustonCenter2.29 Tim DuncanSan AntonioPower forward2.11 nalBasketballAssociationTeams Map literals as values of properties dbprop:team 3

TRANSCRIPT

Page 1: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

Generating Linked Data by inferring the semantics of tables

Varish Mulwad (@varish)University of Maryland, Baltimore County

September 2, 2011

Dr. Tim Finin Dr. Anupam Joshi

Page 2: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

Goal

2

Image from : Zagari RM, Bianchi-Porro G, Fiocca R, Gasbarrini G, Roda E, Bazzoli F. Comparison of 1 and 2 weeks of omeprazole, amoxicillin and clarithromycin treatment for Helicobacter pylori eradication: the HYPER Study. Gut. 2007;56: 475-9. [PMID: 17028126]

Page 3: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

3

Contribution

Name Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

http://dbpedia.org/class/yago/NationalBasketballAssociationTeams

http://dbpedia.org/resource/Allen_Iverson Map literals as values of properties

dbprop:team

Page 4: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

4

Contribution

Name Team Position Height

Michael Jordan Chicago Shooting guard 1.98

Allen Iverson Philadelphia Point guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power forward 2.11

@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .

"Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .

"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .

"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .

All this in a completely automated way !!

Page 5: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

5

Introduction & Motivation

Page 6: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

6

Tables are everywhere !

389, 697 raw and geospatial datasets

The web – 154 million high quality relational tables (Cafarella et al. 2008)

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 7: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

7

Evidence–based medicine

Figure: Evidence-Based Medicine - the Essential Role of Systematic Reviews, and the Need for Automated Text Mining Tools, IHI 2010

The idea behind Evidence-based Medicine is to judge the efficacy oftreatments or tests by meta-analyses or reviews of clinical trials. Key information in such trials is encoded in tables.

However, the rate at which meta-analyses are published remains very low … hampers effective health care treatment …

# of Clinical trials published in 2008

# of meta analysis published in 2008

Page 8: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

8

Related Work

• Extracting tables from documents and web pages Hurst (2006), Embley et al. (2006)

• Understanding semantics of tables Wang et al. (2011), Ventis et al. (2011), Limaye et al.

(2010)

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 9: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

9

Current systems• Use ‘semantically poor’ knowledge bases

• Only one system focuses on complete table interpretation

• Do not generate Linked Data

• No system tackles literal data

• Critical piece of evidence for interpreting medical tables

• No system dealing with tables in specialized domains (e.g. tables found medical literature)

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 10: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

10

• Preliminary work / Baseline system

• Analysis and Evaluation of baseline

• Framework grounded in graphical models and probabilistic reasoning

Building a table interpretation framework

Page 11: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

11

The System’s Brain (Knowledgebase)

Yago

Wikitology1 – A hybrid knowledgebase where structured data meets unstructured data

1 – Wikitology was created as part of Zareen Syed’s Ph.D. dissertation

Syed, Z., and Finin, T. 2011. Creating and Exploiting a Hybrid Knowledge Base for Linked Data, volume 129 of Revised Selected Papers Series: Communications in Computer and Information Science. Springer.

Page 12: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

12

The Baseline System

Page 13: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

13

T2LD Framework

Predict Class for Columns

Linking the table cells

Identify and Discover relations

T2LD Framework

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 14: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

Predicting Class Labels for column

Team

Chicago

Philadelphia

Houston

San Antonio

Class

Instance

Introduction Related Work Baseline Results Joint Inference Conclusion

1. Chicago Bulls2. Chicago3. Judy Chicago

{dbpedia-owl:Place,dbpedia-owl:City,yago:WomenArtist,yago:LivingPeople,yago:NationalBasketballAssociationTeams }

{dbpedia-owl:Place, dbpedia-owl:PopulatedPlace, dbpedia-owl:Film,yago:NationalBasketballAssociationTeams …. ….. ….. }

{……………………………………………………………. }

dbpedia-owl:Place, dbpedia-owl:City, yago:WomenArtist, yago:LivingPeople, yago:NationalBasketballAssociationTeams, dbpedia-owl:PopulatedPlace, dbpedia-owl:Film ….

Page 15: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

15

Linking table cells to entities

Michael Jordan + Chicago + Shooting

Guard + 1.98 + dbpedia-

owl:BasketballPlayer

1. Michael Jordan2. Michael-Hakim Jordan

Classifier 1 – SVM Rank(Ranks the set of entities)

Classifier 2 – SVM (Computes Confidence)

Link to the top ranked entity

Don’t link

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 16: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

16

Identify Relations

Name

Michael Jordan

Allen Iverson

Yao Ming

Tim Duncan

Team

Chicago

Philadelphia

Houston

San Antonio

Rel ‘A’

Rel ‘A’

Rel ‘A’, ‘C’

Rel ‘A’, ‘B’, ‘C’

Rel ‘A’, ‘B’

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 17: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

17

Generating a linked RDF representation

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix yago: <http://dbpedia.org/class/yago/> .

"Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer ."Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams .

"Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan .dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer .

"Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls .dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 18: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

18

Evaluation of the baseline system

Page 19: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

19

Dataset summaryNumber of Tables 15

Total Number of rows 199

Total Number of columns 56 (52)

Total Number of entities 639 (611)

* The number in the brackets indicates # excluding columns that contained numbers

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 20: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

20

Evaluation # 1 (MAP)• Compared the system’s ranked list of labels

against a human–ranked list of labels

• Metric - Average Precision (a.p.) [Mean Average Precision gives a mean over set of queries]

• Commonly used in the Information Retrieval domain to compare two ranked sets

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 21: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

21

Evaluation # 1 (MAP)

0 10 20 30 40 50 600

0.2

0.4

0.6

0.8

1

1.2Average Precision

Average Precision

Column #

Aver

age

Prec

isio

n

MAP = 0.411

System Ranked:1. Person2. Politician3. President

Evaluator Ranked:1. President2. Politician3. OfficeHolder

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 22: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

22

Evaluation # 2 (Correctness)• Evaluated whether our predicted class labels were “fair and

correct”

• Class label may not be the most accurate one, but may be correct– E.g. dbpedia:PopulatedPlace is not the most accurate, but still a

correct label for column of cities

• Three human judges evaluated our predicted class labels

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 23: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

23

Evaluation # 2 (Correctness)

Person Place Organization Other0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

76.92%90.48%

66.67%58.33%

23.08%9.52%

33.33%41.67%

IncorrectCorrect

% o

f cor

rect

ly a

nd in

corr

ectly

pre

dict

ed cl

ass

labe

ls

Column – NationalityPrediction – MilitaryConflict

Column – Birth PlacePrediction – PopulatedPlace

Overall Accuracy: 76.92 %

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 24: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

24

Accuracy for Entity Linking

Person Place Organization Other0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

83.05% 80.43%61.90%

29.22%

16.95% 19.57%38.10%

70.78%

IncorrectCorrect

Categories

% o

f cor

rect

and

inco

rrec

t ins

tanc

es li

nked

Overall Accuracy: 66.12 %

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 25: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

25

Lessons Learnt

• Sequential System – Error percolated from one phase to the next

• Current system favors general classes over specific ones (MAP score = 0.411)

• Largely, a system driven by “heuristics”• Although we consider evidence, we don’t do

assignment jointly

Predict Class for Columns

Linking the table cells

Identify and Discover relations

T2LD Framework

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 26: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

26

Joint Inference over evidence in a table

Probabilistic Graphical Models

Markov logic Networks

Page 27: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

27

A graphical model for tables

C1 C2 C3

R11

R12

R13

R21

R22

R23

R31

R32

R33

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 28: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

28

Parameterized graphical model

C1 C2C3

𝝍𝟓

R11 R12 R13 R21 R22 R23 R31 R32 R33

𝝍𝟑 𝝍𝟑 𝝍𝟑

𝝍𝟒 𝝍𝟒 𝝍𝟒

Function that captures the affinity between the column headers and row values

Row value

Variable Node: Column header

Captures interaction between column headers

Captures interaction between row values

Factor Node

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 29: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

Challenges - Abbreviations

• Other examples: • State Abbreviations• Stock Tickers• Airport Codes• Currency codes

• Preprocessing – parse and identify such columns

• Replace abbreviations with expanded forms

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 30: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

Challenges - LiteralsPopulation

690,000

345,000

510,020

120,000

Age

75

65

50

25

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 31: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

Conclusion• Presented a framework for inferring the semantics of

tables and generating Linked data

• Evaluation of the baseline system show feasibility in tackling the problem

• Work in progress for building framework grounded in graphical models and probabilistic reasoning

• Working on tackling challenges posed by tables from domains such as the medical and open government data

Introduction Related Work Baseline Results Joint Inference Conclusion

Page 32: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

32

References1. Cafarella, M. J.; Halevy, A. Y.; Wang, Z. D.; Wu, E.; and Zhang, Y. 2008.

Webtables:exploring the power of tables on the web. PVLDB 1(1):538–549

2. M. Hurst. Towards a theory of tables. IJDAR,8(2-3):123-131, 2006.

3. D. W. Embley, D. P. Lopresti, and G. Nagy. Notes on contemporary table recognition. In Document Analysis Systems, pages 164-175, 2006.

4. Wang, Jingjing, Shao, Bin, Wang, Haixun, and Zhu, Kenny Q. Understanding tables on the web. Technical report, Microsoft Research Asia, 2010.

5. Venetis Petros, Halevy Alon, Madhavan Jayant, Pasca Marius, Shen Warren, Wu Fei, Miao Gengxin, and Wu Chung. Recovering semantics of tables on the web. In Proc. of the 37th Int'l Conference on Very Large Databases (VLDB), 2011.

6. Limaye Girija, Sarawagi Sunita, and Chakrabarti Soumen. Annotating and searching web tables using entities, types and relationships. In Proc. of the 36th Int'l Conference on Very Large Databases (VLDB), 2010

Page 33: Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore

Thank You ! Questions ?

[email protected]

@varish Web: http://goo.gl/NVu8N

33

“A little semantics goes a long way” ~ Jim Hendler