Transcript
Page 1: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

T2LD – An automatic framework for extracting, interpreting and

representing tables as linked data

Varish MulwadMaster’s Thesis Defense

Advisor: Dr. Tim FininJune 29, 2010

1

Page 2: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Contribution - Tables to Linked Data

http://dbpedia.org/resource/Baltimorehttp://dbpedia.org/resource/BaltimoreLink Cell Value to an entity

Find Relationships between columnshttp://dbpedia.org/

ontology/PopulatedPlace

http://dbpedia.org/ontology/

PopulatedPlaceLargestCityLargestCity

2

Page 3: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .

“City”@en is rdfs:label of dbpedia-owl:City .“State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion .

“Baltimore”@en is rdfs:label of dbpedia:Baltimore .dbpedia:Baltimore a dbpedia-owl:City .… …

Contribution - Tables to Linked Data

3

Page 4: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

A thousand reasons why it’s important…

1. Generate linked RDF for the Semantic Web2. Enrich facts and knowledge that is already existing

on the Semantic Web3. Add new facts and knowledge in the Semantic Web 4. Possible use in completing “incomplete tables”5. Use in expanding the attributes / columns of a table

… and 995 other applications (or more) that will exploit this data

4

Page 5: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Overview

• Introduction• Related Work & Motivation• Tables to linked data• Results• Future Work• Conclusion

5

Page 6: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Introduction

6

Page 7: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

The World Wide Web …

………

………

………

………

… ……

… ……

Talk: abcBy: xyzVenue: some location

Talk: abcBy: xyzVenue: some location

… ……

… ……

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 7

Page 8: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

The World Wide Web …

Good for you and me …

… not so good for machinesImages from http://www.bbc.co.uk/blogs/radiolabs/s5/linked-data/s5.html

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 8

Page 9: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Web of Data – The Semantic Web

Image – www.linkeddata.org

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 9

Page 10: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Linked Data

The principles of Linked Data outline the best practices to share and expose structured data on the World Wide Web.

Every resource has a URI: Baltimore: http://dbpedia.org/resource/Baltimore

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 10

Page 11: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Related Work and Motivation

11

Page 12: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 12

Page 13: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Chicken ? No – Egg … No – Chicken …

• More than a trillion documents on the Web

• ~ 14.1 billion tables, 154 million with high quality relational data (Cafarella et al. 2008)

• Where is structured data ?

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 13

Page 14: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Automate the process

• We need systems that can generate data from existing sources

• Not practical for humans to encode all this into RDF manually

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 14

Page 15: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

In Databases and Web Systems …

• Understanding tables for Data Integration (Ziegler & Dittrich 2004), (Pantel, Philpot, & Hovy 2005)

• Learning to index tables to improve search experience (Cafarella et al. 2008)

• Expanding attributes (columns) of web tables (Lin et al. 2010)

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 15

Page 16: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

On the Semantic Web

• Database to Ontology mapping (Barrasa, scar Corcho, & Gmez-prez 2004), (Hu & Qu 2007), (Papapanagiotou et al. 2006), and (Lawrence 2004)

• W3C working group – RDB2RDF !!! • First working draft – June 8, 2010

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 16

Page 17: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

On the Semantic Web

• Mapping spreadsheets to RDF

• Systems like RDF123 (Han et. al 2008) allows users to convert spreadsheets to RDF

• Such systems are practical and helpful but … – Require significant manual work– Do not generate linked data

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 17

Page 18: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

On the Semantic Web

• Han et. al 2009, addressed the problem of recommending a set of terms to use to describe the objects and relationships in the table

• Did not focus on the overall interpretation of a table

• Did not attempt to understand and link cell values

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 18

Page 19: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .

“City”@en is rdfs:label of dbpedia-owl:City .“State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion .

“Baltimore”@en is rdfs:label of dbpedia:Baltimore .dbpedia:Baltimore a dbpedia-owl:City .… …

An overall interpretation

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 19

Page 20: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Tables to Linked Data

20

Page 21: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

T2LD Framework

Predict Class for Columns

Linking the table cells

Identify and Discover relations

T2LD Framework

Input: Table Headers and Rows

Output: Linked Data Representation of a Table

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 21

Page 22: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

An overview

Query Knowledge base

Predict Class for Columns

Re query Knowledge base using the new evidence

Link cell value to an entity using the new results

obtained

Input: Table Headers and

Rows

Identify Relationships

between columns

Output: Linked Data

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 22

Page 23: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

T2LD Framework

Predict Class for Columns

Linking the table cells

Identify and Discover relations

Input: Table Headers and Rows

Output: Linked Data Representation of a Table

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 23

Page 24: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Querying the Knowledge–Base

For every cell from the column –

Cell Value + Column Header + Row Content

Top N entities, Their Types, Google Page

Rank(We use N = 5)

Wikitology Yago

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 24

City

Baltimore

Boston

New York

Type

Instance

Page 25: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Querying the Knowledge–Base

City

Baltimore

Boston

New York

1.Baltimore, Types, Page Rank2. Baltimore County, Maryland, Types, Page Rank3. John Baltimore, Types, Page Rank

1. Boston, Types, Page Rank2. Boston_(band), Types, Page Rank3. Boston_University, Types, Page Rank

1. New_York_City, Types, Page Rank2. New_York, Types, Page Rank3. New_York_(album), Types, Page Rank

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 25

Page 26: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Set of Classes

Types for Baltimore

{dbpedia-owl:Place, dbpedia-owl:Area}

Types for Baltimore County

{yago:AmericanConductors,yago:LivingPeople}

Types for John Baltimore

{dbpedia-owl:Place, dbpedia-owl:Area}

Types for Boston{dbpedia-owl:Place, dbpedia-owl:PopulatedPlace}

Types for Boston_band

{dbpedia-owl:Band, dbpedia-owl:Organisation}

. . .

Set of classes for a column:

{dbpedia-owl:Place, dbpedia-owl:Area, yago:AmericanConductors, yago:LivingPeople, dbpedia-owl:PopulatedPlace, dbpedia-owl:Band, dbpedia-owl:Organisation, . . . }

Set of classes for a column:

{dbpedia-owl:Place, dbpedia-owl:Area, yago:AmericanConductors, yago:LivingPeople, dbpedia-owl:PopulatedPlace, dbpedia-owl:Band, dbpedia-owl:Organisation, . . . }

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 26

Page 27: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Ranking the Classes

[Baltimore, dbpedia-owl:Place][Boston, dbpedia-owl:Place]

[New York, dbpedia-owl:Place][Baltimore, dbpedia-owl:PopulatedPlace]

[Boston, dbpedia-owl:PopulatedPlace]……

[Baltimore, dbpedia-owl:Band]……

[Baltimore, dbpedia-owl:Place][Boston, dbpedia-owl:Place]

[New York, dbpedia-owl:Place][Baltimore, dbpedia-owl:PopulatedPlace]

[Boston, dbpedia-owl:PopulatedPlace]……

[Baltimore, dbpedia-owl:Band]……

Create a pairing of all the class labels and strings in a column

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 27

Page 28: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Ranking the Classes

• Assign a score to every pair based on – – The entity’s rank that matches the class label – Predicted Google Page Rank

• We use the following formula – – Score = w x ( 1 / R ) + (1 – w) (Normalized Google

Page Rank) – We use w = 0.25

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 28

Page 29: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Ranking the ClassesE.g. Processing class – “dbpedia:Area”

String Baltimore: (R = 1) Baltimore {dbpedia-owl:Place, dbpedia-owl:Area} [PR = 6](R = 2) Baltimore County {dbpedia-owl:Place, dbpedia-owl:Area} [PR = 4](R = 3) John Baltimore {yago:AmericanConductors,yago:LivingPeople} [PR = 5]

Score = w x ( 1 / R ) + (1 – w) x (Normalized Page Rank)[Baltimore, dbpedia:Area] = (0.25 x 1 / 1 ) + (0.75 x 6 / 7) = 0.892

E.g. Processing class – “dbpedia:Band”

String Baltimore: (R = 1) Baltimore {dbpedia-owl:Place, dbpedia-owl:Area} [PR = 6](R = 2) Baltimore County {dbpedia-owl:Place, dbpedia-owl:Area} [PR = 4](R = 3) John Baltimore {yago:AmericanConductors,yago:LivingPeople} [PR = 5]

[Baltimore, dbpedia:Band] = 0 [Since the class does not match any of the entities for Baltimore]

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 29

Page 30: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Predicting the Classes

• Select the class that maximizes its sum of score over the entire column

• E.g. Sum of dbpedia:Area– [Baltimore, dbpedia:Area] + [Boston, dbpedia:Area] + [New York,

dbpedia:Area] = 2.85

• Sum of dbpedia:Band– [Baltimore, dbpedia:Band] + [Boston, dbpedia:Band] + [New York,

dbpedia:Band] = 0.25

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 30

Page 31: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Predicting the Classes

• We predict classes from four vocabularies – DBpedia Ontology, Freebase, WordNet and Yago

[City, dbpedia:Area] = 1 [City, dbpedia:PoplulatedPlace] = 0.9

[City, dbpedia:Band] = 0.2[City, yago:LivingPeople] = 0.23

[City, dbpedia:Area] = 1 [City, dbpedia:PoplulatedPlace] = 0.9

[City, dbpedia:Band] = 0.2[City, yago:LivingPeople] = 0.23

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 31

Page 32: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

The underlying query process …

32

Page 33: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Mapping Table to WikipediaState Capital City Largest City Governor

Maryland Annapolis Baltimore Martin O Malley

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 33

Page 34: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Mapping Table to Wikipedia

State Capital City Largest City Governor

Maryland Annapolis Baltimore Martin O Malley

TypesLinked Concepts Property Values

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 34

Page 35: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Summary of the Query

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 35

Page 36: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Extracting Types from DBpedia

Types for Annapolis

SPARQL Query

Query redirects too … … to avoid disparity in KBs

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 36

Page 37: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

T2LD Framework

Predict Class for Columns

Linking the table cells

Identify and Discover relations

Input: Table Headers and Rows

Output: Linked Data Representation of a Table

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 37

Page 38: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Approach

Table Cell + Column Header + Row Data

+ Column Type

Requery KB with predicted class labels as additional evidence

Generate a feature vector for the top N results of the query

Classifier ranks the entities within the set

of possible results

Select the highest ranked entity

Classifier decides whether to link or

not

Link to “NIL”Link to the top

ranked instance

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 38

Page 39: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Class labels are mapped to typesRef field

Re-querying KB

• Use of predicted class labels as “additional evidence”

WordNet:Cityhttp://dbpedia.org/ontology/CityYago:CitiesinUnitedStatesFreebase:Location

• Restricts the types of the results returned to the predicted class labels

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 39

Page 40: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Summary of the re-query

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 40

Page 41: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Learning to Rank

• We trained a SVMrank classifier which learnt to rank entities within a given set

Feature Vector

Similarity Measures

Popularity Measures

• Levenshtein distance• Dice Score

• Wikitology Score• PageRank• Page Length

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 41

Page 42: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

“To Link or not to Link … ’’

• The highest ranked entity may not the correct one to link to … – Because the string we are querying may not be in

the KB– Top N results may not include the correct answer

• We trained an SVM classifier which would determine whether to link to the top one or not

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 42

Page 43: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

“To Link or not to Link … ’’

• Feature vector included the feature vector of the top ranked entity and additional two features –

– The SVMrank score of the top ranked entity– The difference in scores between the top two

ranked entities

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 43

Page 44: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

T2LD Framework

Predict Class for Columns

Linking the table cells

Identify and Discover relations

Input: Table Headers and Rows

Output: Linked Data Representation of a Table

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 44

Page 45: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Relation between columns

City

Baltimore

Boston

New York

State

Maryland

Massachusetts

New York

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 45

Page 46: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Relation between columns

Maryland - Baltimore

Massachusetts - Boston

New York - New York

dbonto:LargestCity

dbonto:LargestCitydbonto:Capital

dbonto:LargestCity

dbonto:Capital dbonto:LargestCity

Candidate relations

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 46

Page 47: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Scoring the relations

Maryland - Baltimore

Massachusetts - Boston

New York - New York

dbonto:LargestCity

dbonto:LargestCitydbonto:Capital

dbonto:LargestCity

Candidates: dbonto:Capital

dbonto:LargestCity

dbonto:Capital Score:0

dbonto:Capital Score:1

dbonto:LargestCity Score:3

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 47

Page 48: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Relation between columns

Select * where {<http://dbpedia.org/resource/Maryland> ?relation <http://dbpedia.org/resource/Baltimore> }_______________________________________________________________________

Select * where {<http://dbpedia.org/resource/Maryland> ?relation “Baltimore”@en> }

• Query the second column as URI and a literal string

• Check all redirects when querying with URI

• Check all other common names when querying with literal string

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 48

Page 49: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

T2LD Framework

Predict Class for Columns

Linking the table cells

Identify and Discover relations

Input: Table Headers and Rows

Output: Linked Data Representation of a Table

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 49

Page 50: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

An example@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix dbpprop: <http://dbpedia.org/property/> .

“City”@en is rdfs:label of dbpedia-owl:City .“State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion .

“Baltimore”@en is rdfs:label of dbpedia:Baltimore .dbpedia:Baltimore a dbpedia-owl:City .“MD”@en is rdfs:label of dbpedia:Maryland .dbpedia:Maryland a dbpedia-owl:AdministrativeRegion .

dbpprop:LargestCity rdfs:domain dbpedia-owl:AdminstrativeRegion .dbpprop:LargestCity rdfs:range dbpedia-owl:City .

“City”@en is rdfs:label of dbpedia-owl:City .“City” is the common / human name for the class dbpedia-owl:City

dbpedia:Baltimore a dbpedia-owl:City .dbpedia:Baltimore is a type (instance) dbpedia-owl:City

dbpprop:LargestCity rdfs:domain dbpedia-owl:AdminstrativeRegion .The subjects of the triples using the property have to be instances of dbpedia-

owl:AdminstrativeRegion

dbpprop:LargestCity rdfs:range dbpedia-owl:City .The objects of the triples using the property have to be instances of dbpedia-owl:City

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 50

Page 51: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Template@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

“ColumnHeader1” is rdfs:label of PredictedClassLabel1 .“ColumnHeader2” is rdfs:label of PredictedClassLabel2 .

“TableCellString” is rdfs:label of CellValueURL .CellValueURL a PredictedClassLabel .

property rdfs:domain PredictedClassLabel1 .property rdfs:range PredictedClassLabel2 .

Where:ColumnHeader - is a column header from the tableTableCellString - is a string representing a table cellPredictedClassLabel - is the class label associated with the columnCellValueURL - is the DBpedia url, the table cell string is linked toproperty - is the relation discovered between the two columns

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 51

Page 52: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Results

52

Page 53: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Dataset summary

Number of Tables 15

Total Number of rows 199

Total Number of columns 56 (52)

Total Number of entities 639 (611)

* The number in the brackets indicates # excluding columns that contained numbers

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 53

Page 54: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Dataset summary

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 54

Page 55: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Dataset summary

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 55

Page 56: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Evaluation for class label predictions

56

Page 57: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Evaluation # 1 (MAP)

• Compared the system’s ranked list of labels against a human ranked list of labels

• Metric - Mean Average Precision (MAP)

• Commonly used in the Information Retrieval domain to compare two ranked sets

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 57

Page 58: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Evaluation # 1 (MAP)• MAP is defined as –

• R(n) - is the relevance at n. If the class label ranked “n” in the system generated set is a relevant one then R(n) is 1,else it is 0.

• P(n) - is the precision at n. It measures the relevance of the top n results.

• N - is the number of labels retrieved. For our evaluation we consider the top 3 labels retrieved.Introduction Related Work Tables to Linked Data Results Future Work Conclusion

58

Page 59: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Evaluation # 1 (MAP)

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 59

80.76 %

System Ranked:1. Person2. Politician3. President

Evaluator Ranked:1. President2. Politician3. OfficeHolder

Page 60: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Evaluation # 2 (Recall)

• Checked whether the system was retrieving relevant class labels or not.

• Measure used : Recall (R)

• Top three labels ranked by the user were considered to be relevant.

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 60

Page 61: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Evaluation # 2 (Recall)

Recall > 0.6 (75 %)

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 61

System Ranked:1. Person2. Politician3. President

Evaluator Ranked:1. President2. Politician3. OfficeHolder

Page 62: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Evaluation # 3 (Rank Match)

• A comparison of how many times the top three ranked system generated labels match with the top three labels ranked by the users

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 62

Page 63: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Evaluation # 4 (Correctness)

• Evaluated whether our predicted class labels were “fair and correct”

• Class label may not be the most accurate one, but may be correct. – E.g. dbpedia:PopulatedPlace is not the most accurate, but still a

correct label for column of cities

• Three human judges evaluated our predicted class labels

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 63

Page 64: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Evaluation # 4 (Correctness)

• A category-wise breakdown for class label correctness

Introduction Related Work Tables to Linked Data Results Future Work Conclusion

Overall Accuracy: 76.92 %

64

Column – NationalityPrediction – MilitaryConflict

Column – Birth PlacePrediction – PopulatedPlace

Page 65: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Summary – Class label prediction

• Recall and class label correctness show that our approach produces relevant and correct labels

• MAP and Rank Match show that we enjoyed moderate success in ranking labels within a set

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 65

Page 66: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Evaluation for linking table cells to entities

66

Page 67: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Category-wise accuracy for linking table cells

Overall Accuracy: 66.12 %

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 67

Page 68: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

SVMrank classifier

Correctly predicted top ranked instance

215

Incorrectly predicted top ranked instance

7

Total number of instances 222

Accuracy 96.84 %

Correctly predicted top ranked instance

543

Incorrectly predicted top ranked instance

68

Total number of instances 611

Accuracy 88.87 %

• Training data – 171 queries (each with 10 results)• The correct entity was assigned the highest rank and all

others were assigned a same lower rank• Test data – 222 queries (each with 10 results)

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 68

Page 69: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

The binary SVM classifier

Correctly predicted 145

Incorrectly predicted 26

Total number of instances 171

Accuracy 84.79 %

Correctly predicted 541

Incorrectly predicted 70

Total number of instances 611

Accuracy 88.54 %

• Training data – 222 queries (146 +ve, 76 –ve examples)• If the highest ranked instance was correct, a class label of

“yes” was assigned • Test data – 171 queries (119 +ve, 52 –ve examples)

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 69

Page 70: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Evaluation for relation between columns

70

Page 71: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Relation between columns

• Idea – Ask human evaluators to identify relations between columns in a given table

• Pilot Experiment – Asked three evaluators to annotate five random tables from our dataset

• Evaluators identified 20 relations

• Our accuracy – 5 out of 20 (25 % ) were correct

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 71

Page 72: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Future Work

72

Page 73: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Future Work

• Implement a machine learning based approach for class label predictions for columns

• Alternative approach for relation discovery and identification

• Approaches to handle unknown entities

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 73

Page 74: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Conclusion

74

Page 75: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Conclusion

• There’s lot of data that is stored in html tables, spreadsheets, databases and documents

• We presented an automated framework that extracts, interprets and represents tables as linked data

• We are unlocking large amounts of tabular data currently inaccessible and useless for the Semantic Web and making it more meaningful and useful on the Semantic Web

• We believe our work will contribute in materializing the web of data vision

Introduction Related Work Tables to Linked Data Results Future Work Conclusion 75

Page 76: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

References• Cafarella, M. J., Halevy, A., Wang, D. Z., Wu, E., Zhang, Y., 2008. Webtables:exploring

the power of tables on the web. Proc. VLDB Endow.1 (1), 538-549.• Ziegler, P., and Dittrich, K. R. 2004. Three decades of data intecration: all problems

solved? In Building the Information Society, volume 156 of IFIP International Federation for Information Processing, 312. Springer Boston.

• Pantel, P.; Philpot, A.; and Hovy, E. 2005. Aligning database columns using mutual information. In Proceedings of the 2005 national conference on Digital government research, dg.o 05, 205210.

• Cindy Xide Lin, Bo Zhao, Tim Weninger, Jiawei Han, and Bing Liu. 2010. Entity relation discovery from web tables and links. In Proceedings of the 19th international conference on World wide web (WWW '10). ACM, New York, NY, USA, 1145-1146.

• Barrasa, J., Corcho, O., Gomez-perez, A., 2004. R2o, an extensible and semantically based database-to-ontology mapping language. In Proceedings of the 2nd Workshop on Semantic Web and Databases(SWDB2004). Vol. 3372. pp. 1069-1070.

76

Page 77: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

• Hu, W., and Qu, Y. 2007. Discovering simple mappings between relational database schemas and ontologies. In Aberer, K.; Choi, K.-S.; Noy, N. F.; Allemang, D.; Lee, K.-I.; Nixon, L. J. B.; Golbeck, J.; Mika, P.; Maynard, D.; Mizoguchi, R.; Schreiber, G.;and Cudre-Mauroux, P., eds., ISWC/ASWC, volume 4825 of Lecture Notes in Computer Science, 225238. Springer.

• Papapanagiotou, P.; Katsiouli, P.; Tsetsos, V.; Anagnostopoulos, C.; and Hadjiefthymiades, S. 2006. Ronto: Relational to ontology schema matching. In AISSIGSEMIS BULLETIN.

• Lawrence, E. D. R. 2004. Composing mappings between schemas using a reference ontology. In In Proceedings of International Conference on Ontologies, Databases and Application of Semantics (ODBASE), 783800. Springer

• Han, L.; Finin, T.; Parr, C.; Sachs, J.; and Joshi, A. 2008. RDF123: from Spreadsheets to RDF. In Seventh International Semantic Web Conference. Springer.

• Han, L., Finin, T., Yesha, Y., 2009. Finding semantic web ontology terms from words. In: Proceedings of the Eight International Semantic Web Conference. Springer.

References

77

Page 78: Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim  Finin June 29, 2010

Discussion

78


Top Related