t2ld – an automatic framework for extracting, interpreting and representing tables as linked data...
TRANSCRIPT
T2LD – An automatic framework for extracting, interpreting and
representing tables as linked data
Varish MulwadMaster’s Thesis Defense
Advisor: Dr. Tim FininJune 29, 2010
1
Contribution - Tables to Linked Data
http://dbpedia.org/resource/Baltimorehttp://dbpedia.org/resource/BaltimoreLink Cell Value to an entity
Find Relationships between columnshttp://dbpedia.org/
ontology/PopulatedPlace
http://dbpedia.org/ontology/
PopulatedPlaceLargestCityLargestCity
2
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .
“City”@en is rdfs:label of dbpedia-owl:City .“State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion .
“Baltimore”@en is rdfs:label of dbpedia:Baltimore .dbpedia:Baltimore a dbpedia-owl:City .… …
Contribution - Tables to Linked Data
3
A thousand reasons why it’s important…
1. Generate linked RDF for the Semantic Web2. Enrich facts and knowledge that is already existing
on the Semantic Web3. Add new facts and knowledge in the Semantic Web 4. Possible use in completing “incomplete tables”5. Use in expanding the attributes / columns of a table
… and 995 other applications (or more) that will exploit this data
4
Overview
• Introduction• Related Work & Motivation• Tables to linked data• Results• Future Work• Conclusion
5
The World Wide Web …
………
………
………
………
… ……
… ……
Talk: abcBy: xyzVenue: some location
Talk: abcBy: xyzVenue: some location
… ……
… ……
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 7
The World Wide Web …
Good for you and me …
… not so good for machinesImages from http://www.bbc.co.uk/blogs/radiolabs/s5/linked-data/s5.html
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 8
Web of Data – The Semantic Web
Image – www.linkeddata.org
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 9
Linked Data
The principles of Linked Data outline the best practices to share and expose structured data on the World Wide Web.
Every resource has a URI: Baltimore: http://dbpedia.org/resource/Baltimore
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 10
Chicken ? No – Egg … No – Chicken …
• More than a trillion documents on the Web
• ~ 14.1 billion tables, 154 million with high quality relational data (Cafarella et al. 2008)
• Where is structured data ?
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 13
Automate the process
• We need systems that can generate data from existing sources
• Not practical for humans to encode all this into RDF manually
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 14
In Databases and Web Systems …
• Understanding tables for Data Integration (Ziegler & Dittrich 2004), (Pantel, Philpot, & Hovy 2005)
• Learning to index tables to improve search experience (Cafarella et al. 2008)
• Expanding attributes (columns) of web tables (Lin et al. 2010)
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 15
On the Semantic Web
• Database to Ontology mapping (Barrasa, scar Corcho, & Gmez-prez 2004), (Hu & Qu 2007), (Papapanagiotou et al. 2006), and (Lawrence 2004)
• W3C working group – RDB2RDF !!! • First working draft – June 8, 2010
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 16
On the Semantic Web
• Mapping spreadsheets to RDF
• Systems like RDF123 (Han et. al 2008) allows users to convert spreadsheets to RDF
• Such systems are practical and helpful but … – Require significant manual work– Do not generate linked data
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 17
On the Semantic Web
• Han et. al 2009, addressed the problem of recommending a set of terms to use to describe the objects and relationships in the table
• Did not focus on the overall interpretation of a table
• Did not attempt to understand and link cell values
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 18
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .
“City”@en is rdfs:label of dbpedia-owl:City .“State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion .
“Baltimore”@en is rdfs:label of dbpedia:Baltimore .dbpedia:Baltimore a dbpedia-owl:City .… …
An overall interpretation
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 19
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
T2LD Framework
Input: Table Headers and Rows
Output: Linked Data Representation of a Table
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 21
An overview
Query Knowledge base
Predict Class for Columns
Re query Knowledge base using the new evidence
Link cell value to an entity using the new results
obtained
Input: Table Headers and
Rows
Identify Relationships
between columns
Output: Linked Data
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 22
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
Input: Table Headers and Rows
Output: Linked Data Representation of a Table
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 23
Querying the Knowledge–Base
For every cell from the column –
Cell Value + Column Header + Row Content
Top N entities, Their Types, Google Page
Rank(We use N = 5)
Wikitology Yago
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 24
City
Baltimore
Boston
New York
Type
Instance
Querying the Knowledge–Base
City
Baltimore
Boston
New York
1.Baltimore, Types, Page Rank2. Baltimore County, Maryland, Types, Page Rank3. John Baltimore, Types, Page Rank
1. Boston, Types, Page Rank2. Boston_(band), Types, Page Rank3. Boston_University, Types, Page Rank
1. New_York_City, Types, Page Rank2. New_York, Types, Page Rank3. New_York_(album), Types, Page Rank
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 25
Set of Classes
Types for Baltimore
{dbpedia-owl:Place, dbpedia-owl:Area}
Types for Baltimore County
{yago:AmericanConductors,yago:LivingPeople}
Types for John Baltimore
{dbpedia-owl:Place, dbpedia-owl:Area}
Types for Boston{dbpedia-owl:Place, dbpedia-owl:PopulatedPlace}
Types for Boston_band
{dbpedia-owl:Band, dbpedia-owl:Organisation}
. . .
Set of classes for a column:
{dbpedia-owl:Place, dbpedia-owl:Area, yago:AmericanConductors, yago:LivingPeople, dbpedia-owl:PopulatedPlace, dbpedia-owl:Band, dbpedia-owl:Organisation, . . . }
Set of classes for a column:
{dbpedia-owl:Place, dbpedia-owl:Area, yago:AmericanConductors, yago:LivingPeople, dbpedia-owl:PopulatedPlace, dbpedia-owl:Band, dbpedia-owl:Organisation, . . . }
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 26
Ranking the Classes
[Baltimore, dbpedia-owl:Place][Boston, dbpedia-owl:Place]
[New York, dbpedia-owl:Place][Baltimore, dbpedia-owl:PopulatedPlace]
[Boston, dbpedia-owl:PopulatedPlace]……
[Baltimore, dbpedia-owl:Band]……
[Baltimore, dbpedia-owl:Place][Boston, dbpedia-owl:Place]
[New York, dbpedia-owl:Place][Baltimore, dbpedia-owl:PopulatedPlace]
[Boston, dbpedia-owl:PopulatedPlace]……
[Baltimore, dbpedia-owl:Band]……
Create a pairing of all the class labels and strings in a column
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 27
Ranking the Classes
• Assign a score to every pair based on – – The entity’s rank that matches the class label – Predicted Google Page Rank
• We use the following formula – – Score = w x ( 1 / R ) + (1 – w) (Normalized Google
Page Rank) – We use w = 0.25
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 28
Ranking the ClassesE.g. Processing class – “dbpedia:Area”
String Baltimore: (R = 1) Baltimore {dbpedia-owl:Place, dbpedia-owl:Area} [PR = 6](R = 2) Baltimore County {dbpedia-owl:Place, dbpedia-owl:Area} [PR = 4](R = 3) John Baltimore {yago:AmericanConductors,yago:LivingPeople} [PR = 5]
Score = w x ( 1 / R ) + (1 – w) x (Normalized Page Rank)[Baltimore, dbpedia:Area] = (0.25 x 1 / 1 ) + (0.75 x 6 / 7) = 0.892
E.g. Processing class – “dbpedia:Band”
String Baltimore: (R = 1) Baltimore {dbpedia-owl:Place, dbpedia-owl:Area} [PR = 6](R = 2) Baltimore County {dbpedia-owl:Place, dbpedia-owl:Area} [PR = 4](R = 3) John Baltimore {yago:AmericanConductors,yago:LivingPeople} [PR = 5]
[Baltimore, dbpedia:Band] = 0 [Since the class does not match any of the entities for Baltimore]
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 29
Predicting the Classes
• Select the class that maximizes its sum of score over the entire column
• E.g. Sum of dbpedia:Area– [Baltimore, dbpedia:Area] + [Boston, dbpedia:Area] + [New York,
dbpedia:Area] = 2.85
• Sum of dbpedia:Band– [Baltimore, dbpedia:Band] + [Boston, dbpedia:Band] + [New York,
dbpedia:Band] = 0.25
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 30
Predicting the Classes
• We predict classes from four vocabularies – DBpedia Ontology, Freebase, WordNet and Yago
[City, dbpedia:Area] = 1 [City, dbpedia:PoplulatedPlace] = 0.9
[City, dbpedia:Band] = 0.2[City, yago:LivingPeople] = 0.23
[City, dbpedia:Area] = 1 [City, dbpedia:PoplulatedPlace] = 0.9
[City, dbpedia:Band] = 0.2[City, yago:LivingPeople] = 0.23
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 31
Mapping Table to WikipediaState Capital City Largest City Governor
Maryland Annapolis Baltimore Martin O Malley
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 33
Mapping Table to Wikipedia
State Capital City Largest City Governor
Maryland Annapolis Baltimore Martin O Malley
TypesLinked Concepts Property Values
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 34
Summary of the Query
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 35
Extracting Types from DBpedia
Types for Annapolis
SPARQL Query
Query redirects too … … to avoid disparity in KBs
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 36
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
Input: Table Headers and Rows
Output: Linked Data Representation of a Table
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 37
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to “NIL”Link to the top
ranked instance
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 38
Class labels are mapped to typesRef field
Re-querying KB
• Use of predicted class labels as “additional evidence”
WordNet:Cityhttp://dbpedia.org/ontology/CityYago:CitiesinUnitedStatesFreebase:Location
• Restricts the types of the results returned to the predicted class labels
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 39
Summary of the re-query
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 40
Learning to Rank
• We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
• Levenshtein distance• Dice Score
• Wikitology Score• PageRank• Page Length
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 41
“To Link or not to Link … ’’
• The highest ranked entity may not the correct one to link to … – Because the string we are querying may not be in
the KB– Top N results may not include the correct answer
• We trained an SVM classifier which would determine whether to link to the top one or not
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 42
“To Link or not to Link … ’’
• Feature vector included the feature vector of the top ranked entity and additional two features –
– The SVMrank score of the top ranked entity– The difference in scores between the top two
ranked entities
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 43
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
Input: Table Headers and Rows
Output: Linked Data Representation of a Table
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 44
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 45
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbonto:LargestCity
dbonto:LargestCitydbonto:Capital
dbonto:LargestCity
dbonto:Capital dbonto:LargestCity
Candidate relations
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 46
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbonto:LargestCity
dbonto:LargestCitydbonto:Capital
dbonto:LargestCity
Candidates: dbonto:Capital
dbonto:LargestCity
dbonto:Capital Score:0
dbonto:Capital Score:1
dbonto:LargestCity Score:3
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 47
Relation between columns
Select * where {<http://dbpedia.org/resource/Maryland> ?relation <http://dbpedia.org/resource/Baltimore> }_______________________________________________________________________
Select * where {<http://dbpedia.org/resource/Maryland> ?relation “Baltimore”@en> }
• Query the second column as URI and a literal string
• Check all redirects when querying with URI
• Check all other common names when querying with literal string
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 48
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
Input: Table Headers and Rows
Output: Linked Data Representation of a Table
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 49
An example@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix dbpedia: <http://dbpedia.org/resource/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix dbpprop: <http://dbpedia.org/property/> .
“City”@en is rdfs:label of dbpedia-owl:City .“State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion .
“Baltimore”@en is rdfs:label of dbpedia:Baltimore .dbpedia:Baltimore a dbpedia-owl:City .“MD”@en is rdfs:label of dbpedia:Maryland .dbpedia:Maryland a dbpedia-owl:AdministrativeRegion .
dbpprop:LargestCity rdfs:domain dbpedia-owl:AdminstrativeRegion .dbpprop:LargestCity rdfs:range dbpedia-owl:City .
“City”@en is rdfs:label of dbpedia-owl:City .“City” is the common / human name for the class dbpedia-owl:City
dbpedia:Baltimore a dbpedia-owl:City .dbpedia:Baltimore is a type (instance) dbpedia-owl:City
dbpprop:LargestCity rdfs:domain dbpedia-owl:AdminstrativeRegion .The subjects of the triples using the property have to be instances of dbpedia-
owl:AdminstrativeRegion
dbpprop:LargestCity rdfs:range dbpedia-owl:City .The objects of the triples using the property have to be instances of dbpedia-owl:City
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 50
Template@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
“ColumnHeader1” is rdfs:label of PredictedClassLabel1 .“ColumnHeader2” is rdfs:label of PredictedClassLabel2 .
“TableCellString” is rdfs:label of CellValueURL .CellValueURL a PredictedClassLabel .
property rdfs:domain PredictedClassLabel1 .property rdfs:range PredictedClassLabel2 .
Where:ColumnHeader - is a column header from the tableTableCellString - is a string representing a table cellPredictedClassLabel - is the class label associated with the columnCellValueURL - is the DBpedia url, the table cell string is linked toproperty - is the relation discovered between the two columns
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 51
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
* The number in the brackets indicates # excluding columns that contained numbers
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 53
Evaluation # 1 (MAP)
• Compared the system’s ranked list of labels against a human ranked list of labels
• Metric - Mean Average Precision (MAP)
• Commonly used in the Information Retrieval domain to compare two ranked sets
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 57
Evaluation # 1 (MAP)• MAP is defined as –
• R(n) - is the relevance at n. If the class label ranked “n” in the system generated set is a relevant one then R(n) is 1,else it is 0.
• P(n) - is the precision at n. It measures the relevance of the top n results.
• N - is the number of labels retrieved. For our evaluation we consider the top 3 labels retrieved.Introduction Related Work Tables to Linked Data Results Future Work Conclusion
58
Evaluation # 1 (MAP)
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 59
80.76 %
System Ranked:1. Person2. Politician3. President
Evaluator Ranked:1. President2. Politician3. OfficeHolder
Evaluation # 2 (Recall)
• Checked whether the system was retrieving relevant class labels or not.
• Measure used : Recall (R)
• Top three labels ranked by the user were considered to be relevant.
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 60
Evaluation # 2 (Recall)
Recall > 0.6 (75 %)
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 61
System Ranked:1. Person2. Politician3. President
Evaluator Ranked:1. President2. Politician3. OfficeHolder
Evaluation # 3 (Rank Match)
• A comparison of how many times the top three ranked system generated labels match with the top three labels ranked by the users
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 62
Evaluation # 4 (Correctness)
• Evaluated whether our predicted class labels were “fair and correct”
• Class label may not be the most accurate one, but may be correct. – E.g. dbpedia:PopulatedPlace is not the most accurate, but still a
correct label for column of cities
• Three human judges evaluated our predicted class labels
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 63
Evaluation # 4 (Correctness)
• A category-wise breakdown for class label correctness
Introduction Related Work Tables to Linked Data Results Future Work Conclusion
Overall Accuracy: 76.92 %
64
Column – NationalityPrediction – MilitaryConflict
Column – Birth PlacePrediction – PopulatedPlace
Summary – Class label prediction
• Recall and class label correctness show that our approach produces relevant and correct labels
• MAP and Rank Match show that we enjoyed moderate success in ranking labels within a set
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 65
Category-wise accuracy for linking table cells
Overall Accuracy: 66.12 %
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 67
SVMrank classifier
Correctly predicted top ranked instance
215
Incorrectly predicted top ranked instance
7
Total number of instances 222
Accuracy 96.84 %
Correctly predicted top ranked instance
543
Incorrectly predicted top ranked instance
68
Total number of instances 611
Accuracy 88.87 %
• Training data – 171 queries (each with 10 results)• The correct entity was assigned the highest rank and all
others were assigned a same lower rank• Test data – 222 queries (each with 10 results)
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 68
The binary SVM classifier
Correctly predicted 145
Incorrectly predicted 26
Total number of instances 171
Accuracy 84.79 %
Correctly predicted 541
Incorrectly predicted 70
Total number of instances 611
Accuracy 88.54 %
• Training data – 222 queries (146 +ve, 76 –ve examples)• If the highest ranked instance was correct, a class label of
“yes” was assigned • Test data – 171 queries (119 +ve, 52 –ve examples)
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 69
Relation between columns
• Idea – Ask human evaluators to identify relations between columns in a given table
• Pilot Experiment – Asked three evaluators to annotate five random tables from our dataset
• Evaluators identified 20 relations
• Our accuracy – 5 out of 20 (25 % ) were correct
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 71
Future Work
• Implement a machine learning based approach for class label predictions for columns
• Alternative approach for relation discovery and identification
• Approaches to handle unknown entities
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 73
Conclusion
• There’s lot of data that is stored in html tables, spreadsheets, databases and documents
• We presented an automated framework that extracts, interprets and represents tables as linked data
• We are unlocking large amounts of tabular data currently inaccessible and useless for the Semantic Web and making it more meaningful and useful on the Semantic Web
• We believe our work will contribute in materializing the web of data vision
Introduction Related Work Tables to Linked Data Results Future Work Conclusion 75
References• Cafarella, M. J., Halevy, A., Wang, D. Z., Wu, E., Zhang, Y., 2008. Webtables:exploring
the power of tables on the web. Proc. VLDB Endow.1 (1), 538-549.• Ziegler, P., and Dittrich, K. R. 2004. Three decades of data intecration: all problems
solved? In Building the Information Society, volume 156 of IFIP International Federation for Information Processing, 312. Springer Boston.
• Pantel, P.; Philpot, A.; and Hovy, E. 2005. Aligning database columns using mutual information. In Proceedings of the 2005 national conference on Digital government research, dg.o 05, 205210.
• Cindy Xide Lin, Bo Zhao, Tim Weninger, Jiawei Han, and Bing Liu. 2010. Entity relation discovery from web tables and links. In Proceedings of the 19th international conference on World wide web (WWW '10). ACM, New York, NY, USA, 1145-1146.
• Barrasa, J., Corcho, O., Gomez-perez, A., 2004. R2o, an extensible and semantically based database-to-ontology mapping language. In Proceedings of the 2nd Workshop on Semantic Web and Databases(SWDB2004). Vol. 3372. pp. 1069-1070.
76
• Hu, W., and Qu, Y. 2007. Discovering simple mappings between relational database schemas and ontologies. In Aberer, K.; Choi, K.-S.; Noy, N. F.; Allemang, D.; Lee, K.-I.; Nixon, L. J. B.; Golbeck, J.; Mika, P.; Maynard, D.; Mizoguchi, R.; Schreiber, G.;and Cudre-Mauroux, P., eds., ISWC/ASWC, volume 4825 of Lecture Notes in Computer Science, 225238. Springer.
• Papapanagiotou, P.; Katsiouli, P.; Tsetsos, V.; Anagnostopoulos, C.; and Hadjiefthymiades, S. 2006. Ronto: Relational to ontology schema matching. In AISSIGSEMIS BULLETIN.
• Lawrence, E. D. R. 2004. Composing mappings between schemas using a reference ontology. In In Proceedings of International Conference on Ontologies, Databases and Application of Semantics (ODBASE), 783800. Springer
• Han, L.; Finin, T.; Parr, C.; Sachs, J.; and Joshi, A. 2008. RDF123: from Spreadsheets to RDF. In Seventh International Semantic Web Conference. Springer.
• Han, L., Finin, T., Yesha, Y., 2009. Finding semantic web ontology terms from words. In: Proceedings of the Eight International Semantic Web Conference. Springer.
References
77