domain identification for linked open data
DESCRIPTION
Linked Open Data (LOD) has emerged as one of the largest collections of interlinked structured datasets on the Web. Although the adoption of such datasets for applications is increasing, identifying relevant datasets for a specific task or topic is still challenging. As an initial step to make such identification easier, we provide an approach to automatically identify the topic domains of given datasets. Our method utilizes existing knowledge sources, more specifically Freebase, and we present an evaluation which validates the topic domains we can identify with our system. Furthermore, we evaluate the effectiveness of identified topic domains for the purpose of finding relevant datasets, thus showing that our approach improves reusability of LOD datasets.TRANSCRIPT
Domain Identification for Linked Open Data
Sarasi Lalithsena
Pascal Hitzler
Amit Sheth
Kno.e.sis Center
Wright State University, Dayton, OH
Prateek Jain
IBM T.J. Watson Research Center
Yorktown, NY, USA
WI 2013 Atlanta, GA, USA
2
Motivation
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
262 datasets870 alive datasets
lod cloud
3
Motivation
Lingvoj
Climbdata
Need better ways to dataset discovery, description and organization
4
Problem
• How do we identify the relevant datasets from this structured knowledge space?– How do we create a registry of topics which describe the
domain of a dataset?
5
State of the Art – Existing Problems to dataset lookup
• Rely on manual tagging provided by users and the manual reviewing process– CKAN data hub, LOD Diagram
• Rely on keywords and metadata provided by users– CKAN data hub, LODStats
• Need to know instances to start explore the datasets– Semantic Search Engines (SSE) such as Sigma, Swoogle and
Watson• Need to know seed URIs to find the relevant datasets
– Federated Querying Systems for LOD
6
What we propose?
• Introduce a systematic and sophisticated way to identify possible domains, topics, tags (Topic Domain) to better describe these datasets
• What are these topic domain can be?– Predefined set of list– Type of the schema of each dataset
7
What we propose?
Knowledge bases + category system Topic Domains
8
How do we address the previous problems
• Use the category system of existing knowledge sources as the vocabulary to describe the domain– Does not need to either rely on a predefined set of tags– Does not need to rely on metadata and keywords
• Automatic way to identify the topic domains
• Vocabulary can be used to search the datasets and organize the datasets
9
Our approach - Freebase
• Use Freebase as our knowledge source to identify the topic domains
• Why Freebase?– Wide Coverage
Has 39 million topics– Simple Category Hierarchy System
• Freebase category system categorizes each topic in to types and types are grouped in to domains
• Utilized Freebase types and domains as our topic domains
music
Artist
Domain
Type
10
Our Approach - STEPS
1. Instance Identification
2. Category Hierarchy Creation
3. Category Hierarchy Merging
4. Candidate Category Hierarchy Selection
5. Frequency Count Generation
11
Our Approach
STEP 1 Instance Identification– Extract the instances of the dataset with its type – Extract the human readable values of the instances and type
Granite and its type Rock
– Identify the closely related instance from the freebase for each instance in our dataset
Ignimbrite, RockSlate, Rock
Granite, Rock
http://www.freebase.com/m/01tx7r
http://www.freebase.com/m/01c_9jhttp://www.freebase.com/m/03fcm
12
Our Approach
• Instance Identification
We attach the type information as well to the query string
Apple Apple Company
Apple Fruit Apple Fruit
13
Our Approach
• STEP 2 Category Hierarchy CreationIgnimbrite /geology/rock_type {domain/type}
rock type
geographygeology
mountain range
geography
mountain
Ignimbrite
rock type
geographygeology
release track
music
mountain
slate
rock type
geographygeology
mountain
granite
music
recording
14
Our Approach
• Category Hierarchy Merging
rock type
geographygeology
mountain range
mountainIgnimbrite
rock type
geographygeology
release track
music
mountain
slate
recording
rock type
geographygeology
mountain
granite
15
Our Approach
• Candidate Category Hierarchy Selection
Filter out insignificant category hierarchies using a simple heuristics
rock type
geographygeology
mountain range
mountainIgnimbrite
rock type
geographygeology
release track
music
mountain
slate
recording
rock type
geographygeology
mountain
granite
16
Our Approach
• Frequency Count Generation
Count the number of occurrences for each category (number of instances having the given category)
Term Frequency Parent Node
geology 3
rock type 3 geology
mountain range 1 geography
….. … ….
17
Implementation
• Map Reduce Deployment
<Inst, type>…….......…………
map2
…
…
map1
map4
Map n
Map 3
Reducer 1
Reducer m
…
…
Instances belong to same type will go into a single reducer
Post Processing
STEP 2 and 3 STEP 4
STEP 5
18
Evaluation
• We ran our experiments with 30 datasets in LOD for evaluation
Evaluation
Appropriateness of the identified domain
Effectiveness in finding the datasets
User Study
19
Appropriateness of the identified domain
• Select four high frequent domains and types from our results• Mixed it with other randomly selected four domains and types• Asked from users to select the terms that best represent the
higher level domains for the dataset – had 20 users
*
50% of the users agreed on 73% of
the terms (88 out of 120)
20
Appropriateness of the identified domain
TERMS WITH HIGHEST USER AGREEMENT FOR EACH DATASET, WE INDICATE BY A STAR (*) THAT TERM WAS ALSO THE HIGHEST RANKED BY OUR SYSTEM (for 22 datasets)
21
Evaluation
Evaluation
Appropriateness of the identified domain
Effectiveness in finding the datasets
User Study 1. User Study with three other SE
22
Effectiveness in finding the datasets
• Developed a search application using the normalized frequency count
• User study with three other existing state of the art– CKAN, LOD Stat and Sigma
• Term selection• Top ten results are retrieved• Asked users to rank which set of results they preferred
– 1(best ) to 4(worst)• Calculate a user preference score using weighted average
23
Effectiveness in finding the datasets
Term Our Approach CKAN LODStat Sigmamusic 2.037 3.74 3.11 1.333
artist 2.815 3.926 1 2.259
biology 3.481 3.333 1 2.185
animal 2.926 1.63 3.481 1.926
geology 2.852 3.666 1 2.481
drug 2.926 3.148 2 2.555
gene 2.148 3.333 3.074 1.222
university 3.185 3.148 2.37 1.222
food 3.259 2.296 3 1.259
language 3.148 3.74 1 2.11
spacecraft 4 4 1 2
conference 2.814 3.555 1 2.666
astronaut 4 4 1 2
composer 3.815 3.037 1 2.11
tv program 3.666 2.923 1 2.370
instrument 3.852 2 2 3.148
recipe 3.926 2 2 3.074
student 2 3.889 2 3.111
phenotypes 2 3.923 2 3.037
energy 1 3.74 3.26 3.03
24
Evaluation
Evaluation
Appropriateness of the identified domain
Effectiveness in finding the datasets
User Study 1. User Study with three other SE
2. Evaluate CKAN as the baseline
25
Evaluate CKAN as the baselineTerm P R1 F1 R2 F2music 0.286 1 0.445 0.1 0.148
artist 0.4 1 0.571 0.2 0.267
biology 0.125 1 0.222 0.333 0.182
animal 0 0 n/a 0 n/a
geology 0 0 n/a 0 n/a
drug 0.6 0.667 0.632 0.75 0.667
gene 0.333 1 0.5 0.125 0.182
university 0.5 1 0.667 0.051 0.093
food 0 0 n/a 0 n/a
language 1 1 1 0.045 0.0861
spacecraft 1 1 1 1 1
conference 1 1 1 0.125 0.222
astronaut 1 1 1 1 1
composer 0.25 1 0.4 0.5 0.333
tv program 0 0 n/a 0 n/a
instrument 0 1 0 1 0
recipe 0 1 0 1 0
student 1 0 0 0 0
phenotypes 1 0 0 0 0
energy 1 0 0 0 0
26
Evaluation
Evaluation
Appropriateness of the identified domain
Effectiveness in finding the datasets
User Study 1. User Study with three other SE
2. Evaluate CKAN as the baseline
3. Evaluate both CKAN and our approach using a manually curated gold standard
27
Evaluation with a manually curated gold standard
CKAN Our Approach
Term Precision Recall F-Measure Precision Recall F-Measure
music 1 0.5 0.667 0.571 1 0.727
artist 1 0.25 0.4 0.8 1 0.9
biology 1 0.2 0.333 0.625 1 0.769
animal 0 0 n/a 0.333 1 0.5
geology 0 0 n/a 1 0.5 0.667
drug 1 0.6 0.75 1 1 1
gene 1 0.333 0.5 1 1 1
university 0.5 0.667 0.572 0.6 1 0.75
food 0 0 n/a 0.25 1 0.4
language 1 1 1 1 1 1
spacecraft 1 1 1 1 1 1
conference 1 1 1 1 1 1
tv program 0 0 n/a 1 1 1
instrument 1 0 0 0.75 1 0.857
astronaut 1 1 1 1 1 1
composer 1 0.25 0.4 1 1 1
recipe 1 0 0 1 1 1
phenotypes 1 1 1 1 0 0
student 1 0.5 0.667 1 0 0
energy 1 0.333 0.5 1 0 0
Mean 0.775 0.432 0.489 0.846 0.825 0.728
28
Conclusion and Future Work
• Our approach is helpful for systematically categorizing the datasets
• Demonstrate the potential of using the categorization for finding relevant datasets
• Utilize a diverse classification hierarchy such as Freebase• There are other potential application that this work might be
important such browsing and interlinking• Plan to improve the domain coverage by using knowledge
sources such as Wikipedia and Yago• Compare the interpretation given by multiple knowledge sources
to see which one gives a better interpretation
Thank You!
Questions?http://knoesis.wright.edu/researchers/
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled ComputingWright State University, Dayton, Ohio, USA