domain identification for linked open data

29
Domain Identification for Linked Open Data Sarasi Lalithsena Pascal Hitzler Amit Sheth Kno.e.sis Center Wright State University, Dayton, OH Prateek Jain IBM T.J. Watson Research Center Yorktown, NY, USA WI 2013 Atlanta, GA, USA

Upload: sarasi-sarangi

Post on 10-May-2015

202 views

Category:

Technology


0 download

DESCRIPTION

Linked Open Data (LOD) has emerged as one of the largest collections of interlinked structured datasets on the Web. Although the adoption of such datasets for applications is increasing, identifying relevant datasets for a specific task or topic is still challenging. As an initial step to make such identification easier, we provide an approach to automatically identify the topic domains of given datasets. Our method utilizes existing knowledge sources, more specifically Freebase, and we present an evaluation which validates the topic domains we can identify with our system. Furthermore, we evaluate the effectiveness of identified topic domains for the purpose of finding relevant datasets, thus showing that our approach improves reusability of LOD datasets.

TRANSCRIPT

Page 1: Domain Identification for Linked Open Data

Domain Identification for Linked Open Data

Sarasi Lalithsena

Pascal Hitzler

Amit Sheth

Kno.e.sis Center

Wright State University, Dayton, OH

Prateek Jain

IBM T.J. Watson Research Center

Yorktown, NY, USA

WI 2013 Atlanta, GA, USA

Page 2: Domain Identification for Linked Open Data

2

Motivation

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”

262 datasets870 alive datasets

lod cloud

Page 3: Domain Identification for Linked Open Data

3

Motivation

Lingvoj

Climbdata

Need better ways to dataset discovery, description and organization

Page 4: Domain Identification for Linked Open Data

4

Problem

• How do we identify the relevant datasets from this structured knowledge space?– How do we create a registry of topics which describe the

domain of a dataset?

Page 5: Domain Identification for Linked Open Data

5

State of the Art – Existing Problems to dataset lookup

• Rely on manual tagging provided by users and the manual reviewing process– CKAN data hub, LOD Diagram

• Rely on keywords and metadata provided by users– CKAN data hub, LODStats

• Need to know instances to start explore the datasets– Semantic Search Engines (SSE) such as Sigma, Swoogle and

Watson• Need to know seed URIs to find the relevant datasets

– Federated Querying Systems for LOD

Page 6: Domain Identification for Linked Open Data

6

What we propose?

• Introduce a systematic and sophisticated way to identify possible domains, topics, tags (Topic Domain) to better describe these datasets

• What are these topic domain can be?– Predefined set of list– Type of the schema of each dataset

Page 7: Domain Identification for Linked Open Data

7

What we propose?

Knowledge bases + category system Topic Domains

Page 8: Domain Identification for Linked Open Data

8

How do we address the previous problems

• Use the category system of existing knowledge sources as the vocabulary to describe the domain– Does not need to either rely on a predefined set of tags– Does not need to rely on metadata and keywords

• Automatic way to identify the topic domains

• Vocabulary can be used to search the datasets and organize the datasets

Page 9: Domain Identification for Linked Open Data

9

Our approach - Freebase

• Use Freebase as our knowledge source to identify the topic domains

• Why Freebase?– Wide Coverage

Has 39 million topics– Simple Category Hierarchy System

• Freebase category system categorizes each topic in to types and types are grouped in to domains

• Utilized Freebase types and domains as our topic domains

music

Artist

Domain

Type

Page 10: Domain Identification for Linked Open Data

10

Our Approach - STEPS

1. Instance Identification

2. Category Hierarchy Creation

3. Category Hierarchy Merging

4. Candidate Category Hierarchy Selection

5. Frequency Count Generation

Page 11: Domain Identification for Linked Open Data

11

Our Approach

STEP 1 Instance Identification– Extract the instances of the dataset with its type – Extract the human readable values of the instances and type

Granite and its type Rock

– Identify the closely related instance from the freebase for each instance in our dataset

Ignimbrite, RockSlate, Rock

Granite, Rock

http://www.freebase.com/m/01tx7r

http://www.freebase.com/m/01c_9jhttp://www.freebase.com/m/03fcm

Page 12: Domain Identification for Linked Open Data

12

Our Approach

• Instance Identification

We attach the type information as well to the query string

Apple Apple Company

Apple Fruit Apple Fruit

Page 13: Domain Identification for Linked Open Data

13

Our Approach

• STEP 2 Category Hierarchy CreationIgnimbrite /geology/rock_type {domain/type}

rock type

geographygeology

mountain range

geography

mountain

Ignimbrite

rock type

geographygeology

release track

music

mountain

slate

rock type

geographygeology

mountain

granite

music

recording

Page 14: Domain Identification for Linked Open Data

14

Our Approach

• Category Hierarchy Merging

rock type

geographygeology

mountain range

mountainIgnimbrite

rock type

geographygeology

release track

music

mountain

slate

recording

rock type

geographygeology

mountain

granite

Page 15: Domain Identification for Linked Open Data

15

Our Approach

• Candidate Category Hierarchy Selection

Filter out insignificant category hierarchies using a simple heuristics

rock type

geographygeology

mountain range

mountainIgnimbrite

rock type

geographygeology

release track

music

mountain

slate

recording

rock type

geographygeology

mountain

granite

Page 16: Domain Identification for Linked Open Data

16

Our Approach

• Frequency Count Generation

Count the number of occurrences for each category (number of instances having the given category)

Term Frequency Parent Node

geology 3

rock type 3 geology

mountain range 1 geography

….. … ….

Page 17: Domain Identification for Linked Open Data

17

Implementation

• Map Reduce Deployment

<Inst, type>…….......…………

map2

map1

map4

Map n

Map 3

Reducer 1

Reducer m

Instances belong to same type will go into a single reducer

Post Processing

STEP 2 and 3 STEP 4

STEP 5

Page 18: Domain Identification for Linked Open Data

18

Evaluation

• We ran our experiments with 30 datasets in LOD for evaluation

Evaluation

Appropriateness of the identified domain

Effectiveness in finding the datasets

User Study

Page 19: Domain Identification for Linked Open Data

19

Appropriateness of the identified domain

• Select four high frequent domains and types from our results• Mixed it with other randomly selected four domains and types• Asked from users to select the terms that best represent the

higher level domains for the dataset – had 20 users

*

50% of the users agreed on 73% of

the terms (88 out of 120)

Page 20: Domain Identification for Linked Open Data

20

Appropriateness of the identified domain

TERMS WITH HIGHEST USER AGREEMENT FOR EACH DATASET, WE INDICATE BY A STAR (*) THAT TERM WAS ALSO THE HIGHEST RANKED BY OUR SYSTEM (for 22 datasets)

Page 21: Domain Identification for Linked Open Data

21

Evaluation

Evaluation

Appropriateness of the identified domain

Effectiveness in finding the datasets

User Study 1. User Study with three other SE

Page 22: Domain Identification for Linked Open Data

22

Effectiveness in finding the datasets

• Developed a search application using the normalized frequency count

• User study with three other existing state of the art– CKAN, LOD Stat and Sigma

• Term selection• Top ten results are retrieved• Asked users to rank which set of results they preferred

– 1(best ) to 4(worst)• Calculate a user preference score using weighted average

Page 23: Domain Identification for Linked Open Data

23

Effectiveness in finding the datasets

Term Our Approach CKAN LODStat Sigmamusic 2.037 3.74 3.11 1.333

artist 2.815 3.926 1 2.259

biology 3.481 3.333 1 2.185

animal 2.926 1.63 3.481 1.926

geology 2.852 3.666 1 2.481

drug 2.926 3.148 2 2.555

gene 2.148 3.333 3.074 1.222

university 3.185 3.148 2.37 1.222

food 3.259 2.296 3 1.259

language 3.148 3.74 1 2.11

spacecraft 4 4 1 2

conference 2.814 3.555 1 2.666

astronaut 4 4 1 2

composer 3.815 3.037 1 2.11

tv program 3.666 2.923 1 2.370

instrument 3.852 2 2 3.148

recipe 3.926 2 2 3.074

student 2 3.889 2 3.111

phenotypes 2 3.923 2 3.037

energy 1 3.74 3.26 3.03

Page 24: Domain Identification for Linked Open Data

24

Evaluation

Evaluation

Appropriateness of the identified domain

Effectiveness in finding the datasets

User Study 1. User Study with three other SE

2. Evaluate CKAN as the baseline

Page 25: Domain Identification for Linked Open Data

25

Evaluate CKAN as the baselineTerm P R1 F1 R2 F2music 0.286 1 0.445 0.1 0.148

artist 0.4 1 0.571 0.2 0.267

biology 0.125 1 0.222 0.333 0.182

animal 0 0 n/a 0 n/a

geology 0 0 n/a 0 n/a

drug 0.6 0.667 0.632 0.75 0.667

gene 0.333 1 0.5 0.125 0.182

university 0.5 1 0.667 0.051 0.093

food 0 0 n/a 0 n/a

language 1 1 1 0.045 0.0861

spacecraft 1 1 1 1 1

conference 1 1 1 0.125 0.222

astronaut 1 1 1 1 1

composer 0.25 1 0.4 0.5 0.333

tv program 0 0 n/a 0 n/a

instrument 0 1 0 1 0

recipe 0 1 0 1 0

student 1 0 0 0 0

phenotypes 1 0 0 0 0

energy 1 0 0 0 0

Page 26: Domain Identification for Linked Open Data

26

Evaluation

Evaluation

Appropriateness of the identified domain

Effectiveness in finding the datasets

User Study 1. User Study with three other SE

2. Evaluate CKAN as the baseline

3. Evaluate both CKAN and our approach using a manually curated gold standard

Page 27: Domain Identification for Linked Open Data

27

Evaluation with a manually curated gold standard

CKAN Our Approach

Term Precision Recall F-Measure Precision Recall F-Measure

music 1 0.5 0.667 0.571 1 0.727

artist 1 0.25 0.4 0.8 1 0.9

biology 1 0.2 0.333 0.625 1 0.769

animal 0 0 n/a 0.333 1 0.5

geology 0 0 n/a 1 0.5 0.667

drug 1 0.6 0.75 1 1 1

gene 1 0.333 0.5 1 1 1

university 0.5 0.667 0.572 0.6 1 0.75

food 0 0 n/a 0.25 1 0.4

language 1 1 1 1 1 1

spacecraft 1 1 1 1 1 1

conference 1 1 1 1 1 1

tv program 0 0 n/a 1 1 1

instrument 1 0 0 0.75 1 0.857

astronaut 1 1 1 1 1 1

composer 1 0.25 0.4 1 1 1

recipe 1 0 0 1 1 1

phenotypes 1 1 1 1 0 0

student 1 0.5 0.667 1 0 0

energy 1 0.333 0.5 1 0 0

Mean 0.775 0.432 0.489 0.846 0.825 0.728

Page 28: Domain Identification for Linked Open Data

28

Conclusion and Future Work

• Our approach is helpful for systematically categorizing the datasets

• Demonstrate the potential of using the categorization for finding relevant datasets

• Utilize a diverse classification hierarchy such as Freebase• There are other potential application that this work might be

important such browsing and interlinking• Plan to improve the domain coverage by using knowledge

sources such as Wikipedia and Yago• Compare the interpretation given by multiple knowledge sources

to see which one gives a better interpretation

Page 29: Domain Identification for Linked Open Data

Thank You!

Questions?http://knoesis.wright.edu/researchers/

[email protected]

Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled ComputingWright State University, Dayton, Ohio, USA