towards purposeful reuse of semantic datasets through goal-driven summarization

18
K-DRIVE Towards Purposeful Reuse of Semantic Datasets via Goal-Driven Data Summarization Panos Alexopoulos, Jose Manuel Gomez Perez 6th International Conference on Advances in Semantic Processing Porto, Portugal, October 3rd, 2013

Upload: panos-alexopoulos

Post on 05-Dec-2014

303 views

Category:

Technology


0 download

DESCRIPTION

The emergence in the last years of initiatives like the Linked Open Data (LOD) has led to a significant increase of the amount of structured semantic data on the Web. Nevertheless, the wider reuse of such public semantic data is inhibited by the difficulty for users to decide whether a given dataset is actually suitable for their needs. This is because semantic datasets typically cover diverse domains, do not follow a unified way of organizing the knowledge and may differ in a number of dimensions. With that in mind, in this paper, we report our work in progress on a goal-driven dataset summarization approach that may facilitate better understanding and reuse-oriented evaluation of available semantic data.

TRANSCRIPT

Page 1: Towards Purposeful Reuse of Semantic Datasets Through Goal-Driven Summarization

K-DRIVE

Towards Purposeful Reuse of Semantic Datasets via Goal-Driven Data Summarization

Panos Alexopoulos, Jose Manuel Gomez Perez

6th International Conference on Advances in Semantic Processing

Porto, Portugal, October 3rd, 2013

Page 2: Towards Purposeful Reuse of Semantic Datasets Through Goal-Driven Summarization

2

The Linked Data Use ChallengeIntroduction

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Page 3: Towards Purposeful Reuse of Semantic Datasets Through Goal-Driven Summarization

3

Motivating ScenarioIntroduction

●Assume that some entity (individual or organization) wants to reuse public semantic datasets from the Web to:

●Enrich with them its own data.●Use the data to provide added-value services to its users/clients.

●These organizations can be:

●Technology providers (e.g. iSOCO)● Information providers (e.g. publishers, media, etc.)●Knowledge-driven and knowledge-intensive organisations

Page 4: Towards Purposeful Reuse of Semantic Datasets Through Goal-Driven Summarization

4

Why Data Reuse?Data Enrichment

●The problem with semantic data is the high amount of time and effort required to construct and maintain it.

●The reuse of existing public semantic data can (partially) alleviate this problem:

●Their volume and diversity are increasing at high rates.

●Their maintenance and evolution is the responsibility of their publishers, reducing the required efforts and costs for this task in the organization's side

Page 5: Towards Purposeful Reuse of Semantic Datasets Through Goal-Driven Summarization

5

ExampleData Enrichment

●A news organization wants to create and maintain a knowledge base about European Football.

●The pace at which this knowledge changes is quite fast meaning that the organization needs to constantly monitor these changes and update the data.

●Much of this information is already available as public semantic data (e.g. DBPedia).

●Thus it could be better for the organization to reuse this public data instead of creating them from scratch.

Page 6: Towards Purposeful Reuse of Semantic Datasets Through Goal-Driven Summarization

6

Barriers to Data ReuseData Enrichment

●Difficulty for knowledge engineers to decide whether a given dataset is actually suitable for their needs.

● Semantic datasets typically cover diverse domains

● They do not follow a unified way of organizing the knowledge

● Differ in a number of features including size, coverage, granularity and descriptiveness

●This makes difficult the following tasks:

● Assessing whether a dataset satisfies particular requirements

● Comparing different datasets to select which one is more suitable for a given purpose.

Page 7: Towards Purposeful Reuse of Semantic Datasets Through Goal-Driven Summarization

7

Our ApproachData Reuse

●We suggest the provision of the ability to data consumers to derive semantic data summaries.

●Existing summarization approaches treat the summarization task in an application and user independent way.

●By contrast, we are interested in facilitating the generation of requirements-oriented and goal-driven summaries that may be significantly more helpful to users.

Page 8: Towards Purposeful Reuse of Semantic Datasets Through Goal-Driven Summarization

8

Problem DescriptionGoal-Driven Semantic Data Summarization

●Key question: “Given an application scenario where semantic data is required, how suitable is a given existing dataset for the purposes of this scenario?”

●To answer this, users normally need to be able to:

1. Explicitly express the requirements that a dataset needs to satisfy for a given task or goal.

2. Automatically measure/assess the extent to which a dataset satisfies each of these requirements and compile a summary report.

Page 9: Towards Purposeful Reuse of Semantic Datasets Through Goal-Driven Summarization

9

ApproachGoal-Driven Semantic Data Summarization

●To implement these two capabilities we follow a checklist-based approach.

●Checklists are practically lists of action items arranged in a systematic manner that allow users to record the completion of each of them.

●They are widely applied across multiple industries, like healthcare or aviation, to ensure reliable and consistent execution of complex operations.

●In our case we apply checklists to define and execute custom dataset summarization tasks in the form of lists of goal-specific requirements and associated summarization processes.

Page 10: Towards Purposeful Reuse of Semantic Datasets Through Goal-Driven Summarization

10

Summarization Task RepresentationGoal-Driven Semantic Data Summarization

●To represent custom summarization tasks according to the checklist paradigm we have adopted the Minim model.

●This defines the following information:

●The Goals the dataset summarization task is designed to serve

●The Requirements against which the summarization task evaluates the dataset.

●The Data Analysis Operations that the summarization task employs in order to assess the satisfaction of its requirements

Page 11: Towards Purposeful Reuse of Semantic Datasets Through Goal-Driven Summarization

11

Example GoalsGoal-Driven Semantic Data Summarization

●Decide if a dataset is appropriate for a Semantic Annotation scenario.

●Decide if a dataset is appropriate for a Question Answering scenario

●Determine which of two or more similar datasets best represent a given corpus.

●Detect arising inconsistencies or other quality problems.

●…

Page 12: Towards Purposeful Reuse of Semantic Datasets Through Goal-Driven Summarization

12

Example RequirementsGoal-Driven Semantic Data Summarization

●Evaluate the dataset’s coverage of a particular domain/topic: Aims to measure the extent to which a dataset describes a given domain or topic.

●Evaluate the dataset’s labeling adequacy and richness: Aims to measure the extent to which the dataset’s elements (concepts, instances, relations etc.) are accompanied by representative and comprehensible labels, in one or more languages.

●Evaluate Connectivity: This requirement checks the existence of paths between concepts or entities, i.e. whether it is possible to go from a given concept to another on the graph and in what ways.

Page 13: Towards Purposeful Reuse of Semantic Datasets Through Goal-Driven Summarization

13

Example Data OperationsGoal-Driven Semantic Data Summarization

●Check the existence of a particular element (concept, relation, attribute, instance, axiom) in the dataset.

●Check the dataset’s consistency (e.g. by running a reasoner).

●Measure the number of ambiguous entities in the dataset.

●Measure the number of labeled entities.

Page 14: Towards Purposeful Reuse of Semantic Datasets Through Goal-Driven Summarization

14

Application ExampleGoal-Driven Semantic Data Summarization

●We applied our framework to assess the suitability of public datasets for the purposes of reusing to semantically annotate texts describing football matches from the Spanish League.

●For that, we wanted the dataset to be reused to

●Contain information about all the current teams of the Spanish football league.

●All its entities to have at least one associated label and

●To relate teams with the players that current play in them.

Page 15: Towards Purposeful Reuse of Semantic Datasets Through Goal-Driven Summarization

15

Defined Summarization TaskGoal-Driven Semantic Data Summarization

Page 16: Towards Purposeful Reuse of Semantic Datasets Through Goal-Driven Summarization

16

Resulting Summary

Goal-Driven Semantic Data Summarization

●We executed this task against DBPedia and Freebase, automatically producing the following summary report

●The system provides a yes/no answer as to whether each dataset satisfies each requirement but also additional information on why this may or may not be the case.

●This is important because:

● A requirement might not be satisfied because of a high threshold● A requirement might seem to be satisfied, yet that might not be actually true.

Page 17: Towards Purposeful Reuse of Semantic Datasets Through Goal-Driven Summarization

17

Summary Generation ToolOngoing Work

●We are currently developing a summarization tool that enables the definition manipulation and execution of summarization tasks as well as the dashboard-like visualization of their output

Page 18: Towards Purposeful Reuse of Semantic Datasets Through Goal-Driven Summarization

18

Questions?Thank you!

iSOCO MadridAv. del Partenón, 16-18, 1º7ª

Campo de las Naciones

28042 Madrid

España

(t) +34 913 349 797

iSOCO PamplonaParque Tomás

Caballero, 2, 6º4ª

31006 Pamplona

España

(t) +34 948 102 408

iSOCO ValenciaC/ Prof. Beltrán Báguena, 4

Oficina 107

46009 Valencia

España

(t) +34 963 467 143

iSOCO BarcelonaAv. Torre Blanca, 57

Edificio ESADE CREAPOLIS

Oficina 3C 15

08172 Sant Cugat del Vallès

Barcelona, España

(t) +34 935 677 200

iSOCO ColombiaComplejo Ruta N

Calle 67, 52-20

Piso 3, Torre A

Medellín

Colombia

(t) +57 516 7770 ext. 1132

Key Vendor Virtual Assistant 2013

Quieres innovar?

Dr. Panos Alexopoulos

Semantic Applications Research Manager

[email protected]

(t) +34 913 349 797