characterizing knowledge on the semantic web with watson mathieu d’aquin, claudio baldassarre,...

21
Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta The Knowledge Media Institute, The Open University [email protected]

Upload: morris-morgan

Post on 18-Jan-2018

221 views

Category:

Documents


0 download

DESCRIPTION

The Semantic Web is growing…

TRANSCRIPT

Page 1: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

Characterizing Knowledge on the Semantic Web with Watson

Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico MottaThe Knowledge Media Institute, The Open University

[email protected]

Page 2: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

The Semantic Web is Growing

05

1015202530354045

2003 2004

#SW Pages

Lee, J., Goodwin, R. (2004) The Semantic Webscape: a View of the Semantic Web. IBM Research Report.

Page 3: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

The Semantic Web is growing…

http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData

Page 4: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

Next Generation Semantic Web applications

Exploiting the Semantic Web rather than engineering their own knowledge/ontologies

Need for a Gateway to the Semantic Web

Page 5: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

Watson: a Gateway to the Semantic Web

Page 6: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

More on Watson?See also…

– Watson Web Interface: http://watson.kmi.open.ac.uk

– Watson poster and demo at ISWC 2007…

Page 7: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

Characterizing Knowledge in Watson?

Beside being a gateway for applications, Watson gives the opportunity to better understand:

How semantic technologies are used to published knowledge online

How knowledge is structured on the Semantic Web

How ontologies and semantic documents are interconnected in a semantic network

through an analysis of its repository.

Such an analysis provides valuable information for application and tool developers concerning the knowledge they have to manipulate.

Page 8: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

The Watson Collection Collecting Semantic Content:

A number of specialized crawlers for Google, ontology repositories (e.g. Swoogle), PingTheSemanticWeb, etc.

Validated by parsing with Jena, to get only RDF documents

Filters: Before filtering, the repository was composed almost entirely of RSS and FOAF (more than 5 times the number of other documents)

Therefore, the analysis would have been more an analysis of RSS and FOAF than anything else.

These have been filtered out.

An analysis of the FOAF part of the repository separately would be interesting.

Page 9: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

The Watson Collection

Result:

almost 25,500 semantic documents

Page 10: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

The Watson Collection In order to index these documents, Watson extracts information about them.

Information about the content: classes, properties and individuals, the relations between them, the coverage in terms of domain topics, etc.

Information about the representation: the language used and its expressivity, the size and structure of the document, etc.

Information about the network aspects of semantic documents: identification, links between documents, etc.

It is these elements of information that we intend to analyse.

Note that all these elements of information are freely available through the Watson API.

Page 11: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

In the Following

Measures on the following aspects:1. Usage of semantic technologies to publish

knowledge on the Web2. Structure and coverage of semantic documents3. The knowledge network

Focusing more on the most “debatable” elements.

Page 12: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

Semantic Web languages…

• The majority is factual data in RDF• OWL adopted as ontology language• Less overlap between OWL and

RDF-S than between DAML+OIL and RDFS: – better separation of the meta-

models in OWL

• Here a document is considered in a given language if it instantiates an entity of the language

- e.g. it is in OWL and RDF-S if it contains an owl:property and an rdfs:class for example

Page 13: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

… and their expressivity• Apparent contradiction:

– Most of the documents are in OWL FULL

– But 95% use only a very restricted part of the expressive power of OWL (below OWL Lite)

OWL Full because of simple syntactic mistakes

Page 14: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

Size of the documents

Number ofclasses

Documents Documents

Number ofinstances

• Like for expressivity, a power law distribution: lots of very small document and a few very large ones (both for ontological knowledge and factual data, but on different scales)

Page 15: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

Density of the representation

• In average, classes are:1. Poorly defined (small number of properties and

super-classes per class)2. Highly instantiated (high number of instances per

class)

Even the best represented class in each ontology only have 1 property in avg.

Page 16: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

Topic Domain Coverage

• Level of coverage of ontologies for the top categories in DMOZ (details in the paper)

• Very heterogeneous distribution• Not well correlated with the one of the Web

Page 17: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

Identification of semantic document

• Participates to the networked and distributed aspects of the Semantic Web

• URI are unique identifiers, but when applied to ontologies, they may be duplicated:– Default URI of the ontology editor (Protégé)– Misuse of the URI of existing vocabularies (OWL)– Different versions of an ontology having the same URI

• Also, it is a good practice for URIs to be dereferenceable, but only 30% of the semantic documents can be reached through their URI.

Page 18: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

Connectedness and Redundancy

• Connectedness and redundancy are both important aspects of distributed systems.

• Connectedness:– A few large providers (W3.org, Stanford) and a few

locally dense networks (Ontoworld)– Otherwise, very local ontologies

• Redundancy:– Almost 30% of the semantic documents are duplicates– 12% of the entities are described more than once

A better support of the network aspects of ontologies is required.

Page 19: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

Conclusion• Our analysis allows to draw some conclusions

about some of the characteristics of the knowledge published online.

• In particular, it shows that – Semantic Web documents tend to be small, lightweight

and weakly structured – Efforts are still required to publish knowledge in a

variety of domains– The network aspects are not taken enough into

consideration in semantic technologies• These constitute valuable information for tools and

applications developers.

Page 20: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

Limitations• This work can be seen as a first step towards a

fine grained characterization of the Semantic Web.

• But in its current state, it suffers from a number of limitations:– Only a sample of the Semantic Web– A snapshot of the current dataset. Should consider

evolution– Simple analysis methods. Would data mining

approaches be relevant?– The analyzed aspects are insufficient to fully capture

the quality of the knowledge available online

Page 21: Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta

Comment, suggest, question…

http://[email protected]

A last word…

• We believe that the field of evaluation of ontologies and ontology based tools could provide valuable inputs to this study, so please:

• Watson is an open system, our data is available through the Watson API.