knowledge discoverylaurahollink

Knowlege Discovery for the Semantic Web

An Application to Web Usage Mining &

How to use semantics in the Preprocessing stage

Input Data

Data Preprocessing and Transformation Data Mining

Interpretation and Evaluation

Information/ Taking Action

Data fusion (multiple sources) Data Cleaning (noise,missing val.) Feature Selection Dimensionality Reduction Data Normalization

Filtering Patterns Visualization Statistical Analysis - Hypothesis testing - Attribute evaluation - Comparing learned models - Computing Confidence Intervals

Claudia D’Amato - University of Bari, IT.

Laura Hollink - Centrum Wiskunde & Informatica, Amsterdam, NL.

An application to Web Usage Mining

Web Usage Mining = discovering patterns in logs of user interaction with Web resources

• logs typically contain an identifier for users (e.g. ip address), their queries and clicks




• What about usage of Linked Open Data?




• What about usage of Linked Open Data?

• Can we use semantics to improve mining of Web Usage?

Mining Usage of Linked Open Data in USEWOD

USEWOD: http://usewod.org/ [B. Berendt, L. Hollink., M. Luczak-Roesch, et al.]

1. USEWOD workshop series @ ESWC / WWW since 2011

2. USEWOD dataset: server logs of DBpedia, BioPortal, LinkedGeoData, etc., and client side logs from YASGUI.


USEWOD: http://usewod.org/ [B. Berendt, L. Hollink., M. Luczak-Roesch, et al.]

1. USEWOD workshop series @ ESWC / WWW since 2011

2. USEWOD dataset: server logs of DBpedia, BioPortal, LinkedGeoData, etc., and client side logs from YASGUI.

example removed


• Results of USEWOD: LOD usage mining for more efficient indexing [1], cashing [2], auto-completion [3], etc.

[1] Arias, M., Fernández, J. D., Martínez-Prieto, M. A., & de la Fuente, P. (2011). An empirical study of real-world SPARQL queries. USEWOD @ WWW 2011 [2] Lorey, J., & Naumann, F. Caching and prefetching strategies for sparql queries. USEWOD @ ESWC 2013. [3] K. Kramer,R.Q. Dividino, and G. Gröner. SPACE: SPARQL Index for Efficient Autocompletion. ISWC (Posters & Demos) 2013. [4] Rietveld, L., & Hoekstra, R. Man vs. Machine: Differences in SPARQL Queries. USEWOD @ ESWC 2014 [5] Huelss, J., & Paulheim, H. What SPARQL Query Logs Tell and do not Tell about Semantic Relatedness in LOD. NoISE @ ESWC 2015

• Issues: • what is the difference between

queries by machines and humans? [4]• what is the meaning of repeated

queries by bots/tools?• a lot of the usage is invisible due to

data dump download [5]

Usage mining example 1: clustering rdf:properties in DBpedia

Instead of listing all DBpedia properties alphabetically, can we display them in a more meaningful way? Can we use query logs for this?

[5]


Instead of listing all DBpedia properties alphabetically, can we display them in a more meaningful way? Can we use query logs for this?

[5]

[5] Huelss, J., & Paulheim, H. What SPARQL Query Logs Tell and do not Tell about Semantic Relatedness in LOD. NoISE @ ESWC 2015 Disclaimer: simplified discussion of this paper!


Approach: Hierarchical Clustering of properties, where the distance between a pair of properties is based on how often they co-occur in a SPARQL query in the USEWOD2015 logs.





Evaluation: run an experiment to measure how quickly and accurately people identify facts when looking at the standard view or the clustered view.




Evaluation: run an experiment to measure how quickly and accurately people identify facts when looking at the standard view or the clustered view.

Result: no significant differences ☹

Usage mining example 2: mining semantically enriched query logs

[5] Laura Hollink, Peter Mika and Roi Blanco. Web Usage Mining with Semantic Analysis. WWW 2013.


Data: queries and clicks on Yahoo! search engine.



Data: queries and clicks on Yahoo! search engine.Problem when mining ‘raw’ logs: low support of even the most frequent patterns



Approach:1. link queries to entities in

LOD cloud2. choose class of entity +

selected properties3. detect modifier words

(download, trailer, cast, date, etc.)


Approach:1. link queries to entities in

LOD cloud2. choose class of entity +

selected properties3. detect modifier words

(download, trailer, cast, date, etc.)

1. Link queries to entities in LOD cloud:

• Freebase (has a lot of movie related info)• DBpedia (Wikipedia is widely used)


• Sequential pattern mining on the class-level using PrefixSpan.


1.Discover frequent patterns on class-level using

• Using the efficient PrefixSpan algorithm to mine all possible subsequence patterns

Usage mining example 3: semantic patterns of query modification

•Goal: Identify frequent query modifications in an image archive

• state of the art = 3 classes: generalization, specification, reformulation

• Approach:

1.link queries to entities in the LOD cloud

2.Choose class of entity

3.Determine shortest path between consecutive queries Q1 and Q2

4.Rank property-paths according to support and confidence.

Hollink, V., Tsikrika, T., & de Vries, A. P. (2011). Semantic search log analysis: a method and a study on professional image search. JASIST 62(4), 691-713.

See also: Huurnink, B., Hollink, L., Van Den Heuvel, W., & De Rijke, M. (2010). Search behavior of media professionals at an audiovisual archive: A transaction log analysis. JASIST, 61(6), 1180-1197.

Usage mining example 3: semantic patterns of query modification

•Goal: Identify frequent query modifications in an image archive

• state of the art = 3 classes: generalization, specification, reformulation

• Approach:

1.link queries to entities in the LOD cloud

2.Choose class of entity

3.Determine shortest path between consecutive queries Q1 and Q2

4.Rank property-paths according to support and confidence.

Hollink, V., Tsikrika, T., & de Vries, A. P. (2011). Semantic search log analysis: a method and a study on professional image search. JASIST 62(4), 691-713.

See also: Huurnink, B., Hollink, L., Van Den Heuvel, W., & De Rijke, M. (2010). Search behavior of media professionals at an audiovisual archive: A transaction log analysis. JASIST, 61(6), 1180-1197.

Conclusions: • Identified patterns not visible on raw

data. • but “the method is only moderately

successful in identifying the most prominent relations for a given query pair”

The feature selection issue when using LOD

Input Data






Feature Selection

• Feature selection = Limiting the number of features for faster computation times, more understandable models, better prediction value.

• Using Linked Open Data can lead to large number of features per data point.

• a DBpedia resource easily has 50 property-value pairs.

• more are easily added using reasoning

• note: these numbers are not large compared to the number of features in DNA strings, or all words in a text corpus!

• Still, many of them are irrelevant or redundant.

Feature Selection Example

• Goal: learn a relation R between x and y.

• In this paper, R = ‘occupation’, ‘gender’, ‘instance_of’, ‘acted_in’, ‘genre’, ‘position_played_on_team’

• Approach: given a training set of pairs of x, y, learn a “whitelist” of properties in DBpedia, WikiData, YAGO and WordNet that indicate a relation R between x and y

• Cast as a subset selection problem:

• E = the set of possible properties

• local search over the power set of E (i.a. all subsets) to find the optimal subset.

Learning to Exploit Structured Resources for Lexical Inference. Vered Shwartz, Omer Levy, Ido Dagan and Jacob Goldberger. CoNLL 2015 (to appear)july

Data Fusion

Input Data






Data Fusion / Ontology Alignment / Mapping / Matching / Linking / Integration

Ontology / Schema / T-box

level

Instance / data / A-box level

Data Fusion

Ontology Alignment

Data Fusion

Instance matching

Data Fusion

Property matching

Data Fusion

~~~ ~~~ ~ ~~~ ~~ ~

~~~ ~~~ ~ ~~~ ~~ ~

~~~ ~~~ ~ ~~~ ~~ ~

~~~ ~~~ ~ ~~~ ~~ ~

~~~ ~~~ ~ ~~~ ~~ ~

~~~ ~~~ ~ ~~~ ~~ ~

Entity detection / entity linking

Methods for Data Fusion (ontology alignment)

label

label

label

label

Methods for Data Fusion: structural matchers

label

label

label

label

Methods for Data Fusion: structural matchers

label

label

label

label

• E.g. Similarity Flooding: the similarity of a matched pair s1 and t1 propagates to their respective neighbors s2 and t2.

• neighbors can be defined as subclasses, superclasses, instances, domain/ranges, etc.

• Structural measures are in practice never used stand alone.

[10] Ngo, Duy Hoa, and Zohra Bellahsene. YAM++-results for OAEI 2012. OAEI @ ISWC 2012.[11] Sergey Melnik, Hector Garcia-Molina, and Erhard Rahm. Similarity flooding: A versatile graph matching algorithm and its application to schema matching.ICDE 2002.

Methods for Data Fusion: instance based matchers

label

label

label

label

Methods for Data Fusion: instance based matchers

label

label

label

label

• Match classes based on similarity of their instances

• note: you need a way to assess similarity of the instances!

Methods for Data Fusion: string based

label

label

label

label


• This is the most important feature in ontology alignment.

• “nearly all [ontology alignment systems] use a string similarity metric” [12]

• stopping and stemming is not helpful! Nor is using WordNet synonyms. [12]

• In [13] we took an even less semantic approach: linking based on URL syntax.

label

label

label

label

[12] Cheatham, M., & Hitzler, P. String similarity metrics for ontology alignment. ISWC 2013.[13] The debates of the European Parliament as Linked Open Data. Under review. See http://www.talkofeurope.eu/data/ for details.

http://www.talkofeurope.eu/data/


• This is the most important feature in ontology alignment.

• “nearly all [ontology alignment systems] use a string similarity metric” [12]

• stopping and stemming is not helpful! Nor is using WordNet synonyms. [12]

• In [13] we took an even less semantic approach: linking based on URL syntax.

label

label

label

label

[12] Cheatham, M., & Hitzler, P. String similarity metrics for ontology alignment. ISWC 2013.[13] The debates of the European Parliament as Linked Open Data. Under review. See http://www.talkofeurope.eu/data/ for details.

http://www.dbpedia.org/page/Judith_Sargentini

http://www.talkofeurope.eu/data/

Link types

EqualitySameAs EquivalentClasses EquivalentProperties

“Den Haag” = “The Hague” wood-material = wood

Hierarchicalrdfs:subClassOf rdf:type rdfs:subPropertyOf

aat:Artist ⊇ wn:Artist tgn:Africa ∈ wn:Continent conf:has_the_last_name = edas:hasLastName

Weaker semanticsskos:closeMatch / exactMatch /

broadMatch /narrowMatch / relatedMatch

geonames:Italy skos:closeMatch librarytopics:Italy

Domain specific linksE.g. born-in E.g. hasStyle E.g. hasPart

Van Gogh (ULAN) born-in Groot-Zundert (TGN)

Representation of links

architecten architectsskos:exactMatch


architecten

architects

Link 001

skos:exactMatch

handmatigL. Hollink

concept1

concept2

link type

link methodeauteur



architecten

architects

Link 001

skos:exactMatch

handmatigL. Hollink

concept1

concept2

link type

link methodeauteur


• Open Question: how valid are the patterns we discover in data when the quality of the links is low?

http://event.cwi.nl/toolcriticism/


architecten

architects

Link 001

skos:exactMatch

handmatigL. Hollink

concept1

concept2

link type

link methodeauteur


• Open Question: how valid are the patterns we discover in data when the quality of the links is low?

• Even more important to be critical and evaluate the data• source criticism• tool criticism (see http://

event.cwi.nl/toolcriticism/)

http://event.cwi.nl/toolcriticism/

Evaluation of Data Fusion / Linking


1. Manually rating (a sample of) mappings• relatively cheap and easy to interpret• only precision, no recall



2. Comparison to a reference alignment• precision and recall• used in OAEI on the SEALS platform• more expensive if a reference alignment has to be

created (but: crowd sourcing!)




created (but: crowd sourcing!)3. End-to-end evaluation (a.k.a. evaluating an application

that uses the mappings)• arguably the best method!• need to have access to an application + users


• Comparison to a reference alignment: Alternative measures:

• 1. instead of a binary classification into correct/incorrect mappings, take into account how wrong an link is:


• Comparison to a reference alignment: Alternative measures:

• 1. instead of a binary classification into correct/incorrect mappings, take into account how wrong an link is:

• where r(a) is the semantic distance between correspondence a and correspondence a’ in the reference alignment, A is the number of correspondences.

• 2. weight score of mappings based on the frequency of their use

• e.g from usage logs! Laura Hollink, Mark van Assem, Shenghui Wang, Antoine Isaac, Guus Schreiber. Two Variations on Ontology Alignment Evaluation: Methodological Issues.ESWC 2008.




created (but: crowd sourcing!)3. End-to-end evaluation (a.k.a. evaluating an application

that uses the mappings)• arguably the best method!• need to have access to an application + users

Discovering links from text Pointers to what happens in other communities

• Word2Vec: efficient deep learning algorithm to learn vector representations of words

• vector similarity captures semantics between words

• No explicit semantics, but we can’t deny that there is meaning there!

• Success seems to be mostly due to big data






Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.






Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.

Example:

Vec(Madrid) - Vec(Spain) + Vec(France) is closer to Vec(Paris) than to any other vector

NELL: Never-Ending Language Learning

• several machine learning approaches to discover facts (beliefs) from text on the web• string features, distribution of context words, html structure, visual image

analysis.• Running since 2010, has so far learned over 80 million beliefs

NELL: Never-Ending Language Learning

• several machine learning approaches to discover facts (beliefs) from text on the web• string features, distribution of context words, html structure, visual image

analysis.• Running since 2010, has so far learned over 80 million beliefs

T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platanios, A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, J. Welling. In Proceedings of the Conference on Artificial Intelligence (AAAI), 2015.

Research Task Format

Work in 6 groups of 10 students• 5 people design an approach to

association rules with semantics• 5 people focus on how that

approach should be evaluatedThe idea is to work together! E.g. which measures are best for this approach? Which versions of the approach should be evaluated? Will this approach score high on these measures? In which cases?





• We would like one presentation per group of 10 people• of 3 or 4 slides• of max 4 minutes (less is fine too!)• Send me the slides in PDF, with your group number in the title,

by email to [email protected], today before 16:30.• The presentation should show clearly:

1. the AR method2. how did you take into account semantics?3. the evaluation method• BONUS: argue when and why your approach will score high.• BONUS: discuss how the newly learned links can be

represented and used.

mailto:[email protected]





• We would like one presentation per group of 10 people• of 3 or 4 slides• of max 4 minutes (less is fine too!)• Send me the slides in PDF, with your group number in the title,

by email to [email protected], today before 16:30.• The presentation should show clearly:

1. the AR method2. how did you take into account semantics?3. the evaluation method• BONUS: argue when and why your approach will score high.• BONUS: discuss how the newly learned links can be

represented and used.

Tips:• you may pick a dataset that you will use as an example

mailto:[email protected]

knowledge discoverylaurahollink

Education

mining of web usage

usewod usewod

lod usage mining

usewod results of usewod

web resources

semantic web

usewod dataset

usewod workshop series