knowledge discoverylaurahollink
TRANSCRIPT
Knowlege Discovery for the Semantic Web
An Application to Web Usage Mining &
How to use semantics in the Preprocessing stage
Input Data
Data Preprocessing and Transformation Data Mining
Interpretation and Evaluation
Information/ Taking Action
Data fusion (multiple sources) Data Cleaning (noise,missing val.) Feature Selection Dimensionality Reduction Data Normalization
Filtering Patterns Visualization Statistical Analysis - Hypothesis testing - Attribute evaluation - Comparing learned models - Computing Confidence Intervals
Claudia D’Amato - University of Bari, IT.
Laura Hollink - Centrum Wiskunde & Informatica, Amsterdam, NL.
Knowlege Discovery for the Semantic Web
An Application to Web Usage Mining &
How to use semantics in the Preprocessing stage
Input Data
Data Preprocessing and Transformation Data Mining
Interpretation and Evaluation
Information/ Taking Action
Data fusion (multiple sources) Data Cleaning (noise,missing val.) Feature Selection Dimensionality Reduction Data Normalization
Filtering Patterns Visualization Statistical Analysis - Hypothesis testing - Attribute evaluation - Comparing learned models - Computing Confidence Intervals
Claudia D’Amato - University of Bari, IT.
Laura Hollink - Centrum Wiskunde & Informatica, Amsterdam, NL.
An application to Web Usage Mining
Web Usage Mining = discovering patterns in logs of user interaction with Web resources
• logs typically contain an identifier for users (e.g. ip address), their queries and clicks
An application to Web Usage Mining
Web Usage Mining = discovering patterns in logs of user interaction with Web resources
• logs typically contain an identifier for users (e.g. ip address), their queries and clicks
• What about usage of Linked Open Data?
An application to Web Usage Mining
Web Usage Mining = discovering patterns in logs of user interaction with Web resources
• logs typically contain an identifier for users (e.g. ip address), their queries and clicks
• What about usage of Linked Open Data?
• Can we use semantics to improve mining of Web Usage?
Mining Usage of Linked Open Data in USEWOD
USEWOD: http://usewod.org/ [B. Berendt, L. Hollink., M. Luczak-Roesch, et al.]
1. USEWOD workshop series @ ESWC / WWW since 2011
2. USEWOD dataset: server logs of DBpedia, BioPortal, LinkedGeoData, etc., and client side logs from YASGUI.
Mining Usage of Linked Open Data in USEWOD
USEWOD: http://usewod.org/ [B. Berendt, L. Hollink., M. Luczak-Roesch, et al.]
1. USEWOD workshop series @ ESWC / WWW since 2011
2. USEWOD dataset: server logs of DBpedia, BioPortal, LinkedGeoData, etc., and client side logs from YASGUI.
example removed
Mining Usage of Linked Open Data in USEWOD
• Results of USEWOD: LOD usage mining for more efficient indexing [1], cashing [2], auto-completion [3], etc.
[1] Arias, M., Fernández, J. D., Martínez-Prieto, M. A., & de la Fuente, P. (2011). An empirical study of real-world SPARQL queries. USEWOD @ WWW 2011 [2] Lorey, J., & Naumann, F. Caching and prefetching strategies for sparql queries. USEWOD @ ESWC 2013. [3] K. Kramer,R.Q. Dividino, and G. Gröner. SPACE: SPARQL Index for Efficient Autocompletion. ISWC (Posters & Demos) 2013. [4] Rietveld, L., & Hoekstra, R. Man vs. Machine: Differences in SPARQL Queries. USEWOD @ ESWC 2014 [5] Huelss, J., & Paulheim, H. What SPARQL Query Logs Tell and do not Tell about Semantic Relatedness in LOD. NoISE @ ESWC 2015
• Issues: • what is the difference between
queries by machines and humans? [4]• what is the meaning of repeated
queries by bots/tools?• a lot of the usage is invisible due to
data dump download [5]
Usage mining example 1: clustering rdf:properties in DBpedia
Instead of listing all DBpedia properties alphabetically, can we display them in a more meaningful way? Can we use query logs for this?
[5]
Usage mining example 1: clustering rdf:properties in DBpedia
Instead of listing all DBpedia properties alphabetically, can we display them in a more meaningful way? Can we use query logs for this?
[5]
[5] Huelss, J., & Paulheim, H. What SPARQL Query Logs Tell and do not Tell about Semantic Relatedness in LOD. NoISE @ ESWC 2015 Disclaimer: simplified discussion of this paper!
Usage mining example 1: clustering rdf:properties in DBpedia
Approach: Hierarchical Clustering of properties, where the distance between a pair of properties is based on how often they co-occur in a SPARQL query in the USEWOD2015 logs.
[5] Huelss, J., & Paulheim, H. What SPARQL Query Logs Tell and do not Tell about Semantic Relatedness in LOD. NoISE @ ESWC 2015 Disclaimer: simplified discussion of this paper!
Usage mining example 1: clustering rdf:properties in DBpedia
Approach: Hierarchical Clustering of properties, where the distance between a pair of properties is based on how often they co-occur in a SPARQL query in the USEWOD2015 logs.
[5] Huelss, J., & Paulheim, H. What SPARQL Query Logs Tell and do not Tell about Semantic Relatedness in LOD. NoISE @ ESWC 2015 Disclaimer: simplified discussion of this paper!
Evaluation: run an experiment to measure how quickly and accurately people identify facts when looking at the standard view or the clustered view.
Usage mining example 1: clustering rdf:properties in DBpedia
Approach: Hierarchical Clustering of properties, where the distance between a pair of properties is based on how often they co-occur in a SPARQL query in the USEWOD2015 logs.
[5] Huelss, J., & Paulheim, H. What SPARQL Query Logs Tell and do not Tell about Semantic Relatedness in LOD. NoISE @ ESWC 2015 Disclaimer: simplified discussion of this paper!
Evaluation: run an experiment to measure how quickly and accurately people identify facts when looking at the standard view or the clustered view.
Result: no significant differences ☹
Usage mining example 1: clustering rdf:properties in DBpedia
Approach: Hierarchical Clustering of properties, where the distance between a pair of properties is based on how often they co-occur in a SPARQL query in the USEWOD2015 logs.
[5] Huelss, J., & Paulheim, H. What SPARQL Query Logs Tell and do not Tell about Semantic Relatedness in LOD. NoISE @ ESWC 2015 Disclaimer: simplified discussion of this paper!
Evaluation: run an experiment to measure how quickly and accurately people identify facts when looking at the standard view or the clustered view.
Result: no significant differences ☹
Usage mining example 2: mining semantically enriched query logs
[5] Laura Hollink, Peter Mika and Roi Blanco. Web Usage Mining with Semantic Analysis. WWW 2013.
Usage mining example 2: mining semantically enriched query logs
Data: queries and clicks on Yahoo! search engine.
[5] Laura Hollink, Peter Mika and Roi Blanco. Web Usage Mining with Semantic Analysis. WWW 2013.
Usage mining example 2: mining semantically enriched query logs
Data: queries and clicks on Yahoo! search engine.Problem when mining ‘raw’ logs: low support of even the most frequent patterns
[5] Laura Hollink, Peter Mika and Roi Blanco. Web Usage Mining with Semantic Analysis. WWW 2013.
Usage mining example 2: mining semantically enriched query logs
Data: queries and clicks on Yahoo! search engine.Problem when mining ‘raw’ logs: low support of even the most frequent patterns
[5] Laura Hollink, Peter Mika and Roi Blanco. Web Usage Mining with Semantic Analysis. WWW 2013.
Usage mining example 2: mining semantically enriched query logs
Approach:1. link queries to entities in
LOD cloud2. choose class of entity +
selected properties3. detect modifier words
(download, trailer, cast, date, etc.)
Usage mining example 2: mining semantically enriched query logs
Approach:1. link queries to entities in
LOD cloud2. choose class of entity +
selected properties3. detect modifier words
(download, trailer, cast, date, etc.)
1. Link queries to entities in LOD cloud:
• Freebase (has a lot of movie related info)• DBpedia (Wikipedia is widely used)
Usage mining example 2: mining semantically enriched query logs
Approach:1. link queries to entities in
LOD cloud2. choose class of entity +
selected properties3. detect modifier words
(download, trailer, cast, date, etc.)
1. Link queries to entities in LOD cloud:
• Freebase (has a lot of movie related info)• DBpedia (Wikipedia is widely used)
Usage mining example 2: mining semantically enriched query logs
Approach:1. link queries to entities in
LOD cloud2. choose class of entity +
selected properties3. detect modifier words
(download, trailer, cast, date, etc.)
1. Link queries to entities in LOD cloud:
• Freebase (has a lot of movie related info)• DBpedia (Wikipedia is widely used)
Usage mining example 2: mining semantically enriched query logs
• Sequential pattern mining on the class-level using PrefixSpan.
Usage mining example 2: mining semantically enriched query logs
• Sequential pattern mining on the class-level using PrefixSpan.
Usage mining example 2: mining semantically enriched query logs
1.Discover frequent patterns on class-level using
• Using the efficient PrefixSpan algorithm to mine all possible subsequence patterns
Usage mining example 3: semantic patterns of query modification
•Goal: Identify frequent query modifications in an image archive
• state of the art = 3 classes: generalization, specification, reformulation
• Approach:
1.link queries to entities in the LOD cloud
2.Choose class of entity
3.Determine shortest path between consecutive queries Q1 and Q2
4.Rank property-paths according to support and confidence.
Hollink, V., Tsikrika, T., & de Vries, A. P. (2011). Semantic search log analysis: a method and a study on professional image search. JASIST 62(4), 691-713.
See also: Huurnink, B., Hollink, L., Van Den Heuvel, W., & De Rijke, M. (2010). Search behavior of media professionals at an audiovisual archive: A transaction log analysis. JASIST, 61(6), 1180-1197.
Usage mining example 3: semantic patterns of query modification
•Goal: Identify frequent query modifications in an image archive
• state of the art = 3 classes: generalization, specification, reformulation
• Approach:
1.link queries to entities in the LOD cloud
2.Choose class of entity
3.Determine shortest path between consecutive queries Q1 and Q2
4.Rank property-paths according to support and confidence.
Hollink, V., Tsikrika, T., & de Vries, A. P. (2011). Semantic search log analysis: a method and a study on professional image search. JASIST 62(4), 691-713.
See also: Huurnink, B., Hollink, L., Van Den Heuvel, W., & De Rijke, M. (2010). Search behavior of media professionals at an audiovisual archive: A transaction log analysis. JASIST, 61(6), 1180-1197.
Usage mining example 3: semantic patterns of query modification
•Goal: Identify frequent query modifications in an image archive
• state of the art = 3 classes: generalization, specification, reformulation
• Approach:
1.link queries to entities in the LOD cloud
2.Choose class of entity
3.Determine shortest path between consecutive queries Q1 and Q2
4.Rank property-paths according to support and confidence.
Hollink, V., Tsikrika, T., & de Vries, A. P. (2011). Semantic search log analysis: a method and a study on professional image search. JASIST 62(4), 691-713.
See also: Huurnink, B., Hollink, L., Van Den Heuvel, W., & De Rijke, M. (2010). Search behavior of media professionals at an audiovisual archive: A transaction log analysis. JASIST, 61(6), 1180-1197.
Conclusions: • Identified patterns not visible on raw
data. • but “the method is only moderately
successful in identifying the most prominent relations for a given query pair”
The feature selection issue when using LOD
Input Data
Data Preprocessing and Transformation Data Mining
Interpretation and Evaluation
Information/ Taking Action
Data fusion (multiple sources) Data Cleaning (noise,missing val.) Feature Selection Dimensionality Reduction Data Normalization
Filtering Patterns Visualization Statistical Analysis - Hypothesis testing - Attribute evaluation - Comparing learned models - Computing Confidence Intervals
Feature Selection
• Feature selection = Limiting the number of features for faster computation times, more understandable models, better prediction value.
• Using Linked Open Data can lead to large number of features per data point.
• a DBpedia resource easily has 50 property-value pairs.
• more are easily added using reasoning
• note: these numbers are not large compared to the number of features in DNA strings, or all words in a text corpus!
• Still, many of them are irrelevant or redundant.
Feature Selection Example
• Goal: learn a relation R between x and y.
• In this paper, R = ‘occupation’, ‘gender’, ‘instance_of’, ‘acted_in’, ‘genre’, ‘position_played_on_team’
• Approach: given a training set of pairs of x, y, learn a “whitelist” of properties in DBpedia, WikiData, YAGO and WordNet that indicate a relation R between x and y
• Cast as a subset selection problem:
• E = the set of possible properties
• local search over the power set of E (i.a. all subsets) to find the optimal subset.
Learning to Exploit Structured Resources for Lexical Inference. Vered Shwartz, Omer Levy, Ido Dagan and Jacob Goldberger. CoNLL 2015 (to appear)july
Data Fusion
Input Data
Data Preprocessing and Transformation Data Mining
Interpretation and Evaluation
Information/ Taking Action
Data fusion (multiple sources) Data Cleaning (noise,missing val.) Feature Selection Dimensionality Reduction Data Normalization
Filtering Patterns Visualization Statistical Analysis - Hypothesis testing - Attribute evaluation - Comparing learned models - Computing Confidence Intervals
Data Fusion / Ontology Alignment / Mapping / Matching / Linking / Integration
Ontology / Schema / T-box
level
Instance / data / A-box level
Data Fusion
~~~ ~~~ ~ ~~~ ~~ ~
~~~ ~~~ ~ ~~~ ~~ ~
~~~ ~~~ ~ ~~~ ~~ ~
~~~ ~~~ ~ ~~~ ~~ ~
~~~ ~~~ ~ ~~~ ~~ ~
~~~ ~~~ ~ ~~~ ~~ ~
Entity detection / entity linking
Methods for Data Fusion: structural matchers
label
label
label
label
• E.g. Similarity Flooding: the similarity of a matched pair s1 and t1 propagates to their respective neighbors s2 and t2.
• neighbors can be defined as subclasses, superclasses, instances, domain/ranges, etc.
• Structural measures are in practice never used stand alone.
[10] Ngo, Duy Hoa, and Zohra Bellahsene. YAM++-results for OAEI 2012. OAEI @ ISWC 2012.[11] Sergey Melnik, Hector Garcia-Molina, and Erhard Rahm. Similarity flooding: A versatile graph matching algorithm and its application to schema matching.ICDE 2002.
Methods for Data Fusion: instance based matchers
label
label
label
label
• Match classes based on similarity of their instances
• note: you need a way to assess similarity of the instances!
Methods for Data Fusion: string based
• This is the most important feature in ontology alignment.
• “nearly all [ontology alignment systems] use a string similarity metric” [12]
• stopping and stemming is not helpful! Nor is using WordNet synonyms. [12]
• In [13] we took an even less semantic approach: linking based on URL syntax.
label
label
label
label
[12] Cheatham, M., & Hitzler, P. String similarity metrics for ontology alignment. ISWC 2013.[13] The debates of the European Parliament as Linked Open Data. Under review. See http://www.talkofeurope.eu/data/ for details.
Methods for Data Fusion: string based
• This is the most important feature in ontology alignment.
• “nearly all [ontology alignment systems] use a string similarity metric” [12]
• stopping and stemming is not helpful! Nor is using WordNet synonyms. [12]
• In [13] we took an even less semantic approach: linking based on URL syntax.
label
label
label
label
[12] Cheatham, M., & Hitzler, P. String similarity metrics for ontology alignment. ISWC 2013.[13] The debates of the European Parliament as Linked Open Data. Under review. See http://www.talkofeurope.eu/data/ for details.
Methods for Data Fusion: string based
• This is the most important feature in ontology alignment.
• “nearly all [ontology alignment systems] use a string similarity metric” [12]
• stopping and stemming is not helpful! Nor is using WordNet synonyms. [12]
• In [13] we took an even less semantic approach: linking based on URL syntax.
label
label
label
label
[12] Cheatham, M., & Hitzler, P. String similarity metrics for ontology alignment. ISWC 2013.[13] The debates of the European Parliament as Linked Open Data. Under review. See http://www.talkofeurope.eu/data/ for details.
http://www.dbpedia.org/page/Judith_Sargentini
Link types
EqualitySameAs EquivalentClasses EquivalentProperties
“Den Haag” = “The Hague” wood-material = wood
Hierarchicalrdfs:subClassOf rdf:type rdfs:subPropertyOf
aat:Artist ⊇ wn:Artist tgn:Africa ∈ wn:Continent conf:has_the_last_name = edas:hasLastName
Weaker semanticsskos:closeMatch / exactMatch /
broadMatch /narrowMatch / relatedMatch
geonames:Italy skos:closeMatch librarytopics:Italy
Domain specific linksE.g. born-in E.g. hasStyle E.g. hasPart
Van Gogh (ULAN) born-in Groot-Zundert (TGN)
Representation of links
architecten
architects
Link 001
skos:exactMatch
handmatigL. Hollink
concept1
concept2
link type
link methodeauteur
architecten architectsskos:exactMatch
Representation of links
architecten
architects
Link 001
skos:exactMatch
handmatigL. Hollink
concept1
concept2
link type
link methodeauteur
architecten architectsskos:exactMatch
• Open Question: how valid are the patterns we discover in data when the quality of the links is low?
Representation of links
architecten
architects
Link 001
skos:exactMatch
handmatigL. Hollink
concept1
concept2
link type
link methodeauteur
architecten architectsskos:exactMatch
• Open Question: how valid are the patterns we discover in data when the quality of the links is low?
Representation of links
architecten
architects
Link 001
skos:exactMatch
handmatigL. Hollink
concept1
concept2
link type
link methodeauteur
architecten architectsskos:exactMatch
• Open Question: how valid are the patterns we discover in data when the quality of the links is low?
• Even more important to be critical and evaluate the data• source criticism• tool criticism (see http://
event.cwi.nl/toolcriticism/)
Evaluation of Data Fusion / Linking
1. Manually rating (a sample of) mappings• relatively cheap and easy to interpret• only precision, no recall
Evaluation of Data Fusion / Linking
1. Manually rating (a sample of) mappings• relatively cheap and easy to interpret• only precision, no recall
2. Comparison to a reference alignment• precision and recall• used in OAEI on the SEALS platform• more expensive if a reference alignment has to be
created (but: crowd sourcing!)
Evaluation of Data Fusion / Linking
1. Manually rating (a sample of) mappings• relatively cheap and easy to interpret• only precision, no recall
2. Comparison to a reference alignment• precision and recall• used in OAEI on the SEALS platform• more expensive if a reference alignment has to be
created (but: crowd sourcing!)3. End-to-end evaluation (a.k.a. evaluating an application
that uses the mappings)• arguably the best method!• need to have access to an application + users
Evaluation of Data Fusion / Linking
• Comparison to a reference alignment: Alternative measures:
• 1. instead of a binary classification into correct/incorrect mappings, take into account how wrong an link is:
Evaluation of Data Fusion / Linking
• Comparison to a reference alignment: Alternative measures:
• 1. instead of a binary classification into correct/incorrect mappings, take into account how wrong an link is:
Evaluation of Data Fusion / Linking
• Comparison to a reference alignment: Alternative measures:
• 1. instead of a binary classification into correct/incorrect mappings, take into account how wrong an link is:
Evaluation of Data Fusion / Linking
• Comparison to a reference alignment: Alternative measures:
• 1. instead of a binary classification into correct/incorrect mappings, take into account how wrong an link is:
• where r(a) is the semantic distance between correspondence a and correspondence a’ in the reference alignment, A is the number of correspondences.
• 2. weight score of mappings based on the frequency of their use
• e.g from usage logs! Laura Hollink, Mark van Assem, Shenghui Wang, Antoine Isaac, Guus Schreiber. Two Variations on Ontology Alignment Evaluation: Methodological Issues.ESWC 2008.
Evaluation of Data Fusion / Linking
1. Manually rating (a sample of) mappings• relatively cheap and easy to interpret• only precision, no recall
2. Comparison to a reference alignment• precision and recall• used in OAEI on the SEALS platform• more expensive if a reference alignment has to be
created (but: crowd sourcing!)3. End-to-end evaluation (a.k.a. evaluating an application
that uses the mappings)• arguably the best method!• need to have access to an application + users
Discovering links from text Pointers to what happens in other communities
• Word2Vec: efficient deep learning algorithm to learn vector representations of words
• vector similarity captures semantics between words
• No explicit semantics, but we can’t deny that there is meaning there!
• Success seems to be mostly due to big data
Discovering links from text Pointers to what happens in other communities
• Word2Vec: efficient deep learning algorithm to learn vector representations of words
• vector similarity captures semantics between words
• No explicit semantics, but we can’t deny that there is meaning there!
• Success seems to be mostly due to big data
Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.
Discovering links from text Pointers to what happens in other communities
• Word2Vec: efficient deep learning algorithm to learn vector representations of words
• vector similarity captures semantics between words
• No explicit semantics, but we can’t deny that there is meaning there!
• Success seems to be mostly due to big data
Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.
Example:
Vec(Madrid) - Vec(Spain) + Vec(France) is closer to Vec(Paris) than to any other vector
NELL: Never-Ending Language Learning
• several machine learning approaches to discover facts (beliefs) from text on the web• string features, distribution of context words, html structure, visual image
analysis.• Running since 2010, has so far learned over 80 million beliefs
NELL: Never-Ending Language Learning
• several machine learning approaches to discover facts (beliefs) from text on the web• string features, distribution of context words, html structure, visual image
analysis.• Running since 2010, has so far learned over 80 million beliefs
T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platanios, A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, J. Welling. In Proceedings of the Conference on Artificial Intelligence (AAAI), 2015.
Research Task Format
Work in 6 groups of 10 students• 5 people design an approach to
association rules with semantics• 5 people focus on how that
approach should be evaluatedThe idea is to work together! E.g. which measures are best for this approach? Which versions of the approach should be evaluated? Will this approach score high on these measures? In which cases?
Research Task Format
Work in 6 groups of 10 students• 5 people design an approach to
association rules with semantics• 5 people focus on how that
approach should be evaluatedThe idea is to work together! E.g. which measures are best for this approach? Which versions of the approach should be evaluated? Will this approach score high on these measures? In which cases?
• We would like one presentation per group of 10 people• of 3 or 4 slides• of max 4 minutes (less is fine too!)• Send me the slides in PDF, with your group number in the title,
by email to [email protected], today before 16:30.• The presentation should show clearly:
1. the AR method2. how did you take into account semantics?3. the evaluation method• BONUS: argue when and why your approach will score high.• BONUS: discuss how the newly learned links can be
represented and used.
Research Task Format
Work in 6 groups of 10 students• 5 people design an approach to
association rules with semantics• 5 people focus on how that
approach should be evaluatedThe idea is to work together! E.g. which measures are best for this approach? Which versions of the approach should be evaluated? Will this approach score high on these measures? In which cases?
• We would like one presentation per group of 10 people• of 3 or 4 slides• of max 4 minutes (less is fine too!)• Send me the slides in PDF, with your group number in the title,
by email to [email protected], today before 16:30.• The presentation should show clearly:
1. the AR method2. how did you take into account semantics?3. the evaluation method• BONUS: argue when and why your approach will score high.• BONUS: discuss how the newly learned links can be
represented and used.
Tips:• you may pick a dataset that you will use as an example