semantic relatedness of web resources by xesa - philipp scholl

23
KOM - Multimedia Communications Lab Prof. Dr.-Ing. Ralf Steinmetz (Director) Dept. of Electrical Engineering and Information Technology Dept. of Computer Science (adjunct Professor) TUD – Technische Universität Darmstadt Rundeturmstr. 10, D-64283 Darmstadt, Germany Tel.+49 6151 166150, Fax. +49 6151 166152 www.KOM.tu-darmstadt.de © 2010 author(s) of these slides including research results from the KOM research network and TU Darmstadt. Otherwise it is specified at the respective slide Dipl.-Inform. Philipp Scholl Doreen Böhnstedt, M.Sc. Dipl.-Inform. Renato Domínguez García Dr.-Ing. Christoph Rensing Prof.Dr.-Ing. Ralf Steinmetz [email protected] Tel.+49 6151 166115 7. Juni 2022 2010-10-01 EC-TEL Presentation Scholl.ppt Extended Explicit Semantic Analysis for Calculating Semantic Relatedness of Web Resources Presentation 2010/10/01 EC-TEL, Barcelona Recommendation WP WP WP WP WP WP

Upload: crokodil-consortium

Post on 23-Jan-2015

2.039 views

Category:

Technology


0 download

DESCRIPTION

Extended Explicit Semantic Analysis for Calculating Semantic Relatedness of Web Resources

TRANSCRIPT

Page 1: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM - Multimedia Communications LabProf. Dr.-Ing. Ralf Steinmetz (Director)

Dept. of Electrical Engineering and Information TechnologyDept. of Computer Science (adjunct Professor)

TUD – Technische Universität Darmstadt Rundeturmstr. 10, D-64283 Darmstadt, Germany

Tel.+49 6151 166150, Fax. +49 6151 166152 www.KOM.tu-darmstadt.de

© 2010 author(s) of these slides including research results from the KOM research network and TU Darmstadt. Otherwise it is specified at the respective slide

Dipl.-Inform. Philipp SchollDoreen Böhnstedt, M.Sc.Dipl.-Inform. Renato Domínguez GarcíaDr.-Ing. Christoph RensingProf.Dr.-Ing. Ralf Steinmetz

[email protected] Tel.+49 6151 166115

10. April 20232010-10-01 EC-TEL Presentation Scholl.ppt

Extended Explicit Semantic Analysis for Calculating Semantic Relatedness of Web

Resources

Presentation 2010/10/01 EC-TEL, Barcelona

Recommendation

WPWPWPWPWPWP

Page 2: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 2

Outline

A Learning Scenario – Knowledge Networks and Snippets Measuring Semantic Relatedness with ESA Proposed Enhancements to ESA Evaluation Conclusions & Outlook

Page 3: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 3

Scenario: Crokodil

Crokodil – supporting Resource based Learning with Web Resources Collecting Fragments of Web Resources

(“Snippets”) Organize Snippets via (semantic) tagging (with

types Person, Event, Goal, Location, …) Underlying structure: Personal and Community

Knowledge Networks

Embedded as an add-on into the sidebar of the web browser Firefox

Page 4: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 4

Study Results: Snippets of Web Resources

Participants of study [SBB09] found saving fragments of web resources (instead of whole web pages) very useful

Snippets ≡ Fragments of web resources Definite, narrow scope of topic Cover user’s information needs

Findings in Study [SBB09] Comparison 1357 snippets vs.

705 web resources Snippets: 70% smaller than 100 words Web resources: 70% smaller than 1000 words

Comparison: Snippets vs. HTML Pages

0

10

20

30

40

50

60

70

80

90

100

1 10 100 1000 10000 100000

Size in words / tokens

Cu

mu

late

d P

erc

en

tag

e

Snippets Complete HTML Pages

[SBB09] Scholl, P., Benz, B. F., Böhnstedt, D., Rensing, C., Schmitz, B., Steinmetz, R. (2009): Implementation and Evaluation of a Tool for Setting Goals in Self-Regulated Learning with Web Resources, In: Learning in the Synergy of Multiple Disciplines, EC-TEL 2009, pp. 521-534, Springer-Verlag Berlin/Heidelberg

Page 5: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 5

Structural Recommendations

Suggesting related resources in Crokodil: based on structure of knowledge network

Whether the resource has already been saved in the personal or community knowledge network

Based on explicit connections between current web resource and tags

Blog entry: Visualization of Learning with Web 2.0

Paper excerpt: Social Network Analysis and Visualizations for Learning

Web 2.0Life long learning

Recommendation

EC-TEL 2010E-Learning TEL

Page 6: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 6

Challenge: Sparse Knowledge Networks

Direct, explicit connections do not always exist

Knowledge Networks are sparse

Goal: semantic recommendation based on snippets.

Some measure of similarity / relatedness between snippets is needed for recommendation

Blog entry: e-learning in Web 2.0Paper excerpt: Web 2.0 for learning

Web 2.0Life long learning

TEL

E-learningRecommendation

?

Page 7: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 7

Implications for Recommending Snippets

Snippets Are mostly short Have only few significant terms Learning scenario needs recommendation of related, not necessarily similar snippets

Semantic Relatedness vs. Semantic Similarity

Challenge: Vocabulary gap Different wording and terminology

Only marginally similar in terminology, but semantically strongly related

Naïve Bag-Of-Words approach not feasible for comparison

One approach to accomodate these properties: Explicit Semantic Analysis

“TEL refers to the assistance of activities in knowledge acquisition through technology”

“E-Learning comprises all forms of

electronically supported

learning and teaching.”

?

Page 8: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 8

Outline

A Learning Scenario – Knowledge Networks and Snippets Measuring Semantic Relatedness with ESA Proposed Enhancements to ESA Evaluation Conclusions & Outlook

Page 9: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 9

ESA*

Calculates relatedness between words / text [GM07] Based on reference corpus containing semantically distinct documents Allows comparison between conceptualized abstractions of documents

Resulting semantic vector iesa can be compared to other vectors (e.g. by cosine measure)

Base Approach: Explicit Semantic Analysis

x =|terms|×1 n×1

document d1n documents from corpus

Preprocessing steps*

Semantic interpretation Matrix Mint

* Contain:1. Tokenization2. Stemming3. Calculation of TF-IDF

Semantic interpretation vector iesa

n×|terms|n 1×|terms| vectors

document d2

comparison

[GM07] Gabrilovich, E. & Markovitch, S. (2007): Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 6-12

Page 10: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 10

Wikipedia as Reference Corpus

ESA commonly uses Wikipedia as feasible reference corpus

Wikipedia: Collaboratively edited encyclopedic knowledge

German Wikipedia: 1 Mio. articles Each article corresponds to a semantic concept (topic) Articles are densely interconnected by Wiki-Links

German Wikipedia: 25 Mio. links Articles are semantically grouped into categories

German Wikipedia: 122k categories Articles are connected to corresponding / similar articles in

other languages (266 languages available)

Sou

rce:

wik

iped

ia.o

rg

Page 11: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 11

Observation and Hypothesis

Observation: ESA only considers article text Ignores semantic information contained in Wikipedia that can be used:

Connectivity by links Category information

Implement different enhancements by semantic enrichment: eXtended Explicit Semantic Analysis (XESA)

Hypothesis: Semantically enriching interpretation vector by using this additional

information readily provided by Wikipedia enhances task of comparing snippets

Page 12: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 12

Outline

A Learning Scenario – Knowledge Networks and Snippets Measuring Semantic Relatedness with ESA Proposed Enhancements to ESA Evaluation Conclusions & Outlook

Page 13: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 13

XESA – Overview

ESA XESAAG XESACAT XESAAG+CAT

Article content

Article Graph

Category Information

Page 14: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 14

Article Graph Extension

Additional factors (not shown here): Article-Link weight wlink-weight – determines weight of Article Graph selectBestN – selection of only n best values of iesa for complexity

reduction

Albert Einstein

Gravitation

Space

Matter

Curvature

Black Hole

General Relativity

Catholic SchoolJewish

Ulm

Article GraphMatrix A

|articles|×1

Semantic interpretation

vector iesa

x|articles|×|articles|

=|articles|×1

iesa_AG

Page 15: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 15

Category Graph Extension

As categories are appended concept space, the resulting interpretation vector has more dimensions

General RelativityMisner SpaceAnti-Gravity Atom Heat

Fundamental Physics Concepts Concepts of Heaven

Relativity Theories of Gravitation Physics Concepts by Field

Frames of Reference General Relativity

Category GraphMatrix A

|art|×1

Semantic interpretation

vector iesa

x|cat+art|×|art|

=|cat+art|×1

iesa_AG

Page 16: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 16

Outline

A Learning Scenario – Knowledge Networks and Snippets Measuring Semantic Relatedness with ESA Proposed Enhancements to ESA Evaluation Conclusions & Outlook

Page 17: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 17

Evaluation: Development of an Own Corpus

12 Participants were asked to answer questions with snippets

Task: find snippets answering 10 different questions in 5 flavors Facts (“What is FTAA”) Opinions (“Is the term ‘dark ages’ justifiable?”) Homonyms (“What is Java?”) Loosely coupled topics (“How are sweets produced?”) Wide topics (“What is origin of human race?”) + sub-groups where meaning is ambiguous (e.g. Java programming language

vs. Indonesian island Java)

Different search engines used (Google, Bing, Yahoo!, …), resulting in 282 distinct snippets.

Note: Created corpus corresponds to our definition of snippets ø 95 terms, min 5, max 756, standard deviation 71.3

Page 18: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 18

Evaluation: Methodology

Evaluation: ESA vs. XESA As we do not have pair comparisons for all snippets, the rank is important:

relevant and similar snippets should be delivered first

Evaluation methodology break-even-point from search engines Definition: break-even-point is measure where precision and recall of a query

are equal. The higher, the better. Average Interpolated Precision is average of all

comparison of all snippets Displaying as Precision – Recall diagram

Baseline ESA: Break-even point at 0.595

0.595

Page 19: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 19

Evaluation: Comparing Approaches

Selected parameters (adjusted experimentally) selectBestN: n = 25 Article Link weight:

w є {0.5, 0.75} does not make significant difference

Best results XESAAG(B) (0.643), but no

significant difference from XESAAG(A) (0.641)

~ 9% better than ESA XESACAT is good, but cannot

catch up XESAAG+CAT performs worse

than ESA

0.643 0.641

0.6200.543

Page 20: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 20

Outline

A Learning Scenario – Knowledge Networks and Snippets Measuring Semantic Relatedness with ESA Proposed Enhancements to ESA Evaluation Conclusions & Outlook

Page 21: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 21

Recommending via Semantic Relatedness

WPWPWP

WPWPWP

Paper excerpt: Social Network Analysis and Visualizations for Learning

Web 2.0Life long learning

E-Learning

TEL

Blog entry: Visualization of Learning with Web 2.0Recommendation

Semantic Relatedness (XESA)

Page 22: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 22

Conclusions and Future Work

Using Wikipedia as a reference corpus for calculating semantic relatedness for snippets is feasible Enhancing ESA by integrating Wikipedia’s rich semantic structure yields better

results Article Graph improves ESA up to 9%

Performance: not yet applicable to online scenarios

Future Work: Next step: implement semantic relatedness in recommendations Coping with large datasets: make approach performing in real-life contexts Calculate cut-off for “good” concept terms (dimension reduction) Measuring similarity between documents in different languages

Page 23: Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM – Multimedia Communications Lab 23

Questions?

…Thank you for your attention!

This work was supported by funds from the German Federal Ministry of

Education and Research under the mark 01 PF 08015 A and from the European

Social Fund of the European Union (ESF).