minimum spanning trees displaying semantic similarity włodzisław duch & paweł matykiewicz...

Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering, NTU Singapore Cincinnati Children’s Hospital Research Foundation, OH, USA Google: Duch

Post on 15-Jan-2016

222 views

Category:

Documents

0 download

Report

Download

Tags:

Embed Size (px):

TRANSCRIPT

Minimum Spanning TreesDisplaying Semantic Similarity

Włodzisław Duch & Paweł Matykiewicz

Department of Informatics, UMK Toruń

School of Computer Engineering, NTU Singapore

Cincinnati Children’s Hospital Research Foundation, OH, USA Google: Duch

Page 2: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

The Problem

Finding people who share some of our interests in large organizations or worldwide is difficult.

Analyzing people’s homepages and their lists of publications is a good way to find groups and individuals sharing common scientific interest.

Maps should display individuals and groups. The structure of graphical representations

depends strongly on the selection of keywords or dimensionality reduction.

Page 3: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

The Data

Reuters-215785 datasets, with 5 categories and 1 – 176 elements per category.

124 Personal Web Pages of the School of Electrical and Electronic Engineering (EEE) of the Nanyang Technological University (NTU) in Singapore, with 5 categories (control, microelectronics, information, circuit, power), and 14 – 41 documents per category.

Page 4: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

Document-word matrix

Document1: word1 word2 word3. word4 word3 word5.

Document2: word1 word3 word5. word1 word3 word6.

The matrix: documents x word frequencies

1 1 2 1 1 0

2 0 2 0 1 1

Page 5: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

Methods used

Inverse document frequency and term weighting.

Simple selection of relevant terms.

Latent Semantic Analysis (LSA) for dimensionality reduction.

Minimum Spanning Trees for visual representation.

TouchGraph XML visualization of MST trees.

Page 6: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

Data Preparation

Normalize columns of F dividing by highest word frequencies:

Among n documents, term j occurs dj times; inverse document frequency idfj measures uniqueness of term j:

2log / 1, 0j j jidf n d d tf x idf term weights:

ij ij jw tf idf

/ maxij ij iji

tf f f

Page 7: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

Simple selection

Simple selection: take wij weights above certain threshold, binarize and remove zero rows:

1 1

ik jkk

ik jkk k

h hs

h h

Calculate similarity using cosine measure:

ij ij jh w

Page 8: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

Dimensionality reduction

Latent Semantic Analysis (LSA): use Singular Value Decomposition on weight matrix W

i j

iji j

W W

with U = eigenvectors of WWT and V of WTW.

Remove small eigenvalues, recreate reduced W and calculate similarity:

TW UΛV

Page 9: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

Kruskal’s Algorithm and Top - Down Clusterization

Page 10: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

Modified Kruskal’s Algorithm and Bottom - Up Clusterization

Page 11: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

Reuters results

Method topics clusters accuracy

No dim red. 41 129 78.2%

LSA dim red. 0.8 (476) 41 124 76.2%

LSA dim red. 0.6 (357) 41 127 75.2%

Simple Selection 41 130 78.5%

W rank in SVD = 595

Page 12: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

Results for EEE NTU Web pages

Method topics clusters accuracy

No dim red. 10 142 84.7%

LSA dim red. 0.8 (467) 10 129 84.7%

LSA dim red. 0.6 (350) 10 137 82.8%

Simple Selection 10 145 85.5%

Page 13: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

Examples

TouchGraph LinkBrowser http://www.neuron.m4u.pl/search

Page 14: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

Results for Summary Discharges

New experiments on medical texts.

10 classes and 10 documents per class:

Plain Doc-Word matrix ≈ 23% Stop-List, TW-IDF, S.S. ≈ 64% Concept Space ≈ 64% Transformation ≈ 93%

Page 15: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

Simple Word-Doc Vector Space

Page 16: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

Meta-Map Concept Vector Space

Page 17: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

Concept Vector Space after transformation

Page 18: Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,

Summary

In real application knowledge-based approach is needed to select only useful words and to parse their web pages.

Other visualization methods (like MDS) may be explored.

People have many interests and thus may belong to several topic groups.

Could be a very useful tool to create new shared interest groups in the Internet.

Neurobiological foundations of ethics and law Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch

Brains, logic and computational models. Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch Argumentation

Neurocognitive approach to natural language understanding and creativity Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Understanding Medical Data Włodzisław Duch Department of Informatics Nicholas Copernicus University, Toruń, Poland duch

Neurolinguistic Approach to Vector Representation of Medical Concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Cognitive Science and Music Włodzisław Duch Department of Informatics, Uniwersytet Mikołaja Kopernika, Toruń. Google: W. Duch Kajetany, 16.07.2015

Towards CI Foundations Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch WCCI’08 Panel Discussion

Włodzisław Duch

Consciousness and Creativity in Brain-Inspired Cognitive Architectures Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń,

Is embodiment necessary for natural language understanding? Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google:

Włodzisław Duch Nicolaus Copernicus University, Toruń, Poland SCE NTU Singapore Google: W. Duch Google: W. Duch A few notes on auditory processing NTU

Mind from brain: psychological spaces and neuroscience. Włodzisław Duch Department of Computer Methods, Nicholas Copernicus University, Toruń, Poland

Computational intelligence for data understanding Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch

Od autyzmu do ADHD: komputerowe symulacje. Włodzisław Duch Katedra Informatyki Stosowanej UMK, Toruń School of Computer Engineering, Nanyang Technological

Social intelligence: what we need to understand Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland School of Computer

Neurocognitive approach to creativity and higher-level cognition Włodzisław Duch & Co Katedra informatyki Stosowanej, Uniwersytet Mikołaja Kopernika, Toruń,

The Universe, Mind and Intelligence without God? Włodzisław Duch Katedra Informatyki Stosowanej Uniwersytet Mikołaja Kopernika, Toruń Google: W. Duch Google:

Neurocognitive approach to computational creativity Włodzisław Duch & Co Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Dept

What is data mining? Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch ISEP Porto,

Neural reuse panel Włodzisław Duch Katedra Informatyki Stosowanej Uniwersytet Mikołaja Kopernika, Toruń SCE NTU Singapur Google: W Duch Google: W Duch

Meta-Uczenie Maszynowe Włodzisław Duch, Norbert Jankowski & Krzysztof Grąbczewski Katedra Informatyki Stosowanej, Uniwersytet Mikołaja Kopernika, Toruń

Neurodynamics of concepts Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch Argumentation as a cognitive

Support Feature Machine for DNA microarray data Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland

Towards comprehensive foundations of Computational Intelligence Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland

Meta-Learning and learning in highly non-separable cases Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google:

Neurocognitive science: mind from brain? Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, PL Dept. of Comp. Science, School

Artificial Intelligence 60 years Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Ministry of Science and Higher

Support Vector Neural Training Włodzisław Duch Department of Informatics Nicolaus Copernicus University, Toruń, Poland School of Computer Engineering,

Brain-Inspired Cognitive Architectures Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch IRML Summer

Towards Science of DM Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch WCCI’08 Panel Discussion

Meta-Learning: towards universal learning paradigms Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W

Are we automata? Włodzisław Duch Department of Informatics Nicolaus Copernicus University, Toruń, PL Google: W. Duch Google: W. Duch Self, intersubjectivity

Neurocognitive Informatics Manifesto Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch IMS’09, Kunming

Creativity, Neuroscience and Neurocognitive Informatics Włodzisław Duch Katedra informatyki Stosowanej, Uniwersytet Mikołaja Kopernika, Toruń, Poland Google:

Understanding of data using Computational Intelligence methods Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland duch

minimum spanning trees displaying semantic similarity włodzisław duch & paweł matykiewicz...

Documents

lsa dim

documentword matrixdocument1

word1 word3 word5

selection of keywords

plain docword matrix

word1 word2 word3

word1 word3 word6

term weighting