minimum spanning trees displaying semantic similarity włodzisław duch & paweł matykiewicz...

Minimum Spanning TreesDisplaying Semantic Similarity

Włodzisław Duch & Paweł Matykiewicz

Department of Informatics, UMK Toruń

School of Computer Engineering, NTU Singapore

Cincinnati Children’s Hospital Research Foundation, OH, USA Google: Duch

The Problem

Finding people who share some of our interests in large organizations or worldwide is difficult.

Analyzing people’s homepages and their lists of publications is a good way to find groups and individuals sharing common scientific interest.

Maps should display individuals and groups. The structure of graphical representations

depends strongly on the selection of keywords or dimensionality reduction.

The Data

Reuters-215785 datasets, with 5 categories and 1 – 176 elements per category.

124 Personal Web Pages of the School of Electrical and Electronic Engineering (EEE) of the Nanyang Technological University (NTU) in Singapore, with 5 categories (control, microelectronics, information, circuit, power), and 14 – 41 documents per category.

Document-word matrix

Document1: word1 word2 word3. word4 word3 word5.

Document2: word1 word3 word5. word1 word3 word6.

The matrix: documents x word frequencies

1 1 2 1 1 0

2 0 2 0 1 1

Methods used

Inverse document frequency and term weighting.

Simple selection of relevant terms.

Latent Semantic Analysis (LSA) for dimensionality reduction.

Minimum Spanning Trees for visual representation.

TouchGraph XML visualization of MST trees.

Data Preparation

Normalize columns of F dividing by highest word frequencies:

Among n documents, term j occurs dj times; inverse document frequency idfj measures uniqueness of term j:

2log / 1, 0j j jidf n d d tf x idf term weights:

ij ij jw tf idf

/ maxij ij iji

tf f f

Simple selection

Simple selection: take wij weights above certain threshold, binarize and remove zero rows:

ik jkk

ik jkk k

Calculate similarity using cosine measure:

ij ij jh w

Dimensionality reduction

Latent Semantic Analysis (LSA): use Singular Value Decomposition on weight matrix W

with U = eigenvectors of WWT and V of WTW.

Remove small eigenvalues, recreate reduced W and calculate similarity:

TW UΛV

Kruskal’s Algorithm and Top - Down Clusterization

Modified Kruskal’s Algorithm and Bottom - Up Clusterization

Reuters results

Method topics clusters accuracy

No dim red. 41 129 78.2%

LSA dim red. 0.8 (476) 41 124 76.2%

LSA dim red. 0.6 (357) 41 127 75.2%

Simple Selection 41 130 78.5%

W rank in SVD = 595

Results for EEE NTU Web pages

Method topics clusters accuracy

No dim red. 10 142 84.7%

LSA dim red. 0.8 (467) 10 129 84.7%

LSA dim red. 0.6 (350) 10 137 82.8%

Simple Selection 10 145 85.5%

Examples

TouchGraph LinkBrowser http://www.neuron.m4u.pl/search

Results for Summary Discharges

New experiments on medical texts.

10 classes and 10 documents per class:

Plain Doc-Word matrix ≈ 23% Stop-List, TW-IDF, S.S. ≈ 64% Concept Space ≈ 64% Transformation ≈ 93%

Simple Word-Doc Vector Space

Meta-Map Concept Vector Space

Concept Vector Space after transformation

Summary

In real application knowledge-based approach is needed to select only useful words and to parse their web pages.

Other visualization methods (like MDS) may be explored.

People have many interests and thus may belong to several topic groups.

Could be a very useful tool to create new shared interest groups in the Internet.

minimum spanning trees displaying semantic similarity włodzisław duch & paweł matykiewicz...

lsa dim

documentword matrixdocument1

word1 word3 word5

selection of keywords

plain docword matrix

word1 word2 word3

word1 word3 word6

term weighting

Documents

consciousness and creativity in brain-inspired cognitive...

what is data mining? włodzisław duch dept. of informatics,...

brains, logic and computational models. włodzisław duch...

support feature machine for dna microarray data tomasz...

support vector neural training włodzisław duch department...

włodzisław duch nicolaus copernicus university, toruń,...

neural reuse panel włodzisław duch katedra informatyki...

towards ci foundations włodzisław duch department of...

understanding of data using computational intelligence...

autism, ion channels and computational simulations...

od autyzmu do adhd: komputerowe symulacje. włodzisław duch...

computational intelligence for data understanding...

neurolinguistic approach to vector representation of medical...

are we automata? włodzisław duch department of informatics...

how to learn highly non-separable data włodzisław duch...

neural network applications: the present and the future...

artificial intelligence 60 years włodzisław duch...

towards science of dm włodzisław duch department of...

the universe, mind and intelligence without god?...

meta-learning:towards universal learning paradigms...