gergely palla - extracting tag hierarchies

79
Introduction Benchmarks and testing Results Extracting tag hierarchies Gergely Tibély, Péter Pollner, Tamás Vicsek and Gergely Palla Statistical and Biological Physics Research Group, HAS (Eötvös University), Hungary KnowEscape2013 Conference G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Upload: knowescape2013

Post on 15-Jan-2015

69 views

Category:

Social Media


5 download

DESCRIPTION

Talk given by Gergely Palla at First annual meeting of KnowEscape COST Action

TRANSCRIPT

Page 1: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Extracting tag hierarchies

Gergely Tibély, Péter Pollner, Tamás Vicsek andGergely Palla

Statistical and Biological Physics Research Group,HAS (Eötvös University), Hungary

KnowEscape2013 Conference

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 2: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tags and tagging: blogs, news portals

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 3: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tags and tagging: blogs, news portals

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 4: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tags and tagging: blogs, news portals

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 5: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tags and tagging: blogs, news portals

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 6: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tags and tagging: blogs, news portals

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 7: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tags and tagging: blogs, news portals

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 8: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tags and tagging: blogs, news portals

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 9: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tags and tagging: blogs, news portals

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 10: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tags and tagging: video and photo sharing

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 11: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tags and tagging: video and photo sharing

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 12: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tagging and folksonomies

In some cases the emerging set of free tags is called as a

FOLKSONOMY

collaborative nature,tags are equal,no hierarchy,

−→ The opposite of an ONTOLOGY.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 13: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tagging and folksonomies

In some cases the emerging set of free tags is called as a

FOLKSONOMY

collaborative nature,tags are equal,no hierarchy,

−→ The opposite of an ONTOLOGY.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 14: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tagging and folksonomies

In some cases the emerging set of free tags is called as a

FOLKSONOMY

collaborative nature,tags are equal,no hierarchy,

−→ The opposite of an ONTOLOGY.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 15: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tagging and folksonomies

In some cases the emerging set of free tags is called as a

FOLKSONOMY

collaborative nature,tags are equal,no hierarchy,

−→ The opposite of an ONTOLOGY.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 16: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tagging and folksonomies

In some cases the emerging set of free tags is called as a

FOLKSONOMY

collaborative nature,tags are equal,no hierarchy,

−→ The opposite of an ONTOLOGY.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 17: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tagging systemsHow can we search?

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 18: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tagging systemsHow can we search?

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 19: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tagging systemsHow can we search?

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 20: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tagging systemsHow can we search?

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 21: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tagging systemsSearching items

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 22: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tagging systemsSearching items

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 23: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tagging systemsSearching

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 24: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tagging systemsSearching tags

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 25: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tagging systemsSearching tags

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 26: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

The goal

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 27: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

The goal

Extracting a tag hierarchy!

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 28: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

The goal

Extracting a tag hierarchy!

Motivation:

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 29: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

The goal

Extracting a tag hierarchy!

Motivation:Help searching: If the tags are organised into ahierarchy, broadening or narrowing the scope issimple.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 30: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

The goal

Extracting a tag hierarchy!

Motivation:Help searching: If the tags are organised into ahierarchy, broadening or narrowing the scope issimple.

Give recommendations about yet unvisited objects.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 31: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Tag hierarchy extracting algorithms

- P. Heymann and H. Garcia-Molina, "Collaborative Creation ofCommunal Hierarchical Taxonomies in Social Tagging Systems",Technical Report, Stanford InfoLab, (2006).

- P. Schmitz , "Inducing Ontology from Flickr Tags", Proceedings of the15th International Conference on World Wide Web (WWW), (2006).

- C. Van Damme, M. Hepp and K. Siorpaes, "FolksOntology: AnIntegrated Approach for Turning Folksonomies into Ontologies", SocialNetworks 2, 57–70, (2007)

- A.Plangprasopchok and K. Lerman, "Constructing Folksonomies fromUser-specified Relations on Flickr", Proceedings of the World WideWeb conference, (2009)

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 32: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Our methodsBasic outline

DEFINE LINKS

TAG 2

TAG 1

TAG 3

TAG 5

TAG 4

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 33: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Our methodsBasic outline

DEFINE LINKS

TAG 2

TAG 1

TAG 3

TAG 5

TAG 4

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 34: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Our methodsBasic outline

DEFINE LINKS THRESHOLD

TAG 2

TAG 1

TAG 3

TAG 5

TAG 4

TAG 2

TAG 1

TAG 3

TAG 5

TAG 4

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 35: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Our methodsBasic outline

DEFINE LINKS THRESHOLD DETERMINE

DIRECTION

TAG 2

TAG 1

TAG 3

TAG 5

TAG 4

TAG 2

TAG 1

TAG 3

TAG 5

TAG 4

TAG 2

TAG 1

TAG 3

TAG 5

TAG 4

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 36: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Tags and taggingThe goalTag hierarchy extraction methods

Our methodsBasic outline

DEFINE LINKS THRESHOLD DETERMINE

DIRECTION

TAG 2

TAG 1

TAG 3

TAG 5

TAG 4

TAG 2

TAG 1

TAG 3

TAG 5

TAG 4

TAG 1

TAG 2 TAG 3

TAG 4 TAG 5

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 37: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

BenchmarksQuality measures

How to test the method?

Benchmarks?

Gene Ontology

- tag hierarchy: hierarchy of protein functions,- tagged objects: proteins annotated by their knownfunctions

Synthetic benchmark:- tag hierarchy: user defined,- tagged objects: simulated tagging (random walk on thehierarchy).- G. Tibély, P. Pollner, T. Vicsek and G. Palla, “Ontologies and-,tag-statistics”, New Journal of Physics 14, 053009 (2012).

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 38: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

BenchmarksQuality measures

How to test the method?

Benchmarks:

Gene Ontology

- tag hierarchy: hierarchy of protein functions,- tagged objects: proteins annotated by their knownfunctions.

Synthetic benchmark:- tag hierarchy: user defined,- tagged objects: simulated tagging (random walk on thehierarchy).- G. Tibély, P. Pollner, T. Vicsek and G. Palla, “Ontologies and-,tag-statistics”, New Journal of Physics 14, 053009 (2012).

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 39: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

BenchmarksQuality measures

How to test the method?

Benchmarks:

Gene Ontology

- tag hierarchy: hierarchy of protein functions,- tagged objects: proteins annotated by their knownfunctions.

Synthetic benchmark:- tag hierarchy: user defined,- tagged objects: simulated tagging (random walk on thehierarchy).- G. Tibély, P. Pollner, T. Vicsek and G. Palla, “Ontologies and-,tag-statistics”, New Journal of Physics 14, 053009 (2012).

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 40: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

BenchmarksQuality measures

How to measure the quality?

0A

1A 1C

2C2A 2B 2D 2E

3B3A 3C 3F

1B

3D 3E 3G 3H

0A

1A 1C

2C2A 2B 2D 2E

3B3A 3C 3F

1B

3D 3E 3G 3H

EXACT RECONSTRUCTED

Evaluation?

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 41: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

BenchmarksQuality measures

How to measure the quality?

0A

1A 1C

2C2A 2B 2D 2E

3B3A 3C 3F

1B

3D 3E 3G 3H

0A

1A 1C

2C2A 2B 2D 2E

3B3A 3C 3F

1B

3D 3E 3G 3H

EXACT RECONSTRUCTED

Evaluation:fraction of correctly identified links, fraction ofacceptable links, fraction of missing links, etc.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 42: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

BenchmarksQuality measures

How to measure the quality?

0A

1A 1C

2C2A 2B 2D 2E

3B3A 3C 3F

1B

3D 3E 3G 3H

0A

1A 1C

2C2A 2B 2D 2E

3B3A 3C 3F

1B

3D 3E 3G 3H

EXACT RECONSTRUCTED

Evaluation:fraction of correctly identified links, fraction ofacceptable links, fraction of missing links, etc.Normalised Mutual Information: sensitive also to theposition of the non-matching links.L. Danon et al., "Comparing community structureidentification", J. Stat. Mech. P09008 , (2005)

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 43: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

BenchmarksQuality measures

Normalised Mutual InformationMathematical formulation

The probability for picking a tag at random from the descendants of tag i in the exacthierarchy, Ge, and in the reconstructed hierarchy Gr:

pe(i) =|De(i)|N − 1

, pr(i) =|Dr(i)|N − 1

.

The probability for picking a tag at random from the intersection of the two sets ofdescendants:

pe,r(i) =|De(i) ∩ Dr(i)|

N − 1

Based on this, the Normalised Mutual Information between the exact- andreconstructed hierarchies:

Ie,r = −2

N∑i=1

pe,r(i) ln(

pe,r(i)pe(i)pr(i)

)N∑

i=1pe(i) ln pe(i) +

N∑i=1

pr(i) ln pr(i)=

2N∑

i=1|De(i) ∩ Dr(i)| ln

(|De(i)∩Dr(i)|(N−1)|De(i)|·|Dr(i)|

)N∑

i=1|De(i)| ln

(|De(i)|N−1

)+

N∑i=1|Dr(i)| ln

(|Dr(i)|N−1

) .

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 44: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

BenchmarksQuality measures

Normalised Mutual InformationBehaviour

0A

2B

3C 3D 3E 3F 3H

2D2C2A

3B3A

1B

3G

1A

2A 2B

1A

3G 3H

2C 2D

1B

3A 3C3B 3D 3F3E

0A

RANDOMIZATION

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 45: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

BenchmarksQuality measures

Normalised Mutual InformationBehaviour

0A

2B

3C 3D 3E 3F 3H

2D2C2A

3B3A

1B

3G

1A

2A 2B

1A

3G 3H

2C 2D

1B

3A 3C3B 3D 3F3E

0A

RANDOMIZATION

I

bottom up

random

top down

f

0

0.2

0.4

0.6

0.8

1

0 0.2 0.6 0.8 1 0.4

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 46: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Hierarchy of protein functionsFlickr and IMDbSynthetic data

Studied systems

Tagged proteins:- 5,913,610 proteins from GO, annotated by- 4,181 molecular functions.

Tagged photos:- 1,519,030 photos from Flickr, tagged by- 25,441 free English words.

Tagged films:- 336,223 films from IMDb, tagged by- 6,358 English keywords.

Synthetic benchmark:- 2,000,000 virtual objects,- 1,023 tags.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 47: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Hierarchy of protein functionsFlickr and IMDbSynthetic data

ResultsHierarchy of protein functions

0A

1A

2C

4C 4D 4E

5A 5B 5C

4A 4B

2B 2D

3G3D3A 3B 3C 3E 3F 3H 3I 3J 3K 3L 3M 3N 3O 3P 3Q

4G 4H 4K 4O 4R

2A

4I 4J 4L 4M 4N 4P 4Q4F

0A

1A

2C

3S3R

4C 4D 4E

5A 5B 5C

4A 4B

2B2A 2D

3G3D3A 3B 3C 3E 3F 3H 3I 3J 3K 3L 3M 3N 3O 3P 3Q

4S 4K4G 4H 4I 4J 4L 4M 4N 4O 4P 4Q

3T

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 48: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Hierarchy of protein functionsFlickr and IMDbSynthetic data

ResultsHierarchy of protein functions

algorithm A 21%Matching algorithm B 20%links P. H. & H. G.-M. 19%

P. Schmitz 18%algorithm A 66%

Acceptable algorithm B 52%links P. H. & H. G.-M. 51%

P. Schmitz 65%algorithm A 35%

Normalised algorithm B 30%Mut. Info. P. H. & H. G.-M. 30%

P. Schmitz 30%algorithm A 78%

Linearised algorithm B 75%Mut. Info P. H. & H. G.-M. 75%

P. Schmitz 75%

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 49: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Hierarchy of protein functionsFlickr and IMDbSynthetic data

ResultsTag hierarchy from Flickr data

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 50: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Hierarchy of protein functionsFlickr and IMDbSynthetic data

ResultsTag hierarchy from IMDb data

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 51: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Hierarchy of protein functionsFlickr and IMDbSynthetic data

ResultsSynthetic benchmark: tagging by random walks

Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.

The tagging:

the first tag at random,the rest:

with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 52: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Hierarchy of protein functionsFlickr and IMDbSynthetic data

ResultsSynthetic benchmark: tagging by random walks

Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.

The tagging:

the first tag at random,the rest:

with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 53: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Hierarchy of protein functionsFlickr and IMDbSynthetic data

ResultsSynthetic benchmark: tagging by random walks

Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.

The tagging:

the first tag at random,the rest:

with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 54: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Hierarchy of protein functionsFlickr and IMDbSynthetic data

ResultsSynthetic benchmark: tagging by random walks

Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.

The tagging:

the first tag at random,the rest:

with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 55: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Hierarchy of protein functionsFlickr and IMDbSynthetic data

ResultsSynthetic benchmark: tagging by random walks

Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.

The tagging:

the first tag at random,the rest:

with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 56: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Hierarchy of protein functionsFlickr and IMDbSynthetic data

ResultsSynthetic benchmark: tagging by random walks

Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.

The tagging:

the first tag at random,the rest:

with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 57: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Hierarchy of protein functionsFlickr and IMDbSynthetic data

ResultsSynthetic benchmark: tagging by random walks

Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.

The tagging:

the first tag at random,the rest:

with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 58: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Hierarchy of protein functionsFlickr and IMDbSynthetic data

ResultsSynthetic benchmark: tagging by random walks

Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.

The tagging:

the first tag at random,the rest:

with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 59: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Hierarchy of protein functionsFlickr and IMDbSynthetic data

ResultsSynthetic benchmark: tagging by random walks

Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.

The tagging:

the first tag at random,the rest:

with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 60: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Hierarchy of protein functionsFlickr and IMDbSynthetic data

ResultsSynthetic benchmark: tagging by random walks

Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.

The tagging:

the first tag at random,the rest:

with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 61: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Hierarchy of protein functionsFlickr and IMDbSynthetic data

ResultsSynthetic benchmark: tagging by random walks

Input:the tag hierarchy,the distribution of the number tags on the objects,the tag frequency distribution.

The tagging:

the first tag at random,the rest:

with probability p: ashort random walk onthe DAG starting fromthe first tag,with probability 1− p:at random.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 62: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Hierarchy of protein functionsFlickr and IMDbSynthetic data

ResultsSynthetic benchmark

a)

c)

b)

d)

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 63: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Hierarchy of protein functionsFlickr and IMDbSynthetic data

ResultsSynthetic benchmark

algorithm A 31%Matching algorithm B 89%links P. H. & H. G.-M. 48%

P. Schmitz 1%algorithm A 35%

Acceptable algorithm B 91%links P. H. & H. G.-M. 54%

P. Schmitz 2%algorithm A 18%

Normalised algorithm B 83%Mut. Info. P. H. & H. G.-M. 29%

P. Schmitz 1%algorithm A 66%

Linearised algorithm B 97%Mut. Info P. H. & H. G.-M. 76%

P. Schmitz 5%

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 64: Gergely Palla - Extracting tag hierarchies

IntroductionBenchmarks and testing

Results

Hierarchy of protein functionsFlickr and IMDbSynthetic data

Summary

Tags are important in knowledge organisation.

Tag-hierarchy extraction is an interesting problem with agreat potential for practical applications.

We have set up a framework for tag-hierarchy extraction:Benchmark systems for testing the tag-hierarchy extractionalgorithm can be found and can be created.The mutual information provides a quality measuresensitive also to the position of the links in the hierarchy.

-G. Tibély, P. Pollner, T. Vicsek and G. Palla, Ontologies and tag-statisticsNew Journal of Physics 14, 053009 (2012).-G. Tibély, P. Pollner, T. Vicsek and G. Palla, Extracting tag hierarchiesaccepted in PLoS ONE

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 65: Gergely Palla - Extracting tag hierarchies

Tagging by random walksMutual information

Further results from Flickr

e s k i m o

pac i f i c w a l r u s s a m u e l ade l i e a de l i e

p e n g u i n

b e a r d e d sea l

t h a w i n g

e r i g n a t h u s b a r b a t u s o d o b e n u s

r o s m a r u s c h i n s t r a p

s o u t h p o l e

s a m u e l t a y l o r

c o l e r i d g e

s p i t z b e r g e n t h a w

b a f f i n i s l a n d

k r i l l m a r i n e r pygosce l i s

ade l i ae

a n t a r c t i c p e n i n s u l a i n u i t

a n t a r c t i c

a r r o w

s p i t s b e r g e n p r e s e n t

w a l r u s

g i f t b o w

ch r i s t - m a s - t i m e

d e c t r e e f a r m

s v a l b a r d

a r c t i c c i r c le

g r e e n l a n d i ce

c u b e

n u n a v u t

n o r t h p o l e

a n t a r c t i c a

i ce - b r e a k e r

c o l e r i d g e c a r b o n d i o x i d e

d r y i ce

m e l t w h i t e o u t

f e b r u a r y w i scons in

h o l i d a y f r o s t c o l d

s n o w

w i n t e r t i m e m i d w i n t e r w a r m e r h i b e r n a t i o n j a n u a r y

s n o w b a l l f r e e z e

s k i

j a c k f r o s t

h o a r f r o s t

p o g o n i p c o l d

w e a t h e r s n o w - s t o r m s l e d d i n g

s n o w - s h o e s n o w -

f l a k e

m i n u s

ch i l l f r e e z i n g

i c e

f r o s t b i t e s h i v e r i n g

w i n t e r

d i a m o n d d u s t deco -

r a t i o n ho l i - d a y sea- s o n

d e c e m b e r

c a r o l i n g

c h i l l i n g

ch r i s t - m a s t r e e

i c e b e r g i ce w a t e r

i g l o o i ce m a c h i n e

a r c t i c f l a k e b l i z za rd

m e l t i n g s u b l i m a t i o n

s n o w b o a r d

s k i p o l e

h o a r

r i m e

w a t e r s k i

s k i e r s k i i n g n a t i v i t y x m a s

r i b b o n

s a n t a

c h r i s t m a s

w i

m a d i s o n m i l w a u k e e f e b j a n

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 66: Gergely Palla - Extracting tag hierarchies

Tagging by random walksMutual information

Algorithm ADetails

pre-process: for a tag i keep neighbours only ifnij ≥ 0.4 ·max(nij).Link-weight:

random estimate for the number of co-occurrences:〈nij〉R =

ni njn ,

link weight corresponds to the z-score: zij =nij−〈nij〉RσR(nij )

.

Threshold:keep only the strongest link on every tag.

Direction:we assume that tag frequencies are higher close to theroot,

→ for any given tag i , the strongest link is coming from itsparent.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 67: Gergely Palla - Extracting tag hierarchies

Tagging by random walksMutual information

Algorithm ADetails

pre-process: for a tag i keep neighbours only ifnij ≥ 0.4 ·max(nij).Link-weight:

random estimate for the number of co-occurrences:〈nij〉R =

ni njn ,

link weight corresponds to the z-score: zij =nij−〈nij〉RσR(nij )

.

Threshold:keep only the strongest link on every tag.

Direction:we assume that tag frequencies are higher close to theroot,

→ for any given tag i , the strongest link is coming from itsparent.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 68: Gergely Palla - Extracting tag hierarchies

Tagging by random walksMutual information

Algorithm ADetails

pre-process: for a tag i keep neighbours only ifnij ≥ 0.4 ·max(nij).Link-weight:

random estimate for the number of co-occurrences:〈nij〉R =

ni njn ,

link weight corresponds to the z-score: zij =nij−〈nij〉RσR(nij )

.

Threshold:keep only the strongest link on every tag.

Direction:we assume that tag frequencies are higher close to theroot,

→ for any given tag i , the strongest link is coming from itsparent.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 69: Gergely Palla - Extracting tag hierarchies

Tagging by random walksMutual information

Algorithm ADetails

pre-process: for a tag i keep neighbours only ifnij ≥ 0.4 ·max(nij).Link-weight:

random estimate for the number of co-occurrences:〈nij〉R =

ni njn ,

link weight corresponds to the z-score: zij =nij−〈nij〉RσR(nij )

.

Threshold:keep only the strongest link on every tag.

Direction:we assume that tag frequencies are higher close to theroot,

→ for any given tag i , the strongest link is coming from itsparent.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 70: Gergely Palla - Extracting tag hierarchies

Tagging by random walksMutual information

Algorithm ADetails

pre-process: for a tag i keep neighbours only ifnij ≥ 0.4 ·max(nij).Link-weight:

random estimate for the number of co-occurrences:〈nij〉R =

ni njn ,

link weight corresponds to the z-score: zij =nij−〈nij〉RσR(nij )

.

Threshold:keep only the strongest link on every tag.

Direction:we assume that tag frequencies are higher close to theroot,

→ for any given tag i , the strongest link is coming from itsparent.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 71: Gergely Palla - Extracting tag hierarchies

Tagging by random walksMutual information

Algorithm ADetails

pre-process: for a tag i keep neighbours only ifnij ≥ 0.4 ·max(nij).Link-weight:

random estimate for the number of co-occurrences:〈nij〉R =

ni njn ,

link weight corresponds to the z-score: zij =nij−〈nij〉RσR(nij )

.

Threshold:keep only the strongest link on every tag.

Direction:we assume that tag frequencies are higher close to theroot,

→ for any given tag i , the strongest link is coming from itsparent.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 72: Gergely Palla - Extracting tag hierarchies

Tagging by random walksMutual information

Algorithm ADetails

Exception:when the given tag is also the proposed parent of itsstrongest neighbour.

→ in this case we choose the 2nd strongest neighbour asparent

Local root:when the given tag is also the proposed parent for all of itsstrong neighbours.

Global assembly:the local root with largest “entropy” becomes the global root,rest of the local roots are linked in the order of their entropy.

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 73: Gergely Palla - Extracting tag hierarchies

Tagging by random walksMutual information

Algorithm BDetails

Link-weight:random estimate for the number of co-occurrences:〈nij〉R =

ni njn ,

link weight corresponds to the z-score: zij =nij−〈nij〉RσR(nij )

.Threshold:

keep links stronger than zij ≥ 10.Centrality:

calculate the eigenvector centrality based on the remainingweighted adjacency matrix.

Build the hierarchy:start from lowest centrality tags,choose parent from neighbours with higher centrality thanthe given tag,in case there are more candidates, choose the most relatedone, (according to the descendants already under the giventag).

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 74: Gergely Palla - Extracting tag hierarchies

Tagging by random walksMutual information

Mutual information and entropyFor discrete variables xi and yj with a joint probability distribution given by P(xi , yj ), themutual information is defined as

I(x , y) ≡∑

i

∑j

p(xi , yj ) ln

(p(xi , yj )

p(xi )p(yj )

).

The entropies of the variables is usually formulated as

H(x) = −∑

i

p(xi ) ln p(xi ), H(y) = −∑

j

p(yj ) ln p(yj ).

Thus, the mutual information can be also given as

I(x , y) = H(x) + H(y)− H(x , y).

Based on this, the normalised mutual information in general is defined as

Inorm(x , y) ≡2I(x , y)

H(x) + H(y).

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 75: Gergely Palla - Extracting tag hierarchies

Tagging by random walksMutual information

Comparing DAGs with mutual information

The probability for picking a tag at random from the descendants of tag i in the exacthierarchy, Ge, and in the reconstructed hierarchy Gr:

pe(i) =|De(i)|N − 1

, pr(i) =|Dr(i)|N − 1

.

The probability for picking a tag at random from the intersection of the two sets ofdescendants:

pe,r(i) =|De(i) ∩ Dr(i)|

N − 1

Based on this, the Normalised Mutual Information between the exact- andreconstructed hierarchies:

Ie,r = −2

N∑i=1

pe,r(i) ln(

pe,r(i)pe(i)pr(i)

)N∑

i=1pe(i) ln pe(i) +

N∑i=1

pr(i) ln pr(i)=

2N∑

i=1|De(i) ∩ Dr(i)| ln

(|De(i)∩Dr(i)|(N−1)|De(i)|·|Dr(i)|

)N∑

i=1|De(i)| ln

(|De(i)|N−1

)+

N∑i=1|Dr(i)| ln

(|Dr(i)|N−1

) .

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 76: Gergely Palla - Extracting tag hierarchies

Tagging by random walksMutual information

Comparing DAGs with mutual information

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 77: Gergely Palla - Extracting tag hierarchies

Tagging by random walksMutual information

Comparing DAGs with mutual information

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 78: Gergely Palla - Extracting tag hierarchies

Tagging by random walksMutual information

Comparing DAGs with mutual information

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies

Page 79: Gergely Palla - Extracting tag hierarchies

Tagging by random walksMutual information

ResultsIn numbers

Hierarchy (target) GO subset user definedData (input) proteins simulated tagging

algorithm A 21% 31%Matching algorithm B 20% 89%links P. H. & H. G.-M. 19% 48%

P. Schmitz 18% 1%algorithm A 66% 35%

Acceptable algorithm B 52% 91%links P. H. & H. G.-M. 51% 54%

P. Schmitz 65% 2%algorithm A 35% 18%

Normalised algorithm B 30% 83%Mut. Info. P. H. & H. G.-M. 30% 29%

P. Schmitz 30% 1%algorithm A 78% 66%

Linearised algorithm B 75% 97%Mut. Info P. H. & H. G.-M. 75% 76%

P. Schmitz 75% 5%

G. Tibély, P. Pollner, T. Vicsek and G. Palla Extracting tag hierarchies