mapping between taxonomies

30
Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar

Upload: dane-cote

Post on 03-Jan-2016

53 views

Category:

Documents


0 download

DESCRIPTION

Mapping Between Taxonomies. Elena Eneva 11 Dec 2001 Advanced IR Seminar. Mapping Between Taxonomies. Formal systems of orderly classification of knowledge, which are designed for a specific purpose - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Mapping Between Taxonomies

Mapping Between Taxonomies

Elena Eneva

11 Dec 2001

Advanced IR Seminar

Page 2: Mapping Between Taxonomies

Mapping Between TaxonomiesFormal systems of orderly classification

of knowledge, which are designed for a specific purpose

Companies, organizing information in various ways (eg. one for marketing, another for product development)

Page 3: Mapping Between Taxonomies

ApproachGerman

French

Textile

Automobile

By country

By industry

Page 4: Mapping Between Taxonomies

ApproachGerman

French

Textile

Automobile

By country

By industry

Page 5: Mapping Between Taxonomies

ApproachGerman

French

Textile

Automobile

By country

By industry

Page 6: Mapping Between Taxonomies

ApproachGerman

French

Textile

Automobile

By country

By industry

Page 7: Mapping Between Taxonomies

ApproachTextile

Automobile

By industry

Page 8: Mapping Between Taxonomies

ApproachTextile

Automobile

By industry

abcabcabcabcabcabc

abcabcabcabcabcabc

abcabcabcabcabcabc

abcabcabcabcabcabc

Page 9: Mapping Between Taxonomies

ApproachTextile

Automobile

By industry

abcabcabcabcabcabc

abcabcabcabcabcabc

abcabcabcabcabcabc

abcabcabcabcabcabc

Page 10: Mapping Between Taxonomies

ApproachGerman

French

Textile

Automobile

By country

By industry

abc abc abc abc

Page 11: Mapping Between Taxonomies

ApproachGerman

French

Textile

Automobile

By country

By industry

abc abc abc abc

Page 12: Mapping Between Taxonomies

ApproachGerman

French

Textile

Automobile

By country

By industry

abc abc abc abc

abc abc abc abc

Page 13: Mapping Between Taxonomies

DatasetsTwo classification schemes:

Reuter 2001 (807900 docs) Topics (127) Industry categories (871) Regions (376)

Hoovers-255 and Hoovers-28 (4286 docs) industry categories (28) industry categories (255)

Page 14: Mapping Between Taxonomies

Learning2 separate methods of learning for the

documents: Old doc category -> new doc category Doc contents -> new category

Combined method: Weighted average based on confidence Final result determined by a decision tree One combined learner – used both old

category and contents as features

Page 15: Mapping Between Taxonomies

Simple Learners

Simple Decision Tree (C4.5) – learns probabilities of new categories based on 1 kind of feature: Old categories (doesn’t know about documents/words) Word-based classification (doesn’t know about old

categories) Naïve Bayes (rainbow)

Old categories (doesn’t know about documents/words) Word-based classification (doesn’t know about old

categories) Support Vector Machine (SVM-Light)

word-based classification (doesn’t know about old categories), linear kernel [results will be reported in the final paper]

Page 16: Mapping Between Taxonomies

Learning

Using the document content

abcabcabcabcabcabc

Using the document labels

DT, NB, SVM

DT, NB, SVM

Page 17: Mapping Between Taxonomies

Combined Learners

Weighted Average Voting scheme

Combination Decision Tree takes the outputs and confidences of two of

the simple learners, predicts new category

Page 18: Mapping Between Taxonomies

Learning

Using both the content and the label

Combining the two outputs

abcabcabcabcabcabcDT

abcabcabcabcabcabc

DT, NB, SVM

DT, NB, SVM

voting

3rd classifier

Page 19: Mapping Between Taxonomies

Results Words Only

5-fold cross validation

Words Only

0

10

20

30

40

50

60

28p255 255p28

% a

cc

ura

cy

words only NB

words only DT

Page 20: Mapping Between Taxonomies

Results Categories Only

5-fold cross validation

Categories Only

0

20

40

60

80

100

120

28p255 255p28

% a

cc

ura

cy

categs only NB

categs only DT

Page 21: Mapping Between Taxonomies

Results Combination

5-fold cross validation

Combination

0

20

40

60

80

100

120

28p255 255p28

% a

cc

ura

cy

Combination Vote

Combination Comb

Page 22: Mapping Between Taxonomies

Results

words onlyNB DT

28p255 21.14 7.9255p28 53.2 17.5

categs onlyNB DT

28p255 26.19 26.19255p28 100 100

CombinationVote Comb

28p255 28.05 30.26255p28 100 100

Page 23: Mapping Between Taxonomies

Remarks

Hierarchy (old classes) usually ignoredShown that helpsLearners are not the issueBetter way of understandingOld label (or hierarchy path) is meta

data

Page 24: Mapping Between Taxonomies

Remaining Work

SVM results (running even as we speak)Repeat experiments on Reuters-2001

Internal hierarchies Missing labels Less correlated types of classes

Results in standard evaluation format

Page 25: Mapping Between Taxonomies

Future Work

Try with a web dataset (Google and Yahoo! Hierarchies)

Hierarchies of more levelsMeta data (for non-text sources)

Page 26: Mapping Between Taxonomies

Related Literature

A study of Approaches to Hypertext, Y. Yang, S. Slattery, R. Ghani, Journal of Intelligent Information Systems, Volume 18, Number 2, March 2002 (to appear).

Learning Mappings between Data Schemas , A. Doan, P. Domingos, and A. Levy. Proceedings of the AAAI-2000 Workshop on Learning Statistical Models from Relational Data, 2000, Austin, TX.

Page 27: Mapping Between Taxonomies

Questions and Suggestions

The end.

Page 28: Mapping Between Taxonomies

DT accuracy vs Vocabulary size

0

1020

30

40

5060

70

10 100 500 1000 2000

vocabulary size

% a

ccur

acy train accuracy

test accuracy

Page 29: Mapping Between Taxonomies

Taxonomies

Formal systems of orderly classification of knowledge, which are designed for a specific purpose

Change of purpose, change of taxonomies

Businesses often need and keep theinformation in several structures

Important to be able to automatically map between taxonomies

Page 30: Mapping Between Taxonomies

Useful Mappings Companies, organizing information in various ways

(eg. one for marketing, another for product development)

Personal online bookmark classification

Search engines (eg. Google <-> Yahoo)

EU Committee for Standardization “detailed overview of the existing taxonomies officially used in the EU, in order to derive general concepts such as: information organisation, properties, multilinguality, keywords, etc. and, last but not least, the mapping between.”