i-know 2005 experiments in clustering homogeneous xml documents to validate an existing typology...

20
I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse Anne-Marie Vercoustre Inria Projet Axis E_mail: [email protected]

Upload: nathaniel-moss

Post on 26-Mar-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse

I-Know 2005

Experiments in clustering homogeneous XML documents to

Validate an Existing TypologyThierry DespeyrouxYves LechevallierBrigitte Trousse

Anne-Marie Vercoustre

Inria Projet Axis

E_mail: [email protected]

Page 2: I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse

I-Know 2005

Scientific Activity Report at Inria

Page 3: I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse

I-Know 2005

Homogeneous presentation

Page 4: I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse

I-Know 2005

Some RA figures

• 146 files

• 229 000 text lines

• 14,8 M octets of data

• one DTD

• Optional sections

• Free style and content

Page 5: I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse

I-Know 2005

Grouping by Themes (2003)

Page 6: I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse

I-Know 2005

Grouping by Themes (2004)

Page 7: I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse

I-Know 2005

Problem

• Presentation by Research themes

• That varies overtime

• Not politically neutral (funding, evaluation)

• Is there any natural grouping?

• What is the role of different parts of the report in highlighting the themes?

Page 8: I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse

I-Know 2005

Methodology

1. Select specific parts by using the XML structure

2. Select significant words by using a tool for syntactic typing and stemming (TreeTagger)

3. Cluster the documents into disjoined clusters

4. Evaluate those clusters

Page 9: I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse

I-Know 2005

Various experiments

• K-F: Keywords from sections foundations

• K-all: all Keywords

• T-P: text in section presentation

• T-PF: text in sections presentation et foundations

• T-C: names of conferences, workshops, congress etc. in the bibliography

Page 10: I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse

I-Know 2005

TreeTaggerXML Tree Tagger

A3 presentation a3 JJ <unknown>A3 presentation designs NNS designA3 presentation methods NNS methodA3 presentation and CC andA3 presentation tools NNS toolA3 presentation used VVN useA3 presentation by IN byA3 presentation compilers NNS compilerA3 presentation or CC orA3 presentation users NNS userA3 presentation for IN forA3 presentation code NN codeA3 presentation analysis NN analysis

Page 11: I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse

I-Know 2005

Clustering Method

The objective of the 3rd step is to cluster documents in a set of disjoint classes, from the vocabularies selected for the five experiments.

We use a partition method close to the k-means algorithm where the distance between documents is based on the word frequency.

Page 12: I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse

I-Know 2005

K-F-a experiment: list of representative Keywords

Classe 1: 3d approximation, computer, differential, environment , modeling, processing , programming , vision

Classe 2 : computing, equation, grid, problem, transformation Classe 3 : code, design, event, network, processor, time, trafficClasse 4 : calculus, database, datum, image, indexing, information,

integration, knowledge, logic, mining, pattern, recognition, user, web

For each cluster, the list of most representative words can be associated. Those words can be interpreted as summaries for those classes.

Page 13: I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse

I-Know 2005

Repartition of clusters compared to themes 2003

Page 14: I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse

I-Know 2005

Repartition of themes 2003 compared to clusters

0

5

10

15

20

25

30

35

40

45

Cluster_ 1 Cluster_ 2 Cluster_ 3 Cluster_ 4

Theme 4

Theme 3

Theme 2

Theme 1

Page 15: I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse

I-Know 2005

Partition of projectsThèmes Cluster_1 Cluster_2 Cluster_3 Cluster_4

1a A3 Apache Arles Caps Compsys Grand-Large Paris POPS R2D2 ReMap Regal Runtime Sardes

Jacquard

Tropics

1b AlGorille armor Mascotte aces Reso

Gyroweb

1c ADEPT DaRT Espresso trio Moscova Ostre tick Triskell Pop-Art Vasy VerTeCs mimosa s4

2a Compose Protheo Contraintes LogiCal Cristal Obasco miro Lemme lande oasis SECSI calligramme cassis modbio

2b Algo Arenaire Cafe Spaces Adage tanc coprin geometrica galaad

Page 16: I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse

I-Know 2005

Partition des projetsThèmes Cluster_1 Cluster_2 Cluster_3 Cluster_4

3a Eiffel HELIX LeD METISS MAIA Merlin Cordial

Parole

Symbiose

ECOO ACACIA ATLAS AXIS Cordial Gemo in-situ

MOSTRARE Orion PRIMA Smis WAM

TEXMEX Cortex Orpailleur WAM I3D

Atoll EXMO

DREAM SIGNES

3b Air2 Ariana IPARLA ALCOVE EVASION TEXMEX TEMICS Epidaure ISA LEAR

Mirages Odysee Imedia e-motion

PRIMA REVES siames VISTA artis

4a BIPOP COMORE Miaou corida IS2 CONGE Fractales NUMOPT Metalau Sydoco

Scilab macsi Imara Icare Sigma2

4b ALADIN Bang Estime IDOPT Macs Mathfi Micmac OMEGA Opale Caiman Calvi Smash

ScAIApplix sagep

Page 17: I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse

I-Know 2005

Extern Evaluation

The evaluation of the quality of clusters can be done by comparing the resulting clusters with the two lists of themes used by INRIA

nij is the number of research projects with their report classed in cluster Ui and allocated to group Cj (theme j).

ni. is the number of research reports in cluster Ui ,n.k is the number of research projects allocated to group Ck ,n is the total number of research projects analysed.

Page 18: I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse

I-Know 2005

Two evaluation measuresThe F-measure proposed by (Jardine and Rijsbergen, 1963) combines the precision and recall measure between Ui and Ck.

• recall is defined by R(i,k)=nik /ni. • precision is defined by P(i,k)= nik /n.kThe F-measure between the a priori partition U in K

groupes and partition C of INRIA projects by the clustering method is:

))),(),((),().,(.2(max)/(1

. jiPjiRjiPjiRnnFj

K

kk

The corrected Rand index (CR) proposed by (Hubert and Arabie (1985)) to compare two partitions.

Page 19: I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse

I-Know 2005

Results

Themes2003 Sub themes 2003 Themes2004

Exp. K F Rand F Rand F Rand

K-F-a 4 0.53 0.14 0.38 0.09 0.46 0.11

K-F-b 5 0.44 0.05 0.35 0.06 0.37 0.03

K-F-c 9 0.42 0.10 0.37 0.08 0.43 0.12

K-all-a 4 0.52 0.17 0.36 0.09 0.47 0.15

K-all-b 5 0.53 0.17 0.37 0.10 0.54 0.22

K-all-c 9 0.46 0.13 0.40 0.12 0.38 0.10

T-P-a 4 0.55 0.19 0.40 0.14 0.50 0.19

T-P-b 5 0.45 0.11 0.42 0.12 0.47 0.15

T-P-c 9 0.44 0.11 0.45 0.16 0.44 0.14

T-PF-a 4 0.660.66 0.320.32 0.49 0.27 0.50 0.21

T-PF-b 5 0.56 0.22 0.43 0.18 0.51 0.20

T-PF-c 9 0.48 0.22 0.55 0.29 0.46 0.19

T-C-a 4 0.51 0.15 0.39 0.15 0.50 0.21

T-C-b 5 0.44 0.18 0.45 0.24 0.47 0.17

T-C-c 9 0.45 0.13 0.47 0.21 0.45 0.15

Page 20: I-Know 2005 Experiments in clustering homogeneous XML documents to Validate an Existing Typology Thierry Despeyroux Yves Lechevallier Brigitte Trousse

I-Know 2005

Conclusion

• Combination of selection by structure and by linguistic terms

• Evaluation of clustering compared to an existing typology

• The quality of clustering strongly depends on the selected parts in the activity reports (which in turn gives an indication on where the report could be improved) Future :

• Measuring the stability of clusters when K varies• Evolution of classes overtime• Experiences with other collections