experiments in clustering homogeneous xml documents to validate an existing typology

Download Experiments in clustering homogeneous XML documents to Validate an Existing Typology

Post on 03-Feb-2016

23 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

Thierry Despeyroux Yves Lechevallier Brigitte Trousse Anne-Marie Vercoustre Inria Projet Axis E_mail: firstname.surname@inria.fr. Experiments in clustering homogeneous XML documents to Validate an Existing Typology. Scientific Activity Report at Inria. Homogeneous presentation. - PowerPoint PPT Presentation

TRANSCRIPT

  • Experiments in clustering homogeneous XML documents to Validate an Existing Typology

    Thierry DespeyrouxYves LechevallierBrigitte TrousseAnne-Marie Vercoustre

    Inria Projet Axis E_mail: firstname.surname@inria.fr

  • Scientific Activity Report at Inria

  • Homogeneous presentation

  • Some RA figures146 files229 000 text lines14,8 M octets of dataone DTDOptional sectionsFree style and content

  • Grouping by Themes (2003)

  • Grouping by Themes (2004)

  • ProblemPresentation by Research themes That varies overtimeNot politically neutral (funding, evaluation)

    Is there any natural grouping? What is the role of different parts of the report in highlighting the themes?

  • MethodologySelect specific parts by using the XML structureSelect significant words by using a tool for syntactic typing and stemming (TreeTagger) Cluster the documents into disjoined clustersEvaluate those clusters

  • Various experimentsK-F: Keywords from sections foundationsK-all: all KeywordsT-P: text in section presentationT-PF: text in sections presentation et foundationsT-C: names of conferences, workshops, congress etc. in the bibliography

  • TreeTaggerXMLTree Tagger

    A3presentationa3JJA3presentationdesignsNNSdesignA3presentationmethodsNNSmethodA3presentationandCCandA3presentationtoolsNNStoolA3presentationusedVVNuseA3presentationbyINbyA3presentationcompilersNNScompilerA3presentationorCCorA3presentationusersNNSuserA3presentationforINforA3presentationcodeNNcodeA3presentationanalysisNNanalysis

  • Clustering MethodThe objective of the 3rd step is to cluster documents in a set of disjoint classes, from the vocabularies selected for the five experiments.

    We use a partition method close to the k-means algorithm where the distance between documents is based on the word frequency.

  • K-F-a experiment: list of representative KeywordsClasse 1: 3d approximation, computer, differential, environment , modeling, processing , programming , vision Classe 2 : computing, equation, grid, problem, transformation Classe 3 : code, design, event, network, processor, time, trafficClasse 4 : calculus, database, datum, image, indexing, information, integration, knowledge, logic, mining, pattern, recognition, user, webFor each cluster, the list of most representative words can be associated. Those words can be interpreted as summaries for those classes.

  • Repartition of clusters compared to themes 2003

  • Repartition of themes 2003 compared to clusters

    Graph1

    011922

    24304

    02902

    154142

    Theme 1

    Theme 2

    Theme 3

    Theme 4

    confusion

    CONFUSION TABLE

    PartitionTopic_1Topic_2Topic_3Topic_4Topic_5Topic_6Topic_7Topic_8total

    Navigation_ 1692152348913500935385336014982319242

    Navigation_ 2290653331051295724151172339502312631

    Navigation_ 33522371384754410694641310029912042731

    Navigation_ 43240657144384871712292518711112067135

    Total16589208517227482610957165363880655087416

    Number13810683841311047287805

    DENSITY TABLE

    Column_clusterTopic_1Topic_2Topic_3Topic_4Topic_5Topic_6Topic_7Topic_8

    Navigation_ 1119.41117.5725.5699.2116.9988.2111.9041.00

    Navigation_ 267.93162.2940.8511.33178.31158.7210.4435.22

    Navigation_ 382.33112.9932.9220.8926.32143.95138.89110.90

    Navigation_ 467.08177.12152.7716.5637.3480.367.4236.49

    F TABLE

    Column_clusterTopic_1Topic_2Topic_3Topic_4Topic_5Topic_6Topic_7Topic_8

    Navigation_ 176.9070.7314.2655.5610.8152.786.3223.22

    Navigation_ 234.3977.8518.445.1389.4075.814.5216.10

    Navigation_ 341.6854.2014.869.4613.2068.7660.1950.69

    Navigation_ 437.4693.2175.228.1820.6342.093.5018.21

    IDTheme_0Theme_1Theme_2Theme_3Theme_4

    Cluster_ 013939472330

    Cluster_ 142011922

    Cluster_ 23124304

    Cluster_ 33102902

    Cluster_ 435154142

    IDTheme_1Theme_2Theme_3Theme_4

    Cluster_ 10.000.250.280.61

    Cluster_ 20.690.080.000.13

    Cluster_ 30.000.740.000.07

    Cluster_ 40.410.100.480.06

    IDCluster_ 1Cluster_ 2Cluster_ 3Cluster_ 4

    Theme_10.000.690.000.41

    Theme_20.250.080.740.10

    Theme_30.280.000.000.48

    Theme_40.610.130.070.06

    IDTheme_0Theme_1Theme_2Theme_3Theme_4Theme_5Theme_6Theme_7Theme_8Theme_9

    Cluster_ 013915281291318141614

    Cluster_ 14209090212100

    Cluster_ 23113111011220

    Cluster_ 33101400015020

    Cluster_ 43524101200214

    IDTheme_1Theme_2Theme_3Theme_4Theme_5Theme_6Theme_7Theme_8Theme_9

    Cluster_ 10.000.260.000.350.000.070.430.340.00

    Cluster_ 20.570.030.510.000.050.040.090.090.00

    Cluster_ 30.000.470.000.000.000.610.000.090.00

    Cluster_ 40.080.130.040.000.500.000.000.080.57

    IDCluster_ 1Cluster_ 2Cluster_ 3Cluster_ 4

    Theme_10.000.570.000.08

    Theme_20.260.030.470.13

    Theme_30.000.510.000.04

    Theme_40.350.000.000.00

    Theme_50.000.050.000.50

    Theme_60.070.040.610.00

    Theme_70.430.090.000.00

    Theme_80.340.090.090.08

    Theme_90.000.000.000.57

    IDTheme_0Theme_1Theme_2Theme_3Theme_4Theme_5

    Cluster_ 0139352983631

    Cluster_ 142094209

    Cluster_ 2311800112

    Cluster_ 331072319

    Cluster_ 4351713221

    IDTheme_1Theme_2Theme_3Theme_4Theme_5

    Cluster_ 10.000.250.160.510.25

    Cluster_ 20.550.000.000.330.06

    Cluster_ 30.000.230.100.090.61

    Cluster_ 40.490.410.090.060.03

    IDCluster_ 1Cluster_ 2Cluster_ 3Cluster_ 4

    Theme_10.000.550.000.49

    Theme_20.250.000.230.41

    Theme_30.160.000.100.09

    Theme_40.510.330.090.06

    Theme_50.250.060.610.03

    confusion

    0000

    0000

    0000

    0000

    Theme 1

    Theme 2

    Theme 3

    Theme 4

  • Partition of projects

  • Partition des projets

  • Extern EvaluationThe evaluation of the quality of clusters can be done by comparing the resulting clusters with the two lists of themes used by INRIAnij is the number of research projects with their report classed in cluster Ui and allocated to group Cj (theme j).

    ni. is the number of research reports in cluster Ui ,n.k is the number of research projects allocated to group Ck ,n is the total number of research projects analysed.

  • Two evaluation measuresThe F-measure proposed by (Jardine and Rijsbergen, 1963) combines the precision and recall measure between Ui and Ck. recall is defined by R(i,k)=nik /ni. precision is defined by P(i,k)= nik /n.kThe F-measure between the a priori partition U in K groupes and partition C of INRIA projects by the clustering method is:The corrected Rand index (CR) proposed by (Hubert and Arabie (1985)) to compare two partitions.

  • Results

  • ConclusionCombination of selection by structure and by linguistic termsEvaluation of clustering compared to an existing typology The quality of clustering strongly depends on the selected parts in the activity reports (which in turn gives an indication on where the report could be improved) Future :Measuring the stability of clusters when K variesEvolution of classes overtimeExperiences with other collections