semana morphological data exploration using the semana platform feature granularity problem in the...

39
MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP, CNRS Toulouse André WLODARCZYK & Hélène WLODARCZYK CELTA, Université Paris Sorbonne

Upload: anya-trubey

Post on 14-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

MORPHOLOGICAL DATA EXPLORATION

USING THE SEMANASEMANA PLATFORM

Feature Granularity Problemin the Definition of Polish Gender

MORPHOLOGICAL DATA EXPLORATION

USING THE SEMANASEMANA PLATFORM

Feature Granularity Problemin the Definition of Polish Gender

Georges SAUVETUTAH - CREAP, CNRS Toulouse

André WLODARCZYK & Hélène WLODARCZYKCELTA, Université Paris Sorbonne

Page 2: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

CASE STUDY

Why Polish Adjective Declension ?

Answer: Polish Adjective Declension is an application domain with a well-defined borderline; i.e.: in which the total function generates all the combinatory possibilities.

Page 3: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

Case = {Nominative, Accusative, Genitive, Dative, Instrumental, Locative}

Number = {singular, plural}

Gender = {masculine, feminine, neuter, X, Y, Z*}

POLISH DECLENSION

In Polish School Grammar, the Adjective declension consists in amalgamation of 3 “morphological categories”.In our experimentation, we interpreted these categories as attributes of an information system. (Rough Set Theory, Pawlak Z., 1982)

* X, Y, Z will be analyzed in the sequel.

Page 4: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

THE PROBLEM OF GENDER IN POLISH

• In Slavic languages, Gender is a classificatory category as for Nouns while it is an inflectional category as for Adjectives.

• In order elucidate the problem of Gender in Polish noun morphology, we built a database of usages (not uses) of the proximal deictic adjectives.

• The root of these adjectives is very short: one single phoneme t-.

Page 5: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

THE DEICTIC MORPHEMES IN POLISH

The Nominative form of Polish morphemes with proximal (with respect to the speaker) deictic meaning are:

TEN, TA, TO

They correspond to :

TEN TA TO

English this this this

French ce cette ce

German dieser diese dieses

Japanese kono kono kono

Page 6: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

SAMPLES FROM OUR DATABASE

Some samples from the db (examples only in the Nominative case)

Polish English translationSingular Plural

Feminineta deska te deski this/these board(s)ta gęś te gęsi this/these goose/geeseta pani te panie this/these lady/ladies

Masculineten dom te domy this/these house(s)ten pies te psy this/these dog(s)ten pan ci panowie this/these sir(s)

Neuterto pióro te pióra this/these feather(s)to kurczę te kurczęta this/these chicken(s)to dziecko te dzieci this/these child/children... ... ...

Our database contains 108 different noun phrases totally combining all the categories involved in the declension: Case, Number, Gender and Animacy)

Page 7: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

Defining Gender in Polish7 “Genders”

In Polish Linguistics (cf. SALONI, Z. 1976), Gender is defined as a morpho-syntactic category. It is in the Accusative Case that Gender forms of Polish Adjectives are mostly differentiated. Sub-genders are distinguished in singular and in plural. Doing so, surprisingly, up to 7 gender classes have been proposed :

* “Animal” corresponds to the feature “animate” in other European languages descriptions.** “Personal” corresponds to the feature “human” .*** Pluralia tantum are defective nouns with no singular form).

Singular :1. feminine (with a specific Accusative form)2. neuter (with the same form in Accusative as in Nominative)3. animal* masculine (with the same form in Accusative as in Genitive)4. non animal masculine (with the same form in Accusative as in Nominative)Plural :1. personal** masculine (with the same form in Accusative as in Genitive),2. non personal masculine (with the same form in Accusative as in Nominative)3. “pluralia tantum”*** (with the same form in Accusative as in Nominative)

Page 8: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

Defining Gender in Polish5 “Genders”

In fact, Saloni’s theory derives from that of Mańczak, W. (1956) who distinguished the following five “sub-genders” only :

1. personal masculine2. animal masculine3. non animal masculine4. feminine5. neuter

Page 9: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

DATABASE WITH 7 GENDERSNb of objects : 108Nb of duplicates : 65

Nb of attributes : 3 (with respectively 2, 7, 6 values)Nb->{plur or sing}Gnd->{fem or mascAn or mascHum or mascInan or neu or nMasHum or plTant}Case->{A or D or G or I or L or N}Theoretical Combinations : 84Apparent Saturation Index : 51.19%

Non Attested Pairs of Values (10)If all non-attested pairs are inconsistent,the maximum number of combinations is : 54Corrected Saturation Index : 79.63%

Our knowledge reduction algorithm Our knowledge reduction algorithm cannotcannot reduce the different reduce the different descriptions. Instead 45 decision rules are proposed.descriptions. Instead 45 decision rules are proposed.Our knowledge reduction algorithm Our knowledge reduction algorithm cannotcannot reduce the different reduce the different descriptions. Instead 45 decision rules are proposed.descriptions. Instead 45 decision rules are proposed.

Page 10: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

CRITICAL REMARKS ON SUB-GENDERS

We observed that the 5 or 7 “sub-genders” of Polish School Grammars (a) neither correspond to any known semantic or ontological categories (b) nor to any known grammatical sub-gender in other languages.

In inflectional languages, morphological amalgamation of several different categories in one single form may be the source of difficulties in discerning properly the semantic categories in question.

Page 11: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

ANALYSIS

of

GENDER SUBCATEGORIZATION

in POLISH GRAMMAR

Page 12: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

FIRST TRIAL

SPLITTING GENDER

Observing the singular/plural oppositions in Adjective declension, we first divided the 7 “sub-genders” valued Gender attribute into 3 attributes :

gender = {feminine, neuter, masculine)animacy = {animate, inanimate}humanity = {human, non human}

We split the 7 “sub-genders”-valued Gender attribute into more than one attribute (with less values each).

Page 13: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

FIRST TRIAL - RESULTS SPLITTING GENDER

Objects : 108Duplicates : 0Duplicate ratio : 0%The following pairs of attributes could be merged:[HUM|INA] Confidence index = 99.9%[HUM|nHUM]Confidence index = 99.9%[INA|nHUM]Confidence index = 99.9%Attributes : 5 (with resp. 6,2,3,2,2 values)case, number, gender, animacy and humanityTheoretical Combinations : 144Apparent Saturation Index : 75%Non-Attested Pairs of Values (1)If all non-attested pairs were inconsistent,the maximum number of combinations would be: 108Corrected Saturation Index : 100%======================================================Non Attested Pairs of Values (1)inanimate, human, 2, 4

Our knowledge reduction algorithm reduces the 108 different Our knowledge reduction algorithm reduces the 108 different descriptions to 34 decision rules.descriptions to 34 decision rules.Our knowledge reduction algorithm reduces the 108 different Our knowledge reduction algorithm reduces the 108 different descriptions to 34 decision rules.descriptions to 34 decision rules.

Page 14: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

SECOND TRIAL MERGING ANIMACY with HUMANITYConsidering the results of the first trial

- one pair of values (inanimate and human) being not attested in the db (in fact, this pair is clearly contradictory)Non Attested Pairs of Values (1)inanimate, human, 2, 4

- and the confidence indices being computed as belowThe following pairs of attributes could be merged:[HUM|INA] Confidence index = 99.9%[HUM|nHUM] Confidence index = 99.9%[INA|nHUM] Confidence index = 99.9%

we decided to merge both binary attributes ANIMACY with HUMANITY into one three-valued attribute as follows :

ANIMACY-*-{ANY}=[nhuman|inanimate|human]

Page 15: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

SECOND TRIAL - RESULTS

MERGING ANIMACY with HUMANITYNb of objects : 108Nb of duplicates : 0Nb of attributes : 4 (with respectively 2, 3, 3 and 6 values)Nb-->{plur or sing}Gnd-->{fem or masc or neu }Anim--> {inanim or anim or animHum}Case-->{A or D or G or I or L or N}

Duplicate ratio : 0%Theoretical Combinations : 108Apparent Saturation Index : 100%Non-Attested Pairs of Values (0)Corrected Saturation Index : 100%

Again our knowledge reduction algorithm reduces the 108 different Again our knowledge reduction algorithm reduces the 108 different descriptions to 34 decision rules.descriptions to 34 decision rules.Again our knowledge reduction algorithm reduces the 108 different Again our knowledge reduction algorithm reduces the 108 different descriptions to 34 decision rules.descriptions to 34 decision rules.

Page 16: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

Establishingan ANIMACY CATEGORY

for Polish Grammar

Page 17: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

KNOWLEDGE REDUCTIONusing SEMANA

The knowledge reduction algorithm The knowledge reduction algorithm reduces the 108 different descriptions of reduces the 108 different descriptions of Polish Proximal Deictic Morphemes to Polish Proximal Deictic Morphemes to 34 decision rules.34 decision rules.

The knowledge reduction algorithm The knowledge reduction algorithm reduces the 108 different descriptions of reduces the 108 different descriptions of Polish Proximal Deictic Morphemes to Polish Proximal Deictic Morphemes to 34 decision rules.34 decision rules.

Page 18: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

34 Morphological Rulesr1 (9) : CASdat,NBRplu --> tymr2 (3) : CASins,GNDmas,NBRsin --> tymr3 (3) : CASins,GNDneu,NBRsin --> tymr4 (3) : CASloc,GNDmas,NBRsin --> tymr5 (3) : CASloc,GNDneu,NBRsin --> tym

r6 (9) : CASins,NBRplu --> tymi

r7 (1) : CASacc,ANYhum,GNDmas,NBRplu --> tychr8 (9) : CASgen,NBRplu --> tychr9 (9) : CASloc,NBRplu --> tych

r10 (3) : CASacc,GNDneu,NBRsin --> tor11 (3) : CASnom,GNDneu,NBRsin --> to

r12 (3) : CASacc,ANYina,NBRplu --> ter13 (3) : CASacc,ANYnhu,NBRplu --> ter14 (3) : CASacc,GNDfem,NBRplu --> ter15 (3) : CASacc,GNDneu,NBRplu --> ter16 (3) : CASnom,ANYina,NBRplu --> ter17 (3) : CASnom,ANYnhu,NBRplu --> ter18 (3) : CASnom,GNDfem,NBRplu --> ter19 (3) : CASnom,GNDneu,NBRplu --> te

r20 (1) : CASacc,ANYina,GNDmas,NBRsin --> tenr21 (3) : CASnom,GNDmas,NBRsin --> ten

r22 (3) : CASdat,GNDmas,NBRsin --> temur23 (3) : CASdat,GNDneu,NBRsin --> temu

r24 (3) : CASdat,GNDfem,NBRsin --> tejr25 (3) : CASgen,GNDfem,NBRsin --> tejr26 (3) : CASloc,GNDfem,NBRsin --> tej

r27 (1) : CASacc,ANYhum,GNDmas,NBRsin --> tegor28 (1) : CASacc,ANYnhu,GNDmas,NBRsin --> tegor29 (3) : CASgen,GNDmas,NBRsin --> tegor30 (3) : CASgen,GNDneu,NBRsin --> tego

r31 (3) : CASacc,GNDfem,NBRsin --> te*

r32 (3) : CASnom,GNDfem,NBRsin --> ta

r33 (3) : CASins,GNDfem,NBRsin --> ta*

r34 (1) : CASnom,ANYhum,GNDmas,NBRplu --> ci

Page 19: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

DISCOVERED KNOWLEDGE1. All the 108 different descriptions can be represented by 34 rules only.2. 20 rules represent the singular forms and 14 rules represent the plural forms.3. The Gender attribute is not necessary in 8 rules in plural and in cases other than

Nominative. This confirms the generally observed fact that, in Polish grammar, in the plural oblique cases, gender is neutralized (no Gender distinction).

4. The Attribute “Animacy” is present in 9/34 rules and 17/108 samples.3 rules contain the value Human (hum)

r07 (1) : CASacc,ANYhum,GNDmas,NBRplu --> tychr27 (1) : CASacc,ANYhum,GNDmas,NBRsin --> tegor34 (1) : CASnom,ANYhum,GNDmas,NBRplu --> ci

3 rules contain the value Inanimate (ina)r20 (1) : CASacc,ANYina,GNDmas,NBRsin --> tenr12 (3) : CASacc,ANYina,NBRplu --> ter16 (3) : CASnom,ANYina,NBRplu --> te

3 rules contain the value non Human (nhu)r17 (3) : CASnom,ANYnhu,NBRplu --> ter13 (3) : CASacc,ANYnhu,NBRplu --> ter28 (1) : CASacc,ANYnhu,GNDmas,NBRsin --> tego

Page 20: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

GENDER and ANIMACY

The 7 genders theory proposed a too coarse-grained analysis of the domain using only one attribute supposed to represent the Gender category.

In our “first trial”, in addition to Gender, two binary categories (Human and Animate ) were introduced resulting, as a matter of fact, in a too fine-grained description of the domain.

In our “second trial”, after having merged the two binary categories, we got one three-valued Animacy category. As a result, the Analyser (1) detects none of the following anomalies: duplicates (of usages, not uses), non attested pairs of values and (2) proposed no attribute merging possibilities.

Needless to say that our theory takes into account the definition of Gender category such as it is generally used in grammars of other languages.

Page 21: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

The ONTOLOGICAL STRUCTUREof ANIMACY

Interestingly, we noticed that the Feature Structure of Animacy Attribute being a binary tree, it is normal that its values are all exclusive by the law of the excluded middle: nothing can be true and false at the same time.

ANIMACY

HUMANITY

non animate non human human

- +

- +

Page 22: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

RELATIVE WEIGHTOF THE ANIMACY ATTRIBUTE

If we consider the relative weight of the ANIMACY attribute (only 5.4%), we can better understand the difficulties that Polish linguists encountered in their work.

Relative weight of attributes N weight(%)1.CAS 116 36.62.NBR 116 36.63.GND 68 21.54.ANY 17 5.4

It becomes clear that ANIMACY is not as important a category as the other three ones (Case, Number and Gender) which co-occur in the amalgamated adjective paradigm.

Page 23: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

Step 1: DB building

Using our “Dynamic db Builder”… morpheme

sample

attribute, value(features chosen for each entry)

Page 24: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

Step 2: Multi-valued Contingency Table

The 108 samples are collected into a Multi-valued Contingency Table

Page 25: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

Step 3: One-valued Contingency Table

The Multi-valued Table is unfolded as a One-valued Table...

Page 26: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

BURT TABLE acc dat gen ins loc nom hum ina nhu fem mas neu plu sin ci ta ta* te te* tego tej temuten to tych tym tymi acc 18 0 0 0 0 0 6 6 6 6 6 6 9 9 0 0 0 8 3 2 0 0 1 3 1 0 0 dat 0 18 0 0 0 0 6 6 6 6 6 6 9 9 0 0 0 0 0 0 3 6 0 0 0 9 0 gen 0 0 18 0 0 0 6 6 6 6 6 6 9 9 0 0 0 0 0 6 3 0 0 0 9 0 0 ins 0 0 0 18 0 0 6 6 6 6 6 6 9 9 0 0 3 0 0 0 0 0 0 0 0 6 9 loc 0 0 0 0 18 0 6 6 6 6 6 6 9 9 0 0 0 0 0 0 3 0 0 0 9 6 0 nom 0 0 0 0 0 18 6 6 6 6 6 6 9 9 1 3 0 8 0 0 0 0 3 3 0 0 0 hum 6 6 6 6 6 6 36 0 0 12 12 12 18 18 1 1 1 4 1 3 3 2 1 2 7 0 0 ina 6 6 6 6 6 6 0 36 0 12 12 12 18 18 0 1 1 6 1 2 3 2 2 2 6 7 3 nhu 6 6 6 6 6 6 0 0 36 12 12 12 18 18 0 1 1 6 1 3 3 2 1 2 6 7 3 fem 6 6 6 6 6 6 12 12 12 36 0 0 18 18 0 3 3 6 3 0 9 0 0 0 6 3 3 mas 6 6 6 6 6 6 12 12 12 0 36 0 18 18 1 0 0 4 0 5 0 3 4 0 7 9 3 neu 6 6 6 6 6 6 12 12 12 0 0 36 18 18 0 0 0 6 0 3 0 3 0 6 6 9 3 plu 9 9 9 9 9 9 18 18 18 18 18 18 54 0 1 0 0 16 0 0 0 0 0 0 19 9 9 sin 9 9 9 9 9 9 18 18 18 18 18 18 0 54 0 3 3 0 3 8 9 6 4 6 0 12 0 ci 0 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ta 0 0 0 0 0 3 1 1 1 3 0 0 0 3 0 3 0 0 0 0 0 0 0 0 0 0 0 ta* 0 0 0 3 0 0 1 1 1 3 0 0 0 3 0 0 3 0 0 0 0 0 0 0 0 0 0 te 8 0 0 0 0 8 4 6 6 6 4 6 16 0 0 0 0 16 0 0 0 0 0 0 0 0 0 te* 3 0 0 0 0 0 1 1 1 3 0 0 0 3 0 0 0 0 3 0 0 0 0 0 0 0 0 tego 2 0 6 0 0 0 3 2 3 0 5 3 0 8 0 0 0 0 0 8 0 0 0 0 0 0 0 tej 0 3 3 0 3 0 3 3 3 9 0 0 0 9 0 0 0 0 0 0 9 0 0 0 0 0 0 temu 0 6 0 0 0 0 2 2 2 0 3 3 0 6 0 0 0 0 0 0 0 6 0 0 0 0 0 ten 1 0 0 0 0 3 1 2 1 0 4 0 0 4 0 0 0 0 0 0 0 0 4 0 0 0 0 to 3 0 0 0 0 3 2 2 2 0 0 6 0 6 0 0 0 0 0 0 0 0 0 6 0 0 0 tych 1 0 9 0 9 0 7 6 6 6 7 6 19 0 0 0 0 0 0 0 0 0 0 0 19 0 0 tym 0 9 0 6 6 0 7 7 7 3 9 9 9 12 0 0 0 0 0 0 0 0 0 0 0 21 0 tymi 0 0 0 9 0 0 3 3 3 3 3 3 9 0 0 0 0 0 0 0 0 0 0 0 0 0 9 FJ 90 90 90 90 90 90 180 180 180 180 180 180 270 270 5 15 15 80 15 40 45 30 20 30 95 105 45

BURT TABLE acc dat gen ins loc nom hum ina nhu fem mas neu plu sin ci ta ta* te te* tego tej temuten to tych tym tymi acc 18 0 0 0 0 0 6 6 6 6 6 6 9 9 0 0 0 8 3 2 0 0 1 3 1 0 0 dat 0 18 0 0 0 0 6 6 6 6 6 6 9 9 0 0 0 0 0 0 3 6 0 0 0 9 0 gen 0 0 18 0 0 0 6 6 6 6 6 6 9 9 0 0 0 0 0 6 3 0 0 0 9 0 0 ins 0 0 0 18 0 0 6 6 6 6 6 6 9 9 0 0 3 0 0 0 0 0 0 0 0 6 9 loc 0 0 0 0 18 0 6 6 6 6 6 6 9 9 0 0 0 0 0 0 3 0 0 0 9 6 0 nom 0 0 0 0 0 18 6 6 6 6 6 6 9 9 1 3 0 8 0 0 0 0 3 3 0 0 0 hum 6 6 6 6 6 6 36 0 0 12 12 12 18 18 1 1 1 4 1 3 3 2 1 2 7 0 0 ina 6 6 6 6 6 6 0 36 0 12 12 12 18 18 0 1 1 6 1 2 3 2 2 2 6 7 3 nhu 6 6 6 6 6 6 0 0 36 12 12 12 18 18 0 1 1 6 1 3 3 2 1 2 6 7 3 fem 6 6 6 6 6 6 12 12 12 36 0 0 18 18 0 3 3 6 3 0 9 0 0 0 6 3 3 mas 6 6 6 6 6 6 12 12 12 0 36 0 18 18 1 0 0 4 0 5 0 3 4 0 7 9 3 neu 6 6 6 6 6 6 12 12 12 0 0 36 18 18 0 0 0 6 0 3 0 3 0 6 6 9 3 plu 9 9 9 9 9 9 18 18 18 18 18 18 54 0 1 0 0 16 0 0 0 0 0 0 19 9 9 sin 9 9 9 9 9 9 18 18 18 18 18 18 0 54 0 3 3 0 3 8 9 6 4 6 0 12 0 ci 0 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ta 0 0 0 0 0 3 1 1 1 3 0 0 0 3 0 3 0 0 0 0 0 0 0 0 0 0 0 ta* 0 0 0 3 0 0 1 1 1 3 0 0 0 3 0 0 3 0 0 0 0 0 0 0 0 0 0 te 8 0 0 0 0 8 4 6 6 6 4 6 16 0 0 0 0 16 0 0 0 0 0 0 0 0 0 te* 3 0 0 0 0 0 1 1 1 3 0 0 0 3 0 0 0 0 3 0 0 0 0 0 0 0 0 tego 2 0 6 0 0 0 3 2 3 0 5 3 0 8 0 0 0 0 0 8 0 0 0 0 0 0 0 tej 0 3 3 0 3 0 3 3 3 9 0 0 0 9 0 0 0 0 0 0 9 0 0 0 0 0 0 temu 0 6 0 0 0 0 2 2 2 0 3 3 0 6 0 0 0 0 0 0 0 6 0 0 0 0 0 ten 1 0 0 0 0 3 1 2 1 0 4 0 0 4 0 0 0 0 0 0 0 0 4 0 0 0 0 to 3 0 0 0 0 3 2 2 2 0 0 6 0 6 0 0 0 0 0 0 0 0 0 6 0 0 0 tych 1 0 9 0 9 0 7 6 6 6 7 6 19 0 0 0 0 0 0 0 0 0 0 0 19 0 0 tym 0 9 0 6 6 0 7 7 7 3 9 9 9 12 0 0 0 0 0 0 0 0 0 0 0 21 0 tymi 0 0 0 9 0 0 3 3 3 3 3 3 9 0 0 0 0 0 0 0 0 0 0 0 0 0 9 FJ 90 90 90 90 90 90 180 180 180 180 180 180 270 270 5 15 15 80 15 40 45 30 20 30 95 105 45

Step 4: Table of co-occurrences (Burt Table)

Syntacticrelators

animacy

gender

number

morphemes

The One-valued Table is transformed in a Burt Table ...

Page 27: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

BURT TABLE acc dat gen ins loc nom hum ina nhu fem mas neu plu sin ci ta ta* te te* tego tej tem ten to tych tym tymi acc 18 0 0 0 0 0 6 6 6 6 6 6 9 9 0 0 0 8 3 2 0 0 1 3 1 0 0 dat 0 18 0 0 0 0 6 6 6 6 6 6 9 9 0 0 0 0 0 0 3 6 0 0 0 9 0 gen 0 0 18 0 0 0 6 6 6 6 6 6 9 9 0 0 0 0 0 6 3 0 0 0 9 0 0 ins 0 0 0 18 0 0 6 6 6 6 6 6 9 9 0 0 3 0 0 0 0 0 0 0 0 6 9 loc 0 0 0 0 18 0 6 6 6 6 6 6 9 9 0 0 0 0 0 0 3 0 0 0 9 6 0 nom 0 0 0 0 0 18 6 6 6 6 6 6 9 9 1 3 0 8 0 0 0 0 3 3 0 0 0 hum 6 6 6 6 6 6 36 0 0 12 12 12 18 18 1 1 1 4 1 3 3 2 1 2 7 0 0 ina 6 6 6 6 6 6 0 36 0 12 12 12 18 18 0 1 1 6 1 2 3 2 2 2 6 7 3 nhu 6 6 6 6 6 6 0 0 36 12 12 12 18 18 0 1 1 6 1 3 3 2 1 2 6 7 3 fem 6 6 6 6 6 6 12 12 12 36 0 0 18 18 0 3 3 6 3 0 9 0 0 0 6 3 3 mas 6 6 6 6 6 6 12 12 12 0 36 0 18 18 1 0 0 4 0 5 0 3 4 0 7 9 3 neu 6 6 6 6 6 6 12 12 12 0 0 36 18 18 0 0 0 6 0 3 0 3 0 6 6 9 3 plu 9 9 9 9 9 9 18 18 18 18 18 18 54 0 1 0 0 16 0 0 0 0 0 0 19 9 9 sin 9 9 9 9 9 9 18 18 18 18 18 18 0 54 0 3 3 0 3 8 9 6 4 6 0 12 0 ci 0 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ta 0 0 0 0 0 3 1 1 1 3 0 0 0 3 0 3 0 0 0 0 0 0 0 0 0 0 0 ta* 0 0 0 3 0 0 1 1 1 3 0 0 0 3 0 0 3 0 0 0 0 0 0 0 0 0 0 te 8 0 0 0 0 8 4 6 6 6 4 6 16 0 0 0 0 16 0 0 0 0 0 0 0 0 0 te* 3 0 0 0 0 0 1 1 1 3 0 0 0 3 0 0 0 0 3 0 0 0 0 0 0 0 0 tego 2 0 6 0 0 0 3 2 3 0 5 3 0 8 0 0 0 0 0 8 0 0 0 0 0 0 0 tej 0 3 3 0 3 0 3 3 3 9 0 0 0 9 0 0 0 0 0 0 9 0 0 0 0 0 0 tem 0 6 0 0 0 0 2 2 2 0 3 3 0 6 0 0 0 0 0 0 0 6 0 0 0 0 0 ten 1 0 0 0 0 3 1 2 1 0 4 0 0 4 0 0 0 0 0 0 0 0 4 0 0 0 0 to 3 0 0 0 0 3 2 2 2 0 0 6 0 6 0 0 0 0 0 0 0 0 0 6 0 0 0 tych 1 0 9 0 9 0 7 6 6 6 7 6 19 0 0 0 0 0 0 0 0 0 0 0 19 0 0 tym 0 9 0 6 6 0 7 7 7 3 9 9 9 12 0 0 0 0 0 0 0 0 0 0 0 21 0 tymi 0 0 0 9 0 0 3 3 3 3 3 3 9 0 0 0 0 0 0 0 0 0 0 0 0 0 9 FJ 90 90 90 90 90 90 180 180 180 180 180 180 270 270 5 15 15 80 15 40 45 30 20 30 95 105 45

BURT TABLE acc dat gen ins loc nom hum ina nhu fem mas neu plu sin ci ta ta* te te* tego tej tem ten to tych tym tymi acc 18 0 0 0 0 0 6 6 6 6 6 6 9 9 0 0 0 8 3 2 0 0 1 3 1 0 0 dat 0 18 0 0 0 0 6 6 6 6 6 6 9 9 0 0 0 0 0 0 3 6 0 0 0 9 0 gen 0 0 18 0 0 0 6 6 6 6 6 6 9 9 0 0 0 0 0 6 3 0 0 0 9 0 0 ins 0 0 0 18 0 0 6 6 6 6 6 6 9 9 0 0 3 0 0 0 0 0 0 0 0 6 9 loc 0 0 0 0 18 0 6 6 6 6 6 6 9 9 0 0 0 0 0 0 3 0 0 0 9 6 0 nom 0 0 0 0 0 18 6 6 6 6 6 6 9 9 1 3 0 8 0 0 0 0 3 3 0 0 0 hum 6 6 6 6 6 6 36 0 0 12 12 12 18 18 1 1 1 4 1 3 3 2 1 2 7 0 0 ina 6 6 6 6 6 6 0 36 0 12 12 12 18 18 0 1 1 6 1 2 3 2 2 2 6 7 3 nhu 6 6 6 6 6 6 0 0 36 12 12 12 18 18 0 1 1 6 1 3 3 2 1 2 6 7 3 fem 6 6 6 6 6 6 12 12 12 36 0 0 18 18 0 3 3 6 3 0 9 0 0 0 6 3 3 mas 6 6 6 6 6 6 12 12 12 0 36 0 18 18 1 0 0 4 0 5 0 3 4 0 7 9 3 neu 6 6 6 6 6 6 12 12 12 0 0 36 18 18 0 0 0 6 0 3 0 3 0 6 6 9 3 plu 9 9 9 9 9 9 18 18 18 18 18 18 54 0 1 0 0 16 0 0 0 0 0 0 19 9 9 sin 9 9 9 9 9 9 18 18 18 18 18 18 0 54 0 3 3 0 3 8 9 6 4 6 0 12 0 ci 0 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ta 0 0 0 0 0 3 1 1 1 3 0 0 0 3 0 3 0 0 0 0 0 0 0 0 0 0 0 ta* 0 0 0 3 0 0 1 1 1 3 0 0 0 3 0 0 3 0 0 0 0 0 0 0 0 0 0 te 8 0 0 0 0 8 4 6 6 6 4 6 16 0 0 0 0 16 0 0 0 0 0 0 0 0 0 te* 3 0 0 0 0 0 1 1 1 3 0 0 0 3 0 0 0 0 3 0 0 0 0 0 0 0 0 tego 2 0 6 0 0 0 3 2 3 0 5 3 0 8 0 0 0 0 0 8 0 0 0 0 0 0 0 tej 0 3 3 0 3 0 3 3 3 9 0 0 0 9 0 0 0 0 0 0 9 0 0 0 0 0 0 tem 0 6 0 0 0 0 2 2 2 0 3 3 0 6 0 0 0 0 0 0 0 6 0 0 0 0 0 ten 1 0 0 0 0 3 1 2 1 0 4 0 0 4 0 0 0 0 0 0 0 0 4 0 0 0 0 to 3 0 0 0 0 3 2 2 2 0 0 6 0 6 0 0 0 0 0 0 0 0 0 6 0 0 0 tych 1 0 9 0 9 0 7 6 6 6 7 6 19 0 0 0 0 0 0 0 0 0 0 0 19 0 0 tym 0 9 0 6 6 0 7 7 7 3 9 9 9 12 0 0 0 0 0 0 0 0 0 0 0 21 0 tymi 0 0 0 9 0 0 3 3 3 3 3 3 9 0 0 0 0 0 0 0 0 0 0 0 0 0 9 FJ 90 90 90 90 90 90 180 180 180 180 180 180 270 270 5 15 15 80 15 40 45 30 20 30 95 105 45

Step 5: Correspondence Factor Analysis (CFA)Step 5: Correspondence Factor Analysis (CFA)

Numbers in the Table are considered as coordinates of points in a N-dimensional space.

•• •••••• •

•••••• •••

•••••• ••••••

••• •••

•••••• •••• •••••• •

••••••

••••

z

x

y

F1

F2

F3

CFA calculates the axes of inertia of the cloud of points (F1, F2, F3 …)

and displaysprojections in planes [F1,F2], [F1,F3], etc.

CFA is implemented as “Stat-3” in“Semana”

Page 28: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

Correspondence Factor AnalysisCorrespondence Factor Analysis

N Val.propres % HISTOGRAMME 1 0.1572 13.05 ************************************************************ 2 0.1543 12.81 *********************************************************** 3 0.1359 11.27 **************************************************** 4 0.1205 10 ********************************************** 5 0.1 8.3 ************************************** 6 0.0909 7.54 *********************************** 7 0.0732 6.07 **************************** 8 0.0628 5.21 ************************ 9 0.0524 4.35 ******************** 10 0.0428 3.55 **************** 11 0.04 3.32 *************** 12 0.0392 3.25 *************** 13 0.0373 3.09 ************** 14 0.0292 2.43 *********** 15 0.0223 1.85 ********* 16 0.0168 1.39 ****** 17 0.0097 0.81 **** 18 0.0096 0.8 **** 19 0.007 0.58 *** 20 0.0028 0.23 *

N Val.propres % HISTOGRAMME 1 0.1572 13.05 ************************************************************ 2 0.1543 12.81 *********************************************************** 3 0.1359 11.27 **************************************************** 4 0.1205 10 ********************************************** 5 0.1 8.3 ************************************** 6 0.0909 7.54 *********************************** 7 0.0732 6.07 **************************** 8 0.0628 5.21 ************************ 9 0.0524 4.35 ******************** 10 0.0428 3.55 **************** 11 0.04 3.32 *************** 12 0.0392 3.25 *************** 13 0.0373 3.09 ************** 14 0.0292 2.43 *********** 15 0.0223 1.85 ********* 16 0.0168 1.39 ****** 17 0.0097 0.81 **** 18 0.0096 0.8 **** 19 0.007 0.58 *** 20 0.0028 0.23 *

Contribution percent of each axis to the overall inertia of the cloud

Note that, in this case, the first 4 axes have almost equal contributions. This means that the cloud is strongly multidimensional.

Output by “stat-3”

“Stat-3” gives useful information about axes of inertia.

Page 29: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

Correspondence Factor AnalysisCorrespondence Factor Analysis

CLOUD J FREQ QLT INR | F#1 COR CTR | F#2 COR CTR | F#3 COR CTR | F#4 COR CTR | —————————————————————————————————————————————————————————————————————————————————————————— acc 33 397 40 | -70 3 1 | -745 391 120 | 47 2 1 | -48 2 1 | dat 33 588 42 | 643 271 88 | 483 153 50 | 97 6 2 | 489 157 66 | gen 33 596 40 | -15 0 0 | 153 16 5 | -870 522 186 | -290 58 23 | ins 33 869 48 | -362 77 28 | 682 271 100 | 924 499 210 | -195 22 11 | loc 33 326 35 | -117 11 3 | 366 106 29 | -504 201 62 | -103 8 3 | nom 33 633 44 | -78 4 1 | -938 556 190 | 306 59 23 | 147 14 6 | hum 67 5 23 | 0 0 0 | 29 2 0 | -34 3 1 | 5 0 0 | ina 67 5 23 | -1 0 0 | -26 2 0 | 34 3 1 | 6 0 0 | nhu 67 0 22 | 1 0 0 | -3 0 0 | 0 0 0 | -11 0 0 | fem 67 768 33 | 20 1 0 | -26 1 0 | 85 12 4 | -669 754 247 | mas 67 245 28 | -9 0 0 | 56 6 1 | -97 19 5 | 332 220 61 | neu 67 232 28 | -11 0 0 | -29 2 0 | 12 0 0 | 337 229 63 | plu 100 873 30 | -546 823 189 | 43 5 1 | -68 13 3 | 108 32 10 | sin 100 873 30 | 546 823 189 | -43 5 1 | 68 13 3 | -108 32 10 | ci 2 76 36 | -644 18 5 | -841 30 8 | 128 1 0 | 804 28 10 | ta 6 276 40 | 497 29 9 |-1046 127 39 | 545 35 12 | -856 85 34 | ta* 6 445 40 | 207 5 2 | 635 47 15 | 1278 190 67 |-1321 203 80 | te 30 651 44 | -630 225 75 | -839 399 135 | 148 12 5 | 156 14 6 | te* 6 265 40 | 505 30 9 | -845 83 26 | 237 7 2 |-1121 146 58 | tego 15 249 42 | 516 79 25 | -91 2 1 | -751 167 62 | -6 0 0 | tej 17 588 42 | 749 187 60 | 274 25 8 | -324 35 13 |-1012 341 142 | temu 11 559 44 | 1200 306 102 | 469 47 16 | 145 4 2 | 973 201 87 | ten 7 208 39 | 469 34 10 | -917 132 40 | 262 11 4 | 440 30 12 | to 11 308 41 | 468 50 15 | -948 204 65 | 305 21 8 | 378 32 13 | tych 35 812 44 | -624 263 87 | 264 47 16 | -858 498 191 | -86 5 2 | tym 39 481 35 | 214 43 11 | 527 256 70 | 174 28 9 | 408 154 54 | tymi 17 726 47 | -924 251 91 | 752 166 61 | 1016 304 127 | -119 4 2 |

CLOUD J FREQ QLT INR | F#1 COR CTR | F#2 COR CTR | F#3 COR CTR | F#4 COR CTR | —————————————————————————————————————————————————————————————————————————————————————————— acc 33 397 40 | -70 3 1 | -745 391 120 | 47 2 1 | -48 2 1 | dat 33 588 42 | 643 271 88 | 483 153 50 | 97 6 2 | 489 157 66 | gen 33 596 40 | -15 0 0 | 153 16 5 | -870 522 186 | -290 58 23 | ins 33 869 48 | -362 77 28 | 682 271 100 | 924 499 210 | -195 22 11 | loc 33 326 35 | -117 11 3 | 366 106 29 | -504 201 62 | -103 8 3 | nom 33 633 44 | -78 4 1 | -938 556 190 | 306 59 23 | 147 14 6 | hum 67 5 23 | 0 0 0 | 29 2 0 | -34 3 1 | 5 0 0 | ina 67 5 23 | -1 0 0 | -26 2 0 | 34 3 1 | 6 0 0 | nhu 67 0 22 | 1 0 0 | -3 0 0 | 0 0 0 | -11 0 0 | fem 67 768 33 | 20 1 0 | -26 1 0 | 85 12 4 | -669 754 247 | mas 67 245 28 | -9 0 0 | 56 6 1 | -97 19 5 | 332 220 61 | neu 67 232 28 | -11 0 0 | -29 2 0 | 12 0 0 | 337 229 63 | plu 100 873 30 | -546 823 189 | 43 5 1 | -68 13 3 | 108 32 10 | sin 100 873 30 | 546 823 189 | -43 5 1 | 68 13 3 | -108 32 10 | ci 2 76 36 | -644 18 5 | -841 30 8 | 128 1 0 | 804 28 10 | ta 6 276 40 | 497 29 9 |-1046 127 39 | 545 35 12 | -856 85 34 | ta* 6 445 40 | 207 5 2 | 635 47 15 | 1278 190 67 |-1321 203 80 | te 30 651 44 | -630 225 75 | -839 399 135 | 148 12 5 | 156 14 6 | te* 6 265 40 | 505 30 9 | -845 83 26 | 237 7 2 |-1121 146 58 | tego 15 249 42 | 516 79 25 | -91 2 1 | -751 167 62 | -6 0 0 | tej 17 588 42 | 749 187 60 | 274 25 8 | -324 35 13 |-1012 341 142 | temu 11 559 44 | 1200 306 102 | 469 47 16 | 145 4 2 | 973 201 87 | ten 7 208 39 | 469 34 10 | -917 132 40 | 262 11 4 | 440 30 12 | to 11 308 41 | 468 50 15 | -948 204 65 | 305 21 8 | 378 32 13 | tych 35 812 44 | -624 263 87 | 264 47 16 | -858 498 191 | -86 5 2 | tym 39 481 35 | 214 43 11 | 527 256 70 | 174 28 9 | 408 154 54 | tymi 17 726 47 | -924 251 91 | 752 166 61 | 1016 304 127 | -119 4 2 |

Output by “stat-3”

Contribution of object J to the overall inertia of the cloud

Weight of object J / total

Quality of the description of object J on the first 7 coordinates

Contribution of object J to the definition of factor 1

Contribution of factor 1 to the description of object J

Coordinate of object J on factor 1

“Stat-3” gives useful information about objects/features.

Page 30: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

Correspondence Factor AnalysisCorrespondence Factor Analysis

CLOUD J FREQ QLT INR | F#1 COR CTR | F#2 COR CTR | F#3 COR CTR | F#4 COR CTR | —————————————————————————————————————————————————————————————————————————————————————————— acc 33 397 40 | -70 3 1 | -745 391 120 | 47 2 1 | -48 2 1 | dat 33 588 42 | 643 271 88 | 483 153 50 | 97 6 2 | 489 157 66 | gen 33 596 40 | -15 0 0 | 153 16 5 | -870 522 186 | -290 58 23 | ins 33 869 48 | -362 77 28 | 682 271 100 | 924 499 210 | -195 22 11 | loc 33 326 35 | -117 11 3 | 366 106 29 | -504 201 62 | -103 8 3 | nom 33 633 44 | -78 4 1 | -938 556 190 | 306 59 23 | 147 14 6 | hum 67 5 23 | 0 0 0 | 29 2 0 | -34 3 1 | 5 0 0 | ina 67 5 23 | -1 0 0 | -26 2 0 | 34 3 1 | 6 0 0 | nhu 67 0 22 | 1 0 0 | -3 0 0 | 0 0 0 | -11 0 0 | fem 67 768 33 | 20 1 0 | -26 1 0 | 85 12 4 | -669 754 247 | mas 67 245 28 | -9 0 0 | 56 6 1 | -97 19 5 | 332 220 61 | neu 67 232 28 | -11 0 0 | -29 2 0 | 12 0 0 | 337 229 63 | plu 100 873 30 | -546 823 189 | 43 5 1 | -68 13 3 | 108 32 10 | sin 100 873 30 | 546 823 189 | -43 5 1 | 68 13 3 | -108 32 10 | ci 2 76 36 | -644 18 5 | -841 30 8 | 128 1 0 | 804 28 10 | ta 6 276 40 | 497 29 9 |-1046 127 39 | 545 35 12 | -856 85 34 | ta* 6 445 40 | 207 5 2 | 635 47 15 | 1278 190 67 |-1321 203 80 | te 30 651 44 | -630 225 75 | -839 399 135 | 148 12 5 | 156 14 6 | te* 6 265 40 | 505 30 9 | -845 83 26 | 237 7 2 |-1121 146 58 | tego 15 249 42 | 516 79 25 | -91 2 1 | -751 167 62 | -6 0 0 | tej 17 588 42 | 749 187 60 | 274 25 8 | -324 35 13 |-1012 341 142 | temu 11 559 44 | 1200 306 102 | 469 47 16 | 145 4 2 | 973 201 87 | ten 7 208 39 | 469 34 10 | -917 132 40 | 262 11 4 | 440 30 12 | to 11 308 41 | 468 50 15 | -948 204 65 | 305 21 8 | 378 32 13 | tych 35 812 44 | -624 263 87 | 264 47 16 | -858 498 191 | -86 5 2 | tym 39 481 35 | 214 43 11 | 527 256 70 | 174 28 9 | 408 154 54 | tymi 17 726 47 | -924 251 91 | 752 166 61 | 1016 304 127 | -119 4 2 |

CLOUD J FREQ QLT INR | F#1 COR CTR | F#2 COR CTR | F#3 COR CTR | F#4 COR CTR | —————————————————————————————————————————————————————————————————————————————————————————— acc 33 397 40 | -70 3 1 | -745 391 120 | 47 2 1 | -48 2 1 | dat 33 588 42 | 643 271 88 | 483 153 50 | 97 6 2 | 489 157 66 | gen 33 596 40 | -15 0 0 | 153 16 5 | -870 522 186 | -290 58 23 | ins 33 869 48 | -362 77 28 | 682 271 100 | 924 499 210 | -195 22 11 | loc 33 326 35 | -117 11 3 | 366 106 29 | -504 201 62 | -103 8 3 | nom 33 633 44 | -78 4 1 | -938 556 190 | 306 59 23 | 147 14 6 | hum 67 5 23 | 0 0 0 | 29 2 0 | -34 3 1 | 5 0 0 | ina 67 5 23 | -1 0 0 | -26 2 0 | 34 3 1 | 6 0 0 | nhu 67 0 22 | 1 0 0 | -3 0 0 | 0 0 0 | -11 0 0 | fem 67 768 33 | 20 1 0 | -26 1 0 | 85 12 4 | -669 754 247 | mas 67 245 28 | -9 0 0 | 56 6 1 | -97 19 5 | 332 220 61 | neu 67 232 28 | -11 0 0 | -29 2 0 | 12 0 0 | 337 229 63 | plu 100 873 30 | -546 823 189 | 43 5 1 | -68 13 3 | 108 32 10 | sin 100 873 30 | 546 823 189 | -43 5 1 | 68 13 3 | -108 32 10 | ci 2 76 36 | -644 18 5 | -841 30 8 | 128 1 0 | 804 28 10 | ta 6 276 40 | 497 29 9 |-1046 127 39 | 545 35 12 | -856 85 34 | ta* 6 445 40 | 207 5 2 | 635 47 15 | 1278 190 67 |-1321 203 80 | te 30 651 44 | -630 225 75 | -839 399 135 | 148 12 5 | 156 14 6 | te* 6 265 40 | 505 30 9 | -845 83 26 | 237 7 2 |-1121 146 58 | tego 15 249 42 | 516 79 25 | -91 2 1 | -751 167 62 | -6 0 0 | tej 17 588 42 | 749 187 60 | 274 25 8 | -324 35 13 |-1012 341 142 | temu 11 559 44 | 1200 306 102 | 469 47 16 | 145 4 2 | 973 201 87 | ten 7 208 39 | 469 34 10 | -917 132 40 | 262 11 4 | 440 30 12 | to 11 308 41 | 468 50 15 | -948 204 65 | 305 21 8 | 378 32 13 | tych 35 812 44 | -624 263 87 | 264 47 16 | -858 498 191 | -86 5 2 | tym 39 481 35 | 214 43 11 | 527 256 70 | 174 28 9 | 408 154 54 | tymi 17 726 47 | -924 251 91 | 752 166 61 | 1016 304 127 | -119 4 2 |

Output by “stat-3”

Note that the number (singular/plural) has the highest contrib. to axis 1

Note that the quality of the description of attribute “animacy” is very poor and that these elements have no contribution to the first 4 factors.

Page 31: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

Proj. In plane [1,2] PROJECTION DANS LE PLAN FACTORIEL [1,2]| Horizontal: Axe #2 (Inertie: 12.81%) ——— Vertical: Axe #1 (Inertie: 13.05%)| Largeur: 1.798197; Hauteur: 2.123853; Nombre de points : 27+--------------------------------------------------+--------------------tem------------+--10| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | tej | 00| | | 00| | | 00| | dat | 00| | | 00| sin| | 00| te* tego | | 10ta to ten | | 00| | | 00| | | 00| | | 00| | | 00| | tym ta* | 00| | | 00| | | 00| | | 00| | | 00+-----------------------------------------------inahum---gen---------------------------+--40| nhumas | 20| nom acc fem| | 10| neu| loc | 00| | | 00| | | 00| | | 00| | | 00| | ins | 00| | | 00| | | 00| | | 00| plu | 00| | | 00| ci | tych | 10| te | | 00| | | 00| | | 00| | | 00| | | 00| | tymi| 00+--------------------------------------------------+-----------------------------------+--00

axis 2

axis 1

3 6 43

1619

9

21

9

8

6

3

Qualifiers = animacy, gender

Quantifiers = number

Syntactic relators = cases

morphemes

Qualifiers = animacy, gender

Quantifiers = number

Syntactic relators = cases

morphemes

Projection in plane [1,2]

Page 32: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

PROJECTION DANS LE PLAN FACTORIEL [1,2]| Horizontal: Axe #2 (Inertie: 12.81%) ——— Vertical: Axe #1 (Inertie: 13.05%)| Largeur: 1.798197; Hauteur: 2.123853; Nombre de points : 27+--------------------------------------------------+--------------------tem------------+--10| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | tej | 00| | | 00| | | 00| | dat | 00| | | 00| sin| | 00| te* tego | | 10ta to ten | | 00| | | 00| | | 00| | | 00| | | 00| | tym ta* | 00| | | 00| | | 00| | | 00| | | 00+-----------------------------------------------inahum---gen---------------------------+--40| nhumas | 20| nom acc fem| | 10| neu| loc | 00| | | 00| | | 00| | | 00| | | 00| | ins | 00| | | 00| | | 00| | | 00| plu | 00| | | 00| ci | tych | 10| te | | 00| | | 00| | | 00| | | 00| | | 00| | tymi| 00+--------------------------------------------------+-----------------------------------+--00

axis 2

axis 1

Syntactic relators(on axis 2)

3 6 43

1619

9

21

9

8

6

3

quantifiers(on axis 1)

« qualifiers »

Quantifiers and syntactic relators

• Axis 1 separates quantifiers => singular vs plural

• Axis 2 separates syntactic relators => {nom,acc} vs {gen,loc,dat, ins}

• « Qualifiers » (animacy & gender) are not differenciated on axes 1 and 2

• Morphemes are spread over plane [1,2]

• Axis 1 separates quantifiers => singular vs plural

• Axis 2 separates syntactic relators => {nom,acc} vs {gen,loc,dat, ins}

• « Qualifiers » (animacy & gender) are not differenciated on axes 1 and 2

• Morphemes are spread over plane [1,2]

Page 33: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

PROJECTION DANS LE PLAN FACTORIEL [1,2]| Horizontal: Axe #2 (Inertie: 12.81%) ——— Vertical: Axe #1 (Inertie: 13.05%)| Largeur: 1.798197; Hauteur: 2.123853; Nombre de points : 27+--------------------------------------------------+--------------------temu-----------+--10| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | tej | 00| | | 00| | | 00| | dat | 00| | | 00| sin| | 00| te* tego | | 10ta to ten | | 00| | | 00| | | 00| | | 00| | | 00| | tym ta* | 00| | | 00| | | 00| | | 00| | | 00+-----------------------------------------------inahum---gen---------------------------+--40| nhumas | 20| nom acc fem| | 10| neu| loc | 00| | | 00| | | 00| | | 00| | | 00| | ins | 00| | | 00| | | 00| | | 00| plu | 00| | | 00| ci | tych | 10| te | | 00| | | 00| | | 00| | | 00| | | 00| | tymi| 00+--------------------------------------------------+-----------------------------------+--00

axis 2

axis 1

3 6 43

1619

9

21

9

8

6

3

12

9

Morphemes strictly associated to singular:

=> ta, to, ten, te*, tego, tej, temu, ta*

Morphemes strictly associated to singular:

=> ta, to, ten, te*, tego, tej, temu, ta*

Morphemes strictly associated to plural:

=> ci, te, tych, tymi

Morphemes strictly associated to plural:

=> ci, te, tych, tymi

tym may be either singular or plural tym may be either singular or plural

Axis 1 separates quantifiersAxis 1 separates quantifiers

Page 34: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

PROJECTION DANS LE PLAN FACTORIEL [1,2]| Horizontal: Axe #2 (Inertie: 12.81%) ——— Vertical: Axe #1 (Inertie: 13.05%)| Largeur: 1.798197; Hauteur: 2.123853; Nombre de points : 27+--------------------------------------------------+--------------------tem------------+--10| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | tej | 00| | | 00| | | 00| | dat | 00| | | 00| sin| | 00| te* tego | | 10ta to ten | | 00| | | 00| | | 00| | | 00| | | 00| | tym ta* | 00| | | 00| | | 00| | | 00| | | 00+-----------------------------------------------inahum---gen---------------------------+--40| nhumas | 20| nom acc fem| | 10| neu| loc | 00| | | 00| | | 00| | | 00| | | 00| | ins | 00| | | 00| | | 00| | | 00| plu | 00| | | 00| ci | tych | 10| te | | 00| | | 00| | | 00| | | 00| | | 00| | tymi| 00+--------------------------------------------------+-----------------------------------+--00

axis 2

axis 1

3 6 43

1619

9

21

9

8

6

3

On one side: ta, to, ten, te*, ci, te are only nomin. and/or accus.

On the other side: tej, tych, temu, tymi, ta*, tymi are only genitive, locative, dative and/or instrum.

On one side: ta, to, ten, te*, ci, te are only nomin. and/or accus.

On the other side: tej, tych, temu, tymi, ta*, tymi are only genitive, locative, dative and/or instrum.

tego may be either accusative or genitive tego may be either accusative or genitive

Axis 2 separates syntactic relatorsAxis 2 separates syntactic relators

Page 35: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

| | +--------------------------------------+-----ta*---------------------------------------+--10

| | | 00

| | | 00

| | | 00

| | | 00

| | | 10

tymi | | 00

| ins | | 00

| | | 00

| | | 00

| | | 00

| | | 00

| | | 00

| | | 00

| | | 00

| | | 00

| | ta | 00

| | | 00

| | | 00

| | | 00

| | | 00

| nom | to | 00

| | te* | 10

| | ten | 00

| te | tym tem| 00

| ci | dat | 00

| accfem| sin | 00

+-----------------------------------ina+-----------------------------------------------+--20

| hum| | 02

| plu | | 00

| mas| | 00

| | | 00

| | | 00

| | | 00

| | | 00

| | tej | 00

| | | 00

| | | 00

| loc | | 00

| | | 00

| | | 00

| | | 00

| | | 00

| | | 00

| | tego | 00

| | | 00

| tych gen| | 00

+--------------------------------------+-----------------------------------------------+--00

axis 3

axis 1

8

• morphemes tych, tego, tej are only associated to genitive or locative

• morphemes tymi, ta* are only associated to instrum.

• morphemes tych, tego, tej are only associated to genitive or locative

• morphemes tymi, ta* are only associated to instrum.

tym may be either instrumental or

locative or dative

tym may be either instrumental or

locative or dative

Axis 3 separates Axis 3 separates {genitive, locative} {genitive, locative} vsvs {instrumental} {instrumental}

Page 36: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

PROJECTION DANS LE PLAN FACTORIEL [1,2]| Horizontal: Axe #2 (Inertie: 12.81%) ——— Vertical: Axe #1 (Inertie: 13.05%)| Largeur: 1.798197; Hauteur: 2.123853; Nombre de points : 27+--------------------------------------------------+--------------------------------temu+--10| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| tej | | 00| | | 00| | | 00| | dat | 00| | | 00| sin | | 00| te* tego | 00| ta | to ten | 00| | | 00| | | 00| | | 00| | | 00ta* | tym | 00| | | 00| | | 00| | | 00| | | 00+--------------------------------------gen------inahum--------------------------------+--40| nhu| mas | 20| fem acc| nom | 10| loc | neu | 00| | | 00| | | 00| | | 00| | | 00| ins| | 00| | | 00| | | 00| | | 00| |plu | 00| | | 00| tych ci | 10| | te | 00| | | 00| | | 00| | | 00| | | 00| tymi | | 00+--------------------------------------------------+-----------------------------------+--00

axis 4

axis 1

• morphemes ta*, te*, tej, ta are only associated to feminine

• morphemes tego, to, ten, temu, ci are only associated to masculine or neutral

• morphemes ta*, te*, tej, ta are only associated to feminine

• morphemes tego, to, ten, temu, ci are only associated to masculine or neutral

Again, tym is ambiguous and may

be associated to any gender

Again, tym is ambiguous and may

be associated to any gender

Axis 4 separates gender:Axis 4 separates gender:feminine vsvs {masculine, neutral}

Note that animacy is still not

differenciated on axis 4.

Differenciation appears only on axis 9 !

Note that animacy is still not

differenciated on axis 4.

Differenciation appears only on axis 9 !

Page 37: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

Differenciation of Animacy does not appear before factor 9Differenciation of Animacy does not appear before factor 9 FREQ QLT INR | F#1 COR CTR | F#2 COR CTR | F#3 COR CTR | F#4 COR CTR |—————————————————————————————————————————————————————————————————————————————————hum 67 5 23 | 0 0 0 | 29 2 0 | -34 3 1 | -5 0 0 |ina 67 5 23 | -1 0 0 | -26 2 0 | 34 3 1 | -6 0 0 |nhu 67 0 22 | 1 0 0 | -3 0 0 | 0 0 0 | 11 0 0 |acc 33 397 40 | -70 3 1 | -745 391 120 | 47 2 1 | 48 2 1 |dat 33 588 42 | 643 272 88 | 483 153 50 | 97 6 2 | -489 157 66 |gen 33 596 40 | -15 0 0 | 153 16 5 | -870 522 186 | 290 58 23 |ins 33 869 48 | -362 77 28 | 682 271 100 | 924 499 210 | 195 22 11 |loc 33 326 35 | -117 11 3 | 366 106 29 | -504 201 62 | 103 8 3 |nom 33 633 44 | -78 4 1 | -938 556 190 | 306 59 23 | -147 14 6fem 67 768 33 | 20 1 0 | -26 1 0 | 85 12 4 | 669 754 247 |mas 67 245 28 | -9 0 0 | 56 6 1 | -97 19 5 | -332 220 61neu 67 232 28 | -11 0 0 | -29 2 0 | 12 0 0 | -336 229 63 |plu 100 873 30 | -546 823 189 | 44 5 1 | -68 13 3 | -108 32 10 |sin 100 873 30 | 546 823 189 | -44 5 1 | 68 13 3 | 108 32 10 |ci 2 76 36 | -644 18 5 | -840 30 8 | 128 1 0 | -804 28 10 |ta 6 276 40 | 497 29 9 |-1046 127 39 | 545 35 12 | 856 85 34 |ta* 6 445 40 | 207 5 2 | 635 47 15 | 1278 190 67 | 1321 203 80 |te 30 651 44 | -630 225 75 | -839 399 135 | 148 12 5 | -156 14 6 |te* 6 265 40 | 505 30 9 | -845 83 26 | 237 7 2 | 1121 146 58 |tego 15 249 42 | 516 79 25 | -92 2 1 | -751 167 62 | 6 0 0 |tej 17 588 42 | 749 187 60 | 274 25 8 | -324 35 13 | 1012 341 142 |temu 11 559 44 | 1200 306 102 | 469 47 16 | 145 4 2 | -973 201 87 |ten 7 208 39 | 469 34 10 | -918 132 40 | 262 11 4 | -441 30 12 |to 11 308 41 | 468 50 15 | -948 204 65 | 305 21 8 | -378 32 13 |tych 35 812 44 | -623 263 87 | 264 47 16 | -858 498 191 | 86 5 2 |tym 39 481 35 | 215 43 11 | 527 256 70 | 174 28 9 | -408 154 54 |tymi 17 726 47 | -924 251 91 | 753 167 61 | 1016 304 127 | 119 4 2 |

| F#5 COR CTR | F#6 COR CTR | F#7 COR CTR | F#8 COR CTR | F#9 COR CTR |————————————————————————————————————————————————————————————————————————————————————hum | 27 2 0 | -68 11 3 | 19 1 0 | -19 1 0 | 503 617 322 |ina | -34 3 1 | -0 0 0 | -25 2 1 | 87 19 8 | -309 235 121 |nhu | 7 0 0 | 68 11 3 | 6 0 0 | -69 12 5 | -194 94 48 |acc | 160 18 9 | 721 366 191 | -386 105 68 | 209 31 23 | 85 5 5 |dat | -603 239 121 | 126 10 6 | -246 40 27 | -321 68 55 | 10 0 0 |gen | 526 191 92 | -102 7 4 | 13 0 0 | -431 128 99 | -72 4 3 |ins | 446 116 66 | -40 1 1 | 45 1 1 | 24 0 0 | -5 0 0 |loc | -357 101 42 | -72 4 2 | 318 80 46 | 649 333 224 | 8 0 0 |nom | -172 19 10 | -633 253 147 | 255 41 30 | -129 11 9 | -26 0 0 |fem | -317 170 67 | 0 0 0 | -74 9 5 | -94 15 9 | 12 0 0 |mas | 214 92 31 | -311 193 71 | -374 280 128 | 179 64 34 | -13 0 0 |neu | 103 22 7 | 310 195 71 | 448 407 183 | -85 15 8 | 1 0 0 |plu | -163 73 26 | 31 3 1 | -64 11 6 | -90 22 13 | 2 0 0 |sin | 163 73 26 | -31 3 1 | 64 11 6 | 90 22 13 | -2 0 0 |ci | -161 1 0 |-1931 159 76 | -466 9 6 | -233 2 2 | 3212 441 364 |ta | -563 37 18 |-1309 199 105 | 699 57 37 | -528 32 25 | -110 1 1 |ta* | 501 29 14 | -141 2 1 | 100 1 1 | 77 1 1 | 38 0 0 |te | -342 66 35 | 242 33 19 | -242 33 24 | -278 44 37 | -209 25 25 |te* | 10 0 0 | 1359 215 113 |-1121 146 95 | 809 76 58 | 654 50 45 |tego | 1333 526 263 | -11 0 0 | -241 17 12 | -445 59 47 | -24 0 0 |tej | -516 89 44 | -93 3 2 | 55 1 1 | -152 8 6 | -53 1 1 |temu | -485 50 26 | 187 7 4 | -409 36 25 | -728 113 94 | 14 0 0 |ten | 481 36 17 |-1255 247 128 | -627 62 40 | 975 149 112 | -622 61 55 |to | 448 46 22 | 637 92 50 | 1269 366 245 | 177 7 6 | 197 9 8 |tych | -106 8 4 | -65 3 2 | 152 16 11 | 129 11 9 | 13 0 0 |tym | -205 39 16 | 35 1 1 | 81 6 4 | 374 129 87 | 11 0 0 |tymi | 488 70 40 | -17 0 0 | -56 1 1 | -263 20 18 | -21 0 0 |

Animacy first

appears on factor 9

Page 38: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

PROJECTION DANS LE PLAN FACTORIEL [1,9]| Horizontal: Axe #1 (Inertie: 13.05%) ——— Vertical: Axe #9 (Inertie: %)| Largeur: 2.123853; Hauteur: 3.83365; Nombre de points : 27+--------ci ---------------------------+-----------------------------------------------+--10| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | | 00| | te* | 00| | | 00| hum| | 00| | | 00| | | 00| | to | 00| | | 00| acc | ta* | 00+-----------tycplu---ins---------locfem+-----tym---------tegsindat------------------tem+--40tymi nomgen| ta tej | 02| te nhu| | 00| | | 00| ina| | 00| | | 00| | | 00| | ten | 00+--------------------------------------+-----------------------------------------------+--00

axis 1

axis 9(inertia = 4.35 %)

Axis 9 separates Axis 9 separates human vsvs {nonHuman, inanimate}

morpheme ci applies only to human entitiesmorpheme ci applies only to human entities

Page 39: SEMANA MORPHOLOGICAL DATA EXPLORATION USING THE SEMANA PLATFORM Feature Granularity Problem in the Definition of Polish Gender Georges SAUVET UTAH - CREAP,

Comparing Theories of Polish Noun Categories in Grammar

THEORIES GRAMMATICALIZED ATTRIBUTES

Mańczak W. (1956)Saloni Z. (1976)

GENDER NUMBER

feminineneuter

non animal masculineanimal masculine

personal masculinenon personal masculine

“pluralia tantum”

singular

plural

This proposal

GENDER ANIMACY NUMBER

feminine

neuter

masculine

inanimate

non human animate

human animate

singular

plural