disentangling from babylonian confusion – unsupervised language identification
DESCRIPTION
Disentangling from Babylonian Confusion – Unsupervised Language Identification. Chris Biemann, Sven Teresniak University of Leipzig, Germany Cicling-05, Mexico City February 18, 2005. Outline. Review: Supervized Language Identification Co-occurrence graphs Co-occurrences - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Disentangling from Babylonian Confusion – Unsupervised Language Identification](https://reader036.vdocuments.net/reader036/viewer/2022081511/56814590550346895db27fc2/html5/thumbnails/1.jpg)
1
Disentangling from Babylonian Confusion –
Unsupervised Language Identification
Chris Biemann, Sven TeresniakUniversity of Leipzig, Germany
Cicling-05, Mexico City
February 18, 2005
![Page 2: Disentangling from Babylonian Confusion – Unsupervised Language Identification](https://reader036.vdocuments.net/reader036/viewer/2022081511/56814590550346895db27fc2/html5/thumbnails/2.jpg)
2
Outline
1. Review: Supervized Language Identification
2. Co-occurrence graphs• Co-occurrences• Visualizing co-occurrences
3. Chinese Whispers Algorithm• Finding words of the same language
4. Sorting text by language• Evaluation and limitations
![Page 3: Disentangling from Babylonian Confusion – Unsupervised Language Identification](https://reader036.vdocuments.net/reader036/viewer/2022081511/56814590550346895db27fc2/html5/thumbnails/3.jpg)
3
Review: Supervized Language Identification
• needs training • Operates on letter n-grams or common words as features• Works almost error-free for texts from 500 letters on
Drawbacks:• Does not work for previously unknown languages• Danger of misclassifying instead of reporting „unknown“
Example: http://odur.let.rug.nl/~vannoord/TextCat/Demo • “xx xxx x xxx …” classified as Nepali• “öö ö öö ööö …” classified as Persian
![Page 4: Disentangling from Babylonian Confusion – Unsupervised Language Identification](https://reader036.vdocuments.net/reader036/viewer/2022081511/56814590550346895db27fc2/html5/thumbnails/4.jpg)
4
Co-occurrence Statistics
• Co-occurrence: occurrence of two or more words within a well-defined unit of information (sentence, nearest neighbors, window...)
• Significant Co-occurrences reflect relations between words.
• Significance Measure (log-likelihood):
• In the following, sentence-based co-occurrence statistics are used.
( , ) log log !
with number of sentences,
.
sig A B x k x k
n
abx
n
![Page 5: Disentangling from Babylonian Confusion – Unsupervised Language Identification](https://reader036.vdocuments.net/reader036/viewer/2022081511/56814590550346895db27fc2/html5/thumbnails/5.jpg)
5
Co-occurrence Graphs • The entirety of all
significant co-occurrences is a co-occurrence graph G(V,E) withV: Vertices = WordsE: Edges (v1, v2, s) with v1, v2 words, s significance value.
• Co-occurrence graph is- weighted- undirected
• Small-world-property
![Page 6: Disentangling from Babylonian Confusion – Unsupervised Language Identification](https://reader036.vdocuments.net/reader036/viewer/2022081511/56814590550346895db27fc2/html5/thumbnails/6.jpg)
6
Chinese Whispers - Motivation
• (small-world) graphs consist of regions with a high clustering coefficient and hubs that connect those regions
• The nodes in cluster regions should be assigned the same label per region
• Every node gets a label and whispers it to its neighbouring nodes. A node changes to a label if most of its neighbours whisper this label – or it invents a new one
• Under assumption of semantic closeness when being strongly connected there should emerge motivated clusters
![Page 7: Disentangling from Babylonian Confusion – Unsupervised Language Identification](https://reader036.vdocuments.net/reader036/viewer/2022081511/56814590550346895db27fc2/html5/thumbnails/7.jpg)
7
Chinese Whispers AlgorithmAssign different labels to every node in the graph;
For iteration i from 1 to total_iterations {mutation_rate= 1/(i^2);For each word w in the graph {
new_label of w = highest ranked label in neighbourhood of w;
with probability mutation_rate: new_label of w = new class label;}
labels = new_labels;}
• graph clustering algorithm• linear time in the number of nodes • random mutation can be omitted but showed better results
for small graphs
![Page 8: Disentangling from Babylonian Confusion – Unsupervised Language Identification](https://reader036.vdocuments.net/reader036/viewer/2022081511/56814590550346895db27fc2/html5/thumbnails/8.jpg)
8
Assigning New Labels
• Node A changes label from L1 to L3: Sum(L3)=9; Sum(L4)=8; Sum(L2)=5
• Other strategies result in different kinds of partitioning- threshold for share- weighting by node degrees
AL1->L3
DL2
EL3
BL4
CL3
58
6 3
![Page 9: Disentangling from Babylonian Confusion – Unsupervised Language Identification](https://reader036.vdocuments.net/reader036/viewer/2022081511/56814590550346895db27fc2/html5/thumbnails/9.jpg)
9
Chinese Whispers on 7 Languages
![Page 10: Disentangling from Babylonian Confusion – Unsupervised Language Identification](https://reader036.vdocuments.net/reader036/viewer/2022081511/56814590550346895db27fc2/html5/thumbnails/10.jpg)
10
Chinese Whispers on 7 languages
![Page 11: Disentangling from Babylonian Confusion – Unsupervised Language Identification](https://reader036.vdocuments.net/reader036/viewer/2022081511/56814590550346895db27fc2/html5/thumbnails/11.jpg)
11
Assigning languages to sentences
• Use word-based language identification tool• Largest clusters form word lists for different languages• A sentence is assigned a cluster label if
- it contains at least 2 words from the cluster and - not more words from another cluster
Questions for Evaluation:• up to what number of languages is that possible ?• How much can the corpus be biased ?
![Page 12: Disentangling from Babylonian Confusion – Unsupervised Language Identification](https://reader036.vdocuments.net/reader036/viewer/2022081511/56814590550346895db27fc2/html5/thumbnails/12.jpg)
12
Evaluation: Mix of 7 languages
• Languages used: Dutch, Estonian, English, French, German, Icelandic and Italian
• At least 100 sentences per language are necessary for consistent clusters
Precision, Recall and F-value for 7-lingual corpora
0,96
0,97
0,98
0,99
1
100 1000 10000 100000
# of sentences per language
P/R
/F
Precision Recall F-value
![Page 13: Disentangling from Babylonian Confusion – Unsupervised Language Identification](https://reader036.vdocuments.net/reader036/viewer/2022081511/56814590550346895db27fc2/html5/thumbnails/13.jpg)
13
Common mistakes
• Unclassified: - mostly enumerations of sport teams - very short sentences, e.g. headlines- legal act ciphers in estonian case, e.g. 10.12.96 jõust.01.01.97 - RT I 1996 , 89 , 1590
• Misclassified: mixed-language-sentences, likeFrench: Frönsku orðin "cinéma vérité" þýða "kvikmyndasannleikur“
English: Die Beatles mit "All you need is love".
![Page 14: Disentangling from Babylonian Confusion – Unsupervised Language Identification](https://reader036.vdocuments.net/reader036/viewer/2022081511/56814590550346895db27fc2/html5/thumbnails/14.jpg)
14
Evaluation: Bilingual biased
• Language pairs used: English-Estonian, French-Italian, Dutch-German
• 1st language varied between 100-10‘000 sentences, 2nd language 100‘000 sentences
• Factor up to 200 does not result in deterioration• Above factor 200, the 1st language cluster is not
distinguishable in size from 2nd-language ‚noise‘
English noise in a 100K sentence Estonian Corpus
0,995
0,9975
1
100 1000 10000
# English sentences
Pre
cis
ion
/
Recall
P Estonian P English R English
French noise in a 100K sentence Italian Corpus
0,925
0,95
0,975
1
100 1000
# French sentences
P/R
P Italian
R Italian
P French
R French
P total
R total
![Page 15: Disentangling from Babylonian Confusion – Unsupervised Language Identification](https://reader036.vdocuments.net/reader036/viewer/2022081511/56814590550346895db27fc2/html5/thumbnails/15.jpg)
15
Conclusion• Unsupervized Language Identification is possible
• It fails to name the languages, but rather sorts them
• It works for previously undescribed languages, even for dialects
• Accurracy on sentences (here about 120 characters) is compareable to supervized approaches
• When classifying documents, there should be virtually no errors
• Time-linear graph-clustering algorithm
![Page 16: Disentangling from Babylonian Confusion – Unsupervised Language Identification](https://reader036.vdocuments.net/reader036/viewer/2022081511/56814590550346895db27fc2/html5/thumbnails/16.jpg)
16
Questions?
THANK YOU!
![Page 17: Disentangling from Babylonian Confusion – Unsupervised Language Identification](https://reader036.vdocuments.net/reader036/viewer/2022081511/56814590550346895db27fc2/html5/thumbnails/17.jpg)
17
Small Cooccurring Worlds
Angenommene Struktur von Kookkurrenzgraphen: skalenfreie Small Worlds
• kurze Weglänge zwischen den Knoten• hoher Clustering Coeffizient• Power-Law-Verteilung von Knotengraden• Power-Law-Verteilung von Komponentengrößen
Knotengrad: Anzahl (ausgehender) Kanten
Komponente: Zusammenhängende Menge von Knoten
Power-Law-Verteilungen lassen sich einfach aufzeichnen.
![Page 18: Disentangling from Babylonian Confusion – Unsupervised Language Identification](https://reader036.vdocuments.net/reader036/viewer/2022081511/56814590550346895db27fc2/html5/thumbnails/18.jpg)
18
Strategien zur FarbübernahmeEin Knoten ändert seine Farbe auf eine neue Farbe aus der
Umgebung, wenn diese
(1) mit stärkster Signifikanzsumme auftritt. (top)
(2) mit stärkster Signifikanzsumme gewichtet nach Knotengrad auftritt (a - linear, b - logarithmisch) (dist)
(3) mit stärkster Signifikanzsumme auftritt und anteilig über einer gewissen Schwelle liegt (vote <x>)
AL1
DL2
EL3
BL4
CL3
58
63
deg=1deg=2
deg=3deg=5
deg=4
Beispiel: Einfärben von A(1): Sum(L3)=9; Sum(L4)=8; Sum(L2)=5(2a): wSum(L2)=5; wSum(L4)=4; wSum(L3)=2.2(2b): wSum(L4)=7,28; wSum(L3)=5,51; wSum(L2)=3,46(3): nSum(L3)=0,409; nSum(L4)=0,363; nSum(L2)=0,227
![Page 19: Disentangling from Babylonian Confusion – Unsupervised Language Identification](https://reader036.vdocuments.net/reader036/viewer/2022081511/56814590550346895db27fc2/html5/thumbnails/19.jpg)
19
7 Clusters – 7 languages68701:(3792): [...] a-t-elle, a-t-il, a-t-on, aanval, abandonné, abattu, abattus, aborder, abords, abouti, absolu, absolue, acceptent,
accepter, accepté, accessibles, accession, accord, accords, accordé, accordée, accusation, accusations, accuse, accusé, accusée, acheter, achevé, achevée, acte, actes, actifs, action, actionnaires, actions, actions-suicides, activement, activiste, activistes, activités, actuelle, actuellement, adeptes, adjoint, admettre, administratif, admis, [...]
80266:(3616) [...] a, abandoned, able, ablösen, aboard, abortion, abortions, about, above, abroad, absence, absolute, absolutely, accepted, accessible, accident, accidents, acclaim, according, accounting, accused, accusing, acid, acidic, acknowledged, acquire, acquistare, acre, acres, across, act, acting, active, activist, activists, acts, actually, added, addicts, adding, additional, address, administration, administrator, admitted, adopt, adopted, adoption, adults, advance,
68952:(3312) [...] abbandonato, abbastanza, abbia, abbiamo, abbiano, abile, abitante, abitanti, abitazioni, abruzzesi, accade, accaduto, accenno, accertare, accertato, accesso, accoglienza, accolto, accordi, accordo, accorta, accorti, accusa, acquisito, ad, addetti, addirittura, addosso, adesso, adottata, aereo, affari, affermato, affetto, affidare, affidato, affiliati, affiliato, affonda, affrontare, affronteranno, agenti, agenzie, agevolare, aggiunge, aggiungere, aggiunto, agli [...]
75760:(3249) [...] af, afar, afgreiðslutíma, afl, afla, aflaheimilda, aflaheimildum, aflann, aflaverðmætið, afli, aflinn, aflýst, afnema, afnotagjalda, afnotagjaldið, afnotagjöld, afnotagjöldin, afnotagjöldum, afrek, afráðið, afstýra, aftur, afurðaverðs, afurðum, aka, al-Qaeda, al-Zawahri, ala, aldar, aldrei, aldri, aldur, allan, allar, allir, allra, allri, alls, allt, alltaf, alltof, allur, almannafjölmiðla, almannamiðla, almenna, almennt, altari, alveg, andvirði, annan, annar, annarra [...]
81089:(2894) [...] an, aandacht, aandachtsgebied, aangehouden, aangekeken, aangenomen, aangepakt, aangesloten, aangevuld, aangewezen, aangezien, aanleiding, aanmerking, aanpak, aanslag, aansluiting, aantal, aantrekkelijk, aanvankelijk, aanvragen, aanwezige, aanwezigheid, aanwijsbaar, aanwijzingen, aanzien, aarde, aardige, abortuspil, abortuswetgeving, acceptabel, achtduizend, achter, achtergrond, achterhalen, achterover, actie, actieve, actuele [..]
68872:(2791) [...] ab, abend, aber, abermals, abgebaut, abgelaufenen, abgeschlossen, abgeschlossenen, abschneiden, abzuwarten, acht, achtzehn, achtziger, afghanischen, akzeptieren, allein, allem, allen, allenfalls, aller, allerdings, allgemein, allgemeinen, alt, alte, alten, alter, am, amerikanische, amerikanischen, amerikanischer, anderen, anderer, anerkennen, angedroht, angegeben, angehende, angekündigt, angenommen, angesichts, angestellt, angetastet [...]
72602:(2247) [...] aadress, aadressi, aadressil, aadressina, aasta, aastaaruande, aastabilansi, aastabilanss, aastaks, aastal, aastas, aastat, aastate, aeg, aegumistähtaeg, aga, agressiooni, ainuaktsionäri, ainult, ainuõigus, ajaks, ajal, ajast, ajutiselt, akt, aktid, aktide, aktsia, aktsiad, aktsiaid, aktsiakapital, aktsiakapitali, aktsiakapitalist, aktsiaraamatusse, aktsiaselts, aktsiaseltsi, aktsiaseltsiga, aktsiaseltsil, aktsiaseltsile, aktsiat, aktsiate, aktsiatega [...]
60154:(195) [...] afferma, créée, dimostrano, dom, domicilio, dovranno, escl, esclusi, escluso, feriale, festivi, festivo, fóru, gleðilegs, gratuita, gravidanza, incarico, intero, inv, io, jr, jäetud, jõust, liðna, lõige, lõiked, nere, næstir, pagamento, punktides, sab, saper, scenario, servono, sindacale, sjálfsm, socc, soccorso, spiegato, spilurum, techniques, ventanni, warnte, zuständige [...]
[...]84023:(3) Inter, Mailand, Ronaldo