ainl 2016: grigorieva

Lidia Grigorieva

The Institute of Informatics Problems of the Russian Academy of Sciences (IPI RAN)

Root!=Stem

из — prefixбир — rootа, тель, ниц — suffixesа — endingизбирательниц — stem

Dimensionreduction� dimension reduction is the process of reducing the

number of random variables in machine learning tasks:� Lemmatization –grouping together the inflected

forms of a word. LemmaGen; morpha; pymorphy2, mystem...

� Stemming –reducing inflected words to their word stem. The stem need not be identical to the morphological root of the word. Snowball; Lovins; Porter; nltk.stem.* ...

� Root Extraction – reducing derivates to their root., i.e. meaning.

Lemmatization

Mapping from text-word to lemma

Text-word to Lemma

мыла мыть (verb)

washмыло(noun)

soap

StemmingMapping from text-word to stem (excluding

endings)

21

лесистый лесистлесник лесниклесничество лесничествлесничий лесничлесной лесн

to

5

3

5

to

Rootextraction

Mapping from lemma to meaning

лесистый леслесник леслесничество леслесничий леслесной лес

5

1

to

Realization� Neural Networks algorithm

� Train data – 749 cases

� Cross validation – 84 cases (10%)

� Test data – 93 cases

� Accuracy ~0.7

Tasks� plagiarism;

� paraphrase detection;

� textual similarity;

� semantic disambiguation;

� topic model;

� text classification;

� text clusterization;

� question answering systems;

� building semantic graphs (entities, links and relationship between them);

References� РацибурскаяЛ.В. Словарь уникальных морфем

современного русского языка М.: Флинта: Наука, 2009. — 160 с.

� Аванесов Р.И., Ожегов С.И. Морфемно-орфографическийсловарь Около 100 000 слов / А. Н. Тихонов. — М.: АСТ: Астрель, 2002. — 704 с.

� Тихонов А.Н. Морфемно-орфографический словарь русского языка, 2002.

� Кузнецова А. И., Ефремова Т. Ф. Словарь морфем русского языка Ок. 52000 слов. — М.: Рус. яз., 1986. — 1132 с.

� http://old.kpfu.ru/infres/slovar1/begall.htm

� http://snowball.tartarus.org/algorithms/russian/stemmer.html, http://snowballstem.org/demo.html

Effective Paraphrase Expansion in Addressing

Lexical Variability

Vasily Konovalov, Meni Adler, Ido Dagan

Department of Computer Science

Bar-Ilan University, Israel

The 5th conference on Artificial Intelligence and Natural

Language

Problem

Lexical Variability

From Negochat negotiation dialogue corpus:

‘Reject’: “I disagree”, “I reject your proposal”, “it’s not

accepted”.

‘Accept’: “I accept your offer”, “I agree to the salary”, “It’s OK”.

‘Offer’: “I offer you a salary of 60,000 USD”, “How about the

programmer position”, “I propose you a pension of 10%”.

Solution

Translation-based paraphrase expansion

PL

MT1 MT2

SENTENCE PARAPHRASE

Google Yandex

Our research questions

◮ What is the ‘best’ performing language? Why is it actually

the ‘best’ one?

◮ What is the ‘best’ performing combination of MT engines?

Our research settings

Languages: Portuguese, French, German, Hebrew, Russian,

Arabic, Finish, Chinese, Hungarian.

MT engines: Google Translate API, Microsoft Translator Text

API, Yandex Translate API.

Our findings

◮ Among tested languages Hungarian is the ‘best’ performing

one.

◮ The performance of a language correlates well with the

averaged smoothed BLEU.

◮ A language that generates the most lexically dissimilar

paraphrases is the ‘best’ performing language.

◮ The differences between MT engines are insignificant

according to the averaged smoothed BLEU and are not

reflected in evaluation.

◮ The language family relations are reflected in averaged

smoothed BLEU.

Come and see our poster

RESEARCHING QUANTITATIVE

CHARACTERISTICS OF SHORT TEXTS: SCIENTIFIC,

NEWS, USE WRITINGS

■ For data analysis, we used several texts collection.

■ For scientific texts: Collection from the conference Dialogue (to 2003-2006), and Corpus Linguistics.

■ For news: Collection is made up of mass mediashort articles such as: Lenta.ru, the Russian newspaper, RBC, Independent Newspaper, and Kompyulenta.

■ To research writings from Unified State Examination we created several collections, ”reference”, which contains writings written by experts, and the second written by students.

■ For research we selected the most representative characteristics: entropy, readability, lexical diversity, verbal, autosem(all words, except for the service parts of speech), and frequencies (the ratio of the first hundred of the most frequent words of the Russian language, to all words in the text).

0

2

4

6

8

10

12

14

USE expert USE students News Scientific

Entropy

0

0,05

0,1

0,15

0,2

0,25


Readability

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7


Lexical Diversity

0,136

0,138

0,14

0,142

0,144

0,146

0,148

0,15

0,152

0,154


Verbal

0,68

0,7

0,72

0,74

0,76

0,78

0,8


Autosem

0

0,05

0,1

0,15

0,2

0,25

0,3


Frequencies

Building a Lexicon-Based Lemmatizer for Old Irish

Oksana Dereza

[email protected]

Old Irish: Grammar

• Changes can occur to any part of the word

o beginning: mutations

o middle: infixed pronouns

o end: flections

caraid ‘he / she / it loves’ rob-car-si ‘she has loved you’

• Very differently looking forms in a paradigm (esp. verbal)

do-beir ‘gives, brings’ ní t(h)abair ‘does not give, bring’

Old Irish: Orthography

• Inconsistent use of length marks

• Mutations are not always shown in writing

• Complex verb forms can be spelled either with or without a hyphen or a whitespace

• In later texts there are mute vowels to indicate the quality (broad / slender) of consonants next to them⇨ a great number of possible spellings for every form

Consonant b c d f g l m n p r s t

Mutated

consonant

bh ch dh fh gh ll mh nn ph rr sh th

mb gc nd ḟ ng l-l mm bp ṡ dt

cc ḟh m-m ss

bhf ts

s-s

Data

• Dictionary of the Irish Language (DIL)

43,345 entries ⇨ 79,140 unique forms

• Corpus

125 texts, 831,280 tokens

• Gold standard

50 random sentences from the test corpus, 840 tokens

• Not only classical Old Irish

The corpus covers VII-XVI centuries

Problems

• DIL covers only ~ 41% of unique forms in the corpus

• Many contracted forms, but no unified system of contractions

• Inconsistent use of markup and punctuation

caraid

Cite this: eDIL s.v. caraid

or dil.ie/8212

Forms: -carim, -cairim,

caraim, -caraim, -caru, -

cari, carid, caraid, -cara,

carthai, caras, charas,

caris, carthar, -charam,

carait, charaíd, -carat,

cartae, cardda, carda,

carde, cartar, carad,

caram, carid, -carid, -

carad, carad, carthae, -

chartais, carddais, cardáis,

care, -charae, -carae, cara,

-rochra, -chara, cara, -

carat, -carad, -charad,

cechar, -cechra, -cechra,

cechras, -chechrat, -

cechrainn, carais, carois, -

cair, carsait, carsat,

charus, rob-car-si, ro-car,

arro-car, char, rondob-

carsam-ni, charsat,

charsad, ros-carsat, serc,

carthain, carthi

weak vb. with reduplicated fut. on

analogy of canaid ( Thurn. Gramm.

402 ). Ind. pres. 1 s. -carim, Wb. 5c7

. -cairim, 23c12 . caraim, Thes. ii

293.16 . -caraim, Ml. 79d1 . -caru,

Fél. Ep. 311 . 2 s. -cari, Wb. 6c8 . 3 s.

carid, Wb. 25d5 . caraid , Ml. 75c4 . -

cara, Wb. 27d9 . With suff. pron. 3 s.

m. carthai, Fráech 10 . Rel. caras,

Wb. 25c19 . Ml. 91b17 . charas, 30c3

. caris, Thes. ii 247.4 . Pass. rel.

carthar, Ml. 75c4 . Sg. 193b3 . 196b4

.. <…>(a) loves (persons): nád carad som

Iudeiu, Wb. 4d17 . carad uir

mulierem, 22c19 . carsus fiadhu,

Snedg. u. Mac R. 11.5 . rot charus ar

th'airscélaib I have fallen in love

with thee, LU 6084 (TBC). nítcharadar nít tágedar, TBC 2032 = -

chara, LU 5797 . car do chomnesam

amal no-t-cara fén = dilige

proximum, PH 5837 . gé no

charfuinn fiche fear, KMMisc. 362.7

. a fhir Chola charuid mná `beloved

of women', Sc.G. St. iv 62 § 10 . nícharabh bean tsean ná óg, Dánta Gr.

78.11 . <…>

http://dil.ie/8212

http://dil.ie/search?search_in=headword&q=canaid

Lemmatizer

• Two methods for OOV-words

o Baseline: return a demutated form

o Predict a lemma using modified Damerau-Levenshteindistance

• Disambiguation

o For homonymous forms, the lemma with the highest lexical probability is chosen

o Lemma probability equals the sum of probabilities of its forms, and form probability is its frequency count in the corpus

Predicting lemmas for OOV-words

• Generate all possible strings on edit distance 1 and 2

• Check them up in the dictionary

• Add real words to candidate list

• Filter candidates by the first character

“If the unknown word starts with a vowel, the candidate should also start with a vowel, and if the unknown word starts with a consonant, the candidate should start with the same consonant”

• The lemma of the candidate with the highest lexical probability (i.e. frequency count in the corpus) is taken as a lemma for the unknown word

Evaluation

Lexicon Forms ‘Recall’ DIL forms only 79,140 74.7 %

DIL + 1000 most frequent OOV-words 80,206 80.0 %

! 4,889 homonymous forms

Baseline Predicted lemmas

Lemmatized correctly 483 / 840 552 / 840

Accuracy 57,50 % 65,71 %

Evaluation

Tokens 840

Known words 654

Unknown words 186

Lemmatized correctly 552

Lemmas predicted for unknown words 157

Predicted correctly 84

Predicted incorrectly 68

Several lemmas predicted including the

correct one, but the wrong one is chosen

5

~ 60 % of lemmas are predicted correctly

Token Best candidate

from closest

dictionary forms

Best candidate’s lemma

Chosen lemma

+ eólais eólas eólas eólas+ fiarfaigid fíarfaigid fíarfaigid, íarmi-foich íarmi-foich

+ cheast ceist ceist ceist

* déa dia dá, de, do, día de

+ bréithir bréthir bríathar bríathar– n-uaill aill aile, aill, all, aille aile

– chuain cain cain, canaid, cani,

caingen

canaid

– christ ceist ceist ceist

– caeme caíme caíme caíme– chniss cliss cles cles

Predicted lemmas

Source Code & Corpora

Source code

https://github.com/ancatmara/old_irish_lemmatizer

Texts

https://github.com/ancatmara/old_irish_corpora

https://github.com/ancatmara/old_irish_lemmatizer

https://github.com/ancatmara/old_irish_corpora

Extraction of Social Networks from Literary Text

Tsygankova Viktoria, National Research University

Higher School of Economics, Moscow

NovelGraphs

a tool for automatic annotation of texts and for extracting social networks of characters from text,

where nodes represent characters and edges are relations between them.

It can also analyze structural balance of the resulting graphs.

prince paradox

duke de valentinois

henry wotton

narborough

borgia

filippo

hallward

louis xii

lady henry

erskine

adrian

gian maria visconti

romeo

gray

mercutio

ruxton

Example graph of the “Picture of Dorian Gray” by Oscar Wilde

Example graph of the “Study in Scarlet” by A. Conan Doyle

lestrade

gregson

murcher

rance

holmesnarrator

joseph stangerson

Example graph of the “Study in Scarlet” by A. Conan Doyle with sentiment

Example graph of the “Picture of Dorian Gray” by Oscar Wilde with sentient

Conclusions

 A tool NovelGraphs was created for English-language literary fiction, which uses a new approach of extracting characters and connections between them.

 Nodes represent characters found in the text, and edges connect them to other characters

with whom they interact.

 At the moment, combinations of extractors and

aggregators detect characters better than interactions between them.

 Analysis of structural balance identifies key

passages of the text that correspond to the

minima and maxima on the balance plot.

Thanks for watching!

Are the results of your corpus

research really reliable?

Getting automatic result analysis on

GICR.

Tatiana Shavrina, Daniil Selegey

AINL FRUCT, SPb, 12.11.2016

Big Corpora Problem:

1. Billions of words, mostly coming from

social media

2. Getting just the IPM and search

results in KWIC format doesn’t tell

you if the results are biased

3. A lot of metatext attributes – URLs,

doc IDs, author IDs, region, gender,

genre etc. – all are potential source

of bias

Users need corpus tools to see all statistics of the

search area to check for homogeneity with the

whole corpus.

Our solution: Search results analysis right in the interface!

See you at our

Demo stand!