ainl 2016: grigorieva

48
Lidia Grigorieva The Institute of Informatics Problems of the Russian Academy of Sciences (IPI RAN)

Upload: lidia-pivovarova

Post on 19-Jan-2017

17 views

Category:

Science


0 download

TRANSCRIPT

Page 1: AINL 2016: Grigorieva

Lidia Grigorieva

The Institute of Informatics Problems of the Russian Academy of Sciences (IPI RAN)

Page 2: AINL 2016: Grigorieva

Root!=Stem

из — prefixбир — rootа, тель, ниц — suffixesа — endingизбирательниц — stem

Page 3: AINL 2016: Grigorieva

Dimensionreduction� dimension reduction is the process of reducing the

number of random variables in machine learning tasks:� Lemmatization –grouping together the inflected

forms of a word. LemmaGen; morpha; pymorphy2, mystem...

� Stemming –reducing inflected words to their word stem. The stem need not be identical to the morphological root of the word. Snowball; Lovins; Porter; nltk.stem.* ...

� Root Extraction – reducing derivates to their root., i.e. meaning.

Page 4: AINL 2016: Grigorieva

Lemmatization

Mapping from text-word to lemma

Text-word to Lemma

мыла мыть (verb)

washмыло(noun)

soap

Page 5: AINL 2016: Grigorieva

StemmingMapping from text-word to stem (excluding

endings)

21

лесистый лесистлесник лесниклесничество лесничествлесничий лесничлесной лесн

to

5

3

5

to

Page 6: AINL 2016: Grigorieva

Rootextraction

Mapping from lemma to meaning

лесистый леслесник леслесничество леслесничий леслесной лес

5

1

to

Page 7: AINL 2016: Grigorieva

Realization� Neural Networks algorithm

� Train data – 749 cases

� Cross validation – 84 cases (10%)

� Test data – 93 cases

� Accuracy ~0.7

Page 8: AINL 2016: Grigorieva

Tasks� plagiarism;

� paraphrase detection;

� textual similarity;

� semantic disambiguation;

� topic model;

� text classification;

� text clusterization;

� question answering systems;

� building semantic graphs (entities, links and relationship between them);

Page 9: AINL 2016: Grigorieva

References� РацибурскаяЛ.В. Словарь уникальных морфем

современного русского языка М.: Флинта: Наука, 2009. — 160 с.

� Аванесов Р.И., Ожегов С.И. Морфемно-орфографическийсловарь Около 100 000 слов / А. Н. Тихонов. — М.: АСТ: Астрель, 2002. — 704 с.

� Тихонов А.Н. Морфемно-орфографический словарь русского языка, 2002.

� Кузнецова А. И., Ефремова Т. Ф. Словарь морфем русского языка Ок. 52000 слов. — М.: Рус. яз., 1986. — 1132 с.

� http://old.kpfu.ru/infres/slovar1/begall.htm

� http://snowball.tartarus.org/algorithms/russian/stemmer.html, http://snowballstem.org/demo.html

Page 10: AINL 2016: Grigorieva

Effective Paraphrase Expansion in Addressing

Lexical Variability

Vasily Konovalov, Meni Adler, Ido Dagan

Department of Computer Science

Bar-Ilan University, Israel

The 5th conference on Artificial Intelligence and Natural

Language

Page 11: AINL 2016: Grigorieva

Problem

Lexical Variability

From Negochat negotiation dialogue corpus:

‘Reject’: “I disagree”, “I reject your proposal”, “it’s not

accepted”.

‘Accept’: “I accept your offer”, “I agree to the salary”, “It’s OK”.

‘Offer’: “I offer you a salary of 60,000 USD”, “How about the

programmer position”, “I propose you a pension of 10%”.

Page 12: AINL 2016: Grigorieva

Solution

Translation-based paraphrase expansion

PL

MT1 MT2

SENTENCE PARAPHRASE

Google Yandex

Page 13: AINL 2016: Grigorieva

Our research questions

◮ What is the ‘best’ performing language? Why is it actually

the ‘best’ one?

◮ What is the ‘best’ performing combination of MT engines?

Page 14: AINL 2016: Grigorieva

Our research settings

Languages: Portuguese, French, German, Hebrew, Russian,

Arabic, Finish, Chinese, Hungarian.

MT engines: Google Translate API, Microsoft Translator Text

API, Yandex Translate API.

Page 15: AINL 2016: Grigorieva

Our findings

◮ Among tested languages Hungarian is the ‘best’ performing

one.

◮ The performance of a language correlates well with the

averaged smoothed BLEU.

◮ A language that generates the most lexically dissimilar

paraphrases is the ‘best’ performing language.

◮ The differences between MT engines are insignificant

according to the averaged smoothed BLEU and are not

reflected in evaluation.

◮ The language family relations are reflected in averaged

smoothed BLEU.

Page 16: AINL 2016: Grigorieva

Come and see our poster

Page 17: AINL 2016: Grigorieva

RESEARCHING QUANTITATIVE

CHARACTERISTICS OF SHORT TEXTS: SCIENTIFIC,

NEWS, USE WRITINGS

Page 18: AINL 2016: Grigorieva

■ For data analysis, we used several texts collection.

■ For scientific texts: Collection from the conference Dialogue (to 2003-2006), and Corpus Linguistics.

■ For news: Collection is made up of mass mediashort articles such as: Lenta.ru, the Russian newspaper, RBC, Independent Newspaper, and Kompyulenta.

■ To research writings from Unified State Examination we created several collections, ”reference”, which contains writings written by experts, and the second written by students.

Page 19: AINL 2016: Grigorieva

■ For research we selected the most representative characteristics: entropy, readability, lexical diversity, verbal, autosem(all words, except for the service parts of speech), and frequencies (the ratio of the first hundred of the most frequent words of the Russian language, to all words in the text).

Page 20: AINL 2016: Grigorieva

0

2

4

6

8

10

12

14

USE expert USE students News Scientific

Entropy

Page 21: AINL 2016: Grigorieva

0

0,05

0,1

0,15

0,2

0,25

USE expert USE students News Scientific

Readability

Page 22: AINL 2016: Grigorieva

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

USE expert USE students News Scientific

Lexical Diversity

Page 23: AINL 2016: Grigorieva

0,136

0,138

0,14

0,142

0,144

0,146

0,148

0,15

0,152

0,154

USE expert USE students News Scientific

Verbal

Page 24: AINL 2016: Grigorieva

0,68

0,7

0,72

0,74

0,76

0,78

0,8

USE expert USE students News Scientific

Autosem

Page 25: AINL 2016: Grigorieva

0

0,05

0,1

0,15

0,2

0,25

0,3

USE expert USE students News Scientific

Frequencies

Page 26: AINL 2016: Grigorieva

Building a Lexicon-Based Lemmatizer for Old Irish

Oksana Dereza

[email protected]

Page 27: AINL 2016: Grigorieva

Old Irish: Grammar

• Changes can occur to any part of the word

o beginning: mutations

o middle: infixed pronouns

o end: flections

caraid ‘he / she / it loves’ rob-car-si ‘she has loved you’

• Very differently looking forms in a paradigm (esp. verbal)

do-beir ‘gives, brings’ ní t(h)abair ‘does not give, bring’

Page 28: AINL 2016: Grigorieva

Old Irish: Orthography

• Inconsistent use of length marks

• Mutations are not always shown in writing

• Complex verb forms can be spelled either with or without a hyphen or a whitespace

• In later texts there are mute vowels to indicate the quality (broad / slender) of consonants next to them⇨ a great number of possible spellings for every form

Consonant b c d f g l m n p r s t

Mutated

consonant

bh ch dh fh gh ll mh nn ph rr sh th

mb gc nd ḟ ng l-l mm bp ṡ dt

cc ḟh m-m ss

bhf ts

s-s

Page 29: AINL 2016: Grigorieva

Data

• Dictionary of the Irish Language (DIL)

43,345 entries ⇨ 79,140 unique forms

• Corpus

125 texts, 831,280 tokens

• Gold standard

50 random sentences from the test corpus, 840 tokens

• Not only classical Old Irish

The corpus covers VII-XVI centuries

Page 30: AINL 2016: Grigorieva

Problems

• DIL covers only ~ 41% of unique forms in the corpus

• Many contracted forms, but no unified system of contractions

• Inconsistent use of markup and punctuation

caraid

Cite this: eDIL s.v. caraid

or dil.ie/8212

Forms: -carim, -cairim,

caraim, -caraim, -caru, -

cari, carid, caraid, -cara,

carthai, caras, charas,

caris, carthar, -charam,

carait, charaíd, -carat,

cartae, cardda, carda,

carde, cartar, carad,

caram, carid, -carid, -

carad, carad, carthae, -

chartais, carddais, cardáis,

care, -charae, -carae, cara,

-rochra, -chara, cara, -

carat, -carad, -charad,

cechar, -cechra, -cechra,

cechras, -chechrat, -

cechrainn, carais, carois, -

cair, carsait, carsat,

charus, rob-car-si, ro-car,

arro-car, char, rondob-

carsam-ni, charsat,

charsad, ros-carsat, serc,

carthain, carthi

weak vb. with reduplicated fut. on

analogy of canaid ( Thurn. Gramm.

402 ). Ind. pres. 1 s. -carim, Wb. 5c7

. -cairim, 23c12 . caraim, Thes. ii

293.16 . -caraim, Ml. 79d1 . -caru,

Fél. Ep. 311 . 2 s. -cari, Wb. 6c8 . 3 s.

carid, Wb. 25d5 . caraid , Ml. 75c4 . -

cara, Wb. 27d9 . With suff. pron. 3 s.

m. carthai, Fráech 10 . Rel. caras,

Wb. 25c19 . Ml. 91b17 . charas, 30c3

. caris, Thes. ii 247.4 . Pass. rel.

carthar, Ml. 75c4 . Sg. 193b3 . 196b4

.. <…>(a) loves (persons): nád carad som

Iudeiu, Wb. 4d17 . carad uir

mulierem, 22c19 . carsus fiadhu,

Snedg. u. Mac R. 11.5 . rot charus ar

th'airscélaib I have fallen in love

with thee, LU 6084 (TBC). nítcharadar nít tágedar, TBC 2032 = -

chara, LU 5797 . car do chomnesam

amal no-t-cara fén = dilige

proximum, PH 5837 . gé no

charfuinn fiche fear, KMMisc. 362.7

. a fhir Chola charuid mná `beloved

of women', Sc.G. St. iv 62 § 10 . nícharabh bean tsean ná óg, Dánta Gr.

78.11 . <…>

Page 31: AINL 2016: Grigorieva

Lemmatizer

• Two methods for OOV-words

o Baseline: return a demutated form

o Predict a lemma using modified Damerau-Levenshteindistance

• Disambiguation

o For homonymous forms, the lemma with the highest lexical probability is chosen

o Lemma probability equals the sum of probabilities of its forms, and form probability is its frequency count in the corpus

Page 32: AINL 2016: Grigorieva

Predicting lemmas for OOV-words

• Generate all possible strings on edit distance 1 and 2

• Check them up in the dictionary

• Add real words to candidate list

• Filter candidates by the first character

“If the unknown word starts with a vowel, the candidate should also start with a vowel, and if the unknown word starts with a consonant, the candidate should start with the same consonant”

• The lemma of the candidate with the highest lexical probability (i.e. frequency count in the corpus) is taken as a lemma for the unknown word

Page 33: AINL 2016: Grigorieva

Evaluation

Lexicon Forms ‘Recall’ DIL forms only 79,140 74.7 %

DIL + 1000 most frequent OOV-words 80,206 80.0 %

! 4,889 homonymous forms

Baseline Predicted lemmas

Lemmatized correctly 483 / 840 552 / 840

Accuracy 57,50 % 65,71 %

Page 34: AINL 2016: Grigorieva

Evaluation

Tokens 840

Known words 654

Unknown words 186

Lemmatized correctly 552

Lemmas predicted for unknown words 157

Predicted correctly 84

Predicted incorrectly 68

Several lemmas predicted including the

correct one, but the wrong one is chosen

5

~ 60 % of lemmas are predicted correctly

Page 35: AINL 2016: Grigorieva

Token Best candidate

from closest

dictionary forms

Best candidate’s lemma

Chosen lemma

+ eólais eólas eólas eólas+ fiarfaigid fíarfaigid fíarfaigid, íarmi-foich íarmi-foich

+ cheast ceist ceist ceist

* déa dia dá, de, do, día de

+ bréithir bréthir bríathar bríathar– n-uaill aill aile, aill, all, aille aile

– chuain cain cain, canaid, cani,

caingen

canaid

– christ ceist ceist ceist

– caeme caíme caíme caíme– chniss cliss cles cles

Predicted lemmas

Page 36: AINL 2016: Grigorieva

Source Code & Corpora

Source code

https://github.com/ancatmara/old_irish_lemmatizer

Texts

https://github.com/ancatmara/old_irish_corpora

Page 37: AINL 2016: Grigorieva

Extraction of Social Networks from Literary Text

Tsygankova Viktoria, National Research University

Higher School of Economics, Moscow

Page 38: AINL 2016: Grigorieva

NovelGraphs

a tool for automatic annotation of texts and for extracting social networks of characters from text,

where nodes represent characters and edges are relations between them.

It can also analyze structural balance of the resulting graphs.

Page 39: AINL 2016: Grigorieva

prince paradox

duke de valentinois

henry wotton

narborough

borgia

filippo

hallward

louis xii

lady henry

erskine

adrian

gian maria visconti

romeo

gray

mercutio

ruxton

Example graph of the “Picture of Dorian Gray” by Oscar Wilde

Page 40: AINL 2016: Grigorieva

Example graph of the “Study in Scarlet” by A. Conan Doyle

lestrade

gregson

murcher

rance

holmesnarrator

joseph stangerson

Page 41: AINL 2016: Grigorieva

Example graph of the “Study in Scarlet” by A. Conan Doyle with sentiment

Page 42: AINL 2016: Grigorieva

Example graph of the “Picture of Dorian Gray” by Oscar Wilde with sentient

Page 43: AINL 2016: Grigorieva

Conclusions

 A tool NovelGraphs was created for English-language literary fiction, which uses a new approach of extracting characters and connections between them.

 Nodes represent characters found in the text, and edges connect them to other characters

with whom they interact.

 At the moment, combinations of extractors and

aggregators detect characters better than interactions between them.

 Analysis of structural balance identifies key

passages of the text that correspond to the

minima and maxima on the balance plot.

Page 44: AINL 2016: Grigorieva

Thanks for watching!

Page 45: AINL 2016: Grigorieva

Are the results of your corpus

research really reliable?

Getting automatic result analysis on

GICR.

Tatiana Shavrina, Daniil Selegey

AINL FRUCT, SPb, 12.11.2016

Page 46: AINL 2016: Grigorieva

Big Corpora Problem:

1. Billions of words, mostly coming from

social media

2. Getting just the IPM and search

results in KWIC format doesn’t tell

you if the results are biased

3. A lot of metatext attributes – URLs,

doc IDs, author IDs, region, gender,

genre etc. – all are potential source

of bias

Users need corpus tools to see all statistics of the

search area to check for homogeneity with the

whole corpus.

Page 47: AINL 2016: Grigorieva

Our solution: Search results analysis right in the interface!

Page 48: AINL 2016: Grigorieva

See you at our

Demo stand!