introduction to natural language...

48
Introduction to Natural Language Processing a course taught as B4M36NLP at Open Informatics by members of the Institute of Formal and Applied Linguistics Today: Week 4, lecture Today’s topic: Overview of Language Data Resources Today’s teacher: Zdenˇ ek ˇ Zabokrtsk´ y E-mail: [email protected]ff.cuni.cz WWW: http://ufal.mff.cuni.cz/zdenek-zabokrtsky Zdenˇ ek ˇ Zabokrtsk´ y ( ´ UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 1 / 48

Upload: others

Post on 24-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Introduction to Natural Language Processing

a course taught as B4M36NLP at Open Informatics

by members of the Institute of Formal and Applied Linguistics

Today: Week 4, lectureToday’s topic: Overview of Language Data Resources

Today’s teacher: Zdenek Zabokrtsky

E-mail: [email protected]

WWW: http://ufal.mff.cuni.cz/zdenek-zabokrtsky

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 1 / 48

Page 2: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Why language data?

In general, when studying any language phenomenon, there are two basicways to go:

thinking about it in the context of one’s language experience, usingintrospection. . .

or using empirical evidence, statistical models based on real worldusage of language . . .

I side remark: this includes also using brain-imaging methods or at leasteye-tracking devices, but such approaches are still rare in the real NLPindustry

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 2 / 48

Page 3: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Armchair linguistics or data crunching?1957: Noam Chomsky’s attack: “Any natural corpus will be skewed.Some sentences won’t occur because they are obvious, others becausethey are false, still others because they are impolite. The corpus, ifnatural, will be so wildly skewed that the description would be nomore than a mere list.”1992: Charles J. Fillmoore’s caricature of “armchair linguists” vs.“corpus linguists”1988: Frederick Jelinek: ”Every time I fire a linguist, the performanceof the speech recognizer goes up” (perhaps not an exact citation)but 2004: Frederick Jelinek: “My colleagues and I always hoped thatlinguistics will eventually allow us to strike gold.”2005: Tony McEnery: “Corpus data are, for many applications, theraw fuel of NLP, and/or the testbed on which an NLP application isevaluated.”200?: Eric Brill: “More data is more important than betteralgorithms.”200?: Eugene Charniac: “Future is in statistics.”

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 3 / 48

Page 4: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

The world of language data resources today

Today’s Language data resources map - hopelessly diverse.

A very very tiny fragment for illustration: only ontologically-orienteddata collections, just those adhering to the linked open data principles(credit: Wikipedia)

2016: 1,250 submissions to LREC 2016 (International Conference onLanguage Resources and Evaluation, biannual)

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 4 / 48

Page 5: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Why is that so complicated?

Why researchers need so many different pieces of data?

Is the natural language really so complex? Well, yes.

In addition,I thousands of languages (plus dialects), different writing systems. . .I many underlying theoriesI many end-application purposes

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 5 / 48

Page 6: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Let’s try to systematize the space of data resourcesBasic dimensions:

corpus vs. lexiconI lexicon in the broad sense, as a repertory of tokens’ types

modality: spoken vs. writtenI and other, eg. sign languages

covered languages: monolingual vs. multilingualI if multilingual, then possibly parallel

time axis: synchronic vs. diachronicI if annotated, then what on which “level”, with which underlying theory,

what tag set . . .time axis: synchronic vs. diachronic

I if annotated, then what on which “level”, with which underlying theory,what tag set . . .

plain vs. annotatedI if annotated, then what on which “level” (which language phenomena

are captured), with which underlying theory, with what set of labels(tag set) . . .

other language variables:I original vs. translationI native speaker vs. learnerI various kinds of language disorders . . .

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 6 / 48

Page 7: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Corpora

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 7 / 48

Page 8: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

CORPUS according to Merriam-Webster

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 8 / 48

Page 9: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

A historical remark

linguists recognized the need for unbiased empirical evidence longbefore modern NLP

I excerption tickets collected systematically for Czech from 1911

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 9 / 48

Page 10: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Corpus size

typically measured in tokens (words plus puntuation marks)

sampling is inescapableI an I-want-it-all corpus is far beyond our technology (even in a strictly

synchronic sense)

but still, the corpora sizes have been growing at an exponential pacefor some time:

I Brown Corpus in 1964 ≈ 1MWI (electronic corpus of Czech texts in 1970s: 500kW)I British Natural Corpus in 1994 ≈ 100 MWI English Gigaword in 2004 ≈ 1GWI Google’s 5-gram for 10 European Languages in 2009 based on ≈ 1TW

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 10 / 48

Page 11: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Balanced corpora

an elusive goal: a balanced corpus whose proportions correspond tothe real language usage

criteria for choosing types of texts their relative proportion in thecorpus (and eventually concrete texts)?

I style, genreI reception vs. perception (a few influential authors vs. production of a

large community)?

actually no convincing generally valid answers for an optimal mixture. . .

. . . but at least some strategies seem to be more reasonable thanothers

an example of a clearly imbalanced corpus: Wall Street JournalCorpus

I unfortunately used as a material source for the Penn Treebank, whichis undoubtedly among the most influential LR

I “NLP = Wall Street Journal science”

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 11 / 48

Page 12: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Corpus annotation

raw texts – difficult to exploit

solution: gradual “information adding” (more exactly, adding theinformation in an explicit, machine tractable form)

annotation = adding selected linguistic information in an explicit formto a corpus

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 12 / 48

Page 13: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Corpus annotation criticism

some critics: an annotated corpus is worse than a raw corpus becauseof forced interpretations

I one has to struggle with different linguistic traditions of differentnational schools

I example: part of speech categories

relying on annotation might be misleading if the quality is low (errorsor inconsistencies)

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 13 / 48

Page 14: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Variability of PoS tag sets

Penn Treebank POS tagset (for English)

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 14 / 48

Page 15: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Variability of PoS tag sets, cont.

Negra Corpus POS tagset (for German)

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 15 / 48

Page 16: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Variability of PoS tag sets, cont.

Prague Dependency Treebank morphologitagset (for Czech), severalthousand combinations using 15-character long positional tags

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 16 / 48

Page 17: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Treebanks

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 17 / 48

Page 18: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Treebanks

a treebank is a corpus in which sentences’ syntax and/or semantics isanalyzed using tree-shaped data structures

a tree in the sense of graph theory (a connected acyclic graph)

sentence syntactic analysis ... it sounds familiar to most of you,doesn’t it?

Credit: http://konecekh.blog.cz

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 18 / 48

Page 19: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Why trees: Initial thoughts

1 Honestly: trees are irresistibly attractive data structures.

2 We believe sentences can be reasonably represented by discrete unitsand relations among them.

3 Some relations among sentence components (such as some wordgroupings) make more sense than others.

4 In other words, we believe there is an latent but identifiable discretestructure hidden in each sentence.

5 The structure must allow for various kinds of nestedness (. . . a ja murek, ze nejsem Rek, abych mu rek, kolik je v Recku reckych rek . . . ).

6 This resembles recursivity. Recursivity reminds us of trees.

7 Let’s try to find such trees that make sense linguistically and can besupported by empirical evidence.

8 Let’s hope they’ll be useful in developing NLP applications such asMachine Translation.

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 19 / 48

Page 20: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

So what kind of trees?There are two types of trees broadly used:

constituency (phrase-structure) trees

dependency trees

Credit: Wikipedia

Constituency trees simply don’t fit to languages with freer word order,such as Czech. Let’s use dependency trees.

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 20 / 48

Page 21: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

How do we know there is a dependency between twowords?

There are various clues manifested, such as

I word order (juxtapositon): “. . . prijdu zıtra . . . ”I agreement: “. . . novymi.pl.instr knihami.pl.instr . . . ”I government: “. . . slıbil Petrovi.dative . . . ”

Different languages use different mixtures of morphological strategiesto express relations among sentence units.

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 21 / 48

Page 22: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Basic assumptions about building units

If a sentence is to be represented by a dependency tree, then we need tobe able to:

identify sentence boundaries.

identify word boundaries within a sentence.

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 22 / 48

Page 23: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Basic assumptions about dependencies

If a sentence is to be represented by a dependency tree, then:

there must be a unique parent word for each word in each sentence,except for the root word

there are no loops allowed.

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 23 / 48

Page 24: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Even the most basic assumptions are violated

Sometimes sentence boundaries are unclear – generally in speech,but e.g. in written Arabic too, and in some situations even in writtenCzech (e.g. direct speech)

Sometimes word boundaries are unclear, (Chinese, “ins” inGerman, “abych” in Czech).

Sometimes its unclear which words should become parents (Apreposition or a noun? An auxiliary verb or a meaningful verb? . . . ).

Sometimes there are too many relations (“Zahledla ho boseho.”),which implies loops.

Life’s hard. Let’s ignore it and insist on trees.

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 24 / 48

Page 25: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Counter-examples revisited

If we cannot find lingustically justified decisions, then make them at leastconsistent.

Sometimes sentence boundaries are unclear (generally in speech, bute.g. in written Arabic too. . . )

I OK, so let’s introduce annotation rules for sentencesegmentation.

Sometimes word boundaries are unclear, (Chinese, “ins” in German,“abych” in Czech).

I OK, so let’s introduce annotation rules for tokenization.

Sometimes it’s not clear which word should become parent (e.g. apreposition or a noun?).

I OK, so let’s introduce annotation rules for choosing parent.

Sometimes there are too many relations (“Zahledla ho boseho.”),which implies loops.

I OK, so let’s introduce annotation rules for choosing tree-shapedskeleton.

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 25 / 48

Page 26: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Treebanking

Is our dependency approach viable? Can we check it?

Let’s start by building the trees manually.

a treebank - a collection of sentences and associated (typicallymanually annotated) dependency trees

for English: Penn Treebank [Marcus et al., 1993]

for Czech: Prague Dependency Treebank [Hajic et al., 2001]I layered annotation scheme: morhology, surface syntax, deep syntaxI dependency trees for about 100,000 sentences

high degree of design freedom and local linguistic tradition bias

different treebanks =⇒ different annotation styles

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 26 / 48

Page 27: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Case study on treebank variability: Coordination

coordination structures such as“lazy dogs, cats and rats” consistsof

I conjunctsI conjunctionsI shared modifiersI punctuations

16 different annotation stylesidentified in 26 treebanks (andmany more possible)

different expressivity, limitedconvertibility, limited comparabilityof experiments. . .

harmonization of annotationstyles badly needed!

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 27 / 48

Page 28: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

How many treebanks are there out there?

growing interest in dependency treebanks in the last decade or two

existing treebanks for about 50 languages now (but roughly 7,000languages in the world)

UFAL participated in several treebank unification efforts:I 13 languages in CoNLL in 2006I 29 languages in HamleDT in 2011I 37 languages in Universal Dependencies in 2015:

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 28 / 48

Page 29: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Other specialized corpora

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 29 / 48

Page 30: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Parallel corporaspecific feature: alignment between corresponding units in two (ormore) languages

I document level alignmentI sentence level alignmentI word level alignmentI (morpheme level alignment?)

example: The Rosetta Stoneexample: CzEng - a Czech-English parallel corpus, roughly 0.5 wordsfor each language, automatically parsed (using PDT schema) and

alignedZdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 30 / 48

Page 31: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Named entity corpora

specific feature: instances of proper names, such as names of people,geographical names,

example: Czech Named Entity Corpus - two-level hierarchy of 46named entity types, 35k NE instances in 9k sentences

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 31 / 48

Page 32: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Coreference corpora

specific feature: capturing relations between expressions that refer tothe same entity of the real world

(credit: Shumin Wu and Nicolas Nicolov)

example: Prague Dependency Treebanks (around 40k coreferencelinks in Czech texts)

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 32 / 48

Page 33: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Sentiment corpora

specific feature: capture the attitude (in the sense of emotionalpolarity) of a speaker with respect to some topic/expression

simply said: “is this good or is it bad?”

obviously over-simplified, but highly demanded e.g. by the marketingindustry

(credit: SemEval 2014 documentation)

example: MPQA Corpus

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 33 / 48

Page 34: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Highly multi-lingual corpora

specific feature: as many languages as possible

examples:I W2C - at least 1MW for more than 100 languagesI The Bible Corpus - translations of the Bible into 900 languages

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 34 / 48

Page 35: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Examples of Lexicon-like Data Resources

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 35 / 48

Page 36: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Inflectional lexicons

specific feature: capturing the relation between a lemma and inflectedword forms, ideally in both directions

example: MorfFlex CZ, around 120M word forms associated with 1Mlemmas

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 36 / 48

Page 37: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Derivational lexicons

specific feature: capturing the relation between a base word and aderived word (typically by prefixing and/or suffixing)

example: DeriNet, 1M lemmas, 700k derivation links

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 37 / 48

Page 38: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Thesaurus

specific feature: capturing semantic relations between words, such assynonymy and antonymy

example:

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 38 / 48

Page 39: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Wordnets

specific feature: hyponymy (hyperonymy) forest composed of synsets(sets of synonymous words)

example: Princeton Wordnet

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 39 / 48

Page 40: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

EuroWordNet

specific feature: wordnets of several languages interconnected throughEnglish as the hub language

(credit: intuit.ru)

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 40 / 48

Page 41: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Valency lexicons

specific feature: capturing combinatory potential of a word (mostfrequently of a verb) with other sentence elements

example: VALLEX - Valency Lexicon of Czech Verbs

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 41 / 48

Page 42: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

... and many other types of language resources

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 42 / 48

Page 43: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Speech corpora

specific feature: recordings of authentic speech, typically with manualtranscriptions

for training Automatic Speech Recognition systems

example: The Switchboard-1 Telephone Speech Corpus, 2,400telephone conversations, manual transcriptions

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 43 / 48

Page 44: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Datasets primarily uninteded as corpora

Web as a corpus

Wikipedia as a corpus

Enron corpus - 600,000 emails generated by 158 employees of theEnron Corporation

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 44 / 48

Page 45: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

“Metainformation” about languages

example: The World Atlas of Language Structures (WALS)I http://wals.info/I specific feature: various language properties (related e.g. to word

order, morphology, syntax) captured for hundreds of languages

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 45 / 48

Page 46: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

Final remarks

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 46 / 48

Page 47: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

A final remark: current trends in language resources . . .

trends (in the last few years) according to Nicoletta Calzolari’s LREC 2016foreword

social media analysis

discourse, dialog and interactivity

treebanks

under-resourced languages

semantics

multi-linguality

evaluation methodologies

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 47 / 48

Page 48: Introduction to Natural Language Processingufal.mff.cuni.cz/~zabokrtsky/npfl124/slides/lect04-language-data.pdfor using empirical evidence, statistical models based on real world usage

. . . and the last word

Be careful when you hear (or say) that some language data resource (or anannotation scheme, or a probabilistic model, or a technologicalstandard. . . ) is

theory neutral, orI If fact we cannot “measure” language stuctures per se, and thus we

always rely on some assumptions or conventions etc.

language independent.I In fact it is impossible for an NLP developer to consider all variations

in morphology/syntax/semantics of all language.

Zdenek Zabokrtsky (UFAL MFF UK) Overview of Language Data Resources Week 4, lecture 48 / 48