outline what every corpus linguist should know about type...
TRANSCRIPT
What Every Corpus Linguist Should Know AboutType-Token Distributions and Zipf’s Law
Tutorial Workshop #9, 22 July 2019
Stefan EvertFAU Erlangen-Nürnberg
http://zipfr.r-forge.r-project.org/lrec2018.htmlLicensed under CC-by-sa version 3.0
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 1 / 117
Outline
Outline
IntroductionMotivationNotation & basic conceptsZipf’s lawFirst steps (zipfR)
LNRE modelsPopulation & samplesThe mathematics of LNRE
Applications & examplesProductivity &lexical diversityPractical LNRE modellingBootstrapping experimentsLNRE as Bayesian prior
ChallengesModel inferenceZipf’s lawNon-randomnessSignificance testingOutlook
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 2 / 117
Introduction Motivation
Outline
IntroductionMotivationNotation & basic conceptsZipf’s lawFirst steps (zipfR)
LNRE modelsPopulation & samplesThe mathematics of LNRE
Applications & examplesProductivity &lexical diversityPractical LNRE modellingBootstrapping experimentsLNRE as Bayesian prior
ChallengesModel inferenceZipf’s lawNon-randomnessSignificance testingOutlook
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 3 / 117
Introduction Motivation
Some research questions
I How many words did Shakespeare know?I What is the coverage of my treebank grammar on big data?I How many typos are there on the Internet?I Is -ness more productive than -ity in English?I Are there differences in the productivity of nominal
compounds between academic writing and novels?I Does Dickens use a more complex vocabulary than Rowling?I Can a decline in lexical complexity predict Alzheimer’s disease?I How frequent is a hapax legomenon from the Brown corpus?I What is appropriate smoothing for my n-gram model?I Who wrote the Bixby letter, Lincoln or Hay?I How many different species of . . . are there? (Brainerd 1982)
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 4 / 117
Introduction Motivation
Some research questions
I ØØ
I coverage estimatesI ÚÚ
I ØØ
I productivity
I lexical complexity & stylometryI ÚÚ
I prior & posterior distributionI ÚÚ
I ØØ
I unexpected applications
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 5 / 117
Introduction Motivation
Type-token statistics
I These applications relate token and type countsI tokens = individual instances (occurrences)I types = distinct items
I Type-token statistics different from most statistical inferenceI not about probability of a specific eventI but about diversity of events and their probability distribution
I Relatively little work in statistical scienceI Nor a major research topic in computational linguistics
I very specialized, usually plays ancillary role in NLPI Corpus linguistics: TTR & simple productivity measures
I often applied without any statistical inference
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 6 / 117
Introduction Motivation
Zipf’s law (Zipf 1949)
A) Frequency distributions in natural language are highly skewedB) Curious relationship between rank & frequency
word r f r · fthe 1. 142,776 142,776and 2. 100,637 201,274be 3. 94,181 282,543of 4. 74,054 296,216
(Dickens)
C) Various explanations of Zipf’s lawI principle of least effort (Zipf 1949)I optimal coding system, MDL (Mandelbrot 1953, 1962)I random sequences (Miller 1957; Li 1992; Cao et al. 2017)I Markov processes Ü n-gram models (Rouault 1978)
D) Language evolution: birth-death-process (Simon 1955)+ not the main topic today!
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 7 / 117
Introduction Motivation
Goals of this tutorial
I Introduce descriptive statistics, notation and terminology
I Explain mathematical foundations of LNRE models forstatistical inference
I Practise application of models in R
I Discuss measures of productivity & lexical richness
I Address problems and advanced techniques
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 8 / 117
Introduction Notation & basic concepts
Outline
IntroductionMotivationNotation & basic conceptsZipf’s lawFirst steps (zipfR)
LNRE modelsPopulation & samplesThe mathematics of LNRE
Applications & examplesProductivity &lexical diversityPractical LNRE modellingBootstrapping experimentsLNRE as Bayesian prior
ChallengesModel inferenceZipf’s lawNon-randomnessSignificance testingOutlook
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 9 / 117
Introduction Notation & basic concepts
Tokens & types
our sample: recently, very, not, otherwise, much, very, very,merely, not, now, very, much, merely, not, veryI N = 15: number of tokens = sample sizeI V = 7: number of distinct types = vocabulary size
(recently, very, not, otherwise, much, merely, now)type-frequency list
w fwrecently 1very 5not 3otherwise 1much 2merely 2now 1
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 10 / 117
Introduction Notation & basic concepts
Zipf ranking
our sample: recently, very, not, otherwise, much, very, very,merely, not, now, very, much, merely, not, veryI N = 15: number of tokens = sample sizeI V = 7: number of distinct types = vocabulary size
(recently, very, not, otherwise, much, merely, now)Zipf ranking
w r frvery 1 5not 2 3merely 3 2much 4 2now 5 1otherwise 6 1recently 7 1 1 2 3 4 5 6 7
02
46
810
Zipf ranking: adverbs
rank
frequency
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 11 / 117
Introduction Notation & basic concepts
A realistic Zipf ranking: the Brown corpus
top frequencies bottom frequenciesr f word rank range f randomly selected examples1 69836 the 7731 – 8271 10 schedules, polynomials, bleak2 36365 of 8272 – 8922 9 tolerance, shaved, hymn3 28826 and 8923 – 9703 8 decreased, abolish, irresistible4 26126 to 9704 – 10783 7 immunity, cruising, titan5 23157 a 10784 – 11985 6 geographic, lauro, portrayed6 21314 in 11986 – 13690 5 grigori, slashing, developer7 10777 that 13691 – 15991 4 sheath, gaulle, ellipsoids8 10182 is 15992 – 19627 3 mc, initials, abstracted9 9968 was 19628 – 26085 2 thar, slackening, deluxe
10 9801 he 26086 – 45215 1 beck, encompasses, second-place
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 12 / 117
Introduction Notation & basic concepts
A realistic Zipf ranking: the Brown corpus
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 13 / 117
Introduction Notation & basic concepts
Frequency spectrumI pool types with f = 1 (hapax legomena), types with f = 2
(dis legomena), . . . , f = m, . . .I V1 = 3: number of hapax legomena (now, otherwise, recently)I V2 = 2: number of dis legomena (merely, much)I general definition: Vm = |{w | fw = m}|Zipf ranking
w r frvery 1 5not 2 3merely 3 2much 4 2now 5 1otherwise 6 1recently 7 1
frequencyspectrumm Vm1 32 23 15 1
1 2 3 4 5 6 7
frequency spectrum: adverbs
m
Vm
02
46
810
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 14 / 117
Introduction Notation & basic concepts
A realistic frequency spectrum: the Brown corpus
1 2 3 4 5 6 7 8 9 11 13 15
frequency spectrum: Brown corpus
m
Vm
05000
10000
15000
20000
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 15 / 117
Introduction Notation & basic concepts
Vocabulary growth curve
our sample: recently, very, not, otherwise, much, very, very,merely, not, now, very, much, merely, not, very
I N = 1, V (N) = 1, V1(N) = 1I N = 3, V (N) = 3, V1(N) = 3I N = 7, V (N) = 5, V1(N) = 4I N = 12, V (N) = 7, V1(N) = 4I N = 15, V (N) = 7, V1(N) = 3
0 2 4 6 8 10 12 14
02
46
810
vocabulary growth curve: adverbs
N
V(N)V1(N)
V(N)V1(N)
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 16 / 117
Introduction Notation & basic concepts
A realistic vocabulary growth curve: the Brown corpus
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
010000
20000
30000
40000
50000 vocabulary growth curve: Brown corpus
N
V(N)V1(N)
V(N)V1(N)
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 17 / 117
Introduction Notation & basic concepts
Vocabulary growth in authorship attributionI Authorship attribution by n-gram tracing applied to the case
of the Bixby letter (Grieve et al. 2018)I Word or character n-grams in disputed text are compared
against large “training” corpora from candidate authors
Figure 1 One Gettysburg Address 2-word n-gram traces 322
323 324
In addition to analysing 2-word n-grams, we also analysed 1-, 3- and 4-word n-grams, 325
based on the average percentage of n-grams seen in 50 random 260,000-word samples of 326
texts. The analysis was only run up to 4-word n-grams because from that point onward the Hay 327
corpus contains none of the n-grams in the Gettysburg Address. The 3- and 4-word n-gram 328
analyses also correctly attributed the Gettysburg Address to Lincoln: 18% of 3-grams for Lincoln 329
vs. 14% of 3-grams for Hay and 2% of 4-grams for Lincoln vs. 0% of 4-grams for Hay. The 1-330
word n-gram analysis, however, incorrectly attributed the Gettysburg Address to Hay. Figure 3 331
presents the aggregated n-gram traces for all analyses. Notably, the 2-, 3- and 4-word n-gram 332
analyses, which correctly attributed the document to Lincoln, appear to be far more definitive 333
than the incorrect 1-word n-gram analysis. 334
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 18 / 117
Introduction Zipf’s law
Outline
IntroductionMotivationNotation & basic conceptsZipf’s lawFirst steps (zipfR)
LNRE modelsPopulation & samplesThe mathematics of LNRE
Applications & examplesProductivity &lexical diversityPractical LNRE modellingBootstrapping experimentsLNRE as Bayesian prior
ChallengesModel inferenceZipf’s lawNon-randomnessSignificance testingOutlook
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 19 / 117
Introduction Zipf’s law
Observing Zipf’s lawacross languages and different linguistic units
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 20 / 117
Introduction Zipf’s law
Observing Zipf’s lawThe Italian prefix ri- in the la Repubblica corpus
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 21 / 117
Introduction Zipf’s law
Observing Zipf’s law
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 22 / 117
Introduction Zipf’s law
Observing Zipf’s law
I Straight line in double-logarithmic space correspondsto power law for original variables
I This leads to Zipf’s (1949; 1965) famous law:
fr = Cra
I If we take logarithm on both sides, we obtain:
log fr︸ ︷︷ ︸y
= log C − a · log r︸︷︷︸x
I Intuitive interpretation of a and C :I a is slope determining how fast log frequency decreasesI log C is intercept, i.e. log frequency of most frequent word
(r = 1 Ü log r = 0)
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 23 / 117
Introduction Zipf’s law
Observing Zipf’s lawLeast-squares fit = linear regression in log-space (Brown corpus)
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 24 / 117
Introduction Zipf’s law
Zipf-Mandelbrot lawMandelbrot (1953, 1962)
I Mandelbrot’s extra parameter:
fr = C(r + b)a
I Zipf’s law is special case with b = 0I Assuming a = 1, C = 60,000, b = 1:
I For word with rank 1, Zipf’s law predicts frequency of 60,000;Mandelbrot’s variation predicts frequency of 30,000
I For word with rank 1,000, Zipf’s law predicts frequency of 60;Mandelbrot’s variation predicts frequency of 59.94
I Zipf-Mandelbrot law forms basis of statistical LNRE modelsI ZM law derived mathematically as limiting distribution of
vocabulary generated by a character-level Markov processStefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 25 / 117
Introduction Zipf’s law
Zipf-Mandelbrot lawNon-linear least-squares fit (Brown corpus)
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 26 / 117
Introduction First steps (zipfR)
Outline
IntroductionMotivationNotation & basic conceptsZipf’s lawFirst steps (zipfR)
LNRE modelsPopulation & samplesThe mathematics of LNRE
Applications & examplesProductivity &lexical diversityPractical LNRE modellingBootstrapping experimentsLNRE as Bayesian prior
ChallengesModel inferenceZipf’s lawNon-randomnessSignificance testingOutlook
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 27 / 117
Introduction First steps (zipfR)
zipfREvert and Baroni (2007)
I http://zipfR.R-Forge.R-Project.org/I Conveniently available from CRAN repositoryI Package vignette = gentle tutorial introduction
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 28 / 117
Introduction First steps (zipfR)
First steps with zipfR
I Set up a folder for this course, and make sure it is yourworking directory in R (preferably as an RStudio project)
I Install the most recent version of the zipfR packageI tutorial requires version 0.7 or newer
I Package, handouts, code samples & data sets available fromhttp://zipfr.r-forge.r-project.org/lrec2018.html
> library(zipfR)
> ?zipfR # documentation entry point
> vignette("zipfr-tutorial") # read the zipfR tutorial
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 29 / 117
Introduction First steps (zipfR)
Loading type-token data
I Most convenient input: sequence of tokens as text file invertical format (“one token per line”)
+ mapped to appropriate types: normalized word forms, wordpairs, lemmatized, semantic class, n-gram of POS tags, . . .
+ language data should always be in UTF-8 encoding!+ large files can be compressed (.gz, .bz2, .xz)
I Sample data: brown_adverbs.txt on tutorial homepageI lowercased adverb tokens from Brown corpus (original order)
+ download and save to your working directory
> adv <- readLines("brown_adverbs.txt", encoding="UTF-8")
> head(adv, 30) # mathematically, a ‘‘vector’’ of tokens> length(adv) # sample size = 52,037 tokens
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 30 / 117
Introduction First steps (zipfR)
Descriptive statistics: type-frequency list
> adv.tfl <- vec2tfl(adv)> adv.tfl
k f typenot 1 4859 notn’t 2 2084 n’tso 3 1464 soonly 4 1381 onlythen 5 1374 thennow 6 1309 noweven 7 1134 evenas 8 1089 as
......
...N V
52037 1907
> N(adv.tfl) # sample size> V(adv.tfl) # type count
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 31 / 117
Introduction First steps (zipfR)
Descriptive statistics: type-frequency list
I Visualize descriptive statistics with plot method> plot(adv.tfl) # Zipf ranking> plot(adv.tfl, log="xy") # logarithmic scale recommended
1 5 10 50 100 500 1000
15
50
500
5000
Type-Frequency List (Zipf ranking)
rank
frequency
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 32 / 117
Introduction First steps (zipfR)
Descriptive statistics: frequency spectrum
> adv.spc <- tfl2spc(adv.tfl) # or directly with vec2spc> adv.spc
m Vm1 1 7622 2 2603 3 1444 4 995 5 696 6 507 7 408 8 34
......
N V52037 1907
> N(adv.spc) # sample size> V(adv.spc) # type count
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 33 / 117
Introduction First steps (zipfR)
Descriptive statistics: frequency spectrum
> plot(adv.spc) # barplot of frequency spectrum> ?plot.spc # see help page for further options
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Frequency Spectrum
m
Vm
0200
400
600
800
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 34 / 117
Introduction First steps (zipfR)
Descriptive statistics: vocabulary growth
I VGC lists vocabulary size V (N) at different sample sizes NI Optionally also spectrum elements Vm(N) up to m.max
> adv.vgc <- vec2vgc(adv, m.max=2)> plot(adv.vgc, add.m=1:2) # plot all three VGCs
0 10000 20000 30000 40000 50000
0500
1000
1500
2000 Vocabulary Growth
N
V(N)V1(N)V2(N)
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 35 / 117
Introduction First steps (zipfR)
Further example data sets
?Brown words from Brown corpus?BrownSubsets various subsets
?Dickens words from novels by Charles Dickens?ItaPref Italian word-formation prefixes?TigerNP NP and PP patterns from German Tiger treebank
?Baayen2001 frequency spectra from Baayen (2001)?EvertLuedeling2001 German word-formation affixes (manually
corrected data from Evert and Lüdeling 2001)Practice:I Explore these data sets with descriptive statisticsI Try different plot options (from help pages ?plot.tfl,
?plot.spc, ?plot.vgc)
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 36 / 117
LNRE models Population & samples
Outline
IntroductionMotivationNotation & basic conceptsZipf’s lawFirst steps (zipfR)
LNRE modelsPopulation & samplesThe mathematics of LNRE
Applications & examplesProductivity &lexical diversityPractical LNRE modellingBootstrapping experimentsLNRE as Bayesian prior
ChallengesModel inferenceZipf’s lawNon-randomnessSignificance testingOutlook
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 37 / 117
LNRE models Population & samples
Why do we need statistics?
I Often want to compare samples of different sizes+ extrapolation of VGC & productivity measures
I Interested in productivity of affix, vocabulary of author, . . . ;not in a particular text or sample
+ statistical inference from sample to population+ significance of differences in productivity
I Discrete frequency counts are difficult to capture withgeneralizations such as Zipf’s law
+ Zipf’s law predicts many impossible types with 1 < fr < 2+ population does not suffer from such quantization effects
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 38 / 117
LNRE models Population & samples
LNRE models
I This tutorial introduces the state-of-the-art LNRE approachproposed by Baayen (2001)
I LNRE = Large Number of Rare Events
I LNRE uses various approximations and simplifications toobtain a tractable and elegant model
I Of course, we could also estimate the precise discretedistributions using MCMC simulations, but . . .1. LNRE model usually minor component of complex procedure2. often applied to very large samples (N > 1 M tokens)3. still better than naive least-squares regression on Zipf ranking
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 39 / 117
LNRE models Population & samples
The LNRE population
I Population: set of S types wi with occurrence probabilities πiI S = population diversity can be finite or infinite (S =∞)I Not interested in specific types Ü arrange by decreasing
probability: π1 ≥ π2 ≥ π3 ≥ · · ·+ impossible to determine probabilities of all individual types
I Normalization: π1 + π2 + . . .+ πS = 1
I Need parametric statistical model to describe full population(esp. for S =∞), i.e. a function i 7→ πi
I type probabilities πi cannot be estimated reliably from asample, but parameters of this function can
I NB: population index i 6= Zipf rank r
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 40 / 117
LNRE models Population & samples
What should the population look like?
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
0.00
0.02
0.04
0.06
0.08
0.10
k
ππ k
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
0.00
0.02
0.04
0.06
0.08
0.10
k
ππ k
●
●
●
●
●
●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
0.00
0.02
0.04
0.06
0.08
0.10
k
ππ k
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
0.00
0.02
0.04
0.06
0.08
0.10
k
ππ k
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 41 / 117
LNRE models Population & samples
Zipf-Mandelbrot law as a population model
I Zipf-Mandelbrot law for type probabilities:
πi := C(i + b)a
I Two free parameters: a > 1 and b ≥ 0+ C is not a parameter but a normalization constant,
needed to ensure that∑
i πi = 1I Third parameter: S > 0 or S =∞
I This is the Zipf-Mandelbrot population model (Evert 2004)I ZM for Zipf-Mandelbrot model (S =∞)I fZM for finite Zipf-Mandelbrot model
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 42 / 117
LNRE models Population & samples
The parameters of the Zipf-Mandelbrot model●
●
●
●
●
●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
0.00
0.02
0.04
0.06
0.08
0.10
k
ππ k
a == 1.2b == 1.5
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
0.00
0.02
0.04
0.06
0.08
0.10
k
ππ k
a == 2b == 10
●
●
●
●
●
●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
0.00
0.02
0.04
0.06
0.08
0.10
k
ππ k
a == 2b == 15
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
0.00
0.02
0.04
0.06
0.08
0.10
k
ππ k
a == 5b == 40
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 43 / 117
LNRE models Population & samples
The parameters of the Zipf-Mandelbrot model●
●
●
●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
1 2 5 10 20 50 100
1e−
045e
−04
5e−
035e
−02
k
ππ k
a == 1.2b == 1.5
●●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
1 2 5 10 20 50 100
1e−
045e
−04
5e−
035e
−02
k
ππ k
a == 2b == 10
●●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●1 2 5 10 20 50 100
1e−
045e
−04
5e−
035e
−02
k
ππ k
a == 2b == 15
●●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
1 2 5 10 20 50 100
1e−
045e
−04
5e−
035e
−02
k
ππ k
a == 5b == 40
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 44 / 117
LNRE models Population & samples
Sampling from a population model
Assume we believe that the population we are interested in can bedescribed by a Zipf-Mandelbrot model:
●
●
●
●
●
●
●●
●●
●●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 10 20 30 40 50
0.00
0.01
0.02
0.03
0.04
0.05
k
ππ k
a == 3b == 50
● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
1 2 5 10 20 50 100
1e−
045e
−04
5e−
035e
−02
k
ππ k
a == 3b == 50
Use computer simulation to generate random samples:I Draw N tokens from the population such that in
each step, type wi has probability πi to be pickedI This allows us to make predictions for samples (= corpora)
of arbitrary size NStefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 45 / 117
LNRE models Population & samples
Sampling from a population model
#1: 1 42 34 23 108 18 48 18 1 . . .time order room school town course area course time . . .
#2: 286 28 23 36 3 4 7 4 8 . . .
#3: 2 11 105 21 11 17 17 1 16 . . .
#4: 44 3 110 34 223 2 25 20 28 . . .
#5: 24 81 54 11 8 61 1 31 35 . . .
#6: 3 65 9 165 5 42 16 20 7 . . .
#7: 10 21 11 60 164 54 18 16 203 . . .
#8: 11 7 147 5 24 19 15 85 37 . . .
......
......
......
......
......
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 46 / 117
LNRE models Population & samples
Samples: type frequency list & spectrum
rank r fr type i1 37 62 36 13 33 34 31 75 31 106 30 57 28 128 27 29 24 4
10 24 1611 23 812 22 14... ... ...
m Vm1 832 223 204 125 106 57 58 39 310 3... ...
sample #1
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 47 / 117
LNRE models Population & samples
Samples: type frequency list & spectrum
rank r fr type i1 39 22 34 33 30 54 29 105 28 86 26 17 25 138 24 79 23 6
10 23 1111 20 412 19 17... ... ...
m Vm1 762 273 174 105 66 57 78 310 411 2... ...
sample #2
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 48 / 117
LNRE models Population & samples
Random variation in type-frequency lists●
●
●
●●●
●●
●●●
●●
●●
●●●
●●
●●●●●●
●●●●
●●●●●●
●●●●●●●●●●
●●●●
0 10 20 30 40 50
010
2030
40
Sample #1
r
f r
●
●
●●
●
●●
●●●
●●●
●●●●●●●
●●●●
●●
●●●●●
●●●●
●●●●●●●●●●
●●●●●
0 10 20 30 40 50
010
2030
40
Sample #2
r
f r r ↔ fr
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●●●
●●
0 10 20 30 40 50
010
2030
40
Sample #1
k
f k
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
0 10 20 30 40 50
010
2030
40
Sample #2
k
f k i ↔ fi
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 49 / 117
LNRE models Population & samples
Random variation: frequency spectrumSample #1
m
Vm
020
4060
8010
0 Sample #2
m
Vm
020
4060
8010
0
Sample #3
m
Vm
020
4060
8010
0 Sample #4
m
Vm
020
4060
8010
0
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 50 / 117
LNRE models Population & samples
Random variation: vocabulary growth curve
0 200 400 600 800 1000
050
100
150
200 Sample #1
N
V((N
))V
1((N
))
0 200 400 600 800 1000
050
100
150
200 Sample #2
N
V((N
))V
1((N
))
0 200 400 600 800 1000
050
100
150
200 Sample #3
N
V((N
))V
1((N
))
0 200 400 600 800 1000
050
100
150
200 Sample #4
N
V((N
))V
1((N
))
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 51 / 117
LNRE models Population & samples
Expected values
I There is no reason why we should choose a particular sampleto compare to the real data or make a prediction – each one isequally likely or unlikely
I Take the average over a large number of samples, calledexpected value or expectation in statistics
I Notation: E[V (N)
]and E
[Vm(N)
]
I indicates that we are referring to expected values for a sampleof size N
I rather than to the specific values V and Vmobserved in a particular sample or a real-world data set
I Expected values can be calculated efficiently withoutgenerating thousands of random samples
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 52 / 117
LNRE models Population & samples
The expected frequency spectrumVm
E[[Vm]]
Sample #1
m
Vm
E[[V
m]]
020
4060
8010
0
Vm
E[[Vm]]
Sample #2
m
Vm
E[[V
m]]
020
4060
8010
0
Vm
E[[Vm]]
Sample #3
m
Vm
E[[V
m]]
020
4060
8010
0
Vm
E[[Vm]]
Sample #4
m
Vm
E[[V
m]]
020
4060
8010
0
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 53 / 117
LNRE models Population & samples
The expected vocabulary growth curve
0 200 400 600 800 1000
050
100
150
200 Sample #1
N
E[[V
((N))]]
V((N))E[[V((N))]]
0 200 400 600 800 1000
050
100
150
200 Sample #1
N
E[[V
1((N
))]]
V1((N))E[[V1((N))]]
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 54 / 117
LNRE models Population & samples
Prediction intervals for the expected VGC
0 200 400 600 800 1000
050
100
150
200 Sample #1
N
E[[V
((N))]]
V((N))E[[V((N))]]
0 200 400 600 800 1000
050
100
150
200 Sample #1
N
E[[V
1((N
))]]
V1((N))E[[V1((N))]]
“Confidence intervals” indicate predicted sampling distribution:+ for 95% of samples generated by the LNRE model, VGC will
fall within the range delimited by the thin red lines
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 55 / 117
LNRE models Population & samples
Parameter estimation by trial & error
observedZM model
a == 1.5,, b == 7.5
m
Vm
E[[V
m]]
050
0010
000
1500
020
000
2500
00e+00 2e+05 4e+05 6e+05 8e+05 1e+06
010
000
2000
030
000
4000
050
000 a == 1.5,, b == 7.5
N
V((N
))E
[[V((N
))]]
observedZM model
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 56 / 117
LNRE models Population & samples
Parameter estimation by trial & error
observedZM model
a == 1.3,, b == 7.5
m
Vm
E[[V
m]]
050
0010
000
1500
020
000
2500
0
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
010
000
2000
030
000
4000
050
000 a == 1.3,, b == 7.5
N
V((N
))E
[[V((N
))]]
observedZM model
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 56 / 117
LNRE models Population & samples
Parameter estimation by trial & error
observedZM model
a == 1.3,, b == 0.2
m
Vm
E[[V
m]]
050
0010
000
1500
020
000
2500
0
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
010
000
2000
030
000
4000
050
000 a == 1.3,, b == 0.2
N
V((N
))E
[[V((N
))]]
observedZM model
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 56 / 117
LNRE models Population & samples
Parameter estimation by trial & error
observedZM model
a == 2,, b == 550
m
Vm
E[[V
m]]
050
0010
000
1500
020
000
2500
0
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
010
000
2000
030
000
4000
050
000 a == 2,, b == 550
N
V((N
))E
[[V((N
))]]
observedZM model
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 56 / 117
LNRE models Population & samples
Automatic parameter estimation
observedexpected
a == 2.39,, b == 1968.49
m
Vm
E[[V
m]]
050
0010
000
1500
020
000
2500
00e+00 2e+05 4e+05 6e+05 8e+05 1e+06
010
000
2000
030
000
4000
050
000 a == 2.39,, b == 1968.49
N
V((N
))E
[[V((N
))]]
observedexpected
I By trial & error we found a = 2.0 and b = 550I Automatic estimation procedure: a = 2.39 and b = 1968
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 57 / 117
LNRE models The mathematics of LNRE
Outline
IntroductionMotivationNotation & basic conceptsZipf’s lawFirst steps (zipfR)
LNRE modelsPopulation & samplesThe mathematics of LNRE
Applications & examplesProductivity &lexical diversityPractical LNRE modellingBootstrapping experimentsLNRE as Bayesian prior
ChallengesModel inferenceZipf’s lawNon-randomnessSignificance testingOutlook
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 58 / 117
LNRE models The mathematics of LNRE
The sampling model
I Draw random sample of N tokens from LNRE populationI Sufficient statistic: set of type frequencies {fi}
I because tokens of random sample have no orderingI Joint multinomial distribution of {fi}:
Pr({fi = ki} |N) = N!k1! · · · kS !π
k11 · · ·πkSS
I Approximation: do not condition on fixed sample size NI N is now the average (expected) sample size
I Random variables fi have independent Poisson distributions:
Pr(fi = ki) = e−Nπi (Nπi)kiki !
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 59 / 117
LNRE models The mathematics of LNRE
Frequency spectrum
I Key problem: we cannot determine fi in observed sampleI because we don’t know which type wi isI recall that population ranking fi 6= Zipf ranking fr
I Use spectrum {Vm} and sample size V as statisticsI contains all information we have about observed sample
I Can be expressed in terms of indicator variables
I[fi=m] ={1 fi = m0 otherwise
Vm =S∑
i=1I[fi=m]
V =S∑
i=1I[fi>0] =
S∑
i=1
(1− I[fi=0]
)
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 60 / 117
LNRE models The mathematics of LNRE
The expected spectrum
I It is easy to compute expected values for the frequencyspectrum (and variances because the fi are independent)
E[I[fi=m]] = Pr(fi = m) = e−Nπi (Nπi)mm!
E[Vm] =S∑
i=1E[I[fi=m]] =
S∑
i=1e−Nπi (Nπi)m
m!
E[V ] =S∑
i=1E[1− I[fi=0]
]=
S∑
i=1
(1− e−Nπi
)
I NB: Vm and V are not independent because they are derivedfrom the same random variables fi
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 61 / 117
LNRE models The mathematics of LNRE
Sampling distribution of Vm and V
I Joint sampling distribution of {Vm} and V is complicatedI Approximation: V and {Vm} asymptotically follow a
multivariate normal distributionI motivated by the multivariate central limit theorem:
sum of many independent variables I[fi=m]I Usually limited to first spectrum elements, e.g. V1, . . . ,V15
I approximation of discrete Vm by continuous distributionsuitable only if E[Vm] is sufficiently large
I Parameters of multivariate normal:µµµ = (E[V ],E[V1],E[V2], . . .) and ΣΣΣ = covariance matrix
Pr((V ,V1, . . . ,Vk) = v
) ∼ e− 12 (v−µµµ)TΣΣΣ−1(v−µµµ)√
(2π)k+1 det ΣΣΣ
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 62 / 117
LNRE models The mathematics of LNRE
Type density function
I Discrete sums of probabilities in E[V ], E[Vm], . . . areinconvenient and computationally expensive
I Approximation: continuous type density function g(π)
|{wi | a ≤ πi ≤ b}| =∫ b
ag(π) dπ
∑{πi | a ≤ πi ≤ b} =
∫ b
aπg(π) dπ
I Normalization constraint:∫ ∞
0πg(π) dπ = 1
I Good approximation for low-probability types, but probabilitymass of w1,w2, . . . “smeared out” over range
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 63 / 117
LNRE models The mathematics of LNRE
Type density function
0.0 0.1 0.2 0.3 0.4 0.5
01
02
03
04
0
Type density as continuous approximation
occurrence probability π
typ
e d
en
sity
g(π)
w1w2w3w4
0.0 0.1 0.2 0.3 0.4 0.5
01
23
4
Probability density vs. type probabilities
occurrence probability π
pro
ba
bili
ty d
en
sity
f(π)
π1=
.37
6
π2=
.21
5
π3=
.13
0
π4=
.08
2
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 64 / 117
LNRE models The mathematics of LNRE
ZM and fZM as LNRE models
I Discrete Zipf-Mandelbrot population
πi := C(i + b)a for i = 1, . . . ,S
I Corresponding type density function (Evert 2004)
g(π) ={
C · π−α−1 A ≤ π ≤ B0 otherwise
with parametersI α = 1/a (0 < α < 1)I B = (1− α)/(b · α)I 0 ≤ A < B determines S (ZM with S =∞ for A = 0)
+ C is a normalization factor, not a parameter
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 65 / 117
LNRE models The mathematics of LNRE
ZM and fZM as LNRE models
1e-09 1e-07 1e-05 1e-03
02
00
00
40
00
06
00
00
80
00
0 Type density of LNRE model
occurrence probability π
typ
e d
en
sity
g(π)
fZM
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 66 / 117
LNRE models The mathematics of LNRE
Expectations as integrals
I Expected values can now be expressed as integrals over g(π)
E[Vm] =∫ ∞
0
(Nπ)mm! e−Nπg(π) dπ
E[V ] =∫ ∞
0
(1− e−Nπ
)g(π) dπ
I Reduce to simple closed form for ZM with b = 0 (Ü B =∞)
E[Vm] = Cm! · N
α · Γ(m − α)
E[V ] = C · Nα · Γ(1− α)α
I fZM and general ZM with incomplete Gamma function
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 67 / 117
LNRE models The mathematics of LNRE
Parameter estimation from training corpus
I For ZM, α = E[V1]E[V ] ≈ V1
V can be estimated directly,but prone to overfitting
I General parameter fitting by MLE:maximize likelihood of observed spectrum v
maxα,A,B
Pr((V ,V1, . . . ,Vk) = v
∣∣α,A,B)
I Multivariate normal approximation:
minα,A,B
(v−µµµ)TΣΣΣ−1(v−µµµ)
I Minimization by gradient descent (BFGS, CG) or simplexsearch (Nelder-Mead)
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 68 / 117
LNRE models The mathematics of LNRE
Parameter estimation from training corpus
0.2 0.4 0.6 0.8
−4
−3
−2
−1
BNC (bare singular PPs)
α
log 1
0(B)
0.55 0.60 0.65 0.70 0.75
−2.
5−
2.4
−2.
3−
2.2
−2.
1−
2.0
−1.
9−
1.8 Goodness−of−fit X2 (m = 10)
α
log 1
0(B)
(0.65, −2.11)
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 69 / 117
LNRE models The mathematics of LNRE
Goodness-of-fit(Baayen 2001, Sec. 3.3)
I How well does the fitted model explain the observed data?I For multivariate normal distribution:
X 2 = (V−µµµ)TΣΣΣ−1(V−µµµ) ∼ χ2k+1
where V = (V ,V1, . . . ,Vk)å Multivariate chi-squared test of goodness-of-fit
I replace V by observed v Ü test statistic x2
I must reduce df = k + 1 by number of estimated parameters
I NB: significant rejection of the LNRE model for p < .05
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 70 / 117
LNRE models The mathematics of LNRE
Coffee break!
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 71 / 117
Applications & examples Productivity & lexical diversity
Outline
IntroductionMotivationNotation & basic conceptsZipf’s lawFirst steps (zipfR)
LNRE modelsPopulation & samplesThe mathematics of LNRE
Applications & examplesProductivity &lexical diversityPractical LNRE modellingBootstrapping experimentsLNRE as Bayesian prior
ChallengesModel inferenceZipf’s lawNon-randomnessSignificance testingOutlook
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 72 / 117
Applications & examples Productivity & lexical diversity
Measuring morphological productivityexample from Evert and Lüdeling (2001)
0 5000 10000 15000 20000 25000 30000 35000
0100
200
300
400
500
Vocabulary Growth Curves
N
V(N)
-bar-sam-ös
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 73 / 117
Applications & examples Productivity & lexical diversity
Measuring morphological productivityexample from Evert and Lüdeling (2001)
0 50000 150000 250000 350000
050
0010
000
1500
0
a == 1.45,, b == 34.59,, S == 20587
N
V((N
))E
[[V((N
))]]V
1((N
))E
[[V1((
N))]]
observedexpected
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 73 / 117
Applications & examples Productivity & lexical diversity
Measuring morphological productivityexample from Evert and Lüdeling (2001)
0 5000 10000 15000 20000 25000 30000 35000
0100
200
300
400
500
Vocabulary Growth Curves
N
V(N)
-bar-sam-ös
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 73 / 117
Applications & examples Productivity & lexical diversity
Quantitative measures of productivity(Tweedie and Baayen 1998; Baayen 2001)
I Baayen’s (1991) productivity index P(slope of vocabulary growth curve)
P = V1N
I TTR = type-token ratio
TTR = VN
I Zipf-Mandelbrot slope
a
I Herdan’s law (1964)
C = log Vlog N
I Yule (1944) / Simpson (1949)
K = 10 000 ·∑
m m2Vm − NN2
I Guiraud (1954)
R = V√N
I Sichel (1975)
S = V2V
I Honoré (1979)
H = log N1− V1
V
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 74 / 117
Applications & examples Productivity & lexical diversity
Productivity measures for bare singulars in the BNC
spoken writtenV 2,039 12,876N 6,766 85,750K 86.84 28.57R 24.79 43.97S 0.13 0.15C 0.86 0.83P 0.21 0.08
TTR 0.301 0.150a 1.18 1.27
pop. S 15,958 36,874 0 20000 40000 60000 80000
020
0040
0060
0080
0010
000
1200
0
vocabulary growth curves (BNC)
N
V(N
)
writtenspoken
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 75 / 117
Applications & examples Productivity & lexical diversity
Are these “lexical constants” really constant?
0 1000 3000 5000
020
4060
8010
0
Yule's K
0 1000 3000 50000
1020
3040
50
Guiraud's R
0 1000 3000 5000
0.00
0.05
0.10
0.15
0.20
Sichel's S
0 1000 3000 5000
0.0
0.2
0.4
0.6
0.8
1.0
Herdan's law C
0 1000 3000 5000
0.0
0.1
0.2
0.3
0.4
0.5
Baayen's P
0 1000 3000 5000
02
46
810
TTR
0 1000 3000 5000
0.0
0.5
1.0
1.5
2.0
Zipf slope (a)
0 1000 3000 5000
050
0015
000
2500
035
000 population size
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 76 / 117
Applications & examples Practical LNRE modelling
Outline
IntroductionMotivationNotation & basic conceptsZipf’s lawFirst steps (zipfR)
LNRE modelsPopulation & samplesThe mathematics of LNRE
Applications & examplesProductivity &lexical diversityPractical LNRE modellingBootstrapping experimentsLNRE as Bayesian prior
ChallengesModel inferenceZipf’s lawNon-randomnessSignificance testingOutlook
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 77 / 117
Applications & examples Practical LNRE modelling
interactive demo
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 78 / 117
Applications & examples Bootstrapping experiments
Outline
IntroductionMotivationNotation & basic conceptsZipf’s lawFirst steps (zipfR)
LNRE modelsPopulation & samplesThe mathematics of LNRE
Applications & examplesProductivity &lexical diversityPractical LNRE modellingBootstrapping experimentsLNRE as Bayesian prior
ChallengesModel inferenceZipf’s lawNon-randomnessSignificance testingOutlook
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 79 / 117
Applications & examples Bootstrapping experiments
Bootstrapping
I An empirical approach to sampling variation:I take many random samples from the same populationI analyse distribution e.g. of productivity measures
(mean, median, s.d., boxplot, histogram, . . . )I alternatively, estimate LNRE model from each sample and
analyse distribution of model parameters (Ü later)I problem: how to obtain the additional samples?
I Bootstrapping (Efron 1979)I resample from observed data with replacementI this approach is not suitable for type-token distributions
(resamples underestimate vocabulary size V !)I Parametric bootstrapping
I use fitted LNRE model to generate samples, i.e. sample fromthe population described by the model
I advantage: “correct” parameter values are known
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 80 / 117
Applications & examples Bootstrapping experiments
Parametric bootstrapping with LNRE models
I Use simulation experimentsto gain better understandingof quantitative measures
I LNRE model =well-defined population
I Parametric bootstrappingbased on LNRE population
I dependence on sample sizeI controlled manipulation of
confounding factorsI empirical sampling
distribution Ü variabilityI E[P] etc. can be computed
directly in simple cases 1 2 3 4 5 6 7
a = 2a = 1.4a = 1.1
Zipf−Mandelbrot spectrum
m
E[V
m]
050
010
0015
00
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 81 / 117
Applications & examples Bootstrapping experiments
Experiment: sample size
0 5000 10000 15000 20000 25000
020
4060
8010
0
Yule−Simpson K
0 5000 10000 15000 20000 25000
010
2030
4050
Guiraud R
0 5000 10000 15000 20000 25000
0.0
0.2
0.4
0.6
0.8
1.0
Herdan's law C
0 5000 10000 15000 20000 25000
02
46
810
TTR
0 5000 10000 15000 20000 25000
0.0
0.1
0.2
0.3
0.4
0.5
Baayen P
0 5000 10000 15000 20000 25000
0.0
0.5
1.0
1.5
2.0
Zipf slope (a)
0 5000 10000 15000 20000 25000
0.00
0.05
0.10
0.15
0.20
Sichel S
0 5000 10000 15000 20000 25000
020
0040
0060
0080
0010
000
1200
0 Honoré H
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 82 / 117
Applications & examples Bootstrapping experiments
Experiment: frequent lexicalized types
0.0 0.1 0.2 0.3 0.4
020
4060
8010
0
Yule−Simpson K
probability mass of lexicalized types
0.0 0.1 0.2 0.3 0.4
010
2030
4050
Guiraud R
probability mass of lexicalized types
0.0 0.1 0.2 0.3 0.4
0.0
0.2
0.4
0.6
0.8
1.0
Herdan law C
probability mass of lexicalized types
0.0 0.1 0.2 0.3 0.4
02
46
810
TTR
probability mass of lexicalized types
0.0 0.1 0.2 0.3 0.4
0.0
0.1
0.2
0.3
0.4
0.5
Baayen P
0.0 0.1 0.2 0.3 0.40.
00.
51.
01.
52.
0
Zipf slope (a)
0.0 0.1 0.2 0.3 0.4
0.00
0.05
0.10
0.15
0.20
Sichel S
0.0 0.1 0.2 0.3 0.4
020
0040
0060
0080
0010
000
1200
0 Honoré H
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 83 / 117
Applications & examples LNRE as Bayesian prior
Outline
IntroductionMotivationNotation & basic conceptsZipf’s lawFirst steps (zipfR)
LNRE modelsPopulation & samplesThe mathematics of LNRE
Applications & examplesProductivity &lexical diversityPractical LNRE modellingBootstrapping experimentsLNRE as Bayesian prior
ChallengesModel inferenceZipf’s lawNon-randomnessSignificance testingOutlook
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 84 / 117
Applications & examples LNRE as Bayesian prior
Posterior distribution
1e−08 1e−05 1e−02 1e+01
0.0
0.2
0.4
0.6
0.8
1.0
Posterior distribution ((m == 1)) for ZM model with αα == 0.4
expected frequency (Nππ)
post
erio
r di
strib
utio
n P
r((ππ|m
)) ((m==
1)) M
LE
95%
con
fiden
ce99
.9%
con
fiden
ce
Goo
d−T
urin
g
posteriorlog−adjusted
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 85 / 117
Applications & examples LNRE as Bayesian prior
Posterior distribution
1e−08 1e−05 1e−02 1e+01
0.0
0.2
0.4
0.6
0.8
1.0
Posterior distribution ((m == 1)) for ZM model with αα == 0.9
expected frequency (Nππ)
post
erio
r di
strib
utio
n P
r((ππ|m
)) ((m==
1)) M
LE
95%
con
fiden
ce99
.9%
con
fiden
ce
Goo
d−T
urin
g
posteriorlog−adjusted
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 85 / 117
Challenges Model inference
Outline
IntroductionMotivationNotation & basic conceptsZipf’s lawFirst steps (zipfR)
LNRE modelsPopulation & samplesThe mathematics of LNRE
Applications & examplesProductivity &lexical diversityPractical LNRE modellingBootstrapping experimentsLNRE as Bayesian prior
ChallengesModel inferenceZipf’s lawNon-randomnessSignificance testingOutlook
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 86 / 117
Challenges Model inference
How reliable are the fitted models?
Three potential issues:1. Model assumptions 6= population
(e.g. distribution does not follow a Zipf-Mandelbrot law)+ model cannot be adequate, regardless of parameter settings
2. Parameter estimation unsuccessful(i.e. suboptimal goodness-of-fit to training data)
+ optimization algorithm trapped in local minimum+ can result in highly inaccurate model
3. Uncertainty due to sampling variation(i.e. training data differ from population distribution)
+ model fitted to training data, may not reflect true population+ another training sample would have led to different parameters+ especially critical for small samples (N < 10,000)
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 87 / 117
Challenges Model inference
Bootstrappingparametric bootstrapping with 100 replicates
Zipfian slope a = 1/α
0.0 0.2 0.4 0.6 0.8 1.0
02
46
810
12
α
Den
sity
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 88 / 117
Challenges Model inference
Bootstrappingparametric bootstrapping with 100 replicates
Goodness-of-fit statistic X 2 (model not plausible for X 2 > 11)
0 5 10 15 20 25 30
0.00
0.02
0.04
0.06
0.08
X2
Den
sity
p<
0.05
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 88 / 117
Challenges Model inference
Bootstrappingparametric bootstrapping with 100 replicates
Population diversity S
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
0e+
002e
−05
4e−
05
S
Den
sity
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 88 / 117
Challenges Model inference
Sample size matters!Brown corpus is too small for reliable LNRE parameter estimation (bare singulars)
0.9 1.0 1.1 1.2 1.3 1.4 1.5
05
1015
Zipf slope (a)
BNC spoken (N=6766)Brown (N=1005)
0 10000 20000 30000 40000 500000.00
000
0.00
002
0.00
004
0.00
006
0.00
008
0.00
010
0.00
012
population size
Den
sity
BNC spoken (N=6766)Brown (N=1005)
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 89 / 117
Challenges Model inference
How reliable are the fitted models?
Three potential issues:1. Model assumptions 6= population
(e.g. distribution does not follow a Zipf-Mandelbrot law)+ model cannot be adequate, regardless of parameter settings
2. Parameter estimation unsuccessful(i.e. suboptimal goodness-of-fit to training data)
+ optimization algorithm trapped in local minimum+ can result in highly inaccurate model
3. Uncertainty due to sampling variation(i.e. training data differ from population distribution)
+ model fitted to training data, may not reflect true population+ another training sample would have led to different parameters+ especially critical for small samples (N < 10,000)
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 90 / 117
Challenges Zipf’s law
Outline
IntroductionMotivationNotation & basic conceptsZipf’s lawFirst steps (zipfR)
LNRE modelsPopulation & samplesThe mathematics of LNRE
Applications & examplesProductivity &lexical diversityPractical LNRE modellingBootstrapping experimentsLNRE as Bayesian prior
ChallengesModel inferenceZipf’s lawNon-randomnessSignificance testingOutlook
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 91 / 117
Challenges Zipf’s law
How well does Zipf’s law hold?
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 92 / 117
Challenges Zipf’s law
How well does Zipf’s law hold?
I Z-M law seems to fit the first few thousand ranks very well,but then slope of empirical ranking becomes much steeper
I similar patterns have been found in many different data sets
I Various modifications and extensions have been suggested(Sichel 1971; Kornai 1999; Montemurro 2001)
I mathematics of corresponding LNRE models are often muchmore complex and numerically challenging
I may not have closed form for E[V ], E[Vm], or for thecumulative type distribution G(ρ) =
∫∞ρ
g(π) dπ
I E.g. Generalized Inverse Gauss-Poisson (GIGP; Sichel 1971)
g(π) = (2/bc)γ+1
Kγ+1(b) · πγ−1 · e−π
c −b2c4π
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 93 / 117
Challenges Zipf’s law
The GIGP model (Sichel 1971)
1e-09 1e-07 1e-05 1e-03
02
00
00
40
00
06
00
00
80
00
0 Type density of LNRE model
occurrence probability π
typ
e d
en
sity
g(π)
fZMGIGP
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 94 / 117
Challenges Non-randomness
Outline
IntroductionMotivationNotation & basic conceptsZipf’s lawFirst steps (zipfR)
LNRE modelsPopulation & samplesThe mathematics of LNRE
Applications & examplesProductivity &lexical diversityPractical LNRE modellingBootstrapping experimentsLNRE as Bayesian prior
ChallengesModel inferenceZipf’s lawNon-randomnessSignificance testingOutlook
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 95 / 117
Challenges Non-randomness
How accurate is LNRE-based extrapolation?(Baroni and Evert 2005)
0 10000 20000 30000
01
00
20
03
00
40
05
00
60
0
Suffix −bar (25%)
N (tokens)
V (
typ
es)
CorpusZMfZMGIGP
0 10000 20000 30000
01
00
20
03
00
40
05
00
60
0
Suffix −bar (50%)
N (tokens)
V (
typ
es)
CorpusZMfZMGIGP
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 96 / 117
Challenges Non-randomness
How accurate is LNRE-based extrapolation?(Baroni and Evert 2005)
0 200 400 600 800 1000
01
02
03
04
05
0
LOB (25%)
N (k tokens)
V (
k t
yp
es)
CorpusZMfZMGIGPlogN
0 200 400 600 800 1000
01
02
03
04
05
0
LOB (50%)
N (k tokens)V
(k t
yp
es)
CorpusZMfZMGIGPlogN
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 96 / 117
Challenges Non-randomness
How accurate is LNRE-based extrapolation?(Baroni and Evert 2005)
0 20 40 60 80 100
01
00
20
03
00
40
05
00
60
0
BNC (1%)
N (M tokens)
V (
k t
yp
es)
CorpusZMfZMGIGPlogN
0 20 40 60 80 100
01
00
20
03
00
40
05
00
60
0
BNC (10%)
N (M tokens)
V (
k t
yp
es)
CorpusZMfZMGIGPlogN
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 96 / 117
Challenges Non-randomness
Reasons for poor extrapolation quality
I Major problem: non-randomness of corpus dataI LNRE modelling assumes that corpus is random sample
I Cause 1: repetition within textsI most corpora use entire text as unit of samplingI also referred to as “term clustering” or “burstiness”I well-known in computational linguistics (Church 2000)
I Cause 2: non-homogeneous corpusI cannot extrapolate from spoken BNC to written BNCI similar for different genres and domainsI also within single text, e.g. beginning/end of novel
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 97 / 117
Challenges Non-randomness
The ECHO correction(Baroni and Evert 2007)
I Empirical study: quality of extrapolation N0 → 4N0 startingfrom random samples of corpus texts
ZM fZM GIGP
Relative error: E[V] vs. V (DEWAC)
rela
tive
erro
r (%
)
−30
−20
−10
010
2030
●
● ●
●
N0
2N0
3N0
0 5000 10000 15000
05
1015
Goodness−of−fit vs. accuracy for V (3N0)
X2
rMS
E (
%)
●
●●
●
●●
●
●●
● standard
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 98 / 117
Challenges Non-randomness
The ECHO correction(Baroni and Evert 2007)
I Assumption: repetition of type within short span is not a newlexical access or spontaneous formation
I Replace every repetition within span by special type echoI N, V and V1 are not affected Ü same VGC and PI ECHO correction as pre-processing step Ü no modifications to
LNRE models or other analysis software neededI What is an appropriate span size?
Repetition within textual unit (Ü document frequencies)
A fine example. echo very echo echo. Only the echo echo.echo echo are echo. . . .The cat sat on echo mat. Another very fine echo echo downecho echo echo. Two echo are echo. . . .
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 99 / 117
Challenges Non-randomness
The ECHO correction(Baroni and Evert 2007)
I ECHO correction: replace every repetition within same text byspecial type echo (= document frequencies)
ZM fZM GIGP fZMecho
GIGPecho
GIGPpartition
Relative error: E[V] vs. V (DEWAC)
rela
tive
erro
r (%
)
−20
−10
010
20
●
● ●
● ●
●
●
N0
2N0
3N0
0 5000 10000 15000
05
1015
Goodness−of−fit vs. accuracy for V (3N0)
X2
rMS
E (
%)
●
●●
●
●●
●
●●
● standard
echomodelpartition−adjusted
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 100 / 117
Challenges Significance testing
Outline
IntroductionMotivationNotation & basic conceptsZipf’s lawFirst steps (zipfR)
LNRE modelsPopulation & samplesThe mathematics of LNRE
Applications & examplesProductivity &lexical diversityPractical LNRE modellingBootstrapping experimentsLNRE as Bayesian prior
ChallengesModel inferenceZipf’s lawNon-randomnessSignificance testingOutlook
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 101 / 117
Challenges Significance testing
Case study: Iris Murdoch & early symptoms of AD(Evert et al. 2017)
I Renowned British author (1919–1999)I Published a total of 26 novels, mostly well received by criticsI Murdoch experienced unexpected difficulties composing her
last novel, received “without enthusiasm” (Garrard et al. 2005)I Diagnosis of Alzheimer’s disease shortly after publication
Conflicting results:I Decline of lexical diversity
in last novel(Garrard et al. 2005;Pakhomov et al. 2011)
I No clear effects found(Le et al. 2011)
http://news.bbc.co.uk/2/hi/health/4058605.stm
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 102 / 117
Challenges Significance testing
Case study: Iris Murdoch & early symptoms of AD(Evert et al. 2017)
I Corpus dataI 19 out of 26 novels written by Iris MurdochI including 9 last novels, spanning a period of almost 20 yearsI acquired as e-books (no errors due to OCR)
I Pre-processing and annotationI Stanford CoreNLP (Manning et al. 2014) for tokenization,
sentence splitting, POS tagging, and syntactic parsingI exclude dialogue based on typographic quotation marks
(following Garrard et al. 2005; Pakhomov et al. 2011)
I The challenge+ assess significance of differences in productivity for single texts+ might explain conflicting results in prior work
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 103 / 117
Challenges Significance testing
Measures of vocabulary diversity = productivity(Evert et al. 2017)
type count / TTR Honoré H
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 104 / 117
Challenges Significance testing
Cross-validation for productivity measures(Evert et al. 2017)
As a first step:I Partition each novel into folds of 10,000 consecutive tokenså k ≥ 6 folds for each novel (leftover tokens discarded)
Then:I Evaluate complexity measure of interest on each fold
y1, . . . , yk
I Compute macro-average as overall measure for the entire text
y = y1 + · · ·+ ykk
I Instead of value x obtained by evaluating measure on full text
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 105 / 117
Challenges Significance testing
Cross-validation for productivity measures(Evert et al. 2017)
Significance testing procedure:I Standard deviation σ of individual folds estimated from data
σ2 ≈ s2 = 1k − 1
k∑
i=1(yi − y)2
I Standard deviation of macro average can be computed as
σy = σ√k≈ s√
kI Asymptotic 95% confidence intervals are then given by
y ± 1.96 · σyI Comparison of samples with Student’s t-test, based on pooled
cross-validation folds (feasible even for n1 = 1)Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 106 / 117
Challenges Significance testing
Productivity measures with confidence intervals(Evert et al. 2017)
type count / TTR Honoré Hsignificance test vs. first 17 novels
t = −6.1, df=5.52, p = .0012**
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 107 / 117
Challenges Significance testing
Cross-validated measures depend on fold size!
1 2 3 4 5 6 7
a = 3.0a = 2.0a = 1.5
Frequency spectrum at 100k tokens
m
E[V
m]
020
0060
0010
000
0 10000 20000 30000 40000 50000
0.0
0.1
0.2
0.3
0.4
0.5
0.6
type−token ratio (TTR)
fold size
E[y
]=E
[yi]
a = 3.0a = 2.0a = 1.5
0 10000 20000 30000 40000 50000
020
4060
8010
0 Yule's K
fold size
E[y
]=E
[yi]
a = 3.0a = 2.0a = 1.5
0 10000 20000 30000 40000 50000
010
0020
0030
0040
00
Honoré H
fold size
E[y
]=E
[yi]
a = 3.0a = 2.0a = 1.5
AStefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 108 / 117
Challenges Significance testing
Cross-validated measures depend on fold size!
1 2 3 4 5 6 7
a = 3.0a = 2.0a = 1.5
Frequency spectrum at 100k tokens
m
E[V
m]
020
0060
0010
000
0 10000 20000 30000 40000 50000
0.0
0.1
0.2
0.3
0.4
0.5
0.6
type−token ratio (TTR)
fold size
E[y
]=E
[yi]
a = 3.0a = 2.0a = 1.5
0 10000 20000 30000 40000 50000
020
4060
8010
0 Yule's K
fold size
E[y
]=E
[yi]
a = 3.0a = 2.0a = 1.5
0 10000 20000 30000 40000 50000
010
0020
0030
0040
00
Honoré H
fold size
E[y
]=E
[yi]
a = 3.0a = 2.0a = 1.5
CStefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 108 / 117
Challenges Outlook
Outline
IntroductionMotivationNotation & basic conceptsZipf’s lawFirst steps (zipfR)
LNRE modelsPopulation & samplesThe mathematics of LNRE
Applications & examplesProductivity &lexical diversityPractical LNRE modellingBootstrapping experimentsLNRE as Bayesian prior
ChallengesModel inferenceZipf’s lawNon-randomnessSignificance testingOutlook
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 109 / 117
Challenges Outlook
Research programme for LNRE models
I Improve efficiency & numerical accuracy of implementationI numerical integrals instead of differences of Gamma functionsI better parameter estimation (gradient, aggregated spectrum)
I Analyze accuracy of LNRE approximationsI comprehensive simulation experiments, esp. for small samples
I Specify more flexible LNRE population modelsI my favourite: piecewise Zipfian type density functionsI Baayen (2001): mixture distributions (different parameters)
I Develop hypothesis tests & confidence intervalsI key challenge: goodness-of-fit vs. confidence regionI prediction intervals for model-based extrapolation
I Simulation experiments for productivity measuresI Can we find a quantitative measure that is robust against
confounding factors and corresponds to intuitive notions ofproductivity & lexical diversity?
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 110 / 117
Challenges Outlook
Research programme for LNRE models
I Is non-randomness a problem?I not for morphological productivity Ü echo correctionI tricky to include explicitly in LNRE approach
I Do we need LNRE models for practical applications?I better productivity measures + empirical sampling variationI based on cross-validation approach (Evert et al. 2017)
I How important is semantics & context?I Does it make sense to measure productivity and lexical
diversity purely in terms of type-token distributions?I e.g. register variation for morphological productivityI e.g. semantic preferences in productive slots of constructionI type-token ratio 6= complexity of author’s vocabulary
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 111 / 117
Challenges Outlook
Thank you!
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 112 / 117
Challenges Outlook
References I
Baayen, Harald (1991). A stochastic process for word frequency distributions. InProceedings of the 29th Annual Meeting of the Association for ComputationalLinguistics, pages 271–278.
Baayen, R. Harald (2001). Word Frequency Distributions. Kluwer AcademicPublishers, Dordrecht.
Baroni, Marco and Evert, Stefan (2005). Testing the extrapolation quality of wordfrequency models. In P. Danielsson and M. Wagenmakers (eds.), Proceedings ofCorpus Linguistics 2005, volume 1, no. 1 of Proceedings from the CorpusLinguistics Conference Series, Birmingham, UK. ISSN 1747-9398.
Baroni, Marco and Evert, Stefan (2007). Words and echoes: Assessing and mitigatingthe non-randomness problem in word frequency distribution modeling. InProceedings of the 45th Annual Meeting of the Association for ComputationalLinguistics, pages 904–911, Prague, Czech Republic.
Brainerd, Barron (1982). On the relation between the type-token and species-areaproblems. Journal of Applied Probability, 19(4), 785–793.
Cao, Yong; Xiong, Fei; Zhao, Youjie; Sun, Yongke; Yue, Xiaoguang; He, Xin; Wang,Lichao (2017). Pow law in random symbolic sequences. Digital Scholarship in theHumanities, 32(4), 733–738.
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 113 / 117
Challenges Outlook
References II
Church, Kenneth W. (2000). Empirical estimates of adaptation: The chance of twoNoriegas is closer to p/2 than p2. In Proceedings of COLING 2000, pages 173–179,Saarbrücken, Germany.
Efron, Bradley (1979). Bootstrap methods: Another look at the jackknife. The Annalsof Statistics, 7(1), 1–26.
Evert, Stefan (2004). A simple LNRE model for random character sequences. InProceedings of the 7èmes Journées Internationales d’Analyse Statistique desDonnées Textuelles (JADT 2004), pages 411–422, Louvain-la-Neuve, Belgium.
Evert, Stefan and Baroni, Marco (2007). zipfR: Word frequency distributions in R. InProceedings of the 45th Annual Meeting of the Association for ComputationalLinguistics, Posters and Demonstrations Sessions, pages 29–32, Prague, CzechRepublic.
Evert, Stefan and Lüdeling, Anke (2001). Measuring morphological productivity: Isautomatic preprocessing sufficient? In P. Rayson, A. Wilson, T. McEnery,A. Hardie, and S. Khoja (eds.), Proceedings of the Corpus Linguistics 2001Conference, pages 167–175, Lancaster. UCREL.
Evert, Stefan; Wankerl, Sebastian; Nöth, Elmar (2017). Reliable measures of syntacticand lexical complexity: The case of Iris Murdoch. In Proceedings of the CorpusLinguistics 2017 Conference, Birmingham, UK.
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 114 / 117
Challenges Outlook
References III
Garrard, Peter; Maloney, Lisa M.; Hodges, John R.; Patterson, Karalyn (2005). Theeffects of very early Alzheimer’s disease on the characteristics of writing by arenowned author. Brain, 128(2), 250–260.
Grieve, Jack; Clarke, Isobelle; Chiang, Emily; Gideon, Hannah; Heini, Annina; Nini,Andrea; Waibel, Emily (2018). Attributing the Bixby Letter using n-gram tracing.Digital Scholarship in the Humanities. doi:10.1093/llc/fqy042.
Herdan, Gustav (1964). Quantitative Linguistics. Butterworths, London.Kornai, András (1999). Zipf’s law outside the middle range. In Proceedings of the
Sixth Meeting on Mathematics of Language, pages 347–356, University of CentralFlorida.
Le, Xuan; Lancashire, Ian; Hirst, Graeme; Jokel, Regina (2011). Longitudinaldetection of dementia through lexical and syntactic changes in writing: a casestudy of three British novelists. Literary and Linguistic Computing, 26(4), 435–461.
Li, Wentian (1992). Random texts exhibit zipf’s-law-like word frequency distribution.IEEE Transactions on Information Theory, 38(6), 1842–1845.
Mandelbrot, Benoît (1953). An informational theory of the statistical structure oflanguages. In W. Jackson (ed.), Communication Theory, pages 486–502.Butterworth, London.
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 115 / 117
Challenges Outlook
References IV
Mandelbrot, Benoît (1962). On the theory of word frequencies and on relatedMarkovian models of discourse. In R. Jakobson (ed.), Structure of Language andits Mathematical Aspects, pages 190–219. American Mathematical Society,Providence, RI.
Manning, Christopher D.; Surdeanu, Mihai; Bauer, John; Finkel, Jenny; Bethard,Steven J.; McClosky, David (2014). The Stanford CoreNLP natural languageprocessing toolkit. In Proceedings of 52nd Annual Meeting of the Association forComputational Linguistics (ACL 2014): System Demonstrations, pages 55–60,Baltimore, MD.
Miller, George A. (1957). Some effects of intermittent silence. The American Journalof Psychology, 52, 311–314.
Montemurro, Marcelo A. (2001). Beyond the Zipf-Mandelbrot law in quantitativelinguistics. Physica A, 300, 567–578.
Pakhomov, Serguei; Chacon, Dustin; Wicklund, Mark; Gundel, Jeanette (2011).Computerized assessment of syntactic complexity in Alzheimer’s disease: A casestudy of Iris Murdoch’s writing. Behavior Research Methods, 43(1), 136–144.
Rouault, Alain (1978). Lois de Zipf et sources markoviennes. Annales de l’Institut H.Poincaré (B), 14, 169–188.
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 116 / 117
Challenges Outlook
References V
Sichel, H. S. (1971). On a family of discrete distributions particularly suited torepresent long-tailed frequency data. In N. F. Laubscher (ed.), Proceedings of theThird Symposium on Mathematical Statistics, pages 51–97, Pretoria, South Africa.C.S.I.R.
Sichel, H. S. (1975). On a distribution law for word frequencies. Journal of theAmerican Statistical Association, 70, 542–547.
Simon, Herbert A. (1955). On a class of skew distribution functions. Biometrika,47(3/4), 425–440.
Tweedie, Fiona J. and Baayen, R. Harald (1998). How variable may a constant be?measures of lexical richness in perspective. Computers and the Humanities, 32(5),323–352.
Yule, G. Udny (1944). The Statistical Study of Literary Vocabulary. CambridgeUniversity Press, Cambridge.
Zipf, George Kingsley (1949). Human Behavior and the Principle of Least Effort.Addison-Wesley, Cambridge, MA.
Zipf, George Kingsley (1965). The Psycho-biology of Language. MIT Press,Cambridge, MA.
Stefan Evert T1: Zipf’s Law 22 July 2019 | CC-by-sa 117 / 117