frequency-free authorship attribution: testing the n …...frequency-free authorship attribution:...
TRANSCRIPT
Frequency-free authorship attribution:Testing the n-gram tracing method
Dr Andrea NiniUniversity of [email protected]
International Corpus Linguistics Conference 2019Cardiff University, Wales, UK
Prof Jack GrieveUniversity of Birmingham
[email protected]://sites.google.com/view/grievejw
AuthorshipAttribution
Authorship Analysis
AuthorshipProfiling
AuthorshipVerification
The cat sat on the mat and the dog sat on the chair.
the cat
sat on
themat
sat
ondog
Relative frequency
A set of tokens
Word 2-grams: the cat, cat sat, sat on, on the …Word 3-grams: the cat sat, cat sat on, on the mat ……Character 2-grams: th, he, e_, _c, ca …Character 3-grams: the, he_, e_c, cat ……Part of Speech 2-grams: DT NN, NN VBD, VBD PP…
Grieve, Jack (2007) Quantitative authorship attribution: An evaluation of techniques, Literary and Linguistic Computing, 22(3), pp. 251–270.
40 Authors
Approx. 50k to 30k words per author
500 – 2000 words per text
The cat looked at the owner arrogntly.
Th
on
= the= then= theanthropism
= on= ontology
Grieve, Jack (2007) Quantitative authorship attribution: An evaluation of techniques, Literary and Linguistic Computing, 22(3), pp. 251–270.
40 Authors
Approx. 50k to 30k words per author
500 – 2000 words per text
Amanda BirksDied 17th January 2009.
House fire
Christopher Birks rescued children but could not rescue Amanda.
Evidence of murder• No smoke inhalation
• No attempt at escape
• Fire originated in multiple locations
• Wearing outside clothes in bed
Amanda sent SMS text messages throughout the day until 19:48
Amanda Birk’s texting style.
‘ad’ for 'had'
‘don’t’ for ‘don’t'
‘t’ for 'the'
No use by CB
Rare use by CB
‘bak’ for 'back'
‘av’ for 'have'
‘wud’ for 'would'
‘w’ for 'with'
Some use by CB
‘wil’ for 'will'
‘y’ for ‘yes’
‘wen’ for ‘when’
Christopher Birk’s texting style.
‘2’ for ‘to’ with no trailing space
‘4’ for ‘for’ with no trailing space
‘wiv’ for ‘with'
No use by AB
Rare use by AB
‘jst’ for ‘just'
‘dnt’ for ‘dont'
Use of ‘,’ (comma)
‘4get’ for ‘forget'
‘thanx’ for ‘thanks'
Grant, Tim (2013) TXT 4N6: Method, consistency, and distinctiveness in the analysis of SMS text messages, Journal of Law and Policy, 21, pp. 467–494.
The cat sat on the mat and the dog sat on the chair.
the cat
sat on
themat
sat
ondog
Relative frequency
A set of tokens
Word 2-grams: the cat, cat sat, sat on, on the …Word 3-grams: the cat sat, cat sat on, on the mat ……Character 2-grams: th, he, e_, _c, ca …Character 3-grams: the, he_, e_c, cat ……Part of Speech 2-grams: DT NN, NN VBD, VBD PP…
The cat sat on the mat and the dog sat on the chair.
cat
themat
sat
ondog
Relative frequency
A set of types
Word 2-grams: the cat, cat sat, sat on, on the …Word 3-grams: the cat sat, cat sat on, on the mat ……Character 2-grams: th, he, e_, _c, ca …Character 3-grams: the, he_, e_c, cat ……Part of Speech 2-grams: DT NN, NN VBD, VBD PP…
ad don’t
t bak
av wud
w wil
y wen
2 4 wiv
jst dnt
, 4get
thanx
Dnt b daft m8, wots goin
on? Iv got go 2glossopdya want me cum 2u?
Ok bt frm wot iv seen ur
quite capable m8, r u
sure u dnt want hav a
few wks wiv tim? Then
maybe altern8 btween 1
n 2 man teams?
! = # ∩ %# ∪ % = 1/4 = 0.25
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0
Jaccard coefficient
= '()*+, -. .+/0(,+1 2' # /'3 %'()*+, -. .+/0(,+1 2' # -, %
Grant, Tim (2013) TXT 4N6: Method, consistency, and distinctiveness in the
analysis of SMS text messages, Journal of Law and Policy, 21, pp. 467–494.
! = # ∩ %# ∪ % = '()*+, -. .+/0(,+1 2' # /'3 %
'()*+, -. .+/0(,+1 2' # -, %
176 authors (2.5M word tokens of short emails)
”although accuracy rates overall decrease with sample size, this method is still successful in attributing exceptionally small segments of data to their correct authors, even when there are 176 candidate authors in each attribution task”
Wright (2017)
Wright, David (2017) Using word n-grams to identify authors and idiolects. A corpus approach to
a forensic linguistic problem, International Journal of Corpus Linguistics, 22(2), pp. 212–241.
Dear Madam,—
I have been shown in the files of the War Department a statement of the Adjutant General of Massachusetts, that you are the mother of five sons who have died gloriously on the field of battle.I feel how weak and fruitless must be any words of mine which should attempt to beguile you from the grief of a loss so overwhelming. But I cannot refrain from tendering to you the consolation that may be found in the thanks of the Republic they died to save. I pray that our Heavenly Father may assuage the anguish of your bereavement, and leave you only the cherished memory of the loved and lost, and the solemn pride that must be yours, to have laid so costly a sacrifice upon the altar of Freedom.
Yours, very sincerely and respectfully,
Abraham Lincoln
John Hay
1,085 texts400,747 word tokens
577 texts261,126 word tokens
139 word tokens
the
a although
truthuniversally
A
becauseofbe
however
neighbourhood
surroun
ding
Q
the
a although
truthuniversally
A
becauseofbe
however
neighbourhood
surrounding
4/5 = 80%
! = A ∩ %min( * , % )Overlap coefficient
- = * ∩ %* ∪ % = 4
11Jaccard coefficient
= 0.36
Q
the
a although
truthuniversally
A
becauseofbe
however
neighbourhood
surrounding
4/5 = 80%
!" =⋃%&'( )% ∩ +
+, = A ∩ +
min( 2 , + )Overlap coefficient
5%&'
(
)%
+ Shared co-selection coefficient
5%&'
()% ≥ +
Q
N-gram types
• Word (1 ≤ n ≤ 8)
• Part-of-Speech (1 ≤ n ≤ 15)
• Character (1 ≤ n ≤ 20)
Number of candidate
authors
• 2 author case
• 10 author case
Length of known data set Length of questioned text
Parameters
100 1,000 2,000 5,000
10,000 20,000 50,000
100,000 200,000
25 50 100 200
300 500 1,000
Ten authors (Project Gutenberg): Austen, Conrad, Dickens, Doyle, Eliot,
Hardy, Lawrence, Stevenson,
Thackeray, and Wells
Austen
Dickens
N-gram tracing
Austen
Dickens
Q text25 word sample
N-gram tracing
Austen
Dickens
25 word sample
1,000 word sample
1,000 word sample
N-gram tracing
Austen
Dickens
25 word sample
1,000 word sample
1,000 word sample
N-gram tracing
Austen
Dickens
1,000 word sample
1,000 word sample
the becausealthoughtruthuniversally
N-gram tracing
thea
although truth
universally
Austen
Dickens
the becausealthoughtruthuniversally
ofbe
however
neighbourhood
surroun
ding
thea
though truth
wonderful
ofbe
lately
lizard
waddling
N-gram tracing
thea
although truth
universally
Austen
Dickens
the becausealthoughtruthuniversally
ofbe
however
neighbourhood
surrounding
thea
though truth
wonderful
ofbe
lately
lizard
waddling
4/5 = 80%
2/5 = 40%
N-gram tracing
thea
although truth
universally
Austen
Dickens
the becausealthoughtruthuniversally
ofbe
however
neighbourhood
surrounding
thea
though truth
wonderful
ofbe
lately
lizard
waddling
4/5 = 80%
N-gram tracing
1. character n-grams (n = 9, 10, 11, 12, 13)2. word n-grams (n = 2, 3)3. POS n-grams (n = 5, 6)
Q = 600 words written by Austen
AUSTEN: amuse, continuance, enjoyments, evils, mildness CONRAD:DICKENS: caresses, dearer
Q = 600 words written by Austen
AUSTEN: was yet a, ways of th, wed her to, which firs, whom she c, … [382] CONRAD: t of a thi, than an in, timacy of , too much h, us friends … [134]DICKENS: ts concern, ttle short, ughters of, ut particu, was the yo … [161]
Coulthard, Malcolm (2004) Author identification, idiolect, and linguistic uniqueness, Applied Linguistics, 25, pp. 431–447.
”Every speaker has a very large active vocabulary built up over many years, which will differ from the vocabularies others have similarly built up, not only in terms of actual items but also in preferences for selecting certain items rather than others.
Thus, whereas in principle any speaker/writer can use any word at any time, speakers in fact tend to make typical and individuating co-selections of preferred words.
This implies that it should be possible to devise a method of linguistic fingerprinting -in other words that the linguistic `impressions' created by a given speaker/writer should be usable, just like a signature, to identify them.”
Coulthard (2004: 432)
“In reality, the concept of the linguistic fingerprint is an unhelpful, if notactually misleading metaphor, at least when used in the context of forensic investigations of authorship”
Coulthard (2004: 432)
In conclusion Size of comparison
data much more important than size of disputed data
Frequency information not essential in certain situations
Results are much easier to interpret with a type-based method
Results are more meaningful theoretically
Frequency-free authorship attribution:Testing the n-gram tracing method
Dr Andrea NiniUniversity of [email protected]
International Corpus Linguistics Conference 2019Cardiff University, Wales, UK
Prof Jack GrieveUniversity of Birmingham
[email protected]://sites.google.com/view/grievejw