frequency-free authorship attribution: testing the n …...frequency-free authorship attribution:...

30
Frequency-free authorship attribution: Testing the n-gram tracing method Dr Andrea Nini University of Manchester [email protected] www.andreanini.com International Corpus Linguistics Conference 2019 Cardiff University, Wales, UK Prof Jack Grieve University of Birmingham [email protected] https://sites.google.com/view/grievejw

Upload: others

Post on 25-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

Frequency-free authorship attribution:Testing the n-gram tracing method

Dr Andrea NiniUniversity of [email protected]

International Corpus Linguistics Conference 2019Cardiff University, Wales, UK

Prof Jack GrieveUniversity of Birmingham

[email protected]://sites.google.com/view/grievejw

Page 2: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

AuthorshipAttribution

Authorship Analysis

AuthorshipProfiling

AuthorshipVerification

Page 3: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

The cat sat on the mat and the dog sat on the chair.

the cat

sat on

themat

sat

ondog

Relative frequency

A set of tokens

Word 2-grams: the cat, cat sat, sat on, on the …Word 3-grams: the cat sat, cat sat on, on the mat ……Character 2-grams: th, he, e_, _c, ca …Character 3-grams: the, he_, e_c, cat ……Part of Speech 2-grams: DT NN, NN VBD, VBD PP…

Page 4: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

Grieve, Jack (2007) Quantitative authorship attribution: An evaluation of techniques, Literary and Linguistic Computing, 22(3), pp. 251–270.

40 Authors

Approx. 50k to 30k words per author

500 – 2000 words per text

Page 5: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

The cat looked at the owner arrogntly.

Th

on

= the= then= theanthropism

= on= ontology

Page 6: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

Grieve, Jack (2007) Quantitative authorship attribution: An evaluation of techniques, Literary and Linguistic Computing, 22(3), pp. 251–270.

40 Authors

Approx. 50k to 30k words per author

500 – 2000 words per text

Page 7: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

Amanda BirksDied 17th January 2009.

House fire

Christopher Birks rescued children but could not rescue Amanda.

Evidence of murder• No smoke inhalation

• No attempt at escape

• Fire originated in multiple locations

• Wearing outside clothes in bed

Amanda sent SMS text messages throughout the day until 19:48

Amanda Birk’s texting style.

‘ad’ for 'had'

‘don’t’ for ‘don’t'

‘t’ for 'the'

No use by CB

Rare use by CB

‘bak’ for 'back'

‘av’ for 'have'

‘wud’ for 'would'

‘w’ for 'with'

Some use by CB

‘wil’ for 'will'

‘y’ for ‘yes’

‘wen’ for ‘when’

Christopher Birk’s texting style.

‘2’ for ‘to’ with no trailing space

‘4’ for ‘for’ with no trailing space

‘wiv’ for ‘with'

No use by AB

Rare use by AB

‘jst’ for ‘just'

‘dnt’ for ‘dont'

Use of ‘,’ (comma)

‘4get’ for ‘forget'

‘thanx’ for ‘thanks'

Grant, Tim (2013) TXT 4N6: Method, consistency, and distinctiveness in the analysis of SMS text messages, Journal of Law and Policy, 21, pp. 467–494.

Page 8: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

The cat sat on the mat and the dog sat on the chair.

the cat

sat on

themat

sat

ondog

Relative frequency

A set of tokens

Word 2-grams: the cat, cat sat, sat on, on the …Word 3-grams: the cat sat, cat sat on, on the mat ……Character 2-grams: th, he, e_, _c, ca …Character 3-grams: the, he_, e_c, cat ……Part of Speech 2-grams: DT NN, NN VBD, VBD PP…

Page 9: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

The cat sat on the mat and the dog sat on the chair.

cat

themat

sat

ondog

Relative frequency

A set of types

Word 2-grams: the cat, cat sat, sat on, on the …Word 3-grams: the cat sat, cat sat on, on the mat ……Character 2-grams: th, he, e_, _c, ca …Character 3-grams: the, he_, e_c, cat ……Part of Speech 2-grams: DT NN, NN VBD, VBD PP…

Page 10: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

ad don’t

t bak

av wud

w wil

y wen

2 4 wiv

jst dnt

, 4get

thanx

Dnt b daft m8, wots goin

on? Iv got go 2glossopdya want me cum 2u?

Ok bt frm wot iv seen ur

quite capable m8, r u

sure u dnt want hav a

few wks wiv tim? Then

maybe altern8 btween 1

n 2 man teams?

! = # ∩ %# ∪ % = 1/4 = 0.25

0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0

Jaccard coefficient

= '()*+, -. .+/0(,+1 2' # /'3 %'()*+, -. .+/0(,+1 2' # -, %

Grant, Tim (2013) TXT 4N6: Method, consistency, and distinctiveness in the

analysis of SMS text messages, Journal of Law and Policy, 21, pp. 467–494.

Page 11: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

! = # ∩ %# ∪ % = '()*+, -. .+/0(,+1 2' # /'3 %

'()*+, -. .+/0(,+1 2' # -, %

176 authors (2.5M word tokens of short emails)

”although accuracy rates overall decrease with sample size, this method is still successful in attributing exceptionally small segments of data to their correct authors, even when there are 176 candidate authors in each attribution task”

Wright (2017)

Wright, David (2017) Using word n-grams to identify authors and idiolects. A corpus approach to

a forensic linguistic problem, International Journal of Corpus Linguistics, 22(2), pp. 212–241.

Page 12: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

Dear Madam,—

I have been shown in the files of the War Department a statement of the Adjutant General of Massachusetts, that you are the mother of five sons who have died gloriously on the field of battle.I feel how weak and fruitless must be any words of mine which should attempt to beguile you from the grief of a loss so overwhelming. But I cannot refrain from tendering to you the consolation that may be found in the thanks of the Republic they died to save. I pray that our Heavenly Father may assuage the anguish of your bereavement, and leave you only the cherished memory of the loved and lost, and the solemn pride that must be yours, to have laid so costly a sacrifice upon the altar of Freedom.

Yours, very sincerely and respectfully,

Abraham Lincoln

John Hay

1,085 texts400,747 word tokens

577 texts261,126 word tokens

139 word tokens

Page 13: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

the

a although

truthuniversally

A

becauseofbe

however

neighbourhood

surroun

ding

Q

Page 14: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

the

a although

truthuniversally

A

becauseofbe

however

neighbourhood

surrounding

4/5 = 80%

! = A ∩ %min( * , % )Overlap coefficient

- = * ∩ %* ∪ % = 4

11Jaccard coefficient

= 0.36

Q

Page 15: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

the

a although

truthuniversally

A

becauseofbe

however

neighbourhood

surrounding

4/5 = 80%

!" =⋃%&'( )% ∩ +

+, = A ∩ +

min( 2 , + )Overlap coefficient

5%&'

(

)%

+ Shared co-selection coefficient

5%&'

()% ≥ +

Q

Page 16: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

N-gram types

• Word (1 ≤ n ≤ 8)

• Part-of-Speech (1 ≤ n ≤ 15)

• Character (1 ≤ n ≤ 20)

Number of candidate

authors

• 2 author case

• 10 author case

Length of known data set Length of questioned text

Parameters

100 1,000 2,000 5,000

10,000 20,000 50,000

100,000 200,000

25 50 100 200

300 500 1,000

Ten authors (Project Gutenberg): Austen, Conrad, Dickens, Doyle, Eliot,

Hardy, Lawrence, Stevenson,

Thackeray, and Wells

Page 17: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

Austen

Dickens

N-gram tracing

Page 18: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

Austen

Dickens

Q text25 word sample

N-gram tracing

Page 19: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

Austen

Dickens

25 word sample

1,000 word sample

1,000 word sample

N-gram tracing

Page 20: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

Austen

Dickens

25 word sample

1,000 word sample

1,000 word sample

N-gram tracing

Page 21: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

Austen

Dickens

1,000 word sample

1,000 word sample

the becausealthoughtruthuniversally

N-gram tracing

Page 22: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

thea

although truth

universally

Austen

Dickens

the becausealthoughtruthuniversally

ofbe

however

neighbourhood

surroun

ding

thea

though truth

wonderful

ofbe

lately

lizard

waddling

N-gram tracing

Page 23: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

thea

although truth

universally

Austen

Dickens

the becausealthoughtruthuniversally

ofbe

however

neighbourhood

surrounding

thea

though truth

wonderful

ofbe

lately

lizard

waddling

4/5 = 80%

2/5 = 40%

N-gram tracing

Page 24: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

thea

although truth

universally

Austen

Dickens

the becausealthoughtruthuniversally

ofbe

however

neighbourhood

surrounding

thea

though truth

wonderful

ofbe

lately

lizard

waddling

4/5 = 80%

N-gram tracing

Page 25: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

1. character n-grams (n = 9, 10, 11, 12, 13)2. word n-grams (n = 2, 3)3. POS n-grams (n = 5, 6)

Page 26: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

Q = 600 words written by Austen

AUSTEN: amuse, continuance, enjoyments, evils, mildness CONRAD:DICKENS: caresses, dearer

Page 27: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

Q = 600 words written by Austen

AUSTEN: was yet a, ways of th, wed her to, which firs, whom she c, … [382] CONRAD: t of a thi, than an in, timacy of , too much h, us friends … [134]DICKENS: ts concern, ttle short, ughters of, ut particu, was the yo … [161]

Page 28: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

Coulthard, Malcolm (2004) Author identification, idiolect, and linguistic uniqueness, Applied Linguistics, 25, pp. 431–447.

”Every speaker has a very large active vocabulary built up over many years, which will differ from the vocabularies others have similarly built up, not only in terms of actual items but also in preferences for selecting certain items rather than others.

Thus, whereas in principle any speaker/writer can use any word at any time, speakers in fact tend to make typical and individuating co-selections of preferred words.

This implies that it should be possible to devise a method of linguistic fingerprinting -in other words that the linguistic `impressions' created by a given speaker/writer should be usable, just like a signature, to identify them.”

Coulthard (2004: 432)

“In reality, the concept of the linguistic fingerprint is an unhelpful, if notactually misleading metaphor, at least when used in the context of forensic investigations of authorship”

Coulthard (2004: 432)

Page 29: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

In conclusion Size of comparison

data much more important than size of disputed data

Frequency information not essential in certain situations

Results are much easier to interpret with a type-based method

Results are more meaningful theoretically

Page 30: Frequency-free authorship attribution: Testing the n …...Frequency-free authorship attribution: Testing the n-gram tracingmethodDr Andrea Nini University of Manchester andrea.nini@manchester.ac.uk

Frequency-free authorship attribution:Testing the n-gram tracing method

Dr Andrea NiniUniversity of [email protected]

International Corpus Linguistics Conference 2019Cardiff University, Wales, UK

Prof Jack GrieveUniversity of Birmingham

[email protected]://sites.google.com/view/grievejw