‘when the text counts’ exploring the implications of ‘text’ as unit in corpus linguistics...

39
‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael Hoey

Upload: dana-horn

Post on 29-Dec-2015

240 views

Category:

Documents


8 download

TRANSCRIPT

Page 1: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguisticsMatthew Brook O’Donnell

Mike Scott

Michaela Mahlberg

Michael Hoey

Page 2: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Text vs Corpus View Tognini-Bonelli (2004:18)A text A corpus

Read whole Read fragmented

Read horizontally Read vertically

Read for content Read for formal patterning

Read as unique event Read for repeated events

Read as an individual act of will

Read as a sample of social practice

Coherent communicative event

Not a coherent communicative event

Page 3: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

1. referential - like objects from the corpus are retrieved on the basis of search criteria and grouped together

2. narrative - the language data in the corpus are viewed and processed in a sequential manner, word by word in corpus order

Referential vs Narrative Modes McCarty (1996: 241)

Page 4: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Corpus Sampling & Quantitative

Measures Sampling

1st generation corpora – 2000 word samples 2nd generation, i.e. BNC larger samples but for copyright

reasons not complete texts in many cases generalized corpora cover many genres

Quantitative Measures normalization to facilitate comparison & smooth varying text

lengths occs. per X thousand/million words investigation of register variation (Biber et al. 1999)

Page 5: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Corpus linguistics has tended to… fit Tognini-Bonelli’s ‘corpus’ paradigm utilize a referential mode of access be largely limited to small units, primarily

word/phrase and sentence i.e. those things which can be relatively easily

counted investigate variation through normalization

Page 6: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Corpus linguistics has tended to… fit Tognini-Bonelli’s ‘corpus’ paradigm utilize a referential mode of access be largely limited to small units, primarily

word/phrase and sentence i.e. those things which can be relatively easily

counted investigate variation through normalization

Page 7: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Textually sensitive corpus linguistics Sinclair always advocated ‘whole text’ as

default for corpus building and incorporated DA into his corpus-theoretical model

Biber et al. 1996 Study of scientific articles with marked structural

divisions Introduction, method, results, discussion, conclusion

Referential summary of features grouped by text section

Narrative view through use of visualization Discourse map

Page 8: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Working with specialized corpora Large number of relatively homogenous texts

in their entirety same genre limited variation in text size

Example: Guardian news archive (1998-2004) 650,000+ articles News

Home news, overseas, features, sport, obituaries, etc. Paragraph and other structural divisions in

markup

Page 9: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Opportunities: Methodological If corpus has complete texts then features can

be counted relative to text

New measures average occurrences per text textual distribution profiles positional frequency lists textual collocates and keywords dispersion and consistency

Does normalization obscure textual effects?

Page 10: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Opportunities: Theoretical Without generalization and normalization

lower frequency and more delicate patterns emerge and a more realistic and nuanced theory of language use emerges, e.g. lexical priming

‘All primings are genre specific… this is particularly marked of textual primings’ (Hoey 2005: 125)

Investigate and extend claims about discourse from text-linguistics

Page 11: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Opportunities: Theoretical Explore behaviour of features and variation

within texts Biber et al. 1999

note high frequency of temporal adverbs in News registers

and that yesterday, last night, tomorrow are most frequent in British News and infrequent are in American News, which exhibits high frequency of days of the week (Monday, Tuesday,…)

Page 12: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Opportunities: Theoretical Explore behaviour of features and variation

within texts Yesterday in the Guardian

George Bush yesterday suffered a blow to his argument… (P1S1, Sept 27, 2006)

Bulgaria and Romania yesterday received the green light to join the EU in January (P1S1, Sept 27, 2006)

Yesterday, the paper shelved its plan to give… (P2S2, Apr 3, 1998)

Yesterday the prime minister was taunted by Iain Duncan Smith. (P8S2, Apr 25, 2002)

Yesterday seems to avoid theme position in text-initial sentences

Page 13: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Textual Priming Project: Aims to investigate how many (and what types of) lexical

items are primed to appear in text-initial or paragraph-initial position

to identify lexico-grammatical patterns and see how these patterns can be functionally interpreted in the textual contexts.

to relate these lexical and corpus-driven facts to current textual descriptions of (hard) news stories that might provide explanations for the positive primings of relevant lexis.

Page 14: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Subdividing the corpus

TISC = Text initial sentences PISC = Paragraph initial sentence [not in TISC

and at para contains 2+ sentences] SISC = Paragraphs of one sentence [not in TISC]

NISC = Non-paragraph initial sentences

HISC = Headline, sub-headline, etc..

Page 15: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Anatomy of a news articlePope 'deeply sorry' but Muslim protests spread

(1) Italian police were yesterday ordered to tighten security at potential Catholic targets across the country as the leaders of the Roman Catholic church anxiously waited to see if a personal expression of regret by Pope Benedict would assuage Muslim fury over his remarks on Islam. (2) The Pope's speech in Germany last week, in which he quoted a medieval ruler who said Muhammad's innovations were "evil and inhuman", has led to widespread condemnation in the Muslim world. (3) Last night the controversy seemed to have claimed its first victims when gunmen killed a 65-year-old Italian nun and her bodyguard at the entrance to a hospital where she worked in the Somalian capital, Mogadishu.

(4) A doctor said the nun, who was named as Sister Leonella Sgorbati, from Piacenza in northern Italy, had been shot four times in the back by two men with pistols. (5) The attack was linked by some to the Pope's remarks.

TISC

text initial sentence

PISC

paragraph initial sentence

PISC

NISC

non-initial sentence

NISC

HISC

Guardian Sept 18, 2006

Page 16: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Anatomy of a news article(6) A Vatican spokesman, Father Federico Lombardi, said he hoped it was "an isolated event", adding: "We are worried about the consequences of this wave of hatred and hope it doesn't have grave consequences for the church around the world."

(7) The pontiff appeared to risk causing fresh controversy during his speech yesterday when he cited a passage from St Paul that risked being interpreted as hostile - not by Muslims, but by Jews. (8) It described the crucifixion of Jesus as a "scandal for the Jews".

(9) At Castel Gandolfo in the hills east of Rome, where the Pope has his summer residence, armed police kept watch from a balcony on the town hall as he prepared for the most nervously awaited statement of his 17-month papacy.

(10) Some of the pilgrims entering the courtyard of his residence to listen to his traditional Sunday address were searched by police, who confiscated metal-tipped umbrellas and some bottles of liquid. (11) Plainclothes officers dressed as tourists filmed the gathering using video cameras.

PISC

PISC

NISC

NISC

SISC

single- sentence paragraph

SISC

Guardian Sept 18, 2006

Page 17: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Summary of positional subcorpora Guardian Home News 1998-2004

TISC PISC SISC NISC

tokens 3,122,037 12,521,902 17,129,694 19,338,590

types 58,432 127,038 137,322 141,793

sentences 113,288 607,125 555,641 1,064,493

mean (in words) 28 21 31 18

std.dev. 11.11 9.68 23.8 9.88

Page 18: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Summary of positional subcorpora Guardian Home News 1998-2004

Tokens Sentences

Page 19: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Positional frequency list

Raw frequencies

  TISC PISC SISC NISC

YESTERDAY 34646 19405 24510 13363

NIGHT 10700 10929 12747 9835

AFTER 15628 25895 37427 31638

LAST 14569 35161 35161 31696

A 101442 407112 407112 455977

Page 20: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Positional frequency list

Per 1000 words

  TISC PISC SISC NISC

YESTERDAY 11.10 1.55 1.43 0.69

NIGHT 3.43 0.87 0.74 0.51

AFTER 5.01 2.07 2.18 1.64

LAST 4.67 2.81 2.05 1.64

A 32.49 32.51 23.77 23.58

Page 21: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Positional frequency list

Per 100 sentences

  TISC PISC SISC NISC

YESTERDAY 30.58 3.20 4.41 1.26

NIGHT 9.44 1.80 2.29 0.92

AFTER 13.79 4.27 6.74 2.97

LAST 12.86 5.79 6.33 2.98

A 89.54 67.06 73.27 42.84

Page 22: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Positional frequency list

  TISC PISC SISC NISC

YESTERDAY 37.69% 21.11% 26.66% 14.54%

NIGHT 24.20% 24.72% 28.83% 22.25%

AFTER 14.13% 23.42% 33.84% 28.61%

LAST 12.50% 30.16% 30.16% 27.19%

A 7.40% 29.68% 29.68% 33.24%

Percentage of total occs. of word across all subcorpora

Page 23: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Positional frequency list

Percentage of sentences in each subcorpus

  TISC PISC SISC NISC

YESTERDAY 30.58% 3.20% 4.41% 1.26%

NIGHT 9.44% 1.80% 2.29% 0.92%

AFTER 13.79% 4.27% 6.74% 2.97%

LAST 12.86% 5.79% 6.33% 2.98%

A 89.54% 67.06% 73.27% 42.84%

Page 24: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Bad things happen, man!

Corpus studies have illustrated that the semantic association of HAPPEN with negative things

But where in text?

Page 25: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Bad things happen, man!

  TISC PISC SISC NISC

BAD 4.30% 23.77% 25.76% 46.17%

THINGS 1.79% 22.41% 19.62% 56.17%

HAPPEN 1.54% 20.12% 21.91% 56.43%

MAN 13.08% 23.30% 25.05% 38.57%

Percentage of total occs. of word across all subcorpora

Page 26: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Bad things happen, man!

  TISC PISC SISC NISC

BAD 4.30% 23.77% 25.76% 46.17%

THINGS 1.79% 22.41% 19.62% 56.17%

HAPPEN 1.54% 20.12% 21.91% 56.43%

MAN 13.08% 23.30% 25.05% 38.57%

…but not as often in the first sentence of a text!

Percentage of total occs. of word across all subcorpora

Page 27: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Textual priming: Textual Colligation Every word may be primed for us to occur at

the beginning or end of an independently recognised ‘chunk’ of text, e.g. the paragraph, the whole text

Page 28: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Textual priming: Textual Collocation We are primed to expect every word to enter

into or avoid cohesive chains (or cohesive links)

We are primed to associate each (cohesive) word with particular kinds of cohesion

Page 29: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Textual priming: Textual Collocation We are primed to expect every word to enter

into or avoid cohesive chains (or cohesive links)

We are primed to associate each (cohesive) word with particular kinds of cohesion

How can this be measured?

Page 30: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Text frequency profile (TFP)Occs per text # of texts % of total texts

1 35427 60.41

2 14033 23.93

3 5609 9.56

4 2169 3.70

5 862 1.47

6 333 0.57

7 130 0.22

8 45 0.08

9 19 0.03

10 13 0.02

11 4 0.01

14 1 0.00

yesterday• 60% of texts containing Y. have it only once

• nearly 1 in 4 of the texts containing Y. have two occs. forming cohesive link

Page 31: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Nine times yesterday….Russia in emergency talks with IMF

Chernomyrdin appeals for aid as West decries Moscow debt plan and leading bank collapses

(1) THE Russian prime minister, Viktor Chernomyrdin, flew to Crimea in Ukraine last night for emergency talks with the head of the International Monetary Fund as the crashing rouble and the collapse of a leading bank pushed Russia further into chaos.

(2) With Western bankers rounding on Moscow over its plans to reschedule some $40 billion ( pounds 24 billion) of debt, and the stock market in free fall, Mr Chernomyrdin last night desperately sought more aid from the IMF and leading industrial countries.

(3) The recalled prime minister is due to make a televised appeal to the nation within days, amid mounting speculation that President Boris Yeltsin's time is running out.

(4) The trip by the IMF head, Michel Camdessus, to meet Mr Chernomyrdin and the leaders of the former Soviet republics Ukraine and Belarus - which have been sucked into the Russian economic crisis - was kept secret until the last minute.

(5) As the rouble plumbed new lows against the dollar yesterday, the central bank declared the trade null and void and said it would no longer spend its dwindling reserves to support the currency.

(6) The first big banking collapse since rouble devaluation was announced yesterday when Bank Imperial, the 13th biggest bank, had its licence withdrawn.

(7) Deutschmark trade yesterday suggested the rouble would have fallen to almost 14 against the dollar, a loss of more than 100 per cent in 10 days and a clear signpost on the road to hyperinflation.

(8) Trading in the Ukrainian currency, the hrynva, was halted on the Kiev exchange yesterday as bankers scrambled for dollars. (9) In neo-Soviet Belarus the local currency has fallen about 600 per cent in recent months.

(10) In Russia price rises accelerated yesterday, many banks and exchange booths were closed and depositors queued at branches still open to withdraw cash - usually without success.

The Guardian, p. 2 - 27 August 1998 Home News

Page 32: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Nine times yesterday….(11) One Russian advertising executive said: "I've lost a huge contract. (12) No one is doing business. (13) How can

they? What price should they trade at? What currency should they use? You can't use the dollar because it's officially illegal. (14) And the rouble?"

(15) A Moscow-based British economist, Al Breach, said Russia could not now rule out general default on its foreign debts. (16) "If you default on one set of debts, why not default on all of them?" he said.

(17) Calls for Mr Yeltsin to quit grew more insistent yesterday. (18) Gennady Seleznyov, chairman of the lower house of parliament, the state Duma, said deputies had drafted a law guaranteeing any retiring president 10-year membership of the upper house. (19) This would make him immune from prosecution - though not his family or associates.

(20) Mr Chernomyrdin's options were narrowing to radical alternatives yesterday - essentially whether to govern with or without parliament.

(21) The Communist-led alliance, without whose support Mr Chernomyrdin cannot legally be confirmed as head of government, firmed up its demands: Mr Yeltsin's resignation; a sharp change of economic course involving tariff barriers, closer economic integration with ex-Soviet republics and increased rouble investment in industry; and constitutional changes to turn Russia into a parliamentary republic.

(22) Asked yesterday whether changing the constitution would not take up too much time, the Communist leader, Gennady Zyuganov, said: "An emergency situation demands emergency measures. (23) They can be taken in three days by the houses of parliament."

(24) He said his alliance would support only a government that "clearly and definitely renounced so-called monetarist reforms".

(25) If Mr Yeltsin were to step down or give his backing, and the security forces were behind him, Mr Chernomyrdin could dissolve the Duma and impose draconian measures to restore order. (26) The arrival in Moscow yesterday of the former general Alexander Lebed, who governs Krasnoyarsk in Siberia, spawned rumours of a Chernomyrdin-Lebed junta in the making.

(27) Russia in turmoil, page 13

The Guardian, p. 2 - 27 August 1998 Home News

Page 33: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Text frequency profiles

YESTERDAY1 35427 60.41%2 14033 23.933 5609 9.564+ 3575 6.10

BRITISH1 16835 58.42%2 5952 20.663 2627 9.124+ 3400 11.80

AFTER1 31057 52.01%2 15283 25.593 7236 12.124+ 6135 10.27

LAST1 30171 52.09%2 14546 25.113 7063 12.194+ 6137 10.60

POLICE1 9862 42.44%2 4376 18.833 2647 11.394+ 6353 27.34

YEARS1 24668 57.68%2 10088 23.593 4384 10.254+ 3624 8.47

Page 34: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Text frequency profiles

THINGS1 7468 81.57%2 1260 13.763 299 3.274+ 127 1.39

ACCORDING1 12101 82.65%2 1988 13.583 392 2.684+ 160 1.09

AGO1 15276 80.13%2 2919 15.313 639 3.354+ 228 1.20

BAD1 5478 85.14%2 754 11.723 142 2.214+ 59 0.92

CONSEQUENCE1       524     96.50%2       17      3.133       2       0.374+     0       0.00 

MAN1 12489 63.51%2 4241 21.573 1575 8.014+ 1359 6.91

Page 35: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Text frequency profiles

OF1 7620 7.04%2 5353 4.953 4054 3.754+ 91194 84.27

INTO1 27401 58.17%2 11350 24.103 4704 9.994+ 3646 7.74

OVER1 28059 57.99%2 12079 24.973 4955 10.244+ 3288 6.80

IN1 7859 6.90%2 9841 8.643 8875 7.794+ 87320 76.67

AROUND1 15110 75.29%2 3524 17.563 981 4.894+ 452 2.25

Page 36: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Initial findings N.B. TFP at this stage only account for

simple repetition Some words have TFP showing strong

priming for non-cohesiveness

CONSEQUENCE1       524     96.50%2       17      3.133       2       0.374+     0       0.00 

Page 37: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Initial findings N.B. TFP at this stage only account for

simple repetition Some words have TFP showing strong

priming for non-cohesiveness Some words TFP indicate no tendency re.

non-cohesiveness and priming for links (2 repetitions)

YESTERDAY1 35427 60.41%2 14033 23.933 5609 9.564+ 3575 6.10

MAN1 12489 63.51%2 4241 21.573 1575 8.014+ 1359 6.91

Page 38: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Initial findings Some words have TFP showing a priming for

cohesiveness chains (of simple repetition) i.e. 3+ occs. in a text

BRITISH1 16835 58.42%2 5952 20.663 2627 9.124+ 3400 11.80

POLICE1 9862 42.44%2 4376 18.833 2647 11.394+ 6353 27.34

Page 39: ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics Matthew Brook O’Donnell Mike Scott Michaela Mahlberg Michael

Summary working with specialized corpora of whole

texts offers significant opportunities for the development

of new corpus methods and measures positional frequency text frequency profile

allows for testing of theoretical claims, such as the textual components of lexical priming

THANK YOU! [email protected] [email protected]