corpus analysis (2) corpus linguistics richard xiao [email protected]

42
Corpus analysis (2) Corpus Linguistics Richard Xiao [email protected]

Upload: kevin-callahan

Post on 28-Mar-2015

267 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Corpus analysis (2)

Corpus Linguistics

Richard Xiao

[email protected]

Page 2: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Outline of the session• Lecture

– Keyword– Reference corpus– Key keyword

• Practical– WST keyword– AntConc keyword– Wmatrix keyword / key concept– Extra: keyword analysis with CQPweb

2

Page 3: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

What is a keyword?• Keywords are those words whose frequency is

exceptionally high (positive keywords) or low (negative keywords) in comparison with a reference corpus– Keywords usually refer to positive keywords

– But negative keywords are equally interesting (see Xiao and McEnery 2005)

• They appear at the very end of your listing, in a different colour in WordSmith

• They are omitted automatically from a keywords database for key keyword analysis and a keyword plot

3

Page 4: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Why keyword analysis?

• Indicating the ‘aboutness’ (Scott 1999) of a particular text or corpus– Contents analysis, discourse analysis

• Also revealing the salient features which are functionally related to a particular genre (Xiao and McEnery 2005)– Genre analysis, stylistic analysis

4

Page 5: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

How to do keyword analysis• Make a wordlist of the target corpus

• Locate or make a word list of a reference corpus– Scott (2005) “In search of a bad reference corpus”

• http://www.methodsnetwork.ac.uk/redist/pdf/es1_05scott.pdf

– The reference corpus is usually larger than the target corpus

– The appropriateness of a reference corpus depends on your research questions!

• Compare the frequency of each item in the two wordlists to extract keywords – done automatically

• Analyse and interpret keywords – you will do it!

5

Page 6: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Keywords in the party speeches• Target corpus – just one text

– David Cameron's speech at the Conservative conference (10 October 2012, Manchester)

• http://www.bbc.co.uk/news/uk-politics-15189614 • Local copy available (David_speech Unicode text)

- download and unzip the file into a file folder:www.fass.lancs.ac.uk/projects/corpus/data/workshop3texts.zip

• Reference corpus– The 100-million-word BNC: download and unzip (local

copy available)www.lexically.net/downloads/version4/BNC_World.zip

• Tool– WST Keyword

6

Page 7: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Wordlist of David’s speech

7

Page 8: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Creating keyword list

8

Page 9: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Keyword extraction in progress

Warning: It can take time if you have loaded two large wordlists

9

Page 10: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Keywords in David’s speech

10

Negative keyword

What do these keywords tell us?

Page 11: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Keyword: Plot view

11

Page 12: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

What companies do keywords keep?

12

Page 13: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Why “marriage”?

13

Page 14: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Key clusters

14

Similar to word clusters, but only keywords are used.

Page 15: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Key keywords• A key keyword is one which is "key" in more than

one of a number of related texts– The more texts it is "key" in, the more "key key" it is– Can avoid extracting keywords which are unusually

frequent in only a small number of files

• Can be created automatically and as simple to extract as you do for keywords

• n.b. Negative keywords are omitted automatically from a key keyword list

15

Page 16: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Making a batch wordlist

16

Specify a folder where you can write

Page 17: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Batch making keyword lists

17

Page 18: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Batch making keyword lists

18

Specify a folder where you can write

Page 19: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Making a KW database

19

Page 20: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Key keywords

key coverage of the corpus An "associate" is a keyword that appears in the same text

20

Page 21: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Keyword in AntConc

21

target corpus

reference corpus

Page 22: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Keyword in AntConc

Key words in David's speech (in relation to Ed's speech)

22

Page 23: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Wmatrix: Keywords and key concepts

• POS and semantic tagging• Keyword / key concept analysis in Cameron’s

speech in comparison with Miliband’s speech• Copy and paste the speeches into two separate

text files– http://www.bbc.co.uk/news/uk-politics-15189614 – http://www.labour.org.uk/ed-milibands-speech-to-

labour-party-conference

• Save the two texts as David_speech.txt and Ed_speech.txtwww.fass.lancs.ac.uk/projects/corpus/data/workshop3texts.zip

23

Page 24: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Wmatrix: Keywords and key concepts

• Login with your account using zhejiangxx account– http://ucrel.lancs.ac.uk/wmatrix3.html

24

Page 25: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Tagging Wizard

25

Page 26: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Tagging in progress

26

Page 27: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Tagging result

27

Page 28: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Labour frequency list

28

Page 29: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

KWIC concordance

29

Page 30: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

“My folders”

Upload and tag Ed’s speech

…and click on “My folders”

Warning: Your folder view may look different!30

Page 31: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Open David_speech folder and select Ed_speech in “Keyword compared to”

dropdown box

31

Page 32: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Keyword list to download!

32

Page 33: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Keyword cloud – even more interesting!

33

Page 34: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

David’s key concepts(“Key concepts compared to”)

34

Page 35: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Keyword analysis in online corpora

• Using Lancaster’s CQPweb to compare British English (LOB+FLOB) and American English (Brown + Frown)

• Login CQPweb– http://cqpweb.lancs.ac.uk

• Similar analysis can be done at BSFU’s CQPweb corpus hub (different corpora) – http://124.193.83.252/cqp/

– Account: ID=pass=test

35

Page 36: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Creating subcorpora

36

Page 37: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Creating subcorpus BrE

37

Page 38: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Creating subcorpus AmE

38

Page 39: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Making wordlists

39

Page 40: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Wordlist available now

40

Page 41: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Computing keywords

41

You can make adjustments to the statistical measure, cut-off point, and minimum frequency according your research purposes.

Page 42: Corpus analysis (2) Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Keywords in BrE and AmE

42