programming for linguists an introduction to python 08/12/2011
TRANSCRIPT
![Page 1: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/1.jpg)
Programming for Linguists
An Introduction to Python08/12/2011
![Page 2: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/2.jpg)
Ex 1) Write a script that reads 5 words that are typed in by a user and tells the user which word is shortest and longest
![Page 3: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/3.jpg)
Ex. 1)def word_length( ):
count=5list1 = [ ]while count > 0:
s= raw_input( "Please enter a word ”)list1.append(s)count= count-1
longest= list1[0]shortest= list1[0]for word in list1:
if len(word) > len(longest):longest=word
elif len(word) < len(shortest):shortest=word
print shortest,"is the shortest word.”print longest,"is the longest word."
![Page 4: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/4.jpg)
Ex 2) Write a function that takes a sentence as an argument and calculates the average word length of the words in that sentence
![Page 5: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/5.jpg)
Ex 2)def awl(sent):
wlist = [ ] sentence = sent.split( )for word in sentence:
wlist.append(len(word))mean = sum(wlist)/float(len(wlist))print “The average word length is
”,mean
awl(“this is a test sentence”)
![Page 6: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/6.jpg)
Ex 3) Take a short text of about 5 sentences. Write a script that will split up the text into sentences (tip: use the punctuation as boundaries) and calculates the average sentence length, the average word length and the standard deviation for both values
![Page 7: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/7.jpg)
Ex 3)import re
def mean(list):mean = sum(list)/float(len(list))return mean
def SD(list):devs = [ ]for item in list:
std = (item – mean(list))**2devs.append(std)
SD = (sum(devs) / float(len(devs))**0.5return SD
![Page 8: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/8.jpg)
def statistics(sent):asl = [ ]awl = [ ]sentences = re.split(r ‘[.!?]’, sent)for sentence in sentences[:-1]:
sentence = re.sub(r ‘\W+’, ‘ ’,sentence) tokens = sentence.split( )
asl.append(len(tokens))for token in tokens:
awl.append(len(token))print mean(asl), SD(asl)print mean(awl), SD(awl)
statistics(“sentences”)
![Page 9: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/9.jpg)
Like a list, but more general
In a list the index has to be an integer, e.g. words[4]
In a dictionary the index can be almost any type
A dictionary is like a mapping between 2 sets: keys and values
function: dict( )
Dictionaries
![Page 10: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/10.jpg)
To create an empty list:list = [ ]
To create an empty dictionary:dictionary = { }
For example a dictionary containing English and Spanish words:eng2sp = { }eng2sp['one'] = 'uno’print eng2sp{'one': 'uno'}
![Page 11: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/11.jpg)
In this case both the keys and the values are of the string type
Like with lists, you can create dictionaries yourselves, e.g.
eng2sp = {'one': 'uno', 'two': 'dos', 'three': 'tres'}print eng2sp
Note: in general, the order of items in a dictionary is unpredictable
![Page 12: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/12.jpg)
You can use the keys to look up the corresponding values, e.g.print eng2sp['two']
The key ‘two’ always maps to the value ‘dos’ so the order of the items does not matter
If the key is not in the dictionary you get an error message, e.g.print eng2sp[‘ten’]KeyError: ‘ten’
![Page 13: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/13.jpg)
The len( ) function returns the number of key-value pairslen(eng2sp)
The in operator tells you whether something appears as a key in the dictionary‘one’ in eng2spTrue
BUT‘uno’ in eng2spFalse
![Page 14: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/14.jpg)
To see whether something appears as a value in a dictionary, you can use the values( ) function, which returns the values as a list, and then use the in operator, e.g.
‘uno’ in eng2sp.values( )True
![Page 15: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/15.jpg)
Suppose you want to count the number of times each letter occurs in a string, you could:create 26 variables, traverse the
string and, for each letter, add 1 to the corresponding counter
create a dictionary with letters as keys and counters as the corresponding values
A Dictionary as a Set of Counters
![Page 16: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/16.jpg)
def frequencies(sent): freq_dict = { }for let in sent:
if let not in freq_dict:freq_dict[let] = 1
else:freq_dict[let] += 1
return freq_dict
frequencies(“abracadabra”)
![Page 17: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/17.jpg)
The first line of the function creates an empty dictionary
The for loop traverses the string
Each time through the loop, if the letter is not in the dictionary, we create a new key item with the initial value 1
If the letter is already in the dictionary we add 1 to its corresponding value
![Page 18: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/18.jpg)
Write a function that counts the word frequencies in a sentence instead of the letter frequencies using a dictionary
![Page 19: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/19.jpg)
def words(sent):word_freq = { }wordlist = sent.split( )for word in wordlist:
if word not in word_freq:word_freq[word] = 1
else:word_freq[word] += 1
return word_freq
words(“this is is a a test sentence”)
![Page 20: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/20.jpg)
Given a dictionary “word_freq” and a key “is”, finding the corresponding value: word_freq[“is”]
This operation is called a lookup
What if you know the value and want to look up the corresponding key?
Reverse Lookup
![Page 21: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/21.jpg)
Previous example:
def words(sent):word_freq = { }wordlist = sent.split( )for word in wordlist:
if word not in word_freq:word_freq[word] = 1
else:word_freq[word] += 1
return word_freq
w_fr = words(“this is is a a test sentence”)
![Page 22: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/22.jpg)
Write a function which takes as argument the variable w_fr and the nr number of times a word occurs in the sentence and returns a list of words which occur nr times or returns “There are no words in the sentence that occur nr times”.
![Page 23: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/23.jpg)
def reverse_lookup(w_fr, nr):list1 = [ ]for word in w_fr:
if w_fr[word] == nr:list1.append(word)
if len(list1) > 0:return list1
else:print "There are no words in
the sentence that occur ”, nr, “times.”
![Page 24: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/24.jpg)
First you need to import itemgetter:from operator import itemgetter
To go over each item in a dictionary you can use .iteritems( )
To sort the dictionary according to the values, you need to use key = itemgetter(1)
To sort it decreasingly: reverse = True
Sorting a Dictionary According to its Values
![Page 25: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/25.jpg)
from operator import itemgetterdef words(s):w_fr = { }wordlist = s.split( )for word in wordlist:
if word not in w_fr:w_fr[word] = 1
else:w_fr[word] += 1
h = sorted(w_fr.iteritems( ), key = itemgetter(1), reverse =True)return h
![Page 26: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/26.jpg)
It could be useful to invert a dictionary: keys and values switch placedef invert_dict(d):
inv = { }for key in d:
value = d[key]if value not in inv:
inv[value] = [key] else:
inv[value].append(key)return inv
Inverting Dictionaries
![Page 27: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/27.jpg)
But: lists can be values, but never keys!
![Page 28: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/28.jpg)
Getting Started with NLTKIn IDLE:
import nltknltk.download()
![Page 29: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/29.jpg)
Searching TextsStart your script with importing all texts
in NLTK: from nltk.book import * text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton
1908
![Page 30: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/30.jpg)
Any time you want to find out about these texts, just enter their names at the Python prompt:>>> text1<Text: Moby Dick by Herman Melville 1851>
A concordance view shows every occurrence of a given word, together with some context:e.g. “monstrous” in Moby Dick
text1.concordance(“monstrous”)
![Page 31: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/31.jpg)
Try looking up the context of “lol” in the chat corpus (text 5)
If you have a corpus that contains texts that are spread over time, you can look up how some words are used differently over time:e.g. the Inaugural Address Corpus (dates back to 1789): words like “nation”, “terror”, “God”…
![Page 32: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/32.jpg)
You can also examine what other words appear in a similar context, e.g.
text1.similar(“monstrous”)
common_contexts( ) allows you to examine the contexts that are shared by two or more words, e.g.
text1.common_contexts([“very”, “monstrous”])
![Page 33: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/33.jpg)
You can also determine the location of a word in the text
This positional information can be displayed using a dispersion plot
Each stripe represents an instance of a word, and each row represents the entire text, e.g.
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
![Page 34: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/34.jpg)
Counting TokensTo count the number of tokens
(words + punctuation marks), just use the len( ) function, e.g. len(text5)
To count the number of unique tokens, you have to make a set, e.g.set(text5)
![Page 35: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/35.jpg)
If you want them sorted alfabetically, try this:
sorted(set(text5))
Note: in Python all capitalized words precede lowercase words (you can use .lower( ) first to avoid this)
![Page 36: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/36.jpg)
Now you can calculate the lexical diversity of a text, e.g. the chat corpus (text5):
45010 tokens 6066 unique tokens or types
The lexical diversity = nr of types/nr of tokens
Use the Python functions to calculate the lexical diversity of text 5
![Page 37: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/37.jpg)
len(set(text5))/float(len(text5))
![Page 38: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/38.jpg)
Frequency DistributionsTo find n most frequent tokens:
FreqDist( ), e.g.fdist = FreqDist(text1)fdist[“have”]
760all_tokens = fdist.keys( )all_tokens[:50]
The function .keys( ) combined with the FreqDist( ) also gives you a list of all the unique tokens in the text
![Page 39: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/39.jpg)
Frequency distributions can be informative, BUT the most frequent words usually are function words (the, of, and, …)
What proportion of the text is taken up with such words? Cumulative frequency plot
fdist.plot(50, cumulative=True)
![Page 40: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/40.jpg)
If frequent tokens do not give enough information, what about infrequent tokens?Hapaxes= tokens which occur only oncefdist.hapaxes( )
Without their context, you do not get much information either
![Page 41: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/41.jpg)
Fine-grained Selection of TokensExtract tokens of a certain minimum
length:tokens = set(text1)long_tokens = [ ]for token in tokens:
if len(token) >= 15:
long_tokens.append(token)
ORlong_tokens = list(token for token in tokens if len(token) >= 15)
![Page 42: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/42.jpg)
BUT: very long words are often hapaxes
You can also extract frequently occurring long words of a certain length:
words = set(text1)fdist = FreqDist(text1)freq_long_words = list(word for word in words if len(word) >= 7 and fdist[word] >= 7)
![Page 43: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/43.jpg)
Collocations and BigramsA collocation is a sequence of words
that occur together unusually often, e.g. “red whine” is a collocation, “yellow whine” is not
Collocations are essentially just frequent bigrams (word pairs), but you can find bigrams that occur more often than is to be expected based on the frequency of the individual words:text8.collocations( )
![Page 44: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/44.jpg)
Some Functions for NLTK's Frequency Distributions
fdist = FreqDist(samples)
fdist[“word”] frequency of “word”
fdist.freq(“word”) frequency of “word”
fdist.N( ) total number of samples
fdist.keys( ) the samples sorted in order of decreasing frequency
for sample in fdist: iterates over the samples in order of decreasing frequency
![Page 45: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/45.jpg)
fdist.max( ) sample with the greatest count
fdist.plot( ) graphical plot of the frequency distribution
fdist.plot(cumulative=True) cumulative plot of the frequency distribution
fdist1 < fdist2 tests if the samples in fdist1 occur less frequently than in fdist2
![Page 46: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/46.jpg)
Accessing CorporaNLTK also contains entire corpora,
e.g.:Brown CorpusNPS ChatGutenberg Corpus…
A complete list can be found on http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
![Page 47: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/47.jpg)
Each of these corpora contains dozens of individual texts
To see which files are e.g. in the Gutenberg corpus in NLTK:nltk.corpus.gutenberg.fileids()
Do not forget the dot notation nltk.corpus. This tells Python the location of the corpus
![Page 48: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/48.jpg)
You can use the dot notation to work with a corpus from NLTK or you can import a corpus at the beginning of your script:from nltk.corpus import gutenberg
After that you just have to use the name of the corpus and the dot notation before a functiongutenberg.fileids( )
![Page 49: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/49.jpg)
If you want to examine a particular text, e.g. Shakespeare’s Hamlet, you can use the .words( ) functionHamlet = gutenberg.words(“shakespeare-hamlet.txt”)
Note that “shakespeare-hamlet.txt” is the file name that is to be found using the previous .fileids( ) function
You can use some of the previously mentioned functions (corpus methods) on this text, e.g.fdist_hamlet = FreqDist(hamlet)
![Page 50: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/50.jpg)
Some Corpus Methods in NLTKbrown.raw( ) raw data from the
corpus file(s)
brown.categories( ) fileids( ) grouped per predefined categories
brown.words( ) a list of words and punctuation tokens
brown.sents( ) words( ) grouped into sentences
![Page 51: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/51.jpg)
brown.tagged_words( ) a list of (word,tag) pairs
brown.tagged_sents( ) tagged_words( ) grouped into sentences
treebank.parsed_sents( ) a list of parse trees
![Page 52: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/52.jpg)
def statistics(corpus):for fileid in corpus.fileids( ):
nr_chars = len(corpus.raw(fileid)) nr_words = len(corpus.words(fileid)) nr_sents = len(corpus.sents(fileid)) nr_vocab = len(set([word.lower() for
word in corpus.words(fileid)])) print fileid, “average word length: ”,
nr_chars/nr_words,
“average sentence length: ”, nr_words/nr_sents,
“lexical diversity: ”, nr_words/nr_vocab
![Page 53: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/53.jpg)
Some corpora contain several subcategories, e.g. the Brown Corpus contains “news”, “religion”,…
You can optionally specify these particular categories or files from a corpus, e.g.:from nltk.corpus import brown brown.categories( ) brown.words(categories='news') brown.words(fileids=['cg22']) brown.sents(categories=['news',
'editorial', 'reviews'])
![Page 54: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/54.jpg)
Some linguistic research: comparing genres in the Brown corpus in their usage of modal verbs
from nltk.corpus import browncfd = nltk.ConditionalFreqDist((genre, word)
for genre in brown.categories( ) for word in brown.words(categories
=genre))
#Do not press enter to type in the for
#statements!
![Page 55: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/55.jpg)
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor’]
modal_verbs = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres,
samples=modal_verbs)
![Page 56: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/56.jpg)
can could may might must will news 93 86 66 38 50 389 religion 82 59 78 12 54 71 hobbies 268 58 131 22 83 264 science_fiction 16 49 4 12 8 16 romance 74 193 11 51 45 43 humor 16 30 8 8 9 13
A conditional frequency distribution is a collection of frequency distributions, each one for a different "condition”
The condition is usually the category of the text (news, religion,…)
![Page 57: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/57.jpg)
Loading Your Own Text or Corpus
Make sure that the texts/files of your corpus are in plaintext format (convert them, do not just change the file extensions from e.g. .docx to .txt)
Make a map with the name of your corpus which contains all the text files
![Page 58: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/58.jpg)
A text in Python:open your file
f = open(“/Users/claudia/text1.txt”, “r”)read in the text
text1 = f.read( ) reads the text entirely
text1 = f.readlines( ) reads in all lines that end with \n and makes a list
text1 = f.readline( ) reads in one line
![Page 59: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/59.jpg)
Loading your own corpus in NLTK with no subcategories:
import nltk
from nltk.corpus import PlaintextCorpusReader
loc = “/Users/claudia/my_corpus” #Mac
loc = “C:\Users\claudia\my_corpus” #Windows
my_corpus = nltk.PlaintextCorpusReader(loc, “.*”)
![Page 60: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/60.jpg)
Now you can use the corpus methods of NLTK on your own corpus, e.g.
my_corpus.words( ) my_corpus.sents( ) …
![Page 61: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/61.jpg)
Loading your own corpus in NLTK with subcategories:
import nltk
from nltk.corpus import CategorizedPlaintextCorpusReader
loc=“/Users/claudia/my_corpus” #Mac
loc=“C:\Users\claudia\my_corpus” #Windows 7
my_corpus = CategorizedPlaintextCorpusReader(loc, '(?!\.svn).*\.txt', cat_pattern=
r'(cat1|cat2)/.*')
![Page 62: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/62.jpg)
If your corpus is loaded correctly, you should get a list of all files in your corpus by using:
my_corpus.fileids( )
For a corpus with subcategories, you can access the files in the subcategories by taking the name of the subcategory as an argument:my_corpus.fileids(categories = “cat1”)my_corpus.words(categories = “cat2”)
![Page 63: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/63.jpg)
For Next WeekEx. 1)
Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of “men”, “women”, and “people” in each document. What has happened to the usage of these words over time?
![Page 64: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/64.jpg)
Ex 2)According to Strunk and White's Elements of Style, the word “however”, used at the start of a sentence, means "in whatever way" or "to whatever extent", and not "nevertheless". They give this example of correct usage: However you advise him, he will probably do as he thinks best. Use the concordance tool to study actual usage of this word in 5 NLTK texts.
![Page 65: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/65.jpg)
Ex 3)Create a corpus of your own of minimum 10 files containing text fragments. You can take texts of your own, the internet,…Write a program that investigates the usage of modal verbs in this corpus using the frequency distribution tool and plot the 10 most frequent words.
![Page 66: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/66.jpg)
To download and install NLTK:http://www.nltk.org/download
Note: you need to have Python's NumPy and Matplotlib packages installed in order to produce the graphical plots
See http://www.nltk.org/ for installation instructions
![Page 67: Programming for Linguists An Introduction to Python 08/12/2011](https://reader036.vdocuments.net/reader036/viewer/2022062407/56649d045503460f949d7a7c/html5/thumbnails/67.jpg)
Thank you