python tutorial

51
Introduction to Python Chen Lin Chen Lin [email protected] [email protected] COSI 134a COSI 134a Volen 110 Volen 110 Office Hour: Thurs. 3-5 Office Hour: Thurs. 3-5

Upload: shani729

Post on 16-Jul-2015

63 views

Category:

Engineering


8 download

TRANSCRIPT

Introduction to Python

Chen LinChen [email protected]@brandeis.edu

COSI 134aCOSI 134aVolen 110Volen 110

Office Hour: Thurs. 3-5Office Hour: Thurs. 3-5

For More Information?

http://python.org/ - documentation, tutorials, beginners guide, core

distribution, ...Books include: Learning Python by Mark Lutz Python Essential Reference by David Beazley Python Cookbook, ed. by Martelli, Ravenscroft and

Ascher (online at

http://code.activestate.com/recipes/langs/python/) http://wiki.python.org/moin/PythonBooks

Python VideosPython Videos

http://showmedo.com/videotutorials/python“5 Minute Overview (What Does Python

Look Like?)”“Introducing the PyDev IDE for Eclipse”“Linear Algebra with Numpy”And many more

4 Major Versions of Python4 Major Versions of Python

“Python” or “CPython” is written in C/C++

- Version 2.7 came out in mid-2010

- Version 3.1.2 came out in early 2010

“Jython” is written in Java for the JVM“IronPython” is written in C# for the .Net

environmentGo To Website

Development EnvironmentsDevelopment Environmentswhat IDE to use?what IDE to use? http://stackoverflow.com/questions/81584http://stackoverflow.com/questions/81584

1. PyDev with Eclipse 2. Komodo3. Emacs4. Vim5. TextMate6. Gedit7. Idle8. PIDA (Linux)(VIM Based)9. NotePad++ (Windows)10.BlueFish (Linux)

Pydev with EclipsePydev with Eclipse

Python Interactive ShellPython Interactive Shell% python% pythonPython 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)[GCC 4.2.1 (Apple Inc. build 5646)] on darwin[GCC 4.2.1 (Apple Inc. build 5646)] on darwinType "help", "copyright", "credits" or "license" for more information.Type "help", "copyright", "credits" or "license" for more information.>>>>>>

You can type things directly into a running Python sessionYou can type things directly into a running Python session>>> 2+3*4>>> 2+3*41414>>> name = "Andrew">>> name = "Andrew">>> name>>> name'Andrew''Andrew'>>> print "Hello", name>>> print "Hello", nameHello AndrewHello Andrew>>>>>>

BackgroundBackgroundData Types/StructureData Types/StructureControl flowControl flowFile I/OFile I/OModulesModulesClassClassNLTKNLTK

ListList

A compound data type:A compound data type:[0][0][2.3, 4.5][2.3, 4.5][5, "Hello", "there", 9.8][5, "Hello", "there", 9.8][][]Use len() to get the length of a listUse len() to get the length of a list>>> names = [“Ben", “Chen", “Yaqin"]>>> names = [“Ben", “Chen", “Yaqin"]>>> len(names)>>> len(names)33

Use [ ] to index items in the listUse [ ] to index items in the list>>> names[0]>>> names[0]‘‘Ben'Ben'>>> names[1]>>> names[1]‘‘Chen'Chen'>>> names[2]>>> names[2]‘‘Yaqin'Yaqin'>>> names[3]>>> names[3]Traceback (most recent call last):Traceback (most recent call last):File "<stdin>", line 1, in <module>File "<stdin>", line 1, in <module>IndexError: list index out of rangeIndexError: list index out of range>>> names[-1]>>> names[-1]‘‘Yaqin'Yaqin'>>> names[-2]>>> names[-2]‘‘Chen'Chen'>>> names[-3]>>> names[-3]‘‘Ben'Ben'

[0] is the first item.[1] is the second item...

Out of range valuesraise an exception

Negative valuesgo backwards fromthe last element.

Strings share many features with listsStrings share many features with lists

>>> smiles = "C(=N)(N)N.C(=O)(O)O">>> smiles = "C(=N)(N)N.C(=O)(O)O">>> smiles[0]>>> smiles[0]'C''C'>>> smiles[1]>>> smiles[1]'(''('>>> smiles[-1]>>> smiles[-1]'O''O'>>> smiles[1:5]>>> smiles[1:5]'(=N)''(=N)'>>> smiles[10:-4]>>> smiles[10:-4]'C(=O)''C(=O)'

Use “slice” notation toget a substring

String Methods: find, splitString Methods: find, split

smiles = "C(=N)(N)N.C(=O)(O)O"smiles = "C(=N)(N)N.C(=O)(O)O">>> smiles.find("(O)")>>> smiles.find("(O)")1515>>> smiles.find(".")>>> smiles.find(".")99>>> smiles.find(".", 10)>>> smiles.find(".", 10)-1-1>>> smiles.split(".")>>> smiles.split(".")['C(=N)(N)N', 'C(=O)(O)O']['C(=N)(N)N', 'C(=O)(O)O']>>>>>>

Use “find” to find thestart of a substring.

Start looking at position 10.

Find returns -1 if it couldn’tfind a match.

Split the string into partswith “.” as the delimiter

String operators: in, not inString operators: in, not in

if "Br" in “Brother”:if "Br" in “Brother”:

print "contains brother“print "contains brother“

email_address = “clin”email_address = “clin”

if "@" not in email_address:if "@" not in email_address:

email_address += "@brandeis.edu“email_address += "@brandeis.edu“

String Method: “strip”, “rstrip”, “lstrip” are ways toString Method: “strip”, “rstrip”, “lstrip” are ways toremove whitespace or selected charactersremove whitespace or selected characters

>>> line = " # This is a comment line \n">>> line = " # This is a comment line \n">>> line.strip()>>> line.strip()'# This is a comment line''# This is a comment line'>>> line.rstrip()>>> line.rstrip()' # This is a comment line'' # This is a comment line'>>> line.rstrip("\n")>>> line.rstrip("\n")' # This is a comment line '' # This is a comment line '>>>>>>

More String methodsMore String methods

email.startswith(“c") endswith(“u”)email.startswith(“c") endswith(“u”)True/FalseTrue/False

>>> "%[email protected]" % "clin">>> "%[email protected]" % "clin"'[email protected]''[email protected]'

>>> names = [“Ben", “Chen", “Yaqin"]>>> names = [“Ben", “Chen", “Yaqin"]>>> ", ".join(names)>>> ", ".join(names)‘‘Ben, Chen, Yaqin‘Ben, Chen, Yaqin‘

>>> “chen".upper()>>> “chen".upper()‘‘CHEN'CHEN'

Unexpected things about stringsUnexpected things about strings

>>> s = "andrew">>> s = "andrew">>> s[0] = "A">>> s[0] = "A"Traceback (most recent call last):Traceback (most recent call last):File "<stdin>", line 1, in <module>File "<stdin>", line 1, in <module>TypeError: 'str' object does not support item TypeError: 'str' object does not support item

assignmentassignment>>> s = "A" + s[1:]>>> s = "A" + s[1:]>>> s>>> s'Andrew‘'Andrew‘

Strings are read only

““\” is for special characters\” is for special characters

\n -> newline\n -> newline

\t -> tab\t -> tab

\\ -> backslash\\ -> backslash

......

But Windows uses backslash for directories!filename = "M:\nickel_project\reactive.smi" # DANGER!

filename = "M:\\nickel_project\\reactive.smi" # Better!

filename = "M:/nickel_project/reactive.smi" # Usually works

Lists are mutable - some useful Lists are mutable - some useful methodsmethods

>>> ids = ["9pti", "2plv", "1crn"]>>> ids = ["9pti", "2plv", "1crn"]>>> ids.append("1alm")>>> ids.append("1alm")>>> ids>>> ids['9pti', '2plv', '1crn', '1alm']['9pti', '2plv', '1crn', '1alm']>>>ids.extend(L)>>>ids.extend(L) Extend the list by appending all the items in the given list; equivalent to a[len(a):] = L.Extend the list by appending all the items in the given list; equivalent to a[len(a):] = L.>>> del ids[0]>>> del ids[0]>>> ids>>> ids['2plv', '1crn', '1alm']['2plv', '1crn', '1alm']>>> ids.sort()>>> ids.sort()>>> ids>>> ids['1alm', '1crn', '2plv']['1alm', '1crn', '2plv']>>> ids.reverse()>>> ids.reverse()>>> ids>>> ids['2plv', '1crn', '1alm']['2plv', '1crn', '1alm']>>> ids.insert(0, "9pti")>>> ids.insert(0, "9pti")>>> ids>>> ids['9pti', '2plv', '1crn', '1alm']['9pti', '2plv', '1crn', '1alm']

append an element

remove an element

sort by default order

reverse the elements in a list

insert an element at somespecified position.(Slower than .append())

Tuples: Tuples: sort of an immutable list

>>> yellow = (255, 255, 0) # r, g, b>>> yellow = (255, 255, 0) # r, g, b>>> one = (1,)>>> one = (1,)>>> yellow[0]>>> yellow[0]>>> yellow[1:]>>> yellow[1:](255, 0)(255, 0)>>> yellow[0] = 0>>> yellow[0] = 0Traceback (most recent call last):Traceback (most recent call last):File "<stdin>", line 1, in <module>File "<stdin>", line 1, in <module>TypeError: 'tuple' object does not support item assignmentTypeError: 'tuple' object does not support item assignment

Very common in string interpolation:>>> "%s lives in %s at latitude %.1f" % ("Andrew", "Sweden", 57.7056)'Andrew lives in Sweden at latitude 57.7'

zipping lists togetherzipping lists together

>>> names>>> names['ben', 'chen', 'yaqin']['ben', 'chen', 'yaqin']

>>> gender =>>> gender = [0, 0, 1] [0, 0, 1]

>>> zip(names, gender)>>> zip(names, gender)[('ben', 0), ('chen', 0), ('yaqin', 1)][('ben', 0), ('chen', 0), ('yaqin', 1)]

DictionariesDictionaries Dictionaries are lookup tables. They map from a “key” to a “value”.

symbol_to_name = {"H": "hydrogen","He": "helium","Li": "lithium","C": "carbon","O": "oxygen","N": "nitrogen"

} Duplicate keys are not allowed Duplicate values are just fine

Keys can be any immutable valueKeys can be any immutable valuenumbers, strings, tuples, frozensetnumbers, strings, tuples, frozenset, ,

not list, dictionary, set, ...not list, dictionary, set, ...atomic_number_to_name = {atomic_number_to_name = {1: "hydrogen"1: "hydrogen"6: "carbon",6: "carbon",7: "nitrogen"7: "nitrogen"8: "oxygen",8: "oxygen",}}nobel_prize_winners = {nobel_prize_winners = {(1979, "physics"): ["Glashow", "Salam", "Weinberg"],(1979, "physics"): ["Glashow", "Salam", "Weinberg"],(1962, "chemistry"): ["Hodgkin"],(1962, "chemistry"): ["Hodgkin"],(1984, "biology"): ["McClintock"],(1984, "biology"): ["McClintock"],}}

A set is an unordered collection with no duplicate elements.

DictionaryDictionary

>>> symbol_to_name["C"]>>> symbol_to_name["C"]'carbon''carbon'>>> "O" in symbol_to_name, "U" in symbol_to_name>>> "O" in symbol_to_name, "U" in symbol_to_name(True, False)(True, False)>>> "oxygen" in symbol_to_name>>> "oxygen" in symbol_to_nameFalseFalse>>> symbol_to_name["P"]>>> symbol_to_name["P"]Traceback (most recent call last):Traceback (most recent call last):File "<stdin>", line 1, in <module>File "<stdin>", line 1, in <module>KeyError: 'P'KeyError: 'P'>>> symbol_to_name.get("P", "unknown")>>> symbol_to_name.get("P", "unknown")'unknown''unknown'>>> symbol_to_name.get("C", "unknown")>>> symbol_to_name.get("C", "unknown")'carbon''carbon'

Get the value for a given key

Test if the key exists(“in” only checks the keys,not the values.)

[] lookup failures raise an exception.Use “.get()” if you wantto return a default value.

Some useful dictionary methodsSome useful dictionary methods

>>> symbol_to_name.keys()>>> symbol_to_name.keys()['C', 'H', 'O', 'N', 'Li', 'He']['C', 'H', 'O', 'N', 'Li', 'He']

>>> symbol_to_name.values()>>> symbol_to_name.values()['carbon', 'hydrogen', 'oxygen', 'nitrogen', 'lithium', 'helium']['carbon', 'hydrogen', 'oxygen', 'nitrogen', 'lithium', 'helium']

>>> symbol_to_name.update( {"P": "phosphorous", "S": "sulfur"} )>>> symbol_to_name.update( {"P": "phosphorous", "S": "sulfur"} )>>> symbol_to_name.items()>>> symbol_to_name.items()[('C', 'carbon'), ('H', 'hydrogen'), ('O', 'oxygen'), ('N', 'nitrogen'), ('P', [('C', 'carbon'), ('H', 'hydrogen'), ('O', 'oxygen'), ('N', 'nitrogen'), ('P',

'phosphorous'), ('S', 'sulfur'), ('Li', 'lithium'), ('He', 'helium')]'phosphorous'), ('S', 'sulfur'), ('Li', 'lithium'), ('He', 'helium')]

>>> del symbol_to_name['C']>>> del symbol_to_name['C']>>> symbol_to_name>>> symbol_to_name{'H': 'hydrogen', 'O': 'oxygen', 'N': 'nitrogen', 'Li': 'lithium', 'He': 'helium'}{'H': 'hydrogen', 'O': 'oxygen', 'N': 'nitrogen', 'Li': 'lithium', 'He': 'helium'}

BackgroundBackgroundData Types/StructureData Types/Structure

list, string, tuple, dictionarylist, string, tuple, dictionaryControl flowControl flowFile I/OFile I/OModulesModulesClassClassNLTKNLTK

Control FlowControl Flow

Things that are FalseThings that are False The boolean value False The numbers 0 (integer), 0.0 (float) and 0j (complex). The empty string "". The empty list [], empty dictionary {} and empty set set().Things that are TrueThings that are True The boolean value TrueThe boolean value True All non-zero numbers.All non-zero numbers. Any string containing at least one character.Any string containing at least one character. A non-empty data structure.A non-empty data structure.

IfIf

>>> smiles = "BrC1=CC=C(C=C1)NN.Cl">>> smiles = "BrC1=CC=C(C=C1)NN.Cl">>> bool(smiles)>>> bool(smiles)TrueTrue>>> not bool(smiles)>>> not bool(smiles)FalseFalse>>> if not smiles>>> if not smiles::... print "The SMILES string is empty"... print "The SMILES string is empty"...... The “else” case is always optional

Use “elif” to chain subsequent testsUse “elif” to chain subsequent tests

>>> mode = "absolute">>> mode = "absolute">>> if mode == "canonical":>>> if mode == "canonical":... ... smiles = "canonical"smiles = "canonical"... elif mode == "isomeric":... elif mode == "isomeric":... ... smiles = "isomeric”smiles = "isomeric”... ... elif mode == "absolute": elif mode == "absolute":... ... smiles = "absolute"smiles = "absolute"... else:... else:... ... raise TypeError("unknown mode")raise TypeError("unknown mode")......>>> smiles>>> smiles' absolute '' absolute '>>>>>>

“raise” is the Python way to raise exceptions

Boolean logicBoolean logic

Python expressions can have “and”s and Python expressions can have “and”s and “or”s:“or”s:

if (ben if (ben <=<= 5 and chen 5 and chen >=>= 10 or 10 or

chen chen ==== 500 and ben 500 and ben !=!= 5): 5):

print “Ben and Chen“print “Ben and Chen“

Range TestRange Test

if (3 if (3 <= Time <=<= Time <= 5): 5):

print “Office Hour"print “Office Hour"

ForFor

>>> names = [“Ben", “Chen", “Yaqin"]>>> names = [“Ben", “Chen", “Yaqin"]

>>> for name in names:>>> for name in names:

... ... print smilesprint smiles

......

BenBen

ChenChen

YaqinYaqin

Tuple assignment in for loopsTuple assignment in for loops

data = [ ("C20H20O3", 308.371),data = [ ("C20H20O3", 308.371),("C22H20O2", 316.393),("C22H20O2", 316.393),("C24H40N4O2", 416.6),("C24H40N4O2", 416.6),("C14H25N5O3", 311.38),("C14H25N5O3", 311.38),("C15H20O2", 232.3181)]("C15H20O2", 232.3181)]

for for (formula, mw)(formula, mw) in data: in data:print "The molecular weight of %s is %s" % (formula, mw)print "The molecular weight of %s is %s" % (formula, mw)

The molecular weight of C20H20O3 is 308.371The molecular weight of C20H20O3 is 308.371The molecular weight of C22H20O2 is 316.393The molecular weight of C22H20O2 is 316.393The molecular weight of C24H40N4O2 is 416.6The molecular weight of C24H40N4O2 is 416.6The molecular weight of C14H25N5O3 is 311.38The molecular weight of C14H25N5O3 is 311.38The molecular weight of C15H20O2 is 232.3181The molecular weight of C15H20O2 is 232.3181

Break, continueBreak, continue

>>> for value in [3, 1, 4, 1, 5, 9, 2]:>>> for value in [3, 1, 4, 1, 5, 9, 2]:... ... print "Checking", value print "Checking", value... ... if value > 8: if value > 8:... ... print "Exiting for loop"print "Exiting for loop"... ... breakbreak... ... elif value < 3: elif value < 3:... ... print "Ignoring"print "Ignoring"... ... continuecontinue... ... print "The square is", value**2 print "The square is", value**2......

Use “break” to stopUse “break” to stopthe for loopthe for loop

Use “continue” to stopUse “continue” to stopprocessing the current itemprocessing the current item

Checking 3Checking 3The square is 9The square is 9Checking 1Checking 1IgnoringIgnoringChecking 4Checking 4The square is 16The square is 16Checking 1Checking 1IgnoringIgnoringChecking 5Checking 5The square is 25The square is 25Checking 9Checking 9Exiting for loopExiting for loop>>>>>>

Range()Range() ““range” creates a list of numbers in a specified rangerange” creates a list of numbers in a specified range range([start,] stop[, step]) -> list of integersrange([start,] stop[, step]) -> list of integers When step is given, it specifies the increment (or decrement).When step is given, it specifies the increment (or decrement).>>> range(5)>>> range(5)[0, 1, 2, 3, 4][0, 1, 2, 3, 4]>>> range(5, 10)>>> range(5, 10)[5, 6, 7, 8, 9][5, 6, 7, 8, 9]>>> range(0, 10, 2)>>> range(0, 10, 2)[0, 2, 4, 6, 8][0, 2, 4, 6, 8]

How to get every second element in a list?for i in range(0, len(data), 2):

print data[i]

BackgroundBackgroundData Types/StructureData Types/StructureControl flowControl flowFile I/OFile I/OModulesModulesClassClassNLTKNLTK

Reading filesReading files

>>> f = open(“names.txt")>>> f = open(“names.txt")

>>> f.readline()>>> f.readline()

'Yaqin\n''Yaqin\n'

Quick WayQuick Way

>>> lst= [ x for x in open("text.txt","r").readlines() ]>>> lst= [ x for x in open("text.txt","r").readlines() ]>>> lst>>> lst['Chen Lin\n', '[email protected]\n', 'Volen 110\n', 'Office ['Chen Lin\n', '[email protected]\n', 'Volen 110\n', 'Office

Hour: Thurs. 3-5\n', '\n', 'Yaqin Yang\n', Hour: Thurs. 3-5\n', '\n', 'Yaqin Yang\n', '[email protected]\n', 'Volen 110\n', 'Offiche Hour: '[email protected]\n', 'Volen 110\n', 'Offiche Hour: Tues. 3-5\n']Tues. 3-5\n']

Ignore the header?Ignore the header?for (i,line) in enumerate(open(‘text.txt’,"r").readlines()):for (i,line) in enumerate(open(‘text.txt’,"r").readlines()): if i == 0: continueif i == 0: continue print lineprint line

Using dictionaries to count Using dictionaries to count occurrencesoccurrences

>>> for line in open('names.txt'):>>> for line in open('names.txt'):... ... name = line.strip()name = line.strip()... ... name_count[name] = name_count.get(name,0)+ 1name_count[name] = name_count.get(name,0)+ 1... ... >>> for (name, count) in name_count.items():>>> for (name, count) in name_count.items():... ... print name, countprint name, count... ... Chen 3Chen 3Ben 3Ben 3Yaqin 3Yaqin 3

File OutputFile Output

input_file = open(“in.txt")input_file = open(“in.txt")

output_file = open(“out.txt", "w")output_file = open(“out.txt", "w")

for line in input_file:for line in input_file:

output_file.write(line)output_file.write(line)“w” = “write mode”“a” = “append mode”“wb” = “write in binary”“r” = “read mode” (default)“rb” = “read in binary”“U” = “read files with Unixor Windows line endings”

BackgroundBackgroundData Types/StructureData Types/StructureControl flowControl flowFile I/OFile I/OModulesModulesClassClassNLTKNLTK

ModulesModules

When a Python program starts it only has access to a basic functions and classes.

(“int”, “dict”, “len”, “sum”, “range”, ...)“Modules” contain additional functionality.Use “import” to tell Python to load a

module.

>>> import math

>>> import nltk

import the math moduleimport the math module>>> import math>>> import math>>> math.pi>>> math.pi3.14159265358979313.1415926535897931>>> math.cos(0)>>> math.cos(0)1.01.0>>> math.cos(math.pi)>>> math.cos(math.pi)-1.0-1.0>>> dir(math)>>> dir(math)['__doc__', '__file__', '__name__', '__package__', 'acos', 'acosh',['__doc__', '__file__', '__name__', '__package__', 'acos', 'acosh','asin', 'asinh', 'atan', 'atan2', 'atanh', 'ceil', 'copysign', 'cos','asin', 'asinh', 'atan', 'atan2', 'atanh', 'ceil', 'copysign', 'cos','cosh', 'degrees', 'e', 'exp', 'fabs', 'factorial', 'floor', 'fmod','cosh', 'degrees', 'e', 'exp', 'fabs', 'factorial', 'floor', 'fmod','frexp', 'fsum', 'hypot', 'isinf', 'isnan', 'ldexp', 'log', 'log10','frexp', 'fsum', 'hypot', 'isinf', 'isnan', 'ldexp', 'log', 'log10','log1p', 'modf', 'pi', 'pow', 'radians', 'sin', 'sinh', 'sqrt', 'tan','log1p', 'modf', 'pi', 'pow', 'radians', 'sin', 'sinh', 'sqrt', 'tan','tanh', 'trunc']'tanh', 'trunc']>>> help(math)>>> help(math)>>> help(math.cos)>>> help(math.cos)

““import” and “from ... import ...”import” and “from ... import ...”

>>> import math>>> import math

math.cosmath.cos

>>> from math import cos, pi

cos

>>> from math import *

BackgroundBackgroundData Types/StructureData Types/StructureControl flowControl flowFile I/OFile I/OModulesModulesClassClassNLTKNLTK

ClassesClassesclass ClassName(object): class ClassName(object):

<statement-1> <statement-1> . . . . . . <statement-N> <statement-N>

class MyClass(object): class MyClass(object): """A simple example class""" """A simple example class""" i = 12345 12345 def f(self): def f(self): return self.i return self.i

class DerivedClassName(BaseClassName): class DerivedClassName(BaseClassName): <statement-1> <statement-1> . . . . . . <statement-N> <statement-N>

BackgroundBackgroundData Types/StructureData Types/StructureControl flowControl flowFile I/OFile I/OModulesModulesClassClassNLTKNLTK

http://www.nltk.org/bookhttp://www.nltk.org/bookNLTK is on berry patch machines!NLTK is on berry patch machines!

>>>from nltk.book import * >>>from nltk.book import * >>> text1 >>> text1 <Text: Moby Dick by Herman Melville 1851><Text: Moby Dick by Herman Melville 1851>>>> text1.name>>> text1.name'Moby Dick by Herman Melville 1851''Moby Dick by Herman Melville 1851'>>> text1.concordance("monstrous") >>> text1.concordance("monstrous") >>> dir(text1)>>> dir(text1)>>> text1.tokens>>> text1.tokens>>> text1.index("my")>>> text1.index("my")46474647>>> sent2>>> sent2['The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', ['The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in',

'Sussex', '.'] 'Sussex', '.']

Classify TextClassify Text

>>> def gender_features(word): >>> def gender_features(word):

... ... return {'last_letter': word[-1]} return {'last_letter': word[-1]}

>>> gender_features('Shrek') >>> gender_features('Shrek')

{'last_letter': 'k'} {'last_letter': 'k'}

>>> from nltk.corpus import names >>> from nltk.corpus import names

>>> import random >>> import random >>> names = ([(name, 'male') for name in names.words('male.txt')] + >>> names = ([(name, 'male') for name in names.words('male.txt')] +

... [(name, 'female') for name in names.words('female.txt')])... [(name, 'female') for name in names.words('female.txt')])

>>> random.shuffle(names) >>> random.shuffle(names)

Featurize, train, test, predictFeaturize, train, test, predict

>>> featuresets = [(gender_features(n), g) for (n,g) in names] >>> featuresets = [(gender_features(n), g) for (n,g) in names]

>>> train_set, test_set = featuresets[500:], featuresets[:500] >>> train_set, test_set = featuresets[500:], featuresets[:500]

>>> classifier = nltk.NaiveBayesClassifier.train(train_set)>>> classifier = nltk.NaiveBayesClassifier.train(train_set)

>>> print nltk.classify.accuracy(classifier, test_set) >>> print nltk.classify.accuracy(classifier, test_set)

0.7260.726

>>> classifier.classify(gender_features('Neo')) >>> classifier.classify(gender_features('Neo'))

'male''male'

from from nltknltk.corpus import .corpus import reutersreuters

Reuters Corpus:Reuters Corpus:10,788 news10,788 news 1.3 million words.1.3 million words. Been classified into Been classified into 9090 topics topicsGrouped into 2 sets, "training" and "test“Grouped into 2 sets, "training" and "test“Categories overlap with each other Categories overlap with each other

http://nltk.googlecode.com/svn/trunk/doc/bohttp://nltk.googlecode.com/svn/trunk/doc/book/ch02.htmlok/ch02.html

ReutersReuters

>>> from nltk.corpus import reuters >>> from nltk.corpus import reuters

>>> reuters.fileids() >>> reuters.fileids()

['test/14826', 'test/14828', 'test/14829', 'test/14832', ...]['test/14826', 'test/14828', 'test/14829', 'test/14832', ...]

>>> reuters.categories() >>> reuters.categories() ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut',

'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-

oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', ...]oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', ...]