introduction to text-processing - stanford...

69
Introduction to Text-Processing Jim Notwell 23 January 2013 1

Upload: others

Post on 18-Mar-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Introduction to Text-ProcessingJim Notwell

23 January 2013

1

Page 2: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Stanford UNIX Resources

• Host: cardinal.stanford.edu

• To connect from UNIX / Linux / Mac:

• To connect from Windows

PuTTY (http://goo.gl/s0itD)

ssh [email protected]

2

Page 3: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Many useful UNIX text processing commands

• awk cat cut grep head sed sort tail tee tr uniq wc zcat

• Unix commands work together via text streams

• Example usage and others available at:

http://tldp.org/LDP/abs/html/textproc.html

3

Page 4: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Knowing UNIX Commands Eliminates Having To Reinvent The Wheel

For homework #1 last year, to perform a simple file sort, submissions used:

• 35 lines of Python

• 19 lines of Perl

• 73 lines of Java

• 1 line of UNIX commands

4

Page 5: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Anatomy of a UNIX Command

• options: -n 1 -g -c = -n1 -gc• Output is directed to “standard

output” (stdout)

• If no input file is specified, inut comes from “standard input” (stdin)

• “-” also means stdin in a file list

5

command [options] [file1] [file2]

Page 6: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

The real power of UNIX commands comes from combinations through

piping (“|”)• Pipes are used to pass the output of one

program (stdout) as the input (stdin) to another

• Pipe character is <shift>-\

grep “CS273a” grades.txt | sort -k 2,2gr | uniq

Find all lines in the file that contain “CS273a”

6

Page 7: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

The real power of UNIX commands comes from combinations through

piping (“|”)• Pipes are used to pass the output of one

program (stdout) as the input (stdin) to another

• Pipe character is <shift>-\

grep “CS273a” grades.txt | sort -k 2,2gr | uniq

Find all lines in the file that contain “CS273a”

Sort those lines by the second column, in numerical order, highest to lowest

7

Page 8: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

The real power of UNIX commands comes from combinations through

piping (“|”)• Pipes are used to pass the output of one

program (stdout) as the input (stdin) to another

• Pipe character is <shift>-\

grep “CS273a” grades.txt | sort -k 2,2gr | uniq

Find all lines in the file that contain “CS273a”

Sort those lines by the second column, in numerical order, highest to lowest

Remove duplicates and print to standard output

8

Page 9: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Output redirection (>, >>)

• Instead of writing everything to standard output, we can write (>) or append (>>) to a file

grep “CS273a” allClasses.txt > CS273aInfo.txt

9

cat addInfo.txt >> CS273aInfo.txt

Page 10: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

UNIX Commands

10

Page 11: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

man, whatis, apropos• UNIX program that invokes the manual written for a particular

program

• man sort• Shows all info about the program sort

• Hit <space> to scroll down, “q” to exit

• whatis sort• Shows short description of all programs that have “sort” in their

names

• apropos sort• Shows all programs that have sort in their names or short

descriptions

11

Page 12: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

cat• Concatenates files and prints them to standard output

• cat [option] [file] ...

• Variants for compressed input files:

• zcat (.gz files)

• bzcat (.bz2 files)

12

Page 13: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

cat• Concatenates files and prints them to standard output

• cat [option] [file] ...

• Variants for compressed input files:

• zcat (.gz files)

• bzcat (.bz2 files)

13

ABCD

123

Page 14: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

cat• Concatenates files and prints them to standard output

• cat [option] [file] ...

• Variants for compressed input files:

• zcat (.gz files)

• bzcat (.bz2 files)

14

ABCD

123

ABCD123

Page 15: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

head, tail• head: first ten lines

tail: last ten line

• -n option: number of lines• For tail, -n+K means line K to the end

• head -5 : first five lines

• tail -n73 : last 73 lines

• tail -n+10 | head -n 5 : lines 10-14

15

Page 16: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

cut• Prints the selected parts of lines from each file to

standard output

cut [option] ... [file] ...• -d Choose the delimiter between columns (default

tab)

• -f : Fields to print

• -f1,7 : fields 1 and 7

• -f1-4,7,11-13 :fields 1,2,3,4,7,11,12,13

16

Page 17: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

cut Example

17

CS 273 aCS.273.aCS 273 a

cut -f1,3 file.txt

cat file.txt | cut -f1,3=

file.txt

Page 18: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

cut Example

18

CS 273 aCS.273.aCS 273 a

cut -f1,3 file.txt

cat file.txt | cut -f1,3=

CS aCS.273.aCS

file.txt

Page 19: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

cut Example

19

CS 273 aCS.273.aCS 273 a

cut -d ‘.’ -f1,3 file.txt

file.txt

Page 20: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

cut Example

20

CS 273 aCS.273.aCS 273 a

cut -d ‘.’ -f1,3 file.txtCS 273 aCS.aCS 273 a

file.txt

Page 21: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

cut Example

21

CS 273 aCS.273.aCS 273 a

cut -d ‘.’ -f1,3 file.txtCS 273 aCS.aCS 273 a

file.txtIn general, you should make sure your file columns are all delimited with the same

character(s) before applying cut!

Page 22: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

wc

• Print line, word, and character (byte) counts for each file, and totals of each if more tan one file specified

wc [option]... [file]...• -l Print only line counts

22

Page 23: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

sort• Sorts lines in a delimited file (default: tab)

• -k m,n Sorts by columns m to n (1-based)

• -g Sorts by general numerical value (can handle scientific format)

• -r Sorts in descending order

• sort -k1,1gr -k2,3• Sort on field 1 numerically (high to low because of r)

• Break ties on field 2 alphabetically

• Break further ties on field 3 alphabetically

23

Page 24: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

uniq

• Discard all but one successive identical lines from input and print to standard output

• -d Only print duplicate lines

• -i Ignore case in comparison

• -u Only print unique lines

24

Page 25: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

uniq Example

25

CS 273aCS 273aTA: Jim NotwellCS 273a

uniq file.txt

file.txt

Page 26: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

CS 273 aCS.273.aCS 273 a

uniq Example

26

CS 273aCS 273aTA: Jim NotwellCS 273a

uniq file.txtCS 273aTA: Jim NotwellCS 273a

file.txt

Page 27: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

uniq Example

27

CS 273aCS 273aTA: Jim NotwellCS 273a

file.txt

uniq -u file.txt

Page 28: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

CS 273 aCS.273.aCS 273 a

uniq Example

28

CS 273aCS 273aTA: Jim NotwellCS 273a

uniq -u file.txt TA: Jim NotwellCS 273a

file.txt

Page 29: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

uniq Example

29

CS 273aCS 273aTA: Jim NotwellCS 273a

uniq -d file.txt

file.txt

Page 30: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

CS 273 aCS.273.aCS 273 a

uniq Example

30

CS 273aCS 273aTA: Jim NotwellCS 273a

uniq -d file.txt CS 273a

file.txt

Page 31: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

CS 273 aCS.273.aCS 273 a

uniq Example

31

CS 273aCS 273aTA: Jim NotwellCS 273a

uniq -d file.txt CS 273a

file.txtIn general, you probably want to make sure your file is sorted before applying uniq

Page 32: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

grep• Search for lines that contain a word or match a

regular expression

grep [options] PATTERN [file] ...• -i Ignores case

• -v Output lines that do not match

• -E regular expressions

• -f <file> : patterns from a file (1 per line)

32

Page 33: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

grep Example

33

CS 273aCS273CS 273cs 273CS 273

grep -E “^CS[[:space:]]+273$” file

file.txt

For lines that start with CS

Then have one or more spaces (or tabs)

And ends with 273

Page 34: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

grep Example

34

CS 273aCS273CS 273cs 273CS 273

grep -E “^CS[[:space:]]+273$” file

file.txt

For lines that start with CS

Then have one or more spaces (or tabs)

And ends with 273

CS 273CS 273

Page 35: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

tr• Translate or delete characters from standard

input to standard output

tr [option] ... SET1 [SET2]

• -d Delete characters in SET1, don’t translate

35

Page 36: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

tr• Translate or delete characters from standard

input to standard output

tr [option] ... SET1 [SET2]

• -d Delete characters in SET1, don’t translate

36

Thisis anExample.

file.txt

cat file.txt | tr ‘\n’ ‘,’

Page 37: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

tr• Translate or delete characters from standard

input to standard output

tr [option] ... SET1 [SET2]

• -d Delete characters in SET1, don’t translate

37

Thisis anExample.

file.txt

cat file.txt | tr ‘\n’ ‘,’

This,is an,Example.,

Page 38: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

• Most common use is a string replace

sed -e “s/search/replace/g”

sed: stream editor

38

Page 39: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

• Most common use is a string replace

sed -e “s/search/replace/g”

sed: stream editor

39

Thisis anExample.

file.txt

sed file.txt | sed -e “s/is/EEE/g”

Page 40: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

• Most common use is a string replace

sed -e “s/search/replace/g”

sed: stream editor

40

Thisis anExample.

file.txt

sed file.txt | sed -e “s/is/EEE/g”

ThEEEEEE anExample.

Page 41: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

• Join lines of two files on a common field

join [option] ... file1 file2

• -1 Specify which column of file1 to join on

• -2 Specify which column of file2 to join on

• Important: file 1 and file 2 must already be sorted on their join fields

join

41

Page 42: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Shell Scripting

42

Page 43: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Common Shells

43

• Two common shells: bash and tcsh

• Run ps to see which you are using

Page 44: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Multiple UNIX Commands Can Be Combined into a Single Shell Script

44

#!/bin/bashset -beEu -o pipefailcat $1 $2 > tmp.txtsort tmp.txt > $3

script.sh

>chmod u+x script.sh

> ./script.sh file1.txt file2.txtTo Run:

Set script to executable:

http://www.faqs.org/docs/bashman/bashref_toc.html

Page 45: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Multiple UNIX Commands Can Be Combined into a Single Shell Script

45

#!/bin/tcsh -ecat $1 $2 > tmp.txtsort tmp.txt > $3

script.csh

>chmod u+x script.csh

> ./script.csh file1.txt file2.txtTo Run:

Set script to executable:

http://www.the4cs.com/~corin/acm/tutorial/unix/tcsh-help.html

Page 46: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

for loop• BASH for loop to print 1,2,3 on separate lines

• TCSH for loop to print 1,2,3 on separate lines

46

for i in `seq 1 3`do echo ${i}done

foreach i ( `seq 1 3` ) echo ${i}end

Page 47: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

for loop• BASH for loop to print 1,2,3 on separate lines

• TCSH for loop to print 1,2,3 on separate lines

47

for i in `seq 1 3`do echo ${i}done

foreach i ( `seq 1 3` ) echo ${i}end

Special quote characters, usually left of “1” on

keyboard that indicate we should execute the

command within the quotes

Page 48: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

UCSC Kent Source Utilities

48

Page 49: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

• Many C programs in this directory that do manipulation of sequences or chromosome ranges

• Run programs with no arguments to see help message

/afs/ir/class/cs273a/bin/@sys/

49

selectFile

overlapSelect [OPTION]… selectFile inFile outFile

inFile

Page 50: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

• Many C programs in this directory that do manipulation of sequences or chromosome ranges

• Run programs with no arguments to see help message

/afs/ir/class/cs273a/bin/@sys/

50

selectFile

overlapSelect [OPTION]… selectFile inFile outFile

inFile

outFileOutput is all inFile elements that overlap any selectFile elements

Page 51: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Scripting Languages

51

Page 52: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

• A quick-and-easy shell scripting language

• http://www.grymoire.com/Unix/Awk.html

• Treats each line of a file as a record, and splits fields by whitespace

• Fields referenced as $1, $2, $3, … ($0 is entire line)

awk

52

Page 53: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Anatomy of an awk script

awk ‘BEGIN {...} {...} END {...}’

Before the first line Once per line After the last line

53

Page 54: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

• Output the lines where column 3 is less than column 5 in a comma-delimited file. Output a summary line at the end.

awk Example

54

awk -F’,’‘BEGIN{ct=0;}{ if ($3 < $5) { print $0; ct=ct+1; } }END { print “TOTAL LINES: “ ct; }’

Page 55: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Useful Things From awk• Make sure fields are delimited with tabs (to be used

by cut, sort, join, etc.)

awk ‘{print $1 “\t” $2 “\t” $3}’ whiteDelim.txt > tabDelim.txt

• Good string processing using substr, index, length functions

awk ‘{print substr($1, 1, 10)}’ longNames.txt > shortNames.txt

55

Page 56: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Useful Things From awk• Make sure fields are delimited with tabs (to be used

by cut, sort, join, etc.)

awk ‘{print $1 “\t” $2 “\t” $3}’ whiteDelim.txt > tabDelim.txt

• Good string processing using substr, index, length functions

awk ‘{print substr($1, 1, 10)}’ longNames.txt > shortNames.txt

56

String to manipulate Start Position Length

Page 57: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

57

Page 58: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Miscellaneous

58

Page 59: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Interac(ng  with  UCSC  Genome  Browser  MySQL  Tables

• Galaxy  (a  GUI  to  make  SQL  commands  easy)–hBp://main.g2.bx.psu.edu/

• Direct  interac(on  with  the  tables:mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A –Ne “<STMT>“

e.g.mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A –Ne \ “select count(*) from hg18.knownGene“;

+-------+

| 66803 |

+-------+

hBp://dev.mysql.com/doc/refman/5.1/en/tutorial.html30

Page 60: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Python

• A  scrip(ng  language  with  many  useful  constructs• Easier  to  read  than  Perl• hBp://wiki.python.org/moin/BeginnersGuide• hBp://docs.python.org/tutorial/index.html

• Call  a  python  program  from  the  command  line:python myProg.py

3660

Page 61: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Number  types

• Numbers:    int,  float>>> f = 4.7>>> i = int(f)>>> j = round(f)>>> i4>>> j5.0>>> i*j20.0>>> 2**i16

3761

Page 62: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Strings>>> dir(“”)

[…, 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'replace', 'rfind', 'rindex', 'rjust', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

>>> s = “hi how are you?”

>>> len(s)15

>>> s[5:10]

‘w are’

>>> s.find(“how”)3

>>> s.find(“CS273”)

-1

>>> s.split(“ “)[‘hi’, ‘how’, ‘are’, ‘you?’]

>>> s.startswith(“hi”)True

>>> s.replace(“hi”, “hey buddy,”)‘hey buddy, how are you?’

>>> “ extraBlanks ”.strip()‘extraBlanks’

3862

Page 63: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Lists• A  container  that  holds  zero  or  more  objects  in  sequen(al  order

>>> dir([])[…, 'append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse',

'sort']

>>> myList = [“hi”, “how”, “are”, “you?”]

>>> myList[0]‘hi’

>>> len(myList)

4

>>> for word in myList: print word[0:2]

hi

hoar

yo

>>> nums = [1,2,3,4]>>> squares = [n*n for n in nums]

>>> squares

[1, 4, 9, 16]3963

Page 64: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Dic(onaries• A  container  like  a  list,  except  key  can  be  anything  (instead  of  a  non-­‐nega(ve  integer)

>>> dir({})

[…, clear', 'copy', 'fromkeys', 'get', 'has_key', 'items', 'iteritems', 'iterkeys', 'itervalues', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values']

>>> fruits = {“apple”: True, “banana”: True}

>>> fruits[“apple”]

True

>>> fruits.get(“apple”, “Not a fruit!”)

True

>>> fruits.get(“carrot”, “Not a fruit!”)

‘Not a fruit!’

>>> fruits.items()

[('apple', True), ('banana', True)]

4064

Page 65: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Reading  from  files

>>> openFile = open(“file.txt”, “r”)

>>> allLines = openFile.readlines()

>>> openFile.close()

>>> allLines[‘Hello, world!\n’, ‘This is a file-reading\n’, ‘\texample.\n’]

41

Hello,  world!This  is  a  file-­‐reading          example.

file.txt

65

Page 66: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Wri(ng  to  files>>> writer = open(“file2.txt”, “w”)

>>> writer.write(“Hello again.\n”)

>>> name = “Cory”

>>> writer.write(“My name is %s, what’s yours?\n” % name)

>>> writer.close()

42

Hello  again.My  name  is  Cory,  what’s  yours?

file2.txt

66

Page 67: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Crea(ng  func(onsdef compareParameters(param1, param2):

if param1 < param2:

return -1

elif param1 > param2:

return 1

else:

return 0

def factorial(n):

if n < 0:

return None

elif n == 0:

return 1 else:

retval = 1

num = 1

while num <= n:

retval = retval*num

num = num + 1

return retval4367

Page 68: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Example  program

#!/usr/bin/env python

import sys # Required to read arguments from command line

if len(sys.argv) != 3:

print “Wrong number of arguments supplied to Example.py”

sys.exit(1)

inFile = open(sys.argv[1], “r”)

allLines = inFile.readlines()

inFile.close()

outFile = open(sys.argv[2], “w”)

for line in allLines:

outFile.write(line)

outFile.close()

44

Example.py

68

Page 69: Introduction to Text-Processing - Stanford Universitycs173.stanford.edu/presentations/lecture5.pdfman, whatis, apropos • UNIX program that invokes the manual written for a particular

Example  program

python Example.py file1 file2

sys.argv = [‘Example.py’, ‘file1’, ‘file2’]

45

#!/usr/bin/env pythonimport sys # Required to read arguments from command line

if len(sys.argv) != 3: print “Wrong number of arguments supplied to Example.py” sys.exit(1)

inFile = open(sys.argv[1], “r”)allLines = inFile.readlines()inFile.close()

outFile = open(sys.argv[2], “w”)for line in allLines: outFile.write(line)

outFile.close()

69