methods in computational linguistics ii queens college lecture 5: list comprehensions

40
Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

Upload: rosemary-ward

Post on 11-Jan-2016

223 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

Methods in Computational Linguistics II

Queens College

Lecture 5: List Comprehensions

Page 2: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

2

Split into words

• sent = “That isn’t the problem, Bob.” • sent.split()• vs. • nltk.word_tokenize(sent)

Page 3: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

3

List Comprehensions

• Compact way to process every item in a list.

[x for x in array]

dest = []

for x in array:dest.append(x)

Page 4: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

4

Methods

• Using the iterating variable, x, methods can be applied.

• Their value is stored in the resulting list.

[len(x) for x in array]

dest = []

for x in array:dest.append(len(x))

Page 5: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

5

Conditionals

• Elements from the original list can be omitted from the resulting list, using conditional statements

[x for x in array if len(x) == 3]

dest = []

for x in array:

if len(x) == 3:dest.append(x)

Page 6: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

6

Building up

• These can be combined to build up complicated lists

[x.upper() for x in array if len(x) > 3 and x.startswith(‘t’)]

dest = []

for x in array:

if len(x) > 3 and x.startswith(‘t’):dest.append(x.upper())

Page 7: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

7

Lists Containing Lists

• Lists can contain lists• [[a, 1], [b, 2], [d, 4]]• ...or tuples• [(a, 1), (b, 2), (d, 4)]• [ [d, d*d] for d in array if d < 4]

Page 8: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

8

Using multiple lists

• Multiple lists can be processed simultaneously in a list comprehension

• [x*y for x in array1 for y in array2]

Page 9: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

9

List Comprehension Exercises

Make a list of the first ten multiples of ten (10, 20, 30... 90, 100) using a list comprehension.

Make a list of the first ten cubes (1, 8, 27... 1000) using a list comprehension.

Store five names in a list. Make a second list that adds the phrase "is awesome!" to each name, using a list comprehension.

Write out the following code without using a list comprehension:

plus_thirteen = [number + 13 for number in range(1,11)]

Exercises from: http://introtopython.org/all_exercises_challenges.html#ex_ch_12

Page 10: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

10

Lists within lists are often called 2-d arrays

• This is another way we store tables.

• Similar to nested dictionaries.• a = [[0,1], [1,0]]• a[1][1]• a[0][0]

Page 11: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

11

Numpy & Arrays

• Numpy is a commonly used package for numerical calculations in python.

• Its main object is a multidimensional array.

• A[1] List• A[1][2] ‘Rectangular’ 2-d Matrix• A[1][2][3] ‘Cube/Prism’ 3-d Matrix • A[1][2][3][4] 4-d Matrix• etc.

Page 12: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

12

Numpy arrays

from numpy import *

a = array([1,2,3,4])

a = array([1,2], [3,4])

a.ndim Number of dimensions

a.shape Length of each dimension

a.size Total number of elements

Page 13: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

13

numpy array initialization

>>> zeros( (3,4) )

array([[0., 0., 0., 0.],

[0., 0., 0., 0.],

[0., 0., 0., 0.]])

>>> ones( (2,3,4), dtype=int16 )

array([[[ 1, 1, 1, 1],

[ 1, 1, 1, 1],

[ 1, 1, 1, 1]],

[[ 1, 1, 1, 1],

[ 1, 1, 1, 1],

[ 1, 1, 1, 1]]], dtype=int16)

>>> empty( (2,3) )

array([[ 3.73603959e-262, 6.02658058e-154, 6.55490914e-260],

[ 5.30498948e-313, 3.14673309e-307, 1.00000000e+000]])

Page 14: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

14

Content Types

• arrays are homogenous (ndarray)– array([1, 3, 4], dtype=int16)

• lists are not homogenous– [‘abc’, 123, [list1, list2]]

• dtype describes the “type” of object in the array– str, tuple, int, etc.– numpy.int16, numpy.int32, numpy.float64 etc.

Page 15: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

15

zip

• Zip allows you to “zip” two lists together, creating a list of tuples

• names = [‘Andrew’, ‘Beth’, ‘Charles’]• ages = [35, 34, 33]• name_age = zip(names, ages)

– [(‘Andrew’, 35), (‘Beth’, 34), (‘Charles’, 33)]

Page 16: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

16

foreach vs. indexed for loops

“More pythonic”

for n, a in zip(names, ages):

print “%s -- %s” % (n, a)

vs.

for i in xrange(len(names)):

print “%s -- %s” % (names[i], ages[i])

Page 17: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

17

map

• map allows you to apply the same function to a list of objects.

a = [‘1’, ‘2’, ‘4’]

map(int, a)

Page 18: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

18

map

Any function can be ‘map’ed over a list, but the elements of the list need to be a value argument.

def uppercase(s):

return s.upper()

a = [‘abc’, ‘def’, ‘ghi’]

map(uppercase, a)

Page 19: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

19

Functions as objects

• A function name can be assigned to a variable.• map is an example of this, where the first

argument to map is a function object.

a = [1, 3, 4]

len(a)

sum(a)

functions = [len, sum]

for fn in functions:print str(fn), fn(a)

Page 20: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

20

lambda

• Lambda functions are single use functions that do not need to be ‘def’ed.

• Using the uppercase example again:

def uppercase(s):

return s.upper()

a = [‘abc’, ‘def’, ‘ghi’]

map(uppercase, a)

Page 21: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

21

lambda

• Lambda functions are single use functions that do not need to be ‘def’ed.

• These are “anonymous” functions• Using the uppercase example again:

a = [‘abc’, ‘def’, ‘ghi’]

map(lambda s : s.upper(), a)

By design, lambdas are only a single statement

Page 22: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

22

Aside: Glob

• Construct a list of all filemames matching a pattern.

from glob import glob

glob(‘*.txt’)

glob(‘/Users/andrew/Documents/*/*.ppt’)

Page 23: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

23

Linguistic Annotation• Text only takes us so far.• People are reliable judges of linguistic

behavior.• We can model with machines, but for

“gold-standard” truth, we ask people to make judgments about linguistic qualities.

Page 24: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

24

Example Linguistic Annotations

• Sentence Boundaries• Part of Speech Tags• Phonetic Transcription• Syntactic parse trees• Speaker Identity• Semantic Role • Speech Act• Document Topic• Argument structure• Word Sense• many many many more

Page 25: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

25

We need…

• Techniques to process these.

• Every corpus has its own format for linguistic annotation.

• so…we need to parse annotation formats.

Page 26: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

26

Constructing a linguistic corpus

• Decisions that need to be made:

– Why are you doing this?– What material will be collected?– How will it be collected?

• Automatically?• Manually?• Found material vs. laboratory language?

– What meta information will be stored?– What manual annotations are required?

• How will each annotation be defined?• How many annotators will be used?• How will agreement be assessed? • How will disagreements be resolved?

– How will the material be disseminated?• Is this covered by your IRB if the material is the result of a human subject

protocol?

Page 27: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

27

Part of Speech Tagging

• Task: Given a string of words, identify the parts of speech for each word.

Page 28: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

28

Part of Speech tagging

• Surface level syntax.• Primary operation• Parsing• Word Sense Disambiguation• Semantic Role labeling• Segmentation • Discourse, Topic, Sentence

Page 29: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

29

How is it done?

• Learn from Data.• Annotated Data:

• Unlabeled Data:

Page 30: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

30

Learn the association from Tag to Word

Page 31: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

31

Limitations

• Unseen tokens• Uncommon interpretations• Long term dependencies

Page 32: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

32

Format conversion exercise

The/DET Dog/NN is/VB fast/JJ ./.

<word ortho=“The” pos=“DET”></word>

<word ortho=“Dog” pos=“NN”></word>

<word ortho=“is” pos=“VB”></word>

<word ortho=“fast” pos=“JJ”></word>

<word ortho=“.” pos=“.”></word>

The dog is fast.

1, 3, DET

5, 7, NN

9, 10, VB

12, 15, JJ

16, 16, .

Page 33: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

33

Parsing

• Generate a parse tree.

Page 34: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

34

Parsing

• Generate a Parse Tree from:• The surface form (words) of the text• Part of Speech Tokens

Page 35: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

35

Parsing Styles

Page 36: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

36

Parsing styles

Page 37: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

37

Context Free Grammars for Parsing

• S → VP• S →NP VP• NP → Det Nom• Nom → Noun• Nom → Adj Nom• VP → Verb Nom• Det → “A”, “The”

• Noun → “I”, “John”, “Address”

• Verb → “Gave”• Adj → “My”, “Blue”• Adv → “Quickly”

Page 38: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

38

Limitations

• The grammar must be built by hand.• Can’t handle ungrammatical sentences.• Can’t resolve ambiguity.

Page 39: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

39

Probabilistic Parsing

• Assign each transition a probability• Find the parse with the greatest

“likelihood”

• Build a table and count– How many times does each transition happen

• Structured learning.

Page 40: Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

40

Segmentation

• Sentence Segmentation

• Topic Segmentation

• Speaker Segmentation

• Phrase Chunking– NP, VP, PP, SubClause, etc.