1 languages, grammars, and regular expressions ling 570 fei xia week 2: 10/03/07 texpoint fonts used...

45
1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07

Post on 21-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

1

Languages, grammars, and regular expressions

LING 570

Fei Xia

Week 2: 10/03/07

Page 2: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

2

Today

• Hw1 is due: 1% penalty per hour after 3pm.

• Probability Theory (from last time)

• Formal languages, formal grammars, and regular expression J&M 2

• Hw2

Page 3: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

3

Coming up

• Next Monday– Finite-state automaton: J&M Chapter 2

• Next Wed– Finite-state transducer: J&M Section 3.4– Hw2 is due

Page 4: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

4

Probability theory(from last time)

Page 5: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

5

Common trick #1: Maximum likelihood estimation

• An example: toss a coin 3 times, and got two heads. What is the probability of getting a head with one toss?

• Maximum likelihood: (ML) * = arg max P(data | )

• In the example, – P(X=2) = 3 * p * p * (1-p) when p=2/3, P(X=2) reaches the maximum.

Page 6: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

6

Common trick #2:Chain rule

)|(*)()|(*)(),( YXPYPXYPXPYXP

),...|(),...,( 111

1 ii

in XXXPXXP

Page 7: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

7

Common trick #3:joint prob Marginal prob

y

yxPxP ),()(

nxx

nxxPxP,...,

11

2

),...,()(

Page 8: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

8

Common trick #4:Bayes’ rule

)(

)()|(

)(

),()|(

xP

yPyxP

xP

yxPxyP

)()|(maxarg

)(

)()|(maxarg

)|(maxarg*

yPyxP

xP

yPyxP

xyPy

y

y

y

Page 9: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

9

Independent random variables

• Two random variables X and Y are independent iff the value of X has no influence on the value of Y and vice versa.

• P(X,Y) = P(X) P(Y)

• P(Y|X) = P(Y)

• P(X|Y) = P(X)

• Our previous coin examples: P(X, C) != P(X) P(C)

Page 10: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

10

Conditional independence

Once we know the value of Z, knowing the value of Y does not help us predict the value of X, and vice versa.

• P(X,Y | Z) = P(X|Z) P(Y|Z)

• P(X|Y,Z) = P(X | Z)

• P(Y|X, Z) = P(Y |Z)

Page 11: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

11

Independence and conditional independence

• If A and B are independent, are they conditional independent?

• Example:– Burglar, Earthquake– Alarm

Page 12: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

12

Common trick #5:Independence assumption

)|(

),...|(),...,(

11

111

1

ii

i

ii

in

xxP

xxxPxxP

Page 13: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

13

An example

• P(w1 w2 … wn)

= P(w1) P(w2 | w1) P(w3 | w1 w2) * …

* P(wn | w1 …, wn-1)

¼ P(w1) P(w2 | w1) …. P(wn | wn-1)

• Why do we make independence assumption which we know are not true?

Page 14: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

14

Summary of elementaryprobability theory

• Basic concepts: sample space, event space, random variable, random vector

• Joint / conditional /marginal probability

• Independence and conditional independence

• Five common tricks:– Max likelihood estimation– Chain rule– Calculating marginal probability from joint probability– Bayes’ rule– Independence assumption

Page 15: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

15

Formal languages, formal grammars and

regular languages

Page 16: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

16

Unit #1

• Formal grammar, language and regular expression

• Finite-state automaton (FSA)

• Finite-state transducer (FST)

• Morphological analysis using FST

Page 17: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

17

Other units in ling570

• Unit #2: LM and smoothing

• Unit #3: HMM and POS tagging

• Unit #4: Classification and sequence labeling

• Unit #5: Introduction to other common tasks (e.g., IR/IE/WSD)

Page 18: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

18

Regular expression

• Two concepts:– Regular expression in formal language theory– Regular expression (or pattern) in pattern matching: it

is a way of expressing a pattern for the purpose of matching a string

• Both concepts describe a set of strings.

• The two concepts are closely related, but the latter is often more expressive than the former.

Page 19: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

19

Outline

• Formal language– Regular language

• Regular expression in formal language theory

• Formal grammars– Regular grammars

• Regular expression in pattern matching

Page 20: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

20

Formal languages

Page 21: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

21

Definition of formal language

• An alphabet is a finite set of symbols:– Ex: § = {a, b, c}

• A string is a finite sequence of symbols from a particular alphabet juxtaposed:– Ex: the string “baccab”– Ex: empty string ²

• A formal language is a set of strings defined over some alphabet.– Ex1: {aa, bb, cc, aaaa, abba, acca, baab, bbbb, ….}– Ex2: {an bn | n > 0} – Ex3: the empty set Á

Page 22: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

22

Definition of regular languages

• The class of regular languages over an alphabet § is formally defined as:– The empty set, Á, is a regular language– 8 a 2 § [ ², {a} is a regular language.– If L1 and L2 are regular languages, then so are:

(a) L1 ² L2 = {xy | x 2 L1; y 2 L2} (concatenation)

(b) L1 L2 (union or disjunction)

(c) L1* = {x1 x2 …, xn | xi 2 L1 , n 2 N} (Kleene closure)

– There are no other regular languages

Page 23: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

23

Kleene star

Another way to define L*:• L2 = L ² L• Ln = Ln-1 ² L• L* = { ²} L1 L2 …

Examples:• L = {a, bc}• L2 = {aa, abc, bca, bcbc}• L* = {abcbca, ….} = { (a|bc)*}

Page 24: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

24

Properties

• Regular languages are closed under– Concatenation – Union– Kleene closure

• Regular languages are also closed under: – Intersection: L1 Å L2

– Difference: L1 – L2

– Complementation: §* - L1

– Reversal

Page 25: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

25

Are the following languages regular?

• {a, aa, aaa, ….}• Any finite set of strings

• {xy | x2 §*, and y is the reverse of x}• {xx | x 2 §*}

• {an bn | n 2 N}• {an bn cn | n 2 N}

Page 26: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

26

Regular expression

Page 27: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

27

Definition of Regular expression(as in formal language theory)

• The set of regular expressions is defined as follows:(1) Every symbol of is a regular expression

(2) ² is a regular expression

(3) If r1 and r2 are regular expressions, so are (r1), r1 r2, r1 | r2 , r1*

(4) Nothing else is a regular expression.

§

Page 28: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

28

Examples

• ab*c• a (0|1|2|..|9)* b• (CV | CCV)+ C?C?: C is a consonant, V is

a vowel

Other operations that we can use:• a+ = a a* • a? = (a | ²)

Page 29: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

29

Relation between regular language and Regex

• With every regular expression we can associate a regular language.

• Conversely, every regular language can be obtained from a regular expression.

• Examples:– Regex = ab*c– Regular language = {ac, abc, abbc, ….}

Page 30: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

30

Formal grammar

Page 31: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

31

Definition of formal grammar

A formal grammar is a concise description of a formal language. It is a (N, §, P, S) tuple:

• A finite set N of nonterminal symbols

• A finite set Σ of terminal symbols that is disjoint from N

• A finite set P of production rules, each of the form: (§ N)* N (§ N)* ! (§ N)* • A distinguished symbol S 2 N that is the start symbol

Page 32: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

32

Chomsky hierarchyThe left-hand side of a rule must contain at least one non-terminal. ®, ¯, ° 2 (N §)*, A,B 2 N, a 2 §

• Type 0: unrestricted grammar: no other constraints.

• Type 1: Context-sensitive grammar: The rules must be of the form: ® A ¯ ! ® ° ¯

• Type 2: Context-free grammar (CFGs): The rules must be of the form: A ! ®

• Type 3: Regular grammar: The rules are of the forms: right regular grammar: A ! a, A ! aB, or A ! ² left regular grammar: A ! a, A ! Ba, or A ! ²

Are there other kinds of grammars?

Page 33: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

33

Strings generated from a grammar

• The rules are:S → x | y | z | S + S | S - S | S * S | S/S | (S)

• What strings can be generated?

• A grammar is ambiguous if there exists at least one string which has multiple parse trees.

• Is this grammar ambiguous?

Page 34: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

34

Languages generated by grammars

• Given a grammar G, L(G) is the set of strings that can be generated from G.

• Ex: G = (N, §, P, S)

N = {S}, § = {a, b, c} P = { S ! a S b, S ! c }

What is L(G)?

Page 35: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

35

The relation between regular grammars and regular languages

• The regular grammars describe exactly all regular languages.

• All the following are equivalent:– Regular languages– Regular grammars– Regular expression– Finite state automaton (FSA)

Page 36: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

36

Relation between grammars and languages (from wikipedia page)

Chomskyhierarchy

Grammars Languages Minimalautomaton

Type-0 Unrestricted Recursively enumerable

Turing machine

n/a (no common name) Recursive Decider

Type-1 Context-sensitive Context-sensitive Linear-bounded

n/a Indexed Indexed Nested stack

n/a Tree-adjoining Mildly context-sensitive

Thread

Type-2 Context-free Context-free Nondeterministic pushdown

n/a Deterministic context-free

Deterministic context-free

Deterministic pushdown

Type-3 Regular Regular Finite state

Page 37: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

37

How about human languages?

• Are they formal languages?– What is alphabet?– What is string?

• What type of formal languages are they?

Page 38: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

38

Outline

• Formal language – Regular language

• Regular expression in formal language theory

• Formal grammar– Regular grammar

• Patterns in pattern matching J&M 2.1

Page 39: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

39

Patterns in Perl

[ab] a|b. match any character^ the starting position in a string$ the ending position in a string(..) defines a marked subexpressiona* match “a” zero or more timesa+ match “a” one or more timea? match “a” zero or one timea{n,m} “a” appears n to m times

Page 40: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

40

Special symbols in the patterns

\s match any whitespace char\d match any digit\w match any letter or digit

\S match any non-whitespace char…

\+, \-, \., \?, \*, …

Page 41: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

41

Examples

Integer: (\+|\-)?\d+

Real: (\+|\-)?\d+\.\d+

Scientific notation: (\+|\-)? \d+ (\.\d+)?e (\+|\-)?\d+

Any of the three:

(\+|\-)? \d+ (\.\d+)? (e (\+|\-)?\d+)?

Page 42: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

42

Patterns in Perl and Regex

/^(.*)\1$/ { xx | x 2 §*}

/^(.+)a(.+)\1\2$/ {xayxy | x, y 2 §*}

The extra power comes from the ability to refer to marked subexpression.

Page 43: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

43

Outline

• Formal language – Regular language

• Regular expression in formal language theory

• Formal grammars– Regular grammars

• Regex patterns in pattern matching

Page 44: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

44

Homework #2

Page 45: 1 Languages, grammars, and regular expressions LING 570 Fei Xia Week 2: 10/03/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete

45

• Part I: probability

• Part II: formal grammar

• Part III: regular expression