1 languages, grammars, and regular expressions ling 570 fei xia week 2: 10/03/07 texpoint fonts used...
Post on 21-Dec-2015
216 views
TRANSCRIPT
1
Languages, grammars, and regular expressions
LING 570
Fei Xia
Week 2: 10/03/07
2
Today
• Hw1 is due: 1% penalty per hour after 3pm.
• Probability Theory (from last time)
• Formal languages, formal grammars, and regular expression J&M 2
• Hw2
3
Coming up
• Next Monday– Finite-state automaton: J&M Chapter 2
• Next Wed– Finite-state transducer: J&M Section 3.4– Hw2 is due
4
Probability theory(from last time)
5
Common trick #1: Maximum likelihood estimation
• An example: toss a coin 3 times, and got two heads. What is the probability of getting a head with one toss?
• Maximum likelihood: (ML) * = arg max P(data | )
• In the example, – P(X=2) = 3 * p * p * (1-p) when p=2/3, P(X=2) reaches the maximum.
6
Common trick #2:Chain rule
)|(*)()|(*)(),( YXPYPXYPXPYXP
),...|(),...,( 111
1 ii
in XXXPXXP
7
Common trick #3:joint prob Marginal prob
y
yxPxP ),()(
nxx
nxxPxP,...,
11
2
),...,()(
8
Common trick #4:Bayes’ rule
)(
)()|(
)(
),()|(
xP
yPyxP
xP
yxPxyP
)()|(maxarg
)(
)()|(maxarg
)|(maxarg*
yPyxP
xP
yPyxP
xyPy
y
y
y
9
Independent random variables
• Two random variables X and Y are independent iff the value of X has no influence on the value of Y and vice versa.
• P(X,Y) = P(X) P(Y)
• P(Y|X) = P(Y)
• P(X|Y) = P(X)
• Our previous coin examples: P(X, C) != P(X) P(C)
10
Conditional independence
Once we know the value of Z, knowing the value of Y does not help us predict the value of X, and vice versa.
• P(X,Y | Z) = P(X|Z) P(Y|Z)
• P(X|Y,Z) = P(X | Z)
• P(Y|X, Z) = P(Y |Z)
11
Independence and conditional independence
• If A and B are independent, are they conditional independent?
• Example:– Burglar, Earthquake– Alarm
12
Common trick #5:Independence assumption
)|(
),...|(),...,(
11
111
1
ii
i
ii
in
xxP
xxxPxxP
13
An example
• P(w1 w2 … wn)
= P(w1) P(w2 | w1) P(w3 | w1 w2) * …
* P(wn | w1 …, wn-1)
¼ P(w1) P(w2 | w1) …. P(wn | wn-1)
• Why do we make independence assumption which we know are not true?
14
Summary of elementaryprobability theory
• Basic concepts: sample space, event space, random variable, random vector
• Joint / conditional /marginal probability
• Independence and conditional independence
• Five common tricks:– Max likelihood estimation– Chain rule– Calculating marginal probability from joint probability– Bayes’ rule– Independence assumption
15
Formal languages, formal grammars and
regular languages
16
Unit #1
• Formal grammar, language and regular expression
• Finite-state automaton (FSA)
• Finite-state transducer (FST)
• Morphological analysis using FST
17
Other units in ling570
• Unit #2: LM and smoothing
• Unit #3: HMM and POS tagging
• Unit #4: Classification and sequence labeling
• Unit #5: Introduction to other common tasks (e.g., IR/IE/WSD)
18
Regular expression
• Two concepts:– Regular expression in formal language theory– Regular expression (or pattern) in pattern matching: it
is a way of expressing a pattern for the purpose of matching a string
• Both concepts describe a set of strings.
• The two concepts are closely related, but the latter is often more expressive than the former.
19
Outline
• Formal language– Regular language
• Regular expression in formal language theory
• Formal grammars– Regular grammars
• Regular expression in pattern matching
20
Formal languages
21
Definition of formal language
• An alphabet is a finite set of symbols:– Ex: § = {a, b, c}
• A string is a finite sequence of symbols from a particular alphabet juxtaposed:– Ex: the string “baccab”– Ex: empty string ²
• A formal language is a set of strings defined over some alphabet.– Ex1: {aa, bb, cc, aaaa, abba, acca, baab, bbbb, ….}– Ex2: {an bn | n > 0} – Ex3: the empty set Á
22
Definition of regular languages
• The class of regular languages over an alphabet § is formally defined as:– The empty set, Á, is a regular language– 8 a 2 § [ ², {a} is a regular language.– If L1 and L2 are regular languages, then so are:
(a) L1 ² L2 = {xy | x 2 L1; y 2 L2} (concatenation)
(b) L1 L2 (union or disjunction)
(c) L1* = {x1 x2 …, xn | xi 2 L1 , n 2 N} (Kleene closure)
– There are no other regular languages
23
Kleene star
Another way to define L*:• L2 = L ² L• Ln = Ln-1 ² L• L* = { ²} L1 L2 …
Examples:• L = {a, bc}• L2 = {aa, abc, bca, bcbc}• L* = {abcbca, ….} = { (a|bc)*}
24
Properties
• Regular languages are closed under– Concatenation – Union– Kleene closure
• Regular languages are also closed under: – Intersection: L1 Å L2
– Difference: L1 – L2
– Complementation: §* - L1
– Reversal
25
Are the following languages regular?
• {a, aa, aaa, ….}• Any finite set of strings
• {xy | x2 §*, and y is the reverse of x}• {xx | x 2 §*}
• {an bn | n 2 N}• {an bn cn | n 2 N}
26
Regular expression
27
Definition of Regular expression(as in formal language theory)
• The set of regular expressions is defined as follows:(1) Every symbol of is a regular expression
(2) ² is a regular expression
(3) If r1 and r2 are regular expressions, so are (r1), r1 r2, r1 | r2 , r1*
(4) Nothing else is a regular expression.
§
28
Examples
• ab*c• a (0|1|2|..|9)* b• (CV | CCV)+ C?C?: C is a consonant, V is
a vowel
Other operations that we can use:• a+ = a a* • a? = (a | ²)
29
Relation between regular language and Regex
• With every regular expression we can associate a regular language.
• Conversely, every regular language can be obtained from a regular expression.
• Examples:– Regex = ab*c– Regular language = {ac, abc, abbc, ….}
30
Formal grammar
31
Definition of formal grammar
A formal grammar is a concise description of a formal language. It is a (N, §, P, S) tuple:
• A finite set N of nonterminal symbols
• A finite set Σ of terminal symbols that is disjoint from N
• A finite set P of production rules, each of the form: (§ N)* N (§ N)* ! (§ N)* • A distinguished symbol S 2 N that is the start symbol
32
Chomsky hierarchyThe left-hand side of a rule must contain at least one non-terminal. ®, ¯, ° 2 (N §)*, A,B 2 N, a 2 §
• Type 0: unrestricted grammar: no other constraints.
• Type 1: Context-sensitive grammar: The rules must be of the form: ® A ¯ ! ® ° ¯
• Type 2: Context-free grammar (CFGs): The rules must be of the form: A ! ®
• Type 3: Regular grammar: The rules are of the forms: right regular grammar: A ! a, A ! aB, or A ! ² left regular grammar: A ! a, A ! Ba, or A ! ²
Are there other kinds of grammars?
33
Strings generated from a grammar
• The rules are:S → x | y | z | S + S | S - S | S * S | S/S | (S)
• What strings can be generated?
• A grammar is ambiguous if there exists at least one string which has multiple parse trees.
• Is this grammar ambiguous?
34
Languages generated by grammars
• Given a grammar G, L(G) is the set of strings that can be generated from G.
• Ex: G = (N, §, P, S)
N = {S}, § = {a, b, c} P = { S ! a S b, S ! c }
What is L(G)?
35
The relation between regular grammars and regular languages
• The regular grammars describe exactly all regular languages.
• All the following are equivalent:– Regular languages– Regular grammars– Regular expression– Finite state automaton (FSA)
36
Relation between grammars and languages (from wikipedia page)
Chomskyhierarchy
Grammars Languages Minimalautomaton
Type-0 Unrestricted Recursively enumerable
Turing machine
n/a (no common name) Recursive Decider
Type-1 Context-sensitive Context-sensitive Linear-bounded
n/a Indexed Indexed Nested stack
n/a Tree-adjoining Mildly context-sensitive
Thread
Type-2 Context-free Context-free Nondeterministic pushdown
n/a Deterministic context-free
Deterministic context-free
Deterministic pushdown
Type-3 Regular Regular Finite state
37
How about human languages?
• Are they formal languages?– What is alphabet?– What is string?
• What type of formal languages are they?
38
Outline
• Formal language – Regular language
• Regular expression in formal language theory
• Formal grammar– Regular grammar
• Patterns in pattern matching J&M 2.1
39
Patterns in Perl
[ab] a|b. match any character^ the starting position in a string$ the ending position in a string(..) defines a marked subexpressiona* match “a” zero or more timesa+ match “a” one or more timea? match “a” zero or one timea{n,m} “a” appears n to m times
40
Special symbols in the patterns
\s match any whitespace char\d match any digit\w match any letter or digit
\S match any non-whitespace char…
\+, \-, \., \?, \*, …
41
Examples
Integer: (\+|\-)?\d+
Real: (\+|\-)?\d+\.\d+
Scientific notation: (\+|\-)? \d+ (\.\d+)?e (\+|\-)?\d+
Any of the three:
(\+|\-)? \d+ (\.\d+)? (e (\+|\-)?\d+)?
42
Patterns in Perl and Regex
/^(.*)\1$/ { xx | x 2 §*}
/^(.+)a(.+)\1\2$/ {xayxy | x, y 2 §*}
The extra power comes from the ability to refer to marked subexpression.
43
Outline
• Formal language – Regular language
• Regular expression in formal language theory
• Formal grammars– Regular grammars
• Regex patterns in pattern matching
44
Homework #2
45
• Part I: probability
• Part II: formal grammar
• Part III: regular expression