nl!grammar!hierarchies! and!forfal!lang’ages ... · eecs6339 3.0 introduction to computational...
Post on 18-Aug-2020
2 Views
Preview:
TRANSCRIPT
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
1
NL Grammar Hierarchies And ForFal Lang'ages Quick Review of Reg'lar ExOressions, Finite State Automata, Markov Algorithms Representations of Lang'ages, Grammars
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 LAS -‐ nick@cse.yorku.ca
2
Reg'lar ExOressions (summarU) • (empty set) ∅ denoting the set ∅. • (empty string) ε denoting the set containing only the "empty" string, which has no
characters at all. • (literal character) a in Σ denoting the set containing only the character a. • The following operations are defined: • (concatenation) RS denoting the set { αβ | α in R and β in S }. For example {"ab",
"c"}{"d", "ef"} = {"abd", "abef", "cd", "cef"}. • (alternation) R | S denoting the set union of R and S. For example {"ab", "c"}|{"ab",
"d", "ef"} = {"ab", "c", "d", "ef"}. • (Kleene star) R* denoting the smallest superset of R that contains ε and is closed
under string concatenation. This is the set of all strings that can be made by concatenating any finite number (including zero) of strings from R. For example, {"0","1"}* is the set of all finite binary strings (including the empty string), and {"ab", "c"}* = {ε, "ab", "c", "abab", "abc", "cab", "cc", "ababab", "abcab", ... }.
• nothing else is a regular expression over ∑
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 LAS -‐ nick@cse.yorku.ca
3
Finite State Automata • Automata are models of computation: they compute lang'ages.
• A finite-‐state automaton is a five-‐t'ple {Q, q0, ∑, δ, F}, where ∑ is a finite set of alphabet syFbols, Q is a finite set of states, q0 ∈ Q is the initial state, F ⊆ Q is a set of final (accepting) states and δ : Q × ∑ × Q is a relation _om states and alphabet syFbols to states.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 LAS -‐ nick@cse.yorku.ca
4
Finite State Automata • Example: Finite-‐state automaton
• Q = {q0, q1, q2, b3}
• ∑ = {c, a, t, r}
• F = {b3}
• δ = {q0, c, q1>, q1, a, q2>, q2, t, b3>, q2, r , b3>}
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 LAS -‐ nick@cse.yorku.ca
5
Finite State Automata
• The reflexive t&ansitive exfension of the t&ansition relation δ is a new relation, ˆδ, defined as follows:
– for everU state q ∈ Q, (b, ε, b) ∈ ˆδ
– for everU st&ing w ∈ ∑∗ and leier a ∈ ∑, if (q,w, q′) ∈ ˆδ and (q′, a, q′′) ∈ δ then (q,w ·∙ a, q′′) ∈ ˆδ.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
6
Finite State Automata
• Example: Paths
For the finite-‐state automaton:
ˆδ is the following set of t&iples:
q0, ǫ, q0>, q1, ǫ, q1>, q2, ǫ, q2>, b3, ǫ, b3>, q0, c, q1>, q1, a, q2>, q2, t, b3>, q2, r , b3>,
q0, ca, q2>, q1, at, b3>, q1, ar , b3>,
q0, cat, b3>, q0, car , b3>
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
7
Finite State Automata
An exfension: ε-‐moves.
• The t&ansition relation δ is exfended to:
δ ⊆ Q × (∑ ∪ {ε}) × Q
Example: Automata with ε-‐moves -‐ an automaton accepting the lang'age {do, undo, done, undone}:
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
8
ForFal lang'age theorU – definitions
• If L is a lang'age then the reversal of L, denoted LR, is the lang'age {w | wR ∈ L}.
• If L1 and L2 are lang'ages, then L1 ·∙ L2 = {w1 ·∙ w2 | w1 ∈ L1 and w2 ∈ L2}.
• Example: Lang'age operations
– Let L1 = {i, you, he, she, it, we, they}, L2 = {smile, sleep}.
– Then L1R = {i, uoy, eh, ehs, ti, ew, yeht} and L1 ·∙ L2 = {ismile, yousmile, hesmile, shesmile, itsmile, wesmile, theysmile, isleep, yousleep, hesleep, shesleep, itsleep, wesleep, theysleep}.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
9
ForFal lang'age theorU – definitions
• If L is a lang'age then the reversal of L, denoted LR, is the lang'age {w | wR ∈ L}.
• If L1 and L2 are lang'ages, then L1 ·∙ L2 = {w1 ·∙ w2 | w1 ∈ L1 and w2 ∈ L2}.
• Example: Lang'age operations
– Let L1 = {i, you, he, she, it, we, they}, L2 = {smile, sleep}.
– Then L1R = {i, uoy, eh, ehs, ti, ew, yeht} and L1 ·∙ L2 = {ismile, yousmile, hesmile, shesmile, itsmile, wesmile, theysmile, isleep, yousleep, hesleep, shesleep, itsleep, wesleep, theysleep}.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
10
ForFal lang'age theorU – definitions
• If L is a lang'age then the reversal of L, denoted LR, is the lang'age {w | wR ∈ L}.
• If L1 and L2 are lang'ages, then L1 ·∙ L2 = {w1 ·∙ w2 | w1 ∈ L1 and w2 ∈ L2}.
• Example: Lang'age operations
– Let L1 = {i, you, he, she, it, we, they}, L2 = {smile, sleep}.
– Then L1R = {i, uoy, eh, ehs, ti, ew, yeht} and L1 ·∙ L2 = {ismile, yousmile, hesmile, shesmile, itsmile, wesmile, theysmile, isleep, yousleep, hesleep, shesleep, itsleep, wesleep, theysleep}.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
11
ForFal lang'age theorU – definitions
• If L is a lang'age then L0 = {ε}.
• Then, for i > 0, Li = L ·∙ Li−1.
• Example: Lang'age exOonentiation
Let L be the set of words {bau, haus, hof, _au}. Then L0 = {ε}, L1 = L and L2 = {baubau, bauhaus, bauhof, bau_au, hausbau, haushaus, haushof, haus_au, horau, hosaus, hosof, hotau, _aubau, _auhaus, _auhof, _au_au}.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
ForFal lang'age theorU – definitions • The Kleene closure of L and is denoted L∗ and is defined as ∪∞
i=0 Li . • L+ = ∪∞
i=0 i=1 Li • Example: Kleene closure
Let L = {dog, cat}. Obserye that L0 = {ε}, L1 = {dog, cat}, L2 = {catcat, catdog, dogcat, dogdog}, etc. Thus L∗ contains, among its infinite set of st&ings, the st&ings ε, cat, dog, catcat, catdog, dogcat, dogdog, catcatcat, catdogcat, dogcatcat, dogdogcat, etc.
• The notation for ∗ should now become clear: it is simply a special case of L∗, where L = ∑
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
12
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
13
Markov Algorithms A Markov Algorithm is a finite sequence P1, P2,...,Pn of Markov productions to be applied to st&ings in a given alphabet according to the following r'les. Let S be a given st&ing. The sequence is searched to find the first production Pi whose antecedent occurs in S. If no such production exists, the operation of the algorithm halts without change in S. If there is a production in the algorithm whose antecedent occurs in S, the first such production is applied to S. If this is a conclusive production, the operation of the algorithm halts without f'rfher change in S. If this is a simple production, a new search is conducted using the st&ing S' into which S has been t&ansforFed. If the operation of the algorithm ultimately ceases with a st&ing S*, we say that S* is the result of applying the algorithm to S.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
14
Markov Algorithms Example:
Take the alphabet to be {a, b, c, d}. The algorithm is given below.
• Algorithm M1
M11: [conclusive] a d → •d c M12: [simple] b a → W
M13: [simple] a → b c
M14: [simple] b c → b b a M15: [simple] W → a
Taking S = “dcb” we apply the algorithm by M15 dcb becomes adcb by M11 adcb becomes dccb and halts.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
15
Markov Algorithms Example:
Let β be a marker not in the alphabet. If S is a st&ing in the alphabet, the result of applying algorithm M3 to S is the st&ing SA.
Algorithm M3 M31: [interchange] β δ → δ β δ, A ∈ member of alphabet M32: [conclusive] β → •A M33: W → β Since S initially does not contain β, the third production is then used to move β past the syFbols in S. If S contains n occur&ences of syFbols, then aster n steps we obtain the st&ing Sβ. At this point the first production no longer applies, and the second production produces SA. Since this production is conclusive, the st&ing SA is then the result.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
16
Markov Algorithms
In the preceding example, we have int&oduced a new notation. Namely, in the first production we have used the variable δ which ranges over the syFbols in the alphabet. Thus the first line is not really a production, but rather a production schema, denoting all the productions which can be obtained by substit'ting syFbols of the alphabet for δ.
Because of the manner in which the Markov algorithms are used, the order in which the productions are wriien is vital. If the first t�o lines of algorithm M3 were interchanged, the result would be to t&ansforF S into AS, rather than into SA, and the productions represented by βδ → δβ would never be used.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
17
Markov Algorithms Example:
Another procedure which is quite common is that of reversing a st&ing of characters. We do this by moving the first character to the end as before, then moving the nexf character down to the position just preceding the first character, and so on. Markers: µ, β
Algorithm M10 M101: µβ → •W δ, f ∈ members of the alphabet M102: µδβ → βδ M103: µδf → fµδ M104: µδ → βδ M105: W → µ
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
18
Markov Algorithms Illust&ating this algorithm on the st&ing “ABCD” we have
by M105 => µ A B C D
by M103 => B µ A C D
by M103 => B C µ A D
by M103 => B C D µ A
by M104 => B C D β A
by M105 => µ B C D β A
by M103 => C µ B D β A
by M103 => C D µ B β A by M102 => C D β B A
by M105 => µ C D β B A
by M103 => D µ C β B A
by M102 => D β C B A
by M105 => µ D β C B A by M102 => β D C B A by M105 => µ β D C B A by M101 => D C B A
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
19
Representations of Lang'ages How can we decide on a finite representations for lang'ages? One way to represent a lang'age is to give an algorithm which deterFines if a sentence is in the lang'age or not. A more general way is to give a procedure which halts with the answer "yes" for sentences in the lang'age and either does not terFinate or else halts with the answer "no" for sentences not in the lang'age. Such a procedure or algorithm is said to recog�ize the lang'age. As it t'r�s out, there are lang'ages we can recog�ize by a procedure, but not by any algorithm. The method described can represent lang'ages _om a recog�ition point of view. We can also represent lang'ages _om a generative point of view. That is, we can give a procedure which systematically generates successive sentences of the lang'age in some order.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
20
Representations of Lang'ages If we can recog�ize the sentences of a lang'age over alphabet V with either an algorithm or a procedure, then we can generate the lang'age, since we can systematically generate all sentences in V*, test each sentence to see if it is in the lang'age, and outOut in a list only those sentences in the lang'age. One must be caref'l in doing so. For if we generate the sentences in order and use a procedure which does not always halt for testing the sentences, we never get beyond the first sentence for which the procedure does not halt. The way to get around this problem is to organize the testing in such a manner that the procedure never continues to test one sentence forever. This organization requires that we int&oduce several const&'ctions.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
21
Representations of Lang'ages Assume that there are p syFbols in V. We can think of the sentences of V* as numbers represented in base p, plus the emptU sentence e. We can number the sentences in order of increasing lengfh and in "numerical" order for sentences of the same lengfh. In Fig. (a) we have the enumeration of the sentences of{a, b, c}*. We have implicitly assumed that a, b, and c cor&espond to 0, 1, and 2, respectively. (This arg'ment shows that the set of sentences over an alphabet is countable.)
1 E 2 a 3 b 4 c 5 aa 6 ab 7 ac 8 ba 9 bb • ° Fig. (a) The enumeration of sentences in {a, b, c}*
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
22
Representations of Lang'ages Let P be a procedure for testing a sentence to see if the sentence is in a lang'age L. We assume that P can be broken down into discrete steps so that it makes sense to talk about the ith step in the procedure for any given sentence. Before giving a procedure to enumerate the sentences of L, we first give a procedure to enumerate pairs of positive integers. We can map all ordered pairs of positive integers (x, y) onto the set of positive integers as shown in Fig. (b) by the forFula
z = {[(x + y-‐ 1)(x + y – 2)]/2} +y .
xê yè
1 2 3 4 5
1 1 3 6 10 15 2 2 5 9 14 3 4 8 13 . 4 7 12 . 5 11
Fig. (b) Mapping of ordered pairs of integers onto the integers
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
23
Representations of Lang'ages We can enumerate ordered pairs of integers according to the assig�ed value of z. Thus the first few pairs are (1, 1), (2, 1), (1, 2), (3, 1), (2, 2), . . . . Given any pair of integers (i, j), it will event'ally appear in the list. In fact, it will be the [( i+J -‐ 1) ( i+j -‐ 2)] / 2 + jth pair enumerated. We can now give a procedure for enumerating the st&ings of L. Enumerate ordered pairs of integers. When the pair (i, j) is enumerated, generate the ith sentence in V* and apply the first j steps of procedure P to the sentence. Whenever it is deterFined that a generated sentence is in L, add that sentence to the list of members of L. If word i is in L, it will be so deterFined by P in j steps, for some finite j. When (i, j) is enumerated, word i will be generated. This procedure will indeed enumerate all sentences in L.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
24
Representations of Lang'ages If we have a procedure for generating the sentences of a lang'age, then we can const&'ct a procedure for recog�izing the sentences of the lang'age, but not necessarily an algorithm. To deterFine if a sentence x is in L, simply enumerate the sentences of L and compare x with each sentence. If x is generated, the procedure halts, having recog�ized x as being in L. Of course, if x is not in L, the procedure will never terFinate. A lang'age whose sentences can be generated by a procedure is said to be recursively enumerable. Alter�atively, a lang'age is said to be recursively enumerable if there is a procedure for recog�izing the sentences of the lang'age. A lang'age is said to be recursive if there exists an algorithm for recog�izing the lang'age. In forFal lang'age theorU, the class of recursive lang'ages is a proper subset of the class of recursively enumerable lang'ages.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
25
Representations of Lang'ages There are lang'ages which are not even recursively enumerable. That is, there are lang'ages for which we cannot even effectively list the sentences of the lang'age. Asking the question, "What is lang'age theorU ?," we can say that lang'age theorU is the st'dy of sets of st&ings of syFbols, their representations, st&'ct'res, and properfies. Beyond this, we shall leave the question to be answered in a course on forFal g&ammars.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
26
Grammars There is one class of generating systems of primarU interest to us, systems known as g&ammars. The concept of a g&ammar was originally forFalized by ling'ists in their st'dy of nat'ral lang'ages. Ling'ists were concer�ed not only with defining precisely what is or is not a valid sentence of a lang'age, but also with providing st&'ct'ral descriptions of the sentences. One of their goals was to develop a forFal g&ammar capable of describing English. It was hoped that if, for example, one had a forFal g&ammar to describe the English lang'age, one could use the computer in ways that require it to "understand" English. Such a use might be lang'age t&anslation or the computer solution of word problems. To date, this goal is for the most parf unrealized in its totalitU.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
27
Grammars We do not have a definitive g&ammar for English, and there is even disag&eement as to what tUOes of forFal g&ammar are capable of describing English. However, in describing computer lang'ages, beier results have been achieved. For example, the Backus NorFal ForF used to describe ALGOL is a "contexf _ee g&ammar," a tUOe of g&ammar with which we shall st'dy and build upon. We are all familiar with the idea of diag&amming or parsing an English sentence. For example, the sentence "The liile boy ran quickly" is parsed by noting that the sentence consists of the noun phrase "The liile boy" followed by the verb phrase "ran quickly." The noun phrase is then broken down into the sing'lar noun "boy" modified by the t�o adjectives "The” and "liile." The verb phrase is broken down into the sing'lar verb "ran” modified by the adverb "quickly." This sentence st&'ct're is indicated in the diag&am of Fig. (c).
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
28 Fig. (c). A diag&am of the sentence "The liile boy ran quickly."
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
29
Grammars We recog�ize the sentence st&'ct're as being g&ammatically cor&ect. If we had a complete set of r'les for parsing all English sentences, then we would have a technique for deterFining whether or not a sentence is g&ammatically cor&ect. However, such a set does not exist. Parf of the reason for this stems _om the fact that there are no clear r'les for deterFining precisely what constit'tes a sentence. The r'les we applied to parsing the above sentence can be wriien in the following forF:
<sentence> → <noun phrase> <verb phrase> <noun phrase> → <adjective> <noun phrase> <noun phrase> → <adjective> <sing'lar noun> <verb phrase> → <sing'lar verb> <adverb> <adjective> → The | liile <sing'lar noun> → boy <sing'lar verb> → ran <adverb> → quickly
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
30
Grammars The ar&ow in the r'les indicates that the item to the lest of the ar&ow can generate the items to the right of the ar&ow. Note that we have enclosed the names of the parfs of the sentence such as noun, verb, verb phrase, etc., in brackets to avoid conf'sion with the English words and phrases "noun," "verb," "verb phrase," etc. Note that we cannot only test sentences for their g&ammatical cor&ect�ess, but can also generate g&ammatically cor&ect sentences by starfing with (sentence) and replacing (sentence) by (noun phrase) followed by (verb phrase). Nexf we select one of the t�o r'les for (noun phrase) and apply it, and so on, until no f'rfher application of the r'les is possible. Thus any one of an infinite number of sentences can be derived, e.g., a st&ing of occur&ences of "the" and "liile" followed by "boy ran quickly" such as "liile the the boy ran quickly" can be generated. Most of the sentences do not make sense but, neverfheless, are g&ammatically cor&ect.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Instructor: Nick Cercone - 3050 CSEB - nick@cse.yorku.ca 31
ForFal Notion of a Grammar
• ForFally, we denote a g'ammar G by (Vn, Vt, P, S). The syFbols Vn, Vt, P, and S are, respectively, the variables, ter2inals, productions, and star9 sy2bol. Vn, Vt, and P are finite sets. We assume that Vn and Vt contain no elements in common; i.e., Vn ∩ Vt = ε where ε denotes the emptU set and, conventionally Vn ∪ Vt = V The set of productions P consists of exOressions of the forF a → b, where a is a st&ing in V+ and b is a st&ing in V*. Finally, S is always a syFbol in Vn.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Instructor: Nick Cercone - 3050 CSEB - nick@cse.yorku.ca 32
ForFal Notion of a Grammar • Customarily, we shall use capital Latin alphabet leiers for variables. Lower case leiers at the
beginning of the Latin alphabet are used for terFinals. St&ings of terFinals are denoted by lower case leiers near the end of the Latin alphabet, and st&ings of variables and terFinals are denoted by lower case Greek leiers.
• Given the g&ammar G = (Vn, Vt, P, S), we define the lang'age it generates as follows. If a → b is a production of P and g and d are st&ings in V*, then gad → gbd.
• In English we say gad directly derives gbd in g'ammar G. We say that the production a → b is applied to the st&ing gad to obtain gbd. Thus G relates t�o st&ings exactly when the second is obtained _om the first by the application of a single production.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Instructor: Nick Cercone - 3050 CSEB - nick@cse.yorku.ca 33
ForFal Notion of a Grammar
We define g&ammars G1 and G2 to be equivalent if L(G1) = L(G2).
Example. Let us consider a g&ammar G = (VN, VT, P, S), where VN = {S}, VT = {0, 1}, P = {S ~ 0S1, S -‐+ 01}. Here, S is the only variable, 0 and 1 are terFinals. There are t�o productions, S → 0S1 and S → 01. By applying the first production n -‐ 1 times, followed by an application of the second production, we have
S → 0S1 → 00Sll → 03S13 ==> . . . ==> 0n-‐iS1 n-‐1 → 0n1n.§
FurfherFore, these are the only st&ings in L(G). Aster using the second production, we find that the number of S's in the sentential forF decreases by one. Each time the first production is used, the number of S's remains the same. Thus, aster using S → 01, no S's remain in the resulting st&ing.
Since both productions have an S on the lest, the only order in which the productions can be applied is S-‐+ 0S1 some number of times followed by one application of S ~ 01. Thus, L(G) = {0n1n|n ≥1}.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Instructor: Nick Cercone - 3050 CSEB - nick@cse.yorku.ca 34
ForFal Notion of a Grammar
The example was a simple example of a g&ammar. It was relatively easy to deterFine which words were derivable and which were not. In general, it may be exceedingly hard to deterFine what is generated by the g&ammar.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Instructor: Nick Cercone - 3050 CSEB - nick@cse.yorku.ca 35
TyOes of Grammars
The Chomsky hierarchy of lang'ages
• A hierarchy of classes of lang'ages, viewed as sets of st&ings, ordered by their “complexitU”. The higher the lang'age is in the hierarchy, the more “complex” it is.
• In parficular, the class of lang'ages in one class properly includes the lang'ages in lower classes.
• There exists a cor&espondence bet�een the class of lang'ages and the forFat of phrase-‐st&'ct're r'les necessarU for generating all its lang'ages. The more rest&icted are the r'les, the lower in the hierarchy are the lang'ages they generate.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Instructor: Nick Cercone - 3050 CSEB - nick@cse.yorku.ca 36
The Chomsky hierarchy of lang'ages
• Recursively enumerable lang'ages
• Contexf-‐sensitive lang'ages
• Contexf-‐_ee lang'ages
• Reg'lar lang'ages
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Instructor: Nick Cercone - 3050 CSEB - nick@cse.yorku.ca 37
The Chomsky hierarchy of lang'ages
• TyOe-‐0 g&ammars (unrest&icted g&ammars) include all forFal g&ammars. They generate exactly all lang'ages that can be recog�ized by a Turing machine. These lang'ages are also known as the recursively enumerable lang'ages.
• TyOe-‐1 g&ammars (contexf-‐sensitive g&ammars) generate the contexf-‐sensitive lang'ages. These g&ammars have r'les of the forF αAβ → αγβ with A a nonterFinal and α, β and γ st&ings of terFinals and nonterFinals. The st&ings α and β may be emptU, but γ must be nonemptU. The r'le S→ε is allowed if S does not appear on the right side of any r'le. The lang'ages described by these g&ammars are exactly all lang'ages that can be recog�ized by a linear bounded automaton (a nondeterFinistic Turing machine whose tape is bounded by a constant times the lengfh of the input.)
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Instructor: Nick Cercone - 3050 CSEB - nick@cse.yorku.ca 38
The Chomsky hierarchy of lang'ages • TyOe-‐2 g&ammars (contexf-‐_ee g&ammars) generate the contexf-‐_ee lang'ages defined by r'les of the forF
A→γ with A a nonterFinal and γ a st&ing of terFinals and nonterFinals. These lang'ages are exactly all lang'ages recog�ized by a non-‐deterFinistic pushdown automaton. Contexf-‐_ee lang'ages are the basis for most prog&amming lang'ages.
• TyOe-‐3 g&ammars (reg'lar g&ammars) generate the reg'lar lang'ages rest&icting its r'les to a single nonterFinal on the LHS and a RHS consisting of a single terFinal, possibly followed (or preceded, but not both in the same g&ammar) by a single nonterFinal. The r'le S→ε is allowed if S does not appear on the RHS of any r'le. These lang'ages are those that can be decided by a finite state automaton. This family of lang'ages can be obtained by reg'lar exOressions. Reg'lar lang'ages are commonly used to define search paier�s and the lexical st&'ct're of prog&amming lang'ages.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Instructor: Nick Cercone - 3050 CSEB - nick@cse.yorku.ca 39
The Chomsky hierarchy of lang'ages
EverU reg'lar lang'age is contexf-‐_ee, everU contexf-‐_ee lang'age, not containing the emptU st&ing, is contexf-‐sensitive and everU contexf-‐sensitive lang'age is recursive and everU recursive lang'age is recursively enumerable. These are all proper inclusions, meaning that there exist recursively enumerable lang'ages which are not contexf-‐sensitive, contexf-‐sensitive lang'ages which are not contexf-‐_ee and contexf-‐_ee lang'ages which are not reg'lar.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Instructor: Nick Cercone - 3050 CSEB - nick@cse.yorku.ca 40
The Chomsky hierarchy of lang'ages
The following table summarizes each of Chomsky's four tUOes of g&ammars, the class of lang'age it generates, the tUOe of automaton that recog�izes it, and the forF its r'les must have.’
Grammar Languages Automaton Production rules (constraints )
Type -0 Recursively enumerable Turing mac hine α _ β (no restrictions)
Type -1 Context -sensitive Linear -bounded non -deterministic Turing mac hine
αAβ _ αγβ
Type -2 Context -free Non -deterministic pushdown autom aton A _ γ
Type -3 Regular Finite state au tom aton A _ a and A _ a B
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Instructor: Nick Cercone - 3050 CSEB - nick@cse.yorku.ca 41
Why is it interesting?
• The hierarchy represents some inforFal notion of the complexitU of nat'ral lang'ages
• It can help accept or reject ling'istic theories
• It can shed light on questions of human processing of lang'age
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Instructor: Nick Cercone - 3050 CSEB - nick@cse.yorku.ca 42
What exactly is the question?
• When viewed as a set of st&ings, is English a reg'lar lang'age? Is it contexf-‐_ee? How about Hebrew?
• Competence vs. PerforFance
– This is the dog, that wor&ied the cat, that killed the rat, that ate the malt, that lay in the house that Jack built.
– This is the malt that the rat that the cat that the dog wor&ied killed ate
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Instructor: Nick Cercone - 3050 CSEB - nick@cse.yorku.ca 43
Where are nat'ral lang'ages located?
• Chomsky (1957): “English is not a reg'lar lang'age”
• As for contexf-‐_ee lang'ages, “I do not know whether or not English is itself literally outside the range of such analyses”
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Instructor: Nick Cercone - 3050 CSEB - nick@cse.yorku.ca 44
How not to do it
An int&oduction to the principles of t&ansforFational sy�tax (Akmajian and Heny, 1976)
“Since there seem to be no way of using such PS rAles to represent an obviously sigBificant generalization about one langAage, namely, English, we can be sure that phrase st'ActAre g'ammars cannot possibly represent all the sigBificant aspects of langAage st'ActAre. We must int'oduce a new kind of rAle that will per2it us to do so.”
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Instructor: Nick Cercone - 3050 CSEB - nick@cse.yorku.ca 45
How not to do it
Example: Sy�tax (Peter Culicover, 1976)
In general, for any phrase st'ActAre g'ammar containing a finite number of rAles it will always be possible to const'Act a sentence that the g'ammar will not generate. In fact, because of recursion there will always be an infinite number of such sentences. Hence, the phrase st'ActAre analysis will not be sufficient to generate English.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Instructor: Nick Cercone - 3050 CSEB - nick@cse.yorku.ca 46
How not to do it
Example: TransforFational g&ammar (Grinder & Elgin, 1973)
the girl saw the boy
∗the girl kiss the boy
this well-‐known sy�tactic phenomenon demonst&ates clearly the inadequacy of ... contexf-‐_ee phrase-‐st&'ct're g&ammars...
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Instructor: Nick Cercone - 3050 CSEB - nick@cse.yorku.ca 47
How not to do it
The defining characteristic of a contexf-‐_ee r'le is that the syFbol to be rewriien is to be rewriien without reference to the contexf in which it occurs. By definition, one cannot write a contexf-‐_ee r'le that will exOand the syFbol V into kiss in the contexf of being immediately preceded by the sequence the girls and that will exOand the syFbol V into kisses in the contexf of being immediately preceded by the sequence the girl. Any set of contexf-‐_ee r'les that generate (cor&ectly) the sequences the girl kisses the boy and the girls kiss the boy will also generate (incor&ectly) the sequences the girl kiss the boy and the girls kisses the boy.
EECS6339 3.0 Introduction to Computational Linguistics Instructor: Nick Cercone – 3050 LAS – nick.cercone@lassonde.yorku.ca
Tuesdays, Thursdays 10:00-11:20 – LAS 3033 Winter Semester, 2015
Inst&'ctor: Nick Cercone -‐ 3050 CSEB -‐ nick@cse.yorku.ca
48
I'd like -‐ I'd like to know what this whole show is all about before it's out
top related