03 handouts
TRANSCRIPT
-
8/2/2019 03 Handouts
1/14
Lecture 3: Syntax: Grammars, Derivations, Parse Trees.
Scanning.
September 1st, 2010
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (1/54)
Lecture Outline
Programming LanguagesSyntactic Specifications and Analysis
Formal Grammars
Backus-Naur Form
Classification of Formal Languages
Syntactic Analysis of Programs
Derivations
Syntax Trees
Ambiguity
Avoiding Ambiguity
Scanning
Summary
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (2/54)
Formal Grammars contd
How to use a grammar to generate sentences?
1. Let be a sequence containing just the start variable: = vs.
2. While contains any non-terminals, do:
2.1 Choose one non-terminal (say, v) in .2.2 From R choose a rule (say, r) in which v appears on the left-hand side.2.3 Replace the chosen occurence ofv in with the right-hand side ofr.
3. Return .
What if contains a non-terminal v for which there is no rule in R that would
have v at its left-hand side?
The grammar is incomplete.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (3/54)
Formal Grammars contd
Example (Formal grammar)
V = {c}
S = {a, b}
R = {(c, ), (c, aca), (c, bcb)}
vs = c
Is the string abacaba valid in L?
Is ababbbaba valid in L?
What is the language L generated by the grammar?
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (4/54)
-
8/2/2019 03 Handouts
2/14
Backus-Naur Form
BNF Notation
Grammars are usually written using a special notation: the Backus-NaurForm (BNF).
BNF is often extended with convenience symbols to shorten the notation:
the Extended BNF (EBNF).
BNF (and EBNF) is a metalanguage, a language for talking about
languages.
We will use EBNF extensively during the course.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (5/54)
Backus-Naur Form contd
Elements of BNFTerminals are distinguished from non-terminals (variables) by some
typographical convention, for example:
non-terminals are written in italics, using angle brackets, etc.;
terminals are written in a monotype font, enclosed in quotation marks,etc.
Rules are written as strings which contain:
a non-terminal,
a special production symbol (typically, ::=),
a sequence of terminals and non-terminals, or the symbol .
By convention,
the terminals and non-terminals of the grammar are those, and onlythose, included in at least one of the rules;
the left-hand side (the first element) of the topmost rule is the start
variable vs.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (6/54)
Backus-Naur Form contd
Example (BNF representation of a grammar, 1)
c ::= c ::= acac ::= bcb
In this 1,
V = {c},
S = {a, b},
R = {(c, ), (c, aca), (c, bcb)},
vs = c.
The specified language L(1) is:
L(1) = {, aa, bb, aaaa, baab, abba, bbbb, aaaaaa, baaaab, . . . }
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (7/54)
Backus-Naur Form contd
Example (EBNF representation of a grammar, 1)
The grammar can be also written as
c ::=
| aca| bcb
or as
c ::= | aca | bcb
The special symbol | has the meaning of or, and is an element of themetalanguage, not the language specified by the grammar.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (8/54)
-
8/2/2019 03 Handouts
3/14
Backus-Naur Form contd
Metasyntactic extensions
Convenient extensions to the metalanguage inlcude:
the special symbols [ and ] used to enclose a subsequence that appearsin the string at most once;
the special symbols { and } used to enclose a subsequence that appearsin the string any number of times.1
Alternatively, we can use only the symbols { and } together with asuperscript to specify the number of occurences:
{ sequence }2 means two subsequent occurences of sequence;
{ sequence }+ means at least one occurence ofsequence;
{ sequence }
means any number of occurences ofsequence;Further extensions are possible (and are sometimes used).
1The Kleene closure.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (9/54)
Chomskys Hierarchy of Languages
Noam Chomsky defined four classes of languages:
Type 0: Unconstrained Languages
Type 1: Context-Sensitive Languages
Type 2: Context-Free Languages
Type 3: Regular Languages
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (10/54)
Chomskys Hierarchy of Languages contd
Note:
All regular languages are context-free, but not all context-free languages
are regular. All context-free languages are context-sensitive [sic], but not all
context-sensitive languages are context-free.
etc.
This may sound unintuitive, but it follows a well-established convention.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (11/54)
Regular Grammars
What is a regular language?
Aregular language is a language generated by a regular grammar.
In a regular grammar, all rules are of one of the forms:2
v ::= s v
v ::= s
v ::= where s S; v, v V; and it is not required that v = v .
Example (A regular grammar)
string ::= asubstring | bsubstringsubstring ::= | csubstring
Regular grammars are conveniently expressed with regular expressions. The
above could be written as (a|b)c*, (?:a|b)c*, or [ab]c*, etc.
2These are right-regular grammars. In left-regular grammars, the first rule form above is replacedby v ::= v s.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (12/54)
-
8/2/2019 03 Handouts
4/14
Context-Free Grammars
What is a context-free language?
Acontext-free language is a language generated by a context-free grammar.
In a context-free grammar, all rules are of the form:
v ::=
where v V and (V S) (the set of all sequences of variables fromV and symbols from S).3
Example (A non-regular context-free grammar)
expression ::= number| expression operator expression| ( expression )
| . . .
3(V S) is the Kleene closure ofV S.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (13/54)
Context-Sensitive and Unconstrained Languages
What is a context-sensitive language?
Acontext-sensitive language is a language generated by a context-sensitive
grammar.
In a context-sensitive grammar, all rules are of the form:v ::=
where v V, and , , (V S).
What is an unconstrained language?
An unconstrained language is a language generated by an unrestricted
grammar.
In an unrestricted grammar, all rules are of the form:
::= where , (V S) and is non-empty.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (14/54)
Chomskys Hierarchy of Languages contd
Why care about the hierarchy of languages?
Different grammars have different computational complexity:
unconstrained context-sensitive context-free regular
Regular grammars are commonly used to define the microsyntax of
programming languagesthe syntax of lexemes as sequences of symbolsfrom the alphabet of characters.4
Context-free grammars are used to define (macro)syntax of programming
languagesthe syntax of programs as sequences of symbols from the
alphabet of tokens (classified lexemes).5
Additional constraints may be needed to further constrain the syntax,
e.g., by specifying that variable identifiers can be used only after they
have been declared, etc.6
4
CTMCP uses the term lexical syntax rather than microsyntax; others use the term lexicalstructure.5Macrosyntax is usually referred to as syntactic structure.6The less restrictive the metalanguage used to define the grammar, the more restrictive can the
grammar be wrt. to the specified language.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (15/54)
Lecture Outline
Programming LanguagesSyntactic Specifications and Analysis
Formal Grammars
Backus-Naur Form
Classification of Formal Languages
Syntactic Analysis of Programs
Derivations
Syntax Trees
Ambiguity
Avoiding Ambiguity
Scanning
Summary
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (16/54)
-
8/2/2019 03 Handouts
5/14
Syntactic Analysis of Programs
How are programs processed?
The initial input is linearit is a sequence of symbols from the alphabet
of characters.
A lexical analyzer (scanner, lexer, tokenizer) reads the sequence ofcharacters and outputs a sequence of tokens.
Aparser reads a sequence of tokens and outputs a structured (typically
non-linear) internal representation of the programa syntax tree (parse
tree).
The syntax tree is further processed, e.g., by an interpreter or by a
compiler.
We have seen some of these steps implemented in the mdc interpreter.7
7There, both the microsyntax and the syntax were trivial, no parsing was really needed as theintermediate representation was linear and colinear with the list of tokens, and no compilation wasdeveloped.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (17/54)
Syntactic Analysis of Programs contd
How are programs processed? contd
Program: if X == 1 then . . .
Input: i f X = = t h e n .. .Lexemization: if X == 1 then . . .
Tokenization: key(if) var(X) op(==) int(1) key(then) . . .
Parsing: program(ifthenelse(eq(var(X)
int(1))
. . .
. . . )
. . . )
Interpretation: actions according to the program and language semantics
Compilation: code generation according to the program and language
semantics
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (18/54)
Syntactic Analysis of Programs contd
Example (Partial microsyntax of Oz, using Perl-style regexes)
variable ::= [A..Z][A..Za..z0..9_]*
A variable (a variable name) consists of an uppercase letter followed by any
number of word characters.
Variable is valid as a variable name, atom and 123 are not.
Example (Partial microsyntax of Oz, using POSIX classes)
atom ::= [[:lower:]][[:word:]]*additional constraint: no keyword is an atom
An atom consists of a lowercase letter followed by any number of word
characters.
variable is valid as an atom, Atom and 123 are not.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (19/54)
Syntactic Analysis of Programs contd
Example (Partial syntax of Oz)
statement ::= skip| if variable then statement else statement end
|. . .
where skip, if, then, else, and end are symbols from the alphabet of
lexemes.
if X then skip else if Y then skip else skip end end is a valid
statement in Oz;
if X then skip end and if x then skip else skip end are not.8
8The former is not valid in the Oz kernel language, but is valid in the syntactically extendedversion.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (20/54)
-
8/2/2019 03 Handouts
6/14
Syntactic Analysis of Programs contd
Note: It is convenient to use indentation to make the structure of a program
clear to the programmer, but (in Oz) this is inessential for the syntactic
and semantic validity of programs.
Example (Indentation in Oz)
if A thenskip
elseif B then
if C thenskip
elseskip
endelse
skipendend
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (21/54)
Syntactic Analysis of Programs contd
Note: In some programming languages indentation is essential for the
syntactic and semantic validity of programs.
Example (Indentation in Python)
# valid function definitiondef foo(bar):
print barreturn foo
# invaliddef foo(bar): print bar
return foo
# invaliddef foo(bar):
print barreturn foo
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (22/54)
Syntactic Analysis of Programs contd
Note: In some programming languages the programmer has control of
whether indentation is essential for the syntactic and semantic validity
of programs or not.
Example (Indentation in F#)
(* valid, no indentation required *)let hello =fun name -> printf "hello, %a" name
(* invalid, 4-space indentation required *)#lightlet hello =fun name -> printf "hello, %a" name
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (23/54)
Derivations
DerivationsFollowing the recipe for using a grammar explained earlier, we can derive
sentences in the language L() specified by a grammar in a sequence of
steps.
In each step we transform one sentential form (a sequence of terminalsand/or non-terminals) into another sentential form by replacing one
non-terminal with the right-hand side of a matching rule.
The first sentential form is the start variable vs alone.
The last sentential form is a valid sentence, composed only of terminals.
Sequences of sentential forms starting with vs and ending with a sentence in
L() obtained as specified above are called derivations.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (24/54)
-
8/2/2019 03 Handouts
7/14
Derivations contd
The following are two of infinitely many derivations possible to obtain with
the previously defined grammar 1.9
Example (Derivation using 1)
1. c
2. aca
3. abcba
4. abba
Example (Derivation using 1)
1. c
2.
9c ::= | aca | bcb .
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (25/54)
Derivations contd
Rightmost and leftmost derivations
Aderivation is a sequence of sentential forms beginning with a single
nonterminal and ending with a (valid) sequence of terminals.
A derivation such that in each step it is the leftmost non-terminal that isreplaced is called a leftmost derivation.
A derivation such that in each step it is the rightmost non-terminal that is
replaced is called a rightmost derivation.
There can be derivations that are neither leftmost nor rightmost.
Given a start variable v and a sequence s of terminals, there can be
no derivation ofs from v (ifs is not valid in the defined language);
exactly one derivation ofs from v;
more than one derivation.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (26/54)
Derivations contd
Example (A leftmost derivation)
1. statement
2. if variable then statement else statement end
3. if A then statement else statement end
4. if A then skip else statement end5. if A then skip else
if variable then statement else statement endend
. . .
11. if A then skip else
if B then
if C then else skip end
else skip end
end
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (27/54)
Derivations contd
Example (A rightmost derivation)
1. statement
2. if variable then statement else statement end
3. ifvariable
then
statement
else
if variable then statement else statement endend
. . .
11. if A then skip else
if B then
if C then else skip end
else skip end
end
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (28/54)
-
8/2/2019 03 Handouts
8/14
Syntax Trees
Syntax tree
Aparse tree (a syntax tree) is a structured representation of a program.
Parse trees are generate in the process ofparsing programs.
Aparser is a function (a program) that takes as input a sequence oftokens (the output of a lexer) and returns a nested data structure
corresponding to a parse tree.
The data structure returned by the parser is an internal (intermediate)
representation of the program. A parse tree can be used to:
interpret the program (in interpreted langagues);
generate target code (in compiled languages);
optimize the intermediate code (in both interpreted and compiled
languages).
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (29/54)
Syntax Trees
Example (Syntax tree)
Let have the following rule(s):
v ::= | av | vb | vv
Does the sequence ba belong to L()? Yes, it has the following parse tree:
v
v
v
b
v
a v
How manydistinct derivations lead from v to ba?
There are six such derivations (check this!).
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (30/54)
Syntax Trees contd
Example (A simple syntax tree for Oz)
The Oz grammar includes the following rules:
statement ::= skip| if variable then statement else statement end
with the microsyntactic definition ofvariable
given earlier. What is the
parse tree for if A then skip else if B then skip else skip end end?
statement
if variable
A
then statement
skip
else statement
if variable
B
then statement
skip
else statement
skip
end
end
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (31/54)
Syntax Trees contd
Suppose we rewrite the grammar above as
statement ::= skip
| if variable then statement else statement| if variable then statement
How many syntax trees does if A then if B then skip else skip have,
given this grammar? There are two parse trees for this sequencesee the next
slide.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (32/54)
-
8/2/2019 03 Handouts
9/14
Syntax Trees contd
Example (Parse tree for if A then if B then skip else skip)
statement
if variable
A
then statement
if variable
B
then statement
skip
else statement
skip
Example (Parse tree for if A then if B then skip else skip)
statement
if variable
A
then statement
if variable
B
then statement
skip
else statement
skip
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (33/54)
Syntax Trees contd
Does it matter that a sentence has more than one parse tree?
For a sentence like
if A then if B then skip else skip
where all the conditional actions are skip (do nothing, noop), it doesnot matter much.
In general, it does matter, since what actions will be taken and in which
order depends on how the program is understood by the interpreter (or
compiler), which in turn depends on how the program is parsed.
It is therefore essential that
the specification of the syntax is unambiguous, and
the programmer does not make false assumptions about how the code
will be parsed.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (34/54)
Syntax Trees contd
Example (The if-then-else construct in Python)Given these two pieces of code, what is the output for each possiblecombination of values if both a and b can have a value from {True, False}?
1. if a:if b: print 1else: print 2
2. if a:if b: print 1
else: print 2
a = True, b = True: both print 1
a = True, b = False: the first prints 2, the second nothing
a = False, b = True: the second prints 2, the first nothing
a = False, b = False: the second prints 2, the first nothingThe lack of end would add to the grammar ambiguity which is resolved by
involving whitespace in the specification.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (35/54)
Syntax Trees contd
Example (Multistatement lines in Python)In Python, colon (;) can be used to separate multiple statements within oneline.10 Which of the following are equivalent?
1. if a: print 1; print 2
2. if a:print 1print 2
3. if a:print 1
print 2
1. is equivalent to 2.
What about if a: if b: print 1; else print 2?11
10Multistatement lines are considered bad practice in Python.11Invalid syntax.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (36/54)
-
8/2/2019 03 Handouts
10/14
Ambiguity
Ambiguity
A grammar is ambiguous if a sentence can be parsed in more than one way:
the program has more than one parse tree, that is,
the program has more than one leftmost derivation.12
Note: The fact that a program has more than one derivation is not sufficient
to consider the grammar ambiguous.
In practice, most programs have more than one derivation, but all
these derivations correspond to the same parse treethe grammar
is unambiguous. Two distinct leftmost derivations for the same program must
correspond to two distinct parse treesthe grammar must be
ambiguous in this case.
12Or more than one rightmost derivation.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (37/54)
Ambiguitycontd
Example (An ambiguous grammar)
Let exp be a grammar including the following rules:
expression ::= integer
| expression operator expressionoperator ::= - | + | * | /
where integer may generate any integer numeral (a sequence of digits).
Why is exp ambiguous?
Sentences like 1 + 2 + 3 have more than one parse tree.
Worse, sentences like 1 + 2 * 3 have more than one parse tree.
Should 1 + 2 * 3 evaluate to 9 or to 7?
In Smalltalk, the result would be 9.
In general, we would like it to be 7.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (38/54)
Ambiguitycontd
Example (An ambiguous grammar contd)
The expression 1 + 2 * 3 has two parse trees:
expression
expression
integer
1
operator
-
expression
expression
integer
2
operator
*
expression
integer
3
expression
expression
expression
integer
1
operator
-
expression
integer
2
operator
*
expression
integer
3
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (39/54)
Avoiding Ambiguity
There are a number of ways to avoid ambiguity in grammars. Here, we
consider four alternative solutions.
Solution 1: Obligatory parentheses
We can modify exp by enforcing parentheses around complex expressions:
expression ::= integer| (expressionoperatorexpression)
operator ::= - | + | * | /
Benefit: Ambiguity has been resolved.
Drawback: Expressions such as 1 + 2 * 3, or even 1 + 2, are no longerlegal. (We must type (1 + (2 * 3)) and (1 + 2) instead.)
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (40/54)
-
8/2/2019 03 Handouts
11/14
Avoiding Ambiguity
Solution 2: Precedence of operators
We can modify exp by distinguishing operators of high and low priority:
expression ::= term| expression lp-operator expression
term ::= integer| (expression)| term hp-operator term
hp-operator ::= * | /lp-operator ::= + | -
where hp-operator and lp-operator are high-priorityand low-priorityoperators, respectively.
Benefit: Expressions such as 1 + 2 * 3 can be (partially) parsed as
1 + expression but not as expression * 3.
Drawback: An expression like 1 - 2 - 3 is still ambiguous: it can be
(partially) parsed both as expression - 3 and as1 -expression.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (41/54)
Avoiding Ambiguity
Solution 3: Associativity of operators
We can modify exp by introducing associativity of operators:
expression ::= integer | expression operator integeroperator ::= * | / | + | -
Benefit: The operators in this grammar are left-associative; the expression
1 - 2 - 3 can only be (partially) parsed as expression - 3,and not as 1 - expression.
Drawback: All operators have equal precedence; an expression like
1 - 2 * 3 can only be (partially) parsed as expression * 3, andnot as 1 - expression.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (42/54)
Ambiguitycontd
Solution 4: Combine associativity, precedence, and parentheses
We can modify exp by adding all of the above:
expression ::= term
| expression hp-operator termterm ::= factor
| term lp-operator factorfactor ::= integer
| (expression)hp-operator ::= * | /lp-operator ::= + | -
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (43/54)
Scanning
What is scanning?
Scanning is the process of translating programs from the string-of-charactersinput format into the sequence-of-tokens intermediate format.
We have seen scanning in action in the mdc example:
the lexemizer took as input a string of characters and returned a
sequence of lexemes;
the tokenizer took as input a sequence of lexemes and returned a
sequence of tokens.
These two steps are usually merged into one pass, called scanning (but
sometimes even lexing, or tokenization is used about both operations, and
scanning may be used for only creating the lexemes).
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (44/54)
-
8/2/2019 03 Handouts
12/14
Scanning contd
How do we design and implement a scanner?
Building a scanner requires a number of steps:
1. Specification of the microsyntax (the lexical structure) of the language,
typically using regular expressions (regexes).
2. Based on the regexes, a nondeterministic finite automaton (NFA) is built
that recognizes lexemes of the language.
3. Adeterministic finite automaton (DFA) equivalent to the NFA is built.
4. The DFA is implemented using a nested control stucture that processes
the input one character at a time.
All steps can be realized manually, but there exist tools which
allow one to specify the lexical structure using regular expressions, and
build an implementation of the DFA automatically.We shall revisit the mdc example and build a scanner both manually and using
a scanner-building tool.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (45/54)
Scanning contd
Before we implement an mdc scanner, we first have a look at a recognizer for
mdc lexemes.
A scanner processes an input string and returns a list of lexemes (or
tokens).
A recognizer checks whether the whole input string is a single lexeme.
Example (A recognizer for mdc lexemes)
Step 1: The microsyntax ofmdc is trivially specified with the following regular
expressions:
command ::= [pf](exactly one p or one f)
operator ::= [\+\-\*\/](analogously, symbols escaped with \)
integer ::= [0..9]+(one or more digits)
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (46/54)
Scanning contd
Example (A recognizer for mdc lexemes contd)
Step 3: The regex specification is realized by the following DFA:13
start
cmd
op
int
p, f
+, -, *, /
0, . . . , 9
0, . . . , 9
13We skip Step 2; see the further reading section for references if you need more details.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (47/54)
Scanning contd
Example (A recognizer for mdc lexemes contd)
Step 4: An algorithm for the mdc recognizer DFA:14
input: string of characters; output: booleanstate start; char next()while char = EOF:
ifstate = start:ifchar {p, f}: state cmdelse ifchar {+, -, *, /}: state opelse ifchar {0, . . . , 9}: state intelse: return false
else ifstate {cmd,op}: return falseelse ifstate = int:
ifchar / {0, . . . , 9}: return falsechar next()
ifstate {cmd, op, int}: return trueelse: return false
14Notation varies. EOF means end of file (input). Each call to next() returns the next characterfrom the input.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (48/54)
-
8/2/2019 03 Handouts
13/14
Scanning contd
The recognizer checks whether the whole string is a single lexeme, but wewant more:
process strings that include more than one lexeme;
return a sequence of classified lexemes rather than a yes/no answer.
In the previous implementation ofmdc, all lexemes in a program had to be
separated by whitespace. This leads to a tradeoff:
it is more convenient to implement the lexemizerjust split the input bywhitespace;
it is less convenient to use the languagethe programmer must separate
all lexemes with whitespace.
We shall now develop a scanner that makes whitespace between lexemes
optional (unless we want to separate two numerals).
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (49/54)
Scanning contd
Try it! The file code/mdc-recognizer.oz contains an implementation of the
mdc recognizer and a few simple test cases.
Open the file in the OPI (oz &, then C-x C-f). Execute the code (C-. C-b).
What happens?
{MDCRecognizer "p"} evaluates to true, because the input is a
command. {MDCRecognizer "123"} evaluates to true, because the input is
an integer. {MDCRecognizer "1 2 +"} evaluates to false, because the input
is not a valid lexeme, even though it is a valid sentence (legalsequence of valid lexemes) in mdc.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (50/54)
Scanning contd
Example (A scanner for mdc)
Step 4: An algorithm for the mdc scanner DFA:15
input: string of characters; output: sequence of tokenstokens (); state start; char next(); seen
while char = EOF:ifstate = start:
ifchar {p, f}: append cmd, char to tokenselse ifchar {+, -, *, /}: append op, char to tokenselse ifchar {0, . . . , 9}: state int; seen charelse ifchar / S: error(char)char next()
else ifstate = int:ifchar {0, . . . , 9}: concatenate char to seen; char next()else: append int,seen to tokens; seen (); state start
ifstate = int: append int,seen to tokensreturn tokens
15tokens maintains a list of tokens recognized so far. seen maintains a string of characters seensince the most recently recognized token. Angle brackets ( and ) denote tokens (class-lexemepairs).
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (51/54)
Lecture Outline
Programming LanguagesSyntactic Specifications and Analysis
Formal Grammars
Backus-Naur Form
Classification of Formal Languages
Syntactic Analysis of Programs
Derivations
Syntax Trees
Ambiguity
Avoiding Ambiguity
Scanning
Summary
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (52/54)
-
8/2/2019 03 Handouts
14/14
Summary
This time syntax, grammars, derivations, parse trees, ambiguity recognizing, scanning design and implementation of an mdc scanner
Note! The code examples are used as an illustration; we will return to
(some parts of) them when you learn more about the syntax andsemantics of Oz.
Next time syntax and semantics of the declarative kernel language
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (53/54)
Summarycontd
Homework Examine and try out todays code, read Mozart/Oz
documentation if necessary.
Pensum Most of todays slides, except for implementational
details ofmdc scanners and the recognizer and scannerDFA.
Further reading See, e.g., Ch. 3 in Sebesta Concepts of Programming
Languages; Ch. 2 in Scott Programming LanguagePragmatics; Ch. 24 in Copper and Torczon Engineering
a Compiler (a detailed, in-depth but readable
presentation).
Questions . . . ? . . . ?
. . . ?
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (54/54)