03 handouts

Upload: madamey

Post on 05-Apr-2018

234 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 03 Handouts

    1/14

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees.

    Scanning.

    September 1st, 2010

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (1/54)

    Lecture Outline

    Programming LanguagesSyntactic Specifications and Analysis

    Formal Grammars

    Backus-Naur Form

    Classification of Formal Languages

    Syntactic Analysis of Programs

    Derivations

    Syntax Trees

    Ambiguity

    Avoiding Ambiguity

    Scanning

    Summary

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (2/54)

    Formal Grammars contd

    How to use a grammar to generate sentences?

    1. Let be a sequence containing just the start variable: = vs.

    2. While contains any non-terminals, do:

    2.1 Choose one non-terminal (say, v) in .2.2 From R choose a rule (say, r) in which v appears on the left-hand side.2.3 Replace the chosen occurence ofv in with the right-hand side ofr.

    3. Return .

    What if contains a non-terminal v for which there is no rule in R that would

    have v at its left-hand side?

    The grammar is incomplete.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (3/54)

    Formal Grammars contd

    Example (Formal grammar)

    V = {c}

    S = {a, b}

    R = {(c, ), (c, aca), (c, bcb)}

    vs = c

    Is the string abacaba valid in L?

    Is ababbbaba valid in L?

    What is the language L generated by the grammar?

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (4/54)

  • 8/2/2019 03 Handouts

    2/14

    Backus-Naur Form

    BNF Notation

    Grammars are usually written using a special notation: the Backus-NaurForm (BNF).

    BNF is often extended with convenience symbols to shorten the notation:

    the Extended BNF (EBNF).

    BNF (and EBNF) is a metalanguage, a language for talking about

    languages.

    We will use EBNF extensively during the course.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (5/54)

    Backus-Naur Form contd

    Elements of BNFTerminals are distinguished from non-terminals (variables) by some

    typographical convention, for example:

    non-terminals are written in italics, using angle brackets, etc.;

    terminals are written in a monotype font, enclosed in quotation marks,etc.

    Rules are written as strings which contain:

    a non-terminal,

    a special production symbol (typically, ::=),

    a sequence of terminals and non-terminals, or the symbol .

    By convention,

    the terminals and non-terminals of the grammar are those, and onlythose, included in at least one of the rules;

    the left-hand side (the first element) of the topmost rule is the start

    variable vs.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (6/54)

    Backus-Naur Form contd

    Example (BNF representation of a grammar, 1)

    c ::= c ::= acac ::= bcb

    In this 1,

    V = {c},

    S = {a, b},

    R = {(c, ), (c, aca), (c, bcb)},

    vs = c.

    The specified language L(1) is:

    L(1) = {, aa, bb, aaaa, baab, abba, bbbb, aaaaaa, baaaab, . . . }

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (7/54)

    Backus-Naur Form contd

    Example (EBNF representation of a grammar, 1)

    The grammar can be also written as

    c ::=

    | aca| bcb

    or as

    c ::= | aca | bcb

    The special symbol | has the meaning of or, and is an element of themetalanguage, not the language specified by the grammar.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (8/54)

  • 8/2/2019 03 Handouts

    3/14

    Backus-Naur Form contd

    Metasyntactic extensions

    Convenient extensions to the metalanguage inlcude:

    the special symbols [ and ] used to enclose a subsequence that appearsin the string at most once;

    the special symbols { and } used to enclose a subsequence that appearsin the string any number of times.1

    Alternatively, we can use only the symbols { and } together with asuperscript to specify the number of occurences:

    { sequence }2 means two subsequent occurences of sequence;

    { sequence }+ means at least one occurence ofsequence;

    { sequence }

    means any number of occurences ofsequence;Further extensions are possible (and are sometimes used).

    1The Kleene closure.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (9/54)

    Chomskys Hierarchy of Languages

    Noam Chomsky defined four classes of languages:

    Type 0: Unconstrained Languages

    Type 1: Context-Sensitive Languages

    Type 2: Context-Free Languages

    Type 3: Regular Languages

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (10/54)

    Chomskys Hierarchy of Languages contd

    Note:

    All regular languages are context-free, but not all context-free languages

    are regular. All context-free languages are context-sensitive [sic], but not all

    context-sensitive languages are context-free.

    etc.

    This may sound unintuitive, but it follows a well-established convention.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (11/54)

    Regular Grammars

    What is a regular language?

    Aregular language is a language generated by a regular grammar.

    In a regular grammar, all rules are of one of the forms:2

    v ::= s v

    v ::= s

    v ::= where s S; v, v V; and it is not required that v = v .

    Example (A regular grammar)

    string ::= asubstring | bsubstringsubstring ::= | csubstring

    Regular grammars are conveniently expressed with regular expressions. The

    above could be written as (a|b)c*, (?:a|b)c*, or [ab]c*, etc.

    2These are right-regular grammars. In left-regular grammars, the first rule form above is replacedby v ::= v s.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (12/54)

  • 8/2/2019 03 Handouts

    4/14

    Context-Free Grammars

    What is a context-free language?

    Acontext-free language is a language generated by a context-free grammar.

    In a context-free grammar, all rules are of the form:

    v ::=

    where v V and (V S) (the set of all sequences of variables fromV and symbols from S).3

    Example (A non-regular context-free grammar)

    expression ::= number| expression operator expression| ( expression )

    | . . .

    3(V S) is the Kleene closure ofV S.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (13/54)

    Context-Sensitive and Unconstrained Languages

    What is a context-sensitive language?

    Acontext-sensitive language is a language generated by a context-sensitive

    grammar.

    In a context-sensitive grammar, all rules are of the form:v ::=

    where v V, and , , (V S).

    What is an unconstrained language?

    An unconstrained language is a language generated by an unrestricted

    grammar.

    In an unrestricted grammar, all rules are of the form:

    ::= where , (V S) and is non-empty.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (14/54)

    Chomskys Hierarchy of Languages contd

    Why care about the hierarchy of languages?

    Different grammars have different computational complexity:

    unconstrained context-sensitive context-free regular

    Regular grammars are commonly used to define the microsyntax of

    programming languagesthe syntax of lexemes as sequences of symbolsfrom the alphabet of characters.4

    Context-free grammars are used to define (macro)syntax of programming

    languagesthe syntax of programs as sequences of symbols from the

    alphabet of tokens (classified lexemes).5

    Additional constraints may be needed to further constrain the syntax,

    e.g., by specifying that variable identifiers can be used only after they

    have been declared, etc.6

    4

    CTMCP uses the term lexical syntax rather than microsyntax; others use the term lexicalstructure.5Macrosyntax is usually referred to as syntactic structure.6The less restrictive the metalanguage used to define the grammar, the more restrictive can the

    grammar be wrt. to the specified language.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (15/54)

    Lecture Outline

    Programming LanguagesSyntactic Specifications and Analysis

    Formal Grammars

    Backus-Naur Form

    Classification of Formal Languages

    Syntactic Analysis of Programs

    Derivations

    Syntax Trees

    Ambiguity

    Avoiding Ambiguity

    Scanning

    Summary

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (16/54)

  • 8/2/2019 03 Handouts

    5/14

    Syntactic Analysis of Programs

    How are programs processed?

    The initial input is linearit is a sequence of symbols from the alphabet

    of characters.

    A lexical analyzer (scanner, lexer, tokenizer) reads the sequence ofcharacters and outputs a sequence of tokens.

    Aparser reads a sequence of tokens and outputs a structured (typically

    non-linear) internal representation of the programa syntax tree (parse

    tree).

    The syntax tree is further processed, e.g., by an interpreter or by a

    compiler.

    We have seen some of these steps implemented in the mdc interpreter.7

    7There, both the microsyntax and the syntax were trivial, no parsing was really needed as theintermediate representation was linear and colinear with the list of tokens, and no compilation wasdeveloped.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (17/54)

    Syntactic Analysis of Programs contd

    How are programs processed? contd

    Program: if X == 1 then . . .

    Input: i f X = = t h e n .. .Lexemization: if X == 1 then . . .

    Tokenization: key(if) var(X) op(==) int(1) key(then) . . .

    Parsing: program(ifthenelse(eq(var(X)

    int(1))

    . . .

    . . . )

    . . . )

    Interpretation: actions according to the program and language semantics

    Compilation: code generation according to the program and language

    semantics

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (18/54)

    Syntactic Analysis of Programs contd

    Example (Partial microsyntax of Oz, using Perl-style regexes)

    variable ::= [A..Z][A..Za..z0..9_]*

    A variable (a variable name) consists of an uppercase letter followed by any

    number of word characters.

    Variable is valid as a variable name, atom and 123 are not.

    Example (Partial microsyntax of Oz, using POSIX classes)

    atom ::= [[:lower:]][[:word:]]*additional constraint: no keyword is an atom

    An atom consists of a lowercase letter followed by any number of word

    characters.

    variable is valid as an atom, Atom and 123 are not.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (19/54)

    Syntactic Analysis of Programs contd

    Example (Partial syntax of Oz)

    statement ::= skip| if variable then statement else statement end

    |. . .

    where skip, if, then, else, and end are symbols from the alphabet of

    lexemes.

    if X then skip else if Y then skip else skip end end is a valid

    statement in Oz;

    if X then skip end and if x then skip else skip end are not.8

    8The former is not valid in the Oz kernel language, but is valid in the syntactically extendedversion.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (20/54)

  • 8/2/2019 03 Handouts

    6/14

    Syntactic Analysis of Programs contd

    Note: It is convenient to use indentation to make the structure of a program

    clear to the programmer, but (in Oz) this is inessential for the syntactic

    and semantic validity of programs.

    Example (Indentation in Oz)

    if A thenskip

    elseif B then

    if C thenskip

    elseskip

    endelse

    skipendend

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (21/54)

    Syntactic Analysis of Programs contd

    Note: In some programming languages indentation is essential for the

    syntactic and semantic validity of programs.

    Example (Indentation in Python)

    # valid function definitiondef foo(bar):

    print barreturn foo

    # invaliddef foo(bar): print bar

    return foo

    # invaliddef foo(bar):

    print barreturn foo

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (22/54)

    Syntactic Analysis of Programs contd

    Note: In some programming languages the programmer has control of

    whether indentation is essential for the syntactic and semantic validity

    of programs or not.

    Example (Indentation in F#)

    (* valid, no indentation required *)let hello =fun name -> printf "hello, %a" name

    (* invalid, 4-space indentation required *)#lightlet hello =fun name -> printf "hello, %a" name

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (23/54)

    Derivations

    DerivationsFollowing the recipe for using a grammar explained earlier, we can derive

    sentences in the language L() specified by a grammar in a sequence of

    steps.

    In each step we transform one sentential form (a sequence of terminalsand/or non-terminals) into another sentential form by replacing one

    non-terminal with the right-hand side of a matching rule.

    The first sentential form is the start variable vs alone.

    The last sentential form is a valid sentence, composed only of terminals.

    Sequences of sentential forms starting with vs and ending with a sentence in

    L() obtained as specified above are called derivations.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (24/54)

  • 8/2/2019 03 Handouts

    7/14

    Derivations contd

    The following are two of infinitely many derivations possible to obtain with

    the previously defined grammar 1.9

    Example (Derivation using 1)

    1. c

    2. aca

    3. abcba

    4. abba

    Example (Derivation using 1)

    1. c

    2.

    9c ::= | aca | bcb .

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (25/54)

    Derivations contd

    Rightmost and leftmost derivations

    Aderivation is a sequence of sentential forms beginning with a single

    nonterminal and ending with a (valid) sequence of terminals.

    A derivation such that in each step it is the leftmost non-terminal that isreplaced is called a leftmost derivation.

    A derivation such that in each step it is the rightmost non-terminal that is

    replaced is called a rightmost derivation.

    There can be derivations that are neither leftmost nor rightmost.

    Given a start variable v and a sequence s of terminals, there can be

    no derivation ofs from v (ifs is not valid in the defined language);

    exactly one derivation ofs from v;

    more than one derivation.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (26/54)

    Derivations contd

    Example (A leftmost derivation)

    1. statement

    2. if variable then statement else statement end

    3. if A then statement else statement end

    4. if A then skip else statement end5. if A then skip else

    if variable then statement else statement endend

    . . .

    11. if A then skip else

    if B then

    if C then else skip end

    else skip end

    end

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (27/54)

    Derivations contd

    Example (A rightmost derivation)

    1. statement

    2. if variable then statement else statement end

    3. ifvariable

    then

    statement

    else

    if variable then statement else statement endend

    . . .

    11. if A then skip else

    if B then

    if C then else skip end

    else skip end

    end

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (28/54)

  • 8/2/2019 03 Handouts

    8/14

    Syntax Trees

    Syntax tree

    Aparse tree (a syntax tree) is a structured representation of a program.

    Parse trees are generate in the process ofparsing programs.

    Aparser is a function (a program) that takes as input a sequence oftokens (the output of a lexer) and returns a nested data structure

    corresponding to a parse tree.

    The data structure returned by the parser is an internal (intermediate)

    representation of the program. A parse tree can be used to:

    interpret the program (in interpreted langagues);

    generate target code (in compiled languages);

    optimize the intermediate code (in both interpreted and compiled

    languages).

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (29/54)

    Syntax Trees

    Example (Syntax tree)

    Let have the following rule(s):

    v ::= | av | vb | vv

    Does the sequence ba belong to L()? Yes, it has the following parse tree:

    v

    v

    v

    b

    v

    a v

    How manydistinct derivations lead from v to ba?

    There are six such derivations (check this!).

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (30/54)

    Syntax Trees contd

    Example (A simple syntax tree for Oz)

    The Oz grammar includes the following rules:

    statement ::= skip| if variable then statement else statement end

    with the microsyntactic definition ofvariable

    given earlier. What is the

    parse tree for if A then skip else if B then skip else skip end end?

    statement

    if variable

    A

    then statement

    skip

    else statement

    if variable

    B

    then statement

    skip

    else statement

    skip

    end

    end

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (31/54)

    Syntax Trees contd

    Suppose we rewrite the grammar above as

    statement ::= skip

    | if variable then statement else statement| if variable then statement

    How many syntax trees does if A then if B then skip else skip have,

    given this grammar? There are two parse trees for this sequencesee the next

    slide.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (32/54)

  • 8/2/2019 03 Handouts

    9/14

    Syntax Trees contd

    Example (Parse tree for if A then if B then skip else skip)

    statement

    if variable

    A

    then statement

    if variable

    B

    then statement

    skip

    else statement

    skip

    Example (Parse tree for if A then if B then skip else skip)

    statement

    if variable

    A

    then statement

    if variable

    B

    then statement

    skip

    else statement

    skip

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (33/54)

    Syntax Trees contd

    Does it matter that a sentence has more than one parse tree?

    For a sentence like

    if A then if B then skip else skip

    where all the conditional actions are skip (do nothing, noop), it doesnot matter much.

    In general, it does matter, since what actions will be taken and in which

    order depends on how the program is understood by the interpreter (or

    compiler), which in turn depends on how the program is parsed.

    It is therefore essential that

    the specification of the syntax is unambiguous, and

    the programmer does not make false assumptions about how the code

    will be parsed.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (34/54)

    Syntax Trees contd

    Example (The if-then-else construct in Python)Given these two pieces of code, what is the output for each possiblecombination of values if both a and b can have a value from {True, False}?

    1. if a:if b: print 1else: print 2

    2. if a:if b: print 1

    else: print 2

    a = True, b = True: both print 1

    a = True, b = False: the first prints 2, the second nothing

    a = False, b = True: the second prints 2, the first nothing

    a = False, b = False: the second prints 2, the first nothingThe lack of end would add to the grammar ambiguity which is resolved by

    involving whitespace in the specification.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (35/54)

    Syntax Trees contd

    Example (Multistatement lines in Python)In Python, colon (;) can be used to separate multiple statements within oneline.10 Which of the following are equivalent?

    1. if a: print 1; print 2

    2. if a:print 1print 2

    3. if a:print 1

    print 2

    1. is equivalent to 2.

    What about if a: if b: print 1; else print 2?11

    10Multistatement lines are considered bad practice in Python.11Invalid syntax.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (36/54)

  • 8/2/2019 03 Handouts

    10/14

    Ambiguity

    Ambiguity

    A grammar is ambiguous if a sentence can be parsed in more than one way:

    the program has more than one parse tree, that is,

    the program has more than one leftmost derivation.12

    Note: The fact that a program has more than one derivation is not sufficient

    to consider the grammar ambiguous.

    In practice, most programs have more than one derivation, but all

    these derivations correspond to the same parse treethe grammar

    is unambiguous. Two distinct leftmost derivations for the same program must

    correspond to two distinct parse treesthe grammar must be

    ambiguous in this case.

    12Or more than one rightmost derivation.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (37/54)

    Ambiguitycontd

    Example (An ambiguous grammar)

    Let exp be a grammar including the following rules:

    expression ::= integer

    | expression operator expressionoperator ::= - | + | * | /

    where integer may generate any integer numeral (a sequence of digits).

    Why is exp ambiguous?

    Sentences like 1 + 2 + 3 have more than one parse tree.

    Worse, sentences like 1 + 2 * 3 have more than one parse tree.

    Should 1 + 2 * 3 evaluate to 9 or to 7?

    In Smalltalk, the result would be 9.

    In general, we would like it to be 7.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (38/54)

    Ambiguitycontd

    Example (An ambiguous grammar contd)

    The expression 1 + 2 * 3 has two parse trees:

    expression

    expression

    integer

    1

    operator

    -

    expression

    expression

    integer

    2

    operator

    *

    expression

    integer

    3

    expression

    expression

    expression

    integer

    1

    operator

    -

    expression

    integer

    2

    operator

    *

    expression

    integer

    3

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (39/54)

    Avoiding Ambiguity

    There are a number of ways to avoid ambiguity in grammars. Here, we

    consider four alternative solutions.

    Solution 1: Obligatory parentheses

    We can modify exp by enforcing parentheses around complex expressions:

    expression ::= integer| (expressionoperatorexpression)

    operator ::= - | + | * | /

    Benefit: Ambiguity has been resolved.

    Drawback: Expressions such as 1 + 2 * 3, or even 1 + 2, are no longerlegal. (We must type (1 + (2 * 3)) and (1 + 2) instead.)

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (40/54)

  • 8/2/2019 03 Handouts

    11/14

    Avoiding Ambiguity

    Solution 2: Precedence of operators

    We can modify exp by distinguishing operators of high and low priority:

    expression ::= term| expression lp-operator expression

    term ::= integer| (expression)| term hp-operator term

    hp-operator ::= * | /lp-operator ::= + | -

    where hp-operator and lp-operator are high-priorityand low-priorityoperators, respectively.

    Benefit: Expressions such as 1 + 2 * 3 can be (partially) parsed as

    1 + expression but not as expression * 3.

    Drawback: An expression like 1 - 2 - 3 is still ambiguous: it can be

    (partially) parsed both as expression - 3 and as1 -expression.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (41/54)

    Avoiding Ambiguity

    Solution 3: Associativity of operators

    We can modify exp by introducing associativity of operators:

    expression ::= integer | expression operator integeroperator ::= * | / | + | -

    Benefit: The operators in this grammar are left-associative; the expression

    1 - 2 - 3 can only be (partially) parsed as expression - 3,and not as 1 - expression.

    Drawback: All operators have equal precedence; an expression like

    1 - 2 * 3 can only be (partially) parsed as expression * 3, andnot as 1 - expression.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (42/54)

    Ambiguitycontd

    Solution 4: Combine associativity, precedence, and parentheses

    We can modify exp by adding all of the above:

    expression ::= term

    | expression hp-operator termterm ::= factor

    | term lp-operator factorfactor ::= integer

    | (expression)hp-operator ::= * | /lp-operator ::= + | -

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (43/54)

    Scanning

    What is scanning?

    Scanning is the process of translating programs from the string-of-charactersinput format into the sequence-of-tokens intermediate format.

    We have seen scanning in action in the mdc example:

    the lexemizer took as input a string of characters and returned a

    sequence of lexemes;

    the tokenizer took as input a sequence of lexemes and returned a

    sequence of tokens.

    These two steps are usually merged into one pass, called scanning (but

    sometimes even lexing, or tokenization is used about both operations, and

    scanning may be used for only creating the lexemes).

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (44/54)

  • 8/2/2019 03 Handouts

    12/14

    Scanning contd

    How do we design and implement a scanner?

    Building a scanner requires a number of steps:

    1. Specification of the microsyntax (the lexical structure) of the language,

    typically using regular expressions (regexes).

    2. Based on the regexes, a nondeterministic finite automaton (NFA) is built

    that recognizes lexemes of the language.

    3. Adeterministic finite automaton (DFA) equivalent to the NFA is built.

    4. The DFA is implemented using a nested control stucture that processes

    the input one character at a time.

    All steps can be realized manually, but there exist tools which

    allow one to specify the lexical structure using regular expressions, and

    build an implementation of the DFA automatically.We shall revisit the mdc example and build a scanner both manually and using

    a scanner-building tool.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (45/54)

    Scanning contd

    Before we implement an mdc scanner, we first have a look at a recognizer for

    mdc lexemes.

    A scanner processes an input string and returns a list of lexemes (or

    tokens).

    A recognizer checks whether the whole input string is a single lexeme.

    Example (A recognizer for mdc lexemes)

    Step 1: The microsyntax ofmdc is trivially specified with the following regular

    expressions:

    command ::= [pf](exactly one p or one f)

    operator ::= [\+\-\*\/](analogously, symbols escaped with \)

    integer ::= [0..9]+(one or more digits)

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (46/54)

    Scanning contd

    Example (A recognizer for mdc lexemes contd)

    Step 3: The regex specification is realized by the following DFA:13

    start

    cmd

    op

    int

    p, f

    +, -, *, /

    0, . . . , 9

    0, . . . , 9

    13We skip Step 2; see the further reading section for references if you need more details.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (47/54)

    Scanning contd

    Example (A recognizer for mdc lexemes contd)

    Step 4: An algorithm for the mdc recognizer DFA:14

    input: string of characters; output: booleanstate start; char next()while char = EOF:

    ifstate = start:ifchar {p, f}: state cmdelse ifchar {+, -, *, /}: state opelse ifchar {0, . . . , 9}: state intelse: return false

    else ifstate {cmd,op}: return falseelse ifstate = int:

    ifchar / {0, . . . , 9}: return falsechar next()

    ifstate {cmd, op, int}: return trueelse: return false

    14Notation varies. EOF means end of file (input). Each call to next() returns the next characterfrom the input.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (48/54)

  • 8/2/2019 03 Handouts

    13/14

    Scanning contd

    The recognizer checks whether the whole string is a single lexeme, but wewant more:

    process strings that include more than one lexeme;

    return a sequence of classified lexemes rather than a yes/no answer.

    In the previous implementation ofmdc, all lexemes in a program had to be

    separated by whitespace. This leads to a tradeoff:

    it is more convenient to implement the lexemizerjust split the input bywhitespace;

    it is less convenient to use the languagethe programmer must separate

    all lexemes with whitespace.

    We shall now develop a scanner that makes whitespace between lexemes

    optional (unless we want to separate two numerals).

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (49/54)

    Scanning contd

    Try it! The file code/mdc-recognizer.oz contains an implementation of the

    mdc recognizer and a few simple test cases.

    Open the file in the OPI (oz &, then C-x C-f). Execute the code (C-. C-b).

    What happens?

    {MDCRecognizer "p"} evaluates to true, because the input is a

    command. {MDCRecognizer "123"} evaluates to true, because the input is

    an integer. {MDCRecognizer "1 2 +"} evaluates to false, because the input

    is not a valid lexeme, even though it is a valid sentence (legalsequence of valid lexemes) in mdc.

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (50/54)

    Scanning contd

    Example (A scanner for mdc)

    Step 4: An algorithm for the mdc scanner DFA:15

    input: string of characters; output: sequence of tokenstokens (); state start; char next(); seen

    while char = EOF:ifstate = start:

    ifchar {p, f}: append cmd, char to tokenselse ifchar {+, -, *, /}: append op, char to tokenselse ifchar {0, . . . , 9}: state int; seen charelse ifchar / S: error(char)char next()

    else ifstate = int:ifchar {0, . . . , 9}: concatenate char to seen; char next()else: append int,seen to tokens; seen (); state start

    ifstate = int: append int,seen to tokensreturn tokens

    15tokens maintains a list of tokens recognized so far. seen maintains a string of characters seensince the most recently recognized token. Angle brackets ( and ) denote tokens (class-lexemepairs).

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (51/54)

    Lecture Outline

    Programming LanguagesSyntactic Specifications and Analysis

    Formal Grammars

    Backus-Naur Form

    Classification of Formal Languages

    Syntactic Analysis of Programs

    Derivations

    Syntax Trees

    Ambiguity

    Avoiding Ambiguity

    Scanning

    Summary

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (52/54)

  • 8/2/2019 03 Handouts

    14/14

    Summary

    This time syntax, grammars, derivations, parse trees, ambiguity recognizing, scanning design and implementation of an mdc scanner

    Note! The code examples are used as an illustration; we will return to

    (some parts of) them when you learn more about the syntax andsemantics of Oz.

    Next time syntax and semantics of the declarative kernel language

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (53/54)

    Summarycontd

    Homework Examine and try out todays code, read Mozart/Oz

    documentation if necessary.

    Pensum Most of todays slides, except for implementational

    details ofmdc scanners and the recognizer and scannerDFA.

    Further reading See, e.g., Ch. 3 in Sebesta Concepts of Programming

    Languages; Ch. 2 in Scott Programming LanguagePragmatics; Ch. 24 in Copper and Torczon Engineering

    a Compiler (a detailed, in-depth but readable

    presentation).

    Questions . . . ? . . . ?

    . . . ?

    Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning. (54/54)