syntax the structure of a language. lexical structure the structure of the tokens of a programming...
TRANSCRIPT
![Page 1: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/1.jpg)
Syntax
The Structure of a Language
![Page 2: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/2.jpg)
Lexical Structure
• The structure of the tokens of a programming language
• The scanner takes a sequence of characters and collects them into tokens
![Page 3: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/3.jpg)
Tokens
• Reserved words (keywords)– if while
• Literals or constants– 3.14 “Fred”
• Special symbols– + =
• Identifiers
![Page 4: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/4.jpg)
Principle of Longest Substring
• At each point, the longest possible string is collected into a single token
• Natural token separators– Token separators
•; + =
– White space• Spaces and tabs• Newlines• Comments
![Page 5: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/5.jpg)
FORTRAN violates these rules
• DO 99 I = 1.10– Assigns 1.10 to the variable DO99I
• DO 99 I = 1,10– Sets up a loop with loop counter I going from 1 to 10
• FORTRAN has no reserved words at all
![Page 6: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/6.jpg)
C token conventions
• Six classes of tokens– Identifiers– Keywords– Constants– String literals– Operators– Other operators
• White space characters are ignored except as they separate tokens
• Adheres to the principle of longest substring
![Page 7: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/7.jpg)
Regular Expressions
• Regular expressions were invented by Stephen Kleene and appeared in a Rand Corporation report in about 1950
• Regular expressions represent a form of language definition
• Each regular expression E denotes a language L(E) defined over the alphabet of the language
![Page 8: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/8.jpg)
Rules defining REs
• Empty is a RE
• Atom– Any symbol from the alphabet is a RE
• Alternation– If a and b are REs then so is a|b– All strings identified by a and all those identified by b
• Concatenation– If a and b are REs then so is ab– All strings formed by concatenating a string identified by b to
the end of one identified by a
![Page 9: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/9.jpg)
More rules for REs
• Kleene Closure– If a is an RE then so is a*– All strings formed by concatenating zero or
more strings identified by a
• Positive Closure– If a is an RE then so is a+– All strings formed by concatenating one or
more strings identified by a
![Page 10: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/10.jpg)
Examples of Res
• (a|b)c– Recognizes ac and bc but no others
• (a|b)*c– Recognizes c ac bc aac abc abac
• (a|b)+c– Does not recognize c but all the others above
![Page 11: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/11.jpg)
Extensions
• [] – any one of a set of characters– [A-Z] – any capitol letter– [0123456789] – any digit
• ? – an optional item (0 or 1 of these)– [A-Z][0-9]? – a single capitol letter or a
single capitol letter followed by a single digit
• . (period) – any character
![Page 12: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/12.jpg)
More Examples
• [0-9]+– Simple integer constants
• [0-9]+(\.[0-9])?– Simple floating-point constants
![Page 13: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/13.jpg)
Context-Free Grammars (CFGs)
• Context-free grammars were developed by Noam Chomsky as a way to specify language
• Rules are generally specified in Backus-Naur Form (BNF) or ain Extended BNF (EBNF)
![Page 14: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/14.jpg)
What makes up a CFG?
• A set N of non-terminal symbols
• A set T of terminal symbols
• A set P of production rules
• A special non-terminal symbol S called the start symbol (or goal symbol)
![Page 15: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/15.jpg)
Sample CFG
• sentence noun-phrase verb-phrase .
• noun-phrase article noun
• article a | the
• noun girl | dog
• verb-phrase verb noun-phrase
• verb sees | pets
![Page 16: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/16.jpg)
Parts of the grammar
• Non-terminal symbols: {sentence, noun-phrase, article, noun, verb-phrase,
verb}
• Terminal Sumbols{ . ,a, the, girl, dog, sees, pets}
• Production rulesThe previous slide provides these
• Start Symbolsentence
![Page 17: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/17.jpg)
Notes on CFG
• Non-terminal symbols are those that appear on the left-hand side (lhs) of the production rules
• Terminal symbols are those that appear only on the right-hand side (rhs) of the production rules
and | are meta-symbols
![Page 18: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/18.jpg)
(Left-Most) Derivation
sentence noun-phrase verb-phrase . article noun verb-phrase . the noun verb-phrase . the girl verb-phrase . the girl verb noun-phrase . the girl sees noun-phrase . the girl sees article noun . the girl sees a noun . the girl sees a dog .
![Page 19: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/19.jpg)
Corresponding Parse Treesentence
noun-phrase verb-phrase .
article noun verb noun-phrase
article noun
the girl sees
a dog
![Page 20: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/20.jpg)
Ambiguous Grammars
• A grammar is ambiguous of a sentence has • two distinct derivations or
• two distinct parse trees
![Page 21: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/21.jpg)
Grammar for expressions
expr expr + expr
| expr * expr
| (expr)
| number
number number digit | digit
digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
![Page 22: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/22.jpg)
Parse trees for 3 + 5 * 7expr
expr expr
expr expr
+
expr
expr expr
expr expr
*
+*number
digit
3
number
digit
5
number
digit
7
number
digit
3
number
digit
5
number
digit
7
![Page 23: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/23.jpg)
Handling Ambiguity
• The grammar rules for expressions can be modified to eliminate the ambiguity that precedence should take care of
• Introduce a new non-terminal that forces the higher-precedence operator lower in the parse tree
![Page 24: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/24.jpg)
Precedence handled
expr expr + expr | term
term term * term | ( expr ) | number
number number digit | digit
digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
![Page 25: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/25.jpg)
Associativity
• This grammar is still ambiguous
• There are two parse trees for 5 + 7 + 9
• This may be ok for addition & multiplication, but not for subtraction & addition which are left-associative
![Page 26: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/26.jpg)
Revised Grammar (not ambiguous)
expr expr + term | term
term term * factor | factor
factor ( expr ) | number
number number digit | digit
digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
![Page 27: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/27.jpg)
EBNFs
• Extended BNF adds more metasymbols• { } – a repeated item (0 or more times)
• [ ] – an optional item (0 or 1 time)
![Page 28: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/28.jpg)
Expression Grammar in EBNF
expr term { + term }
term factor { * factor }
factor ( expr ) | number
number digit { digit }
digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
![Page 29: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/29.jpg)
EBNF for if-statement
if-statement if (expression) statement [ else statement ]
![Page 30: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/30.jpg)
Syntax Diagrams
• Syntax diagrams are an alternative to EBNF
• Study the diagrams on pp 99-101 and observe the direct relationship of each to the EBNF grammar rules for expressions
![Page 31: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/31.jpg)
Parsers
• This simplest parser is a recognizer• Accepts or rejects strings on whether they
are legal strings in the language
• More general parsers• Build parse trees (or abstract syntax
trees)
• May calculate values of expressions, etc.
![Page 32: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/32.jpg)
Bottom-up Parsers
• Attempts to match the input with the RHSs of the grammar rules
• When a match occurs, the RHS is replaced by the non-teminal on the LHS of the rule (called a reduce)
• Sometimes called shift-reduce parsing
![Page 33: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/33.jpg)
Top-down Parsers
• Non-terminals are expanded to match incoming tokens and the parser directly constructs a derivation
![Page 34: Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters](https://reader035.vdocuments.net/reader035/viewer/2022062221/56649f535503460f94c77cc9/html5/thumbnails/34.jpg)
Recursive-Descent Parsing
• A program made up of a collection of mutually recursive procedures, one for each non-terminal.