scanning & parsing with lex and yacc

46
Scanning & Parsing with Lex and YACC Hans-Arno Jacobsen ECE 297 Can we generate code to support mundane coding tasks and safe time? Powerful, but not easy Give you an example for Milestone 1. •Submissions: 99 •Average for A2: 71% •Early submission bonus: 1 •Full marks: 5 •16 teams attempted nonce bonus •7 got full marks •7 teams attempted ACC bonus •7 got full marks

Upload: gautam

Post on 05-Feb-2016

95 views

Category:

Documents


3 download

DESCRIPTION

Submissions: 99 Average for A2: 71% Early submission bonus: 1 Full marks: 5 16 teams attempted nonce bonus 7 got full marks 7 teams attempted ACC bonus 7 got full marks. Can we generate code to support mundane coding tasks and safe time?. Scanning & Parsing with Lex and YACC. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scanning & Parsing with  Lex and YACC

Scanning & Parsing with Lex and YACC

Hans-Arno Jacobsen

ECE 297

Can we generate code to support mundane coding tasks and safe time?

Powerful, but not easy

Give you an example for Milestone 1.

•Submissions: 99•Average for A2: 71%•Early submission bonus: 1•Full marks: 5•16 teams attempted nonce bonus

•7 got full marks•7 teams attempted ACC bonus

•7 got full marks

Page 2: Scanning & Parsing with  Lex and YACC

CoursePeer – try it out!

• Developed by a former ECE297 student– Many of the videos under tips & tricks are from him too

• Short video about CoursePeer

• To sign up and auto-enrol under ECE297, use this link– http://www.crspr.com/?rid=339

• Will have a quick demo and use it on Wednesday for our Q&A session

Page 3: Scanning & Parsing with  Lex and YACC

Know your tools!

• Can we generate code based on a specification of what we want?

• Is the specification simpler than writing a program for doing the same task?

• Fully automated program generation has been a dream since the early days of computing.

Page 4: Scanning & Parsing with  Lex and YACC

Where do we need parsing in the storage server?

Page 5: Scanning & Parsing with  Lex and YACC

Where do we need parsing in the storage server?

• Configuration file (file)• Bulk loading of data files (file)• Protocol messages (network)

• Command line arguments (string)

Page 6: Scanning & Parsing with  Lex and YACC

Parsing

• default.conf – the way the disk may see it

server_host localhost \n server_port 1111 \n table marks \n # This datadirectory may be an absolute or relative path. \n data_directory ./data \n\n\n \EOF

server_host localhost server_port 1111table marks

data_directory ./data

PROPERTY VALUEPROPERTY VALUE(TABLE TABLE-NAME)+PROPERTY VALUE

Tokens

Page 7: Scanning & Parsing with  Lex and YACC

ScenariosWhere we’d like to safe time in writing a quick language processor?

Conceptually speaking• Languages

– Data description language– Script language– Markup language

• System configurations

• Workload generation

In our storage servers• Languages

– Data schema & data– Query language– Output formatting (Web,

Latex, PDF, Word, Excel)

• Storage server configuration

• Benchmarking

Page 8: Scanning & Parsing with  Lex and YACC

Parser generation from 30K feet

SpecificationSpecification Generator

Generator

Other code

Other code

Generated code

Written by developer

Written by developer

Compiler / LinkerExecut-

able

Page 9: Scanning & Parsing with  Lex and YACC

Scanning & parsing I

PROPERTY

server_host localhost \n server_port 1111 \n table marks \n # This data

PROPERTY VALUEPROPERTY VALUE(TABLE TABLE-NAME)+PROPERTY VALUE

Scanning

Parsing

ProcessingVerify content, add to data structures, …

VALUE PROPERTY VALUE …

Page 10: Scanning & Parsing with  Lex and YACC

Regular expressions

• (TABLE TABLE-NAME)+– TABLE TABLE-NAME– TABLE TABLE-NAME TABLE TABLE-NAME– …

• Regular expressions (formal languages)

• Extended regular expressions (UNIX)

Patterns

Page 11: Scanning & Parsing with  Lex and YACC

Scanning & parsing II

• Parsing is really two steps– Scanning (a.k.a. tokenizing or lexical analysis)– Parsing, i.e., analysis of structure and syntax according to

a grammar (i.e., a set of rules)• flex is the scanner generator (open source)

– Fast Lex for lexical analysis• YACC is the parser generator

– Yet Another Compiler Compiler for structural and syntax analysis

• Lex and YACC work together• Generated scanner drives the generated parser

• We use flex (fast Lex) and Bison (GNU YACC)• There are myriads of other tools for Java, C++, …, some

of which combine Lex/Yacc into one tool (e.g., javacc)

Page 12: Scanning & Parsing with  Lex and YACC

Objectives for today

• Cover the basics of Lex & Yacc

• Everybody should have an appreciation of the potential of these tools

• There is a lot more detail that remains unsaid

• To challenge you

Page 13: Scanning & Parsing with  Lex and YACC

Lex & YACC overview

LexicalAnalyzerinput stream token stream

Structural Analyzertoken stream

Output defined byactions in parser

specification(often an in-memory

representation of input)

server_host localhost \n server_port 1111 \n table marks \n # This data directorymay be an absolute or relative path. \n data_directory ./data \n\n\n \EOF

PROPERTY VALUE PROPERTY VALUE

Page 14: Scanning & Parsing with  Lex and YACC

LEXICAL ANALYSIS WITH LEX

Page 15: Scanning & Parsing with  Lex and YACC

You can control the name of

generated file

Lex introduction

flexInput specification

(*.l)

lex.yy.c

C compiler

LexicalAnalyzerinput stream token stream

You generate thelexical analyzer by using flex

flex is fast Lex

Synonyms: lexical

analyzer, scanner, lexer,

tokenizer

Page 16: Scanning & Parsing with  Lex and YACC

Lex• Input specification for lex – the “program”

– Three parts: Definitions, Rules, User code– Use “%%” as a delimiter for each part

• First part: Definitions– Options used by flex inside the scanner– Defines variables & macros– Code within “%{” and “%}” directly copied into the

scanner (e.g., global variables, header files)• Second part: Rules

– Patterns and corresponding actions• Actions are executed when corresponding pattern(s)

matches– Patterns are defined by regular expressions

Page 17: Scanning & Parsing with  Lex and YACC

Parsing the configuration file of Milestone 1

%{#include "config_parser.tab.h"...

%}a2Z [a-zA-Z]host server_hostport server_port dir data_directory

%%

{host} { return HOST_PROPERTY; }{port} { return PORT_PROPERTY; }table { return TABLE; }{dir} { return DDIR_PROPERTY; }[\t\n ]+ { }#.*\n { }{a2Z}* { yylval.sval = strdup(yytext);

return STRING; }[0-9]+ { yylval.pval = (int) atoi(yytext); return PORT_NUMBER; }

. { return yytext[0]; }…

Shorthands for use below config_parser.l

Pattern

Action

Page 18: Scanning & Parsing with  Lex and YACC

flex pattern matching principles

• Actions are executed when patterns match– Tokens are returned to caller; next pattern …

• Patterns match a given input character or string only once– Input stream is consumed

• flex executes the action for the longest possible matching input– Order of patterns in the spec. is important

Page 19: Scanning & Parsing with  Lex and YACC

flex regular expressions by example I(Really: extended regular expressions)

`x‘ match the character 'x' `.‘ any character (byte) except newline`[xyz]’ match either an 'x', a 'y', or a 'z' `[abj-oZ]‘ match an 'a', a 'b', any letter from 'j'

through 'o', or a 'Z‘`[^A-Z]‘a "negated character class", i.e., any

character EXCEPT those in the class`[^A-Z\n]’ any character EXCEPT an uppercase

letter or a newline

Page 20: Scanning & Parsing with  Lex and YACC

flex regular expression by example II

`r*‘ zero or more r's, where r is any regular expression

`r+‘ one or more r's `r?‘ zero or one r (that is, “an optional r”)‘r{2,5}‘ anywhere from two to five r's `r{2,}‘ two or more r's `r{4}‘ exactly 4 r's‘<<EOF>>' an end-of-file

r is any regular

expression

Page 21: Scanning & Parsing with  Lex and YACC

flex regular expressions

• There are many more expressions, see manual

• Form complex expressions– E.g.: IP address, names, …

• The expression syntax is used in other tools as well (well worth learning)

Page 22: Scanning & Parsing with  Lex and YACC

Parsing the configuration file of Milestone 1%{#include "config_parser.tab.h"...

%}a2Z [a-zA-Z]host server_hostport server_port dir data_directory

%%

{host} { return HOST_PROPERTY; }{port} { return PORT_PROPERTY; }table { return TABLE; }{dir} { return DDIR_PROPERTY; }[\t\n ]+ { }#.*\n { }{a2Z}* { yylval.sval = strdup(yytext);

return STRING; }[0-9]+ { yylval.pval = (int) atoi(yytext); return PORT_NUMBER; }. { return yytext[0]; }<<EOF>> { return 0; }

config_parser.l

User-defined variable in YACC(conveys token value to YACC)

server_host localhost server_port 1111table marks

data_directory ./data

Page 23: Scanning & Parsing with  Lex and YACC

PARSING WITH YACC

Page 24: Scanning & Parsing with  Lex and YACC

YACC introducing

YACCInput specification

(*.y)

y.tab.c

C compiler

Syntax analyzer / parser

token stream, e.g.,via flex

Output defined byactions in parser

specification

From the specified grammar, YACC generates a parser which recognizes

“sentences” according to the grammar

You can control the name of

generated file

Page 25: Scanning & Parsing with  Lex and YACC

YACC• Input specification for YACC (similar to flex)

– Three parts: Definitions, Rules, User code– Use “%%” as a delimiter for each part

• First part: Definitions– Definition of tokens for the second part and for use by flex– Definition of variables for use by the parser code

• Second part: Rules– Grammar for the parser

• Third part: User code– The code in this part is copied into the parser generated by

YACC

Page 26: Scanning & Parsing with  Lex and YACC

Configuration file parser Milestone 1

%{#include <string.h>#include <stdio.h>

struct table *tl, *t;struct configuration *c;

/* define a linked list of table names */

struct table { char *table_name; struct table *next;};

/* define a structure for the configuration information */

struct configuration { char *host; int port; struct table *tlist; char *data_dir; };

Definition sectionconfig_parser.y

Page 27: Scanning & Parsing with  Lex and YACC

Configuration file parser Milestone 1

%}%union{ char *sval; // String value (user defined) int pval; // Port number value (user defined)}%token <sval> STRING%token <pval> PORT_NUMBER %token HOST_PROPERTY PORT_PROPERTY

DDIR_PROPERTY TABLE

%% Definition section cont’d.

config_parser.y

Page 28: Scanning & Parsing with  Lex and YACC

Configuration file parser Milestone 1

property_list:HOST_PROPERTY STRINGPORT_PROPERTY NUMBERtable_listdata_directory

;table_list:

table_list TABLE STRING| TABLE STRING

;

data_directory: DDIR_PROPERTY STRING ;%%

(Grammar) Rules section(simplified)

config_parser.y

Page 29: Scanning & Parsing with  Lex and YACC

data_directory:

DDIR_PROPERTY STRING { c = (struct configuration *)

malloc(sizeof(struct configuration));

// Check c for NULL

c->data_dir = strdup( $2 ); } ;

config_parser.y

$1 $2

(Grammar) Rules section(details)

struct configuration { char *host; int port; struct table *tlist; char *data_dir; };

struct configuration *c;

Page 30: Scanning & Parsing with  Lex and YACC

property_list:

HOST_PROPERTY STRING PORT_PROPERTY PORT_NUMBER table_list data_directory { c->host = strdup( $2 ); c->port = $4; c->tlist = tl; } ;

config_parser.y

struct configuration { char *host; int port; struct table *tlist; char *data_dir; };

(Grammar) Rules section(details)

struct configuration *c;

Page 31: Scanning & Parsing with  Lex and YACC

… TABLE STRING TABLE STRING

Configuration file parser Milestone 1

property_list:HOST_PROPERTY STRINGPORT_PROPERTY NUMBERtable_listdata_directory

;table_list:

table_list TABLE STRING| TABLE STRING

;

data_directory: DDIR_PROPERTY STRING ;%%

(Grammar) Rules section(simplified)

config_parser.y

Page 32: Scanning & Parsing with  Lex and YACC

table_list is a recursive rule

• Example table specification in configuration filetable MyCoursestable MyMarkstable MyFriends

• table_list: table_list TABLE STRING | TABLE STRING ;

• Terminology– table_list is called a non-terminal– TABLE & STRING are terminals

Page 33: Scanning & Parsing with  Lex and YACC

Recursive rule executiontable_list : table_list TABLE STRING

table_list TABLE STRING TABLE STRING

TABLE STRING TABLE STRING TABLE STRING

table MyCoursestable MyMarkstable MyFriends

table MyCourses

table MyMarks table MyCourses

table MyMarks table MyCoursestable MyFriends

table_list: table_list TABLE STRING |TABLE STRING ;

Page 34: Scanning & Parsing with  Lex and YACC

table_list:

table_list TABLE STRING { t = (struct table *) malloc(sizeof(struct table)); t->table_name = strdup( $3 ); t->next = tl; tl = t; } | TABLE STRING { tl = (struct table *) malloc(sizeof(struct table)); tl->table_name = strdup( $2 ); tl->next = NULL; } ;

table

tabletl =

config_parser.y

struct table { char *table_name; struct table *next; };

$1 $2 $3

$1 $2

tl

t->next = tl

tl->next = NULL

t

struct table *tl, *t;

Page 35: Scanning & Parsing with  Lex and YACC

How to invoke the parser

int main (int argc, char **argv){

FILE *f; extern FILE *yyin; if (argc == 2) { f = fopen(argv[1],"r"); if (!f){ …// error handling …} yyin = f;

while( ! feof(yyin) ) { if (yyparse() != 0) {

…yyerror("");exit(0);

}; } fclose(f); } …

• yylex() for calling generated scanner• by default called within yyparse()

Page 36: Scanning & Parsing with  Lex and YACC

In the Makefile

lexer: config_parser.l${LEX} config_parser.l${CC} ${CFLAGS} ${INCLUDE} -c lex.yy.c

yaccer: config_parser.y${YACC} -d config_parser.y${CC} ${CFLAGS} ${INCLUDE} -c config_parser.tab.c

parser: config_parser.tab.o lex.yy.o${CC} ${CFLAGS} ${INCLUDE} -c parser.c${CC} -o p ${CFLAGS} ${INCLUDE} lex.yy.o \

config_parser.tab.o \parser.o

Page 37: Scanning & Parsing with  Lex and YACC

Benefits• Faster development

– Compared to manual implementation• Easier to change the specification and

generate new parser– Than to modify 1000s of lines of code to add,

change, delete an existing feature• Less error-prone, as code is generated• Cost: Learning curve

– Invest once, amortized over 40+ years career

Page 38: Scanning & Parsing with  Lex and YACC

If you want to know more• Lecture, examples and some recommended

reading are enough to tackle all of the parsing for Milestone 3 & 4

• 3rd and 4th year lectures on Compilers may show you the algorithms behind & inside Lex & YACC

• Lectures on Computability and Theory of Computation may also show you these algorithms

Page 39: Scanning & Parsing with  Lex and YACC
Page 40: Scanning & Parsing with  Lex and YACC

A flex specification%{ #include <stdio.h#include "y.tab.h"int c;extern int yylval;%}%%" " ;[a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER); }[0-9] { c = yytext[0]; yylval = c - '0'; return(DIGIT); }[^a-z0-9\b] { c = yytext[0]; return(c); }

The Header

The “Guts”:Regular

expressions annotated with

actions

Page 41: Scanning & Parsing with  Lex and YACC

Temporary variable(s)

The header

%{ #include <stdio.h#include "y.tab.h"int c;extern int yylval;

%}%%

Special variable• defined in scanner • used in parser• for transferring values associated with tokens to parser

dividing line between

header and rules section

Page 42: Scanning & Parsing with  Lex and YACC

The rules%%" " ;[a-z] { c = yytext[0]; yylval = c - 'a'; return (LETTER); }[0-9] { c = yytext[0]; yylval = c - '0'; return (DIGIT); }[^a-z0-9\b] { c = yytext[0]; return(c); }

the string associated with the token

the string associated with the token

yytext: the string associated

with the token

Page 43: Scanning & Parsing with  Lex and YACC

The rules

%%" " ;[a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER); }[0-9] { c = yytext[0]; yylval = c - '0'; return(DIGIT); }[^a-z0-9\n] { c = yytext[0]; return(c); }

sets yylval to the character’s

alphabetical order

sets yylval to digit’snumerical value

otherwise simply returns that character;

presumably it’s an operator: +*-, etc.

Page 44: Scanning & Parsing with  Lex and YACC

Simple example

• Implement a calculator which can recognize adding or subtracting of numbers

[linux33]% ./y_calc1+101 = 102[linux33] % ./y_calc1000-300+200+100 = 1000[linux33] %

Page 45: Scanning & Parsing with  Lex and YACC

Example – the Lex part%{#include <math.h>#include "y.tab.h"extern int yylval;%}

%%[0-9]+ { yylval = atoi(yytext);

return NUMBER; }[\t ]+ ; /* Do nothing for white space */\n return 0;/* End of the logic */. return yytext[0];%%

pattern

action

Definitions

Rules

Page 46: Scanning & Parsing with  Lex and YACC

Example – the Yacc part%token NAME NUMBER

%%

statement: NAME '=' expression

| expression

{ printf("= %d\n", $1); }

;

expression:expression '+' NUMBER

{ $$ = $1 + $3; }

|expression '-' NUMBER

{ $$ = $1 - $3; }

| NUMBER

{ $$ = $1; }

;

Definitions

Rules

Include Yacc library(-ly)