perly parsing with regexp::grammars

61
Perly Parsers: Perl-byacc Parse::Yapp Parse::RecDescent Regex::Grammar Steven Lembark Workhorse Computing [email protected]

Upload: workhorse-computing

Post on 11-Jun-2015

874 views

Category:

Technology


2 download

DESCRIPTION

A short description of Perly grammar processors leading up to Regexp::Grammars. Develops two R::G modules, one for single-line logfile entries, another for larger FASTA format entries in the NCBI "nr.gz" file. The second example shows how to derive one grammar from another by overriding tags in the base grammar.

TRANSCRIPT

Page 1: Perly Parsing with Regexp::Grammars

Perly ParsersPerl-byaccParseYapp

ParseRecDescentRegexGrammar

Steven LembarkWorkhorse Computing

lembarkwrkhorscom

Grammars are the guts of compilers

Compilers convert text from one form to anotherndash C compilers convert C source to CPU-specific assembly

ndash Databases compile SQL into RDBMS ops

Grammars define structure precedence valid inputsndash Realistic ones are often recursive or context-sensitive

ndash The complexity in defining grammars led to a variety of tools for defining them

ndash The standard format for a long time has been ldquoBNFrdquo which is the input to YACC

They are wasted on flat textndash If ldquosplit trdquo does the job skip grammars entirely

The first Yet Another YACC

Yet Another Compiler Compiler ndash YACC takes in a standard-format grammar structure

ndash It processes tokens and their values organizing the results according to the grammar into a structure

Between the source and YACC is a tokenizerndash This parses the inputs into individual tokens defined by the grammar

ndash It doesnt know about structure only breaking the text stream up into tokens

Parsing is a pain in the lex

The real pain is gluing the parser and tokenizer togetherndash Tokenizers deal in the language of patterns

ndash Grammars are defined in terms of structure

Passing data between them makes for most of the difficultyndash One issue is the global yylex call which makes having multiple parsers

difficult

ndash Context-sensitive grammars with multiple sub-grammars are painful

The perly way

Regexen logic glue hmm been there beforendash The first approach most of us try is lexing with regexen

ndash Then add captures and if-blocks or excute (code) blocks inside of each regex

The problem is that the grammar is defined by your code structurendash Modifying the grammar requires re-coding it

ndash Hubris maybe but Truly Lazy it aint

ndash Was the whole reason for developing standard grammars amp their handlers in the first place

Early Perl Grammar Modules

These take in a YACC grammar and spit out compiler code Intentionally looked like YACC

ndash Able to re-cycle existing YACC grammar files

ndash Benefit from using Perl as a built-in lexer

ndash Perl-byacc amp ParseYapp

Good Recycles knowledge for YACC users Bad Still not lazy The grammars are difficult to maintain and you

still have to plug in post-processing code to deal with the results

right =left - +left left NEGright ^

input empty

| input line push($_[1]$_[2]) $_[1]

line n $_[1] | exp n print $_[1]n | error n $_[0]-gtYYErrok

exp NUM| VAR $_[0]-gtYYData-gtVARS$_[1] | VAR = exp $_[0]-gtYYData-gtVARS$_[1]=$_[3] | exp + exp $_[1] + $_[3] | exp - exp $_[1] - $_[3] | exp exp $_[1] $_[3]

Example ParseYapp grammar

The Swiss Army Chainsaw

ParseRecDescent extended the original BNF syntax combining the tokens amp handlers

Grammars are largely declarative using OO Perl to do the heavy liftingndash OO interface allows multiple context sensitive parsers

ndash Rules with Perl blocks allows the code to do anything

ndash Results can be acquired from a hash an array or $1

ndash Left right associative tags simplify messy situations

Example PRD

This is part of an infix formula compiler I wrote

It compiles equations to a sequence of closures

add_op + | - | $item[ 1 ] mult_op | | ^ $item[ 1 ]

add ltleftop mult add_op multgt compile_binop $item[1]

mult ltleftop factor mult_op factorgt compile_binop $item[1]

Just enough rope to shoot yourself

The biggest problem PRD is sloooooooowsloooooooow Learning curve is perl-ish shallow and long

ndash Unless you really know what all of it does you may not be able to figure out the pieces

ndash Lots of really good docs that most people never read

Perly blocks also made it look too much like a job-dispatcherndash People used it for a lot of things that are not compilers

ndash Good amp Bad thing it really is a compiler

ndash Bad rap for not doing well what it wasnt supposed to do at all

RIP PRD

Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6

ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster

The result is Perl6 Grammarsndash Declarative syntax extends matching with rules

ndash Built into Perl6 as a structure not an add-on

ndash Much faster

ndash Not available in Perl5

RegexGrammars

Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax

ndash Much better performance than PRD

Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself

ndash Simplifies mixing code and matching

ndash Single place to get the final results

ndash Cleaner syntax with automatic whitespace handling

Extending regexen

ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)

You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues

ndash multi-line mode (m) simplifies line anchors for line-oriented parsing

ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic

ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary

What you get

The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program

Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward

Comfortable for incremental development or refactoringndash Largely declarative syntax helps

ndash OOP provides inheritance with overrides for rules

my $compiler= do use RegexpGrammars

qr ltdatagt

ltrule data gt lt[text]gt+ ltrule text gt +

xm

Example Creating a compiler

Context can be a do-block subroutine or branch logic

ldquodatardquo is the entry rule

All this does is read lines into an array with automatic ws handling

Results

The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results

ndash Empty keys () hold input text (for errors or debugging)

ndash Easy to handle with DataDumper

The hash has at least one key for the entry rule one empty key for input data if context is being saved

For example feeding two lines of a Gentoo emerge log through the line grammar gives

=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]

Parsing a few lines of logfile

Getting rid of context

The empty-keyed values are useful for development or explicit error messages

They also get in the way and can cost a lot of memory on large inputs

You can turn them on and off with ltcontextgt and ltnocontextgt in the rules

qr

ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +

xm

warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk

You usually want [] with +

data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]

qr

ltnocontextgt turn off globally

ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)

xm

An array[ref] of text

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 2: Perly Parsing with Regexp::Grammars

Grammars are the guts of compilers

Compilers convert text from one form to anotherndash C compilers convert C source to CPU-specific assembly

ndash Databases compile SQL into RDBMS ops

Grammars define structure precedence valid inputsndash Realistic ones are often recursive or context-sensitive

ndash The complexity in defining grammars led to a variety of tools for defining them

ndash The standard format for a long time has been ldquoBNFrdquo which is the input to YACC

They are wasted on flat textndash If ldquosplit trdquo does the job skip grammars entirely

The first Yet Another YACC

Yet Another Compiler Compiler ndash YACC takes in a standard-format grammar structure

ndash It processes tokens and their values organizing the results according to the grammar into a structure

Between the source and YACC is a tokenizerndash This parses the inputs into individual tokens defined by the grammar

ndash It doesnt know about structure only breaking the text stream up into tokens

Parsing is a pain in the lex

The real pain is gluing the parser and tokenizer togetherndash Tokenizers deal in the language of patterns

ndash Grammars are defined in terms of structure

Passing data between them makes for most of the difficultyndash One issue is the global yylex call which makes having multiple parsers

difficult

ndash Context-sensitive grammars with multiple sub-grammars are painful

The perly way

Regexen logic glue hmm been there beforendash The first approach most of us try is lexing with regexen

ndash Then add captures and if-blocks or excute (code) blocks inside of each regex

The problem is that the grammar is defined by your code structurendash Modifying the grammar requires re-coding it

ndash Hubris maybe but Truly Lazy it aint

ndash Was the whole reason for developing standard grammars amp their handlers in the first place

Early Perl Grammar Modules

These take in a YACC grammar and spit out compiler code Intentionally looked like YACC

ndash Able to re-cycle existing YACC grammar files

ndash Benefit from using Perl as a built-in lexer

ndash Perl-byacc amp ParseYapp

Good Recycles knowledge for YACC users Bad Still not lazy The grammars are difficult to maintain and you

still have to plug in post-processing code to deal with the results

right =left - +left left NEGright ^

input empty

| input line push($_[1]$_[2]) $_[1]

line n $_[1] | exp n print $_[1]n | error n $_[0]-gtYYErrok

exp NUM| VAR $_[0]-gtYYData-gtVARS$_[1] | VAR = exp $_[0]-gtYYData-gtVARS$_[1]=$_[3] | exp + exp $_[1] + $_[3] | exp - exp $_[1] - $_[3] | exp exp $_[1] $_[3]

Example ParseYapp grammar

The Swiss Army Chainsaw

ParseRecDescent extended the original BNF syntax combining the tokens amp handlers

Grammars are largely declarative using OO Perl to do the heavy liftingndash OO interface allows multiple context sensitive parsers

ndash Rules with Perl blocks allows the code to do anything

ndash Results can be acquired from a hash an array or $1

ndash Left right associative tags simplify messy situations

Example PRD

This is part of an infix formula compiler I wrote

It compiles equations to a sequence of closures

add_op + | - | $item[ 1 ] mult_op | | ^ $item[ 1 ]

add ltleftop mult add_op multgt compile_binop $item[1]

mult ltleftop factor mult_op factorgt compile_binop $item[1]

Just enough rope to shoot yourself

The biggest problem PRD is sloooooooowsloooooooow Learning curve is perl-ish shallow and long

ndash Unless you really know what all of it does you may not be able to figure out the pieces

ndash Lots of really good docs that most people never read

Perly blocks also made it look too much like a job-dispatcherndash People used it for a lot of things that are not compilers

ndash Good amp Bad thing it really is a compiler

ndash Bad rap for not doing well what it wasnt supposed to do at all

RIP PRD

Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6

ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster

The result is Perl6 Grammarsndash Declarative syntax extends matching with rules

ndash Built into Perl6 as a structure not an add-on

ndash Much faster

ndash Not available in Perl5

RegexGrammars

Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax

ndash Much better performance than PRD

Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself

ndash Simplifies mixing code and matching

ndash Single place to get the final results

ndash Cleaner syntax with automatic whitespace handling

Extending regexen

ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)

You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues

ndash multi-line mode (m) simplifies line anchors for line-oriented parsing

ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic

ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary

What you get

The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program

Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward

Comfortable for incremental development or refactoringndash Largely declarative syntax helps

ndash OOP provides inheritance with overrides for rules

my $compiler= do use RegexpGrammars

qr ltdatagt

ltrule data gt lt[text]gt+ ltrule text gt +

xm

Example Creating a compiler

Context can be a do-block subroutine or branch logic

ldquodatardquo is the entry rule

All this does is read lines into an array with automatic ws handling

Results

The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results

ndash Empty keys () hold input text (for errors or debugging)

ndash Easy to handle with DataDumper

The hash has at least one key for the entry rule one empty key for input data if context is being saved

For example feeding two lines of a Gentoo emerge log through the line grammar gives

=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]

Parsing a few lines of logfile

Getting rid of context

The empty-keyed values are useful for development or explicit error messages

They also get in the way and can cost a lot of memory on large inputs

You can turn them on and off with ltcontextgt and ltnocontextgt in the rules

qr

ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +

xm

warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk

You usually want [] with +

data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]

qr

ltnocontextgt turn off globally

ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)

xm

An array[ref] of text

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 3: Perly Parsing with Regexp::Grammars

The first Yet Another YACC

Yet Another Compiler Compiler ndash YACC takes in a standard-format grammar structure

ndash It processes tokens and their values organizing the results according to the grammar into a structure

Between the source and YACC is a tokenizerndash This parses the inputs into individual tokens defined by the grammar

ndash It doesnt know about structure only breaking the text stream up into tokens

Parsing is a pain in the lex

The real pain is gluing the parser and tokenizer togetherndash Tokenizers deal in the language of patterns

ndash Grammars are defined in terms of structure

Passing data between them makes for most of the difficultyndash One issue is the global yylex call which makes having multiple parsers

difficult

ndash Context-sensitive grammars with multiple sub-grammars are painful

The perly way

Regexen logic glue hmm been there beforendash The first approach most of us try is lexing with regexen

ndash Then add captures and if-blocks or excute (code) blocks inside of each regex

The problem is that the grammar is defined by your code structurendash Modifying the grammar requires re-coding it

ndash Hubris maybe but Truly Lazy it aint

ndash Was the whole reason for developing standard grammars amp their handlers in the first place

Early Perl Grammar Modules

These take in a YACC grammar and spit out compiler code Intentionally looked like YACC

ndash Able to re-cycle existing YACC grammar files

ndash Benefit from using Perl as a built-in lexer

ndash Perl-byacc amp ParseYapp

Good Recycles knowledge for YACC users Bad Still not lazy The grammars are difficult to maintain and you

still have to plug in post-processing code to deal with the results

right =left - +left left NEGright ^

input empty

| input line push($_[1]$_[2]) $_[1]

line n $_[1] | exp n print $_[1]n | error n $_[0]-gtYYErrok

exp NUM| VAR $_[0]-gtYYData-gtVARS$_[1] | VAR = exp $_[0]-gtYYData-gtVARS$_[1]=$_[3] | exp + exp $_[1] + $_[3] | exp - exp $_[1] - $_[3] | exp exp $_[1] $_[3]

Example ParseYapp grammar

The Swiss Army Chainsaw

ParseRecDescent extended the original BNF syntax combining the tokens amp handlers

Grammars are largely declarative using OO Perl to do the heavy liftingndash OO interface allows multiple context sensitive parsers

ndash Rules with Perl blocks allows the code to do anything

ndash Results can be acquired from a hash an array or $1

ndash Left right associative tags simplify messy situations

Example PRD

This is part of an infix formula compiler I wrote

It compiles equations to a sequence of closures

add_op + | - | $item[ 1 ] mult_op | | ^ $item[ 1 ]

add ltleftop mult add_op multgt compile_binop $item[1]

mult ltleftop factor mult_op factorgt compile_binop $item[1]

Just enough rope to shoot yourself

The biggest problem PRD is sloooooooowsloooooooow Learning curve is perl-ish shallow and long

ndash Unless you really know what all of it does you may not be able to figure out the pieces

ndash Lots of really good docs that most people never read

Perly blocks also made it look too much like a job-dispatcherndash People used it for a lot of things that are not compilers

ndash Good amp Bad thing it really is a compiler

ndash Bad rap for not doing well what it wasnt supposed to do at all

RIP PRD

Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6

ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster

The result is Perl6 Grammarsndash Declarative syntax extends matching with rules

ndash Built into Perl6 as a structure not an add-on

ndash Much faster

ndash Not available in Perl5

RegexGrammars

Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax

ndash Much better performance than PRD

Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself

ndash Simplifies mixing code and matching

ndash Single place to get the final results

ndash Cleaner syntax with automatic whitespace handling

Extending regexen

ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)

You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues

ndash multi-line mode (m) simplifies line anchors for line-oriented parsing

ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic

ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary

What you get

The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program

Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward

Comfortable for incremental development or refactoringndash Largely declarative syntax helps

ndash OOP provides inheritance with overrides for rules

my $compiler= do use RegexpGrammars

qr ltdatagt

ltrule data gt lt[text]gt+ ltrule text gt +

xm

Example Creating a compiler

Context can be a do-block subroutine or branch logic

ldquodatardquo is the entry rule

All this does is read lines into an array with automatic ws handling

Results

The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results

ndash Empty keys () hold input text (for errors or debugging)

ndash Easy to handle with DataDumper

The hash has at least one key for the entry rule one empty key for input data if context is being saved

For example feeding two lines of a Gentoo emerge log through the line grammar gives

=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]

Parsing a few lines of logfile

Getting rid of context

The empty-keyed values are useful for development or explicit error messages

They also get in the way and can cost a lot of memory on large inputs

You can turn them on and off with ltcontextgt and ltnocontextgt in the rules

qr

ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +

xm

warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk

You usually want [] with +

data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]

qr

ltnocontextgt turn off globally

ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)

xm

An array[ref] of text

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 4: Perly Parsing with Regexp::Grammars

Parsing is a pain in the lex

The real pain is gluing the parser and tokenizer togetherndash Tokenizers deal in the language of patterns

ndash Grammars are defined in terms of structure

Passing data between them makes for most of the difficultyndash One issue is the global yylex call which makes having multiple parsers

difficult

ndash Context-sensitive grammars with multiple sub-grammars are painful

The perly way

Regexen logic glue hmm been there beforendash The first approach most of us try is lexing with regexen

ndash Then add captures and if-blocks or excute (code) blocks inside of each regex

The problem is that the grammar is defined by your code structurendash Modifying the grammar requires re-coding it

ndash Hubris maybe but Truly Lazy it aint

ndash Was the whole reason for developing standard grammars amp their handlers in the first place

Early Perl Grammar Modules

These take in a YACC grammar and spit out compiler code Intentionally looked like YACC

ndash Able to re-cycle existing YACC grammar files

ndash Benefit from using Perl as a built-in lexer

ndash Perl-byacc amp ParseYapp

Good Recycles knowledge for YACC users Bad Still not lazy The grammars are difficult to maintain and you

still have to plug in post-processing code to deal with the results

right =left - +left left NEGright ^

input empty

| input line push($_[1]$_[2]) $_[1]

line n $_[1] | exp n print $_[1]n | error n $_[0]-gtYYErrok

exp NUM| VAR $_[0]-gtYYData-gtVARS$_[1] | VAR = exp $_[0]-gtYYData-gtVARS$_[1]=$_[3] | exp + exp $_[1] + $_[3] | exp - exp $_[1] - $_[3] | exp exp $_[1] $_[3]

Example ParseYapp grammar

The Swiss Army Chainsaw

ParseRecDescent extended the original BNF syntax combining the tokens amp handlers

Grammars are largely declarative using OO Perl to do the heavy liftingndash OO interface allows multiple context sensitive parsers

ndash Rules with Perl blocks allows the code to do anything

ndash Results can be acquired from a hash an array or $1

ndash Left right associative tags simplify messy situations

Example PRD

This is part of an infix formula compiler I wrote

It compiles equations to a sequence of closures

add_op + | - | $item[ 1 ] mult_op | | ^ $item[ 1 ]

add ltleftop mult add_op multgt compile_binop $item[1]

mult ltleftop factor mult_op factorgt compile_binop $item[1]

Just enough rope to shoot yourself

The biggest problem PRD is sloooooooowsloooooooow Learning curve is perl-ish shallow and long

ndash Unless you really know what all of it does you may not be able to figure out the pieces

ndash Lots of really good docs that most people never read

Perly blocks also made it look too much like a job-dispatcherndash People used it for a lot of things that are not compilers

ndash Good amp Bad thing it really is a compiler

ndash Bad rap for not doing well what it wasnt supposed to do at all

RIP PRD

Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6

ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster

The result is Perl6 Grammarsndash Declarative syntax extends matching with rules

ndash Built into Perl6 as a structure not an add-on

ndash Much faster

ndash Not available in Perl5

RegexGrammars

Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax

ndash Much better performance than PRD

Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself

ndash Simplifies mixing code and matching

ndash Single place to get the final results

ndash Cleaner syntax with automatic whitespace handling

Extending regexen

ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)

You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues

ndash multi-line mode (m) simplifies line anchors for line-oriented parsing

ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic

ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary

What you get

The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program

Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward

Comfortable for incremental development or refactoringndash Largely declarative syntax helps

ndash OOP provides inheritance with overrides for rules

my $compiler= do use RegexpGrammars

qr ltdatagt

ltrule data gt lt[text]gt+ ltrule text gt +

xm

Example Creating a compiler

Context can be a do-block subroutine or branch logic

ldquodatardquo is the entry rule

All this does is read lines into an array with automatic ws handling

Results

The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results

ndash Empty keys () hold input text (for errors or debugging)

ndash Easy to handle with DataDumper

The hash has at least one key for the entry rule one empty key for input data if context is being saved

For example feeding two lines of a Gentoo emerge log through the line grammar gives

=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]

Parsing a few lines of logfile

Getting rid of context

The empty-keyed values are useful for development or explicit error messages

They also get in the way and can cost a lot of memory on large inputs

You can turn them on and off with ltcontextgt and ltnocontextgt in the rules

qr

ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +

xm

warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk

You usually want [] with +

data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]

qr

ltnocontextgt turn off globally

ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)

xm

An array[ref] of text

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 5: Perly Parsing with Regexp::Grammars

The perly way

Regexen logic glue hmm been there beforendash The first approach most of us try is lexing with regexen

ndash Then add captures and if-blocks or excute (code) blocks inside of each regex

The problem is that the grammar is defined by your code structurendash Modifying the grammar requires re-coding it

ndash Hubris maybe but Truly Lazy it aint

ndash Was the whole reason for developing standard grammars amp their handlers in the first place

Early Perl Grammar Modules

These take in a YACC grammar and spit out compiler code Intentionally looked like YACC

ndash Able to re-cycle existing YACC grammar files

ndash Benefit from using Perl as a built-in lexer

ndash Perl-byacc amp ParseYapp

Good Recycles knowledge for YACC users Bad Still not lazy The grammars are difficult to maintain and you

still have to plug in post-processing code to deal with the results

right =left - +left left NEGright ^

input empty

| input line push($_[1]$_[2]) $_[1]

line n $_[1] | exp n print $_[1]n | error n $_[0]-gtYYErrok

exp NUM| VAR $_[0]-gtYYData-gtVARS$_[1] | VAR = exp $_[0]-gtYYData-gtVARS$_[1]=$_[3] | exp + exp $_[1] + $_[3] | exp - exp $_[1] - $_[3] | exp exp $_[1] $_[3]

Example ParseYapp grammar

The Swiss Army Chainsaw

ParseRecDescent extended the original BNF syntax combining the tokens amp handlers

Grammars are largely declarative using OO Perl to do the heavy liftingndash OO interface allows multiple context sensitive parsers

ndash Rules with Perl blocks allows the code to do anything

ndash Results can be acquired from a hash an array or $1

ndash Left right associative tags simplify messy situations

Example PRD

This is part of an infix formula compiler I wrote

It compiles equations to a sequence of closures

add_op + | - | $item[ 1 ] mult_op | | ^ $item[ 1 ]

add ltleftop mult add_op multgt compile_binop $item[1]

mult ltleftop factor mult_op factorgt compile_binop $item[1]

Just enough rope to shoot yourself

The biggest problem PRD is sloooooooowsloooooooow Learning curve is perl-ish shallow and long

ndash Unless you really know what all of it does you may not be able to figure out the pieces

ndash Lots of really good docs that most people never read

Perly blocks also made it look too much like a job-dispatcherndash People used it for a lot of things that are not compilers

ndash Good amp Bad thing it really is a compiler

ndash Bad rap for not doing well what it wasnt supposed to do at all

RIP PRD

Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6

ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster

The result is Perl6 Grammarsndash Declarative syntax extends matching with rules

ndash Built into Perl6 as a structure not an add-on

ndash Much faster

ndash Not available in Perl5

RegexGrammars

Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax

ndash Much better performance than PRD

Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself

ndash Simplifies mixing code and matching

ndash Single place to get the final results

ndash Cleaner syntax with automatic whitespace handling

Extending regexen

ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)

You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues

ndash multi-line mode (m) simplifies line anchors for line-oriented parsing

ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic

ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary

What you get

The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program

Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward

Comfortable for incremental development or refactoringndash Largely declarative syntax helps

ndash OOP provides inheritance with overrides for rules

my $compiler= do use RegexpGrammars

qr ltdatagt

ltrule data gt lt[text]gt+ ltrule text gt +

xm

Example Creating a compiler

Context can be a do-block subroutine or branch logic

ldquodatardquo is the entry rule

All this does is read lines into an array with automatic ws handling

Results

The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results

ndash Empty keys () hold input text (for errors or debugging)

ndash Easy to handle with DataDumper

The hash has at least one key for the entry rule one empty key for input data if context is being saved

For example feeding two lines of a Gentoo emerge log through the line grammar gives

=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]

Parsing a few lines of logfile

Getting rid of context

The empty-keyed values are useful for development or explicit error messages

They also get in the way and can cost a lot of memory on large inputs

You can turn them on and off with ltcontextgt and ltnocontextgt in the rules

qr

ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +

xm

warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk

You usually want [] with +

data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]

qr

ltnocontextgt turn off globally

ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)

xm

An array[ref] of text

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 6: Perly Parsing with Regexp::Grammars

Early Perl Grammar Modules

These take in a YACC grammar and spit out compiler code Intentionally looked like YACC

ndash Able to re-cycle existing YACC grammar files

ndash Benefit from using Perl as a built-in lexer

ndash Perl-byacc amp ParseYapp

Good Recycles knowledge for YACC users Bad Still not lazy The grammars are difficult to maintain and you

still have to plug in post-processing code to deal with the results

right =left - +left left NEGright ^

input empty

| input line push($_[1]$_[2]) $_[1]

line n $_[1] | exp n print $_[1]n | error n $_[0]-gtYYErrok

exp NUM| VAR $_[0]-gtYYData-gtVARS$_[1] | VAR = exp $_[0]-gtYYData-gtVARS$_[1]=$_[3] | exp + exp $_[1] + $_[3] | exp - exp $_[1] - $_[3] | exp exp $_[1] $_[3]

Example ParseYapp grammar

The Swiss Army Chainsaw

ParseRecDescent extended the original BNF syntax combining the tokens amp handlers

Grammars are largely declarative using OO Perl to do the heavy liftingndash OO interface allows multiple context sensitive parsers

ndash Rules with Perl blocks allows the code to do anything

ndash Results can be acquired from a hash an array or $1

ndash Left right associative tags simplify messy situations

Example PRD

This is part of an infix formula compiler I wrote

It compiles equations to a sequence of closures

add_op + | - | $item[ 1 ] mult_op | | ^ $item[ 1 ]

add ltleftop mult add_op multgt compile_binop $item[1]

mult ltleftop factor mult_op factorgt compile_binop $item[1]

Just enough rope to shoot yourself

The biggest problem PRD is sloooooooowsloooooooow Learning curve is perl-ish shallow and long

ndash Unless you really know what all of it does you may not be able to figure out the pieces

ndash Lots of really good docs that most people never read

Perly blocks also made it look too much like a job-dispatcherndash People used it for a lot of things that are not compilers

ndash Good amp Bad thing it really is a compiler

ndash Bad rap for not doing well what it wasnt supposed to do at all

RIP PRD

Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6

ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster

The result is Perl6 Grammarsndash Declarative syntax extends matching with rules

ndash Built into Perl6 as a structure not an add-on

ndash Much faster

ndash Not available in Perl5

RegexGrammars

Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax

ndash Much better performance than PRD

Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself

ndash Simplifies mixing code and matching

ndash Single place to get the final results

ndash Cleaner syntax with automatic whitespace handling

Extending regexen

ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)

You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues

ndash multi-line mode (m) simplifies line anchors for line-oriented parsing

ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic

ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary

What you get

The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program

Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward

Comfortable for incremental development or refactoringndash Largely declarative syntax helps

ndash OOP provides inheritance with overrides for rules

my $compiler= do use RegexpGrammars

qr ltdatagt

ltrule data gt lt[text]gt+ ltrule text gt +

xm

Example Creating a compiler

Context can be a do-block subroutine or branch logic

ldquodatardquo is the entry rule

All this does is read lines into an array with automatic ws handling

Results

The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results

ndash Empty keys () hold input text (for errors or debugging)

ndash Easy to handle with DataDumper

The hash has at least one key for the entry rule one empty key for input data if context is being saved

For example feeding two lines of a Gentoo emerge log through the line grammar gives

=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]

Parsing a few lines of logfile

Getting rid of context

The empty-keyed values are useful for development or explicit error messages

They also get in the way and can cost a lot of memory on large inputs

You can turn them on and off with ltcontextgt and ltnocontextgt in the rules

qr

ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +

xm

warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk

You usually want [] with +

data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]

qr

ltnocontextgt turn off globally

ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)

xm

An array[ref] of text

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 7: Perly Parsing with Regexp::Grammars

right =left - +left left NEGright ^

input empty

| input line push($_[1]$_[2]) $_[1]

line n $_[1] | exp n print $_[1]n | error n $_[0]-gtYYErrok

exp NUM| VAR $_[0]-gtYYData-gtVARS$_[1] | VAR = exp $_[0]-gtYYData-gtVARS$_[1]=$_[3] | exp + exp $_[1] + $_[3] | exp - exp $_[1] - $_[3] | exp exp $_[1] $_[3]

Example ParseYapp grammar

The Swiss Army Chainsaw

ParseRecDescent extended the original BNF syntax combining the tokens amp handlers

Grammars are largely declarative using OO Perl to do the heavy liftingndash OO interface allows multiple context sensitive parsers

ndash Rules with Perl blocks allows the code to do anything

ndash Results can be acquired from a hash an array or $1

ndash Left right associative tags simplify messy situations

Example PRD

This is part of an infix formula compiler I wrote

It compiles equations to a sequence of closures

add_op + | - | $item[ 1 ] mult_op | | ^ $item[ 1 ]

add ltleftop mult add_op multgt compile_binop $item[1]

mult ltleftop factor mult_op factorgt compile_binop $item[1]

Just enough rope to shoot yourself

The biggest problem PRD is sloooooooowsloooooooow Learning curve is perl-ish shallow and long

ndash Unless you really know what all of it does you may not be able to figure out the pieces

ndash Lots of really good docs that most people never read

Perly blocks also made it look too much like a job-dispatcherndash People used it for a lot of things that are not compilers

ndash Good amp Bad thing it really is a compiler

ndash Bad rap for not doing well what it wasnt supposed to do at all

RIP PRD

Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6

ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster

The result is Perl6 Grammarsndash Declarative syntax extends matching with rules

ndash Built into Perl6 as a structure not an add-on

ndash Much faster

ndash Not available in Perl5

RegexGrammars

Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax

ndash Much better performance than PRD

Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself

ndash Simplifies mixing code and matching

ndash Single place to get the final results

ndash Cleaner syntax with automatic whitespace handling

Extending regexen

ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)

You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues

ndash multi-line mode (m) simplifies line anchors for line-oriented parsing

ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic

ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary

What you get

The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program

Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward

Comfortable for incremental development or refactoringndash Largely declarative syntax helps

ndash OOP provides inheritance with overrides for rules

my $compiler= do use RegexpGrammars

qr ltdatagt

ltrule data gt lt[text]gt+ ltrule text gt +

xm

Example Creating a compiler

Context can be a do-block subroutine or branch logic

ldquodatardquo is the entry rule

All this does is read lines into an array with automatic ws handling

Results

The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results

ndash Empty keys () hold input text (for errors or debugging)

ndash Easy to handle with DataDumper

The hash has at least one key for the entry rule one empty key for input data if context is being saved

For example feeding two lines of a Gentoo emerge log through the line grammar gives

=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]

Parsing a few lines of logfile

Getting rid of context

The empty-keyed values are useful for development or explicit error messages

They also get in the way and can cost a lot of memory on large inputs

You can turn them on and off with ltcontextgt and ltnocontextgt in the rules

qr

ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +

xm

warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk

You usually want [] with +

data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]

qr

ltnocontextgt turn off globally

ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)

xm

An array[ref] of text

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 8: Perly Parsing with Regexp::Grammars

The Swiss Army Chainsaw

ParseRecDescent extended the original BNF syntax combining the tokens amp handlers

Grammars are largely declarative using OO Perl to do the heavy liftingndash OO interface allows multiple context sensitive parsers

ndash Rules with Perl blocks allows the code to do anything

ndash Results can be acquired from a hash an array or $1

ndash Left right associative tags simplify messy situations

Example PRD

This is part of an infix formula compiler I wrote

It compiles equations to a sequence of closures

add_op + | - | $item[ 1 ] mult_op | | ^ $item[ 1 ]

add ltleftop mult add_op multgt compile_binop $item[1]

mult ltleftop factor mult_op factorgt compile_binop $item[1]

Just enough rope to shoot yourself

The biggest problem PRD is sloooooooowsloooooooow Learning curve is perl-ish shallow and long

ndash Unless you really know what all of it does you may not be able to figure out the pieces

ndash Lots of really good docs that most people never read

Perly blocks also made it look too much like a job-dispatcherndash People used it for a lot of things that are not compilers

ndash Good amp Bad thing it really is a compiler

ndash Bad rap for not doing well what it wasnt supposed to do at all

RIP PRD

Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6

ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster

The result is Perl6 Grammarsndash Declarative syntax extends matching with rules

ndash Built into Perl6 as a structure not an add-on

ndash Much faster

ndash Not available in Perl5

RegexGrammars

Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax

ndash Much better performance than PRD

Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself

ndash Simplifies mixing code and matching

ndash Single place to get the final results

ndash Cleaner syntax with automatic whitespace handling

Extending regexen

ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)

You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues

ndash multi-line mode (m) simplifies line anchors for line-oriented parsing

ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic

ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary

What you get

The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program

Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward

Comfortable for incremental development or refactoringndash Largely declarative syntax helps

ndash OOP provides inheritance with overrides for rules

my $compiler= do use RegexpGrammars

qr ltdatagt

ltrule data gt lt[text]gt+ ltrule text gt +

xm

Example Creating a compiler

Context can be a do-block subroutine or branch logic

ldquodatardquo is the entry rule

All this does is read lines into an array with automatic ws handling

Results

The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results

ndash Empty keys () hold input text (for errors or debugging)

ndash Easy to handle with DataDumper

The hash has at least one key for the entry rule one empty key for input data if context is being saved

For example feeding two lines of a Gentoo emerge log through the line grammar gives

=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]

Parsing a few lines of logfile

Getting rid of context

The empty-keyed values are useful for development or explicit error messages

They also get in the way and can cost a lot of memory on large inputs

You can turn them on and off with ltcontextgt and ltnocontextgt in the rules

qr

ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +

xm

warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk

You usually want [] with +

data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]

qr

ltnocontextgt turn off globally

ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)

xm

An array[ref] of text

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 9: Perly Parsing with Regexp::Grammars

Example PRD

This is part of an infix formula compiler I wrote

It compiles equations to a sequence of closures

add_op + | - | $item[ 1 ] mult_op | | ^ $item[ 1 ]

add ltleftop mult add_op multgt compile_binop $item[1]

mult ltleftop factor mult_op factorgt compile_binop $item[1]

Just enough rope to shoot yourself

The biggest problem PRD is sloooooooowsloooooooow Learning curve is perl-ish shallow and long

ndash Unless you really know what all of it does you may not be able to figure out the pieces

ndash Lots of really good docs that most people never read

Perly blocks also made it look too much like a job-dispatcherndash People used it for a lot of things that are not compilers

ndash Good amp Bad thing it really is a compiler

ndash Bad rap for not doing well what it wasnt supposed to do at all

RIP PRD

Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6

ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster

The result is Perl6 Grammarsndash Declarative syntax extends matching with rules

ndash Built into Perl6 as a structure not an add-on

ndash Much faster

ndash Not available in Perl5

RegexGrammars

Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax

ndash Much better performance than PRD

Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself

ndash Simplifies mixing code and matching

ndash Single place to get the final results

ndash Cleaner syntax with automatic whitespace handling

Extending regexen

ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)

You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues

ndash multi-line mode (m) simplifies line anchors for line-oriented parsing

ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic

ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary

What you get

The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program

Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward

Comfortable for incremental development or refactoringndash Largely declarative syntax helps

ndash OOP provides inheritance with overrides for rules

my $compiler= do use RegexpGrammars

qr ltdatagt

ltrule data gt lt[text]gt+ ltrule text gt +

xm

Example Creating a compiler

Context can be a do-block subroutine or branch logic

ldquodatardquo is the entry rule

All this does is read lines into an array with automatic ws handling

Results

The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results

ndash Empty keys () hold input text (for errors or debugging)

ndash Easy to handle with DataDumper

The hash has at least one key for the entry rule one empty key for input data if context is being saved

For example feeding two lines of a Gentoo emerge log through the line grammar gives

=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]

Parsing a few lines of logfile

Getting rid of context

The empty-keyed values are useful for development or explicit error messages

They also get in the way and can cost a lot of memory on large inputs

You can turn them on and off with ltcontextgt and ltnocontextgt in the rules

qr

ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +

xm

warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk

You usually want [] with +

data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]

qr

ltnocontextgt turn off globally

ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)

xm

An array[ref] of text

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 10: Perly Parsing with Regexp::Grammars

Just enough rope to shoot yourself

The biggest problem PRD is sloooooooowsloooooooow Learning curve is perl-ish shallow and long

ndash Unless you really know what all of it does you may not be able to figure out the pieces

ndash Lots of really good docs that most people never read

Perly blocks also made it look too much like a job-dispatcherndash People used it for a lot of things that are not compilers

ndash Good amp Bad thing it really is a compiler

ndash Bad rap for not doing well what it wasnt supposed to do at all

RIP PRD

Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6

ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster

The result is Perl6 Grammarsndash Declarative syntax extends matching with rules

ndash Built into Perl6 as a structure not an add-on

ndash Much faster

ndash Not available in Perl5

RegexGrammars

Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax

ndash Much better performance than PRD

Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself

ndash Simplifies mixing code and matching

ndash Single place to get the final results

ndash Cleaner syntax with automatic whitespace handling

Extending regexen

ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)

You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues

ndash multi-line mode (m) simplifies line anchors for line-oriented parsing

ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic

ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary

What you get

The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program

Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward

Comfortable for incremental development or refactoringndash Largely declarative syntax helps

ndash OOP provides inheritance with overrides for rules

my $compiler= do use RegexpGrammars

qr ltdatagt

ltrule data gt lt[text]gt+ ltrule text gt +

xm

Example Creating a compiler

Context can be a do-block subroutine or branch logic

ldquodatardquo is the entry rule

All this does is read lines into an array with automatic ws handling

Results

The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results

ndash Empty keys () hold input text (for errors or debugging)

ndash Easy to handle with DataDumper

The hash has at least one key for the entry rule one empty key for input data if context is being saved

For example feeding two lines of a Gentoo emerge log through the line grammar gives

=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]

Parsing a few lines of logfile

Getting rid of context

The empty-keyed values are useful for development or explicit error messages

They also get in the way and can cost a lot of memory on large inputs

You can turn them on and off with ltcontextgt and ltnocontextgt in the rules

qr

ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +

xm

warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk

You usually want [] with +

data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]

qr

ltnocontextgt turn off globally

ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)

xm

An array[ref] of text

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 11: Perly Parsing with Regexp::Grammars

RIP PRD

Supposed to be replaced with ParseFastDescentndash Damian dropped work on PFD for Perl6

ndash His goal was to replace the shortcomings with PRD with something more complete and quite a bit faster

The result is Perl6 Grammarsndash Declarative syntax extends matching with rules

ndash Built into Perl6 as a structure not an add-on

ndash Much faster

ndash Not available in Perl5

RegexGrammars

Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax

ndash Much better performance than PRD

Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself

ndash Simplifies mixing code and matching

ndash Single place to get the final results

ndash Cleaner syntax with automatic whitespace handling

Extending regexen

ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)

You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues

ndash multi-line mode (m) simplifies line anchors for line-oriented parsing

ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic

ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary

What you get

The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program

Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward

Comfortable for incremental development or refactoringndash Largely declarative syntax helps

ndash OOP provides inheritance with overrides for rules

my $compiler= do use RegexpGrammars

qr ltdatagt

ltrule data gt lt[text]gt+ ltrule text gt +

xm

Example Creating a compiler

Context can be a do-block subroutine or branch logic

ldquodatardquo is the entry rule

All this does is read lines into an array with automatic ws handling

Results

The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results

ndash Empty keys () hold input text (for errors or debugging)

ndash Easy to handle with DataDumper

The hash has at least one key for the entry rule one empty key for input data if context is being saved

For example feeding two lines of a Gentoo emerge log through the line grammar gives

=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]

Parsing a few lines of logfile

Getting rid of context

The empty-keyed values are useful for development or explicit error messages

They also get in the way and can cost a lot of memory on large inputs

You can turn them on and off with ltcontextgt and ltnocontextgt in the rules

qr

ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +

xm

warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk

You usually want [] with +

data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]

qr

ltnocontextgt turn off globally

ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)

xm

An array[ref] of text

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 12: Perly Parsing with Regexp::Grammars

RegexGrammars

Perl5 implementation derived from Perl6ndash Back-porting an idea not the Perl6 syntax

ndash Much better performance than PRD

Extends the v510 recursive matching syntax leveraging the regex enginendash Most of the speed issues are with regex design not the parser itself

ndash Simplifies mixing code and matching

ndash Single place to get the final results

ndash Cleaner syntax with automatic whitespace handling

Extending regexen

ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)

You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues

ndash multi-line mode (m) simplifies line anchors for line-oriented parsing

ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic

ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary

What you get

The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program

Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward

Comfortable for incremental development or refactoringndash Largely declarative syntax helps

ndash OOP provides inheritance with overrides for rules

my $compiler= do use RegexpGrammars

qr ltdatagt

ltrule data gt lt[text]gt+ ltrule text gt +

xm

Example Creating a compiler

Context can be a do-block subroutine or branch logic

ldquodatardquo is the entry rule

All this does is read lines into an array with automatic ws handling

Results

The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results

ndash Empty keys () hold input text (for errors or debugging)

ndash Easy to handle with DataDumper

The hash has at least one key for the entry rule one empty key for input data if context is being saved

For example feeding two lines of a Gentoo emerge log through the line grammar gives

=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]

Parsing a few lines of logfile

Getting rid of context

The empty-keyed values are useful for development or explicit error messages

They also get in the way and can cost a lot of memory on large inputs

You can turn them on and off with ltcontextgt and ltnocontextgt in the rules

qr

ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +

xm

warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk

You usually want [] with +

data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]

qr

ltnocontextgt turn off globally

ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)

xm

An array[ref] of text

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 13: Perly Parsing with Regexp::Grammars

Extending regexen

ldquouse RegexpGrammarrdquo turns on added syntaxndash block-scoped (avoids collisions with existing code)

You will probably want to add ldquoxmrdquo or ldquoxsrdquondash extended syntax avoids whitespace issues

ndash multi-line mode (m) simplifies line anchors for line-oriented parsing

ndash single-line mode (s) makes ignoring line-wrap whitespace largely automatic

ndash I use ldquoxmrdquo with explicit ldquonrdquo or ldquosrdquo matches to span lines where necessary

What you get

The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program

Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward

Comfortable for incremental development or refactoringndash Largely declarative syntax helps

ndash OOP provides inheritance with overrides for rules

my $compiler= do use RegexpGrammars

qr ltdatagt

ltrule data gt lt[text]gt+ ltrule text gt +

xm

Example Creating a compiler

Context can be a do-block subroutine or branch logic

ldquodatardquo is the entry rule

All this does is read lines into an array with automatic ws handling

Results

The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results

ndash Empty keys () hold input text (for errors or debugging)

ndash Easy to handle with DataDumper

The hash has at least one key for the entry rule one empty key for input data if context is being saved

For example feeding two lines of a Gentoo emerge log through the line grammar gives

=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]

Parsing a few lines of logfile

Getting rid of context

The empty-keyed values are useful for development or explicit error messages

They also get in the way and can cost a lot of memory on large inputs

You can turn them on and off with ltcontextgt and ltnocontextgt in the rules

qr

ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +

xm

warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk

You usually want [] with +

data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]

qr

ltnocontextgt turn off globally

ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)

xm

An array[ref] of text

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 14: Perly Parsing with Regexp::Grammars

What you get

The parser is simply a regex-refndash You can bless it or have multiple parsers in the same program

Grammars can reference one anotherndash Extending grammars via objects or modules is straightforward

Comfortable for incremental development or refactoringndash Largely declarative syntax helps

ndash OOP provides inheritance with overrides for rules

my $compiler= do use RegexpGrammars

qr ltdatagt

ltrule data gt lt[text]gt+ ltrule text gt +

xm

Example Creating a compiler

Context can be a do-block subroutine or branch logic

ldquodatardquo is the entry rule

All this does is read lines into an array with automatic ws handling

Results

The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results

ndash Empty keys () hold input text (for errors or debugging)

ndash Easy to handle with DataDumper

The hash has at least one key for the entry rule one empty key for input data if context is being saved

For example feeding two lines of a Gentoo emerge log through the line grammar gives

=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]

Parsing a few lines of logfile

Getting rid of context

The empty-keyed values are useful for development or explicit error messages

They also get in the way and can cost a lot of memory on large inputs

You can turn them on and off with ltcontextgt and ltnocontextgt in the rules

qr

ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +

xm

warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk

You usually want [] with +

data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]

qr

ltnocontextgt turn off globally

ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)

xm

An array[ref] of text

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 15: Perly Parsing with Regexp::Grammars

my $compiler= do use RegexpGrammars

qr ltdatagt

ltrule data gt lt[text]gt+ ltrule text gt +

xm

Example Creating a compiler

Context can be a do-block subroutine or branch logic

ldquodatardquo is the entry rule

All this does is read lines into an array with automatic ws handling

Results

The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results

ndash Empty keys () hold input text (for errors or debugging)

ndash Easy to handle with DataDumper

The hash has at least one key for the entry rule one empty key for input data if context is being saved

For example feeding two lines of a Gentoo emerge log through the line grammar gives

=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]

Parsing a few lines of logfile

Getting rid of context

The empty-keyed values are useful for development or explicit error messages

They also get in the way and can cost a lot of memory on large inputs

You can turn them on and off with ltcontextgt and ltnocontextgt in the rules

qr

ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +

xm

warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk

You usually want [] with +

data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]

qr

ltnocontextgt turn off globally

ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)

xm

An array[ref] of text

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 16: Perly Parsing with Regexp::Grammars

Results

The results of parsing are in a tree-hash named ndash Keys are the rule names that produced the results

ndash Empty keys () hold input text (for errors or debugging)

ndash Easy to handle with DataDumper

The hash has at least one key for the entry rule one empty key for input data if context is being saved

For example feeding two lines of a Gentoo emerge log through the line grammar gives

=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]

Parsing a few lines of logfile

Getting rid of context

The empty-keyed values are useful for development or explicit error messages

They also get in the way and can cost a lot of memory on large inputs

You can turn them on and off with ltcontextgt and ltnocontextgt in the rules

qr

ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +

xm

warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk

You usually want [] with +

data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]

qr

ltnocontextgt turn off globally

ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)

xm

An array[ref] of text

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 17: Perly Parsing with Regexp::Grammars

=gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk data =gt =gt 1367874132 Started emerge on May 06 2013 2102121367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk text =gt [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ]

Parsing a few lines of logfile

Getting rid of context

The empty-keyed values are useful for development or explicit error messages

They also get in the way and can cost a lot of memory on large inputs

You can turn them on and off with ltcontextgt and ltnocontextgt in the rules

qr

ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +

xm

warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk

You usually want [] with +

data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]

qr

ltnocontextgt turn off globally

ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)

xm

An array[ref] of text

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 18: Perly Parsing with Regexp::Grammars

Getting rid of context

The empty-keyed values are useful for development or explicit error messages

They also get in the way and can cost a lot of memory on large inputs

You can turn them on and off with ltcontextgt and ltnocontextgt in the rules

qr

ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +

xm

warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk

You usually want [] with +

data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]

qr

ltnocontextgt turn off globally

ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)

xm

An array[ref] of text

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 19: Perly Parsing with Regexp::Grammars

qr

ltnocontextgt turn off globallyltdatagtltrule data gt lttextgt+ oops left off the []ltrule text gt +

xm

warn | Repeated subrule lttextgt+ will only capture its final match | (Did you mean lt[text]gt+ instead) | data =gt text =gt 1367874132 emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk

You usually want [] with +

data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]

qr

ltnocontextgt turn off globally

ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)

xm

An array[ref] of text

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 20: Perly Parsing with Regexp::Grammars

data =gt text =gt the [text] parses to an array of text [ 1367874132 Started emerge on May 06 2013 210212 1367874132 emerge --jobs --autounmask-write ndash ]

qr

ltnocontextgt turn off globally

ltdatagtltrule data gt lt[text]gt+ltrule text gt (+)

xm

An array[ref] of text

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 21: Perly Parsing with Regexp::Grammars

Breaking up lines

Each log entry is prefixed with an entry id Parsing the ref_id off the front adds

ltdatagtltrule data gt lt[line]gt+ltrule line gt ltref_idgt lt[text]gtlttoken ref_id gt ^(d+)ltrule text gt +

line =gt[

ref_id =gt 1367874132text =gt Started emerge on May 06 2013 210212

hellip

]

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 22: Perly Parsing with Regexp::Grammars

Removing cruft ldquowsrdquo

Be nice to remove the leading ldquo ldquo from text lines In this case the ldquowhitespacerdquo needs to include a colon along with the

spaces Whitespace is defined by ltws hellip gt

ltrule linegt ltws[s]+gt ltref_idgt lttextgt

ref_id =gt 1367874132text =gt emerge --jobs ndashautounmask-wr

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 23: Perly Parsing with Regexp::Grammars

The prefix means something

Be nice to know what type of line was being processed ltprefix= regex gt asigns the regexs capture to the ldquoprefixrdquo tag

ltrule line gt ltws[s]gt ltref_idgt ltentrygt ltrule entry gt ltprefix=([][][])gt lttextgt | ltprefix=([gt][gt][gt])gt lttextgt | ltprefix=([=][=][=])gt lttextgt | ltprefix=([][][])gt lttextgt | lttextgt

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 24: Perly Parsing with Regexp::Grammars

entry =gt text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt prefix =gt text =gt emerge --jobs ndashautounmask-write ref_id =gt 1367874132 entry =gt prefix =gt gtgtgt text =gt emerge (1 of 2) sys-apps ref_id =gt 1367874256

ldquoentryrdquo now contains optional prefix

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 25: Perly Parsing with Regexp::Grammars

Aliases can also assign tag results

Aliases assign a key to rule results

The match from ldquotextrdquo is aliased to a named type of log entry

ltrule entrygt

ltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 26: Perly Parsing with Regexp::Grammars

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write ndash prefix =gt ref_id =gt 1367874132 entry =gt command =gt terminating prefix =gt ref_id =gt 1367874133

Generic ldquotextrdquo replaced with a type

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 27: Perly Parsing with Regexp::Grammars

Parsing without capturing

At this point we dont really need the prefix strings since the entries are labeled

A leading tells RG to parse but not store the results in

ltrule entry gtltprefix=([][][])gt ltcommand=textgt|ltprefix=([gt][gt][gt])gt ltstage=textgt|ltprefix=([=][=][=])gt ltstatus=textgt|ltprefix=([][][])gt ltfinal=textgt|ltmessage=textgt

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 28: Perly Parsing with Regexp::Grammars

entry =gt message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 entry =gt command =gt emerge --jobs --autounmask-write - ref_id =gt 1367874132 entry =gt command =gt terminating ref_id =gt 1367874133

ldquoentryrdquo now has typed keys

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 29: Perly Parsing with Regexp::Grammars

The ldquoentryrdquo nesting gets in the way

The named subrule is not hard to get rid of just move its syntax up one level

ltws[s]gt ltref_idgt ( ltprefix=([][][])gt ltcommand=textgt | ltprefix=([gt][gt][gt])gt ltstage=textgt | ltprefix=([=][=][=])gt ltstatus=textgt | ltprefix=([][][])gt ltfinal=textgt | ltmessage=textgt )

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 30: Perly Parsing with Regexp::Grammars

data =gt line =gt [ message =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874132 command =gt emerge --jobs --autounmask-write --keep-going --load-average=40 --complete-graph --with-bdeps=y --deep talk ref_id =gt 1367874132 command =gt terminating ref_id =gt 1367874133 message =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137

Result array of ldquolinerdquo with ref_id amp type

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 31: Perly Parsing with Regexp::Grammars

Funny names for things

Maybe ldquocommandrdquo and ldquostatusrdquo arent the best way to distinguish the text

You can store an optional token followed by text

ltrule entry gt ltws[s]gt ltref_idgt lttypegt lttextgt lttoken typegt ( [][][] | [gt][gt][gt] | [=][=][=] | [][][] )

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 32: Perly Parsing with Regexp::Grammars

Entrys now have ldquotextrdquo and ldquotyperdquo

entry =gt [ ref_id =gt 1367874132 text =gt Started emerge on May 06 2013 210212 ref_id =gt 1367874133 text =gt terminating type =gt ref_id =gt 1367874137 text =gt Started emerge on May 06 2013 210217 ref_id =gt 1367874137 text =gt emerge --jobs --autounmask-write ndash type =gt

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 33: Perly Parsing with Regexp::Grammars

prefix alternations look ugly

Using a count works

[]3 | [gt]3 | []3 | [=]3but isnt all that much more readable

Given the way these are used use a block

[gt=] 3

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 34: Perly Parsing with Regexp::Grammars

qr ltnocontextgt

ltdatagt ltrule data gt lt[entry]gt+

ltrule entry gt ltws[s]gt ltref_idgt ltprefixgt lttextgt

lttoken ref_id gt ^(d+) lttoken prefix gt [gt=]3 lttoken text gt + xm

This is the skeleton parser

Doesnt take muchndash Declarative syntax

ndash No Perl code at all

Easy to modify by extending the definition of ldquotextrdquo for specific types of messages

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 35: Perly Parsing with Regexp::Grammars

Finishing the parser

Given the different line types it will be useful to extract commands switches outcomes from appropriate linesndash Sub-rules can be defined for the different line types

ltrule commandgt ldquoemergerdquo ltwsgtlt[switch]gt+

lttoken switchgt ([-][-]S+) This is what makes the grammars useful nested context-sensitive

content

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 36: Perly Parsing with Regexp::Grammars

Inheriting amp Extending Grammars

ltgrammar namegt and ltextends namegt allow a building-block approach

Code can assemble the contents of for a qr without having to eval or deal with messy quote strings

This makes modular or context-sensitive grammars relatively simple to composendash References can cross package or module boundaries

ndash Easy to define a basic grammar in one place and reference or extend it from multiple other parsers

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 37: Perly Parsing with Regexp::Grammars

The Non-Redundant File

NCBIs ldquonrgzrdquo file is a list if sequences and all of the places they are known to appear

It is moderately large 140+GB uncompressed The file consists of a simple FASTA format with heading separated

by ctrl-A chars

gtHeading 1

[amino-acid sequence characters]

gtHeading 2

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 38: Perly Parsing with Regexp::Grammars

Example A short nrgz FASTA entry

Headings are grouped by species separated by ctrl-A (ldquocArdquo) charactersndash Each species has a set of sources amp identifier pairs followed by a single

description

ndash Within-species separator is a pipe (ldquo|rdquo) with optional whitespace

ndash Species counts in some header run into the thousands

gtgi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKVQKLLNPDQ

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 39: Perly Parsing with Regexp::Grammars

First step Parse FASTA

qr ltgrammar ParseFastagt ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+

ltrule head gt + ltwsgt ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt

lttoken start gt ^ [gt] lttoken comment gt ^ [] + lttoken seq gt ^ [nw-]+xm

Instead of defining an entry rule this just defines a name ldquoParseFastardquo ndash This cannot be used to generate results by itself

ndash Accessible anywhere via RexepGrammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 40: Perly Parsing with Regexp::Grammars

The output needs help however

The ldquoltseqgtrdquo token captures newlines that need to be stripped out to get a single string

Munging these requires adding code to the parser using Perls regex code-block syntax ()ndash Allows inserting almost-arbitrary code into the regex

ndash ldquoalmostrdquo because the code cannot include regexen

seq =gt [ MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYDKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPVQKLLNPDQ]

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 41: Perly Parsing with Regexp::Grammars

Munging results $MATCH

The $MATCH and MATCH can be assigned to alter the results from the current or lower levels of the parse

In this case I take the ldquoseqrdquo match contents out of join them with nothing and use ldquotrrdquo to strip the newlinesndash join + split wont work because split uses a regex

ltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt ( $MATCH = join =gt delete $MATCH seq $MATCH =~ trnd )

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 42: Perly Parsing with Regexp::Grammars

One more step Remove the arrayref

Now the body is a single string

No need for an arrayref to contain one string Since the body has one entry assign offset zero

body =gt [MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDTKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ]

ltrule fastagt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 43: Perly Parsing with Regexp::Grammars

Result a generic FASTA parser

fasta =gt [ body =gt MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEYKEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQKVQKLLNPDQ head =gt gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum]gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

The head and body are easily accessible Next parse the nr-specific header

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 44: Perly Parsing with Regexp::Grammars

Deriving a grammar

Existing grammars are ldquoextendedrdquo The derived grammars are capable of producing results In this case

References the grammar and extracts a list of fasta entries

ltextends ParseFastagt

lt[fasta]gt+

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 45: Perly Parsing with Regexp::Grammars

Splitting the head into identifiers

Overloading fastas ldquoheadrdquo rule handles allows splitting identifiers for individual species

Catch cA is separator not a terminatorndash The tail item on the list doest have a cA to anchor on

ndash Using ldquo+[cAn] walks off the header onto the sequence

ndash This is a common problem with separators amp tokenizers

ndash This can be handled with special tokens in the grammar but RG provides a cleaner way

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 46: Perly Parsing with Regexp::Grammars

First pass Literal ldquotailrdquo item

This works but is uglyndash Have two rules for the main list and tail

ndash Alias the tail to get them all in one place

ltrule headgt lt[ident]gt+ lt[ident=final]gt (

remove the matched anchors

trcAnd for $MATCH ident )

lttoken ident gt + cAlttoken final gt + n

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 47: Perly Parsing with Regexp::Grammars

Breaking up the header

The last header item is aliased to ldquoidentrdquo Breaks up all of the entries

head =gt ident =gt [ gi|66816243|ref|XP_6421311| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] gi|1705556|sp|P546701|CAF1_DICDI RecName Full=Calfumirin-1 Short=CAF-1 gi|793761|dbj|BAA062661| calfumirin-1 [Dictyostelium discoideum] gi|60470106|gb|EAL680861| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] ]

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 48: Perly Parsing with Regexp::Grammars

Dealing with separators ltsepgt

Separators happen often enoughndash 1 2 3 4 13 91 numbers by commas spaces

ndash g-c-a-g-t-t-a-c-a characters by dashes

ndash usrlocalbin basenames by dir markers

ndash usrusrlocalbin dirs separated by colons

that RG has special syntax for dealing with them Combining the item with and a seprator

ltrule listgt lt[item]gt+ ltseparatorgt one-or-more

ltrule list_zomgt lt[item]gt ltseparatorgt zero-or-more

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 49: Perly Parsing with Regexp::Grammars

Cleaner nrgz header rule Separator syntax cleans things up

ndash No more tail rule with an alias

ndash No code block required to strip the separators and trailing newline

ndash Non-greedy match ldquo+rdquo avoids capturing separators

qr ltnocontextgt

ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt + xm

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 50: Perly Parsing with Regexp::Grammars

Nested ldquoidentrdquo tag is extraneous

Simpler to replace the ldquoheadrdquo with a list of identifiers Replace $MATCH from the ldquoheadrdquo rule with the nested identifier

contents

qr ltnocontextgt ltextends ParseFastagt

lt[fasta]gt+

ltrule head gt lt[ident]gt+ [cA] ( $MATCH = delete $MATCH ident )

lttoken ident gt + xm

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 51: Perly Parsing with Regexp::Grammars

Result

fasta =gt [ body =gt MASTQNIVEEVQKMLDTNPDQ head =gt [ gi|66816243|ref|XP_6rt=CAF-1 gi|793761|dbj|BAA0626oideum] gi|60470106|gb|EAL68086m discoideum AX4] ] ]

The fasta content is broken into the usual ldquobodyrdquo plus a ldquoheadrdquo broken down on cA boundaries

Not bad for a dozen lines of grammar with a few lines of code

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 52: Perly Parsing with Regexp::Grammars

One more level of structure idents

Species have ltsource gt | ltidentifiergt pairs followed by a description

Add a separator clause ldquo (s|s)rdquo

ndash This can be parsed into a hash something like

gi|66816243|ref|XP_6421311|hypothetical

Becomes

gi =gt 66816243 ref =gt XP_6421311 desc =gt hypothetical

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 53: Perly Parsing with Regexp::Grammars

Munging the separated input

ltfastagt ( my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz )

ltrule head gt lt[ident]gt+ [cA] lttoken ident gt lt[taxa]gt+ ( s [|] s ) lttoken taxa gt +

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 54: Perly Parsing with Regexp::Grammars

Result head with sources ldquodescrdquo

fasta =gt body =gt MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKREDQN head =gt [ desc =gt 30S ribosomal protein S18 [Lactococ gi =gt 15674171 ref =gt NP_2683461 desc =gt 30S ribosomal protein S18 [Lactoco gi =gt 116513137 ref =gt YP_8120441

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 55: Perly Parsing with Regexp::Grammars

Balancing RG with calling code

The regex engine could process all of nrgzndash Catch lt[fasta]gt+ returns about 250_000 keys and literally millions of total identifiers in

the heads

ndash Better approach ltfastagt on single entries but chunking input on gt removes it as a leading charactor

ndash Making it optional with ltstartgt fixes the problemlocal $ = gt

while( my $chunk = readline )

chomplength $chunk or do --$ next

$chunk =~ $nr_gz

process single fasta record in

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 56: Perly Parsing with Regexp::Grammars

Fasta base grammar 3 lines of codeqr

ltgrammar ParseFastagt

ltnocontextgt

ltrule fasta gt ltstartgt ltheadgt ltwsgt lt[body]gt+(

$MATCH body = $MATCH body [0])

ltrule head gt + ltwsgtltrule body gt ( lt[seq]gt | ltcommentgt ) ltwsgt(

$MATCH = join =gt delete $MATCH seq $MATCH =~ trnd

)

lttoken start gt ^ [gt]lttoken comment gt ^ [] + lttoken seq gt ^ ( [nw-]+ )

xm

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 57: Perly Parsing with Regexp::Grammars

Extension to Fasta 6 lines of codeqr

ltnocontextgtltextends ParseFastagtltfastagt(

my $identz = delete $MATCH fasta head ident

for( $identz )

my $pairz = $_-gt taxa my $desc = pop $pairz$_ = $pairz desc =gt $desc

$MATCH fasta head = $identz)

ltrule head gt lt[ident]gt+ [cA]ltrule ident gt lt[taxa]gt+ ( s [|] s )lttoken taxa gt +

xm

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 58: Perly Parsing with Regexp::Grammars

Result Use grammars

Most of the ldquorealrdquo work is done under the hoodndash RegexpGrammars does the lexing basic compilation

ndash Code only needed for cleanups or re-arranging structs

Code can simplify your grammarndash Too much code makes them hard to maintain

ndash Trick is keeping the balance between simplicity in the grammar and cleanup in the code

Either way the result is going to be more maintainable than hardwiring the grammar into code

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 59: Perly Parsing with Regexp::Grammars

Aside KwikFix for Perl v518

v517 changed how the regex engine handles inline code Code that used to be eval-ed in the regex is now compiled up front

ndash This requires ldquouse re evalrdquo and ldquono strict varsrdquo

ndash One for the Perl code the other for $MATCH and friends

The immediate fix for this is in the last few lines of RGimport which push the pragmas into the caller

Look up $^H in perlvars to see how it works

require re re-gtimport( eval )require strict strict-gtunimport( vars )

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 60: Perly Parsing with Regexp::Grammars

Use RegexpGrammars

Unless you have old YACC BNF grammars to convert the newer facility for defining the grammars is cleanerndash Frankly even if you do have old grammars

RegexpGrammars avoids the performance pitfalls of PRDndash It is worth taking time to learn how to optimize NDF regexen however

Or better yet use Perl6 grammars available today at your local copy of Rakudo Perl6

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61
Page 61: Perly Parsing with Regexp::Grammars

More info on RegexpGrammars

The POD is thorough and quite descriptive [comfortable chair enjoyable beverage suggested]

The demo directory has a number of working ndash if un-annotated ndash examples

ldquoperldoc perlrerdquo shows how recursive matching in v510+ PerlMonks has plenty of good postings Perl Review article by brian d foy on recursive matching in Perl

510

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
  • Slide 27
  • Slide 28
  • Slide 29
  • Slide 30
  • Slide 31
  • Slide 32
  • Slide 33
  • Slide 34
  • Slide 35
  • Slide 36
  • Slide 37
  • Slide 38
  • Slide 39
  • Slide 40
  • Slide 41
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Slide 46
  • Slide 47
  • Slide 48
  • Slide 49
  • Slide 50
  • Slide 51
  • Slide 52
  • Slide 53
  • Slide 54
  • Slide 55
  • Slide 56
  • Slide 57
  • Slide 58
  • Slide 59
  • Slide 60
  • Slide 61