processing xml: a rewriting system approach
DESCRIPTION
Yet another method to parse XML: rewrite it!TRANSCRIPT
![Page 1: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/1.jpg)
Processing XMLA rewriting system approach
Alberto Simões
Portuguese Perl Workshop – 2010
Alberto Simões Processing XML: a rewriting system approach
![Page 2: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/2.jpg)
Motivation and Goals
XML is usually generated from structured information:databases, spreadsheets, forms, etc.
but it can be generated from unstructured(or poorly-structured data):
textual documents, domain specific languages;
Question arises:How to produce XML documents from textual documents?
write a parser (natural language, domain specific, etc);
produce XML by rewriting the textual document!
Alberto Simões Processing XML: a rewriting system approach
![Page 3: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/3.jpg)
Motivation and Goals
XML is usually generated from structured information:databases, spreadsheets, forms, etc.
but it can be generated from unstructured(or poorly-structured data):
textual documents, domain specific languages;
Question arises:How to produce XML documents from textual documents?
write a parser (natural language, domain specific, etc);
produce XML by rewriting the textual document!
Alberto Simões Processing XML: a rewriting system approach
![Page 4: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/4.jpg)
Motivation and Goals
XML is usually generated from structured information:databases, spreadsheets, forms, etc.
but it can be generated from unstructured(or poorly-structured data):
textual documents, domain specific languages;
Question arises:How to produce XML documents from textual documents?
write a parser (natural language, domain specific, etc);
produce XML by rewriting the textual document!
Alberto Simões Processing XML: a rewriting system approach
![Page 5: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/5.jpg)
Motivation and Goals
XML is usually generated from structured information:databases, spreadsheets, forms, etc.
but it can be generated from unstructured(or poorly-structured data):
textual documents, domain specific languages;
Question arises:How to produce XML documents from textual documents?
write a parser (natural language, domain specific, etc);
produce XML by rewriting the textual document!
Alberto Simões Processing XML: a rewriting system approach
![Page 6: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/6.jpg)
Motivation and Goals
XML is usually generated from structured information:databases, spreadsheets, forms, etc.
but it can be generated from unstructured(or poorly-structured data):
textual documents, domain specific languages;
Question arises:How to produce XML documents from textual documents?
write a parser (natural language, domain specific, etc);
produce XML by rewriting the textual document!
Alberto Simões Processing XML: a rewriting system approach
![Page 7: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/7.jpg)
Hows does textual rewriting works?
write rewriting rules:
rule ∼= pattern × restriction × action
pattern a regular (or irregular) expression that shouldbe textually matched;
restriction conditional code that checks whether the ruleshould be applied;
action a piece of code (or simply a string) thatproduces text that should replace theoriginally matched text;
Alberto Simões Processing XML: a rewriting system approach
![Page 8: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/8.jpg)
Hows does textual rewriting works?
write rewriting rules:
rule ∼= pattern × restriction × action
pattern a regular (or irregular) expression that shouldbe textually matched;
restriction conditional code that checks whether the ruleshould be applied;
action a piece of code (or simply a string) thatproduces text that should replace theoriginally matched text;
Alberto Simões Processing XML: a rewriting system approach
![Page 9: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/9.jpg)
Hows does textual rewriting works?
write rewriting rules:
rule ∼= pattern × restriction × action
pattern a regular (or irregular) expression that shouldbe textually matched;
restriction conditional code that checks whether the ruleshould be applied;
action a piece of code (or simply a string) thatproduces text that should replace theoriginally matched text;
Alberto Simões Processing XML: a rewriting system approach
![Page 10: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/10.jpg)
Hows does textual rewriting works?
write rewriting rules:
rule ∼= pattern × restriction × action
pattern a regular (or irregular) expression that shouldbe textually matched;
restriction conditional code that checks whether the ruleshould be applied;
action a piece of code (or simply a string) thatproduces text that should replace theoriginally matched text;
Alberto Simões Processing XML: a rewriting system approach
![Page 11: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/11.jpg)
Are there text rewriting tools?
For this work we used Text::RewriteRules:
written in Perl:Perl regular expression engine power;Reflexive language (code can be generated on the fly);
supports different rewriting approaches:Fixed-point rewriting approach;Sliding-cursor rewriting approach;Lexical analyzer approach;
home-developed;
Alberto Simões Processing XML: a rewriting system approach
![Page 12: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/12.jpg)
Are there text rewriting tools?
For this work we used Text::RewriteRules:
written in Perl:Perl regular expression engine power;Reflexive language (code can be generated on the fly);
supports different rewriting approaches:Fixed-point rewriting approach;Sliding-cursor rewriting approach;Lexical analyzer approach;
home-developed;
Alberto Simões Processing XML: a rewriting system approach
![Page 13: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/13.jpg)
Are there text rewriting tools?
For this work we used Text::RewriteRules:
written in Perl:Perl regular expression engine power;Reflexive language (code can be generated on the fly);
supports different rewriting approaches:Fixed-point rewriting approach;Sliding-cursor rewriting approach;Lexical analyzer approach;
home-developed;
Alberto Simões Processing XML: a rewriting system approach
![Page 14: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/14.jpg)
Are there text rewriting tools?
For this work we used Text::RewriteRules:
written in Perl:Perl regular expression engine power;Reflexive language (code can be generated on the fly);
supports different rewriting approaches:Fixed-point rewriting approach;Sliding-cursor rewriting approach;Lexical analyzer approach;
home-developed;
Alberto Simões Processing XML: a rewriting system approach
![Page 15: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/15.jpg)
Fixed-point rewriting approach
Algorithmeasy to understand;a sequence of rules that are applied by order;first rule is applied, and following rules are only applied ifthere is no previous rule that can be applied;it might happen that a rule changes the document in a waythat a previous rule will be applied again;the process ends when there are no rules that can beapplied (or if a specific rule forces the system to end);
Code example: anonymization of emailsRULES anonymize\w+(\.\w+)*@\w+\.\w+(\.\w+)*==>[[hidden email]]ENDRULES
Alberto Simões Processing XML: a rewriting system approach
![Page 16: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/16.jpg)
Fixed-point rewriting approach
Algorithmeasy to understand;a sequence of rules that are applied by order;first rule is applied, and following rules are only applied ifthere is no previous rule that can be applied;it might happen that a rule changes the document in a waythat a previous rule will be applied again;the process ends when there are no rules that can beapplied (or if a specific rule forces the system to end);
Code example: anonymization of emailsRULES anonymize\w+(\.\w+)*@\w+\.\w+(\.\w+)*==>[[hidden email]]ENDRULES
Alberto Simões Processing XML: a rewriting system approach
![Page 17: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/17.jpg)
Sliding-cursor rewriting approach
Algorithmthe cursor is placed in the beginning of the string;patterns are matched if they occur right after the cursor;if a rule is applied, the cursor is placed after that region;if no rule matches, the cursor moves ahead one character;process ends when cursor reaches the end of the string;it will never rewrite text that was already rewritten.
Code example: brute force translationRULES/m translate(\w+)=e=> $translation{$1} !! exists($translation{$1})ENDRULES
Example_ latest trainúltimo _ trainúltimo combóio _
Alberto Simões Processing XML: a rewriting system approach
![Page 18: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/18.jpg)
Sliding-cursor rewriting approach
Algorithmthe cursor is placed in the beginning of the string;patterns are matched if they occur right after the cursor;if a rule is applied, the cursor is placed after that region;if no rule matches, the cursor moves ahead one character;process ends when cursor reaches the end of the string;it will never rewrite text that was already rewritten.
Code example: brute force translationRULES/m translate(\w+)=e=> $translation{$1} !! exists($translation{$1})ENDRULES
Example_ latest trainúltimo _ trainúltimo combóio _
Alberto Simões Processing XML: a rewriting system approach
![Page 19: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/19.jpg)
Sliding-cursor rewriting approach
Algorithmthe cursor is placed in the beginning of the string;patterns are matched if they occur right after the cursor;if a rule is applied, the cursor is placed after that region;if no rule matches, the cursor moves ahead one character;process ends when cursor reaches the end of the string;it will never rewrite text that was already rewritten.
Code example: brute force translationRULES/m translate(\w+)=e=> $translation{$1} !! exists($translation{$1})ENDRULES
Example_ latest trainúltimo _ trainúltimo combóio _
Alberto Simões Processing XML: a rewriting system approach
![Page 20: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/20.jpg)
Valid Rewriting Rules
Different approaches have different possible rules. . .
but the most relevant rules are:==> simple pattern substitution: left hand side includes
a Perl regular expression and right hand sideincludes the string that will replace the match;
=e=> similar to the previous one, but right hand sideincludes Perl code to be evaluated. The result willbe used to replace the match;
=begin=> without a left hand side, the right hand side code isexecuted before the rewrite starts;
=end=> without a right hand side, when the left hand sidepattern matches quits the rewrite system;
they can include a restriction block (!!) at the right of the action.
Alberto Simões Processing XML: a rewriting system approach
![Page 21: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/21.jpg)
Valid Rewriting Rules
Different approaches have different possible rules. . .
but the most relevant rules are:==> simple pattern substitution: left hand side includes
a Perl regular expression and right hand sideincludes the string that will replace the match;
=e=> similar to the previous one, but right hand sideincludes Perl code to be evaluated. The result willbe used to replace the match;
=begin=> without a left hand side, the right hand side code isexecuted before the rewrite starts;
=end=> without a right hand side, when the left hand sidepattern matches quits the rewrite system;
they can include a restriction block (!!) at the right of the action.
Alberto Simões Processing XML: a rewriting system approach
![Page 22: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/22.jpg)
Valid Rewriting Rules
Different approaches have different possible rules. . .
but the most relevant rules are:==> simple pattern substitution: left hand side includes
a Perl regular expression and right hand sideincludes the string that will replace the match;
=e=> similar to the previous one, but right hand sideincludes Perl code to be evaluated. The result willbe used to replace the match;
=begin=> without a left hand side, the right hand side code isexecuted before the rewrite starts;
=end=> without a right hand side, when the left hand sidepattern matches quits the rewrite system;
they can include a restriction block (!!) at the right of the action.
Alberto Simões Processing XML: a rewriting system approach
![Page 23: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/23.jpg)
Valid Rewriting Rules
Different approaches have different possible rules. . .
but the most relevant rules are:==> simple pattern substitution: left hand side includes
a Perl regular expression and right hand sideincludes the string that will replace the match;
=e=> similar to the previous one, but right hand sideincludes Perl code to be evaluated. The result willbe used to replace the match;
=begin=> without a left hand side, the right hand side code isexecuted before the rewrite starts;
=end=> without a right hand side, when the left hand sidepattern matches quits the rewrite system;
they can include a restriction block (!!) at the right of the action.
Alberto Simões Processing XML: a rewriting system approach
![Page 24: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/24.jpg)
Valid Rewriting Rules
Different approaches have different possible rules. . .
but the most relevant rules are:==> simple pattern substitution: left hand side includes
a Perl regular expression and right hand sideincludes the string that will replace the match;
=e=> similar to the previous one, but right hand sideincludes Perl code to be evaluated. The result willbe used to replace the match;
=begin=> without a left hand side, the right hand side code isexecuted before the rewrite starts;
=end=> without a right hand side, when the left hand sidepattern matches quits the rewrite system;
they can include a restriction block (!!) at the right of the action.
Alberto Simões Processing XML: a rewriting system approach
![Page 25: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/25.jpg)
Valid Rewriting Rules
Different approaches have different possible rules. . .
but the most relevant rules are:==> simple pattern substitution: left hand side includes
a Perl regular expression and right hand sideincludes the string that will replace the match;
=e=> similar to the previous one, but right hand sideincludes Perl code to be evaluated. The result willbe used to replace the match;
=begin=> without a left hand side, the right hand side code isexecuted before the rewrite starts;
=end=> without a right hand side, when the left hand sidepattern matches quits the rewrite system;
they can include a restriction block (!!) at the right of the action.
Alberto Simões Processing XML: a rewriting system approach
![Page 26: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/26.jpg)
Rewriting Text into XML
How to produce XML from weak-structured data?write a parser;or rewrite the data step-by-step into XML!
Two case studies:Rewriting a dictionary in textual format into TEI;Rewriting a XML DSL authoring tool into XML;
Alberto Simões Processing XML: a rewriting system approach
![Page 27: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/27.jpg)
Rewriting Text into XML
How to produce XML from weak-structured data?write a parser;or rewrite the data step-by-step into XML!
Two case studies:Rewriting a dictionary in textual format into TEI;Rewriting a XML DSL authoring tool into XML;
Alberto Simões Processing XML: a rewriting system approach
![Page 28: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/28.jpg)
Rewriting Text into TEI
Rewrite this. . .*Cachimbo*,_m._Apparelho de fumador, composto d..Peça de ferro, em que entra o es..Buraco, em que se encaixa a vela..* _Bras. de Pernambuco._Bebida, preparada com aguardente..* _Pl. Gír._Pés.(Do químb. _quixima_)
. . . into this!<entry id="cachimbo"><form><orth>Cachimbo</orth></form><sense><gramGrp>m.</gramGrp><def>Apparelho de fumador, composto d..Peça de ferro, em que entra o es..Buraco, em que se encaixa a vela..</def></sense><sense ast="1"><usg type="geo">Bras. de Pernamb..<def>Bebida, preparada com aguardente..</def></sense><sense ast="1"><gramGrp>Pl.</gra..<usg type="style">Gír.</usg><def>Pés.</def></sense><etym ori="químb">(Do químb. _qu..</entry>Alberto Simões Processing XML: a rewriting system approach
![Page 29: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/29.jpg)
Rewriting Text into TEI
This rewrite was all based on:a few tables (grammatical and usage strings);
entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euphentries domains: Agr Anat Anthrop Apicult Arith Artilh Archit
rewrite the few mark-up into better XML structure;((\* )?_([^_]|_[^_]{1,5}_)+_( *)?)\n=e=>$a=$1;end_def.end_sense.start_sense.gramGrp($a)."\n".start_def
rewrite the new XML structure to detect and annotate amore complex structure;
<gramGrp>([^<]*)\s*\*\s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast=\"1\"".gramGrp($a)
detect and correct wrong XML elements.</form></sense>==></form>
</form></def>\n</sense>==></form>
Alberto Simões Processing XML: a rewriting system approach
![Page 30: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/30.jpg)
Rewriting Text into TEI
This rewrite was all based on:a few tables (grammatical and usage strings);
entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euphentries domains: Agr Anat Anthrop Apicult Arith Artilh Archit
rewrite the few mark-up into better XML structure;((\* )?_([^_]|_[^_]{1,5}_)+_( *)?)\n=e=>$a=$1;end_def.end_sense.start_sense.gramGrp($a)."\n".start_def
rewrite the new XML structure to detect and annotate amore complex structure;
<gramGrp>([^<]*)\s*\*\s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast=\"1\"".gramGrp($a)
detect and correct wrong XML elements.</form></sense>==></form>
</form></def>\n</sense>==></form>
Alberto Simões Processing XML: a rewriting system approach
![Page 31: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/31.jpg)
Rewriting Text into TEI
This rewrite was all based on:a few tables (grammatical and usage strings);
entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euphentries domains: Agr Anat Anthrop Apicult Arith Artilh Archit
rewrite the few mark-up into better XML structure;((\* )?_([^_]|_[^_]{1,5}_)+_( *)?)\n=e=>$a=$1;end_def.end_sense.start_sense.gramGrp($a)."\n".start_def
rewrite the new XML structure to detect and annotate amore complex structure;
<gramGrp>([^<]*)\s*\*\s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast=\"1\"".gramGrp($a)
detect and correct wrong XML elements.</form></sense>==></form>
</form></def>\n</sense>==></form>
Alberto Simões Processing XML: a rewriting system approach
![Page 32: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/32.jpg)
Rewriting Text into TEI
This rewrite was all based on:a few tables (grammatical and usage strings);
entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euphentries domains: Agr Anat Anthrop Apicult Arith Artilh Archit
rewrite the few mark-up into better XML structure;((\* )?_([^_]|_[^_]{1,5}_)+_( *)?)\n=e=>$a=$1;end_def.end_sense.start_sense.gramGrp($a)."\n".start_def
rewrite the new XML structure to detect and annotate amore complex structure;
<gramGrp>([^<]*)\s*\*\s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast=\"1\"".gramGrp($a)
detect and correct wrong XML elements.</form></sense>==></form>
</form></def>\n</sense>==></form>
Alberto Simões Processing XML: a rewriting system approach
![Page 33: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/33.jpg)
Rewriting Text into TEI
Case study conclusions:flexible tool;
works on big files:Text file is 13 MB;Output XML is 30 MB;Process takes about nine minutes!
we event rewrote XML into XML.
Hey!! XML is text!!How can we rewrite it!?
Alberto Simões Processing XML: a rewriting system approach
![Page 34: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/34.jpg)
Rewriting Text into TEI
Case study conclusions:flexible tool;
works on big files:Text file is 13 MB;Output XML is 30 MB;Process takes about nine minutes!
we event rewrote XML into XML.
Hey!! XML is text!!How can we rewrite it!?
Alberto Simões Processing XML: a rewriting system approach
![Page 35: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/35.jpg)
Rewriting XML
different from the usual DOM or SAX oriented approaches;
looks to XML as text, non structured data;
rewrite can be done:as any other text write system;taking advantage of irregular expressions.
Irregular expressions? Are you kidding?
Alberto Simões Processing XML: a rewriting system approach
![Page 36: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/36.jpg)
Rewriting XML
different from the usual DOM or SAX oriented approaches;
looks to XML as text, non structured data;
rewrite can be done:as any other text write system;taking advantage of irregular expressions.
Irregular expressions? Are you kidding?
Alberto Simões Processing XML: a rewriting system approach
![Page 37: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/37.jpg)
Rewriting XML
different from the usual DOM or SAX oriented approaches;
looks to XML as text, non structured data;
rewrite can be done:as any other text write system;taking advantage of irregular expressions.
Irregular expressions? Are you kidding?
Alberto Simões Processing XML: a rewriting system approach
![Page 38: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/38.jpg)
Rewriting XML
different from the usual DOM or SAX oriented approaches;
looks to XML as text, non structured data;
rewrite can be done:as any other text write system;taking advantage of irregular expressions.
Irregular expressions? Are you kidding?
Alberto Simões Processing XML: a rewriting system approach
![Page 39: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/39.jpg)
Not so regular expressions
Perl has a powerful regular expression engine:regular expressions can define capture zones:small pieces of the match that can be used later;
regular expressions can define look-ahead or look-behind:check the context of the matching zone;
since Perl 5.10, regular expressions can be recursive:regular expression that depends on themself.
my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
For XML, we defined two classes:[[:XML:]] matches any well formed XML fragment;[[:XML(tag):]] matches a XML fragment with a specific
root element;
Alberto Simões Processing XML: a rewriting system approach
![Page 40: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/40.jpg)
Not so regular expressions
Perl has a powerful regular expression engine:regular expressions can define capture zones:small pieces of the match that can be used later;
regular expressions can define look-ahead or look-behind:check the context of the matching zone;
since Perl 5.10, regular expressions can be recursive:regular expression that depends on themself.
my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
For XML, we defined two classes:[[:XML:]] matches any well formed XML fragment;[[:XML(tag):]] matches a XML fragment with a specific
root element;
Alberto Simões Processing XML: a rewriting system approach
![Page 41: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/41.jpg)
Not so regular expressions
Perl has a powerful regular expression engine:regular expressions can define capture zones:small pieces of the match that can be used later;
regular expressions can define look-ahead or look-behind:check the context of the matching zone;
since Perl 5.10, regular expressions can be recursive:regular expression that depends on themself.
my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
For XML, we defined two classes:[[:XML:]] matches any well formed XML fragment;[[:XML(tag):]] matches a XML fragment with a specific
root element;
Alberto Simões Processing XML: a rewriting system approach
![Page 42: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/42.jpg)
Not so regular expressions
Perl has a powerful regular expression engine:regular expressions can define capture zones:small pieces of the match that can be used later;
regular expressions can define look-ahead or look-behind:check the context of the matching zone;
since Perl 5.10, regular expressions can be recursive:regular expression that depends on themself.
my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
For XML, we defined two classes:[[:XML:]] matches any well formed XML fragment;[[:XML(tag):]] matches a XML fragment with a specific
root element;
Alberto Simões Processing XML: a rewriting system approach
![Page 43: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/43.jpg)
Not so regular expressions
Perl has a powerful regular expression engine:regular expressions can define capture zones:small pieces of the match that can be used later;
regular expressions can define look-ahead or look-behind:check the context of the matching zone;
since Perl 5.10, regular expressions can be recursive:regular expression that depends on themself.
my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
For XML, we defined two classes:[[:XML:]] matches any well formed XML fragment;[[:XML(tag):]] matches a XML fragment with a specific
root element;
Alberto Simões Processing XML: a rewriting system approach
![Page 44: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/44.jpg)
Not so regular expressions
Perl has a powerful regular expression engine:regular expressions can define capture zones:small pieces of the match that can be used later;
regular expressions can define look-ahead or look-behind:check the context of the matching zone;
since Perl 5.10, regular expressions can be recursive:regular expression that depends on themself.
my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
For XML, we defined two classes:[[:XML:]] matches any well formed XML fragment;[[:XML(tag):]] matches a XML fragment with a specific
root element;
Alberto Simões Processing XML: a rewriting system approach
![Page 45: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/45.jpg)
Rewriting XML
As a simple example, we can remove duplicate translation unitsin a translation memory file:
Code exampleRULES/m duplicates([[:XML(tu):]])==>!!duplicate($1)ENDRULES
sub duplicate {my $tu = shift;my $tumd5 = md5(dtstring($tu,
-default => sub{$c}));return 1 if exists $visited{$tumd5};$visited{$tumd5}++return 0;
}
Alberto Simões Processing XML: a rewriting system approach
![Page 46: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/46.jpg)
Conclusions
The rewriting approach is:flexible;powerful;easy to learn;grows quickly;big systems can be difficult to maintain;
The Perl regular engine:makes it easy to match anything;almost supports full grammars;makes it possible to define block structures;
So, it can be applied to XML easily!
Alberto Simões Processing XML: a rewriting system approach
![Page 47: Processing XML: a rewriting system approach](https://reader038.vdocuments.net/reader038/viewer/2022103113/554be48eb4c9056b348b48ea/html5/thumbnails/47.jpg)
Thank you
Thank You!
Alberto Simõ[email protected]
Alberto Simões Processing XML: a rewriting system approach