perl/xml::dom - reading and writing xml from perl dr. andrew c.r. martin [email protected]
TRANSCRIPT
Perl/XML::DOM - reading and writing XML from Perl
Dr. Andrew C.R. [email protected]://www.bioinf.org.uk/
Aims and objectivesRefresh the structure of a XML (or
XHTML) documentKnow problems in reading and
writing XMLUnderstand the requirements of
XML parsers and the two main types
Know how to write code using the DOM parser
PRACTICAL: write a script to read XML
An XML refresher!<mutants> <mutant_group native='1abc01'> <structure> <method>x-ray</method> <resolution>1.8</resolution> <rfactor>0.20</rfactor> </structure>
<mutant domid='2bcd01'> <structure> <method>x-ray</method> <resolution>1.8</resolution> <rfactor>0.20</rfactor> </structure> <mutation res='L24' native='ALA' subs='ARG' /> </mutant> </mutant_group></mutants>
Tags: paired opening and
closing tags
May contain data and/or
other (nested) tags
un-paired tags use
special syntax
Attributes: optional
– contained within
the opening tag
Writing XML
Writing XML is straightforward
Generate XML from a Perl script using print() statements.
However:tags correctly nested
quote marks correctly paired
international character sets
Reading XML
As simple or complex as you wish!
Full control over XML:simple pattern may suffice
Otherwise, may be dangerousmay rely on un-guaranteed formatting
<mutants><mutant_group native='1abc01'><structure><method>x-ray</method><resolution>1.8</resolution><rfactor>0.20</rfactor></structure><mutant domid='2bcd01'><structure><method>x-ray</method><resolution>1.8</resolution><rfactor>0.20</rfactor></structure><mutation res='L24’native='ALA’ subs='ARG'/></mutant></mutant_group></mutants>
XML ParsersClear rules for data boundaries
and hierarchy
Predictable; unambiguous
Parser translates XML into
stream of events
complex data object
XML Parsers
Different data sources of datafilescharacter stringsremote references
different character encodingsstandard LatinJapanese
checking for well-formedness errors
Good parser will handle:
XML ParsersRead stream of characters
differentiate markup and dataOptionally replace entity
references(e.g. < with <)
Assemble complete documentdisparate (perhaps remote) sources
Report syntax and validation errorsPass data to client program
XML ParsersIf XML has no syntax errors it is
'well formed'
With a DTD, a validating parser will check it matches:'valid'
XML ParsersWriting a good parser is a lot of
work!
A lot of testing needed
Fortunately, many parsers available
Getting data to your programParser can generate 'events'
Tags are converted into events
Events triggered in your program as the document is read
Parser acts as a pipeline converting XML into processed chunks of data sent to your program: an 'event stream'
Getting data to your program
OR…
XML converted into a tree structure
Reflects organization of the XML
Whole document read into memory before your program gets access
Pros and cons
In the parser, everything is likely to be event-driven tree-based parsers create a data structure from the
event stream
Data structure
More convenientCan access data in any
order Code usually simplerMay be impossible to
handle very large filesNeed more processor
timeNeed more memory
Event stream
Faster to access limited data
Use less memoryParser loses data at the
next eventMore complex code
SAX and DOMde facto standard APIs for XML
parsingSAX (Simple API for XML)
event-stream APIoriginally for Java, but now for several
programming languages (including Perl)development promoted by Peter Murray Rust,
1997-8
DOM (Document Object Model) W3C standard tree-based parserplatform- and language-neutralallows update of document content and structure
as well as reading
Perl XML parsers
Many parsers available
Differ in three major ways: parsing style (event driven or data
structure)
'standards-completeness’
speed (implementation in C or pure Perl)
Perl XML parsers
XML::Simple
Very easy to use
Designed for simple applications
Can’t handle 'mixed content' tags containing both data and other
tags
<p>This is <b>mixed</b> content</p>
Perl XML parsers
XML::ParserOldest Perl XML parser
Reasonably fast and flexibile
Not very standards-compliant.
Is a wrapper to 'expat’probably the first C XML parser written
by James Clark
XML::Parser
use XML::Parser;
my $xmlfile = shift @ARGV; # the file to parse
# initialize parser object and parse the stringmy $parser = XML::Parser->new( ErrorContext => 2 );eval { $parser->parsefile( $xmlfile ); };
# report error or successif( $@ ) { $@ =~ s/at \/.*?$//s; # remove module line number print STDERR "\nERROR in '$xmlfile':\n$@\n";} else { print STDERR "'$xmlfile' is well-formed\n";}
Simple example - check well-formedness
Eval used so parser doesn’t cause program to exit
Remember 2 lines before an error $@ stores
the error
Perl XML parsers
XML::DOM Implements W3C DOM Level 1Built on top of XML::ParserVery good fast, stable and
complete Limited extended functionality
XML::SAX Implements SAX2 wrapper to ExpatFast, stable and complete
Perl XML parsers
XML::LibXML
Wrapper around GNOME libxml2
Very fast, complete and stable
Validating/non-validating
DOM and SAX support
Perl XML parsers
XML::Twig
DOM-like parser, BUT
Allows you to define elements which can be parsed as discrete units'twigs' (small branches of a tree)
Perl XML parsers
Several others
More specialized, adding… XPath (to select data from XML
document)
re-formatting (XSLT or other methods)
...
XML::DOMDOM is a standard API
once learned moving to a different language is straightforward
moving between implementations also easy
Suppose we want to extract some data from an XML file...
XML::DOM<data> <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> <species name='Drosophila melanogaster'> <common-name>fruit fly</common-name> <conservation status='not endangered' /> </species></data>
We want:cat (Felix domesticus) not endangeredfruit fly (Drosophila melanogaster) not endangered
#!/usr/bin/perl
use XML::DOM;
$file = shift @ARGV;
$parser = XML::DOM::Parser->new();
$doc = $parser->parsefile($file);
foreach $species ($doc->getElementsByTagName('species'))
{
$common_name = $species->getElementsByTagName('common-name');
$cname = $common_name->item(0)->getFirstChild->getNodeValue;
$name = $species->getAttribute('name');
$conservation = $species->getElementsByTagName('conservation');
$status = $conservation->item(0)->getAttribute('status');
print "$cname ($name) $status\n";
}
$doc->dispose();
Import XML::DOM moduleobtain filenameinitialize a parserparse the file
Nested loops correspond to
nested tags
Returns each element matching ‘species’
Look at each species element in turn$species contains content and descendant elements
Can now treat $speciesin the same way as $doc
#!/usr/bin/perl
use XML::DOM;
$file = shift @ARGV;
$parser = XML::DOM::Parser->new();
$doc = $parser->parsefile($file);
foreach $species ($doc->getElementsByTagName('species'))
{
$common_name = $species->getElementsByTagName('common-name');
$cname = $common_name->item(0)->getFirstChild->getNodeValue;
$name = $species->getAttribute('name');
$conservation = $species->getElementsByTagName('conservation');
$status = $conservation->item(0)->getAttribute('status');
print "$cname ($name) $status\n";
}
$doc->dispose();
$common_name contains contentand any descendant elements
Here there is only one <common-name> elementper <species> element, but parser can't know that. Have to specify that we want
the first (and only) <common-name> elementThe <common-name> element contains only
one child which is data. Obtain first (and only) child element
using ->getFirstChild
Extract the text from the element object using ->getNodeValue
Returns each element matching ‘common-name’
->getElementsByTagName returns an arrayHere we work through the array:
foreach $species ($doc->getElementsByTagName('species')){}
$common_name = $species->getElementsByTagName('common-name');$cname = $common_name->item(0)->getFirstChild->getNodeValue;
Here we obtain an object and use the ->item() method to obtain the first item:
@common_names = $species->getElementsByTagName('common-name');$cname = $common_names[0]->getFirstChild->getNodeValue;
Alternative is to access an individual array element:
$common_name = $species->getElementsByTagName('common-name');$cname = $common_name->item(0)->getFirstChild->getNodeValue;
<species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /></species>
There could have been more thanone <common-name> element
within this <species> element
Obtain the actual textrather than an element object
#!/usr/bin/perl
use XML::DOM;
$file = shift @ARGV;
$parser = XML::DOM::Parser->new();
$doc = $parser->parsefile($file);
foreach $species ($doc->getElementsByTagName('species'))
{
$common_name = $species->getElementsByTagName('common-name');
$cname = $common_name->item(0)->getFirstChild->getNodeValue;
$name = $species->getAttribute('name');
$conservation = $species->getElementsByTagName('conservation');
$status = $conservation->item(0)->getAttribute('status');
print "$cname ($name) $status\n";
}
$doc->dispose();
Attributes much simpler!Can only contain text (no nested elements)Simply specify the attribute
<species name='Felix domesticus'>
Attributes
<species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /></species>
$conservation = $species->getElementsByTagName('conservation');$status = $conservation->item(0)->getAttribute('status');
Extract the <conservation> elements from this <species> element
DOM parser can't know there is only one<conservation> element per species.Extract the first (and only) one.This is an empty element, there are no
child elements so we don’t need ->getFirstChild
Contains only a ‘status’ attribute,extract its value with ->getAttribute()
#!/usr/bin/perl
use XML::DOM;
$file = shift @ARGV;
$parser = XML::DOM::Parser->new();
$doc = $parser->parsefile($file);
foreach $species ($doc->getElementsByTagName('species'))
{
$common_name = $species->getElementsByTagName('common-name');
$cname = $common_name->item(0)->getFirstChild->getNodeValue;
$name = $species->getAttribute('name');
$conservation = $species->getElementsByTagName('conservation');
$status = $conservation->item(0)->getAttribute('status');
print "$cname ($name) $status\n";
}
$doc->dispose();
Print the extracted information
Loop back to next <species> element
Clean up and free memory
XML::DOM
Note
Not necessary to use variable names that match the tags, but it is a very good idea!
There are many many more functions, but this set covers most needs
Writing XML with XML::DOM
#!/usr/bin/perluse XML::DOM;$nspecies = 2;@names = ('Felix domesticus', 'Drosophila melanogaster');@commonNames = ('cat', 'fruit fly');@consStatus = ('not endangered', 'not endangered');
$doc = XML::DOM::Document->new;$xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString;
$root = $doc->createElement('data');
for($i=0; $i<$nspecies; $i++){ $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]);
$root->appendChild($species);
$cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname);
$cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons);}print $root->toString;
Import XML::DOMInitialize dataCreate an XML
document objectUtility method toprint XML header
<?xml version=“1.0” ?>
#!/usr/bin/perluse XML::DOM;$nspecies = 2;@names = ('Felix domesticus', 'Drosophila melanogaster');@commonNames = ('cat', 'fruit fly');@consStatus = ('not endangered', 'not endangered');
$doc = XML::DOM::Document->new;$xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString;
$root = $doc->createElement('data');
for($i=0; $i<$nspecies; $i++){ $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]);
$root->appendChild($species);
$cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname);
$cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons);}print $root->toString;
Create a ‘root’ element for our dataLoop through each ofthe species
Create a <species>elementSet its ‘name’attributeJoin the <species> element
to the parent <data> element
<?xml version=“1.0” ?>
<data> <species name='Felix domesticus’ /></data>
#!/usr/bin/perluse XML::DOM;$nspecies = 2;@names = ('Felix domesticus', 'Drosophila melanogaster');@commonNames = ('cat', 'fruit fly');@consStatus = ('not endangered', 'not endangered');
$doc = XML::DOM::Document->new;$xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString;
$root = $doc->createElement('data');
for($i=0; $i<$nspecies; $i++){ $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]);
$root->appendChild($species);
$cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname);
$cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons);}print $root->toString;
Create a <common-name>element
Create a text node containingthe specified text
Join this text node as achild of the <name> elementJoin the <name> elementas a child of <species>
<?xml version=“1.0” ?>
<data> <species name='Felix domesticus'> <common-name>cat</common-name> </species></data>
#!/usr/bin/perluse XML::DOM;$nspecies = 2;@names = ('Felix domesticus', 'Drosophila melanogaster');@commonNames = ('cat', 'fruit fly');@consStatus = ('not endangered', 'not endangered');
$doc = XML::DOM::Document->new;$xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString;
$root = $doc->createElement('data');
for($i=0; $i<$nspecies; $i++){ $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]);
$root->appendChild($species);
$cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname);
$cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons);}print $root->toString;
Create a <conservation>elementSet the ‘status’attributeJoin the <conservation> element
as a child of <species>
<?xml version=“1.0” ?>
<data> <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species></data>
#!/usr/bin/perluse XML::DOM;$nspecies = 2;@names = ('Felix domesticus', 'Drosophila melanogaster');@commonNames = ('cat', 'fruit fly');@consStatus = ('not endangered', 'not endangered');
$doc = XML::DOM::Document->new;$xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString;
$root = $doc->createElement('data');
for($i=0; $i<$nspecies; $i++){ $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]);
$root->appendChild($species);
$cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname);
$cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons);}print $root->toString;
Loop back to handle thenext species
Finally print the resultingdata structure
<?xml version=“1.0” ?>
<data> <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> <species name='Drosophila melanogaster'> <common-name>fruit fly</common-name> <conservation status='not endangered' /> </species></data>
Summary - reading XMLCreate a parser
$parser = XML::DOM::Parser->new();
Parse a file$doc = $parser->parsefile('filename');
Extract all elements matching tag-name$element_set = $doc->getElementsByTagName('tag-
name')
Extract first element of a set$element = $element_set->item(0);
Extract first child of an element$child_element = $element->getFirstChild;
Extract text from an element$text = $element->getNodeValue;
Get the value of a tag’s attribute$text = $element->getAttribute('attribute-name');
Summary - writing XMLCreate an XML document structure
$doc = XML::DOM::Document->new;
Utility to create an XML header
$header = $doc->createXMLDecl('1.0');
Create a tagged element
$element = $doc->createElement('tag-name');
Set an attribute for an element$element->setAttribute('attrib-name',
'value');
Append a child element to an element$parent_element->appendChild($child_element);
Create a text node element$element = $doc->createTextNode('text');
Print a document structure as a stringprint $root_element->toString;
SummaryTwo types of parser
Event-drivenData structure
Writing a good parser is difficult!Many parsers availableXML::DOM for reading and writing
data