perl/xml::dom - reading and writing xml from perl dr. andrew c.r. martin [email protected]

Perl/XML::DOM - reading and writing XML from Perl

Dr. Andrew C.R. [email protected]://www.bioinf.org.uk/

Aims and objectivesRefresh the structure of a XML (or

XHTML) documentKnow problems in reading and

writing XMLUnderstand the requirements of

XML parsers and the two main types

Know how to write code using the DOM parser

PRACTICAL: write a script to read XML

An XML refresher!<mutants> <mutant_group native='1abc01'> <structure> <method>x-ray</method> <resolution>1.8</resolution> <rfactor>0.20</rfactor> </structure>

<mutant domid='2bcd01'> <structure> <method>x-ray</method> <resolution>1.8</resolution> <rfactor>0.20</rfactor> </structure> <mutation res='L24' native='ALA' subs='ARG' /> </mutant> </mutant_group></mutants>

Tags: paired opening and

closing tags

May contain data and/or

other (nested) tags

un-paired tags use

special syntax

Attributes: optional

– contained within

the opening tag

Writing XML

Writing XML is straightforward

Generate XML from a Perl script using print() statements.

However:tags correctly nested

quote marks correctly paired

international character sets

Reading XML

As simple or complex as you wish!

Full control over XML:simple pattern may suffice

Otherwise, may be dangerousmay rely on un-guaranteed formatting

<mutants><mutant_group native='1abc01'><structure><method>x-ray</method><resolution>1.8</resolution><rfactor>0.20</rfactor></structure><mutant domid='2bcd01'><structure><method>x-ray</method><resolution>1.8</resolution><rfactor>0.20</rfactor></structure><mutation res='L24’native='ALA’ subs='ARG'/></mutant></mutant_group></mutants>

XML ParsersClear rules for data boundaries

and hierarchy

Predictable; unambiguous

Parser translates XML into

stream of events

complex data object

XML Parsers

Different data sources of datafilescharacter stringsremote references

different character encodingsstandard LatinJapanese

checking for well-formedness errors

Good parser will handle:

XML ParsersRead stream of characters

differentiate markup and dataOptionally replace entity

references(e.g. < with <)

Assemble complete documentdisparate (perhaps remote) sources

Report syntax and validation errorsPass data to client program

XML ParsersIf XML has no syntax errors it is

'well formed'

With a DTD, a validating parser will check it matches:'valid'

XML ParsersWriting a good parser is a lot of

work!

A lot of testing needed

Fortunately, many parsers available

Getting data to your programParser can generate 'events'

Tags are converted into events

Events triggered in your program as the document is read

Parser acts as a pipeline converting XML into processed chunks of data sent to your program: an 'event stream'

Getting data to your program

OR…

XML converted into a tree structure

Reflects organization of the XML

Whole document read into memory before your program gets access

Pros and cons

In the parser, everything is likely to be event-driven tree-based parsers create a data structure from the

event stream

Data structure

More convenientCan access data in any

order Code usually simplerMay be impossible to

handle very large filesNeed more processor

timeNeed more memory

Event stream

Faster to access limited data

Use less memoryParser loses data at the

next eventMore complex code

SAX and DOMde facto standard APIs for XML

parsingSAX (Simple API for XML)

event-stream APIoriginally for Java, but now for several

programming languages (including Perl)development promoted by Peter Murray Rust,

1997-8

DOM (Document Object Model) W3C standard tree-based parserplatform- and language-neutralallows update of document content and structure

as well as reading

Perl XML parsers

Many parsers available

Differ in three major ways: parsing style (event driven or data

structure)

'standards-completeness’

speed (implementation in C or pure Perl)

Perl XML parsers

XML::Simple

Very easy to use

Designed for simple applications

Can’t handle 'mixed content' tags containing both data and other

tags

<p>This is <b>mixed</b> content</p>

Perl XML parsers

XML::ParserOldest Perl XML parser

Reasonably fast and flexibile

Not very standards-compliant.

Is a wrapper to 'expat’probably the first C XML parser written

by James Clark

XML::Parser

use XML::Parser;

my $xmlfile = shift @ARGV; # the file to parse

# initialize parser object and parse the stringmy $parser = XML::Parser->new( ErrorContext => 2 );eval { $parser->parsefile( $xmlfile ); };

# report error or successif( $@ ) { $@ =~ s/at \/.*?$//s; # remove module line number print STDERR "\nERROR in '$xmlfile':\n$@\n";} else { print STDERR "'$xmlfile' is well-formed\n";}

Simple example - check well-formedness

Eval used so parser doesn’t cause program to exit

Remember 2 lines before an error $@ stores

the error

Perl XML parsers

XML::DOM Implements W3C DOM Level 1Built on top of XML::ParserVery good fast, stable and

complete Limited extended functionality

XML::SAX Implements SAX2 wrapper to ExpatFast, stable and complete

Perl XML parsers

XML::LibXML

Wrapper around GNOME libxml2

Very fast, complete and stable

Validating/non-validating

DOM and SAX support

Perl XML parsers

XML::Twig

DOM-like parser, BUT

Allows you to define elements which can be parsed as discrete units'twigs' (small branches of a tree)

Perl XML parsers

Several others

More specialized, adding… XPath (to select data from XML

document)

re-formatting (XSLT or other methods)

...

XML::DOMDOM is a standard API

once learned moving to a different language is straightforward

moving between implementations also easy

Suppose we want to extract some data from an XML file...

XML::DOM<data> <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> <species name='Drosophila melanogaster'> <common-name>fruit fly</common-name> <conservation status='not endangered' /> </species></data>

We want:cat (Felix domesticus) not endangeredfruit fly (Drosophila melanogaster) not endangered

#!/usr/bin/perl

use XML::DOM;

$file = shift @ARGV;

$parser = XML::DOM::Parser->new();

$doc = $parser->parsefile($file);

foreach $species ($doc->getElementsByTagName('species'))

{

$common_name = $species->getElementsByTagName('common-name');

$cname = $common_name->item(0)->getFirstChild->getNodeValue;

$name = $species->getAttribute('name');

$conservation = $species->getElementsByTagName('conservation');

$status = $conservation->item(0)->getAttribute('status');

print "$cname ($name) $status\n";

}

$doc->dispose();

Import XML::DOM moduleobtain filenameinitialize a parserparse the file

Nested loops correspond to

nested tags

Returns each element matching ‘species’

Look at each species element in turn$species contains content and descendant elements

Can now treat $speciesin the same way as $doc

#!/usr/bin/perl

use XML::DOM;





{







}

$doc->dispose();

$common_name contains contentand any descendant elements

Here there is only one <common-name> elementper <species> element, but parser can't know that. Have to specify that we want

the first (and only) <common-name> elementThe <common-name> element contains only

one child which is data. Obtain first (and only) child element

using ->getFirstChild

Extract the text from the element object using ->getNodeValue

Returns each element matching ‘common-name’

->getElementsByTagName returns an arrayHere we work through the array:

foreach $species ($doc->getElementsByTagName('species')){}

$common_name = $species->getElementsByTagName('common-name');$cname = $common_name->item(0)->getFirstChild->getNodeValue;

Here we obtain an object and use the ->item() method to obtain the first item:

@common_names = $species->getElementsByTagName('common-name');$cname = $common_names[0]->getFirstChild->getNodeValue;

Alternative is to access an individual array element:

$common_name = $species->getElementsByTagName('common-name');$cname = $common_name->item(0)->getFirstChild->getNodeValue;

<species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /></species>

There could have been more thanone <common-name> element

within this <species> element

Obtain the actual textrather than an element object

#!/usr/bin/perl

use XML::DOM;





{







}

$doc->dispose();

Attributes much simpler!Can only contain text (no nested elements)Simply specify the attribute

<species name='Felix domesticus'>

Attributes

<species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /></species>

$conservation = $species->getElementsByTagName('conservation');$status = $conservation->item(0)->getAttribute('status');

Extract the <conservation> elements from this <species> element

DOM parser can't know there is only one<conservation> element per species.Extract the first (and only) one.This is an empty element, there are no

child elements so we don’t need ->getFirstChild

Contains only a ‘status’ attribute,extract its value with ->getAttribute()

#!/usr/bin/perl

use XML::DOM;





{







}

$doc->dispose();

Print the extracted information

Loop back to next <species> element

Clean up and free memory

XML::DOM

Note

Not necessary to use variable names that match the tags, but it is a very good idea!

There are many many more functions, but this set covers most needs

Writing XML with XML::DOM

#!/usr/bin/perluse XML::DOM;$nspecies = 2;@names = ('Felix domesticus', 'Drosophila melanogaster');@commonNames = ('cat', 'fruit fly');@consStatus = ('not endangered', 'not endangered');

$doc = XML::DOM::Document->new;$xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString;

$root = $doc->createElement('data');

for($i=0; $i<$nspecies; $i++){ $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]);

$root->appendChild($species);

$cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname);

$cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons);}print $root->toString;

Import XML::DOMInitialize dataCreate an XML

document objectUtility method toprint XML header

<?xml version=“1.0” ?>








Create a ‘root’ element for our dataLoop through each ofthe species

Create a <species>elementSet its ‘name’attributeJoin the <species> element

to the parent <data> element








Create a <common-name>element

Create a text node containingthe specified text

Join this text node as achild of the <name> elementJoin the <name> elementas a child of <species>








Create a <conservation>elementSet the ‘status’attributeJoin the <conservation> element

as a child of <species>








Loop back to handle thenext species

Finally print the resultingdata structure


<data> <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> <species name='Drosophila melanogaster'> <common-name>fruit fly</common-name> <conservation status='not endangered' /> </species></data>

Summary - reading XMLCreate a parser


Parse a file$doc = $parser->parsefile('filename');

Extract all elements matching tag-name$element_set = $doc->getElementsByTagName('tag-

name')

Extract first element of a set$element = $element_set->item(0);

Extract first child of an element$child_element = $element->getFirstChild;

Extract text from an element$text = $element->getNodeValue;

Get the value of a tag’s attribute$text = $element->getAttribute('attribute-name');

Summary - writing XMLCreate an XML document structure

$doc = XML::DOM::Document->new;

Utility to create an XML header

$header = $doc->createXMLDecl('1.0');

Create a tagged element

$element = $doc->createElement('tag-name');

Set an attribute for an element$element->setAttribute('attrib-name',

'value');

Append a child element to an element$parent_element->appendChild($child_element);

Create a text node element$element = $doc->createTextNode('text');

Print a document structure as a stringprint $root_element->toString;

SummaryTwo types of parser

Event-drivenData structure

Writing a good parser is difficult!Many parsers availableXML::DOM for reading and writing

data

perl/xml::dom - reading and writing xml from perl dr. andrew c.r. martin [email protected]

Documents