perl/xml::dom - reading and writing xml from perl dr. andrew c.r. martin [email protected]

45
Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin [email protected] http://www.bioinf.org.uk/

Upload: stephen-joseph

Post on 24-Dec-2015

228 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

Perl/XML::DOM - reading and writing XML from Perl

Dr. Andrew C.R. [email protected]://www.bioinf.org.uk/

Page 2: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

Aims and objectivesRefresh the structure of a XML (or

XHTML) documentKnow problems in reading and

writing XMLUnderstand the requirements of

XML parsers and the two main types

Know how to write code using the DOM parser

PRACTICAL: write a script to read XML

Page 3: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

An XML refresher!<mutants> <mutant_group native='1abc01'> <structure> <method>x-ray</method> <resolution>1.8</resolution> <rfactor>0.20</rfactor> </structure>

<mutant domid='2bcd01'> <structure> <method>x-ray</method> <resolution>1.8</resolution> <rfactor>0.20</rfactor> </structure> <mutation res='L24' native='ALA' subs='ARG' /> </mutant> </mutant_group></mutants>

Tags: paired opening and

closing tags

May contain data and/or

other (nested) tags

un-paired tags use

special syntax

Attributes: optional

– contained within

the opening tag

Page 4: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

Writing XML

Writing XML is straightforward

Generate XML from a Perl script using print() statements.

However:tags correctly nested

quote marks correctly paired

international character sets

Page 5: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

Reading XML

As simple or complex as you wish!

Full control over XML:simple pattern may suffice

Otherwise, may be dangerousmay rely on un-guaranteed formatting

<mutants><mutant_group native='1abc01'><structure><method>x-ray</method><resolution>1.8</resolution><rfactor>0.20</rfactor></structure><mutant domid='2bcd01'><structure><method>x-ray</method><resolution>1.8</resolution><rfactor>0.20</rfactor></structure><mutation res='L24’native='ALA’ subs='ARG'/></mutant></mutant_group></mutants>

Page 6: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

XML ParsersClear rules for data boundaries

and hierarchy

Predictable; unambiguous

Parser translates XML into

stream of events

complex data object

Page 7: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

XML Parsers

Different data sources of datafilescharacter stringsremote references

different character encodingsstandard LatinJapanese

checking for well-formedness errors

Good parser will handle:

Page 8: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

XML ParsersRead stream of characters

differentiate markup and dataOptionally replace entity

references(e.g. &lt; with <)

Assemble complete documentdisparate (perhaps remote) sources

Report syntax and validation errorsPass data to client program

Page 9: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

XML ParsersIf XML has no syntax errors it is

'well formed'

With a DTD, a validating parser will check it matches:'valid'

Page 10: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

XML ParsersWriting a good parser is a lot of

work!

A lot of testing needed

Fortunately, many parsers available

Page 11: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

Getting data to your programParser can generate 'events'

Tags are converted into events

Events triggered in your program as the document is read

Parser acts as a pipeline converting XML into processed chunks of data sent to your program: an 'event stream'

Page 12: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

Getting data to your program

OR…

XML converted into a tree structure

Reflects organization of the XML

Whole document read into memory before your program gets access

Page 13: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

Pros and cons

In the parser, everything is likely to be event-driven tree-based parsers create a data structure from the

event stream

Data structure

More convenientCan access data in any

order Code usually simplerMay be impossible to

handle very large filesNeed more processor

timeNeed more memory

Event stream

Faster to access limited data

Use less memoryParser loses data at the

next eventMore complex code

Page 14: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

SAX and DOMde facto standard APIs for XML

parsingSAX (Simple API for XML)

event-stream APIoriginally for Java, but now for several

programming languages (including Perl)development promoted by Peter Murray Rust,

1997-8

DOM (Document Object Model) W3C standard tree-based parserplatform- and language-neutralallows update of document content and structure

as well as reading

Page 15: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

Perl XML parsers

Many parsers available

Differ in three major ways: parsing style (event driven or data

structure)

'standards-completeness’

speed (implementation in C or pure Perl)

Page 16: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

Perl XML parsers

XML::Simple

Very easy to use

Designed for simple applications

Can’t handle 'mixed content' tags containing both data and other

tags

<p>This is <b>mixed</b> content</p>

Page 17: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

Perl XML parsers

XML::ParserOldest Perl XML parser

Reasonably fast and flexibile

Not very standards-compliant.

Is a wrapper to 'expat’probably the first C XML parser written

by James Clark

Page 18: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

XML::Parser

use XML::Parser;

my $xmlfile = shift @ARGV; # the file to parse

# initialize parser object and parse the stringmy $parser = XML::Parser->new( ErrorContext => 2 );eval { $parser->parsefile( $xmlfile ); };

# report error or successif( $@ ) { $@ =~ s/at \/.*?$//s; # remove module line number print STDERR "\nERROR in '$xmlfile':\n$@\n";} else { print STDERR "'$xmlfile' is well-formed\n";}

Simple example - check well-formedness

Eval used so parser doesn’t cause program to exit

Remember 2 lines before an error $@ stores

the error

Page 19: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

Perl XML parsers

XML::DOM Implements W3C DOM Level 1Built on top of XML::ParserVery good fast, stable and

complete Limited extended functionality

XML::SAX Implements SAX2 wrapper to ExpatFast, stable and complete

Page 20: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

Perl XML parsers

XML::LibXML

Wrapper around GNOME libxml2

Very fast, complete and stable

Validating/non-validating

DOM and SAX support

Page 21: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

Perl XML parsers

XML::Twig

DOM-like parser, BUT

Allows you to define elements which can be parsed as discrete units'twigs' (small branches of a tree)

Page 22: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

Perl XML parsers

Several others

More specialized, adding… XPath (to select data from XML

document)

re-formatting (XSLT or other methods)

...

Page 23: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

XML::DOMDOM is a standard API

once learned moving to a different language is straightforward

moving between implementations also easy

Suppose we want to extract some data from an XML file...

Page 24: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

XML::DOM<data> <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> <species name='Drosophila melanogaster'> <common-name>fruit fly</common-name> <conservation status='not endangered' /> </species></data>

We want:cat (Felix domesticus) not endangeredfruit fly (Drosophila melanogaster) not endangered

Page 25: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

#!/usr/bin/perl

use XML::DOM;

$file = shift @ARGV;

$parser = XML::DOM::Parser->new();

$doc = $parser->parsefile($file);

foreach $species ($doc->getElementsByTagName('species'))

{

$common_name = $species->getElementsByTagName('common-name');

$cname = $common_name->item(0)->getFirstChild->getNodeValue;

$name = $species->getAttribute('name');

$conservation = $species->getElementsByTagName('conservation');

$status = $conservation->item(0)->getAttribute('status');

print "$cname ($name) $status\n";

}

$doc->dispose();

Import XML::DOM moduleobtain filenameinitialize a parserparse the file

Nested loops correspond to

nested tags

Returns each element matching ‘species’

Look at each species element in turn$species contains content and descendant elements

Can now treat $speciesin the same way as $doc

Page 26: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

#!/usr/bin/perl

use XML::DOM;

$file = shift @ARGV;

$parser = XML::DOM::Parser->new();

$doc = $parser->parsefile($file);

foreach $species ($doc->getElementsByTagName('species'))

{

$common_name = $species->getElementsByTagName('common-name');

$cname = $common_name->item(0)->getFirstChild->getNodeValue;

$name = $species->getAttribute('name');

$conservation = $species->getElementsByTagName('conservation');

$status = $conservation->item(0)->getAttribute('status');

print "$cname ($name) $status\n";

}

$doc->dispose();

$common_name contains contentand any descendant elements

Here there is only one <common-name> elementper <species> element, but parser can't know that. Have to specify that we want

the first (and only) <common-name> elementThe <common-name> element contains only

one child which is data. Obtain first (and only) child element

using ->getFirstChild

Extract the text from the element object using ->getNodeValue

Returns each element matching ‘common-name’

Page 27: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

->getElementsByTagName returns an arrayHere we work through the array:

foreach $species ($doc->getElementsByTagName('species')){}

$common_name = $species->getElementsByTagName('common-name');$cname = $common_name->item(0)->getFirstChild->getNodeValue;

Here we obtain an object and use the ->item() method to obtain the first item:

@common_names = $species->getElementsByTagName('common-name');$cname = $common_names[0]->getFirstChild->getNodeValue;

Alternative is to access an individual array element:

Page 28: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

$common_name = $species->getElementsByTagName('common-name');$cname = $common_name->item(0)->getFirstChild->getNodeValue;

<species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /></species>

There could have been more thanone <common-name> element

within this <species> element

Obtain the actual textrather than an element object

Page 29: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

#!/usr/bin/perl

use XML::DOM;

$file = shift @ARGV;

$parser = XML::DOM::Parser->new();

$doc = $parser->parsefile($file);

foreach $species ($doc->getElementsByTagName('species'))

{

$common_name = $species->getElementsByTagName('common-name');

$cname = $common_name->item(0)->getFirstChild->getNodeValue;

$name = $species->getAttribute('name');

$conservation = $species->getElementsByTagName('conservation');

$status = $conservation->item(0)->getAttribute('status');

print "$cname ($name) $status\n";

}

$doc->dispose();

Attributes much simpler!Can only contain text (no nested elements)Simply specify the attribute

<species name='Felix domesticus'>

Attributes

Page 30: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

<species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /></species>

$conservation = $species->getElementsByTagName('conservation');$status = $conservation->item(0)->getAttribute('status');

Extract the <conservation> elements from this <species> element

DOM parser can't know there is only one<conservation> element per species.Extract the first (and only) one.This is an empty element, there are no

child elements so we don’t need ->getFirstChild

Contains only a ‘status’ attribute,extract its value with ->getAttribute()

Page 31: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

#!/usr/bin/perl

use XML::DOM;

$file = shift @ARGV;

$parser = XML::DOM::Parser->new();

$doc = $parser->parsefile($file);

foreach $species ($doc->getElementsByTagName('species'))

{

$common_name = $species->getElementsByTagName('common-name');

$cname = $common_name->item(0)->getFirstChild->getNodeValue;

$name = $species->getAttribute('name');

$conservation = $species->getElementsByTagName('conservation');

$status = $conservation->item(0)->getAttribute('status');

print "$cname ($name) $status\n";

}

$doc->dispose();

Print the extracted information

Loop back to next <species> element

Clean up and free memory

Page 32: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

XML::DOM

Note

Not necessary to use variable names that match the tags, but it is a very good idea!

There are many many more functions, but this set covers most needs

Page 33: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

Writing XML with XML::DOM

Page 34: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

#!/usr/bin/perluse XML::DOM;$nspecies = 2;@names = ('Felix domesticus', 'Drosophila melanogaster');@commonNames = ('cat', 'fruit fly');@consStatus = ('not endangered', 'not endangered');

$doc = XML::DOM::Document->new;$xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString;

$root = $doc->createElement('data');

for($i=0; $i<$nspecies; $i++){ $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]);

$root->appendChild($species);

$cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname);

$cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons);}print $root->toString;

Import XML::DOMInitialize dataCreate an XML

document objectUtility method toprint XML header

<?xml version=“1.0” ?>

Page 35: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

#!/usr/bin/perluse XML::DOM;$nspecies = 2;@names = ('Felix domesticus', 'Drosophila melanogaster');@commonNames = ('cat', 'fruit fly');@consStatus = ('not endangered', 'not endangered');

$doc = XML::DOM::Document->new;$xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString;

$root = $doc->createElement('data');

for($i=0; $i<$nspecies; $i++){ $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]);

$root->appendChild($species);

$cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname);

$cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons);}print $root->toString;

Create a ‘root’ element for our dataLoop through each ofthe species

Create a <species>elementSet its ‘name’attributeJoin the <species> element

to the parent <data> element

Page 36: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

<?xml version=“1.0” ?>

<data> <species name='Felix domesticus’ /></data>

Page 37: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

#!/usr/bin/perluse XML::DOM;$nspecies = 2;@names = ('Felix domesticus', 'Drosophila melanogaster');@commonNames = ('cat', 'fruit fly');@consStatus = ('not endangered', 'not endangered');

$doc = XML::DOM::Document->new;$xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString;

$root = $doc->createElement('data');

for($i=0; $i<$nspecies; $i++){ $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]);

$root->appendChild($species);

$cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname);

$cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons);}print $root->toString;

Create a <common-name>element

Create a text node containingthe specified text

Join this text node as achild of the <name> elementJoin the <name> elementas a child of <species>

Page 38: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

<?xml version=“1.0” ?>

<data> <species name='Felix domesticus'> <common-name>cat</common-name> </species></data>

Page 39: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

#!/usr/bin/perluse XML::DOM;$nspecies = 2;@names = ('Felix domesticus', 'Drosophila melanogaster');@commonNames = ('cat', 'fruit fly');@consStatus = ('not endangered', 'not endangered');

$doc = XML::DOM::Document->new;$xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString;

$root = $doc->createElement('data');

for($i=0; $i<$nspecies; $i++){ $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]);

$root->appendChild($species);

$cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname);

$cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons);}print $root->toString;

Create a <conservation>elementSet the ‘status’attributeJoin the <conservation> element

as a child of <species>

Page 40: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

<?xml version=“1.0” ?>

<data> <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species></data>

Page 41: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

#!/usr/bin/perluse XML::DOM;$nspecies = 2;@names = ('Felix domesticus', 'Drosophila melanogaster');@commonNames = ('cat', 'fruit fly');@consStatus = ('not endangered', 'not endangered');

$doc = XML::DOM::Document->new;$xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString;

$root = $doc->createElement('data');

for($i=0; $i<$nspecies; $i++){ $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]);

$root->appendChild($species);

$cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname);

$cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons);}print $root->toString;

Loop back to handle thenext species

Finally print the resultingdata structure

Page 42: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

<?xml version=“1.0” ?>

<data> <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> <species name='Drosophila melanogaster'> <common-name>fruit fly</common-name> <conservation status='not endangered' /> </species></data>

Page 43: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

Summary - reading XMLCreate a parser

$parser = XML::DOM::Parser->new();

Parse a file$doc = $parser->parsefile('filename');

Extract all elements matching tag-name$element_set = $doc->getElementsByTagName('tag-

name')

Extract first element of a set$element = $element_set->item(0);

Extract first child of an element$child_element = $element->getFirstChild;

Extract text from an element$text = $element->getNodeValue;

Get the value of a tag’s attribute$text = $element->getAttribute('attribute-name');

Page 44: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

Summary - writing XMLCreate an XML document structure

$doc = XML::DOM::Document->new;

Utility to create an XML header

$header = $doc->createXMLDecl('1.0');

Create a tagged element

$element = $doc->createElement('tag-name');

Set an attribute for an element$element->setAttribute('attrib-name',

'value');

Append a child element to an element$parent_element->appendChild($child_element);

Create a text node element$element = $doc->createTextNode('text');

Print a document structure as a stringprint $root_element->toString;

Page 45: Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk

SummaryTwo types of parser

Event-drivenData structure

Writing a good parser is difficult!Many parsers availableXML::DOM for reading and writing

data