copyright 2001, activestate python and xml. copyright 2001, activestate about me paul prescod,...
TRANSCRIPT
Copyright 2001, ActiveState
About me
• Paul Prescod, ([email protected])
• ActiveState Senior Developer
• Co-Author, XML Handbook
Copyright 2001, ActiveState
Preview
• About Python
• Python SAX/DOM
• PyXML Package
• Python XSLT/XPath
• Python SOAP/XML-RPC
• XML and Zope
Copyright 2001, ActiveState
What is Python?
• Python is an easy to learn, powerful programming language.
– Efficient high-level data structures
– Simple approach to object-oriented programming.
– Elegant syntax and dynamic typing
Copyright 2001, ActiveState
Brief History of Python
• CWI, early 90s.• Dynamic Object Oriented High Level
Language.• More than a text processing language.• More than a scripting language.• Scalable and object oriented from the
beginning.• Dynamically type checked.
Copyright 2001, ActiveState
Python's business case
• Python can displace many other languages in the organization.
• The Python interpreter is free.• Python is legally unencumbered.• Professional programmers find Python
more flexible than most languages.• Amateur programmers are (often) more
comfortable than with Perl or Java.
Copyright 2001, ActiveState
Usability features
• Exceptionally clear syntax.
• Provides an obvious way to do most things.
• Small set of features combine in powerful ways.
• Only innovative where innovation is really necessary.
Copyright 2001, ActiveState
More Usability features
• Huge amount of free code and libraries• Interactive.• Designed to talk to the world.• Runs with Unix, Mac and Windows.• Integrates with JVM (Jython) and .NET
Framework (Python.NET)• Talks MS COM, XPCOM,
CORBA,SOAP, XML-RPC, …
Copyright 2001, ActiveState
Scalability features
• Simple but powerful module system.
• Simple but powerful class system.
• Structured, standardized exceptions.
Copyright 2001, ActiveState
Environments
• Unix (almost all)
• Windows (3.1, 95, NT, CE)
• Mac
• JVM
• Various legacy systems...
Copyright 2001, ActiveState
Extendable
• New data types -- in Python or C
• Modules -- in Python or C
• Functions -- in Python or C
Copyright 2001, ActiveState
Python isn't picky!
• COM/CORBA
• HTML/XML/SGML
• Win API/POSIX
• You can write code that is portable or platform-specific.
Copyright 2001, ActiveState
Compared to Perl
• Simpler syntactically.
• More object oriented.
• Easier to extend.
• But slower regular expressions...
Copyright 2001, ActiveState
Compared to Java
• Java is more difficult for amateur programmers.
• Static type checking can be inconvenient in text processing.
• Puritanical OO can be inconvenient.
• Bottom line: Java can make simple projects harder.
Copyright 2001, ActiveState
Why not Java: political
• "100% pure Java" gets in the way.
• The Java environment punishes interoperability. (e.g. getenv is deprecated)
• Java is designed to have interoperability limitations.
• Embedding Java is relatively painful.
Copyright 2001, ActiveState
Jython (nee JPython)
• Compiles Python classes to Java classes
• Embedded interpreter allows interactive coding.
• Access to all Java classes.
• For better or worse: maintains Java's security/platform-independence bubble.
Copyright 2001, ActiveState
Jython can use Java tools
• RDF
• XPointer
• Various parsers
• Swing GUI
• Unicode
Copyright 2001, ActiveState
Python Limitations
• “Ordinary Python" has 8-bit and Unicode string types.– Handling explicit conversions can be annoying.
• Not as fast as C++.• Raw text searching is not as fast as Perl.• Dynamic type checking requires more care in
testing.
Copyright 2001, ActiveState
Python interpreter
• Just type:C:\> pythonPython 1.5.2 (#0, Apr 13 1999, 10:51:12) [MSC 32 bit (Intel)] on win32
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> print "Hello, World"Hello, Python>>> print "Goodbye, World "Goodbye, Python>>> ^Z
C:\>
Copyright 2001, ActiveState
Byte-compiling
• Python automatically bytecompiles modules.
• Next execution does not require compilation.
• .py files get a .pyc in the same directory
• When the .py is updated, the .pyc is updated
Copyright 2001, ActiveState
Interpreters
• DOS/Win32 (last slide)
• Unix (use ^D to exit)
• Graphical: “IDLE”, “PythonWin”
Copyright 2001, ActiveState
Python variables
• Any Python variable can hold any value.>>> width = 20>>> height = 5 * 9>>> width * height900>>> width = "really wide“>>> width'really wide'
Copyright 2001, ActiveState
Numeric types
• int: 32 bit, e.g. "x=5"
• long: arbitrary sized, e.g. "x=2L**128"
• float: accuracy depends on platform, e.g. "x=3.14"
• complex: real+imag., "x=5.3+3.2j"
Copyright 2001, ActiveState
Sequence types:
• Strings: "abcd"
• Tuples: (1,2,"b")
• Lists: [1,"a",3]
Copyright 2001, ActiveState
Sequence operations
• Iteration:for i in myList:print i
• Numeric indexing:k = myList[3]
• Slicing:k = mylist[2:5]
Copyright 2001, ActiveState
Sequence types: string
myStr = "abc" # assignment
myStr = myStr + "def" # = "abcdef"
for char in myStr: print char # iterateotherstr = myStr[1:4] # = "bcd"
Copyright 2001, ActiveState
Sequence types: lists
myList = ["a",5,3.25,2L,4+3j] anotherList = ["a",myList, ["3","2"]]anotherList2 = myList + myList # = ["a",5,...,"a",5,...]yetAnotherList = myList[1:3] # = [5,3.25]
Copyright 2001, ActiveState
Iterating over sequences
strlist = ["abc", "def", "ghi"]for item in strlist: for char in item: print char
Copyright 2001, ActiveState
Sequence Concatenation
>>> word = 'Help' + 'A'>>> word'HelpA'>>> list = ["Hello"] + ["World"]>>> print list['Hello', 'World']
Copyright 2001, ActiveState
Negative indexes
>>> word[-1] # The last character'A'>>> word[-2] # The last-but-one
character'p'>>> word[-2:] # The last two characters'pA'>>> word[:-2] # All but the last two
characters'Hel'
Copyright 2001, ActiveState
Getting the length
• The len() function gets a sequence's length
>>> len( "abc" )3>>> len( ["abc","def"] )2
Copyright 2001, ActiveState
Tuples
• Immutable list-like objects are called "tuples“
>>> a=(1,2)>>> a[0]=3Traceback (innermost last): File "<stdin>", line 1, in ?TypeError: object doesn't support item assignment
Copyright 2001, ActiveState
Dictionaries
• Serve as a lookup table
• Maps "keys" to "values".
• Keys can be of any immutable type
• Assignment adds or changes members
• keys() method returns keys
Copyright 2001, ActiveState
Dictionaries
>>> dict={"a":"alpha", "b":"bravo","c":"charlie"}
>>> dict["abc"]=10>>> dict[5]="def">>> dict[2.52]=6.71>>> print dict{2.52: 6.71, 5: 'def', 'abc': 10, 'b': 'bravo', 'c': 'charlie', 'a': 'alpha'}
Copyright 2001, ActiveState
Dictionary Methods
>>> dict.keys()[2.52, 5, 'abc', 'b', 'c', 'a']>>> dict.values()[6.71, 'def', 10, 'bravo', 'charlie', 'alpha']
>>> dict.items()[(2.52, 6.71), (5, 'def'), ('abc', 10), ('b', 'bravo'), …]
>>> dict.clear()>>> print dict{}
Copyright 2001, ActiveState
File Objects
• Represent opened files:myFile = open( "catalog.txt", "r" )data = myFile.read()myFile = open( "catalog2.txt", "w" )data = data+ "more data"myFile.write( data )
Copyright 2001, ActiveState
Function definitions
• Encapsulate bits of code.
• Can take a fixed or variable number of arguments.
• Arguments can have default values.
Copyright 2001, ActiveState
Functions are objects
>>> def myClickFunction():... print "I was clicked"...>>> # assume button is a GUI button>>> button.OnClick = myClickFunction>>> print button.OnClick.__name__myClickFunction>>>
Copyright 2001, ActiveState
Exception handling
• Python exception handling like Java/C++.
• Errors are reported in tracebacks.
• Exceptions propagate up.
Copyright 2001, ActiveState
Exception traceback
Traceback (innermost last): File "test.py", line 10, in ? a() File "test.py", line 2, in a b( ) File "test.py", line 5, in b c( ) File "test.py", line 8, in c 1/0ZeroDivisionError: integer division or modulo
Copyright 2001, ActiveState
Classes
• Classes combine code and data.• They represent real world objects.• We create "instance objects" from classes.• Closest languages in terms of object model
are SmallTalk or Ruby.• Much more flexible than Java or C++• More central to the language than
Perl/Tcl/PHP.
Copyright 2001, ActiveState
Inheritance
• Classes can specify a base class.• The new class "inherits" methods and data.• The new class can
– "override" methods.– add data and methods.
• Multiple Inheritance is okay• All methods are virtual.
Copyright 2001, ActiveState
Modules and Packages
• A module is a set of code in a single file.
• A package is a collection of related modules.
Copyright 2001, ActiveState
XML and Python
• Accessing XML with Python
• Parsing XML with Python
– Non validating Parsers
– Validating Parsers
Copyright 2001, ActiveState
Reading XML
• XML as a character data stream
– the RE module
• XML as a tree structure
– lists of node objects
• XML as an event source
– event dispatching to methods
Copyright 2001, ActiveState
Parsers in Python
• C extension modules
– PyExpat
– sgmlop
• Written in Python code:
– xmllib
– xmlproc
Copyright 2001, ActiveState
Manipulating XML
• Flat file processing with RE's (briefly!)
• PySAX - Simple API for XML
• PyDOM - W3C Document Object Model
• …
Copyright 2001, ActiveState
Flat File Processing
• XML documents are text.
• Ordinary textual tools continue to work.
• E.G. Search for emph elements:import re
for i in re.search( r"<emph>(.*)</emph>", input ): print i
Copyright 2001, ActiveState
Flat File Recipe
• Unless your needs are very simple, let me help you!
• I’ve already converted the ultimate XML parsing regular expression to Python:
http://aspn.activestate.com/ASPN/Python/Cookbook/Recipe/65125
Copyright 2001, ActiveState
Events
• Think of an XML document as a series of events
• "Start tag", "End tag", “Characters", etc.
• We can handle hierarchy by tracking start/end tags.
• We can deal with the document a little at a time.
Copyright 2001, ActiveState
PySAX
• "Simple API for XML"
• Common API for parsers.
• Based on Java API.
• Parser implements certain interfaces.
• Application implements callback interfaces.
Copyright 2001, ActiveState
SAX Model
• The application hands the parser an event handler object.
• The parser sends events to the handler.• The handler can
– store them somehow,– build something,– re-route them to other parts of the
app.
Copyright 2001, ActiveState
Application side
• Applications must provide:– ContentHandler– ErrorHandler– DTDHandler– EntityResolver
• Parser developer implements:– XMLReader– A few more (out of scope)
Copyright 2001, ActiveState
ContentHandler
• Captures document instance events.
• App can:
– Build app. objects.
– Output something.
– Build a GUI
– ...
Copyright 2001, ActiveState
ContentHandler callbacks
• Main ones:
startElement(name, attrs)
endElement(name)
characters(content)
ignorableWhitespace(ch, start, length)
processingInstruction(target, data)
(cont’d)
Copyright 2001, ActiveState
ContentHandler egfrom xml.sax.handler import \ ContentHandler
class countHandler(ContentHandler): def __init__(self): self.tags={}
def startElement(self, name, attr): if not self.tags.has_key(name): self.tags[name] = 0 self.tags[name] += 1
Copyright 2001, ActiveState
ContentHandler eg
import xml.sax
parser = xml.sax.make_parser()
handler = countHandler()
parser.setContentHandler(handler)
parser.parse("test.xml")
print handler.tags
Copyright 2001, ActiveState
PySax Distribution
• Default content handler implementation is provided.
• Subclass can override only what it needs.
• Function to get parser is also provided.
Copyright 2001, ActiveState
ErrorHandling
• In addition to content handler,• we should assign an error handler.
class MyErrorHandler: def warning(self, exception):
print "Whoa, nelly!" print exception
def error(self, exception): print "Whoa, nelly!" raise exception
def fatalError(self, exception): print "Whoa, nelly!" raise exception
Copyright 2001, ActiveState
ErrorHandling (cont'd)
...errHandler = MyErrorHandler() parser.setErrorHandler( errHandler )parser.parse("\\temp\\test.xml")
Copyright 2001, ActiveState
Character handling
# print out characters in documentfrom xml.sax.handler import ContentHandler import xml.sax, sys class textHandler(ContentHandler): def characters(self, ch): sys.stdout.write(ch.encode("Latin-1"))
parser = xml.sax.make_parser() parser.setContentHandler(textHandler()) parser.parse("test.xml")
Copyright 2001, ActiveState
Document Object Model
• Document Object Model
• The DOM is a W3C standard.
• Extended version of "Dynamic HTML"
• Defined in CORBA IDL.
• Implemented in various languages.
• Implemented in IE5.0 and eventually Netscape
Copyright 2001, ActiveState
The DOM
• The DOM is a tree-based API.
• This implies a certain amount of overhead.
• But also a lot of convenience and flexibility.
• XPath implementation essentially requires tree-based APIs.
Copyright 2001, ActiveState
DOM Nodes
• Elements, attributes, comments, etc. called "nodes".
• Classes represent node types.
• All node types subclass the "node" base class.
Copyright 2001, ActiveState
Node Objects
• Example methods include:
– getNodeType
– getParentNode
– getChildNodes
– getAttributes
– insertBefore
– cloneNode
Copyright 2001, ActiveState
Element Objects
• Elements are a representative subclass:
• getTagName
• getAttribute
• setAttribute
• getElementsByTagName
Copyright 2001, ActiveState
DOM node types
ATTRIBUTECDATA_SECTIONCOMMENTDOCUMENTDOCUMENT_FRAGMENTDOCUMENT_TYPE
Copyright 2001, ActiveState
More DOM node types
ELEMENTENTITYENTITY_REFERENCE NOTATIONPROCESSING_INSTRUCTIONTEXT
Copyright 2001, ActiveState
Navigation properties
• parentNode - Parent of this node• firstChild - First child of this node• lastChild - Last child of this node• previousSibling - Node immediately preceding
this node• nextSibling - Node immediately following this
node• childNodes - List containing all the children of
this node
Copyright 2001, ActiveState
Example
<folder> <title>XML bookmarks</title> <bookmark href="http://www.python.org/sigs/xml-sig/" >
<title>SIG for XML Processing in Python</title>
</bookmark></folder>
Copyright 2001, ActiveState
First "title" node
Properties:
• parentNode: folder element• firstChild: Text node 'XML bookmarks'• lastChild: Text node 'XML bookmarks'• previousSibling: codeNone• nextSibling: bookmark element• childNodes: A 1-element list: [ Text node
'XML bookmarks' ]
Copyright 2001, ActiveState
DOM
• The DOM API is very large and beyond the scope of the tutorial.
• A few short examples will illustrate the basic model.
Copyright 2001, ActiveState
Building a DOM
from xml.dom import minidom
dom = minidom.parse("test.xml")rootel = dom.documentElementprint rootel.nodeNametopnodes = rootel.childNodes
for toplevel in topnodes : print toplevel.nodeName
Copyright 2001, ActiveState
Searching a DOM
# print the last point element # in the treeprint h.document.documentElement.\ getElementsByTagName('point')[-1]
Copyright 2001, ActiveState
Modifying a DOM
appendChild(newChild)
insertBefore(newChild, refChild)
replaceChild(newChild, oldChild)
removeChild(oldChild)
Copyright 2001, ActiveState
The Document Node
• One Document node per document.
• The base of the entire tree
• documentElement attribute contains a single Element node
• childNodes may have additional children, such as ProcessingInstruction nodes.
Copyright 2001, ActiveState
PyXML Package
• http://pyxml.sourceforge.net
• Collection of lots of useful Python XML stuff.
• Collectively maintained.
Copyright 2001, ActiveState
PyDOM
• A richer, more robust DOM than minidom.
• More classes, support for DOM 2+
• Integration with XPath and XSLT
Copyright 2001, ActiveState
PyXML Marshalling
• Convert Python types into XML
• xml.marshal.generic – generic base class
• xml.marshal.wddx – marshal Python types as WDDX
• xml.marshal.xmlrpc – marshal Python types as XML-RPC elements
Copyright 2001, ActiveState
PyTrex
• PyTrex is a schema processor for the TREX schema language
• http://sourceforge.net/projects/pytrex/
• http://www.thaiopensource.com/trex/
Copyright 2001, ActiveState
Python SOAP/XML-RPC
• PythonWare distributes the XML-RPC client: www.pythonware.com
• There are various SOAP implementations:– SOAP.py : http://www.actzero.com – soaplib.py : http://www.pythonware.com– 4Suite: http://4suite.org/– …
Copyright 2001, ActiveState
Python SOAP Example
• SOAP.py:
import SOAP
server = SOAP.SOAPProxy( "http://localhost:8000/")
print server.echo("Hello world")
Copyright 2001, ActiveState
XML and Zope
• Zope is an Open Source application server that publishes objects on the Internet.
• ParsedXML: Breaks up an XML document into bits.
• XML-RPC: You can plumb the depths of Zope with XML-RPC.
• Zcatalog: Index based on element-type names, attribute names, etc.
Copyright 2001, ActiveState
ParsedXML
• A free Zope “product” (extension)
• Every element is a first-class Zope object.
• You can add “behavior” to XML documents
• RSS Channel Product
Copyright 2001, ActiveState
Zope XML-RPC
d=xmlrpclib.Server(
'http://localhost:8080/Zope')
content=d.document_src()
content=content.replace( 'test', 'CHANGED')
d.manage_upload(content)
Copyright 2001, ActiveState
Redfoot
• Redfoot is a framework for distributed RDF-based applications, written in Python.– an RDF database – a query API for RDF– an RDF parser and serializer – a simple HTTP server providing a web interface
for viewing and editing RDF – a fully customizable UI – the beginnings of a peer-to-peer architecture for
communication between different RDF databases
Copyright 2001, ActiveState
More Information
• XML Topic Guide– http://www.python.org/topics/xml/
• SIG – http:///www.python.org/sigs/
• ActiveState Programmers Network– http://www.activestate.com/ASPN
• XML-DEV: subscribe at:– [email protected]
Copyright 2001, ActiveState
General XML
• Definitive Spec.– http://www.w3c.org/TR/xml-spec.html
• Annotated Spec.– http://www.xml.com/xml/pub/axml/axmlintro.html
• FAQ : – http://www.ucc.ie/xml
• Definitive Refererence to all things XML– http://www.oasis-org.org/sgml