lt pyxml: a fast validating xml parser embedded in python

31
LT PyXML: A fast validating XML parser embedded in Python Henry S. Thompson HCRC Language Technology Group University of Edinburgh

Upload: vevay

Post on 07-Feb-2016

67 views

Category:

Documents


0 download

DESCRIPTION

LT PyXML: A fast validating XML parser embedded in Python. Henry S. Thompson HCRC Language Technology Group University of Edinburgh. Acknowledgements. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: LT PyXML: A fast validating XML parser embedded in Python

LT PyXML: A fast validating XML parser embedded in Python

Henry S. ThompsonHCRC Language Technology

GroupUniversity of Edinburgh

Page 2: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

2

Acknowledgements This work was carried out in the Language

Technology Group of the Human Communication Research Centre, whose baseline funding comes from the UK Economic and Social Research Council

The UK Engineering and Physical Sciences Research Council funded project NSCOPE, which stimulated some of the work discussed here today

This work was also helped by grants to our group from Sun Microsystems and Microsoft

Page 3: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

3

How we use SGML/XML We use SGML and XML in the context of

collecting, standardising, distributing, annotating and using large text collections (corpora) for computational linguistics research and development

These corpora are: Large: 10-100 million words Densely annotated: often every word has associated

markup DTDs and validation are very important to us

Page 4: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

4

An aside about validation A DTD or schema is a contract between

producers and consumers It provides a guaranteed interface Producers validate to ensure they are providing

what they promised Consumers validate to check up on producers

and to protect their applications Application authors validate to simplify their task

Leave error detection and analysis to the validating parser

Page 5: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

5

How we use XML (2) Like any other SME, we produce documents Being a university-embedded SME, we produce

lots of documents Lots of those documents are trivial variations

on one-another, based on target medium and/or audience Overhead slides for teaching Web pages for publicity/teaching backup Presentation slides for conferences Research papers for monographs and journals

Page 6: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

6

Our application needs Batch applications to automatically add

linguistic annotation Modular, pipelined programs

supporting data parallelism Specialised interactive editors to hand-

correct markup Authoring tools and publication tools

which make content-sharing easy

Page 7: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

7We built software: RXP & LT XML because of the following issues:

Price Efficiency C-language interface Documentation

Contrast with EXPAT 50 to 100% slower

– but still 90% faster than Java implementations Thoroughly documented Validates Coverage nine nines identical

Page 8: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

8

LT XML: Basic Architecture Pipelines of ‘fat’ streams

c.f. Unix ‘thin’ streams API provides primitives for XML-

appropriate input and output Two alternative views:

micro-sequence: start-tag, comment, char-data, end-tag, proc. inst

tree-structure: sequence of sub-trees, level ad lib.

Page 9: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

9

Flat view provides GetNextBit which reads the

next bit of XML: Start/empty tags (including attributes and all

values) Text==PCDATA End tags Processing instructions

PrintBit will write one of these to an output stream

Page 10: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

10

Tree-structured view Items are subtrees of the SGML structure Reading

GetNextItem GetNextQueryItem

Writing PrintItem

The two views (flat or tree-structured) can be mixed to suit the needs of the application

Page 11: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

11

Query language LT XML defines a query language which

allows the specification of elements from an XML document

Queries are tree based, using element names, attribute values and textual data

Similar path-style syntax to XPath Regular expressions are allowed for

attribute values.

Page 12: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

12

Query language, continued The LT XML query language is not a

complete relational query language, although that can be built on top

For efficiency reasons, LT XML doesn't allow queries which require back-tracking or an unbounded amount of left context

The query language allows programmers to quickly find the sub-structure they are interested in, while ignoring the rest

Page 13: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

13

Query example

.*/TEXT/./P[TYPE=STD]/S[1]

Page 14: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

14Simple Tools are Simple to Build Less than one page of C code to

produce simple application Pipelines mean you can compose

simple tools for complex applications

Page 15: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

15

Pre-constructed Tools Extract text content: textonly Select fragments based on tags, attributes

and text content: sggrep Count tags: sgcount Production-system style transformation: sgmltrans

Simple pattern-based information extraction: sgrpg

Indexing for fast access: mkindex

Page 16: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

16

Availability Free to all for research use Executables and libraries for Unix

(Solaris, SunOs, Linux, FreeBSD) and Win32

Sources for Unix Packaged executable for Mac

http://www.ltg.ed.ac.uk/software/xml/

Page 17: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

17

What about user interaction? C is not the world's easiest or most portable GUI-

building environment We have inhouse clients who are happy with

scripting languages So we've embedded LT XML inside a number of

other contexts Common Lisp Perl Python

It's the Python embedding that's the main topic for today

Page 18: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

18

LT PyXML Basics A C-implemented Python module Integrates the LT XML API into Python

Architecture– Both views (bits and tree fragments)

Objects– including garbage collection

Functions– A modest subset

We've used the Tkinter module for all our GUI work, put Python has other GUI options

Page 19: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

19

LT PyXML functions Files

Open, OpenString, Fopen, Close Bits

GetNextBit, ItemParse Attributes

GetAttrVal, ItemActualAttributes, PutAttrVal Queries

ParseQuery, GetNextQueryItem Printing

Print, PrintEndTag, PrintStartTag, PrintTextLiteral

Page 20: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

20

LT PyXML Objects Use native Python lists and dictionaries where we

can New primitive Objects, often lazy wrt pullthrough

Files– NSL_File

Doctypes– NSL_Doctype, NSL_ElementType, NSL_AttrDefn,

NSL_ContentParticle Instances

– NSL_Bit, NSL_Item, NSL_ERef , NSL_OOB Queries

– NSL_Query

Page 21: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

21

LT PyXML limitations 8-bit character inventory (Python/Tk

limitation) I haven't delivered on the promise in

the abstract, but The binary is in the XED distributions A proper release will appear shortly

Page 22: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

22

Three applications XED

instance access minimal doctype access minimal

Schema workbench instance access paradigmatic depends heavily on validation

XML DTD Normaliser instance access non-existent doctype access paradigmatic

Page 23: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

23

XED A text editor for XML document instances Implemented in Python using LT PyXML

and Tkinter Optimised for hand-authoring small- to

medium-sized documents Cross-platform Free of charge Sources not yet available

Page 24: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

24

XED features Single-window WYSIWYG presentation Add, remove and rename balanced

start/end tag pairs and empty elements Add, remove and rename attribute

name/value pairs Add or remove comments, CDATA

sections and processing instructions Context-sensitive tag and attribute menus

Page 25: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

25

XED features, cont'd Filling of text content, indenting of

element-only content Structure-sensitive point-and-sweep

selection paradigm Structure-preserving cut and paste Multiple undo Key bindings based on xxxPad under

WIN32; based on Emacs under Unix

Page 26: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

26

XED demo See http://www.ltg.ed.ac.uk/ht/xed.html The vast bulk of XED is Python/Tk, but it's

made possible by LT PyXML Control of text segments Control of OOB processing

Context-sensitive menus are initialised from the DTD

Really helps newcomers to XML get started Cannot produce ill-formed XML

Page 27: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

27

Schema Workbench demo Not publically available yet Built to facilitate development of the XML

Schema spec When I started writing large schemata

which exploited the refinement aspects of the public WD I needed to see the type hierarchy I needed to produce a normalised DTD to

compare with the originals

Page 28: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

28

Schema Workbench features The schema document to schema structures

part of this took less than a day to write Two main reasons

Validation on the way in meant – I could depend on the presence of required components– I didn't need to check for misplaced bits

Python's object-creation and evaluation facilities– Turned most NSL_Items directly into Python objects with

object type == GI

Once I had the structures, implementing refinement was easy

Page 29: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

29

DTD normaliser This was a two hour, 1.5 page job:

Find the DTD Construct a string file which uses it Open that string Sort the doctype Print the declarations, sorting disjunctions

Page 30: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

30

I can't resist :-) Once I got the tools built, I could diff

the normalised XHTML draft DTD and the DTD produced from my XHTML schema

I found one error in the DTD!

Page 31: LT PyXML: A fast validating XML parser embedded in Python

HCRC Language Technology Group

Henry S. ThompsonXML DevCon, Montréal, 1999-08-19

31When it's time to railroad,everybody railroads The next big challenge for XML,

Schemas particularly is Managing the mapping between

document infoset and application infoset

LT PyXML has proved to be a useful laboratory for exploring this issue