tdx: a high-performance table-driven xml parser
DESCRIPTION
TDX: a High-Performance Table-Driven XML Parser. Wei Zhang Robert van Engelen. Department of C omputer Science Florida State University. Outline. Motivation Introduction Recent Work Table-Driven XML Parsing – TDX TDX Construction Toolkit Results and Preliminary Conclusion. - PowerPoint PPT PresentationTRANSCRIPT
TDX: a High-Performance Table-Driven XML Parser
Wei Zhang
Robert van Engelen
Department of Computer Science
Florida State University
2
Outline
Motivation Introduction Recent Work Table-Driven XML Parsing – TDX TDX Construction Toolkit Results and Preliminary Conclusion
3
Motivation
Enhance performance for XML-based Web Services
Provide flexibility Offer high-level modularity
4
Roadmap
Motivation Introduction Recent Work Table-Driven XML parsing – TDX TDX construction Tool Kit Experiment Results and Preliminary
Conclusion
5
Introduction
Validating XML Parsing Three stages
• Well-formedsness• Validation• Data conversion
Frequent access to schema Separation introduces
overhead and requires frequent access to schema
well-formedness
data conversion
validation
XMLXML
application
6
Introduction (cont’d) Schema-specific XML parsing (SSP)
Merging well-formedness and validation No requirement to frequent access to
schema Separation stage of data conversion in
implemented SSP
Well-formedness
Data Conversion
Validation
7
Roadmap
Motivation Introduction Recent Work Table-Driven XML parsing – TDX TDX construction Tool Kit Experiment Results and Preliminary
Conclusion
8
Recent Work
Chiu: “A compiler-based cpproach to schema-specific XML parsing” Merging parsing and validation by
constructing PDA No namespace support Conversion from NFA to DFA may result in
exponentially growing space requirement
9
Recent Work(cont'd)
van Engelen: “Constructing finite automata for high-performance web services” Integrates parsing and validation into one
stage by parsing actions encoded by DFA Cannot process cyclic XML schema
10
Recent Work(cont'd)
van Engelen: ”The gSOAP toolkit for web services and peer-to-peer Computing Networks ” Namespace support Merging parsing and validation Implementing a recursive-decent parsing Disadvantages of recursive-descent
• Code size and function calling overhead
11
Roadmap
Motivation Introduction Recent Work Table-Driven XML parsing – TDX TDX construction Tool Kit Experiment Results and Preliminary
Conclusion
12
Table-XML Parsing (TDX) LL(1) grammar can be derived from
schema XML documents can be parsed and
validated using LL(1) grammar Well-formedness (parsing) can be verified
through grammar rules Validation can be accomplished using
semantic actions Application-specific events can also be
encoded as semantic actions
13
Illustrating Example<schema> <element name=“book” type=“bookType”> <complexType name=“bookType”> <sequence> <element name=“title” type=“string”> <element name=“author” type=“string”> </sequence> </complexType></schema>
LL(1) Grammar:s ‘<book>’ t ‘</book>’ t t1 t2
t1 ‘<title>’ DATA //imp_s(s.val) ‘</title>’
t2 ‘<author>’ DATA //imp_s(s.val) ‘</author>’
14
Illustrating Example (cont'd)
<book>
<title>
XML Tech
</title>
<author>
Bob
</author>
</book>
s
(a) An XML Instance
t
t1 t
2
imp_s(“XML Tech”)
DATA
imp_s(“Bob”)
(b) Predictive Parsing
DATA
‘<book>’ ‘</book>’
‘<title>’ ‘</title>’‘<author>’ ‘<author>’
15
Roadmap Recent Work Table-Driven XML parsing – TDX
Illustrating example Architecture Token generation Mapping schema to LL(1) Parsing table Parsing engine Scanner/tokenizer
TDX construction Tool Kit Experiment Results and Preliminary
Conclusion
16
TDX - Architecture
<XML>TokenCDATA
Tokens
LL(1)Parsing Table
Ll(1) GrammarProductions and Actions
Events
Error: invalid
Modules
application
Scanner/Tokenizer
(DFA)
Parsing Engine(TDX)
17
Roadmap Recent Work Table-Driven XML parsing – TDX
Illustrating example Architecture Token generation Mapping schema to LL(1) Parsing table Parsing engine Scanner/Tokenizer
TDX construction Tool Kit Experiment Results and Preliminary
Conclusion
18
Token Generation Defined by
<namespace, tag>• Element name (opening and closing)• Attribute name
some data type• Such as Enumeration
Namespace binding Identical tag names under different namespaces are
represented as different tokens Normalized tokens
19
Roadmap Recent Work Table-Driven XML parsing – TDX
Illustrating example Architecture Token generation Mapping schema to LL(1) Parsing table Parsing engine Scanner/Tokenizer
TDX construction Tool Kit Experiment Results and Preliminary
Conclusion
20
Mapping Schema to LL(1) Grammar
Structural constraints are mapped to rules Validation constraints are mapped to
semantic actions Note that many types of validation constraints
are mapped to rules• Such as occurrence, enumeration
21
Mapping Example(1)
<simpleType name=“state”> <restriction base=“string”> <enumeration value=“OFF”/> <enumeration value=“ON”/> </restriction> </simpleType>
state “OFF” | “ON”
<simpleType name=“value”> <restriction base="integer"> <minInclusive value="10"/> <maxInclusive value="250"/> </restriction></simpleType>
value DATA//imp_i(char *s)
22
<complexType name=“example”> <choice> <element name=“id” type=“id_type” minOccurs=“0”/> <element name=“value” type=“value_type” minOccurs=“2”
maxOccurs=“unbounded”/> </choice></complexType>
Mapping Example(2)
c1 ‘<id>’ id_type ‘</id>’ example c1 | c2
c2 c’2 c’2 c’’2
<sequence> example c1 c2
c’2 ‘<value>’ value_type ‘</value>’
c1
c’’2 c’’2 c’2 c’’2
23
Roadmap Recent Work Table-Driven XML parsing – TDX
Illustrating example Architecture Token generation Mapping schema to LL(1) Parsing table Parsing engine Scanner/Tokenizer
TDX construction Tool Kit Experiment Results and Preliminary
Conclusion
24
LL(1) Parsing Table
Constructed from LL(1) grammar Indexed by nonterminals and terminals Contains either index of grammar
production or error entry
25
Roadmap Recent Work Table-Driven XML parsing – TDX
Illustrating example Architecture Token generation Mapping schema to LL(1) Parsing table Parsing engine Scanner/Tokenizer
TDX construction Tool Kit Experiment Results and Preliminary
Conclusion
26
Parsing Engine
Schema Independent Maintains
Parsing table Production table Action table Stack
27
Roadmap Recent Work Table-Driven XML parsing – TDX
Illustrating example Architecture Token generation Mapping schema to LL(1) Parsing table Parsing engine Scanner/Tokenizer
TDX construction Tool Kit Experiment Results and Preliminary
Conclusion
28
Scanner/Tokenizer Constructed from schema Schema provides DFA states
information Element name
• Has attribute? Attribute name
Root element needs special care Schema information
29
Scanner/Tokenizer example
<book xmlns:x ="http://www.x.org" xmlns:y ="http://www.y.org" targetnamespace ="http://www.x.org"> <title>XML Bible</title> <author> <name> Bob </name> <y:title> professor</y:title> </author></book>
<"www.y.org", "title">
<"www.x.org", "title">
DATA
<"www.x.org", "/title">
30
Roadmap
Motivation introduction Recent Work Table-Driven XML parsing – TDX TDX construction Tool Kit Experiment Results and Preliminary
Conclusion
31
TDX Construction Toolkit
Service.wsdl wsdl2TDX
Service_flex.l
Service_TDX.h
tab.yy.c
Service_TDX.c
flex
32
Roadmap
Motivation introduction Recent Work Table-Driven XML parsing – TDX TDX construction Tool Kit Experiment Results and Preliminary
Conclusion
33
Experiment Setup
Compare with DFA-based Parser gSOAP 2.7 eXpat 1.2 Xerces 2.7.0
Memory-resident XML message Elapsed real time using timeofday()
34
Parsing Performance(1)
0
50
100
150
200
250
300
350
TDX TDX -Cfa DFA DFA -Cfa eXpat gSOAP Xerces
EchoString Array Size = 1024B
Tim
e(u
s)
validation
decoding+validation
parsing
parsing+validation
35
Parsing Performance (2)
1
10
100
1000
10000
100000
1 10 100 1000 10000EchoString Array Size
Tim
e(u
s)
XercesgSOAPeXpatTDXDFA
36
Conclusion
Enhance parsing speed Flexible framework
Encoding value-based validation and application-specific events as semantic rules
Combining structural, syntactic and semantic constraints in one pass
High-level of modularity