tdx: a high-performance table-driven xml parser

36
TDX: a High-Performance Table-Driven XML Parser Wei Zhang Robert van Engelen Department of Computer Science Florida State University

Upload: halla-chan

Post on 31-Dec-2015

48 views

Category:

Documents


4 download

DESCRIPTION

TDX: a High-Performance Table-Driven XML Parser. Wei Zhang Robert van Engelen. Department of C omputer Science Florida State University. Outline. Motivation Introduction Recent Work Table-Driven XML Parsing – TDX TDX Construction Toolkit Results and Preliminary Conclusion. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: TDX: a High-Performance              Table-Driven XML Parser

TDX: a High-Performance Table-Driven XML Parser

Wei Zhang

Robert van Engelen

Department of Computer Science

Florida State University

Page 2: TDX: a High-Performance              Table-Driven XML Parser

2

Outline

Motivation Introduction Recent Work Table-Driven XML Parsing – TDX TDX Construction Toolkit Results and Preliminary Conclusion

Page 3: TDX: a High-Performance              Table-Driven XML Parser

3

Motivation

Enhance performance for XML-based Web Services

Provide flexibility Offer high-level modularity

Page 4: TDX: a High-Performance              Table-Driven XML Parser

4

Roadmap

Motivation Introduction Recent Work Table-Driven XML parsing – TDX TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

Page 5: TDX: a High-Performance              Table-Driven XML Parser

5

Introduction

Validating XML Parsing Three stages

• Well-formedsness• Validation• Data conversion

Frequent access to schema Separation introduces

overhead and requires frequent access to schema

well-formedness

data conversion

validation

XMLXML

application

Page 6: TDX: a High-Performance              Table-Driven XML Parser

6

Introduction (cont’d) Schema-specific XML parsing (SSP)

Merging well-formedness and validation No requirement to frequent access to

schema Separation stage of data conversion in

implemented SSP

Well-formedness

Data Conversion

Validation

Page 7: TDX: a High-Performance              Table-Driven XML Parser

7

Roadmap

Motivation Introduction Recent Work Table-Driven XML parsing – TDX TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

Page 8: TDX: a High-Performance              Table-Driven XML Parser

8

Recent Work

Chiu: “A compiler-based cpproach to schema-specific XML parsing” Merging parsing and validation by

constructing PDA No namespace support Conversion from NFA to DFA may result in

exponentially growing space requirement

Page 9: TDX: a High-Performance              Table-Driven XML Parser

9

Recent Work(cont'd)

van Engelen: “Constructing finite automata for high-performance web services” Integrates parsing and validation into one

stage by parsing actions encoded by DFA Cannot process cyclic XML schema

Page 10: TDX: a High-Performance              Table-Driven XML Parser

10

Recent Work(cont'd)

van Engelen: ”The gSOAP toolkit for web services and peer-to-peer Computing Networks ” Namespace support Merging parsing and validation Implementing a recursive-decent parsing Disadvantages of recursive-descent

• Code size and function calling overhead

Page 11: TDX: a High-Performance              Table-Driven XML Parser

11

Roadmap

Motivation Introduction Recent Work Table-Driven XML parsing – TDX TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

Page 12: TDX: a High-Performance              Table-Driven XML Parser

12

Table-XML Parsing (TDX) LL(1) grammar can be derived from

schema XML documents can be parsed and

validated using LL(1) grammar Well-formedness (parsing) can be verified

through grammar rules Validation can be accomplished using

semantic actions Application-specific events can also be

encoded as semantic actions

Page 13: TDX: a High-Performance              Table-Driven XML Parser

13

Illustrating Example<schema> <element name=“book” type=“bookType”> <complexType name=“bookType”> <sequence> <element name=“title” type=“string”> <element name=“author” type=“string”> </sequence> </complexType></schema>

LL(1) Grammar:s ‘<book>’ t ‘</book>’ t t1 t2

t1 ‘<title>’ DATA //imp_s(s.val) ‘</title>’

t2 ‘<author>’ DATA //imp_s(s.val) ‘</author>’

Page 14: TDX: a High-Performance              Table-Driven XML Parser

14

Illustrating Example (cont'd)

<book>

<title>

XML Tech

</title>

<author>

Bob

</author>

</book>

s

(a) An XML Instance

t

t1 t

2

imp_s(“XML Tech”)

DATA

imp_s(“Bob”)

(b) Predictive Parsing

DATA

‘<book>’ ‘</book>’

‘<title>’ ‘</title>’‘<author>’ ‘<author>’

Page 15: TDX: a High-Performance              Table-Driven XML Parser

15

Roadmap Recent Work Table-Driven XML parsing – TDX

Illustrating example Architecture Token generation Mapping schema to LL(1) Parsing table Parsing engine Scanner/tokenizer

TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

Page 16: TDX: a High-Performance              Table-Driven XML Parser

16

TDX - Architecture

<XML>TokenCDATA

Tokens

LL(1)Parsing Table

Ll(1) GrammarProductions and Actions

Events

Error: invalid

Modules

application

Scanner/Tokenizer

(DFA)

Parsing Engine(TDX)

Page 17: TDX: a High-Performance              Table-Driven XML Parser

17

Roadmap Recent Work Table-Driven XML parsing – TDX

Illustrating example Architecture Token generation Mapping schema to LL(1) Parsing table Parsing engine Scanner/Tokenizer

TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

Page 18: TDX: a High-Performance              Table-Driven XML Parser

18

Token Generation Defined by

<namespace, tag>• Element name (opening and closing)• Attribute name

some data type• Such as Enumeration

Namespace binding Identical tag names under different namespaces are

represented as different tokens Normalized tokens

Page 19: TDX: a High-Performance              Table-Driven XML Parser

19

Roadmap Recent Work Table-Driven XML parsing – TDX

Illustrating example Architecture Token generation Mapping schema to LL(1) Parsing table Parsing engine Scanner/Tokenizer

TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

Page 20: TDX: a High-Performance              Table-Driven XML Parser

20

Mapping Schema to LL(1) Grammar

Structural constraints are mapped to rules Validation constraints are mapped to

semantic actions Note that many types of validation constraints

are mapped to rules• Such as occurrence, enumeration

Page 21: TDX: a High-Performance              Table-Driven XML Parser

21

Mapping Example(1)

<simpleType name=“state”> <restriction base=“string”> <enumeration value=“OFF”/> <enumeration value=“ON”/> </restriction> </simpleType>

state “OFF” | “ON”

<simpleType name=“value”> <restriction base="integer"> <minInclusive value="10"/> <maxInclusive value="250"/> </restriction></simpleType>

value DATA//imp_i(char *s)

Page 22: TDX: a High-Performance              Table-Driven XML Parser

22

<complexType name=“example”> <choice> <element name=“id” type=“id_type” minOccurs=“0”/> <element name=“value” type=“value_type” minOccurs=“2”

maxOccurs=“unbounded”/> </choice></complexType>

Mapping Example(2)

c1 ‘<id>’ id_type ‘</id>’ example c1 | c2

c2 c’2 c’2 c’’2

<sequence> example c1 c2

c’2 ‘<value>’ value_type ‘</value>’

c1

c’’2 c’’2 c’2 c’’2

Page 23: TDX: a High-Performance              Table-Driven XML Parser

23

Roadmap Recent Work Table-Driven XML parsing – TDX

Illustrating example Architecture Token generation Mapping schema to LL(1) Parsing table Parsing engine Scanner/Tokenizer

TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

Page 24: TDX: a High-Performance              Table-Driven XML Parser

24

LL(1) Parsing Table

Constructed from LL(1) grammar Indexed by nonterminals and terminals Contains either index of grammar

production or error entry

Page 25: TDX: a High-Performance              Table-Driven XML Parser

25

Roadmap Recent Work Table-Driven XML parsing – TDX

Illustrating example Architecture Token generation Mapping schema to LL(1) Parsing table Parsing engine Scanner/Tokenizer

TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

Page 26: TDX: a High-Performance              Table-Driven XML Parser

26

Parsing Engine

Schema Independent Maintains

Parsing table Production table Action table Stack

Page 27: TDX: a High-Performance              Table-Driven XML Parser

27

Roadmap Recent Work Table-Driven XML parsing – TDX

Illustrating example Architecture Token generation Mapping schema to LL(1) Parsing table Parsing engine Scanner/Tokenizer

TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

Page 28: TDX: a High-Performance              Table-Driven XML Parser

28

Scanner/Tokenizer Constructed from schema Schema provides DFA states

information Element name

• Has attribute? Attribute name

Root element needs special care Schema information

Page 29: TDX: a High-Performance              Table-Driven XML Parser

29

Scanner/Tokenizer example

<book xmlns:x ="http://www.x.org" xmlns:y ="http://www.y.org" targetnamespace ="http://www.x.org"> <title>XML Bible</title> <author> <name> Bob </name> <y:title> professor</y:title> </author></book>

<"www.y.org", "title">

<"www.x.org", "title">

DATA

<"www.x.org", "/title">

Page 30: TDX: a High-Performance              Table-Driven XML Parser

30

Roadmap

Motivation introduction Recent Work Table-Driven XML parsing – TDX TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

Page 31: TDX: a High-Performance              Table-Driven XML Parser

31

TDX Construction Toolkit

Service.wsdl wsdl2TDX

Service_flex.l

Service_TDX.h

tab.yy.c

Service_TDX.c

flex

Page 32: TDX: a High-Performance              Table-Driven XML Parser

32

Roadmap

Motivation introduction Recent Work Table-Driven XML parsing – TDX TDX construction Tool Kit Experiment Results and Preliminary

Conclusion

Page 33: TDX: a High-Performance              Table-Driven XML Parser

33

Experiment Setup

Compare with DFA-based Parser gSOAP 2.7 eXpat 1.2 Xerces 2.7.0

Memory-resident XML message Elapsed real time using timeofday()

Page 34: TDX: a High-Performance              Table-Driven XML Parser

34

Parsing Performance(1)

0

50

100

150

200

250

300

350

TDX TDX -Cfa DFA DFA -Cfa eXpat gSOAP Xerces

EchoString Array Size = 1024B

Tim

e(u

s)

validation

decoding+validation

parsing

parsing+validation

Page 35: TDX: a High-Performance              Table-Driven XML Parser

35

Parsing Performance (2)

1

10

100

1000

10000

100000

1 10 100 1000 10000EchoString Array Size

Tim

e(u

s)

XercesgSOAPeXpatTDXDFA

Page 36: TDX: a High-Performance              Table-Driven XML Parser

36

Conclusion

Enhance parsing speed Flexible framework

Encoding value-based validation and application-specific events as semantic rules

Combining structural, syntactic and semantic constraints in one pass

High-level of modularity