1 xduce xduce: a statically type xml processing: hosoya and pierce presented by: guy korland based...
TRANSCRIPT
1
XDuce
XDuce: A statically Type XML Processing: Hosoya and PiercePresented by: Guy Korland
Based on presentation by:Tabuchi Naoshi([email protected])
2
Presentation Outline
Introduction (pronounced “transduce”) Programming in XDuce.
Values Regular Expression Types Subtyping
Pattern matching. Conclusions.
3
XDuce: What For?
A functional language for XML processing.
On the basis of Regular Expression Types Pattern Matching
Statically Typedi.e. Outputs are statically checked against DTD-conformance etc.
4
Advantages (vs. “untyped”)
“Untyped” XML processing: programs using DOM etc. Little connection between program and
XML schema. Validity can be checked only at run-time,
if any.
5
Advantages (vs. “embedding”)
“Embedding” : mapping XML schema into language’s type system.
e.g.<!ELEMENT person (name, mail*, tel?)> (DT
D)
type person = name * mail list * tel option (ML)
6
Advantages (vs. “embedding”)
Embedding does not suit intuition in some cases.
e.g.Intuitively… (name,mail*,tel?) <:(name,mail*,tel*)but not name * mail list * tel option <:
name * mail list * tel list
(ML)
7
Values
Values are XML Documents (input, output, intermediate).
Syntax: XDuce’s native syntax. Standard XML syntax document.
8
Values(cont.)
Standard XML syntax:<!-- mybook.xml --><addrbook>
<person> <name> Haruo Hosoya </name><email> hahosoya@kyoto-u </email><email> hahosoya@upenn </email>
</person><person>
<name> Benjamin Pierce </name><email> bcpierce@upenn </email><tel> 123-456-789 </tel> </person>
</addrbook>
let val doc = load_xml(“mybook.xml”)
9
Values(cont.)
XDuce’s native syntax:let val mybook = addrbook[
person[name["Haruo Hosoya"],email["hahosoya@kyoto-u"],email["hahosoya@upenn"]],
person[name["Benjamin Pierce"],email["bcpierce@upenn"],tel["123-456-789"]]]
Constructor labal[…] where … is sequence of other values.
String enclosed in double-quotes,
unlike XML
10
Regular Expression Types
Types are defined in regular expression form with labels. Concatenation, union, alteration as basic
constructors. Labels correspond to elements of XML
(person, name, mail, etc…).
11
Regular Expression Types (cont.)
Example:type Addrbook = addrbook[(Name, Addr, Tel?)*]type Name = name[String]type Addr = addr[String]type Tel = tel[String]
Correspond DTD:<!ELEMENT addrbook (name, addr, tel?)*><!ELEMENT name #PCDATA><!ELEMENT addr #PCDATA><!ELEMENT tel #PCDATA>
Types not labels
12
Syntax of Types:T ::= ()| X | L[T]
| T,T (* concat. *)| T|T (* alter. *)| T* (* rep. *)
whereX : Type Variable (String, Int…)L : Label
Regular Expression Types (cont.)
Empty sequence type
13
Regular Expression Types (cont.)
Syntactic sugar: T+ ≡ T,T* T? ≡ T|()
Types can be (mutually) recursive:
type Folder = Entry*type Entry = name[String], file[String] |
name[String], folder[Folder]
14
Regular Expression Types (cont.)
Syntax of Labels:
L ::= l (* specific label *)| ~ (* wildcard label
*)| L|L (* union *)| L\L (* difference *)
15
Regular Expression Types (cont.)
The label class ~ represents the set of all labels.
We can define a type Anytype Any =(~[Any] |Int |Float |String)*
Labels Uniontype Heading = (h1|h2|h3|h4|h5|h6)
[Inline]
(HTML headings)
16
Subtyping
Meaning of subtypes is as usual:All values t of T are also values of T’
T <: T’ ⇔ ∀t ∈ T ⇒ t ∈ T’ Examples:
Name,Addr <: Name,Addr,Tel? Name,Addr,Tel <: Name,Addr,Tel? addrbook[Name,Addr,Name,Addr,Tel]
<: addrbook[(Name,Addr,Tel?)*]
17
Subtyping - Union Types
Union (or alternation) type constructor |. Example:
Name <: Name | Tel Tel <: Name | Tel
Forget ordering (Name,Addr)*,(Name,Tel)* <:
((Name,Addr)|(Name,Tel))*
Distributivity (Name,Tel)|(Name,Addr) <: Name,(Addr|Tel)
18
Subtyping - Subtagging
Allowing subtyping between types with different labels. (beyond the expressive power of DTD)
e.g. (HTML)subtag i <: fontstylesubtag b <: fontstyle
i[T] <: fontstyle[T]b[T] <: fontstyle[T]
19
Complexity of Subtyping
Subtype relation (T <: T’) is equivalent to inclusion CFGs Undecidable!
Need some restrictions on syntax.
(next slide…)
20
Well-formedness of Types
Syntactic restriction on types to ensure “regularity”.
Recursive use of types can only occur at the tail position of type definition, or inside labels.
21
Well-formed Types: Examples
type X = Int, Ytype Y = String, X | ()
and
type Z = String, lab[Z], String |()are well-formed, but
type U = Int, U, String |()is not.
22
Complexity of Subtyping, again
With well-formedness, checking subtype relation is: Still EXPTIME-complete, equivalent to
inclusion of tree automata [CDG+]
but acceptable in practical cases.
23
Pattern matching (cont.)
ML-like pattern matching:
“pattern -> expression”
Example:val url = match v with
www[val s as String] -> "http://“ ^ s
| email[val s as String] -> "mailto:" ^ s
| ftp[val s as String] -> "ftp://" ^ s
24
Pattern matching (cont.)
Pattern match can also involve regular expression types.
e.g.match p with
| person[name[String],(val ms as Mail*),
(val t as Tel?)] -> …
25
Pattern matching (cont.)
Functions – reusable pattern matching.
Example:fun make_url(val s as String): String =
match s with
www[val s as String] -> "http://" ^ s
| email[val s as String] -> "mailto:" ^ s
| ftp[val s as String] -> "ftp://" ^ s
26
Policies of Pattern Matching
Pattern matching has two basic policies: First-match (as in ML):
only the first pattern matched is taken. Longest-match
(as usual in regexp. matching on string):matching is done as much as possible.
27
First-match: Example
(* p = person[name, mail, tel] *)match p with| person[Name, (val ms as Mail*), Tel]
-> (* invoked *)| person[Name, (val ms as Mail*), Tel?]
-> (* not invoked *)
28
Longest-match: Example
(* p = person[name, mail, mail, tel] *)
match p with
| … (val m1 as Mail*),(val m2 as Mail*),
…-> (* m1 = mail, mail
m2 = () *)
29
Exhaustiveness and Redundancy
Pattern matches are checked against exhaustiveness and redundancy. Exhaustiveness: No “omission” of values. Redundancy: Never-matched patterns.
30
Exhaustiveness
A pattern match P1 -> e1 | … | Pn -> en is exhaustive (wrt. input type T)⇔All values t ∈ T are matched by some Pi
orT <: P1 | … | Pn
31
Exhaustiveness: Example (1/2)
(* type Person = person[Name, Mail*, Tel?] *)
match p with
| person[Name, Mail*, Tel]-> ...
| person[Name, Mail*]-> ...
is exhaustive patterns (wrt. Person)
32
Exhaustiveness: Example (2/2)
(* type Person = person[Name, Mail*, Tel?] *)
match p with
| person[Name, Mail*, Tel]-> ...
| person[Name, Mail+]-> ...
is NOT exhaustive (wrt. Person):person[name[...]] does not match
33
Redundancy
A pattern Pi is redundant in
P1 -> e1 | … | Pn -> en
(wrt. input type T)⇔All values matched by Pi is matched by P1 | ... | Pi-1
34
Redundancy: Example
(* type Person = person[Name, Mail*, Tel?] *)
match p with| person[name, Mail*, tel?]
-> ...| person[name, Mail*)]
-> ...
Second pattern is redundant:anything match second pattern also match first one.
35
Complete Example (1/3)
type Addrbook = addrbook[Person*]type Person = person[Name,Email*,Tel?]type Name = name[String]type Email = email[String]type Tel = tel[String]
(* and output documents. *)type TelBook = telbook[TelPerson*]type TelPerson = person[Name,Tel]
(* load an address book *)let val doc = load_xml("mybook.xml")
36
Complete Example (2/3)
(* validate it against the type Addrbook *)
let val valid_doc = validate doc with Addrbook
(* extract the content of the top label addrbook *)
let val out_doc =
match valid_doc with
addrbook[val persons as Person*] ->
telbook[make_tel_book(persons)]
(* save out_doc to out.xml*)
save_xml("output.xml")(out_doc)
37
Complete Example (3/3)
(* take ps of type Person* and return TelPerson* *)fun make_tel_book (val ps as Person*) : TelPerson* =
match ps withperson[name[val n as String], Email*,
tel[val t as String]],val rest as Person*
-> person[name[n], tel[t]], make_tel_book(rest)
| person[name[val n as String], Email*], val rest as Person*
-> make_tel_book(rest)
| () -> ()
Recursive call
38
Conclusion
Expressiveness of regular expression types/pattern matching are useful for XML processing.
Type inference (including subtype relation) is possible and efficient (in most practical cases). (Appendix 2)
39
Applications
Bookmarks
(Mozilla bookmark extraction). Html2Latex. Diff (diff for XML). All 300 – 350 lines.
40
Future Works
Precise type inference on all variables. Introducing Any type: Not possible by
naïve way. Breaks closure-property of tree
automata. Makes type inference impossible.
41
References
XDuce: A statically Type XML Processing: Hosoya and Pierce
XDuce: A typed XML Processing Language: Hosoya and Pierce
Regular Expression Pattern Matching for XML: Hosoya and Pierce
Regular Expression Types for XML: Hosoya, Vouillon, and Pierce
Available @ http://xduce.sourceforge.net
42
Appendix 1:Type Inference
43
Type Inference (1/2)
Infer types of variables in patterns Results are exact types of variables Type of each variable depends on
pattern itself, and type of input
44
Type Inference (2/2)
Type inference is “flow-sensitive” In P1 -> e1 | … | Pn -> en , inference on
Pi depends on P1 ... Pi-1
Because... Values matched by Pi are those NOT matc
hed by P1 ... Pi-1
45
Type Inference: Example (1/2)
(* p :: person[name[], mail*, tel[]?] *) match p with
| person[name[], rest] -> …
Type of rest is inferred
mail*, tel[]?
In this case
46
Type Inference: Example (2/2)
match p with| person[name[], tel[]] -> …
| person[name[], rest] -> …
Type of rest becomes
(mail+, tel[]?) | ()
In this case, because…person[name[], (), tel[]]
Is matched by the first pattern.
47
Type Inference: Limitations
“Exact” type inference is possible only on Variables at tail position, or Inside labels (c.f. well-formedness)
Limitation comes from internal representation of patterns (binary trees)
48
Appendix 2:Algorithms for Pattern Matching
49
Algorithms for Pattern Matching
Pattern matching takes following steps Translation of values into internal forms
(binary trees). Translation of types and patterns into
internal forms (binary trees and tree automata).
Values are matched by patterns, in terms of tree automata.
50
Internal Forms of Values
Values are represented as binary trees internally:
t ::= ε (* leaves *)| l(t, t) (* labels *)
First node is content of the label, second
is remainder of the sequence.
51
Internal Forms of Values: Example
person[name[], mail[], mail[]]
is translated into
person(name(ε,mail(ε,mail(ε,ε))),ε)
52
Internal Forms of Types
Types are also translated into binary trees
T ::= φ (* empty *)| ε (* leaves *)| T|T (* union *)
| l(X, X) (* label *) X is States, used in tree automata
53
Internal Forms of Types: Tree Automata
A tree automaton M is a mapping of States -> Typese.g.
M(X) = name(Y, Z)M(Y) = εM(Z) = mail(Y, Z) | ε
...
54
Internal Forms of Types: Example
type Person =
person[name[], mail*, tel[]?]
is translated into binary tree: person(X1, X0) and tree automaton M, s.t.
M(X0) = εM(X1) = name(X0, X2),M(X2) = mail(X0, X2) | mail(X0, X3) | εM(X3) = tel(X0, X0)
55
Internal Forms of Patterns
Patterns are similar to types, with some additions
P ::= (* same as types... *)| x : P (* x as P *)| T (* wildcard *)
Wildcards are used for non “as”-ed variables.
56
Internal Forms of Patterns: Example
Patternperson[name[n], (ms as mail*)]
is translated into binary tree
person(Y1, Y0)
and tree automaton N, s.t.N(Y0) = εN(Y1) = name(n:T, ms:Y2)N(Y2) = mail(Y0, Y2) | ε
57
Pattern Matching (1/3)
Pattern matching has two roles match input values (of course!) bind variables to components of input val
ue, if matched Written formally
t ∈ D ⇒ V“t is matched by D, yielding V” (V : Vars -> Values)
58
Pattern Matching (2/3)
Matching relation t ∈ D ⇒ V is defined by following rules... (next slide)
Assumptions: D is a set of patterns and states A tree automaton N is implied (D, N) corresponds to the external pattern
59
Pattern Matching (3/3)
212121
222111
21
21
21
1
),(),(
|
|
}{:
)(
VVYYlttl
VYtVYt
VPPt
VPtPt
VPPt
VPtTt
txVPxt
VPt
VYt
VYNt