info-h-509 xml and web technologies - lecture 2: xml and xpath · unicode(6):utf-8. for a code unit...

107

info-h-509 xml and web technologies Lecture 2: XML and XPath . Stijn Vansummeren February 20, 2019

Upload: others

Post on 31-May-2020

3 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

info-h-509 xml and web technologies

Lecture 2: XML and XPath.

Stijn VansummerenFebruary 20, 2019

Page 2: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

lecture outline.

Unicode

XML and Namespaces

XPath

JSON

1

Page 3: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

..unicode

Page 4: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

unicode (1).

• HTML (and XML) files are represented as text files• A text file is logically a sequence of characters• But physically a sequence of bytes• Several mappings from bytes to characters exist:

◦ ASCII◦ EBCDIC◦ Unicode

..Unicode aims to cover all characters in all past or present written languages.In its current version, Unicode consists of a repertoire of more than 107,000

characters covering 90 scripts

3

Page 5: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

unicode (2): characters.

..A character is a symbol that appears in a text.Definition

Examples• letters of the alphabet• pictograms like ©• accents

Unicode characters are abstract entities:• LATIN CAPITAL LETTER A• LATIN CAPITAL LETTER A WITH RING ABOVE• HIRAGANA LETTER SA• RUNIC LETTER THURISAZ THURS THORN

4

Page 6: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

unicode (3): glyphs.

..A glyph is a graphical presentation of a character.Definition

For instance, the glyph Å represents several characters:• LATIN CAPITAL LETTER A WITH RING ABOVE• ANGSTROM SIGN

It can also be described by a sequence of characters:• LATIN CAPITAL LETTER A COMBINING RING ABOVE

Some characters even result in several glyphs

5

Page 7: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

unicode (4): code points.

..Every Unicode character is assigned a unique number between 0 and 1 114 112,called its code point

.Definition

For instance:• the character HIRAGANA LETTER SA is assigned the code point 12 373• Code points 0 through 127 coincide with ASCII

Only 107 000 code points in use today

6

Page 8: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

unicode (5): character encodings.

..A character encoding defines how to interpret a sequence of bytes as a se-quence of code points

.Definition

In Unicode:• bytes are first parsed into fixed-length code units• one or more code units may be required to denote a code point

Example character encodings are UTF-8, UTF-16, UTF-32

7

Page 9: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

unicode (6): utf-8.

• Each code unit is a single byte• Code points is from 1 to 4 code units• Code Units between 0 and 127 directly represent the corresponding code points(same as ASCII)

..

Consider the following sequence of 3 bytes

01000001 01000010 01000011

decimal 65 66 67glyphs A B C

8

Page 10: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

unicode (6): utf-8.

• Each code unit is a single byte• Code points is from 1 to 4 code units• Code Units between 0 and 127 directly represent the corresponding code points(same as ASCII)

..

Consider the following sequence of 3 bytes

01000001 01000010 01000011

decimal 65 66 67

glyphs A B C

8

Page 11: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

unicode (6): utf-8.

• Each code unit is a single byte• Code points is from 1 to 4 code units• Code Units between 0 and 127 directly represent the corresponding code points(same as ASCII)

..

Consider the following sequence of 3 bytes

01000001 01000010 01000011

decimal 65 66 67glyphs A B C

8

Page 12: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

unicode (6): utf-8.

• For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are usedto indicate that the code point comprises 2, 3, respectively 4 units.

• The remaining units must have bit mask 10XXXXXX

..

Consider the following sequence of 3 bytes

11100011 10000001 10010101

• The first unit indicates that the code point comprises 3 units• We then interpret the non-bitmask bits of the 3 units as one binary number

0011 0000 0101 0101

• This is 12 373 in decimal, the code point for HIRAGANA LETTER SA

9

Page 13: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

unicode (6): utf-8.

• For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are usedto indicate that the code point comprises 2, 3, respectively 4 units.

• The remaining units must have bit mask 10XXXXXX

..

Consider the following sequence of 3 bytes

11100011 10000001 10010101

• The first unit indicates that the code point comprises 3 units

• We then interpret the non-bitmask bits of the 3 units as one binary number

0011 0000 0101 0101

• This is 12 373 in decimal, the code point for HIRAGANA LETTER SA

9

Page 14: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

unicode (6): utf-8.

• For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are usedto indicate that the code point comprises 2, 3, respectively 4 units.

• The remaining units must have bit mask 10XXXXXX

..

Consider the following sequence of 3 bytes

11100011 10000001 10010101

• The first unit indicates that the code point comprises 3 units• We then interpret the non-bitmask bits of the 3 units as one binary number

0011 0000 0101 0101

• This is 12 373 in decimal, the code point for HIRAGANA LETTER SA

9

Page 15: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

unicode (6): utf-8.

• For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are usedto indicate that the code point comprises 2, 3, respectively 4 units.

• The remaining units must have bit mask 10XXXXXX

..

Consider the following sequence of 3 bytes

11100011 10000001 10010101

• The first unit indicates that the code point comprises 3 units• We then interpret the non-bitmask bits of the 3 units as one binary number

0011 0000 0101 0101

• This is 12 373 in decimal, the code point for HIRAGANA LETTER SA

9

Page 16: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

unicode (7).

The big picture:

.. sequenceof bytes

. sequence ofcode units

. sequence ofcode points

.

sequence ofcharacters

.

sequenceof glyphs

Other encodings:• See book for details on UTF-16 and UTF-32• Another popular character encoding is ISO-8859-1 (aka Latin-1) with only 256code points. It coincides with ASCII on code points 0-127, but cannot representgeneral Unicode

10

Page 17: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

unicode (8): use in html.

..<meta http-equiv=”Content-Type” content=”text/html;charset=ISO-8859-1”>

11

Page 18: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

..xml and namespaces

Page 19: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

what is xml?.

XML: the eXtensible Markup Language

• In essence a data exchange format• Each application domain is free to chose its own markup tags• It can also be viewed as a framework for defining markup languages• There is a common set of generic tools for processing XML documents• Developed by World Wide Web Consortium (W3C), standardized in 1998.

13

Page 20: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

the textual representation of xml: an example.

By means ofOnline demonstration

14

Page 21: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

the textual representation of xml: some key points.

• Always starts with the XML Declaration<?xml version=”1.1” encoding=”UTF-8”?>

• Consists mainly of Element start and end tags<bla ...> ... </bla>

◦ Start and end tags must match and nest properly!OK: <x> <y> </y> </x>

Not OK: </z> <x> <y> </x> </y>◦ Short hand for empty elements: <bla/>

• Attributes: name=”value” in start tags

• Always exactly one root element

Well-formedness can be checked by tools

15

Page 22: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

the textual representation of xml: some key points.

• Always starts with the XML Declaration<?xml version=”1.1” encoding=”UTF-8”?>

• Consists mainly of Element start and end tags<bla ...> ... </bla>

◦ Start and end tags must match and nest properly!OK: <x> <y> </y> </x>

Not OK: </z> <x> <y> </x> </y>◦ Short hand for empty elements: <bla/>

• Attributes: name=”value” in start tags

• Always exactly one root element

Well-formedness can be checked by tools

15

Page 23: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

the textual representation of xml: some key points.

• Always starts with the XML Declaration<?xml version=”1.1” encoding=”UTF-8”?>

• Consists mainly of Element start and end tags<bla ...> ... </bla>

◦ Start and end tags must match and nest properly!OK: <x> <y> </y> </x>

Not OK: </z> <x> <y> </x> </y>◦ Short hand for empty elements: <bla/>

• Attributes: name=”value” in start tags

• Always exactly one root element

Well-formedness can be checked by tools

15

Page 24: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

the textual representation of xml: some key points.

• Always starts with the XML Declaration<?xml version=”1.1” encoding=”UTF-8”?>

• Consists mainly of Element start and end tags<bla ...> ... </bla>

◦ Start and end tags must match and nest properly!OK: <x> <y> </y> </x>

Not OK: </z> <x> <y> </x> </y>◦ Short hand for empty elements: <bla/>

• Attributes: name=”value” in start tags

• Always exactly one root element

Well-formedness can be checked by tools

15

Page 25: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

the textual representation of xml: some key points.

• Always starts with the XML Declaration<?xml version=”1.1” encoding=”UTF-8”?>

• Consists mainly of Element start and end tags<bla ...> ... </bla>

◦ Start and end tags must match and nest properly!OK: <x> <y> </y> </x>

Not OK: </z> <x> <y> </x> </y>◦ Short hand for empty elements: <bla/>

• Attributes: name=”value” in start tags

• Always exactly one root element

Well-formedness can be checked by tools

15

Page 26: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

the textual representation of xml: special constructs.

• Comment tags: allow you to write some meta-information in the XML file. Thisis normally ignored by all XML processing applications.

<!– bla –>

• Processing instructions: allow you to specify specific instructions to specificprocessors. Each has a target and a value.

<?target value?>

16

Page 27: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

the textual representation of xml: more constructs.

Special characters can be referenced by character references• These take the form &#XXXX;

<bla> Copyright © 2005 </bla>• < and > and & denote < and > and & respectively

CDATA sections allow text that doesn’t need to be escaped by &lt and &gt

<![CDATA[ <this is not a tag> ]]>

17

Page 28: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

refresh your memory about trees.

Do you know the following terms?• node, edge• root, leaf• child, parent• sibling (ordered trees), ancestor,descendant

...A.

B

.

E

.

F

.

C

.

D

.

G

.

H

.

I

.

H

18

Page 29: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

the xml conceptual data model: trees.

19

<?xml version=”1.1” encoding=”UTF-8”?><store><order><customer>

<name>John</name><email>[email protected]</email>

</customer><item>

<id> 18F </id><price> 100 </price>

</item><item>

<id> G43 </id><price> 50 </price>

</item></order><order key=”3”><customer>

<name>Jane</name><email>[email protected]</email>

</customer><item>

<id> 18F </id><price> 22 </price>

</item></order>

</store>

...

store

.

order

.

customer

.

name

.

John

.

email

.

[email protected]

.

item

.

id

.

18F

.

price

.

100

.

item

.

id

.

G43

.

price

.

50

.

order

.

key=3

.

customer

.

name

.

Jane

.

email

.

[email protected]

.

item

.

id

.

18F

.

price

.

22

.

legend

.

element node

.

attribute node

.

text node

..

Conceptually, an XML document is an orderedtree with different types of nodes• document structure = tree structure• document data = leaf of tree• also comment nodes, processing instructions,

... (not shown)

Page 30: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

the xml conceptual data model: conversion.

..An XML parser is an algorithm that given the textual XML document, constructsits tree representation

.Definition

..On file.

In memory

.

XML Parser.

<?xml vers ion=”1.0” encoding=”UTF-8”?><order>

<customer><name>John</name><email>[email protected]</email>

</customer><item>

< id> 18F </ id><price > 100 </price >

</item><item>

< id> G43 </ id><price > 50 </price >

</item></order>

..

order

.

customer

.

name

.

John

.

email

.

jj.com

.

item

.

id

.

18F

.

price

.

100

.

item

.

id

.

G43

.

price

.

50 20

Page 31: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

the xml conceptual data model: caution!.

There are multiple data models for XML:• XQuery and XPath data model (used here)• The post-validation infoset• DOM, JDOM• …

..We use the XPath data model unless specified otherwise!

21

Page 32: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

example: a recipy markup language.

..

<?xml vers ion=”1.0” encoding=”UTF-8”?><co l l ec t ion ><descr ipt ion >Recipes suggested by Jane Dow</descr ipt ion >

< rec ipe id=” r 1 1 7”>< t i t l e >Rhubarb Cobbler</ t i t l e ><date>Wed, 14 Jun 95</date>

< ing red ient name=”diced rhubarb” amount=”2.5” uni t=”cup”/>< ing red ient name=”sugar” amount=”2” uni t=”tablespoon”/>< ing red ient name=” f a i r l y r ipe banana” amount=”2”/>< ing red ient name=”cinnamon” amount=”0.25” uni t=”teaspoon”/>< ing red ient name=”nutmeg” amount=”1” uni t=”dash”/>

<preparation ><step>Combine a l l and use as cobbler, pie, or c r i sp.</step>

</preparation >

<comment> Rhubarb Cobbler made i t with ... </comment>

<nu t r i t i on ca lo r i e s=”170” f a t=”28%”carbohydrates=”58%” prote in=”14%”/>

< re la ted re f=”42”>Garden Quiche</re lated ></recipe >

</co l l ec t ion >

22

Page 33: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

applications of xml.

Rough Classification:• Data-oriented languages (e.g configuration files, CML)• Document-oriented languages (e.g. XHTML)• Protocols and programming languages (e.g SOAP, WSDL, XSLT)• Hybrids

23

Page 34: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

example: xhtml.

XHTML is the XML version of HTML• XHMTL must be a well-formed XML document (e.g. tags must match and nestproperly)

• Ordinary HTML is typically “ invalid” and hence renders differently on differentbrowsers.

..

<?xml vers ion=”1.0” encoding=”UTF-8”?><html xmlns=”http://www.w3.org/1999/xhtml”><head>< t i t l e >Hello world!</ t i t l e ></head><body><h1>This i s a heading</h1>This i s some tex t.

</body></html>

24

Page 35: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

example: chemical markup language cml.

..

<?xml vers ion=”1.0” encoding=”UTF-8”?><molecule id=”METHANOL”><atomArray>< s t r i n gA r r a y bu i l t i n=” id”>a1 a2 a3 a4 a5 a6</s t r ingAr ray >< s t r i n gA r r a y bu i l t i n=”elementType”>C O H H H H</s t r ingAr ray >< f l o a t A r r a y bu i l t i n=”x3” uni ts=”pm”>-0.748 0.558 ...

</ f l oa tA r ray >< f l o a t A r r a y bu i l t i n=”y3” uni ts=”pm”>-0.015 0.420 ...

</ f l oa tA r ray >< f l o a t A r r a y bu i l t i n=”z3” uni ts=”pm”>0.024 -0.278 ...

</ f l oa tA r ray ></atomArray>

</molecule>

25

Page 36: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

example: ebxml.

..

<?xml vers ion=”1.0” encoding=”UTF-8”?><Mul t iPa r t yCo l l abora t ion name=”DropShip”><BusinessPartnerRole name=”Customer”><Performs i n i t i a t i n g Ro l e=’//binaryCo l labora t ion[@name=”Firm Order”]/

I n i t i a t i n g Ro l e[@name=”buyer”]’ /></BusinessPartnerRole ><BusinessPartnerRole name=”Re ta i l e r”><Performs respondingRole=’//binaryCo l labora t ion[@name=”Firm Order”]/

RespondingRole[@name=” s e l l e r”]’ /><Performs i n i t i a t i n g Ro l e=’//binaryCo l labora t ion[...]/

I n i t i a t i n g Ro l e[@name=”buyer”]’ /></BusinessPartnerRole ><BusinessPartnerRole name=”DropShip Vendor”>...

</BusinessPartnerRole ></Mul t iPar tyCo l labora t ion >

26

Page 37: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

example: thml.

..

<?xml vers ion=”1.0” encoding=”UTF-8”?><h3 c lass=”s05” id=”One.2.p0.2”>Having a Humble Opinion of Se l f </h3><p c lass=” F i r s t ” id=”One.2.p0.3”>EVERY man na tu ra l l y des i res knowledge<note place=” foot” id=”One.2.p0.4”><p c lass=”Footnote” id=”One.2.p0.5”><added id=”One.2.p0.6”>

<name id=”One.2.p0.7”>A r i s t o t l e </name>, Metaphysics, i. 1.</added></p>

</note>;but what good i s knowledge without fear of God? Indeed a humbler u s t i c who serves God i s bet te r than a proud i n t e l l e c t u a l whoneglects h is soul to study the course of the s ta r s.<added id=”One.2.p0.8”><note place=” foot” id=”One.2.p0.9”><p c lass=”Footnote” id=”One.2.p0.10”>

Augustine, Confessions V. 4.</p>

</note></added></p>

27

Page 38: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

xml namespaces (1): motivation.

Consider an XML language for widgets. The <info> element containsinformation in XHTML markup.

..

<widget type=”gadget”><head s i ze=”medium”/><big><subwidget re f=”gizmo”/></big>< info >

<head>< t i t l e >Descr ip t ion of gadget</ t i t l e >

</head><body><h1>Gadget</h1>A gadget contains a big gizmo

</body></ info >

</widget>

..

?

• When combining languages, element names may become ambiguous• A general solution?

28

Page 39: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

xml namespaces (1): motivation.

Consider an XML language for widgets. The <info> element containsinformation in XHTML markup.

..

<widget type=”gadget”><head s i ze=”medium”/><big><subwidget re f=”gizmo”/></big>< info >

<head>< t i t l e >Descr ip t ion of gadget</ t i t l e >

</head><body><h1>Gadget</h1>A gadget contains a big gizmo

</body></ info >

</widget>

..

?

• When combining languages, element names may become ambiguous• A general solution?

28

Page 40: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

xml namespaces (1): motivation.

Consider an XML language for widgets. The <info> element containsinformation in XHTML markup.

..

<widget type=”gadget”><head s i ze=”medium”/><big><subwidget re f=”gizmo”/></big>< info >

<head>< t i t l e >Descr ip t ion of gadget</ t i t l e >

</head><body><h1>Gadget</h1>A gadget contains a big gizmo

</body></ info >

</widget>

..

?

• When combining languages, element names may become ambiguous• A general solution?

28

Page 41: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

xml namespaces (2): the idea.

• Assign a URI to every (sub-)languagee.g. http://www.w3.org/1999/xhtml for XTML 1.0

• Qualify element names with URIs:{http://www.w3.org/1999/xhtml}head

29

Page 42: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

xml namespaces (3): actual solution.

Namespace declarations bind URIs to prefixes

<... xmlns:foo=”http://www.w3.org/TR/xhtml1”>...<foo:head>...</foo:head>...

</...>

• Lexical scope• Default namespace (no prefix) declared with xmlns=”...”• Attribute names can also be prefixed

30

Page 43: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

xml namespaces (4): example.

Namespace declarations bind URIs to prefixes

..

<widget type=”gadget” xmlns=”http://www.widget. inc”><head s i ze=”medium”/><big><subwidget re f=”gizmo”/></big>< i n fo xmlns:xhtml=”http://www.w3.org/TR/xhtml1”>

<xhtml:head><xhtml: t i t l e >Descr ip t ion of gadget</xhtml: t i t l e >

</xhtml:head><xhtml:body><xhtml:h1>Gadget</xhtml:h1>

A gadget contains a big gizmo</xhtml:body>

</ info ></widget>

Namespace map: for each element, maps prefixes to URIs

31

Page 44: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

..xpath

Page 45: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

what is xpath?.

XPath is:• A flexible notation for navigating in trees

• Widely used as a basic component in many XML-related standards:◦ XML Schema: uniqueness and scope◦ XSLT & XQuery: pattern matching, selection and value computation◦ XLink & XPointer: document relationships

XPath has multiple versions:• version 1.0 (limited to location paths)• version 2.0 (focus of this lecture)• version 3.0 (small changes w.r.t 2.0, standardized in 2014)• version 3.1 (towards support for JSON, standardized in 2017)

33

Page 46: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

basic syntax.

..The basic building block of XPath expressions is the location path, which con-sists of a sequence of location steps separated by /.

.

Definition

...

Example location path:

.descendant::C. /. child::E[attribute::id]. /. child::F

..

location step

..

location step

..

location step

34

Page 47: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

basic syntax.

..The basic building block of XPath expressions is the location path, which con-sists of a sequence of location steps separated by /.

.

Definition

...

Example location path:

.descendant::C. /. child::E[attribute::id]. /. child::F..

location step

..

location step

..

location step

34

Page 48: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

basic syntax.

..

A location step consists of• an axis• a nodetest• (optionally) some predicateswith the following general syntax

axis :: nodetest [ exp1 ] [ exp2 ] …

.

Definition

..

Examples:• child::rcp:ingredient• child::rcp:recipe[attribute::id=’117’]• attribute::amount

35

Page 49: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating a location path.

..

• Semantically, each step maps a context node into a sequence• This is easily extended to map sequences of nodes to sequences of nodes

◦ each node in the input sequence serves as a context node◦ and is replaced with the result of applying the step

• The path then applies each step in return

Example:

..A.

B

.

C

.

E

.

F

.

F

.

F

.

C

.

F

.

B

.

D

.

C

.

E

.

E

.

F

.

E

.

F

36

Page 50: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating a location path.

..

• Semantically, each step maps a context node into a sequence• This is easily extended to map sequences of nodes to sequences of nodes

◦ each node in the input sequence serves as a context node◦ and is replaced with the result of applying the step

• The path then applies each step in return

Example:

Start with context node ...A.

B

.

C

.

E

.

F

.

F

.

F

.

C

.

F

.

B

.

D

.

C

.

E

.

E

.

F

.

E

.

F

36

Page 51: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating a location path.

..

• Semantically, each step maps a context node into a sequence• This is easily extended to map sequences of nodes to sequences of nodes

◦ each node in the input sequence serves as a context node◦ and is replaced with the result of applying the step

• The path then applies each step in return

Example:

descendant::C ......A.

B

.

C

.

E

.

F

.

F

.

F

.

C

.

F

.

B

.

D

.

C

.

E

.

E

.

F

.

E

.

F

36

Page 52: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating a location path.

..

• Semantically, each step maps a context node into a sequence• This is easily extended to map sequences of nodes to sequences of nodes

◦ each node in the input sequence serves as a context node◦ and is replaced with the result of applying the step

• The path then applies each step in return

Example:

descendant::C/child::E .....A.

B

.

C

.

E

.

F

.

F

.

F

.

C

.

F

.

B

.

D

.

C

.

E

.

E

.

F

.

E

.

F

36

Page 53: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating a location path.

..

• Semantically, each step maps a context node into a sequence• This is easily extended to map sequences of nodes to sequences of nodes

◦ each node in the input sequence serves as a context node◦ and is replaced with the result of applying the step

• The path then applies each step in return

Example:

descendant::C/child::E ........A.

B

.

C

.

E

.

F

.

F

.

F

.

C

.

F

.

B

.

D

.

C

.

E

.

E

.

F

.

E

.

F

36

Page 54: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating a location path.

..

• Semantically, each step maps a context node into a sequence• This is easily extended to map sequences of nodes to sequences of nodes

◦ each node in the input sequence serves as a context node◦ and is replaced with the result of applying the step

• The path then applies each step in return

Example:

descendant::C/child::E/child::F .....A.

B

.

C

.

E

.

F

.

F

.

F

.

C

.

F

.

B

.

D

.

C

.

E

.

E

.

F

.

E

.

F

36

Page 55: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating a location path.

..

• Semantically, each step maps a context node into a sequence• This is easily extended to map sequences of nodes to sequences of nodes

◦ each node in the input sequence serves as a context node◦ and is replaced with the result of applying the step

• The path then applies each step in return

Example:

descendant::C/child::E/child::F .........A.

B

.

C

.

E

.

F

.

F

.

F

.

C

.

F

.

B

.

D

.

C

.

E

.

E

.

F

.

E

.

F

36

Page 56: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating a single location step (1).

..• The input to a location step consists of (1) the tree representation of an XMLdocument and (2) a context

• The output of a location step is a sequence of nodes

.

Definition

...A.

B

.

C

.

E

.

F

.

F

.

F

.

C

.

F

.

B

.

D

.

C

.

E

.

E

.

F

.

E

.

F

.

n1

.

n2

.

n3

.

n4

.

(n1,n2,n3,n4)

.

descendant::E

37

Page 57: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating a single location step (2).

..

The context of an XPath expression consists of 6 components:• a context node (a node in the XML tree)• a context position and size (two natural numbers)• a set of variable bindings• a function library (depends on the implementation)• a set of namespace declarations (mapping prefixes to namespace URIs)

.

Definition

• The application determines the initial context• If the path starts with ’/’ then

◦ the initial context node is the root◦ the initial position and size are 1

38

Page 58: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating location steps (3).

To evaluate axis :: nodetest [ exp1 ] [ exp2 ]:.. input tree + context.

sequence of nodes

.

sequence of nodes

.

sequence of nodes

.

(1) evaluate axis

.

(2) filter by node test

.

(3) filter by predicates

39

Page 59: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

the order of nodes in the resulting sequence.

..

• The document order of an XML tree is the order in which node m occursbefore node n if the open tag of m occurs before the open tag of n in thecorresponding text representation.

• The result of a location step and path is always sorted in document order,with duplicate nodes removed.

.

Definition

...A.

B

.

C

.

E

.

F

.

F

.

F

.

C

.

F

.

B

.

D

.

C

.

E

.

E

.

F

.

E

.

F

.

n1

.

n2

.

n3

.

n4

.

(n1,n2,n3,n4)

.

descendant::E

40

Page 60: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating the axis.

..Remember:• An axis is evaluated relative to the context node• And evaluates to a sequence of nodes

XPath supports 12 different axes

• child• descendant• parent• ancestor• following-sibling• preceding-sibling

• attribute• following• preceding• self• descendant-or-self• ancestor-or-self

41

Page 61: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating the axis (1): the parent axis.

...............................

..• Selects the unique parent node of the context node.• Returns the empty sequence if the context node is the root.

42

Page 62: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating the axis (2): the child axis.

................................

..• Selects all children excluding attributes

43

Page 63: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating the axis (3): the descendant axis.

...................................

..

• The transitive closure of the child axis. Selects all descendants excludingattributes.

• The descendant-or-self axis returns the concatenation of the self anddescendant sequences.

44

Page 64: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating the axis (4): the ancestor axis.

.................................

..

• The transitive closure of the parent axis. All ancestors of the context node,from parent to the root element node.

• The ancestor-or-self axis returns the concatenation of the self andancestor sequences.

45

Page 65: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating the axis (5): the following-sibling axis.

...............................

..• Returns the right-hand siblings of the context node excluding attributes. Itreturns the empty sequence if the context node is an attribute.

46

Page 66: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating the axis (6): the preceding-sibling axis.

................................

..• Returns the left-hand siblings of the context node, excluding attributes. Itreturns the empty sequence if the context node is an attribute.

47

Page 67: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating the axis (7): the following axis.

........................................

..• Returns all nodes appearing strictly later in document order, but excludingdescendants and attribute nodes.

48

Page 68: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating the axis (8): the preceding axis.

.......................................

..• Returns all nodes appearing strictly earlier in document order, but excludingancestors and attribute nodes.

49

Page 69: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating location steps: remember.

To evaluate axis :: nodetest [ exp1 ] [ exp2 ]:.. input tree + context.

sequence of nodes

.

sequence of nodes

.

sequence of nodes

.

(1) evaluate axis

.

(2) filter by node test

.

(3) filter by predicates

50

Page 70: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

node tests.

..The node test in a location path takes as input the sequence computed by thelocation step, and further selects from this sequence all nodes that satisfy thetest.

In particular• text() selects all text nodes in the sequence• comment() selects all comment nodes• processing-instruction() selects all processing instruction nodes• node() select all nodes• * selects all element nodes, except when it is used after an attribute axis, inwhich case it selects all attribute nodes

• name selects all nodes with the given name (must be a Qualified Name, orQName for short). Text nodes do not have a name!

• *:localname selects the nodes with the given name in any namespace• prefix:* selects nodes as *, but only those in the given namespace.

51

Page 71: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

quiz.

What does location step child::hi select in the following document whenevaluated from the root?<bla>

<hi/> hi</bla>

52

Page 72: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

evaluating location steps: remember.

To evaluate axis :: nodetest [ exp1 ] [ exp2 ]:.. input tree + context.

sequence of nodes

.

sequence of nodes

.

sequence of nodes

.

(1) evaluate axis

.

(2) filter by node test

.

(3) filter by predicates

53

Page 73: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

predicates.

..

• The predicates in a location path take as input the sequence computed by the nodetest. For each node in that sequence, the predicate is evaluated (using the currentnode as context node). The result is coerced into a boolean:◦ a number yields true if it equals the context position◦ a string yields true if it is nonempty◦ a sequence yields true if it is nonempty

All nodes that returned true are kept, the rest is discarded.• A predicate can be any XPath expression (see later), it need not be a location step

Examples• /descendant::recipe[descendant::ingredient]• /descendant::ingredient[attribute::name=’sugar’]• /descendant::recipe[descendant::ingredient[attribute::name=’sugar’]]• /descendant::recipe[4]• /descendant::recipe[position()=4][position()=1]

54

Page 74: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

a note about axes directions.

Each XPath axis has a direction• A forward direction means that the input sequence is filtered in documentorder:◦ child, descendant, following-sibling, following, self, descendant-or-self

• A backward direction means that the input sequence is filtered in reversedocument order:◦ parent, ancestor, preceding-sibling, preceding

• Attribute nodes are considered to be unordered. Therefore, the order in whichthe nodes in the sequence resulting from the attribute axis are filtered isimplementation-dependent, but stable

..

Example:• child::node()[position()=2] returns the second child• ancestor::node()[position()=2] returns the grand-parent (the parenthas position 1, the grand-parent position 2, and so on ...)

55

Page 75: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

abbreviations.

XPath has an abbreviated syntax• Child is the default axis, so /child::rcp:recipe is equivalent to/rcp:recipe

• The attribute axis may be replaced by the @ character• The fragment descendant-or-self::node() may be replaced by //• The fragment self::node() may be replaced by .• The fragment parent::node() may be replaced by ..

..What is the unabbreviated syntax for the following expression?

//rcp:nutrition[@calories=349]/..//rcp:title/text()

56

Page 76: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

abbreviations.

XPath has an abbreviated syntax• Child is the default axis, so /child::rcp:recipe is equivalent to/rcp:recipe

• The attribute axis may be replaced by the @ character• The fragment descendant-or-self::node() may be replaced by //• The fragment self::node() may be replaced by .• The fragment parent::node() may be replaced by ..

..What is the unabbreviated syntax for the following expression?

//rcp:nutrition[@calories=349]/..//rcp:title/text()

56

Page 77: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

beyond the basics: general expressions.

..A general XPath expression evaluates to a sequence of items, where each itemis either an node or an atomic value

Atomic values may be:• numbers• booleans• Unicode strings• or one of the other datatypes defined in XML Schema (see later)

And we already know that each node has an identity

57

Page 78: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

general expressions: literal expressions.

..

Examples:

• 42• 3.1415• 6.022E13• ’XPath is a lot of fun’• ”XPath is a lot of fun”

• ’The cat said ”Meow!”’• ”The cat said ””Meow!”””• ”XPath is just

so much fun”

58

Page 79: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

general expressions: arithmetic expressions.

..

Examples:

• 42 + 34• 42 - 34• 3.1415 * 5

• 3.1415 div 5• 42 idiv 34• 42 mod 3

..

Operators are generalized to sequences• If any argument is empty, the result is empty• If all argument are singleton sequences of numbers, the operation isperformed

• Otherwise, a runtime error occurs

59

Page 80: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

general expressions: variable references.

..Examples:

• $foo • $bar:foo

Variables are set in the context• Can be set by the application before running the xpath expression (inputvariables)

• But variables can also be bound by the XPath for-expression (see later)

Careful!• $foo-17 refers to the variable “foo-17”• To substract 17 from the value of variable “foo”, we write ($foo)-17, $foo-17, or $foo+-17

60

Page 81: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

general expressions: sequence expressions.

..

Operators are generalized to sequences• The ’,’ operator concatenates sequences• Integer ranges are constructed with ’to’• Sequences containing only nodes can be combined using the set operatorsunion, intersect, except. The result is always sorted in document order,and duplicate-free.

• Sequences are always flattened

..

The following expressions all give the same result:• (1, (2,3,4), ((5)), (), (((6,7)), 8,9))• 1 to 9• 1,2,3,4,5,6,7,8,9

61

Page 82: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

general expressions: functions.

• XPath has an extensive function library• Default namespace for functions:http://www.w3.org/2006/xpath-functions

• 106 functions are required• More functions with the namespace:http://www.w3.org/2001/XMLSchema

• See book for pages 78-81 for useful standard functions

..Example• calling a function with 4 arguments: fn:avg(1,2,3,4)• calling a function with 1 argument: fn:avg((1,2,3,4))

62

http://www.w3.org/2006/xpath-functions

http://www.w3.org/2001/XMLSchema

Page 83: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

general expressions: path expressions.

• Location Paths are expressions• They may start from arbitrary sequences that contain only nodes

◦ evaluate the path for each node in the sequence◦ using the current node as context node◦ context position and size are taken from the sequence◦ the results are combined in document order

..

Example:• (fn:doc(”veggie.xml”), fn:doc(”bbq.xml”))//rcp:title

(For multiple documents, the document order between nodes in different documentsis implementation-dependent, but stable).

63

Page 84: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

general expressions: filter expressions.

• Predicates are generalized to arbitrary sequences• The expression ’.’ is the context item

..Example (what is the result?):• (10 to 40)[. mod 5 = 0 and position()>20]

64

Page 85: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

general expressions: value comparison.

..

• Operators eq, ne, lt, le, gt, ge• Used on atomic values• when applied to an arbitrary values:

◦ atomize◦ if either argument is empty, the result is empty◦ if either has length> 1, raise runtime error (error in book)◦ if incomparible (e.g. string comparison with float), raise runtime error◦ otherwise, compare the two atomic values

..Examples:• 8 eq 4+4• (//rcp:ingredient)[1]/@name eq ”beef cube steak”

65

Page 86: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

atomization.

..

Atomization is the process of transforming an arbitrary sequence into a se-quence of atomic values.• Each node is replaced by its atomized value• For element nodes, the atomized value is the concatenation of the contentof all descendant text nodes (in document order)

• For other nodes, the atomized value is the obvious string

.

Definition

66

Page 87: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

general expressions: generalized comparison.

..

• Operators =, !=, <, <=, >, >=• Used on arbitrary values:

◦ atomize◦ if either argument is empty, the result is empty◦ if there exists two atomic values, one from each atomized argument, whose

comparison holds, the result is true◦ otherwise, the result is false

• So, this is an existential semantics!

..

Examples:• 8 = 4+4• (1,2) = (2,4)• //rcp:ingredient/@name = ”salt”

67

Page 88: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

general expressions: node comparison.

..

• Operators is, «, »,• Used to compare nodes on identity and document order• When applied to arbitrary values:

◦ if either argument is empty, the result is empty◦ if both are singleton nodes, the nodes are compared◦ otherwise, a runtime error is raised

..

Examples:• (//rcp:recipe)[2] is //rcp:recipe[rcp:title eq ”Ricotta Pie” ]• /rcp:collection « (//rcp:recipe)[4]• (//rcp:recipe)[4] » (//rcp:recipe)[3]

68

Page 89: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

be careful about comparisons.

What is the result of the following comparisons?• ((//rcp:ingredient)[40]/@name, (//rcp:ingredient)[40]/@amount) eq((//rcp:ingredient)[53]/@name, (//rcp:ingredient)[53]/@amount)

• ((//rcp:ingredient)[40]/@name, (//rcp:ingredient)[40]/@amount) =((//rcp:ingredient)[53]/@name, (//rcp:ingredient)[53]/@amount)

• ((//rcp:ingredient)[40]/@name, (//rcp:ingredient)[40]/@amount) is((//rcp:ingredient)[53]/@name, (//rcp:ingredient)[53]/@amount)

69

Page 90: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

be careful about comparisons (2).

..

XPath violates most algebraic laws of equality

• Reflexivity:() = () yields false

• Transitivity:(1,2)=(2,3) and (2,3)=(3,4), but not (1,2)=(3,4)

• Anti-symmetry:(1,4)<=(2,3) and (2,3)<=(1,4) but not (1,2)=(3,4)

• Negation:both (1) != () and (1) = () yield false

70

Page 91: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

general expressions: boolean expressions.

..

• Operators and, or, not,• Arguments are interpreted as false if

◦ the boolean false◦ the empty sequence◦ the empty string◦ the number zero

• Constants using functions true() and false()

..Example:• (//rcp:recipe) and //ingredient[4]/@quantity

71

Page 92: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

general expressions: for expressions.

..

General form for $var in expr1 return expr2• evaluates expr1 to a sequence• for each item in that sequence evaluate expr2 with var bound to the currentitem

• concatenates the resulting sequences

..

Examples:• for $r in //rcp:recipe returnfn:count($r//rcp:ingredient[fn:not(rcp:ingredient)])

• for $i in (1 to 5)for $j in (1 to $i)return $j

72

Page 93: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

general expressions: conditional expressions.

..

Example:• fn:avg(

for $r in //rcp:ingredient returnif ( $r/@unit = ”cup” )then xs:double($r/@amount) * 237

else if ( $r/@unit = ”teaspoon” )then xs:double($r/@amount) * 5

else if ( $r/@unit = ”tablespoon” )then xs:double($r/@amount) * 15

else ())

73

Page 94: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

general expressions: quantified expressions.

..Example:• some $r in //rcp:ingredient satisfies $r/@name eq ”sugar”• every $r in //rcp:recipy satisfies $r/@name neq ”cake”

74

Page 95: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

xpath 1.0 restrictions.

..

• Many implementations only support XPath 1.0• Smaller function library• Implicit casts of values• Some expressions change semantics:

”4” < ”4.0”is false in XPath 1.0 but true in XPath 2.0

75

Page 96: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

xpath 3.0 extensions.

..

New features in XPath 3.0 include:• String concatenation operator ||• Simple mapping operator !. E1 ! E2 is similar to E1/E2 except that:

◦ E1 can return an arbitrary sequence, not necessarily a sequence of nodes◦ E1 is evaluated to a sequence. For every item in that sequence E2 is evaluated

with the current item as context item.◦ Results are concatenated, but there is no duplicate elimination, nor sorting after

the concatenation.

• Higher-order functions: functions are valid items; anonymous functions canbe defined and called dynamically.

• Let binding.

..

Examples:• ’ab’ || ’def’ = ’abdef’• let $x := doc(’a.xml’)/*, $y := $x//*return $y[@value gt $x/@min]

76

Page 97: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

xpath 3.1 extensions.

..

XPath 3.1 extends the XPath Data Model• A value in xpath 3.1 is a sequence of 0 or more items, where every item is anatomic value, a node, a map, an array, or a function.

• Maps are dictionaries from keys to values.◦ Keys must be atomic values.◦ Duplicate keys are not allowed.

• Arrays are Arrays are ordered collections of items (like sequences), but anarray can include as element a sequence (which a sequence cannot).

• New expressions to create and manipulate maps and arrays.• This allows to also load json.

..Create a map with 3 entries:• map { ”Su” : ”Sunday”, ”Mo” : ”Monday”, ”Tu” : ”Tuesday”}

77

Page 98: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

xpath 3.1 extensions.

..

XPath 3.1 extends the XPath Data Model• A value in xpath 3.1 is a sequence of 0 or more items, where every item is anatomic value, a node, a map, an array, or a function.

• Maps are dictionaries from keys to values.◦ Keys must be atomic values.◦ Duplicate keys are not allowed.

• Arrays are Arrays are ordered collections of items (like sequences), but anarray can include as element a sequence (which a sequence cannot).

• New expressions to create and manipulate maps and arrays.• This allows to also load json.

..Maps can be nested:• map { ”book” : map { ”author” : [”J. Doe”, ”X. Brown”], ”title”: ”TheX”}}

77

Page 99: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

xpath 3.1 extensions.

..

XPath 3.1 extends the XPath Data Model• A value in xpath 3.1 is a sequence of 0 or more items, where every item is anatomic value, a node, a map, an array, or a function.

• Maps are dictionaries from keys to values.◦ Keys must be atomic values.◦ Duplicate keys are not allowed.

• Arrays are Arrays are ordered collections of items (like sequences), but anarray can include as element a sequence (which a sequence cannot).

• New expressions to create and manipulate maps and arrays.• This allows to also load json.

..Maps lookup• $x(”book”)

77

Page 100: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

xpath 3.1 extensions.

..

XPath 3.1 extends the XPath Data Model• A value in xpath 3.1 is a sequence of 0 or more items, where every item is anatomic value, a node, a map, an array, or a function.

• Maps are dictionaries from keys to values.◦ Keys must be atomic values.◦ Duplicate keys are not allowed.

• Arrays are Arrays are ordered collections of items (like sequences), but anarray can include as element a sequence (which a sequence cannot).

• New expressions to create and manipulate maps and arrays.• This allows to also load json.

..

Array creation• Array with three members: [1, 2, ”book”]• Array with two members, both sequences: [ (), (27, 17, 0)]• Array with two members, the second being a sequence: [1, $x//recipe]• Curly array constructor creates flat arrays: array {1, $x//recipe}

..Array lookup1, 2, 5, 7 (4) evaluates to 7.

..Load JSON• fn:parse-json(”data.json”)

77

Page 101: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

..json

Page 102: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

json, that other xml.

JSON = JavaScript Object Notation• A syntactical fragment of JavaScript (aka EcmaScript) for describing data• Libraries for reading & writing JSON exist for virtually every programming language• Widely used with JavaScript & AJAX

Example:

..

{ ”company”: { ”name”: ”WIND so lu t ions”, ”symbol”: ”WIND” },”reportData”:[{ ”year”: 2011, ”p r o f i t”: 2000000 },{ ”year”: 2010, ”d e f i c i t”: 1000 },

],”author”: ”John Doe”

}

79

Page 103: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

json in a nutshell.

80

• JSON can represent four primitive types (strings, numbers, booleans, and null) andtwo structured types (objects and arrays).

• A value is a a string, number, boolean, null, object, or array.

Number primitive type• A number is any integer or fractional number like in most programming languages

Example:..10 -12.34 10e5 +3.141516

Page 104: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

json in a nutshell.

81

• JSON can represent four primitive types (strings, numbers, booleans, and null) andtwo structured types (objects and arrays).

• A value is a a string, number, boolean, null, object, or array.

String primitive type• A string is a sequence of zero or more unicode characters.• String literals are delimited by double quotes. The C-style escape characters (\”, \n,

\t, …) apply.

Example:..”This is a string literal\n This continues on a new line”

Page 105: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

json in a nutshell.

82

• JSON can represent four primitive types (strings, numbers, booleans, and null) andtwo structured types (objects and arrays).

• A value is a a string, number, boolean, null, object, or array.

Boolean primitive types• A boolean is either the literal true or false

Page 106: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

json in a nutshell.

83

• JSON can represent four primitive types (strings, numbers, booleans, and null) andtwo structured types (objects and arrays).

• A value is a a string, number, boolean, null, object, or array.

Object structured type• An object is an unordered collection of zero or more name/value pairs, where aname is a string and a value is a string, number, boolean, null, object, or array.

• Objects are delimited by curly braces; within the same object, each name must beunique.

Example:

..

{ ”company”: { ”name”: ”WIND so lu t ions”, ”symbol”: ”WIND” },”reportData”:[{ ”year”: 2011, ”p r o f i t”: 2000000 },{ ”year”: 2010, ”d e f i c i t”: 1000 },

],”author”: ”John Doe”

}

Page 107: INFO-H-509 XML and Web Technologies - Lecture 2: XML and XPath · unicode(6):utf-8. For a code unit > 127, the bit masks 110XXXXX, 1110XXXX, and 11110XXX are used to indicate that

json in a nutshell.

84

• JSON can represent four primitive types (strings, numbers, booleans, and null) andtwo structured types (objects and arrays).

• A value is a a string, number, boolean, null, object, or array.

Array structured type• An array is an ordered sequence of zero or more values.• Arrays are delimited by square brackets.

Example:

..[

{ ”year”: 2011, ”p r o f i t”: 2000000 },{ ”year”: 2010, ”d e f i c i t”: 1000 },

]

Querying XML XPath - Artificial Intelligenceopenclassroom.stanford.edu/.../old-site/docs/pdfs/XPath.pdf · Querying XML XPath . Jennifer Widom XPath Querying XML Not nearly as mature

XPATH. Cos’è XPATH: XPath e’ una sintassi per selezionare frammenti di documenti XML XPath non e’ un linguaggio XML XPath e’ standardizzato dal W3C

XML 6.6 XPath 6. What is XPath? XPath is a syntax used for selecting parts of an XML document The way XPath describes paths to elements is similar to

Intermediate XSLT and XPath xml-xpath Intermediate XSLT ...tecfa.unige.ch/guides/te/files/xml-xpath.pdf · • XPath uses a compact non-XML syntax (to facilitate use of XPath within

XMLI Structure of XML Data Structure of XML Data XML Document Schema XML Document Schema XPATH XPATH

CS 433 Xml, DTD, XPath, & Xslt

Xpath injection in XML databases

Xpath Sources: amoeller/XML

Processing XML: XPath, XQuery...320302 Databases & WebServices (P. Baumann) Processing XML: XPath, XQuery Ramakrishnan & Gehrke, Chapter 24 / 27

Xml, DTD, XPath, & Xslt

XPath-AwareChunking of XML-Documentsdoesen0.informatik.uni-leipzig.de/proceedings/slides/btw2003_wiss... · XPath-AwareChunking of XML-Documents Wolfgang Lehner. Uni Erlangen XPath-Aware

Basisdatensatz.pdf ·

XPath - Roma Tre Universityatzeni/didattica/BD/20112012... · XPath Introduction XPath is a language that lets you identify particular parts of XML documents XPath interprets XML

1 Chapter 10: XML What is XML What is XML Basic Components of XML Basic Components of XML XPath XPath XQuery XQuery

XML Parsers XPath, XQuery Outline - EPFLlsir · ¥XML parsers ¥XPath ¥XQuery. 31 XQuery Motivation ¥Query is a strongly typed query language ¥Builds on XPath ¥XPath expressivity

XML and XPath with PHP

XPath Was ist XPath XPath ist eine Syntax für das Definieren der Teile eines XML-Dokumentes. XPath benutzt Pfade, um XML-Elemente zu definieren. XPath

" href="https://vdocuments.net/56813a38550346895da2215c.html">

XML & XPath Injections

More XML XML schema, XPATH, XSLT

Introduction to XML, XPath, & XQuery

University" href="https://vdocuments.net/xml-schema-xpath-and-xquery-school-of-dakoopcs5530lecturesxml-schema-querypdf.html">XML Schema, XPath, and XQuery - School of Computingdakoop/cs5530/lectures/xml-schema-query.pdf · XML Schema, XPath, and XQuery ... SQL XML Schema Relational ... xsd="> University

1 XPath. 2 What is XPath? XPath is a syntax for defining parts of an XML document XPath uses paths to define XML elements XPath defines a library of standard

XML 、 Xpath 轉換 XML 文件

Course: The XPath Language - Project WAMwam.inrialpes.fr/courses/PG-MoSIG12/xpath.pdfCourse on XML, DTDs, XML Schema, Static Analysis of XPath, logic, Keywords XML, XPath, XML Schema,

Querying XML: XPath and XQuery

XPath XML Path Language. Outline XML Path Language (XPath) Data Model Description Node values XPath expressions Relative expressions Simple subset of

XML and XPath details

XML Query Languages XPATH XQUERY

XML Data Management 5. Extracting Data from XML: XPath

XML and Databases XQuery, XSLT and XPath XQuery XQuery XML

XPATH XML Path Language. Xpath – XML Path Language IT Zertifikat - Daten und Metadatenstandards: XPath 2 Entwicklung des W3C Adressierungssprache für

XML, XML Schema, Xpath and Xquery