introduction to xmlavid.cs.umass.edu/courses/445/f2009/lectures/lec14-xml.pdf · introduction to...

30
Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau.

Upload: others

Post on 25-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

Introduction to XML

Yanlei DiaoUMass Amherst

Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau.

Page 2: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

2

Data Integration & Sharing

Intranet………

NULLAnna Wang111

513-123-0102Frank James109

413-555-1223Joe Black107

ContactNameSid

…………

[email protected]

[email protected]

[email protected]

Contact (NOT NULL)LastNameFirstNameSid

Amherst College Student Database

UMass Student Database

Page 3: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

3

WWW

Structured data - Databases

Unstructured Text - Documents

Semistructured Data

Integration of Text & Structured Data

Page 4: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

4

Needs for A New Data Model

Loose and rich structure

Evolving, unknown, irregular structures

Integration of structured, but heterogeneous data sources

Textual data with tags and links

Combination of data models (relational, hierarchical)

Answer to all of above : Extensible Markup Language (XML) http://www.w3.org/TR/REC-xml/

Page 5: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

5

From HTML to XML

Page 6: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

6

HTML

<h1> Bibliography </h1><p> <i> Foundations of Databases </i> Abiteboul, Hull, Vianu <br> Addison Wesley, 1995<p> <i> Data on the Web </i> Abiteboul, Buneman, Suciu <br> Morgan Kaufmann, 1999

HTML describes the presentation.

Page 7: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

7

XML

<bibliography> <book> <title> Foundations… </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> …

</bibliography>

XML describes the content!

Page 8: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

8

Outline

XML Syntax

XML typing

Page 9: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

9

XML Syntax

Tags: book, title, author, …− start tag: <book>− end tag: </book>

Elements: <book>…</book>,<author>…</author>− An element can have text data in its content.− An element can have nested elements.− Empty element: <red></red>, abbrv. <red/>

XML document: single root element

An XML document is well formed if it has a single rootelement and matching tags.

Page 10: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

10

XML Syntax

<book price = “55” currency = “USD”> <title> Foundations of Databases </title> <author> Abiteboul </author> … <year> 1995 </year></book>

Attributes are alternative ways to represent data.

Page 11: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

11

XML Syntax

<family><person id=“o555”> <name> Jane </name> </person><person id=“o456”> <name> Mary </name>

<children idrefs=“o123 o555”/></person><person id=“o123” mother=“o456”><name>John</name></person>

</family>

Oids and references in XML are just syntax.

Page 12: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

12

XML Semantics: a Tree !

<data> <person id=“o555” > <name> Mary </name> <address> <street> Maple </street> <no> 345 </no> <city> Seattle </city> </address> </person> <person> <name> John </name> <address> Thailand </address> <phone> 23456 </phone> </person></data>

data

Mary

person

person

name address

name address

street no city

Maple 345 Seattle

JohnThai

phone

23456

id

o555

Elementnode

Textnode

Attributenode

Order matters ! IDREF will turn it to a graph.

Page 13: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

13

XML Data

XML is self-describing due to the use of tags

Schema elements become part of the data– Relational schema: persons(name,phone)– In XML <persons>, <name>, <phone> are part of the

data, and are repeated many times

Consequence: XML is much more flexible

Some real data:http://www.cs.washington.edu/research/xmldatasets/

Page 14: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

14

Relational Data as XML

<person><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone> 6363</phone></row>

</person>

row row row

name name namephone phone phone

“John” 3634 “Sue” “Dick”6343 6363

personXML: person

name phone

John 3634

Sue 6343

Dick 6363

Page 15: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

15

XML is Semi-structured Data

Missing attributes:

Could represent ina table with nulls

<data> <person> <name> John</name> <phone>1234</phone> </person> <person> <name>Joe</name> </person></data>

← no phone !

name phone

John 1234

Joe -

Page 16: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

16

XML is Semi-structured Data

Repeated attributes

Impossible in tables:nested collections

(non 1NF)

<person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone></person>

← two phones !

name phone

Mary 2345 3456 ???

Page 17: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

17

XML is Semi-structured Data

Attributes with different types in different objects

Mixed content: <p><person>Joe</person> went to <school>Umass</school>.</p>

<data> <person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone> </person> <person> <name> M. Carey</name> <phone>3456</phone> </person></data>

← structured name !

← unstructured name !

Page 18: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

18

Outline

XML Syntax

XML typing

Page 19: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

19

Data Typing in XML

Data typing in the relational model: schema

Data typing in XML– Much more complex– Typing restricts valid trees that can occur

• theoretical foundation: tree languages

– Practical methods:• DTD (Document Type Definition)• XML Schema: superset of DTD (not covered in detail in

this class)

Page 20: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

20

Document Type Definitions (DTD)

Part of the original XML specification.• DTD can be part of an XML document.

An XML document may have a DTD.• But it is not required.

XML document iswell-formed = if tags are correctly closedValid = if it further has a DTD and conforms to it

Page 21: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

21

DTD Example

<!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)>]>

Page 22: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

22

DTD Example

<company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ...</company>

Example of valid XML document:

Page 23: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

23

DTD: The Content Model

Content model:– Complex = a regular expression over other elements– Text-only = #PCDATA– Empty = EMPTY– Mixed content = (#PCDATA | A | B | C)*

<!ELEMENT tag_name (CONTENT)>

contentmodel

Page 24: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

24

DTD: Regular Expressions

<!ELEMENT name (firstName, lastName))

<name> <firstName> . . . . . </firstName> <lastName> . . . . . </lastName></name>

<!ELEMENT name (firstName?, lastName))

DTD XML

<!ELEMENT person (name, phone*))

sequence

optional

<!ELEMENT person (name, (phone|email)))

Kleene star

alternation

<person> <name> . . . . . </name> <phone> . . . . . </phone> <phone> . . . . . </phone> <phone> . . . . . </phone> . . . . . .</person>

Page 25: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

25

Attributes in DTDs

<!ELEMENT person (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED>

<person age=“25”> <name> ....</name> ...</person>

Page 26: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

26

Attributes in DTDs

<!ELEMENT person (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED id ID #REQUIRED manager IDREF #REQUIRED manages IDREFS #REQUIRED>

<person age=“25” id=“p29432” manager=“p48293” manages=“p34982 p423234”> <name> ....</name> ...</person>

Page 27: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

27

Attributes in DTDs

Types: CDATA = string ID = key IDREF = foreign key IDREFS = foreign keys separated by space (Monday | Wednesday | Friday) = enumeration

Page 28: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

28

Attributes in DTDs

Kind: #REQUIRED #IMPLIED = optional value = default value value #FIXED = the only value allowed

Page 29: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

29

Using DTDs

Must include in the XML document

Either include the entire DTD:– <!DOCTYPE nameRootElement [ DTD ]>

Or include a reference to it:– <!DOCTYPE nameRootElement SYSTEM “http://www.mydtd.org/”>

Page 30: Introduction to XMLavid.cs.umass.edu/courses/445/f2009/lectures/Lec14-XML.pdf · Introduction to XML Yanlei Diao UMass Amherst Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu,

30

XML Schema

DTDs capture grammatical structure, but have limitations:− Not themselves in XML, inconvenient to build tools− Don’t capture database data types− No way of defining OO-like inheritance…

XML Schema addresses shortcomings of DTDs− XML syntax− Subclassing− Domains and built-in data types− MIN and MAX # of occurrences of elements− http://www.w3.org/XML/Schema