structured data 1.html 2.xml 3.xhtml 4.json 5.xmlschema

23
Structured Data 1. HTML 2. XML 3. XHTML 4. JSON 5. XMLSchema

Upload: eleanor-jacobs

Post on 11-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Structured Data

1. HTML2. XML3. XHTML 4. JSON5. XMLSchema

Structured Data• Machine processable data needs to be structured• There are many examples• Properties files:

host=example.comport=8080protocol=https

• Comma Separated Values:

host,port,protocol example.com,8080,https• These are examples of ‘flat files’• hard to model composite structures

HTML and XML• Derivatives of Standard Generalized Markup Language (SGML).• Offer machine readable, yet machine independent means of conveying

information• Use the angle bracket syntax (<>) to structure the document.• Based on a tree-structure:

root

siblings

<html><head></head><body> <p> hello world </p></body>

</html>

child

Elements and Attributes• Elements are structural• Attributes qualify elements

attribute<html><head></head><body bgcolor=“red”> <p> hello world </p></body>

</html>

element

Hypertext Markup Language (HTML)

• Its primary purpose is to convey information to a browser for human consumption:– <p>, <bold>, <italic>, <pre> etc.

• It does contain other tags that are not presentational.• Like one for metadata:

– <meta>• And ones that are structural:

– e.g. <head>, <body>, <div>, <span>• And some that are sort of in between:

– e.g. , <ol>, <ul>, <h1>, <title>• HTML can embed information:

– e.g. <img>, <object>• It can also contain style and script content in the header:

– <style>, <script>• Most importantly, it can link to other resources via the anchor tag and href

attribute:– e.g. <a href=“http:// example.com/otherpage.html”>

HTML• HTML Example describing a book

<h1>The Cat in the Hat</h1><br><p>by Dr Seuss</p><ul>

<li>Publisher: HarperCollins</li><li>Genre: Children’s Fiction</li><li>Year: 2003</li><li>ISBN: 0-00-715853</li>

</ul>

<br>visit the website <a href=“http://harp.co.uk”>here</a>

HTML• The main limitations of HTML are:

– Fixed set of tags– Focus on presentation

• Like the Web, it is primarily for human consumption– Not all HTML is ‘well-formed’, i.e. it breaks the tree structure

• The classic case is orphan <br> tags. Strictly speaking, a tag must either contain child tags, or be an empty tag (<br/>).

• During the browser wars mostly between M$ and Netscape, browsers became very forgiving of invalid markup to recruit users.

• This is just about OK when dealing with a fixed set of presentational tags, free market economics permitting

• But not sustainable and not good for machine parsing

Extensible Markup Language (XML)

• XML is (e)xtensible.– You can create your own tags which means– Tags can be understood in semantic terms:

• e.g. <book> contains <author>

– XML MUST be well-formed (no structural inconsistencies like <br>)

– validation against a Document Type Definition (DTD) or XML Schema or RelaxNG document is easier because it is well-formed.• These define what a particular document can contain,

e.g. a book element MUST contain >= 1 author elements

XML• XML Example of a book

<?xml version="1.0"?> <book>

<title>The Cat in the Hat</title><author>Dr Seuss</author><isbn>0-00-715853<isbn><genre>Children’s Fiction</genre><published>2003</published><publisher> <name>HarperCollins</name> <url>http://harp.co.uk</url></publisher>

</book>

XML Pros• Plain text

– Human readable– Create/edit in standard text editor (if you really want to)

• Self-Describing, Structured Data– Extensible tag language– Machine readable– Can be validated against DTDs and Schema

• Presentation independent– Unlike HTML– Format to other languages using transformations (e.g.

XSLT)• Programming language independent

– Java, C, C++, Visual Basic, Perl…• Simple to parse• Widely used in many domains and for many purposes

XML Cons

• The main limitations of XML are:– Verbose way of describing data– How do you include binary data (e.g. images)?

• (work in progress and not ubiquitously supported)– A proliferation of DTD and Schema types because

anyone can create their own tags• Lots of processing time for each new XML doc and

DTD/Schema you come across• New software components to understand the new XML

docs (their semantics not structure)• How do I know if your <author> tag means the same as

my <author> tag?

XML Namespaces• This last issue is addressed through namespaces

– Allows a tag to be qualified by a URI:<a:author xmlns:a=“http://andrew/namespace”>

<s:author xmlns:s=“http://sue/namespace”>

• Now I can tell the difference between the two author tags :-)• But the XML is more complicated :-(• And what happens if I change the definition of my author tag?• I suppose I better change the namespace:

prefix namespace

<a:author xmlns:a=“http://andrew/namespace/v1”>

• That’s better :-)• But now every client that understood the previous namespace is

broken :-(

binding

RDF XML example

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/”> <foaf:Person rdf:about="#AL"> <foaf:name>Archibald Leach</foaf:name> <foaf:mbox_sha1sum>cf2342293...</foaf:mbox_sha1sum> <foaf:knows> <foaf:Person> <foaf:name>Katharine Hepburn</foaf:name> </foaf:Person> </foaf:knows> </foaf:Person></rdf:RDF>

XHTML• In between HTML and XML

– It is valid HTML and valid XML• MUST be well-formed.• Fixed set of tags

– Makes use of HTML non-presentational tags.– Defers presentational concerns completely to

Cascading Style Sheets (CSS)• Instead uses element attributes to inject presentational

hints to the CSS:

<div class=“my-important-type”>I’m important</div>

Class attribute

Cascading Style Sheets(CSS)• A rendering language that goes in the header of an HTML page

– Property based• element -type {presentation-key : value}

• CSS allows for extensibility!– I can define a class, and define rendering hints to the browser for that class:

<style type=“text/css”>

.my-important-type {font-color: red}

</style>

And in the document:

<div class=“my-important-type”>Hey wait!</div>

• Hey, wait!• at the same time as defining rendering hints to the browser, I’m also

classifying an element in the document.• Perhaps I can use this to support semantic information, not just rendering

information• So I could call my class .book and have elements inside it like .title

and .author. Hmm…

XHTML example<head>

<title>My Book</title></head><body>

<div class=“book”><h1 class=“title”>The Cat in the Hat</h1><p>by <span class=“author”>Dr Seuss</span></p><ul> <li>Publisher: <span class=“pub”>HarperCollins</span></li> <li>Genre: <span class=“genre”>Children’s

Fiction</span></li> <li>Year: <span class=“year”>2003</span></li> <li>ISBN: <span class=“isbn”>0-00-715853</isbn></li></ul>

</div><p>visit the website at <a href=“http://harp.co.uk” class=“url” title=“http://harp.co.uk”>here</a>

</body>

XHTML with some CSS• Here’s what it looks like in a browser

with a bit of CSS in the head of the HTML page:The important thing to take away here is that the data has not been lost through rendering.

It looks nice for a human, but a machine can still extract the book properties

HTML 5• Builds on HTML 4• A set of features, rather than a monolithic spec.• Not all browser support all features yet.• HTML 5 MUST be well-formed (XHTML)• Some core features:

– Canvas – drawing area– Video – embed directly – no need for plugins– Local storage– Multi-threaded Javascript– GEO location– Semantic tags – section, header, footer etc.– Micro data – embedded semantic metadata, e.g.

licencing, vCards and your own vocabs.

HTML 5• Micro data – embedded semantic metadata, e.g.

licencing, vCards and your own vocabs.• You can create scopes on a tag:

<section itemscope itemtype="http://data-vocabulary.org/Person">

– Then mark up elements within the scope:<img itemprop="photo” src=“…”/>

<p itemprop=”name”>Andrew</p>

Then publish your vocabulary so people can use it.Publish in human readable for, and RDF for machine processing.

See http://html5demos.com/

Javascript Object Notation (JSON)

• Another structured document type, not based on XML.• Instead uses properties, and nested curly braces to describe

data:{"location":

{"id": "WashingtonDC", "city": "Washington DC",

"venue": "Hilton Hotel, Tysons Corner", "address": "7920 Jones Branch Drive”

} }

• Essentially a dictionary• Supports number, string, boolean, array (list) and Object (map)• JSON can be parsed into a Javascript object using the

eval(string) method.• Popular because it is simpler than XML and natively understood

by browsers.

XML Schema

• XML Syntax for describing how XML documents should be structured.– Has some built-in data types

• Allows for validation of an XML document

• Allows for code generation– Create objects in your favorite

programming language to manipulate XML documents

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace="urn:book" xmlns:bk="urn:book">

<xsd:element name="book" type="bk:Book"/>

<xsd:complexType name="Book"> <xsd:sequence> <xsd:element name="title" type="xsd:string"/> <xsd:element name="author" type="xsd:string"/> <xsd:element name=”isbn" type="xsd:string"/> <xsd:element name="genre" type="xsd:string"/> <xsd:element name=”published” type="xsd:date" /> <xsd:element name=”publisher" type=”bk:Publisher”/> </xsd:sequence> </xsd:complexType>

<xsd:complexType name=”Publisher"> <xsd:sequence> <xsd:element name=”name" type="xsd:string"/> <xsd:element name=”url" type="xsd:anyURI"/> </xsd:sequence> </xsd:complexType></xsd:schema>

Structured Data

• Why use structured data?• Understand how structured data

encapsulates information• What are the strengths/weaknesses of

different types of structured data?