Download - Structured Data
![Page 1: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/1.jpg)
Structured Data
1. HTML2. XML3. XHTML 4. JSON5. XMLSchema
![Page 2: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/2.jpg)
Structured Data• Machine processable data needs to be structured• There are many examples• Properties files:
host=example.comport=8080protocol=https
• Comma Separated Values:
host,port,protocol example.com,8080,https• These are examples of ‘flat files’• hard to model composite structures
![Page 3: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/3.jpg)
HTML and XML• Derivatives of Standard Generalized Markup Language (SGML).• Offer machine readable, yet machine independent means of conveying
information• Use the angle bracket syntax (<>) to structure the document.• Based on a tree-structure:
root
siblings
<html><head></head><body> <p> hello world </p></body>
</html>
child
![Page 4: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/4.jpg)
Elements and Attributes• Elements are structural• Attributes qualify elements
attribute<html><head></head><body bgcolor=“red”> <p> hello world </p></body>
</html>
element
![Page 5: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/5.jpg)
Hypertext Markup Language (HTML)
• Its primary purpose is to convey information to a browser for human consumption:– <p>, <bold>, <italic>, <pre> etc.
• It does contain other tags that are not presentational.• Like one for metadata:
– <meta>• And ones that are structural:
– e.g. <head>, <body>, <div>, <span>• And some that are sort of in between:
– e.g. , <ol>, <ul>, <h1>, <title>• HTML can embed information:
– e.g. <img>, <object>• It can also contain style and script content in the header:
– <style>, <script>• Most importantly, it can link to other resources via the anchor tag and href
attribute:– e.g. <a href=“http:// example.com/otherpage.html”>
![Page 6: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/6.jpg)
HTML• HTML Example describing a book
<h1>The Cat in the Hat</h1><br><p>by Dr Seuss</p><ul>
<li>Publisher: HarperCollins</li><li>Genre: Children’s Fiction</li><li>Year: 2003</li><li>ISBN: 0-00-715853</li>
</ul>
<br>visit the website <a href=“http://harp.co.uk”>here</a>
![Page 7: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/7.jpg)
HTML• The main limitations of HTML are:
– Fixed set of tags– Focus on presentation
• Like the Web, it is primarily for human consumption– Not all HTML is ‘well-formed’, i.e. it breaks the tree structure
• The classic case is orphan <br> tags. Strictly speaking, a tag must either contain child tags, or be an empty tag (<br/>).
• During the browser wars mostly between M$ and Netscape, browsers became very forgiving of invalid markup to recruit users.
• This is just about OK when dealing with a fixed set of presentational tags, free market economics permitting
• But not sustainable and not good for machine parsing
![Page 8: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/8.jpg)
Extensible Markup Language (XML)
• XML is (e)xtensible.– You can create your own tags which means– Tags can be understood in semantic terms:
• e.g. <book> contains <author> – XML MUST be well-formed (no structural
inconsistencies like <br>)– validation against a Document Type Definition
(DTD) or XML Schema or RelaxNG document is easier because it is well-formed.• These define what a particular document can contain,
e.g. a book element MUST contain >= 1 author elements
![Page 9: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/9.jpg)
XML• XML Example of a book
<?xml version="1.0"?> <book>
<title>The Cat in the Hat</title><author>Dr Seuss</author><isbn>0-00-715853<isbn><genre>Children’s Fiction</genre><published>2003</published><publisher> <name>HarperCollins</name> <url>http://harp.co.uk</url></publisher>
</book>
![Page 10: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/10.jpg)
XML Pros• Plain text
– Human readable– Create/edit in standard text editor (if you really want to)
• Self-Describing, Structured Data– Extensible tag language– Machine readable– Can be validated against DTDs and Schema
• Presentation independent– Unlike HTML– Format to other languages using transformations (e.g.
XSLT)• Programming language independent
– Java, C, C++, Visual Basic, Perl…• Simple to parse• Widely used in many domains and for many purposes
![Page 11: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/11.jpg)
XML Cons
• The main limitations of XML are:– Verbose way of describing data– How do you include binary data (e.g. images)?
• (work in progress and not ubiquitously supported)– A proliferation of DTD and Schema types because
anyone can create their own tags• Lots of processing time for each new XML doc and
DTD/Schema you come across• New software components to understand the new XML
docs (their semantics not structure)• How do I know if your <author> tag means the same as
my <author> tag?
![Page 12: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/12.jpg)
XML Namespaces• This last issue is addressed through namespaces
– Allows a tag to be qualified by a URI:<a:author xmlns:a=“http://andrew/namespace”>
<s:author xmlns:s=“http://sue/namespace”>
• Now I can tell the difference between the two author tags :-)• But the XML is more complicated :-(• And what happens if I change the definition of my author tag?• I suppose I better change the namespace:
prefix namespace
<a:author xmlns:a=“http://andrew/namespace/v1”>
• That’s better :-)• But now every client that understood the previous namespace is
broken :-(
binding
![Page 13: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/13.jpg)
RDF XML example<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/”> <foaf:Person rdf:about="#AL"> <foaf:name>Archibald Leach</foaf:name> <foaf:mbox_sha1sum>cf2342293...</foaf:mbox_sha1sum> <foaf:knows> <foaf:Person> <foaf:name>Katharine Hepburn</foaf:name> </foaf:Person> </foaf:knows> </foaf:Person></rdf:RDF>
![Page 14: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/14.jpg)
XHTML• In between HTML and XML
– It is valid HTML and valid XML• MUST be well-formed.• Fixed set of tags
– Makes use of HTML non-presentational tags.– Defers presentational concerns completely to
Cascading Style Sheets (CSS)• Instead uses element attributes to inject presentational
hints to the CSS:
<div class=“my-important-type”>I’m important</div>
Class attribute
![Page 15: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/15.jpg)
Cascading Style Sheets(CSS)• A rendering language that goes in the header of an HTML page
– Property based• element -type {presentation-key : value}
• CSS allows for extensibility!– I can define a class, and define rendering hints to the browser for that class:
<style type=“text/css”> .my-important-type {font-color: red}</style>And in the document:<div class=“my-important-type”>Hey wait!</div>
• Hey, wait!• at the same time as defining rendering hints to the browser, I’m also
classifying an element in the document.• Perhaps I can use this to support semantic information, not just rendering
information• So I could call my class .book and have elements inside it like .title
and .author. Hmm…
![Page 16: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/16.jpg)
XHTML example<head>
<title>My Book</title></head><body>
<div class=“book”><h1 class=“title”>The Cat in the Hat</h1><p>by <span class=“author”>Dr Seuss</span></p><ul> <li>Publisher: <span class=“pub”>HarperCollins</span></li> <li>Genre: <span class=“genre”>Children’s
Fiction</span></li> <li>Year: <span class=“year”>2003</span></li> <li>ISBN: <span class=“isbn”>0-00-715853</isbn></li></ul>
</div><p>visit the website at <a href=“http://harp.co.uk” class=“url” title=“http://harp.co.uk”>here</a>
</body>
![Page 17: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/17.jpg)
XHTML with some CSS• Here’s what it looks like in a browser
with a bit of CSS in the head of the HTML page:The important thing to take away here is that the data has not been lost through rendering.It looks nice for a human, but a machine can still extract the book properties
![Page 18: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/18.jpg)
HTML 5• Builds on HTML 4• A set of features, rather than a monolithic spec.• Not all browser support all features yet.• HTML 5 MUST be well-formed (XHTML)• Some core features:
– Canvas – drawing area– Video – embed directly – no need for plugins– Local storage– Multi-threaded Javascript– GEO location– Semantic tags – section, header, footer etc.– Micro data – embedded semantic metadata, e.g.
licencing, vCards and your own vocabs.
![Page 19: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/19.jpg)
HTML 5• Micro data – embedded semantic metadata, e.g.
licencing, vCards and your own vocabs.• You can create scopes on a tag:
<section itemscope itemtype="http://data-vocabulary.org/Person">
– Then mark up elements within the scope:<img itemprop="photo” src=“…”/>
<p itemprop=”name”>Andrew</p>
Then publish your vocabulary so people can use it.Publish in human readable for, and RDF for machine processing.
See http://html5demos.com/
![Page 20: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/20.jpg)
Javascript Object Notation (JSON)
• Another structured document type, not based on XML.• Instead uses properties, and nested curly braces to describe
data:{"location":
{"id": "WashingtonDC", "city": "Washington DC",
"venue": "Hilton Hotel, Tysons Corner", "address": "7920 Jones Branch Drive”
} }
• Essentially a dictionary• Supports number, string, boolean, array (list) and Object (map)• JSON can be parsed into a Javascript object using the
eval(string) method.• Popular because it is simpler than XML and natively understood
by browsers.
![Page 21: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/21.jpg)
XML Schema
• XML Syntax for describing how XML documents should be structured.– Has some built-in data types
• Allows for validation of an XML document
• Allows for code generation– Create objects in your favorite
programming language to manipulate XML documents
![Page 22: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/22.jpg)
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace="urn:book" xmlns:bk="urn:book">
<xsd:element name="book" type="bk:Book"/>
<xsd:complexType name="Book"> <xsd:sequence> <xsd:element name="title" type="xsd:string"/> <xsd:element name="author" type="xsd:string"/> <xsd:element name=”isbn" type="xsd:string"/> <xsd:element name="genre" type="xsd:string"/> <xsd:element name=”published” type="xsd:date" /> <xsd:element name=”publisher" type=”bk:Publisher”/> </xsd:sequence> </xsd:complexType>
<xsd:complexType name=”Publisher"> <xsd:sequence> <xsd:element name=”name" type="xsd:string"/> <xsd:element name=”url" type="xsd:anyURI"/> </xsd:sequence> </xsd:complexType></xsd:schema>
![Page 23: Structured Data](https://reader036.vdocuments.net/reader036/viewer/2022062315/5681650d550346895dd78669/html5/thumbnails/23.jpg)
Structured Data
• Why use structured data?• Understand how structured data
encapsulates information• What are the strengths/weaknesses of
different types of structured data?