2011.08.25 - slide 1is 257 – fall 2011 xml foundations: introduction ray r. larson university of...
Post on 21-Dec-2015
219 views
TRANSCRIPT
IS 257 – Fall 2011 2011.08.25 - SLIDE 1
XML Foundations:Introduction
Ray R. Larson
University of California, Berkeley
School of Information
IS 242: XML FoundationsThis lecture is largely based on an earlier lecture by Eric Wilde
IS 257 – Fall 2011 2011.08.25 - SLIDE 2
Abstract
• The Extensible Markup Language (XML) was introduced in 1998 to enable content providers to publish their content on the Web in an application-specific format. HTML was thought to lack sufficient semantics, since its only purpose was (and is) the preparation of content for Web-based publishing. XML was the first step towards machine-readable data formats for the Web, a trend that has been taken to higher levels with the introduction of the Semantic Web. XML appeared when the Web was in the steepest part of its success curve, and since then has taken over as the globally accepted format for the exchange of machine-readable structured data.
IS 257 – Fall 2011 2011.08.25 - SLIDE 3
XML Overview
• More and more value switches from goods to information
• Information sharing needs well-defined structures
• Business agility and flexibility are critical success factors
• Standardized formats prevent lock-in and incompatibilities
• XML is the most successful format for structured data
• XML technologies are widely used and universally available
• XML for B2B enables better workflow engineering
• XML for B2C is a good interface between B2B and Web interfaces
• XML is a mission-critical success factor for optimizing ROI and
minimizing interoperability risks in today's fast-moving globalized
fragmented business landscape …
IS 257 – Fall 2011 2011.08.25 - SLIDE 4
Plan for the Course
• XML Basics and how to apply them• Describing classes of XML documents• Combining different vocabularies of XML
documents• Selecting parts of an XML document• Transforming XML into something else (or XML
again)• A more complicated way to describe classes of
XML documents• Even more ways of describing classes of XML
documents• How does all of this relate to databases?• What to expect as future developments
IS 257 – Fall 2011 2011.08.25 - SLIDE 5
What will we be doing?
• Projects– Encoded Archival Description/Encoded Archival Context– iTunes XML as the common theme (linking with other data)– how to understand an XML document representing an iTunes
library– how to write a schema describing this document's structure– how to select parts of the library (tracks, playlists, artists, …)– how to transform libraries/playlists (into HTML, Atom, …)
• Tools– XML editor such as Altova XML Spy (XSLT and XQuery
included) or Oxygen– XSLT Processor such as Saxon– XQuery Processor such as Saxon
IS 257 – Fall 2011 2011.08.25 - SLIDE 6
Outline
• Varia
• What is XML?
• Why XML?
• Beyond XML
IS 257 – Fall 2011 2011.08.25 - SLIDE 7
About the course
• All subject to change…
• Web site at – http://courses.ischool.berkeley.edu/i242/f11/
• Office hours TuTh 2-3
• TA: Yiming Liu– Office, lab hours TBA
• Guest lecturers when away– Eric Wilde, Jeroen van Rotterdam (EMC)
IS 257 – Fall 2011 2011.08.25 - SLIDE 8
Outline
• Varia
• What is XML?– What is XML Good for?– What is XML NOT Good for?
• Why XML?
• Beyond XML
IS 257 – Fall 2011 2011.08.25 - SLIDE 9
XML Yin/Yang
• XML is …– … great for exchanging trees (if this is what you want to
do)– … platform-independent (even your mobile phone
processes XML)– … a foundation for other technologies (some of which
we will look at)
• XML is not …– … a programming language (ever programmed comma-
separated values?)– … capturing semantics (without higher-layer consensus,
XML is worthless)– … ensuring interoperability (we both use bits! we can
interoperate!)
IS 257 – Fall 2011 2011.08.25 - SLIDE 10
Outline
• Varia
• What is XML?– What is XML Good for?– What is XML NOT Good for?
• Why XML?
• Beyond XML
IS 257 – Fall 2011 2011.08.25 - SLIDE 11
Why use XML
• Because you want to share data– share it in a format which is widely used and easy to use– enable others to use it on various platforms with existing tools
• Because you want to share data cheaply– It is easier to use XML than to invent something new– it is even easier to use an existing XML schema than to invent a
new one
• Because you want to share data openly– if you invent new formats, people must process them– avoid applying the "security through obscurity" principle
inadvertently– application-specific processing should be deferred to higher
layers
IS 257 – Fall 2011 2011.08.25 - SLIDE 12
Is XML self-describing?
• XML is often said to be "self-describing"– many people think this is the same as "self-explanatory"– the catch is what exactly it is you refer to by "describing"
• Database data cannot live without a database– database data is simply content, the structure is provided by a
DBMS– XML documents have their structure encoded within them– compared to database data, XML in fact is "self-describing"
• What is the gap between "self-describing" and "self-explanatory"?– it is impossible to find out how the document could be modified– there are no semantics associated with neither structure nor
content– so "self-describing" means, you can guess a lot, but you maybe
wrong
IS 257 – Fall 2011 2011.08.25 - SLIDE 13
Outline
• Varia
• What is XML?– What is XML Good for?– What is XML NOT Good for?
• Why XML?
• Beyond XML
IS 257 – Fall 2011 2011.08.25 - SLIDE 14
XML is Character-based
• XML is not a binary format, it is based on Unicode– "binary structures" cannot (or rather should not) be
described using XML
• Multimedia formats often are binary– image formats such as GIF, JPEG, and PNG– audio formats such as MP3 and AAC– video formats such as MPEG4 and H.264
• But: multimedia also uses many XML formats– vector graphics formats such as Scalable Vector
Graphics (SVG)– Synchronized Multimedia Integration Language (SMIL)
for describing presentations
IS 257 – Fall 2011 2011.08.25 - SLIDE 15
XML is a syntax for trees
• Not all data is easily represented by trees– overlapping markup (multiple "views" of the same
content)– graph-like structures which are less constrained than
trees
• What is it that you have in your tree?– XML encodes a structure purely on the syntactic level– what the structures mean is in no way described by
XML– XML structures must be accompanied by semantic
descriptions
IS 257 – Fall 2011 2011.08.25 - SLIDE 16
XML Usage
• XML can be used in different ways– people should be able to use your XML directly using standard
tools– if they absolutely need a set of special tools, something is wrong
• XML is hip, so everybody wants to use it– many things have been created ad-hoc and without much
planning– if you start something which is XML-based, use XML responsibly– if you have to use some "bad XML", complain about it
• Finding the balance can be hard– XML is great for prototyping and experiments– once you decide to redesign your XML, it may be too late– XML documents may be short-lived, XML schemas are definitely
not
IS 257 – Fall 2011 2011.08.25 - SLIDE 17
Outline
• Varia
• What is XML?
• Why XML?– Pre-XML problems– XML on the Web– XML today
• Beyond XML
IS 257 – Fall 2011 2011.08.25 - SLIDE 18
Web Technology
• Early Web: URI+HTTP+HTML– URIs identify resources (in a human-readable way)– HTTP retrieves resources (using a simple protocol)– HTML is the resource format (using a simple data format)
• The early Web was a distributed hypermedia system– not designed by hypermedia researchers or companies– simple enough to be adopted very fast
• The Web today uses many different technologies– URI+HTTP+HTML for basic Web publishing– CSS & JavaScript (maybe even Ajax) for advanced publishing
• JavaScript & XML (a.k.a. Ajax)– scripts dynamically loading data from a server– machine-to-machine interaction: the server and the script
IS 257 – Fall 2011 2011.08.25 - SLIDE 19
From Humans to Machines
• The Web was designed for humans– HTML is a language for describing page layout and
links– machines were only used for implementing it
• Search engines were the first machine users on the Web– they made the Web's success possible– they demonstrated how hard it is to "understand"
HTML pages– search engines are still a very active field of research
• A bigger Web needs more automation
IS 257 – Fall 2011 2011.08.25 - SLIDE 20
Outline
• Varia
• What is XML?
• Why XML?– Pre-XML problems– XML on the Web– XML today
• Beyond XML
IS 257 – Fall 2011 2011.08.25 - SLIDE 21
SGML, HTML and XML
• Standard Generalized Markup Language (SGML)– a language for designing document types– a very complex standard with many expensive and non-
interoperable implementations
• Hypertext Markup Language (HTML)– implements a simple SGML document type– its syntax is SGML syntax, it is not defined by HTML itself– uses very few SGML features, dedicated processors are rather
easy to build
• Extensible Markup Language (XML)– a language for designing document types (i.e., classes of
documents)– a greatly simplified version of SGML, omitting many obscure
features– a specification with no optional parts!
IS 257 – Fall 2011 2011.08.25 - SLIDE 22
Outline
• Varia
• What is XML?
• Why XML?– Pre-XML problems– XML on the Web– XML today
• Beyond XML
IS 257 – Fall 2011 2011.08.25 - SLIDE 23
XML Documents on the Web
• XML's idea was that content should be published as XML– stylesheets could then be used to render human-readable views– machines could simply use the underlying XML
• There are (almost) no XML documents on the Web– stylesheet support depends on browsers (software has a long
life!)– many content providers do not want to publish machine-readable
data
• There are many XML documents behind HTML documents– content does not have to be made public in a machine-readable
way– browser-independent HTML can be produced from XML– XML technologies can be leveraged on the server-side
IS 257 – Fall 2011 2011.08.25 - SLIDE 24
XML Documents Elsewhere
• XML is not used as intended, but it is very successful– as a server-side foundation for Web publishing– as a B2B-focused format with no Web publishing in
mind
• XML has been successful because of different reasons– being there at the right time (Internet bubble)– politically correct (the W3C is OS-agnostic)– technically sound (simple and no optional parts)– human-readable based on a well-known syntax– great for rapid prototyping and experiments
IS 257 – Fall 2011 2011.08.25 - SLIDE 25
Outline
• Varia
• What is XML?
• Why XML?– Pre-XML problems– XML on the Web– XML today
• Beyond XML
IS 257 – Fall 2011 2011.08.25 - SLIDE 26
Used Everywhere
• Very small: Messages from sensors– e.g., building automation or car electronics– mostly implemented in hardware or firmware
• Very large: Genome sequences– encoding the results of genome analyses– yields very large XML documents (several gigabytes)
• Very different processing requirements– very fast processing (time critical applications)– memory-conserving processing (very large
documents)– incremental processing (streaming)– random access (only small parts required)
IS 257 – Fall 2011 2011.08.25 - SLIDE 27
This course and XML
• "XML is ASCII for the 21st century"– information professionals should know and use XML– you will see it in many projects– you will hopefully use it in many projects– you will be able to build and test prototypes very
rapidly
• What do you need for using XML?– XML and some kind of schema language– XSLT for processing it– Xquery and XML Databases for search and access
IS 257 – Fall 2011 2011.08.25 - SLIDE 28
Outline
• Varia
• What is XML?
• Why XML?
• Beyond XML
IS 257 – Fall 2011 2011.08.25 - SLIDE 29
Sharing Concepts
• XML is a syntax for trees– trees are just structured data– for doing something useful, you must understand the
trees• Schema-based sharing of concepts is possible
– HTML works great because everybody is using it– Anything beyond HTML's capabilities needs a new
schema• General sharing of concepts is hard
– the AI community tried for decades and failed– micro-formats are a more humble approach to
"reusable shared concepts"– agreement in communities gets exponentially harder
with their size
IS 257 – Fall 2011 2011.08.25 - SLIDE 30
The Semantic Web
• Technologies for describing concepts– the foundation of successful interaction is mutual understanding– describe your XML using Semantic Web technologies
• XML core technologies do not convey any meaning– XML is a language for exchanging trees– XML schema languages describe what trees may be exchanged– XML schema languages are for markup design
• Semantic Web technologies have received a lot of attention– and a lot of research funding (latest rebranding: Linked Data)– success for the most general approaches is questionable at best– debatable success of AI's overall promises ("thinking machines")– modest approaches are more promising and likely to succeed