markup for statisticians an introduction to alphabet soup

Markup for Statisticians

An Introduction to Alphabet Soup

WWW

• In the 1980’s the world wide web (WWW) came in to being

• for documentation on most projects that impact on the WWW look at– www.w3.org

• a major factor in its success was the notion of a markup language

WWW

• the technology hurdle that this overcame was the separation of content from presentation

• a web browser is responsible for understanding and rendering the content in a web page

WWW

• that content is marked-up using HTML (or a relative)

• on IE5, under the View menu you will find an option for source

• select a web page and view the source

WWW

• this is very nice – now everyone can view your page using any browser

• all the browser has to do is to understand and implement a number of HTML directives

• notions such as linking (directing people to another place by a click) etc are easily implemented in this frame work

WWW

• NCSA puts out a document entitled– A Beginner’s Guide to HTML

• this is one of the better guides I have seen

• There are many books (most much longer than they need to be)

• O’Reilly’s HTML Pocket Reference seems pretty useful

WWW

• How does it work?

• Your web browser opens a special type of connection (an http connection usually) to another computer and through that protocol asks for the information on a particular web page

• Other types of connections, such as ftp, are also generally supported

WWW

• now we have solved the problem of how to put content onto your computer

• how do we solve the problem of providing programs or applications to perform some computations?

• this is where Java came in

• Java is a language that has a strong security model

WWW: Java

• Java applets can be secured in the sense that you can determine before you run them that they will do nothing harmful to your computer

• if you could not ensure that you would be ill advised to run an applet

• this is why there are no C or C++ applets• they can be written but no one should be silly

enough to run one

WWW

• while all web browsers use http as their basic means of transferring data other programs can also use http

• now the web is full of information about all sorts of topics

• how do we begin to make sense of that information?

WWW

• HTML has a severe limitations

• these became apparent when search engines were first being developed

• the problem is that there is no way to indicate the meaning of any of the information

• for example consider the tags that you have available for a table

WWW: Table tags

• the table tags are:– <caption>, <table>, <td>, <th>, and <tr>– a few more in HTML 4.0

• except by convention there is no way to indicate the content of the table

• but tables often contain data – data that we want to use

• without information on content it is hard to use the data programmatically

WWW

• we want to have smart programs

• there is no sense in having people find and manipulate data – if it is on the web it would be nice if it were in a format that a program could deal with

• the more we can automate the more we can do

WWW and R

• open R and look at the manual page for connections

• look at URL connections

• we want to open a connection to Leo Breiman’s home page– bhp <- url(“http://oz.berkeley.edu/users/breiman/”,

open=“r”)

– bhp.content <- readLines(bhp)

WWW and R

• now look at what bhp.content contains

• Dr. Breiman has also put up a data set at– http://oz.berkeley.edu/users/breiman/glass6.dat

• open a url connection to this page and read the data

• what does it look like?

• what would we like to do with it?

WWW and R

• we would probably like to put it into a dataframe

• we would also like to know what the data means

• there is no way to do that with HTML except by convention

• and even then we have to parse the data• writing parsers is complicated

WWW and XML

• The eXtensible Markup Language is intended to provide the missing functionality

• it comes with a number of additional tools

• XSL, XSLT, Xpointer, Xlink and Xpath

• XML is a simplified form of SGML

XML

• is becoming the standard for data transfer

• it is also becoming popular for tasks like remote procedure calls, for communicating between cooperative computing languages via SOAP

XML

• with XML you can define your own tags– <foo> , <bar>, and so on

• to give them meaning you use a Document Type Definition (or DTD)

• the DTD specifies which tags are valid, which attributes a tag can have and also the order (or nesting) requirements

XML

• in XML all open tags must have a corresponding closing tag,– <foo> must be followed by </foo> with any

other tags that have been opened after <foo> closed before </foo>

– this ensures proper nesting of the XML tags and makes it possible to parse the documents easily

XML

• an element consists of two tags, an opening tag and a closing tag– <fruit> orange </fruit>

• is an element

• any text between the tags is considered to be part of the element and is formatted according to the rules for that element

XML

• elements can have attributes– <height units=“inches”> 24 </height>

• notice that under these circumstances it is reasonably easy to extract all the heights from an XML document (and to get the units right!)

• attribute values must be contained inside of quotation marks, either double or single

XML

• a non-empty element must have both an opening and a closing tag

• an empty element might be there as a place holder or to provide its attribute– <foo xx=“hi there” />

• is an empty element, the closing tag is not required but we had to put a / before the closing >

XML

• tags must be nested correctly

• so the following is not allowed– <foo> <bar> that’s all folks </foo> </bar>

• since bar is the second tag it must be the first one to close

• an XML document that adheres to these rules is said to be well—formed

XML

• well—formed XML documents can be parsed using standard methods

• an second concept that can be applied to XML documents is validity

• an XML document is said to be valid if it conforms to its DTD

• XML documents can be well—formed but not valid

XML

• XML documents can be useful even when there is no DTD

• in other situations (eg my system for documenting clinical trials) the use of a DTD to ensure validity is necessary

• recently the DTD specification has been extended – the new method is called schema and is more flexible than a DTD

XML

• PI – processing instructions

• a PI tells an application to carry out a specific task

• a PI is not part of the rendered document but rather is an instruction to either the XML parser or to an application that uses the resultant document

XML

• PI’s are of the form:– <?target instructions?>

• An example of a PI:– <?xml version=“1.0” standalone=“no”?>

• this PI is included as the first line in almost all XML documents

• it indicates the versin and standalone=no indicates that a DTD is required

XML

• Namespaces: we need some means of limiting the scope of the definition of a tag

• suppose we have combined two DTD’s in a single XML document (this is both legal and useful)

• suppose that both DTD’s define a tag named leg

• except in one it stands for a person’s leg

XML

• and in the other the leg of a chair

• we wouldn’t want to mix those up

• namespaces can be used to ensure that tags from one DTD do not get confused with tags from another

• namespaces really don’t do anything though

• they are simply macro substitutions

XML

• namespaces should be unique

• it is common to use a URI (which need not exist)

• <Book xmlns:RG=www.rgentleman.org>

• from here on tags can use

• <RG:foo>

• and this is the equivalent of prepending the namespace string to the tag

http://www.rgentleman.org/

XSL

• eXtensible Stylesheet Language

• this has not yet been completely formed (but should be soon)

• a style sheet describes how the XML document should be transformed to provide the rendered output

• you can have multiple style sheets for any XML document

XSL

• this means that you can have different versions of the document depending on whether the output is a Web page, a pdf document, input for another processing step and so on

• XSL (through XSLT) provides a means of rendering the data in an XML document

XPath

• an XML document has a tree structure

• there is a root node and below that there can be many more nodes

• for XSLT (and Xpointer) to work well they need to be able to reference different elements within the document

• they do this via XPath

XPath

• a simple example

• *[not(self::FOO:Bar)]

• is an Xpath statement that refers to all children of the current node whose name (the tag) is not FOO:Bar

• you can refer to parent nodes, children, grandparents and so on

XLL

• eXtensible Linking Language

• another part of the XML family are the mechanisms for linking different documents and portions of documents

• Xlink and Xpointer are the two mechanisms used to carry out the linking (similar to what goes on in a web page but with more control)

XLink

• a link is only an assertion of a relationship between pieces of a document (or documents)

• how that link is presented to the user depends on many things and can be quite different in different settings

• XML ID’s are used to provide unique labels for Xlink to link to

XPointer

• ID’s give you a flexible way to link to parts of the same document

• when you want to link to other documents then you need Xpointer

• the syntax is pretty complex

Literate Programming

• literate programming is an idea that originated with Don Knuth

• he wanted a system that allowed him to mix text and code in a more natural way

• so that documentation could be read easily by humans

Literate Programming

• to make the code runnable the code segments are extracted and placed in a separate file

• in the development version of R (and soon as a separate library) is a version of literate programming for the R language

• it is called Sweave

Sweave

• the idea is to produce a LaTeX like document that has a mix of LaTeX and R code

• this document is passed through an S engine and the code may be replaced by the output that it generates (including graphics)

Sweave

• this allows you to easily update reports when the data change

• it also allows you to document the code together with the report that the code is used to write

• see the Sweave User Manual that is also provided for today’s lecture

Sweave

• a second but important use for Sweave is to use it to document R packages

• using Sweave we can produce files that contain examples of analyses

• the Tangle facility allows us to extract the code segments into separate files and to run them

STangle

• Tangling is sort of the opposite of weaving• it separates the components• for R/S packages the text portion is generally

not of interest• the code portions allow us to ensure that the

program is still functioning as we expect• it allows us to put much more complex

examples into our code

Sweave

• once this becomes a stable part of R I anticipate that most of you will find it a very useful device for doing homework assignments and data analyses

markup for statisticians an introduction to alphabet soup

Documents

www html

web page slide

www ncsa

source slide

useful slide

supported slide

world wide web www

markup language slide