markup for statisticians an introduction to alphabet soup

45
Markup for Statisticians An Introduction to Alphabet Soup

Post on 20-Dec-2015

238 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Markup for Statisticians An Introduction to Alphabet Soup

Markup for Statisticians

An Introduction to Alphabet Soup

Page 2: Markup for Statisticians An Introduction to Alphabet Soup

WWW

• In the 1980’s the world wide web (WWW) came in to being

• for documentation on most projects that impact on the WWW look at– www.w3.org

• a major factor in its success was the notion of a markup language

Page 3: Markup for Statisticians An Introduction to Alphabet Soup

WWW

• the technology hurdle that this overcame was the separation of content from presentation

• a web browser is responsible for understanding and rendering the content in a web page

Page 4: Markup for Statisticians An Introduction to Alphabet Soup

WWW

• that content is marked-up using HTML (or a relative)

• on IE5, under the View menu you will find an option for source

• select a web page and view the source

Page 5: Markup for Statisticians An Introduction to Alphabet Soup

WWW

• this is very nice – now everyone can view your page using any browser

• all the browser has to do is to understand and implement a number of HTML directives

• notions such as linking (directing people to another place by a click) etc are easily implemented in this frame work

Page 6: Markup for Statisticians An Introduction to Alphabet Soup

WWW

• NCSA puts out a document entitled– A Beginner’s Guide to HTML

• this is one of the better guides I have seen

• There are many books (most much longer than they need to be)

• O’Reilly’s HTML Pocket Reference seems pretty useful

Page 7: Markup for Statisticians An Introduction to Alphabet Soup

WWW

• How does it work?

• Your web browser opens a special type of connection (an http connection usually) to another computer and through that protocol asks for the information on a particular web page

• Other types of connections, such as ftp, are also generally supported

Page 8: Markup for Statisticians An Introduction to Alphabet Soup

WWW

• now we have solved the problem of how to put content onto your computer

• how do we solve the problem of providing programs or applications to perform some computations?

• this is where Java came in

• Java is a language that has a strong security model

Page 9: Markup for Statisticians An Introduction to Alphabet Soup

WWW: Java

• Java applets can be secured in the sense that you can determine before you run them that they will do nothing harmful to your computer

• if you could not ensure that you would be ill advised to run an applet

• this is why there are no C or C++ applets• they can be written but no one should be silly

enough to run one

Page 10: Markup for Statisticians An Introduction to Alphabet Soup

WWW

• while all web browsers use http as their basic means of transferring data other programs can also use http

• now the web is full of information about all sorts of topics

• how do we begin to make sense of that information?

Page 11: Markup for Statisticians An Introduction to Alphabet Soup

WWW

• HTML has a severe limitations

• these became apparent when search engines were first being developed

• the problem is that there is no way to indicate the meaning of any of the information

• for example consider the tags that you have available for a table

Page 12: Markup for Statisticians An Introduction to Alphabet Soup

WWW: Table tags

• the table tags are:– <caption>, <table>, <td>, <th>, and <tr>– a few more in HTML 4.0

• except by convention there is no way to indicate the content of the table

• but tables often contain data – data that we want to use

• without information on content it is hard to use the data programmatically

Page 13: Markup for Statisticians An Introduction to Alphabet Soup

WWW

• we want to have smart programs

• there is no sense in having people find and manipulate data – if it is on the web it would be nice if it were in a format that a program could deal with

• the more we can automate the more we can do

Page 14: Markup for Statisticians An Introduction to Alphabet Soup

WWW and R

• open R and look at the manual page for connections

• look at URL connections

• we want to open a connection to Leo Breiman’s home page– bhp <- url(“http://oz.berkeley.edu/users/breiman/”,

open=“r”)

– bhp.content <- readLines(bhp)

Page 15: Markup for Statisticians An Introduction to Alphabet Soup

WWW and R

• now look at what bhp.content contains

• Dr. Breiman has also put up a data set at– http://oz.berkeley.edu/users/breiman/glass6.dat

• open a url connection to this page and read the data

• what does it look like?

• what would we like to do with it?

Page 16: Markup for Statisticians An Introduction to Alphabet Soup

WWW and R

• we would probably like to put it into a dataframe

• we would also like to know what the data means

• there is no way to do that with HTML except by convention

• and even then we have to parse the data• writing parsers is complicated

Page 17: Markup for Statisticians An Introduction to Alphabet Soup

WWW and XML

• The eXtensible Markup Language is intended to provide the missing functionality

• it comes with a number of additional tools

• XSL, XSLT, Xpointer, Xlink and Xpath

• XML is a simplified form of SGML

Page 18: Markup for Statisticians An Introduction to Alphabet Soup

XML

• is becoming the standard for data transfer

• it is also becoming popular for tasks like remote procedure calls, for communicating between cooperative computing languages via SOAP

Page 19: Markup for Statisticians An Introduction to Alphabet Soup

XML

• with XML you can define your own tags– <foo> , <bar>, and so on

• to give them meaning you use a Document Type Definition (or DTD)

• the DTD specifies which tags are valid, which attributes a tag can have and also the order (or nesting) requirements

Page 20: Markup for Statisticians An Introduction to Alphabet Soup

XML

• in XML all open tags must have a corresponding closing tag,– <foo> must be followed by </foo> with any

other tags that have been opened after <foo> closed before </foo>

– this ensures proper nesting of the XML tags and makes it possible to parse the documents easily

Page 21: Markup for Statisticians An Introduction to Alphabet Soup

XML

• an element consists of two tags, an opening tag and a closing tag– <fruit> orange </fruit>

• is an element

• any text between the tags is considered to be part of the element and is formatted according to the rules for that element

Page 22: Markup for Statisticians An Introduction to Alphabet Soup

XML

• elements can have attributes– <height units=“inches”> 24 </height>

• notice that under these circumstances it is reasonably easy to extract all the heights from an XML document (and to get the units right!)

• attribute values must be contained inside of quotation marks, either double or single

Page 23: Markup for Statisticians An Introduction to Alphabet Soup

XML

• a non-empty element must have both an opening and a closing tag

• an empty element might be there as a place holder or to provide its attribute– <foo xx=“hi there” />

• is an empty element, the closing tag is not required but we had to put a / before the closing >

Page 24: Markup for Statisticians An Introduction to Alphabet Soup

XML

• tags must be nested correctly

• so the following is not allowed– <foo> <bar> that’s all folks </foo> </bar>

• since bar is the second tag it must be the first one to close

• an XML document that adheres to these rules is said to be well—formed

Page 25: Markup for Statisticians An Introduction to Alphabet Soup

XML

• well—formed XML documents can be parsed using standard methods

• an second concept that can be applied to XML documents is validity

• an XML document is said to be valid if it conforms to its DTD

• XML documents can be well—formed but not valid

Page 26: Markup for Statisticians An Introduction to Alphabet Soup

XML

• XML documents can be useful even when there is no DTD

• in other situations (eg my system for documenting clinical trials) the use of a DTD to ensure validity is necessary

• recently the DTD specification has been extended – the new method is called schema and is more flexible than a DTD

Page 27: Markup for Statisticians An Introduction to Alphabet Soup

XML

• PI – processing instructions

• a PI tells an application to carry out a specific task

• a PI is not part of the rendered document but rather is an instruction to either the XML parser or to an application that uses the resultant document

Page 28: Markup for Statisticians An Introduction to Alphabet Soup

XML

• PI’s are of the form:– <?target instructions?>

• An example of a PI:– <?xml version=“1.0” standalone=“no”?>

• this PI is included as the first line in almost all XML documents

• it indicates the versin and standalone=no indicates that a DTD is required

Page 29: Markup for Statisticians An Introduction to Alphabet Soup

XML

• Namespaces: we need some means of limiting the scope of the definition of a tag

• suppose we have combined two DTD’s in a single XML document (this is both legal and useful)

• suppose that both DTD’s define a tag named leg

• except in one it stands for a person’s leg

Page 30: Markup for Statisticians An Introduction to Alphabet Soup

XML

• and in the other the leg of a chair

• we wouldn’t want to mix those up

• namespaces can be used to ensure that tags from one DTD do not get confused with tags from another

• namespaces really don’t do anything though

• they are simply macro substitutions

Page 31: Markup for Statisticians An Introduction to Alphabet Soup

XML

• namespaces should be unique

• it is common to use a URI (which need not exist)

• <Book xmlns:RG=www.rgentleman.org>

• from here on tags can use

• <RG:foo>

• and this is the equivalent of prepending the namespace string to the tag

Page 32: Markup for Statisticians An Introduction to Alphabet Soup

XSL

• eXtensible Stylesheet Language

• this has not yet been completely formed (but should be soon)

• a style sheet describes how the XML document should be transformed to provide the rendered output

• you can have multiple style sheets for any XML document

Page 33: Markup for Statisticians An Introduction to Alphabet Soup

XSL

• this means that you can have different versions of the document depending on whether the output is a Web page, a pdf document, input for another processing step and so on

• XSL (through XSLT) provides a means of rendering the data in an XML document

Page 34: Markup for Statisticians An Introduction to Alphabet Soup

XPath

• an XML document has a tree structure

• there is a root node and below that there can be many more nodes

• for XSLT (and Xpointer) to work well they need to be able to reference different elements within the document

• they do this via XPath

Page 35: Markup for Statisticians An Introduction to Alphabet Soup

XPath

• a simple example

• *[not(self::FOO:Bar)]

• is an Xpath statement that refers to all children of the current node whose name (the tag) is not FOO:Bar

• you can refer to parent nodes, children, grandparents and so on

Page 36: Markup for Statisticians An Introduction to Alphabet Soup

XLL

• eXtensible Linking Language

• another part of the XML family are the mechanisms for linking different documents and portions of documents

• Xlink and Xpointer are the two mechanisms used to carry out the linking (similar to what goes on in a web page but with more control)

Page 37: Markup for Statisticians An Introduction to Alphabet Soup

XLink

• a link is only an assertion of a relationship between pieces of a document (or documents)

• how that link is presented to the user depends on many things and can be quite different in different settings

• XML ID’s are used to provide unique labels for Xlink to link to

Page 38: Markup for Statisticians An Introduction to Alphabet Soup

XPointer

• ID’s give you a flexible way to link to parts of the same document

• when you want to link to other documents then you need Xpointer

• the syntax is pretty complex

Page 39: Markup for Statisticians An Introduction to Alphabet Soup

Literate Programming

• literate programming is an idea that originated with Don Knuth

• he wanted a system that allowed him to mix text and code in a more natural way

• so that documentation could be read easily by humans

Page 40: Markup for Statisticians An Introduction to Alphabet Soup

Literate Programming

• to make the code runnable the code segments are extracted and placed in a separate file

• in the development version of R (and soon as a separate library) is a version of literate programming for the R language

• it is called Sweave

Page 41: Markup for Statisticians An Introduction to Alphabet Soup

Sweave

• the idea is to produce a LaTeX like document that has a mix of LaTeX and R code

• this document is passed through an S engine and the code may be replaced by the output that it generates (including graphics)

Page 42: Markup for Statisticians An Introduction to Alphabet Soup

Sweave

• this allows you to easily update reports when the data change

• it also allows you to document the code together with the report that the code is used to write

• see the Sweave User Manual that is also provided for today’s lecture

Page 43: Markup for Statisticians An Introduction to Alphabet Soup

Sweave

• a second but important use for Sweave is to use it to document R packages

• using Sweave we can produce files that contain examples of analyses

• the Tangle facility allows us to extract the code segments into separate files and to run them

Page 44: Markup for Statisticians An Introduction to Alphabet Soup

STangle

• Tangling is sort of the opposite of weaving• it separates the components• for R/S packages the text portion is generally

not of interest• the code portions allow us to ensure that the

program is still functioning as we expect• it allows us to put much more complex

examples into our code

Page 45: Markup for Statisticians An Introduction to Alphabet Soup

Sweave

• once this becomes a stable part of R I anticipate that most of you will find it a very useful device for doing homework assignments and data analyses