cit 383: administrative scripting
DESCRIPTION
XML. CIT 383: Administrative Scripting. Topics. What is XML? XML Structure REXML. eXtensible Markup Language. Extensible descriptive markup language framework Began as subset of Standard Generalized Markup Language (SGML). - PowerPoint PPT PresentationTRANSCRIPT
CIT 383: Administrative Scripting Slide #1
CIT 383: Administrative Scripting
XML
CIT 383: Administrative Scripting
Topics
1. What is XML?
2. XML Structure
3. REXML
CIT 383: Administrative Scripting
eXtensible Markup LanguageExtensible descriptive markup language framework
– Began as subset of Standard Generalized Markup Language (SGML).
– To ensure that data remains available after programs that originally created/read it become obsolete or unusable.
<?xml version="1.0" encoding="UTF-8"?><inventory>
<book isbn=“0976694042”><author>Chris Pine</author><title>Learn to Program</title>
</book></inventory>
CIT 383: Administrative Scripting
Descriptive vs Presentational
Presentational describe how documents should look<b>text</b> turns on boldface for text
What if you want to change book titles from bold to italics?
Replace won’t work if items other than books are bold.
Descriptive languages focus on the meaning<title>xml and you</title>
Stylesheets describe how to present logical items.
Can just be used for data storage, interchange.
A/K/A logical or structural markup languages.
CIT 383: Administrative Scripting
XML-based Languages
• Ant
• Atom
• CML
• MathML
• MML
• MusicXML
• ODF
• OPML
• RDF
• SAML
• SOAP
• SVG
• VoiceXML
• WML
• XHTML
• XUL
CIT 383: Administrative Scripting
Evolution of XML
1986 SGML standard published as ISO 8879
1987 Unicode proposal published
1991 First volume of Unicode standard
1996 XML work started
1998 XML 1.0 released as a W3C standard
2001 XML Schema language
2004 XML 1.1 released (not widely used)
2007 Unicode 5.0 published
CIT 383: Administrative Scripting
XML Tree Structure<todo>
<title>Monday’s List</title><item>Study for midterm</item><item><priority=10/>Scripting Class</item><item>Bathe cat</item>
</html>
todo
titleTuesday’s List
itemScripting Class
itemBathe Cat
itemStudy for midterm
priority10
CIT 383: Administrative Scripting
Elements and Attributes
An element consists of tags and contents<title>Learn to Program</title>
Begin and end tags are mandatory.
<isbn number=“0976694042” />
Attributesnumber=“0976694042”
Elements may have zero or more attributes.
Attribute values must always be quoted.
CIT 383: Administrative Scripting
Text
XML declaration specifies character encoding<?xml version="1.0" encoding="UTF-8"?>
EncodingsUnicode: universal character set, UTF-8, UTF-32ISO-8859: 8-bit encodings, 8859-1 is West Europe
Entities&#nnnn; encodes specified Unicode character&name; are named character entities, such as
< is <> is >& is ¤cy symbols, fractions, Greek letters, math symbols, etc.
CIT 383: Administrative Scripting
XML Syntax Rules
1. There is one and only one root tag.
2. Begin tags must be matched by an end tag.
3. XML tags must be properly nested.
4. XML tags are case sensitive.
5. All attribute values must be quoted.
6. Whitespace within tags is part of text.
7. Newlines are always stored as LF.
8. HTML-style comments: <!-- comment -->
CIT 383: Administrative Scripting
Correctness
Well-formed– Conforms to XML syntax rules.– A conforming parser will not parse documents
that are not well-formed.
Valid– Conforms to XML semantics rules as defined in
• Document Type Definition (DTD)• XML Schema
– A validating parser will not parse invalid documents.
CIT 383: Administrative Scripting
XML Schema Languages
Document Type Definitions Inherited from SGML.No support for all XML.
XML SchemaMost commonly used.Schemas are XML docs.A/K/A WXS, XSD
RELAX NGREgular LAnguage forXML Next GenerationXML and non-XML forms.
<?xml version="1.0" encoding="utf-8" ?>
<xs:schema elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="Address">
<xs:complexType>
<xs:sequence>
<xs:element name="Recipient" type="xs:string" />
<xs:element name="House" type="xs:string" />
<xs:element name="Street" type="xs:string" />
<xs:element name="Town" type="xs:string" />
<xs:element minOccurs="0" name="County" type="xs:string" />
<xs:element name="PostCode" type="xs:string" />
<xs:element name="Country">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="FR" /> <xs:enumeration value="DE" /> <xs:enumeration value="ES" /> <xs:enumeration value="UK" /> <xs:enumeration value="US" />
</xs:restriction>
</xs:simpleType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
CIT 383: Administrative Scripting
Ruby XML Parsers
REXML: Ruby Electric XML– Standard with the ruby language.
– Slow on large documents.
libxml-ruby– Ruby bindings for Gnome libxml2 XML toolkit.
– Very fast (30X as fast as REXML).
HPricot– Parses XML as well as HTML.
– Fast (3-4X as fast as REXML).
– Does not check for well-formedness or validity.
CIT 383: Administrative Scripting
Types of Parsing
Tree Parsing (DOM-like)– Good for small documents.– Loads entire document into memory.– Simple API
Stream Parsing (SAX-like)– Good for large documents.– User defines callback methods, passes to API.– Parser runs callback methods on pattern match.
CIT 383: Administrative Scripting
Tree Parsing
Loads entire XML doc into memory.require ‘rexml/document’
include REXML
input = File.new(‘data.xml’)
doc = Document.new(input)
root = doc.root
Search document as a tree using XPathdoc.elements.each(“ch/section”) do |e|
puts e.attributes[“title”]
end
CIT 383: Administrative Scripting
Stream ParsingDefine listener class.
class MyListener include REXML::StreamListener def tag_start(*args) puts “start: #{args.map {|x| x.inspect}.join(‘,’”
endend
Invoke parserrequire ‘rexml/document’require ‘rexml/streamlistener’include REXMLlisten = MyListener.newsource = File.new(‘data.xml’)Document.parse_stream(source, listen)
CIT 383: Administrative Scripting
XPath Searches
h.search("p")Find all paragraph tags in document.
doc.search("/html/body//p")Find all paragraph tags within the body tag.
doc.search("//a[@src]") Find all anchor tags with a src attribute.
doc.search("//a[@src='google.com']") Find all a tags with a src attribute of google.com.
CIT 383: Administrative Scripting Slide #18
References1. Michael Fitzgerald, Learning Ruby, O’Reilly,
2008.2. David Flanagan and Yukihiro Matsumoto, The
Ruby Programming Language, O’Reilly, 2008.3. Hal Fulton, The Ruby Way, 2nd edition, Addison-
Wesley, 2007.4. Robert C. Martin, Clean Code, Prentice Hall,
2008.5. Dave Thomas with Chad Fowler and Andy Hunt,
Programming Ruby, 2nd edition, Pragmatic Programmers, 2005.