introduction to xml. outline background xml basics document type descriptors (dtds) xml schema cml
TRANSCRIPT
Introduction to XML
Outline
• Background
• XML Basics
• Document Type Descriptors (DTDs)
• XML schema
• CML
From HTML To XML• HTML - Hyper text Markup language
– Mean for structuring text for visual presentation– Designed to describe how a Web browser should arrange text,
images and push-buttons on a page.• HTML describes:
– Intra-document structure – Inter-document structure
<HTML><HEAD><TITLE>Introduction to XML</TITLE></HEAD><BODY>
<H1>XML</H1><IMG SRC=”info_logo.jpeg" WIDTH="200" HEIGHT="150” >
</BODY></HTML>
Opening tag Text (PCDATA)
Closing tag
Attribute name Attribute value
From HTML to XML
• Need for data structuring for more general applications than display applications
• Examples:– Extracting biological data from NCBI search result
page to be used for running a bioinformatics tool– Extracting financial data from web pages to conduct
financial analyses
• Solution: markup language to structure document contents (XML)
XML: brief history
• XML: eXtended Markup Language
• Subset of SGML
• First version (1.0) formally ratified by the W3C in 1998
• Current version is XML 1.1 released in 2004
• XML is becoming the standard for data interchange between applications
XML: brief history
• Purpose: used for structuring the content of documents
• Basis for various application specific markup languages including:– GML: Geography Markup Language– OFX: Open Financial Exchange Markup Language – SBML: The systems biology markup language – MusicXML: Music Markup language– CML: Chemical Markup Language – Much more …
XML: brief history
• Some advantages of XML– XML is extensible– XML is both human readable and computer readable– XML is platform and language independent– XML is a public standard– XML tool set is large and growing– XML works well with the Internet– XML documents can be transformed– XML is global
XML: brief history
• Some of the disadvantages – XML is verbose– XML is not a cure-all for data integration– XML does not guarantee unified format– XML requires a large learning curve
Outline
• Background
• XML Basics
• Document Type Descriptors (DTDs)
• XML schema
• CML
XML structure - Example<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE PROTEIN_SET SYSTEM "protein.dtd"><PROTEIN_SET> <PROTEIN> <ACCESSION>P26954</ACCESSION> <ENTRY_NAME>IL3B_MOUSE</ENTRY_NAME> <PROTEIN_NAME>Interleukin-3 receptor class II beta chain [Precursor] </PROTEIN_NAME> <GENE_NAME>CSF2RB2</GENE_NAME> <GENE_NAME>AI2CA</GENE_NAME> <GENE_NAME>IL3RB2</GENE_NAME> <GENE_NAME>IL3R</GENE_NAME> <ORGANISM taxonomy_id="10090">Mus musculus</ORGANISM> <COMMENT>FUNCTION: IN MOUSE THERE ARE TWO CLASSES OF HIGH-AFFINITY IL-3 RECEPTORS. ONE CONTAINS THIS IL-3-SPECIFIC BETA CHAIN AND THE OTHER CONTAINS THE BETA CHAIN ALSO SHARED BY HIGH-AFFINITY IL-5 AND GM-CSF RECEPTORS.</COMMENT> <COMMENT>SUBUNIT: Heterodimer of an alpha and a beta chain.</COMMENT>
<KEYWORD>Receptor</KEYWORD> <KEYWORD>Glycoprotein</KEYWORD> <KEYWORD>Signal</KEYWORD> </PROTEIN></PROTEIN_SET>
XML structure
• Key components:– Tags– Text
XML structure
• Tags:– Represent element names– Used in pairs– E.g.
<GENE_NAME>…</GENE_NAME> – Must be properly nested:
• <reference> <author> ... </author> ... </reference> --- good
• <reference> <author> ... </reference>... </author> --- bad
XML structure
• Element names follow XML name specification• XML names:
– Include:• Alphanumeric characters• Non- English characters• Ideograms: e.g. Ω• Underscore (_), hyphen (-), period, colon
– Should not include• White spaces, quotation marks, apostrophes, dollar signs, percent
symbols, carets, and semicolon
– May only start with:• Letters• Ideograms• Underscore character
XML structure
• Text:– XML has only one “basic” type -- text– Text is bounded by tags – E.g.:<PROTEIN_NAME> Interleukin-3 receptor class II beta chain
[Precursor] </PROTEIN_NAME>
<Seq_length> 2650 </Seq_length> --- 2650 is still text
– XML text is called PCDATA (for parsed character data)
XML structure• Tag nesting - used for expressing various data structures
including:– Tuple (record):<reference>
<author> Johnston, M. </author> <title> The nucleotide sequence of Saccharomyces cerevisiae chromosome XII </title <publication_year> 1997 </publication_year>
– List:<protein_set> <protein> … </protein> <protein> …</protein>...</protein_set>
XML Terminology - Elements
• Element: segment of an XML document between an opening and a corresponding closing tag
<reference> <author> Johnston, M. </author>
<author> Hillier, L. </author> <title> The nucleotide sequence of Saccharomyces
cerevisiae chromosome XII </title><publication_year> 1997</publication_year>
</reference>
element
element, a sub-element of
XML Terminology - Elements• Mixed content: an element may contain mixture of sub-elements and PCDATA• E.g.:
<book> <title>My First XML</title> <prod id="33-657“ media="paper"></prod><chapter>
Introduction to XML <para>What is HTML</para><para>What is XML</para>
</chapter> …
</book>
XML Terminology - Attributes• An (opening) tag may contain attributes• Typically used to describe the content of an element• Syntax: attribute_name = “ value1 value2 …”• Attribute names follow XML naming•Example 1:
<ORGANISM taxonomy_id="10090“ >Mus musculus</ORGANISM>
• Example 2: <file type="gif">computer.gif</file>
XML Terminology - Attributes
• Common use for attributes is to express dimension or type
<picture> <height dim= “cm”> 2400 </height> <width dim= “in”> 96 </width> <data encoding = “gif” compression = “zip”> M05-.+C$@02!G96YE<FEC ... </data></picture>
XML Terminology - Using IDs
• Special attribute
• Used to uniquely identify elements
• Can be used by other elements for referencing purposes
• Value of an ID attribute is unique
• Must be declared of type ID in the DTD
XML Terminology - Using IDs<family> <person perId="jane" mother="mary" father="john"> <name> Jane Doe </name> </person> <person perId="john" children="jane jack"> <name> John Doe </name> </person> <person perId="mary" children="jane jack"> <name> Mary Doe </name> </person> <person perId="jack" mother=”mary" father="john"> <name> Jack Doe </name> </person></family>
A Complete XML Document
An XML document must include:
• A declaration part:E.g. <?xml version="1.0" encoding="ISO8859-1" ?>
• A root elementE.g. <PROTEIN_SET> … </ PROTEIN_SET>
Well-formed documents
• XML documents must be well-formed:– Presence of one root element– Proper XML naming– Proper matching of tags– Proper nesting of tags– Attribute values must be quoted– The name of an attribute is unique within an element – Comments and preprocessing instructions may not
appear inside tags– No un-escaped < or & may appear in the character
data of an element or an attribute
Outline
• XML Basics
• Document Type Descriptors (DTDs)
• XML schema
• CML
Document Type Descriptors• Document Type Descriptors (DTDs) impose structure on an XML
document
• The DTD is a syntactic specification
• General syntax:<!DOCTYPE DTD-name [ <!ELEMENT …> <!ELEMENT …> …
<!ATTLIST …> <!ATTLIST …>…]
• Note: DTD-name corresponds to the root element of XML documents that use the DTD for validation
<person>
<name> MacNiel, John </name>
<greet> Dr. John MacNiel </greet>
<addr>1234 Huron Street </addr>
<addr> Rome, OH 98765 </addr>
<tel> (321) 786 2543 </tel>
<fax> (321) 786 2543 </fax>
<tel> (321) 786 2543 </tel>
<email> [email protected] </email>
</person>
Example: An Address Book
Exactly one name
At most one greeting
As many address lines as needed (in order)
Mixed telephones and faxes
As manyas needed
Specifying the structure
name to specify a name element
greet? to specify an optional (0 or 1) greet
elements
name,greet? to specify a name followed by an optional greet
Specifying the structure (cont)
addr* to specify 0 or more address lines
tel | fax a tel or a fax element
(tel | fax)* 0 or more repeats of tel or fax
email* 0 or more email elements
Specifying the structure (cont)
So the whole structure of a person entry is specified by
name, greet?, addr*, (tel | fax)*, email*
This is known as a regular expression
A DTD for the address book<!DOCTYPE addressbook [ <!ELEMENT addressbook (person*)> <!ELEMENT person (name, greet?, address*, (fax | tel)*,
email*)> <!ELEMENT name (#PCDATA)> <!ELEMENT greet (#PCDATA)> <!ELEMENT address(#PCDATA)> <!ELEMENT tel (#PCDATA)> <!ELEMENT fax (#PCDATA)> <!ELEMENT email (#PCDATA)>]>
Summary of XML regular expressions for element description
• e The tag e occurs
• e1,e2 The expression e1 followed by e2
• e* 0 or more occurrences of e
• e? Optional -- 0 or 1 occurrences
• e+ 1 or more occurrences
• e1 | e2 either e1 or e2
• (e) grouping
Specifying attributes in the DTD
• Example:<!ELEMENT height (#PCDATA)>
<!ATTLIST height
dimension CDATA #REQUIRED
accuracy CDATA #IMPLIED >
• The dimension attribute is required; the accuracy attribute is optional
• CDATA is the “type” of the attribute -- it means string
Specifying attributes in the DTD
• General syntax:<!ATTLIST element_name
attribute_name attribute_type default_value>
• Attribute types include– CDATA, ENUMERATION, ID, IDREF, IDREFS,
NOTATION, NMTOKEN, NMTOKENS, ENTITY, ENTITIES
•Attribute default values include:– #IMPLIED, #REQUIRED, #FIXED, literal
Specifying ID and IDREF attributes
<!DOCTYPE family [ <!ELEMENT family (person)*> <!ELEMENT person (name)> <!ELEMENT name (#PCDATA)> <!ATTLIST person
perId ID #REQUIRED mother IDREF #IMPLIED father IDREF #IMPLIED children IDREFS #IMPLIED>]>
Some conforming data<family> <person perId ="jane" mother="mary" father="john"> <name> Jane Doe </name> </person> <person perId ="john" children="jane jack"> <name> John Doe </name> </person> <person perId ="mary" children="jane jack"> <name> Mary Doe </name> </person> <person perId ="jack" mother=”mary" father="john"> <name> Jack Doe </name> </person></family>
Consistency of ID and IDREF attribute values
• If an attribute is declared as ID– the associated values must all be distinct (no
confusion)• If an attribute is declared as IDREF
– the associated value must exist as the value of some ID attribute
• Similarly for all the values of an IDREFS attribute•ID and IDREF attributes are not typed
Connecting the document with its DTD
• In line:<?xml version="1.0"?><!DOCTYPE PROTEIN_SET [<!ELEMENT ...> … ]><PROTEIN_SET> ... </PROTEIN_SET>
• Another file: <!DOCTYPE PROTEIN_SET SYSTEM “protein.dtd"><PROTEIN_SET > ... </PROTEIN_SET>
• A URL: <!DOCTYPE PROTEIN_SET SYSTEM "http://.../protein.dtd"><PROTEIN_SET> ... </ PROTEIN_SET>
Valid Documents
• XML documents are checked for validity
against a an XML validator such as DTDs
• Validity specifies that the document conforms
to the DTD: conforms to regular expression
grammar, types of attributes correct, and
constraints on ID and IDREF(S) satisfied
Outline
• Background
• XML Basics
• Document Type Descriptors (DTDs)
• XML schema
• CML
XML schema
• W3C recommendation
• Successor of DTDs
• Used to validate XML documents
• Specification lengthy and rather complex
• Proposed to address DTDs pitfalls
XML schema
• Features – Data typing: compared to DTD where
elements and attributes are strings– Schema files are XML files– Support for object-oriented practices– Addition validation rules (e.g. pattern of a
element content, minimum/maximum values for attributes)
– Full support of namespaces
XML schema
• XML schema for scientific applications – Used in several areas: bioinformatics, chemical
informatics, laboratory informatics, etc.– Examples include:
• AGAVE• CML• PEML• PSI-MI• SBML• UniProt XML• XFF
Introductory Example
• Example1: Representing protein data
XML Schema constructs
• The <schema> element– Root of XML schema document– E.g.
<xs:schema xmlns:xs=http://www.w3.org/2001/XMLSchema>
– Prefix xs references the namespace of XML schemas
– Used to reference schema constructs such as sx:annotation, xs:complexType
XML Schema constructs
• Schema Documentation– Schema element xs:annotation is used to document
schema documents, providing information about the document and detailed information about its elements and attributes
– Two types of documentation:• Human readable: using element xs:documentation• Machine readable using element xs:appinfo
– E.g. <xs:annotation>
<xs:documentation>Sample XML Schema for representing Protein data.</xs:documentation></xs:annotation>
XML Schema constructs
• Simple types vs. complex types– A schema element is either of simpleType or
complexType– An element is of simple type if it does not contain
any attribute or children elements– An element is of complex type if it does include
children, attributes or both– See example1
XML Schema constructs
• Global elements vs. local elements– Global elements are direct children of the root
schema element– E.g. protein_set in example1 is a global element– Local elements are not direct children of schema
element– Global elements can referenced within the
document while local elements cannot.– E.g. <xs:element ref=“organism”> - protein
element contains organism element
XML Schema constructs
• Creating instance documents– An instance document is a XML document that adheres to
an XML grammar defined by a DTD, an XML schema, etc.– Example: protein_set instance document
<protein_set xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="protein.xsd">
– Declaring instance namespace enables to use its constructs such as noNamespaceSchemaLocation attribute
– Attribute noNamespaceSchemaLocation specifies that the XML schema has no declared namespace and its value corresponds to the location of the XML schema
XML Schema constructs
• Working with simple types– Simple type contains single values such as string, integer,
NMTOKEN– XML schema provides 44 built-in schema types including:
• String, byte, decimal, float, Boolean, time, Qname, AnyURI
– Built-in data types are organized in a hierarchy rooted by AnyType data type
– Type declaration: using type attribute when defining and attribute or an element
XML Schema constructs
• Working with facets to derive new data types– Users may derive new data types– Two mechanisms for derivation:
• By extension• By restriction
– Facets are provided to derive new data types by restriction
– XML schema supports 12 facets including length, minLength, maxLength, pattern, and enumeration
XML Schema constructs
• Declaring new data types using facets– Example:
xs:simpleType name="accessionType">
<xs:restriction base="xs:string">
<xs:minLength value="4" />
<xs:maxLength value="8" />
</xs:restriction>
</xs:simpleType>• Are these elements of type accessionType valid or not ?
<accession>P23</accession>,
<accession>P12345678</accession>
XML Schema constructs
• Example of facets – Pattern facet– Used to restrict string values to match a regular
expression pattern– E.g. Accession numbers must start with P letter
<xs:simpleType name="accessionType"> <xs:restriction base="xs:string"> <xs:minLength value="4"/> <xs:maxLength value="8"/> <xs:pattern value="P.*"/> </xs:restriction></xs:simpleType>
• E.g. A DNA sequence should only include A, C, G or T characters
XML Schema constructs
• Example of facets – Enumeration facet– Used to restrict the list of possible values– E.g. restrict the list of biological databases to be
referenced<xs:simpleType name="databaseType"> <xs:restriction base="xs:string"> <xs:enumeration value="EMBL"/> <xs:enumeration value="PIR"/> <xs:enumeration value="MGD"/> <xs:enumeration value="InterPro"/> <xs:enumeration value="Pfam"/> <xs:enumeration value="SMART"/> <xs:enumeration value="PROSITE"/> </xs:restriction></xs:simpleType>
XML Schema constructs
• Working with complex types– Used to define elements that include attributes and/or
children– Elements including attributes only are defined as
complex types with simple content– E.g.
<xs:element name="organism"> <xs:complexType> <xs:simpleContent> <xs:extension base="xs:string"> <xs:attribute name="taxonomy_id" type="xs:integer" use="required"/> </xs:extension> </xs:simpleContent> </xs:complexType></xs:element>
XML Schema constructs
• Working with complex types– Elements including children are defined as complex
types with complex content– E.g.
<xs:element name="protein_set"> <xs:complexType> <xs:complexContent> <xs:restriction base="xs:anyType"> <xs:sequence> <xs:element ref="protein" maxOccurs="unbounded"/> </xs:sequence> </xs:restriction> </xs:complexContent> </xs:complexType></xs:element>
XML Schema constructs
• Working with complex types– Abbreviation: elements defined as complex types with
complex content restriction of anyType may be abbreviated
– E.g.<xs:element name="protein_set">
<xs:complexType>
<xs:sequence>
<xs:element ref="protein" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
XML Schema constructs• Occurrences constraints on elements
– Used to set the exact number of times a children may appear– Using minOccurs and MaxOccurs attributes– Values vary from “0” to “unbounded”– If not specified, default values are set to “1” – E.g.<xs:element name="protein"> <xs:complexType> <xs:complexContent> <xs:restriction base="xs:anyType"> <xs:sequence> <xs:element name="accession" type="xs:string"/> <xs:element name="entry_name" type="xs:string"/> <xs:element name="protein_name" type="xs:string"/> <xs:element name="gene_name" type="xs:string" maxOccurs="unbounded"/> <xs:element ref="organism"/> <xs:element ref="cross_reference" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="comment" type="xs:string" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="keyword" type="xs:string" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:restriction> </xs:complexContent> </xs:complexType></xs:element>
XML Schema constructs• Compositors: sequence, choice and all
– Used to specify how elements are organized
– Using sequence:E.g. <xs:element name="PubmedArticle">
<xs:complexType>
<xs:sequence>
<xs:element name="MedlineID" type="xs:long"/>
<xs:element name="PMID" type="xs:long"/>
</xs:sequence>
</xs:complexType>
</xs:element>
XML Schema constructs• Compositors: sequence, choice and all
– Using choice:E.g. <xs:element name="PubmedArticle">
<xs:complexType>
<xs:choice>
<xs:element name="MedlineID" type="xs:long"/>
<xs:element name="PMID" type="xs:long"/>
</xs:choice>
</xs:complexType>
</xs:element>
XML Schema constructs• Compositors: sequence, choice and all
– Sequence and choice constructs can be combined to organize sub-elements :
E.g. <xs:element name="PubmedArticle"> <xs:complexType> <xs:sequence> <xs:choice> <xs:element name="MedlineID" type="xs:long"/> <xs:element name="PMID" type="xs:long"/> </xs:choice> <xs:element name="ArticleTitle" type="xs:string"/> <xs:element name="AbstractText" type="xs:string"/> </xs:sequence> </xs:complexType></xs:element>
XML Schema constructs
• Using choice:– Used to indicate that a group of elements can appear in
an order– But each element should be either optional or appear
only once – E.g.<xs:element name="PubmedArticle"> <xs:complexType> <xs:all> <xs:element name="MedlineID" type="xs:long"/> <xs:element name="PMID" type="xs:long"/> </xs:all> </xs:complexType></xs:element>
XML Schema constructs• Defining named complex types
– Named complex types can be reused using ref attribute
– Defined as direct children of schema element, and specifying its name attribute
– E.g.<xs:complexType name="organismType"><xs:simpleContent><xs:extension base="xs:string"> <xs:attribute name="taxonomy_id" type="xs:integer" use="required" /> </xs:extension> </xs:simpleContent> </xs:complexType>
Putting it together
• Representing protein data: an updated version using XML schemas basic constructs such named data types, etc.
Outline
• Background
• XML Basics
• Document Type Descriptors (DTDs)
• XML schema
• CML
CML• What is CML?
– CML (Chemical Markup Language) – CML is an XML-based language for representing chemical
data – More precisely, CML is the application of XML for the
representation of molecules and molecular representation, crystallography and spectra
– CML evolved in the chemical industry to solve the needs of exchanging molecular and other information for publishing Web-based documents for patent applications, standards committees, and other organizations
– CML does not cover all chemistry but focuses on molecules (and similar structures representable by a formula)
– CML does represent molecules, atoms, and bonds
CML
• History– Early discussion began in 1994– Version 1.0 formally published in 1999– CML current version is 2.0 published in– CML has a CML schema specification
CML
• CML component based approach:– STM-ML. Generic support
for scientific information – CMLCore. Molecular and
related information – CMLReact. Chemical
reactions – CMLSpec. Spectra – CMLComp. Computational
Chemistry – CMLQuery. General query
language for chemistry
CML
• CML modular architecture:– allows it to be used as part of an application – E.g.
• "substance" in a GeneOntology • Description or chemical support for a New Drug
Application
– Intended to interoperate (not compete!) with other chemical informatics projects such as JCAMP-DX, SpectroML, etc.
CML• CML as part of a larger chemical informatics
community– Provide support only for the main components (e.g.
core CML, CMLReact, etc.)– Other chemical related fields are not supported– E.g.
• No longer support for macromolecular structures and sequences directly in CML.
• New strategy: working with the biomolecular community to use CML for represent in molecules precisely.
• Example of scientific languages likely to use CML – Materials Markup Language (MatML), NIST – Gene Ontology (GO), EBI, etc. – CellML (for describing cells, including reactions) – e-CTD (Common Technical Dossier for e-submission
of new drug applications (NDAs). FDA, EMEA
References• XML in a nutshell, O’Reilly publisher, 2004• XML for bioinformatics, by Ethan Cerami, Springer publisher, 2005• Extensible Markup Language (XML)
http://www.w3.org/XML • Document Type Descriptors http://www.w3schools.com/dtd// • XML Schema http://www.w3schools.com/schema/default.asp• CML official sites http://www.xml-cml.org/,
http://cml.sourceforge.net/main.html, http://www.xml-cml.org/information/disciplines/index.html
• Chemical Markup, XML, and the World Wide Web. 6. CMLReact, an XML Vocabulary for Chemical ReactionsGemma L. Holliday, Peter Murray-Rust, and Henry S. RzepaWeb Release Date: 30-Nov-2005; (Article) DOI: 10.1021/ci0502698
• Chemical markup language zone http://www.adobe.com/svg/demos/devtrack/chemical.htm