introduction to xml. outline background xml basics document type descriptors (dtds) xml schema cml

70
Introduction to XML

Upload: frederick-mcdonald

Post on 28-Dec-2015

255 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Introduction to XML

Page 2: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Outline

• Background

• XML Basics

• Document Type Descriptors (DTDs)

• XML schema

• CML

Page 3: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

From HTML To XML• HTML - Hyper text Markup language

– Mean for structuring text for visual presentation– Designed to describe how a Web browser should arrange text,

images and push-buttons on a page.• HTML describes:

– Intra-document structure – Inter-document structure

<HTML><HEAD><TITLE>Introduction to XML</TITLE></HEAD><BODY>

<H1>XML</H1><IMG SRC=”info_logo.jpeg" WIDTH="200" HEIGHT="150” >

</BODY></HTML>

Opening tag Text (PCDATA)

Closing tag

Attribute name Attribute value

Page 4: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

From HTML to XML

• Need for data structuring for more general applications than display applications

• Examples:– Extracting biological data from NCBI search result

page to be used for running a bioinformatics tool– Extracting financial data from web pages to conduct

financial analyses

• Solution: markup language to structure document contents (XML)

Page 5: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML: brief history

• XML: eXtended Markup Language

• Subset of SGML

• First version (1.0) formally ratified by the W3C in 1998

• Current version is XML 1.1 released in 2004

• XML is becoming the standard for data interchange between applications

Page 6: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML: brief history

• Purpose: used for structuring the content of documents

• Basis for various application specific markup languages including:– GML: Geography Markup Language– OFX: Open Financial Exchange Markup Language – SBML: The systems biology markup language – MusicXML: Music Markup language– CML: Chemical Markup Language – Much more …

Page 7: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML: brief history

• Some advantages of XML– XML is extensible– XML is both human readable and computer readable– XML is platform and language independent– XML is a public standard– XML tool set is large and growing– XML works well with the Internet– XML documents can be transformed– XML is global

Page 8: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML: brief history

• Some of the disadvantages – XML is verbose– XML is not a cure-all for data integration– XML does not guarantee unified format– XML requires a large learning curve

Page 9: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Outline

• Background

• XML Basics

• Document Type Descriptors (DTDs)

• XML schema

• CML

Page 10: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML structure - Example<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE PROTEIN_SET SYSTEM "protein.dtd"><PROTEIN_SET> <PROTEIN> <ACCESSION>P26954</ACCESSION> <ENTRY_NAME>IL3B_MOUSE</ENTRY_NAME> <PROTEIN_NAME>Interleukin-3 receptor class II beta chain [Precursor] </PROTEIN_NAME> <GENE_NAME>CSF2RB2</GENE_NAME> <GENE_NAME>AI2CA</GENE_NAME> <GENE_NAME>IL3RB2</GENE_NAME> <GENE_NAME>IL3R</GENE_NAME> <ORGANISM taxonomy_id="10090">Mus musculus</ORGANISM> <COMMENT>FUNCTION: IN MOUSE THERE ARE TWO CLASSES OF HIGH-AFFINITY IL-3 RECEPTORS. ONE CONTAINS THIS IL-3-SPECIFIC BETA CHAIN AND THE OTHER CONTAINS THE BETA CHAIN ALSO SHARED BY HIGH-AFFINITY IL-5 AND GM-CSF RECEPTORS.</COMMENT> <COMMENT>SUBUNIT: Heterodimer of an alpha and a beta chain.</COMMENT>

<KEYWORD>Receptor</KEYWORD> <KEYWORD>Glycoprotein</KEYWORD> <KEYWORD>Signal</KEYWORD> </PROTEIN></PROTEIN_SET>

Page 11: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML structure

• Key components:– Tags– Text

Page 12: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML structure

• Tags:– Represent element names– Used in pairs– E.g.

<GENE_NAME>…</GENE_NAME> – Must be properly nested:

• <reference> <author> ... </author> ... </reference> --- good

• <reference> <author> ... </reference>... </author> --- bad

Page 13: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML structure

• Element names follow XML name specification• XML names:

– Include:• Alphanumeric characters• Non- English characters• Ideograms: e.g. Ω• Underscore (_), hyphen (-), period, colon

– Should not include• White spaces, quotation marks, apostrophes, dollar signs, percent

symbols, carets, and semicolon

– May only start with:• Letters• Ideograms• Underscore character

Page 14: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML structure

• Text:– XML has only one “basic” type -- text– Text is bounded by tags – E.g.:<PROTEIN_NAME> Interleukin-3 receptor class II beta chain

[Precursor] </PROTEIN_NAME>

<Seq_length> 2650 </Seq_length> --- 2650 is still text

– XML text is called PCDATA (for parsed character data)

Page 15: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML structure• Tag nesting - used for expressing various data structures

including:– Tuple (record):<reference>

<author> Johnston, M. </author> <title> The nucleotide sequence of Saccharomyces cerevisiae chromosome XII </title <publication_year> 1997 </publication_year>

– List:<protein_set> <protein> … </protein> <protein> …</protein>...</protein_set>

Page 16: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Terminology - Elements

• Element: segment of an XML document between an opening and a corresponding closing tag

<reference> <author> Johnston, M. </author>

<author> Hillier, L. </author> <title> The nucleotide sequence of Saccharomyces

cerevisiae chromosome XII </title><publication_year> 1997</publication_year>

</reference>

element

element, a sub-element of

Page 17: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Terminology - Elements• Mixed content: an element may contain mixture of sub-elements and PCDATA• E.g.:

<book> <title>My First XML</title> <prod id="33-657“ media="paper"></prod><chapter>

Introduction to XML <para>What is HTML</para><para>What is XML</para>

</chapter> …

</book>

Page 18: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Terminology - Attributes• An (opening) tag may contain attributes• Typically used to describe the content of an element• Syntax: attribute_name = “ value1 value2 …”• Attribute names follow XML naming•Example 1:

<ORGANISM taxonomy_id="10090“ >Mus musculus</ORGANISM>

• Example 2: <file type="gif">computer.gif</file>

Page 19: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Terminology - Attributes

• Common use for attributes is to express dimension or type

<picture> <height dim= “cm”> 2400 </height> <width dim= “in”> 96 </width> <data encoding = “gif” compression = “zip”> M05-.+C$@02!G96YE<FEC ... </data></picture>

Page 20: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Terminology - Using IDs

• Special attribute

• Used to uniquely identify elements

• Can be used by other elements for referencing purposes

• Value of an ID attribute is unique

• Must be declared of type ID in the DTD

Page 21: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Terminology - Using IDs<family> <person perId="jane" mother="mary" father="john"> <name> Jane Doe </name> </person> <person perId="john" children="jane jack"> <name> John Doe </name> </person> <person perId="mary" children="jane jack"> <name> Mary Doe </name> </person> <person perId="jack" mother=”mary" father="john"> <name> Jack Doe </name> </person></family>

Page 22: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

A Complete XML Document

An XML document must include:

• A declaration part:E.g. <?xml version="1.0" encoding="ISO8859-1" ?>

• A root elementE.g. <PROTEIN_SET> … </ PROTEIN_SET>

Page 23: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Well-formed documents

• XML documents must be well-formed:– Presence of one root element– Proper XML naming– Proper matching of tags– Proper nesting of tags– Attribute values must be quoted– The name of an attribute is unique within an element – Comments and preprocessing instructions may not

appear inside tags– No un-escaped < or & may appear in the character

data of an element or an attribute

Page 24: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Outline

• XML Basics

• Document Type Descriptors (DTDs)

• XML schema

• CML

Page 25: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Document Type Descriptors• Document Type Descriptors (DTDs) impose structure on an XML

document

• The DTD is a syntactic specification

• General syntax:<!DOCTYPE DTD-name [ <!ELEMENT …> <!ELEMENT …> …

<!ATTLIST …> <!ATTLIST …>…]

• Note: DTD-name corresponds to the root element of XML documents that use the DTD for validation

Page 26: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

<person>

<name> MacNiel, John </name>

<greet> Dr. John MacNiel </greet>

<addr>1234 Huron Street </addr>

<addr> Rome, OH 98765 </addr>

<tel> (321) 786 2543 </tel>

<fax> (321) 786 2543 </fax>

<tel> (321) 786 2543 </tel>

<email> [email protected] </email>

</person>

Example: An Address Book

Exactly one name

At most one greeting

As many address lines as needed (in order)

Mixed telephones and faxes

As manyas needed

Page 27: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Specifying the structure

name to specify a name element

greet? to specify an optional (0 or 1) greet

elements

name,greet? to specify a name followed by an optional greet

Page 28: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Specifying the structure (cont)

addr* to specify 0 or more address lines

tel | fax a tel or a fax element

(tel | fax)* 0 or more repeats of tel or fax

email* 0 or more email elements

Page 29: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Specifying the structure (cont)

So the whole structure of a person entry is specified by

name, greet?, addr*, (tel | fax)*, email*

This is known as a regular expression

Page 30: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

A DTD for the address book<!DOCTYPE addressbook [ <!ELEMENT addressbook (person*)> <!ELEMENT person (name, greet?, address*, (fax | tel)*,

email*)> <!ELEMENT name (#PCDATA)> <!ELEMENT greet (#PCDATA)> <!ELEMENT address(#PCDATA)> <!ELEMENT tel (#PCDATA)> <!ELEMENT fax (#PCDATA)> <!ELEMENT email (#PCDATA)>]>

Page 31: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Summary of XML regular expressions for element description

• e The tag e occurs

• e1,e2 The expression e1 followed by e2

• e* 0 or more occurrences of e

• e? Optional -- 0 or 1 occurrences

• e+ 1 or more occurrences

• e1 | e2 either e1 or e2

• (e) grouping

Page 32: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Specifying attributes in the DTD

• Example:<!ELEMENT height (#PCDATA)>

<!ATTLIST height

dimension CDATA #REQUIRED

accuracy CDATA #IMPLIED >

• The dimension attribute is required; the accuracy attribute is optional

• CDATA is the “type” of the attribute -- it means string

Page 33: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Specifying attributes in the DTD

• General syntax:<!ATTLIST element_name

attribute_name attribute_type default_value>

• Attribute types include– CDATA, ENUMERATION, ID, IDREF, IDREFS,

NOTATION, NMTOKEN, NMTOKENS, ENTITY, ENTITIES

•Attribute default values include:– #IMPLIED, #REQUIRED, #FIXED, literal

Page 34: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Specifying ID and IDREF attributes

<!DOCTYPE family [ <!ELEMENT family (person)*> <!ELEMENT person (name)> <!ELEMENT name (#PCDATA)> <!ATTLIST person

perId ID #REQUIRED mother IDREF #IMPLIED father IDREF #IMPLIED children IDREFS #IMPLIED>]>

Page 35: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Some conforming data<family> <person perId ="jane" mother="mary" father="john"> <name> Jane Doe </name> </person> <person perId ="john" children="jane jack"> <name> John Doe </name> </person> <person perId ="mary" children="jane jack"> <name> Mary Doe </name> </person> <person perId ="jack" mother=”mary" father="john"> <name> Jack Doe </name> </person></family>

Page 36: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Consistency of ID and IDREF attribute values

• If an attribute is declared as ID– the associated values must all be distinct (no

confusion)• If an attribute is declared as IDREF

– the associated value must exist as the value of some ID attribute

• Similarly for all the values of an IDREFS attribute•ID and IDREF attributes are not typed

Page 37: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Connecting the document with its DTD

• In line:<?xml version="1.0"?><!DOCTYPE PROTEIN_SET [<!ELEMENT ...> … ]><PROTEIN_SET> ... </PROTEIN_SET>

• Another file: <!DOCTYPE PROTEIN_SET SYSTEM “protein.dtd"><PROTEIN_SET > ... </PROTEIN_SET>

• A URL: <!DOCTYPE PROTEIN_SET SYSTEM "http://.../protein.dtd"><PROTEIN_SET> ... </ PROTEIN_SET>

Page 38: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Valid Documents

• XML documents are checked for validity

against a an XML validator such as DTDs

• Validity specifies that the document conforms

to the DTD: conforms to regular expression

grammar, types of attributes correct, and

constraints on ID and IDREF(S) satisfied

Page 39: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Outline

• Background

• XML Basics

• Document Type Descriptors (DTDs)

• XML schema

• CML

Page 40: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML schema

• W3C recommendation

• Successor of DTDs

• Used to validate XML documents

• Specification lengthy and rather complex

• Proposed to address DTDs pitfalls

Page 41: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML schema

• Features – Data typing: compared to DTD where

elements and attributes are strings– Schema files are XML files– Support for object-oriented practices– Addition validation rules (e.g. pattern of a

element content, minimum/maximum values for attributes)

– Full support of namespaces

Page 42: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML schema

• XML schema for scientific applications – Used in several areas: bioinformatics, chemical

informatics, laboratory informatics, etc.– Examples include:

• AGAVE• CML• PEML• PSI-MI• SBML• UniProt XML• XFF

Page 43: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Introductory Example

• Example1: Representing protein data

Page 44: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Schema constructs

• The <schema> element– Root of XML schema document– E.g.

<xs:schema xmlns:xs=http://www.w3.org/2001/XMLSchema>

– Prefix xs references the namespace of XML schemas

– Used to reference schema constructs such as sx:annotation, xs:complexType

Page 45: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Schema constructs

• Schema Documentation– Schema element xs:annotation is used to document

schema documents, providing information about the document and detailed information about its elements and attributes

– Two types of documentation:• Human readable: using element xs:documentation• Machine readable using element xs:appinfo

– E.g. <xs:annotation>

<xs:documentation>Sample XML Schema for representing Protein data.</xs:documentation></xs:annotation>

Page 46: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Schema constructs

• Simple types vs. complex types– A schema element is either of simpleType or

complexType– An element is of simple type if it does not contain

any attribute or children elements– An element is of complex type if it does include

children, attributes or both– See example1

Page 47: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Schema constructs

• Global elements vs. local elements– Global elements are direct children of the root

schema element– E.g. protein_set in example1 is a global element– Local elements are not direct children of schema

element– Global elements can referenced within the

document while local elements cannot.– E.g. <xs:element ref=“organism”> - protein

element contains organism element

Page 48: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Schema constructs

• Creating instance documents– An instance document is a XML document that adheres to

an XML grammar defined by a DTD, an XML schema, etc.– Example: protein_set instance document

<protein_set xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="protein.xsd">

– Declaring instance namespace enables to use its constructs such as noNamespaceSchemaLocation attribute

– Attribute noNamespaceSchemaLocation specifies that the XML schema has no declared namespace and its value corresponds to the location of the XML schema

Page 49: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Schema constructs

• Working with simple types– Simple type contains single values such as string, integer,

NMTOKEN– XML schema provides 44 built-in schema types including:

• String, byte, decimal, float, Boolean, time, Qname, AnyURI

– Built-in data types are organized in a hierarchy rooted by AnyType data type

– Type declaration: using type attribute when defining and attribute or an element

Page 50: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Schema constructs

• Working with facets to derive new data types– Users may derive new data types– Two mechanisms for derivation:

• By extension• By restriction

– Facets are provided to derive new data types by restriction

– XML schema supports 12 facets including length, minLength, maxLength, pattern, and enumeration

Page 51: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Schema constructs

• Declaring new data types using facets– Example:

xs:simpleType name="accessionType">

<xs:restriction base="xs:string">

<xs:minLength value="4" />

<xs:maxLength value="8" />

</xs:restriction>

</xs:simpleType>• Are these elements of type accessionType valid or not ?

<accession>P23</accession>,

<accession>P12345678</accession>

Page 52: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Schema constructs

• Example of facets – Pattern facet– Used to restrict string values to match a regular

expression pattern– E.g. Accession numbers must start with P letter

<xs:simpleType name="accessionType"> <xs:restriction base="xs:string"> <xs:minLength value="4"/> <xs:maxLength value="8"/> <xs:pattern value="P.*"/> </xs:restriction></xs:simpleType>

• E.g. A DNA sequence should only include A, C, G or T characters

Page 53: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Schema constructs

• Example of facets – Enumeration facet– Used to restrict the list of possible values– E.g. restrict the list of biological databases to be

referenced<xs:simpleType name="databaseType"> <xs:restriction base="xs:string"> <xs:enumeration value="EMBL"/> <xs:enumeration value="PIR"/> <xs:enumeration value="MGD"/> <xs:enumeration value="InterPro"/> <xs:enumeration value="Pfam"/> <xs:enumeration value="SMART"/> <xs:enumeration value="PROSITE"/> </xs:restriction></xs:simpleType>

Page 54: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Schema constructs

• Working with complex types– Used to define elements that include attributes and/or

children– Elements including attributes only are defined as

complex types with simple content– E.g.

<xs:element name="organism"> <xs:complexType> <xs:simpleContent> <xs:extension base="xs:string"> <xs:attribute name="taxonomy_id" type="xs:integer" use="required"/> </xs:extension> </xs:simpleContent> </xs:complexType></xs:element>

Page 55: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Schema constructs

• Working with complex types– Elements including children are defined as complex

types with complex content– E.g.

<xs:element name="protein_set"> <xs:complexType> <xs:complexContent> <xs:restriction base="xs:anyType"> <xs:sequence> <xs:element ref="protein" maxOccurs="unbounded"/> </xs:sequence> </xs:restriction> </xs:complexContent> </xs:complexType></xs:element>

Page 56: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Schema constructs

• Working with complex types– Abbreviation: elements defined as complex types with

complex content restriction of anyType may be abbreviated

– E.g.<xs:element name="protein_set">

<xs:complexType>

<xs:sequence>

<xs:element ref="protein" maxOccurs="unbounded"/>

</xs:sequence>

</xs:complexType>

</xs:element>

Page 57: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Schema constructs• Occurrences constraints on elements

– Used to set the exact number of times a children may appear– Using minOccurs and MaxOccurs attributes– Values vary from “0” to “unbounded”– If not specified, default values are set to “1” – E.g.<xs:element name="protein"> <xs:complexType> <xs:complexContent> <xs:restriction base="xs:anyType"> <xs:sequence> <xs:element name="accession" type="xs:string"/> <xs:element name="entry_name" type="xs:string"/> <xs:element name="protein_name" type="xs:string"/> <xs:element name="gene_name" type="xs:string" maxOccurs="unbounded"/> <xs:element ref="organism"/> <xs:element ref="cross_reference" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="comment" type="xs:string" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="keyword" type="xs:string" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:restriction> </xs:complexContent> </xs:complexType></xs:element>

Page 58: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Schema constructs• Compositors: sequence, choice and all

– Used to specify how elements are organized

– Using sequence:E.g. <xs:element name="PubmedArticle">

<xs:complexType>

<xs:sequence>

<xs:element name="MedlineID" type="xs:long"/>

<xs:element name="PMID" type="xs:long"/>

</xs:sequence>

</xs:complexType>

</xs:element>

Page 59: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Schema constructs• Compositors: sequence, choice and all

– Using choice:E.g. <xs:element name="PubmedArticle">

<xs:complexType>

<xs:choice>

<xs:element name="MedlineID" type="xs:long"/>

<xs:element name="PMID" type="xs:long"/>

</xs:choice>

</xs:complexType>

</xs:element>

Page 60: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Schema constructs• Compositors: sequence, choice and all

– Sequence and choice constructs can be combined to organize sub-elements :

E.g. <xs:element name="PubmedArticle"> <xs:complexType> <xs:sequence> <xs:choice> <xs:element name="MedlineID" type="xs:long"/> <xs:element name="PMID" type="xs:long"/> </xs:choice> <xs:element name="ArticleTitle" type="xs:string"/> <xs:element name="AbstractText" type="xs:string"/> </xs:sequence> </xs:complexType></xs:element>

Page 61: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Schema constructs

• Using choice:– Used to indicate that a group of elements can appear in

an order– But each element should be either optional or appear

only once – E.g.<xs:element name="PubmedArticle"> <xs:complexType> <xs:all> <xs:element name="MedlineID" type="xs:long"/> <xs:element name="PMID" type="xs:long"/> </xs:all> </xs:complexType></xs:element>

Page 62: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

XML Schema constructs• Defining named complex types

– Named complex types can be reused using ref attribute

– Defined as direct children of schema element, and specifying its name attribute

– E.g.<xs:complexType name="organismType"><xs:simpleContent><xs:extension base="xs:string"> <xs:attribute name="taxonomy_id" type="xs:integer" use="required" /> </xs:extension> </xs:simpleContent> </xs:complexType>

Page 63: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Putting it together

• Representing protein data: an updated version using XML schemas basic constructs such named data types, etc.

Page 64: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

Outline

• Background

• XML Basics

• Document Type Descriptors (DTDs)

• XML schema

• CML

Page 65: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

CML• What is CML?

– CML (Chemical Markup Language) – CML is an XML-based language for representing chemical

data – More precisely, CML is the application of XML for the

representation of molecules and molecular representation, crystallography and spectra

– CML evolved in the chemical industry to solve the needs of exchanging molecular and other information for publishing Web-based documents for patent applications, standards committees, and other organizations

– CML does not cover all chemistry but focuses on molecules (and similar structures representable by a formula)

– CML does represent molecules, atoms, and bonds

Page 66: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

CML

• History– Early discussion began in 1994– Version 1.0 formally published in 1999– CML current version is 2.0 published in– CML has a CML schema specification

Page 67: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

CML

• CML component based approach:– STM-ML. Generic support

for scientific information – CMLCore. Molecular and

related information – CMLReact. Chemical

reactions – CMLSpec. Spectra – CMLComp. Computational

Chemistry – CMLQuery. General query

language for chemistry

Page 68: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

CML

• CML modular architecture:– allows it to be used as part of an application – E.g.

• "substance" in a GeneOntology • Description or chemical support for a New Drug

Application

– Intended to interoperate (not compete!) with other chemical informatics projects such as JCAMP-DX, SpectroML, etc.

Page 69: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

CML• CML as part of a larger chemical informatics

community– Provide support only for the main components (e.g.

core CML, CMLReact, etc.)– Other chemical related fields are not supported– E.g.

• No longer support for macromolecular structures and sequences directly in CML.

• New strategy: working with the biomolecular community to use CML for represent in molecules precisely.

• Example of scientific languages likely to use CML – Materials Markup Language (MatML), NIST – Gene Ontology (GO), EBI, etc. – CellML (for describing cells, including reactions) – e-CTD (Common Technical Dossier for e-submission

of new drug applications (NDAs). FDA, EMEA

Page 70: Introduction to XML. Outline Background XML Basics Document Type Descriptors (DTDs) XML schema CML

References• XML in a nutshell, O’Reilly publisher, 2004• XML for bioinformatics, by Ethan Cerami, Springer publisher, 2005• Extensible Markup Language (XML)

http://www.w3.org/XML • Document Type Descriptors http://www.w3schools.com/dtd// • XML Schema http://www.w3schools.com/schema/default.asp• CML official sites http://www.xml-cml.org/,

http://cml.sourceforge.net/main.html, http://www.xml-cml.org/information/disciplines/index.html

• Chemical Markup, XML, and the World Wide Web. 6. CMLReact, an XML Vocabulary for Chemical ReactionsGemma L. Holliday, Peter Murray-Rust, and Henry S. RzepaWeb Release Date: 30-Nov-2005; (Article) DOI: 10.1021/ci0502698

• Chemical markup language zone http://www.adobe.com/svg/demos/devtrack/chemical.htm