xtech 2008 presentation; "representing, indexing and mining science with xml and rdf: golem and...

45
Representing, indexing and mining science with XML and RDF: Golem and CrystalEye Andrew Walkingshaw Unilever Centre for Molecular Science Informatics University Chemical Laboratory University of Cambridge 9th May, 2008 http://creativecommons.org/licenses/by-nc-sa/2.0/uk/

Upload: andrew-walkingshaw

Post on 20-Aug-2015

3.958 views

Category:

Technology


2 download

TRANSCRIPT

Representing, indexing and mining science with XML and RDF: Golem and CrystalEye

Andrew WalkingshawUnilever Centre for Molecular Science InformaticsUniversity Chemical LaboratoryUniversity of Cambridge

9th May, 2008

http://creativecommons.org/licenses/by-nc-sa/2.0/uk/

The Unilever Centre Home, sweet office

Photo by Peter Corbett (cc-by): http://www.flickr.com/photos/ptc24/2175475233/

Experimentalists Science as we knew it

Photo by spike55151 (cc-by-nc-sa): http://www.flickr.com/photos/spike55151/444102323/

Theorists Deforestation: their fault

Photo by jaevus (cc-by-nc): http://www.flickr.com/photos/jaevus/429209470/

Computationalists Practical quantum mechanics

Photo by vabellon (cc-by-nc-sa): http://www.flickr.com/photos/vabellon/243812958/

Scientific informatics Tying it all together

Photo by Pulpolux !!! (cc-by-nc): http://www.flickr.com/photos/pulpolux/128967513/

The seven stages of visualization

• acquire data

• parse data

• filter data

• mine data

• represent data

• refine representation

• add interactivity“Visualizing Data” by Ben Fry: cover © O’Reilly Media

- Web standards:HTTP and RESTHTML, CSS, Javascript

- Markup:XML and CMLXPath and XSLTCIF and other legacy formats

- Semantic web:RDF and SPARQL

- Machine learning: OSCAR 3, non-negative matrix factorization, etc...

Our tools

Photo by neilt (cc-by-nc): http://www.flickr.com/photos/neilt/2517652/

The information business

"Ultimately, Reuters' news is the raw material for analysis and application by investors and downstream news organizations. Adding metadata to make that job of analysis easier for those building additional value on top of your product is a really interesting way to view the publishing opportunity. If you don't think of what you produce as the 'final product' but rather as a step in an information pipeline, what do you do differently to add value for downstream consumers? In Reuters' case, Devin thinks you add hooks to make your information more programmable. This is a really important insight, and one I'm going to be chewing on for some time."

— Tim O'Reilly (http://radar.oreilly.com/archives/2008/02/reuters-ceo-sees-semantic-web.html)

Photo by Thomas Hawk (cc-by-nc): http://www.flickr.com/photos/thomashawk/205664675/

• mostly written in Fortran (77!)

• large legacy codebases

• fast, but really retro I/O

• result: unstructured output in US-ASCII

• So: we need to speak their language — literally — if we want to change things.

Simulations

Photo by Adrian DeeJay (cc-by-nc-sa):http://www.flickr.com/photos/adrian-deejay/442309109/

FoX

• A validating, W3C-test-suite-passing XML parser/writer in pure Fortran 95.

• Written by Dr Toby White, Earth Sciences, Cambridge

• Includes libraries which make it easy to write either KML (geodata) or a subset of CML (chemistry).

• BSD licensed.

• http://www.uszla.me.uk/FoX/

Photo by johnmuk (cc-by-nc-sa): http://www.flickr.com/photos/jm999uk/57312427/

CML “Microformats”

• Unlike http://microformats.org/ –

• CML files (as written by FoX) are made up of components with defined mathematical semantics

• i.e. you don’t know what they mean, but you do know how to read and manipulate them.

• Consistent serializations of the datatypes we’re exposing.

• Local, not global, semantics.

Photo by Bret Arnett (cc-by-nc-sa): http://www.flickr.com/photos/bretarnett/136222945/

Adding labels

• We can attach higher-level semantics to these objects through CML’s @dictRef attribute:

<property dictRef="castep:Etot"> <scalar units="si:joule">123.1 </scalar></property>

• So this is how we attach meanings to the elements our documents are made of.

• How do we make use of this?

http://www.lexical.org.uk/science/golem/

Photo by Leon Reynolds (cc-by-nc-sa): http://www.flickr.com/photos/lwr/28239165/

the Golem language

• XML language annotating CML dictionaries

• Golem annotations supply the information needed to make concepts processable:

• locations (XPath)

• transformations (XSLT)

• bounds and type

• relationships between concepts

Photo by cvrt (cc-by-sa): http://flickr.com/photos/covert/697928394/

An example dictionary entry (from CASTEP)

<entry id="Etot" term="Total energy"> <annotation /> <definition>Total energy of the calculation.</definition> <description> <h:div class="dictDescription"> It should be noted that total energies in DFT calculations are relative rather than absolute, and thus as such have no intrinsic meaning in and of themselves; rather, they only have meaning when considered in comparison to atomically-equivalent systems calculated at the same precision (plane-wave cutoff, k-grid, etcetera). </h:div> </description> <metadataList> <metadata name="dc:author" content="golem-kiln" /> </metadataList> <golem:xpath> /cml:cml/cml:propertyList[@id="finalProperties"]/cml:property[@dictRef="castep:Etot"] </golem:xpath> <golem:template call="scalar" role="getvalue" binding="pygolem_serialization" /> <golem:implements>value</golem:implements> <golem:implements>absolute</golem:implements> <golem:childOf>id_finalProperties</golem:childOf> </entry>

Information flow

<cml:property dictRef=”castep:Etot”> <cml:scalar units=”castepunits:ev”>200</cml:scalar></cml:property>

>>> total_energy200>>> total_energy.unit u'castepunits:ev'

call cmlAddProperty(xmlfile, 200, & dictRef="castep:Etot", & units="castepunits:ev”)

FoX

[200, “castepunits:ev”]

XSLT

Golem

Photo by eqqman (cc-by-nc): http://www.flickr.com/photos/eqqman/223682514/

total_energy = \Etot.findin(document)[0].getvalue()

Pathfinding

• Some concepts can occur in different contexts – different places in different documents.

<a> <b> <c dictRef=”target” /> </b><a>

/a/b/c[@dictRef=”target”]

<a> <d> <b> <c dictRef=”target” /> </b> </d></a>

/a/d/b/c[@dictRef=”target”]

<a> <d /> <b dictRef=”one” /> <b> <c dictRef=”target” /> </b></a>

/a/b/c[@dictRef=”target”]

• Here there are two potential XPaths for our target node in the corpus. How do we find it?

• So we can’t use either directly: what we can instead use is the longest common subset in our corpus.

• Here, that’s .//b/c[@dictRef=”target”] ...

XPath algebra

• If the XPath for a concept goes back to the root node, it’s always in the same place — so we know where to find it. (We call it absolutely-positioned.)

• But what if the XPath doesn’t go back to the root —like for Energy here?

• Here, we want “final energy” – the instance of concept Energy which is a child of an instance of Final config. So:

“/Root node/Final config” + “.//Energy” giving “/Root node/Final config/.//Energy”

— and we’ve got the address. Now we can find it with XPath, deduce its datatype, transform it to JSON with XSLT, and we’re away.

Root nodeInitial setup

Energy

EnergyFinal config

Building richer data browsers

Implementations

• CASTEP: http://www.castep.org/

• DALTON

• DL_POLY 3

• GULP: http://www.ivec.org/gulp/

• MOPAC: http://www.openmopac.net/

• OSSIA

• SIESTA: http://www.uam.es/siesta/

• VAMP

Photo by Felix42 (cc-by-nc-sa): http://www.flickr.com/photos/felix42/76536001/

Halfway.

Photo by mike157 (cc-by-sa):http://www.flickr.com/photos/72486075@N00/383577651/

Crystallography

• Crystallographers got their act together early.

• CIF is the standard data format for publishing crystallographic data:

http://www.iucr.org/iucr-top/cif/

• CrystalEye is an aggregation of “supplementary data” from journals:

http://wwmm.ch.cam.ac.uk/crystaleye/

Photo by cobalt123 (cc-by-nc): http://www.flickr.com/photos/cobalt/94863441/

A page from CrystalEye

Underneath...

• Data is converted to CML,

• ... and available as attachments to an Atom feed,

• ... and it’s got @dictRefs.

Therefore:

• Run Golem over the corpus;

• and now we can program against CrystalEye’s data.

In triple time...

• Data in RDF would be good — mostly because of SPARQL.

• A bit of sleight of hand:

<cml:property dictRef=”_cell_volume”> <cml:scalar units=”cmlunits:A3”> 200 </cml:scalar></cml:property>

to (in pseudo-NTriples):

<http://doc-url> <castep:Etot> [125.35, “cmlunits:A3”]

• Deem @dictRefs to be RDF classes

• So we can go from CML to RDF automatically via Golem (for absolutely-positioned concepts)

Photo by sillydog (cc-by-nc-sa): http://www.flickr.com/photos/sillydog/72697229/

What CrystalEye doesn’t exploit

• Lots of things, but firstly —

• Bibliographic data, such as:

• author names

• institutions

• date of publication

• So how can we best use this data?

Photo by Thomas Hawk (cc-by-nc): http://www.flickr.com/photos/thomashawk/205664675/

Indexing

• Here’s a poser —

• What papers has an author written?

• Who were their coauthors?

• SPARQL makes this sort of question easy:

PREFIX dc: <http://purl.org/dc/terms/>PREFIX ce: <http://wwmm.ch.cam.ac.uk/crystaleye/dictionary#>DESCRIBE ?file WHERE { ?file dc:contributor '%s' .}

• And add a front end.

Photo by Reeding Lessons (cc-by-nc-sa): http://www.flickr.com/photos/reedinglessons/2238990839/

Automatic indexing (look, no cards)

More axes

• By author is good. By subject would be even better.

• We have the papers’ titles: can we extract chemical names and terms from them? Yes:

• http://oscar3-chem.sourceforge.net/

• OpenCalais for chemistry. (Sort of.)

• Store OSCAR’s annotations as triples in the same store — then we can query on those too. Basically: tagging by machines.

Photo by Global Mouser (cc-by-nc-sa): http://www.flickr.com/photos/hryciuk/234940727/

Integration

• Documents about chemistry from different sources will use the same set of named entities

• ... say, chemical blogs.

• So we can integrate those (they’ve got per-post URIs, so we can add triples about them) into our index too —

• and we’re building connections between journal articles and blogs, basically for free.

Photo by solofotones (cc-by-nc-sa): http://www.flickr.com/photos/solofotones/1839531915/

Clustering

• What papers are tagged with a given tag?

• Another question:

What tags co-occur frequently in our corpus? Again, SPARQL makes this kind of thing easy.

• Another approach: brute-force. Cluster all the papers by title (using Porter stemming and an algorithm like NMF) – works well at finding related papers.

New visions

• So far, all the things I’ve shown have been pretty website-like.

• But now we’ve got, and can query, the data, we can find new ways of looking at it - and answer new questions.

• For example; how has the geographic distribution of work in crystallography changed over the last few years?

Photo by Leo Reynolds (cc-by-nc-sa): http://www.flickr.com/photos/lwr/12364944/

Out into the world

• Can we map institutions to locations?

• http://geonames.org/ has geodata; geotag the cities.

• Store the latitude and longitude as triples as before.

• (Why triple stores are great: no schema, one query language!)

• Managed to geotag about 30,000 papers in our corpus successfully.

Photo by rogiro (cc-by-nc): http://www.flickr.com/photos/riot/71878108/

Making data beautiful

• We can learn a lot from artists and graphic designers!

• http://processing.org/ is great for hacking up visualizations

• Get the data with a little Python/SPARQL program:

query = """PREFIX ce: <http://wwmm.ch.cam.ac.uk/crystaleye/dictionary#>PREFIX fn: <http://www.w3.org/2005/xpath-functions>PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT DISTINCT ?place ?lat ?lon ?dateWHERE {?url ce:AcceptanceDate ?date . ?url ce:location ?loc . ?loc ce:address ?place . ?loc ce:latitude ?lat . ?loc ce:longitude ?lon . FILTER (xsd:date(?date) >= xsd:date("%s")) . FILTER (xsd:date(?date) < xsd:date("%s")) . }ORDER BY ?date """

Image by scloopy (cc-by-nc-sa): http://www.flickr.com/photos/onecm/438336731/

The process(ing)

• Plot week-by-week

• Each unique location associated with a paper that week gets plotted.

• Points fade out over an eight-week period.

• So:

From http://en.wikipedia.org/wiki/Equirectangular_projection

Crystallography 2000-2007The world of...

Crystallography 2000-2007The world of...

Where next?

• What’s going to happen to:

• scientific data capture?

• scientific publishing?

• These are cultural and economic questions as much as anything else; we have the technology.

• Where can we reuse these ideas?

• What other data can we free?

Photo by baboon™ (cc-by-nc): http://www.flickr.com/photos/baboon/92082777/

That’s all, folks!

Photo by psd (cc-by): http://www.flickr.com/photos/psd/2086641/

Acknowledgments (I) - my colleagues

• Prof. Peter Murray-Rust, Nick Day, Jim Downing (for CrystalEye)

• Dr Peter Corbett (for OSCAR 3)

• Dr Joe Townsend, Alan Tonge (SPECTRa-T)

• Dr Nico Adams, Nick England, Dr Lezan Hawizy (Polymer Informatics)

• Prof. Martin Dove, Dr Richard Bruin, Dr Kevin Yang, Dr Dan Wilson (MaterialsGrid)

• Dr Toby White (eMinerals project; FoX and http://cmlcomp.org/)

Acknowledgments (II) - academic collaborators

• Prof. Kurt Mikkelsen (DALTON)

• Dr Thomas Steinke (VAMP)

• the CASTEP Development Group (http://castep.org/)

• CrystalEye (http://wwmm.ch.cam.ac.uk/crystaleye/):

• the International Union of Crystallography (http://www.iucr.org/)the Royal Society of Chemistry (http://rsc.org/)

• OSCAR (http://oscar3-chem.sourceforge.net/):

• the Royal Society of Chemistry (http://rsc.org/)Nature Publishing Group (http://nature.com/)

Acknowledgments (III)

for access to the Talis N2 platform beta

Our commercial partners in MaterialsGrid

Acknowledgments (IV) — free software and data

• Python (http://python.org/)

• Numpy (http://numpy.scipy.org/)Scipy (http://scipy.org/)simplejson (http://undefined.org/python/)rdflib (http://rdflib.org/)lxml (http://codespeak.net/lxml/)Django (http://djangoproject.org/)

• Processing (http://processing.org/)

• jQuery (http://jquery.com/) and Flot (http://code.google.com/p/flot/)

• Geodata from Geonames (http://geonames.org/) and Wikipedia