icic 2013 conference proceedings antony williams royal society of chemistry

53
The Big Data Challenges Associated with Building a National Data Repository for Chemistry Antony Williams ICIC Meeting, Vienna October 14 th 2013

Upload: dr-haxel-congress-and-event-management-gmbh

Post on 15-Jun-2015

588 views

Category:

Technology


0 download

DESCRIPTION

The Big Data Challenges Associated with Building a National Data Repository for Chemistry Antony Williams (Royal Society of Chemistry , USA) At a time when the data explosion has simply been redefined as “Big”, the hurdles associated with building a subject-specific data repository for chemistry are daunting. Combining a multitude of non-standard data formats for chemicals, related properties, reactions, spectra etc., together with the confusion of licensing and embargoing, and providing for data exchange and integration with services and platforms external to the repository, the challenge is significant. This all at a time when semantic technologies are touted as the fundamental technology to enhance integration and discoverability. Funding agencies are demanding change, especially a change towards access to open data to parallel their expectations around Open Access publishing. The Royal Society of Chemistry has been funded by the Engineering and Physical Science Research of the UK to deliver a “chemical database service” for UK scientists. This presentation will provide an overview of the challenges associated with this project and our progress in delivering a chemistry repository capable of handling the complex data types associated with chemistry. The benefits of such a repository in terms of providing data to develop prediction models to further enable scientific discovery will be discussed and the potential impact on the future of scientific publishing will also be examined.

TRANSCRIPT

Page 1: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

The Big Data Challenges

Associated with Building a National

Data Repository for Chemistry

Antony Williams

ICIC Meeting, Vienna

October 14th 2013

Page 2: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

So what is all this Big Data?

Page 3: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
Page 4: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

And the World of Chemistry?

Page 5: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

And the World of Chemistry?

“The InChIKey indexing has therefore turned

Google into a de-facto open global chemical

information hub by merging links to most

significant sources, including over 50 million

PubChem and ChemSpider records.”

Page 6: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

And the World of Chemistry?

Page 7: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

RSC’s ChemSpider

>29 million chemicals from >500 sources

Page 8: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

…and the world of Openness

Page 9: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Times have changed…

Open Access funder mandates…

Page 10: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Times have changed…

Growth, growth, growth…

Page 11: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Publishers are responding

Page 12: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

The world of Open Data…

Page 13: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Open Data are everywhere

• Is Openness and Social Sharing changing

the world?

• The cultural experiments in Open Data and

exchange are almost daily

• Mobile platforms enhance participation

• And then what of Chemistry Data???

Page 14: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Publications-summary of work

• Scientific publications are a summary of work

• Is all work reported?

• How much science is lost to pruning?

• What of value sits in notebooks and is lost?

• Publications offering access to “real data”?

• How much data is lost?

• How many compounds never reported?

• How many syntheses fail or succeed?

• How many characterization measurements?

Page 15: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

About Me…as a Chemist

• I’ve performed a few dozen chemical

syntheses

• I’ve run thousands of analytical spectra

• I’ve generated thousands of NMR assignments

• I’ve probably published <5% of all work

• Most of it has been lost

• But things can be different today….

• But it still needs to be associated with me…

Page 16: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

What of non-abstracted data?

• How much data generated in a lab, that COULD

go public, is lost forever?

Page 17: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

• How much data generated in a lab, that COULD

go public, is lost forever?

• Public Domain reference databases of value?

• Syntheses

• Properties

• Spectra and CIFs

• Images

• Raw data vs. representations of data

What of non-abstracted data?

Page 18: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

ChemSpider

• ChemSpider allowed the community to

participate in linking the internet of chemistry

& crowdsourcing of data

• Successful experiment in terms of building a

central hub for integrated web search

• More people are “users” than “contributors”

• Yet basic feedback and game-play helps

Page 19: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Crowdsourced “Annotations”

• Users can add

• Descriptions, Syntheses and Commentaries

• Links to PubMed articles

• Links to articles via DOIs

• Add spectral data

• Add Crystallographic Information Files

• Add photos

• Add MP3 files

• Add Videos

Page 20: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

An EPSRC Call

“…the identification of the need for a UK

national service for the provision of a

searchable, electronic chemical database

for the UK academic research community.”

Page 21: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

• Service for UK Academics

• “Prepaid access” integrating commercial

databases and services

• Access to curated data sets

• Provision of prediction algorithms

National Chemical Database Service

Page 22: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

National Chemical Database Service

Page 23: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

• Service for UK Academics

• “Prepaid access” integrating commercial

databases and services

• Access to curated data sets

• Provision of prediction algorithms

• Ultimate goal is to federate search

• Development of “data repository”

National Chemical Database Service

Page 24: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Development of Data Repository

• Data repository should not just be a data dump – should not be a “big disk”

• Searchable, integrated, segregated repository of data types

• Data access including private, shared embargoed and public

• Delivery of derived models from data

• Integrated to AltMetrics models

Page 25: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

What can drive participation?

• What can drive scientists to participate and

contribute?

• Ensuring provenance of their data for reuse

• Mandates from funding agencies

• Improved systems to ease contribution

• Additional contributions to science

• Improved publishing processes

• Recognition for contributions

Page 26: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

AltMetrics

Page 27: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

AltMetrics

Page 28: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

AltMetrics as Scientist Impact

Page 29: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

AltMetrics

Page 30: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Plum Analytics

Page 31: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Plum Analytics

Page 32: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Rewards and Recognition

Congratulations! Your 1st CSSP

article has been published.

Philosopher Lao Tzu said “A

journey of a thousand miles begins

with a single step”. In the same

way we hope that this will be the

first of many submissions that you

make to CSSP.

The First Step badge is awarded when a user submits (& has published) their 1st CSSP article.

Page 33: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

AltMetrics Feeds

• For our data repository ensure contribution of

data will feed out to the AltMetrics platforms

• Every data point, every data download, use

and reuse will be associated with the scientist

• Data will be DOI’ed (presently under review)

• Services provided will allow for AltMetrics use

Page 34: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Domain Specific Challenges

• Creating a platform of value not just dumping

• Searchability, segregation, tagging, use and reuse, collaboration, low barrier to participation

• Quality of chemistry data at source

• ensuring chemicals are correct

• reactions map and balance as appropriate

• file format handling for analytical data types – binary file formats are proprietary

• valid interpretation of data

Page 35: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Domain Specific Challenges

• Quality of data at source

• ensuring chemicals are correct - VALIDATION

• reactions map and balance as appropriate –

VALIDATION and STANDARDIZATION

• file format handling for analytical data types –

binary file formats are proprietary -

STANDARDIZATION

• valid interpretation of data – VALIDATION and

ANNOTATION

Page 36: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Validating Chemicals

• Community service

for validation and

standardization of

chemicals (CVSP)

• Open rules sets but

standard set based

on FDA substance

registry system

Page 37: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

DB08128

J. Brechner, IUPAC Graphical Representation of stereochem. configurations Section: ST-1.1.10

DB06287

Validating chemicals

Page 38: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Standardizing Chemicals

Page 39: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Validated Name-Structure

dictionaries for data checking

• Chemical name dictionaries used for:

• Text-mining (publications, patents)

• Linking to other databases – think Biology

• Drug names are incredibly valuable links

• Searching the web

• Names link to structures

Page 40: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Difficult to navigate…

What’s the

structure?

Are they in

our file?

What’s

similar?

What’s the

target? Pharmacology

data?

Known

Pathways?

Working On

Now? Connections

to disease?

Expressed in

right cell type?

Competitors?

IP?

Page 41: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Inside our Publication Archive

• How much data is in the archive, in the publications and in the supplementary info?

• How many compounds for ChemSpider?

• How many syntheses for ChemSpider reactions?

• How many characterization measurements? • Property Data

• Spectral Data

• Graphs and charts to be used for modeling?

Page 42: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

What if we could capture it all?

Digitally Enhancing the RSC Archive

Page 43: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Linking Names to Structures

Page 44: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Semantic Mark-up of Articles

Page 45: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Hosting Reactions

• Seed set of over 1 million reactions from patents to

develop validation and standardization routines.

• Reactions to be extracted from RSC journal articles,

ESI and reaction databases will be examined

• Resulting validation algorithms used at deposition

Page 46: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

The challenges of analytical data

• Integration of ChemSpider to analytical

instrumentation vendors already in place

• Agilent, Bruker, Thermo, Waters

• Vendors produce complex proprietary data

formats and standard formats are required

(JCAMP, NetCDF, AniML) • ChemSpider already hosts thousands of JCAMP spectra

• Support of “assigned spectra” in place

• Data validation approaches understood

• There are a myriad of analytical data types…

Page 47: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Turning “Figures” Into Data

Page 48: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Community Data Repository

• Automated depositions of data – service-based

deposition, sweep and deposit

• Integrate to Electronic Lab Notebooks as feeds

• High value would be databases of reference

data, but validated by model validation and the

community

• National services feeding the repository –

crystallography, mass spectrometry

Page 49: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

E-Lab Notebooks

• Integration between ELNs

and:

• ChemSpider

• ChemSpider Reactions

• Chemistry Data Repository

Page 50: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

What do we have in place?

• We are testing a data repository on our assets – ChemSpider and our archive of publications

• Working with many collaborators to define needs

• Deposition system for deposition of chemical compounds – hosts >29 million chemicals

• Crowdsourcing curation & annotation platform

• Chemical validation & standardization platform

• Chemical reactions database with >1 million reactions and presently developing RVSP

• Analytical data handling formats (JCAMP preferred)

• And lots in development…

Page 51: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

The Challenges Ahead

• Chemistry is NOT just nicely defined structures!

• Materials, minerals, attached to beads,

polymers, ambiguous materials

• Domain-specific measurements

• File format standards are limited in application

• Encouraging scientists to free up their data

• AltMetrics, open data mandates, systems

• The data explosion continues

• 4 years ahead to expand capability

Page 52: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Internet Data

The Future

Commercial Software

Pre-competitive Data

Open Science

Open Data

Publishers

Educators

Open Databases

Chemical Vendors

Small organic molecules

Undefined materials

Organometallics

Nanomaterials

Polymers

Minerals

Particle bound

Links to Biologicals

Page 53: ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry

Thank you

Email: [email protected]

Twitter: @ChemConnector

Personal Blog: www.chemconnector.com

SLIDES: www.slideshare.net/AntonyWilliams