data rich chemistry inside wikipedia and other wikis

35
DATA-RICH CHEMISTRY INSIDE WIKIPEDIA & OTHER WIKIS Martin A Walker, SUNY Potsdam

Upload: martin-walker

Post on 20-Nov-2014

1.219 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Data rich chemistry inside wikipedia and other wikis

DATA-RICH CHEMISTRY INSIDE WIKIPEDIA &

OTHER WIKIS

Martin A Walker, SUNY Potsdam

Page 2: Data rich chemistry inside wikipedia and other wikis

OVERVIEW Chemical data in Wikipedia Validation of Wikipedia chemical data RSC Learn Chemistry Conclusion

Page 3: Data rich chemistry inside wikipedia and other wikis

SUBSTANCE DATA IN WIKIPEDIA Wikipedia is designed as an encyclopedia, NOT a

database, BUT many cheminformatics groups want to use data from Wikipedia

Since most data are entered by a human being, rather than by machine, Wikipedia can often provide a data source that is independent of the main online databases

Could the Wikipedia chemists make the data more accessible without compromising the project’s mission? What about DBpedia?

Page 4: Data rich chemistry inside wikipedia and other wikis

CHEMBOXES & DRUGBOXES The Chembox on a substance page

contains standard representations such as Skeletal formula IUPAC name InChI and InChIKey CAS no. (represents substance, not structure) SMILES (proprietary but de facto standard before

InChI)

These were traditionally supplied for use by readers to copy/paste, but we were asked to make a machine-friendly version

Page 5: Data rich chemistry inside wikipedia and other wikis

WIKIPEDIA DRUG PAGES

Page 6: Data rich chemistry inside wikipedia and other wikis

EARLY CHEMBOXES

Chemboxes were originally set up as tables – OK for people, but not for data mining.

A typical chembox From 2007

Page 7: Data rich chemistry inside wikipedia and other wikis

NEW CHEMBOXES Now designed as a set of data

fields with values entered by the editor – better for data extraction and for validation

Drugboxes also redesigned Machine-friendly formats

(SMILES, InChI, InChIKey, CAS Reg. No.) included in nearly all chemboxes

Hide/show used to avoid table “explosions”

Collections of Wikipedia data are now available for cheminformatics groups to use

Page 8: Data rich chemistry inside wikipedia and other wikis

CURRENT FORM OF CHEMBOX

SIMPLE FULL FORM

Page 9: Data rich chemistry inside wikipedia and other wikis

TABLE EXPLOSIONS!Some data (e.g., InChIs for complex molecules) can be very long – and this was a hindrance to their use in Wikipedia

Page 10: Data rich chemistry inside wikipedia and other wikis

VALUE OF THE INCHI AND INCHIKEY InChI can be used to define what

structure is being represented when compiling a virtual database.

InChI can provide an unambiguous reference when validating structures on Wikipedia

InChIKey is useful to help those using search engines

Page 11: Data rich chemistry inside wikipedia and other wikis

DATA PAGESPROBLEM: Table creep – users ask for the table to include the Standard Free Energy of Hydroformylation in a Black Box

ANSWER: Put it on a sub-page – the supplementary data page (something unique to chemistry!).Click on a link from the bottom of the Chembox:

Page 12: Data rich chemistry inside wikipedia and other wikis

DATA PAGES

Page 13: Data rich chemistry inside wikipedia and other wikis

DATA VALIDATION

Page 14: Data rich chemistry inside wikipedia and other wikis

DATA VALIDATIONHow I use the key terms:

Validation =>“How I can be sure the data are correct?”

Curation => an ongoing process of fixing errors

Page 15: Data rich chemistry inside wikipedia and other wikis

CONTENT VALIDATION In 2008 a data validation drive

was initiated for basic chemical identifiers

Led to a collaboration with CAS, to ensure Wikipedia CAS registry nos. are correct

Now around 3500+ substances have been validated against CAS Common Chemistry, as having correct name, structure & CAS RN

Other fields now being validated Validated content indicated with a

check mark

Page 16: Data rich chemistry inside wikipedia and other wikis

THE APPROACH TO VALIDATIONEvery old version (called a RevID) of an article is preserved (for all) for posterity, and can potentially serve as a permanent record of a validated version.

Page 17: Data rich chemistry inside wikipedia and other wikis

PROTECTING VALIDATED FIELDSPROBLEM: This is “the encyclopedia anyone can edit” – so anyone can change the BP of water to 200 oC.

SOLUTION: A bot patrols the pages, and watches for edits to key fields. Any dubious edits are flagged with a red X (next to the data), and logged.

System developed by Dirk Beetstra (Eindhoven University of Technology). It is the only such tool on Wikipedia.

Page 18: Data rich chemistry inside wikipedia and other wikis

VALIDATION PROTECTED BY BOT

If anyone tries to vandalize a validated field, this will be flagged by a bot soon afterwards.

This example received a red X 11 minutes after it was vandalized.

Page 19: Data rich chemistry inside wikipedia and other wikis

VALIDATED REVISIONIDS

Page 20: Data rich chemistry inside wikipedia and other wikis

CHECKING STRUCTURES IN 2008-2010, around 3000 chemical

structures were informally checked against CAS Common Chemistry

PROBLEM: Structures are loaded from an external file on Wikimedia Commons, which can be “invisibly” changed

Page 21: Data rich chemistry inside wikipedia and other wikis

SINCE FALL 2010The bot has been modified to watch changes to the RevID of the Wikimedia Commons structure imageA few hundred images validated so far

Page 22: Data rich chemistry inside wikipedia and other wikis

DRUGBOXES

Drugboxes are patrolled by the bot, but at present WP:PHARM not active in formal validation. Most work done by Dirk Beetstra, using official lists from data sources (e.g., ChEBI).

Page 23: Data rich chemistry inside wikipedia and other wikis

RSC LEARN CHEMISTRY

Page 24: Data rich chemistry inside wikipedia and other wikis

RSC LEARN CHEMISTRY WIKIAims to enrich RSC educational content with data from ChemSpider, then make it open for educators to contribute their own content (licensed under Creative Commons)

Page 25: Data rich chemistry inside wikipedia and other wikis

SUBSTANCE SEARCHES

Page 26: Data rich chemistry inside wikipedia and other wikis

SUBSTANCE PAGES: FOUND BY INCHI SEARCH

Page 27: Data rich chemistry inside wikipedia and other wikis

WITH LINKS TO SPECTRA:

Page 28: Data rich chemistry inside wikipedia and other wikis

QUIZZES: “PREDICT THE PRODUCT”

Page 29: Data rich chemistry inside wikipedia and other wikis

QUIZZES

Page 30: Data rich chemistry inside wikipedia and other wikis

QUIZZES

Page 31: Data rich chemistry inside wikipedia and other wikis

INCHI PROVIDES THE WAY

Page 32: Data rich chemistry inside wikipedia and other wikis

CONCLUSION Wikipedia can provide a useful “virtual

database” of highly curated information on common chemicals and drugs.

Don’t forget the data page information! The validation effort needs to go further –

YOUR help is very welcome! RSC Learn Chemistry shows that chemical

data can also be used to enrich an educational site.

Page 33: Data rich chemistry inside wikipedia and other wikis

ACKNOWLEDGEMENTS Congratulations to Henry and Peter, and

thanks for the invitation to speak in their symposium.

Thanks to Antony Williams for his many contributions to both Wikipedia and Learn Chemistry.

Thanks to Aileen Day, Lorna Thomson, Duncan McMillan and RSC Education staff, and to RSC for the funding of Learn Chemistry.

Thanks to undergraduate student Tyson Terpstra for uploading many quiz InChIs.

Thank you for your attention!

Page 34: Data rich chemistry inside wikipedia and other wikis

ANY QUESTIONS?

Thank you for your attention

Page 35: Data rich chemistry inside wikipedia and other wikis

COPYRIGHT INFORMATION All of my own content in this presentation is

released under a Creative Commons BY-SA-3.0 license

Copyright information for images is usually attributed on the slide itself

Content from Wikipedia and Learn Chemistry is reused via a Creative Commons BY-SA-3.0 license. For authors, please visit the original Wikipedia page and select the “history” tab.

Other pictures not attributed should only be my own personal pictures, also CC-BY-SA3.