wikiconference 2016 talk burgstaller

Download Wikiconference 2016 talk Burgstaller

If you can't read please download the document

Upload: sebotic

Post on 15-Jan-2017

556 views

Category:

Science


0 download

TRANSCRIPT

Drug and chemical compound items in Wikidata as a data source for Wikipedia infoboxes

https://commons.wikimedia.org/wiki/File:Wikidata-logo-en.svg

Sebastian Burgstaller-Muehlbacher, PhDUser:SeboticTwitter: @sebotic

Contents

The Problem

Introduction to WikidataData model

References

Values/data types

Gene Wiki Info Boxes - An Example solution

Chemistry Data in WikidataIssues with the data

Community cleanup

Migration of Info Boxes to Wikidata

The Problem (with chemistry data)

Wikipedia has ~300 different languages projects

Currently, chemistry data resides as info box parameterData are not reusable between language projects

Data are not machine readable

Data are hard to update automatically

Data cannot be reused for other purposes, e.g. science.

The solution

-Labels, descriptions, aliases in different languages

-Diverse Properties

-Sitelinks

Wikidata items

Two types of entitiesProperties (Pxxxx): Describe the nature of a data value

Different data types

2,900 different properties in Wikidata

Data items (Qxxxx): A set of claims or statements

Consist of property value pairs

20 million items in Wikidata

-Properties must be proposed and approved by the community

-Data items can be edited by any Wikidata user and are the true data stores.

A Wikidata Statement

https://commons.wikimedia.org/wiki/File:Wikidata_statement.svg

Claim: Property with value + optional qualifiers

Statement: A claim with its references

Wikidata Data types

The current Wikidata data types: String

WDItemID

External ID

MonolingualText

Property

Quantity

Time

Url

GlobeCoordinate

CommonsMedia

Mathematical formula

-Many querys to the Wikidata API make the bot slow and might make Wikimedia people/adminstrators unhappy.

-Calling wbeditentity ensures that all data is either written or not, so if the connection or bot breaks, no harm is done. -No new items will be created and then left unpopulated.

Unique Features of Wikidata

Completely free, even for commercial usage (CC0).

Granular: Single values with references.

Anybody can contribute.

Extensive item history.

A repository for data on all domains of knowledge.

Full integration with the semantic web.

Essentially: A giant graph of knowledge.

Single value refs/nano publications

Revisions/data releases

Burgstaller-Muehlbacher, et al, Database, 2016

Data use case: Gene Wiki infoboxes

Issues with chemical data in the Wiki space

Incorrect identifiers in info boxes or on Wikidata items

Incorrect chemical properties

Incorrect labels, aliases

Incorrect isomeric forms of the compound

Mixture of different isomeric forms

https://commons.wikimedia.org/wiki/File:Isomerism.svg

How to solve Isomerism issues?

Make sure that the structure in Wikidata and Wikipedia are correct and consistent:Use the InChI (International Chemical Identifier) or InChI key to determine what isomer a certain article or WD item is actually talking about.






What are InChIs

IUPAC InChI (International Chemical Identifier).

Describes the structure of a chemical compound or substance.

Freely usable.

Can be computed from e.g SMILES, or MOL format.

Do not need to be assigned by an organization.

What are InChI keys

The SHA-256 hashed version of an InChI

Makes chemicals searchable on the Web

Makes chemicals easily comparable

Short, unique

UEJJHQNACJXSKW-UHFFFAOYSA-N

First block (14 letter) encodes skeleton (connectivtiy)

Second block (8 letter) encodes stereochemistry and radioisotopes

Last letter, number of protons (charge)

How to solve Isomerism issues?

Make sure that the structure in Wikidata and Wikipedia are correct and consistent:Use the InChI (International Chemical Identifier) or InChI key to determine what isomer a certain article or WD item is actually talking about.

Minimum requirement: Correct, unique InChI key on item.

Best case: Make sure all structural identifiers are correct (isomeric SMILES, canonical SMILES, InChI or InCh key).

A minimum of a correct InChI key allows for the rest of the chemical compound item to be populated by (our) bots.

What has been accomplished so far?

Discussion on Wikiproject chemistry: https://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Chemistry#Wikidata_as_source_for_infobox_dataGeneral consensus that info boxes should use Wikidata

Wikidata needs to improve on data quality

Of the 17,000 original chemical compound Wikidata items, 16,000 have been validated around an InChI key.

More chemical data has been imported, so they are readily available for new Wikipedia articles or correction of existing ones.

Things that need your attention

I generated a list of items at Wikidata project chemistry which need human intervention.

https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Chemistry#Annotation_in_which_species_chemical_compounds_are_found

Please have a look at those and unify the sterechemistry and identifiers around one unique InChI key!

Data maintenance in Wikidata

Our bots are written in Python (2.7 and 3.x compatible).

Python bots keep Wikidata in sync with authoritative data source. (PubChem, ChemSpider, ChEBI, ChEMBL)

Bots are run according to data release cycles of authoritative data sources.

Mechanisms in place for detection of inconsistencies.

Contributions of other Wikidata users are being accounted for, based on references.

Wikidata API and query endpoints

Three ways to access data:Wikidata API allows read, write and full text search. (www.wikidata.org/w/api.php)

REST endpoint for fast, direct data access.
(queryr.wmflabs.org/)

Wikidata query service (WDQS) as a SPARQL endpoint for complex queries.
(query.wikidata.org/)

The Sparql endpoint allows complex and also federated queries on the full WD content.

REST and SPARQL are still in beta mode.

Acknowledgments

Andrew SuBenjamin GoodTim PutmanJulia TurnerGregg Stupp(TSRI)

Gang Fu Evan Bolton(NIH, PubChem)

Andra Waagmeester
(Micelio.be)

Elvira Mitraka Lynn Schriml(Disease Ontology, U Baltimore)