text mining for chemistry and building a public platform for document markup

29
Text mining for chemistry Text mining for chemistry and building a public and building a public platform for document platform for document markup markup Antony Williams Antony Williams

Upload: orcid-0000-0002-2668-4821

Post on 10-May-2015

1.089 views

Category:

Technology


2 download

DESCRIPTION

Text Mining for Chemistry and Building a Public Platform for Document Markup The identification of chemical names in documents has provided platforms to enable structure-based searching of patents and mark-up chemistry publications. A natural extension is the ability to make chemistry articles, blog pages, wiki pages and other documents searchable by the extracted chemical structures. The ChemSpider database is built on a database of over 21 million unique chemical entities from close to 200 data sources and provides a rich resource of information for chemists. We will report on our efforts to integrate chemical name extraction with the ChemSpider platform to enable structure searching of Open Access chemistry articles, and online chemistry materials. We will unveil our online document markup platform for chemists to make both their open- and closed-access publications searchable by the language of chemistry – the structure.

TRANSCRIPT

Page 1: Text Mining for Chemistry and Building a Public Platform for Document Markup

Text mining for chemistry Text mining for chemistry and building a public and building a public

platform for document platform for document markupmarkup

Antony WilliamsAntony Williams

Page 2: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

Searching and Reading Searching and Reading Articles…Articles…

Online search tools for chemistry articles are Online search tools for chemistry articles are generally text-basedgenerally text-based

Searching articles based on chemical structure Searching articles based on chemical structure and substructure is very expensive.. but is and substructure is very expensive.. but is changingchanging

Text-mining is a “hot area” of research ….but Text-mining is a “hot area” of research ….but what is public? What depends on public curation? what is public? What depends on public curation?

Page 3: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

Text-Based Search Tools Text-Based Search Tools

GoogleGoogle Pubmed Pubmed Google ScholarGoogle Scholar Publishers websitesPublishers websites And 10s of other resources….And 10s of other resources….

Page 4: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

Vancomycin Through Vancomycin Through PubChemPubChem

Page 5: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

Vancomycin Text SearchesVancomycin Text Searches

PubmedPubmed

Google ScholarGoogle Scholar

Page 6: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

Online Structure Searching of Online Structure Searching of ArticlesArticles

Some capabilities from publishers starting Some capabilities from publishers starting to show upto show up

Page 7: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

Publishers should adopt/add Publishers should adopt/add InChIsInChIs

RSC and Nature Publishing Group RSC and Nature Publishing Group have!have!

Page 8: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

Page 9: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

ChemMantis - Single Click ChemMantis - Single Click Mark-up Mark-up

Page 10: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

Name-Structure PairsName-Structure Pairs

Page 11: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

Converting Detected Names…Converting Detected Names…

Names are searched against a validated Names are searched against a validated dictionary (this expands as ChemSpider is dictionary (this expands as ChemSpider is curatedcurated

If not found then they are passed through If not found then they are passed through a Name to Structure algorithma Name to Structure algorithm

If they cannot convert then ChemSpider is If they cannot convert then ChemSpider is searched for non-validated namessearched for non-validated names

Page 12: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

RED UnderlineRED UnderlineNon-validated, Cannot Convert Non-validated, Cannot Convert

through NTSthrough NTS ““Names” can be Names” can be

added to Suppress added to Suppress ListList

Page 13: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

BLUE UnderlineBLUE UnderlineName to Structure Converted Name to Structure Converted

Page 14: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

Deposit StructuresDeposit Structures

Page 15: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

Entity Extraction built Entity Extraction built around modified around modified algorithms from SureChemalgorithms from SureChem

Optimized for Optimized for “publications”“publications”

Dictionaries for chemical Dictionaries for chemical entities, groups, reactions, entities, groups, reactions, elements, families, elements, families, species…species…

Dictionaries can be Dictionaries can be expanded – presently expanded – presently adding PDBadding PDB

Page 16: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

Species..Species..

Page 17: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

What do you do with a markup What do you do with a markup system?system?

Test it, Show it off and make it available…Test it, Show it off and make it available… Tested on chemistry articles so why not Tested on chemistry articles so why not

HOST articles?HOST articles? ……and create an online journal…and create an online journal…

Page 18: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

The ChemSpider JournalThe ChemSpider Journal

Page 19: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

Open Access Community Open Access Community JournalJournal

Page 20: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

Deposit ArticleDeposit Article

Import URL or DocumentImport URL or Document Copy-PasteCopy-Paste MarkupMarkup

Page 21: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

Copy-Paste VersionCopy-Paste VersionMartin Walker Monthly ArticleMartin Walker Monthly Article

Page 22: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

Chemical namesChemical names

Page 23: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

Names, Elements, Groups, Names, Elements, Groups, FamiliesFamilies

Page 24: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

OutlinksOutlinks

Page 25: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

Mark Up Open Access ArticleMark Up Open Access Article

Page 26: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

Online Journals and Live DataOnline Journals and Live Data

Page 27: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

A Community Resource of A Community Resource of SpectraSpectra

Spectra deposited on ChemSpider as Spectra deposited on ChemSpider as “Open Data” are available to anybody to “Open Data” are available to anybody to “Embed” in their articles, blogs, wikis etc“Embed” in their articles, blogs, wikis etc

Page 28: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

Present DictionariesPresent Dictionaries

Chemical names - ChemSpider Validated Chemical names - ChemSpider Validated NamesNames

Reactions - Wikipedia Named Reactions Reactions - Wikipedia Named Reactions and RSC Reaction Ontology reactionsand RSC Reaction Ontology reactions

Species – Wikipedia “species”Species – Wikipedia “species”

To add – New DictionariesTo add – New Dictionaries PDB codesPDB codes IUPAC Gold BookIUPAC Gold Book

Page 29: Text Mining for Chemistry and Building a Public Platform for Document Markup

Building the Primary Web Portal for Chemistry

ConclusionsConclusions

The internet enables chemistry – and at a reduced The internet enables chemistry – and at a reduced costcost

Web 2.0 is here and improving quality – to benefit Web 2.0 is here and improving quality – to benefit 3.03.0

Question Quality!Question Quality! Crowdsourcing for expansion, curation and Crowdsourcing for expansion, curation and

integrationintegration Classical models may die quite quickly – business Classical models may die quite quickly – business

models must change soon or failmodels must change soon or fail Publishers – Publishers – heed the profileration of InChIs for heed the profileration of InChIs for

ChemistryChemistry