text mining for chemistry and building a public platform for document markup
DESCRIPTION
Text Mining for Chemistry and Building a Public Platform for Document Markup The identification of chemical names in documents has provided platforms to enable structure-based searching of patents and mark-up chemistry publications. A natural extension is the ability to make chemistry articles, blog pages, wiki pages and other documents searchable by the extracted chemical structures. The ChemSpider database is built on a database of over 21 million unique chemical entities from close to 200 data sources and provides a rich resource of information for chemists. We will report on our efforts to integrate chemical name extraction with the ChemSpider platform to enable structure searching of Open Access chemistry articles, and online chemistry materials. We will unveil our online document markup platform for chemists to make both their open- and closed-access publications searchable by the language of chemistry – the structure.TRANSCRIPT
Text mining for chemistry Text mining for chemistry and building a public and building a public
platform for document platform for document markupmarkup
Antony WilliamsAntony Williams
Building the Primary Web Portal for Chemistry
Searching and Reading Searching and Reading Articles…Articles…
Online search tools for chemistry articles are Online search tools for chemistry articles are generally text-basedgenerally text-based
Searching articles based on chemical structure Searching articles based on chemical structure and substructure is very expensive.. but is and substructure is very expensive.. but is changingchanging
Text-mining is a “hot area” of research ….but Text-mining is a “hot area” of research ….but what is public? What depends on public curation? what is public? What depends on public curation?
Building the Primary Web Portal for Chemistry
Text-Based Search Tools Text-Based Search Tools
GoogleGoogle Pubmed Pubmed Google ScholarGoogle Scholar Publishers websitesPublishers websites And 10s of other resources….And 10s of other resources….
Building the Primary Web Portal for Chemistry
Vancomycin Through Vancomycin Through PubChemPubChem
Building the Primary Web Portal for Chemistry
Vancomycin Text SearchesVancomycin Text Searches
PubmedPubmed
Google ScholarGoogle Scholar
Building the Primary Web Portal for Chemistry
Online Structure Searching of Online Structure Searching of ArticlesArticles
Some capabilities from publishers starting Some capabilities from publishers starting to show upto show up
Building the Primary Web Portal for Chemistry
Publishers should adopt/add Publishers should adopt/add InChIsInChIs
RSC and Nature Publishing Group RSC and Nature Publishing Group have!have!
Building the Primary Web Portal for Chemistry
Building the Primary Web Portal for Chemistry
ChemMantis - Single Click ChemMantis - Single Click Mark-up Mark-up
Building the Primary Web Portal for Chemistry
Name-Structure PairsName-Structure Pairs
Building the Primary Web Portal for Chemistry
Converting Detected Names…Converting Detected Names…
Names are searched against a validated Names are searched against a validated dictionary (this expands as ChemSpider is dictionary (this expands as ChemSpider is curatedcurated
If not found then they are passed through If not found then they are passed through a Name to Structure algorithma Name to Structure algorithm
If they cannot convert then ChemSpider is If they cannot convert then ChemSpider is searched for non-validated namessearched for non-validated names
Building the Primary Web Portal for Chemistry
RED UnderlineRED UnderlineNon-validated, Cannot Convert Non-validated, Cannot Convert
through NTSthrough NTS ““Names” can be Names” can be
added to Suppress added to Suppress ListList
Building the Primary Web Portal for Chemistry
BLUE UnderlineBLUE UnderlineName to Structure Converted Name to Structure Converted
Building the Primary Web Portal for Chemistry
Deposit StructuresDeposit Structures
Building the Primary Web Portal for Chemistry
Entity Extraction built Entity Extraction built around modified around modified algorithms from SureChemalgorithms from SureChem
Optimized for Optimized for “publications”“publications”
Dictionaries for chemical Dictionaries for chemical entities, groups, reactions, entities, groups, reactions, elements, families, elements, families, species…species…
Dictionaries can be Dictionaries can be expanded – presently expanded – presently adding PDBadding PDB
Building the Primary Web Portal for Chemistry
Species..Species..
Building the Primary Web Portal for Chemistry
What do you do with a markup What do you do with a markup system?system?
Test it, Show it off and make it available…Test it, Show it off and make it available… Tested on chemistry articles so why not Tested on chemistry articles so why not
HOST articles?HOST articles? ……and create an online journal…and create an online journal…
Building the Primary Web Portal for Chemistry
The ChemSpider JournalThe ChemSpider Journal
Building the Primary Web Portal for Chemistry
Open Access Community Open Access Community JournalJournal
Building the Primary Web Portal for Chemistry
Deposit ArticleDeposit Article
Import URL or DocumentImport URL or Document Copy-PasteCopy-Paste MarkupMarkup
Building the Primary Web Portal for Chemistry
Copy-Paste VersionCopy-Paste VersionMartin Walker Monthly ArticleMartin Walker Monthly Article
Building the Primary Web Portal for Chemistry
Chemical namesChemical names
Building the Primary Web Portal for Chemistry
Names, Elements, Groups, Names, Elements, Groups, FamiliesFamilies
Building the Primary Web Portal for Chemistry
OutlinksOutlinks
Building the Primary Web Portal for Chemistry
Mark Up Open Access ArticleMark Up Open Access Article
Building the Primary Web Portal for Chemistry
Online Journals and Live DataOnline Journals and Live Data
Building the Primary Web Portal for Chemistry
A Community Resource of A Community Resource of SpectraSpectra
Spectra deposited on ChemSpider as Spectra deposited on ChemSpider as “Open Data” are available to anybody to “Open Data” are available to anybody to “Embed” in their articles, blogs, wikis etc“Embed” in their articles, blogs, wikis etc
Building the Primary Web Portal for Chemistry
Present DictionariesPresent Dictionaries
Chemical names - ChemSpider Validated Chemical names - ChemSpider Validated NamesNames
Reactions - Wikipedia Named Reactions Reactions - Wikipedia Named Reactions and RSC Reaction Ontology reactionsand RSC Reaction Ontology reactions
Species – Wikipedia “species”Species – Wikipedia “species”
To add – New DictionariesTo add – New Dictionaries PDB codesPDB codes IUPAC Gold BookIUPAC Gold Book
Building the Primary Web Portal for Chemistry
ConclusionsConclusions
The internet enables chemistry – and at a reduced The internet enables chemistry – and at a reduced costcost
Web 2.0 is here and improving quality – to benefit Web 2.0 is here and improving quality – to benefit 3.03.0
Question Quality!Question Quality! Crowdsourcing for expansion, curation and Crowdsourcing for expansion, curation and
integrationintegration Classical models may die quite quickly – business Classical models may die quite quickly – business
models must change soon or failmodels must change soon or fail Publishers – Publishers – heed the profileration of InChIs for heed the profileration of InChIs for
ChemistryChemistry