content mining a short introduction to practices and policies

STM Future Lab

Content Mininga short introduction to practices and policiesSummary of a study for the Publishing Research Consortiuminto Journal Article Mining,By Eefke Smit, Maurits van der Graaf (2011)

Full study available on PRC website

1Lets start with a potential user (1)Use-case-1: keeping up-to-dateSince 1982: 90,000 journal articles on neuroregeneration (e.g. spinal cord injury)New articles: on average 22 journal articles per day on neuroregeneration

Prof. Joost Verhaagen PhD, Netherlands Institute for Neuroscience, Amsterdam

2Lets start with a potential user (2)Use-case-2: Information needed as result of laboratory experimentsWhich molecules do play a role in this process?Typical outcome of an experiment: hundreds of molecules show enhanced activityNext step: how to filter out the relevant molecules?You would like to have for each of these molecules a meta-analysis about what is already known about these molecules in other processes

Prof. Joost Verhaagen PhD, Netherlands Institute for Neuroscience, Amsterdam

3The essence of TDM is:So much information to analyze:Can a machine do this for him ?

Text mining tool for semantic search by PubTator, see http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/tutorial/index.htmlWhat TDM looks like:

Typical text mining consists ofProcessing large corpora of text in an automated wayTo identify entities, instances, actions, relationships and patterns and do further analysis (e.g. on assertions or sentiments)Examples: genes, proteins, gene-disease patterns, compound properties, chemical structures, side effects of drugsText mining output typically consists of a new metadata layer for the information:Article clusters and categorisations, indexesTopical maps, to show the occurrence of topics and their inter-relationshipsDatabases with facts, patterns, relationships, statements, assertions, properties found in the articles, Visualisations like graphs, mappings, plot-graphs and topical maps

Study commisioned in 2011 by the Publishing Research ConsortiumAuthors:Eefke Smit, Intnl Assoc of STM PublishersMaurits van der Graaf, Pleiade Management & Consultancy Two parts:Qualitative study:29 interviews with experts in academia, research, libraries, vendors and publishersQuantitative studySurvey among publishers (members Crossref & STM)190 responsesFull report on PRC website www.publishingresearch.netArticle in the 1st issue of 2012 of Learned Publishing

7Optimists and Pessimists on TDMSkeptics about TDM:Has always over-promisedOnly in specialized fieldsTools still complicatedManual curation necessaryHigh investmentsDomain dependentNo common dictionaryOverambition in the promise of knowledge discoveryOptimists about TDM:Vast digital corpus available and growingMore and more application areas (business, legal, social, etc)Tools improving fastManual work reducedPublic domain or domain precisionProcessing power less of a problem, analytical tools better, visualisation adds to analysis

8

Publishers are optimistic:Opinions/ expectations for Content Mining in the next 3 years

Publishers are optimistic, continued:Opinions/ expectations for Content Mining on scholarly content in the next 3 years

10

but publishers do not yet get many mining requests from 3rd parties:11

Publishers are liberal in allowing mining:How case-by-case requests are treated 12

and plan more mining themselves:for retrieval and navigation

Cross-sector solutions to facilitate Content Mining betterSuggestions made by experts during the interviews:

Standardization of Content FormatsOne Content Mining platformCommonly agreed access and permission termsOne window for mining permissionsCollaboration with national libraries

(ad 3: most interviewed experts do NOT see Open Access as a related issue; access terms also relate to datafile delivery or mining on the platform itself)

Survey results for the 5 suggestions for cross-sector solutions

Standardisation best prefered,of content formats and of APIsTop 3 for all Respondents:Standardisation of FormatsOne Mining PlatformAgreed Permission Terms

Top 3 for Experts only: Standardisation of FormatsAgreed Permission TermsOne Mining PlatformExperts believe less in one platform and support standardisation even stronger, not just for content, also for APIs:

Progress since 2011 on collective solutions facilitating TDM:Commonly agreed permission terms:STM standard clause for TDM on subscribed content for non commercial TDMSTM standard clause for pharma-TDM via Pharma-Documentention-Ring (PDR)One window for mining permissions across publishers:PLS/ CLA prototype, sepcially serving small and medium sized publishersObtain minable content in one place and provide standardized APIs and content formats:CCC content mining platformCrossref project-Prospect: single point web access for multiple publishers

Questions ?

Eefke SmitDirector Standards and TechnologyInternational Association of STM [email protected]

content mining a short introduction to practices and policies

Documents

typical text mining

mining r

text mining tool

journal articles

study available

years publishers

hundreds of molecules

relevant molecules