sci know mine 2013: what can we learn from topic modeling on 350m academic documents?

28
What can we learn from topic modeling on 350M documents? William Gunn Head of Academic Outreach Mendeley @mrgunn – https://orcid.org/0000-0002- 3555-2054

Upload: william-gunn

Post on 05-Dec-2014

66 views

Category:

Science


0 download

DESCRIPTION

Mendeley Talk

TRANSCRIPT

Page 1: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

What can we learn from topic modeling on 350M

documents?

William GunnHead of Academic OutreachMendeley@mrgunn – https://orcid.org/0000-0002-3555-2054

Page 2: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

Based in London, Mendeley is researchers, graduates and software developers from...

Page 3: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

The opposite problem

We have the papers (400M) and are looking for the best way to turn them into structured knowledge.

We have useful triage indicators - #altmetrics, reproducibility

You have great use cases

Page 4: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

...and aggregatesdata in the cloud

Mendeley extracts research data…

Collecting rich signals from domain experts.

Page 5: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

Rich user profile data

Page 6: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

TEAM Projectacademic knowledge management solutions

• Algorithms to determine the content similarity of academic papers

• Performing text disambiguation and entity recognition to differentiate between and relate similar in-text entities and authors of research papers.

• Developing semantic technologies and semantic web languages with the focus of metadata integration/validation

• Investigate profiling and user analysis technologies, e.g. based on search logs and document interaction.

• We will also improve folksonomies and through that, ontologies of text.

• Finally, tagging behaviour will be analysed to improve tag recommendations and strategies.

• http://team-project.tugraz.at/blog/

Page 7: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

Semantics vs. Syntax

• Language expresses semantics via syntax

• Syntax is all a computer sees in a research article.

• How do we get to semantics?

•Topic Modeling!

Page 8: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

Distribution of Topics

BioPhy

s

Engine

er

Comp

Sci

Psych

& E

du

Busine

ss Law

Other

0%5%

10%15%20%25%30%35%

Page 9: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

Subcategories of Comp. Sci.

AI HCI Info Sci Software Eng

Networks0%

5%

10%

15%

20%

Page 10: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?
Page 11: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

Generated topics – Comp. Sci.

Page 12: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

Generated Topics - Biology

Page 13: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

Categorization is imperfect

Page 14: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

Categorization As A ProcessThing

Process

Reaction

Catalysis

Enzymatic

Page 15: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

Categorization As A ProcessThing

Process

Reaction

Catalysis

Enzymatic

Page 16: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

Categories change over time

Page 17: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

Can we assist triage?

Page 18: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

Code Project

Use case = mining research papers for facts to add to LOD repositories and light-weight ontologies.

• Crowd-sourcing enabled semantic enrichment & integration techniques for integrating facts contained in unstructured information into the LOD cloud

• Federated, provenance-enabled querying methods for fact discovery in LOD repositories

• Web-based visual analysis interfaces to support human based analysis, integration and organisation of facts

• Socio-economic factors – roles, revenue-models and value chains – realisable in the envisioned ecosystem.

• http://code-research.eu/

Page 19: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?
Page 20: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?
Page 21: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?
Page 22: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

Metrics as a discovery tool

Page 23: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

We didn’t see that a target is more likely to be validated if it was reported in ten publications or in two publications

“NATURE REVIEWS DRUG DISCOVERY 10, 712 (SEPTEMBER 2011)

Page 24: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

Either the results were reproducible and showed transferability in other models, or even a 1:1 reproduction of published experimental procedures revealed inconsistencies between published and in-house data

“NATURE REVIEWS DRUG DISCOVERY 10, 712 (SEPTEMBER 2011)

Page 25: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

Amgen: 47 of 53 “landmark” oncology publications could not be reproduced.

Bayer: 43 of 67 oncology & cardiovascular projects were based on contradictory results

Dr. John Ioannidis: 432 publications purporting sex differences in hypertension, multiple sclerosis, or lung cancer. Only one data set was reproducible.

There is no Gold Standard

Page 26: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

Building a reproducibility dataset

• Mendeley and Science Exchange have started the Reproducibility Initiative

• working with Figshare & PLOS to host data & replication reports

• building open datasets backing high-impact work

• extending the “executable paper” concept to biomedical research

Page 27: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

Make it porous & part of the web.

Our success as a crowdsourcing platform is largely due to our openness & end-user usefulness.

Communities must be open if they are to thrive.

Page 28: Sci Know Mine 2013: What can we learn from topic modeling on 350M academic documents?

www.mendeley.com

[email protected]@mrgunn