vivo 2013 topic modeling entity extraction

26
What can we learn from topic modeling on 350M documents? William Gunn Head of Academic Outreach Mendeley @mrgunn – https://orcid.org/0000-0002-3555-2054

Upload: william-gunn

Post on 05-Dec-2014

77 views

Category:

Science


2 download

DESCRIPTION

Mendeley Talk

TRANSCRIPT

Page 1: VIVO 2013 Topic Modeling Entity Extraction

What can we learn from topic modeling on 350M documents?

William Gunn

Head of Academic Outreach

Mendeley

@mrgunn – https://orcid.org/0000-0002-3555-2054

Page 2: VIVO 2013 Topic Modeling Entity Extraction

Who am I? PhD Biomedical Science

I've been active in online science communities since 1995

Established the community program at Mendeley – 1700 advisors from 650 schools in 60 countries.

Lead the outreach to librarian, academic research, and tech communities

Page 3: VIVO 2013 Topic Modeling Entity Extraction

Based in London, Mendeley is researchers, graduates and software developers from...

Page 4: VIVO 2013 Topic Modeling Entity Extraction

Two new approaches

Embed a tool within the researcher workflow to capture data

Capture new kinds of data – usage of research objects, not just citations of papers.

Page 5: VIVO 2013 Topic Modeling Entity Extraction

...and aggregates

data in the cloud

Mendeley extracts research data…

Collecting rich signals from domain experts.

Page 6: VIVO 2013 Topic Modeling Entity Extraction

Rich user profile data

Page 7: VIVO 2013 Topic Modeling Entity Extraction

TEAM Project academic knowledge management solutions

• Algorithms to determine the content similarity of academic papers

• Performing text disambiguation and entity recognition to differentiate between and relate similar in-text entities and authors of research papers.

• Developing semantic technologies and semantic web languages with the focus of metadata integration/validation

• Investigate profiling and user analysis technologies, e.g. based on search logs and document interaction.

• We will also improve folksonomies and through that, ontologies of text.

• Finally, tagging behaviour will be analysed to improve tag recommendations and strategies.

• http://team-project.tugraz.at/blog/

Page 8: VIVO 2013 Topic Modeling Entity Extraction

Semantics vs. Syntax

• Language expresses semantics via syntax

• Syntax is all a computer sees in a research article.

• How do we get to semantics?

•Topic Modeling!

Page 9: VIVO 2013 Topic Modeling Entity Extraction

Distribution of Topics

0%

5%

10%

15%

20%

25%

30%

35%

Bio Phys Engineer CompSci

Psych &Edu

Business Law Other

Page 10: VIVO 2013 Topic Modeling Entity Extraction

Subcategories of Comp. Sci.

0%

5%

10%

15%

20%

AI HCI Info Sci SoftwareEng

Networks

Page 11: VIVO 2013 Topic Modeling Entity Extraction
Page 12: VIVO 2013 Topic Modeling Entity Extraction

Generated topics – Comp. Sci.

Page 13: VIVO 2013 Topic Modeling Entity Extraction

Generated Topics - Biology

Page 14: VIVO 2013 Topic Modeling Entity Extraction

Categorization As A Process

Thing

Process

Reaction

Catalysis

Enzymatic

Page 15: VIVO 2013 Topic Modeling Entity Extraction

Categorization As A Process

Thing

Process

Reaction

Catalysis

Enzymatic

Page 16: VIVO 2013 Topic Modeling Entity Extraction

Categorization is imperfect

Page 17: VIVO 2013 Topic Modeling Entity Extraction

Cateories change over time

Page 18: VIVO 2013 Topic Modeling Entity Extraction

Code Project

Use case = mining research papers for facts to add to LOD repositories and light-weight ontologies.

• Crowd-sourcing enabled semantic enrichment & integration techniques for integrating facts contained in unstructured information into the LOD cloud

• Federated, provenance-enabled querying methods for fact discovery in LOD repositories

• Web-based visual analysis interfaces to support human based analysis, integration and organisation of facts

• Socio-economic factors – roles, revenue-models and value chains – realisable in the envisioned ecosystem.

• http://code-research.eu/

Page 19: VIVO 2013 Topic Modeling Entity Extraction
Page 20: VIVO 2013 Topic Modeling Entity Extraction
Page 21: VIVO 2013 Topic Modeling Entity Extraction
Page 22: VIVO 2013 Topic Modeling Entity Extraction

Metrics as a discovery tool

Page 23: VIVO 2013 Topic Modeling Entity Extraction

Google Analytics for Research

Page 24: VIVO 2013 Topic Modeling Entity Extraction

Building a reproducibility dataset

• Mendeley and Science Exchange have started the Reproducibility Initiative

• working with Figshare & PLOS to host data & replication reports

• building open datasets backing high-impact work

• extending the “executable paper” concept to biomedical research

Page 25: VIVO 2013 Topic Modeling Entity Extraction

Make it porous & part of the web.

All these examples show that the main motivation for people to get data (pictures, bookmarks, etc) off their computers and on the web is because it helps them find more of the same.

Communities must be open if they are to thrive.

Page 26: VIVO 2013 Topic Modeling Entity Extraction

www.mendeley.com

[email protected] @mrgunn