contextualization of topics - browsing through terms, authors, journals and cluster allocations

27
Introduction Method Results Summary Contextualization of Topics browsing through terms, authors, journals and cluster allocations Rob Koopman 1 Shenghui Wang 1 Andrea Scharnhorst 2 1 OCLC Research 2 DANS-KNAW ISSI 2015

Upload: shenghui-wang

Post on 12-Aug-2015

14 views

Category:

Science


0 download

TRANSCRIPT

Introduction Method Results Summary

Contextualization of Topicsbrowsing through terms, authors, journals and cluster allocations

Rob Koopman1 Shenghui Wang1 Andrea Scharnhorst2

1OCLC Research 2DANS-KNAW

ISSI 2015

Introduction Method Results Summary

Introduction

What are essence and boundary of a scientific field?

Different ways to find clusters in scientific literature based onconnectivity in terms of authorship, citations, languagesimilarity, etc.

Ambiguous nature in science

Introduction Method Results Summary

Ariadne: interactive context explorer

Ariadne is an interactive interface which allows users toexplore the context of entities such as authors, journals,topical terms, etc.

It builds on semantic indexing statistically computed from alarge scale bibliographic corpus

It was originally implemented to explore 1M topical terms, 3Mauthors, 35K journals and 700+ Dewey decimal classesassociated with 65M articles.

Introduction Method Results Summary

Research questions

Q1: How does the Ariadne algorithm work on a much smaller,field specific dataset?

Q2: Can we use Ariadne to label the clusters produce by thedifferent methods?

Q3: Can we use Ariadne to compare different clusteringsolutions?

Introduction Method Results Summary

LittleAriadne

LittleAriadne: context explorer over Astrophysics data

Offline: generates a semantic representation for each entity

Online: finds the most related entities and usingmultidimensional scaling to display

Introduction Method Results Summary

LittleAriadne

An example article

Article ID ISI:000276828000006

Title On the Mass Transfer Rate in SS Cyg

Abstract The mass transfer rate in SS Cyg at quiescence, estimatedfrom the observed luminosity of the hot spot, is log M-tr= 16.8 +/- 0.3. This is safely below the critical masstransfer rates of log M-crit = 18.1 (corresponding to logT-crit(0) = 3.88) or log M-crit = 17.2 (corresponding tothe “revised” value of log T-crit(0) = 3.65). The masstransfer rate during outbursts is strongly enhanced

Author [author:smak j]

ISSN [issn:0001-5237]

Subject [subject:accretion, accretion disks] [subject:cataclysmicvariables] [subject:disc instability model] [subject:dwarf novae][subject:novae, cataclysmic variables] [subject:outbursts][subject:parameters] [subject:stars] [subject:stars dwarf novae][subject:stars individual ss cyg] [subject:state] [subject:superoutbursts]

Introduction Method Results Summary

LittleAriadne

An example article

Article ID ISI:000276828000006

Title On the Mass Transfer Rate in SS Cyg

Abstract The mass transfer rate in SS Cyg at quiescence, estimatedfrom the observed luminosity of the hot spot, is log M-tr= 16.8 +/- 0.3. This is safely below the critical masstransfer rates of log M-crit = 18.1 (corresponding to logT-crit(0) = 3.88) or log M-crit = 17.2 (corresponding tothe “revised” value of log T-crit(0) = 3.65). The masstransfer rate during outbursts is strongly enhanced

Author [author:smak j]

ISSN [issn:0001-5237]

Subject [subject:accretion, accretion disks] [subject:cataclysmicvariables] [subject:disc instability model] [subject:dwarf novae][subject:novae, cataclysmic variables] [subject:outbursts][subject:parameters] [subject:stars] [subject:stars dwarf novae][subject:stars individual ss cyg] [subject:state] [subject:superoutbursts]

Cluster label [cluster:a 19] [cluster:b 16] [cluster:c 15][cluster:d 51] [cluster:e 17] [cluster:f 1]

Introduction Method Results Summary

LittleAriadne

Six different clustering solutions

x Source y=#Cluster #Cluster in LittleAriadnea cwts 1.8 23 23b UMSI 23 23c oclc 20 20 20d hu 139 48e sts 5664 229f ECOOM 15 15

Introduction Method Results Summary

LittleAriadne

Entities in the Astrophysis dataset

There are in total 90,343 entities associated with 111,616astrophysics articles

59 journals

27,027 author names (no disambiguation applied)

39,577 topical terms

23,322 subjects (extracted from ”Author Keywords” and”Keywords Plus”)

358 cluster labels (source + cluster id)

Introduction Method Results Summary

LittleAriadne

Build semantic representation

Basic assumptions

Entities can be represented by its contextEntities which share more context are more likely to be related

Context is the textual environment where an entity occurs

Introduction Method Results Summary

LittleAriadne

An example article

Article ID ISI:000276828000006

Title On the Mass Transfer Rate in SS Cyg

Abstract The mass transfer rate in SS Cyg at quiescence, estimatedfrom the observed luminosity of the hot spot, is log M-tr= 16.8 +/- 0.3. This is safely below the critical masstransfer rates of log M-crit = 18.1 (corresponding to logT-crit(0) = 3.88) or log M-crit = 17.2 (corresponding tothe “revised” value of log T-crit(0) = 3.65). The masstransfer rate during outbursts is strongly enhanced

Author [author:smak j]

ISSN [issn:0001-5237]

Subject [subject:accretion, accretion disks] [subject:cataclysmicvariables] [subject:disc instability model] [subject:dwarf novae][subject:novae, cataclysmic variables] [subject:outbursts][subject:parameters] [subject:stars] [subject:stars dwarf novae][subject:stars individual ss cyg] [subject:state] [subject:superoutbursts]

Cluster label [cluster:a 19] [cluster:b 16] [cluster:c 15][cluster:d 51] [cluster:e 17] [cluster:f 1]

Introduction Method Results Summary

LittleAriadne

Dimension reduction using Random Projection

masstransfer rate

[subject:outburst][subject:sstars][subject:parameters]

[author:smak j]

[cluster: a19][issn:0001-5237]

Introduction Method Results Summary

LittleAriadne

Dimension reduction using Random Projection

masstransfer rate

[subject:outburst][subject:sstars][subject:parameters]

[author:smak j]

[cluster: a19][issn:0001-5237]

Introduction Method Results Summary

LittleAriadne

From semantic representation to visualisation and more

Each entity has its semantic representation

Cosine similarity between entities can be computed very fast,based on which the 2D visualisation is implemented

For each article, we collected the semantic representation ofall the entities in which it involves, and take an average as itssemantic representation

We applied a standard K-means clustering method to clusterthese articles based on their semantic representations

Introduction Method Results Summary

LittleAriadne

From semantic representation to visualisation and more

Each entity has its semantic representation

Cosine similarity between entities can be computed very fast,based on which the 2D visualisation is implemented

For each article, we collected the semantic representation ofall the entities in which it involves, and take an average as itssemantic representation

We applied a standard K-means clustering method to clusterthese articles based on their semantic representations

Introduction Method Results Summary

Experiment 1: Exploring context

Experiment 1: Exploring context

Now we can explore

Let’s start with starsAn overview of all journals

Introduction Method Results Summary

Experiment 1: Exploring context

Contextual view of stars

Introduction Method Results Summary

Experiment 2: Labelling clusters

Experiment 2: Labelling clusters

What is cluster a 2?

Introduction Method Results Summary

Experiment 2: Labelling clusters

Experiment 2: Labelling clusters

Introduction Method Results Summary

Experiment 2: Labelling clusters

Experiment 2: Labelling clusters

Cluster ID Top 9 most related topical terms

a 2 ”cosmology” ”dark energy” ”density perturbations””cosmologies” ”planck” ”cosmological” ”spatialcurvature” ”inflationary” ”inflation”

b 2 ”cosmology” ”cosmological constant” ”cosmologies””cosmological” ”universes” ”dark energy” ”quadratic””tensor” ”planck”

c 17 ”power spectrum” ”cosmological parameters” ”cmb””last scattering” ”anisotropies” ”microwave background””power spectra” ”planck” ”cosmic microwave”

d 28 ”density perturbations” ”inflationary” ”inflation””dark energy” ”scale invariant” ”spatial curvature””cosmological perturbations” ”inflationary models””cosmologies”

Introduction Method Results Summary

Experiment 3: Comparing clustering solutions

Experiment 3: Comparing clustering solutions

Cluster labels are treated as entities

Let’s compare

Introduction Method Results Summary

Experiment 3: Comparing clustering solutions

Highly similar clustering solutions

Introduction Method Results Summary

Experiment 3: Comparing clustering solutions

Partially agreeing clustering solutions

Introduction Method Results Summary

Experiment 3: Comparing clustering solutions

An overview of all clustering solutions

Introduction Method Results Summary

Summary

Summary

We present a method and an interface that allows visualexploration through the contexts of entities

We can provide the most related topical terms to clustersalthough expert knowledge is needed to transform them intoreal labels/topics

LittleAriadne provides a visual way of comparing differentclustering solutions

Our naıve way of clustering is worth exploring further

Introduction Method Results Summary

Future extensions

Future extensions

Add more types of entities, such as citations, publishers,conferences, etc, to provide richer context

Add direct links to articles to answer information retrievalneeds

Study context sensitivity

compare ”young” and ”young”

Introduction Method Results Summary

Thank you

Thank you

http://thoth.pica.nl/astro/relate

Rob Koopman ([email protected])Shenghui Wang ([email protected])Andrea Scharnhorst ([email protected])