data you may like: a recommender system for research data discovery

16
Data You May Like: A Recommender System for Research Data Discovery 1 MINERAL RESOURCES, 2 INFORMATION MANAGEMENT & TECHNOLOGY Anusuriya Devaraju 1 , Rob Davy 2 and Dominic Hogan 2 IN21D: New Approaches to Data Discovery Across Geoscience Domains I AGU 2016, 13 th December 2016. image: orbital-recruitment.co.uk

Upload: anusuriya-devaraju

Post on 22-Jan-2018

113 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Data You May Like: A Recommender System for Research Data Discovery

1MINERAL RESOURCES, 2INFORMATION MANAGEMENT & TECHNOLOGY

Anusuriya Devaraju1, Rob Davy2 and Dominic Hogan2

IN21D: New Approaches to Data Discovery Across Geoscience Domains I AGU 2016, 13th December 2016.

image: orbital-recruitment.co.uk

Introducing Recommender Systems

We can classify recommender systems into two broad groups:

• Content-based filteringsystems examine properties of the items recommended.

• Collaborative filtering systems recommend items based on similarity measures between users or item co-occurrences.

2 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.

Introducing Case Study• Total collection (30.11.2016) : 1853

• (1802 public, 51 private collections)

• Domains• Agriculture & food • Astronomy & space science• Data61 • Energy • Food & nutrition• Health & biosecurity • Land & water • Manufacturing • Mineral resources • Oceans & atmosphere

3 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.

CSIRO Data Access Portal (DAP)

Motivations• Direct search

• data search is limited in terms of title, keyword and descriptions.

• Faceted browsing• exhaustive filters and time

consuming.

4 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.

A Recommender System for Research Data

• Search and recommendations are complementary.

• Enhances data visibility, especially for users unfamiliar with the datasets.

5 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.

interested [ view, download ]

similar

Dataset A

likely interested

Similar Datasets

Data User

Similar datasets may be determined based on :• Explicit information, e.g.,

metadata of datasets• Implicit information, e.g.,

data consumption details inferred from logs.

Data Sources

6 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.

DAP Web Service: https://ws.data.csiro.au/

DAP Google Analytics Reporting API and server log files

Explicit Information

Implicit Information

Data Similarity ModelThe overall similarity between two datasets (Di, Dj) is :

7 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.

• Di, Dj : Datasets• S : Overall similarity• ω1… ωn : Feature weights• Sfn(Di,Dj) : Similarity between Di and Dj datasets based on feature class n

(normalized value between [0, 1])

S(Di, Dj) = ω1Sf1(Di,Dj) + ω2Sf2(Di,Dj) + ω3Sf3(Di,Dj) + … + ωnSfn(Di,Dj)

Best Choice of Weights? • Survey Period : 08.06.2016 –

23.08.2016

• Respondents : Data owners and consumers

• Number of respondents : 151

8 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.

Feature Extraction

9 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.

Feature Class Feature extraction Similarity Measures

Title TF-IDF score Cosine similarity

Description TF-IDF score Cosine similarity

Keyword TF-IDF score Cosine similarity

Activity TF-IDF score Cosine similarity

Fields of Research

Presence/absence of research fields Jaccard coefficient

Lead Researcher Presence/absence of lead researchers Jaccard coefficient

Contributor Presence/absence of contributors Jaccard coefficient

Search behaviour Common query term Cosine similarity

Download Datasets downloaded together Cosine similarity

Examples : Infer Related Datasets

10 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.

rock

Search results• Dataset 1• Dataset 2• Dataset 3• Dataset 4• …..

[search term]

Common Search Term Daily Data Download

Example

11 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.

System Architecture

12 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.

CSIRO Data Access Portal (DAP)

You may also like ……

HDF

Rec

om

men

der

Ser

vice

SQL database

Research DataRecommender Engine

CSIRO-DAP Web Service

Analytics Reporting API

DAP Server Logs

Examples of Web Service Requests• Obtaining the similarity result is via GET request:

http://{server-name}/simhdf?collection=DAP&nn=5&uw=0&target=csiro:6110

• Get the features and weights associated with a collection http://{server-name]/features?collection=DAP

13 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.

Offline Evaluation• Respondents: Lead researchers

• Number of datasets evaluated : 52

• Evaluation period : 10.11-30.11.2016

• Binary relevance testsa. Top-ranked datasets : 51/52 datasets (98%)

are rated as relevant.

– 1 dataset was ‘undecided’

b. Next-ranked datasets (not created by evaluator) : 46/52 datasets (89%) rated as relevant

- 6 datasets are rated as ‘less relevant’

14 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.

Business Unit Number of Evaluators

Agriculture 11

ICT 11

Energy 3

Land & Water 14

Manufacturing 3

Mineral Resources 10

What’s Next?• Enhance the model with addition of new features – spatial and temporal

information.

• More evaluation! number of evaluators, compare ranked lists, 10-fold cross validation.

• Apply the recommender model to infer similar research datasets from other repositories, e.g., data.gov.au, TERN, etc.

15 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.

Mineral ResourcesAnusuriya DevarajuPostdoctoral Fellowe [email protected]

IMTRobert Davy Research Engineere [email protected]

IMTDominic HoganData Librariane [email protected]

MINERAL RESOURCES

Acknowledgement:• CSIRO eResearch Collaboration Project (ERRFP-368).• CSIRO IMT Data Management Capability

Enhancement Program (DMCEP).