data you may like: a recommender system for research data discovery
TRANSCRIPT
Data You May Like: A Recommender System for Research Data Discovery
1MINERAL RESOURCES, 2INFORMATION MANAGEMENT & TECHNOLOGY
Anusuriya Devaraju1, Rob Davy2 and Dominic Hogan2
IN21D: New Approaches to Data Discovery Across Geoscience Domains I AGU 2016, 13th December 2016.
image: orbital-recruitment.co.uk
Introducing Recommender Systems
We can classify recommender systems into two broad groups:
• Content-based filteringsystems examine properties of the items recommended.
• Collaborative filtering systems recommend items based on similarity measures between users or item co-occurrences.
2 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
Introducing Case Study• Total collection (30.11.2016) : 1853
• (1802 public, 51 private collections)
• Domains• Agriculture & food • Astronomy & space science• Data61 • Energy • Food & nutrition• Health & biosecurity • Land & water • Manufacturing • Mineral resources • Oceans & atmosphere
3 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
CSIRO Data Access Portal (DAP)
Motivations• Direct search
• data search is limited in terms of title, keyword and descriptions.
• Faceted browsing• exhaustive filters and time
consuming.
4 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
A Recommender System for Research Data
• Search and recommendations are complementary.
• Enhances data visibility, especially for users unfamiliar with the datasets.
5 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
interested [ view, download ]
similar
Dataset A
likely interested
Similar Datasets
Data User
Similar datasets may be determined based on :• Explicit information, e.g.,
metadata of datasets• Implicit information, e.g.,
data consumption details inferred from logs.
Data Sources
6 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
DAP Web Service: https://ws.data.csiro.au/
DAP Google Analytics Reporting API and server log files
Explicit Information
Implicit Information
Data Similarity ModelThe overall similarity between two datasets (Di, Dj) is :
7 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
• Di, Dj : Datasets• S : Overall similarity• ω1… ωn : Feature weights• Sfn(Di,Dj) : Similarity between Di and Dj datasets based on feature class n
(normalized value between [0, 1])
S(Di, Dj) = ω1Sf1(Di,Dj) + ω2Sf2(Di,Dj) + ω3Sf3(Di,Dj) + … + ωnSfn(Di,Dj)
Best Choice of Weights? • Survey Period : 08.06.2016 –
23.08.2016
• Respondents : Data owners and consumers
• Number of respondents : 151
8 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
Feature Extraction
9 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
Feature Class Feature extraction Similarity Measures
Title TF-IDF score Cosine similarity
Description TF-IDF score Cosine similarity
Keyword TF-IDF score Cosine similarity
Activity TF-IDF score Cosine similarity
Fields of Research
Presence/absence of research fields Jaccard coefficient
Lead Researcher Presence/absence of lead researchers Jaccard coefficient
Contributor Presence/absence of contributors Jaccard coefficient
Search behaviour Common query term Cosine similarity
Download Datasets downloaded together Cosine similarity
Examples : Infer Related Datasets
10 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
rock
Search results• Dataset 1• Dataset 2• Dataset 3• Dataset 4• …..
[search term]
Common Search Term Daily Data Download
Example
11 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
System Architecture
12 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
CSIRO Data Access Portal (DAP)
You may also like ……
HDF
Rec
om
men
der
Ser
vice
SQL database
Research DataRecommender Engine
CSIRO-DAP Web Service
Analytics Reporting API
DAP Server Logs
Examples of Web Service Requests• Obtaining the similarity result is via GET request:
http://{server-name}/simhdf?collection=DAP&nn=5&uw=0&target=csiro:6110
• Get the features and weights associated with a collection http://{server-name]/features?collection=DAP
13 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
Offline Evaluation• Respondents: Lead researchers
• Number of datasets evaluated : 52
• Evaluation period : 10.11-30.11.2016
• Binary relevance testsa. Top-ranked datasets : 51/52 datasets (98%)
are rated as relevant.
– 1 dataset was ‘undecided’
b. Next-ranked datasets (not created by evaluator) : 46/52 datasets (89%) rated as relevant
- 6 datasets are rated as ‘less relevant’
14 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
Business Unit Number of Evaluators
Agriculture 11
ICT 11
Energy 3
Land & Water 14
Manufacturing 3
Mineral Resources 10
What’s Next?• Enhance the model with addition of new features – spatial and temporal
information.
• More evaluation! number of evaluators, compare ranked lists, 10-fold cross validation.
• Apply the recommender model to infer similar research datasets from other repositories, e.g., data.gov.au, TERN, etc.
15 | Data You May Like: A Recommender System for Research Data Discovery | Anusuriya Devaraju et al.
Mineral ResourcesAnusuriya DevarajuPostdoctoral Fellowe [email protected]
IMTRobert Davy Research Engineere [email protected]
IMTDominic HoganData Librariane [email protected]
MINERAL RESOURCES
Acknowledgement:• CSIRO eResearch Collaboration Project (ERRFP-368).• CSIRO IMT Data Management Capability
Enhancement Program (DMCEP).