cloud dataverse: a data repository platform for an openstack cloud

31
CLOUD DATAVERSE Mercè Crosas 1 Orran Krieger 2 Piyanai Saowarattitada 2 1 Institute for Quantitative Social Science (IQSS), Harvard University 2 Massachussetts Open Cloud (MOC), Boston University

Upload: merce-crosas

Post on 22-Jan-2018

259 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

CLOUD

DATAVERSEMercè Crosas1

Orran Krieger2

Piyanai Saowarattitada2

1Institute for Quantitative Social Science (IQSS), Harvard University2Massachussetts Open Cloud (MOC), Boston University

Page 2: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

DATA REPOSITORIES NEED CLOUDS

CLOUDS NEED DATA REPOSITORIES

Page 3: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

This Talk

1. The Need– The rise of big data-centric computation

– The rise of modern data repositories

2. Our Platforms– Dataverse: A premier open-source data repository platform

– MOC: Top collaborative OpenStack cloud with Big Data compute

3. The Solution– Cloud Dataverse: Bringing MOC and Dataverse together

Page 4: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

AWS sees the value in data

“When data is made publicly available on AWS, anyone can analyze any volume of data without needing to download or store it themselves.”

Page 5: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

Data and Compute leads to

Discovery

Page 6: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

A wide range of fields and industries

can benefit from access to data

Page 7: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

But, AWS public datasets miss key

aspects needed in data repositories

• Incentives to share data

• Citation to each version of the data

• Metadata for Discoverability

• Tiered access to non-public data

• Commitment to data archival & preservation

Page 8: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

The scientific community has been thinking

about data archives and repositories for some

time

1957

Roper

Center

public and

operational

1960

Zentral Archiv für

Empirische

Sozialforschung

(Germany)

1962

ICPSR

1964

Steinmetz

Archive

(Netherlands)

ODUM

Data Archive

1965 1970

Protein

Data Bank

1982

European

Nucleotide

Archive

GenBank

social sciences life sciences

UK Data

Archive

19671966

National Space

Science Data

Center

1995

Pangae

a

1987

Astrophysics

Data System

1990

EOSDIS

astronomy

earth sciences

Page 9: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

Number of data repositories grows

with growth of data sharing

Dryad Figshare

Zenodo

2006 2009 20112013

DataCiteData Citation Principles

# of (all types of ) data repositories from 2012 to 2016 source: r3data.org

> 1,500 research data repositories

Page 10: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

Today’s repositories incentivize data sharing by giving

credit to data authors through formal citation

Persistent citations to datasets published in data repositories

Bibliography

Page 11: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

The Dataverse open-source platform

enables building any type of data repository

Agriculture data

Repository in Fudan, China

Data from 20 Universities

Public data repository

Science Consortium

Page 12: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

Challenges

• Datasets have to be small

• Hard to copy 40 PB over the internet

• Not every one has the right compute infrastructure

DATAVERSE NEEDED A CLOUD

Page 13: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

The Massachusetts Open Cloud – an Open Cloud eXchange

Page 14: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

Imagine shrinking Pacific Research Platform to the size of a building

Page 15: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

Imagine shrinking Pacific Research Platform to the size of a building

Consortium comparable to Pacific Research Platform• Huge community covering every field of research• Collaborations across the globe• Massive data and computational requirements• Massive student population covering every discipline

Widths are proportional to enrollment

Page 16: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

MGHPCC Data Center

15 MW, 90,000 square feet + can grow

10s of thousand HPC users, potentially many more cloud users

Page 17: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

The MOC partnership

Page 18: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

Today’s model of Cloud

Page 19: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

What we need:

an “Open Cloud eXchange (OCX)”

C3DDB

HP

CBig

DataWe

b

Page 20: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

OpenStack great, but… where is the data?

• We need:

– share data between providers

– expose cloud meta-data researchers/companies

• Our scientific users need

– In-situ compute on public & community data sets

– control with whom and how their data is shared

– reduced barrier to exploit rich tools to compute on the data

• Our commercial & public sector partners need

– share data with researchers/startups

– reduce the risk/barrier of publishing data

– model to expose technology in environment with rich data

The MOC need a modern Dataset repositoryOpenStack Needs a Dataset repository project

Page 21: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

Data depositor Data users

Compute

Dataverse Before Cloud Dataverse

Page 22: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

Data depositor Data users

Swift

Object Storage

Nova

Compute

Horizon

Nova

Compute

Sahara

Analytics

Giji

Dataverse After Cloud Dataverse

UI

Page 23: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

Data depositor Data users

Swift

Object Storage

Nova

Compute

Horizon

Nova

Compute

Sahara

Analytics

Giji

What’s missing in Cloud ?

UI

Page 24: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

Data depositor Data users

Swift

Object Storage

Nova

Compute

Horizon

Nova

Compute

Sahara

Analytics

Giji

What’s missing in Dataverse ?

Page 25: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

Swift

Object Storage

So what is Cloud Dataverse ?

Page 26: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

DEMO: Billion Object Platform(BOP) GeoTweets

Page 27: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

Data users/analyst

Swift

Object Storage

Horizon

Tweets

BOP GeoTweetsCOLD report

Nova

Compute

Nova

Compute

Sahara

Analytics

Summary : BOP GEOTWEETS Cold Demo

Giji

BOP

Data depositor

Page 28: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

Dataverse Community review

SUMMER 2016

FALL 2016 JANUARY 2017

DECEMBER2016

POC

Barcelona OpenSstack Summit #vBrownBag

Full Collaborative Development Begins

MAY 2017

Boston OpenStack Summit : *Swift per repository*URI*Demos

SUMMER 2017

Worldwide Data Federation

A Year in the life of Cloud Dataverse

MOC Annual WorkshopPOCPreview

OCTOBER 2016

Page 29: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

World Wide Data Federation

Page 30: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

DATA REPOSITORIES NEED CLOUDS

CLOUDS NEED DATA REPOSITORIES

With Cloud Dataverse, we combine the power and scalability of OpenStack cloud with the need to access data using a feature-rich repository

Page 31: Cloud Dataverse: A Data repository platform for an OpenStack Cloud

THANKS