20150126-globus-usuk-data-workshop-blaiszik-final

Globus Scientific Data Publication

Services Ben Blaiszik, Kyle Chard, Rachana Anathakrishnan, Steve Tuecke, Ian

Foster, Globus Team

blaiszik@uchicago.edu www.globus.org

ComputationInstitute

Overview •  What is Globus? •  Globus Services

– Data publication – Data cataloging – Data transfer – User authentication – Groups – Sharing

> 8000 endpoints > 85 U.S. campuses European Globus Community: http://www.egcf.eu/

Globus is ...

Research data management delivered via SaaS

Big data transfer, sharing,publication, and discovery…

…directly from your own storage systems OR the cloud

Globus Delivers

SaaS Market Domination...

…for your photos

…for your e-mail

…for your entertainment

…for your research data

Research data management scenarios and challenges

Public Cloud

“I need to easily, quickly, & reliably move or mirror portions of my data to other places.”

Research Computing HPC Cluster Lab Server Personal Laptop

XSEDE Resource

Scientific Instrumentation

“I need to easily and securely share my data with my colleagues at other institutions.”

“I need to publish my data so that others can find it and use it.”

Scholarly Publication

Reference Dataset

Active Research Collaboration

Globus Transfer •  “Fire-and-forget” transfers

–  Optimize transfer –  Automatic fault recovery –  Automatic retry –  Seamless security integration –  128-bit checksums

•  Intuitive Web GUI and powerful APIs for automation –  REST and Python APIs

Globus moves the data for you

secureendpoint,

e.g. laptop

You submit a transfer request Globus

notifies you once the transfer is complete

secureendpoint,e.g. midway

transfer

Catalog Data Publication

Endpoint File Systems

Discover Plugin Point [Federation?]

Globus, the Abridged Version

Transfers

Groups Sharing

User Auth

Metadata Layer

Data layer

11 * REST and Python APIs throughout

Globus Catalog •  Automate metadata ingestion from

instrumentation and acquisition machines

–  API/CLI integration

•  Allow near real-time metadata-driven feedback to experiments

•  Allow for insert points in the workflow –  Ingest at point of collection –  Catalog metadata and provenance –  Push to data store –  Push to local or external HPC

•  Allow building and sharing of typed metadata definitions –  e.g. build definition set that specifically

fits X-ray scattering data at your beamline

–  Addresses problem of T, temp, Temp, temperature, temperature_kelvin, ...

•  Group data based on use and features, not location/filename –  Logical grouping to organize, search, and

describe

•  Operate on datasets as units •  Tag datasets with characteristics

that reflect content •  Share/move datasets for

collaboration •  Interact with via REST API,

Python API, GUI, and CLI

Globus Catalog Catalog à Datasets à Members

Globus Catalog Web User Interface

Near field-HEDM Workflow (Sharma, Almer)

** Supported by Data Engines for Big Data LDRD

(Wilde, Wozniak, Sharma, Almer, Blaiszik)

3: GenerateParameters

(FOP.c)50 tasks25s/task

¼ CPU hoursconcurrent

DetectorUp to 1000 datasets/week

Dataset360 files

4 GB total

1: Median calc75s (90% I/O)

MedianImage.cUses Swift

2: Peak Search15s per file

ImageProcessing.cUses Swift

ReducedDataset360 files

5 MB total

4: Analysis PassFitOrientation.c

105 tasks20s/task

555 CPU hoursthen

1m/task 1667 CPU hours

concurrent

real-time overnight or real-time

feedback to experiment

Up to2.2 M CPU hours

per week!

real time: 4/4/2014

Experimenting “in the data dark” •  Feedback during each experiment was non-existent •  Required months to calculate relevant information for

publication OR to find out experiment was corrupted •  Now, initial feedback over lunch using (Globus, SWIFT,

and Catalog) to leverage HPC and track metadata

Globus Data Publication •  Operated as a hosted

service

•  Designed for Big Data

•  Bring your own (per collection) storage

•  Extensible metadata schemas and input forms

•  Customizable publication and curation workflows

•  Associate unique and persistent digital identifiers with datasets

•  Rich discovery model (in dev)

Curator reviews and approves; data set published on campus or other

Researcher assembles data set; describes it using metadata (Dublin core and domain-specific) Peers and public

search and discover data sets; access and transfer using

publisheddatastore

curator

researcher

Data Publication Dashboard

Start a New Submission

Policies at the Collection Level •  Required metadata, schemas •  Data storage location •  Metadata curation policies

Describe Submission: Dublin Core

•  Scientist or representative describes the data they are submitting

•  For this collection

Dublin Core and a collection metadata template are required

Describe submission: 2) Scientific metadata

•  Scientist or representative describes the data they are submitting

•  For this collection

Dublin Core and a collection metadata template are required

Assemble the dataset

Transfer Files to Submission Endpoint

•  Scientist transfers dataset files to a unique publish endpoint

•  Endpoint is created

on collection-specified data store

•  Dataset may be

assembled over any period of time

•  When submission is finished, dataset will be rendered immutable via checksum

Check Dataset Assembly

•  Verify size, file names, etc

•  System attempts to

determine file types

•  Scientist can choose to edit, remove, or add more files

•  Scientist then accepts the collection-specified license and completes the submission (not pictured)

DOI Assignment

Submission Curation

If configured, a curator can approve the submission, reject, or edit metadata

Discover a Published Dataset

•  Search on ranged meta-data

•  Link back to the

published dataset

View Downloaded Dataset

Use Globus Connect Personal to pull the files locally for analysis

...all of this via SaaS and with your own (institutional or personal) resources or cloud resources

Summary

Transfer

User Authentication Groups Sharing

Data Publication

Data Cataloging

Automation and Workflows

Thank you to our sponsors! U.S. DEPARTMENT OF

ENERGY

Data Engines for Big Data LDRD

20150126-globus-usuk-data-workshop-blaiszik-final

globus transfer

data engines

group data

big data transfer

b globus

transfer request globus

data store push

big data ldrdwilde

Documents

birkhoven zorggoed 20150126 iv z kwaliteit - presentatie...

globus spirits limited“globus spirits limited” or...

globus -inform integrationszentrum „globus“ 11 2007...

kas kecil-20150126.ppt

globus elektromobili

20150126 armaflexapplicationmanual website es secured

globus retail

globus professional

20150126 acefir

undha usuk basa jawi wonten ing serat … · usuk basa jawi...

telkomsel lte fdd interworking strategy_v1 1 1 - 20150126

session19 globus

globus @icm€¦ · 4 dsi -globus 21/02/2019 architecture...

status of globus activities within infn massimo sgaravatto...

todays libre 20150126

globus computing infrustructure software globus toolkit 11-2

20150126 latence 10 minutes - human talk

pfv 2013-20150126 terminal-gat - vm.nrw.de · title:...

globus imagebroschuere

e.moderna power -classe-netvibes 20150126