percipient storage for exascale data centric computing documents/fusion... · – co-design extreme...

35
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 671500 Percipient StorAGe for Exascale Data Centric Computing Computing for the Exascale Shaun de Witt Culham Centre for Fusion Energy, UK 2 nd Technical Meeting on Fusion Data Processing, Validation and Analysis - June 2 nd 2017 Per-cip-i-ent (pr-sp-nt) Adj. Having the power of perceiving, especially perceiving keenly and readily. n. One that perceives.

Upload: others

Post on 25-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

This project has received funding from the European

Union’s Horizon 2020 research and innovation

programme under grant agreement No 671500

Percipient StorAGe for Exascale Data Centric Computing Computing for the Exascale

Shaun de Witt Culham Centre for Fusion Energy, UK

2nd Technical Meeting on Fusion Data Processing, Validation and Analysis - June 2nd 2017

Per-cip-i-ent (pr-sp-nt)

Adj.

Having the power of perceiving, especially perceiving keenly and readily.

n.

One that perceives.

Storage cannot keep up w/ Compute!

Way too much data

Way too much energy to move data

New Storage devices use unclear

Opportunity: Big Data Analytics and Extreme Computing

Overlaps

Storage Problems at Extreme Scale

SAGE Project Goal

Big (Massive!)

Data Analysis • Avoid Data Movements

• Manage & Process

extremely large data

sets

Need Exascale Data Centric Computing Systems

Big Data Extreme Computing (BDEC Systems)

SAGE Validates a BDEC System which can Ingest,

Store, Process and Manage extreme amounts of data

Extreme

Computing • Changing I/O needs

• HDDs cannot keep up

Consortium

Coordinator: Seagate

• Provide/Validate Novel Storage Architecture

– Object Storage w/ a very flexible API , driving

• Multi-tiered Hierarchical Storage System

• Providing integrated compute capability

• Purpose: Increase overall scientific throughput!

– Co-designed with Use cases

– Integrated with Ecosystem tools

• Provide roadmap of component technologies for achieving Extreme Scale

– Including Programing Models and Access Methods

• European Excellence in the area of Exascale Data Centric Computing by targeting;

– HPC & Big Data technology Influencers

– Scientific Communities & Infrastructure/Wider Markets

SAGE Overall Objectives

• Tracking very strongly European Exascale and HPC objectives

– Very Active Participation in ETP4HPC

– Strategic Research Agenda (SRA) Goals – SRA2

(http://www.etp4hpc.eu/en/sra.html)

• Continue to be extremely well aligned

• Driving future H2020 projects ( FETHPC2, ESDs, etc)

• Synergistic with worldwide initiatives – ECP (Exascale Computing Project): https://exascaleproject.org/

– BDEC (Big Data Extreme Compute) : http://www.exascale.org/bdec/

SAGE Overall Objectives

SAGE Relevance to Fusion

• ITER Data Analysis

– Prompt analysis, pre-emptive caching, …

• Engineering and Modelling

– Note PPPL already working on

Exasacale development with XGC* –

opportunities for collaboration?

* http://www.pppl.gov/news/2017/02/advanced-fusion-code-led-pppl-selected-participate-early-science-programs-three-new-0

Research Areas

Applications

• Primary Goal

– Demonstrate Use Cases & Co-Design

the system

• Methodology

– Obtain Requirements from:

• Various Use Cases

• Detailed profiling supported by

Tools

– Feedback requirements to the

platform (“Co-Design”)

Co-Design Requirements

Automated Application Characterization w/ Tools

• CCFE fusion energy applications – Analytics of Log-files for Fusion (ALF)

– Spectre: Providing near real time feedback on plasma

– Finite element analysis using ParaFEM

• Savu – Tomography reconstruction and processing pipeline

• Ray – Distributed assembly of metagenome

• JURASSIC – Fast radiative transfer model simulation code

• iPIC3D – Particle-in-Cell code for simulations of space plasma

• NEST – Simulator for spiking neural network models

• Angelia benchmarking framework – Benchmarking framework for Apache Flink

Applications

• Goal

– Build the data centric computing

platform

• Methodology

– Advanced Object Storage

– New NVRAM Technologies in I/O

stack

– Ability for I/O to Accept computation

• Incl. Memory as part of storage

tiers

– API for massive data ingest and

extreme I/O

– Commodity Server & Computing

Components in I/O stack

Percipient Storage Overview

Percipient Storage Stack

API Layers

SAGE Unique Features

Tiered Object Storage

HSM design

• Data organization is based on Mero

composite layout

• Data migration decision come from

– User access

– Knowledge base filled from all Mero

information (hints, events, …)

Function Offloading

Containers

• Logical Grouping of Objects

– System properties • access latency, resilience,…

– Object properties • Formats, access mechanism…

– Scientific properties • Location, time, diagnostic…

– Event based • ‘shot’, earthquake, hurricane,…

Built in FLINK connector

Maybe NN

are better

for low

latency…

HDF5 and NetCDF Integration

MPI and PGAS

Status:

• Global Partitioned Address Space for hierarchical storage

– Realized via MPI windows allocated on storage

– Can substitute MPI I/O eliminating distinction between programming interfaces for memory and storage

• Implemented in PMPI and MPICH (available as open-source on github)

• Use MPI “hints”

• No change to the MPI standard

20

• iPIC3D – PGAS I/O

– Function shipping for data analysis

• JURASSIC – Function off-loading or run-time

system for

• Data pre-processing (data extraction)

• Data compress/decompression

– Asynchronous I/O with data staging using semi-persistent cache

• NEST – Data post-processing (data

analysis) using run-time system

• Savu – Native HDF5 support

SAGE Feature Analysis

doi:10.1016/j.fusengdes.2017.03.113

– Native block store with Ceph-like

interface

– Function off-loading for slicing of data

regions using a Python interface

• Spectre

– Apache Flink for parallelising FFT and

buffered streaming of data to storage

Clients Clients

Clients Clients

Clients Clients

pNFS Services on Object Storage

• pNFS is standard parallel file system

protocol

• Separate MetaData access from Data

access

• Posix namespace is stored in Mero KV

Store

• Files data are stored in Mero Objects

22

Mero

KVS MD Server

Mero

Objects

pNFS Protocol

Mero Protocol

Network

Extreme resiliency for applications

Distributed Transactions

Groups of storage (including I/O) operations that are

atomic in the face of certain failures known as

allowed failures.

Allowed Failures

Transient network failures

Node crash and restart

Distributed Transactions Manager (DTM)

Creates transactions

Controls transactions

Actions

Scatter-gather write of data into a Mero object

Scatter-gather read of data from a Mero object

Creation of a new sub-directory

Renaming of a file

Writing of a data unit reconstructed from parity

blocks in a spare unit

Mero DTM Mero DTM

Mero

DTM

Mero

DTM

Mero DTM Mero DTM

Mero DTM Mero DTM

Integration, Demonstration

• Goal

– Hardware definition, integration and demonstration

• Methodology

– Design and Bring-up of SAGE hardware • Seagate Hardware

• Atos Hardware

– Integration of all the software components • Juelich Supercomputer Center(JSC)

– Demonstrate use cases • Extrapolate performance to Exascale

• Study other Object stores vis-à-vis Mero

SAGE Hardware Prototype Built, shipped and integrated at JSC

Sage is extremely well aligned to the broader goals for Europe in the

area of Storage, I/O and Energy Efficiency

• M-BIO-1: Tightly coupled Storage class memory io systems demo

• M-BIO-3: Multi-tiered heterogeneous storage system demo

• M-BIO-5: Big data analytics tools developed for hpc use

• M-BIO-6: ‘Active Storage’ capability demonstrated

• M-BIO-8: Extreme scale multi-tier data management tools available

• M-ARCH-3: New compute nodes and storage architecture use nvram

• M-ENER –X: Addresses Energy goals by avoiding data movements

– 100x more energy to move data compared to compute!!

• M-ENER-FT-10: Application survival on unreliable hardware

Alignment with European Goals [ ETP4HPC SRA]

Expected Impacts & Innovation

Commercial and Market Impacts

(Storage, Systems & Tools)

Key inputs into European

Road mapping

Key inputs into “Data Intensive”

Research programs

Primary European Storage

Platform for Extreme Scale

(Applicability: Big Science and

BDEC)

Acknowledgements Sai Narasimhamurthy(Seagate)

Dirk Pleiter (Forschungszentrum Jülich)

Stefano Markidis (Kungliga Tekniska Högskolan)

Questions ? [email protected]

Backup

Services, Systemware & Tools

• Goal

– Explore tools and services on top of Mero

• Methodology

– “HSM” Methods to automatically move

data across tiers

– “pNFS” parallel file system access on Mero

– Scale out Object Storage Integrity

checking service provision

– Allinea Performance Analysis Tools

provision

HSM PoC implementation

ready

pNFS PoC implementation

ready

Performance Analysis tools

framework ready

Completed scoping/Arch of

Data Integrity Checking

• Goal

– Mero Object Storage Platform

Development(w/ Clovis API)

– Evaluate NVRAM Options

• Methodology

– Co-design Extreme Scale Object

Store Components

– Study NVRAM technologies incl.

Emulation

Mero Extreme scale Features & NVRAM

Concept & Architecture of Key Mero Exascale

components

NVRAM state of the Art studies

Low Level system software/Emulation of NVRAM

Programming Models and Analytics

• Goal

– Explore usage of SAGE by programming models,

runtimes and data analytics solutions

• Methodology

– Usage of SAGE through MPI and PGAS • Adapt MPI-IO for SAGE

• Adapt PGAS for SAGE

– Runtimes for SAGE • Pre/Post Processing

– Volume Rendering • Exploit Caching hierarchy

– Data Analytics methods on top of Clovis • Apache Flink over Clovis, looking beyond Hadoop

• Exploit NVRAM as extension of Memory

PGAS for SAGE Proof of

Concept

Following up from MPI-IO

gap analysis

Runtimes Proof of Concept

Detailed architecture of Data

Analytics

SAGE to lay the foundation for a European

storage platform to be #1 at Extreme Scale

SAGE Project Ambition

M9 Review Recommendations

[Recommendation 1] – As part of the validation process, we recommend the

consortium to include criteria to evaluate the performance of the proposed storage

system. These criteria will serve to monitor internally the progress of the project and

will not entail additional obligations towards the European Commission.

o WP5 Discussion

o A PEWG (Performance Evaluation Working Group) was setup

o Methods to continuously track the performance of the SAGE system

[Recommendation 2] – As part of the dissemination activities concerning future work, we

recommend the consortium to better advertise the SAGE project to the scientific

community, for instance through the publication of scientific papers. In order to facilitate

user engagement, we also suggest exploring the possibility to give access to the

prototype to users outside the consortium.

o More focus on open publications (WP6 Discussion)

o Initiation of activity to provide access to the prototype (WP5 Discussion)

• Exascale/Extreme Computing

– Computing, Exaflop and Beyond

• Object Storage

– Grouping data into user defined “objects”. No particular notion of grouping in

hierarchical trees.

• Data Centric Computing

– Computing that depends on and generates lots of data

– Classical HPC was mainly only about simulations, data was secondary

• Parallel File systems

– Popular paradigm for accessing storage in HPC – by parallelizing I/O to storage

subsystem

• pNFS

– A type of parallel file system

• NVM/NVRAM

– Non Volatile Memory

• MPI

– Popular programming model in HPC, MPI-IO is the IO library

Definition of Terms

WP6: Dissemination, Exploitation..

• Goal

Dissemination, Exploitation & Collaboration

– Methodology

• Disseminate SAGE through events, conferences, publications

and Talks

• Exploitation through exploring marketing opportunity and

expanding European IP

• Collaboration with other European and International Projects

– Seeking influence on Def-Facto standard methods Continued Website Updated

Continued Publications

Continued Social Media

Continued Participation in Key events & Talks

Press Releases and Press Coverage

Discussion with potential users of SAGE technology