sayantan sur, intel exacomm workshop held in conjunction...

21
Sayantan Sur, Intel ExaComm Workshop held in conjunction with ISC 2018

Upload: others

Post on 31-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held

Sayantan Sur, Intel

ExaComm Workshop held in conjunction with ISC 2018

Page 2: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held

Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Legal Disclaimer & Optimization Notice

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

2

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Copyright © 2018, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

2

Page 3: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held

Inputs:

Better processors

Better fabric

Better system design

Output:

Increasing overall performance

What is HPC?

Exacomm '18

Courtesy Top500.org

HPC is an activity characterized by the workload’s nature,intent and response to scale

3

Page 4: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held

IoT and Analytics

Total data in the world doubling every two years1

Vast amounts of data unutilized

“only 1 percent of data from an oil rig with 30,000 sensors is examined”2

Simply adding more commodity compute / fabric doesn’t help –need high performance!

Artificial Intelligence

Data Parallel Deep Learning already utilizing HPC Fabrics / techniques

“Deep Learning at 15PF” on Intel Xeon Phi3

Model Parallel imposes challenging memory / bandwidth limits

Can other problems use high performance?

Exacomm '18

1 https://insidebigdata.com/2017/02/16/the-exponential-growth-of-data/2 McKinsey Global Institute report on The Internet of Things: Mapping the value beyond the hype

4

3 https://arxiv.org/pdf/1708.05256.pdf

Page 5: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held

Converged workflows could create new value

StreamingData

Data-CollectionSimulation or Analytics

Human or AI Agent

Interactive Steering

Discard Bad Data

NoSQL Databases Parallel Filesystems

Store

Process

• Medical device• Radio telescope• IoT network• Data from Web

Fabric must comprehend multiple application domains to serve a converged workflow

Exacomm '18 5

Page 6: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held

6

Example Desired State: Unified Architecture

Exacomm '18

Server

FORTRAN / C++ ApplicationsMPI

High Performance

Java / Scala ApplicationsHadoopEasy to Use

Resource Managers

Storage AbstractionsRemote Storage

Compute & Big Data CapableHigh Performance Components

Server Storage(SSDs) HPC Fabric

Infrastructure

Traditional Infrastructure / Private/Public Cloud

C++ ApplicationsTensorFlow / Caffe / MXNet

Easy to Use

Page 7: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held

7

Requirements on Fabric and Fabric Software

Exacomm '18

Hardware + Software: Support a wider set of fabric users thanMPI, PGAS, File systems …

Hardware: Accelerate new communication paradigms emerging in converged workflows

Software: Provide semantic abstractions that enable communication offload while enhancing portability

Page 8: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held

8

Enable “software leads hardware”

• Same abstractions on multiple fabrics, regardless of feature set

• Carefully defined semantics - net overheads remains the same even when hardware assists are unavailable

Interfaces aligned to a variety of use models – HPC (tag-matching, RMA), Client-Server (connected, datagram)

Powerful network and service discovery features

How can Open Fabrics Interface (OFI) help?

Exacomm '18

Application

Fabric Software

Implement my semantics!

Hardware

Here’s what I can do!

Should apps target this?

Or should they target this?

Page 9: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held

9

OFI – State of the Union

Exacomm '18

OFI

Intel® MPI Library

MPICH* Open MPI* GASNet*

libfabric Enabled Middleware

Advanced application oriented semantics

Tag MatchingScalable memory

registration

Triggered Operations

Multi-Receive buffers

SocketsTCP, UDP

VerbsCisco

usNIC*

IntelOPA PSM

CrayGNI* Mellanox* IBM Blue

Gene*WIP Providers

®

Reliable Datagram Endpoints

Remote Completion Semantics

Streaming Endpoints

Shared Address Vectors

Unexpected MessageBuffering

Intel® MLSL#

# Exploration

OFI Insulates applications

from wide diversity of fabrics

underneath

* Other names and brands may be claimed as property of others

Sandia SHMEM*

NetIO*Charm++*

Page 10: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held

10

What new communication paradigms in Analytics could be accelerated?

Exacomm '18

Page 11: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held

11

Commonly occurring paradigm in Analytics middleware

• NoSQL databases

• Spark

Multiple RPC libraries available

• gRPC, Netty, etc.

How to express RPC semantics on HPC Fabrics which have rich offloads?

RPC / Distributed Object Store Paradigms

Exacomm '18

ClientLibrary

ObjectRequest

Requestby Hash

Responsewith

Object

Responsewith

Object

Object Hashor

RPC ID

Look up object Hash

orHandle

RPC Request

ServerApplication

Page 12: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held

12

RDMA Doesn’t quite fit the RPC Model

Exacomm '18

Object Request(Hash)

Lookup

Object Data

CLIENT SERVER

When objects are small

Object Request(Hash)

Lookup

Object Length

CLIENT SERVER

When objects are big

Allocation

Address

Put

Completion

Does this look familiar to anyone?

Completion

Server has totrack the

transaction

Page 13: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held

13

MPI use model

• Apps encouraged to pre-post recv

• Sender knows size of data

• Receiver knows max size of data

• Match order strictly defined

Distributed Object store use model

• Object Request may arrive at any time

• Object size not known until lookup is performed on the hash

• Ordering might be relaxed

Traditional Tag matching doesn’t fit either!

Exacomm '18

Send w/ Tag

Internalcompletion

CLIENT SERVER

MPI-style Tag Matching Model

Post Recv w/ Tag

Page 14: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held

14

Variable Sized Messages

• When receiver doesn’t know size of object prior to communication

• Easy to use when message sizes vary greatly

• Alleviates buffer management

• Removes application level rendezvous

Can leverage tag matching if available

A Better Use Model with OFI 1.7

Exacomm '18

CLIENT SERVER

Lookup

Object Request(Hash)

VariableSized Message

fi_recv(FI_CLAIM);

Completion

Rendezvousis internal

to OFI

Page 15: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held

15

What are the new communication paradigms in AI (Deep Learning)?

Exacomm '18

Page 16: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held

16

output expected

Layer 0

Layer 1

Layer N-

1

Layer N

0.10 0.15 0.20 … 0.05

person cat dog bike

0 1 0 … 0

person cat dog bike

penalty(error or cost)

Layer 0

Layer 1

Layer N-

1

Layer N

ForwardPropagation

(computation)

BackPropagation

CommunicateRevised Weights

Deep Learning Image Recognition Training

Exacomm '18 16

Data parallel mode, the model is replicated on each node

Model is further split into individual layer of neurons

Computation / layer is fixed

Communication / layer is fixedAllreduce #1

Allreduce #2

Allreduce #N-1

Allreduce #N

Page 17: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held

Performance depends on Overlap

L3

L2

L1

L0

Iteration N Backpropagation Iteration N+1 Forward

L0

L1

L2

L3Non overlapped

communication

Waiting for completion in

reverse order of initiating them

Potentially Overlapped

communication

Compute

Start Communication

Wait Communication

Time

AllreduceOperations

Exacomm '18 17

Messages from each Allreduce

have different priority

Page 18: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held

18

Offloaded Communication

• Trigger op when a condition is met

• Useful to advance a pre-computed schedule

Designed for MPI Collectives

• Latency reduction for blocking collectives

• Increase overlap for non-blocking collectives

Triggered Operations

Exacomm '18

When Operation

is complete

When valueof TC reaches

value of T

Threshold=T

Operation

Threshold Counter (TC)

Completion Counter (CC)

Triggered Operation

(RMA/Atomic etc.)

Execute Operation

IncrementCC by 1

Page 19: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held

19

Completion of one operation could trigger multiple

Should any runnable operation be run?

Should we check if some other op completed so we run those dependencies instead?

Choice of semantic can have performance / implementation implications

Active area of research

How to handle multiple schedules?

Exacomm '18

Schedule #1

Schedule #2

Completion

Run?

Run?

Check?

Page 20: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held

20

Summary

Requirements on HPC Fabric – hardware and software are increasing

Many new software frameworks must access the HPC fabric

Analytics frameworks evolved outside of HPC

Artificial intelligence frameworks can tax HPC fabrics with new semantics

Software components will evolve at their own pace, on a variety of hardware

OFI can help in aligning the software and hardware requirements, enabling software to lead hardware

Completely open and free participation in OFI development

Exacomm '18

http://libfabric.org/

Page 21: Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held