the cuda c++ standard library · 2019-12-03 · bryce adelstein lelbach cuda c++ core libraries...

59
Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE CUDA C++ STANDARD LIBRARY @blelbach

Upload: others

Post on 24-Apr-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

Bryce Adelstein Lelbach

CUDA C++ Core Libraries Lead

ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair

THE CUDA C++ STANDARD LIBRARY@blelbach

Page 2: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

ISO C++ == Core Language + Standard Library

2Copyright (C) 2019 NVIDIA

Page 3: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

ISO C++ == Core Language + Standard Library

C++ without a Standard Library is severely diminished.

3Copyright (C) 2019 NVIDIA

Page 4: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

CUDA C++ == Core Language + ???

4Copyright (C) 2019 NVIDIA

Page 5: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

CUDA C++ == Core Language + libcu++

Version 1 in CUDA 10.2!

5Copyright (C) 2019 NVIDIA

Page 6: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

libcu++ is the

opt-in,

heterogeneous,

incremental

CUDA C++ Standard Library.

6Copyright (C) 2019 NVIDIA

Page 7: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

Opt-in

#include <…> ISO C++, __host__ only.

std:: Strictly conforming to ISO C++.

#include <cuda/std/…> CUDA C++, __host__ __device__.

cuda::std:: Strictly conforming to ISO C++.

#include <cuda/…> CUDA C++, __host__ __device__.

cuda:: Conforming extensions to ISO C++.

Does not interfere with or replace your host standard library.

7Copyright (C) 2019 NVIDIA

Page 8: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

Opt-in

#include <atomic>std::atomic<int> x;

#include <cuda/std/atomic>cuda::std::atomic<int> x;

#include <cuda/atomic>cuda::atomic<int, cuda::thread_scope_block> x;

Does not interfere with or replace your host standard library.

8Copyright (C) 2019 NVIDIA

Page 9: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

Heterogeneous

Copyable/Movable objects can migrate between host & device.

Host & device can call all (member) functions.

Host & device can concurrently use synchronization primitives*.

9Copyright (C) 2019 NVIDIA

*: Synchronization primitives must be in managed memory and be declared with cuda::std::thread_scope_system.

Page 10: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

IncrementalNot a complete standard library today; each release will add more.

10Copyright (C) 2019 NVIDIA

Different Optimal Design for Host/Device

Large Effort to Implement for Device

Same Optimal Design for Host/Device

(just add __device__)

std::iostream

std::hashstd::min/std::max

Highest Impactstd::array

std::function

std::atomic

Already Available (in Thrust or CUB)

std::tuple

Page 11: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

IncrementalNot a complete standard library today; each release will add more.

11Copyright (C) 2019 NVIDIA

Different Optimal Design for Host/Device

Difficult to Implement for Device

Same Optimal Design for Host/Device

(just add __device__)

std::iostream

std::hashstd::min/std::max

Highest Impactstd::array

std::function

std::atomic

Already Available (in Thrust or CUB)

std::tuple

We are prioritizing facilities that:Are neither trivial nor difficult to port.Are not already available in CUDA.Have a high impact.

Page 12: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

Based on LLVM’s libc++

Forked from LLVM’s libc++.

License: Apache 2.0 with LLVM Exception.

NVIDIA is already contributing back to the community:

Freestanding atomic<T>: reviews.llvm.org/D56913

C++20 synchronization library: reviews.llvm.org/D68480

12Copyright (C) 2019 NVIDIA

Page 13: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

libcu++ Release Schedule

Version 1 (CUDA 10.2, now): <atomic> (Pascal+), <type_traits>.

Version 2 (1H 2020): atomic<T>::wait/notify, <barrier>, <latch>, <counting_semaphore> (all Volta+), <chrono>, <ratio>, <functional> minus function.

Future priorities: atomic_ref<T>, <complex>, <tuple>, <array>, <utility>, <cmath>, string processing, …

13Copyright (C) 2019 NVIDIA

Page 14: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

14Copyright (C) 2019 NVIDIA

namespace cuda {

enum thread_scope {thread_scope_system, // All threads.thread_scope_device,thread_scope_block

};

template <typename T,thread_scope S = thread_scope_system>

struct atomic;

namespace std {template <typename T>using atomic = cuda::atomic<T>;

} // namespace std

} // namespace cuda

Page 15: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

__host__ __device__void signal_flag(volatile int& flag) {

// ^^^ volatile was "notionally right" for flag in legacy CUDA C++.__threadfence_system(); // <- Should be fused on the operation.// vvv We "cast away" the `volatile` qualifier.atomicExch((int*)&flag, 1); // <- Ideally want an atomic store.

}

15Copyright (C) 2019 NVIDIA

Page 16: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

__host__ __device__void signal_flag(volatile int& flag) {

// ^^^ volatile was "notionally right" for flag in legacy CUDA C++.__threadfence_system(); // <- Should be fused on the operation.// vvv We "cast away" the `volatile` qualifier.flag = 1; // <- "Works" for a store but is UB (volatile != atomic).

}

16Copyright (C) 2019 NVIDIA

Page 17: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

__host__ __device__void signal_flag_better(atomic<bool>& flag) {

flag = true;}

17Copyright (C) 2019 NVIDIA

Page 18: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

__host__ __device__void signal_flag_even_better(atomic<bool>& flag) {

flag.store(true, memory_order_release);}

18Copyright (C) 2019 NVIDIA

Page 19: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

__host__ __device__void signal_flag_excellent(atomic<bool>& flag) {

flag.store(true, memory_order_release);flag.notify_all(); // <- Will make sense later (Version 2).

}

19Copyright (C) 2019 NVIDIA

Page 20: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

__host__ __device__int poll_flag_then_read(volatile int& flag, int& data) {

// ^^^ volatile was "notionally right" for flag in legacy CUDA C++.// vvv We "cast away" the volatile qualifier.while (1 != atomicAdd((int*)&flag, 0)) // <- Should be atomic load.; // <- Spinloop without backoff is bad under contention.

__threadfence_system(); // <- 9 out of 10 of you forget this one!return data; // <- Even if volatile, you still need the fence.

}

20Copyright (C) 2019 NVIDIA

Page 21: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

__host__ __device__int poll_flag_then_read(volatile int& flag, int& data) {

// ^^^ volatile was "notionally right" for flag in legacy CUDA C++.// vvv We "cast away" the volatile qualifier.while (1 != flag) // <- “Works" but is UB (volatile != atomic).; // <- Spinloop without backoff is bad under contention.

__threadfence_system(); // <- 9 out of 10 of you forget this one!return data; // <- Even if volatile, you still need the fence.

}

21Copyright (C) 2019 NVIDIA

Page 22: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

__host__ __device__int poll_flag_then_read_better(atomic<bool>& flag, int& data) {

while (!flag); // <- Spinloop without backoff is bad under contention.

return data;}

22Copyright (C) 2019 NVIDIA

Page 23: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

__host__ __device__int poll_flag_then_read_even_better(atomic<bool>& flag, int& data) {

while (!flag.load(memory_order_acquire)); // <- Spinloop without backoff is bad under contention.

return data;}

23Copyright (C) 2019 NVIDIA

Page 24: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

__host__ __device__int poll_flag_then_read_excellent(atomic<bool>& flag, int& data) {

flag.wait(false, memory_order_acquire); // Version 2.// ^^^ Backoff to mitigate heavy contention.return data;

}

24Copyright (C) 2019 NVIDIA

Page 25: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

// Mixing scopes can be a messy error; we prevent it at compile time.__host__ __device__ void foo() {

atomic<bool> s_flag;signal_flag(s_flag); // Ok; expects and got system atomic type.

atomic<bool, thread_scope_device> d_flag;signal_flag(d_flag); // Compile error; expects system atomic type.

}

25Copyright (C) 2019 NVIDIA

Page 26: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

// Writing __host__ __device__ functions today is nearly impossible.__host__ __device__ void bar(volatile int& a) {

#ifdef __CUDA_ARCH__atomicAdd((int*)&a, 1);

#else// What do I write here for all the CPUs & compilers I support?

#endif}

26Copyright (C) 2019 NVIDIA

Page 27: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

__host__ __device__ void bar_better(atomic<int>& a) {a += 1;

}

27Copyright (C) 2019 NVIDIA

Page 28: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

__host__ __device__ void bar_even_better(atomic<int>& a) {a.fetch_add(1, memory_order_relaxed);

}

28Copyright (C) 2019 NVIDIA

Page 29: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

Stop Using Legacy Atomics (atomic[A-Z]*):

Sequential consistency & acquire/release are not first-class.

Device-only.

Memory scope is a property of operations not objects.

Atomicity is a property of operations not objects.

29Copyright (C) 2019 NVIDIA

Page 30: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

Stop Using volatile for synchronization:

volatile != atomic.

volatile is a vague pact; atomic<T> has clear semantics.

30Copyright (C) 2019 NVIDIA

Page 31: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

Volta+ NVIDIA GPUs deliver and libcu++ exposes:

C++ Parallel Forward Progress Guarantees.

The C++ Memory Model.

Why does this matter?

31Copyright (C) 2019 NVIDIA

Page 32: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

Why does this matter?

Volta+ NVIDIA GPUs and libcu++ enable a wide range of concurrent algorithms & data structures previously unavailable on GPUs.

More concurrent algorithms & data structures means more code can run on GPUs!

32Copyright (C) 2019 NVIDIA

Page 33: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

33Copyright (C) 2019 NVIDIA

No limitations on

thread delays

Threads delayed

infinitely often

Thread delays

limited

Every thread

makes progressWait-free Obstruction-free Starvation-free

Some thread

makes progressLock-free Clash-free Deadlock-free

Non-Blocking Blocking

Source: http://www.cs.tau.ac.il/~shanir/progress.pdf

Page 34: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

34Copyright (C) 2019 NVIDIA

No limitations on

thread delays

Threads delayed

infinitely often

Thread delays

limited

Every thread

makes progressWait-free Obstruction-free Starvation-free

Some thread

makes progressLock-free Clash-free Deadlock-free

Non-Blocking Blocking

Source: http://www.cs.tau.ac.il/~shanir/progress.pdf

Weakly Parallel

Forward Progress

Pre Volta NVIDIA GPUs

Other GPUs

Page 35: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

35Copyright (C) 2019 NVIDIA

No limitations on

thread delays

Threads delayed

infinitely often

Thread delays

limited

Every thread

makes progressWait-free Obstruction-free Starvation-free

Some thread

makes progressLock-free Clash-free Deadlock-free

Non-Blocking Blocking

Source: http://www.cs.tau.ac.il/~shanir/progress.pdf

Weakly Parallel

Forward Progress

Pre Volta NVIDIA GPUs

Other GPUs

Parallel

Forward Progress

Only Volta+

Page 36: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

Why does this matter?

Volta+ NVIDIA GPUs and libcu++ enable a wide range of concurrent algorithms & data structures previously unavailable on GPUs.

More concurrent algorithms & data structures means more code can run on GPUs!

36Copyright (C) 2019 NVIDIA

Page 37: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

template <typename Key, typename Value,typename Hash = hash<Key>,typename Equal = equal_to<Key>>

struct concurrent_insert_only_map {enum state_type { state_empty, state_reserved, state_filled };

// ...

__host__ __device__ Value* try_insert(Key const& key, Value const& value);

private:uint64_t capacity_;Key* keys_;Value* values_;atomic<state_type>* states_;Hash hash_;Equal equal_;

};

38Copyright (C) 2019 NVIDIA

Page 38: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

template <typename Key, typename Value,typename Hash = hash<Key>,typename Equal = equal_to<Key>>

struct concurrent_insert_only_map {enum state_type { state_empty, state_reserved, state_filled };

// ...

__host__ __device__ Value* try_insert(Key const& key, Value const& value);

private:uint64_t capacity_;Key* keys_;Value* values_;atomic<state_type>* states_;Hash hash_;Equal equal_;

};

39Copyright (C) 2019 NVIDIA

Page 39: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

template <typename Key, typename Value,typename Hash = hash<Key>,typename Equal = equal_to<Key>>

struct concurrent_insert_only_map {enum state_type { state_empty, state_reserved, state_filled };

// ...

__host__ __device__ Value* try_insert(Key const& key, Value const& value);

private:uint64_t capacity_;Key* keys_;Value* values_;atomic<state_type>* states_;Hash hash_;Equal equal_;

};

40Copyright (C) 2019 NVIDIA

Page 40: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

template <typename Key, typename Value,typename Hash = hash<Key>,typename Equal = equal_to<Key>>

struct concurrent_insert_only_map {enum state_type { state_empty, state_reserved, state_filled };

// ...

__host__ __device__ Value* try_insert(Key const& key, Value const& value);

private:uint64_t capacity_;Key* keys_;Value* values_;atomic<state_type>* states_;Hash hash_;Equal equal_;

};

41Copyright (C) 2019 NVIDIA

Page 41: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

template <typename Key, typename Value,typename Hash = hash<Key>,typename Equal = equal_to<Key>>

struct concurrent_insert_only_map {enum state_type { state_empty, state_reserved, state_filled };

// ...

__host__ __device__ Value* try_insert(Key const& key, Value const& value);

private:uint64_t capacity_;Key* keys_;Value* values_;atomic<state_type>* states_;Hash hash_;Equal equal_;

};

42Copyright (C) 2019 NVIDIA

Page 42: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

template <typename Key, typename Value,typename Hash = hash<Key>,typename Equal = equal_to<Key>>

struct concurrent_insert_only_map {enum state_type { state_empty, state_reserved, state_filled };

// ...

__host__ __device__ Value* try_insert(Key const& key, Value const& value);

private:uint64_t capacity_;Key* keys_;Value* values_;atomic<state_type>* states_;Hash hash_;Equal equal_;

};

43Copyright (C) 2019 NVIDIA

Page 43: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

template <typename Key, typename Value,typename Hash = hash<Key>,typename Equal = equal_to<Key>>

struct concurrent_insert_only_map {enum state_type { state_empty, state_reserved, state_filled };

// ...

__host__ __device__ Value* try_insert(Key const& key, Value const& value);

private:uint64_t capacity_;Key* keys_;Value* values_;atomic<state_type>* states_;Hash hash_;Equal equal_;

};

44Copyright (C) 2019 NVIDIA

Page 44: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

struct concurrent_insert_only_map {__host__ __device__ Value* try_insert(Key const& key, Value const& value) {

auto index(hash_(key) % capacity_);// ...

}};

45Copyright (C) 2019 NVIDIA

Page 45: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

struct concurrent_insert_only_map {__host__ __device__ Value* try_insert(Key const& key, Value const& value) {

auto index(hash_(key) % capacity_);for (uint64_t i = 0; i < capacity_; ++i) { // Linearly probe up to `capacity_` times.

// ...}return nullptr; // If we are here, the container is full.

}};

46Copyright (C) 2019 NVIDIA

Page 46: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

struct concurrent_insert_only_map {__host__ __device__ Value* try_insert(Key const& key, Value const& value) {

auto index(hash_(key) % capacity_);for (uint64_t i = 0; i < capacity_; ++i) { // Linearly probe up to `capacity_` times.

// ...}return nullptr; // If we are here, the container is full.

}};

47Copyright (C) 2019 NVIDIA

Page 47: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

struct concurrent_insert_only_map {__host__ __device__ Value* try_insert(Key const& key, Value const& value) {

auto index(hash_(key) % capacity_);for (uint64_t i = 0; i < capacity_; ++i) { // Linearly probe up to `capacity_` times.

state_type old = states_[index].load(memory_order_acquire);while (old == state_empty) { // As long as the slot is empty, try to lock it.

// ...}// ...

}return nullptr; // If we are here, the container is full.

}};

48Copyright (C) 2019 NVIDIA

Page 48: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

struct concurrent_insert_only_map {__host__ __device__ Value* try_insert(Key const& key, Value const& value) {

auto index(hash_(key) % capacity_);for (uint64_t i = 0; i < capacity_; ++i) { // Linearly probe up to `capacity_` times.

state_type old = states_[index].load(memory_order_acquire);while (old == state_empty) { // As long as the slot is empty, try to lock it.

if (states_[index].compare_exchange_weak(old, state_reserved, memory_order_acq_rel)) {// We locked it by setting the state to `state_reserved`; now insert the key & value.// ...

}}// ...

}return nullptr; // If we are here, the container is full.

}};

49Copyright (C) 2019 NVIDIA

Page 49: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

struct concurrent_insert_only_map {__host__ __device__ Value* try_insert(Key const& key, Value const& value) {

auto index(hash_(key) % capacity_);for (uint64_t i = 0; i < capacity_; ++i) { // Linearly probe up to `capacity_` times.

state_type old = states_[index].load(memory_order_acquire);while (old == state_empty) { // As long as the slot is empty, try to lock it.

if (states_[index].compare_exchange_weak(old, state_reserved, memory_order_acq_rel)) {// We locked it by setting the state to `state_reserved`; now insert the key & value.new (keys_ + index) Key(key);new (values_ + index) Value(value);states_[index].store(state_filled, memory_order_release); // Unlock the slot.

return values_ + index;}

}// ...

}return nullptr; // If we are here, the container is full.

}};

50Copyright (C) 2019 NVIDIA

Page 50: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

struct concurrent_insert_only_map {__host__ __device__ Value* try_insert(Key const& key, Value const& value) {

auto index(hash_(key) % capacity_);for (uint64_t i = 0; i < capacity_; ++i) { // Linearly probe up to `capacity_` times.

state_type old = states_[index].load(memory_order_acquire);while (old == state_empty) { // As long as the slot is empty, try to lock it.

if (states_[index].compare_exchange_weak(old, state_reserved, memory_order_acq_rel)) {// We locked it by setting the state to `state_reserved`; now insert the key & value.new (keys_ + index) Key(key);new (values_ + index) Value(value);states_[index].store(state_filled, memory_order_release); // Unlock the slot.

return values_ + index;}

}// ...

}return nullptr; // If we are here, the container is full.

}};

51Copyright (C) 2019 NVIDIA

Page 51: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

struct concurrent_insert_only_map {__host__ __device__ Value* try_insert(Key const& key, Value const& value) {

auto index(hash_(key) % capacity_);for (uint64_t i = 0; i < capacity_; ++i) { // Linearly probe up to `capacity_` times.

state_type old = states_[index].load(memory_order_acquire);while (old == state_empty) { // As long as the slot is empty, try to lock it.

if (states_[index].compare_exchange_weak(old, state_reserved, memory_order_acq_rel)) {// We locked it by setting the state to `state_reserved`; now insert the key & value.new (keys_ + index) Key(key);new (values_ + index) Value(value);states_[index].store(state_filled, memory_order_release); // Unlock the slot.states_[index].notify_all(); // Wake up anyone who was waiting for us to fill the slot.return values_ + index;

}}// ...

}return nullptr; // If we are here, the container is full.

}};

52Copyright (C) 2019 NVIDIA

Page 52: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

struct concurrent_insert_only_map {__host__ __device__ Value* try_insert(Key const& key, Value const& value) {

auto index(hash_(key) % capacity_);for (uint64_t i = 0; i < capacity_; ++i) { // Linearly probe up to `capacity_` times.

state_type old = states_[index].load(memory_order_acquire);while (old == state_empty) { // As long as the slot is empty, try to lock it.

if (states_[index].compare_exchange_weak(old, state_reserved, memory_order_acq_rel)) {// We locked it by setting the state to `state_reserved`; now insert the key & value.new (keys_ + index) Key(key);new (values_ + index) Value(value);states_[index].store(state_filled, memory_order_release); // Unlock the slot.states_[index].notify_all(); // Wake up anyone who was waiting for us to fill the slot.return values_ + index;

}}// ...

}return nullptr; // If we are here, the container is full.

}};

53Copyright (C) 2019 NVIDIA

Page 53: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

struct concurrent_insert_only_map {__host__ __device__ Value* try_insert(Key const& key, Value const& value) {

auto index(hash_(key) % capacity_);for (uint64_t i = 0; i < capacity_; ++i) { // Linearly probe up to `capacity_` times.

state_type old = states_[index].load(memory_order_acquire);while (old == state_empty) { // As long as the slot is empty, try to lock it.

if (states_[index].compare_exchange_weak(old, state_reserved, memory_order_acq_rel)) {// We locked it by setting the state to `state_reserved`; now insert the key & value.new (keys_ + index) Key(key);new (values_ + index) Value(value);states_[index].store(state_filled, memory_order_release); // Unlock the slot.states_[index].notify_all(); // Wake up anyone who was waiting for us to fill the slot.return values_ + index;

}} // If we didn’t fill the slot, wait for it to be filled and check if it matches.while (state_filled != states_[index].load(memory_order_acquire))

;// ...

}return nullptr; // If we are here, the container is full.

}};

54Copyright (C) 2019 NVIDIA

Page 54: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

struct concurrent_insert_only_map {__host__ __device__ Value* try_insert(Key const& key, Value const& value) {

auto index(hash_(key) % capacity_);for (uint64_t i = 0; i < capacity_; ++i) { // Linearly probe up to `capacity_` times.

state_type old = states_[index].load(memory_order_acquire);while (old == state_empty) { // As long as the slot is empty, try to lock it.

if (states_[index].compare_exchange_weak(old, state_reserved, memory_order_acq_rel)) {// We locked it by setting the state to `state_reserved`; now insert the key & value.new (keys_ + index) Key(key);new (values_ + index) Value(value);states_[index].store(state_filled, memory_order_release); // Unlock the slot.states_[index].notify_all(); // Wake up anyone who was waiting for us to fill the slot.return values_ + index;

}} // If we didn’t fill the slot, wait for it to be filled and check if it matches.states_[index].wait(state_reserved, memory_order_acquire);// ...

}return nullptr; // If we are here, the container is full.

}};

55Copyright (C) 2019 NVIDIA

Page 55: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

struct concurrent_insert_only_map {__host__ __device__ Value* try_insert(Key const& key, Value const& value) {

auto index(hash_(key) % capacity_);for (uint64_t i = 0; i < capacity_; ++i) { // Linearly probe up to `capacity_` times.

state_type old = states_[index].load(memory_order_acquire);while (old == state_empty) { // As long as the slot is empty, try to lock it.

if (states_[index].compare_exchange_weak(old, state_reserved, memory_order_acq_rel)) {// We locked it by setting the state to `state_reserved`; now insert the key & value.new (keys_ + index) Key(key);new (values_ + index) Value(value);states_[index].store(state_filled, memory_order_release); // Unlock the slot.states_[index].notify_all(); // Wake up anyone who was waiting for us to fill the slot.return values_ + index;

}} // If we didn’t fill the slot, wait for it to be filled and check if it matches.states_[index].wait(state_reserved, memory_order_acquire);if (equal_(keys_[index], key)) return values_ + index; // Someone else inserted.// ...

}return nullptr; // If we are here, the container is full.

}};

56Copyright (C) 2019 NVIDIA

Page 56: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

struct concurrent_insert_only_map {__host__ __device__ Value* try_insert(Key const& key, Value const& value) {

auto index(hash_(key) % capacity_);for (uint64_t i = 0; i < capacity_; ++i) { // Linearly probe up to `capacity_` times.

state_type old = states_[index].load(memory_order_acquire);while (old == state_empty) { // As long as the slot is empty, try to lock it.

if (states_[index].compare_exchange_weak(old, state_reserved, memory_order_acq_rel)) {// We locked it by setting the state to `state_reserved`; now insert the key & value.new (keys_ + index) Key(key);new (values_ + index) Value(value);states_[index].store(state_filled, memory_order_release); // Unlock the slot.states_[index].notify_all(); // Wake up anyone who was waiting for us to fill the slot.return values_ + index;

}} // If we didn’t fill the slot, wait for it to be filled and check if it matches.states_[index].wait(state_reserved, memory_order_acquire);if (equal_(keys_[index], key)) return values_ + index; // Someone else inserted.index = (index + 1) % capacity_; // Collision: keys didn’t match. Try the next slot.

}return nullptr; // If we are here, the container is full.

}};

57Copyright (C) 2019 NVIDIA

Page 57: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

struct concurrent_insert_only_map {__host__ __device__ Value* try_insert(Key const& key, Value const& value) {

auto index(hash_(key) % capacity_);for (uint64_t i = 0; i < capacity_; ++i) { // Linearly probe up to `capacity_` times.

state_type old = states_[index].load(memory_order_acquire);while (old == state_empty) { // As long as the slot is empty, try to lock it.

if (states_[index].compare_exchange_weak(old, state_reserved, memory_order_acq_rel)) {// We locked it by setting the state to `state_reserved`; now insert the key & value.new (keys_ + index) Key(key);new (values_ + index) Value(value);states_[index].store(state_filled, memory_order_release); // Unlock the slot.states_[index].notify_all(); // Wake up anyone who was waiting for us to fill the slot.return values_ + index;

}} // If we didn’t fill the slot, wait for it to be filled and check if it matches.states_[index].wait(state_reserved, memory_order_acquire);if (equal_(keys_[index], key)) return values_ + index; // Someone else inserted.index = (index + 1) % capacity_; // Collision: keys didn’t match. Try the next slot.

}return nullptr; // If we are here, the container is full.

}};

58Copyright (C) 2019 NVIDIA

Page 58: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

libcu++

Opt-in, heterogeneous, incremental C++ standard library for CUDA.

Open source; port of LLVM’s libc++; contributing upstream.

Version 1 (next week): <atomic> (Pascal+), <type_traits>.

Version 2 (1H 2020): atomic<T>::wait/notify, <barrier>, <latch>, <counting_semaphore> (all Volta+), <chrono>, <ratio>, <functional> minus function.

Future priorities: atomic_ref<T>, <complex>, <tuple>, <array>, <utility>, <cmath>, string processing, …

The CUDA C++ Standard Library

59Copyright (C) 2019 NVIDIA

Page 59: THE CUDA C++ STANDARD LIBRARY · 2019-12-03 · Bryce Adelstein Lelbach CUDA C++ Core Libraries Lead ISO C++ Library Evolution Incubator Chair, ISO C++ Tooling Study Group Chair THE

60Copyright (C) 2019 Bryce Adelstein Lelbach

@blelbach