accelerating simulation of agent-based models on heterogeneous architectures

23
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Accelerating Simulation of Agent- Based Models on Heterogeneous Architectures Jin Wang , Norman Rubin ‡* , Haicheng Wu , Sudhakar Yalamanchili † Georgia Institute of Technology ‡ AMD * The author is now affiliated with NVIDIA Research 1

Upload: muriel

Post on 07-Feb-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures. Jin Wang † , Norman Rubin ‡* , Haicheng Wu † , Sudhakar Yalamanchili † † Georgia Institute of Technology ‡ AMD * The author is now affiliated with NVIDIA Research. Discrete Heterogeneous Architectures. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures

Jin Wang†, Norman Rubin‡*, Haicheng Wu†, Sudhakar Yalamanchili†

† Georgia Institute of Technology‡ AMD

* The author is now affiliated with NVIDIA Research1

Page 2: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 2

Discrete Heterogeneous Architectures

Discrete GPU and CPU connected by PCIe bus Powerful GPUs and CPUs Slow PCIe bus

Compute Unit

Compute Unit

Compute Unit

Compute Unit

Compute Unit

Compute Unit

Compute Unit

Compute Unit

Device

Mem

ory

GPU (e.g. AMD Southern Island) CPU (e.g. Intel Core i7)

PCIe bus

Host M

em

ory

Page 3: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 3

Integrated Heterogeneous Architectures

CPU and GPU one the same die Less powerful CPU and GPU Faster On-chip Memory Bus E.g. AMD Fusion APU

GPU

Compute Unit

Compute Unit

Compute Unit

Compute Unit

CPU

System Memory

Physical Memory

Host Mem Dev Mem

CPU GPU

UNB

L2 WC

Page 4: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 4

Agent-Base Model on Heterogeneous Architectures

Agent-based Model: Time-step or event-driven simulation of a group of agents

with states

On Discrete GPU: Intrinsic Parallel structure for GPU implementations CPUs only transfer data and are idle most of time

On Integrated Architectures: More computation capability can be extracted from CPU

Agent States

Agent States

Agent States

Transit F

unctions

Updated States

Updated States

Updated States

Page 5: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 5

This Work

Goal: Efficiently use integrated CPU-GPU architectures for agent-based model simulations

Proposed: A massively parallel implementation of agent-based model on

GPUs An optimization for integrated architectures that moves a

portion of computation to CPU

Uses Traffic Simulation as an example for Agent-based Model

Page 6: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 6

Traffic Simulation

Agent-based Model Simulation Two-lane Traffic Depends on close neighbors

States velocity xPosition Lane vehicleType

Transit Functions Acceleration Function: Depends on preceding neighbor Lane-change Function: Depends on preceding and back

neighbor on both lanes Three neighbors: Preceding neighbors in both lanes / back

neighbor in the other lane

xPosition

Page 7: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 7

GPU Massively Parallel Implementation

Data structure for states Structure of Arrays Sorted according to x positions Stored in Global Memory

Mapping One work-item for one vehicle A work-group for a block of vehicles

Three steps Locate Neighbors Update States Sort states according to x positions

Kernel 1

Kernel 2

Page 8: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 8

Locate Neighbors

Target: locate neighbors from mixed lanesTwo stages, both of which have BSP structure

Stage 1: Locate group neighbors Stage 2: Locate individual neighbors within a group

Page 9: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 9

Stage 1: Locate group neighbors

Stage 2: Locate individual neighbors within a group

Load group neighbors with current block to local memory

Locate Neighbors Cont.

Page 10: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 10

Update States and Sort

For each vehicle (mapped to each work-item) Get the neighbor index (from “locate-neighbor” step) Load neighbor data (velocity, xposition) Compute new acceleration, velocity and xposition according

to neighbor data using the transit functions Store the newly update states to global memory

Sort vehicles according to x position Sort is necessary because

Lane-changing One lane is moving faster than the other

In other agent-based model: Restructuring mechanism the same as or similar to sorting

Page 11: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 11

Experiments Platforms

Discrete Platform: GPU: AMD Radeon HD7950

Southern Island (GCN) 28 Compute Unites 850MHz

CPU: Intel Core i7-920 2.66GHz 4 CPU Cores / 8 Threads

Integrated Platform: AMD Trinity APU A10-5800K

4 CPU Cores HD7660D GPU

Northern Islands (4-way VLIW) 6 Compute Unites 800MHz

Page 12: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 12

Performance for GPU ImplementationSort consumes lots of time!

Radix Sort

Bitonic Sort Odd-Even Sort

Page 13: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 13

Optimization for Integrated Architectures

Move some computation to CPUUtilize faster on-chip memory bus

Non-zero copy memory access zero copy memory access

Page 14: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 14

Optimization: Local Sort and CPU Merge

Most of time: sort is only required within block (local sort)

Some time: merge required across blocks

Page 15: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 15

Optimization: Local Sort and CPU Merge Cont.

Merge neighbor blocks on CPU if necessary

Compare max X Position in current block with min X Position in the next block

There can be consecutive merge

Maximum consecutive blocks to be merged

Page 16: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 16

Benefit of Proposed Optimization

Reduce global workloadMerge algorithm is serial and can have CPU as its more natural venue

Communication between CPU and GPU for merge stage is faster on integrated platform through on-chip memory bus

Page 17: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 17

Results

Speedup over pure GPU implementation on Discrete Platform

128

256

512 1K 2K 4K 8K 16

K32

K64

K12

8K25

6K51

2K 1M 2M0

0.5

1

1.5

2

2.5

3

3.5

4

AMD Radeon HD 7950 (Baseline)AMD Radeon HD 7950 (Optimized with CPU)AMD A10-5800K APU (Optimized with CPU)

Number of Vehicles

Sp

eed

up

Even worse than baseline

Page 18: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 18

Sort on HD7950 baselineTraffic States Update on HD7950 baseline

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6CPU Merge (including Memory Access) on HD7950Traffic States Update + Local Sort on HD7950

Number of Vehicles

No

rmal

ized

Exe

cuti

on

Tim

e

CPU Merge (including Memory Access) on A10-5800KTraffic States Update + Local Sort on A10-5800K

Results Cont.

Breakdown for States Update / Sort / CPU Merge

Page 19: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 19

CPU Merge (including Memory Access) on A10-5800KTraffic States Update + Local Sort on A10-5800K

Sort on HD7950 baselineTraffic States Update on HD7950 baseline

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6CPU Merge (including Memory Access) on HD7950Traffic States Update + Local Sort on HD7950

Number of Vehicles

No

rmal

ized

Exe

cuti

on

Tim

e

Results Cont.

Breakdown for States Update / Sort / CPU Merge

Page 20: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 20

Sort on HD7950 baselineTraffic States Update on HD7950 baseline

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6CPU Merge (including Memory Access) on HD7950Traffic States Update + Local Sort on HD7950

Number of Vehicles

No

rmal

ized

Exe

cuti

on

Tim

e

CPU Merge (including Memory Access) on A10-5800KTraffic States Update + Local Sort on A10-5800K

Results Cont.

Breakdown for States Update / Sort / CPU Merge

Page 21: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 21

Conclusion

Optimization of agent-based model on Integrated Architectures through traffic simulation problem

Utilize computation capability of both CPU and GPU Memory access from host to device is faster through the on-

chip memory bus

Provides insight to mapping traditional GPGPU applications to integrated architectures

Page 22: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 22

Thank you!

Questions?

Page 23: Accelerating Simulation of Agent-Based Models  on Heterogeneous  Architectures

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 23

Appendix: Traffic Simulation Models

Acceleration Model

Lane-changing Model Interact with front vehicle before/after lane-changing, back vehicle

after lane-changings’ > minGap

s’’ > minGap

acc' (M') - acc (M) > p [ acc (B') - acc' (B') ] + athr

Reference: Martin Treiber and Arne Kesting. An open-source microscopic traffic simulator. Intelligent Transportation Systems Magazine, 2(3):6{13, Fall 2010.

acc’(M’)

acc(M)

acc(B’), acc’(B’)

s' s''

velocity distance to front vehicle

velocity difference from front vehicle