(gpu) - cpe.ku.ac.thparuj/204521/gpu.pdf · warp thread nvidia simt (single instruction multiple...

Warp thread Warp thread � Warp thread � Warp thread � Warp thread

Nvidia เรียกวา่เป็นการประมวลผลแบบ SIMT (Single Instruction Multiple Nvidia เรียกวา่เป็นการประมวลผลแบบ SIMT (Single Instruction Multiple � Nvidia เรียกวา่เป็นการประมวลผลแบบ SIMT (Single Instruction Multiple � Nvidia เรียกวา่เป็นการประมวลผลแบบ SIMT (Single Instruction Multiple � Nvidia เรียกวา่เป็นการประมวลผลแบบ SIMT (Single Instruction Multiple เรียกวา่เป็นการประมวลผลแบบ

Thread)Thread)Thread)Thread)Thread)

thread warp kernel � thread warp kernel � thread warp kernel � thread warp kernel � thread warp kernel

Thread Warp 3Thread Warp 3Thread Warp 3Thread Warp 3Thread Warp 3Thread Warp 8Thread Warp 8Thread Warp 8

Common PCThread Warp 8

Thread Warp Common PCThread Warp 8

Thread Warp Common PCThread Warp Common PCThread Warp Common PCThread Warp

Thread Warp 7Scalar Scalar Scalar Scalar Thread Warp 7Scalar Scalar Scalar Scalar Thread Warp 7Scalar Scalar Scalar Scalar Thread Warp 7ScalarThread

ScalarThread

ScalarThread

ScalarThread

Thread Warp 7ScalarThread

ScalarThread

ScalarThread

ScalarThreadThreadThreadThread ThreadThreadThreadThread ThreadThread

WThread

XThread

YThread

ZW X Y ZW X Y ZSIMD Pipeline

W X Y ZSIMD PipelineSIMD PipelineSIMD PipelineSIMD Pipeline

3333

for (i=0; i < N; i++)for (i=0; i < N; i++)for (i=0; i < N; i++)for (i=0; i < N; i++)

C[i] = A[i] + B[i];C[i] = A[i] + B[i];C[i] = A[i] + B[i];C[i] = A[i] + B[i];

Vectorized CodeScalar Sequential Code Vectorized CodeScalar Sequential Code Vectorized CodeScalar Sequential Code Vectorized CodeScalar Sequential Code

load loadload load loadload load loadload load loadload load load

loadIter. 1 loadIter. 1 load loadloadIter. 1 load loadloadIter. 1 load load

Time

Iter. 1 load load

Time

Time

Time

Time

add Time

add add addTime

add add addTime

add add addTime

add add

store store storestore store storestore store storestore store store

loadloadloadIter. Iter.

loadIter. Iter.

loadIter. Iter. Iter. Iter.

load Vector Instruction

Iter.

1

Iter.

2loadIter. 2Vector Instruction1 2loadIter. 2Vector Instruction1 2loadIter. 2

1 2Iter. 2

addaddaddadd

storestorestore4

store4

store

Slide credit: Krste Asanovic 4Slide credit: Krste Asanovic 4Slide credit: Krste Asanovic

SIMTSIMTSIMTSIMTSIMTSIMTSIMT

thread id index thread id index � thread id index � thread id index � thread id index

Let’s assume N=16, blockDim=4 � 4 blocks Let’s assume N=16, blockDim=4 � 4 blocks Let’s assume N=16, blockDim=4 � 4 blocks Let’s assume N=16, blockDim=4 � 4 blocks

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

+++0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

+0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

+0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

+ + + ++ + + ++ + + ++ + + +

Slide credit: Hyesoon KimSlide credit: Hyesoon KimSlide credit: Hyesoon Kim

CPU codeCPU codeCPU codeCPU codeCPU code

for (ii = 0; ii < 100; ++ii) {for (ii = 0; ii < 100; ++ii) {for (ii = 0; ii < 100; ++ii) {for (ii = 0; ii < 100; ++ii) {C[ii] = A[ii] + B[ii];C[ii] = A[ii] + B[ii];C[ii] = A[ii] + B[ii];}}}}

CUDA codeCUDA codeCUDA codeCUDA code

// there are 100 threads// there are 100 threads// there are 100 threads// there are 100 threads__global__ void KernelFunction(…) {__global__ void KernelFunction(…) {__global__ void KernelFunction(…) {int tid = blockDim.x * blockIdx.x + threadIdx.x;

__global__ void KernelFunction(…) {int tid = blockDim.x * blockIdx.x + threadIdx.x;int tid = blockDim.x * blockIdx.x + threadIdx.x;int tid = blockDim.x * blockIdx.x + threadIdx.x;int varA = aa[tid];int varA = aa[tid];int varA = aa[tid];int varA = aa[tid];int varB = bb[tid];int varB = bb[tid];int varB = bb[tid];C[tid] = varA + varB;C[tid] = varA + varB;C[tid] = varA + varB;C[tid] = varA + varB;

}}}

Slide credit: Hyesoon KimSlide credit: Hyesoon KimSlide credit: Hyesoon Kim

77Slide credit: Hyesoon Kim 7Slide credit: Hyesoon Kim 7Slide credit: Hyesoon Kim

fine-grained � fine-grained � fine-grained � fine-grained multithreading

� fine-grained multithreadingmultithreadingmultithreadingmultithreadingmultithreading

หนึ�งคาํสั�งจากหนึ�ง thread อยูใ่น pipeline ณ Warps สําหรับทาํการหนึ�งคาํสั�งจากหนึ�ง thread อยูใ่น pipeline ณ Thread Warp 3Warps สําหรับทาํการ

� หนึ�งคาํสั�งจากหนึ�ง thread อยูใ่น pipeline ณ Thread Warp 3Warps สําหรับทาํการ

� หนึ�งคาํสั�งจากหนึ�ง thread อยูใ่น pipeline ณ Thread Warp 3สําหรับทาํการ

schedule� หนึ�งคาํสั�งจากหนึ�ง thread อยูใ่น pipeline ณ

เวลาเดียวกนั (ไม่มีการทาํ branch Thread Warp 3Thread Warp 8 schedule

หนึ�งคาํสั�งจากหนึ�ง อยูใ่น ณ เวลาเดียวกนั (ไม่มีการทาํ branch Thread Warp 8 scheduleเวลาเดียวกนั (ไม่มีการทาํ branch Thread Warp 8เวลาเดียวกนั (ไม่มีการทาํ branch เวลาเดียวกนั (ไม่มีการทาํ branch

prediction)prediction) Thread Warp 7prediction) Thread Warp 7SIMD Pipeline

prediction) Thread Warp 7SIMD Pipeline

prediction)

ซ่อน latency ดว้ยกนันาํ warp อื�นๆ มาSIMD Pipeline

� ซ่อน latency ดว้ยกนันาํ warp อื�นๆ มาSIMD Pipeline

� ซ่อน latency ดว้ยกนันาํ warp อื�นๆ มา� ซ่อน latency ดว้ยกนันาํ warp อื�นๆ มาI-Fetch

ซ่อน latency ดว้ยกนันาํ warp อื�นๆ มาschedule แบบคละกนัไป (interleave I-Fetchschedule แบบคละกนัไป (interleave I-Fetchschedule แบบคละกนัไป (interleave I-Fetchschedule แบบคละกนัไป (interleave

Decodeแบบคละกนัไป

warp execution) Decodewarp execution) Decodewarp execution) Decodewarp execution)

ม ี register file รองรบัจาํนวน thread

R RR� ม ี register file รองรบัจาํนวน thread

R RR� ม ี register file รองรบัจาํนวน thread

RF

RF

RF

� ม ี register file รองรบัจาํนวน thread ใน ได้

F FFม ี register file รองรบัจาํนวน thread ใน warp ได้ใน warp ได้ใน warp ได้ใน warp ได้ A A Aใน ได้ไมม่กีารทาํ context switching โดย

AL

AL

AL

� ไมม่กีารทาํ context switching โดย LU LU LU

Miss?� ไมม่กีารทาํ context switching โดย U U U

Miss?� ไมม่กีารทาํ context switching โดย Miss?ไมม่กีารทาํ context switching โดย OS

Miss?ไมม่กีารทาํ OS D-Cache Thread Warp 1OS D-Cache Thread Warp 1OS D-Cache Thread Warp 1

ถา้ warp ไหน miss ใน D-Cache Thread Warp 1Thread Warp 2All Hit?

� ถา้ warp ไหน miss ใน D-Cache Thread Warp 2DataAll Hit?� ถา้ warp ไหน miss ใน D-Cache Thread Warp 2DataAll Hit?� ถา้ warp ไหน miss ใน D-Cache จะถกูนําออกไปแขวน

Data� ถา้ warp ไหน miss ใน D-Cache จะถกูนําออกไปแขวนถา้ จะถกูนําออกไปแขวน Thread Warp 6จะถกูนําออกไปแขวน Thread Warp 6

Writebackจะถกูนําออกไปแขวน Thread Warp 6

Writebackจะถกูนําออกไปแขวน Thread Warp 6

WritebackWriteback

88Slide credit: Tor Aamodt 8Slide credit: Tor Aamodt 8Slide credit: Tor Aamodt

SIMD warp SIMD warp SIMD warp SIMD warp SIMD warp SIMD warp SIMD warp

SIMD ธรรมดามเีพยีงหนึ่ง thread � SIMD ธรรมดามเีพยีงหนึ่ง thread � SIMD ธรรมดามเีพยีงหนึ่ง thread � SIMD ธรรมดามเีพยีงหนึ่ง thread � SIMD ธรรมดามเีพยีงหนึ่ง thread lock step

ธรรมดามเีพยีงหนึ่ง ทาํงานเป็นแบบ lock step� ทาํงานเป็นแบบ lock step� ทาํงานเป็นแบบ lock step� ทาํงานเป็นแบบ lock stepทาํงานเป็นแบบ

โปรแกรมค่อนขา้งยาก ตอ้งรุ้จกัการใช ้control register ต่างๆ (เช่น VLEN) ตอ้งรู้รายละเอียดของไปป์ไลน์� โปรแกรมค่อนขา้งยาก ตอ้งรุ้จกัการใช ้control register ต่างๆ (เช่น VLEN) ตอ้งรู้รายละเอียดของไปป์ไลน์� โปรแกรมค่อนขา้งยาก ตอ้งรุ้จกัการใช ้control register ต่างๆ (เช่น VLEN) ตอ้งรู้รายละเอียดของไปป์ไลน์� โปรแกรมค่อนขา้งยาก ตอ้งรุ้จกัการใช ้control register ต่างๆ (เช่น VLEN) ตอ้งรู้รายละเอียดของไปป์ไลน์โปรแกรมค่อนขา้งยาก ตอ้งรุ้จกัการใช ้ ต่างๆ เช่น ตอ้งรู้รายละเอียดของไปป์ไลน์

SIMD แบบ warp มหีลาย thread แตว่า่แตล่ะ thread ใช้ชุดคาํสั่งSIMD แบบ warp มหีลาย thread แตว่า่แตล่ะ thread ใช้ชุดคาํสั่ง� SIMD แบบ warp มหีลาย thread แตว่า่แตล่ะ thread ใช้ชุดคาํสั่ง� SIMD แบบ warp มหีลาย thread แตว่า่แตล่ะ thread ใช้ชุดคาํสั่ง� SIMD แบบ warp มหีลาย thread แตว่า่แตล่ะ thread ใช้ชุดคาํสั่งเดยีวกนั

แบบ มหีลาย แตว่า่แตล่ะ ใช้ชุดคาํสั่งเดยีวกนัเดยีวกนัเดยีวกนัเดยีวกนัเดยีวกนั

ไม่ตอ้งทาํงานแบบ lock step� ไม่ตอ้งทาํงานแบบ lock step� ไม่ตอ้งทาํงานแบบ lock step� ไม่ตอ้งทาํงานแบบ lock stepไม่ตอ้งทาํงานแบบ

โปรแกรมค่อนขา้งง่าย� โปรแกรมค่อนขา้งง่าย� โปรแกรมค่อนขา้งง่าย� โปรแกรมค่อนขา้งง่ายโปรแกรมค่อนขา้งง่าย

single-thread compiler GPU hardware เหมือนโปรแกรมแบบ single-thread แต่ compiler และ GPU hardware จดัการใหเ้ป็น � เหมือนโปรแกรมแบบ single-thread แต่ compiler และ GPU hardware จดัการใหเ้ป็น � เหมือนโปรแกรมแบบ single-thread แต่ compiler และ GPU hardware จดัการใหเ้ป็น เหมือนโปรแกรมแบบ single-thread แต่ compiler และ GPU hardware จดัการใหเ้ป็น

multiple-threadmultiple-threadmultiple-threadmultiple-thread

9999

branch SIMD warpbranch SIMD warpbranch SIMD warpbranch SIMD warpbranch SIMD warpbranch SIMD warpbranch SIMD warp

4 thread warp code block A 4 thread warp code block A � 4 thread warp code block A � 4 thread warp code block A � 4 thread warp code block A G control flow graph G control flow graph G control flow graph G control flow graph G control flow graph G control flow graph

thread � thread � thread � thread � thread

AAAA

Thread Warp Common PCThread Warp Common PCThread Warp Common PCB

Thread Warp Common PCBB

Thread Thread ThreadThread

B

Thread Thread ThreadThread Thread Thread ThreadThread Thread Thread ThreadThreadC D F

Thread

2

Thread

3

Thread

4

Thread

1C D F 2 3 41C D F 2 3 41C D F 2 3 41

EEEE

GGGG

101010Slide credit: Tor Aamodt 10Slide credit: Tor Aamodt

branch branch � branch � branch � branch branch divergencedivergencedivergencedivergencedivergence

thread warp divergence

thread warp thread warp thread warp thread warp thread warp

BranchBranchBranchBranchBranchBranchBranchBranch

((((

Path APath A(

Path APath A)

Path APath A)

Path APath A)))

Path BPath BPath BPath BPath BPath BPath BPath B

111111Slide credit: Tor Aamodt 11Slide credit: Tor Aamodt

StackStackStackAA/1111

StackAA/1111

Reconv. PC Next PC Active MaskAA/1111Reconv. PC Next PC Active Mask

- G 1111TOS - A 1111TOS - E 1111- E 1111- B 1111TOS - E 1111TOSReconv. PC Next PC Active Mask

- E 1111- G 1111TOS - A 1111TOS - E 1111- E 1111- B 1111TOS - E 1111TOS - E 1111- G 1111TOS - A 1111TOS - E 1111- E 1111- B 1111TOS - E 1111TOS - E 1111- G 1111TOS - A 1111TOSE D 0110- E 1111E D 0110TOS- E 1111- B 1111TOS - E 1111TOSE D 0110- E 1111E D 0110E D 0110TOS E D 0110

BB/1111E D 0110E C 1001TOSE D 0110TOS E D 0110E E 1001TOSBB/1111 E C 1001TOS E E 1001TOSBB/1111 E C 1001TOS E E 1001TOSBB/1111 E C 1001TOS E E 1001TOS

C D FC/1001 D/0110C D FC/1001 D/0110C D FCommon PC

C/1001 D/0110C D FThread Warp Common PC

C/1001 D/0110Thread Warp Common PCThread Warp Common PCThread Warp

E Thread Thread ThreadThreadE/1111E Thread Thread ThreadThreadE/1111E Thread Thread ThreadThreadE/1111E Thread

2

Thread

3

Thread

4

Thread

1 2 3 41 2 3 41G

2 3 41G/1111GG/1111GG/1111GG/1111

A D G ACB EA D G ACB EA D G ACB EA D G ACB E

TimeTimeTimeTime


A;A;A;A;

if (some condition) {if (some condition) {One per warp

if (some condition) {One per warp

if (some condition) {

B; One per warpB; One per warpB;B;

} else {} else {} else {} else {

C; Control Flow StackC; Control Flow StackC; Control Flow StackC; Control Flow StackC;

}Next PC Recv PC Amask



}

D;A -- 1111

Next PC Recv PC AmaskD -- 1111

D;A -- 1111D -- 1111

D;TOS A -- 1111D -- 1111

D;TOS A -- 1111D -- 1111TOS A -- 1111

B D 1110D -- 1111TOS

B D 1110B D 1110A

B D 1110C D 0001DA C D 0001DA C D 0001DA C D 0001DA C D 0001D

Execution SequenceExecution SequenceExecution SequenceExecution Sequence

A C B DB C A C B DB C A C B DB C A C B DB C1 0 1 1

B C1 0 1 11 0 1 11

1

0

0

1

1

1

11 0 1 11 0 1 11

1

0

0

1

1

1

11 0 1 11 0 1 11 0 1 11

1

0

1

1

0

1

1D

1 1 0 1D

1 1 0 1D

1 1 0 1Time

DTime

DTimeTime


: thread diverge : thread diverge � : thread diverge � : thread diverge � : thread diverge branch warp branch warp branch warp branch warp branch warp branch warp

Warp SIMD � Warp SIMD � Warp SIMD � Warp SIMD � Warp SIMD cycle ( cycle )( cycle )( cycle )( cycle )

14141414

BranchBranchBranchBranchBranchBranchBranchBranch

Path APath APath APath APath APath APath APath A

Path BPath BPath BPath B

บทความที�เกี�ยวขอ้ง: Fung et al., “Dynamic Warp Formation and � บทความที�เกี�ยวขอ้ง: Fung et al., “Dynamic Warp Formation and � บทความที�เกี�ยวขอ้ง: Fung et al., “Dynamic Warp Formation and � บทความที�เกี�ยวขอ้ง: Fung et al., “Dynamic Warp Formation and � บทความที�เกี�ยวขอ้ง: Fung et al., “Dynamic Warp Formation and

Scheduling for Efficient GPU Control Flow, MICRO 2007

บทความที�เกี�ยวขอ้ง

Scheduling for Efficient GPU Control Flow,” MICRO 2007Scheduling for Efficient GPU Control Flow,” MICRO 2007Scheduling for Efficient GPU Control Flow,” MICRO 2007Scheduling for Efficient GPU Control Flow,” MICRO 2007

15151515

x/1111A

x/1111A

x/1111A

x/1111y/1111A y/1111y/1111

LegendLegendLegendAA

Bx/1110 AA

Bx/1110 AA

Bx/1110y/0011 Execution of Warp x Execution of Warp yB y/0011 Execution of Warp x Execution of Warp yB y/0011 Execution of Warp x

at Basic Block AExecution of Warp yat Basic Block Aat Basic Block A at Basic Block A

x/1000 x/0110 x/0001

at Basic Block A at Basic Block A

Cx/1000

Dx/0110

Fx/0001

Cx/1000

Dx/0110

Fx/0001

Cx/1000y/0010 D

x/0110y/0001 F

x/0001y/1100C y/0010 D y/0001 F y/1100

Dy/0010 y/0001 y/1100

DDA new warp created from scalar

D

x/1110A new warp created from scalar

Ex/1110

A new warp created from scalar

threads of both Warp x and y Ex/1110y/0011

threads of both Warp x and y E y/0011threads of both Warp x and y

executing at Basic Block DE y/0011 executing at Basic Block Dy/0011 executing at Basic Block D

x/1111G

x/1111G

x/1111y/1111G y/1111G y/1111

A A B B G G A AC C D D E E F FA A B B G G A AC C D D E E F FA A B B G G A AC C D D E E F F

BaselineBaselineBaselineBaseline

TimeTimeTime

Dynamic A A B B G G A AC D E E FDynamic A A B B G G A AC D E E FDynamic A A B B G G A AC D E E FDynamic

WarpWarpWarpWarpWarp

FormationFormationFormationFormation

TimeTimeTime


NVIDIA:NVIDIA:� NVIDIA:� NVIDIA:� NVIDIA:

240 stream processors240 stream processors� 240 stream processors� 240 stream processors� 240 stream processors240 stream processors

“SIMT execution”� “SIMT execution”� “SIMT execution”� “SIMT execution”� “SIMT execution”

::� :� :� ::

30 cores� 30 cores� 30 cores� 30 cores� 30 cores

8 SIMD functional core8 SIMD functional ในแต่ละ core� 8 SIMD functional ในแต่ละ core� 8 SIMD functional ในแต่ละ core� 8 SIMD functional ในแต่ละ core8 SIMD functional ในแต่ละ core

1717Slide credit: Kayvon Fatahalian

17Slide credit: Kayvon FatahalianSlide credit: Kayvon Fatahalian

“ ”“ ”“ ”“ ”“ ”“ ”“ ”

64 KB of storage 64 KB of storage 64 KB of storage 64 KB of storage

for fragment for fragment for fragment for fragment for fragment

contexts (registers)contexts (registers)contexts (registers)contexts (registers)

= SIMD functional unit, control = instruction stream decode= SIMD functional unit, control = instruction stream decode= SIMD functional unit, control = instruction stream decode= SIMD functional unit, control = instruction stream decode= SIMD functional unit, control

shared across 8 unitsshared across 8 unitsshared across 8 unitsshared across 8 units

= execution context storage = multiply-add = execution context storage = multiply-add = execution context storage = multiply-add = execution context storage = multiply-add = execution context storage = multiply-add= multiply= multiply= multiply= multiply



“ ”“ ”“ ”“ ”“ ”“ ”“ ”

64 KB of storage 64 KB of storage 64 KB of storage 64 KB of storage

for thread contexts for thread contexts for thread contexts for thread contexts for thread contexts

(registers)(registers)(registers)(registers)

warp 32 threadswarp 32 threads� warp 32 threads� warp 32 threads� warp 32 threads� warp 32 threads

schedule 32 warp � schedule 32 warp � schedule 32 warp � schedule 32 warp � schedule 32 warp

1024 thread contexts� 1024 thread contexts� 1024 thread contexts� 1024 thread contexts� 1024 thread contexts



Tex TexTex TexTex Tex

… … … ………… … … ………… … … ………


……… ……………… ……………… ………


……… ……………… ……………… ………

Tex TexTex TexTex Tex……… ……………… ……………… ………

TexTex TexTex TexTex Tex……… ……………… ……………… ………

30 cores 30,720 threads30 cores 30,720 threads30 cores 30,720 threads30 cores 30,720 threads30 cores 30,720 threads



(gpu) - cpe.ku.ac.thparuj/204521/gpu.pdf · warp thread nvidia simt (single instruction multiple...

Documents