ati stream computing ati radeon™ hd 2900 series instruction set architecture micah villmow may 30,...

ATI Stream ComputingATI Radeon™ HD 2900 Series Instruction Set Architecture

Micah VillmowMay 30, 2008

| ATI Stream Computing Update | Confidential2 2 | ATI Stream Computing – ATI Radeon™ HD 2900 Series Instruction Set Architecture

ATI Radeon™ HD 2900 Series GPU ISA

• Useful Definitions

• Why learn ISA?

• Control Flow Programs – What are they?

• Clauses – Atomicity Guaranteed

ALU Clauses

TEX Clauses

VTX Clauses

• Instructions – Understanding them


Definitions

CF – Control Flow

Clause temps – GPR124-127 that are temporary registers, also refered to as T#

kcache – on-chip constant memory that can be locked

Clause – homogonous group of instructions run atomically on the hardware, either ALU, TEX, or VTX

Quad – Four (x,y) data elements arranged in a 2-by-2 array

Fetch – Load data via the vertex or texture instructions

Predicate – A bit that is set/cleared as result of a condition that masks writing to an ALU result

PV – Previous Vector, get vector unit (XYZW) results from previous ALU clause

PS – Previous Scalar, get trans unit (T) result from previous ALU clause


Whole Quad Mode vs. Valid Pix

Whole Quad Mode(WQM)

Executes clause as if all pixels are alive.

Valid Pixel Mode(VPM)

Executes only live pixels

0 1

2 3

Execute ALU w/ WQM flag brings pixel 1 backtemporarily

0 1

2 3All pixels valid

0 1

2 3Kill Pixel 1

0 1

2 3All pixels valid

0 1

2 3Kill Pixel 1

0 1

2 3

Execute ALU w/ VPM flag ignorespixel 1


Why Learn ISA?

• Help to understand what is actually being executed

• Allow exact calculation of theoretical peaks

• Determine bottlenecks in code

• Help to optimize code by analyzing generated code

GPU ISA;PS; -------- Disassembly --------------------00 ALU: ADDR(32) CNT(8) KCACHE0(CB0:0-15) 0 x: MOV R0.x, KC0[1].x y: MOV R1.y, KC0[1].y z: MOV R1.z, KC0[1].z w: MOV R1.w, KC0[1].w 1 x: MOV R1.x, KC0[0].w y: MOV R1.y, KC0[0].z z: MOV R1.z, KC0[0].y w: MOV R1.w, KC0[0].x 01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX 02 ALU_BREAK: ADDR(40) CNT(3) 2 x: SETGT_DX10 R2.x, R0.x, 0x3C23D70A 3 x: PREDNE_INT ____, R2.x, 0.0f 03 ALU: ADDR(43) CNT(5) KCACHE0(CB0:0-15) 4 x: ADD R1.x, KC0[0].x, R1.x y: ADD R1.y, KC0[0].y, R1.y z: ADD R1.z, KC0[0].z, R1.z w: ADD R1.w, KC0[0].w, R1.w t: ADD R0.x, R0.x, -1.0f 04 ENDLOOP i0 PASS_JUMP_ADDR(2) 05 EXP_DONE: PIX0, R1END_OF_PROGRAM

Understand this

AMD HLSLconst CALchar* HLSLKernel ="cbuffer myConstants\n""{ float4 inc;float4 repeat;};\n""void main( in float4 wpos:VPOS,

out float4 out0 : SV_TARGET )\n""{\n"" out0 = inc.wzyx; \n"" for( ; repeat.x>0.01f;

repeat.x=repeat.x-1.f){\n"" out0 = out0 + inc; \n"" }}\n"

CALobject objcalutAMDhlslCompileProgram(&obj,

CAL_PROGRAM_TYPE_PS, HLSLKernel, CAL_TARGET_670 )

Write this


Control Flow Programs

• Series of Control Flow instructions which can: Initiate execution of clauses Allocate space in input or output buffer Export to or import from a data buffer Control branching, looping, and stack operations 40 cycle latency that needs to be hidden

GPU ISA;PS; -------- Disassembly --------------------00 ALU: ADDR(32) CNT(8) KCACHE0(CB0:0-15) 0 x: MOV R0.x, KC0[1].x y: MOV R1.y, KC0[1].y z: MOV R1.z, KC0[1].z w: MOV R1.w, KC0[1].w 1 x: MOV R1.x, KC0[0].w y: MOV R1.y, KC0[0].z z: MOV R1.z, KC0[0].y w: MOV R1.w, KC0[0].x 01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX 02 ALU_BREAK: ADDR(40) CNT(3) 2 x: SETGT_DX10 R2.x, R0.x, 0x3C23D70A 3 x: PREDNE_INT ____, R2.x, 0.0f03 ALU: ADDR(43) CNT(5) KCACHE0(CB0:0-15) 4 x: ADD R1.x, KC0[0].x, R1.x y: ADD R1.y, KC0[0].y, R1.y z: ADD R1.z, KC0[0].z, R1.z w: ADD R1.w, KC0[0].w, R1.w t: ADD R0.x, R0.x, -1.0f 04 ENDLOOP i0 PASS_JUMP_ADDR(2) 05 EXP_DONE: PIX0, R1END_OF_PROGRAM

Control Flow Instructions


Typical CF Program Flow


Control Flow Clauses – ALU/TEX

ALU CF Clause – 1 to 128 ALU slots, where max of 5 ALU slots per ALU clause.

03 ALU: ADDR(43) CNT(5) KCACHE0(CB0:0-15) 4 x: ADD R1.x, KC0[0].x, R1.x y: ADD R1.y, KC0[0].y, R1.y z: ADD R1.z, KC0[0].z, R1.z w: ADD R1.w, KC0[0].w, R1.w t: ADD R0.x, R0.x, -1.0f

TEX CF Clause – 1 to 8 TEX slots per clause03 TEX: ADDR(176) CNT(8) VALID_PIX 9 SAMPLE R0, R18.wxww, t4, s4 UNNORM(XYZW) 10 SAMPLE R5, R18.wyww, t0, s0 UNNORM(XYZW) 11 SAMPLE R6, R18.wyww, t1, s1 UNNORM(XYZW) 12 SAMPLE R7, R18.wyww, t2, s2 UNNORM(XYZW) 13 SAMPLE R8, R18.wyww, t3, s3 UNNORM(XYZW) 14 SAMPLE R1, R18.wxww, t5, s5 UNNORM(XYZW) 15 SAMPLE R3, R18.wxww, t6, s6 UNNORM(XYZW) 16 SAMPLE R9, R18.wxww, t7, s7 UNNORM(XYZW)


Control Flow Clauses – VTX/VTX_TC

VTX CF Clause – 1 to 8 VTX slots per clause

VTX_TC CF Clause – same as VTX, but through texture cache

used when vertex unit does not exist on chip

00 VTX: ADDR(176) CNT(2) 0 VFETCH R1.xy__, R0.x, fc128 MEGA(16) OFFSET(0) FETCH_TYPE(NO_INDEX_OFFSET) 2 VFETCH R2.xy__, R0.x, fc128 MEGA(16) OFFSET(0) FETCH_TYPE(NO_INDEX_OFFSET)

00 VTX_TC: ADDR(176) CNT(2) 0 VFETCH R1.xy__, R0.x, fc128 MEGA(16) OFFSET(0) FETCH_TYPE(NO_INDEX_OFFSET) 2 VFETCH R2.xy__, R0.x, fc128 MEGA(16) OFFSET(0) FETCH_TYPE(NO_INDEX_OFFSET)

| ATI Stream Computing Update | Confidential1010 | ATI Stream Computing – ATI Radeon™ HD 2900 Series Instruction Set Architecture

Control Flow Clauses – Color/Scratch

EXP_DONE – Sends data out via the pixel buffer, or color buffer, and signals no more exports will occur

Scratch Write – Write to a scratch buffer

05 EXP_DONE: PIX0, R1 // write to R1 to o0 onlyor01 EXP_DONE: PIX0, R1 BRSTCNT(7) // Write to R1-R8 to o0-o7

HD38XX:01 MEM_SCRATCH_WRITE_IND: VEC_PTR[0+R0.x], R1, ARRAY_SIZE(1) ELEM_SIZE(3) BURST_CNT(0)HD48XX:2 MEM_SCRATCH_WRITE_IND_ACK: VEC_PTR[0+R2.x], R1, ARRAY_SIZE(1) ELEM_SIZE(3) BURST_CNT(0)

Scratch Read – Read from a scratch buffer

HD38XX:03 MEM_SCRATCH_READ_IND: R3, VEC_PTR[0+R3.x], ARRAY_SIZE(1) ELEM_SIZE(3) HD48XX:04 VTX: ADDR(48) CNT(2) 4 MEM_SCRATCH_READ_VF R3, VEC_PTR[0+R3.x], ARRAY_SIZE(1) ELEM_SIZE(3)

UNCACHED BURST_CNT(0)


Control Flow Clauses – Global Buffer

Gather Clause – Read from Global Memory BufferHD38XX:01 MEM_GLOBAL_READ_IND: R0, DWORD_PTR[0+R0.x], ELEM_SIZE(3)HD48XX:01 VTX: ADDR(48) CNT(1) 4 MEM_SCATTER_READ_VF R0, DWORD_PTR[0+R0.x], ELEM_SIZE(3) UNCACHED BURST_CNT(0)

Scatter Clause – Write to Global Memory Buffer

HD38XX:01 MEM_GLOBAL_WRITE_IND: DWORD_PTR[0+R1.x], R0, ELEM_SIZE(3) orHD48XX:01 MEM_GLOBAL_WRITE_IND_ACK: DWORD_PTR[0+R1.x], R0, ELEM_SIZE(3)


Control Flow Clauses - Conditionals

ALU_BREAK: Breaks out of a loop based on predicate set in instruction in the clause

02 ALU_BREAK: ADDR(37) CNT(2) KCACHE0(CB0:0-15) 1 y: SETE_INT R0.y, R0.x, KC0[1].x 2 x: PREDE_INT ____, R0.y, 0.0f UPDATE_EXEC_MASK UPDATE_PRED

Other instructions, if, else, endloop etc…Jump(i.e. If):01 JUMP ADDR(5) VALID_PIXElse:05 ELSE POP_CNT(1) ADDR(22) VALID_PIXPush Stack:12 ALU_PUSH_BEFORE: ADDR(50) CNT(3) Pop Stack:44 ALU_POP_AFTER: ADDR(122) CNT(1) Whileloop:01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX Endloop:04 ENDLOOP i0 PASS_JUMP_ADDR(2)


ALU Clauses – ALU Overview

03 ALU: ADDR(43) CNT(5) KCACHE0(CB0:0-15) 4 x: ADD R1.x, KC0[0].x, R1.x y: ADD R1.y, KC0[0].y, R1.y z: ADD R1.z, KC0[0].z, R1.z w: ADD R1.w, KC0[0].w, R1.w t: ADD R0.x, R0.x, -1.0f

R1 W Z Y X

MSB LSB

32 bits

128 bits

R1

WZ

YX

MS

BL

SB

128

bits

KC

0[0]

WZ

YX

MS

BL

SB

32 b

its

128

bits

T -1.0fT R0.xT R0.x


ALU Clauses – GPRs

• 127 GPR’s per thread accessible, via R register

• 256 constants per thread, via C register

• GPR [124,127] are temps that last through ALU CF clause, via T register

• PV/PS are temps that last 1 ALU clause

• SR – Shared global registers

• AR – Address register allows dynamic indexing into register file, only via MOVA instruction

• aL – Index loop register for loop based offsets

• KC0/1 – Constant cache bank 01 register

• Read port and cycle Restrictions!


ALU Clause - ALU Data Flow

• GPR Read port restrictions – only 3 different GPR’s are accessible per ALU clause

•Constant Read Port Restrictions – Only 4 distinct elements can be read per ALU clause


ALU Clauses – Misc

Cycle restrictions cause issues when reading from T/R/SR registers.

The src registers are read over three cycles.

Src0 = cycle 0, src1 = cycle 1, src2 = cycle 2

VEC_### changes the cycle the read would occur at because of port restrictions

4 x: ADD R1.x, R31.y, (0xC0400000, -3.0f).x y: MULADD T0.y, -PV3.y, (0x41000000, 8.0f).y, T1.z z: ADD R2.z, T0.x, R29.z w: MULADD T0.w, -PV3.z, (0x41000000, 8.0f).y, T0.z VEC_120 t: ADD R2.w, R16.w, R31.y

02 ALU: ADDR(39) CNT(2) 3 x: MOV SR0.x, R0.x 4 x: MOV R1.x, SR0.x


TEX/VTX Clauses

03 TEX: ADDR(18748) CNT(4) VALID_PIX 20 SAMPLE R2, R2.zwzz, t1, s1 UNNORM(XYZW) 21 SAMPLE R6, R0.zyzz, t1, s1 UNNORM(XYZW) 22 SAMPLE R8, R3.zwzz, t1, s1 UNNORM(XYZW) 23 SAMPLE R32, R29.zwzz, t1, s1 UNNORM(XYZW)

01 TEX: ADDR(112) CNT(5) 16 LDS_READ R0, R0.zy WATERFALL

01 VTX: ADDR(48) CNT(1) 4 MEM_SCATTER_READ_VF R0, DWORD_PTR[0+R0.x], ELEM_SIZE(3) UNCACHED BURST_CNT(0)

00 VTX: ADDR(176) CNT(2) 0 VFETCH R1.xy__, R0.x, fc128 MEGA(16) OFFSET(0) FETCH_TYPE(NO_INDEX_OFFSET)

04 VTX: ADDR(48) CNT(2) 4 MEM_SCRATCH_READ_VF R3, VEC_PTR[0+R3.x], ARRAY_SIZE(1) ELEM_SIZE(3)

UNCACHED BURST_CNT(0)

01 TEX: ADDR(48) CNT(1) 1 LDS_WRITE (0) R0.xyyy, STRIDE(16) SIMD_REL02 TEX: ADDR(48) CNT(1) 2 LDS_WRITE (0) R1.xyyy, STRIDE(16) SIMD_ABS03 TEX: ADDR(48) CNT(1) 3 LDS_WRITE (0) R2.xyyy, STRIDE(16) SIMD_REL FFT_PERMUTE


Disclaimer & AttributionDISCLAIMERThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION© 2010 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI Logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other names are for informational purposes only and may be trademarks of their respective owners.

ati stream computing ati radeon™ hd 2900 series instruction set architecture micah villmow may 30,...

Documents