ati stream computing ati radeon™ hd 2900 series instruction set architecture micah villmow may 30,...
TRANSCRIPT
ATI Stream ComputingATI Radeon™ HD 2900 Series Instruction Set Architecture
Micah VillmowMay 30, 2008
| ATI Stream Computing Update | Confidential2 2 | ATI Stream Computing – ATI Radeon™ HD 2900 Series Instruction Set Architecture
ATI Radeon™ HD 2900 Series GPU ISA
• Useful Definitions
• Why learn ISA?
• Control Flow Programs – What are they?
• Clauses – Atomicity Guaranteed
ALU Clauses
TEX Clauses
VTX Clauses
• Instructions – Understanding them
| ATI Stream Computing Update | Confidential3 3 | ATI Stream Computing – ATI Radeon™ HD 2900 Series Instruction Set Architecture
Definitions
CF – Control Flow
Clause temps – GPR124-127 that are temporary registers, also refered to as T#
kcache – on-chip constant memory that can be locked
Clause – homogonous group of instructions run atomically on the hardware, either ALU, TEX, or VTX
Quad – Four (x,y) data elements arranged in a 2-by-2 array
Fetch – Load data via the vertex or texture instructions
Predicate – A bit that is set/cleared as result of a condition that masks writing to an ALU result
PV – Previous Vector, get vector unit (XYZW) results from previous ALU clause
PS – Previous Scalar, get trans unit (T) result from previous ALU clause
| ATI Stream Computing Update | Confidential4 4 | ATI Stream Computing – ATI Radeon™ HD 2900 Series Instruction Set Architecture
Whole Quad Mode vs. Valid Pix
Whole Quad Mode(WQM)
Executes clause as if all pixels are alive.
Valid Pixel Mode(VPM)
Executes only live pixels
0 1
2 3
Execute ALU w/ WQM flag brings pixel 1 backtemporarily
0 1
2 3All pixels valid
0 1
2 3Kill Pixel 1
0 1
2 3All pixels valid
0 1
2 3Kill Pixel 1
0 1
2 3
Execute ALU w/ VPM flag ignorespixel 1
| ATI Stream Computing Update | Confidential5 5 | ATI Stream Computing – ATI Radeon™ HD 2900 Series Instruction Set Architecture
Why Learn ISA?
• Help to understand what is actually being executed
• Allow exact calculation of theoretical peaks
• Determine bottlenecks in code
• Help to optimize code by analyzing generated code
GPU ISA;PS; -------- Disassembly --------------------00 ALU: ADDR(32) CNT(8) KCACHE0(CB0:0-15) 0 x: MOV R0.x, KC0[1].x y: MOV R1.y, KC0[1].y z: MOV R1.z, KC0[1].z w: MOV R1.w, KC0[1].w 1 x: MOV R1.x, KC0[0].w y: MOV R1.y, KC0[0].z z: MOV R1.z, KC0[0].y w: MOV R1.w, KC0[0].x 01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX 02 ALU_BREAK: ADDR(40) CNT(3) 2 x: SETGT_DX10 R2.x, R0.x, 0x3C23D70A 3 x: PREDNE_INT ____, R2.x, 0.0f 03 ALU: ADDR(43) CNT(5) KCACHE0(CB0:0-15) 4 x: ADD R1.x, KC0[0].x, R1.x y: ADD R1.y, KC0[0].y, R1.y z: ADD R1.z, KC0[0].z, R1.z w: ADD R1.w, KC0[0].w, R1.w t: ADD R0.x, R0.x, -1.0f 04 ENDLOOP i0 PASS_JUMP_ADDR(2) 05 EXP_DONE: PIX0, R1END_OF_PROGRAM
Understand this
AMD HLSLconst CALchar* HLSLKernel ="cbuffer myConstants\n""{ float4 inc;float4 repeat;};\n""void main( in float4 wpos:VPOS,
out float4 out0 : SV_TARGET )\n""{\n"" out0 = inc.wzyx; \n"" for( ; repeat.x>0.01f;
repeat.x=repeat.x-1.f){\n"" out0 = out0 + inc; \n"" }}\n"
CALobject objcalutAMDhlslCompileProgram(&obj,
CAL_PROGRAM_TYPE_PS, HLSLKernel, CAL_TARGET_670 )
Write this
| ATI Stream Computing Update | Confidential6 6 | ATI Stream Computing – ATI Radeon™ HD 2900 Series Instruction Set Architecture
Control Flow Programs
• Series of Control Flow instructions which can: Initiate execution of clauses Allocate space in input or output buffer Export to or import from a data buffer Control branching, looping, and stack operations 40 cycle latency that needs to be hidden
GPU ISA;PS; -------- Disassembly --------------------00 ALU: ADDR(32) CNT(8) KCACHE0(CB0:0-15) 0 x: MOV R0.x, KC0[1].x y: MOV R1.y, KC0[1].y z: MOV R1.z, KC0[1].z w: MOV R1.w, KC0[1].w 1 x: MOV R1.x, KC0[0].w y: MOV R1.y, KC0[0].z z: MOV R1.z, KC0[0].y w: MOV R1.w, KC0[0].x 01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX 02 ALU_BREAK: ADDR(40) CNT(3) 2 x: SETGT_DX10 R2.x, R0.x, 0x3C23D70A 3 x: PREDNE_INT ____, R2.x, 0.0f03 ALU: ADDR(43) CNT(5) KCACHE0(CB0:0-15) 4 x: ADD R1.x, KC0[0].x, R1.x y: ADD R1.y, KC0[0].y, R1.y z: ADD R1.z, KC0[0].z, R1.z w: ADD R1.w, KC0[0].w, R1.w t: ADD R0.x, R0.x, -1.0f 04 ENDLOOP i0 PASS_JUMP_ADDR(2) 05 EXP_DONE: PIX0, R1END_OF_PROGRAM
Control Flow Instructions
| ATI Stream Computing Update | Confidential7 7 | ATI Stream Computing – ATI Radeon™ HD 2900 Series Instruction Set Architecture
Typical CF Program Flow
| ATI Stream Computing Update | Confidential8 8 | ATI Stream Computing – ATI Radeon™ HD 2900 Series Instruction Set Architecture
Control Flow Clauses – ALU/TEX
ALU CF Clause – 1 to 128 ALU slots, where max of 5 ALU slots per ALU clause.
03 ALU: ADDR(43) CNT(5) KCACHE0(CB0:0-15) 4 x: ADD R1.x, KC0[0].x, R1.x y: ADD R1.y, KC0[0].y, R1.y z: ADD R1.z, KC0[0].z, R1.z w: ADD R1.w, KC0[0].w, R1.w t: ADD R0.x, R0.x, -1.0f
TEX CF Clause – 1 to 8 TEX slots per clause03 TEX: ADDR(176) CNT(8) VALID_PIX 9 SAMPLE R0, R18.wxww, t4, s4 UNNORM(XYZW) 10 SAMPLE R5, R18.wyww, t0, s0 UNNORM(XYZW) 11 SAMPLE R6, R18.wyww, t1, s1 UNNORM(XYZW) 12 SAMPLE R7, R18.wyww, t2, s2 UNNORM(XYZW) 13 SAMPLE R8, R18.wyww, t3, s3 UNNORM(XYZW) 14 SAMPLE R1, R18.wxww, t5, s5 UNNORM(XYZW) 15 SAMPLE R3, R18.wxww, t6, s6 UNNORM(XYZW) 16 SAMPLE R9, R18.wxww, t7, s7 UNNORM(XYZW)
| ATI Stream Computing Update | Confidential9 9 | ATI Stream Computing – ATI Radeon™ HD 2900 Series Instruction Set Architecture
Control Flow Clauses – VTX/VTX_TC
VTX CF Clause – 1 to 8 VTX slots per clause
VTX_TC CF Clause – same as VTX, but through texture cache
used when vertex unit does not exist on chip
00 VTX: ADDR(176) CNT(2) 0 VFETCH R1.xy__, R0.x, fc128 MEGA(16) OFFSET(0) FETCH_TYPE(NO_INDEX_OFFSET) 2 VFETCH R2.xy__, R0.x, fc128 MEGA(16) OFFSET(0) FETCH_TYPE(NO_INDEX_OFFSET)
00 VTX_TC: ADDR(176) CNT(2) 0 VFETCH R1.xy__, R0.x, fc128 MEGA(16) OFFSET(0) FETCH_TYPE(NO_INDEX_OFFSET) 2 VFETCH R2.xy__, R0.x, fc128 MEGA(16) OFFSET(0) FETCH_TYPE(NO_INDEX_OFFSET)
| ATI Stream Computing Update | Confidential1010 | ATI Stream Computing – ATI Radeon™ HD 2900 Series Instruction Set Architecture
Control Flow Clauses – Color/Scratch
EXP_DONE – Sends data out via the pixel buffer, or color buffer, and signals no more exports will occur
Scratch Write – Write to a scratch buffer
05 EXP_DONE: PIX0, R1 // write to R1 to o0 onlyor01 EXP_DONE: PIX0, R1 BRSTCNT(7) // Write to R1-R8 to o0-o7
HD38XX:01 MEM_SCRATCH_WRITE_IND: VEC_PTR[0+R0.x], R1, ARRAY_SIZE(1) ELEM_SIZE(3) BURST_CNT(0)HD48XX:2 MEM_SCRATCH_WRITE_IND_ACK: VEC_PTR[0+R2.x], R1, ARRAY_SIZE(1) ELEM_SIZE(3) BURST_CNT(0)
Scratch Read – Read from a scratch buffer
HD38XX:03 MEM_SCRATCH_READ_IND: R3, VEC_PTR[0+R3.x], ARRAY_SIZE(1) ELEM_SIZE(3) HD48XX:04 VTX: ADDR(48) CNT(2) 4 MEM_SCRATCH_READ_VF R3, VEC_PTR[0+R3.x], ARRAY_SIZE(1) ELEM_SIZE(3)
UNCACHED BURST_CNT(0)
| ATI Stream Computing Update | Confidential1111 | ATI Stream Computing – ATI Radeon™ HD 2900 Series Instruction Set Architecture
Control Flow Clauses – Global Buffer
Gather Clause – Read from Global Memory BufferHD38XX:01 MEM_GLOBAL_READ_IND: R0, DWORD_PTR[0+R0.x], ELEM_SIZE(3)HD48XX:01 VTX: ADDR(48) CNT(1) 4 MEM_SCATTER_READ_VF R0, DWORD_PTR[0+R0.x], ELEM_SIZE(3) UNCACHED BURST_CNT(0)
Scatter Clause – Write to Global Memory Buffer
HD38XX:01 MEM_GLOBAL_WRITE_IND: DWORD_PTR[0+R1.x], R0, ELEM_SIZE(3) orHD48XX:01 MEM_GLOBAL_WRITE_IND_ACK: DWORD_PTR[0+R1.x], R0, ELEM_SIZE(3)
| ATI Stream Computing Update | Confidential1212 | ATI Stream Computing – ATI Radeon™ HD 2900 Series Instruction Set Architecture
Control Flow Clauses - Conditionals
ALU_BREAK: Breaks out of a loop based on predicate set in instruction in the clause
02 ALU_BREAK: ADDR(37) CNT(2) KCACHE0(CB0:0-15) 1 y: SETE_INT R0.y, R0.x, KC0[1].x 2 x: PREDE_INT ____, R0.y, 0.0f UPDATE_EXEC_MASK UPDATE_PRED
Other instructions, if, else, endloop etc…Jump(i.e. If):01 JUMP ADDR(5) VALID_PIXElse:05 ELSE POP_CNT(1) ADDR(22) VALID_PIXPush Stack:12 ALU_PUSH_BEFORE: ADDR(50) CNT(3) Pop Stack:44 ALU_POP_AFTER: ADDR(122) CNT(1) Whileloop:01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX Endloop:04 ENDLOOP i0 PASS_JUMP_ADDR(2)
| ATI Stream Computing Update | Confidential1313 | ATI Stream Computing – ATI Radeon™ HD 2900 Series Instruction Set Architecture
ALU Clauses – ALU Overview
03 ALU: ADDR(43) CNT(5) KCACHE0(CB0:0-15) 4 x: ADD R1.x, KC0[0].x, R1.x y: ADD R1.y, KC0[0].y, R1.y z: ADD R1.z, KC0[0].z, R1.z w: ADD R1.w, KC0[0].w, R1.w t: ADD R0.x, R0.x, -1.0f
R1 W Z Y X
MSB LSB
32 bits
128 bits
R1
WZ
YX
MS
BL
SB
128
bits
KC
0[0]
WZ
YX
MS
BL
SB
32 b
its
128
bits
T -1.0fT R0.xT R0.x
| ATI Stream Computing Update | Confidential1414 | ATI Stream Computing – ATI Radeon™ HD 2900 Series Instruction Set Architecture
ALU Clauses – GPRs
• 127 GPR’s per thread accessible, via R register
• 256 constants per thread, via C register
• GPR [124,127] are temps that last through ALU CF clause, via T register
• PV/PS are temps that last 1 ALU clause
• SR – Shared global registers
• AR – Address register allows dynamic indexing into register file, only via MOVA instruction
• aL – Index loop register for loop based offsets
• KC0/1 – Constant cache bank 01 register
• Read port and cycle Restrictions!
| ATI Stream Computing Update | Confidential1515 | ATI Stream Computing – ATI Radeon™ HD 2900 Series Instruction Set Architecture
ALU Clause - ALU Data Flow
• GPR Read port restrictions – only 3 different GPR’s are accessible per ALU clause
•Constant Read Port Restrictions – Only 4 distinct elements can be read per ALU clause
| ATI Stream Computing Update | Confidential1616 | ATI Stream Computing – ATI Radeon™ HD 2900 Series Instruction Set Architecture
ALU Clauses – Misc
Cycle restrictions cause issues when reading from T/R/SR registers.
The src registers are read over three cycles.
Src0 = cycle 0, src1 = cycle 1, src2 = cycle 2
VEC_### changes the cycle the read would occur at because of port restrictions
4 x: ADD R1.x, R31.y, (0xC0400000, -3.0f).x y: MULADD T0.y, -PV3.y, (0x41000000, 8.0f).y, T1.z z: ADD R2.z, T0.x, R29.z w: MULADD T0.w, -PV3.z, (0x41000000, 8.0f).y, T0.z VEC_120 t: ADD R2.w, R16.w, R31.y
02 ALU: ADDR(39) CNT(2) 3 x: MOV SR0.x, R0.x 4 x: MOV R1.x, SR0.x
| ATI Stream Computing Update | Confidential1717 | ATI Stream Computing – ATI Radeon™ HD 2900 Series Instruction Set Architecture
TEX/VTX Clauses
03 TEX: ADDR(18748) CNT(4) VALID_PIX 20 SAMPLE R2, R2.zwzz, t1, s1 UNNORM(XYZW) 21 SAMPLE R6, R0.zyzz, t1, s1 UNNORM(XYZW) 22 SAMPLE R8, R3.zwzz, t1, s1 UNNORM(XYZW) 23 SAMPLE R32, R29.zwzz, t1, s1 UNNORM(XYZW)
01 TEX: ADDR(112) CNT(5) 16 LDS_READ R0, R0.zy WATERFALL
01 VTX: ADDR(48) CNT(1) 4 MEM_SCATTER_READ_VF R0, DWORD_PTR[0+R0.x], ELEM_SIZE(3) UNCACHED BURST_CNT(0)
00 VTX: ADDR(176) CNT(2) 0 VFETCH R1.xy__, R0.x, fc128 MEGA(16) OFFSET(0) FETCH_TYPE(NO_INDEX_OFFSET)
04 VTX: ADDR(48) CNT(2) 4 MEM_SCRATCH_READ_VF R3, VEC_PTR[0+R3.x], ARRAY_SIZE(1) ELEM_SIZE(3)
UNCACHED BURST_CNT(0)
01 TEX: ADDR(48) CNT(1) 1 LDS_WRITE (0) R0.xyyy, STRIDE(16) SIMD_REL02 TEX: ADDR(48) CNT(1) 2 LDS_WRITE (0) R1.xyyy, STRIDE(16) SIMD_ABS03 TEX: ADDR(48) CNT(1) 3 LDS_WRITE (0) R2.xyyy, STRIDE(16) SIMD_REL FFT_PERMUTE
| ATI Stream Computing Update | Confidential1818 | ATI Stream Computing – ATI Radeon™ HD 2900 Series Instruction Set Architecture
Disclaimer & AttributionDISCLAIMERThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION© 2010 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI Logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other names are for informational purposes only and may be trademarks of their respective owners.