헤테로지니어스 컴퓨팅 : cpu 에서 gpu 로 옮겨가기

헤테로지니어스 컴퓨팅 :

CPU 에서 GPU 로 옮겨가기

NCSOFT LE Team

이권일

발표자• NDC 자주오는 단골 발표자

• NCSOFT Linear Eternal 개발팀 근무

• 최적화 및 3D 엔진 개발 ( 과거형 )

발표 소개• 헤테로지니어스 컴퓨팅

• DirectX 11 Compute Shader

• 시연용 예제 설명

헤테로지니어스 컴퓨팅• 한가지 이상의 프로세서를 내장한 시스템

– AMD 의 CPU+GPU 솔루션 마켓팅 용어– 이제는 일반적인 통합 솔루션

• CPU 와 GPU 를 같이 사용하는 프로그래밍– CPU 는 복잡하고 선형적인 작업에 효율적– GPU 는 병렬화 가능한 반복 작업에 효율적

CPU 의 30~40% 는 GPU 공간

언제부터 시작할 것인가 ?• 시판중인 모든 데스크톱 프로세서들이 지원중

– 2012 년 이후부터 Intel/AMD 의 모든 CPU 들이 지원됨

• 스마트폰 프로세서들도 OpenCL 지원중– 최근 OpenCL 1.2 지원 SoC 들도 계속 출시중

• Stackoverflow 의 게시물에서 트렌드를 읽자 !!– SSE vs CUDA 게시물 비율이 1 : 20 쯤 된다– SSE/AVX 는 정말 인기 없다 .

DirectCompute 로 시작합시다 .• DirectX 11 부터 포함된 GPGPU 환경

– 친숙한 DirectX 인터페이스들 사용과 HLSL 이용– VS 2012/2013 에서 Compute Shader HLSL 파일 지원

• 윈도우에서 기본 지원하며 따로 설치가 필요 없음– Shared Memory 및 Structured Data 지원– Windows XP 미지원으로 CUDA/OpenCL 을 원할 수도 있다 .– 그러나 GPU 프로그래밍 라이브러리로서는 제일 빈약함

• Direct3D 초기화 DirectCompute 단독 사용이 가능하다 .– 20 줄 이내에 초기화 코드 작성 , 전체적으로 적은 코드량

Vistual Studio 의 HLSL 지원

간단한 DirectCompte 의 초기화 - 실행// Initialize CComPtr<ID3D11Device> pDevice;CComPtr<ID3D11DeviceContext> pContext;D3D_FEATURE_LEVEL FeatureLevel[] = { D3D_FEATURE_LEVEL_11_0, };D3D11CreateDevice(NULL, D3D_DRIVER_TYPE_HARDWARE, NULL, D3D11_CREATE_DEVICE_DEBUG, FeatureLevel, 1, D3D11_SDK_VERSION, &pDevice, NULL, &pContext);

// Create ShaderCComPtr<ID3D11ComputeShader> pTestCS;std::vector<BYTE> temp = LoadShader(_T("test.cso"));pDevice->CreateComputeShader(temp.begin()._Ptr, temp.size(), NULL, &pTestCS);pContext->CSSetShader(pTestCS, NULL, 0);

// Create Output BufferCComPtr<ID3D11Texture2D> pFloatBuffer;CD3D11_TEXTURE2D_DESC descFloatBuffer(DXGI_FORMAT_R32G32B32A32_FLOAT, texture_width, texture_height, 1, 1, D3D11_BIND_SHADER_RESOURCE | D3D11_BIND_UNORDERED_ACCESS);pDevice->CreateTexture2D(&descFloatBuffer, NULL, &pFloatBuffer);

CComPtr<ID3D11UnorderedAccessView> pFloatView;CD3D11_UNORDERED_ACCESS_VIEW_DESC deescFloatView(pFloatBuffer, D3D11_UAV_DIMENSION_TEXTURE2D);pDevice->CreateUnorderedAccessView(pFloatBuffer, &deescFloatView, &pFloatView);

ID3D11UnorderedAccessView* uoViews[] = { pFloatView };UINT uoCounts[2] = { -1, -1 };pContext->CSSetUnorderedAccessViews(0, _countof(uoViews), uoViews, uoCounts);

// Dispatch !!pContext->Dispatch(texture_width/32, texture_height/32, 1);

Compute Shader 4.x vs 5.0• CS 4.x 는 DX10 시절의 GPU 까지 대부분 지원

– groupshared 메모리 16kB 에 접근 방식에 제약– UA Resource 는 1 개만 지원 가능 ( 출력채널 1 개 )– UA Resource 포맷에 제한있음

• CS 5.0 은 일반적은 GPGPU 모델을 지원– groupshared 메모리는 32kB 까지 지원– 자유로운 형태의 UA Resource 는 8 개까지 지원– 텍스쳐 쓰기 가능 (2,3 차원 배열 + 포맷 변환 )

그래서 어느거 쓰라고요 ?• 2011 년 Intel Sandy Bridge 부터 CS 4.0 을 지원 (CPU + GPU)• 2012 년 Intel Ivy Bridge 부터 CS 5.0 으로 확대 (CPU + GPU)

• 2006 년 ATI HD 2000 내장 보드부터 CS 4.0 지원 (GPU)• 2011 년 AMD Llano APU 부터 CS 5.0 지원 (CPU + GPU)

• 2006 년 nVIDIA GeForce 8 시리즈부터 CS 4.0 지원 (GPU)• 2010 년 nVIDIA GeForce 400 시리즈부터 CS 5.0 지원 (GPU)

GPU 프로그래밍 기초• Muti-Threading

– CPU 가 루프를 돌면서 처리하던 것을 병렬로 처리– 한번에 32, 64 개 유닛이 같은 명령을 동시에 실행함

• Shared Memory– 쓰레드간에 공유하는 작은 크기의 메모리– 쓰레드간 주고 받을 정보나 중복 데이터 저장

GPU Multi-Threading// cpu_test.cppvoid func(){

for(uint i=0; i<count; ++i)

{ c[i] = a[i] + b[i];}

}

// gpu_test_cs.hl 니[numthreads(groupWidth, groupHeight, 1)]void func(uint i : SV_DispatchThreadID){

c[i] = a[i] + b[i];}

// gpu_test.cpppContext->Dispatch(count, 1, 1);

GPU 멀티 쓰레딩Job 0 Job 1 Job 2 Job 3 Job 4 Job 5 Job 6 Job 7

Job 8 Job 9 Job 10 Job 11 Job 12 Job 13 Job 14 Job 15







CPU GPU









Single Instruction Multi Thread• 32/64 쓰레드가 같은 명령어를 실행한다 .

– 명령어 유닛 1 개가 여러 개의 쓰레드를 제어– SIMD 와 흡사하나 코드 자유도가 좀더 높다 .– Thread Group, Warp, WaveFront 등으로 불린다 .

• 분기 발생시 양쪽 조건을 모두 실행한다 .– 모든 쓰레드가 if/for/while 에 들어가지 못하면 다른 쓰레드는 작동을

멈춘다 .– Group 이 다양한 분기를 탈 경우 매우 느려진다 .– 모두 실패할 경우 빠르게 Context Switching 된다 .

GPU 분기if(threadId < 16){

// 무엇인가 실행do_some_things();

}else{

// 다른 무엇인가 실행do_other_things();

}

Shared Memory 임시버퍼로 쓰기• Cache 용도로 메모리 접근 최적화

– Device Memory 는 상대적으로 매우 느리다 !!– 순차 복사후 랜덤하게 Shared Memory 접근

• Shared Memory 로 Register 사용을 줄인다 .– 어느정도 부피가 있는 Temporary Data 저장 – 각종 List/Queue/Stack 을 구현해도 좋다 .

쓰레드간 Shared Memory 공유하기• 다른 쓰레드가 계산한 결과를 저장한다 .

– 쓰레드간 데이터 교환에 유용하다 .– Shared Memory 가 부족하면 Device Memory 도 사용

• 쓰레드들을 Thread Pool 같이 써보자– 상황에 따라 처리하는 객체들을 바꿔서 담당한다 .– 쓰레드가 32 가 넘을 경우 GroupSync() 를 해야한다 .

CPU 에서 GPU 로 작업을 옮겨보자• 지금 사용중인 코드중에 무엇이 좋을까 ?

– 루프를 많이 돌고 있다 . (well?)– 순서가 바뀌거나 랜덤도 잘 돌아간다 . (good!!)– 임시 메모리 사용량이 적다 . (perfect!!!)

• C 로 알고리즘 테스트하기– GPU 에서도 쓸 수 있는 CPU 용 알고리즘 구현– 병렬화 , 메모리 사용 줄이기 , 랜덤 접근 줄이기– GPU 디버깅이 어렵기 결과 비교용으로 쓰기

예제로 보는 GPU 프로그래밍• Bicubic Test

– Uniform Cubic B-Spline Curve– 이미지 편집툴에서 사용하는 확대 알고리즘

• Tiny Viewer– Hierarchical Keyframe Animation 재생– GPU 연산을 3D 렌더링에 직접 보내는 예제

Uniform Cubic B-Spline• GPU Gems 2 Chapter 20

• 1 개의 픽셀을 위해 16 개의 픽셀을 읽음– X 축 4 번 Interpolation, Y 축 1 번 Interpolation

Uniform Cubic B-Spline(1.5, 1.5) 좌표의 값을 계산하기위해

Y=0 일때 X 축 방향으로 B-Spline 값을 계산Y=1 일때 X 축 방향으로 B-Spline 값을 계산Y=2 일때 X 축 방향으로 B-Spline 값을 계산Y=3 일때 X 축 방향으로 B-Spline 값을 계산계산한 4 개 값으로 Y 축 방향으로 B-Spline 값을 계산

총 16 픽셀 읽기 + 5 번의 B-Spline 계산

CPU B-Spline// CPP B-Spline 계산 코드const BGRA& p(int x, int y) {

if(x >= 0 && x < g_image_width && y >= 0 && y < g_image_height) {return g_source[g_image_width*y + x];

} else {return g_black;

}}

float w0(float t) {return (1.0f / 6.0f)*(-t*t*t + 3.0f*t*t - 3.0f*t + 1.0f);

}

float w1(float t){return (1.0f / 6.0f)*(3.0f*t*t*t - 6.0f*t*t + 4.0f);

}

float w2(float t){return (1.0f / 6.0f)*(-3.0f*t*t*t + 3.0f*t*t + 3.0f*t + 1.0f);

}

float w3(float t){return (1.0f / 6.0f)*(t*t*t);

}

fBGRA px(int x, int y, float dx){return w0(dx)*p(x - 1, y) + w1(dx)*p(x, y) + w2(dx)*p(x + 1, y) + w3(dx)*p(x + 2, y);

}

fBGRA pxy(int x, int y, float dx, float dy){return w0(dy)*px(x, y - 1, dx) + w1(dy)*px(x, y, dx) +

w2(dy)*px(x, y + 1, dx) + w3(dy)*px(x, y + 2, dx);}

fBGRA pxy(int x, int y, float dx, float dy) {return px(x, y - 1, dx)*w0(dy) + px(x, y, dx)*w1(dy) +

px(x, y + 1, dx)*w2(dy) + px(x, y + 2, dx)*w3(dy);}

// CPP Resize 루프void ResizeBruteForce() {

float sx = ox + (g_image_width/zoom)/2;float sy = oy + (g_image_height/zoom)/2;

for(int y=0; y<g_screen_height; ++y) {int v = (int)floor( (y+sy) * zoom );float dv = ( (y+sy) * zoom ) - v;

for(int x=0; x<g_screen_width; ++x) {int u = (int)floor( (x+sx) *

zoom );float du = ( (x+sx) * zoom )

- u;

g_screen[g_screen_width*y + x] = pxy(u, v, du, dv);}

}}

struct BGRA { BGRA(BYTE r, BYTE g, BYTE b, BYTE a) : R(r), G(g), B(b), A(a) {}

struct fBGRA operator*(float v) const;struct fBGRA operator*(int v) const;

BYTE R, G, B, A;};

struct fBGRA {fBGRA() {}fBGRA(float r, float g, float b, float a) : R(r), G(g), B(b), A(a) {}

struct fBGRA operator+(const fBGRA& v) const;struct fBGRA operator*(float v) const;operator struct BGRA();

float R, G, B, A;};

BGRA* g_screen = (BGRA*)_aligned_malloc(~~~~, 16);

GPU B-Spline// HLSL 코드 float4 p(int x, int y){

return source[int2(x, y)];}


}


}


}


}

float4 px(int x, int y, float dx){return w0(dx)*p(x - 1, y) + w1(dx)*p(x, y) + w2(dx)*p(x + 1, y) +

w3(dx)*p(x + 2, y);}

float4 pxy(int x, int y, float dx, float dy){return w0(dy)*px(x, y - 1, dx) + w1(dy)*px(x, y, dx) +


[numthreads(g_thread_width, g_thread_height, 1)]float4 main(uint3 dispatchThreadID : SV_DispatchThreadID){

int v = (int)floor((dispatchThreadID.y + sy) * zoom);int u = (int)floor((dispatchThreadID.x + sx) * zoom);

float dv = ((dispatchThreadID.y + sy) * zoom) - v;float du = ((dispatchThreadID.x + sx) * zoom) - u;

return pxy(u, v, du, dv);}

// CPP 코드void Resize(){

float sx = ox + (g_image_width / zoom) / 2;float sy = oy + (g_image_height / zoom) / 2;float constants[4] = { sx, sy, zoom, 0 };g_context->UpdateSubresource(g_constBuffer, 0, NULL, constants, 0,

0);

// Dispatch !!g_context->Dispatch( g_screen_width / g_thread_width,

g_screen_height / g_thread_height,

1);

// GPU -> CPU 로 데이터 복사g_context->CopyResource(g_copyBuffer, g_targetBuffer);

// 속도 측정을 위해 작업 완료 대기 - g_context->End(g_query);while (g_context->GetData(g_query, NULL, 0, 0) == S_FALSE) {}

}

CPU vs GPU // CPP B-Spline 계산 코드const BGRA& p(int x, int y) {

if(x >= 0 && x < g_image_width && y >= 0 && y < g_image_height) {return g_source[g_image_width*y + x];} else {return g_black;}

}


}


}


}


}

fBGRA px(int x, int y, float dx){return w0(dx)*p(x - 1, y) + w1(dx)*p(x, y) + w2(dx)*p(x + 1, y) + w3(dx)*p(x + 2, y);

}

fBGRA pxy(int x, int y, float dx, float dy){return w0(dy)*px(x, y - 1, dx) + w1(dy)*px(x, y, dx) + w2(dy)*px(x, y + 1, dx) + w3(dy)*px(x, y + 2, dx);

}

fBGRA pxy(int x, int y, float dx, float dy) {return px(x, y - 1, dx)*w0(dy) + px(x, y, dx)*w1(dy) + px(x, y + 1, dx)*w2(dy) + px(x, y + 2, dx)*w3(dy);

}

// HLSL 코드 float4 p(int x, int y){

return source[int2(x, y)];}


}


}


}


}

float4 px(int x, int y, float dx){return w0(dx)*p(x - 1, y) + w1(dx)*p(x, y) + w2(dx)*p(x + 1, y) +

w3(dx)*p(x + 2, y);}

float4 pxy(int x, int y, float dx, float dy){return w0(dy)*px(x, y - 1, dx) + w1(dy)*px(x, y, dx) +


[numthreads(g_thread_width, g_thread_height, 1)]float4 main(uint3 dispatchThreadID : SV_DispatchThreadID){

int v = (int)floor((dispatchThreadID.y + sy) * zoom);int u = (int)floor((dispatchThreadID.x + sx) * zoom);

float dv = ((dispatchThreadID.y + sy) * zoom) - v;float du = ((dispatchThreadID.x + sx) * zoom) - u;


CPU vs GPU// CPP Resize 루프void ResizeBruteForce() {

float sx = ox + (g_image_width/zoom)/2;float sy = oy + (g_image_height/zoom)/2;

for(int y=0; y<g_screen_height; ++y) {int v = (int)floor( (y+sy) * zoom );float dv = ( (y+sy) * zoom ) - v;

for(int x=0; x<g_screen_width; ++x) {int u = (int)floor( (x+sx) *

zoom );float du = ( (x+sx) * zoom )

- u;

g_screen[g_screen_width*y + x] = pxy(u, v, du, dv);}

}}

struct BGRA { BGRA(BYTE r, BYTE g, BYTE b, BYTE a) : R(r), G(g), B(b), A(a) {}

struct fBGRA operator*(float v) const;struct fBGRA operator*(int v) const;

BYTE R, G, B, A;};

struct fBGRA {fBGRA() {}fBGRA(float r, float g, float b, float a) : R(r), G(g), B(b), A(a) {}

struct fBGRA operator+(const fBGRA& v) const;struct fBGRA operator*(float v) const;operator struct BGRA();

float R, G, B, A;};

BGRA* g_screen = (BGRA*)_aligned_malloc(~~~~, 16);

// CPP 코드void Resize(){

float sx = ox + (g_image_width / zoom) / 2;float sy = oy + (g_image_height / zoom) / 2;float constants[4] = { sx, sy, zoom, 0 };g_context->UpdateSubresource(g_constBuffer, 0, NULL, constants, 0,

0);

// Dispatch !!g_context->Dispatch( g_screen_width / g_thread_width,

g_screen_height / g_thread_height,

1);

// GPU -> CPU 로 데이터 복사g_context->CopyResource(g_copyBuffer, g_targetBuffer);

// 속도 측정을 위해 작업 완료 대기 - g_context->End(g_query);while (g_context->GetData(g_query, NULL, 0, 0) == S_FALSE) {}

}// HLSL 코드 [numthreads(g_thread_width, g_thread_height, 1)]float4 main(uint3 dispatchThreadID : SV_DispatchThreadID){

int v = (int)floor((dispatchThreadID.y + sy) * zoom);

int u = (int)floor((dispatchThreadID.x + sx) * zoom);

float dv = ((dispatchThreadID.y + sy) * zoom) - v;

float du = ((dispatchThreadID.x + sx) * zoom) - u;


Bicubic Test 최적화• 임시 버퍼에 X 축 연산을 저장하여 최적화

– 1 배 확대시 X 축 1 번 Y 축 1 번 씩 계산– 2 배 확대시 X 축 0.5 번 Y 축 1 번 씩 계산

• 최적화 한것을 SSE 로 바꾸면 4 배이상 항상– 메모리 읽기 / 쓰기 효율성 개선– SSE Instricsic 으로 Pixel Encode/Decode

최적화된 B-Spline 애니메이션(1.5, 1.5) 좌표의 값을 계산하기위해

계산에 쓰일 Y=0 ~ Y=3 까지의 X 축들을 임시버퍼에 전체 계산해둠

임시 버퍼의 X 축 정보들을 사용해서 전체 Y 축 정보를 계산

첫번째 줄을 계산할때는 임시버퍼에 Y=0, Y=3 까지의 X 축을 채워야 하지만 그 다음부터는 필요에 따라 채우면 된다 .

1:1 비율시 1 픽셀 읽기 + 2 번의 B-Spline 계산1:1 비율시 0.5 픽셀 읽기 + 2 번의 B-Spline 계산

Bicubic Test GPU 구현• BruteForce 구현이 SSE 최적화 보다 2 배 빠르다 .

– 쉐이더 코드 는 CPP 를 80% 정도 복사 붙이기– 처음의 코드에 비해 대략 40 배쯤 빠르다 .

• 최적화 알고리즘 적용시 SSE 보다 3.7 배 빠르다 .– SSE 최적화 코드 170 줄– GPU 최적화 코드 30 줄

최적화된 CPU 코드// CPP SSE Optimized 루프void px_sse(int y, BGRA* g_source, __m128* image_sse, __m128 result_sse[g_screen_width], int u[g_screen_width], __m128 wx_sse[g_screen_width][4], int x_loop[9]){ if(y>=0 && y<g_image_height) { BGRA* image = g_source + g_image_width * y;

__m128* iter = image_sse; for(int x=0; x<g_image_width/4; ++x) { __m128i current = _mm_load_si128((__m128i*)(image)); __m128i low = _mm_unpacklo_epi8(current, _mm_setzero_si128()); __m128i high = _mm_unpackhi_epi8(current, _mm_setzero_si128()); image += 4; *iter++ = _mm_cvtepi32_ps(_mm_unpacklo_epi16(low, _mm_setzero_si128())); *iter++ = _mm_cvtepi32_ps(_mm_unpackhi_epi16(low, _mm_setzero_si128())); *iter++ = _mm_cvtepi32_ps(_mm_unpacklo_epi16(high, _mm_setzero_si128())); *iter++ = _mm_cvtepi32_ps(_mm_unpackhi_epi16(high, _mm_setzero_si128())); }

int x=0; for(; x<x_loop[0]; ++x) { result_sse[x] = g_zero4; } for(; x<x_loop[1]; ++x) { result_sse[x] = wx_sse[x][3]*image_sse[u[x]+2]; } for(; x<x_loop[2]; ++x) { result_sse[x] = wx_sse[x][2]*image_sse[u[x]+1] + wx_sse[x][3]*image_sse[u[x]+2]; } for(; x<x_loop[3]; ++x) { result_sse[x] = wx_sse[x][1]*image_sse[u[x]+0] + wx_sse[x][2]*image_sse[u[x]+1] + wx_sse[x][3]*image_sse[u[x]+2]; } for(; x<x_loop[4]; ++x) { result_sse[x] = wx_sse[x][0]*image_sse[u[x]-1] + wx_sse[x][1]*image_sse[u[x]+0] + wx_sse[x][2]*image_sse[u[x]+1] + wx_sse[x][3]*image_sse[u[x]+2]; } for(; x<x_loop[5]; ++x) { result_sse[x] = wx_sse[x][0]*image_sse[u[x]-1] + wx_sse[x][1]*image_sse[u[x]+0] + wx_sse[x][2]*image_sse[u[x]+1]; } for(; x<x_loop[6]; ++x) { result_sse[x] = wx_sse[x][0]*image_sse[u[x]-1] + wx_sse[x][1]*image_sse[u[x]+0]; } for(; x<x_loop[7]; ++x) { result_sse[x] = wx_sse[x][0]*image_sse[u[x]-1]; } for(; x<x_loop[8]; ++x) { result_sse[x] = g_zero4; } } else { for(int x=0; x<g_screen_width; ++x) { result_sse[x] = g_zero4; } }}

void ResizeOptimized_sse(){ if (g_source) { float sx = ox + (g_image_width/zoom)/2 + 0.5f; float sy = oy + (g_image_height/zoom)/2 + 0.5f;

const __m128 c0_sse = _mm_set_ps( 1.0f/6.0f, 4.0f/6.0f, 1.0f/6.0f, 0.0f/6.0f); const __m128 c1_sse = _mm_set_ps(-3.0f/6.0f, 0.0f/6.0f, 3.0f/6.0f, 0.0f/6.0f); const __m128 c2_sse = _mm_set_ps( 3.0f/6.0f,-6.0f/6.0f, 3.0f/6.0f, 0.0f/6.0f); const __m128 c3_sse = _mm_set_ps(-1.0f/6.0f, 3.0f/6.0f,-3.0f/6.0f, 1.0f/6.0f);

int u[g_screen_width]; int x_loop[9] = { 0, 0, 0, 0, g_screen_width, g_screen_width, g_screen_width, g_screen_width, g_screen_width }; __m128 wx_sse[g_screen_width][4];

for(int x=0; x<g_screen_width; ++x) { float fu = (x+sx) * zoom - 0.5f; u[x] = (int)floor( fu );

//u[x]-1 >= 0; if(u[x]+2 < 0) x_loop[0] = x+1; if(u[x]+1 < 0) x_loop[1] = x+1; if(u[x]+0 < 0) x_loop[2] = x+1; if(u[x]-1 < 0) x_loop[3] = x+1; //u[x]+2 < g_image_width; if(u[x]+2 < g_image_width) x_loop[4] = x+1; if(u[x]+1 < g_image_width) x_loop[5] = x+1; if(u[x]+0 < g_image_width) x_loop[6] = x+1; if(u[x]-1 < g_image_width) x_loop[7] = x+1;

__m128 t1_sse = _mm_set1_ps(fu - u[x]);

__m128 temp = c3_sse*t1_sse*t1_sse*t1_sse + c2_sse*t1_sse*t1_sse + c1_sse*t1_sse + c0_sse;

wx_sse[x][0] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(3,3,3,3)); wx_sse[x][1] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(2,2,2,2)); wx_sse[x][2] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(1,1,1,1)); wx_sse[x][3] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(0,0,0,0)); }

__m128* image_sse = g_image_data; __m128* temp_sse[4] = { g_temp_data + g_screen_width * 0, g_temp_data + g_screen_width * 1, g_temp_data + g_screen_width * 2, g_temp_data + g_screen_width * 3 }; __m128* swap_sse[3]; int v_last = INT_MIN; for(int y=0; y<g_screen_height; ++y) { float fv = (y+sy) * zoom - 0.5f; int v = (int)floor( fv );

switch(v - v_last) { case 0: break; case 1: swap_sse[0] = temp_sse[0];

temp_sse[0] = temp_sse[1]; temp_sse[1] = temp_sse[2]; temp_sse[2] = temp_sse[3]; temp_sse[3] = swap_sse[0];

px_sse(v + 2, g_source, image_sse, temp_sse[3], u, wx_sse, x_loop); break; case 2: swap_sse[0] = temp_sse[0]; swap_sse[1] = temp_sse[1];

temp_sse[0] = temp_sse[2]; temp_sse[1] = temp_sse[3]; temp_sse[2] = swap_sse[0]; temp_sse[3] = swap_sse[1];

px_sse(v + 1, g_source, image_sse, temp_sse[2], u, wx_sse, x_loop); px_sse(v + 2, g_source, image_sse, temp_sse[3], u, wx_sse, x_loop); break; case 3: swap_sse[0] = temp_sse[0]; swap_sse[1] = temp_sse[1]; swap_sse[2] = temp_sse[2];

temp_sse[0] = temp_sse[3]; temp_sse[1] = swap_sse[0]; temp_sse[2] = swap_sse[1]; temp_sse[3] = swap_sse[2];

px_sse(v + 0, g_source, image_sse, temp_sse[1], u, wx_sse, x_loop); px_sse(v + 1, g_source, image_sse, temp_sse[2], u, wx_sse, x_loop); px_sse(v + 2, g_source, image_sse, temp_sse[3], u, wx_sse, x_loop); break;

default: px_sse(v - 1, g_source, image_sse, temp_sse[0], u, wx_sse, x_loop); px_sse(v + 0, g_source, image_sse, temp_sse[1], u, wx_sse, x_loop); px_sse(v + 1, g_source, image_sse, temp_sse[2], u, wx_sse, x_loop); px_sse(v + 2, g_source, image_sse, temp_sse[3], u, wx_sse, x_loop); break; }

v_last = v;

__m128 t1_sse = _mm_set1_ps(fv - v); __m128 temp = c3_sse*t1_sse*t1_sse*t1_sse + c2_sse*t1_sse*t1_sse + c1_sse*t1_sse + c0_sse;

__m128 wy0 = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(3,3,3,3)); __m128 wy1 = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(2,2,2,2)); __m128 wy2 = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(1,1,1,1)); __m128 wy3 = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(0,0,0,0));

__m128* src0 = temp_sse[0]; __m128* src1 = temp_sse[1]; __m128* src2 = temp_sse[2]; __m128* src3 = temp_sse[3]; __m128i* dest = (__m128i*)(g_screen+g_screen_width*y); BGRA* destTemp = g_screen + g_screen_width*y;

for(int x=0; x<g_screen_width/4; ++x) { int prefetch = 8; _mm_prefetch((char*)(src0+prefetch), _MM_HINT_NTA); _mm_prefetch((char*)(src1+prefetch), _MM_HINT_NTA); _mm_prefetch((char*)(src2+prefetch), _MM_HINT_NTA); _mm_prefetch((char*)(src3+prefetch), _MM_HINT_NTA);

_mm_stream_si128( dest++, _mm_packus_epi16( _mm_packs_epi32( _mm_cvtps_epi32(_mm_load_ps((const float*)(src0+0))*wy0 + _mm_load_ps((const float*)(src1+0))*wy1 + _mm_load_ps((const float*)(src2+0))*wy2 + _mm_load_ps((const float*)(src3+0))*wy3), _mm_cvtps_epi32(_mm_load_ps((const float*)(src0+1))*wy0 + _mm_load_ps((const float*)(src1+1))*wy1 + _mm_load_ps((const float*)(src2+1))*wy2 + _mm_load_ps((const float*)(src3+1))*wy3) ), _mm_packs_epi32( _mm_cvtps_epi32(_mm_load_ps((const float*)(src0+2))*wy0 + _mm_load_ps((const float*)(src1+2))*wy1 + _mm_load_ps((const float*)(src2+2))*wy2 + _mm_load_ps((const float*)(src3+2))*wy3), _mm_cvtps_epi32(_mm_load_ps((const float*)(src0+3))*wy0 + _mm_load_ps((const float*)(src1+3))*wy1 + _mm_load_ps((const float*)(src2+3))*wy2 + _mm_load_ps((const float*)(src3+3))*wy3) ) ) ); src0 += 4; src1 += 4; src2 += 4; src3 += 4; } } }}

최적화된 GPU 코드// HLSL SSE Optimized 루프groupshared float4 buffer2[g_thread_height + 4][g_thread_width];

float4 pxy_opt2(int v, int thread_x, int thread_y, float dv) {return w0(dv) * buffer2[v][thread_x] + w1(dv) * buffer2[v + 1][thread_x] + w2(dv) * buffer2[v + 2][thread_x] + w3(dv) * buffer2[v + 3][thread_x];

}

[numthreads(g_thread_width, g_thread_height, 1)]float4 main(uint2 group : : SV_GroupID, uint2 thread : SV_GroupThreadID, uint2 dispatch : SV_DispatchThreadID) {

float fu = (dispatch.x + sx) * zoom;float fv = (dispatch.y + sy) * zoom;int u = (int)floor(fu);int v = (int)floor(fv);float du = fu - u;float dv = fv - v;

int v_first = (int)floor((group.y * g_thread_height + sy) * zoom);int v_last = (int)floor((group.y * g_thread_height + g_thread_height - 1 + sy) * zoom);

int buffer_y = thread_y;for (int pv = v_first - 1 + thread_y; pv <= v_last + 3; pv += g_thread_height) {

buffer2[buffer_y][thread.x] = px(u, pv, du);

buffer_y += g_thread_height;}

GroupMemoryBarrierWithGroupSync();

return pxy_opt2(v - v_first, thread.x, thread.y, dv);}

Bicubic Test 벤치마크

CPU BruteForce CPU Optimized CPU SSE Optimiezed Intel HD 4000 nVIDIA GTS 440 nVIDIA GTX 6800

20

40

60

80

100

120

140

Bicubic Test Benchmark

1 배 확대 2 배 확대

CPU BruteForce 대비 결과

CPU BruteForce CPU Optimized CPU SSE Optimiezed Intel HD 4000 nVIDIA GTS 440 nVIDIA GTX 6800

50

100

150

200

250

300

350

Benchmark vs CPU BruteForce

1 배 확대 2 배 확대

Tiny Viewer 예제• VTFetch 가 아닌 실제 캐릭터 연산을 GPU 로 대체

– 시간이 일정하지 않은 키프레임 데이터– 여러 애니메이션 블렌딩 가능 (CPU 와 동일 )– 압축된 Key Frame, 이나 IK 구현도 가능 (CPU 와 동일 )

• 계산 결과는 CPU 와 동일한로 렌더러 사용– CPU + GPU 동시에 연산을 분산해서 계산 가능– GPU 연산 결과를 CPU 에서 다시 사용 가능

키프레임 업데이트 애니메이션for each instance:

for each animation track:find keyframe and

interpolationlocal[i] = matrix(key.pos,

key.rot)

for each bone hierarchy:find parent for boneworld[i] = local[i] *

world[parent]

for each target skin bone:find bone index from

skin indexskin[i] = invWorld[i] *

world[bone]

GPU 에 맵핑하기

Local -> World 변환 루프 문제World 0

World 1

World 2

World 3

World 4

World 5

World 6

Group Sync 사용하기instance = threadId /animationCountanimation = threadId % animationCount

instance = threadId /columnWidthcolumn = threadId % columnWidth

instance = threadId /skinCountcolumn = threadId % skinCount

GroupSync

GroupSync

GroupSync

GroupSync

GroupSync

GroupSync

Matrix -> Pos/Rot 메모리 줄이기• CPU 도 GPU 도 BandWidth 에 민감하다 .

– 최적화를 많이할수록 메모리 퍼포먼스가 중요– 4x4 를 4x3 또는 tanslation, rotation 페어로 변경

• AoS 와 SoA 를 잘 골라쓰자 .– SoA 가 더 빠르다고 하지만 다수의 스트림을 다루를때는

AoS 가 더 빠를때도 있다 .– BandWidth/Bank 가 적을수록 AoS 를 선호한다 .

Tiny Viwer 벤치마크Instace 개수 1024 2048 4096 8192 16384

Intel HD 4000 Full Render 13.29 25.74 50.23 99.24

Full CPU Copy 15.72 30.27 59.26 117.05

Full CPU Animation + Copy 15.73 30.23 59.27 117.12

Full GPU Animation 14.19 27.44 53.51 105.83

1/10 Render 1.99 4.26 6.67 14.73 27.88

1/10 CPU Copy 4.38 9.41 16.13 31.25 60.69

1/10 CPU Animation + Copy 5.15 8.37 16.15 36.35 60.93

1/10 GPU Animation 3.62 5.45 11.74 21.96 40.74

1 Indices Render 0.47 1.49 0.52 2.06 1.28

1 Indices CPU Copy 3.44 5.16 9.77 19.28 37.16

1 Indices CPU Animation + Copy 5.09 6.80 18.24 24.65 49.86

1 Indices GPU Animation 1.37 2.26 4.76 7.45 14.32

nVIDIA GTS 440 Full Render 3.80 7.25 14.04 27.44 54.21

Full CPU Copy 4.01 7.63 14.80 29.07 57.45

Full CPU Animation + Copy 4.01 7.63 14.78 28.92 57.07

Full GPU Animation 4.59 8.69 16.82 32.90 65.01

nVIDIA GTX 680 Full Render 0.58 1.10 2.16 4.25 8.39

Full CPU Copy 0.76 1.47 2.87 5.70 11.39

Full CPU Animation + Copy 2.42 4.50 8.67 17.11 33.38

Full GPU Animation 0.74 1.41 2.78 5.39 10.67

1 2 3 4 50.00

10.00

20.00

30.00

40.00

50.00

60.00

Intel HD Graphics 4000 + 1 Indices

1 Indices Render 1 Indices CPU Copy 1 Indices CPU Animation + Copy 1 Indices GPU Animation

1 2 3 4 50.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

nVIDIA GTS 440 / nVIDIA GTX 680

nVIDIA GTS 440 Full Render nVIDIA GTS 440 Full CPU Copy nVIDIA GTS 440 Full CPU Animation + CopynVIDIA GTS 440 Full GPU Animation nVIDIA GTX 680 Full Render nVIDIA GTX 680 Full CPU CopynVIDIA GTX 680 Full CPU Animation + Copy nVIDIA GTX 680 Full GPU Animation

GPU 와 CPU 의 비동기 처리

CPU Animation

IdleRenderCopyIdleRender IdleRenderCopy Copy

CPU Animation Copy CPU Animation CopyCopy

Render

Idle Cmd

GPUAnimation Render GPU

Animation Render GPUAnimation Render

Idle Cmd Idle Cmd Idle

CPU Animation

GPU Animation

Tiny Viewer 실행 결론• Intel U3317 + HD 4000

– CPU Copy 조차 GPU Animation 보다 느리다 .– CPU Animation 까지 하면 부하가 늘어나서 더 느려진다 .

• Intel Sandy Bridge 2500 + nVIDIA GTS 440– GPU 애니메이션 이 CPU Copy 시간에 비해 3.75 배정도 더 걸린다 .– GPU Animation 손해가 제일 큰 경우이다 . CPU 부하를 나을 수 있다 .

• Intel Sandy Bridge 2500 + nVIDIA GTX 680– CPU Copy 시간과 GPU Animtion 시간이 비슷하다 .– GPU 가 CPU Animtion 결과를 기다리므로 GPU Animtion 이 항상 빠르다 .

GPU 의 약점을 알아두자• 까다로운 GPU 메모리 접근

– 대역폭은 넓지만 랜덤 접근에 강하지 않다 .– Shared Memory 도 최악의 경우 16 배까지 느려진다 .

• 생각보다 느린 Thread 당 연산 속도– 클럭이 CPU 의 1/4~1/5 정도 , IPC 도 1/5~1/8– OOE 는 지원하지만 CPU 만큼 고급은 아니다 .

• CPU 와 데이터와 결과를 주고 받아야한다 .– 복사 비용은 매우매우매우 비싸다 . 최대한 재사용하자 .– 주고 받는 데이터 크기도 최소화 하자 . (fp16, uint16)

GPU 공부하면서 어려웠던 점들• 아직 제대로된 프로파일러가 없다

– 미묘한 메모리 패널티– 미묘한 Occupancy 패널티

• GPU 마다 특성들이 다르다– 누구는 AoS 가 좋고 누구는 SoA 가 좋대– 누구는 벡터를 쓰면 좋고 누구는 그냥하는게 좋대

• Intel 내장 그래픽 카드의 버그– 가끔 데이터를 못읽어와서 고생했다 .– nVIDIA 는 잘돌아가니 거기서 해보자 .

Q&A

?

헤테로지니어스 컴퓨팅 : cpu 에서 gpu 로 옮겨가기

Engineering