intel® atom™ processor - graphics developer's guide · 8 optimization notice ... this...
TRANSCRIPT
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel®
Atom™ processor-based platforms
Copyright © 2008-2011 Intel Corporation
All Rights Reserved
Revision: 1.0
Contributors: Ron Fosner, Orion Granatir
World Wide Web: http://www.intel.com
Intel® Atom™ Processor - Graphics Developer's Guide
2
Disclaimer and Legal Information
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO
LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR
OTHER INTELLECTUAL PROPERTY RIGHT.
UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD
CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are
available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm
Software Source Code Disclaimer
Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license:
Intel Sample Source Code License Agreement
This license governs use of the accompanying software. By installing or copying all or
any part of the software components in this package, you (“you” or “Licensee”) agree
to the terms of this agreement. Do not install or copy the software until you have
carefully read and agreed to the following terms and conditions. If you do not agree
to the terms of this agreement, promptly return the software to Intel Corporation
(“Intel”).
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 3
1. Definitions:
A. “Materials" are defined as the software (including the Redistributables and
Sample Source as defined herein), documentation, and other materials, including any updates and upgrade thereto, that are provided to you under this Agreement.
B. "Redistributables" are the files listed in the "redist.txt" file that is included in
the Materials or are otherwise clearly identified as redistributable files by Intel.
C. “Sample Source” is the source code file(s) that: (i) demonstrate(s) certain functions for particular purposes; (ii) are identified as sample source code;
and (iii) are provided hereunder in source code form.
D. “Intel‟s Licensed Patent Claims” means those claims of Intel‟s patents that
(a) are infringed by the Sample Source or Redistributables, alone and not in combination, in their unmodified form, as furnished by Intel to Licensee and (b) Intel has the right to license.
2. License Grant: Subject to all of the terms and conditions of this Agreement:
A. Intel grants to you a non-exclusive, non-assignable, copyright license to use
the Material for your internal development purposes only.
B. Intel grants to you a non-exclusive, non-assignable copyright license to reproduce the Sample Source, prepare derivative works of the Sample Source and distribute the Sample Source or any derivative works thereof
that you create, as part of the product or application you develop using the
Materials. C. Intel grants to you a non-exclusive, non-assignable copyright license to
distribute the Redistributables, or any portions thereof, as part of the product or application you develop using the Materials.
D. Intel grants Licensee a non-transferable, non-exclusive, worldwide, non-
sublicenseable license under Intel‟s Licensed Patent Claims to make, use, sell, and import the Sample Source and the Redistributables.
3. Conditions and Limitations:
A. This license does not grant you any rights to use Intel‟s name, logo or
trademarks.
B. Title to the Materials and all copies thereof remain with Intel. The Materials are copyrighted and are protected by United States copyright laws. You will not remove any copyright notice from the Materials. You agree to prevent
any unauthorized copying of the Materials. Except as expressly provided herein, Intel does not grant any express or implied right to you under Intel patents, copyrights, trademarks, or trade secret information.
Intel® Atom™ Processor - Graphics Developer's Guide
4
C. You may NOT: (i) use or copy the Materials except as provided in this Agreement; (ii) rent or lease the Materials to any third party; (iii) assign this Agreement or transfer the Materials without the express written consent of Intel; (iv) modify, adapt, or translate the Materials in whole or in part except as provided in this Agreement; (v) reverse engineer, decompile, or
disassemble the Materials not provided to you in source code form; or (vii) distribute, sublicense or transfer the source code form of any components of the Materials and derivatives thereof to any third party except as provided in this Agreement.
4. No Warranty:
THE MATERIALS ARE PROVIDED “AS IS”. INTEL DISCLAIMS ALL EXPRESS OR IMPLIED WARRANTIES WITH RESPECT TO THEM, INCLUDING ANY
IMPLIED WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT, AND FITNESS FOR ANY PARTICULAR PURPOSE.
5. Limitation of Liability: NEITHER INTEL NOR ITS SUPPLIERS SHALL BE
LIABLE FOR ANY DAMAGES WHATSOEVER (INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS OF BUSINESS PROFITS, BUSINESS INTERRUPTION, LOSS OF BUSINESS INFORMATION, OR OTHER LOSS) ARISING OUT OF THE USE OF OR INABILITY TO USE THE SOFTWARE, EVEN IF INTEL HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. BECAUSE SOME JURISDICTIONS PROHIBIT THE EXCLUSION OR LIMITATION OF LIABILITY FOR CONSEQUENTIAL OR INCIDENTAL
DAMAGES, THE ABOVE LIMITATION MAY NOT APPLY TO YOU. 6. USER SUBMISSIONS: You agree that any material, information or other
communication, including all data, images, sounds, text, and other things embodied therein, you transmit or post to an Intel website or provide to
Intel under this Agreement will be considered non-confidential ("Communications"). Intel will have no confidentiality obligations with
respect to the Communications. You agree that Intel and its designees will be free to copy, modify, create derivative works, publicly display, disclose, distribute, license and sublicense through multiple tiers of distribution and licensees, incorporate and otherwise use the Communications, including derivative works thereto, for any and all commercial or non-commercial purposes
7. TERMINATION OF THIS LICENSE: This Agreement becomes effective on the
date you accept this Agreement and will continue until terminated as provided for in this Agreement. Intel may terminate this license at any time if you are in breach of any of its terms and conditions. Upon termination, you will immediately return to Intel or destroy the Materials and all copies thereof.
8. U.S. GOVERNMENT RESTRICTED RIGHTS: The Materials are provided with "RESTRICTED RIGHTS". Use, duplication or disclosure by the Government is subject to restrictions set forth in FAR52.227-14 and DFAR252.227-7013 et seq. or its successor. Use of the Materials by the Government constitutes acknowledgment of Intel's rights in them.
9. APPLICABLE LAWS: Any claim arising under or relating to this Agreement shall be governed by the internal substantive laws of the State of Delaware,
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 5
without regard to principles of conflict of laws. You may not export the Materials in violation of applicable export laws.
* Other names and brands may be claimed as the property of others.
Intel and Intel Atom are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright (C) 2008 – 2011, Intel Corporation. All rights reserved.
Revision History
Revision Number Description Revision Date
1.0 Intel® Atom™ Processor Developer's Guide Feb 2011
Intel® Atom™ Processor - Graphics Developer's Guide
6
Contents
Disclaimer and Legal Information ..............................................................................................2
Software Source Code Disclaimer .............................................................................................2
Copyright (C) 2008 – 2011, Intel Corporation. All rights reserved. ................................................5
Revision History......................................................................................................................5
1 About this Document ...........................................................................................8
1.1 Intended Audience ...................................................................................9 1.2 Conventions, Symbols, and Terms .............................................................9 1.3 Related Information ............................................................................... 10
2 Intel® Atom™ Processor Optimization ................................................................. 12
2.1 Overview .............................................................................................. 12 2.2 Intel® Atom™ Processor Series ............................................................... 12
2.2.1 Detecting Intel® Atom™ Processors ........................................... 13 2.3 Intel® Atom™ Processor Block Diagram ................................................... 16 2.4 Front End .............................................................................................. 17
2.4.1 Locating x87 instructions ........................................................... 17 2.4.2 Avoid x87 instructions ............................................................... 18 2.4.3 Intel® Hyper-Threading Technology ........................................... 19
2.5 Execution Core ...................................................................................... 20 2.5.1 Optimization with Intel® Streaming SIMD Extensions (Intel® SSE) 20 2.5.2 Optimization for In-order Execution ............................................ 21 2.5.3 64-bit support .......................................................................... 23
2.6 Tools .................................................................................................... 23 2.6.1 Intel® Composer XE (Compilers and Libraries) ............................. 23 2.6.2 Intel® VTune™ Amplifier XE ...................................................... 24 2.6.3 Intel® Graphics Performance Analyzers (Intel® GPA) - Platform
Analyzer .................................................................................. 25 2.7 Intel® Atom™ Processor-based Platform Optimizations .............................. 26
2.7.1 Tune for Power ......................................................................... 26 2.7.2 Tools ....................................................................................... 27
3 Intel® Atom™ Processor Integrated Graphics ...................................................... 28
3.1 Overview .............................................................................................. 28 3.2 Understanding the Intel® Atom™ Processor 3D Graphics Systems ............... 29 3.3 Intel® Graphics Media Accelerator 950/3150 ............................................ 30 3.4 Intel® Graphics Media Accelerator 500/600 .............................................. 31 3.5 Graphics API Support ............................................................................. 31 3.6 Detecting GPUs...................................................................................... 32
4 Quick Tips: Graphics Performance Tuning ............................................................ 33
4.1 Primitive Processing ............................................................................... 33 4.1.1 Vertex Capabilities .................................................................... 33 4.1.2 Tips On Vertex/Primitive Processing ............................................ 33
4.2 Shader Capabilities ................................................................................ 34
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 7
4.2.1 Tips on Shader Capabilities ........................................................ 35 4.3 Texture Sample and Pixel Operations ....................................................... 36
4.3.1 Tips on Texture Sampling / Pixel Operations ................................ 36 4.4 Managing Constants on Microsoft DirectX* ................................................ 37 4.5 Graphics Memory ................................................................................... 38
4.5.1 Resource Management .............................................................. 38 4.5.2 Checking for Available Memory ................................................... 39
4.6 Creating a Microsoft DirectX* 9 Device for Intel® Atom™ Processor Graphics 39
5 Performance Analysis with Intel® Graphics Performance Analyzers ......................... 42
5.1 Intel® GPA Monitor ................................................................................ 42 5.2 Intel® GPA System Analyzer HUD ............................................................ 43 5.3 Intel® GPA Frame Analyzer .................................................................... 43 5.4 Diagnosing Performance Bottlenecks ........................................................ 43
6 Support ........................................................................................................... 45
7 References ....................................................................................................... 46
8 Optimization Notice ........................................................................................... 47
§
Intel® Atom™ Processor - Graphics Developer's Guide
8
1 About this Document
This document provides development hints and tips to ensure that your customers will have a great experience playing your games and running other interactive 3D graphics applications on platforms with Intel® Atom™ processors. This document details
software development practices encompassing the entire range of Intel® Atom™ processors with a focus on performance analysis using Microsoft DirectX*. Intel® Software Development Products useful in optimizing and profiling graphics applications are discussed throughout this document.
Figure 1 - The Intel® Atom™ processors are the brand names for a family of low-power
processors and platforms designed specifically for mobile Internet devices.
Intel Atom processors enable a broad range of devices including netbooks, entry-level desktops, tablets, handhelds, smartphones, consumer electronics (CE) devices, and other companion devices. Today Intel Atom processors integrate features such as controllers for memory, graphics, video, and display for a host of new applications
that deliver flexibility and innovation. In the future, 32nm-based System-on-Chip (SoC) solutions will provide even greater functionality and form factor options.
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 9
Intel Atom processors are optimized to enable new connected experiences with a
range of capabilities:
A new range of power-efficient devices with excellent performance enabled by
industry-leading 45nm high-k metal gate technology and soon, 32nm silicon
process technology
Highly integrated application processor that transforms everyday devices
Smaller, more compact designs with a thermal design power (TDP) ranging from
less than 1 watt to 13 watts
Low power options in select devices enabling incredibly low idle, allowing
devices to conserve energy
Better performance and increased system responsiveness enabled by Intel®
Hyper-Threading Technology (Intel® HT Technology)
Therefore, it makes sense to write your 3D applications to take advantage of this broad market and optimize the experience for the greatest number of people. By following the tips and tricks in this document, you have the opportunity to make your application shine with the graphics volume market leader.
1.1 Intended Audience
This document is targeted at experienced graphics developers who are familiar with
OpenGL*/Microsoft DirectX*, C/C++, multithread and shader programming, Microsoft Windows* operating systems, and 3D graphics.
1.2 Conventions, Symbols, and Terms
The following conventions are used in this document.
Table 1 Coding Style and Symbols Used in this Document
Source code:
for(int i=0;i<10; ++i ){
cout << i << endl;
The following terms are used in this document.
Table 2 Terms Used in this Document
1. Intel Integrated Graphics Hardware (IIG)
a. GPU – Graphics Processing Unit
b. GMCH – Graphics and Memory Controller Hub – a parent component
architecture and chipset housing some Intel integrated graphics hardware
(GPU)
Intel® Atom™ Processor - Graphics Developer's Guide
10
c. GMA – Graphics Media Accelerator – component name describing the GPU
chipset component in Intel integrated graphics.
d. UMA – Unified Memory Architecture – an architecture where the graphics
subsystem does not have exclusive dedicated memory and uses the host
system‟s memory (SDRAM)
e. DVMT – Dynamic Video Memory Technology – a memory allocation scheme
in UMA systems which allocates an exclusive, dynamically resizable chunk of
main memory to the graphics (driver)
f. VF – Vertex Fetch
g. VS – Vertex Shader
h. PS – Pixel Shader
i. GS – Geometry Shader
j. EU – Execution Unit, a vector machine component
k. CS – Command Stream manager component controlling 3D and media
l. I$ - Instruction cache
m. SO – Stream Output
2. Imagination Technologies POWERVR*
a. USSE – Universal Scalable Shader Engine
b. CGS – Course Grain Scheduler
c. ISP – Image Synthesis Processor
3. SWGP – Software geometry processing, a superset of CPU-based processing that
includes CPU vertex processing. SWGP is not equivalent to the Microsoft DirectX*
reference device.
4. SWVP – Software vertex processing
5. HWVP – Hardware vertex processing
1.3 Related Information
There are several other places you can look for additional information on Intel graphics, including the following sites:
Intel® HD Graphics: http://software.intel.com/en-us/articles/intel-graphics-
developers-guides/
Intel® 4 Series Chipsets (the Intel® 4500, X4500, and X4500HD GMAs) Developer‟s
Guide: http://software.intel.com/en-us/articles/intel-graphics-media-accelerator-developers-guide/
Intel® 3 Series Express Chipsets including the Intel® 3000 GMA and Intel® X3000 GMA Developer‟s Guide: http://software.intel.com/en-us/articles/intel-gma-3000-and-
x3000-developers-guide/.
We hope your questions are covered in these resources, including this guide. We are constantly updating these resources and welcome your comments and suggestions. If you have questions not answered in these resources, or have suggestions on
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 11
improving the guide, please get in touch with us at: [email protected]. If you are actively working with Intel already, you can also reach us through your engineering or account management contacts.
Intel® Atom™ Processor - Graphics Developer's Guide
12
2 Intel® Atom™ Processor Optimization
2.1 Intel® Atom™ Processor Overview
The Intel® Atom™ processor was designed for general performance requirements of modern workloads while maintaining low power consumption.
The key features allowing the Intel Atom processors to maintain this low power
consumption and efficient performance include: - Intel® Hyper-Threading Technology provides two logical processors for
multitasking and multi-threading workloads.
- Support for Single-Instruction Multiple-Data (SIMD) extensions up to Intel® Streaming SIMD Extensions 3 (Intel® SSE3) and Supplemental Streaming SIMD Extensions 3 (SSSE3).
- Enhanced Intel SpeedStep® Technology enables the operating system (OS) to program a processor to transition to lower frequency and/or voltage levels while executing a workload.
- Support deep power down technology to reduce static power consumption by
turning off power to cache and other sub-systems in the processor.
- For greater power efficiency, Intel Atom processors utilize in-order processing. This differs from common out-of-order processors found in desktops and laptops. Intel Atom processors will not reorder an instruction stream to
extract instruction-level parallelism like other Intel® processors.
Note: For an in-depth resource on optimizing for Intel® Atom™ processors, please review the Intel® 64 and IA-32 Architectures Software Developer's Manuals:
http://www.intel.com/products/processor/manuals/.
For Intel Atom processors, see Chapter 12 of the Intel® Architecture Optimization Reference Manual.
2.2 Intel® Atom™ Processor Series
The best place to get information about all available Intel® Atom™ processors is ark.intel.com.
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 13
Intel® Atom™ Processor Series
Devices 64-bit Support
Hyper-threading
N270, N280 Netbooks
No Yes
N4xx series , N5xx series Netbooks
Yes Yes
D4xx series, D5xx series Entry level desktops
(e.g. nettops)
Yes Yes
230, 330 Entry level desktops
(e.g. nettops)
Yes Yes
Z5xx Mobile Internet Devices (MIDs), some netbooks
No Yes (except Z510)
Z6xx Mobile Internet Devices
(MIDs), some netbooks
No Yes
CE4100 Consumer Electronics
(e.g. Internet TV)
No Yes
E-series Embedded
No Yes
2.2.1 Detecting Intel® Atom™ Processors
An application can use the CPUID instruction to determine information about the host processor. This includes detecting Intel® Atom™ processors and support for features
like Intel® Hyper-Threading Technology.
Note: For a more in-depth discussion and full cross-platform processor detection, please refer to Chapter 14 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual Volume 1: Basic Architecture.
Beginning with the Intel486™ processor family, the type of CPU can be determined based on the processor identification signature. For all currently shipping Intel Atom processors (those manufactured using the 45 nm process), the processor identification
signature will be (values are in binary):
Extended Family Extended Model Type Family Code Model No.
00000000 0001 00 0110
1100
Intel® Atom™ Processor - Graphics Developer's Guide
14
The following source code will determine if the application is running on an Intel Atom
processor and check for Intel Hyper-Threading Technology. Please note, this source code is not completely cross-platform because it doesn‟t properly support older CPUs (e.g., Intel386™ and Intel486 CPUs). For a more in-depth discussion and full cross-platform processor detection, please review the Intel® 64 and IA-32 Architectures
Software Developer’s Manual Volume 1: Basic Architecture.
struct CPUInfoStruct
{
union {
char CPUBrandString[48];
__int32 nCPUBrandString[16*3];
};
int nSteppingID;
int nModel;
int nFamily;
int nProcessorType;
int nBasicProcessorID;
int nExtendedModel;
int nExtendedFamily;
bool bAtomProcessor;
bool bHyperThreading;
char CPUString[13];
};
bool isAtom( const CPUInfoStruct& info)
{
// firstChar is beginning pointer, c is end minus the string
// length we're looking for
char const * firstChar = info.CPUBrandString;
// Atom(TM) = 8 chars
char const * c =
info.CPUBrandString + sizeof(info.CPUBrandString)/sizeof(char) - 8;
// search backwards, looking for 'A' 't' 'o' 'm' '(' 'T' 'M' ')' or till
// we hit decrement past firstChar
while ( c >= firstChar )
{
if ( c[0] == 'A' &&
c[1] == 't' &&
c[2] == 'o' &&
c[3] == 'm' &&
c[4] == '(' &&
c[5] == 'T' &&
c[6] == 'M' &&
c[7] == ')' )
{
return true;
}
--c;
}
return false;
}
// This function fills up the CPU Info Struct for us.
// multiple CPUID calls are necessary to get all the information and
// each call gives you more information about the depth of calls you can make.
void fillCPUInfo(CPUInfoStruct& info)
{
__int32 CPUInfo[4] = {0};
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 15
::memset(&info, 0, sizeof(CPUInfoStruct));
// cpuid intrinsic calls the cpuid instruction and returns 4 32bit values
// the results depend upon the InfoType parameter passed in
// Get the number of maximum InfoType value we can call for this
// processor and also get the ID string
::__cpuid(CPUInfo, 0);
// Swap last two to put in readable form
int temp = CPUInfo[2];
CPUInfo[2] = CPUInfo[3];
CPUInfo[3] = temp;
// Copy 12 characters
::memcpy(info.CPUString, &(CPUInfo[1]), 12 ); // 13th position is zero
// Check to see if we can make the next call
if( CPUInfo[0] < 1 )
{
return;
}
// Call with InfoType == 1
// CPUInfo will be set to the following:
// 0: Bits 0-3: Stepping ID
// 0: Bits 4-7: Model Number
// 0: Bits 8-11: FamilyCode
// 0: Bits 12-13: Processor Type
// 0: Bits 14-15: Reserved
// 0: Bits 16-19: Extended Model
// 0: Bits 20-27: Extended Family
// 0: Bits 28-31: Reserved
// 3: Bit 28: Hyper-threading technology
::__cpuid(CPUInfo, 1);
info.nSteppingID = CPUInfo[0] & 0xf; // bits 0-3
info.nModel = (CPUInfo[0] >> 4) & 0xf; // bits 4-7
info.nFamily = (CPUInfo[0] >> 8) & 0xf; // bits 8-11
info.nProcessorType = (CPUInfo[0] >> 12) & 0x3; // bits 12-13
info.nExtendedModel = (CPUInfo[0] >> 16) & 0xf; // bits 16-19
info.nExtendedFamily = (CPUInfo[0] >> 20) & 0xff;// bits 20-27
info.bHyperThreading = (CPUInfo[3] & 0x10000000) != 0;// bit 28
// Check to see if we can get the Processor Brand String
// Call with InfoType == 0x80000000
::__cpuid(CPUInfo, 0x80000000);
if( CPUInfo[0] < 0x80000004 ) // extended info supported up to 4?
{
return;
}
// Yes, make the 3 calls (16 chars each or 4 ints each)
// to make up the brand string - it's null terminated
::__cpuid(info.nCPUBrandString + 0, 0x80000002);
::__cpuid(info.nCPUBrandString + 4, 0x80000003);
::__cpuid(info.nCPUBrandString + 8, 0x80000004);
// Now we can check for Atom(TM) processors
info.bAtomProcessor = isAtom( info );
}
int _tmain(int argc, _TCHAR* argv[])
Intel® Atom™ Processor - Graphics Developer's Guide
16
{
// This is how to use the code
CPUInfoStruct info; // Create the struct
fillCPUInfo(info); // Fill it
// Query the bits
printf_s("CPUString : %s\n", info.CPUString);
printf_s("Brand String : %s\n", info.CPUBrandString);
printf_s("Hyperthreaded?: %s\n", info.bHyperThreading ? "Yes": "No");
printf_s("is it an Atom : %s\n", info.bAtomProcessor ? "Yes": "No");
return 0;
}
2.3 Intel® Atom™ Processor Block Diagram
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 17
2.4 Front End
The front end features a power-optimized pipeline that can deliver up to two instructions per cycle to the instruction queue for scheduling. This means the ideal retired “cycle per instruction” is 0.5.
Tip: By default, Intel® VTune™ Amplifier XE will show “Clocks per Instructions Retired – CPI”. For this metric, the ideal is 0.5. However, there is a lot of things that can prevent an ideal scenario like delays due to cache misses, etc. See Section 2.6.2 for
more details on Intel VTune Amplifier XE.
It‟s important to avoid legacy x87 instructions (see the next Section “Locating x87 instructions” for more details). Back-to-back x87 can cause the front end to stall because the front end can only handle decoding one x87 instruction per cycle.
Tip: Avoid x87 instructions; see Section 2.4.2 for more details. In general, Intel® SSE will have better performance at lower power utilization. Whenever possible, use Intel SSE for floating point-intensive operations. See Section 2.5.1 for more
information about Intel SSE.
2.4.1 Locating x87 instructions
If the compiler generates code using x87 instructions, then the disassembly view will appear similar to the following: for(int i=0; i!=n ;i++)
003238C8 mov esi,dword ptr [ebp+8]
003238CB push edi
003238CC mov edi,dword ptr [dest]
003238CF add esi,8
003238D2 mov ebx,4000h
N[i] = V[i] / magnitude(V[i]);
003238D7 fld dword ptr [esi-4]
003238DA fld dword ptr [esi-8]
003238DD fld dword ptr [esi]
003238DF fld st(1)
003238E1 fmulp st(2),st
003238E3 fld st(2)
003238E5 fmulp st(3),st
003238E7 fxch st(1)
003238E9 faddp st(2),st
003238EB fmul st(0),st
003238ED faddp st(1),st
003238EF fstp dword ptr [ebp-4]
003238F2 fld dword ptr [ebp-4]
003238F5 call _CIsqrt (3255B0h)
003238FA fstp dword ptr [ebp-4]
...
Intel® Atom™ Processor - Graphics Developer's Guide
18
The assembly instructions beginning with the letter 'f', including fmul, fld, faddp, and
fmulp, are legacy pre-Intel® Pentium® processor x87 math coprocessor instructions. Furthermore, in the 2nd to last line the code, call _CIsqrt, is invoking a function call to compute the square root rather than putting this inline. This sort of assembly is not ideal for high-performance code.
2.4.2 Avoid x87 instructions
2.4.2.1 Proper Microsoft Visual Studio* Settings
In Microsoft Visual Studio*, there is a setting that will avoid generating x87 instructions.
In the Project Properties, under C/C++ group is Code Generation options. Set “Enable Enhanced Instruction Set” to “Streaming SIMD Extensions 2” so the compiler will generate Intel® SSE instructions to better use all the execution units and avoid generating x87 instructions. Also change “Floating Point Model” to “Fast” so that the compiler will use 32-bit instead of double. Changing these options will require a
rebuild of all the code to take effect.
Tip: For Microsoft Visual Studio*, set Enhanced Instruction Set to Streaming SIMD Extensions 2 (/arch:SSE2).
Tip: For Microsoft Visual Studio*, set Floating Point Model to Fast.
2.4.2.2 Proper GCC Settings
The -ffast-math option is appropriate for games and allows the compiler to generate
faster math code that doesn‟t exactly implement IEEE or ISO rules and specifications for math functions. This option sets -fno-math-errno, -funsafe-math-optimizations, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans and -fcx-limited-range.
Please note that this option might have adverse effects on functionality that requires a
high level of precision or cross-platform support.
Tip: For GCC, use the -ffast-math option.
The -mssse3 switch enables the compiler to generate Supplemental SSE3 (SSSE3) instructions. Since all Intel Atom processors support SSSE3, this will better utilize all execution units and avoid generating x87 instructions.
Tip: For GCC, use the -mssse3 option.
Changing these options will require a rebuild of all the code to take effect.
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 19
2.4.2.3 Proper Intel® C++ Composer XE 2011 Settings
The /fp:fast and -fp-model fast options are appropriate for games and allow the
compiler to generate faster math code that doesn‟t exactly implement IEEE or ISO rules and specifications for math functions.
Please note that this option might have adverse effects on functionality that requires a high level of precision or cross-platform support.
Tip: For Intel® C++ Composer XE 2011 for Windows*, use the /fp:fast option. Tip: For Intel® C++ Composer XE 2011 for Linux*, use the -fp-model fast option.
The SSE_ATOM option allows the compiler to generate Supplemental SSE3 (SSSE3)
and MOVBE instructions. Since all Intel Atom processors support SSSE3, this will better utilize all execution units and avoid generating x87 instructions.
Tip: For Intel® C++ Composer XE 2011 for Windows*, use the /QxSSE3_ATOM
option. Tip: For Intel® C++ Composer XE 2011 for Linux*, use the -xSSE3_ATOM option.
2.4.3 Intel® Hyper-Threading Technology
Intel® Hyper-Threading Technology (Intel® HT Technology) enables multiple threads
to run on each core. Intel HT Technology is designed to increase processor throughput and overall performance on threaded software. Nearly all Intel Atom
processors support Intel HT Technology.
Tip: Use threading to fully utilize all components of the Intel® Atom™ processor. This
is especially true for multi-core Intel Atom processors.
The instruction queue is statically partitioned for scheduling instruction execution from
two threads. The scheduler is able to pick one instruction from either thread and dispatch to either of port 0 or port 1 for execution. The hardware makes selection choice on fetching/decoding/dispatching instructions between two threads based on criteria of fairness as well as each thread‟s readiness to make forward progress.
Basically, if one thread isn‟t using all execution units due to stalling from
dependencies or unbalanced instructions streams, a second thread can run on underutilized execution units.
Note: Intel® GMA 950/3150 graphics offload vertex processing to the CPU. This
means that 3D game applications and associated vertex processing work from the driver will being utilizing CPU resource.
The multithreaded graphics driver will be running alongside your app and utilizing
resources. Threading might incur a performance penalty due to oversubscription.
Intel® Atom™ Processor - Graphics Developer's Guide
20
However, on multi-core Intel Atom processors, use of multithreading is paramount to achieving maximum performance.
Tip: Use the Intel® Software Development Products to help measure performance
and scaling with multithreading. See Section 2.6 for more information.
2.5 Execution Core
Since the front-end can issues two instructions per cycle, the execution cores should
be making forward progress with two instructions whenever possible. The compiler
will handle most of the details with selecting the best instruction ordering.
Several instructions take more than one cycle to complete. In most cases, other multiple cycle instructions can be pipelined with longer instructions. However, single-cycle instructions will block due to the requirements of program order. Divides and
64-bit floating point operations are examples of multi-cycle instructions that do not pipeline well. Multiples are an example of instructions that pipeline well.
Tip: Divide instructions should only be used when absolutely necessary. In Intel®
VTune™ Amplifier XE, “DIV” and “CYCLES_DIV_BUSY” events can be used to determine if divides are a bottleneck in your program.
Tip: Use 32-bit floating point instead of 64-bit floating point whenever possible. 64-
bit instructions take longer to complete and generally can‟t be pipelined as well as 32-bit versions.
Tip: It‟s important to use Intel® Streaming SIMD Extensions (Intel® SSE) for performance critical code that is computationally intense. See “Optimization with
Intel® Streaming SIMD Extensions (Intel® SSE)” for more information.
2.5.1 Optimization with Intel® Streaming SIMD Extensions (Intel® SSE)
All Intel® Atom™ processors support Intel® Streaming SIMD Extensions (Intel® SSE), Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3
(SSSE3). Intel SSE instructions allow the CPU to work on 4 32-bit floating points with a single instruction. This can greatly increase floating point operations per second (FLOPS) and is vital for computationally intense code.
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 21
In general, Intel SSE will promote better power efficiency with increased throughput.
Tip: Setting proper compiler options will allow the compiler to automatically general
Intel® SSE instructions:
Compiler Option to Enable Intel® SSE code
generation
Microsoft Visual Studio* Set “Enhanced Instruction Set” to “Streaming SIMD Extensions 2 (/arch:SSE2)”
GCC -mssse3
Intel® C++ Composer XE 2011 for
Windows*
/QxSSE3_ATOM
Intel® C++ Composer XE 2011 for
Linux*
-xSSE3_ATOM
There are multiple ways to use Intel SSE. For developers interested in maximum
control, intrinsics are the best way to utilize Intel SSE. Intrinsics are compiler-specific functions that generate inline highly efficient machine instructions. For developers targeting Microsoft DirectX* on PC or Microsoft Xbox*, the Microsoft XNA* Math Library wraps the use of intrinsics in a library that already supports vectors and matrices.
Tip: There are a few things to keep in mind when utilizing the Microsoft XNA* Math Library. First, be careful accessing individual elements. Getting and setting elements inside an Intel® SSE vector isn‟t free. It‟s best to put data into XMVECTORS and keep
it there as long as possible. Also, make sure you are using properly aligned data.
2.5.2 Optimization for In-order Execution
Instruction scheduling heuristics and coding techniques that apply to out-of-order microarchitectures may not deliver optimal performance on an in-order
microarchitecture. Likewise, instruction scheduling heuristics and coding techniques for an in-order pipeline like Intel® Atom™ microarchitecture may not achieve optimal performance on out-of-order microarchitectures.
Here is an example of where improperly ordered instructions can cause stalls that would otherwise be avoided in an out-of-order processor:
Intel® Atom™ Processor - Graphics Developer's Guide
22
The easiest way to optimize for the in-order nature of Intel Atom processors is to utilize Intel® C++ Composer XE 2011 with the xL option. This option will allow the
compiler to assume that the target system has an in-order processor and aggressive unroll loops.
Tip: For Intel® C++ Composer XE 2011 for Windows*, use the /QxL option to enable optimizations for in-order processors.
Tip: For Intel® C++ Composer XE 2011 for Linux*, use the –xL option to enable optimizations for in-order processors.
Loop unrolling can help find instructions to pair with long-latency operations (e.g.
multi-cycle instructions). For example, issuing multiple long-latency multiple instructions together will generate better throughput. It is worthwhile to investigate loop unrolling in critical sections of your application‟s code.
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 23
Tip: Unrolling loops can help pair instructions for execution and pipelining. However, unrolling loops can put more pressure on the front-end. In Intel® VTune™ Amplifier XE, the “ICACHE_MISSES” event can be used to measure if the increase instruction
footprint is being harmful.
2.5.3 64-bit support
It‟s worthwhile to note that 64-bit support is not ubiquitous. To reach the broadest market, target 32-bit whenever possible.
Intel® Atom™ Processor Series
Support for 64-bit
N270, N280 No
N4xxseries , N5xx series Yes
D4xx series, D5xx series Yes
230, 330 Yes
Z5xx No
Z6xx No
CE4100 No
E-series No
2.6 Tools
2.6.1 Intel® Composer XE (Compilers and Libraries)
Intel C++ Composer XE 2011 (formerly the Intel® C++ Compiler) has several flags that are ideal for applications targeting Intel® Atom™ processor-based platforms:
Intel® Atom™ Processor - Graphics Developer's Guide
24
Compiler Platform Details
Microsoft Windows*
/QxSSE3_ATOM SSE_ATOM option allows the compiler to generate Supplemental SSE3 (SSSE3) and MOVBE instructions
/QxL Enables optimization around in-order execution
/fp:fast Allows the compiler to generate faster math code that doesn‟t exactly implement IEEE or ISO rules and specifications for math functions
Linux*
-xSSE3_ATOM SSE_ATOM option allows the compiler to generate
Supplemental SSE3 (SSSE3) and MOVBE instructions
-xL Enables optimization around in-order execution
-fp-model Allows the compiler to generate faster math code
that doesn‟t exactly implement IEEE or ISO rules and specifications for math functions
Intel® Composer XE also includes a set of parallel development mechanisms called the Intel® Parallel Building Blocks. Intel® Threading Building Blocks helps developers
to build performant task-based threading appropriate for games.
For more information on Intel® Composer XE, visit: http://software.intel.com/en-us/articles/intel-composer-xe/.
See Section 8 for a notice about optimizations with Intel® Software Development
Products.
2.6.2 Intel® VTune™ Amplifier XE
Intel® VTune™ Amplifier XE is a great tool for locating bottlenecks of CPU workloads. In addition to general profiling information, it includes several profiling events that are
appropriate for Intel® Atom™ processor-based platforms:
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 25
Intel® VTune™ Amplifier XE Event
Details
Clocks per Instruction Retired (CPU_CLK_UNHALTED.THREAD /
INST_RETIRED.ANY)
Measure the average amount of latency per instruction
retired (completed). Ideally, this should be 0.5 because the front end can decode and issue 2 instructions per clock.
DECODE_RESTRICTION Count the number of occurrences in a workload that encountered delays causing reduction of decode throughput. Avoid back-to-back x87 instructions.
BACLEARS Can provide a means to evaluate whether loop unrolling is helping or hurting front end performance.
ICACHE_MISSES Can help evaluate if loop unrolling is increasing the
instruction footprint too much.
BR_MISSP_TYPE_RETIRED Can provide a means to evaluate branch prediction issues due to branch types.
DIV and CYCLES_DIV_BUSY Can provide a means to determine if divides are a
bottleneck.
For more information on VTune™ Amplifier XE, visit: http://www.intel.com/software/products/vtune.
2.6.3 Intel® Graphics Performance Analyzers (Intel® GPA) - Platform Analyzer
Intel® Graphics Performance Analyzer is a tool designed for games development to help profile and analyze Microsoft DirectX* graphic applications. For more information on System Analyzer and Frame Analyzer, see Section 5.
Intel® Graphics Performance Analyzer is a tool designed to visualize the execution
profile of the tasks in a code base on the heterogeneous (CPU+GPU) PC platform over time. This tool collects trace data during the application run to provide detailed analysis of how code executes across all threads, and correlates the CPU work with work being done on the GPU. The tool automatically aligns clocks across all cores in the entire system so that analyze can be done of CPU-based workloads together with
GPU-based workloads on the timeline.
Note: Use Intel® GPA System Analyzer HUD to capture traces. Intel® GPA Platform Analyzer will need to be run on a separate machine for most Intel® Atom™ processor-
based devices and to connect over a network. See Section 5 for more information.
Platform View requires a developer to instrument their code base; this involves marking up areas of the code with a simple API. Once a code base is properly
Intel® Atom™ Processor - Graphics Developer's Guide
26
instrumented, the tool will show performance over time which includes multithreaded support and automatic Microsoft DirectX* driver markup.
For more information on GPA visit: http://software.intel.com/en-us/articles/intel-gpa/.
2.7 Intel® Atom™ Processor-based Platform
Optimizations
2.7.1 Tune for Power
The Intel® Atom™ processor was designed to meet the performance requirements of modern workloads with minimal power consumption to facilitate small form-factor devices. It‟s important to be power-conscious when targeting Intel Atom processor-based platforms.
In general, avoid operations that frequently wake the hardware. For example, avoiding spin waits or hardware polling. Activities that spinning a hard drive or media device (CD/DVD) can use a significant amount of power.
Target a fixed frame rate. It‟s better to let the hardware idle and conserve power
instead of letting the frame rate be uncapped.
Tip: Reduce the number of cycles to complete a task and allow the hardware to sleep
sooner.
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 27
The Intel® Laptop Gaming TDK gives an application access to information about
power source and battery life (see Section 2.7.2.1 for more details).
Tip: Save the game when the battery is about to die.
A developer can also use the Windows* Power Management functions. For example, listening for the WM_POWERBROADCAST event allows an application to detect system
suspension, hibernation, closed lid, low battery, and more.
For developers targeting Linux* platforms, see http://lesswatts.org for more tips on optimizing for power.
2.7.2 Tools
2.7.2.1 Intel® Laptop Gaming Technology Development Kit (TDK)
The Intel® Laptop Gaming TDK provides an easy interface to add mobile-aware features to a game. Here are some examples:
Power source: GetPwrSrc() – returns information about power source (battery
or A/C).
GetPercentBatteryLife() – returns the percentage of remaining battery life.
Get80211SignalStrength() – return the network connectivity strength.
The TDK also includes functionality to build a Wi-Fi Ad-Hoc peer-to-peer network.
For more information on the Intel® Laptop Gaming TDK, visit: http://software.intel.com/en-us/articles/intel-laptop-gaming-technology-development-kit/
Intel® Atom™ Processor - Graphics Developer's Guide
28
3 Intel® Atom™ Processor Integrated
Graphics
The latest generation of Intel® Atom™ processors contains an on-board low-power
GPU, designed to provide a satisfying user experience watching HD videos and 3-D games. The Intel Atom processor is powerful enough to play basic 3D games with the
on-board graphics process providing a new level of 3D gaming support.
Figure 2. On-chip graphics architecture of the mobile chipset featuring Intel® Atom™ processor codenamed Pineview and platform controller hub codenamed
Tigerpoint
3.1 Overview
Some versions of the Intel® Atom™ processors contain an on-board GPU. These range in power from the low-end Intel® Graphics Media Accelerator 500 (Intel® GMA 500) series to the current top-end Intel Atom graphics processor, the Intel® Graphics
Media Accelerator 3150 (Intel® GMA 3150). The Intel GMA 500 and Intel GMA 600
are based upon the Imagination Technologies POWERVR* graphics processor, while the Intel GMA 950 and Intel GMA 3150 are based upon the Intel® 945G Express chipset. This variation in graphics power can tend to complicate programming graphics on an Intel Atom processor with integrated graphics, so it‟s important to identify the particular GPU your application is running on and to program to the strengths of each GPU.
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 29
The following table lists the Intel Atom processors that have an integrated graphics
process and the GPUs found on those chipsets.
Tip: Validate on a system with Intel® GMA 500/600 graphics and a system with
Intel® GMA 950/3150 graphics. The performance characteristics are different enough to warrant separate validation on these two classes of graphics hardware.
Since the Intel GMA 500/600 and Intel GMA 950/3150 are built on different core
technologies, it‟s important to remember to properly check the Microsoft DirectX* Caps.
3.2 Understanding the Intel® Atom™ Processor 3D
Graphics Systems
It is best to think of Intel® Atom™ processor-based graphics solutions in two categories: Intel® GMA 950/3150 and Intel® GMA 500/600. Unless otherwise stated, all advice in this guide applies to both the Intel GMA 950/3150 and Intel GMA
500/600. The Intel GMA 950/3150 and Intel GMA 500/600 are based on different core technologies, but are both integrated solutions backed by Intel Atom processor-based Intel chipsets, so they share similar characteristics.
Intel GMA 950/3150 has been designed with a deep pipelined architecture, where
performance is maximized by allowing each stage of the pipeline to simultaneously operate on different primitives or portions of the same primitive. The main blocks of the pipeline are the Setup Engine, Rasterizer, Texture Pipeline, and Raster Pipeline. A typical programming sequence would be to send instructions to set the state of the pipeline followed by rending instructions containing 3D primitive vertex data.
Graphics Solution
Intel® Atom™ Processor
Series
Microsoft DirectX* Support
OpenGL* Support
(Microsoft Windows*)
Vertex Processing
Intel® GMA 500
(Section 3.2)
Z5xx DirectX* 9.0c
(Shader Model 2)
OpenGL* 1.1 Hardware
Intel GMA 600 (Section 3.3) Z6xx DirectX* 9.0c
(Shader Model 2)
OpenGL* 1.1 Hardware
Intel GMA 950 (Section 3.4) N2XX DirectX* 9.0c
(Shader Model 2)
OpenGL* 1.4 Software
Intel GMA 3150 (Section 3.5) D4xx/D5xx,
N4xx/N5xx
DirectX* 9.0c (Shader Model 2)
OpenGL* 1.4 Software
Intel® Atom™ Processor - Graphics Developer's Guide
30
Intel GMA 500/600 cores are a tile-based 3D rendering architecture. They feature a
3D graphics engine as well as a 2D graphics engine. Since it is a tile-based architecture, the 3D engine will render and process small sections of a screen (called „tiles‟) to the frame buffer, rather than filling a frame buffer with an entire scene. Sending smaller sections of a scene to the engine permits more consistent utilization
of the graphics hardware and allows for a small internal frame buffer (similar to a cache) which is flushed to an external frame buffer. Traditionally, larger frame buffers have been used, which increases power consumption. The graphics core does internal Z processing which permits better organization of the write operations and eliminates the need for a physical Z buffer, also saving power.
Tip: Target a fixed frame rate. It‟s better to let the hardware idle and conserve power
instead of letting the frame rate be uncapped.
3.3 Intel® Graphics Media Accelerator 950/3150
The Intel® Graphics Media Accelerator 950 (Intel® GMA 950) is an integrated (on-
board) graphic chip on the Mobile Intel® 945G Express chipset for Intel processors. It is a faster clocked version of the Intel GMA 900.
The Intel GMA 3150 is a very low power integrated (shared memory) graphics part
that is located on the processor package (on die with the Intel® Atom™ processor). It features two processor cores clocked at 200 MHz.
Intel GMA 950/3150 are based on a deep pipelined architecture:
Intel GMA 950/3150 do not support hardware vertex processing. They support Microsoft DirectX* 9.0c with Shader Model 2.0 (with software Vertex Shader) and
OpenGL* 1.4. In addition to 3D acceleration, Intel GMA 950/3150 have extensive
hardware to accelerate 2D video.
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 31
3.4 Intel® Graphics Media Accelerator 500/600
The Intel® Graphics Media Accelerator 500/600 (Intel® GMA 500/600) are graphics solutions for embedded products (e.g. MIDs), netbooks, and other small mobile devices. They are based on Imagination Technologies POWERVR SGX* cores (e.g.,
POWERVR SGX535*).
Intel GMA 500 is clocked at 100 (UL11L) or 200 MHz (US15L, US15W chipset). Intel GMA 600 has a max clock speed of 400 MHz.
Intel GMA 500/600 cores are a tile-based 3D rendering architecture:
Intel GMA 500/600 do support hardware vertex processing. They support Microsoft
DirectX* 9.0c with Shader Model 2.0 and OpenGL* 1.1. In addition to 3D acceleration, Intel GMA 500/600 both have extensive hardware to accelerate 2D video.
Note: For more information about POWERVR*, check out Imagination Technologies
POWERVR Insider* SDK at: http://www.imgtec.com/powervr/insider/powervr-sdk.asp.
3.5 Graphics API Support
This is the current level of support found in the drivers for Intel® Atom™ processor
GPUs. In addition there are variations in the particular level of support as new features are added in the drivers. You should check the latest drivers to see if there have been any updates to the level of support.
Intel® Atom™ Processor - Graphics Developer's Guide
32
GPU Microsoft DirectX*
Vertex SM
Pixel SM OpenGL* (Microsoft Windows*)
OpenGL* (Linux*)
Intel® GMA
500/600
9.0c
3.0 3.0 1.1 2.0
Intel GMA 950/3510 9.0c 3.0 (SW) 2.0 1.5 2.0
Tip: In general, Intel® GMA 500/600 have a high level of Microsoft DirectX* Capabilities (DX Caps). For applications targeting Intel® GMA 950/3150 and Intel
GMA 500/600, use Intel GMA 950/3150 capabilities as the target functionality. See
Section 4.1.1, 4.2, and 4.3 for more details.
3.6 Detecting GPUs
There is a short sample that demonstrates a way to detect the primary graphics present in a system available on Intel‟s Visual Computing Developer Community:
http://software.intel.com/en-us/articles/gpu-detect-sample. The source code determines the primary graphics device based on the Vendor ID and Device ID.
Tip: Uses proper GPU detection code to automatically set default feature levels.
The source code mentioned above can easily be extended for non-Intel hardware.
Since Intel® GMA 950/3150 and Intel® GMA 500/600 are built on different core technologies, it is important to identify which class of graphics hardware is present.
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 33
4 Quick Tips: Graphics Performance
Tuning
4.1 Primitive Processing
Intel® GMA 500/600 supports both Hardware Vertex Processing (HWVP) and Software
Vertex Processing (SWVP). However, Intel GMA 950/3150 only supports Software Vertex Processing.
For some workloads on Intel GMA 500/600, CPU vertex processing may offer even
greater performance enhancements. For this reason, it is recommended to use D3DCREATE_PUREDEVICE during device creation. This allows software processing to be
enabled based on performance that is determined by the specific configuration, workload, and Intel integrated graphics capability. However, Intel GMA 950/3150 does not support D3DCREATE_PUREDEVICE.
Tip: See Section 4.6 for details on how to create a Microsoft DirectX* 9 Device that
will properly use HWVP and SWVP as needed.
4.1.1 Vertex Capabilities
Intel® GMA 950/3150 Intel® GMA 500/600
Max Primitive Count 64K 1.3 million
(64K DX9 limit)
Max Vertex Index 64K 16.7 million
Vertex Processing Software Hardware
4.1.2 Tips On Vertex/Primitive Processing 1. Use IDirect3DDevice9::DrawIndexedPrimitive (DirectX* 9)
a. The vertex cache size will increase over time and can be discovered using
D3DQUERYTYPE_VCACHE.
Intel® Atom™ Processor - Graphics Developer's Guide
34
2. Ensure adequate batching of primitives to amortize runtime and driver overhead.
a. Maximize batch sizes, in general bigger is better.
b. Minimize render state changes between batches to reduce the number of
pipeline flushes.
c. Use instancing to enable better vertex throughput, especially for small batch
sizes. This also minimizes state changes and Draw calls.
3. Use static vertex buffers as much as possible.
4. Do as much CPU side clipping as possible. Use visibility tests to reject objects that
fall outside the view frustum to reduce the impact of objects that are not visible.
a. Set D3DRS_CLIPPING to FALSE for objects that do not need clipping.
5. For Intel GMA 500/600, it is more important to sort by state than sort by distance
from the camera.
4.2 Shader Capabilities
Intel® GMA 950/3150
Intel® GMA 500/600
Vertex Shader Model 2.0 (Software) 3.0
Pixel Shader Format 2.0 3.0
Dynamic Flow Control No Yes
Predication No Yes
Number Instruction Slots (Pixel Shader)
96 512
Number of Temporary Registers
12 32
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 35
4.2.1 Tips on Shader Capabilities
1. Use programmable shaders over fixed functions as much as possible. For
example, use shader-based fog instead of fixed function fog.
2. Do not use dynamic flow control or predication. Intel® GMA 950/3150 do not
support these features, and they can quickly be a bottleneck on Intel GMA
500/600. Static flow control, such as execution depending on uniform variables, is
supported but make sure to validate performance.
3. Favor per vertex calculations over per pixel calculations. For example, use per
vertex lighting instead of per pixel lighting.
4. Keep pixel shaders as short and simple as possible.
5. Balance texture samples and shader complexity.
a. Texture samples are executed in parallel to shader execution. For best result,
have a high ration of ALU instructions (math operations) per texture sample.
b. Although large shaders can be supported via cache structure, it is important to
be aware of limited number registers that are available, and running out of
these can drop the efficiency of the execution units.
6. Space texture sampling calls away from where they are used in pixel shaders
when possible and practical.
7. Optimize your shader performance by adequate use of your integrated graphics:
a. Reduce the use of macro/transcendental functions where possible.
Instructions like LOG, LIT, ARL, POW, EXP, INV, RSQ, SQRT, SIN, COS,
SINCOS, etc are more expensive, particularly for full screen effects.
b. In general, use full precision for non-transcendental instructions.
8. The following common shader effects typically affect performance and should be
tested for performance and optimization. Pay special attention to full screen post
processing affects including per-pixel and multiple pass techniques when
evaluating graphics related performance bottlenecks.
a. Glow/Bloom
b. Motion Blur
c. Depth of Field
d. HDR/Tone Mapping
e. Heat Distortion
f. Atmospheric Effects
g. Dynamic Ambient Occlusion
Intel® Atom™ Processor - Graphics Developer's Guide
36
4.3 Texture Sample and Pixel Operations
Gfx Arch Intel® GMA 950/3150 Intel® GMA 500/600
Format Support 16/32-bit fixed point
16/32-bit fixed point 16/32-bit floating point operations
Max # of Samples Up to 8 Up to 8
Vertex Textures No (needs Shader Model 3) No (needs Shader Model 3)
Max 2D/3D/Cube
Textures Dimension
2K/256/512 4K/4K/512
Filtering Type Support Bilinear, Trilinear, and Anisotropic
(max 4) Bilinear, Trilinear, and Anisotropic
(max 16)
Texture Compression DX9: DXT1/3/5 DX9: DXT1/3/5
Non Power of 2 Textures Yes Yes
Render to Texture Yes Yes
Multi-Sample Render (MSAA) No No
Multi-Target Render No Max=4
Max Texture Dimension 2048 4096
4.3.1 Tips on Texture Sampling / Pixel Operations 1. Use compressed textures and mipmaps.
2. Minimize the use of large textures even though the architecture supports up to
2K×2K. For optimal performance, use texture sizes that are 256x256 or less.
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 37
3. Minimize the use of Trilinear and Anisotropic Filtering
a. Utilize a type of filtering based on the usage in a scene rather than using it
everywhere.
4. Do not use floating point textures. Intel® GMA 950/3150 do not support these
features, and floating point textures can quickly be a bottleneck on Intel GMA
500/600.
5. Minimize the number of Clear calls.
a. Clear surfaces, color and Z/Stencil buffer at the same time when required.
6. Minimize lock/blit of Z and/or stencil buffer to minimize bandwidth impact.
7. On Intel GMA 950/3150, utilize shadow maps instead of stencil shadows as they
are fill-intensive.
8. Multi-texture rendering is better than multi-pass rendering since multi-texture
rendering reduces state changes, driver overhead, and CPU load. In addition, Intel
integrated graphics utilizes main system memory for graphics. The intermediate
pixels computed in a multi-pass rendering need to be transported back to main
memory and then back to the graphics subsystem when needed again, causing a
full round trip over the bus per render target for each pass.
Tip: For Intel® GMA 950/3150, if the “Texture 2x2” state override in the Intel® GPA System Analyzer HUD shows a significant performance increase, the texture samplers
are likely a bottleneck. See Section 5 for more details.
4.4 Managing Constants on Microsoft DirectX*
Constants are external variables passed as parameters to the shaders; their values remain “constant” during each invocation of the shader program. Despite their name, constants are one of the most frequently changing values in a Microsoft DirectX*
application. A shader program can initialize a constant variable statically to a value in the shader file or at runtime through the application.
Most of the recommendations described here are not completely new and may have been described elsewhere. However, it is still very much applicable to Intel integrated
graphics and the recommendations attempt to detail them in a cohesive manner. In
addition to these points, it is worth noting that:
1. The driver optimizes access to the most frequently used constants. Use less than
32 constants to achieve the highest performance gain from this feature. Limit the
use of dynamic indexed constants (C[ax], C[r]) as these cannot be optimized by
the driver, causing high latency in shaders. These constants are normally found in
vertex shaders.
2. Higher performance is obtained with local constants over global constants.
Intel® Atom™ Processor - Graphics Developer's Guide
38
3. Immediate constants provide better performance than dynamic indexed constants.
In dynamic indexed constants, the driver cannot determine a prior the index value
and needs to create a full size constant buffer space in memory instead of using
the hardware constant buffer.
4. To take advantage of the optimization, limit the use of global constants and the
use of dynamically indexed constants C[ax] as these skip the Intel integrated
graphics optimization algorithm within the Intel driver.
4.5 Graphics Memory
Integrated graphics will continue to use the Unified Memory Architecture (UMA) and
Dynamic Video Memory Technology (DVMT). As with past integrated graphics solutions, UMA specifies that memory resources can be used for video memory when
needed. DVMT is an enhancement of the UMA concept, where in the optimum amount of memory is allocated for balanced graphics and system performance.
DVMT ensures the most efficient use of available memory - regardless of frame buffer or main memory size - for balanced 2D/3D graphics performance and system
performance. DVMT dynamically responds to system requirements and application's demands, by allocating the proper amount of display, texturing, and buffer memory after the operation system has booted. For example, a 3D application when launched may require more vertex buffer memory to enhance the complexity of objects or more texture memory to enhance the richness of the 3D environment. The operating system views the Intel graphics driver as an application, which uses a high speed mechanism for the graphics controller to communicate directly with system memory
called Direct AGP to request allocation of additional memory for 3D applications, and returns the memory to the operating system when no longer required.
4.5.1 Resource Management
Allocate surfaces in priority order. The render surfaces that will be used most frequently should be allocated first.
The 3D engines‟ performance is dependent on the memory bandwidth available. Systems that have more bandwidth available will outperform systems with less
bandwidth. The engines‟ performance is also dependent on the core clock frequency. The higher the frequency, the more data is processed.
Tip: On Microsoft DirectX* 9, use D3DPOOL_DEFAULT for lockable memory (dynamic
vertex/index buffers).
Tip: On Microsoft DirectX* 9, use D3DPOOL_MANAGED for non-lockable memory
(textures, back buffers, etc).
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 39
Tip: Proper texture compression can greatly improve the utilization of memory bandwidth. The texture sample has dedicated hardware for uncompressing known texture formats (DXT1, DXT2, DXT3, etc).
4.5.2 Checking for Available Memory The operating system will manage memory for an application that uses Microsoft DirectX*. Video memory on Intel integrated graphics use dynamically allocated DVMT (Dynamic Video Memory Technology). This means that the graphics memory will be dynamically allocated from main memory as needed.
Developers should consider DVMT memory as “local memory” in addition to any
“dedicated” memory. Memory checks that only supply the available amount of “dedicated” graphics memory do not supply an appropriate number for the integrated graphics. In many software queries for integrated graphics, “Non-Local Video Memory” will show as ZERO (0). That number should not be used to determine “AGP” or “PCI Express” compatibility.
As a result of the dynamic allocation of graphics memory performed by the integrated graphics (based on application requests), you need to ensure that you understand all of the memory that is truly available to the graphics device.
The Microsoft DirectX* SDK (June 2010) includes the VideoMemory sample code which demonstrates 5 commonly used methods to detect the total amount of video memory. Of these tests only GetVideoMemoryViaD3D9 and GetVideoMemoryViaDXGI work properly on Intel GMA 950/3150 and Intel GMA 500/600. All other methods return only the local/dedicated graphics memory and consequently are incorrect for integrated graphics. For more information, see the sample code: http://msdn.microsoft.com/en-us/library/ee419018(v=VS.85).aspx.
4.6 Creating a Microsoft DirectX* 9 Device for
Intel® Atom™ Processor Graphics
The following code shows how to correctly initialize and detect Microsoft DirectX* 9 Software Vertex Processing (SWVP). This sample also shows how to switch to software vertex processing for the devices that support it, and conversely, hardware vertex processing for the devices that support that.
Tip: To determine the available graphics memory, use the GetVideoMemoryViaD3D9
method in the Microsoft DirectX* SDK VideoMemory sample code. GetVideoMemoryViaDXGI also works, but does not have support for Microsoft Windows* XP.
Intel® Atom™ Processor - Graphics Developer's Guide
40
HRESULT hr;
DWORD BehaviorFlags = 0;
IDirect3DDevice9* pDevice = NULL;
UINT nMinRequiredVertexShaderLevel = yourMinimumVSLevel; // i.e.D3DVS_VERSION(3,0)
UINT nMinRequiredPixelShaderLevel = yourMinimumPSLevel; // i.e.D3DPS_VERSION(2,0)
// Clear any vertex processing flags
BehaviorFlags &= ~(D3DCREATE_HARDWARE_VERTEXPROCESSING |
D3DCREATE_MIXED_VERTEXPROCESSING |
D3DCREATE_SOFTWARE_VERTEXPROCESSING);
// We’ll try to get ‘PURE’ hardware device first
BehaviorFlags |= D3DCREATE_PUREDEVICE;
hr = pD3D->CreateDevice(Adapter,
DeviceType,
hFocusWindow,
BehaviorFlags | D3DCREATE_HARDWARE_VERTEXPROCESSING,
pPresentationParameters,
&pDevice);
if(D3D_OK == hr)
{
// NOTE: We’re using pDevice->GetDeviceCaps and not pD3D->GetDeviceCaps
hr = pDevice->GetDeviceCaps(&Caps9);
}
if( (D3D_OK != hr)
|| (Caps9.VertexShaderVersion < nMinRequiredVertexShaderLevel)
|| (Caps9.PixelShaderVersion < nMinRequiredPixelShaderLevel) )
{
// We didn’t get a ‘PURE’ hardware device, so clear the flag.
BehaviorFlags &= ~D3DCREATE_PUREDEVICE;
hr = pD3D->CreateDevice(Adapter,
DeviceType,
hFocusWindow,
BehaviorFlags | D3DCREATE_MIXED_VERTEXPROCESSING,
pPresentationParameters,
&pDevice);
if(D3D_OK == hr)
{
hr = pDevice->GetDeviceCaps(&Caps9);
}
if( (D3D_OK != hr)
|| (Caps9.VertexShaderVersion < nMinRequiredVertexShaderLevel)
|| (Caps9.PixelShaderVersion < nMinRequiredPixelShaderLevel) )
{
hr = pD3D->CreateDevice(Adapter,
DeviceType,
hFocusWindow,
BehaviorFlags |
D3DCREATE_SOFTWARE_VERTEXPROCESSING,
pPresentationParameters,
&pDevice);
if(D3D_OK == hr)
{
pDevice->GetDeviceCaps(&Caps9);
if(Caps9.PixelShaderVersion < nMinRequiredPixelShaderLevel)
{
// Minimum specs for this application are
// higher than this system can handle
// Exit this application gracefully...
pDevice->Release;
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 41
pDevice = NULL;
hr = E_FAIL;
}
}
}
}
Intel® Atom™ Processor - Graphics Developer's Guide
42
5 Performance Analysis with Intel®
Graphics Performance Analyzers
The Intel® Graphics Performance Analyzers (Intel® GPA) were created with the goal
of making a great Microsoft DirectX* tool that would provide all the information needed to analyze frame captures and improve graphics performance on Intel graphics hardware.
There are four major components to Intel GPA:
- Intel® GPA Monitor – See Section 5.1
- Intel® GPA System Analyzer HUD (Heads-Up Display) - See Section 5.2
- Intel® GPA Frame Analyzer – See Section 5.3
- Intel® GPA Platform Analyzer – See Section 2.6.3
Intel GPA will work on most Microsoft DirectX* graphics parts including Intel GMA 950/3150. However, at this time Intel GPA does not support tile-based rendering like Intel GMA 500/600. For Intel GMA 500/600, there are tools provided by Imagination Technologies* for POWERVR*-based graphics such as PVRTrace* and PVRTune*.
For more information on Intel GPA, visit: http://software.intel.com/en-
us/articles/intel-gpa/.
5.1 Intel® GPA Monitor
Intel® GPA Monitor connects Intel GPA to an application (locally or on a remote
computer), and enables the configuration of the Intel GPA System Analyzer HUD mode and hot keys.
On most Intel® Atom™ processor-based devices, the analysis tools must be run on a separate machine. The Intel GPA Monitor can be configured to connect to any
Microsoft DirectX* application or launch a specific application.
Note: Intel® GPA does not support Intel® GMA 500/600. For more information
about POWERVR* and PVRTune*, check out the Imagination Technologies POWERVR
Insider* SDK at: http://www.imgtec.com/powervr/insider/powervr-sdk.asp.
Tip: For remote analysis, start Intel® GPA Monitor on both the target machine (e.g. Intel® Atom™ processor-based device) and the host machine. Start the analysis tool
(Intel GPA Frame Analyzer or Intel GPA Platform Analyzer) on the host machine and enter the target machine‟s IP address.
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 43
Review the Quick Start Guide included with Intel GPA for more information on using the Intel GPA Monitor.
5.2 Intel® GPA System Analyzer HUD
Intel® GPA System Analyzer HUD (Heads-Up Display) displays application
performance metrics in real time, overlaid on Microsoft DirectX* applications. This tool provides high-level performance profiling of graphics applications, in order to determine whether the application is CPU-bound or GPU-bound. If the application is GPU-bound, there is a hotkey to capture a GPU frame for detailed analysis by the Intel® GPA Frame Analyzer. If the application is CPU-bound, there is a hotkey to
capture a trace file for detailed analysis by the Intel® GPA Platform Analyzer. Press Ctrl-F1 in the HUD to see the hotkey list.
Note: Use Intel® GPA System Analyzer HUD to capture frames. Intel GPA Frame Analyzer will need to be run on a separate machine for more Intel® Atom™
processor-based devices and to connect over a network.
5.3 Intel® GPA Frame Analyzer
Intel® GPA Frame Analyzer provides a detailed view of a captured frame file, which contains all Microsoft DirectX* context used to render the selected 3D frame, as well
as per-draw call/region GPU metrics. This tool provides performance info from applications at the frame level, render target level, and draw call level. It enables detailed analysis and “what if” optimization experiments without the need to recompile or rebuild an application.
Note: Use Intel® GPA System Analyzer HUD to capture frames. Intel GPA Frame
Analyzer will need to be run on a separate machine for most Intel® Atom™ processor-based devices and to connect over a network.
5.4 Diagnosing Performance Bottlenecks
At a very high level, the graphics stack includes a rendering system that takes
polygons, textures, and commands as input to display the resulting picture on an output device.
The graphics stack consists of the CPU, main memory, and the bus which delivers the visual payload of data to the Intel integrated graphics chipset. Several scenarios
involving these components can affect overall performance. Considering that each of these computational systems resides along a highway where data is flowing, the following could occur:
Intel® Atom™ Processor - Graphics Developer's Guide
44
If any of these channels are underutilized, the system may be underperforming in
terms of overall capacity to do more work.
If any of these channels are overutilized, the system may be underperforming in
terms of capacity to keep the data moving fast enough.
For optimal performance, the application should maximize the performance of the graphics subsystem and operate the other channels optimally to keep the graphics subsystem continuously productive with minimal starving or blocking situations.
Tip: For Intel® GMA 950/3150, if the “Disable Draw Calls” override in the Intel® GPA System Analyzer HUD does not show a significant performance increase, the CPU is likely a bottleneck. This could be the application, graphics driver, or both.
If the application is CPU-bound, there is a hotkey to capture a trace file for detailed analysis by the Intel® GPA Platform Analyzer. See Section 2.6.3 for more information.
Tip: If decreasing the screen resolution doesn‟t increase the frame rate, it‟s likely that the application is CPU-bound, vertex processing bound, or limited by fixed function process (e.g. clipping).
If the application is GPU-bound, there is a hotkey to capture a GPU frame for detailed
analysis by the Intel® GPA Frame Analyzer. See Section 5.3 for more information.
There are several overrides available to investigate possible GPU limitations. Here are
just a few suggestions:
Override Significant Frame Rate Increase
No Change
Disable Draw Calls GPU-bound
CPU-bound
(application or driver)
Texture 2x2 Texture sampler or memory
bandwidth bound
--
Simple Pixel Shader Probably pixel shader bound
(possibly from texture sampling) If GPU-bound, investigate
vertex processing or other fixed function processing (e.g.
clipping)
For more information about overrides in Intel GPA System Analyzer HUD, see the
documentation included with Intel GPA.
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 45
6 Support Intel® 64 and IA-32 Architectures Software Developer's Manuals:
http://www.intel.com/products/processor/manuals/
Intel‟s integrated graphics chipset development community forum:
http://software.intel.com/en-us/forums/developing-software-for-visual-computing/
Game programming resources:
http://software.intel.com/en-us/visual-computing/
Intel® Software Network:
http://software.intel.com/en-us/
Intel® Software Partner Program:
http://www.intel.com/software/partner/visualcomputing/
Intel® Visual Adrenaline graphics and gaming campaign:
http://www.intel.com/software/visualadrenaline/
Intel® Graphics Performance Analyzers (Intel® GPA):
http://software.intel.com/en-us/articles/intel-gpa/
Intel® Composer XE:
http://software.intel.com/en-us/articles/intel-composer-xe/
Intel® VTune™ Amplifier XE:
http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/
Intel® Atom™ Processor - Graphics Developer's Guide
46
7 References
[1] “Copying and Accessing Resource Data (Direct3D 10)”. Direct3D Programming
Guide. Microsoft DirectX* SDK (November 2008).
[2] “DirectX* Constants Optimizations for Intel Integrated Graphics”. Intel
Software Network, Intel: http://software.intel.com/en-us/articles/directx-
constants-optimizations-for-intel-integrated-graphics/.
Intel® Atom™ Processor - Graphics Developer's Guide
How to maximize graphics and game performance on Intel® Atom™ processor-based platforms 47
8 Optimization Notice
Intel® compilers, associated libraries and associated development tools may include
or utilize options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are
reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel® Compiler User and Reference Guides” under “Compiler
Options." Many library routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code
and other factors, you likely will get extra performance on Intel microprocessors.
Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel®
Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.
While Intel believes our compilers and libraries are excellent choices to assist in
obtaining the best performance on Intel® and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not.
(Notice revision #20101101)