microprocessor system architectures – ia32 advanced features and rests jakub yaghob
TRANSCRIPT
Microprocessor system architectures – IA32 advanced
features and rests
Jakub Yaghob
Multiple-processor management Mechanisms
Support for atomic operations on system memory Serializing instructions APIC L2 and L3 caches Hyper-threading
Aims Maintain system memory coherence Maintain cache coherence Predictable ordering of writes to memory Distribute interrupt handling among processors Increase system performance by exploiting multi-threaded OSs
and applications
Locked atomic operations
Three independent mechanisms Guaranteed atomic operations Bus locking using LOCK# or instruction prefix LOCK
Cache coherency protocols insuring cache coherency for atomic operations on cached data (cache lock) (Pentium Pro+)
Guaranteed atomic operations
i486+ R/W a byte R/W a word (2B) aligned on a word R/W a dword (4B) aligned on a dword
Pentium+ R/W a qword (8B) aligned on a qword R/W a word from/to uncached memory within 32-bit bus
Pentium Pro+ Unaligned word, dword, qword R/W from/to cached
memory within a cache line
Bus locking Automatic locking
XCHG with memory Setting B (busy) flag of a TSS descriptor Updating descriptors (e.g. A flag) Updating page tables Interrupt acknowledgement
Software controlled locking (prefix LOCK) Automatically assumed for XCHG BTS, BTC, BTR XADD, CMPXCHG, CMPXCHG8B INC, DEC, NOT, NEG, ADD, ADC, SUB, SBB, AND, OR, XOR Otherwise #UD exception (invalid opcode) Memory access can be unaligned Pentium Pro+ serializes locked operations
Self-modifying code Option 1
Write modified code using data segment Jump to new code or an intermediate location Execute the new code
Option 2 Write modified code using data segment Execute a serializing instruction Execute the new code
Required for Pentium Pro+ Performance penalty Cross-modifying code
One CPU changes a code and the second one executes it Synchronize CPUs and execute a serializing instruction
Memory ordering Program-ordering
Alias strong-ordering R/W issued on the bus in the order they occur in the instruction stream under
all circumstances i386
Processor-ordering Alias speculative-ordering or weak-ordering Allows increased instruction execution speed, while maintaining memory
coherency The exact behavior depends on a model; Pentium Pro+
Pentium and i486 They use processor-ordering In most cases they behave as program-ordered R miss goes ahead of W, when all buffered W are cache hits
I/O always in the order of instruction stream (strong-ordering)
Processor-ordering I. Single-processor and WB memory
R can be carried out speculatively and in any order R can pass buffered W, but the CPU is self-consistent W to memory are always carried out in program order, excluding instructions
CLFLUSH, MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, MOVNTPD W can be buffered W are not speculative; performed only for really executed (retired)
instructions Data from buffered W can be passed to waiting R within the CPU R/W cannot pass I/O, locked or serializing instructions R cannot pass LFENCE and MFENCE W cannot pass SFENCE and MFENCE
Multiple CPUs Individual CPUs behave as single-processor Writes by a single CPU are observed in the same order by all CPUs Writes from the individual CPUs on the bus are NOT ordered with respect to
each other
Processor-ordering II.
„Fast string“ operation „Fast string“
Pentium Pro+ MOVS or STOS CPU works with cache lines
Reads are not performed during cache line writes Interrupts only on the cache line border Conditions
EDI and ESI aligned to 8B (PIII), EDI aligned to 8B (P4) Ascending order (DF=0) Initial counter ECX>=64 Source and target most not overlap by less then one cache line (64B
for P4+, 32B other) Memory type WC or WB
Strengthening or weakening memory ordering Strengthening
I/O instructions, locked instructions, LOCK and serializing instructions
SFENCE (PIII), LFENCE and MFENCE (P4+) SFENCE – all W finished before this instruction LFENCE – all R finished before this instruction MFENCE – all R and W finished before this instruction
PAT (Page Attribute Table) strengthens ordering for pages (PIII+)
Weakening or strengthening MTRR (Memory Type Range Registers) weaken or
strengthen ordering for physical memory regions (Pentium Pro+)
Serializing instructions
CPU finishes all flags, registers and memory changes
CPU clears all buffered W Pentium+ Privileged instructions
MOV CRx, MOV DRx, WRMSR, INVD, INVLPG, WBINVD, LGDT, LIDT, LTR
Non-privileged instructions CPUID, IRET, RSM
Non-privileged for memory ordering LFENCE, SFENCE, MFENCE
Propagation of page table entry changes „TLB shootdown“ Simple method
Send IPI to all CPUs Stop all CPUs excluding one (spin-lock) Active CPU makes the changes (invalidates page tables in
memory) and resumes all CPUs All CPUs invalidates their TLB (selectively or all entries) All CPUs return from IPI
Complicated and faster methods can be developed Different TLB mappings are not used on different CPUs
during the update The OS must be prepared for a situation where CPUs use
stale mapping during the update
MPS 1.4 Multiprocessor Specification Controlled booting of multiple CPUs without a
dedicated HW HW can initiate a boot without a dedicated signal or
a predefined boot CPU All IA-32 CPUs have the same boot protocol
(including HT) Different mechanisms for different CPU models (P4
x Xeon older x Xeon newer) BSP = Bootstrap Processor AP = Application Processor
Detecting hyper-threading or multi-core
Hardware Multi-Threading feature flag CPUID.1:EDX[28] = 1
Logical processors per Package CPUID.1:EBX[23:16]
Cores per Package Only when CPUID works with EAX=4, otherwise
it has 1 core CPUID.(EAX=4,ECX=0):EAX[31:26]+1
Hyper-threading – I
One core is able to execute 2 or more instruction streams
Some parts of a core are private for each logical processor, some parts are shared among logical processors
Hyper-threading – II Private state of a logical
processor General purpose registers
EAX-ESP (RAX-RSP, R8-R15)
Segment registers CS-SS EFLAGS and EIP (RIP) x87 (ST0-ST7), MMX
(MM0-MM7), SSE (XMM0-XMM7/XMM15) and their control and status registers
Control registers CRx, GDTR, IDTR, LDTR, IA32_EFER
Debug registers DRx Time stamp Most of MSRs (including
PAT) Local APIC Instruction TLB
Shared state MTRR Data TLB Cache, the bus Some MSRs
Multi-Core
Programming MT-capable CPUs – I Requires support from OS Using PAUSE instruction in spin-lock
Encoded as REP NOP Older IA-32 CPUs interpret PAUSE as NOP Older AMD CPUs do NOT understand it
Using HLT Idle logical processor must use HLT and must not actively wait
Using MONITOR/MWAIT SSE3, check CPUID.1.ECX[3] = 1, available only for CPL=0 MONITOR sets up a memory range monitored for W MWAIT places the processor in an optimized state until a W to
the monitored range occurs
Programming MT-capable CPUs – II
Scheduling Dispatch tasks to logical processors 0 for all cores,
then to logical processors 1, etc. Use thread affinity
Do not measure the speed of a CPU by an active loop One lock or semaphore should be placed aligned into
128B block of memory
APIC (Advanced Programmable Interrupt Controller)
Local APIC Internal in CPUs Receives interrupts from CPU’s interrupt pins, from internal
sources and from an external I/O APIC Sends and receives IPI (InterProcessor Interrupt)
I/O APIC Part of a chipset Receives external interrupts and relays them to a local APIC Possibility of IPI distribution among CPUs
xAPIC Newer architecture EXtended APIC P4 and Xeons
APIC – xAPIC
xAPIC system (P4 and Xeon)
APIC – „traditional“ APIC APIC system (Pentium and Pentium Pro+)
Local APIC structure
Internal cache
Cache structure of P4 and Xeon
Characteristics of cachesCache type Pentium/MMX Pentium Pro+ Pentium
M,Core/Core2P4 and Xeon
Trace cache N/A N/A N/A 12Kμops; 8wa
L1 instruction
8K; 2wa/16K; 32B; 4wa
16K; 32B; 4wa 32K; 64B; 8wa N/A
L1 data 8K; 2wa/16K; 32B; 4wa
8K; 2wa/16K; 32B; 4wa
32K; 64B; 8wa 8K; 64B; 4wa/16K; 8wa
L2 common external 128K-2M; 32B; 4wa
<2M; 64B; 8wa/ <4M; 64B; 16wa
256K-2M; 64B; 8wa
L3 common N/A N/A N/A Xeon 512K-4M; 64B; 8wa
Instr TLB 4K 32; 4wa/fa 32; 4wa 128; 4wa 128; 4wa
Data TLB 4K 64; 4wa/fa 64; 4wa 128; 4wa/DTLB0:16, DTLB1:256; 4wa
64; fa
Instr TLB LP ==ITLB4K 2; fa 2; fa/4; 4wa Fragmented??
Data TLB LP 8;4wa/==DTLB4K
8; 4wa 8; fa/DTLB0:16;DTLB1:32; 4wa
==DTLB4K
Store buffer 2*1/4*4 12 16/20 24
WC buffer N/A 4 6/8 6/8
Cache terminology Cache use MESI protocol for maintaining coherency Cache line fill
An operand is read from cacheable memory The entire cache line is read
Cache hit An operand is in a cache An access uses a value from a cache
Cache miss An operand is not in a cache
Write hit If a valid cache line exists, CPU can write into the cache If a write misses a cache, cache line fill occurs
Snooping CPU checks memory accesses on the bus with its cache lines
MESI Each cache line has 2 status bits Transparent for programs Instruction L1 has only SI Transition by snooping
CPU detects W to the line with M Cancel transaction W line directly to the other CPU with branch to the memory Moving to the I state
Cache line status M (Modified) E (Exclusive) S (Shared) I (Invalid)Is it valid? yes yes yes noThe memory copy is... ...out of date ...exact ...exact N/ACopies in other CPUs? no no maybe maybe
W to this line......does not
go to the bus
... does not go to the bus,
moving to M
...moving to E
...goes directly to the memory
Cache control CR0[CD]
=0 – caching enabled for the whole of system memory, can be restricted for regions or pages =1 – caching disabled for Pentium, for other restricted
CR0[NW] =0 – WB enabled, can be restricted =1 – WB disabled
PCD and PWT in the page tables and directories Disable caching/WB for pages or page directories
PCD and PWT in the CR3 Disable caching/WB for page directories
G in the page tables (Pentium Pro+) Does not flush TLB entry during implicit flushing (task switch, mov cr3,eax)
CR4[PGE] (Pentium Pro+) Enables G in page tables
MTRR (Pentium Pro+) Memory types for regions of physical memory
PAT (PIII+) Memory types for pages
Store buffers IA-32 stores temporarily each W to memory in a
store buffer CPU continues without waiting on the memory or a cache
Transparent for software Draining store buffers
An interrupt or an exception Serializing instruction (Pentium Pro+) I/O operation LOCK operation BINIT operation (Pentium Pro+) (machine check) SFENCE instruction (PIII+) MFENCE instruction (P4+)
Memory types – an overview
Pentium has UC, WT, WB Control using NW, CD
UC- from PIII with PAT
Memory type CacheableWrite
cacheableSpeculative
readsMemory ordering
modelStrong Uncacheable
(UC)No No No Strong
Uncacheable (UC-) No No NoStrong, can be
overridden by WC in MTRR
Write combining (WC) No No Yes WeakWrite through (WT) Yes No Yes Speculative
Write back (WB) Yes Yes Yes SpeculativeWrite protected (WP) Yes (R) No Yes Speculative
Memory types – I Strong uncacheable (UC)
The system memory is not cached All R/W have strong-ordering, no speculation Useful for memory-mapped I/O Greatly reduces system performance
Uncacheable (UC-) Like UC, can be overridden to WC using MTRR Only PIII+ using PAT
Write Combining (WC) The system memory is not cached No coherency protocol Speculative R enabled, W ordering is NOT ensured W delayed and combined in WC buffers Useful for video frame buffers
Memory types – II Write Through (WT)
R/W from/to the system memory cached R comes from a cache on cache hit; cache line fills on cache miss; speculative R W writes to a cache and the main memory on cache hit; does not write to the cache on
cache miss WC enabled Useful for video frame buffers or devices without snooping
Write Back (WB) R/W from/to the system memory cached R comes from a cache on cache hit; cache line fills on cache miss; speculative R W writes to a cache and the main memory on cache hit; cache line fill on cache miss Cache coherency protocol
Write Protected (WP) R comes from a cache on cache hit; cache line fills on cache miss; speculative R W directly propagated on the system bus
MTRR (Memory Type Range Registers) Assigning memory types to the physical memory regions Checking MTRR presence using CPUID MSR R/O registr IA32_MTRRCAP
Support for fixed ranges Number of variable ranges (Pentium Pro+) Support for WC type
Default type MSR IA32_MTRR_DEF_TYPE defines memory type for physical memory not
covered by fixed and variable ranges Fixed ranges
8 ranges of 64K size in the lowest 512K (00000000-0007FFFF) 16 ranges of 16K size in the next 256K (00080000-000BFFFF) 64 ranges of 4K size in the next 256K (000C0000-000FFFFF)
Variable ranges Address & PHYSMASKn = PHYSBASEn & PHYSMASKn When a variable range overlaps with a fixed range, the fixed range wins
PAT (Page Attribute Table) Assigning memory type to the ranges of linear address space Checking PAT presence using CPUID MSR IA32_CR_PAT defines 8 types The type for a page is selected from IA32_CR_PAT by an index
created from PAT(4), PCD(2), PWT(1) bits in page tables It is always switched on The initial setting after RESET is backward compatible with PCD and
PWT – 2 * (WB, WT, UC-, UC)
Memory types restrictions If CR0[CD]=1, then caching is disabled If CR0[CD]=0, then caching restricted using PAT (or PCD and PWT) and MTRR
Always selected the most restrictive type WT „wins“ over WB WC „wins“ over WT and WB
Reset Sets a CPU to the well known state CPU in the real mode Internal caches, TLB and BTB invalidated CPU model dependent behavior
Pentium Pro+ All CPUs start initialization protocol, on of them is chosen as BSP and
continues in an OS initialization, all other APs halt and wait for an IPI „Wait for Startup“
i486 and Pentium HW knows, which CPU is BSP, other APs halt and wait on SIPI
INIT Like RESET Internal caches, MSR, MTRR, x87, SSE do not change Move to the real mode
CPU state after RESET, INIT and power-up
EFLAGS 00000002 CR0 60000010h
EIP 0000FFF0 CRx 0
CS F000 EAX, ... 0
Base FFFF0000 EDX 00000mxxh
Limit FFFF STx +0.0
xS 0000 x87 CW 0040h
Base 00000000 x87 SW 0000
Limit FFFF x87 Tag 5555h
GDTR, IDTR
00000000 XMMx 0
Limit FFFF MXCSR 1F80
LDTR, TR 0000 DRx 0
Base 00000000 DR6 FFFF0FF0
Limit FFFF DR7 00000400h
Microcode update Pentium Pro+ has an interface for uploading microcode block
with patches to the CPU Microcode block is supplied by Intel directly to the BIOS vendors Microcode block has a header with CPU model specification Checking CPU model in the microcode header with current CPU A microcode must be uploaded before L2 is enabled and lot of
other constraints (e.g. segment limit exceeding)
Virtual machine extensions (VMX) Two classes of software
Virtual machine monitor (VMM) Acts like a host Full control of HW Presents abstract HW to guests
Guest software Guest software environment with OS and applications
Virtual-machine control data structure (VMCS) – I VMX non-root operation and VMX transitions
controlled by a VMCS Access through the VMCS pointer (one per logical
CPU) Changing the pointer using VMPTRST and VMPTRLD instructions
VMCS configuration using VMREAD, VMWRITE, VMCLEAR instructions
VMM could use a different VMCS for each virtual CPU
Each logical CPU associates a physical memory region (one 4KB frame) with each VMCS
Virtual-machine control data structure (VMCS) – II
VMCS state Inactive
after VMCLEAN Active
Memory region after VMPTRLD Maintains CPU state
Current VMPTRLD loads current VMCS VMLAUNCH, VMPTRST, VMREAD, VMRESUME and VMWRITE operate with current VMCS
Virtual-machine control data structure (VMCS) – III
VMCS data Guest-state area
CPU state is saved on VM exits and loaded from there on VM entries
Host-state area CPU state is loaded on VM exits
VM-execution control fields VM-exit control fields VM-entry control fields VM-exit information fields
Guest-state area Registers
CR0, CR3, CR4 RSP, RIP, RFLAGS CS, DS, ES, FS, GS, SS, LDTR, TR
Selector and part of internal cache GDTR, IDTR MSRs
IA32_DEBUGCTL, IA32_SYSENTER_CS, IA32_SYSENTER_ESP, IA32_SYSENTER_EIP
Activity state Active, HLT, shutdown, wait-for-SIPI
Interruptibility state Blocking by STI, MOV SS, NMI, SMI
Pending debug exceptions VMCS link pointer
Host-state area
Registers CR0, CR3, CR4 RSP, RIP CS, DS, ES, FS, GS, SS, TR Base address for FS, GS, TR, GDTR, IDTR MSRs
IA32_SYSENTER_CS, IA32_SYSENTER_ESP, IA32_SYSENTER_EIP
VM-execution control fields Pin-based VM-execution controls
VM-exits on external interrupt or NMI CPU-based VM-execution controls
Instructions and events causing VM-exits Exception bitmap I/O-bitmap addresses Guest/host masks and read shadows for CR0 and CR4 CR3 target controls
4 target addresses+counter CR8 access control MSR bitmap address
VM-exit control fields
VM-exit controls Basic operation of VM-exit
VM-exit controls for MSRs List of MSRs stored and loaded on VM-exit
VM-entry control fields
VM-entry controls Basic operation on VM-entry
VM-entry controls for MSRs List of MSRs to be loaded on VM-entry
Event injection “Executed” before the first guest-mode instruction Interrupts, exceptions including error-code
VM-exit information fields
Basic VM-exit information Exit reason, exit qualification
Vectored events Interrupts, exceptions
VM-exits during event delivery VM-exits due to instruction execution
Instruction address, length, detailed information
VMXON region
Physical memory region (4KB frame) for VMX operation
Operand of VMXON instruction
Using VMCS
VMCLEAR should be executed before VM-entry
VMLAUNCH should be used for the first VM-entry using VMCS after VMCLEAR
VMRESUME should be used for any subsequent VM-entry
VMX non-root operation
Instructions, which cause VM-exit Unconditionally: CPUID, INVD, MOV from CR3, all
VMX instructions Conditionally: CLTS, HLT, IN/OUT, INVLPG, LMSW, MONITOR, MOV CR8, MOV to CR0, MOV to CR3, MOV to CR4, MOV DR, MWAIT, PAUSE, RDMSR, RDPMC, RDTSC, RSM, WRMSR
Other causes Exceptions, interrupts, INIT signals, start-up IPI,
task switches, system-management interrupts