microcomputer course koby gottlieb, 3/03, p-1 microcomputer course: x86 architecture basics koby...

Microcomputer CourseKoby Gottlieb, 3/03, p-1

Microcomputer Course:x86 Architecture Basics

Koby Gottlieb, Intel IsraelSpring Semester 2003


Administration

Participants

Grading 65% exam, 35% exercises All exercises are mandatory! No exercise => no final grade

Web page: http://www.ee.technion.ac.il/~ee044800/ Latest version of slides Pointers to additional info

Material

Contacts Koby Gottlieb [email protected] 04-8655718 George Leifman [[email protected]]


Objectives

Get acquainted with the X86 architecture and system components Historical perspective

Major features

SW/HW aspects

What you will not get here Expertise in X86 programming

Full details of everything

Expect Varying level of details

Hopefully: taste of more


Lecture 1: Introduction


History, Trends, & Terminology


Major Milestones in Microprocessors 1971 - 1975

Intel: 4004, 4 bit, 8008, 8 bit, 8080, 8 bit Motorola & other Still as a logic substitution

1975: Altair, 8080 based, 1st kit. Basic interpreter 1977: TRS80 of RadioShack

Apple, Apple-II 1980: VISICALC - 1st killer application

12 active companies 1981: IBM-PC, w/8088. MS-DOS, IBM PC/XT later Open standard! 1982: 80286 CPU 1984: IBM PC/AT, 80286 based 1985: i386 CPU 1987: i386 based PC 1989: i486 CPU, Windows! 1991: i486 PC, x86 competitors 1993: Pentium® processor 1994: Pentium based PC 1995: Pentium Pro processor 1996: Pentium Pro based PC Windows 95, Windows NT 1996: Pentium w/MMXTM Technology + PC 1997: Pentium-II® processor + PC, Cyrix MediaGX Did not mention buses, networking, etc...


Trends Phase-I (-1970): “Enter the game”

Technology progress

Phase-II (1970-1990): “Imitate the big guys”Use technology and copy ideas from mainframes Enlarge word size FP hardware Paging, Protection, Caching High integration

Phase-III (1990-): “Lead the game”Innovation unique to microcomputers: Super-scalar, out of order execution Branch prediction High frequency networking Graphics

Diverging again?Net-PC, thin client, connected PC, High integration, etc...


Representative Data

ScreenYear Xistors MIPS Memory Disk Screen

Memory___________________________________________________

1975 10K 0.2 <64K - Mono 2K

1985 275K 5.5 1M <10MB CGA 14” 64K

1995 5.5M 300 16M >500MB SVGA 17” 1MB___________________________________________________

Factor 20 40 16 50 - 16

1998 7.5M 600 32M 4GB “ 4MB


Performance Observations

Performance is doubled every 18-24 months

The “Spiral” principle (A. Grove):

New software is written to the fastest available processor. Actually, it takes the next generation of processors to run this software (reasonably).

There will always be applications that will use all available computing power.

<well, some people question that…>

"My nightmare is to wake up some day and not need any more computing power." Gordon Moore, Feb 1998


Computer System

Class focus CPU:

General architectureMajor components

System:Buses (CPU, ISA, PCI, USB) InterruptsMemory hierarchy

CPU

PCI BUS

Host/PCICacheBridge

Memory

ScsiAdap

ScsiBus

LanAdap

LAN

USBHub

KeyBoard,Monitor,Mouse,Speakers

etc..

GraphAdapt

VideoBuffer


Architecture & Microarchitecture

Architecture:“The science, art, or profession of designing and constructing building, bridges, etc...” Focus on how it looks externally Consider how it can be built and how good it performs!

Construction:“The way in which something is built, formed or devised by fitting parts or elements together systematically” Focus on how it is built/operated Make sure it works and works correctly!

Definition applicable also for buildings, cars, computers, etc...In computers: construction == architecture


Architecture & Microarchitecture (cont.)

Architecture:The collection of features of a processor (or a system) as they are seen by the “user”

User: a binary executable running on that processor, or

assembly level programmer

MicroArchitecture:The collection of features or way of implementation of a processor (or a system) do not affect the user.

Features which change Timing/performance are considered microarchitecture.


Architecture & Microarchitecture Elements

Architecture: Registers data width (8/16/32) Instruction set Addressing modes Addressing methods (Segmentation, Paging, etc...) Protection etc…

Architecture Bus width Physical memory size Caches & Cache size Number of execution units Execution Pipelines, Number & type execution units Branch prediction TLB etc...

Timing is considered Architecture (though it is user visible!)


Compatibility

Processor A (or system A) is compatible to processor B (or system B) if it preserves the same architecture features as seen by a user of B

SW compatible - most important for us! The ability to run existing SW!

HW compatible Upward compatible “Exactly” compatible (sometimes - “Downward” compatible)

(Required more for application, less for OS) Architecture Extension: addition of data-types, instructions,

etc… to enable better support of new/existing applications.e.g. MMX

Trends: 1985-1955: no/minimal architecture changes 1995- : we see them coming (MMX, SSE etc.)


Basic Architecture

Overview Data types Registers Stack Flags Segmentation Instructions:

Format Memory operands Instruction Set


IA (X86) evolution - Detailed

Year Proce-ssor

bits bus PhMem

MHZ MIPS FP Cache Prot Page Tran Tech P W Feature

1978 8086 16 16 1M 5/10 .3/.8 8087 - - - 29K 3u 40 2.5/1.0 New, 16 bit

1979 8088 16 8 1M 5/8 -"- 8087 - - - -"- 3u 40 2.5/1.0 8 bit bus

1982 80186 16 16 1M 80187 - - - 56K More integration

1982 80286 16 16 16M 8/12.5 1.2/2.7 80287 - + - 134K 68 3.3/1.1 Protection, 16M

1985 80386 32 16/32 4G 16/33 5.0/11.4 80387 - + + 275K 132 2.7 32 bit, paging

1989 i486 32 32 4G 25/100 20/80 on/487 8K + + 1.2M 1.2/.8 168 4.5 FP on chip,Cache, “RISC”

1993 Pentium 32 64 4/64G 66/200 112/400 on 2*8K + + 3.1M .8/.6 273 10/15 Dual Pipe, FPpipe, DP support

1995 P6 32 64 4/64G 150/200 400 on 2*8K+256K + + 5.5M 0.6 387 14-18 Dynamicexecution

1996 P55C 32 64 4/64G 200/233 466 on 2*16K + + >5M 0.6/0.35 273 8-10 MMX Technology

1997 PentiumII

32 64 4/64G 233/300 600 on 2*16K+512Kon SECC

+ + ~7.5M 0.35 242(SEC)

34/43

Each major generation is 2-3X faster than its predecessor!


Operation Modes

Architecture reflects this evolution!

Evolution => several modes: 16 bit:

Real Mode (8086 and up) Protected mode (286 and up) Virtual 86 mode (386 and up)

32 bit: Protected mode (386 and up) Paging enabled/disabled

Proce-ssor

RealMode

Prot-16 Prot-32

86 +

286 + +

386+ + + +


Philosophical Basis of First Wave Architecture

HW / Instruction level Support for high level languages

Instruction Set encoding for memory efficiency Two-Operand Instructions

reg-reg, reg-imm, reg-memory, read-modify-write

Specialized and Implicit Register Usage

Complex Instructions to maximize functionality / instruction fetch

Memory Segmentation Relocatable programs / modules well supported

Addressing modes suitable for high-level languages Stack and Data spaces in instruction set architecture

Push / Pop Addressing in Stack Space

Structure and Array element addressing syllables


Lecture 2: x86 Architecture Basics


Basic Data Types

BYTE

WORDHIGHBYTE

LOWBYTE

15 7 0

Address N+1

Address N

DOUBLEWORD

QUADWORD

HIGH WORD LOW WORD

Address N+3

Address N+2

Address N+1

Address N

31 15 0

LOW DOUBLEWORD

63 47 31 15 0

AddressN+7

AddressN+6

AddressN+5

AddressN+4

Address N+3

Address N+2

AddressN+1

Address N

Figure 3-2 Fundamental Date Types

BYTE

AddressN

7 0

HIGH DOUBLEWORD


Memory Layout Little Endian - LSB in lower addresses

Figure 3-3 Bytes, Words, Doublewords and Quadwords in Memory

7A

FE

06

36

1F

A4

23

0B

74

CB

31

0EH

0DH

0CH

0BH

0AH

9H

8H

7H

6H

5H

4H

3H

2H

1H

0H

QUADWORD AT ADDRESS 6CONTAINS 7AFE06361FA4230BH

DOUBLEWORD AT ADDRESS OAHCONTAINS 7AFE0636H

WORD AT ADDRESS 0BHCONTAINS FE06H

BYTE AT ADDRESS 9CONTAINS 1FH

WORD AT ADDRESS 6CONTAINS 230BH

WORD AT ADDRESS 2CONTAINS 74CBH

WORD AT ADDRESS 1CONTAINS CB31H


Extended Data Types7 0

15 0

031

07

15 0

...

...

N 0

N

47 31

0

0

0

...

...

Figure 3-4. Data Types

BYTE INTEGER7 BIT MAGNITUDE1 BIT SIGN

WORD INTEGER15 BIT MAGNITUDE1 BIT SIGN

DOUBLEWORD INTEGER31 BIT MAGNITUDE1 BIT SIGN

BYTE ORDINAL8 BIT MAGNITUDE

WORD ORDINAL16 BIT MAGNITUDE

DOUBLEWORD ORDINAL32 BIT MAGNITUDE

BCD INTEGER4 BIT DIGIT PER BYTE

PACKED BCD INTEGER4 BIT DIGIT PER HALF-BYTE

NEAR POINTER32/16 BIT OFFSET

FAR POINTER32/16 BIT OFFSET16 BIT SELECTOR

BIT FIELDUP TO 32 BITS

BIT STRINGUP TO 4 GIGABITS

BYTE STRINGUP TO 4 GIGABYTES

31 0

31 15 0Pointers w/ 16 Bit Offsets

031


Registers

8 “General” 32 bit registers: EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP

Lowest 16 bits can be specified as AX, BX, ... Bits 0-7, 8-15 of first four registers can be addressed as bytes:

AL, AH, BL, BH, CL, CH, DL, DH

Special registers: eFLAGS, eIP Segment registers: CS, DS, ES, SS, FS, GS 8 Floating Point registers + FP control & status 8 MMX registers Various OS registers (trace, debug, control, status).

Not discussed here

* Throughout this presentation: eREG = REG / EREG, according to 16/32 bit mode


Basic Registers

Figure 3-5. Application Register Set

GENERAL REGISTERS

31 23 15 7 0

AH

DH

CH

BH

AL

DL

CL

BL

16-BIT

AX

DX

CX

BX

32-BIT

EAX

EDX

ECX

EBX

EBP

ESI

EDI

ESP

BP

SI

DI

SP

15 0SEGMENT REGISTERS

CS

SS

DS

ES

FS

GS

STATUS AND CONTROL REGISTERS31 0

EFLAGS

EIP


Floating Point

FPU DATA REGISTERSTAGFIELD

SIGNIFICANDEXPONENTSIGN

79 78 64 63 0 1 0FP7

FP6

FP5

FP4

FP3

FP2

FP1

FP0

CONTROL REGISTER

STATUS REGISTER

TAG WORD

15 0 47

INSTRUCTION POINTER

DATA POINTER

Figure 6-1. Floating-Point Unit Register Set


MMX (1)

Packed bytes (8x8 bits)

63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0

Packed word (4x16 bits)

63 48 47 32 31 16 15 0

Packed doublewords (2x32 bits)63 32 31 0

63 0

Quadword (64 bits)

Figure 2-3. Packed Data Types


MMX (2)

63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0

Memory address 1008h Memory address 1000h

Figure 2-4. Eight Packed Bytes in Memory (at address 1000H)

Figure 2-2. MMX Register Set

MM7

MM6

MM5

MM4

MM3

MM2

MM1

MM0

63 0


Streaming SIMD Extensions

Packed, single-precision FP (4x32 bits)

127 96 95 64 63 32 31 0

Streaming SIMD Extensions Register Set

XMM7

XMM6

XMM5

XMM4

XMM3

XMM2

XMM1

XMM0

127 0


Stack Related instructions: Push/Pop, Call/Ret Used in interrupt handling

TOP OF STACK

STACK SEGMENT31 0

SUBROUTINEPASSED

VARIABLES

EBP

ESP

BOTTOM OF STACK(INITIAL ESP VALUE)

PUSHES PUT THETOP OF STACK ATLOWER ADDRESSES

POPS PUT THE TOP OF STACK ATHIGHER ADDRESSES

Figure 3-8. Stacks


Flags

Affected by ALU operations, e.g.,: ADD, CMP

Used mainly for control flow: Jcc

But also by other instructions ADC, RLC, etc...

Table 3-2. Status FlagsNAME PURPOSE CONDITIONS REPORTED

OF OVERFLOW Result exceeds positive or negative limit of numbers range

SF Sign Result is negative (less then zero)

ZF Zero Result is zero

AF Auxiliary carry Carry out of bit position 3 (used for BCD)

PF Parity Low byte of resutl had even parity (even number of set bits)

CF Carry flag Carry out of most significant bit of result

FIGURE 3-9. EFLAGS Register

CF

1PF

0AF

0ZF

SF

TF

IF

DF

OF

IOPL

NT

0RF

VM

AC

VIF

VIP

ID

0000000000

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

X ID FLAG (ID)X VIRTUAL INTERRUPT PENDING (VIP)X VIRTUAL INTERRUPT FLAG (VIF)X ALIGNMENT CHECK (AC)X VIRTUAL 8086 MODE (VM)X RESUME FLAG (RF)X NESTED TASK (NT)X I/O PRIVILEGE LEVEL (IOPL)S OVERFLOW FLAG (OF)C DIRECTION FLAG (DF)X INTERRUPT ENABLE FLAG (IF)X TRAP FLAG (TF)S SIGN FLAG(SF)S ZERO FLAG (ZF)S AUXILIARY CARRY FLAG (AF)S PARITY FLAG (PF)S CARRY FLAG (CF)

S INDICATES A STATUS FLAGC INDICATES A CONTROL FLAGX INDICATES A SYSTEM FLAG

BIT POSITIONS SHOWN AS O OR 1 ARE INTEL RESERVED.DO NOT USE. ALWAYS SET THEM TO THE VALUE PREVIOUSLY READ.


Segmentation (1) Problem: use “large” memory with short word width

16 bits can cover only 64KB

Solution: address space divided into “segments” Logical address is segment:offset Linear address: address in the “flat” address range Segmentation: mapping from Logical to Linear

Segment base = f(segment register) offset < segment size: 64K (8086/286), 4G (386+) linear address = segment_base + offset

At any time. there are 4 (8086) or 6 (386 and on) active segment registers CS: code DS: data SS: stack ES: extra FS, GS: new extra

Rough analogy - phone directory: region and a number within


Segmentation (2)

Figure 3-7. A Segmented Memory

CSSS

DSES

FSGS

DIFFERENT LOGICAL SEGMENTSDIFFERENT ADDRESS SPACE

IN PHYSICAL MEMORY

CODESEGMENT

STACKSEGMENT

DATASEGMENT

DATASEGMENT

DATASEGMENT

DATASEGMENT

Figure 3-6. An Unsegmented Memory

DIFFERENT LOGICAL SEGMENTS

GSFSESDSCSSS

ONE PHYSICAL ADDRESS SPACE


Instruction Set Architecture

Data transfer Common ALU operations (add, sub, cmp...) Control flow (jmp, call, jcc) String (movs, cmps, sets - using eSI, eDI, eCX) FP instructions (also SSE variant) MMX instructions OS support (task switch, call gates, table handling, etc.) I/O instructions

“Exotic” instructions (DAA, XLAT, etc.)


Data Transfer Instructions MOV Byte / Word / Dword, register to / from register / memory

Also Zero/Sign extended moves

STACK OPERATIONS PUSH [SOURCE]

SP = SP - 2 ; ( SS : SP ) = SOURCE ; (16-bit)

ESP = ESP - 4 ; ( SS : ESP ) = SOURCE ; (32-bit)

POP [DEST]

DEST = ( SS : SP ) ; SP = SP + 2; (16-bit)

DEST = ( SS : ESP ) ; ESP = ESP + 4; (32-bit)

I/O instructions IN (I/O port specified by 8-bit immediate or DX)

EAX / AX / AL = Dword / Word / Byte from I/O port

OUT (I/O port specified by 8-bit immediate or DX)

I/O Port = Dword / Word / Byte from EAX / AX / AL


ALU and Shift Instructions

ALU Usual ALU ops on Byte / Word / Dword operands Integer Multiply and Divide Instructions

Multiply and Divide use implicit register pairs Chained add / subtract for multi-precision arithmetic Conversion ops for packed and unpacked decimal math

SHIFTS and ROTATES Logical and arithmetic shifts Rotates through / not through C bit By constant (1) or by count in CL

Later extended to immediate operand in 80386


Control Flow

Assortment of conditional jumps Jcc Byte/Word/Dword displacement CS not modified, eIP = eIP + signed displacement *

JMP SHORT - one byte displacement NEAR - two/four byte displacement FAR - Across Segments (Loads CS : eIP)

CALL Like JMP Pushes next eIP in SHORT and NEAR forms Pushes CS : eIP in FAR forms

RET (Return) Pops into eIP in NEAR form Pops into CS : eIP in FAR form

* eIP = IP / EIP, according to 16/32 bit mode

Add examples


String Instructions String operation components

Opcode specifying ES:eDI op DS:eSI Typically preceded by REP (Repeat) Prefix

for ( ; Repeat Condition; eCX -- )

{

ES : eDI op DS : eSI

eDI = eDI + d*oprnd_size ; eSI = eSI + d*oprnd_size

}

d = 1 or -1 Value selected by Direction Flag

OP can be move, compare, find … oprnd_size is one of 1/2/4

Repeat Condition is test of Zero Flag Can be test if set or test if clear Zero Flag set by CX == 0 or by op (CMPS)

* eREG = REG / EREG, according to 16/32 bit mode


Small ExamplesRepeat Mov

struct str {char c[1000];}src,dst;void move()

{ dst=src;}

COMM_src:BYTE:03e8HCOMM_dst:BYTE:03e8H...; 2 : {

push ebpmov ebp, esppush esipush edi

; 3 : dst=src;mov ecx, 250mov esi, OFFSET FLAT:_srcmov edi, OFFSET FLAT:_dstrep movsd

; 4 : }pop edipop esipop ebpret 0

Mov, ALU, Jmp

int array[100];int sum=0;

void sum_array(){ register int i;

for(i=0;i<100;i++)sum+=array[i];

}

COMM_array:DWORD:064H_sum DD 01H DUP (?); 5 : register int i;; 6 : for(i=0;i<100;i++)

mov eax, OFFSET FLAT:_array$L28:; 7 : sum+=array[i];

mov ecx, DWORD PTR [eax]add eax, 4add DWORD PTR _sum, ecxcmp eax, OFFSET FLAT:_array+400jl SHORT $L28

; 8 : }


Instruction Format (1) Variable length instructions! Remember - only 2 explicit operands - source and destination

may be the same Machine language

Zero or more prefixes Opcode Mod/Rm Displacement Immediate

8/16/32 Handling Instruction encoding differs for 8-bit and 16/32 bit operands Same instruction encoding is used for 16 and 32 bit operands Current code segment defines if processor runs 16 or 32 bit code In 16 bit code “word” operands indicate 16 bit operand,

In 32 bit code “word” operands indicate 32 bit operand. This can be toggled by the operand-size prefix (0x66)


Instruction Format (2)

Figure F-1. Core Instruction Format

d32|16|8| none d32|16|8|none{ {TTTTTTTT TTTTTTTT mod TTT r/m ss index base

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7-6 5-3 2-0 7-6 5-3 2-0

{ { {{

opcode(one or two bytes)(T represents an

opcode bit.)

mod r/mbyte

s-i-bbyte

register and addressmode specifier

addressdisplacement(4, 2, 1 bytes

or none)

immediatedata

(4, 2, 1 bytesor none)

Figure 25-1. Instruction Format

1 OR 2 0 OR 1 0 OR 1 0,1,2 OR 4

NUMBER OF BYTES

0,1,2 OR 4

INSTRUCTIONPREFIX

ADDRESS-SIZE PREFIX

OPERAND-SIZE PREFIX

SEGMENTOVERRIDE

0 OR 1 0 OR 1 0 OR 1 0 OR 1

NUMBER OF BYTES

REP PREFIX

LOCK PREFIX

0 OR 1 0 OR 1

OPCODE MOD R/M SIB DISPLACMENT IMMEDIATE


Assembly Version

MS/Intel “programming tool” Case insensitive Operand size implicit Destination on left

Unix “compiler back-end” Case sensitive Operand size explicit Destination on right

Assembler should be smart enough to handle prefix when 16 bit operand is used in 32 bit code...


Example of ASM FormatIntel/Microsoft tools:

- op dest, src- implied width => requires adding BYTE PTR if not clear- many other goodies (structures style)

MOV EAX, ECX ; EAX:=ECXMOV BYTE PTR GS:[EAX]+4, 0 ; GS:[EAX+4]:=0

REP MOVSB REP MOVS BYTE PTR ES:[EDI], BYTE PTR DS:[ESI]

ADD BX,[EAX+EDX*2]+4 ; 16 bit opsize ADD EBX, [BX+SI] ; 16 bit addr STRING:DB ’A’,’B’,’C’, 0

Unix tools:- op src, dest- explicit width- case sensitive- less flexible

movl %ecx,%eaxmovb 0, %gs:4(%eax)

rep movsbaddw 4(%eax,%edx,2),%bxaddl (%bx,%si),%ebx

L001: .4byte “abc”,’0’


Instruction Format Example strncpy code listing

section .textstrncpy() 0: 55 pushl %ebp 1: 8b ec movl %esp,%ebp 3: 83 ec 0c subl $0xc,%esp 6: c7 45 fc 00 00 00 00 movl $0x0,-4(%ebp) d: 2b d2 subl %edx,%edx f: 3b 55 10 cmpl 16(%ebp),%edx 12: 73 26 jae 0x26 <3a> 14: 8d 45 fc leal -4(%ebp),%eax 17: 89 04 24 movl %eax,(%esp) 1a: e8 fc ff ff ff call 0xfffffffc <1b> 1f: 8b 45 0c movl 12(%ebp),%eax 22: 8b 55 fc movl -4(%ebp),%edx 25: 8a 04 10 movb (%eax,%edx),%al 28: 8b 4d 08 movl 8(%ebp),%ecx 2b: 88 04 11 movb %al,(%ecx,%edx) 2e: 8b 45 fc movl -4(%ebp),%eax 31: 40 incl %eax 32: 89 45 fc movl %eax,-4(%ebp) 35: 3b 45 10 cmpl 16(%ebp),%eax 38: 72 da jb 0xda <14> 3a: 8b 45 0c movl 12(%ebp),%eax 3d: 8b e5 movl %ebp,%esp 3f: 5d popl %ebp 40: c3 ret main() 41: 83 ec 0c subl $0xc,%esp 44: c7 04 24 00 00 00 00 movl $0x0,(%esp) 4b: c7 44 24 04 00 00 00 00 movl $0x0,4(%esp) 53: c7 44 24 08 04 00 00 00 movl $0x4,8(%esp) 5b: e8 a0 ff ff ff call 0xffffffa0 <0> 60: a3 00 00 00 00 movl %eax,0x0 65: 83 c4 0c addl $0xc,%esp 68: c3 ret

Encoding and Prefix example

disassembly for 6.o

section .text 0: a3 00 00 00 00 movl %eax,0x0 5: 66 a3 00 00 00 00 movw %eax,0x0 b: a2 00 00 00 00 movb %al,0x010: 89 1d 00 00 00 00 movl %ebx,0x016: 66 89 1d 00 00 00 00 movw %bx,0x01d: 88 1d 00 00 00 00 movb %bl,0x023: 89 19 movl %ebx,(%ecx)25: 66 89 19 movw %bx,(%ecx)28: 88 19 movb %bl,(%ecx)2a: 67 89 19 movl %ebx,(%bx,%di)2d: 67 66 89 19 movw %bx,(%bx,%di)31: 67 88 19 movb %bl,(%bx,%di)34: a4 movsb (%esi),(%edi)35: f3 a4 repz movsb (%esi),(%edi)37: ff 01 incl (%ecx)39: f0 ff 01 lock incl (%ecx)3c: 64 ff 01 incl %fs:(%ecx)3f: 66 ff 01 incw (%ecx)42: 67 ff 01 incl (%bx,%di)45: f3 ff 01 repz incl (%ecx)48: f3 f0 64 66 67 ff 01 repz lock incw %fs:(%bx,%di)4f: c3 ret


Lecture 3: Floating Point & MMX


Floating Point

Supports single, double and extended memory format (32, 64 and 80 bits), as well as other exotic formats (BCD, integers)

8 registers arranged as a stack

A stack entry (register) is always 80 bits; conversion done during load/store only

Each operation works on top of stack and optionally on another element (another stack entry of memory). Stack top moves!

Supports all IEEE-754 operations (fadd, fsub, fmul, fdiv, fsqrt, fcmp...) and many others (trig, transcendentals, etc...)

Separate unit (“co-processor”) for 8086, 286, 386. On-chip since i486

Exception model reflects this history

Big performance boost in i486, large in Pentium, huge in Pentium-II


Floating Point Formats General Format: (-1)S 2E (b0b1b2b3..bp-1)

Where:S = 0 or 1E = any integer between Emin and Emax, inclusivebi = 0 or 1p = number of bits of precision

SpecificsParameter Single Double ExtendedFormat width in bits 32 64 80p (bits of precision) 24 53 64 2’s

complementExponent width in bits 8 11 15 Biased formatEmax +127 +1023 +16383Emin -126 -1022 -16382Exponent bias +127 +1023 +16383

b0 is implied in single/double, explicit in extended Example of single precision number:

Significand (Mantissa)S Exponent

02331


Floating Point Formats (cont.) Example

Ordinary Decimal 178.125

Scientific Decimal 1.78125E2

Scientific Binary 1 0110010001 E111

Scientific Binary (Biased Exp +127) 1 0110010001 E10000110

Single Format (Normalized)

Sign 0

Biased Exponent 10000110

Significand 01100100010000000000000

1 (implicit)

Special numbers:Exponent is all 0’s or all 1’s: Zero - the entire number is 0

There’s a negative zero, too! Denormals - (+/-) very small numbers, implicit 1 is not assumed Infinite - (+/-) Special rules for 0*inf, etc... NaN - Not a Number - Signaling NaNs and Quite NaNs

Used to propagate errors in computations when exceptions are masked


Floating Point - Control

Rounding modes Precision control Exception Masks

Rounding Control00 - Round to Nearest or Even01 - Round Down (Toward -)10 - Round Up (Toward +)11 - Chop (Truncate Toward 0)

Precision control00 - 24 Bits (Single Precision)01 - (Reserved)10 - 53 Bits (Double Precision)11 - 64 Bits (Extended Precision)

Reserved(Infinity Control)*Rounding ControlPrecision Control

ReservedException Mask

PrecisionUnderflowOverflowZero DivideDenormalized OperandInvalid Operation

IM

DM

ZM

OM

UM

PM

XPCRCX

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

X

* Infinity Control has no effect on 386 and on. Used for compatibility with 8086 & 286

Figure 6-3. FPU Control Word Format


Floating Point - Status

Exception Flags Top Of Stack

position Complex exception

model and handling.Not discussed.

Figure 6-3. FPU Status Word Format

ES is set if any unmasked exception bit is set; Cleared otherwise.

Top values:000 = Register 0 is top of stack001 = Register 1 is top of stack....111 = Register 7 is top of stack

FPU BusyTop of Stack PointerCondition Code

Error Summary StatusStack FaultException Flags

PrecisionUnderflowOverflowZero DivideDenormalized OperandInvalid Operation

IE

DE

ZE

OE

UE

PE

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

TOPSF

C0

C1

C2

C3

BES


Floating Point - Status

Flag each stack value as Valid, Empty, Zero, Special

Tag Values00 = Valid01 = Zero10 = Special: Invalid (NaN, Unsupported, Infinity, or Denormal) 11 = Empty

Tag(7) Tag(6) Tag(5) Tag(4) Tag(3) Tag(2) Tag(1) Tag(0)

015

Figure 6-4. Tag Word Format


Floating Point ExampleCOMM _x1:QWORD ; S E P

_a DD 040b00000r ; 5.5 0 81 (1.)6000_b DD 040800000r ; 4 0 81 (1.)0_c DD 03e800000r ; 0.25 0 7B (1.)0_i2 DD 02H_$T32 DD 040800000r ; 4

; 8 : {

push ebpmov ebp, esp tos=0

; 9 : x1=(-b+sqrt(b*b-4*a*c))/(i2*a);

fld DWORD PTR _b ; b tos=7 fchs ;-bfld DWORD PTR _b ; b tos=6 fmul DWORD PTR _b ; b2

fld DWORD PTR $T32 ; 4 tos=5 fmul DWORD PTR _a ; 4afmul DWORD PTR _c ; 4acfsubp ST(1), ST(0) ; b2-4ac tos=6 fsqrt ; =sqrt(b2-4ac)faddp ST(1), ST(0) ;-b+ tos=7 fild DWORD PTR _i2 ; 2 tos=6 fmul DWORD PTR _a ; 2*afdivp ST(1), ST(0) ; (-b+)/2a tos=7 fstp QWORD PTR _x1 ; store tos=0

; 10 : }

pop ebpret 0

_

// solving a quadratic equationfloat a=5.5;float b=4;float c=.25;int i2=2;double x1;

void main(){ x1=(-b+sqrt(b*b-4*a*c))/(i2*a);}

Note: Stack movement Data type used Number

representation


MMX


MMX

An extension to the X86 Architecture Useful mainly for multimedia applications Technique: Single Instruction, Multiple Data (SIMD) with

existing registers New data types - packed 64 bit consists of either

8 bytes 4 words 2 doublewords

Instructions to operate on this data Software model to allow easy migration with existing OS


MMX Instructions and Data Types

Operations (57 instructions) Arithmetic (saturating & wrap-around)

– Add– Subtract– Multiply– Multiply-add (for fast multiply

accumulate) Conversions

– Pack from large to smaller words– Unpack from small to large words

Data Types Packed (fixed point) integers Signed, Unsigned Bytes, Words, D-Words

8 Packed Bytes

4 Packed Words

2 Packed D-words

Logical– And– Or– Shifts– Compares (including arithmetic)

Transfers – Register to register– Load from and store to memory

Integration into Intel Architecture 8 new registers - MM0- MM7 Uses previously reserved opcodes Uses same addressing modes

Uses same 2 operand encoding scheme– One is src/dest– Other src may come from memory


Example of Operations

Paddusw: Add Unsigned Packed word with Saturate

a3 a2 a1

b3 b2 b1+ + + +

a3+b3 a2+b2 a1+b1

FFFFh

8000h

FFFFh

Packed Add

Paddw: Add Packed word

a3 a2 a1

b3 b2 b1+ + + +

a3+b3 a2+b2 a1+b1

FFFFh

8000h

7FFFh

ShiftPsllq: Shift left logical all 64-bit

a3 a2 a1

0010h

a0

a1 a0a2 0000h

Multiply-AddPmaddwd: multiply and add

a3 a2 a1 a0

b3 b2 b1 b0** **

a3*b3+a2*b2a1*b1+a0*b0


Example Operations

Multiply-AddPmaddwd: multiply and add

a3 a2 a1 a0

b3 b2 b1 b0** **

a3*b3+a2*b2a1*b1+a0*b0

Multiply-HighPmulhw: multiply high

a3 a2 a1 a0

b3 b2 b1 b0

c3 c2 c1 c0

a0*b0a1*b1a2*b2a3*b3

High order 16 bits

Multiply-LowPmullw: multiply low

a3 a2 a1 a0

b3 b2 b1 b0

c3 c2 c1 c0

a0*b0a1*b1a2*b2a3*b3

low order 16 bits


FP/MMX view of registers

ST7

ST0

TOSStatusWord

MM Tag

FPU Tag063

63

00

00

00

00

00

00

MM7

MM0 TOS=0

00

00

1113

MMX registers MM0-MM7 overlap FP registers All tags are valid w/ MMX TOS is 0

Pros No change of context Easy OS migration

Cons Difficult to Intermix MMX

& FP Need to signal MMX to

FP move(EMMS: clear tags)

Example: Conditional Select / Branch Removal

PANDN

PCMPEQW

X4 !=greenX2 != greenX1=green X3=green

greengreengreen green

0x0000 0xFFFF 0xFFFF 0x0000

X1 X2 X4X3 PAND

0x0000 0xFFFF 0xFFFF 0x0000

Y1 Y2 Y4Y3

POR 0x0000 0x0000

finished pixels

0x0000 0x0000

Y1

X2

Y3

X4

Y1

X2

X4

Y3

if (X[i] != green) then new_image[i] = X[i] else new_image[i] = Y[i]

+ =

Packed Comparison (Pcmpcc) and the logical instructions enable conditional select operations in parallel and without data dependent branches.

X Y new_image

MOVQ MM1, XPCMPEQW MM1, GREENMOVQ MM2, MM1PANDN MM1, XPAND MM2, YPOR MM1, MM2MOVQ New, MM1


Streaming SIMD Extensions

New technology to exploit parallelism in FP and INT applications Key capabilities

Packed Operations Branch Removal/Compression Data Movement/Hints FP/INT Type Conversion

Application Domains… 3D Graphics: geometry, lighting signal processing high precision simulation & modeling video encoding/decoding other apps requiring streaming input and output


SIMD FP instructions types

arithmetic square root approximation instructions min & max loads & stores

move mask compare & set mask logical compare & set eflags conversion

Instruction CategoriesSIMD FP - 4 wide, single-precision, 2 wide double precision SIMD INT - extensions to MMX™ technology capabilities, Extending all the MMX instruction to 128 bit using the new XMM registersCacheability controlState management


New SIMD FP Registers

xmm0

xmm1

xmm2

xmm3

xmm4

xmm5

xmm6

xmm7

eight 128-bit, 4x32bit single-precision FP2x64bit single-precision FP, 128 bit MMX

•New set of registers!•Direct access•Used for data only•Extended processor state


Packed SP Data Type Each register holds 4 single-precision FP values or 2 double precision IEEE-754 compatible Scalar operates on least-significant number

xmm0

1-bit sign8-bit exponent23-bit mantissa

031

0

32127

22233031

1-bit sign10-bit exponent53-bit mantissa


Packed Operations

xmm0

xmm1

xmm0

a0a1a2a3

b0b1b2b3

a0 op b0a1 op b1a2 op b2a3 op b3

op is one of•addps addpd •subps subpd•mulps mulpd•divps divpd


Scalar Operations

xmm0

xmm1

xmm0

a0a1a2a3

b0b1b2b3

a0 op b0a1a2a3

op is one of•Addss addsd•subss subsd•nulss mulsd•divss divsd


SIMD Data Organization

Exploit vertical parallelism SOA versus AOS

Array of Structures

Structure of Arrays

X0, X1, X2, X3, … Y0, Y1, Y2, Y3, … Z0, Z1, Z2, Z3, ...

X0, Y0, Z0, X1, Y1, Z1, X2, Y2, Z2, …

Better cacheability

Better SIM

D calculations


Matrix Vector Multiply example

Typical 3D operation Load values in SOA format

xxxx…, yyyy…, zzzz…

Follow with multiply and add operations

movaps xmm0, [list+X+ecx] ;load X components

movaps xmm2, [list+Y+ecx] ;load Y components

movaps xmm3, [list+Z+ecx] ;load Z components

movaps xmm1, [esi+m00] ;m00 m00 m00 m00

movaps xmm4, [esi+m01] ;m01 m01 m01 m01

Loop back and pick up next 4 vertices…mulps xmm1, xmm0 ;x*m00 x*m00 x*m00 x*m00

mulps xmm4, xmm2 ;y*m01 y*m01 y*m01 y*m01

addps xmm4, xmm1 ;add the 2 results

movaps xmm1, [esi+m02] ;load matrix element m02 (x4)

mulps xmm1, xmm3 ;z*m02 z*m02 z*m02 z*m02

addps xmm4, xmm1 ;add results

addps xmm4, [esi+m03] ;add last element of matrix


SIMD Integer Instructions

Extensions to MMXTM technology instructions Operate on same 64-bit registers as previous MMX technology

instructions, Or on the new 128 bit XMM registers Instructions: extract, insert, min/max, byte mask integer,

multiply high unsigned, shuffle multiply unsigned double, add and sub qword, average


Computing Model

Processinginput output

If you can’t bring data in fast enough or spit it out fast enough… SIMD is of little or no use

The “streaming” part of the Streaming SIMD Extensions is critical to overall performance


Prefetch

Hides latency by bringing in data before you need it Provide cache hints to fetch data to different cache levels

prefetchnta - prefetch non-temporal data to non-temporal cache (L1)

prefetcht0 - prefetch temporal data into both L1 and L2 caches prefetcht1, prefetcht2 - prefetch temporal data into L2 cache


Prefetch Illustrated

memoryL1

L2

prefetchnta [esi]


Prefetch Illustrated (2)

memoryL1

L2

prefetcht0 [esi]


Prefetch Illustrated (3)

memoryL1

L2

prefetcht1 [esi]prefetcht2 [esi]


Warning about prefetch

Proper placing of prefetch is critical to insure that there’s enough time between prefetch and actual load limited resources to load data

Excessive use of prefetch can actually hurt performance Spread out prefetches and insure sufficient computation

before load


Streaming Stores

Avoids polluting cache with data that you don’t real soon (or ever)

Supports 128 and 64-bit versions movntps - from xmm reg to memory movntq - from mm reg to memory

If store is a cache hit, then cache is updated and not sent to memory

Weakly ordered Use sfence instruction to insure order


Streaming Store Illustrated

memory

L1

L2

movntps [esi]

xmm0

No write allocation*

* If it was a cache hit,then data goes to cacheand is not writtendirectly to memory


AMD 64 bit extension

Starting with Opron (released April – 2003)


Lecture 4: Addressing and Instruction Encoding


Addressing Modes Compute the offset part of a memory reference

Direct Register (by name) Memory: 16 bit addressing modes

(Segment) + base + index + displacement disp(base,index) 8 combinations of base + index

BX, SI, DI, BX+SI, BX+DI, BP+SI, BP+DI, disp only Disp 0, 8, or 16 bit. Occupies 0, 1 or 2 bytes All combinations are encoded in the ModRM byte Examples: 8(bp); (bp,bx);

Memory: 32 bit addressing modes (Segment) + base + index*scale + disp

disp(base,index,scale) Base: each of the 8 general registers or none Index: each of 7 general registers (no ESP!) or none Scale: 1,2,4,8 Index/scale requires an additional SIB bytes.

Base only can use ModRM byte only Disp: 0, 8 or 32 bit. Occupies 0, 1 or 4 bytes Examples: (eax); array(ecx,4); based_array(eax,ebx,4)

Addressing mode determined by code mode (16/32) Can be toggled using the address prefix (0x67)


Addressing Modes

Figure 25-2. ModR/M and SIB byte Formats

MOD REG/OPCODE R/M

SIB (SCALE INDEX BASE) BYTE

7 6 5 4 3 2 1 0

SS INDEX BASE

MODR/M BYTE

7 6 5 4 3 2 1 0

Figure 3-10. Effective Address Computation (32-bit vesrion)

{CSSSDSESFSGS

EAXECXEDXEBXESPEBPESIEDI

EAXECXEDXEBXEBPESIEDI

1

2

4

8

No Displacement8-bit Displacement32-bit Displacement

+ + * +{} }{}{}{ }

Segment + Base + (Index * Scale) + Displacement


Encoding Example

ModRM byte: MOD REG/OP Reg/Mem 7 6 5 4 3 2 1 0

SIB byte: SS INDEX BASE 83 ec 0c subl $0xc,%esp

89 03 movl %eax,(%ebx) / base onlyc7 45 fc 07 00 00 00 movl $0x7,-4(%ebp) / base, disp ; imm8b 55 fc movl -4(%ebp),%edx / base, disp ; reg89 45 fc movl %eax,-4(%ebp)3b 55 10 cmpl 16(%ebp),%edx89 04 24 movl %eax,(%esp) / base only espc7 04 24 08 00 00 00 movl $0x8,(%esp)c7 44 24 04 09 00 00 00 movl $0x9,4(%esp) / base+disp esp8a 04 10 movb (%eax,%edx),%al / base + index8a 04 50 movb (%eax,%edx,2),%al / base+scale*index8a 04 42 movb (%edx,%eax,2),%al / 88 04 11 movb %al,(%ecx,%edx)40 incl %eax72 da jb 0xda <14>5d popl %ebpc3 ret


MOD/RM-SIB Table: 16-bit addressing table

Table 11-1. 16-Bit Addressing Forms with the ModR/M Byter8(/r)r16(/r)r32(/r)/digit (Opcode)REG =

ALAXEAX0000

CLCXECX1001

DLDXEDX2010

BLBXEBX3011

AHSPESP4100

CHBP1

EBP

5

101

DHSIESI6110

BHDIEDI7111

EffectiveAddress Mod R/M ModR/M Values in Hexadecimal

[BX+SI][BX+DI][BP+SI][BP+DI][SI][DI]disp162

[BX]

00 000001010011100101110111

0001020304050607

08090A0B0C0D0E0F

1011121314151617

18191A1B1C1D1E1F

2021222324252627

28292A2B2C2D2E2F

3031323334353637

38393A3B3C3D3E3F

[BX+SI]+disp83

[BX+DI]+disp8[BP+SI]+disp8[BP+DI]+disp8[SI]+disp8[DI]+disp8[BP]+disp8[BX]+disp8

01 000001010011100101110111

4041424344454647

48494A4B4C4D4E4F

5051525354555657

58595A5B5C5D5E5F

6061626364656667

68696A6B6C6D6E6F

7071727374757677

78797A7B7C7D7E7F

[BX+SI]+disp16[BX+DI]+disp16[BP+SI]+disp16[BP+DI]+disp16[SI]+disp16[DI]+disp16[BP]+disp16[BX]+disp16

10 000001010011100101110111

8081828384858687

88898A8B8C8D8E8F

9091929394959697

98999A9B9C9D9E9F

A0A1A2A3A4A5A6A7

A8A9AAABACADAEAF

B0B1B2B3B4B5B6B7

B8B9BABBBCBDBEBF

EAX/AX/ALECX/CX/CLEDX/DX/DLEBX/BX/BLESP/SP/AHEBP/BP/CHESI/SI/DHEDI/DI/BH

11 000001010011100101110111

C0C1C2C3C4C5C6C7

C8C9CACBCCCDCECF

D0D1D2D3D4D5D6D7

D8D9DADBDCDDDEDF

E0EQE2E3E4E5E6E7

E8E9EAEBECEDEEEF

F0F1F2F3F4F5F6F7

F8F9FAFBFCFDFEFF

Notes

1. The default segment register is SS for the effective addresses containing a BP index, DS for other effective addresses.

2. The “disp16” nomenclature denotes a 16-bit displacement following the ModR/M byte, to be added to the index.

3. The “disp8” nomenclature denotes an 8-bit displacement following the ModR/M byte, to be sign-extended and added to the index.


MOD/RM-SIB Table: 32-bit addressing table

Table 11-2. 32-Bit Addressing Forms with the ModR/M Byter8(/r)r16(/r)r32(/r)/digit (Opcode)REG =

ALAXEAX0000

CLCXECX1001

DLDXEDX2010

BLBXEBX3011

AHSPESP4100

CHBPEBP5101

DHSIESI6110

BHDIEDI7111

EffectiveAddress Mod R/M ModR/M Values in Hexadecimal

[EAX][ECX][EDX][EBX][--][--]1

disp32[ESI][EDI]

00 000001010011100101110111

0001020304050607

08090A0B0C0D0E0F

1011121314151617

18191A1B1C1D1E1F

2021222324252627

28292A2B2C2D2E2F

3031323334353637

38393A3B3C3D3E3F

disp8[EAX]3

disp8[ECX]disp8[EDX]disp8[EBX]disp8[--][--]disp8[EBP]disp8[ESI]disp8[EDI]

01 000001010011100101110111

4041424344454647

48494A4B4C4D4E4F

5051525354555657

58595A5B5C5D5E5F

6061626364656667

68696A6B6C6D6E6F

7071727374757677

78797A7B7C7D7E7F

disp32[EAX]disp32[ECX]disp32[EDX]disp32[EBX]disp32[--][--]disp32[EBP]disp32[ESI]disp32[EDI]

10 000001010011100101110111

8081828384858687

88898A8B8C8D8E8F

9091929394959697

98999A9B9C9D9E9F

A0A1A2A3A4A5A6A7

A8A9AAABACADAEAF

B0B1B2B3B4B5B6B7

B8B9BABBBCBDBEBF

EAX/AX/ALECX/CX/CLEDX/DX/DLEBX/BX/BLESP/SP/AHEBP/BP/CHESI/SI/DHEDI/DI/BH

11 000001010011100101110111

C0C1C2C3C4C5C6C7

C8C9CACBCCCDCECF

D0D1D2D3D4D5D6D7

D8D9DADBDCDDDEDF

E0E1E2E3E4E5E6E7

E8E9EAEBECEDEEEF

F0F1F2F3F4F5F6F7

F8F9FAFBFCFDFEFF

Notes

1. The [--][--] nomenclature means a SIB follows the ModR/M byte.

2. The disp32 nomenclature denotes a 32-bit displacement following the SIB byte, to be added to the index.

3. The disp8 nomenclature denotes an 8-bit displacement following the SIB byte, to be sign-extended and added to the index.

Table 11-3. 32-Bit Addressing Forms with the SIB Byte

r32Base =Base =

EAX0

000

ECX1

001

EDX2

010

EBX3

011

ESP4

100

[*]5

101

ESI6

110

EDI7

111

Scaled Index SS Index SIB Values in Hexadecimal

[EAX][ECX][EDX][EBX]none[EBP][ESI][EDI]

00 000001010011100101110111

0008101820283038

0109111921293139

020A121A222A323A

030B131B232B333B

040C141C242C343C

050D151D252D353D

060E161E262E363E

070F171F272F373F

[EAX*2][ECX*2][EDX*2][EBX*2]none[EBP*2][ESI*2][EDI*2]

01 000001010011100101110111

4048505860687078

4149515961697179

424A525A626A727A

434B535B636B737B

444C545C646C747C

454D555D656D757D

464E565E666E767E

474F575F676F777F


10 000001010011100101110111

80889098A0A8B0B8

81899189A1A9B1B9

828A929AA2AAB2BA

838B939BA3ABB3BB

848C949CA4ACB4BC

858D959DA5ADB5BD

868E969EA6AEB6BE

878F979FA7AFB7BF


11 000001010011100101110111

C0C8D0D8E0E8F0F8

C1C9D1D9E1E9F1F9

C2CAD2DAE2EAF2FA

C3CBD3DBE3EBF3FB

C4CCD4DCE4ECF4FC

C5CDD5DDE5EDF5FD

C6CED6DEE6EEF6FE

C7CFD7DFE7EFF7FF

Notes

1. The [*] nomenclature means a disp32 with no base if MOD is 00,[EBP] otherwise. This provides the following addressing modes:disp32[index] (MOD=00).disp8[EBP][index] (MOD=01).disp32[EBP][index] (MOD=10).


Addressing Mode Examples_s$ = -4_t$ = -8; 16 : // typedef struct {char mem1, mem2, prev,next;} str_type_t;; 17 : // int i,j; <Globals>; 18 : // char a,b,c,d,e,f,g,h; <Globals>; 19 : // int s,t;; <Stack>

; 20 : /* int var; */ a = var;mov al, BYTE PTR _var; DISP only

; 21 : /* str_type_t struc; */ b = struc.mem2;mov cl, BYTE PTR _struc+1; DISP only

; 22 : /* int array[100]; */ c = array[i]; mov edx, DWORD PTR _imov al, BYTE PTR _array[edx*4]; INDEX*SCALE+DISP

; 23 : /* str_type_t arr_str[100]; */ d = arr_str[i].next;mov ecx, DWORD PTR _imov dl, BYTE PTR _arr_str[ecx*4+3]; INDEX*SCALE+DISP

; 24 : /* int *p; */ e = *p; mov eax, DWORD PTR _pmov cl, BYTE PTR [eax]; BASE

; 25 : /* str_type_t *str_p; */ f = str_p->prev; mov edx, DWORD PTR _str_pmov al, BYTE PTR [edx+2]; BASE+DISP

; 26 : /* int *arr_p; */ g = arr_p[j]; mov ecx, DWORD PTR _jmov edx, DWORD PTR _arr_pmov al, BYTE PTR [edx+ecx*4]; BASE+INDEX*SCALE

; 27 : /* str_type_t *arr_str_p; */ h = arr_str_p[j]->prev;mov ecx, DWORD PTR _jmov edx, DWORD PTR _arr_str_pmov al, BYTE PTR [edx+ecx*4+2]; BASE+INDEX*SCALE+DISP

; 28 : /* int s,t (stack) */ t = s;mov ecx, DWORD PTR _s$[ebp]mov DWORD PTR _t$[ebp], ecx BASE+DISP


Software Conventions

API - Application Programming Interface Interface among procedures modules: param types, order, etc.. Machine dependent; Defined in HLL.

Calling conventions - machine level calling sequence Parameter passing location and order (stack/regs) Register preserved across calls Machine dependent, may be also language dependent

ABI - Application Binary Interface Interface among application and DLLs/OS machine dependent; OS dependent

C conventions for x86: Parameters are pushed last to 1st EAX, EBX, EDX are scratch, all other preserved Return results - Int: EAX (EDX also), FP: top of FP stack All parameters (without prototypes) are widened to int/double


Compatibility Issues

Goal: enable modules compiled by different compilers to interoperate Possibly also different languages

Calling conventions Parameter location/order Static link (e.g. for Pascal)

External variable names - for the linker “Name decoration” - for C++

External name encodes the variable’s type Exception handling

How to locate the user-registered exception handler when an exception occurs

Debug information format This is only nice-to-have

All of the above are broken in practice!


Administration

Please send me your Email address, so that we can have a mailing list My address is [email protected] Subject line: Microcomputer course mailing list

Contacts koby [email protected] 04-8655718 Ofri [email protected] 04-8655851


Example for Calling Conventions Example of calling stackUsing 32 bit mode, near pointer

#include <string.h>

char in[]="abc";char out[10];char *p;

char *strncpy(char *d, const char *s, size_t n){ int i=0; for (i=0;i<n;i++) { monitor(&i); d[i]=s[i]; } return (char*)s;}

main(){ p = strncpy(out,in,sizeof(in));}==============================================

.file "strncpy.c"

.data

.globl inin: .size in,4

.string "abc" // char in[]="abc"

.comm out,10,4 // char out[10];

.comm p,4,4 // char *p

.text

/====================/ strncpy/--------------------

.globl strncpystrncpy:

pushl %ebp // put old frame ptrmovl %esp,%ebp // set new frame ptrsubl $12,%esp // Space for automat movl $0,-4(%ebp) // i = 0subl %edx,%edx // %edx = 0cmpl 16(%ebp),%edx // if (i<n)jae .L1

.L2:lea -4(%ebp),%eax // %eax = &imovl %eax,(%esp) // push &icall monitor // movl 12(%ebp),%eax // %eax = smovl -4(%ebp),%edx // %edx = imovb (%eax,%edx),%al // %al = s[i]movl 8(%ebp),%ecx // %ecx = dmovb %al,(%ecx,%edx) // d[i] = %almovl -4(%ebp),%eax // \incl %eax // | i++movl %eax,-4(%ebp) // /cmpl 16(%ebp),%eax // if (i<n)jb .L2

.L1:movl 12(%ebp),%eax // return smovl %ebp,%esp // clear stackpopl %ebp // return frameret

/====================/ main/--------------------

.globl mainmain:

subl $12,%espmovl $out,(%esp) // outmovl $in,4(%esp) // inmovl $4,8(%esp) // pcall strncpy // callmovl %eax,p // store return valueaddl $12,%esp // clean stackret

esp frame itemebp+16 sizeof(in)/nebp+12 in / s

before call strncpy ebp+8 out / dafter the call ebp+4 ret-addrafter push ebp ebp old ebp

ebp-4 var iebp-8 unused

before call monitor ebp-12 param I


Code example - Unix style

; unsigned int fib(int x) {; if(x>2); return(fib(x-1)+fib(x-2));; else; return(1);; } ; ; Stack frame:; %esp + 8: x ; %esp +4: return address ; note - no ebp!

.file "0f.c"

.align 16fib: / x= 8(%esp)

pushl %eax / save %eax is it needed?movl 8(%esp),%edx / edx=xcmpl $2,%edx / x>2?pushl %ebx / (save %ebx)jle .L66decl %edx / x<2, x--pushl %edx / \call fib / } %ebx = fib(x-1)movl %eax,%ebx / /movl 16(%esp),%eax / \subl $2,%eax / } %eax = x-2pushl %eax / /call fib / %eax = fib(x-2)addl %ebx,%eax / %eax = fib(x-1) + fib(x-2)addl $8,%esp / reset stack

.L67:popl %ebx / restore %ebxaddl $4,%esp / reset stack (contained %eax) ret / return.align 16,7,4

.L66:popl %ebxmovl $1,%eaxaddl $4,%espret / ret 1

Code example - Microsot/Intel style; unsigned int fib(int x) {; if(x>2); return(fib(x-1)+fib(x-2));; else; return(1); }; Stack frame:; %esp + 12: x ; %esp + 8: return address; %esp + 4,+8: esi, edi; %eax contains return value ; note - no ebp! TITLE fib.c .386P.model FLAT..._x$ = 8_fib PROC NEAR; 39 : { push esi ; push temps push edi; 40 : if( x > 2 ) mov edi, DWORD PTR _x$[esp+4] ; edi = x cmp edi, 2 ; x > 2? jle SHORT $L370; 41 : return (fib(x-1) + fib(x - 2)); lea eax, DWORD PTR [edi-2] ; eax = x-2! dec edi ; edi = x-1 push eax call _fib ; eax = fib(x-2) add esp, 4 ; pop param mov esi, eax push edi call _fib ; eax = fib(x-1) add esp, 4 add eax, esi; eax = fib(x-1)+fib(x-2) pop edi pop esi ; restore temps ret 0$L370:; 42 : else; 43 : return (1); mov eax, 1 ; return 1 pop edi; 44 : } pop esi ret 0_fib ENDP_TEXT ENDSEND


CISC vs RISC

CISC (Complex Instruction Set Computer) Simple as well as exotic instructions Small number of registers Variety of addressing modes Operation on registers and memory “Goal”: reduced instruction count

RISC (Reduced Instruction Set Computer) Simple instructions only Large number of registers Small set of addressing modes Memory access in load/stores only “Goal”: fast execution (through simple design)


Lecture 5: Segmentation


Segmentation and Protected Mode


Segmentation Already mentioned:

Address space divided into “segments” Logical address: segment:offset Linear address: address in the “flat” address range Segmentation: mapping from Logical to Linear

Segment base = f(segment register)offset < segment size: <64K (86/286), <4G (386+) linear address = segment_base + offset

At any time, there are 4 (8086) or 6 (286 and on) active segment registersCS: code, DS: data, SS: stack, ES: extra. FS, GS: new extra

Segment register can be modified by: A move from general register (e.g., mov ds,ax) LxS inst (e.g., LDS): loads full ptr to both seg reg and general reg.

All memory reference instructions assume default seg register CS for branches, SS for Stack, DS for data ES for string instruction destination operands

Segment prefix can be used to modify the default segment register for the current instruction


Address Computation - Real Mode Segment Base = segment*16 Linear address = segment*16 + offset

0x1234:0x5678 = 0x12340+0x5678 = 0x179B8 Implications:

Application may consume many segments Address space = 64K*16 = 1M Linear addresses have many (4096) logical representations

Seg+1:offset = seg:offset+16 But: do not assume that!

Last address: 0xFFFF*16+0xFFFF = 0x10FFEF (1M+64K-16)

19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

16-Bit Segment Selector 0 0 0 0

19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

0 0 0 0 16-Bit Effective Address0 0 0 0

21-Bit Linear Address

Base

Offset

Linear Address

+

= 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Figure 9-1. 8086 Address Translation

A20

0 0 0 0


Segmentation Models (16-bit) Terminology:

Code (“text”): code + optional read-only variables Data: global variables, static variable. Read/Write and Read Only Stack: parameters, automatic variables Heap: dynamically allocated memory

Model Segment Size Pointer Type

Tiny Code+Data+Stack+Heap <64K Offset

Small Code<64K, Data+Stack+Heap<64K Offset

Compact Code<64K, Data<64K, Stack<64K, Heap>64K Seg:Offset

Large Code>64K, Data>64K, Stack<64K, Heap>64K Seg:Offset

Huge Single Element > 64K Seg:Offset Trade-off between size and performance: smaller is faster

Tiny/Small: no need to switch segment registers Compact: uses only one data segment, but can address the entire memory

(e.g., heap) , if needed, not too many segment switches

32-bit mode typically uses Flat Model: One segment, starts at 0, spans 4G, all seg registers point to it


Segmentation Pros/Cons

Great solution for large memory w/ small word Overhead depends on memory requirements

Pay per use…

Allows reasonable code/data sharing

- More complex software development and tools

- Real mode segmentation => 1MB only

Did x86 succeed because of segmentation?… or in spite of it?...


Protected Mode - MotivationTwo Problems: 1MB is too small

More sophisticated usage: Multiprogramming, Multiusers.

Addresses are interpreted differently by different clients:

entry *next_entry (entry *p)

{ return (p->next); }

...

crnt_p = next_entry(crnt_p);

crnt_p->value = 7;

...

Increased need for protection among apps and OS In real mode, each application can crash the OS!


Protection - Solution

Solution - store information on segments in a table Segment register becomes an index to a “Descriptor Table”

contains address (base) and protection information

Address info Segment base

Linear address is still segment_base + offset, but…Segment_base = Descriptor_Table[f(segment)].base

Protection info Segment size (“limit”) <64K for 286, <4G for 386+ Segment type R, RW, ER, EO (Read, R/W, Execute/Read, Exec-Only) Segment privilege (0..3: 0 is higher, 3 is lower) Restrictions are checked for every access Descriptor table should be protected!

… And more colorful protection mechanisms…

Small step to the segment, big leap to mankind!


Address Computation Linear address = Descriptor_Table[f(segment)].base + offset Coverage

286: 16MB address space. Requires 512 segments to cover! 386: 4GB address space. 1 segment is sufficient

Descriptor Table

Selector Offset

SegmentDescriptor

LinearAddress

LogicalAddress

BaseAddress

015

031

031

+

Figure 11-5. Segment Translation


A General Segmentation Model The OS allocates each application a number of segments

CS

SS

GS

DS

ES

FS

Access LimitBase Address

SegmentRegisters

SegmentDescriptors









PhysicalMemory

Figure 11-3. Multisegmented Model


Descriptor Tables 2 tables

GDT (Global)Pointed by GDTR

LDT (Local)Per taskPointed indirectly by

LDTR

Table pointers GDTR: Base + Limit LDTR: GDT Index only

Entries (descriptors) of various types: Segments Call/interrupt gates Tasks LDT - in the GDT

Figure 11-4. TI Bit Selects Descriptor Table

TI

Base AddressLimit

GDTRBase Address

Limit LDTR

Selector

SegmentSelector

GlobalDescriptor

Table

LocalDescriptor

Table

TI = 0 TI = 1


13 bit index (8K entries) 1 bit table index - GDT (0) or LDT (1) 2 bit “requested” privilege level - see below

Segment Selector Format

INDEXTI

RPL

5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 01 1

Table Indicator 0 = GDT 1 = LDTRequested Privilege Level 00 = Most Privileged 11 = Least

Figure 11-7. Segment Selector Format


Segment Descriptor Format Note

286 => 386 evolution (high-order 16 bits) Limit >4M achieved by the granularity bit - scale by 212

D/B: 16/32 bit

Information moves fromdescriptor (in memory)to CPU when segmentregister is loaded

Figure 11-8. Segment Descriptors Format

AVL Available for use by system softwareBASE Segment base addressD/B Default operation size

(0 = 16-Bit segment; 1 = 32-bit segment)DPL Descriptor privilege level G GranularityLIMIT Segment limitP Segment presentS Descriptor type

(0 = System; 1 = application)TYPE Segment type

Reserved

5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 01

1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 63 23

0151616

Base 31:24 P Base 23:16

Segment Limit 15:0

DPL

TYPESGD/B

0AVL

SegLimit19:16

Base Address 15:0 +0

+4


Privilege Levels Lower number => higher privilege Code can access data of equal/lower privilege levels only Code can call more privileged data via “call gates” only

Each level has its own stack!

Some instructions and I/O operations are restricted to certain privilege levels

Also referred to as:

“Rings”

LEVEL 3

LEVEL 2

LEVEL 1

LEVEL 0

Applications

Operating SystemServices (DeviceDrivers, etc..)

Operating SystemKernel

Protection Rings

Figure 12-2: Protections Rings


For Data: DPL >= max(RPL,CPL)Otherwise: exception!

RPL: used by OSto protect against userTrojan Horses

Trojan Horse:User code calls OS codeand passes a pointer toa privileged data area

Privilege Level Check

Figure 12-3. Privilege Check for Data Access

CPL Current Privilege LevelDPL Descriptor Privilege LevelRPL Requested Privilege Level

5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 01

1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 63 23

0151616

DPL

+0

+4

CPL

RPL

PrivilegeCheck

Operand Segment Selector

Current Code Segment Register

Operand Segment Descriptor


Call Gate A mechanism for the OS to restrict lower privileged code to

call only pre-defined entry points Restricts the caller privilege level

Not Used

Procedure EntryPoint

+

DPL Count

Selector Offset 15:0

BASE DPL BASE

BASE

Offset 31:16

Descriptor Table

GateDescriptor

Code SegmentDescriptor

Selector Offset Within Segment

Destination Address

015 031

Figure 12-6. Call Gate Mechanism

From which PrivilegeLevel the gate may be used.

The called code Privilege Level.

# of parameters


Inter Level Call - Stack Switching On inter-level call

CPL changes Control transferred Stack switched

Stack Switch New (privileged) SS:eSP is taken from the TSS Old SS:eSP is stored on new stack Parameters are copied to new stack

Figure 12-9. Stack Frame During Inter Level Call

xyz

PARM 1

PARM 2

PARM 3

OLD SS

OLD ESP

PARM 1

PARM 2

PARM 3

xyz

OLD CS

OLD EIP

Old Stack Before Call

New StackAfter Call, Before Return

Old Stack After RETURN(far retN N=3)

ESP3

ESP0

ESP3

SS0:ESP0

From TSS

Assume call from 3 to 0 Unique to interlevel calls


Task State Segment (TSS) HW mechanism to allow multi-tasking Contains task state:

Dynamic: Integer registers, segment registers, flags...

Static: CR3, per level stack pointer, etc... Task switch occurs on

JMP/Call to a TSS descriptor task gate Interrupt indexed to a task gate Certain IRET

On switch: Current TSS content is updated

For example, general registers are saved by the CPU!

New task’s values are retrieved from TSS

Contains initial stack values for level 0-2

Figure 13-1. 32-bit Task State Segment

CR3

ESP1

ESP0

EIPEFLAGS

EAXECXEDXEBXESPEBPESIEDI

ESP2

IO map base addr

Sel for task LDTGSFSDSSSCSES

SS2

SS1

SS0

Link (old TSS sel)

T

048C1014181C2024282C3034383C4044484C5054585C6064

01531


Protected ModeImplications and Observations (1)

Software conversion Straightforward if software does not assume linear = seg*16+offset

Real mode “tricks” do not work. Usually for the better: Cannot write on code

Cannot access segment 0

Cannot jump/write into the OS

New issues How to write on code (self-modifying code, SMC) - use aliasing

How to use OS services - via call gates

How to access a known linear address -Map a segment to a known base. e.g., use 4G at base 0.

How to protect between processes (applications) -Use LDT per process. Reload LDTR at context switch

Additional related mechanisms exist - not discussed


Protected ModeImplications and Observations (2)

Segmentation performance impactIssues: Indirect vs direct addressing Protection check

Solution: information is cached for each active segment Almost no overhead on memory reference... But Change of a segment register is costly!

Segment based virtual memory Possible - through use of “segment not present” bit Unattractive - due to variable length of segments


Protected Mode in Practice

Value of segmentation/protected mode Only option for a 16-bit OS 32-bit OS prefers flat model, page based protection

Call gates used for system calls by Unix Not by Windows NT

Task switch mechanism rarely if ever used OS/2 uses 32-bit segmentation Exotic uses for segment registers

FS used by NT for multi-threading


Segmentation - Summary

Mapping of logical address (selector:offset) to linear address Initially introduced to support addressability larger than the

machine word Since 286, descriptor tables enable

Protection Even larger addressability Potentially swapping to disk (but not very useful)

Starting with 386, 4GB addressability But 32-bit operating systems almost all use flat memory

Protection is the separation of Several different segments with different attributes Processes from one another Processes from the OS (protection “rings”)

All memory accesses are verified Control transfers are limited: call gates


Lecture 6: Paging


Paging


Paging - Motivation The problem:

Effective address can cover 4GB, but actual memory is smaller Application should be independent of actual memory location

Avoid memory size and location concernsAllow using fixed, pre-defined, addressesAllow easier code/data sharingInsensitive to gaps within address space

e.g., because of unused, but allocated, array space

Solution: Paging Idea: memory is divided to equal sized chunks - pages (4KB) A layer on top of segmentation. Performs linear-to-physical mapping Address “processing”

Logical (Seg:offset) ==> Linear ==> Physical Page tables map linear address to physical. Keeping track of:

Valid pagesPaged out pages (to secondary memory)Non-existent pages

Analogy: Big library, good catalog, reading room, storage room


Page Translation Mechanism 2-level hierarchical mapping

Using page directory and page tables All pages and page tables are 4K

Linear address divided to: Directory 10 bits Table 10 bits Offset 12 bits Dir/Table serves as index

into a page table Offset serves as pointer

into a data page

Page entry points to apage table or page

Performance issues: TLB!

Advanced mechanism 4MB pages(not discussed here)

Figure 11-16. Combined Segment and Page Address Translation

Descriptor Table

Selector Offset

SegmentDescriptor

LogicalAddress

BaseAddress

015

031

031

+

DIR TABLE OFFSET

Page Table

PG Tbl Entry

Page Directory

4K Dir Entry

4K Page Frame

Operand

Linear Address Space (4K Page)1121

CR3


Page Translation Mechanism (Cont.) CR3 points to current page directory

May be changed per process

Usually, a page directory entry (covers 4MB) will point to a page table that covers data of the same type/usage

Sharing. Can alias Different linear on

same physical(e.g., SMC)

Pages from differentprocesses to samephysical (e.g., OS)

DIR TABLE OFFSET

Page Dir Code

Data

StackOS

Phys Mem

4K page

CR3

Page Tables


Page Entry Format 20 bit pointer to a 4K

Aligned address 12 bits flags Virtual memory

Present Accessed, Dirty

Protection Writable (R#/W) User (U/S#)

2 levels only!

0, 1, 2 Supervisor

Caching Page Write-Through Page Cache

Disabled

3 bits for OS usage

Figure 11-14. Format of Page Directory and Page Table Entries for 4K Pages

0

0 0

Page Frame Address 31:12 AVAIL 0 0 APCD

PWT

U W P

PresentWritableUserWrite-ThroughCache DisableAccessedPage Size (0: 4 Kbyte)Available for OS Use

Page DirEntry

04 12357911 681231

Page Frame Address 31:12 AVAIL D APCD

PWT

U W P

PresentWritableUserWrite-ThroughCache DisableAccessedDirtyAvailable for OS Use

Page TableEntry

04 12357911 681231

Reserved by Intel for future use (should be zero)

-

-


Paging - Virtual Memory

A page can be New, not yet loaded Loaded Paged out to disk

A loaded page can be Clean Dirty

When a page is not loaded (P bit clear) => a page fault occurs It may require removing a loaded page to insert the new one

OS prioritizes removal by LRU and dirty/clean/avail bitsDirty page should be written to disk. Clean ones need not

New page is either loaded from disk or “initialized” CPU sets page “access” flag when accessed, “dirty” when written

Copy-on-write strategy for forked Unix processes


Paging Pros and Cons

Independence of physical memory size No need to preserve continuous memory ranges Physical memory is allocated only if accessed Can map same linear address to different physical address Can share data/code among processes Pages have predefined size - easy memory management Simple protection mechanism

Overhead of tables (insignificant) Internal fragmentation (insignificant) Translation time - reduced by TLB


Entering Protected Mode, Paging “Easy”: set CR0.PE to enable protection, CR0.PG to enable Paging…

But one better make sure all tables are ready and valid...

Figure 10-3. Control Registers

Reserved ReservedNE

ET

TS

EM

MP

PE

Page Directory Base PCD

PWT

0 ... 0 0MCE

PSE

DE

TSD

PVI

VME

Page Fault Linear Address

WP

AM

PG

CD

NW CR0

CR1

CR2

CR3

CR4

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0


Mapping 4M pages, 36 bit address.

PSE (page file extension) flag bit 4 of CR4 The PSE enables 4MByte pages (or 2Mbyte pages when

PAE flag is set). Those pages are mapped directly from the page directory, without using page tables.

PAE (physical address extension) flag bit 5 of CR4 The PAE flag enables 36 bit physical addresses (Not

covered here)


Paging/Segmentation Example All left indentation are data to the example, right are conclusions/results. All numbers are in HEX.GDTR: 0000 2000CR3: 0000 3000Logical address DS: 23, Offset: 3456

0010 0011

DS: index = 4, TI=GDT, Priv level = 3GDT(4) is in address 2000+4*8=2020

Assume protection validation is successful...GDT(4): base 00C03000 (linear!)

Linear address = 00C03000+3456 = 00C064560000 0000 1100 0000 0110 0100 0101 0110

PDE = 003 (0000 0000 11)PTE = 006 (00 0000 0110)Off = 456 (0100 0101 0110)

address in PDE = 3000+3*4 = 300C

M(300C) = 4000

address in PTE = 4000+6*4 = 4018

M(4018) = 5000

full address in page = 5000+456 = 5456

M(5456) = 12345678Finished? Not exactly....


Paging/Segmentation Example (Cont.)

But: where do we find the GDT?

GDTR points to 2000 linear, need to look at the physical address to see GDT(4)!

Linear address = 00002020 = 0000 0000 0000 0000 0010 0000 0010 0000

PDE = 000 (0000 0000 00)

PTE = 002 (00 0000 0020)

Off = 020 (0000 0010 0000)

address in PDE = 3000+0*4 = 3000

M(3000) = 6000

address in PTE = 6000+2*4 = 6008

M(6008) = 5000

full address in page = 5000+020 = 5020

M(5020) = 0010 f0c0 3000 1234

(G=0, 16bit, avail, present, DPL=3, appl, data/RO, limit=1234, base = 00c03000)


Addressing Methods - Summary

16 bits real and virtual mode: seg_reg*16+offset

16/32 bit protected mode:Use global/local descriptor table (GDT/LDT) xDT[seg_reg].base + offset (xDT = GDT/LDT) Check protection! Note: 32-bit is always protected

32 bit paging enabled: First translate as above, then Convert virtual (linear) to physical

pdi=vaddr[31:23]; pti=vaddr[22:13]; offset=vaddr[11:0]paddr = CR3->pdt[pdi]->pt[pti]+offset

Check protection!


Lecture 7: Interrupts and Exceptions


Interrupts and Exceptions

Some of the material in this section was taken from:The advanced Intel microprocessors - 80286/80386/80486.By: Barry B. Brey, DeVry institute of technology. Pub. by MacMillan publishing company


What is it?

“A signal from from hardware or software, such as a keystroke, that demands immediate attention and takes priority over other operations ”

Analogy: Alarm clock Pain


What Causes It? Interrupts

External to the CPU

Peripheral: I/O: mouse move/click, keyboard, timer

Exceptions Internal to the CPU

Exceptions: Processor detected:

Fault: restart from the causing instruction (e.g., page fault)

Trap: restart from the next instruction (e.g., overflow)

Abort: cannot restart (double fault) Programmed Exceptions (“software interrupts”):

e.g.: INTO, INT 3, INT n, BOUND

The terms “interrupts” & “exceptions” are frequently intermixed


Why Is It Important?

Few categories

Inevitable: only way to implement a feature Cannot implement timers, page faults, ^break etc. without it

Difficult without: alternatives imply huge program overhead e.g., polling to identify keyboard stroke

Convenience: cheaper way to implement a feature e.g., zero divide, INTO, BOUND, etc..

Conventions: Application/OS communication System call. Allow using procedures without linkage


How Does It Work?

Internal interrupts: handled completely by CPU + SW

External interrupt involves CPU + SW + peripherals External device delivers “INTR” signal (or NMI, SMI)

CPU acknowledges with “INTA” bus cycle

Device sends interrupt# on Data bus

Usually requires a PIC (Programmable Interrupt Controller) in system

External interrupts (except NMI/SMI) can be masked (IF=0)

SignalIntr

Data

Nmi,..

DEVICE PIC CPU

IntA


Interrupt Vector 256 entry array Each entry points to an interrupt/exception handler Located in address 0 (8086) or as pointed by IDTR register First 32 entries reserved for “Intel” for CPU internal exceptions Entry contains

Real mode: 4 bytes = Segment:offset Protected mode: 8 bytes = Interrupt/Trap/Task Gate

In both cases, control transferbased on the entry info

Figure 14-1. IDTR Locates IDT in Memory

IDTR Register

IDT Base Address IDT Limit

0151647

+ Gate forInterrupt #N

Gate forInterrupt #1

Gate forInterrupt #0

+8

+8*N

+16

+0

Real Mode Interrupt Vector

CS #NIP #N

CS #1IP #1CS #0IP #0

4

8

0

4*N


Reserved entriesVector Number Description

0 Divide Error 86

1 Debug Exception 86

2 NMI Interrupt 86

3 Breakpoint 86

4 INTO-detected Overflow 86

5 BOUND Range Exceeded 186

6 Invalid Opcode 86

7 Device Not Available 86

8 Double Fault 286

9 CoProcessor Segment Overrun (reserved) 286

10 Invalid Task State Segment 286

11 Segment Not Present 286

12 Stack Fault 286

13 General Protection 86

14 Page Fault 386

15 (Intel reserved. Do not use.)

16 Floating-Point Error 386

17 Alignment Check 486

18 Machine Check* Pentium

19-31 (Intel reserved. Do not use.)

32-255 Maskable Interrupts


Interrupt Vector Content (DOS)# Interrupt Name Address Owner

00 Divide by Zero 1756:00D2 NC01 Single Step 0070:06F4 Unknown02 Nonmaskable 1993:26D2 DOS03 Breakpoint 0070:06F4 Unknown04 Overflow 0070:06F4 Unknown05 Print Screen F000:FF54 BIOS06 Invalid Opcode F000:EB43 BIOS07 Reserved F000:EAEB BIOS08 IRQ0 - System Timer 15BE:0000 FAXTSR09 IRQ1 - Keyboard 15BE:001D FAXTSR0A IRQ2 - Reserved 0A08:0057 DOS0B IRQ3 - COM2 0A08:006F DOS0C IRQ4 - COM1 D60B:0095 BIOS0D IRQ5 - Reserved 0A08:009F DOS0E IRQ6 - Diskette 0A08:00B7 DOS0F IRQ7 - Printer 0070:06F4 Unknown10 Video 15BE:0100 FAXTSR11 Equipment Determination F000:F84D BIOS12 Memory Size Determination F000:F841 BIOS13 Fixed Disk/Diskette 15BE:0174 FAXTSR14 Asynchronous Communication F000:E739 BIOS15 System Services 0C4C:19A0 SMARTDRV16 Keyboard F000:E82E BIOS17 Printer F000:EFD2 BIOS18 Resident BASIC F000:E000 BIOS19 Bootstrap Loader 0C4C:1990 SMARTDRV1A Real-Time Clock Services F000:FE6E BIOS1B Keyboard Break 0070:06EE Unknown1C User Timer Tick 1465:00D6 CASMGR1D Video Parameters F000:F0A4 BIOS1E Diskette Parameters 0000:0522 Device Driver1F Video Graphics Characters C000:5A52 BIOS20 Program Terminate 0255:003A DOS System Area21 General DOS Functions 15BE:01A1 FAXTSR22 Terminate Address 18B9:02B1 COMMAND23 Ctrl+Break Handler Address 1993:26BA DOS24 Critical Error Handler 18B9:0155 COMMAND25 Absolute Disk Read 0C4C:19DE SMARTDRV26 Absolute Disk Write 0C4C:1A27 SMARTDRV27 Terminate and Stay Resident 0255:004C DOS System Area

# Interrupt Name Address Owner

28 DOS Idle 15BE:0222 FAXTSR29 DOS Internal - FAST PUTCHAR 0255:0052 DOS System Area2A Microsoft Networks 15BE:073E FAXTSR2B Reserved for DOS 0116:10DA DOS System Area2C Reserved for DOS 0116:10DA DOS System Area2D Reserved for DOS 0116:10DA DOS System Area2E DOS - Execute Command 0ACD:013F COMMAND2F Multiplex (Process Interface) 15BE:079E FAXTSR....40 Diskette BIOS Revector F000:EC59 BIOS41 Fixed Disk Parameters F000:E13D BIOS42 Relocated Video Handler F000:F065 BIOS43 EGA/VGA User Font Table C000:5652 BIOS44 Novell Netware API F000:EA97 BIOS45 Reserved F000:EA97 BIOS46 Fixed Disk Parameters F000:E401 BIOS....70 IRQ8 - Real-Time Clock 0A08:0052 DOS71 IRQ9 - Reserved F000:EED2 BIOS72 IRQ10 - Reserved 0A08:00CF DOS73 IRQ11 - Reserved 0A08:00E7 DOS74 IRQ12 - Reserved 0A08:00FF DOS75 IRQ13 - Redirect to NMI F000:EEDB BIOS76 IRQ14 - Fixed Disk 0A08:0117 DOS77 IRQ15 - Reserved F000:8E8D BIOS....F8 Reserved for User Programs 0040:0000 Device DriverF9 Reserved for User Programs 7A0E:7E00 UnknownFA Reserved for User Programs 0080:7C00 SSCDROMFB Reserved for User Programs 972F:0001 UnknownFC Reserved for User Programs 8304:F81C UnknownFD Reserved for User Programs EF40:7C00 BIOSFE Reserved for User Programs F729:F80B SSCDROMFF Reserved for User Programs 0002:F000 SMARTDRV

CPU Interrupts

External Interrupts


Additional Players Flags

IF flag: masks external interruptsSet/Clear by STI/CLI

TF flag: is the CPU in TRAP mode (single-step)Set/Clear by eFLAGS manipulation

Instruction IRET: “return from interrupt” = RET + few more things Software interrupts: INT n, INT3, INTO (4), BOUND (int 5)

Signals (for external interrupts) INTR: interrupt request INTA: interrupt acknowledge (bus cycle) = IO#*C*R# Data Bus: pass interrupt # NMI, SMI: additional interrupt lines

NMI - non maskable (timer, power fail)SMI - system management (OS independent power management)

When an interrupt/exception occurs, the interrupt# is decoded and then the story begins...


Interrupt Processing On Call (italic: Protected mode only)

Stack (SS:ESP) is set to level 0 stack (If not there) Store on stack

Old SS, ESP (on privilege level change only)Old FlagsOld CS, EIP (the offending instruction, or the next one!)Error code (for certain exceptions only. Contains erred selector)

Clear the IF/RF bits of EFLAGs (masks interrupts, delay single-step) Jump to the interrupt/exception handler

Handler code <Do the right things right…> Issue IRET

On IRET: Similar to regular RET + Restore EFLAGs

Figure 14-4. Stack Frame After Exception or Interruptwith and without privilege level change

Unused

Old SS

Old ESP

Old EIP

Old EFLAGs

Old CS

Error Code

ESP fromTSS

NewESP

OldESP


Interrupt Handling

Hardware side Ensure transparency to the affected task

Will return to the point of interruption w/ the right flagsDoes the minimum - only what software cannot do!

Interrupt handlers Transparent: preserve all used registers! Short as possible. If cannot - enable interrupts during handling

Do not disable interrupts for a long time! Avoid nesting of same interrupts, or make sure the software can

handle that! Some of the information can/should be extracted from the Stack

(old CS:eIP, old SS:eSP, eFLAGS, error code)

Interrupt handler writers must fully understand the relevant HW interaction (PIC + actual peripheral)


Simple Interrupt Routine

<From Barry B. Bray: page 412 CHAPTER 10: INTERRUPTS>EXAMPLE 10-3

;Procedure that displays all the registers and their contents;on the video screen in response to a trap interrupt

0000 TRACE PROC FAR

0000 50 PUSH AX ;save registers0001 55 PUSH BP0002 53 PUSH BX

0003 BE 0054 R MOV BX, OFFSET NAMES ;address register names

;display registers

0006 E8 009A R CALL CRLF ;get new display line0009 E8 00A9 R CALL DISP ;display AX000C 88 C3 MOV AX,BX000E E8 00A9 R CALL DISP ;display BX0011 88C1 MOV AX,CX0013 E8 00A9 R CALL DISP ;display CX0016 88 C2 MOV AX,DX0018 EB 00A9 R CALL DISP ;display DX0018 88 C4 MOV AX,SP001D 05 000C ADD AX,120020 E8 00A9 R CALL DISP ;display SP0023 88 CS MOV AX,BP0025 E8 00A9 R CALL DISP ;display BP0028 8B C6 MOV AX,SI002A EB 00A9 R CALL DISP ;display SI002D 8B EC MOV BP,SP ;address stack with BP002F 88 46 06 MOV AX,[BP+6]0032 E8 00A9 R CALL DISP ;display IP0035 8B 46 0A MOV AX,[BP+10]0038 E8 00A9 R CALL DISP ;display flags0038 8B 46 08 MOV AX,[BP+8]003E EB 00A9 R CALL DISP ;display CS0041 8CDB MOV AX, DS

0043 E8 00A9 R CALL DISP ;display DS0046 8C C0 MOV AX,ES0048 E8 00A9 R CALL DISP ;display ES0048 8C D0 MOV AX,SS004D E8 00A9 R CALL DISP ;display SS

0050 5B POP BX ;restore registers0051 5D POP BP0052 58 POP AX0053 CF IRET0054 TRACE ENDP

Note:; Call CRLF: print new line; Call DISP: print reg name & value

Push registers

Manipulates stack

Restore registers


Setting/Clear TF Flag There are no special instructions that set or clear the trap flag. Flag is on the stack at [BP+8]

; Procedure that sets TF to enable trap0000 TRON PROC FAR0000 50 PUSH AX ; save registers0001 55 PUSH BP0002 8B EC MOV BP, SP ; get SP0004 88 46 08 MOV AX, [BP+8] ; get flags0007 80 CC 01 OR AH,1 ; set TF000A 89 46 08 MOV [BP+8],AX ; save flags000D 5D POP BP ; restore registers000E 58 POP AX000F CF IRET0010 TRON ENDP

;Procedure that clears TF to disable trap0010 TROFF PROC FAR0010 50 PUSH AX ; save registers0011 55 PUSH BP0012 8B EC MOV BP, SP ; get SP0014 88 46 08 MOV AX, [BP+8] ; get flags0017 80 ES FE AND AH, 0FEH ; clear TF001A 89 46 08 MOV [BP+8],AX ; save flags001D 5D POP BP ; restore registers001E 58 POP AX001F CF IRET0020 TROFF ENDP


Order of interrupts Last instruction traps (breakpoint) External interrupts Fetch related (Code page fault) Decode related (Illegal opcode) Execution related (Data page fault)


Hardware Example

10K

D0D1D2D3D4D5D6D7

RD#WR#A0A1RESETCS#

PA0PA1PA2PA3PA4PA5PA6PA7

PB0PB1PB2PB3PB4PB5PB6PB7

PC0PC1PC2PC3PC4PC5PC6PC7

D0D1D2D3D4D5D6D7

DAV#

Keyboard

8255A-5

1 1 1 1 2 2 2 2 Y Y Y Y Y Y Y Y1 2 3 4 1 2 3 4

1 1 1 1 2 2 2 2 A A A A A A A A 1 21 2 3 4 1 2 3 4 G G

D0D1D2D3D4D5D6D7

IORC#

A1A2

RESET

WAIT2#

IOWC#A0A3A4A5A6A7A8A9

A10

A11A12A13A14A15

INTR

INTA#

I1I2I3I4I5I6I7I8I9I10

O1O2O3O4O5O6O7O8

16L8

vcc

U1

STB#

U3

U2

74ALS244

An 8255A-5 Interface to a KB from 80286 using interrupt vector 40H

123456789

11

1918171615141312

3433323130292827

53698

35

6

432140393837

1819202122232425

1415161713121110

18

16

14

12

9

7

5

3

2

4

6

8

11

13

15

17

1

19

D0-D7: 00000010 = 40H


HW Intr Routine <From Barry B. Bray: page 422 CHAPTER 10: INTERRUPTS>

EXAMPLE 10-5; Interrupt service procedure that reads a

key; from the keyboard.

= 0500 PORTA EQU 500H ; port A= 0506 CNTR EQU 506H ; control

register0000 0100 FIFO DB 256 DUP (?) ; FIFO buffer0100 0000 INP DW ? ; input pointer0102 0000 OUTP DW ? ; output pointer

0104 KEY PROC FAR

0104 50 PUSH AX ; save registers0105 53 PUSH EX0106 57 PUSH DI0107 52 PUSH DX

0108 2E: 8B 1E 0100 R MOV BX, INP ; address FIFO010D 2E: 88 3E 0102 R MOV DI, OUTP0112 FE C3 INC BL ; test for full0114 38 DF CMP BX, DI0116 7411 JE FULL ; if full

0118 FE CB DEC BL011A BA 0500 MOV DX, PORTA011D EC IN AL, DX ; get data011E 2E: 88 07 MOV CS:[BX],AL ; save data0121 2E: FE 06 0100 R INC BYTE PTR INP0126 EB 07 90 JMP DONE

0129 FULL:0129 BO 08 MOV A/8 ; disable

interrupts0128 BA 0506 MOV DX, CNTR012E EE OUT DX, AL

012F DONE:012F 5A POP DX ; restore

register0130 5F POP DI0131 58 POP EX0132 58 POP AX0133 CF IRET

0134 KEY ENDP

<From Barry B. Bray: page 423 CHAPTER 10: INTERRUPTS>

EXAMPLE 10-6; procedure that reads a key from the FIFO and; returns with it in AH

0104 READ PROC FAR0104 53 PUSH EX ; save registers0105 57 PUSH DI0106 52 PUSH DX

0107 EMPTY:

0107 2E: 8B 1E 0100 R MOV BX, INP ; address FIFO010C 2E: 8B 3E 0102 R MOV DI, OUTP0111 38 DF CMP BX, DI ; test for empty0113 74 F2 JE EMPTY ; if empty

0115 2E: 8A 25 MOV AH, CS:[DI] ; get data0118 8009 MOV AL9 ; enable 8255A-5 intr011A BA0506 MOV DX, CNTR011D EE OUT DX, AL

; increment pointerO11E 2E: FE 06 01M R INC BYTE PTR OUTP0123 5A POP DX : restore registers0124 5F POP DI0125 58 POP EX0126 CB RET

READ ENDP

0

256

INP

OUTP

INP (after KB press)

OUTP (after “READ”)

FIFO


The Programmable Interrupt Controller (PIC)

Heavy-weight legacy, the 8259A component - PC/AT Today part of the “chip-set”

But with full software compatibility

Newer systems use the Advanced PIC (APIC) Supports multiprocessor systems Not discussed here

Functions: React to an external interrupt, according to priorities Convert the interrupt request (IRQ) into a vector Inform the peripheral that the interrupt was acknowledged

Vector addresses can be programmed Using a set of “commands”

Up to 8 components can be cascaded. In practice, only 2


8259A connection 10K

D0D1D2D3D4D5D6D7

A0CS#RD#WR#SP/EN#INTINTA

IR0IR1IR2IR3IR4IR5IR6IR7

CAS0CAS1CAS2

8259A

D0D1D2D3D4D5D6D7

A1

IORC#

INTRINTA#

WAIT2#

IOWC#A0A2A3A4A5A6A7A8A9

A10A11A12A13A14A15

I1I2I3I4I5I6I7I8I9I10

O1O2O3O4O5O6O7O8

16L8

U2

U1

An 8259A Interface to the 80286

123456789

11

1918171615141312

1110 987654

27132

1617

26

1819202122232425

121315

vcc


Cascading 8259 & IRQs

Number Address Name Owner

IRQ 00 15BE:0000 Timer Output 0 FAXTSR

IRQ 01 15BE:001D Keyboard FAXTSR

IRQ 02 0A08:0057 [Cascade] DOS

IRQ 03 0A08:006F COM2 DOS

IRQ 04 D60B:0095 Mouse BIOS

IRQ 05 0A08:009F LPT2 DOS

IRQ 06 0A08:00B7 Floppy Disk DOS

IRQ 07 0070:06F4 LPT1 Unknown

IRQ 08 0A08:0052 Realtime-Clock DOS

IRQ 09 F000:EED2 Reserved BIOS

IRQ 10 0A08:00CF Reserved DOS

IRQ 11 0A08:00E7 Reserved DOS

IRQ 12 0A08:00FF Reserved DOS

IRQ 13 F000:EEDB Coprocessor BIOS

IRQ 14 0A08:0117 Fixed Disk DOS

IRQ 15 F000:8E8D Reserved BIOS

8259 #1

8259 #2

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

INTRto CPU

INTRto next8259


Lecture 8: The Software-Hardware Interface


The Software/Hardware Interaction

Background Hal/BIOS Reset/Boot Memory management More...


Old Computers...

Applications talk directly to hardware OK for embedded systems…

Program is burned in ROM

Problematic for a personal computer Who loads programs? What if underlying HW changes? Managing shared resources: memory,

files, protection? No reuse of software Programmer must be expert in HW?

Reprogrammable computer must have an OPERATING SYSTEM

Application

Hardware


OS Role Ideal model

Applications talk with the OS OS talks with the HW

This seems to solve all problem: OS loads programs HW changes require OS changes, but

not application change OS manages shared resources OS contains reused (shared) SW Programmers need not be aware of HW

In practice, however, applications continued to interact with HW

No OS support for new devices Performance requirement (e.g., screen)

Applications

OS

Hardware

Ideal Model

Applications

OS

Hardware

Practical Model


General OS models Monolithic

All procedures interact All can talk to the HW

Hardware

System Services

ApplicationProgram. . . . . .

User Mode

Kernel Mode

Monolithic Operating System

OperatingSystem

Procedures

ApplicationProgram

File System

Memory and I/ODevice mgmt

ProcessorScheduling

Hardware

System Services

ApplicationProgram. . . . . .

User Mode

Kernel Mode

Layered Operating System

OperatingSystem

Procedures

ApplicationProgram

Layered A procedure can talk to lower

layers only May be enforced by HW


General OS Models (Cont.) Client/Server Ideally - kernel consists of message passing only

Clean, modular, secure... but slow

In practice - kernel contains more. e.g. Scheduler, memory mgmt, device drivers, etc...

Hardware

MicroKernel

User Mode

Kernel Mode

Client/Server Operating System

ServiceProcedures

ProcessServer

FileServer

MemoryServer

NetworkServer

DisplayServer

ClientApplication

ClientApplication

. . . . . .

Send

Reply


Windows NT Basic Layers

Hardware

System ServicesSystem Services

ProcessManager

ProcessManager

VirtualMemoryManager

VirtualMemoryManager

LPCFacility

LPCFacility

Object Manager

Object Manager

SecurityReference

Monitor

SecurityReference

Monitor I/OSystem

I/OSystem

KernelKernel

Hardware Abstraction LayerHardware Abstraction Layer

Hardware

Hardware

UserKernel

Win32Client

Win32Client

Win32sub

system

Win32sub

system

PosixClient

PosixClient

Posixsub

system

Posixsub

system

. . . . . . LogonProcess

LogonProcess

Securitysub

system

Securitysub

system

. . . . . .


Hardware Independence

Separate between the OS and the actual hardware Allows

Same OS can BOOT on different platforms as-is Same OS can run on different platforms as-is

At boot time: hardware independence achieved by BIOS (basic I/O system)

At run time: BIOS (DOS, Windows 3.1/95) HAL (NT) Device drivers

Applications

OS

BIOS/HAL

BIOS Model

Hardware

Ideal model Reality


BIOS Contains basic functionality for HW handling

Disk I/O Screen I/O Mouse, keyboard

All that is needed to boot the system - and more

Resides in a non-volatile memory (EPROM, Flash) Supplied with the computer

BIOS suppliers: Phoenix, AMI (American Megatrend)Usually in cooperation with chip and system makers

OS supplier: Microsoft, Microsoft, ...

Interfaces to devices are done also by: Applications (DOS only) Device drivers (Windows)

BIOS additional tasks Boot Setup (HW parameters)


Inter-Layer Interface

API (application programming interface) Defines the function, parameters, return values. etc...

Level of interface: Programmer level: C-like, machine independent Code level: machine dependent (stack layout, etc..) Transfer from programmer code to library code

Regular call, standard API

Transfer from user code to OS code (still in user mode!) Call/Jmp. May use non-standard API

Transfer from user mode to system mode Software interrupt Inter-level call (using call gate)

extern int read(int, void *, unsigned); main (){ int file=0,buf[100],icount=10,count; count = read(file,buf,icount);}

disassembly for read.out:

.._cerror() 80481c8: a3 70 94 04 08 movl %eax,0x8049470 / errno <- code 80481cd: b8 ff ff ff ff movl $0xffffffff,%eax / ret val <- -1 80481d2: c3 ret 80481d3: 90 nop 80481d4: 90 nop 80481d5: 90 nop 80481d6: 90 nop 80481d7: 90 nop

_read()read() 80481d8: b8 03 00 00 00 movl $0x3,%eax / set read code 80481dd: 9a 00 00 00 00 07 00 lcall 0x7,0x0 / enter kernel 80481e4: 72 02 jb 0x2 <80481e8> / check error 80481e6: c3 ret / OK: return 80481e7: 90 nop 80481e8: 3c 5b cmpb $0x5b,%al 80481ea: 74 ec je 0xec <80481d8> 80481ec: e9 d7 ff ff ff jmp 0xffffffd7 <80481c8> / error routine 80481f1: 90 nop 80481f2: 90 nop 80481f3: 90 nop

main()[3] 80481f4: 55 pushl %ebp 80481f5: 8b ec movl %esp,%ebp 80481f7: 81 ec 90 01 00 00 subl $0x190,%esp 80481fd: 57 pushl %edi 80481fe: 56 pushl %esi[4] 80481ff: 2b ff subl %edi,%edi 8048201: be 0a 00 00 00 movl $0xa,%esi[6] 8048206: 56 pushl %esi 8048207: 8d 85 70 fe ff ff leal -400(%ebp),%eax 804820d: 50 pushl %eax 804820e: 57 pushl %edi 804820f: e8 c4 ff ff ff call 0xffffffc4 <80481d8> / call read lib 8048214: 83 c4 0c addl $0xc,%esp[7] 8048217: 5e popl %esi 8048218: 5f popl %edi 8048219: 8b e5 movl %ebp,%esp 804821b: 5d popl %ebp 804821c: c3 ret

Unix system code sequence

#include <time.h>main (){ time_t l1,l2; l2=time(&l1); // time receives pointer to time_t (long) and returns time_t }--------------------------------------------------------------- User code4: {00401010 55 push ebp00401011 8bec mov ebp,esp00401013 83ec08 sub esp,0000000800401016 53 push ebx00401017 56 push esi00401018 57 push edi5: time_t l1,l2;6:7: l2=time(&l1);00401019 8d45fc lea eax,dword ptr [l1]0040101c 50 push eax0040101d e812000000 call time (00401034) / go to library00401022 83c404 add esp,0000000400401025 8945f8 mov dword ptr [l2],eax8: }00401028 5f pop edi00401029 5e pop esi0040102a 5b pop ebx0040102b c9 leave0040102c c3 ret---------------------------------------------------------------- Library code_time:00401034 55 push ebp00401035 8bec mov ebp,esp00401037 83ec10 sub esp,000000100040103a 8d45f0 lea eax,dword ptr [ebp-10]0040103d 50 push eax / enter DLL0040103e ff15bc804000 call dword ptr [__imp__GetLocalTime@4 (004080bc)]00401044 0fb74dfc movzx ecx,word ptr [ebp-04]00401048 0fb755fa movzx edx,word ptr [ebp-06]0040104c 51 push ecx0040104d 52 push edx0040104e 0fb745f8 movzx eax,word ptr [ebp-08]00401052 0fb74df6 movzx ecx,word ptr [ebp-0a]00401056 50 push eax00401057 51 push ecx00401058 0fb755f2 movzx edx,word ptr [ebp-0e]0040105c 0fb745f0 movzx eax,word ptr [ebp-10]00401060 52 push edx00401061 50 push eax00401062 e87f010000 call ___loctotime_t (004011e6)00401067 83c418 add esp,000000180040106a 8b4d08 mov ecx,dword ptr [ebp+08]0040106d 85c9 test ecx,ecx0040106f 7402 jz _time+0000003f (00401073)00401071 8901 mov dword ptr [ecx],eax00401073 8be5 mov esp,ebp00401075 5d pop ebp00401076 c3 ret---------------------------------------------------------------- Dll code82929af0 683970f7bf push bff7703982929af5 e923aa663d jmp bff93c1d / enter kernel---------------------------------------------------------------- Kernel codebff93c1d ...... ......... ....... jmp far .... / enter ring 0003b:03d8 cd30 int 30 ;#0028:c0236288 --------Priveleged level

WIN95 system code sequence


DOS/BIOS Interface Very low level interface definition

Parameters are passed in registersAH contains function code

Return values are passed in registersCF contains error indication (Simple JC to check!)

Actual call is done by software interruptDOS: int 0x21BIOS: various 0x10-0x1A, e.g., 0x13 for disk, 0x16 for KBD

Low level interface pros/cons+ Faster code

+ Denser code size

- Very machine dependent: not portable, less extensible

Was OK for its time. Inappropriate for today


DOS ExamplesDOS: Delete file: INT 21H / DOS call

Input Parameters AH: 41HDS:DX asciiz pointer (“myfile.dat”)

Return values AX: 0: succeed

2: file not found5: fail to delete

example: LDS DX, file_name / set file nameMOV AH, 41H / delete codeINT 21H / dos callCMP AX, 0 / succeeded?JNZ fail

....file_name DB “myfile.dat”,0

----------------------------------------------------------------------------------------------

DOS: Read from file/ Device: INT 21H / DOS call

Input Parameters AH: 3FHBX file handleCX bytes to readDS:DX pointer to data buffer

Return values AX: CF=0 Number of bytes read

AX: CF=1 error code (5-access denied, 6-invalid handle)


BIOS Example

BIOS ReadDiskSector: INT 13H / Diskette/Fixed disk BIOS serviceAH: 2H / Read Disk sector

Input Parameters: AL: Number of sectors to readCH: cylinder # (low 8 bits)CL: Cylinder/Sector #DH: Head#DL: Drive#ES:BX Pointer to buffer

Return Values: AH: 0 = no errorFixed disk error code

CF: 0 = no error, 1 = error

Normal operation Application calls DOS to bring bytes from file DOS examines which device is associated with the file For disk:

DOS file system translates logical file address into physical disk address DOS Calls BIOS to read the actual bytes


BIOS Example (Cont.)

BIOS ReadKeyBoardInput: INT 16H / keyboard BIOS service

Input Parameter: AH: 0H / Read keyboard

Return Values: AH: Scan code

AL: Ascii code

Normal operation Application calls DOS to bring bytes from file DOS examines which device is associated with the file For keyboard: Calls BIOS to read the actual bytes

In practice, applications may access the keyboard directly+ Faster access+ Can get more info (extended code)- Some capabilities are lost: e.g., cannot redirect input!


Boot-Strap Process Reset

CS:IP is 0xf000:0xfff0 (within BIOS), CR0: Cache disabled, write-thru Perform optional BIST (built-in self test) Enter Real mode, most general/segment registers are 0

POST = Power on Self Test Start OS load (OS independent)

Call ROM loader Identify boot disk Load master boot record - pass control Find boot sector - load and pass control

DOS Specific (also WIN3.1) Find IO.SYS - load and pass control (BIOS extensions) Find MSDOS.SYS - load and pass control (DOS code) Load drivers Execute config.sys (more Device drivers) Find command.com - load and pass control (“shell”) Find autoexec.bat - run (initial apps) Prompt user


POST – Power-On Self Test

Check CPU integrity (regs, math, protected mode, etc..) BIOS checksum CMOS checksum (modifiable values for setup) Check/initialize DMA, keyboard controller Check lower 64KB of memory Check/initialize PIC, Cache, display controller Check all memory Check/initialize Serial & Parallel ports Check/initialize Disk/Diskette controller Look for BIOS extensions in segments A000:F000

Identified by AA55 on a 2K memory boundary. Code follows

Other operations: create shadow ROM BIOS, etc... Start Boot Strap


Boot-Strap - Comparison

NTNTInt 19

NT Loader

NT Min Kernel

NT Kernel

Device Driver

Login Process

BIOSBIOSReset

POST

BIOS

BIOS extensions

DOSDOSDOS Device Driver

DOS

TSRs

DOS Applications

WIN 3.1WIN 3.1Windows VxD

Windows Drivers (DLLs)

Windows apps and DLL

WIN 95WIN 95Windows VxD

Windows Drivers (DLLs)

Windows apps and DLL

Win95option 2

Win95option 1

Schematic onlySome details ignored


DOS, Win 3.1 & Win 95 Memory Map

* EW Jan/29/980

Extended Memory

High Mem

Rom BIOS

Empty RAM

Video, SCSI, Network,Expanded Memory, etc..

Empty RAM

DOS Free Memory

Applications

TSRs

DOS Device Drivers

DOS COMMAND.COM

BIOS

Interrupt Vector

4GB

1M+64K: 10FFEF

1M: 0FFFFF

1M-128K: 0E0000

640K: 9FFFF

1K, 3FF

Upper memory


Win 95 memory map

Shared

Ring 0

Shared

Ring 0

Shared

Ring 3

System Tables (IDT,GDT ..)

VxD

System DLLs

Memory Mapped Files

Top portion ofWin-16 global heap

Per Process area

Win32

Lower portion ofwin-16 global heap

MS-DOS

Per Process

Ring 3

4GB

3GB

2GB

4MB

0


NT Virtual Memory Map

Non Paged Pool(Cannot be swapped out)

Paged

Direct MappingAddresses

Paged

4GB

2GB

0

User & SystemDLLs

Per Process

Ring 3

System

Shared

Ring 0

Kernel, Drivers, IDT, GDT, TSS..


Memory Model in Win/NT and Win95

Minimize usage of segments Use paging for address and access check

Address error:Memory access beyond program limits is checked by the OS following page fault exception

Access error:Memory Access is checked for read/write & privilege level in the page table and the address segment attribute

There are different privilege levels for the OS and the user context (need to remember this during interrupt time)

The FS segment register is used as a Task pointer to the OS database

Win/NT Still supports 16-bit and real mode segmentation applications with Virtual X86 (need to remember during Interrupt service)


Win/NT / Win ‘95 Page Tables

PageDirectory

Process 1’s CR3

PageDirectory

Process 2’s CR3

PageTable

PageTable

4K

PageTable

PageTable

PageTable

PageTable

System PageTable

4K

4K

4K

PhysicalMemorypages

4K


Background: Processes and Threads

Different for different operating system, we’ll talk Win32 (Windows ‘95 and Windows NT)

Processes are the primary running objects One executable and several DLLs Address space Object (e.g. file) handles Security attributes Execution priority

A process can have multiple threads For concurrency Minimal state: just registers and a private stack

Threads are known to the OS Preemptive thread switch, as well as preemptive process switch

Threads need to synchronize access to shared resources


Process and Thread Memory Management

Each Process has its own page table Separate CR3

All user processes use the same segments

Process switch includes switching to its pages

Linear mapping address are the same as the logical address “Flat” memory model

Each thread has different stack allocation obtained from the its process memory space

No paging magic, simply a different value in ESP

How memory is shared between Processes Same physical memory page is mapped by two or more

processes’ page tables

=> Two linear addresses calculate to the same physical address

Each process switch causes flushing of the TLB


Mode Switch Enter protected mode:

Prepare at least minimal GDT/IDT/TR. Set GDTR CR0.PE <= 1. Segment register still points to old base/limit Load segment registers as needed Perform board initialization - as needed Avoid interrupts while switching. Now update IDTR. Update TSS

Enter Paging: Prepare minimal page tables (at least 1directory, 1 table) CR3 <= page directory, CR0.PG <=1 The above code in page which phys=linear.

Need jump before leaving the page.

Exit paging: The same thing, with CR0.PG <= 0

Exit protected (in 286 - just reset) Set all segments so limit=0xffff. Disable interrupts CR0.PE <= 0 IDTR = 0, enable interrupts


Get_vector: ...XOR EAX, EAX; MOV ES, AX / seg=0MOV AX, ES:x*4MOV old_vector_p, AX / get offsetMOV AX, ES:x*4+2MOV old_vector_p+2, AX / get selector

...old_vector_p: DD ? / apart from code! ....Set_vector: ...

XOR EAX, EAX; MOV ES, AX / seg=0MOV AX, new_vector_pMOV ES:x*4, AX / set offsetMOV AX, new_vector_p+2MOV ES:x*4+2, AX / set selector

...new_vector_p: DD new_service_proc / new proc addr ...int_service: IN AL, port# / read status of the keyboard

CMP AL, hot_key / check if special valueJZ hot_key_service / if OK, get requestJMP [old_service_p] / jmp far to saved address...

hot_key_service: ...

/** C example */void chain_intr(unsigned int vector,

unsigned char mask){

void (_cdecl _interrupt_far * _cdecl np)(unsigned);

_disable();np = _dos_getvect( vector);_dos_setvect(vector, process_intr);_enable();/* program PIC to enable

interrupts */outp( PIC_e, inp( PIC_e ) & mask);outp( PIC_b, EOI);

}

Interrupt Vector Chaining (simplified)


Running DOS under Windows - Virtual X86

Allow execution of real mode apps under protected mode OS

Use Real-Mode logical-to-linear address translation Linear = seg*16+offset

Paging is used for address translation Can have several VM X86 tasks simultaneously Lower 1MB can be mapped to anywhere

DOS/BIOS code is copied for each task Emulate (virtualize) privileged operations

Invokes a general protection handler Handler decides how to continue

In protected mode (level 0) - e.g., I/OBack in VM mode (level 3) - e.g., divide by 0


VM X86 Address Translation

Segment Unit Page Unit32

16/3

2

Effective Address (offset)

BaseLinearaddress

Physical address

DOS BOXVM1

DOS BoxVM2

Page table 1

Page table 2

20bit, 0-1M

20bit, 0-1M

Linear space

Linear space

Memory

4GB


Virtualizing Privileged I/O and Instructions

Windows has APIs (at the VxD level) to tie a handler to a port

Upon an exception:1. Check interrupt vector

2. Enter protected

mode handler

3. Optionally,

Process in VM X86

TASK A

TSS A

IO Bit Map

Level 0Handler

Level 3DOS/Windows

Application

IOPL = 0CPL=3

GP13

OUT DX, AL

LGDT

DIV AX,0

Interruptvector

1

2

3


Other


Other things

Debug/Test features. I/O addressing Initialization MultiProcessing (Locking) Event Monitoring (Pentium)


X86 SW Challenge

SW architects - define models of computations 16 bit: small/compact/medium/large 32 bit: flat is enough...

Compilers - Generates fastest code for: Family of processors Individual processor

Application writers Write faster code.


Lecture 9: Performance


General Performance Questions

How efficiently is the application running on the platform? How are resources being used? CPU, memory, IO, bus? Where CPU time is being spent? Can program logic and or algorithms be improved in time

consuming code

Section 1: Overview 181Microcomputer CourseKoby Gottlieb, 3/03, p-181

Order of Optimizations

Time everythingAccess memory efficientlyMaximize branch predictabilityAvoid stallsUse a good blend of instructions


Component profiling:

flat profiling Collecting time based and event based information while running

the application Break down of active executables (exe’s, dll’s, drivers etc.)

call graph profiling Display of the call graph Collecting the information based on hierarchical profiling

Adding the time spent in functions that were called to the time of each function

Cache utilization profiling Tradeoff between code and data size.

Example: Processing data in phase A and B. Running phase A to until a data buffer between phases is full.


Dynamic Execution

Dynamic Data Flow Analysis Determines data and register dependencies to detect opportunities

for out-of-order execution.

Deep Branch Prediction Processor can predict many branches ahead of the instruction

pointer

Speculative Execution Allows the processor to execute instructions ahead of the program

counter but still retire the results in the order of the original instruction stream.


Architecture Diagram

Fetch & Decode Unit(In order unit)•Fetches instructions•Decodes instructions to µOPs•Performs branch prediction

Retirement Unit(In order unit)•Retires instructions in order•Writes results to registers/memory

Dispatch / Execute Unit(out of order unit)•Schedules and executes µOPs•Contains 5 execution ports

L2 cache

Bus Interface Unit

L1 data cacheL1 instruction

cache

Instruction Pool/reorder buffer•Buffer of µOPs waiting for execution

System bus

Fetch Load Store

Section 2: Optimizations: Fetch Unit


Fetch Unit

Reads cache lines (32 bytes) from L1 instruction cache

Sends instructions to the decode unit

Fetches the next instructions based on branch prediction

Fetch & Decode Unit

Retirement Unit

Dispatch & Execute Unit

L2

Bus Interface Unit

data cacheIcache

System bus

Instruction Pool(ROB)



Branch Prediction

Static Branch Prediction: Forward conditional branches will be predicted as not taken

jz @forward

Backward conditional branches will be predicted as takenjnz @backward

Unconditional direct branches are always takenjmp label

Unconditional indirect branches fall throughjmp [eax]

Dynamic Branch Prediction: Based on prediction and history in Branch Target Buffer (BTB)

If you can guess from the history what the branch will be, so will the processor (0101 -- taken, not taken, taken, not taken)



Static Branch Prediction Example

What does the static branch prediction algorithm predict for the following branches?

loop:

add eax, [edi]

dec ecx

add edi, 4

jnz loop

cmp eax, ebx

je start

mov eax, ecx

start:

mul ebx

test ebx, ebx

jnz not_zero

mov ebx, 50

not_zero:

add ebx, 10

mov ebx, [esi]

push ebx

push eax

call swap

lea edx, [ecx *4]

jmp [edx]

1

Begin:

shr eax, 1

jnc Begin

mov ebx, eax

2 3

4 5 6

Answers: 1-3 taken, 4-6 not taken



Return Stack Buffer

Return Stack Buffer is used to predict RETs Every time a CALL is executed, the return address is pushed on

the Return Stack Buffer Every RET pops the Return Stack Buffer and uses the address for

prediction

Make sure for each CALL there is a RET or the addresses will be out of sequence and all return address predictions will be wrong

Section 2: Optimizations: Decode Unit


Decode Unit

Converts Intel Architecture instructions into micro-ops (µOps)

Stores µOps in the Instruction Pool

To reduce false register dependencies, register references are converted to an internal register set Register alias table Register renaming

Fetch & Decode Unit

Retirement Unit


L2

Bus Interface Unit

data cacheIcache

System bus




The Decoders

3 parallel decoders Decode instructions which require 4 or fewer µOps

Microcode instruction sequencer Decodes instructions which are 5 or more µOps

Do not reorder code, minimal performance gain

microcode instruction sequencer

decoder 01-4 µOps

decoder 21 µOp

decoder 11 µOp

register allocation table

ROB/Instruction Pool



Register Renaming

40 internal registers used to reduce data dependencies No user access and transparent to software

What the processor sees:Asm code internal to processor

mov al, mem

add al, 6

mov [edi], al

mov eax, 5

mov dest1, eax

mov ax, mem2

shr ax, 3

mov dest2, ax

mov <r0>, mem

add <r0>, 6 =>> <r3>

mov [edi], <r3>

mov <r1>, 5

mov dest1, <r1>

mov <r2>, mem2

shr <r2>, 3 =>> <r4>

mov dest2, <r4>



Register Stalls

Register stalls occur when a large register is read after one of its partial registers is written

The instructions do not have to be adjacent because of out-of-order execution

Caused by register renaming The large register might be split among multiple internal registers

Penalty: Time it takes to retire the small register write

mov al, byte ptr [mem]

inc eax ; stalled waiting for al to retire



The Clean Bit

Used to indicate that a register is zero Means the partial register equals the larger register Only eax, ebx, ecx, and edx have clean bits

Partial register stalls will NOT happen if the clean bit is on No direct user access to this bit, completely internal to the

processor



What turns on the clean bit?

Zero the register with XOR or SUB ONLY!

The clean bit can stay on for multiple instructions The clean bit is not copied between registers

xor eax, eax ; eax’s clean bit is on

sub ebx, ebx ; ebx’s clean bit is on

xor eax, eax ; eax’s clean bit is on

mov ebx, eax ; does not copy the clean bit



What turns off the clean bit?

Clean bit is turned off when the operation has the possibility of changing the larger register

Use of the ‘h’ register turns off the clean bit Keep the clean bit on when using partial registers

xor eax, eax ; clean bit on

mov al, 5 ; clean bit still on

dec al ; clean bit still on

inc eax ; clean bit off, could spill into ah

mov al, 6 ; partial load

inc eax ; stall



Register Stall Examples

Identify code segments with register stalls

lahf

mov ebx, eax

sub eax, eax

mov ax, bx

inc eax

xor eax, eax

mov ax, bx

inc eax

xor eax, eaxmov ah, 5push eax

mov eax, 0

mov ax, bx

inc eax

movzx eax, bx

push eax1 2 3

4 5 6

pop ax

mov [mem], eax

xor eax, eax

mov al, bl

mov ah, cl

inc eax

pop eax

mov [mem], al7 8 9

Answers: odd boxes stall



Mixing of Code and Data

Don’t put data in the instruction stream put data:

in the data segment or on the stackat the end of the code segmentat the end of a function

Don’t make the processor decode data wastes decode time

usually happens with indirect jumps and their tables reduces cache efficiency

Section 2: Optimizations: Execution Unit



Only out-of-order unit Schedules and executes µOPs

based on data-flow constraints

Fetch & Decode Unit

Retirement Unit


L2

Bus Interface Unit

data cacheIcache


System bus



Execution Units Diagram

ROBReservation

Station

Port 1

Port 3Port 2 Port 4

IntegerUnit

Load Unit

Store Address

Calculation Unit

Store DataUnit

Port 0

MOV EAX,[Memref]

MOV [Memref], EAX

INC

FMUL

FDIV

FADD

LEA

PAND

PMULHW PackedMultiply

PackedALU

Address Generation

Unit

FP Unit

IntegerUnit

PackedShift

PackedALU

ADDJMP

PAND

PSRLQPACKSSWB



Out-of-Order Execution and Data Dependencies

Out-of-Order execution is limited by data dependencies

for (i=0; i<SIZE/2; i+=2){

accum += array[i];accum += array[i+1];

}

for (i=0; i<SIZE/2; i+=2){

accum1 += array[i];accum2 += array[i+1];

}accum = accum1 + accum2;

Good: Unrolled, but... Better: Dependencies reduced



Serializing Instructions

Serializing instructions cause execution to be delayed until all previous instructions execute, retire, and results are written back to memory. read-modify-write instructions with lock prefix

Typically used in multiprocessor/ shared memory environments XCHG with a memory operand mov to control or debug registers CPUID FLDCW (used in _ftol)



ftolftol

Casting floats to Integer formats cause serious penalties on the Pentium® II processor ~50 clks using ftol

~25clks in exe / ~30 clks in stalls!

3 Problems: Serialization Partial stall with fnstcw Memory stall

Solutions: In most cases you can use round



Store Address Unknown Blocking

If the destination address for a store is unknown, all subsequent loads will stall. e.g.

e.g.

loop:

mov eax, [esi] ;stalls waiting for add edi,4 to retire

add esi, 4

and eax, 0ffh

add edi, 4

mov [edi-4], eax

dec ecx

jnz loop

mov eax, [table+ebx*4]

mov [eax], ecx

mov ecx, [esi] ; stalls



Self Modifying Code

Do not modify code that might already be in the processor’s execution pipeline Both the in-order and out-of-order units will be flushed May cause branch prediction problems Processor stalls the in-order unit until the ROB empties Cache problems and penalties

Not the same as self generating code Code that is created once and executed, but never modified suffers

no penalty

Section 2: Optimizations: Retirement


Retirement Unit

Commits µOps to permanent machine state in original order

Removes µOps from reorder buffer

Fetch & Decode Unit

Retirement Unit


L2

Bus Interface Unit

data cacheIcache

Instruction Pool

System bus

Section 2: Optimizations: Retirement


Store Forwarding

Improves the performance of load operations of recently stored data Must be same data size and alignment Saves the time it takes the cache to combine the data

GOOD BAD

32 bit load

8 bit store

16 bit store

16 bit load

32 bit store

32 bit store

32 bit load

8 bit load

16 bit load

32 bit store

32 bit load

16 bit store


Store Forward Example

Avoid mixing MMx memory LD’s with scalar integer memory ST’s (i.e., 8-, 16-, 32-bit ST’s) to the same SIMD data

Code sequences such as the following will cause MOB blocks:

MOV mem, eax # dword store to “mem”, low-half of SIMD variable

MOV mem+4, ebx # dword store to “mem+4”, high-half of SIMD variable

:

MOVQ mm0, mem # qword load of entire SIMD variable “mem”

# MOB blocks this load because the data requested is

# not a subset of the data of either ST in the MOB



Store Forwarding C code example

float f;

unsigned int I;

.

.

.

f = I;

Why is it a problem?


Store Forwarding code example

Assigning unsigned int to floatmov eax, DWORD PTR _imov DWORD PTR -16+[ebp], eaxmov DWORD PTR -16+[ebp+4], 0fild QWORD PTR -16+[ebp]

A better code could be:int uint_fix[2] = { 0, 0x4f800000};f1 = ((int) i) + *(((float *) &uint_fix[i>>31]));

And in assembly:mov eax, DWORD PTR _ifild DWORD PTR _ishr eax, 31fadd DWORD PTR _uint_fix[eax*4]

Section 2: Optimizations: Memory & Cache


Memory Interface

16 Kb data cache 2 way, set associative

Write back cache by default

Fetch & Decode Unit

Retirement Unit


L2

Bus Interface Unit

data cacheIcache

Instruction Pool

System bus



The Memory Types

Write back (WB) Default for system memory All reads and writes are cached Write misses allocate and fill cache lines Read and write sequentially for maximum performance

Uncacheable memory (UC) Accesses are executed in order without reordering Accesses to other memory types can pass accesses to UC No speculative accesses



Write-Combining (WC)

Writes are buffered and burst across the bus improving bus utilization Writes are stored in an internal buffer (32 bytes), delayed,

reordered, and combined Speculative reads ARE allowed Reads are NOT cached Buffer is flushed on:

UC access, different address, return from an interrupt, serializing instruction, locked, or I/O instruction XCHG with memory (fastest locking instruction)


Lecture 10: Hyper-Threading


Intel's Hyper-Threading TechnologyOverview

Intel's Hyper-Threading Technology brings the concept of simultaneous multi-threading to the Intel Architecture.

Hyper-Threading Technology makes a single physical processor appear as two logical processors. the physical execution resources are shared and the

architecture state is duplicated for the two logical processors.

From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on multiple physical processors.

From a microarchitecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources.


Thread Level - Amdahl’s Law

Maximum Efficiency

Fraction parallel limits scalability

Key: Parallelize everything significant

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

0 20 40 60 80 100

Percent Parallel

Ma

xim

um

Pa

rall

el

Eff

icie

nc

y

2 Threads

4 Threads

8 Threads

16 Threads

32 Threads

64 Threads

)1(

1

pNp


Multithreading + Multiple Processors= Improved Performance

Multithreading + Multiple Processors= Improved Performance

Run threads using multiple processors

CPU 1CPU 1

CPU 2CPU 2

CPU 3CPU 3

Brief Introduction to ThreadsBrief Introduction to Threads

Multiprocessing


Open DB’s Address Book Open DB’s Address Book

InBox CalendarInBox Calendar


Functional Parallelism

Concurrent Tasks

Apply different operations to different data elements


Data Parallelism Apply the same operation to different data elements


function SpellCheck {

loop (word = 1, words_in_file)

compare_to_dictionary (word);

}

function SpellCheck {

loop (word = 1, words_in_file)

compare_to_dictionary (word);

}

Open FileOpen File Edit Spell Check Edit Spell Check


Thread Libraries – Win32* API

C language interfacesThreads exist within a single processGood for asynchronous concurrencyAll threads are peers

No explicit parent-child model Exception: main() thread

Creating Win32* ThreadsHANDLE CreateThread(

LPSECURITY_ATTRIBUTES ThreadAttributes,DWORD StackSize,LPTHREAD_START_ROUTINE StartAddress,LPVOID Parameter,DWORD CreationFlags,LPDWORD ThreadId );


*Other names and brands may be claimed as the property of others

Thread handle is a synchronization object

Functions are explicitly mapped to threads


Pentium 4 Block diagram


Execution pipeline

A high-level view of the microarchitecture pipeline. buffering queues separate major pipeline logic blocks. The buffering queues are either partitioned or duplicated to

ensure independent forward progress through each logic block.


Fetch and deliver

• Alternate between logical processors

• Execution trace cache/ Microcode ROM

• Fetch and decode instructions

• Register rename and allocation


Execution The out-of-order execution engine consists of

the allocation, register renaming, scheduling, and execution functions

Logical processors execute simultaneously Compete for schedulers, ALUs Schedulers map independent instructions to available

execution resources


The memory subsystem The memory subsystem includes:

The DTLB: translates addresses to physical addresses. It has 64 fully associative entries; each entry can map either a 4K or a 4MB page. Although the DTLB is a shared structure between the two

logical processors, each entry includes a logical processor ID tag.

Each logical processor also has a reservation register to ensure fairness and forward progress in processing DTLB misses.

the low-latency Level 1 (L1) data cache the Level 2 (L2) unified cache, and the Level 3 unified cache (the Level 3 cache is only available

on the Intel® XeonTM processor MP). Access to the memory subsystem is also largely oblivious to

logical processors. The schedulers send load or store uops without regard to logical

processors and the memory subsystem handles them as they come.


Instruction retirement

Alternate between logical processorsCommit state in program order


Hyper Threading implementation

Two logical processors for very small additional die area

Alternate between logical processorsFetch and deliverReorder and retire

Competitive sharing between logical processorsRapid execution engineCaches


OS support From the OS point of view HT is just like multi processing

Needs BIOS support for initialization

HLT instruction (for idle) The HLT instruction stops instruction execution and places the

processor in HALT stat. An enabled interrupt, NMI or Reset resume execution. The return instruction from HLT is the next instruction

There are two modes of operation referred to as single-task (ST) or multi-task (MT).

On a processor with Hyper-Threading Technology, executing HALT transitions the processor from MT-mode to ST0- or ST1-mode, depending on which logical processor executed the HALTSpin loops:

Use new pause instructionFor long wait use the OS call

Wait on object


Lecture 11: x86 and 3D graphics


Quick Intro to 3D Graphics

Glossary:Vertex – point in 3D spaceTriangle – 3 connected verticesObject – list of triangles that have the

same material properties (AKA Mesh)Texture – a 2D image that is wrapped

on the surface of a Mesh


For Each Triangle...

Geometry transform each vertex (FP)

Lighting compute lighting from light sources and surface

properties (FP)

Rasterization setup triangle for rasterization (FP) perform shading & texture mapping during

triangle fill (INT)


Rendering Processlight

source

objects

viewing planeX

Y

Z

incident light

perspective projection

Mathematical modelsfor light, objects & viewercreate a 2D image viaa 3D process


Coordinate Systems

xw

yw

zw

world xo

yo

zo

object

xv

yv

zv viewer

xl

yl

zl

light


Transformation

Transformation change the coordinate system a point lies in

ExamplesViewing Transformation - translate

objects to viewer coordinate before projection to viewing plane

Shadows - translate objects to light source coordinates to calculate shadows


NL R

S

Lighting Process

Surface

Diffuse Light scatters in all directionsSpecular Light reflects in direction of reflection vector R

V


Rasterization

Now that we have transformed and lit polygons… actually, the vertices...

And, we know where they appear on the screen…

We have to fill their interiors!


Textures

Image maps to apply surfaces.Gives impression of complex surface properties.

Can substitute for lots of polys (eg, tree bitmap on a single rectangle, vs. thousands of leaf polys)


Rasterization (again)


Flat Fill


Some lighting


And textures


Demo

SkinnedMesh.exe


History of 3D Pipeline Partitioning

Geometry

Lighting

Edge Setup

Rasterization

Application

Software 3DProcessor Does All

First Generation3D HW Accelerators

Rasterization

HW

Geometry

Lighting

Edge Setup

Application

2nd Generation HW (1998-99)

HW

Geometry

Lighting

Edge Setup

Rasterization

Application

3rd Generation HW (1999-2000)

HW

Geometry

Lighting

Edge Setup

Rasterization

Application


Memory BW: The problem

Frame buffer - ~3MB Z buffer - ~3MB For each object:

Mesh (vertices and triangle connectivity) – 1KB – 1MB Texture(s) - ~256KB

Typical game frame – 8MB – 20MB


AGP: the Solution AGP (Accelerated Graphics Port)

Larger BW than PCI Lower HW cost (less local RAM needed)

Frame/Z buffers stored in graphics local memory Texturing from system memory


Memory traffic when using AGP


How does it work

The AGP aperture is mapped as one chunk (1:1 mapping in CPU’s paging HW), both the gfx chip and the CPU reference the same addresses

The GART is mapping AGP memory address to the system memory address (like paging HW in the CPU)


Current generation

Multi-texturing: More than one texture for surface, used for details maps, reflections, refractions, lighting tricks, etc.


Programmable HW

Helps the developer in customizing its transform, lighting and texture operations


Programmable vertex machine


Programmable HW demos


Vertex Shader Processing

Setup

map vertex data to VVM input registers

execute vertex shader

collect data from VVM output registers, format and write to output

clip and render

Vertex shader declaration

Vertex shader code


Vertex Virtual Machine (DX8)

Constant Memory

96 entries

128 bits4 floats

16 entries

13 entries

12 entries

Vertex Input

Vertex Output

RegistersVertex

ShaderA0.X

128 instructions

addr

data

128 bits4 floats

128 bits4 floats

128 bits4 floats


Vertex Virtual Machine (DX9)

19 entries

12 entries

Vertex Output

RegistersVertex

ShaderA0.XYZW

256 instructions

Constant Memory

256 entries

128 bits4 floats

addr

data

128 bits4 floats

16 entries

Vertex Input128 bits4 floats

128 bits4 floats

CLC

16 entries

Loop control128 bits

4 ints

16 entries

Branch control16 BOOLs

Changes from DX8 in red


Vs.2.0 (DX9) shader language

DX8 Mov Add Mul Mad Dp3 Dp4 Slt Sge Min Max Rcp Rsq Expp Logp Dst* Lit*

DX9 additions Loop, Endloop Rep, Endrep Jnz Call Callnz Ret Exp Log M3x3, M4x3, M3x4, M3x2,M2x3** Sgn** Abs** Frc** Lrp** Pow** Nrm** Sincos**

*Removed from DX9

**Macro instructions


Vertex Shader Example

Vs.1.1

Dcl_position v0

Dcl_color0 v1

Dcl_texcoord0 v2

Dp4 oPos.x, v0, c0

Dp4 oPos.y, v0, c1

Dp4 oPos.z, v0, c2

Dp4 oPos.w, v0, c3

Mov oD0.rgb, v1

Mov oT0.xy, v2


SW Vertex shader


backup


Shading Techniques

Scan Line

A

B

C

L R

P

Polygon fill is done across the

scanline

Flat shading Color the whole polygon with one color Does not show highlights inside polygons

Gouraud shading If the object surfaces are curved, we can

approximate it by polygons Interpolating vertex intensity values along

scanline Still does not show highlights inside polygons


Texture Mapping]

1. Texture coordinates (s,t,q) for each vertex. These are interpolated along polygon edges.

2. Texture coordinate for each point on scanline is interpolated. Linear interpolation, or a quadratic approximation (for perspective correction) is used.

3. The (s,t) value maps into the source texture map. It usually does not fall on the center of a texel.

4. A texture mapping algorithm is applied to the nearest texel, and possibly surrounding texels, to determine the result.

Texturing eats CPU MIPS & bandwidth. 20-70% slower than flat shading.

(s0, t0 , q0)

(s1, t1, q1)

(s2, t2 , q2)

(si, ti)

1

2

3

4

s = 0, t = 0 s = 1, t = 0

s = 1, t = 1


Credits

This foil set was developed by Ronny Ronen

microcomputer course koby gottlieb, 3/03, p-1 microcomputer course: x86 architecture basics koby...

Documents

i486 pc

processor pc

terminology slide

introduction slide

5m60032m4gb4mb slide

pc l1995

pc l1989

pentium pro based pc