computer design – introduction 1 mamas – computer architecture 234267 dr. lihu rappoport some of...
Post on 19-Dec-2015
226 views
TRANSCRIPT
Computer Design – Introduction1
MAMAS – Computer MAMAS – Computer ArchitectureArchitecture
234267234267Dr. Lihu Rappoport
Some of the slides were taken from Avi Mendelson, Randi Katz, Patterson, Gabriel Loh
Computer Design – Introduction2
General Course InformationGeneral Course Information Grade
20% Exercise (mandatory) 80% Final exam
Textbooks Computer Architecture a Quantitative Approach:
Hennessy & Patterson
Other course information Course web site:
http://webcourse.cs.technion.ac.il/234267 Foils will be on the web several days before the
class
Computer Design – Introduction3
Class FocusClass Focus CPU
Introduction: performance, instruction set (RISC vs. CISC)
Pipeline, hazards Branch prediction Out-of-order execution
Memory Hierarchy Cache Main memory Virtual Memory
Advanced Topics PC Architecture
Motherboard & chipset, DRAM, I/O, Disk, peripherals
Computer Design – Introduction4
Computer System StructureComputer System Structure
CPU
PCI
NorthBridge
DDRIIChannel 1
mouse
LAN
LanAdap
GraphicAdapt
Mem BUSCPU BUS
Cache
SoundCard
speakers
South Bridge
PCI express ×16
IDEcontroller
IO Controller
DVDDrive
HardDisk
Pa
rall
el
Po
rt
Se
ria
l P
ort Floppy
Drivekeybrd
DDRIIChannel 2
USBcontroller
SATAcontroller
PCI express ×1
Computer Design – Introduction5
Architecture & Architecture & MicroarchitectureMicroarchitecture
ArchitectureThe processor features seen by the “user” Instruction set, addressing modes, data width, …
Micro-architectureThe way of implementation of a processor Caches size and structure, number of execution
units, … Timing is considered uArch (though it is user
visible)
Processors with different uArch can support the same Architecture
Computer Design – Introduction6
CompatibilityCompatibility Backward compatibility
New hardware can run existing software• Core2 Duo can run SW written for Pentium4,
PentiumM, Pentium III, Pentium II, Pentium, 486, 386, 268
Forward compatibility New software can run on existing hardware Example: new software written with SSE2TM runs on
older processor which does not support SSE2TM Commonly supports one or two generations behind
Architecture independent SW JIT – just in time compiler: Java and .NET Binary translation
Computer Design – Introduction8
Technology Trends and Technology Trends and PerformancePerformance
Computing capacity: 4× per 3 years If we could keep all the transistors busy all the time Actual: 3.3× per 3 years
Moore’s Law: Performance is doubled every ~18 months Trend is slowing: process scaling declines, power is up
Speed
1
10
100
1000
Logic
DRAM
Capacity
1
10
100
1000
10000
100000
1000000
Logic
DRAM
2× in 3 years
1.1× in 3 years
CPU speed and Memory speed grow apart
2× in 3 years
4× in 3 years
Computer Design – Introduction9
Moore’s LawMoore’s Law
Graph taken from: http://www.intel.com/technology/mooreslaw/index.htm
Computer Design – Introduction10
CPI – Cycles Per InstructionCPI – Cycles Per Instruction CPUs work according to a clock signal
Clock cycle is measured in nsec (10-9 of a second) Clock frequency (= 1/clock cycle) measured in GHz
(109cyc/sec)
Instruction Count (IC) Total number of instructions executed in the program
CPI – Cycles Per Instruction Average #cycles per Instruction (in a given program)
IPC (= 1/CPI) : Instructions per cycles
CPI =#cycles required to execute the program
IC
Computer Design – Introduction11
CPU TimeCPU Time CPU Time - time required to execute a
program
CPU Time = IC CPI clock cycle
Our goal: minimize CPU Time Minimize clock cycle: more GHz (process, circuit,
uArch)
Minimize CPI: uArch (e.g.: more
execution units)
Minimize IC: architecture (e.g.: SSETM)
Computer Design – Introduction12
Speedupoverall =ExTimeold
ExTimenew
=1
Speedupenhanced
Fractionenhanced(1 - Fractionenhanced) +
Suppose enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, then:
Amdahl’s LawAmdahl’s Law
ExTimenew = ExTimeold ×Speedupenhanced
Fractionenhanced(1 – Fraction enhanced) +
Computer Design – Introduction13
• Floating point instructions improved to run at 2×, but only 10% of executed instructions are FP
Speedupoverall =1
0.95= 1.053
ExTimenew = ExTimeold × (0.9 + 0.1 / 2) = 0.95 × ExTimeold
Corollary:
Make The Common Case Fast
Amdahl’s Law: ExampleAmdahl’s Law: Example
Computer Design – Introduction14
Calculating the CPI of a Calculating the CPI of a ProgramProgram
ICi: #times instruction of type i is executed in the program
IC: #instruction executed in the program:
Fi: relative frequency of instruction of type i : Fi = ICi/IC
CPIi – #cycles to execute instruction of type i e.g.: CPIadd = 1, CPImul = 3
#cycles required to execute the program:
CPI: CPI
cyc
IC
CPI IC
ICCPI
IC
ICCPI F
i ii
n
ii
i
n
i ii
n
# 1
1 1
# *cyc CPI IC CPI ICi ii
n
1
IC ICii
n
1
Computer Design – Introduction15
Comparing PerformanceComparing Performance Peak Performance
MIPS, MFLOPS Often not useful: unachievable / unsustainable in
practice Benchmarks
Real applications, or representative parts of real apps Targeted at the specific system usages
SPEC INT – integer applications Data compression, C complier, Perl interpreter,
database system, chess-playing, Text-processing, … SPEC FP – floating point applications
Mostly important scientific applications TPC Benchmarks
Measure transaction-processing throughput
Computer Design – Introduction16
The ISA is what the user / compiler see
The HW implements the ISA
instruction set
software
hardware
Instruction Set DesignInstruction Set Design
Computer Design – Introduction17
ISA ConsiderationsISA Considerations Code size
Long instructions take more time to fetch Longer instructions require a larger memory
• Important in small devices, e.g., cell phones
Number of instructions (IC) Reducing IC reduce execution time
• At a given CPI and frequency
Code “simplicity” Simple HW implementation
• Higher frequency and lower power Code optimization can better be applied to “simple
code”
Computer Design – Introduction18
Architectural Consideration Architectural Consideration ExampleExample
Displacement Address Size
1% of addresses > 16-bits 12 - 16 bits of displacement needed
0%
10%
20%
30%
0 1 2 3 4 5 6 7 8 9
10
11
12
13
14
15
Address Bits
Int. Avg.
FP Avg.
Computer Design – Introduction19
CISC ProcessorsCISC Processors CISC - Complex Instruction Set Computer
The idea: a high level machine language Example: x86
Characteristic Many instruction types, with a many addressing
modes Some of the instructions are complex
• Execute complex tasks• Require many cycles
ALU operations directly on memory• Only a few registers, in many cases not orthogonal
Variable length instructions• common instructions get short codes save code
length
Computer Design – Introduction20
Rank instruction % of total executed
1 load 22%
2 conditional branch 20%
3 compare 16%
4 store 12%
5 add 8%
6 and 6%
7 sub 5%
8 move register-register 4%
9 call 1%
10 return 1%
Total 96%
Simple instructions dominate instruction frequency
Top 10 x86 InstructionsTop 10 x86 Instructions
Computer Design – Introduction21
CISC DrawbacksCISC Drawbacks Complex instructions and complex addressing modes
complicates the processor slows down the simple, common instructions contradicts Make The Common Case Fast
Compilers don’t use complex instructions / indexing
methods
Variable length instructions are real pain in the neck Difficult to decode few instructions in parallel
• As long as instruction is not decoded, its length is unknown It is unknown where the instruction ends It is unknown where the next instruction starts
An instruction may be over more than a single cache line An instruction may be over more than a single page
Computer Design – Introduction22
RISC ProcessorsRISC Processors RISC - Reduced Instruction Set Computer
The idea: simple instructions enable fast hardware Characteristic
A small instruction set, with only a few instructions formats
Simple instructions• execute simple tasks• Most of them require a single cycle (with pipeline)
A few indexing methods ALU operations on registers only
• Memory is accessed using Load and Store instructions only
• Many orthogonal registers • Three address machine: Add dst, src1, src2
Fixed length instructions
Examples: MIPSTM, SparcTM, AlphaTM, PowerTM
Computer Design – Introduction23
RISC Processors (Cont.)RISC Processors (Cont.) Simple architecture Simple micro-
architecture Simple, small and fast control logic Simpler to design and validate Room for large on die caches Shorten time-to-market
Using a smart compiler Better pipeline usage Better register allocation
Existing RISC processor are not “pure” RISC e.g., support division which takes many cycles
Computer Design – Introduction24
Compilers and ISACompilers and ISA Ease of compilation
Orthogonality: • no special registers• few special cases • all operand modes available with any data type or
instruction type Regularity:
• no overloading for the meanings of instruction fields streamlined
• resource needs easily determined
Register Assignment is critical too Easier if lots of registers
Computer Design – Introduction25
CISC Is DominantCISC Is Dominant The x86 architecture, which is a CISC
architecture, dominates the processor market A vast amount of existing software Intel, AMD, Microsoft and others benefit from this
• Intel and AMD put a lot of money to make high performance x86 processors, despite the architectural disadvantage
• Current x86 processor give the best cost/performance CISC processors use arch ideas from the RISC world Starting at Pentium II and K6, x86 processors
translate CISC instructions into RISC-like operations internally
• the inside core looks much like that of a RISC processor
Computer Design – Introduction26
Software Specific ExtensionsSoftware Specific Extensions Extend arch to accelerate exec of specific
apps
Example: SSETM – Streaming SIMD Extensions 128-bit packed (vector) / scalar single precision FP
(4×32) Introduced on Pentium® III on ’99 8 new 128 bit registers (XMM0 – XMM7) Accelerates graphics, video, scientific calculations,
…
Packed: Scalar:
x0x1x2x3
y0y1y2y3
x0+y0x1+y1x2+y2x3+y3
+
128-bits
x0x1x2x3
y0y1y2y3
x0+y0y1y2y3
+
128-bits