1 structure of computer systems (advanced computer architectures) course: gheorghe sebestyen lab....

29
1 Structure of Computer Structure of Computer Systems Systems (Advanced Computer (Advanced Computer Architectures) Architectures) Course: Course: Gheorghe Sebestyen Gheorghe Sebestyen Lab. works Lab. works : : Anca Hangan Anca Hangan Madalin Neagu Madalin Neagu Ioana Dobos Ioana Dobos

Upload: bryce-sullivan

Post on 25-Dec-2015

225 views

Category:

Documents


0 download

TRANSCRIPT

11

Structure of Computer SystemsStructure of Computer Systems

(Advanced Computer Architectures)(Advanced Computer Architectures)

Course:Course:Gheorghe SebestyenGheorghe Sebestyen

Lab. worksLab. works::

Anca HanganAnca HanganMadalin NeaguMadalin NeaguIoana DobosIoana Dobos

22

Objectives and contentObjectives and content

design of computer components and design of computer components and systems systems

study of methods used for increasing the study of methods used for increasing the speed and the efficiently of computer speed and the efficiently of computer systemssystems

study of advanced computer architecturesstudy of advanced computer architectures

33

BibliographyBibliography

Baruch, Z. F., Baruch, Z. F., Structure of Computer SystemsStructure of Computer Systems, , U.T.PRES, Cluj-U.T.PRES, Cluj-Napoca, 2002Napoca, 2002Baruch, Z. F., Baruch, Z. F., Structure of Computer Systems with ApplicationsStructure of Computer Systems with Applications, , U. U. T. PRES, Cluj-Napoca, 2003 T. PRES, Cluj-Napoca, 2003 Gorgan, G. Sebestyen, Gorgan, G. Sebestyen, Proiectarea calculatoarelorProiectarea calculatoarelor, Editura , Editura Albastra, 2005Albastra, 2005Gorgan, G. Sebestyen, Gorgan, G. Sebestyen, Structura calculatoarelorStructura calculatoarelor, Editura Albastra, , Editura Albastra, 20002000J. Hennessy , D. Patterson, J. Hennessy , D. Patterson, Computer Architecture: A Quantitative Computer Architecture: A Quantitative ApproachApproach, 1-5, 1-5thth edition edition D. Patterson, J. Hennessy, D. Patterson, J. Hennessy, Computer Organization and Design: The Computer Organization and Design: The Hardware/Software Interface, Hardware/Software Interface, 1-3th edition1-3th edition

any book about computer architecture, microprocessors, microcontrollers or any book about computer architecture, microprocessors, microcontrollers or digital signal processorsdigital signal processors

Search: Intel Academic Community, Intel technologies Search: Intel Academic Community, Intel technologies ((http://www.intel.com/technology/product/demos/index.htmhttp://www.intel.com/technology/product/demos/index.htm), etc.), etc.

my web page: my web page: http://http://users.utcluj.ro/~sebestyenusers.utcluj.ro/~sebestyen

44

Course ContentCourse Content Factors that influence the performance of a computer Factors that influence the performance of a computer

systems, technological trendssystems, technological trends Computer arithmetic – ALU designComputer arithmetic – ALU design CPU design strategiesCPU design strategies

pipeline architectures, super-pipelinepipeline architectures, super-pipeline parallel architectures (multi-core, multiprocessor systems)parallel architectures (multi-core, multiprocessor systems) RISC architectures RISC architectures microprocessorsmicroprocessors

Interconnection systemsInterconnection systems Memory designMemory design

ROM, SRAM, DRAM, SDRAM, etc.ROM, SRAM, DRAM, SDRAM, etc. cache memorycache memory virtual memoryvirtual memory

Technological trendsTechnological trends

55

Performance featuresPerformance features

execution timeexecution time reaction time to external eventsreaction time to external events memory capacity and speedmemory capacity and speed input/output facilities (interfaces)input/output facilities (interfaces) development facilitiesdevelopment facilities dimension and shapedimension and shape predictability, safety and fault tolerancepredictability, safety and fault tolerance costs: absolute and relative costs: absolute and relative

66

Performance featuresPerformance features

Execution timeExecution time execution time of:execution time of:

• operations – arithmetical operations operations – arithmetical operations e.g. multiply is 30-40 times slower than addinge.g. multiply is 30-40 times slower than adding single or multiple clock periodssingle or multiple clock periods

• instructionsinstructions simple and complex instructions have different execution simple and complex instructions have different execution

timestimes average execution time = average execution time = ΣΣ t tinstructioninstruction(i)*p(i)*pinstructioninstruction(i)(i)

• where pwhere pinstructioninstruction(i) – probability of instruction “i”(i) – probability of instruction “i” dependable/predictable systems – with fixed execution time dependable/predictable systems – with fixed execution time

for instructionsfor instructions

77

Performance featuresPerformance features

Execution timeExecution time execution time of:execution time of:

• procedures, tasksprocedures, tasks the time to solve a given function (e.g. sorting, printing, the time to solve a given function (e.g. sorting, printing,

selection, i/o operations, context switch)selection, i/o operations, context switch)

• transactionstransactions execution of a sequence of operations to update a execution of a sequence of operations to update a

databasedatabase

• applicationsapplications e.g. 3D rendering, simulation of fluids’ flow, computation e.g. 3D rendering, simulation of fluids’ flow, computation

of statistical dataof statistical data

88

Performance featuresPerformance features

reaction timereaction time response time to a given eventresponse time to a given event solutions:solutions:

• best effort – batch programmingbest effort – batch programming

• interactive systems – event driven systemsinteractive systems – event driven systems

• real-time systems – worst case execution time (WCET) is real-time systems – worst case execution time (WCET) is guaranteed guaranteed

scheduling strategies for single or multi processor systemsscheduling strategies for single or multi processor systems

influences:influences:• execution time of interrupt routines or proceduresexecution time of interrupt routines or procedures

• context-switch timecontext-switch time

• background execution of operating system’s threadsbackground execution of operating system’s threads

99

Performance featuresPerformance features memory capacity and speed:memory capacity and speed:

cache memory: SRAM, very high speed (<1ns), low capacity (1-8MB)cache memory: SRAM, very high speed (<1ns), low capacity (1-8MB) internal memory: SRAM or DRAM, average speed (15-70ns), medium internal memory: SRAM or DRAM, average speed (15-70ns), medium

capacity (1-8GB)capacity (1-8GB) external memory (storage): HD, DVD, CD, Flash (1-10ms), very big external memory (storage): HD, DVD, CD, Flash (1-10ms), very big

capacity (0,5-12TB)capacity (0,5-12TB) input/output facilities (interfaces):input/output facilities (interfaces):

very divers or dedicated for a purposevery divers or dedicated for a purpose input devices: keyboard, mouse, joystick, video camera, microphone, input devices: keyboard, mouse, joystick, video camera, microphone,

sensors/transducerssensors/transducers output devices: printer, video, sound, actuators, output devices: printer, video, sound, actuators, input/output: storage devicesinput/output: storage devices

development facilities:development facilities: OS services (e.g. display, communication, file system, etc.), OS services (e.g. display, communication, file system, etc.), programming and debugging frameworks, programming and debugging frameworks, development kits (minimal hardware and software for building dedicated development kits (minimal hardware and software for building dedicated

systems)systems)

1010

Performance featuresPerformance features dimension and shapedimension and shape

supercomputers – minimal dimensional restrictionssupercomputers – minimal dimensional restrictions personal computers – desktop, laptop, tabletPC – some personal computers – desktop, laptop, tabletPC – some

limitationslimitations

mobile devicesmobile devices – – “hand held devices” phones, medical devices“hand held devices” phones, medical devices dedicated systems – significant dimensional and shape related dedicated systems – significant dimensional and shape related

restrictionsrestrictions

predictability, safety and fault tolerancepredictability, safety and fault tolerance predictable execution timepredictable execution time controllable quality and safety controllable quality and safety safety critical systems, industrial computers, medical devicessafety critical systems, industrial computers, medical devices

costscosts absolute or relative (cost/performance, cost/bit)absolute or relative (cost/performance, cost/bit) cost restrictions for dedicated or embedded systemscost restrictions for dedicated or embedded systems

1111

Physical performance parametersPhysical performance parameters Clock signal’s frequencyClock signal’s frequency

a good measure of performance for a long period of timea good measure of performance for a long period of time depends on:depends on:

• the integration technology – the dimension of a transistor and path the integration technology – the dimension of a transistor and path lengths lengths

• supply voltage and relative distance between high and low statessupply voltage and relative distance between high and low states clock period = the time delay for the longest signal pathclock period = the time delay for the longest signal path

= no_of_gates * delay_of_a_gate= no_of_gates * delay_of_a_gate clock period grows with the complex CPUs clock period grows with the complex CPUs

• RISC computers increase clock frequency by reducing the CPU RISC computers increase clock frequency by reducing the CPU complexitycomplexity

1212

Physical performance parametersPhysical performance parameters

Clock signal’s frequencyClock signal’s frequency we can compare computers with the same internal architecturewe can compare computers with the same internal architecture for different architectures the clock frequency is less relevantfor different architectures the clock frequency is less relevant after 60 years of steady grows in frequency, now the frequency after 60 years of steady grows in frequency, now the frequency

is saturated to 2-3 GHz because of the power dissipation is saturated to 2-3 GHz because of the power dissipation limitationslimitations

• where: where: αα activation factor (0,1-1), C-capacitance, V-voltage, f-frequency activation factor (0,1-1), C-capacitance, V-voltage, f-frequency

increasing the clock frequency:increasing the clock frequency:• technological improvement – smaller transistors, through better technological improvement – smaller transistors, through better

lithographic methodslithographic methods

• architectural improvement – simpler CPU, shorter signal pathsarchitectural improvement – simpler CPU, shorter signal paths

·f2·C·V wer dynamic_po

1313

Physical performance parametersPhysical performance parameters Average instructions executed per second Average instructions executed per second

(IPS(IPS))

where pwhere pii = probability of using instruction i = probability of using instruction i

ppi i = no_instr= no_instri i / total_no_instructions/ total_no_instructions ttii – execution time of instruction i – execution time of instruction i

instruction types:instruction types:• short instructions (e.g. adding) – 1-5 short instructions (e.g. adding) – 1-5

clock cyclesclock cycles• long instructions (e.g. multiply) – 100-120 long instructions (e.g. multiply) – 100-120

clock cyclesclock cycles• integer instructionsinteger instructions• floating point instructions (slower)floating point instructions (slower)

measuring units: MIPS, MFlops, Tflopsmeasuring units: MIPS, MFlops, Tflops can compare computers with same or can compare computers with same or

similar instruction setssimilar instruction sets not good for CISC v.s. RISC comparisonnot good for CISC v.s. RISC comparison

TypeType YearYear Freq.Freq. MIPSMIPS

I4004I4004 19711971 0,74MHz0,74MHz 0,090,09

I80286I80286 19821982 12 MHz12 MHz 2,662,66

I80486I80486 19921992 66MHz66MHz 5252

P IIIP III 20002000 600MHz600MHz 2.0542.054

Intel I7Intel I7 20112011 3.33GHz3.33GHz 177.730177.730

it*ip

1 _instr average_no

1414

Physical performance parametersPhysical performance parameters

Execution time of a programExecution time of a program more realisticmore realistic can compare computers with different architecturescan compare computers with different architectures influenced by the operating system, communication and storage influenced by the operating system, communication and storage

systemssystems How to select a good program for comparison? (a good How to select a good program for comparison? (a good

benchmark)benchmark)• real programs: compilers, coding/decoding, zip/unzipreal programs: compilers, coding/decoding, zip/unzip• significant parts of a real program: OS kernel modules, significant parts of a real program: OS kernel modules,

mathematical libraries, graphical processing functionsmathematical libraries, graphical processing functions• synthetic programs: combination of instructions in a percentage synthetic programs: combination of instructions in a percentage

typical for a group of applications (with no real outcome):typical for a group of applications (with no real outcome): Dhrystone – combination of integer instructionsDhrystone – combination of integer instructions Whetstone – contains floating point instructions tooWhetstone – contains floating point instructions too

issues with benchmarks: issues with benchmarks: • processor architectures optimized for benchmarksprocessor architectures optimized for benchmarks• compilation optimization techniques eliminate useless instructions compilation optimization techniques eliminate useless instructions

1515

Physical performance parametersPhysical performance parameters

Other metrics:Other metrics: number of transactions per secondnumber of transactions per second

• in case of databases or server systemsin case of databases or server systems• number of concurrent accesses to a database or warehousenumber of concurrent accesses to a database or warehouse• operations: read-modify-write, communication, access to operations: read-modify-write, communication, access to

external memoryexternal memory• describes the whole computer system not only the CPUdescribes the whole computer system not only the CPU

communication bandwidthcommunication bandwidth• number of Mbytes transmitted per secondnumber of Mbytes transmitted per second• total bandwidths or useful/usable bandwidthtotal bandwidths or useful/usable bandwidth

context switch timecontext switch time• for embedded and real-time systemsfor embedded and real-time systems• example: EEMBC – EDN embedded microprocessor example: EEMBC – EDN embedded microprocessor

benchmark consortiumbenchmark consortium

1616

Principles for performance Principles for performance improvementimprovement

Moor’s LawMoor’s Law Ahmdal’s LawAhmdal’s Law Locality: time and spaceLocality: time and space Parallel executionParallel execution

1717

Principles for performance improvementPrinciples for performance improvement

Moor’s LawMoor’s Law (1965, Gordon Moor*) - “the number of (1965, Gordon Moor*) - “the number of transistors on integrated circuits doubles approximately transistors on integrated circuits doubles approximately every two years”every two years”

18 months law18 months law (David House, Intel) – “the performance (David House, Intel) – “the performance of a computer is doubled every 18 month” (1,5 year), as of a computer is doubled every 18 month” (1,5 year), as a result of more transistors and faster onesa result of more transistors and faster ones

1818

8086

4004

Pentium 4

‘486

‘386‘286

Pentium

8080

Moor’s law

1919

Principles for performance improvementPrinciples for performance improvement

Moor’s law (cont.)Moor’s law (cont.) the grows will continue but not for long !!! the grows will continue but not for long !!!

(2013-2018)(2013-2018) now the doubling period is 3 yearsnow the doubling period is 3 years Intel predicts a limitation to 16 Intel predicts a limitation to 16

nanometer technology (read more on nanometer technology (read more on Wikipedia)Wikipedia)

Other similar grows:Other similar grows: clock frequency – saturated 3-4 years clock frequency – saturated 3-4 years

agoago capacity of internal memories (DRAMs)capacity of internal memories (DRAMs) capacity of external memories (HD, capacity of external memories (HD,

DVD)DVD) number of pixels for image and video number of pixels for image and video

devices devices

Semiconductor manufacturingprocesses

(source wikipedia)• 10 µm — 1971 • 3 µm — 1975 • 1.5 µm — 1982 • 1 µm — 1985 • 800 nm . 1989

• 600 nm 1994• 350 nm 1995 • 250 nm 1998• 180 nm 1999 • 130 nm 2000• 90 nm — 2002 • 65 nm — 2006 • 45 nm — 2008 • 32 nm — 2010 • 22 nm — 2012 • 14 nm — approx. 2014 • 10 nm — approx. 2016 • 7 nm — approx. 2018 • 5 nm — approx. 2020

2020

Principles for performance improvementPrinciples for performance improvement Precursors:Precursors:

• 90/10 principle: 90% of the time the processor executes 10% 90/10 principle: 90% of the time the processor executes 10% of the codeof the code

• principle: “make the common case fast”principle: “make the common case fast”• invest more in those parts that counts moreinvest more in those parts that counts more

Amdahl’s lawAmdahl’s law How to measure the impact of a new technology?How to measure the impact of a new technology? speedup – speedup – ηη – – how many times the execution is fasterhow many times the execution is faster

where: where: ηη’ - the speedup of the new component’ - the speedup of the new component f - the fraction of the program that benefit from the improvement f - the fraction of the program that benefit from the improvement

• Consequence: the speedup is limited by the Amdahl’s lawConsequence: the speedup is limited by the Amdahl’s law

Numerical example:Numerical example: f = 0,1; f = 0,1; ηη’=2 => ’=2 => ηη = 1,052 (5% grows) = 1,052 (5% grows) f= 0,1 ; f= 0,1 ; ηη’=∞ => ’=∞ => ηη = 1,111 (11% grows) = 1,111 (11% grows)

'old_exect*

old_execf)t-[(1

old_exect

_old_exect

fexecnewt

Old time New time

]’ / f f)-[(1 / 1

2121

Principles for performance improvementPrinciples for performance improvement

Locality principlesLocality principles Time localityTime locality

• ““if a memory location is accessed than it has a high if a memory location is accessed than it has a high probability of being accessed in the near futureprobability of being accessed in the near future””

• explanations:explanations: execution of instructions in a loop execution of instructions in a loop a variable is used for a number of times in a program sequencea variable is used for a number of times in a program sequence

• consequence: consequence: good practice: bring the newly accessed memory location good practice: bring the newly accessed memory location

closer to the processor for a better access time in case of a closer to the processor for a better access time in case of a next access => justification of cache memoriesnext access => justification of cache memories

2222

Principles for performance improvementPrinciples for performance improvement

Locality principlesLocality principles Space localitySpace locality

• ““if a memory location is accessed than its neighbor locations if a memory location is accessed than its neighbor locations have a high probability of being accessed in the near futurehave a high probability of being accessed in the near future””

• explanations:explanations: execution of instructions in a loop execution of instructions in a loop consecutive access to the elements of a data structure (vector, consecutive access to the elements of a data structure (vector,

matrix, record, list, etc.)matrix, record, list, etc.)

• consequence: consequence: good practice: good practice:

• bring the location’s neighbors closer to the processor for a bring the location’s neighbors closer to the processor for a better access time in case of a next access => justification better access time in case of a next access => justification of cache memoriesof cache memories

• transfer blocks of data instead of single locations; block transfer blocks of data instead of single locations; block transfer on DRAMs is much fastertransfer on DRAMs is much faster

2323

Principles for performance improvementPrinciples for performance improvement

Parallel execution principleParallel execution principle ““when the technology limits the speed increase a further when the technology limits the speed increase a further

improvement may be obtained through parallel execution”improvement may be obtained through parallel execution” parallel execution levels:parallel execution levels:

• data leveldata level – multiple ALUs – multiple ALUs

• instruction levelinstruction level – pipeline architectures, super-pipeline and – pipeline architectures, super-pipeline and superscalar, wide instruction set computerssuperscalar, wide instruction set computers

• thread levelthread level – multi-cores, multiprocessor systems – multi-cores, multiprocessor systems

• application levelapplication level – distributed systems, Grid and cloud systems – distributed systems, Grid and cloud systems parallel execution is one of the explanations for the speedup of parallel execution is one of the explanations for the speedup of

the latest processors (look at the table at slide 11) the latest processors (look at the table at slide 11)

2424

Improving the CPU performanceImproving the CPU performance Execution timeExecution time – the measure of the CPU performance – the measure of the CPU performance

where: IPS – instructions per secondwhere: IPS – instructions per second

CPI – cycles per instructionCPI – cycles per instruction

TTclkclk, f, fclkclk – clock signal’s period and frequency – clock signal’s period and frequency

Goal Goal – reduce the execution time in order to have a better CPU – reduce the execution time in order to have a better CPU performanceperformance

Solution – influence (reduce or increase) the parameters in the Solution – influence (reduce or increase) the parameters in the above formulas in order to reduce the execution time above formulas in order to reduce the execution time

IPS

noInstrexect

_

clkfCPInoInstrclkTCPInoInstrexect *_**_

2525

Improving the CPU performanceImproving the CPU performance Solutions: Solutions: increase the number of instructions per secondincrease the number of instructions per second

• How to do it ? reduce the duration of instructions reduce the frequency (probability) of long and complex instructions (e.g.

replace multiply operations) reduce the clock period and increase the frequency reduce CPI

• external factors that may influence IPS: access time to instruction code and data may influence drastically the

execution time of an instruction example: for the same instruction type (e.g. adding):

• < 1ns for instruction and data in the cache memory• 15-70 ns for instruction and data in the main memory• 1-10 ms for instruction and data in the virtual (HD) memory

CPIclkf

clkTCPIIPS

itipIPS

*

1

*

1External view

Architectural view

2626

Improving the CPU performanceImproving the CPU performance Solutions: Solutions: reduce the number of instructionsreduce the number of instructions

Instr_noInstr_no – number of instructions executed by the CPU during – number of instructions executed by the CPU during an application executionan application execution

• improve algorithms, improve algorithms, • reduce the complexity of the algorithm, reduce the complexity of the algorithm, • more powerful instructions: multiple operations during a single more powerful instructions: multiple operations during a single

instructioninstruction parallel ALUs, SIMD architectures, string operationsparallel ALUs, SIMD architectures, string operations

Instr_no = op_no / op_per_instrInstr_no = op_no / op_per_instr

• op_no – number of elementary operations required to solve a given op_no – number of elementary operations required to solve a given problem (application)problem (application)

• op_per_instr – number of operations executed in a single instruction op_per_instr – number of operations executed in a single instruction (average value)(average value)

• increasing the op_per_instr may increase the CPI (next parameter increasing the op_per_instr may increase the CPI (next parameter in the formula)in the formula)

2727

Improving the CPU performanceImproving the CPU performance Solutions (cont.): reduce CPISolutions (cont.): reduce CPI

CPI – cycles per instructionCPI – cycles per instruction – number of clock periods – number of clock periods needed to execute an instructionneeded to execute an instruction

• instructions have variable CPIs; an average value is neededinstructions have variable CPIs; an average value is needed

where: nwhere: ni i – number of instructions of type “i” in the analyzed program – number of instructions of type “i” in the analyzed program

sequence sequence

CPICPIii – CPI for instruction of type ”i” – CPI for instruction of type ”i”

• methods to reduce the CPI: methods to reduce the CPI: pipeline execution of instructions => CPI close to 1pipeline execution of instructions => CPI close to 1 superscalar, superpipeline => CPI superscalar, superpipeline => CPI єє (0.25 – 1) (0.25 – 1) simplify the CPU and the instructions – RISC architecturesimplify the CPU and the instructions – RISC architecture

iCPIip

iniCPIin

vaCPI **

2828

Improving the CPU performanceImproving the CPU performance Solutions (cont.): reduce the clock Solutions (cont.): reduce the clock

signal’s period or increase the signal’s period or increase the frequencyfrequency TTclkclk – the period of the clock signal – the period of the clock signal or or

ffclkclk – – the frequency of the clock signalthe frequency of the clock signal

Methods:Methods:• reduce the dimension of a switching element reduce the dimension of a switching element

and increase the integration ratioand increase the integration ratio• reduce the operating voltagereduce the operating voltage• reduce the length of the longest path – simplify reduce the length of the longest path – simplify

the CPU architecturethe CPU architecture

ΔtΔt’

Vcc

2929

ConclusionsConclusions

ways of increasing the speed of the ways of increasing the speed of the processors:processors: less instructionsless instructions smaller CPI – simpler instructionssmaller CPI – simpler instructions parallel execution at different levelsparallel execution at different levels higher clock frequencyhigher clock frequency