processor designjarnau.site.ac.upc.edu/pd/01_moores_law_and_dennard_scaling.pdf76 / 98 end of...

1 / 98

Processor Design

José María Arnau

2 / 98

Objective

● Understand the intricacies of advanced IC design and development

3 / 98

Objective

module dff(input clk, input d, output q);

always @(posedge clk)q <= d;

endmodule

RTL code

4 / 98

Objective

module dff(input clk, input d, output q);

always @(posedge clk)q <= d;

endmodule

RTL code

IC layout

5 / 98

PD Contents

1. Moore’s Law and Dennard Scaling

2. Hardware Design Cycle

3. Functional Verification

4. Circuit Design

5. Physical Design

6 / 98

Evaluation

● Presentation on related research work– PechaKucha format: 20 slides, 20 seconds/slide

● List of papers posted soon in PD web site– http://jarnau.site.ac.upc.edu/PD/

● 20% of final mark

7 / 98

Practical approach

● Theory sessions are useful for lab assigments– How to write effective test benches

– How to verify your design

– How to synthesize, place and route your design

– How to estimate max freq and area of your design

● EDA tools for digital synthesis– Qflow: http://opencircuitdesign.com/qflow/

8 / 98

PD Contents

(a) Transistor basic physics

(b) Power wall

(c) Dark silicon

2. Hardware Design Cycle

3. Functional Verification

4. Circuit Design

5. Physical Design

9 / 98

10 / 98

The number of transistors in a dense integrated circuit doubles about every two years.

Moore’s Law

11 / 98

The number of transistors in a dense integrated circuit doubles about every two years.

As transistors shrink, they become faster, consume less power, and are cheaper to manufacture.

Moore’s Law

Dennard Scaling

12 / 98

MOS Transistors

● Metal Oxide Semiconductor

nMOS Transistor pMOS Transistor

13 / 98

MOS Transistors

nMOS gate

source

pMOS gate

source

14 / 98

MOS Transistors

nMOS gate

source

pMOS gate

source

gate = 0

source

15 / 98

MOS Transistors

nMOS gate

source

pMOS gate

source

gate = 0

source

OFF gate = 1

source

16 / 98

MOS Transistors

nMOS gate

source

pMOS gate

source

gate = 0

source

OFF gate = 1

source

gate = 0

source

17 / 98

MOS Transistors

nMOS gate

source

pMOS gate

source

gate = 0

source

OFF gate = 1

source

gate = 0

source

ON gate = 1

source

18 / 98

CMOS Logic

● Inverter

19 / 98

CMOS Logic

● Inverter

20 / 98

CMOS Logic

● Inverter

21 / 98

CMOS Logic

● Inverter

22 / 98

CMOS Logic

● Inverter

23 / 98

CMOS Logic

● Inverter

24 / 98

CMOS Logic

● Inverter

25 / 98

CMOS Logic

● Inverter

26 / 98

CMOS Logic

● Inverter

27 / 98

CMOS Logic

● Inverter

28 / 98

CMOS Logic

● NAND gate

29 / 98

CMOS Logic

● NAND gate

30 / 98

CMOS Logic

● NAND gate

31 / 98

CMOS Logic

● NAND gate

32 / 98

CMOS Logic

● NAND gate

33 / 98

CMOS Logic

● NAND gate

34 / 98

CMOS Logic

● NAND gate

35 / 98

CMOS Logic

● NAND gate

36 / 98

CMOS Logic

● NAND gate

37 / 98

CMOS Logic

● NAND gate

OFF OFF

38 / 98

CMOS Logic

● NAND gate

OFF OFF

39 / 98

CMOS Logic

● NAND gate

OFF OFF

40 / 98

CMOS Logic

● Complementary Metal-Oxide Semicondutor– pMOS pull-up network

– nMOS pull-down network

41 / 98

CMOS Logic

● Multiplexers

42 / 98

CMOS Logic

● Multiplexers

43 / 98

CMOS Logic

● Multiplexers

44 / 98

CMOS Logic

● Multiplexers

45 / 98

CMOS Logic

● Multiplexers

OFFOFF

46 / 98

CMOS Logic

● Multiplexers

OFFOFF

47 / 98

CMOS Logic

● Latches and flip-flops

D latch D flip-flop

48 / 98

Moore’s Law

● The number of transistors in a dense integrated circuit doubles about every two years

49 / 98

Moore’s Law

50 / 98

Moore’s Law

~2 years

51 / 98

Moore’s Law

161514131211109876543210

Electronics, April 19, 1965.

52 / 98

Moore’s Law

53 / 98

Moore’s LawTransistors per Square Millimeter by Year, 1971–2018. Logarithmic scale. Data from Wikipedia.

54 / 98

Technology

● What do “7nm” or “10nm” technology mean?

55 / 98

Layout Design Rules● Define how small features can be and how closely

they can be reliably packed

● Lambda based rules

– C. Mead and L. Conway, Introduction to VLSI Systems, Reading, MA: Addison-Wesley, 1980.

56 / 98

Dennard Scaling

● As transistors shrink, they become faster, consume less power, and are cheaper to manufacture.

– Miniaturization provides smaller and faster transistors

57 / 98

Dennard Scaling

Area = S2

58 / 98

Dennard Scaling

Area = S2

Area = 0.5*S2

59 / 98

Dennard Scaling

Area = S2

Area = 0.5*S2

● Transistor dimensions scaled by 30%

60 / 98

Dennard Scaling

Area = S2

Area = 0.5*S2

● Delay reduced by 30%

61 / 98

Dennard Scaling

Area = S2

Area = 0.5*S2

● Delay reduced by 30%

● Frequency increased by ~40%

62 / 98

Performance Scaling

● Smaller transistors are faster– ~1.4x higher performance per generation

● More transistors allow more functionality– Better branch predictors

– Larger caches

– ILP: superscalar, out-of-order execution

– DLP: vector unit

– Better memory controller

63 / 98

CPU Performance

Hennessy, J. L., & Patterson, D. A. (2011). Computer architecture: a quantitative approach. Elsevier.

64 / 98

CPU Frequency

65 / 98

CPU FrequencyPower wall!

66 / 98

CMOS Power

● Dynamic Power– Switching power

● Static Power (Leakage)– Power dissipated even

when not switching

– Around one third of the total power in current technology

Pdyn=αCV 2 f

Psta=I leakV

Ptotal=Pdyn+Psta

67 / 98

CMOS Power

● Static Power

– Gate leakage (Igate)

– Subthreshold leakage (Isub)

– Junction leakage (Ijunct)

– Depends on the temperature

68 / 98

CMOS Power

Im, Y., Cho, E. S., Choi, K., & Kang, S. (2007, March). Development of Junction Temperature Decision (JTD) Map for Thermal Design of Nano-scale Devices Considering Leakage Power. In Twenty-Third Annual IEEE Semiconductor Thermal Measurement and Management Symposium (pp. 63-67). IEEE.

69 / 98

Frequency vs Power

70 / 98

Power Wall

71 / 98

Power Wall

72 / 98

Power Wall and ILP Limits

● Cannot increase performance by raising the frequency due to thermal constraints

● Diminishing returns in ILP– Computer architects optimized superscalar out-of-

order CPUs for decades

● But Moore’s law still provides more and more transistors

● What can we do with the extra transistors?

73 / 98

Multi-core Processors

https://www.guru3d.com/articles-summary/amd-ryzen-5-2400g-review,2.html

74 / 98

Multi-core Processors

● Use extra transistors to include multiple cores in the same chip

● Exploit TLP (Thread-Level Parallelism)● Programmer has to manually extract parallelism

– The Free Lunch Is Over. A Fundamental Turn Toward Concurrency in Software. Herb Sutter. http://www.gotw.ca/publications/concurrency-ddj.htm

75 / 98

CPU Trends

76 / 98

End of Dennard Scaling● According to Dennard scaling, power density remains

constant

– Smaller transistors consume less power

– If frequency is not increased, total power remains the same

● Dennard scaling did not take subthreshold leakage into account

● In current technology, more transistors result in more power

– Power density increases even at the same frequency!

P=αCV 2 f + I leakV

77 / 98

Power Density

John Hennessy and David Patterson. “A New Golden Age for Computer Architecture: Domain-Specific Hardware/Software Co-Design, Enhanced Security, Open Instruction Sets, and Agile Chip Development

78 / 98

Dark Silicon

● Dark Silicon and the End of Multicore Scaling. Esmaeilzadeh, H., Blem, E., Amant, R. S., Sankaralingam, K., & Burger, D. ISCA 2011.

● Part of the chip must be powered off due to thermal constraints– 22 nm: 21% of the chip powered off

– 8 nm: ~50% of the chip powered off

79 / 98

Multicore Limitations

● Increasing the number of cores is no longer an effective solution to improve performance– Due to the dark silicon problem, not all the cores

can be powered at the same time

– Diminishing returns in TLP

● But Moore’s law still provides more and more transistors

● What can we do with the extra transistors?

80 / 98

Hardware Accelerators● Welcome to the golden age of hardware

accelerators!

https://www.hotchips.org/hc30/1conf/1.05_AMD_APU_AMD_Raven_HotChips30_Final.pdf

81 / 98

Hardware Accelerators

Qualcomm Snapdragon 835 [1] NVIDIA “Parker” SoC [2]

1. https://www.notebookcheck.net/Qualcomm-Snapdragon-835-SoC-Benchmarks-and-Specs.207842.0.html2. https://www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.22-Monday-Epub/HC28.22.30-Low-Power-Epub/HC28.22.322-Tegra-Parker-AndiSkende-NVIDIA-v01.pdf

82 / 98

● Application Specific Integrated Circuit

https://ngcodec.com/markets-cloud-transcoding/

83 / 98

Software Overheads

for (i = 0; i < 4; i++) v[i] *= 2.0f;

84 / 98

Software Overheads

for (i = 0; i < 4; i++) v[i] *= 2.0f;

la $t0, v la $t1, v+16 li $t2, 0x40000000 mtc1 $t2, $f2for: lwc1 $f0, 0($t0) mul.s $f0, $f0, $f2 swc1 $f0, 0($t0) addiu $t0, $t0, 4 bne $t0, $t1, for

85 / 98

Software Overheads

for (i = 0; i < 4; i++) v[i] *= 2.0f;

https://en.wikibooks.org/wiki/Microprocessor_Design/Pipelined_Processors

86 / 98

Software Overheads

for (i = 0; i < 4; i++) v[i] *= 2.0f;

ReadI$

R/W D$

ReadRF

WriteRF

ADD/SUB

MUL BytesR MM

Bytes W MM

4 4 4 4 4 16

4 8 4 4

4 4 8 4 16

4 4 4 4

20 8 32 12 16 4 16 16TOTAL:

87 / 98

Software Overheads

for (i = 0; i < 4; i++) v[i] *= 2.0f;

ReadI$

R/W D$

ReadRF

WriteRF

ADD/SUB

MUL BytesR MM

Bytes W MM

4 4 4 4 4 16

4 8 4 4

4 4 8 4 16

4 4 4 4

20 8 32 12 16 4 16 16TOTAL:

Accesses to TLBTLB missesROB accessesIssue queue...

88 / 98

ASIC vs CPU

Main MemoryMain Memory

ReadMEM

WriteMEM

MUL BytesR MM

Bytes W MM

4 16 16

89 / 98

ASIC vs CPU

ReadMEM

WriteMEM

MUL BytesR MM

Bytes W MM

4 16 16

ReadI$

R/W D$

ReadRF

WriteRF

ADD/SUB

MUL BytesR MM

Bytes W MM

20 8 32 12 16 4 16 16

90 / 98

ASIC vs CPU

ReadMEM

WriteMEM

MUL BytesR MM

Bytes W MM

4 16 16

ReadI$

R/W D$

ReadRF

WriteRF

ADD/SUB

MUL BytesR MM

Bytes W MM

20 8 32 12 16 4 16 16

ASIC does much less work, so it is faster and consumes less energy

91 / 98

Shao, Yakun Sophia, et al. "The aladdin approach to accelerator design and modeling." IEEE Micro 35.3 (2015): 58-70.

92 / 98

Shao, Yakun Sophia, et al. "The aladdin approach to accelerator design and modeling." IEEE Micro 35.3 (2015): 58-70.

93 / 98

Sea of Accelerators

Y. S. Shao, B. Reagen, G. Wei and D. Brooks, "Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures," 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), 2014, pp. 97-108.

94 / 98

● ASICs provide higher performance and energy-efficiency

● But they are less programmable and have large development costs

● Only accelerate in hardware applications that are widely popular and compute intensive

● Example: machine learning accelerators

95 / 98

Google TPUJouppi, Norman P., et al. "In-datacenter performance analysis of a tensor processing unit." Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on. IEEE, 2017.

● 92 TOPS

● 28 MB on-chip

● 15x – 30x speedup

● 30x – 80x higher Perf/W

96 / 98

Turing Lecture at ISCA 2018

https://www.acm.org/hennessy-patterson-turing-lecture

97 / 98

What about the future?

● Hardware accelerators will start providing diminishing returns in the future

● Any solution?– Quantum computing

– DNA computing

– Photonic computing

– Superconducting computing

98 / 98

Bibliography

● Weste, N. H., & Harris, D. “CMOS VLSI Design : A Circuits and Systems Perspective”. 4th Edition, 2010.

● Kahng AB, Lienig J, Markov IL, Hu J. “VLSI Physical Design: from Graph Partitioning to Timing Closure”. Springer Science & Business Media; 2011 Jan 27.

● Mead, C., & Conway, L. (1980). Introduction to VLSI systems (Vol. 1080). Reading, MA: Addison-Wesley.

processor designjarnau.site.ac.upc.edu/pd/01_moores_law_and_dennard_scaling.pdf76 / 98 end of...

Documents

the solar system. scaling things down let’s make...

the new era of scaling in an socworld · 2010. 1. 5. ·...

measurement and scaling fundamentals and comparative scaling

chapter 21 state-of-the-art - linköping university ·...

march 3 - 6, 2019 mesa, arizona archive...–density scaling...

iucn ssc publications action plans · 2005. 4. 12. · 7....

cloud scalability patterns · scaling up == vertical...

principles and practice of scalable systems (pposs) · 2...

computer systems research in the post-dennard...

scaling - resources.whiterosemaths.com › wp-content ›...

abstraction and performance in database systems · dsls are...

isscc 2016 silicon systems for the internet of...

saya woolfalk chimatek tool - csdt · • reflection:...

lecture 15: moore’s law and dennard...

something strange and deadly by susan dennard

chain of trust: can trusted hardware help scaling...

the impact of moore’s law and loss of dennard scaling ·...

lesson 14. scaling-up and scaling-down - … lesson 1… ·...

scaling factors and scaling parameters

with articles by dennard, itoh, koyanagi, sunami, foss...