cs 152 computer architecture and engineering lecture 27 ...cs152/sp05/lecnotes/lec15-2.pdfcomputer...
TRANSCRIPT
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
2005-4-28John Lazzaro
(www.cs.berkeley.edu/~lazzaro)
CS 152 Computer Architecture and Engineering
Lecture 27 – Multiprocessors
www-inst.eecs.berkeley.edu/~cs152/
TAs: Ted Hong and David Marquardt
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Last Time: Synchronization
Higher Addresses
LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head memory addr
T2 & T3 (2 copes
of consumer thread)
y x
Tail Head
y
Tail Head
After:Before:Higher Addresses
T1 code(producer)
ORi R1, R0, x ; Load x value into R1LW R2, tail(R0) ; Load queue tail into R2 SW R1, 0(R2) ; Store x into queueADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail memory addr
Critical section: T2 and T3 must take turns running red code.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Today: Memory System Design
NUMA and Clusters: Two different ways to build very large computers.
Multiprocessor memory systems: Consequences of cache placement.
Write-through cache coherency: Simple, but limited, approach to multiprocessor memory systems.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Two CPUs, two caches, shared DRAM ...
CPU0
Cache
Addr Value
CPU1
Shared Main MemoryAddr Value16
Cache
Addr Value
5
CPU0:LW R2, 16(R0)
516
CPU1:LW R2, 16(R0)
16 5
CPU1:SW R0,16(R0)
0
0Write-through caches
View of memory no longer “coherent”.
Loads of location 16 from CPU0 and CPU1 see different values!
Today: What to do ...
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
The simplest solution ... one cache!
CPU0 CPU1
Shared Main Memory
CPUs do not have internal caches.
Only one cache, so different values for a memory address cannot appear in 2caches!Shared Multi-Bank Cache
Memory Switch
Multiple caches banks support read/writes by both CPUs in a switch epoch, unless both target same bank.
In that case, one request is stalled.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Not a complete solution ... good for L2.
CPU0 CPU1
Shared Main Memory
For modern clock rates,access to shared cache through switch takes 10+ cycles.
Shared Multi-Bank Cache
Memory Switch Using shared cache as the L1 data cache is tantamount to slowing down clock 10X for LWs. Not good.
This approach was a complete solution in the days when DRAM row access time and CPU clock period were well matched.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Modified form: Private L1s, shared L2
CPU0 CPU1
Shared Main Memory
Thus, we need to solve the cache coherency problem for L1 cache.
Shared Multi-Bank L2 Cache
Memory Switch or Bus
Advantages of shared L2 over private L2s:
Processors communicate at cache speed, not DRAM speed.
L1 Caches L1 Caches
Constructive interference, if both CPUs need same data/instr.
Disadvantage: CPUs share BW to L2 cache ...
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
supp
orts
a 1
.875
-Mby
te o
n-ch
ip L
2 ca
che.
Pow
er4
and
Pow
er4+
sys
tem
s bo
th h
ave
32-
Mby
te L
3 ca
ches
, whe
reas
Pow
er5
syst
ems
have
a 3
6-M
byte
L3
cach
e.T
he L
3 ca
che
oper
ates
as a
bac
kdoo
r with
sepa
rate
bus
es fo
r rea
ds a
nd w
rites
that
ope
r-at
e at
hal
f pr
oces
sor
spee
d. I
n Po
wer
4 an
dPo
wer
4+ sy
stem
s, th
e L3
was
an
inlin
e ca
che
for
data
ret
riev
ed fr
om m
emor
y. B
ecau
se o
fth
e hi
gher
tran
sisto
r de
nsity
of t
he P
ower
5’s
130-
nm te
chno
logy
, we c
ould
mov
e the
mem
-or
y co
ntro
ller
on c
hip
and
elim
inat
e a
chip
prev
ious
ly n
eede
d fo
r the
mem
ory
cont
rolle
rfu
nctio
n. T
hese
two
chan
ges
in th
e Po
wer
5al
so h
ave t
he si
gnifi
cant
side
ben
efits
of r
educ
-in
g la
tenc
y to
the
L3 c
ache
and
mai
n m
emo-
ry, a
s w
ell a
s re
duci
ng t
he n
umbe
r of
chi
psne
cess
ary
to b
uild
a sy
stem
.
Chip
overv
iewFi
gure
2 s
how
s th
e Po
wer
5 ch
ip,
whi
chIB
M f
abri
cate
s us
ing
silic
on-o
n-in
sula
tor
(SO
I) d
evic
es a
nd c
oppe
r int
erco
nnec
t. SO
Ite
chno
logy
red
uces
dev
ice
capa
cita
nce
toin
crea
se t
rans
isto
r pe
rfor
man
ce.5
Cop
per
inte
rcon
nect
dec
reas
es w
ire
resi
stan
ce a
ndre
duce
s de
lays
in w
ire-d
omin
ated
chi
p-tim
-
ing
path
s. I
n 13
0 nm
lith
ogra
phy,
the
chi
pus
es ei
ght m
etal
leve
ls an
d m
easu
res 3
89 m
m2 .
The
Pow
er5
proc
esso
r su
ppor
ts th
e 64
-bit
Pow
erPC
arc
hite
ctur
e. A
sin
gle
die
cont
ains
two
iden
tical
pro
cess
or co
res,
each
supp
ortin
gtw
o lo
gica
l thr
eads
. Thi
s ar
chite
ctur
e m
akes
the c
hip
appe
ar as
a fo
ur-w
ay sy
mm
etric
mul
-tip
roce
ssor
to th
e op
erat
ing
syst
em. T
he tw
oco
res s
hare
a 1
.875
-Mby
te (1
,920
-Kby
te) L
2ca
che.
We i
mpl
emen
ted
the L
2 ca
che a
s thr
eeid
entic
al s
lices
with
sep
arat
e co
ntro
llers
for
each
. The
L2
slice
s are
10-
way
set-
asso
ciat
ive
with
512
cong
ruen
ce cl
asse
s of 1
28-b
yte l
ines
.T
he d
ata’s
rea
l add
ress
det
erm
ines
whi
ch L
2sli
ce th
e dat
a is c
ache
d in
. Eith
er p
roce
ssor
core
can
inde
pend
ently
acc
ess e
ach
L2 c
ontr
olle
r.W
e al
so in
tegr
ated
the
dire
ctor
y fo
r an
off-
chip
36-
Mby
te L
3 ca
che o
n th
e Pow
er5
chip
.H
avin
g th
e L3
cach
e dire
ctor
y on
chip
allo
ws
the
proc
esso
r to
che
ck th
e di
rect
ory
afte
r an
L2 m
iss w
ithou
t exp
erie
ncin
g of
f-ch
ip d
elay
s.To
red
uce
mem
ory
late
ncie
s, w
e in
tegr
ated
the m
emor
y co
ntro
ller o
n th
e chi
p. T
his e
lim-
inat
es d
rive
r an
d re
ceiv
er d
elay
s to
an
exte
r-na
l con
trol
ler.
Proce
ssor c
oreW
e de
signe
d th
e Po
wer
5 pr
oces
sor c
ore
tosu
ppor
t bo
th e
nhan
ced
SMT
and
sin
gle-
thre
aded
(ST
) op
erat
ion
mod
es.
Figu
re 3
show
s th
e Po
wer
5’s
inst
ruct
ion
pipe
line,
whi
ch is
iden
tical
to th
e Pow
er4’
s. A
ll pi
pelin
ela
tenc
ies i
n th
e Pow
er5,
incl
udin
g th
e bra
nch
misp
redi
ctio
n pe
nalty
and
load
-to-
use
late
n-cy
with
an
L1 d
ata
cach
e hi
t, ar
e th
e sa
me
asin
the
Pow
er4.
The
iden
tical
pip
elin
e st
ruc-
ture
lets
opt
imiz
atio
ns d
esig
ned
for
Pow
er4-
base
d sy
stem
s pe
rfor
m
equa
lly
wel
l on
Pow
er5-
base
d sy
stem
s. F
igur
e 4
show
s th
ePo
wer
5’s i
nstr
uctio
n flo
w d
iagr
am.
In S
MT
mod
e, th
e Po
wer
5 us
es tw
o se
pa-
rate
inst
ruct
ion
fetc
h ad
dres
s reg
ister
s to
stor
eth
e pr
ogra
m c
ount
ers
for
the
two
thre
ads.
Inst
ruct
ion
fetc
hes
(IF
stag
e)
alte
rnat
ebe
twee
n th
e tw
o th
read
s. I
n ST
mod
e, t
hePo
wer
5 us
es o
nly
one
prog
ram
cou
nter
and
can
fetc
h in
stru
ctio
ns fo
r th
at t
hrea
d ev
ery
cycl
e. I
t ca
n fe
tch
up t
o ei
ght
inst
ruct
ions
from
the
inst
ruct
ion
cach
e (I
C s
tage
) ev
ery
cycl
e. T
he tw
o th
read
s sh
are
the
inst
ruct
ion
cach
e an
d th
e in
stru
ctio
n tr
ansla
tion
faci
lity.
In a
give
n cy
cle,
all f
etch
ed in
stru
ctio
ns co
me
from
the
sam
e th
read
.
42
HOT
CHIP
S15
IEEE M
ICRO
Figu
re 2
. Pow
er5
chip
(FXU
= fi
xed-
poin
t exe
cutio
n un
it, IS
U=
inst
ruct
ion
sequ
enci
ng u
nit,
IDU
= in
stru
ctio
n de
code
uni
t,LS
U =
load
/sto
re u
nit,
IFU
= in
stru
ctio
n fe
tch
unit,
FPU
=flo
atin
g-po
int u
nit,
and
MC
= m
emor
y co
ntro
ller).
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Sequentially Consistent Memory Systems
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Recall: Sequential ConsistencySequential Consistency: As if each thread takes turns executing, and instructions in each thread execute in program order.
LW R3, head(R0) ; Load queue head into R3spin: LW R4, tail(R0) ; Load queue tail into R4 BEQ R4, R3, spin ; If queue empty, wait LW R5, 0(R3) ; Read x from queue into R5 ADDi R3, R3, 4 ; Shift head by one word SW R3 head(R0) ; Update head memory addr
T2 code(consumer)
T1 code(producer)
ORi R1, R0, x ; Load x value into R1LW R2, tail(R0) ; Load queue tail into R2 SW R1, 0(R2) ; Store x into queueADDi R2, R2, 4 ; Shift tail by one wordSW R2 0(tail) ; Update tail memory addr
1
2
3
4
Legal orders: 1, 2, 3, 4 or 1, 3, 2, 4 or 3, 4, 1 2 ... but not 2, 3, 1, 4!
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Sequential consistency requirements ...
CPU0
Cache
Addr Value
CPU1
Shared Memory HierarchyAddr Value16
Cache
Addr Value
5
1. Only one processor at a time has write permission for a memory location.
516 16 5 0
0
The “sequential” part of sequential consistency.
2. No processor can load a stale copy of a location after a write.The “consistent” part of sequential consistency.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Implementation: Snoopy Caches
CPU0
Cache Snooper
CPU1
Shared Main Memory Hierarchy
Each cache has the ability to “snoop” on memory bus transactions of other CPUs.
Cache SnooperMemory bus
The bus also has mechanisms to let a CPU intervene to stop a bus transaction, and to invalidate cache lines of other CPUs.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Writes from 10,000 feet ...
CPU0
Cache Snooper
CPU1
Shared Main Memory Hierarchy
Cache SnooperMemory bus
1. Writing CPU takes control of bus.
2. Address to be written is invalidated in all other caches.
3. Write is sent to main memory.
Reads will no longer hit in cache and get stale data.
Reads will cache miss, retrieve new value from main memory
For write-thru caches ...
To a first-order, reads will “just work” if write-thru caches implement this policy.
A “two-state” protocol (cache lines are “valid” or “invalid”).
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Limitations of the write-thru approach
CPU0
Cache Snooper
CPU1
Shared Main Memory Hierarchy
Cache SnooperMemory bus
Every write goes to the bus.
Total bus write bandwidth does not support more than 2 CPUs, in modern practice.
To scale further, we need to use write-back caches.
Write-back big trick: keep track of whether other caches also contain a cached line. If not, a cache has an “exclusive” on the line, and can read and write the line as if it were the only CPU.For details, take CS 252 ...
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Other Machine Architectures
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
NUMA: Non-uniform Memory Access
CPU 0
Cache
CPU 1023
Interconnection Network
Each CPU has part of main memory attached to it.
Cache
DRAM DRAM
...
To access other parts of main memory, use the interconnection network.
For best results, applications take the non-uniform memory latency into account.
Good for applications that match the machine model ...
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Clusters ... an application-level approachRack Stats
• Weight: 1500 Lbs
• Power: 98 Amps
• Fans: 340 (2”) + 2 (8”)
• Wire: 0.25 miles
• Assembly and wiring time: 60 man-hours
Connect large numbers of 1-CPU or 2-CPU rack mount computers together with high-end network technology.
Instead of using hardware to create a shared memory abstraction, let an application build its own memory model.
University of Illinois, 650 2-CPU Apple Xserve cluster, connected with Myrinet (3.5 µs ping time - low latency network).
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Clusters also used for web serversIn some applications, each machine can handle a net query by itself.
Example: serving static web pages. Each machine has a copy of the website.
but I intentionally ignore them here because theyare well studied elsewhere and because the issuesin this article are largely orthogonal to the use ofdatabases.
AdvantagesThe basic model that giant-scale services followprovides some fundamental advantages:
! Access anywhere, anytime. A ubiquitous infra-structure facilitates access from home, work,airport, and so on.
! Availability via multiple devices. Because theinfrastructure handles most of the processing,users can access services with devices such asset-top boxes, network computers, and smartphones, which can offer far more functionali-ty for a given cost and battery life.
! Groupware support. Centralizing data frommany users allows service providers to offergroup-based applications such as calendars, tele-conferencing systems, and group-managementsystems such as Evite (http://www.evite.com/).
! Lower overall cost. Although hard to measure,infrastructure services have a fundamental costadvantage over designs based on stand-alonedevices. Infrastructure resources can be multi-plexed across active users, whereas end-userdevices serve at most one user (active or not).Moreover, end-user devices have very low uti-lization (less than 4 percent), while infrastruc-ture resources often reach 80 percent utiliza-tion. Thus, moving anything from the deviceto the infrastructure effectively improves effi-ciency by a factor of 20. Centralizing theadministrative burden and simplifying enddevices also reduce overall cost, but are harderto quantify.
! Simplified service updates. Perhaps the mostpowerful long-term advantage is the ability toupgrade existing services or offer new serviceswithout the physical distribution required bytraditional applications and devices. Devicessuch as Web TVs last longer and gain useful-ness over time as they benefit automaticallyfrom every new Web-based service.
ComponentsFigure 1 shows the basic model for giant-scalesites. The model is based on several assumptions.First, I assume the service provider has limitedcontrol over the clients and the IP network.Greater control might be possible in some cases,however, such as with intranets. The model also
assumes that queries drive the service. This is truefor most common protocols including HTTP, FTP,and variations of RPC. For example, HTTP’s basicprimitive, the “get” command, is by definition aquery. My third assumption is that read-onlyqueries greatly outnumber updates (queries thataffect the persistent data store). Even sites that wetend to think of as highly transactional, such as e-commerce or financial sites, actually have thistype of “read-mostly” traffic1: Product evaluations(reads) greatly outnumber purchases (updates), forexample, and stock quotes (reads) greatly out-number stock trades (updates). Finally, as the side-bar, “Clusters in Giant-Scale Services” (next page)explains, all giant-scale sites use clusters.
The basic model includes six components:
! Clients, such as Web browsers, standalone e-mail readers, or even programs that use XMLand SOAP initiate the queries to the services.
! The best-effort IP network, whether the publicInternet or a private network such as anintranet, provides access to the service.
! The load manager provides a level of indirectionbetween the service’s external name and theservers’ physical names (IP addresses) to preservethe external name’s availability in the presenceof server faults. The load manager balances loadamong active servers. Traffic might flow throughproxies or firewalls before the load manager.
! Servers are the system’s workers, combiningCPU, memory, and disks into an easy-to-repli-cate unit.
IEEE INTERNET COMPUTING http://computer.org/internet/ JULY • AUGUST 2001 47
Giant-Scale Services
Client
Client
Client
Loadmanager
Persistent data store
Client
IP network
Single-site server
Optionalbackplane
Figure 1.The basic model for giant-scale services. Clients connect viathe Internet and then go through a load manager that hides downnodes and balances traffic.Load manager is a special-purpose computer that assigns
incoming HTTP connections to a particular machine.Image from Eric Brewer’s IEEE Internet Computing article.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Clusters also used for web servicesIn other applications, many machines work together on each transaction.
Example: Web searching. The search is partitioned over many machines, each of which holds a part of the database.
Altavista web search engine did not use clusters. Instead, Altavista used shared-memory multiprocessors. This approach could not scale with the web.
above 20 Gbits per second. They detect downnodes automatically, usually by monitoring openTCP connections, and thus dynamically isolatedown nodes from clients quite well.
Two other load-management approaches aretypically employed in combination with layer-4switches. The first uses custom “front-end” nodesthat act as service-specific layer-7 routers (in soft-ware).2 Wal-Mart’s site uses this approach, forexample, because it helps with session manage-ment: Unlike switches, the nodes track sessioninformation for each user.
The final approach includes clients in the load-management process when possible. This general“smart client” end-to-end approach goes beyondthe scope of a layer-4 switch.3 It greatly simplifiesswitching among different physical sites, which inturn simplifies disaster tolerance and overloadrecovery. Although there is no generic way to dothis for the Web, it is common with other systems.In DNS, for instance, clients know about an alter-native server and can switch to it if the primarydisappears; with cell phones this approach isimplemented as part of roaming; and applicationservers in the middle tier of three-tier databasesystems understand database failover.
Figures 2 and 3 illustrate systems at oppositeends of the complexity spectrum: a simple Web farmand a server similar to the Inktomi search enginecluster. These systems differ in load management,use of a backplane, and persistent data store.
The Web farm in Figure 2 uses round-robinDNS for load management. The persistent datastore is implemented by simply replicating all con-tent to all nodes, which works well with a smallamount of content. Finally, because all servers canhandle all queries, there is no coherence trafficand no need for a backplane. In practice, evensimple Web farms often have a second LAN (back-plane) to simplify manual updates of the replicas.In this version, node failures reduce system capac-ity, but not data availability.
In Figure 3, a pair of layer-4 switches managesthe load within the site. The “clients” are actuallyother programs (typically Web servers) that use thesmart-client approach to failover among differentphysical clusters, primarily based on load.
Because the persistent store is partitionedacross servers, possibly without replication, nodefailures could reduce the store’s effective size andoverall capacity. Furthermore, the nodes are nolonger identical, and some queries might need tobe directed to specific nodes. This is typicallyaccomplished using a layer-7 switch to parse
URLs, but some systems, such as clustered Webcaches, might also use the backplane to routerequests to the correct node.4
High AvailabilityHigh availability is a major driving requirementbehind giant-scale system design. Other infra-
IEEE INTERNET COMPUTING http://computer.org/internet/ JULY • AUGUST 2001 49
Giant-Scale Services
Client
Client
Client
Round-robin DNS
Simple replicated store
Client
IP network
Single-site server
Figure 2. A simple Web farm. Round-robin DNS assigns differentservers to different clients to achieve simple load balancing. Persis-tent data is fully replicated and thus all nodes are identical and canhandle all queries.
Program
Program
Program
Loadmanager
Partitioned data store
Program
IP network
Single-site server
Myrinet backplane
Figure 3. Search engine cluster. The service provides support to otherprograms (Web servers) rather than directly to end users.These pro-grams connect via layer-4 switches that balance load and hide faults.Persistent data is partitioned across the servers, which increasesaggregate capacity but implies there is some data loss when a serveris down. A backplane allows all nodes to access all data.
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
CS 152: What’s left ...
Monday 5/2: Final report due, 11:59 PM
Thursday 5/5: Midterm II, 6 PM to 9 PM, 320 Soda.
Tuesday 5/10: Final presentations.
Watch email for final project peer review request.
No class on Thursday. Review session in Tuesday 5/2, + HKN (???).
Deadline to bring up grading issues:Tues 5/10@ 5PM. Contact John at lazzaro@eecs
UC Regents Spring 2005 © UCBCS 152 L26: Synchronization
Tomorrow: Final Project Checkoff
UC Regents Spring 2005 © UCBCS 152 L8: Pipelining I
Instruction Cache
Data Cache
DRAM
D
R
A
M
C
o
n
t
r
o
l
l
e
r
P
i
p
e
l
i
n
e
d
C
P
U
IC Bus IM Bus
DC Bus DM Bus
TAs will provide “secret” MIPS machine code tests.
Bonus points ifthese tests run by2 PM. If not, TAs give you test code to use over weekend