06 - cs131 lec - memory errors
TRANSCRIPT
-
8/3/2019 06 - CS131 Lec - Memory Errors
1/41
COMPUTER ARCHITECTUREMemory Errors, Detection and Correction
-
8/3/2019 06 - CS131 Lec - Memory Errors
2/41
MEMORY ERRORS Memory is an electronic storage device
All electronic storage devices have the potential to
incorrectly return information different than what wasoriginally stored
Some technologies are more likely than others to do this
DRAM memory, because of its nature, is likely to returnoccasional memory errors
-
8/3/2019 06 - CS131 Lec - Memory Errors
3/41
MEMORY ERRORS DRAM memory stores ones and zeros as charges on
small capacitors that must be continually refreshed to
ensure that the data is not lost. This is less reliable than the static storage used by
SRAMs
-
8/3/2019 06 - CS131 Lec - Memory Errors
4/41
MEMORY ERRORS Two kinds of errors that can typically occur in a memory
system:
1. Repeatable or hard error2. Transient or soft error
-
8/3/2019 06 - CS131 Lec - Memory Errors
5/41
MEMORY ERRORS
Repeatable or hard error- a piece of hardware isbroken and will consistently return incorrect results.
A bit may be stuck so that it always returns "0" for example,no matter what is written to it.
Hard errors usually indicate loose memory modules, blownchips, motherboard defects or other physical problems.
They are relatively easy to diagnose and correct because theyare consistent and repeatable.
-
8/3/2019 06 - CS131 Lec - Memory Errors
6/41
MEMORY ERRORSTransient or soft error- this occurs when a bit reads back
the wrong value once, but subsequently functions
correctly. These problems are much more difficult to diagnose
They are also, unfortunately, more common.
Eventually, a soft error will usually repeat itself, but it cantake anywhere from minutes to years for this to happen.
-
8/3/2019 06 - CS131 Lec - Memory Errors
7/41
MEMORY ERRORS Soft errors are sometimes caused by:
memory that is physically bad
poor quality motherboards memory system timings that are set too fast
static shocks
other similar problems that are not related to the memorydirectly
In addition, stray radioactivity that is naturally present inmaterials used in PC systems can cause the occasionalsoft error.
-
8/3/2019 06 - CS131 Lec - Memory Errors
8/41
MEMORY ERRORS DRAMs used today are far more reliable than those of
five to ten years ago.
This has been the chief excuse used by system vendorswho have dropped error detection support from their PCs
-
8/3/2019 06 - CS131 Lec - Memory Errors
9/41
MEMORY ERRORS However, there are factors that make the problem worse
in modern systems as well
More memory is being used Systems today are running much faster
Regardless of how often memory errors occur, they dooccur.
How much damage they create depends on when theyhappen and what it is that they get wrong
-
8/3/2019 06 - CS131 Lec - Memory Errors
10/41
NON-PARITY, PARITY AND ECC MEMORY Memory modules have traditionally been available in two
basic types: non-parityand parity.
Non-parity is "regular" memory it contains exactly one bit of memory for every bit of data to be
stored
8 bits are used to store each byte of data
Parity memory adds an extra single bit for every eightbits of data, used only for error detection and correction 9 bits of data are used to store each byte
-
8/3/2019 06 - CS131 Lec - Memory Errors
11/41
NON-PARITY, PARITY AND ECC MEMORY A summary of the different common module sizes and
their bit widths:
72 bits64 bits168-Pin DIMM
36 bits32 bits72-Pin SIMM
9 bits8 bits30-Pin SIMM
Bit Width ofParity SIMM
Bit Width ofNon-Parity
SIMM
Module Type
-
8/3/2019 06 - CS131 Lec - Memory Errors
12/41
PARITY CHECKING Parity checking is a rudimentary method of detecting
simple, single-bit errors in a memory system
It in fact has been present in PCs since the original IBMPC in 1981, and until the early 1990s was used in everyPC sold on the market
It requires the use of parity memory, which provides anextra bit for every byte stored
This extra bit is used to store information to allow error
detection
-
8/3/2019 06 - CS131 Lec - Memory Errors
13/41
PARITY CHECKING Every byte of data that is stored in the system memory
contains 8 bits of real data, each one a zero or a one
It is possible to count up the number of zeros or ones ina byte
For example, the byte 10110011 has 3 zeros and 5 ones
The byte 00100100 has 6 zeros and 2 ones
Some bytes will have an even number of ones, andsome will have an odd number
-
8/3/2019 06 - CS131 Lec - Memory Errors
14/41
ODD PARITY Each time a byte is written to memory, a logic circuit
called a parity generator/checkerexamines the byte anddetermines whether the data byte has an even or an oddnumber of ones.
If it has an even number of ones, the ninth (parity) bit isset to a one, otherwise it is set to a zero
The result is that no matter how many ones there were inthe original eight data bits, there is an odd number ofones when you look at all nine bits together.
-
8/3/2019 06 - CS131 Lec - Memory Errors
15/41
EVEN PARITY It is also possible to have even parity, where the
generator makes the sum always come out even
But the standard in PC memory is odd parity
-
8/3/2019 06 - CS131 Lec - Memory Errors
16/41
ODD PARITY This table shows some examples of how this works:
91811111111
31200100100
50510110011
11000000000
Number ofOnes
Including
Parity Bit
Parity BitNumber ofOnes in Data
Bits
Sample DataBits
-
8/3/2019 06 - CS131 Lec - Memory Errors
17/41
ODD PARITY When all nine bits are taken together, there is always an
odd number of ones
When the data is read back from memory, the paritycircuit this time acts as a checker
It reads back all nine bits and determines again if thereare an odd or an even number of ones
If there is an even number of ones, there musthavebeen an error in one of the bits, because when it storedthe byte the circuit set the parity bit so that there would
always be an odd number of ones
-
8/3/2019 06 - CS131 Lec - Memory Errors
18/41
ODD PARITY The system knows one bit is wrong, although it doesn't
know which one it is
When a parity error is detected, the parity circuitgenerates what is called a "non-maskable interrupt" or"NMI", which is usually used to instruct the processor to
immediately halt This is done to ensure that the incorrect memory does
not end up corrupting anything
-
8/3/2019 06 - CS131 Lec - Memory Errors
19/41
ODD PARITY What happens if there is an error in two of the bits?
Example data: 00100100
stored data: 00100100 1 (with parity bit)Read back as 01100000 1
Here we have two bits that have flipped, one of them
from 1 to 0 and the other from 0 to 1. But the number of ones is still odd!
Parity does not protect against double-bit errors.
-
8/3/2019 06 - CS131 Lec - Memory Errors
20/41
PARITY CHECKING Incidentally, contrary to popular myth, parity checking
does notslow down the operation of the memory system
at all The parity bit generation and detection is done in parallel
with the writing or reading of the system memory, in
transistor-to-transistor logic that is much faster than theDRAM memory circuits being used
If it finds an error it uses an interrupt.
-
8/3/2019 06 - CS131 Lec - Memory Errors
21/41
ECC ECCstands for error correcting circuits, error correcting
code, or error correction code
This protocol not only detects both single-bit and multi-biterrors, it will actually correctsingle-bit errors
Unlike parity, which uses a single bit to provide
protection to eight bits, ECC uses larger groupings: 7 bitsto protect 32 bits, or 8 bits to protect 64 bits.
-
8/3/2019 06 - CS131 Lec - Memory Errors
22/41
ECC There are special ECC memory modules designed
specifically for use in ECC mode, but most modern
motherboards that support ECC will in fact work in thatmode using standard parity memory modules as well.
Since parity memory includes one extra bit for every
eight bits of data, this means 64 bits worth of paritymemory is 72 bits wide, which means there is enough todo ECC.
-
8/3/2019 06 - CS131 Lec - Memory Errors
23/41
ECC ECC has the ability to correct a detected single-bit error
in a 64-bit block of memory
When this happens, the computer will continue withoutan interruption
It will have no idea that anything even happened
-
8/3/2019 06 - CS131 Lec - Memory Errors
24/41
ECC ECC will detect (but not correct) errors of 2, 3 or even 4
bits, in addition to detecting (and correcting) single-bit
errors. ECC memory handles these multi-bit errors similarly to
how parity handles single-bit errors: a non-maskableinterrupt (NMI) that instructs the system to shut down toavoid data corruption.
Multi-bit errors are extremelyrare in memory.
-
8/3/2019 06 - CS131 Lec - Memory Errors
25/41
ECC Unlike parity checking, ECC willcause a slight slowdown
in system operation
The reason is that the ECC algorithm is morecomplicated, and a bit of time must be allowed for ECCto correct any detected errors
The penalty is usually one extra wait state per memoryread
This translates in most cases to a real world decrease in
performance of approximately 2-3%
-
8/3/2019 06 - CS131 Lec - Memory Errors
26/41
HAMMING DISTANCE (HD) The hamming distance between two values is the
number of bit changes (differences) betweencorresponding bit patterns of the two
Example: 1101 and 1010 have a hamming distance of 3
This is done by computing the bitwise booleanEXCLUSIVE OR of the two codewords and count the
number of 1 bits in the result Its significance is that if two codewords are a hamming
distance d apart, it will require d single-bit errors to
convert one into the other
-
8/3/2019 06 - CS131 Lec - Memory Errors
27/41
HD OF A CODE SET A code set is just a series of bit values
A code set that uses n bits may or may not use all 2n
possible bit patterns The HD of a code set is the minimum HD between each
pair of elements in the code set
Example code set:
HD = 1
110
100
010
000
XXX
-
8/3/2019 06 - CS131 Lec - Memory Errors
28/41
HD OF A CODE SET Example code set:
HD = 2
The HD of the ASCII character set is 1
In general, if a code set uses up all possible bit patterns,HD = 1
11001010
0101
0000
XXXX
-
8/3/2019 06 - CS131 Lec - Memory Errors
29/41
ERROR CORRECTION To detect d-bit errors, HD of the code set is greater than
or equal to d+1
To correct d-bit errors, HD >= 2d+1 Notice the addition of the parity bit ensures that the HD
of the code set is 2, enabling it to detect 1-bit error
-
8/3/2019 06 - CS131 Lec - Memory Errors
30/41
ERROR CORRECTION Example: the code set {00000, 01011, 10110, 11101} has
a HD of 3, enabling it to detect and correct 1-bit error
If a bit pattern is transmitted and one of the bits inverted,the receiver would immediately recognize that thepattern received is not one of the valid patterns
To correct the error, the receiver inverts each of the bitsuntil a valid pattern is obtained
-
8/3/2019 06 - CS131 Lec - Memory Errors
31/41
HAMMING
CODE
Formal means of transmitting bit patterns so that thereceiver can detect (correct) 1-bit errors
Givenm
data bits, we addr
bits and sendm + r
r is chosen such that m + r + 1
-
8/3/2019 06 - CS131 Lec - Memory Errors
32/41
HAMMING
CODE
Position the r bits within m bits
Bits are numbered starting from 1, starting at the left
Example: m = 4r + 5
-
8/3/2019 06 - CS131 Lec - Memory Errors
33/41
HAMMING
CODE
The value of the checkbit is the paritycomputation of the bitposition as this check bit
1: 1,3,5,7,9
2: 2,3,6,7,10,11
4: 4,5,6,7 8: 8,9,10,11,12
110012
101111
101010
10019
10008
01117
01106
0101501004
00113
00102
00011
-
8/3/2019 06 - CS131 Lec - Memory Errors
34/41
HAMMING
CODE
Example: Using even parity, transmit 0010
1 2 3 4 5 6 7
0 1 0 1 0 1 0 If bit #3 changed
1 2 3 4 5 6 7
0 1 1 1 0 1 0
The erroneous bit is the sum of all check bits that areincorrect
4: 4, 5, 6, 7
2: 2, 3, 6, 7
1: 1, 3, 5, 7
YES
-
8/3/2019 06 - CS131 Lec - Memory Errors
35/41
HAMMING
CODE
Example: Using even parity, transmit 0010
1 2 3 4 5 6 7
0 1 0 1 0 1 0 If bit #6 changed
1 2 3 4 5 6 7
0 1 0 1 0 0 0
4: 4, 5, 6, 7
2: 2, 3, 6, 7
1: 1, 3, 5, 7YES
-
8/3/2019 06 - CS131 Lec - Memory Errors
36/41
HAMMING CODE
Example: Using even parity, transmit 0010
1 2 3 4 5 6 7
0 1 0 1 0 1 0 If bit #4 changed
1 2 3 4 5 6 7
0 1 0 0 0 1 0
4: 4, 5, 6, 7
2: 2, 3, 6, 7
1: 1, 3, 5, 7YES
YES
THE MARKET'S CHANGE FROM PARITY TO NON
-
8/3/2019 06 - CS131 Lec - Memory Errors
37/41
THE MARKET'S CHANGE FROM PARITY TO NON-PARITY MEMORY
At one time, all computers used parity memory
This changed rapidly, and in a few short years parity
went from the default and the standard to the minority innew systems
Most Pentium class systems not only do not use parity
memory, most cannot support parity checking (or ECC)at all
What happened?
-
8/3/2019 06 - CS131 Lec - Memory Errors
38/41
MEMORY ERRORS
There are valid technical reasons to not do paritychecking but the original motivation was cost savings
By not having to include an extra bit of parity storage forevery eight bits of data, non-parity memory isapproximately 11% cheaper than parity memory
-
8/3/2019 06 - CS131 Lec - Memory Errors
39/41
MEMORY ERRORS
The move to non-parity memory seemed to be nearlyuniversal
There were a few reasons for this:
When Intel introduced the Pentium it also introduced its newchipsets for Pentium motherboards
These early chipsets were the standard from the early going,
and they did not support parity checking (Intel's 430HXPentium chipset supports parity and ECC, but was notintroduced until two years after Intel's first Pentium chipsets)
-
8/3/2019 06 - CS131 Lec - Memory Errors
40/41
MEMORY ERRORS
Removing parity checking was easy to do because fewPC buyers have enough computer knowledge tounderstand the difference or what the implications are of
not having parity memory Keeping parity memory had no positive impact on
performance
The negative effects of not having parity memory (risk ofunreliable operation) are virtually impossible for theaverage person to see - they tend to blame some othercomponent.
-
8/3/2019 06 - CS131 Lec - Memory Errors
41/41
MEMORY ERRORS
Overall, the market became oriented so that a vendordeciding to include parity memory would incur an
additional cost and get no sales benefit for it Most of the reasons for parity memory costing more
today are supply and demand issues