06 - cs131 lec - memory errors

Upload: iamca-cagurangan

Post on 06-Apr-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    1/41

    COMPUTER ARCHITECTUREMemory Errors, Detection and Correction

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    2/41

    MEMORY ERRORS Memory is an electronic storage device

    All electronic storage devices have the potential to

    incorrectly return information different than what wasoriginally stored

    Some technologies are more likely than others to do this

    DRAM memory, because of its nature, is likely to returnoccasional memory errors

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    3/41

    MEMORY ERRORS DRAM memory stores ones and zeros as charges on

    small capacitors that must be continually refreshed to

    ensure that the data is not lost. This is less reliable than the static storage used by

    SRAMs

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    4/41

    MEMORY ERRORS Two kinds of errors that can typically occur in a memory

    system:

    1. Repeatable or hard error2. Transient or soft error

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    5/41

    MEMORY ERRORS

    Repeatable or hard error- a piece of hardware isbroken and will consistently return incorrect results.

    A bit may be stuck so that it always returns "0" for example,no matter what is written to it.

    Hard errors usually indicate loose memory modules, blownchips, motherboard defects or other physical problems.

    They are relatively easy to diagnose and correct because theyare consistent and repeatable.

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    6/41

    MEMORY ERRORSTransient or soft error- this occurs when a bit reads back

    the wrong value once, but subsequently functions

    correctly. These problems are much more difficult to diagnose

    They are also, unfortunately, more common.

    Eventually, a soft error will usually repeat itself, but it cantake anywhere from minutes to years for this to happen.

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    7/41

    MEMORY ERRORS Soft errors are sometimes caused by:

    memory that is physically bad

    poor quality motherboards memory system timings that are set too fast

    static shocks

    other similar problems that are not related to the memorydirectly

    In addition, stray radioactivity that is naturally present inmaterials used in PC systems can cause the occasionalsoft error.

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    8/41

    MEMORY ERRORS DRAMs used today are far more reliable than those of

    five to ten years ago.

    This has been the chief excuse used by system vendorswho have dropped error detection support from their PCs

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    9/41

    MEMORY ERRORS However, there are factors that make the problem worse

    in modern systems as well

    More memory is being used Systems today are running much faster

    Regardless of how often memory errors occur, they dooccur.

    How much damage they create depends on when theyhappen and what it is that they get wrong

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    10/41

    NON-PARITY, PARITY AND ECC MEMORY Memory modules have traditionally been available in two

    basic types: non-parityand parity.

    Non-parity is "regular" memory it contains exactly one bit of memory for every bit of data to be

    stored

    8 bits are used to store each byte of data

    Parity memory adds an extra single bit for every eightbits of data, used only for error detection and correction 9 bits of data are used to store each byte

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    11/41

    NON-PARITY, PARITY AND ECC MEMORY A summary of the different common module sizes and

    their bit widths:

    72 bits64 bits168-Pin DIMM

    36 bits32 bits72-Pin SIMM

    9 bits8 bits30-Pin SIMM

    Bit Width ofParity SIMM

    Bit Width ofNon-Parity

    SIMM

    Module Type

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    12/41

    PARITY CHECKING Parity checking is a rudimentary method of detecting

    simple, single-bit errors in a memory system

    It in fact has been present in PCs since the original IBMPC in 1981, and until the early 1990s was used in everyPC sold on the market

    It requires the use of parity memory, which provides anextra bit for every byte stored

    This extra bit is used to store information to allow error

    detection

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    13/41

    PARITY CHECKING Every byte of data that is stored in the system memory

    contains 8 bits of real data, each one a zero or a one

    It is possible to count up the number of zeros or ones ina byte

    For example, the byte 10110011 has 3 zeros and 5 ones

    The byte 00100100 has 6 zeros and 2 ones

    Some bytes will have an even number of ones, andsome will have an odd number

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    14/41

    ODD PARITY Each time a byte is written to memory, a logic circuit

    called a parity generator/checkerexamines the byte anddetermines whether the data byte has an even or an oddnumber of ones.

    If it has an even number of ones, the ninth (parity) bit isset to a one, otherwise it is set to a zero

    The result is that no matter how many ones there were inthe original eight data bits, there is an odd number ofones when you look at all nine bits together.

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    15/41

    EVEN PARITY It is also possible to have even parity, where the

    generator makes the sum always come out even

    But the standard in PC memory is odd parity

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    16/41

    ODD PARITY This table shows some examples of how this works:

    91811111111

    31200100100

    50510110011

    11000000000

    Number ofOnes

    Including

    Parity Bit

    Parity BitNumber ofOnes in Data

    Bits

    Sample DataBits

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    17/41

    ODD PARITY When all nine bits are taken together, there is always an

    odd number of ones

    When the data is read back from memory, the paritycircuit this time acts as a checker

    It reads back all nine bits and determines again if thereare an odd or an even number of ones

    If there is an even number of ones, there musthavebeen an error in one of the bits, because when it storedthe byte the circuit set the parity bit so that there would

    always be an odd number of ones

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    18/41

    ODD PARITY The system knows one bit is wrong, although it doesn't

    know which one it is

    When a parity error is detected, the parity circuitgenerates what is called a "non-maskable interrupt" or"NMI", which is usually used to instruct the processor to

    immediately halt This is done to ensure that the incorrect memory does

    not end up corrupting anything

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    19/41

    ODD PARITY What happens if there is an error in two of the bits?

    Example data: 00100100

    stored data: 00100100 1 (with parity bit)Read back as 01100000 1

    Here we have two bits that have flipped, one of them

    from 1 to 0 and the other from 0 to 1. But the number of ones is still odd!

    Parity does not protect against double-bit errors.

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    20/41

    PARITY CHECKING Incidentally, contrary to popular myth, parity checking

    does notslow down the operation of the memory system

    at all The parity bit generation and detection is done in parallel

    with the writing or reading of the system memory, in

    transistor-to-transistor logic that is much faster than theDRAM memory circuits being used

    If it finds an error it uses an interrupt.

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    21/41

    ECC ECCstands for error correcting circuits, error correcting

    code, or error correction code

    This protocol not only detects both single-bit and multi-biterrors, it will actually correctsingle-bit errors

    Unlike parity, which uses a single bit to provide

    protection to eight bits, ECC uses larger groupings: 7 bitsto protect 32 bits, or 8 bits to protect 64 bits.

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    22/41

    ECC There are special ECC memory modules designed

    specifically for use in ECC mode, but most modern

    motherboards that support ECC will in fact work in thatmode using standard parity memory modules as well.

    Since parity memory includes one extra bit for every

    eight bits of data, this means 64 bits worth of paritymemory is 72 bits wide, which means there is enough todo ECC.

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    23/41

    ECC ECC has the ability to correct a detected single-bit error

    in a 64-bit block of memory

    When this happens, the computer will continue withoutan interruption

    It will have no idea that anything even happened

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    24/41

    ECC ECC will detect (but not correct) errors of 2, 3 or even 4

    bits, in addition to detecting (and correcting) single-bit

    errors. ECC memory handles these multi-bit errors similarly to

    how parity handles single-bit errors: a non-maskableinterrupt (NMI) that instructs the system to shut down toavoid data corruption.

    Multi-bit errors are extremelyrare in memory.

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    25/41

    ECC Unlike parity checking, ECC willcause a slight slowdown

    in system operation

    The reason is that the ECC algorithm is morecomplicated, and a bit of time must be allowed for ECCto correct any detected errors

    The penalty is usually one extra wait state per memoryread

    This translates in most cases to a real world decrease in

    performance of approximately 2-3%

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    26/41

    HAMMING DISTANCE (HD) The hamming distance between two values is the

    number of bit changes (differences) betweencorresponding bit patterns of the two

    Example: 1101 and 1010 have a hamming distance of 3

    This is done by computing the bitwise booleanEXCLUSIVE OR of the two codewords and count the

    number of 1 bits in the result Its significance is that if two codewords are a hamming

    distance d apart, it will require d single-bit errors to

    convert one into the other

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    27/41

    HD OF A CODE SET A code set is just a series of bit values

    A code set that uses n bits may or may not use all 2n

    possible bit patterns The HD of a code set is the minimum HD between each

    pair of elements in the code set

    Example code set:

    HD = 1

    110

    100

    010

    000

    XXX

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    28/41

    HD OF A CODE SET Example code set:

    HD = 2

    The HD of the ASCII character set is 1

    In general, if a code set uses up all possible bit patterns,HD = 1

    11001010

    0101

    0000

    XXXX

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    29/41

    ERROR CORRECTION To detect d-bit errors, HD of the code set is greater than

    or equal to d+1

    To correct d-bit errors, HD >= 2d+1 Notice the addition of the parity bit ensures that the HD

    of the code set is 2, enabling it to detect 1-bit error

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    30/41

    ERROR CORRECTION Example: the code set {00000, 01011, 10110, 11101} has

    a HD of 3, enabling it to detect and correct 1-bit error

    If a bit pattern is transmitted and one of the bits inverted,the receiver would immediately recognize that thepattern received is not one of the valid patterns

    To correct the error, the receiver inverts each of the bitsuntil a valid pattern is obtained

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    31/41

    HAMMING

    CODE

    Formal means of transmitting bit patterns so that thereceiver can detect (correct) 1-bit errors

    Givenm

    data bits, we addr

    bits and sendm + r

    r is chosen such that m + r + 1

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    32/41

    HAMMING

    CODE

    Position the r bits within m bits

    Bits are numbered starting from 1, starting at the left

    Example: m = 4r + 5

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    33/41

    HAMMING

    CODE

    The value of the checkbit is the paritycomputation of the bitposition as this check bit

    1: 1,3,5,7,9

    2: 2,3,6,7,10,11

    4: 4,5,6,7 8: 8,9,10,11,12

    110012

    101111

    101010

    10019

    10008

    01117

    01106

    0101501004

    00113

    00102

    00011

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    34/41

    HAMMING

    CODE

    Example: Using even parity, transmit 0010

    1 2 3 4 5 6 7

    0 1 0 1 0 1 0 If bit #3 changed

    1 2 3 4 5 6 7

    0 1 1 1 0 1 0

    The erroneous bit is the sum of all check bits that areincorrect

    4: 4, 5, 6, 7

    2: 2, 3, 6, 7

    1: 1, 3, 5, 7

    YES

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    35/41

    HAMMING

    CODE

    Example: Using even parity, transmit 0010

    1 2 3 4 5 6 7

    0 1 0 1 0 1 0 If bit #6 changed

    1 2 3 4 5 6 7

    0 1 0 1 0 0 0

    4: 4, 5, 6, 7

    2: 2, 3, 6, 7

    1: 1, 3, 5, 7YES

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    36/41

    HAMMING CODE

    Example: Using even parity, transmit 0010

    1 2 3 4 5 6 7

    0 1 0 1 0 1 0 If bit #4 changed

    1 2 3 4 5 6 7

    0 1 0 0 0 1 0

    4: 4, 5, 6, 7

    2: 2, 3, 6, 7

    1: 1, 3, 5, 7YES

    YES

    THE MARKET'S CHANGE FROM PARITY TO NON

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    37/41

    THE MARKET'S CHANGE FROM PARITY TO NON-PARITY MEMORY

    At one time, all computers used parity memory

    This changed rapidly, and in a few short years parity

    went from the default and the standard to the minority innew systems

    Most Pentium class systems not only do not use parity

    memory, most cannot support parity checking (or ECC)at all

    What happened?

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    38/41

    MEMORY ERRORS

    There are valid technical reasons to not do paritychecking but the original motivation was cost savings

    By not having to include an extra bit of parity storage forevery eight bits of data, non-parity memory isapproximately 11% cheaper than parity memory

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    39/41

    MEMORY ERRORS

    The move to non-parity memory seemed to be nearlyuniversal

    There were a few reasons for this:

    When Intel introduced the Pentium it also introduced its newchipsets for Pentium motherboards

    These early chipsets were the standard from the early going,

    and they did not support parity checking (Intel's 430HXPentium chipset supports parity and ECC, but was notintroduced until two years after Intel's first Pentium chipsets)

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    40/41

    MEMORY ERRORS

    Removing parity checking was easy to do because fewPC buyers have enough computer knowledge tounderstand the difference or what the implications are of

    not having parity memory Keeping parity memory had no positive impact on

    performance

    The negative effects of not having parity memory (risk ofunreliable operation) are virtually impossible for theaverage person to see - they tend to blame some othercomponent.

  • 8/3/2019 06 - CS131 Lec - Memory Errors

    41/41

    MEMORY ERRORS

    Overall, the market became oriented so that a vendordeciding to include parity memory would incur an

    additional cost and get no sales benefit for it Most of the reasons for parity memory costing more

    today are supply and demand issues