spring 2006cs 3321 reliable byte-stream (tcp) outline connection establishment/termination sliding...

Spring 2006 CS 332 1

Reliable Byte-Stream (TCP)

OutlineConnection Establishment/TerminationSliding Window Revisited Flow ControlAdaptive Timeout


End-to-End Protocols• Underlying best-effort network

– drops messages– re-orders messages– delivers duplicate copies of a given message– limits messages to some finite size– delivers messages after an arbitrarily long delay

• Common end-to-end services– guarantee message delivery– deliver messages in the same order they are sent– deliver at most one copy of each message– support arbitrarily large messages– support synchronization (between sender and receiver)– allow the receiver to flow control the sender– support multiple application processes on each host


Simple Demultiplexer (UDP)• Extends host-to-host service into process-to-

process• Unreliable and unordered datagram service• Adds multiplexing• No flow control• Endpoints identified by ports (why not PID?)

– servers have well-known ports (clients don’t need this)• Often just starting point

– see /etc/services on Unix

– Implemented as message queue


Simple Demultiplexer (UDP)• Header format

– Note 16 bit port number (so only 64K ports)

– Process really identified via <port,host> pair

• Checksum (optional in IPv4, mandatory in IPv6)– psuedo header + UDP header + data

• Pseudo header: Protocol number Source IP Dest IP UDP length field

Why?

SrcPort DstPort

Checksum Length

Data

0 16 31


TCP Overview

• Connection-oriented• Byte-stream

– app writes bytes– TCP sends segments– app reads bytes

Application process

Writebytes

TCPSend buffer

Segment Segment Segment

Transmit segments

Application process

Readbytes

TCPReceive buffer

…

… …

• Full duplex• Flow control: keep sender from

overrunning receiver• Congestion control: keep sender

from overrunning network


Flow Control vs Congestion Control

• Flow Control– Prevent sender from overloading receiver

– End-to-end issue

• Congestion Control– Prevent too much data from being injected into network

– Concerned with how hosts and network interact


Data Link Reliability (text 2.5)

Wherein we look at reliability issues on a point-to-point link! Error correcting codes can’t handle all possible errors (without introducing lots of overhead--including this is not designing for normal situation), so badly garbled frames are dropped. We need a way to recover from these lost frames.


Acks and Timeouts• Acknowledgement (ACK)

– Small frame sent to peer indicating receipt of frame

– No data

– Piggybacking

• Timeout– If ACK not received within reasonable time, original

frame is retransmitted

• Automatic Repeat Request (ARQ)– General strategy of using ACKS and timeouts to

implement reliable delivery


Acknowledgements & Timeouts

Spring 2006 CS 332 10

Acknowledgements & Timeouts

Spring 2006 CS 332 11

A Subtlety…

• Consider scenarios (c) and (d) in previous slide.– Receiver receives two good frames (duplicate)

– It may deliver both to higher layer protocol (not good!)

– Solution: 1-bit sequence number in frame header

Spring 2006 CS 332 12

Stop-and-Wait

• Problem: keeping the pipe full• Example

– 1.5Mbps link x 45ms RTT = 67.5Kb (8KB)

– 1KB frames implies 1/8th link utilization (Next slide)

Sender Receiver

Spring 2006 CS 332 13

Bandwidth x Delay Product

• Sending a 1KB packet in 45ms implies sending at rate of (1024 x 8)/0.045 = 182 Kbps, or 1/8 of bandwidth.

• Bandwidth-delay: The number of bits that fits in the pipe in a single round trip. (I.e. the amount of data that could be “in transit” at any given time.)

• Goal: Want to be able to send this much data before getting first ACK. (called keeping the pipe full)

Spring 2006 CS 332 14

Sliding Window• Allow multiple outstanding (un-ACKed) frames• Upper bound on un-ACKed frames, called window

Sender Receiver

Tim

e

……

Spring 2006 CS 332 15

Sliding Window: Sender• Assign sequence number to each frame (SeqNum)• Maintain three state variables:

– send window size (SWS)– last acknowledgment received (LAR)– last frame sent (LFS)

• Maintain invariant: LFS - LAR ≤ SWS

• Advance LAR when ACK arrives • Buffer up to SWS frames (must be prepared to retransmit

frames until they are ACKed)

SWS

LAR LFS

… …

Spring 2006 CS 332 16

Sliding Window: Receiver• Maintain three state variables

– receive window size (RWS) (upper bound on # out-of-order frames)

– largest frame acceptable (LFA) (sequence # of)– last frame received (LFR)

• Maintain invariant: LFA - LFR ≤ RWS

• Frame SeqNum arrives:– if LFR < SeqNum ≤ LFA accept– if SeqNum ≤ LFR or SeqNum > LFA discard

• Send cumulative ACKs

RWS

LFR LFA

… …

Spring 2006 CS 332 17

Note:

• When packet loss occurs, pipe is no longer kept full!

• Longer it takes to notice lost packet, worst the condition becomes

• Possible solutions:– Send NACKs

– Selective acknowledgements (just ACK exactly those frames received, not highest frame received)

– Not used: too much added complexity

Spring 2006 CS 332 18

Sequence Number Space• SeqNum field is finite; sequence numbers wrap

around• Sequence number space must be larger then

number of outstanding frames (I.e. stop-and-wait had 2 # space)– I.e. if sequence number space is of size 8 (say 0..7), and

number of outstanding frames is allowed to be 10, then sender can send sequence numbers 0,1,2,3,4,5,6,7,0,1 all at once. Now if receiver sends back an ACK with sequence number 1, which packet 1 is it ACKing?

Spring 2006 CS 332 19

Sequence Number Space• Even SWS < SequenceSpaceSize is not sufficient

– suppose 3-bit SeqNum field (0..7) (so SequenceSpaceSize = 8)– Let SWS=RWS=7– sender transmit frames 0..6– Frames arrive successfully, but ACKs are lost– sender retransmits 0..6– receiver expecting 7, 0..5, but receives second incarnation of 0..5

(because the receiver has at this point updated its various pointers)

• SWS ≤ (SequenceSpaceSize+1)/2 is rule (if SWS=RWS)

• Intuitively, SeqNum “slides” between two halves of sequence number space

Spring 2006 CS 332 20

Easy to overlook…

• Relationship between window size and sequence number space depends on assumption that frames are not reordered in transit (easy to assume on point-to-point link).

Spring 2006 CS 332 21

Back to Chapter 5…

Spring 2006 CS 332 22

Data Link Versus Transport• Transport potentially connects many different hosts

– need explicit connection establishment and termination

• Transport has potentially different RTT (over different routes and at different times, even on scale of minutes)– need adaptive timeout mechanism

• Transport has potentially long delay in network– need to be prepared for arrival of very old packets

• Transport has potentially different capacity at destination – need to accommodate different node capacity

• Transport has potentially different network capacity– need to be prepared for network congestion

Spring 2006 CS 332 23

The “End-to-End” Argument• Consider TCP vs X.25• TCP: Consider underlying IP network unreliable

and use sliding window to provide end-to-end in-order reliable delivery

• X.25: Use sliding window within network on hop-by-hop basis (which should guarantee end-to-end). Several problems with this:– No guarantee that added hop preserves service– In link from A to B to C, no guarantee that B behaves

perfectly (nodes known to introduce errors and mix packet order)

Spring 2006 CS 332 24

End-to-End

• “A function should not be provided in the lower levels of the system unless it can be completely and correctly implemented at that level”

• Does allow for functions to be incompletely provided at lower levels for optimization– E.g. detecting and retransmitting single corrupt packet

across one hop preferable to retransmitting entire file end-to-end.

• See reading assignment on class homework page

Spring 2006 CS 332 25

Segment Format

Options (variable)

Data

Checksum

SrcPort DstPort

HdrLen 0 Flags

UrgPtr

AdvertisedWindow

SequenceNum

Acknowledgment

0 4 10 16 31

Spring 2006 CS 332 26

Segment Format (cont)• Each connection identified with 4-tuple:

– (SrcPort, SrcIPAddr, DestPort, DestIPAddr)

• Sliding window and flow control– acknowledgment, SequenceNum, AdvertisedWindow

• Flags– SYN, FIN, RESET, PUSH, URG, ACK

• Checksum– pseudo header + TCP header + data

Sender

Data (SequenceNum)

Acknowledgment +AdvertisedWindow

Receiver

Spring 2006 CS 332 27

Connection Establishment and Termination

Active participant(client)

Passive participant(server)

SYN, SequenceNum = x

SYN + ACK, SequenceNum = y,

ACK, Acknowledgment = y + 1

Acknowledgment = x + 1

Note: SequenceNumcontains the sequencenumber of the first data byte containedin the segment. ACKfield always gives thesequence number ofthe next data byte expected. (Except forthe SYN segments)

Spring 2006 CS 332 28

State Transition DiagramCLOSED

LISTEN

SYN_RCVD SYN_SENT

ESTABLISHED

CLOSE_WAIT

LAST_ACKCLOSING

TIME_WAIT

FIN_WAIT_2

FIN_WAIT_1

Passive open Close

Send/SYNSYN/SYN + ACK

SYN + ACK/ACK

SYN/SYN + ACK

ACK

Close/FIN

FIN/ACKClose/FIN

FIN/ACKACK + FIN/ACK Timeout after two segment lifetimes

FIN/ACK

ACK

ACK

ACK

Close/FIN

Close

CLOSED

Active open/SYN

Openingconnection

Closingconnection

event/action

Spring 2006 CS 332 29

Sliding Window Revisited

• Sending side– LastByteAcked ≤ LastByteSent

– LastByteSent ≤ LastByteWritten

– buffer bytes between LastByteAcked and LastByteWritten

Sending application

LastByteWritten

TCP

LastByteSentLastByteAcked

Receiving application

LastByteRead

TCP

LastByteRcvdNextByteExpected

• Receiving side– LastByteRead < NextByteExpected

– NextByteExpected ≤ LastByteRcvd +1

– buffer bytes between LastByteRead and LastByteRcvd

Spring 2006 CS 332 30

Flow Control

• Send buffer size: MaxSendBuffer• Receive buffer size: MaxRcvBuffer• Receiving side

– LastByteRcvd - LastByteRead ≤ MaxRcvBuffer– AdvertisedWindow = MaxRcvBuffer - (LastByteRcvd - LastByteRead)

• Sending side– LastByteSent - LastByteAcked ≤ AdvertisedWindow– EffectiveWindow = AdvertisedWindow - (LastByteSent - LastByteAcked)

– LastByteWritten - LastByteAcked ≤ MaxSendBuffer– block sender if (LastByteWritten - LastByteAcked) + y > MaxSenderBuffer

Spring 2006 CS 332 31

Flow Control

• Always send ACK in response to arriving data segment– This response contains latest Acknowledge and AdvertisedWindow fields even if they haven’t changed

• Problem: How does the sending side know when the advertised window is no longer 0?– It can’t get this info, since receiver only sends window advertisements

in response to received packets, and sender can’t send anything because it believes the window size is zero.

• Solution: Persist when AdvertisedWindow = 0– Periodically send a probe segment with one byte of data. Although

most won’t be accepted, they trigger responses, and eventually one will come back with a nonzero advertised window.

Spring 2006 CS 332 32

Protection Against Wrap Around

• 32-bit SequenceNum

Bandwidth Time Until Wrap AroundT1 (1.5 Mbps) 6.4 hoursEthernet (10 Mbps) 57 minutesT3 (45 Mbps) 13 minutesFDDI (100 Mbps) 6 minutesSTS-3 (155 Mbps) 4 minutesSTS-12 (622 Mbps) 55 secondsSTS-24 (1.2 Gbps) 28 seconds

Spring 2006 CS 332 33

Keeping the Pipe Full

• 16-bit AdvertisedWindow

Bandwidth Delay x Bandwidth ProductT1 (1.5 Mbps) 18KBEthernet (10 Mbps) 122KBT3 (45 Mbps) 549KBFDDI (100 Mbps) 1.2MBSTS-3 (155 Mbps) 1.8MBSTS-12 (622 Mbps) 7.4MBSTS-24 (1.2 Gbps) 14.8MB

Results below assumeRTT of 100 ms, typical for cross-country link

Spring 2006 CS 332 34

TCP Extensions

• Implemented as header options• Store timestamp in outgoing segments• Extend sequence space with 32-bit timestamp:

PAWS (Protection Against Wrapped Sequence Numbers)

• Shift (scale) advertised window

Spring 2006 CS 332 35

Adaptive Retransmission(Original Algorithm)

• Measure SampleRTT for each segment/ACK pair

• Compute weighted average of RTT

between 0.8 and 0.9 (recommended value 0.9)– Note in this range has a strong smoothing effect

• Set timeout based on EstRTT– TimeOut = 2 x EstRTT (rather conservative)

SampleRTT)1(EstRTTEstRTT ×−+×=

Spring 2006 CS 332 36

Karn/Partridge Algorithm

• Problem: ACK doesn’t acknowledge a transmission (it acks a receive)• Do not sample RTT when retransmitting • Double timeout after each retransmission (exponential backoff)

Sender Receiver

Original transmission

ACK

Sam

pleR

TT

Retransmission

Sender Receiver

Original transmission

ACK

Sam

pleR

TT

Retransmission

Why?

Spring 2006 CS 332 37

A Problem

• Problem with both these approaches: they can’t keep up with wide RTT fluctuations, thus causing unnecessary retransmissions

• When the network is already loaded, unnecessary retransmissions add to the network load (as Stevens notes, “It is the network equivalent of pouring gasoline on a fire”)

• What’s needed: keep track of the variance in RTT measurements AND use smooth RTT estimator.

Spring 2006 CS 332 38

Jacobson/ Karels Algorithm• New Calculations for average RTT • Diff = sampleRTT - EstRTT• EstRTT = EstRTT + ( g x Diff)

– Recommended value for g is 0.125– EstRTT is just the smoothed RTT as before

• Dev = Dev + h ( |Diff| - Dev)– Recommended value for h is 0.25– Dev is the smoothed mean deviation (easier to compute mean that

standard deviation, which requires a square root)• TimeOut = EstRTT + x Dev

– Larger gain for the deviation makes the TimeOut value increase faster when the RTT changes.

• Notes– algorithm only as good as granularity of clock (500ms on Unix)– accurate timeout mechanism important to congestion control (later)

Note thesevalues?

Spring 2006 CS 332 39

TCP Interactive Data Flow

• Material here is from TCP/IP Illustrated, Vol. 1• Study by Caceres, et. al. (1991) :

– On a packet count basis, about half of all TCP segments contain bulk data (ftp, email, Usenet news)

– Half contain interactive data (telnet, rlogin)

– On byte count basis, ratio is around 90% bulk transfer, 10% interactive.

– Bulk data tends to be full size (normally 512 bytes of data), interactive is much smaller (90% of telnet and rlogin packets carry less than 10 bytes of data).

Spring 2006 CS 332 40

Rlogin and Telnet

• Surprisingly, each interactive keystroke typically generates a packet (as opposed to a line generating a packet).

• Moreover, a single rlogin keystroke can generate 4 segments (though usually 3)

i. Interactive keystroke from clientii. ACK of keystroke from server (typically piggybacked

in echo of data byte) see next slideiii. Echo of data byte from serveriv. ACK of echoed byte from client

Spring 2006 CS 332 41

Delayed ACKs

• Normally, TCP does not send an ACK the instant it receives data. Instead, it delays the ACK, hoping to have data going in other direction on which it can piggyback the ACK.

• Most implementations use a 200ms delay (delays ACK up to 200ms before sending the ACK by itself)

• This is why in previous slide, ACK would normally piggyback with the echoed character

Spring 2006 CS 332 42

Nagle Algorithm

• 1 byte data segment generates 41 byte packets (20 for IP header + 20 for TCP header).

• Small packets are called tinygrams– On LANs, usually not an issue, but on WANs, this can be

a problem (it adds congestion)

• Solution: Nagle Algorithm (RFC 896, Nagle, 1984): When a TCP connection has outstanding data that has not yet been Acked, small segments cannot be sent until the outstanding data is acknowledged.

Spring 2006 CS 332 43

Nagle Algorithm (continued)

• Nagle is self-clocking: the faster the ACKs come back, the faster the data is sent. But on slow WAN, where tinygrams can be a problem, fewer segments are sent.– Ex. On LAN, time for single byte to be sent, ACKed

and echoed is around 16ms. To generate data at this rate, you need to be typing around 60 characters per second (so on LAN you don’t kick in Nagle)

– On WAN, you’ll often kick in Nagle

Spring 2006 CS 332 44

Disabling the Nagle Algorithm• Why would you want to?

– X Window system: small messages (mouse movements) need to be delivered without delay

– Typing one of the terminals special function keys during interactive login

• Function keys normally generate multiple bytes of data, beginning with ASCII escape character. If TCP gets data a byte at a time, it can potentially send first byte and then hold the rest of the characters. The server wouldn’t generate the ACK until it received the rest of the command, so Nagle would kick in, meaning rest of bytes not sent for 200ms, which can be a noticeable delay.

• With sockets API, the TCP_NODELAY option disables Nagle

• Host Requirements RFCs (1122, 1123) specify that there must be a way for an app to disable Nagle on an individual TCP connection.

Spring 2006 CS 332 45

TCP Teardown

spring 2006cs 3321 reliable byte-stream (tcp) outline connection establishment/termination sliding...

Documents

host slide

frame header slide

previous slide

overrunning network

message queue slide

reliable delivery slide

duplex flow control

flow control endpoints