the transport layer · the transport protocol data unit (pdu) to the network layer for...

Cmpt 371 Transport Layer

The Transport Layer

We already know the functions provided by the transport layer, because weneeded to talk about them in order to describe how applications make use of theservices provided by the transport layer. Let’s quickly review:

e The transport layer must provide a multiplexing and demultiplexing service.This is its most important function.

For outgoing data, the transport layer attaches a transport header which in-cludes a port number which identifies the sending process. It then handsthe transport protocol data unit (PDU) to the network layer for transmission.This is a multiplexing function, with messages from many sources (applica-tions) multiplexed into a single stream of segments that are handed to thenetwork layer.

For incoming data, the transport layer examines the port number in thetransport header and uses the port number to choose the application (pro-cess) that should receive the message carried in the transport PDU. Thisis a demultiplexing function, with segments arriving from the network layerdistributed to the correct application.

e The transport layer could provide reliable data transfer, integrity (authenti-cation and encryption), and quality of service (QoS) guarantees.

By analogy, then, this would be the place to consider the services that thetransport layer might want from the network layer.

e The essential service provided by the network layer is delivery of a packetfrom one end system to another.

But what about services that the network layer could provide? This discussion isconspicuous in its absence from the start of Chapter 3.

e A skip ahead to Chapter 4 offers a possible explanation: The list of servicesthat could be provided by the network layer looks a lot like the list of servicesthat could be provided by the transport layer.

e The Internet transport protocols make the least possible assumption aboutthe network layer: best-effort delivery between end systems. By design! TCPand UDP can run over top of any network layer.

1 June, 2012


e If there were any serious competitors at the transport level, it’d be worthhaving a discussion of possible network layer services and how the presenceor absence of those services would affect the design of transport protocols.But there aren’t, so we won’t.

The Internet protocol suite provides two transport protocols, TCP (Transmis-sion Control Protocol) and UDP (User Datagram Protocol).

e UDP provides only multiplexing and demultiplexing.

e TCP provides reliable data transfer in addition to multiplexing and demulti-plexing.

e A ‘bolt-on’ module, TLS (Transport Layer Security) can be used to add in-tegrity to either TCP or UDP.

e The Internet transport protocols provide no QoS guarantees, because theunderlying network may not be capable of supporting such guarantees.

Port numbers allow TCP and UDP to identify specific processes.

e A port number is associated with exactly one process, but a process canacquire multiple port numbers.

e In the Internet protocols the allowable range of port numbers is 0 – 65535(216 −1). This range is divided into system ports, 0 – 1023, user ports, 1024 –49151, and dynamic ports, 49152 – 65535. The IANA assigns port numbers1

from the system range for services associated with standard Internet proto-cols. It also administers the use of ports in the user range as a conveniencefor network application developers.

In the socket API, each port number is associated with a socket, the objectcreated by an application to access the services of the transport layer. As wealready know, it’s actually a bit more complicated.

e A socket using the UDP protocol is associated with a local port number andnetwork address. Each time an application wants to use the socket to senda message, it must specify the destination port and network address. Thedestination can be different for each use of the socket.

1See http://www.iana.org/protocols and scroll down the page to the section withthe heading ‘Port Numbers’. The ‘Service Name and Transport Protocol Port NumberRegistry’ lists all system and user port assignments.

2 June, 2012


e A socket using the TCP protocol is associated with a local port number andnetwork address and with a remote port number and network address, aspart of the TCP connection setup. Once the connection is established, theapplication does not need to specify the destination with each message, butthe destination cannot be changed.

e In order to send a message to an application, we must somehow know aport number for a socket associated with that application. In the Internet,this is solved by using well-known port numbers (the system and user portsmentioned above). Using the socket API, we can request a specific well-known port number be associated with a socket created by the application.

Reliable Data Transfer

Let’s dive right into one of the core subjects of this chapter: reliable datatransfer. To be reliable, we require that data be delivered without loss or error,and in the order in which it was sent.

e Without going into details, detection of errors requires that extra informationbe transmitted with the data. The sender performs some calculation overthe data and sends the result of this calculation along with the data. Thereceiver repeats the calculation and checks its result against the result sentwith the data. If the two results agree, the data has been received withouterror.

The amount of extra information required is surprisingly small. Often, it is calleda checksum.

Assuming that we detect an error in data delivered to the local system, whatcan we do about it?

e We could try to correct the error. Just as with error detection, error correctionrequires that extra information be transmitted with the data. The amountof extra information required for error correction is large and this techniqueis not used in practice.

e We can ask the sender to retransmit the data. This requires some care todo right but it is a practical technique. We’ll start simple and work up to thealgorithms used in modern protocols.

To establish a trivial base case, consider a perfect channel2 between the senderand receiver. No data is ever lost or corrupted and data is delivered in the orderthat it’s sent.

3 June, 2012


As shown in the figure, the implementa-tion really is trivial. A single state suf-fices for each of the sender and receiver.The sender’s transport layer waits for theapplication to send a message with a callto rdt_send. When it arrives, the trans-port layer wraps it in a transport layersegment with a call to wrap and handsit off to the network layer with a call toudt_send.The receiver’s transport layer waits forthe network layer to provide data witha call to rdt_rcv. When it arrives,the transport layer removes the messagefrom the segment with a call to extractand delivers it to the application with acall to deliver_msg.

wait forcall fromabove

rdt_send(msg)seg = wrap(msg)udt_send(seg)

Sender

wait forcall frombelow

rdt_rcv(seg)

msg = extract(seg)deliver_msg(msg)

Receiver

(The notation used in this and subsequentfigures is the standard ‘bubble diagram’ no-tation for a state machine. See the text fordetails if you’re not familiar with it.)

Nothing else is required. With a perfect channel, nothing can go wrong.

In reality, errors happen. Let’s start by assuming that data can be corruptedbut never entirely lost. By assumption, the receiver can detect data errors butdoesn’t have enough information to repair the error. The receiver will need to askthe sender to retransmit the data. We’ve identified the three capabilities requiredto do this:

e Error detection is needed so that the receiver is aware that there’s a problem.

e Receiver feedback is needed so that the sender is aware that there’s a prob-lem.

e Retransmission of the data by the sender is necessary to fix the problem.

Figure 1 shows the enhanced state machines required for the sender and receiver.

e Each time the application hands a message to the sender’s transport layer fortransmission, the transport layer calculates a checksum over the message.The message and checksum are wrapped together in a segment and handedto the network layer for transmission.

2Because we don’t want to be specific about the connection between the sender andreceiver — it could be a single link or the Internet — we’ll use the word channel forthe connection between the sender and receiver.

4 June, 2012


wait forcall fromabove

rdt_send(msg)

xmtseg = wrap(msg,chksum)udt_send(xmtseg)

Senderwait forACK orNAK

rdt_rcv(rcvseg) && isnak(rcvseg)

udt_send(xmtseg)

rdt_rcv(rcvseg) &&isack(rcvseg)

Λ

rdt_rcv(rcvseg) &&!corrupt(rcvseg)


xmtseg = wrap(ACK)udt_send(xmtseg)

Receiver

rdt_rcv(rcvseg) &&corrupt(rcvseg)

xmtseg = wrap(NAK)udt_send(xmtseg)

wait forcall frombelow

Figure 1: Sender and receiver state machines for rdt2.0

The sender’s work is not done. It must wait for a reply from the receiver.If the received segment is a NAK, the sender must retransmit the segment.If the received segment is an ACK, the sender can return to the initial stateand wait for another message from the application.

e Each time the network layer delivers a segment to the receiver’s transportlayer, the receiver must check that the segment is correct. To do this, it cal-culates a checksum for the received message and compares it to the check-sum sent with the message.

If the checksums match, the segment is correct and it can be delivered tothe application. In addition, an ACK segment must be sent to the sender’stransport layer so that it knows the segment was received without error.

If the checksums don’t match, the segment is corrupted and it is simplydiscarded. In addition, a NAK segment must be sent to the sender’s transportlayer so that it knows to retransmit the segment.

There are several problems with rdt2.0. The first is merely annoying. Whilethe sender is in the ‘wait for ACK or NAK’ state, it cannot accept new messages

5 June, 2012


from the application. If the application calls rdt_send, it will block waiting forthe transport layer to return to the ‘wait for call from above’ state. This type ofprotocol, where the sender must stop and wait for an acknowledgement for eachsegment, is commonly called a stop-and-wait protocol.

The second is fundamental. If any segment can be corrupted, ACK and NAK

segments can be corrupted. Suppose that the receiver sends a NAK segment thatis corrupted en route.

e The sender will not recognise it as either of ACK or NAK and will stay in the‘wait for ACK or NAK’ state.

e The receiver will take no further action; it will simply wait for the next seg-ment to arrive.

e Our protocol is livelocked — neither side will do anything more.

How can we fix this problem?

e It should be quickly apparent that adding a checksum to ACK and NAK seg-ments won’t help. The sender will know that the segment just received iscorrupt, but that doesn’t help. It’s still sitting in state ‘wait for ACK or NAK’.

Adding additional capability to the transport protocol to allow the sender torequest that the receiver retransmit an ACK or NAK won’t help either. Boththe sender and receiver will require additional states to deal with this newsegment type. A bit of thought should convince you that we’ve just movedthe problem to the new states.

e We could decide that if the sender receives a corrupt segment, it will assumethe worst (NAK) and retransmit xmtseg. But what if the corrupted segmentwas really a ACK? Now the receiver will get a second copy of the message,with no way of knowing it’s a copy.

Let’s pursue this last idea for a moment. Knowing that it’s received a copy,the receiver could take action, discarding the copy and resending the ACK. Thiswould tell the sender that the segment was received without error and allow it toreturn to ‘wait for call from above’ to await the next message from the application.Successful error recovery!

The technique that we’ll adopt to allow the receiver to detect a duplicate issequence numbers. Now we have a new set of questions to answer.

e How many sequence numbers do we need? Here, two (0 and 1) would sufifice,because there’s only one segment being transmitted (commonly referred to as

6 June, 2012


‘in flight’) at any given moment. If the receiver has sent an ACK for a segmentwith sequence number 0 and it receives another segment with sequencenumber 0, it knows that its ACK didn’t get through and it can send it again.

A bit of thought should convince you that, in general, if we want to haveN − 1 segments in flight we need N sequence numbers.

While we’re at it, consider that when the receiver receives a segment with thewrong sequence number, it knows that the previous ACK was not received and itshould send it again.

e We could apply the same logic at the sender if the ACK also carried sequencenumbers. Suppose that each ACK contains the sequence number of the lastsegment that was correctly received. If this isn’t the same sequence numberas the segment that was just transmitted, the sender knows that it mustretransmit the segment.

That brings us to rdt2.2, shown in Figure 2. The state machines for the senderand receiver have been doubled so that we can use state to keep track of thesegment sequence number. Assume that the sender and receiver each start inthe state identified by the dashed arrow. Let’s see what happens when a segmentis transmitted without error.

e The sender starts in state ‘wait for call 0 from above’. When the applicationhands a message to the transport layer, it’s wrapped in a segment along witha checksum, assigned a sequence number of 0, and handed to the networklayer for transmission. The sender now moves to state ‘wait for ACK 0’ toreceive the reply from the receiver.

e The receiver starts in state ‘wait for call 0 from below’. If the segment isreceived without error (i.e., not corrupted and sequence number 0), the re-ceiver will deliver the message to the application, send back an ACK messagewith sequence number 0 (ACK(0)), and move to ‘wait for call 1 from below’ toawait delivery of the next segment.

e When the sender receives ACK(0), it knows that the segment with sequencenumber 0 was received without error. It moves to state ‘wait for call 1 fromabove’ to await the next message from the application.

But suppose the segment is corrupted on the way to the receiver.

e The receiver will discard the segment, send back ACK(1), and remain in state‘wait for call 0 from below’ in anticipation that the sender will retransmit thesegment.

7 June, 2012


wait forcall 0 fromabove

rdt_send(msg)

xmtseg = wrap(0,msg,chksum)udt_send(xmtseg)

Sender

wait forACK 0

rdt_rcv(rcvseg) &&(corrupt(rcvseg) ||isack(rcvseg,1))

udt_send(xmtseg)


wait forACK 1

rdt_send(msg)


Λ

rdt_rcv(rcvseg) &&!corrupt(rcvseg) &&isack(rcvseg,0)

Λ



udt_send(xmtseg)

rdt_rcv(rcvseg) &&!corrupt(rcvseg) && seq(rcvseg,0)


xmtseg = wrap(0,ACK,chksum)udt_send(xmtseg)

Receiverrdt_rcv(rcvseg) &&(corrupt(rcvseg) ||seq(rcvseg,1))


wait forcall 1 frombelow





rdt_rcv(rcvseg) &&(corrupt(rcvseg) ||seq(rcvseg,0))



8 June, 2012


e Receipt of ACK(1) will cause the sender to retransmit the segment with se-quence number 0. This exchange will repeat until the segment is receivedwithout error.

Suppose that the segment is received correctly by the receiver but the ACK iscorrupted on its way back to the sender.

e On the receiver’s side, everything looks good. It delivers the message to theapplication and moves to state ‘wait for call 1 from below’ to await the nextsegment.

e The sender receives a corrupt segment and responds by resending the seg-ment with sequence number 0.

e When the receiver sees the segment with sequence number 0, it discards thesegment and repeats ACK(0).

e This sequence will repeat until the ACK(0) is received without error by thesender. When this happens, the sender knows that the segment with se-quence number 0 was received without error and it can move to state ‘waitfor call 1 from above.’

In general, if the sender receives an ACK with a sequence number that does notmatch the segment it just transmitted, it knows that the receiver did not receivethe segment. If the receiver receives a duplicate segment, it knows that the senderdid not receive the ACK for that segment. If we want to have many segments inflight, we’ll need to use a variable to keep track of the sequence numbers insteadof creating more states.

Let’s add the second type of error: loss of a segment. In the current Internet,the most common reason is that a router has dropped a packet because of con-gestion (no space in the transmit buffer for a link). Once either of the sender orreceiver recognises that a segment has gone missing, we can easily recover fromthe error by retransmitting the segment. But how can we detect the absence ofsomething?

e The first thing to note is that in order to realise something hasn’t arrived, wemust be expecting something to arrive. Given that we’re expecting somethingto arrive, we probably have some notion of when, and that allows us to say‘the thing I’m expecting hasn’t arrived in a reasonable amount of time.’

e The technique is called a timeout. For example, when the sender hands asegment to the network layer for transmission to the receiver, it knows thatit should receive an ACK within a ‘reasonable’ amount of time. It can set

9 June, 2012


a timer to go off at the end of the interval. Computers don’t do well with‘reasonable’, however, so we’ll need to be more specific.

e One possibility is to keep track of the average time between handing a seg-ment to the network layer and receiving an ACK in reply. This is the averageround trip time (RTT), and it’s an estimate of the minimum interval beforethe sender can expect an ACK from the receiver.

e The trick is to wait long enough to be reasonably certain of loss, but notso long that it takes an unacceptably long time to recover from loss of asegment.

e This balancing act introduces the possibility of unnecessary retransmis-sions, but fortunately our protocol already copes with duplicates.

The state machines for rdt3.0 are shown in Figure 3. Each time a segment ishanded to the network layer (udt_send), a timer is started. When an ACK indi-cates that the segment was received without error, the timer is stopped.

e In each ‘wait for ACK’ state, the sender now relies completely on the timeoutto trigger retransmission. Arguably this is not the best choice — the protocolmight recover faster if we kept the behaviour of rdt2.2 and retransmittedthe segment on receipt of a corrupted segment or an ACK with the wrongsequence number.

e The sender must also consider the possibility that an ACK will arrive in oneof the ‘wait for call from above’ states, triggered by an unnecessary retrans-mission of a segment.

Notice that the receiver’s state machine is unchanged from rdt2.2. The receiveralready has the capability to recognise and discard duplicate segments. The re-ceiver has no use for a timeout, because each segment might be the last. Thesender, on the other hand, expects to receive an ACK for each segment it trans-mits.

The text illustrates the operation of rdt3.0 with four scenarios (Figure 3.16).Only the scenario for a premature timeout by the sender (Figure 3.16(d)) is dis-cussed here. The timeline is modified slightly to show how the sender mightreceive an ACK while in a ‘wait for call from above’ state.

10 June, 2012





Receiverrdt_rcv(rcvseg) &&(corrupt(rcvseg) ||

seq(rcvseg,1))







rdt_rcv(rcvseg) &&(corrupt(rcvseg) ||

seq(rcvseg,0))



rdt_send(msg)


start_timer

Sender

wait forACK 0


wait forACK 1

rdt_send(msg)


start_timer

stop_timer


stop_timer


Λ

rdt_rcv(rcvseg)


Λ

udt_send(xmtseg)start_timer

timeout

udt_send(xmtseg)start_timer

timeout


Λ

Λ

rdt_rcv(rcvseg)


11 June, 2012


Sender Receiver

send seg(0)

rcv ACK(0)send seg(1)

rcv ACK(1)

rcv seg(0)send ACK(0)

rcv seg(1) (dup)send ACK(1)

seg(0)

ACK(0)

.

.

.

seg(1)

timeout!resend seg(1)

seg(1)

rcv seg(1)

send ACK(1)

... delay ...

ACK(1)

rcv ACK(1)(dup)

ACK(1)

seg(0)send seg(0)

The segment with sequence num-ber 0 is transmitted and acknowl-edged without error. When the seg-ment with sequence number 1 ar-rives, the receiver is busy and thereis some delay before it transmitsACK(1), enough that the sender timesout and retransmits the segmentwith sequence number 1.

After the retransmission, the origi-nal ACK(1) arrives from the receiver.The sender processes the ACK andmoves to state ‘wait for call 0 fromabove’ where it waits for the applica-tion to provide another message.

Meanwhile, the receiver has pro-cessed the duplicate segment withsequence number 1, sending ACK(1)in response. Because the applicationhasn’t generated a new message, thesender is still in state ‘wait for call0 from above’ when the duplicateACK(1) arrives.

The protocols we’ve just explored are stop-and-wait protocols — the sender willwait for an acknowledgement before sending the next segment. Is there room forimprovement? Let’s do a quick calculation.

e Transcontinental distances are on the order of 4000 – 5000 km; transoceanicdistances up to 9000 km. Signal propagation speeds are on the order of2 × 108 – 3 × 108 m/sec. To an order of magnitude, dprop for a segment will be10−2 sec; a few 10’s of msec.

This doesn’t account for the other nodal delays — some quick tests with pingshow an average delay of 80 – 90 msec to the east coast of North America,200 msec across the Pacific.

e How long will it take us to transmit a typical segment? The size of a typ-ical Ethernet frame is 1500 bytes or 12000 bits. For convenience, let’suse a segment size of 10kb and transmit it over gigabit Ethernet links at109 b/sec. The transmission time for a segment is on the order of 10−5 sec;about 10 µsec.

12 June, 2012


e Bottom line: It takes about 10 µsec to transmit the segment, after which thesender waits for 10’s of msec for the ACK.

We’re using less than 1/1000 of the available bandwidth!

Yes, surely there must be some way to improve on this!

e Using larger segments could help, but that brings its own problems in a net-work based on store-and-forward routers. Question P31 from Chapter 1 ex-plored the advantage of breaking a single large segment into multiple smallsegments for transmission. The key lesson is that we want to have manysegments in flight, reducing the total transmission time.

Let’s explore this idea, and see how to design a reliable data transmissionprotocol that allows for many segments in flight between two end systems. Thereare two variations, called go-back-n and selective repeat.

e In a go-back-n protocol, the sender puts many segments in flight towardthe receiver. The receiver sends back acknowledgements as each segment isreceived. If a segment is lost or corrupted, the receiver requests retransmis-sion of the missing segment and discards all segments that arrive until themissing segment is received.

This simplifies the design of the receiver, but many segments may be trans-mitted correctly, discarded, and retransmitted, in the course of error recov-ery.

e In a selective repeat protocol, the sender puts many segments in flight to-ward the receiver. The receiver sends back acknowledgements as each seg-ment is received. If a segment is lost or corrupted, the receiver requestsretransmission of the missing segment and buffers all segments that arrivewithout error until the missing segment is received.

This complicates the design of the receiver, but error recovery is limited toretransmission of the missing segment(s).

Implementing either protocol has some implications:

e As mentioned earlier, if we want to have N −1 segments in flight, we will needat least N sequence numbers. We will need to introduce variables to handlethe bookkeeping; it’s not practical to keep adding states to the protocol.

Typically, N is determined by the number of bits available to hold the se-quence number. Sequence numbers with k bits have N = 2k distinct values0 – 2k − 1, hence the window size is N − 1 = 2k − 1.

13 June, 2012


e The sender must be able to retransmit any segment that has been transmit-ted but not acknowledged by the receiver. That means that the sender needsenough buffer space to store all segments in flight until they are acknowl-edged.

The buffer requirement at the receiver will depend on the choice of protocol.

Before we talk about the details of either protocol, let’s justify the assertion thatN sequence numbers are necessary for N − 1 segments in flight. The figure belowillustrates the proper use of sequence numbers for a window of size N − 1 = 7.

0 1 23 4 5 6 70 1 2 3 5 67past future

segmentsacknowledged

segments inflight

segments notyet transmitted

4

At any one time, we can have a maximum of 7 segments in flight — transmittedbut not yet acknowledged. Suppose that the window size was 8 and the sendertransmitted a eighth segment. As shown in the following figure, it would havesequence number 3.

30 1 2 5 67



4

0 1 23 4 5 6 70 1 2 5 67

past future


segments inflight


4

Sender

Receiver

3

expectedsequencenumber

30 1 2764 5

Now, suppose that the receiver has actually received all the segments transmittedby the sender (segments #4, #5, . . ., #2, #3), but for some reason all the acknowl-edgements have been lost. The receiver is expecting a segment with sequencenumber 4. If the sender retransmits the oldest unacknowledged segment (withsequence number 4), the receiver will think it’s a new segment and will accept theduplicate.

14 June, 2012


e The general problem is that errors in the channel can cause the sender andreceiver to have very different views of the window of legal sequence num-bers.

Given N sequence numbers, we can have at most N − 1 segments in flight.

The figures invite another interpretation. At any given time, we can have N − 1segments in flight. Considered in terms of the infinite sequence of past, present,and future segments, it’s as if we are sliding a window of size N − 1 over thesequence of segments. Another name for this type of protocol is a sliding windowprotocol.

Now let’s consider the details for a go-back-n protocol. What actions are re-quired of the sender? There’s only one state, so we can dispense with the bubblediagram3. Assume that sequence numbers range from 0 to N −1 and the first seg-ment to be transmitted will receive sequence number 0. Create variables oldestand newest to track the sequence number of the oldest and newest segmentsin flight, and inflight to count the number of segments in flight. Let’s con-sider how the sender responds to the possible events: ‘message from application’,‘receive ack’, ‘corrupt segment’, and ‘timeout’.

e Initialise oldest to 0, newest to N − 1, and inflight to 0.

e Message from application: When the application tries to transmit a mes-sage, the transport layer must decide if there’s room in the window. If so, itcan transmit the segment and place a copy in the buffer holding segmentsin flight. If not, it must refuse the message.if inflight < N-1 :

newest = (newest+1) mod Nsegbuffer[newest] = wrap(newest,msg,chksum)udt_send(segbuffer[newest])inflight = inflight+1if newest == oldest : start_timer

elserefuse(msg)

Notice that we’re only concerned with a timeout on the ACK for the oldestsegment; there’s no need to keep a timer for other segments.

3The presentation of go-back-n and selective repeat in these notes restates the pre-sentation in the text in Python and makes explicit the modulo-N arithmetic used forsequence numbers. You should convince yourself that the presentations are equiva-lent.

15 June, 2012


e Receive ACK: If an ACK arrives without error, we can take that as an in-dication that the receiver has correctly received all segments with sequencenumbers up to the sequence number in the ACK message. It may well be thatsome previous ACK has been lost; that’s ok. Assume that diffModN calcu-lates the difference between two sequence numbers using mod N arithmeticand returns a positive value4 between 0 and N − 1.ackseq = seq(rcvseg)cnt = diffModN(ackseq,oldest)+1if cnt < N :

inflight = inflight-cntoldest = ackseqif oldest == newest :

stop_timerelse

start_timeroldest = (oldest+1) mod N

It could be that this ACK is a request for retransmission of the segment withsequence number oldest. In this case5 the ACK will specify a sequencenumber that’s (oldest-1) mod N. No previously unacknowledged segmentis acknowledged, hence the transmit window is unchanged. Otherwise, theACK will be for some segment in the transmit window. The count of segmentsin flight is reduced accordingly and the base of the window moves forwardto one past the segment just acknowledged.

If this ACK acknowledges the newest segment (the one just transmitted), thenwe have no segments in flight, so stop the timer. If there are segments stillin flight, restart the timer (this may result in a longer timeout for the oldestsegment still in flight).

e Corrupt segment: There’s nothing to be done when this happens. It maybe that we’ll receive an uncorrupted ACK in a bit, in which case the loss ofthis ACK won’t matter. Or maybe we won’t, in which case the timer will gooff and we’ll retransmit.

e Timeout: We haven’t received an ACK for the oldest segment in flight, andit’s well past time for that to happen. Assume that there’s been an error andresend everything.

4In other words, diffModN returns the result of counting forward from oldest toackseq.

5To justify the test cnt < N, recall that (k − 1) mod N ≡ (k + (N − 1)) mod N .

16 June, 2012


udt_send(segbuffer[oldest])start_timerxmtseq = (oldest+1) mod Ncnt = 1while cnt < inflight :

udt_send(segbuffer[xmtseq])cnt = cnt + 1xmtseq = (xmtseq+1) mod N

What actions are required of the receiver? Really, there are only two eventsof interest: the arrival of an uncorrupted segment with the expected sequencenumber (the ‘correct segment’), and the arrival of any other segment, corrupt ornot (‘default’). Assume a variable expected that contains the expected sequencenumber.

e Initialise expected to 0.

e Correct segment: The receiver should extract the message and pass it tothe application and send an acknowledgement to the sender.msg = extract(rcvseg)deliver_msg(msg)ackseg = wrap(expected,ACK,chksum)udt_send(ackseg)expected = (expected+1) mod N

e Default: If anything else arrives, it’s wrong. Repeat the acknowledgment ofthe last segment correctly received.ackseq = (expected+(N-1)) mod Nackseg = wrap(ackseq,ACK,chksum)udt_send(ackseg)

We can summarise the windows at the sender and receiver as follows:

e At the sender, the oldest segment of interest is the oldest segment that’s beentransmitted but not acknowledged. The window of available sequence num-bers is anchored here. The window advances each time the oldest segmentis acknowledged by the receiver.

e At the receiver, the only segment of interest is the expected segment, a trivialwindow of size 1. The window advances each time the expected segment isreceived.

17 June, 2012


Ideally, we will get maximum utilisation of the channel if the time required totransmit a full set of segments (where a full set is defined to be the length of thesender’s window) is equal to the RTT of the channel, in seconds.

e The acknowledgement for the first segment transmitted would arrive at thesender just as it finishes transmitting the segment that fills the window. Ifacknowledgements arrive on schedule, there will be no gaps in the outgoingstream of segments.

The problem with go-back-n is that, in worst case, error recovery can require re-transmission of one entire window of segments. Over a channel with long latencyand high bandwidth, this can be a very large amount of data.

Can we do better if we adopt the policy that the receiver will acknowledge andkeep all segments received without error and the sender will retransmit only thosesegments that are not acknowledged?

e To achieve in-order delivery, the receiver will need to buffer segments whileit requests retransmission of a missing segment and awaits its arrival. If wecan buffer segments at the sender, we can surely manage it at the receiver.

e The receiver will have a nontrivial window, as we’re willing to accept seg-ments that arrive early and buffer them until they can be passed to theapplication.

There’s a subtle problem stemming from the fact that the sender and receivermay not agree on the position of the window. The following figure6 illustrates theproblem.

6This is Figure 3.23 from the text, with sequence numbers and annotations to makeclear the story told by the figure. The window size (14) is an odd choice, but onlybecause it’s not 2k − 1.

18 June, 2012


past future

011 124 10 1 2 3 4 5 69



75 146 8 7 8 9 10 11 12 13 1413

oldest newest

window

used available

* * ** *

14 10 1211 3 4 5 6975 146 8 7 8 9 10 11 12 13 1413

oldest(not received)

newest(not received)

window

0 2

segments notyet received

segments acknowledgedand delivered to application

* * ***

Sender

Receiver

The scenario in the figure shows that the sender thinks it has transmitted allsegments up to segment #3 (the portion of the window labelled ‘used’) but hasn’treceived acknowledgements for segments #11 or #12. The receiver is awaitingsegment #0. It thinks it has acknowledged all segments up to segment #14. andhas also received and acknowledged segments #1, #2, and #3.

e This implies that ACK(11) and ACK(12) (the acknowledgements for segments #11and #12) were corrupted or lost, that segment #0 was corrupted or lost, andthat ACK(3) is in flight, corrupted, or lost.

Suppose that the sender’s timer expires and it decides to retransmit seg-ment #11.

e The receiver will receive a segment with sequence number 11. The receiver’swindow extends to sequence number 13, so the arrival of segment #11 iswithin expectation, if a bit fast. The receiver will accept this duplicate seg-ment, incorrectly, and buffer it for delivery once the intervening segmentsarrive. The acknowledgement will satisfy the sender and the error will beundetected.

e But we can’t just ignore this error pattern. It will happen that acknowledge-ments get lost, just as shown here, and there must be some way to clear thelack of acknowledgement at the sender so that it can advance its window.

19 June, 2012


A bit of thought should convince you that every segment in the combined windowextending from the leftmost (past) edge of the sender’s window to the rightmost(future) edge of the receiver’s window must have a unique sequence number. Ineffect, the sender’s and receiver’s windows must be treated as one large window.The rule for a selective repeat protocol is that the window at the sender andreceiver should be b(N − 1)/2c.

Figure 3.27 in the text provides another illustration of this error scenario,using sequence numbers from 0 – 3 and a window size of 3.

With the preliminary analysis out of the way, what are the actions for thesender? Assume that sequence numbers range from 0 to N − 1, that the windowat the sender and receiver is of size b(N − 1)/2c = W , and the first segment tobe transmitted will receive sequence number 0. As before, oldest holds thesequence number of the oldest unacknowledged segment. Newest will be thesequence number of the most recently transmitted segment, and inuse will countthe number of sequence numbers in use. Notice that sequence number arithmeticis performed modulo N , even though the window size is limited to W .

e Initialise oldest to 0, newest to N − 1, and inuse to 0.

e Message from application: When the application tries to transmit a mes-sage, the transport layer must decide if there’s room in the window. If so, itcan transmit the segment and place a copy in the buffer holding segmentsin flight. If not, it must refuse the message.if inuse < W :

newest = (newest+1) mod Nsegbuffer[newest] = wrap(newest,msg,chksum)udt_send(segbuffer[newest])inuse = inuse+1start_timer(newest)

elserefuse(msg)

Because we’re only resending segments that are not correctly received (thusnot acknowledged), we need a separate timeout for each segment.

e Receive ACK: If an ACK arrives without error, we can take that as an indi-cation that the receiver has correctly received the referenced segment.ackseq = seq(rcvseg)stop_timer(ackseq)mark_seg_as_acked(ackseq)if oldest == ackseq :

while seg_is_acked(oldest) :

20 June, 2012


inuse = inuse-1oldest = (oldest+1) mod N

If the ACK is for the oldest unacknowledged segment, we can advance thesender’s window to the next unacknowledged segment. We have to check thestatus (acknowledged or not) of each segment. The complete condition forthe while loop iswhile oldest <= newest && seg_is_acked(oldest) :

but testing oldest <= newest is awkward in mod N arithmetic. Since asegment that’s not yet sent cannot be acknowledged, the while loop muststop when oldest is incremented to be greater than newest. An explicit testfor oldest <= newest is not required .

e Corrupt segment: There’s nothing to be done when this happens. It maybe that we’ll receive an uncorrupted ACK in a bit, in which case the loss ofthis ACK won’t matter. Or maybe we won’t, in which case the timer will gooff and we’ll retransmit.

e Timeout(lateseq): We haven’t received an ACK for the oldest segment inflight, with sequence number lateseq, and it’s past time for that to happen.Assume that there’s been an error and resend just this one segment.udt_send(segbuffer[lateseq])start_timer(lateseq)

And the receiver? It becomes a bit more complex because it must now managebuffers and a nontrivial window. The base of the window is held in oldest, thesequence number of the first segment not yet received. Each time a segmentarrives without error, there are three cases to consider:

e The sequence number of the segment matches oldest.

Call this event ‘oldest’. In this case, we want to send an ACK and begin todeliver messages to the application, advancing the window through consec-utive sequence numbers until we come to a missing segment.

e The segment has a sequence number less than oldest, but within the pastwindow (i.e., the W sequence numbers preceding oldest). This is a segmentthat arrived without error and was delivered to the application, but the ACK

was lost and the sender has retransmitted the segment.

Call this event ‘past’. In this case, we want to send an ACK so the sender willknow it’s been received, but that’s all we need to do.

21 June, 2012


e The segment has a sequence number larger than oldest, but within thefuture window (i.e., the W − 1 sequence numbers following oldest). This isa segment that’s arrived early.

Call this event ‘future’. In this case, we want to send an ACK so the senderwill know it’s been received, but we can’t yet deliver it to the application be-cause an earlier message hasn’t arrived. Buffer the segment for later delivery.

In Python, the actions will be as follows.

e Oldest: The sequence number of the segment, rcvseq, matches the se-quence number in oldest. The receiver should unwrap and deliver themessage in this segment, then scan the segment buffer to see if there areadditional messages ready for delivery. For uniformity, stash the newly ar-rived segment in the buffer before starting the scan.rcvseq = seq(rcvseg)ackseg = wrap(rcvseq,ACK,chksum)udt_send(ackseg)segbuffer[rcvseq] = rcvsegwhile !empty(segbuffer[rcvseq]) :

msg = extract(segbuffer[rcvseq])deliver_msg(msg)rcvseq = (rcvseq+1) mod N

oldest = rcvseq

When the scan reaches an empty segment buffer, reset oldest.

e Past: All that needs to be done is send an ACK.rcvseq = seq(rcvseg)ackseg = wrap(rcvseq,ACK,chksum}udt_send(ackseg)

e Future: We need to acknowledge this segment and buffer it because we can’tdeliver the message to the application because one or more earlier segmentsare missing.rcvseq = seq(rcvseg)ackseg = wrap(rcvseq,ACK,chksum}udt_send(ackseg)segbuffer[rcvseq] = rcvseg

e Default: If the arriving segment is corrupt, we don’t need to do anything.If the arriving segment is not corrupt but has a sequence number outsidethe past or future windows, something is seriously wrong (this shouldn’thappen).

22 June, 2012


Now that we have a good understanding of the relationship between windowsize and range of sequence numbers, it’s time to admit that we can’t always usethe minimum range of sequence numbers.

e Our model of the channel between the sender and receiver allows for seg-ments to be corrupted or lost completely, and retransmission triggered bytimeouts can cause duplicates to arrive at the receiver. But our model of thechannel assumes that, with the exception of complete loss, segments arriveat the receiver in the order that they were sent.

e For a sender and receiver connected by a single link, this is trivially true.Bits cannot pass one another as they propagate along the link. This extendsto the situation where there’s a single path between the sender and receiver,even if the path has multiple links.

e This assumption does not hold in a large packet-switched network wherethere are many alternative paths between the sender and receiver. It’s possi-ble (if unlikely) for a segment to be delayed for a significant amount of time,long enough for the sender and receiver to retransmit the segment, recoverfrom the loss, and move on. When the delayed segment finally appears atthe receiver, its sequence number may well be within the current windowand this would result in the segment being accepted — an error.

e The practical solution is to place an upper bound on the lifetime of packetsin the network7 and use a range of sequence numbers large enough to avoidany repetition in that time period.

It’s time to summarise what we’ve learned about reliable data transfer proto-cols.

Our model of the communication channel between the sender and the receiverallows for three types of errors:

e Data can be corrupted, so that the bits that arrive at the receiver are not thebits transmitted by the sender. This can occur due to noise in the channelor intermittent equipment failure. With today’s technology, data corruptionis very rare for guided media, slightly more common with unguided media.

e Data can be outright lost, so that nothing arrives at the receiver. This canoccur when an intermediate router runs out of buffer space and must discarda datagram.

7For the Internet, this is estimated to be about three minutes.

23 June, 2012


e Data can be delayed for long periods and arrive at the receiver out of or-der. This is an extremely rare error, caused by extreme congestion delays ortransient errors in router forwarding tables.

It’s important to keep in mind that terabytes (1012) of data are transmitted onthe Internet each second. A one-in-a-billion error happens somewhere once eachmillisecond.

To achieve reliable data transfer in the presence of these errors, we have asuite of techniques to apply:

e Checksums are the result of some calculation performed over the data andtransmitted with the data for verification by the receiver. They are used todetect corrupted data.

e Timers are used to measure an interval. In reliable data transfer, theyare used to time the interval between sending a segment and receiving theacknowledgement. They are used to detect complete data loss.

e Sequence numbers are used to identify each unit of data (a segment, forexample). They allow the receiver to detect loss or duplication of data. A gapin the sequence numbers seen by the receiver indicates data loss. A repeatedsequence number indicates duplication.

e Acknowledgements provide positive feedback from the receiver to the senderso that the sender knows what data the receiver has received. Acknowl-edgements allow the sender to discard data that’s buffered for possible re-transmission. Negative acknowledgements are an alternate implementationchoice.

e A sliding window allows multiple units of data to be in flight between thesender and the receiver. This allows a reliable data transfer protocol toachieve an acceptable data transfer rate by increasing the utilisation of thechannel.

24 June, 2012

Cmpt 371 UDP

UDP

The User Datagram8 Protocol (UDP), defined in RFC 768, provides a connec-tionless, best-effort data transfer service.

We’ve mentioned already that the Internet network protocol, IP, provides aconnectionless, best-effort data transfer service between hosts. Why do we needUDP?

e For one thing, it adds the essential transport service, the ability to specifyparticular application processes on the source and destination hosts.

e A second added capability is a checksum over the entire UDP segment plusselected items from the IP header so that it’s possible to detect if the headeror data has been corrupted in transmission.

There’s not a whole lot to a UDP segment:

source port (16) destination port (16)length (16) checksum (16)

payload (max 64KiB)

e The source (local) and destination (remote) port numbers are 16 bit values,as explained previously.

e The 16 bit message length includes the UDP header. The real limit onlength is the IP header’s length field, also 16 bits. Allowing for 20 bytesof mandatory IP header and 8 bytes of UDP header, the maximum payloadis 216 − 8 − 20 = 65507 bytes.

e The checksum is the standard Internet checksum documented in RFC 1071.It’s calculated as the one’s complement sum of the UDP header and data as16 bit (2 byte) words. If the payload is an odd number of bytes, a byte withvalue 0 is added for the purpose of computing the checksum.

Use of the checksum is optional, and a checksum value of 0 is interpretedas absence of the checksum9.

8The RFC refers to a UDP segment as a datagram, but these notes will use segment forcompatibility with the text.

25 June, 2012

Cmpt 371 TCP

To provide additional protection against changes in the IP header, selectedfields are collected into a pseudo-header that’s prepended to the UDP segmentwhen the checksum is calculated. This provides an extra measure of protectionagainst corruption of the the IP header as it’s forwarded from router to router.

source IP addressdestination IP address

0protocol

(udp = 17)length

e To see why this is useful, we need to look ahead a bit. Each router willmodify the IP header — at the least, the router will modify the hop count inthe header. The router must then recalculate the IP header checksum. If therouter introduces an error as it modifies the IP header, the new checksumwill be correct for the erroneous header.

The UDP header and data are not modified in transit, so routers do not havethe same opportunity to recalculate the UDP checksum and hide an error.

e The pseudo-header is not transmitted with the UDP segment, but the UDPchecksum is transmitted. The destination end system rebuilds the pseudo-header from the received IP header and recalculates the checksum. A mis-match indicates an error somewhere in the UDP segment or the IP headerfields (but it’s still not possible to pinpoint the error).

e The inclusion of the length of the UDP segment in the pseudoheader seemsredundant. A likely explanation is that it is included for symmetry with TCP,which uses the same pseudo-header format and does not provide an explicitlength field in the TCP segment header.

TCP

TCP, defined in RFC 793, provides reliable end-to-end data transport over anunreliable network layer. Connections are full duplex and point-to-point.

e More specifically, TCP is designed to run over an unreliable internetwork,and adapt its behaviour to the varying characteristics (bandwidth, delay,maximum transfer unit) of paths through an internetwork.

9Recall that this is not a problem for one’s complement representation. If the computedchecksum turns out to be 0, we just use the other one’s complement representation of0, which is all 1’s.

26 June, 2012

Cmpt 371 TCP

As explained earlier, each end of a TCP connection is identified by an ⟨address:port⟩ pair. The four-tuple ⟨local_ip:local_port ; remote_ip:remote_port⟩ defines aconnection. This four-tuple must be unique for each connection.

e Port numbers are interpreted exactly as for UDP. Many applications are al-located the same port number in both the UDP and TCP port space.

The unit of data transfer for TCP is called a segment.

e It consists of a 20 byte mandatory header, some header options, and the datapayload. A segment need not have any data.

e All told, a segment must fit within the payload limit of IP — 64K minuswhatever space is occupied by the IP header.

Recall that TCP provides a byte stream and does not preserve the boundariesbetween blocks of data as provided by the application. If markers are required toseparate messages, the application must supply them.

e By not preserving boundaries between blocks of data, TCP is free to groupdata for maximum efificiency before passing it to the network layer for trans-mission, and before passing it to the application on the receiving side.

The goal is to send the least amount of network overhead data per byteof data transferred between applications. The TCP and IP headers have afixed minimum size, so the only way to reduce overhead is to send as muchapplication data as possible in each segment.

e The ability to regroup data for efificient transmission allows TCP to avoid‘silly window syndrome’, where the transmitting side is continually sendingtiny segments in response to small receiver window increments.

To suggest that a TCP implementation send even small amounts of data promptly(e.g., for use with interactive applications where a single keypress or mouse clickevokes an action), the protocol defines a ‘push’ mechanism.

e ‘Push’ is a strong suggestion to the local TCP implementation to immediatelysend whatever data it has accumulated, and a similar suggestion to theremote TCP implementation to immediately deliver any accumulated data tothe process at the other end of the connection, without waiting for more datato build a larger segment.

27 June, 2012

Cmpt 371 TCP

e On the sending side, when the sending application requests a push, the localTCP process should send all unsent data to the remote TCP process with the‘push’ (PSH) flag set. On receipt of this segment, the remote TCP processshould deliver the data to the receiving application without further delay.

Note that ‘push’ is a suggestion, not an order, although it’s expected that the TCPimplementation will comply if at all possible.

To signal the presence of urgent data in the byte stream (e.g., ctl-C to abortan operation) TCP provides an ‘urgent’ mechanism.

e When the sending application signals that the message is urgent, the localTCP process sets a field in the header, the urgent pointer, indicating the endof urgent data in the byte stream as an offset from the start of the segment.The ‘urgent’ (URG) flag is set to indicate to the remote TCP process that theurgent data pointer is valid.

On receipt of a segment with the URG flag set, the remote TCP processshould immediately notify the receiving application. The receiving applica-tion has the responsibility for deciding how to act on the notification. Theexpectation is that it will process received data as expediently as possible (inthe extreme, simply discarding it) in order to reach the marked place in thebyte stream. Urgent data is not delivered ‘out-of-band’, i.e., it is not deliveredahead of data sent before the sending application produced the urgent data.

On each subsequent call by the receiving application to receive data, theurgent flag will be set until all data up to the byte specified by the urgentpointer has been read.

e Note that the value of the urgent pointer itself is not returned. The sendingapplication can send additional urgent data, and in this case the value ofthe urgent pointer is transparently updated while the receiving applicationis working through the byte stream to process the urgent data.

e Urgent does not automatically imply push. This should be signalled as wellto move the data as fast as possible.

As documented in RFC 6093, there are problems with the implementation ofthe urgent mechanism in the TCP implementation for many common operatingsystems. The authors of the RFC go so far as to recommend that new applicationsnot use the urgent mechanism.

Let’s have a look at the TCP header format. The header may also includeoptions, not shown here or discussed in these notes.

28 June, 2012

Cmpt 371 TCP

sequence number (32)

source port (16) destination port (16)

hlen (4) rsvd (4) window (16)

checksum (16) urgent pointer (16)

acknowledgement number (32)

RST

SYN

ACK

PSH

FIN

URG

ECE

CWR

e The source and destination port fields serve the same function as in UDP:they identify a particular application process at the source and destinationhosts.

e The checksum calculation uses the same pseudo-header defined for UDP,containing the IP source and destination addresses, protocol (TCP, in thiscase), and the total TCP segment length (header and data). Unlike for UDP,the segment length is not redundant. The TCP header does not contain anexplicit length field — the receiving network layer must calculate the lengthof the TCP segment as the difference between the IP datagram length andthe IP header length.

e The sequence number, acknowledgement number, and window size fieldsare used to implement a sliding window protocol.

Each data byte in a TCP data stream receives a sequence number (headerbytes are not included). The sequence number specifies the position in thedata stream of the first data byte in the segment. The acknowledgementnumber is valid only if the ACK flag is set in the control field. It specifies thenext byte that the receiver is expecting and acknowledges all previous bytes(i.e., cumulative acknowledgement).

The size of the transmit window (i.e., the amount of additional data thatthe receiver permits the sender to transmit) is specified by the window field,so that acknowledgement of receipt and permission to send more data areseparate functions. The TCP window size is used for flow control and is notequivalent to window size as defined for the go-back-n and selective repeatprotocols. The TCP sliding window protocol will be explained in detail below.

e The TCP header length (hlen) specifies the size of the header, in 4 byte units.An explicit length field is needed because the TCP header may contain op-tional fields in addition to the mandatory fields described here. The amountof data carried in the segment must be calculated implicitly using the IPdatagram length and the TCP header length.

e The control flags field contains flags that specify the type of segment anddata urgency. The use of the PSH and URG flags has already been described.

29 June, 2012

Cmpt 371 TCP

f As mentioned above, the ACK flag indicates that the acknowledgementnumber is valid. It will should set in all segments except the very firstsegment sent by the end system that initiates the connection. It willalso be unset in segments that abort (reset) the connection.

f SYN, FIN, and RST indicate connection control segments.SYN indicates that this segment is a connection request or connectionaccept segment sent as part of the three-way handshake that opens aconnection (described below).The FIN bit is set to indicate a clear request segment that closes theconnection in one direction (also described below).The RST bit is set to indicate refusal of a connection request, or toreset (abort) the connection. On receipt, the connection is immediatelyterminated.

f RFC 3168 defines two additional flags, ECE and CWR, which are usedas part of Explicit Congestion Notification (ECN). These flags allow TCPto cooperate with the network layer when a TCP segment is received inan IP datagram with the ‘congestion experienced’ (CE) indication.

Connection establishment is a three-way hand-shake, as shown in the figure to the right.

e A connect request is a segment with theSYN bit set, but no ACK bit. It initiates theconnection establishment process and givesthe initial sequence number k for bytesmoving from Host A to Host B.

SYN

SYN+AC

K

ACK

End System A End System B

connectionopen

e A connect accept is a segment with the SYN and ACK bits set. It acknowl-edges the sequence number of the connect request (SYN) segment (the valuein the acknowledgement number field will be k +1), and proposes a sequencenumber l for bytes moving from Host B to Host A.

e The final step in the connection establishment is a segment with only theACK bit set. It will acknowledge the sequence number of the connect accept(SYN, ACK) segment (the acknowledgement number will be l + 1).

e All three segments in this exchange can, in theory, carry data, and the ac-knowledgement numbers will be adjusted accordingly. In practice, the firsttwo segments rarely contain data. The final ACK that completes connectionestablishment will most likely carry data.

30 June, 2012

Cmpt 371 TCP

Connection release is symmetric. Each side canterminate transmission, while remaining open toreceive data from the other.

e The side which initiated the close (End Sys-tem A in the example) will continue to pro-cess segments for a time after receiving aclose from the remote peer (End System B)so that any late-arriving segments are prop-erly handled (e.g., a retransmitted FIN seg-ment from End System B when the finalACK from End System A is lost).

ACK+FIN

ACK

ACK

End System A End System B

A → Bclosed

B → Aclosed

ACK+FIN

...

e Too many retransmissions of FIN with no response will eventually result ina unilateral close, after an appropriate timeout.

Before we describe TCP’s sliding window algorithm in detail, there’s one moremechanism to discuss: Determining an appropriate length for the timeout thattriggers retransmission of a segment. As mentioned in the description of rdt3.0,the minimum possible timeout is the round-trip time (RTT) between the senderand receiver. In a packet-switched network, the RTT will vary depending on theinstantaneous load at each router along the path between the end systems. Theinstantaneous load on the remote end system will also have an effect. We don’twant to base timeout decisions on instantaneous values, so we need some way tocalculate an estimate of the average RTT.

e First, we need some samples on which to base an estimate. We don’t needa measurement for every segment; it’ll sufifice to pick a segment, set a timer,and wait for the acknowledgement for that segment. When the acknowl-edgement arrives, start a new measurement with the next segment to betransmitted. This produces a new measurement roughly once per RTT. Callthis measurement sampleRTT.

Because we want the estimated RTT to reflect the normal round-trip time,the sample is used only if the segment is transmitted and acknowledgedwithout error.

e For each new measurement, calculate

estRTT = (1 − α)(estRTT) + α(sampleRTT)

This calculation produces an exponentially weighted moving average (EWMA)which smooths out the often considerable variation in the instantaneoussamples. (See Figure 3.32 in the text.)

31 June, 2012

Cmpt 371 TCP

e A similar calculation is used to maintain an estimate of the variation (devi-ation) from one instantaneous sample to the next as

devRTT = (1 − 1)(devRTT) + 1(|estRTT − sampleRTT|)

e Given estRTT and devRTT, what’s an appropriate timeout interval? Years ofexperience have resulted in the formula

timeoutInterval = estRTT + 4(devRTT)

In addition, there are rules to quickly increase timeoutInterval to avoid un-necessary timeouts and to quickly reduce it to avoid wasting time waiting for atimeout.

Now that all the mechanisms are in place, let’s look at how TCP assemblesthem into an efificient sliding window reliable data transfer protocol. The TCPprotocol is a hybrid between go-back-n and selective repeat. As with go-back-n,acknowledgements are cumulative. However, many (perhaps all) TCP implemen-tations buffer segments that are received correctly but out-of-order and the TCPprotocol takes advantage of this.

e TCP sequence numbers are 32 bits and count individual bytes in the datastream. 232 sequence numbers seems like a huge range, but typical RTTvalues (100’s of msec) combined with gigabit data rates push the limits ofsequence number reuse.

e The acknowledgement sequence number specifies the next byte that the re-ceiver expects to receive. As with go-back-n, acknowledgements are cumu-lative.

e As with selective repeat, when a timeout occurs a TCP sender will retransmitonly the segment that triggered the timeout. In addition, it will temporarilydouble the timeout interval with each consecutive timeout, in the hope thatit can avoid timing out on other unacknowledged segments. As soon as asegment is acknowledged without error, the timeout interval reverts to theestimate described above.

e Assume that the receiver is buffering segments that arrive out-of-order whileit waits for a lost segment to be retransmitted. When the retransmittedsegment arrives, the receiver can send a cumulative acknowledgement thatcovers the lost segment and all buffered data. If this cumulative acknowl-edgement arrives at the sender before the next timeout, only the segment inerror will be retransmitted.

32 June, 2012

Cmpt 371 TCP

e As with selective repeat, the arrival of an uncorrupted segment triggers thereceiver to send an acknowledgement10. When an acknowledgement se-quence number does not refer to the most recently transmitted segment,it’s an indication that the specified segment was corrupted or lost.

If a TCP sender receives four successive segments with the same acknowl-edgement sequence number, it will proactively retransmit the segment in-stead of waiting for a timeout to occur. This mechanism is called ‘fast re-transmit’ and is defined in RFC 568111.

To close the discussion of TCP, let’s consider flow control. The sender mustlimit the rate at which it transmits data so as not to exceed the rate at which theremote end system can accept data or the rate at which the intervening networkcan forward data.

Let’s start with the rate at which the remote end system can accept data.When we looked at go-back-n and selective repeat, acknowledgements confirmedthat the receiver received the data and they gave the sender permission to advanceits window and send more data. In TCP, these two functions are separated.

e The acknowledgement sequence number tells the sender that the receiverhas received all bytes in the data stream up to the specified sequence num-ber.

e The window size tells the sender how much additional data it’s allowed tosend. The receiver sets the window size based on its available buffer space.

This provides an effective way for the receiver to limit the sender’s data rate, butit’s not without problems.

e When the receiver reduces the window size to zero, the sender cannot sendany more data. But (as stated above) it’s the receipt of a segment that triggersthe receiver to send an acknowledgement, and the sender is forbidden tosend more data!

TCP gets out of this dilemna by specifying that the sender should probe thereceiver periodically with a segment containing a single byte.

10This is a slight oversimplification; see Table 3.2 in the text.

11RFC 5681 deals with TCP’s four congestion avoidance and recovery mechanisms: slowstart, congestion avoidance, fast retransmit, and fast recovery.

33 June, 2012

Cmpt 371 TCP

e A small but non-zero window size (silly window syndrome, mentioned ear-lier) is also a problem because of the fixed overhead of the link, network, andtransport headers.

To avoid this, a TCP receiver should not advertise a small window size. Whenits buffer space drops below a minimum amount it should simply advertisezero until there is sufificient buffer space to accept a segment of reasonablesize.

On the sender’s side, another algorithm (Nagle’s algorithm) is used to decideif single bytes of data should be transmitted immediately or held until moredata is supplied by the application. The basic rule is that segments of just afew bytes can be sent as long as there is no unacknowledged data in flight.

Now, how can the sender also adjust its transmission rate to avoid or allevi-ate congestion in the network? A TCP sender keeps a second window size limitimposed by congestion, the congestion window size. At any given time, the sizeof the sender’s transmit window is limited to the minimum of the congestionwindow and the receiver’s window.

e Suppose that the window size given by the receiver is large enough so thatit’s not a limit. The sender can transmit no more than the number of bytesspecified by the congestion window before it must wait for an acknowledge-ment, which cannot arrive in less than the RTT between the local and re-mote end systems. The effect is to roughly limit the sender’s transmit rateto (congestion window size)/RTT.

Recall that the biggest reason for segment loss is that some router ran outof buffer space and dropped the datagram carrying the segment. An aggressiveretransmit strategy will not help

e Consider the retransmit strategy of go-back-n, which immediately retrans-mits all unacknowledged segments. This will make the congestion worse.More packets arrive at already congested routers, further increasing the de-lay and the number of dropped packets, which prompts more retransmis-sions, in a positive feedback loop.

Add to this that a dropped packet may be successfully forwarded throughseveral routers before it’s finally dropped. This is wasted work and usescapacity that could have been put to a more productive use.

e TCP’s basic strategy — retransmit only a single segment and double thetimeout — works well in this scenario. It retransmits the minimum amountof data (the lost or corrupted segment), adding the least possible load to the

34 June, 2012

Cmpt 371 TCP

network as it waits for acknowledgements that may simply be delayed bycongestion.

How, then, should TCP adjust the congestion window? We can certainly detectloss of a segment, by a timeout or the receipt of too many acknowledgements withthe same sequence number, and we can assume that the cause is congestion. Howshould we use this knowledge to decrease the congestion window? Nor can weforget the other side of the problem: How can we detect when congestion haseased and the congestion window should be increased?

TCP has a three-part approach. Let’s start at the beginning with the decep-tively named slow start algorithm.

e Slow start sets the initial congestion window to one segment. (More specifi-cally, the size of one segment.)

e Each time a segment is acknowledged without error, the congestion windowis increased by one segment. This results in exponential growth for thecongestion window. (When the first acknowledgement returns after one RTT,two segments can be sent out. If each of them is acknowledged without error,the congestion window is four at the end of two RTT times. Four segmentsare sent out . . .)

e There is, of course, a limit, and exponential growth will reach it quickly.The first segment loss triggers the slow start algorithm to remember 1⁄2 thecurrent congestion window as a value called the slow start threshold and tocut the congestion window back to one.

e Exponential growth begins again but changes to linear growth when thewindow reaches the slow start threshold. (In other words, avoid that lastdoubling of the window that resulted in segment loss.) This is called conges-tion avoidance mode. Linear growth means that that the congestion windowis increased by one segment for each RTT without error.

Even linear growth will eventually reach the limit of the network’s ability toaccept segments, or some transient problem will cause the loss of a segment,and a timeout will occur. The reaction is the same: set the slow start thres-hold to 1⁄2 the current congestion window, cut the congestion window to one,and start again.

It’s also possible that loss of a segment will be detected before a timeout occurs, byreceipt of multiple acknowledgements with the same sequence number (duplicateacknowledgements). This really does indicate a transient situation. Yes, onesegment was dropped, but subsequent segments must be getting through in orderto trigger the receiver to send the duplicate acknowledgements.

35 June, 2012

Cmpt 371 TCP

e A less drastic reaction is in order. First, recall that fast retransmit meansthat TCP will retransmit the lost segment without waiting for a timeout.

The slow start threshold is set to the current congestion window and thecongestion window is cut back to 1⁄2 its current value. TCP then enters intothe fast recovery mode.

e In fast recovery mode, TCP will increase the congestion window by one seg-ment for each duplicate acknowledgement. This gives it room to keep trans-mitting new segments while waiting for the acknowledgement that indicatesthat the retransmitted segment has successfully arrived at the receiver.

Eventually the retransmitted segment will get through and the anticipatedacknowledgement will arrive. This will be a cumulative acknowledgementthat covers the retransmitted segment plus all the additional segments thathave arrived at the receiver. At this point, TCP resets the congestion windowto the slow start threshold and returns to congestion avoidance mode.

The net effect of slow start, congestion avoidance, and fast recovery is thata plot of data transmission rates for a typical TCP connection will look like asawtooth if the only limit is network congestion.

e More often, arbitrary bandwith caps imposed at some point in the networkwill be the limiting factor for TCP’s data transmission rate. For example, thebandwidth on the link between a home and the local ISP is usually cappedat a limit set by the service agreement. The more you’re willing to pay, thehigher the cap.

The mechanisms just described attempt to control the rate of data transmis-sion without any feedback from the routers along the path between the end sys-tems. More recently, RFC 3168 defines a new mechanism called Explicit Conges-tion Notification (ECN). It requires cooperation between the network layer protocol(IP) and the transport layer protocol (TCP). To understand how it works, we needto start by saying a bit more about how routers decide to drop datagrams.

e Early routers used a trivial algorithm: If a datagram arrives and the relevantqueue is full, drop the datagram. This is known as tail drop. It has a numberof bad properties — it’s insensitive to the type of trafific, and it tends todrop one or two datagrams from many data streams, disrupting all of themsimultaneously.

e Modern routers are more proactive: They monitor the amount of free spacein a queue and use some algorithm to choose frames to drop before queueoverflow leaves no choice. The most common algorithm is called Random

36 June, 2012

Cmpt 371 TCP

Early Discard (RED). When the free space in a queue drops below a mini-mum, each frame that arrives is at risk of being dropped with probability p.The value of p increases as queue space decreases, reaching 1 when thequeue overflows.

Given a proactive algorithm like RED, a router has a choice: It can discard thedatagram as before, or it can attempt to notify the end systems for the data flowthat congestion is increasing. The hope is that the end systems will cooperate andreduce their transmission rate.

e In the IP header12, two bits of the DS (Differentiated Services) field are usedby ECN. These ECN bits are used to define a set of code points:

00 The datagram is part of a data flow which does not support ECN.01 The datagram is part of a data flow which supports ECN.10 The datagram is part of a data flow which supports ECN.11 The datagram has been marked by a router to indicate congestion.

In the TCP header, two flags are added to the control flags, ECE (ECN-echo)and CWR (congestion window reduced).

e During TCP connection setup, the end system initiating the connection (i.e.,the system sending the connection request segment) sets both ECE and CWRto indicate that it’s ECN-capable and desires to use ECN. If the respondingend system wishes to use ECN, it sets ECE only in the connection acceptsegment13.

e Once the end systems have established that they are capable of and willingto participate in ECN, they will set the ECN bits in the IP header to one ofthe two code point values (01 or 10) that indicate that the datagram is partof an ECN-capable data flow.

e If a router decides that it should drop a datagram to reduce congestion butthe output queue isn’t completely full, it may instead decide to set the ECNbits of the IP header to 11 and forward the datagram on towards its destina-tion.

12The IP protocal will be covered in detail when we talk about the network layer.

13Not ECE and CWR; this is a defense against a class of broken TCP implementationswhich copy the received flags field to the reply before modifying the flags. See RFC 3168for details.

37 June, 2012

Cmpt 371 TCP

e When an end system receives an IP datagram with the ECN bits set to 11,it will set the ECE bit in the TCP header when it acknowledges the segmentcarried in that IP datagram.

e When the other end system receives a TCP segment with the ECE bit set,it will respond just as if a segment had been dropped — by reducing itstransmission window. It sets the CWR bit in the next segment it sends to theother end point, so that the other end point knows that the ECE indicationhas been received and acted on.

e This capsule summary doesn’t begin to do justice to the complexity of theECN mechanism. There are many subtle issues that must be considered inorder to make this work well. Please see RFC 3168 for the details.

References

[1] J. Kurose and K. Ross. Computer Networking: A Top-Down Approach. AddisonWesley, 6th edition, 2012.

38 June, 2012

the transport layer · the transport protocol data unit (pdu) to the network layer for...

Documents