Transmission Control Protocol (TCP) Data Flow
TCP data flow provides reliable data transfer through the sequencing of outbound data and the acknowledgment of inbound data. Along with reliability, TCP data transferincludes behaviors to prevent inefficient use of the network and provide sender-side and receiver-side flow control.
Basic TCP Data Flow Behavior
The following mechanisms govern TCP data flow, whether for interactive traffic, such as Telnet sessions, or for bulk data transfer, such as the downloading of a large file with the File Transfer Protocol (FTP):
- AcknowledgmentsTCP acknowledgments are delayed and cumulative for contiguous data and selective for noncontiguous data.
- Sliding send and receive windowsA send window for the sender and a receive window for the receiver control the amount of data that can be sent. Send and receive windows provide receiver-side flow control. As data is sent and acknowledged, the send and receive windows slide along the sequence space of the sender's byte stream.
- Avoidance of small segmentsSmall segments—TCP segments that are not at the TCP maximum segment size (MSS)—are allowed, but are governed to avoid inefficient internetwork use.
- Sender-side flow controlTCP sliding windows provide a way for thereceiver to determine flow control, but the sender also uses flow control algorithms to avoid sending too much data and congesting the internetwork.
TCP Acknowledgments
A TCP acknowledgment (ACK) is a TCP segment with the ACK flag set. In an ACK, the Acknowledgment Number field indicates the next byte in the contiguous byte stream that the ACK's sender expects to receive. Additionally, if the TCP Selective Acknowledgment (SACK) option is present, the ACK indicates up to four blocks of noncontiguous datareceived.
Delayed Acknowledgments
When a TCP peer receives a segment, the acknowledgment for the segment (eithercumulative or selective) is not sent immediately. The TCP peer delays the sending of the ACK segment for the following reasons:
- If, during the delay, additional TCP segments are received, a single ACK segment can acknowledge the receipt of multiple TCP segments.
- For full-duplex data flow, delaying the ACK makes it possible for the ACK segment to contain data. This is known as piggybacking the data on the ACK, or piggyback ACKs. If the incoming TCP segment contains data that requires a response from the receiver, the response can be sent along with the ACK. This is common for Telnet traffic, in which each keystroke of the Telnet client is sent to the Telnet server process. The received Telnet keystroke must be echoed back to the Telnet client. Rather than sending an ACK for the keystroke received and then sending the echoed keystroke, a single TCP segment containing the ACK and the echoed keystroke is sent.
- TCP has the time to perform general connection maintenance. The Application Layer protocol has additional time to retrieve data from the TCP receive buffer and an updated window size can be sent with ACK.
RFC 1122 specifies that the acknowledgment delay should be no longer than 0.5 seconds. By default, TCP/IP for the Microsoft Windows Server 2003 family and Windows XP uses an acknowledgment delay of 200 ms (0.2 seconds), which can be configured per interface by the TcpDelAckTicks registry setting.
TcpDelAckTicks
Location: HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParametersInterfacesInterfaceGUID Data type: REG_DWORD Valid range: 0-6 Default value: 2 Present by default: No
TcpDelAckTicks sets the delayed acknowledgment timer (in 100-ms intervals) of aninterface. Setting TcpDelAckTicks to zero disables delayed acknowledgments. Thedefault value of 2 specifies a 200-ms delayed acknowledgment timer.
Cumulative for Contiguous Data
As originally defined in RFC 793, the TCP acknowledgment scheme is cumulative. The presence of the ACK flag and the value of the Acknowledgment Number field explicitly acknowledge all bytes in the received byte stream from the Initial Sequence Number (ISN) + 1 (the first byte of data sent on the connection), up to but not including the number in the Acknowledgment Number field (Acknowledgment Number – 1). Figure 14-1 illustrates the cumulative acknowledgment scheme of TCP.
Figure 14-1: The cumulative acknowledgment scheme of TCP.
An ACK with a new Acknowledgment Number field is sent when a TCP segment isreceived containing data that is contiguous with previous data received. TCP segments received that are not contiguous with the previous segments received are not acknowledged. Only when the missing segments are retransmitted and received, creating a contiguous block of one or more TCP segments, is an ACK segment sent with the new Acknowledgment Number field.
Although the original cumulative acknowledgment scheme for TCP works well and provides reliable data transfer, in high-loss environments this relatively simple acknowledgment scheme can slow throughput and use additional network bandwidth.
For example, a TCP peer sends six TCP segments. If the first of the six segments is dropped and the last five segments arrive, no ACK for the five received segments is sent. With normal TCP retransmission behavior, after the retransmission time-out (RTO), the sending TCP peer begins to retransmit all six segments. When the retransmission of the first TCP segment arrives, the receiving TCP peer sends an ACK segment confirming receipt of all six segments. Although the dropped first segment was successfully recovered, TCP needlessly sent duplicates of segments that successfully arrived.
Selective for Noncontiguous Data
With selective acknowledgments, the Acknowledgment Number field still indicates the last contiguous byte received, but the TCP SACK option can acknowledge noncontiguous received segments. With the TCP SACK option, the left and right edges of the blocks of noncontiguous data received are explicitly acknowledged, preventing needless retransmission. Figure 14-2 illustrates TCP's selective acknowledgment scheme.
Figure 14-2: The selective acknowledgment scheme of TCP.
Using the previous example, if six TCP segments are sent and the first segment is dropped, the receiving TCP peer sends an ACK segment with the following settings: the Acknowledgment Number field is set to the first byte of the missing TCP segment, and the TCP SACK option is set with the left and right edge of the block consisting of the second through the sixth received TCP segments. After receipt of the ACK with the TCP SACK option, the sender marks the selectively acknowledged TCP segments and does notretransmit them. The sending TCP peer retransmits the first TCP segment after its RTO. After receipt, the receiving TCP peer sends an ACK segment with the Acknowledgment Number field set to the first octet past the sixth TCP segment.
Selective acknowledgments are especially important for the recovery of data on a TCP connection with a large window size. The previous example has a window size of six segments. Imagine a high-bandwidth, high-delay link such as a satellite channel with a window size of 200 segments. The sender transmits 200 segments at a time. If cumulative acknowledgments are used and the first segment is dropped, the sender needlessly retransmits many of the successfully received segments before the dropped segment is recovered. Selective acknowledgments eliminate needless retransmissions of successfully received segments.
TCP Sliding Windows
To govern the amount of data that can be sent at any one time and to provide receiver-side flow control, data transfer between TCP peers is performed using a window. The window is the span of data on the byte stream that the receiver permits the sender to send. The sender can send only the bytes of the byte stream that lie within the window. New data can be sent only with the receiver's permission. The window slides along the sender's outbound byte stream and the receiver's inbound byte stream.
The values of the Acknowledgment Number and Window fields in ACKs that the receiver sends determine the actual bytes within the window. The Acknowledgment Number field indicates the next byte of data that the receiver expects to receive. The Window fieldindicates the amount of space left in a receive buffer to store incoming TCP data on this connection. The span of data within the window is from Acknowledgment Number through the value of Acknowledgment Number + Window – 1.
For a given logical pipe—one direction of the full-duplex TCP connection—the sender maintains a send window and the receiver maintains a receive window. When there are no data or ACK segments in transit, a logical pipe's send window and receive window are matched. In other words, the span of data that the sender is permitted to send is matched to the span of data that the receiver is able to receive.
Send Window
To maintain the send window, the sender must account for the bytes in the outbound byte stream that have been
- Sent and acknowledged (Sent/ACKed)
- Sent but not acknowledged (Sent/UnACKed)
- Unsent but fit within the current send window (Unsent/Inside)
- Unsent but lie beyond the current send window (Unsent/Outside)
Figure 14-3 illustrates the types of data that exist for the send window.
Figure 14-3: The types of data for the TCP send window.
The span of data that lies within the send window is the Sent/UnACKed and Unsent/Inside data.
Sent/ACKed Data
Sent/ACKed data is data that has been sent and acknowledged as received. The first byte of Sent/ACKed data is the value of ISN + 1. Recall from the TCP connection establishment process that the TCP peer chooses an ISN that is explicitly acknowledged as if it were a data byte. Therefore, the first byte of user data sent on the connection is ISN + 1. Recall also that the acknowledgment number is the next byte of data the receiver expects to receive, explicitly acknowledging all bytes received up to but not including theacknowledgment number. Therefore, the last byte of ACKed data is the value of Acknowledgment Number – 1.
Sent/UnACKed Data
Sent/UnACKed data is data that has been sent but for which no acknowledgments have been received. The Sent/UnACKed data is either in transit, dropped from the internetwork, has arrived at the receiver but no ACK has been sent (because of delayed acknowledgments), or the ACK for the Sent/UnACKed data is in transit.
To distinguish Sent/UnACKed data from Unsent/Inside data, TCP maintains a variable known as SND.NXT, which is the value of the next byte to be sent. The value of SND.NXT becomes the value of the Sequence Number field for the next TCP segment sent.
The first byte of Sent/UnACKed data is the Acknowledgment Number field's value of the last ACK segment received from the receiver. The last byte of Sent/UnACKed data is the value of SND.NXT – 1.
Unsent/Inside Data
Unsent/Inside data is data that has not yet been sent but is within the current sendwindow. Unsent/Inside data can be sent because the receiver has permitted it. It is natural to assume that if the data has been permitted, the sender will send all data withinthe send window before waiting for an acknowledgment and an updated window size from the receiver. In other words, there is no Unsent/Inside data when waiting for an acknowledgment.
However, as discussed later in this chapter, when starting the initial data flow and when encountering congestion, the sender-side flow control mechanisms of slow start and congestion avoidance prevent the sender from sending all the data that falls within the receiver's send window. In such cases, these mechanisms govern the amount of data sent before waiting for an acknowledgment.
The first byte of Unsent/Inside data is the value of the SND.NXT variable. The last byte of Unsent/Inside data is the last byte of data within the send window, the value ofAcknowledgment Number + Window – 1.
Unsent/Outside
Unsent/Outside data is data that is unsent and outside the current send window, representing future data to be sent. Unsent/Outside data relative to the current send window should never be sent because it falls outside the receive window. The receiver's receive window is a direct reflection of buffer space remaining to store incoming data. Thereceiver discards data that cannot be stored in the receive buffer for the connection, and sends an ACK segment with the current acknowledgment number. The first byte of Unsent/Outside data is the value of Acknowledgment Number + Window.
Sliding the Send Window
The send window has a left edge (defined by the boundary between Sent/ACKed and Sent/UnACKed data) and a right edge (defined by the boundary between Unsent/Inside and Unsent/Outside data). When an ACK is received with a higher acknowledgment number, the send window closes and the left edge advances to the right. When an ACK is received in which the value of Acknowledgment Number + Window is greater than the previous value of Acknowledgment Number + Window, the send window opens and the right edge advances to the right. The sum of the Acknowledgment Number + Window fields in an ACK is the acknowledgment number of the ACK for the last TCP segment that fits within the current send window. Figure 14-4 illustrates the sliding of the send window.
Figure 14-4: The sliding of the send window showing window closing and opening.
It is possible for the send window to close but not open—for the left edge of the send window to advance while the right edge does not. For example, the sender receives an ACK with an increased acknowledgment number but a decreased window, such that the sum of Acknowledgment Number + Window does not change. This can happen when the receiver receives the data, which is acknowledged, but the received data has not been passed to the Application Layer protocol on the receiver. Therefore, the value of the Acknowledgment Number field in the ACK increases because of the contiguous data arriving, but the window decreases by the same amount, keeping the value of Acknowledgment Number + Window the same.
Zero Send Window
When the receiver advertises a window size of zero, the left and right edges of the send window are at the same boundary—the boundary between Sent/ACKed data and Unsent/Outside data. A zero window size can occur when the receiver's receive buffer fills with acknowledged data but the data has not yet been retrieved by the Application Layer protocol. This can happen when TCP has not yet indicated the data to the Application Layer protocol or when the Application Layer protocol has not explicitly informed TCP that it is ready to receive the next block of data from the TCP receive buffer.
With a zero send window, no new data can be sent until an ACK with a nonzero window size is received. However, because no new data is sent, the receiver is not sending any new ACKs. This can produce a deadlock situation in which the sender waits toreceive a new window size and the receiver does not send a new window size because there are no new ACKs to send. Consequently, receiver and sender behaviors are defined to prevent the deadlock.
When the data in the TCP receive buffer is passed to the Application Layer protocol,the receiver sends an ACK segment with the current acknowledgment number and new nonzero window size. However, this segment is an ACK containing no data. ACK segments without data are not sent reliably; the receiver does not acknowledge them, nor does the sender retransmit the ACK segments when it does not receive acknowledgment of their receipt. Therefore, if an ACK sent by the sender to update the window size is lost, the sender would have no notification that new data can be sent. The TCP connection is indefinitely deadlocked; the receiver has informed the sender that new data can be sent, but the sender still considers the window size to be zero.
To prevent the deadlock of the dropped ACK that the receiver sent, the sender periodically sends a TCP segment containing 1 byte of new data for the connection. Because the data byte is Unsent/Outside data, the receiver discards the data and sends an ACK with the current acknowledgment number and window size. This sender-side mechanism is known as probing the window. The first window probe is sent after the current RTO, and the interval for successive probes is determined by doubling the timeout for the previous probe.
Receive Window
To maintain the receive window, the receiver must account for the bytes in the inbound byte stream that have been
- Received, acknowledged, and retrieved by the Application Layer protocol (Rcvd/ACKed/Retr)
- Received, acknowledged, and not retrieved by the Application Layer protocol (Rcvd/ACKed/NotRetr)
- Received, but not acknowledged (Rcvd/UnACKed)
- Not received, but inside the current receive window (NotRcvd/Inside)
- Not received, but outside the current receive window (NotRcvd/Outside)
Figure 14-5 illustrates the types of data that exist for the receive window.
Figure 14-5: The types of data for the TCP receive window.
The span of data that lies within the maximum receive window is Rcvd/ACKed/NotRetr, Rcvd/UnACKed, and NotRcvd/Inside. The span of data that lies within the current receive window is Rcvd/UnACKed and NotRcvd/Inside.
Notice the difference between the maximum receive window and the current receive window. The maximum receive window is a fixed size and corresponds to a receive buffer used to store inbound TCP segments. The current receive window is of variable size and is the amount of space that is left in the receive buffer to store inbound TCP segments. The current receive window's size is the value of the Window field advertised in ACKs sent back to the sender, and is the difference between the maximum receive window size and the amount of data that has been received and acknowledged but not passed to the Application Layer protocol.
Rcvd/ACKed/Retr Data
Rcvd/ACKed/Retr data is data that has been received, acknowledged, and retrieved by the Application Layer protocol. The first byte of Rcvd/ACKed/Retr data is the value of ISN + 1. To track the next byte to be passed to the Application Layer protocol, TCP maintains a variable called RCV.USER. Therefore, the last byte of Rcvd/ACKed/Retr data is the value of RCV.USER – 1.
Rcvd/ACKed/NotRetr Data
Rcvd/ACKed/NotRetr data is data that has been received and acknowledged but has not been passed up to the Application Layer protocol. This category of data is the difference between the fixed-size maximum receive window and the variable-size current receive window. The first byte of Rcvd/ACKed/NotRetr data is the value of RCV.USER. The last byte of Rcvd/ACKed/NotRetr data is the value of Acknowledgment Number – 1.
Rcvd/UnACKed Data
Rcvd/UnACKed data is data that has been received but not acknowledged. To keep track of the next contiguous byte to be received, TCP maintains a variable called RCV.NEXT. When an ACK segment is sent, the ACK segment's Acknowledgment Number field is set to the value of RCV.NEXT. The first byte of Rcvd/UnACKed data is the current acknowledgment number. The last byte of Rcvd/UnACKed data is the value of RCV.NEXT – 1.
If there are no TCP segments in transit and the receiver has not yet sent the ACK for TCP segments received, the send window's Sent/UnACKed data is the same as the receive window's Rcvd/UnACKed data. In this situation, the value of RCV.NEXT kept by thereceiver is equal to the value of SND.NEXT kept by the sender.
NotRcvd/Inside Data
NotRcvd/Inside data is data that can be received and will fit within the current receive window. The first byte of NotRcvd/Inside data is the value of RCV.NEXT. The last byte of NotRcvd/Inside data within the receive window is the value of Acknowledgment Number + Window – 1.
NotRcvd/Outside Data
NotRcvd/Outside data is data that has not been received and is outside the currentreceive window, representing future data to be received. NotRcvd/Outside data relative to the current receive window should never be received because it falls outside the current receive window. The receiver discards data that cannot be stored in the current receive window and sends an ACK with the current acknowledgment number. The first byte of NotRcvd/Outside data is the value of Acknowledgment Number + Window.
Sliding the Receive Window
The current receive window has a left edge (defined by the boundary between Rcvd/ACKed/NotRetr and Rcvd/UnACKed data) and a right edge (defined by the boundary between NotRcvd/Inside and NotRcvd/Outside data). When an ACK segment is sent with an acknowledgment number set to RCV.NEXT, the current receive window closes and the left edge advances to the right. When the Rcvd/ACKed/NotRetr data is passed up to the Application Layer protocol, the maximum receive window opens and the right edge advances to the right. When this occurs, space is made available in the fixed-size receive buffer and new data can be received. The maximum receive window slides to the right by the number of bytes passed to the Application Layer protocol. When the maximum receive window slides as a result of data being passed to the Application Layer protocol, the current receive window slides also, as the right edge of the maximum receive window and the current receive window are the same. The next ACK that the receiver sends contains an updated window size. The increase in the sum of the acknowledgment number and the window size indicates to the sender that more data can be sent.
Figure 14-6 illustrates the sliding of the receive window.
Figure 14-6: Sliding the receive window showing window closing, opening, and shrinking.
If the Application Layer protocol does not receive the data in a timely fashion, thereceive window closes instead of sliding. This is indicated to the sender by increasing the acknowledgment number for new data received and decreasing the value of theWindow field by the same amount, thereby keeping the value of Acknowledgment Number + Window the same. In an extreme situation, the maximum receive window is filled with Rcvd/ACKed/NotRetr data and the left and right edges are the same (a zero receive window).
Shrinking the Window
Shrinking the window is the movement of the right edge of the receive window tothe left. To shrink the receive window, an ACK segment is sent where the value ofAcknowledgment Number + Window decreases. Normally, the value of Acknowledgment Number + Window either increases or remains the same. RFC 1122 discourages shrinking the window. However, a sending TCP peer must be prepared to adjust its send window accordingly. The receiver discards any data sent that is suddenly outside the shrunken receive window.
TCP IP for the Windows Server 2003 Family and Windows XP Maximum Receive Window Size
The TCP/IP for the Windows Server 2003 family and Windows XP maximum receive window size is set to 16,384 bytes by default. The default maximum receive window size and the MSS of the connection negotiated during the TCP connection establishment process determine the maximum receive window size. For maximum efficiency in bulk data transfers, the maximum receive window size is adjusted to be an integral multiple of the MSS for the connection.
The maximum receive window size is calculated using the following algorithm (based on the default maximum window size of 16,384 bytes):
- Assume a maximum receive window size of 16,384 bytes (16 KB). In the synchronize (SYN) segment sent to establish a TCP connection, include the TCP MSS option and set the window size to 16,384.
- When the SYN-ACK segment is returned, examine the TCP MSS option todetermine the MSS for the connection (the minimum MSS of the two TCP peers).
- Based on the connection's MSS, divide 16,384 by the connection's MSS and round up to the next integer value.
- If the result of rounding up is not at least four times the connection's MSS, set the window size to four times the MSS (up to a maximum of 65,535). Window scaling must be in effect to use window sizes greater than 65,535.
Ethernet Example
For the maximum receive window size for an Ethernet-based TCP connection, the algorithm produces the following:
- Assume a maximum receive window size of 16,384 bytes (16 KB). In the SYN segment sent to establish a TCP connection, set the window size to 16,384.
- When the SYN-ACK segment is returned, examine the TCP MSS option todetermine the connection's MSS (the minimum of the MSS of the two TCP peers). The MSS for two Ethernet-based TCP peers is 1460.
- Based on the connection's MSS, divide 16,384 by the connection's MSS and round up to the next integer value. Therefore, 16,384/1460 = 11.22, which, rounded up to the next integer value, is 12. Therefore, the maximum receive window size for two Ethernet TCP peers is 17,520 (12*1460).
The default of 17,520 for Ethernet assumes that additional TCP options, such as SACK and TCP timestamps, are not being used. If used, the maximum receive window size is adjusted accordingly.
Token Ring (4-Mbps Ring with an IP Maximum Transmission Unit of 4168)
For the maximum receive window size for a 4-Mbps Token Ring–based TCP connection, the algorithm produces the following:
- Assume a maximum receive window size of 16,384 bytes (16 KB). In the SYN segment sent to establish a TCP connection, set the window size to 16,384.
- When the SYN-ACK segment is returned, examine the TCP MSS option todetermine the connection's MSS (the minimum of the MSS of the two TCP peers). The MSS for two 4-Mbps Token Ring TCP peers is 4128.
- Based on the connection's MSS, divide 16,384 by the connection's MSS and round up to the next integer value. Therefore, 16,384/4128 = 3.97, which, rounded up to the next integer value, is 4. Therefore, the maximum receive window size for two 4-Mbps Token Ring TCP peers is 16,512 (4*4128).
Token Ring (16-Mbps Ring with an IP MTU of 17,928)
For the maximum receive window size for a 16-Mbps Token Ring–based TCP connection, the algorithm produces the following:
- Assume a maximum receive window size of 16,384 bytes (16 KB). In the SYN segment sent to establish a TCP connection, set the window size to 16,384.
- When the SYN-ACK segment is returned, examine the TCP MSS option todetermine the connection's MSS (the minimum of the MSS of the two TCP peers). The MSS for two 16-Mbps Token Ring TCP peers is 17,888.
- Based on the connection's MSS, divide 16,384 by the connection's MSS and round up to the next integer value. Therefore, 16,384/17,888 = 0.9, which, rounded up to the next integer value, is 1.
- The result of rounding up is not at least four times the connection's MSS. Therefore, the window size is set to four times the MSS, or 71,552 (17,888*4). However, without window scaling, this window size cannot be accommodated. Therefore, the maximum window size is set to a single MSS, or 17,888.
Changing the Default Maximum Receive Window Size
The default maximum receive window size can be set through the setsockopt() Windows Sockets function on a per socket basis or through the following registry settings:
GlobalMaxTcpWindowSize
Location: HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters Data type: REG_DWORD Valid range: 0–0x3FFFFFFF Present by default: No
GlobalMaxTcpWindowSize sets the number of bytes in the default maximum receive window for all interfaces (unless overridden per interface using the TcpWindowSize registry setting). Values greater than 65,535 can be used only in conjunction with enabling window scaling with the Tcp1323Opts registry setting and with TCP peers that sup-port window scaling. The maximum value, 0x3FFFFFFF, or 1,073,741,823, reflects the largest window size possible using window scaling.
TcpWindowSize
Location: HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters and Location: HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters InterfacesInterfaceGUID Data type: REG_DWORD Valid range: 0–0x3FFFFFFF Present by default: No
TcpWindowSize (in TcpipParameters) sets the number of bytes in the default maximum receive window for all interfaces unless overridden by the interface-based TcpWindowSize registry setting (in TcpipParametersInterfacesInterfaceGUID). The default value is the smallest of the following values:
- 0xFFFF (65,535)
- GlobalMaxTcpWindowSize
- Four times the connection MSS
- 16,384 rounded up to an even multiple of the connection MSS
Values greater than 65,535 can be used only in conjunction with enabling windowscaling (using the Tcp1323Opts registry setting), and with TCP peers that support window scaling. The maximum value, 0x3FFFFFFF, or 1,073,741,823, reflects the largest window size possible using window scaling. TCP also adjusts the window size based on the bit rate of the sending interface in the following way: below 1 Mbps, the window size is set to 8 KB; from 1 Mbps to 100 Mbps, the window size is set to 17 KB; and for 100 Mbps or higher, the window size is set to 64 KB.
Small Segments
A small segment is a TCP segment that is smaller than the MSS. To increase the efficiency of sending data, TCP avoids sending and receiving small segments by using the Naglealgorithm and by avoiding silly window syndrome.
The Nagle Algorithm
For interactive data, such as the data of a Telnet or Rlogin session, much of the traffic is made up of individual keystrokes sent by the client and echoed by the server. For each keystroke, a single byte of data is sent. This is a network efficiency of 2.5 percent (the number of bytes of data [1 byte] divided by the number of bytes of overhead needed to send the data [40 bytes]). For interactive sessions, such as Telnet, each typed character must be sent and echoed back to the Telnet client application to be displayed on the user's screen. Therefore, sending small segments cannot be avoided for interactive sessions. Preventing the sending of a small segment would mean that the user would not see the keystroke as entered on the keyboard.
In the case of Telnet and Rlogin, a single keystroke echoed back to the user generates the following three TCP segments:
- The client application sends the keystroke byte as a small TCP segment with the Push (PSH) flag set.
- The keystroke TCP segment is passed to the server application, which sends an echo of the keystroke back to the client application (along with an ACK of the keystroke byte) as a small TCP segment with the PSH flag set.
- The echoed keystroke TCP segment is passed to the client application, which sends an ACK of the echoed keystroke segment.
Typical interactive sessions consist of multiple keystrokes in rapid succession.
To minimize the sending of small TCP segments, TCP is required to use the Nagle algorithm, named after John Nagle, the author of RFC 896, which describes the algorithm. The Nagle algorithm's premise is that a TCP connection can send only a single unacknowledged small segment. If a small segment is sent and not acknowledged, no other small segments can be sent.
More Info |
The Nagle algorithm is described in RFC 896, which can be found in the Rfc folder on the companion CD-ROM. |
In the case of interactive session traffic, such as Telnet and Rlogin, a keystroke segment is sent. Additional keystrokes entered by the user are accumulated in the TCP send buffer until the ACK for the outstanding small segment arrives. The next segment sent could contain multiple keystrokes. Depending on the average time to receive acknowledgments and the user's typing speed, this simple rule can decrease the number of TCP segments sent in the session by a factor of three or more.
The Nagle algorithm adapts itself to the environment in which the TCP segments are being sent. In a high-bandwidth, low-delay environment, such as a local area network (LAN), ACKs return more quickly and less accumulation occurs. However, in such an environment, lower efficiency can be tolerated because of the higher capacity of the LAN. In a low-bandwidth, high-delay environment, such as a wide area network (WAN), ACKs return less quickly, producing more accumulation. This results in more efficient data transfer for environments with less capacity.
TCP/IP for the Windows Server 2003 family and Windows XP uses the Nagle algorithm by default. The Nagle algorithm is disabled through the TCP_NODELAY Windows Sockets option. Developers should disable the Nagle algorithm only when the immediate sending of multiple small segments is required. To improve performance of file locking and manipulation, a computer running a member of the Windows Server 2003 family orWindows XP disables the Nagle algorithm for NetBIOS over TCP/IP (NetBT) and non-NetBIOS–based redirector and server communication.
Silly Window Syndrome
Whenever data is passed to the receiver's Application Layer protocol, the receive window opens and a new window size is advertised. Depending on how much data isretrieved from the receive buffer, this mechanism can cause the following behavior:
- The sender and receiver are in a zero window state. The sender has sent all the data it can. The receiver has acknowledged all the data in the receive buffer and is waiting for the Application Layer protocol to retrieve the databefore it is free to advertise a nonzero window size.
- The Application Layer protocol retrieves a single byte of data from the receive buffer. The receive window advances by one byte.
- The receiver sends an ACK with the window size set to 1.
- The sender, realizing that the value of Acknowledgment Number + Window has increased, advances its send window by one byte. Because the receiver has permitted the sending of a single byte, the sender sends a single byte.
Each time the Application Layer protocol fetches a single byte of data from the buffer, the sender sends a single-byte TCP segment. The data sent on the TCP connection consists of a steady pattern of small segments. This behavior is known as the silly window syndrome (SWS). Both the sender and the receiver avoid SWS.
Receiver-Side SWS Avoidance
The receiver avoids SWS by not advertising a new window size unless it is at least either an MSS or half of the maximum receive window size. Figure 14-7 illustrates receiver-side SWS avoidance.
Figure 14-7: SWS avoidance as implemented by the receiver.
As data is passed to the application, the receive window advances. If the receive window advances n bytes, receiver-side SWS dictates that a new window size cannot be advertised unless n is at least MSS bytes or half the maximum receive window.
Sender-Side SWS Avoidance
The sender avoids SWS by not sending a TCP segment containing data unless the advertised receive window size is at least MSS bytes. However, as previously discussed, small segments must be allowed for interactive data. Therefore, small segments are allowed if either of the following is true:
- The data is being pushed and adheres to the Nagle algorithm. Interactive data typically sets the TCP header's PSH flag. A single small segment can be sent according to the Nagle algorithm.
- The data is at least half the size of the maximum receive window and adheres to the Nagle algorithm.
Sender Side Flow Control
Receiver-side flow control is implemented through the send and receive windows. The receiver can inform the sender to stop sending data by reducing the advertised receive window to zero. However, once a nonzero receive window size is advertised, there is nothing in the TCP sliding window mechanism that prevents the sender from sending all possible segments in the send window.
For example, during the TCP connection process, the maximum receive window size is determined. If the TCP peers are Ethernet-based, the maximum receive window and the advertised receive window at the end of the TCP connection establishment process for two Windows Server 2003 family– or Windows XP–based hosts is 17,520 bytes, or 12 MSS-sized TCP segments (assuming no TCP options are present). According to the TCP sliding window mechanism, the sender can immediately send all 12 segments that fit within the receive window without waiting for any ACKs. Although this behavior is permitted, it also can lead to network congestion, especially when sending TCP segments across multiple routers.
To prevent the flooding of segments that fit within the advertised receive window, TCP implementations, including TCP/IP for the Windows Server 2003 family and Windows XP, use the following algorithms described in RFC 2001:
- The slow start algorithmIncreases the actual send windowthe number of segments within the send window that the sender can send before waiting for an acknowledgmentfor each ACK segment received that acknowledges new data.
- The congestion avoidance algorithmIncreases the actual send window for each round-trip time.
Although the slow start and congestion avoidance algorithms were developed to solve separate problems, they are used together to provide sender-side flow control.
Both the slow start and congestion avoidance algorithms maintain an additional variable called the congestion window (cwind) to help define how much data can be sent. For both algorithms, the size of the actual send window is the minimum of the advertised receive window and the congestion window (the value of cwind).
Slow Start Algorithm
The premise of the slow start algorithm is that TCP increases the cwind by the MSS (or one segment size) for every ACK received that acknowledges new data. Every time cwind is updated, it is compared to the current advertised receive window size, and the minimum of both values is used to update the actual send window size.
When TCP data begins to flow on a connection after the connection establishment process or after a prolonged idle time, the following slow start process is used to increase the actual send window size (assuming two Ethernet-based TCP peers):
- Set cwind's initial value to 2 MSS (two MSS-sized segments). Compare cwind's value and the currently advertised receive window size (17,520 or 12 MSS). Set the actual send window size to the minimum of cwind and the currently advertised receive window size. Result: cwind = 2 MSS, advertised receive window size = 12 MSS, actual send window = 2 MSS.
- Two TCP segments are sent. The sender waits for ACKs.
- When the sender receives an ACK, cwind is set to 3 MSS. Compare cwind's value and the currently advertised receive window size. Set the actual send window size to the minimum of cwind and the currently advertised receive window size. Result: cwind = 3 MSS, advertised receive window size = 12 MSS, actual send window = 3 MSS.
- Three TCP segments are sent. The sender waits for ACKs.
- When the sender receives an ACK, cwind is set to 4 MSS. Compare cwind's value and the currently advertised receive window size. Set the actual send window size to the minimum of cwind and the currently advertised receive window size. Result: cwind = 4 MSS, advertised receive window size = 12 MSS, actual send window = 4 MSS.
- Four TCP segments are sent. The sender waits for ACKs.
This process continues until cwind becomes greater than the currently advertised receive window (12 MSS), at which point the currently advertised receive window governs how much data can be sent at a time, and slow start is finished. There is no more sender-side flow control unless a TCP segment needs to be retransmitted.
The following Network Monitor trace (Capture 14-01, included in the Captures folder on the companion CD-ROM) illustrates the slow start behavior for the downloading of a file using FTP up to 6 MSS:
17 FTP Server FTP Client ....S., len: 0, seq: 10482005-10482005, ack: 0 18 FTP Client FTP Server .A..S., len: 0, seq: 376829-376829, ack: 10482006 19 FTP Server FTP Client .A...., len: 0, seq: 10482006-10482006, ack: 376830 20 FTP Server FTP Client .A...., len: 1460, seq: 10482006-10483465, ack: 376830 21 FTP Server FTP Client .A...., len: 1460, seq: 10483466-10484925, ack: 376830 22 FTP Client FTP Server .A...., len: 0, seq: 376830-376830, ack: 10484926 23 FTP Server FTP Client .A...., len: 1460, seq: 10484926-10486385, ack: 376830 24 FTP Server FTP Client .A...., len: 1460, seq: 10486386-10487845, ack: 376830 25 FTP Server FTP Client .A...., len: 1460, seq: 10487846-10489305, ack: 376830 26 FTP Client FTP Server .A...., len: 0, seq: 376830-376830, ack: 10489306 27 FTP Server FTP Client .A...., len: 1460, seq: 10489306-10490765, ack: 376830 28 FTP Server FTP Client .A...., len: 1460, seq: 10490766-10492225, ack: 376830 29 FTP Server FTP Client .A...., len: 1460, seq: 10492226-10493685, ack: 376830 30 FTP Server FTP Client .A...., len: 1460, seq: 10493686-10495145, ack: 376830 31 FTP Client FTP Server .A...., len: 0, seq: 376830-376830, ack: 10495146 32 FTP Server FTP Client .A...., len: 1460, seq: 10495146-10496605, ack: 376830 33 FTP Server FTP Client .A...., len: 1460, seq: 10496606-10498065, ack: 376830 34 FTP Server FTP Client .A...., len: 1460, seq: 10498066-10499525, ack: 376830 35 FTP Server FTP Client .A...., len: 1460, seq: 10499526-10500985, ack: 376830 36 FTP Server FTP Client .A...., len: 1460, seq: 10500986-10502445, ack: 376830 37 FTP Client FTP Server .A...., len: 0, seq: 376830-376830, ack: 10500986 38 FTP Server FTP Client .A...., len: 1460, seq: 10502446-10503905, ack: 376830 39 FTP Server FTP Client .A...., len: 1460, seq: 10503906-10505365, ack: 376830 40 FTP Server FTP Client .A...., len: 1460, seq: 10505366-10506825, ack: 376830 41 FTP Server FTP Client .A...., len: 1460, seq: 10506826-10508285, ack: 376830 42 FTP Server FTP Client .A...., len: 1460, seq: 10508286-10509745, ack: 376830 43 FTP Server FTP Client .A...., len: 1460, seq: 10509746-10511205, ack: 376830
The slow start algorithm for this data transfer is as follows:
- The TCP connection establishment process is done in Frames 17 through 19. cwind is set to 2 MSS.
- Frames 20 and 21 are the two segments corresponding to the current actual send window size of 2 MSS.
- Frame 22 is an ACK segment for Frames 20 and 21. cwind is set to 3 MSS.
- Frames 23 through 25 are the three segments corresponding to the current send actual window size of 3 MSS.
- Frame 26 is an ACK segment for Frames 23 to 25. cwind is set to 4 MSS.
- Frames 27 through 30 are the four segments corresponding to the currentactual send window size of 4 MSS.
- Frame 31 is an ACK segment for Frames 27 through 30. cwind is set to 5 MSS.
- Frames 32 through 36 are the five segments corresponding to the currentactual send window size of 5 MSS.
- Frame 37 is an ACK segment for Frames 21 through 35. cwind is set to 6 MSS.
- Frames 38 through 43 are the six segments corresponding to the current actual send window size of 6 MSS.
The rate at which the size of the actual send window increases depends on how quickly ACK segments are returned. In a high-bandwidth, low-delay environment such as a LAN, the actual send window opens quickly. In a low-bandwidth, high-delay environment such as a WAN, the actual send window opens more slowly.
Although called the slow start algorithm, the actual send window size can increase at an exponential rate based on the receipt of multiple ACKs for multiple segments sent. For example, when starting the actual send window at 2 MSS, two segments are sent. If an ACK is sent for each segment sent, the actual send window increases to 4 MSS; four segments are sent. If an ACK is sent for each segment sent, the actual send window increases to 8 MSS. The actual send window has quickly grown from 2 MSS to 4 MSS, and then to 8 MSS. The actual window growth depends on how many ACK segments are received.
Congestion Avoidance Algorithm
Once data is flowing on the TCP connection, the actual send window is governed by the currently advertised receive window and receiver-side flow control is in effect. When a TCP segment must be retransmitted, the assumption is that the packet loss is a result of congestion at a router, rather than damage to the packet causing a checksum verification to fail. If the packet loss is a result of congestion at a router, the sender's transmission rate must be immediately lowered and then gradually increased back to the rate at which data was being sent before the congestion occurred. For TCP connections, the transmission rate is the amount of data that the sender can send before having to wait for an ACK.
When the congestion occurs, the slow start algorithm is used to increase the size of the actual window size to half of the value of the advertised receive window size when the congestion occurred. The congestion avoidance algorithm then takes over. To keep track of when to use slow start and when to use congestion avoidance, an additional variable called the slow start threshold (ssthresh) is used. When a connection is established, ssthresh is set to 65,535. As with slow start, during congestion avoidance, the actual send window is the minimum of cwind and the currently advertised receive window.
The premise of the congestion avoidance algorithm is to increase cwind by 1 MSS for each round-trip time, which is the time it takes for a TCP segment to be sent and acknowledged. The congestion avoidance algorithm provides a smooth, linear increase in cwind, thereby increasing the actual send window. There are different ways of implementing the change in cwind for congestion avoidance, as follows:
- •One method is to increase cwind by MSS*MSS/cwind (integer division) for each segment that is acknowledged. For example, if cwind is set to 7 MSS,for each segment that is acknowledged, cwind is incremented by MSS*MSS/7*MSS, or MSS/7. Therefore, after seven acknowledged segments, cwind increases by 1 MSS. When cwind is incremented by a quantity that is not a full MSS, sender-side SWS prevents a small segment from being sent. Only after cwind is incremented to another MSS can another full segment be sent.
- Another method is to track the current actual send window size in increments of the MSS. When the number of segments that correspond to the size of the current actual send window size are ACKed, increment cwind by an MSS. Thus, the actual send window grows by 1 MSS for each full window of data that has been acknowledged.
With slow start, the actual send window increases by an MSS for each ACK received in a round-trip time. With congestion avoidance, the actual send window increases by an MSS for multiple ACKs received in a round-trip time.
When congestion occurs (indicated when a TCP segment must be retransmitted or when a duplicate ACK is received), the combination of slow start and congestion avoidance for TCP/IP for the Windows Server 2003 family and Windows XP works as follows:
- Set ssthresh to half the value of the current send window with a minimum value of 2 MSS. cwind is set to the value of 2 MSS.
- Set the actual send window to the minimum of the currently advertised receive window and cwind.
- Send the appropriate number of TCP segments.
- As ACKs are received, increment cwind. If cwind ssthresh, increment cwind using slow start. If cwind > ssthresh, increment cwind using congestion avoidance.
- Return to step 2.
The result of using the combination of slow start and congestion avoidance is that when congestion occurs, the sender uses slow start to quickly increase the size of the actual send window size to half the size of the actual send window when the congestionoccurred. Then, congestion avoidance is used to more slowly increase the size of the actual send window size up to the currently advertised receive window size. This gradualincrease in the amount of data being sent allows the internetwork to clear its routing buffers and recover from the congestion.
Summary
TCP achieves reliable data transfer through the cumulative or selective acknowledgment of TCP segments received. Selective acknowledgments improve TCP performance in high-loss environments or for TCP connections with large window sizes. To provide receiver-side flow control, TCP uses sliding send and receive windows. With each ACK segment, the receiver indicates how much more data can be sent and successfully buffered. To avoid sending small segments, TCP uses the Nagle algorithm and SWS avoidance. To provide sender-side flow control, TCP uses the slow start and congestion avoidance algorithms. Slow start is used to increase the size of the actual send window by 1 MSS for each ACK segment received. Congestion avoidance is used to increase the size of the actual send window by 1 MSS for each round-trip time. Slow start and congestion avoidance are used to avoid congesting an IP internetwork when sending and retransmitting data.