1TCP(7) Linux Programmer's Manual TCP(7)
2
3
4
6 tcp - TCP protocol
7
9 #include <sys/socket.h>
10 #include <netinet/in.h>
11 #include <netinet/tcp.h>
12
13 tcp_socket = socket(AF_INET, SOCK_STREAM, 0);
14
16 This is an implementation of the TCP protocol defined in RFC 793,
17 RFC 1122 and RFC 2001 with the NewReno and SACK extensions. It pro‐
18 vides a reliable, stream-oriented, full-duplex connection between two
19 sockets on top of ip(7), for both v4 and v6 versions. TCP guarantees
20 that the data arrives in order and retransmits lost packets. It gener‐
21 ates and checks a per-packet checksum to catch transmission errors.
22 TCP does not preserve record boundaries.
23
24 A newly created TCP socket has no remote or local address and is not
25 fully specified. To create an outgoing TCP connection use connect(2)
26 to establish a connection to another TCP socket. To receive new incom‐
27 ing connections, first bind(2) the socket to a local address and port
28 and then call listen(2) to put the socket into the listening state.
29 After that a new socket for each incoming connection can be accepted
30 using accept(2). A socket which has had accept(2) or connect(2) suc‐
31 cessfully called on it is fully specified and may transmit data. Data
32 cannot be transmitted on listening or not yet connected sockets.
33
34 Linux supports RFC 1323 TCP high performance extensions. These include
35 Protection Against Wrapped Sequence Numbers (PAWS), Window Scaling and
36 Timestamps. Window scaling allows the use of large (> 64K) TCP windows
37 in order to support links with high latency or bandwidth. To make use
38 of them, the send and receive buffer sizes must be increased. They can
39 be set globally with the /proc/sys/net/ipv4/tcp_wmem and
40 /proc/sys/net/ipv4/tcp_rmem files, or on individual sockets by using
41 the SO_SNDBUF and SO_RCVBUF socket options with the setsockopt(2) call.
42
43 The maximum sizes for socket buffers declared via the SO_SNDBUF and
44 SO_RCVBUF mechanisms are limited by the values in the
45 /proc/sys/net/core/rmem_max and /proc/sys/net/core/wmem_max files.
46 Note that TCP actually allocates twice the size of the buffer requested
47 in the setsockopt(2) call, and so a succeeding getsockopt(2) call will
48 not return the same size of buffer as requested in the setsockopt(2)
49 call. TCP uses the extra space for administrative purposes and inter‐
50 nal kernel structures, and the /proc file values reflect the larger
51 sizes compared to the actual TCP windows. On individual connections,
52 the socket buffer size must be set prior to the listen(2) or connect(2)
53 calls in order to have it take effect. See socket(7) for more informa‐
54 tion.
55
56 TCP supports urgent data. Urgent data is used to signal the receiver
57 that some important message is part of the data stream and that it
58 should be processed as soon as possible. To send urgent data specify
59 the MSG_OOB option to send(2). When urgent data is received, the ker‐
60 nel sends a SIGURG signal to the process or process group that has been
61 set as the socket "owner" using the SIOCSPGRP or FIOSETOWN ioctls (or
62 the POSIX.1-2001-specified fcntl(2) F_SETOWN operation). When the
63 SO_OOBINLINE socket option is enabled, urgent data is put into the nor‐
64 mal data stream (a program can test for its location using the SIOCAT‐
65 MARK ioctl described below), otherwise it can be only received when the
66 MSG_OOB flag is set for recv(2) or recvmsg(2).
67
68 Linux 2.4 introduced a number of changes for improved throughput and
69 scaling, as well as enhanced functionality. Some of these features
70 include support for zero-copy sendfile(2), Explicit Congestion Notifi‐
71 cation, new management of TIME_WAIT sockets, keep-alive socket options
72 and support for Duplicate SACK extensions.
73
74 Address Formats
75 TCP is built on top of IP (see ip(7)). The address formats defined by
76 ip(7) apply to TCP. TCP only supports point-to-point communication;
77 broadcasting and multicasting are not supported.
78
79 /proc interfaces
80 System-wide TCP parameter settings can be accessed by files in the
81 directory /proc/sys/net/ipv4/. In addition, most IP /proc interfaces
82 also apply to TCP; see ip(7). Variables described as Boolean take an
83 integer value, with a nonzero value ("true") meaning that the corre‐
84 sponding option is enabled, and a zero value ("false") meaning that the
85 option is disabled.
86
87 tcp_abc (Integer; default: 0; since Linux 2.6.15)
88 Control the Appropriate Byte Count (ABC), defined in RFC 3465.
89 ABC is a way of increasing the congestion window (cwnd) more
90 slowly in response to partial acknowledgments. Possible values
91 are:
92
93 0 increase cwnd once per acknowledgment (no ABC)
94
95 1 increase cwnd once per acknowledgment of full sized segment
96
97 2 allow increase cwnd by two if acknowledgment is of two seg‐
98 ments to compensate for delayed acknowledgments.
99
100 tcp_abort_on_overflow (Boolean; default: disabled; since Linux 2.4)
101 Enable resetting connections if the listening service is too
102 slow and unable to keep up and accept them. It means that if
103 overflow occurred due to a burst, the connection will recover.
104 Enable this option only if you are really sure that the listen‐
105 ing daemon cannot be tuned to accept connections faster.
106 Enabling this option can harm the clients of your server.
107
108 tcp_adv_win_scale (integer; default: 2; since Linux 2.4)
109 Count buffering overhead as bytes/2^tcp_adv_win_scale, if
110 tcp_adv_win_scale is greater than 0; or bytes-
111 bytes/2^(-tcp_adv_win_scale), if tcp_adv_win_scale is less than
112 or equal to zero.
113
114 The socket receive buffer space is shared between the applica‐
115 tion and kernel. TCP maintains part of the buffer as the TCP
116 window, this is the size of the receive window advertised to the
117 other end. The rest of the space is used as the "application"
118 buffer, used to isolate the network from scheduling and applica‐
119 tion latencies. The tcp_adv_win_scale default value of 2
120 implies that the space used for the application buffer is one
121 fourth that of the total.
122
123 tcp_allowed_congestion_control (String; default: see text; since Linux
124 2.4.20)
125 Show/set the congestion control algorithm choices available to
126 unprivileged processes (see the description of the TCP_CONGES‐
127 TION socket option). The list is a subset of those listed in
128 tcp_available_congestion_control. The default value for this
129 list is "reno" plus the default setting of tcp_congestion_con‐
130 trol.
131
132 tcp_available_congestion_control (String; read-only; since Linux
133 2.4.20)
134 Show a list of the congestion-control algorithms that are regis‐
135 tered. This list is a limiting set for the list in
136 tcp_allowed_congestion_control. More congestion-control algo‐
137 rithms may be available as modules, but not loaded.
138
139 tcp_app_win (integer; default: 31; since Linux 2.4)
140 This variable defines how many bytes of the TCP window are
141 reserved for buffering overhead.
142
143 A maximum of (window/2^tcp_app_win, mss) bytes in the window are
144 reserved for the application buffer. A value of 0 implies that
145 no amount is reserved.
146
147 tcp_base_mss (Integer; default: 512; since Linux 2.6.17)
148 The initial value of search_low to be used by the packetization
149 layer Path MTU discovery (MTU probing). If MTU probing is
150 enabled, this is the initial MSS used by the connection.
151
152 tcp_bic (Boolean; default: disabled; Linux 2.4.27/2.6.6 to 2.6.13)
153 Enable BIC TCP congestion control algorithm. BIC-TCP is a
154 sender-side only change that ensures a linear RTT fairness under
155 large windows while offering both scalability and bounded TCP-
156 friendliness. The protocol combines two schemes called additive
157 increase and binary search increase. When the congestion window
158 is large, additive increase with a large increment ensures lin‐
159 ear RTT fairness as well as good scalability. Under small con‐
160 gestion windows, binary search increase provides TCP friendli‐
161 ness.
162
163 tcp_bic_low_window (integer; default: 14; Linux 2.4.27/2.6.6 to 2.6.13)
164 Set the threshold window (in packets) where BIC TCP starts to
165 adjust the congestion window. Below this threshold BIC TCP
166 behaves the same as the default TCP Reno.
167
168 tcp_bic_fast_convergence (Boolean; default: enabled; Linux 2.4.27/2.6.6
169 to 2.6.13)
170 Force BIC TCP to more quickly respond to changes in congestion
171 window. Allows two flows sharing the same connection to con‐
172 verge more rapidly.
173
174 tcp_congestion_control (String; default: see text; since Linux 2.4.13)
175 Set the default congestion-control algorithm to be used for new
176 connections. The algorithm "reno" is always available, but
177 additional choices may be available depending on kernel configu‐
178 ration. The default value for this file is set as part of ker‐
179 nel configuration.
180
181 tcp_dma_copybreak (integer; default: 4096; since Linux 2.6.24)
182 Lower limit, in bytes, of the size of socket reads that will be
183 offloaded to a DMA copy engine, if one is present in the system
184 and the kernel was configured with the CONFIG_NET_DMA option.
185
186 tcp_dsack (Boolean; default: enabled; since Linux 2.4)
187 Enable RFC 2883 TCP Duplicate SACK support.
188
189 tcp_ecn (Boolean; default: disabled; since Linux 2.4)
190 Enable RFC 2884 Explicit Congestion Notification. When enabled,
191 connectivity to some destinations could be affected due to
192 older, misbehaving routers along the path causing connections to
193 be dropped.
194
195 tcp_fack (Boolean; default: enabled; since Linux 2.2)
196 Enable TCP Forward Acknowledgement support.
197
198 tcp_fin_timeout (integer; default: 60; since Linux 2.2)
199 This specifies how many seconds to wait for a final FIN packet
200 before the socket is forcibly closed. This is strictly a viola‐
201 tion of the TCP specification, but required to prevent denial-
202 of-service attacks. In Linux 2.2, the default value was 180.
203
204 tcp_frto (integer; default: 0; since Linux 2.4.21/2.6)
205 Enable F-RTO, an enhanced recovery algorithm for TCP retransmis‐
206 sion timeouts (RTOs). It is particularly beneficial in wireless
207 environments where packet loss is typically due to random radio
208 interference rather than intermediate router congestion. See
209 RFC 4138 for more details.
210
211 This file can have one of the following values:
212
213 0 Disabled.
214
215 1 The basic version F-RTO algorithm is enabled.
216
217 2 Enable SACK-enhanced F-RTO if flow uses SACK. The basic ver‐
218 sion can be used also when SACK is in use though in that case
219 scenario(s) exists where F-RTO interacts badly with the
220 packet counting of the SACK-enabled TCP flow.
221
222 Before Linux 2.6.22, this parameter was a Boolean value, sup‐
223 porting just values 0 and 1 above.
224
225 tcp_frto_response (integer; default: 0; since Linux 2.6.22)
226 When F-RTO has detected that a TCP retransmission timeout was
227 spurious (i.e, the timeout would have been avoided had TCP set a
228 longer retransmission timeout), TCP has several options concern‐
229 ing what to do next. Possible values are:
230
231 0 Rate halving based; a smooth and conservative response,
232 results in halved congestion window (cwnd) and slow-start
233 threshold (ssthresh) after one RTT.
234
235 1 Very conservative response; not recommended because even
236 though being valid, it interacts poorly with the rest of
237 Linux TCP; halves cwnd and ssthresh immediately.
238
239 2 Aggressive response; undoes congestion-control measures that
240 are now known to be unnecessary (ignoring the possibility of
241 a lost retransmission that would require TCP to be more cau‐
242 tious); cwnd and ssthresh are restored to the values prior to
243 timeout.
244
245 tcp_keepalive_intvl (integer; default: 75; since Linux 2.4)
246 The number of seconds between TCP keep-alive probes.
247
248 tcp_keepalive_probes (integer; default: 9; since Linux 2.2)
249 The maximum number of TCP keep-alive probes to send before giv‐
250 ing up and killing the connection if no response is obtained
251 from the other end.
252
253 tcp_keepalive_time (integer; default: 7200; since Linux 2.2)
254 The number of seconds a connection needs to be idle before TCP
255 begins sending out keep-alive probes. Keep-alives are only sent
256 when the SO_KEEPALIVE socket option is enabled. The default
257 value is 7200 seconds (2 hours). An idle connection is termi‐
258 nated after approximately an additional 11 minutes (9 probes an
259 interval of 75 seconds apart) when keep-alive is enabled.
260
261 Note that underlying connection tracking mechanisms and applica‐
262 tion timeouts may be much shorter.
263
264 tcp_low_latency (Boolean; default: disabled; since Linux 2.4.21/2.6)
265 If enabled, the TCP stack makes decisions that prefer lower
266 latency as opposed to higher throughput. It this option is dis‐
267 abled, then higher throughput is preferred. An example of an
268 application where this default should be changed would be a
269 Beowulf compute cluster.
270
271 tcp_max_orphans (integer; default: see below; since Linux 2.4)
272 The maximum number of orphaned (not attached to any user file
273 handle) TCP sockets allowed in the system. When this number is
274 exceeded, the orphaned connection is reset and a warning is
275 printed. This limit exists only to prevent simple denial-of-
276 service attacks. Lowering this limit is not recommended. Net‐
277 work conditions might require you to increase the number of
278 orphans allowed, but note that each orphan can eat up to ~64K of
279 unswappable memory. The default initial value is set equal to
280 the kernel parameter NR_FILE. This initial default is adjusted
281 depending on the memory in the system.
282
283 tcp_max_syn_backlog (integer; default: see below; since Linux 2.2)
284 The maximum number of queued connection requests which have
285 still not received an acknowledgement from the connecting
286 client. If this number is exceeded, the kernel will begin drop‐
287 ping requests. The default value of 256 is increased to 1024
288 when the memory present in the system is adequate or greater (>=
289 128Mb), and reduced to 128 for those systems with very low mem‐
290 ory (<= 32Mb). It is recommended that if this needs to be
291 increased above 1024, TCP_SYNQ_HSIZE in include/net/tcp.h be
292 modified to keep TCP_SYNQ_HSIZE*16<=tcp_max_syn_backlog, and the
293 kernel be recompiled.
294
295 tcp_max_tw_buckets (integer; default: see below; since Linux 2.4)
296 The maximum number of sockets in TIME_WAIT state allowed in the
297 system. This limit exists only to prevent simple denial-of-ser‐
298 vice attacks. The default value of NR_FILE*2 is adjusted
299 depending on the memory in the system. If this number is
300 exceeded, the socket is closed and a warning is printed.
301
302 tcp_moderate_rcvbuf (Boolean; default: enabled; since Linux
303 2.4.17/2.6.7)
304 If enabled, TCP performs receive buffer auto-tuning, attempting
305 to automatically size the buffer (no greater than tcp_rmem[2])
306 to match the size required by the path for full throughput.
307
308 tcp_mem (since Linux 2.4)
309 This is a vector of 3 integers: [low, pressure, high]. These
310 bounds, measured in units of the system page size, are used by
311 TCP to track its memory usage. The defaults are calculated at
312 boot time from the amount of available memory. (TCP can only
313 use low memory for this, which is limited to around 900
314 megabytes on 32-bit systems. 64-bit systems do not suffer this
315 limitation.)
316
317 low TCP doesn't regulate its memory allocation when the
318 number of pages it has allocated globally is below
319 this number.
320
321 pressure When the amount of memory allocated by TCP exceeds
322 this number of pages, TCP moderates its memory con‐
323 sumption. This memory pressure state is exited once
324 the number of pages allocated falls below the low
325 mark.
326
327 high The maximum number of pages, globally, that TCP will
328 allocate. This value overrides any other limits
329 imposed by the kernel.
330
331 tcp_mtu_probing (integer; default: 0; since Linux 2.6.17)
332 This parameter controls TCP Packetization-Layer Path MTU Discov‐
333 ery. The following values may be assigned to the file:
334
335 0 Disabled
336
337 1 Disabled by default, enabled when an ICMP black hole detected
338
339 2 Always enabled, use initial MSS of tcp_base_mss.
340
341 tcp_no_metrics_save (Boolean; default: disabled; since Linux 2.6.6)
342 By default, TCP saves various connection metrics in the route
343 cache when the connection closes, so that connections estab‐
344 lished in the near future can use these to set initial condi‐
345 tions. Usually, this increases overall performance, but it may
346 sometimes cause performance degradation. If tcp_no_metrics_save
347 is enabled, TCP will not cache metrics on closing connections.
348
349 tcp_orphan_retries (integer; default: 8; since Linux 2.4)
350 The maximum number of attempts made to probe the other end of a
351 connection which has been closed by our end.
352
353 tcp_reordering (integer; default: 3; since Linux 2.4)
354 The maximum a packet can be reordered in a TCP packet stream
355 without TCP assuming packet loss and going into slow start. It
356 is not advisable to change this number. This is a packet
357 reordering detection metric designed to minimize unnecessary
358 back off and retransmits provoked by reordering of packets on a
359 connection.
360
361 tcp_retrans_collapse (Boolean; default: enabled; since Linux 2.2)
362 Try to send full-sized packets during retransmit.
363
364 tcp_retries1 (integer; default: 3; since Linux 2.2)
365 The number of times TCP will attempt to retransmit a packet on
366 an established connection normally, without the extra effort of
367 getting the network layers involved. Once we exceed this number
368 of retransmits, we first have the network layer update the route
369 if possible before each new retransmit. The default is the RFC
370 specified minimum of 3.
371
372 tcp_retries2 (integer; default: 15; since Linux 2.2)
373 The maximum number of times a TCP packet is retransmitted in
374 established state before giving up. The default value is 15,
375 which corresponds to a duration of approximately between 13 to
376 30 minutes, depending on the retransmission timeout. The
377 RFC 1122 specified minimum limit of 100 seconds is typically
378 deemed too short.
379
380 tcp_rfc1337 (Boolean; default: disabled; since Linux 2.2)
381 Enable TCP behavior conformant with RFC 1337. When disabled, if
382 a RST is received in TIME_WAIT state, we close the socket imme‐
383 diately without waiting for the end of the TIME_WAIT period.
384
385 tcp_rmem (since Linux 2.4)
386 This is a vector of 3 integers: [min, default, max]. These
387 parameters are used by TCP to regulate receive buffer sizes.
388 TCP dynamically adjusts the size of the receive buffer from the
389 defaults listed below, in the range of these values, depending
390 on memory available in the system.
391
392 min minimum size of the receive buffer used by each TCP
393 socket. The default value is the system page size.
394 (On Linux 2.4, the default value is 4K, lowered to
395 PAGE_SIZE bytes in low-memory systems.) This value is
396 used to ensure that in memory pressure mode, alloca‐
397 tions below this size will still succeed. This is not
398 used to bound the size of the receive buffer declared
399 using SO_RCVBUF on a socket.
400
401 default the default size of the receive buffer for a TCP
402 socket. This value overwrites the initial default
403 buffer size from the generic global
404 net.core.rmem_default defined for all protocols. The
405 default value is 87380 bytes. (On Linux 2.4, this
406 will be lowered to 43689 in low-memory systems.) If
407 larger receive buffer sizes are desired, this value
408 should be increased (to affect all sockets). To
409 employ large TCP windows, the net.ipv4.tcp_win‐
410 dow_scaling must be enabled (default).
411
412 max the maximum size of the receive buffer used by each
413 TCP socket. This value does not override the global
414 net.core.rmem_max. This is not used to limit the size
415 of the receive buffer declared using SO_RCVBUF on a
416 socket. The default value is calculated using the
417 formula
418
419 max(87380, min(4MB, tcp_mem[1]*PAGE_SIZE/128))
420
421 (On Linux 2.4, the default is 87380*2 bytes, lowered
422 to 87380 in low-memory systems).
423
424 tcp_sack (Boolean; default: enabled; since Linux 2.2)
425 Enable RFC 2018 TCP Selective Acknowledgements.
426
427 tcp_slow_start_after_idle (Boolean; default: enabled; since Linux
428 2.6.18)
429 If enabled, provide RFC 2861 behavior and time out the conges‐
430 tion window after an idle period. An idle period is defined as
431 the current RTO (retransmission timeout). If disabled, the con‐
432 gestion window will not be timed out after an idle period.
433
434 tcp_stdurg (Boolean; default: disabled; since Linux 2.2)
435 If this option is enabled, then use the RFC 1122 interpretation
436 of the TCP urgent-pointer field. According to this interpreta‐
437 tion, the urgent pointer points to the last byte of urgent data.
438 If this option is disabled, then use the BSD-compatible inter‐
439 pretation of the urgent pointer: the urgent pointer points to
440 the first byte after the urgent data. Enabling this option may
441 lead to interoperability problems.
442
443 tcp_syn_retries (integer; default: 5; since Linux 2.2)
444 The maximum number of times initial SYNs for an active TCP con‐
445 nection attempt will be retransmitted. This value should not be
446 higher than 255. The default value is 5, which corresponds to
447 approximately 180 seconds.
448
449 tcp_synack_retries (integer; default: 5; since Linux 2.2)
450 The maximum number of times a SYN/ACK segment for a passive TCP
451 connection will be retransmitted. This number should not be
452 higher than 255.
453
454 tcp_syncookies (Boolean; since Linux 2.2)
455 Enable TCP syncookies. The kernel must be compiled with CON‐
456 FIG_SYN_COOKIES. Send out syncookies when the syn backlog queue
457 of a socket overflows. The syncookies feature attempts to pro‐
458 tect a socket from a SYN flood attack. This should be used as a
459 last resort, if at all. This is a violation of the TCP proto‐
460 col, and conflicts with other areas of TCP such as TCP exten‐
461 sions. It can cause problems for clients and relays. It is not
462 recommended as a tuning mechanism for heavily loaded servers to
463 help with overloaded or misconfigured conditions. For recom‐
464 mended alternatives see tcp_max_syn_backlog, tcp_synack_retries,
465 and tcp_abort_on_overflow.
466
467 tcp_timestamps (Boolean; default: enabled; since Linux 2.2)
468 Enable RFC 1323 TCP timestamps.
469
470 tcp_tso_win_divisor (integer; default: 3; since Linux 2.6.9)
471 This parameter controls what percentage of the congestion window
472 can be consumed by a single TCP Segmentation Offload (TSO)
473 frame. The setting of this parameter is a tradeoff between
474 burstiness and building larger TSO frames.
475
476 tcp_tw_recycle (Boolean; default: disabled; since Linux 2.4)
477 Enable fast recycling of TIME_WAIT sockets. Enabling this
478 option is not recommended since this causes problems when work‐
479 ing with NAT (Network Address Translation).
480
481 tcp_tw_reuse (Boolean; default: disabled; since Linux 2.4.19/2.6)
482 Allow to reuse TIME_WAIT sockets for new connections when it is
483 safe from protocol viewpoint. It should not be changed without
484 advice/request of technical experts.
485
486 tcp_vegas_cong_avoid (Boolean; default: disabled; Linux 2.2 to 2.6.13)
487 Enable TCP Vegas congestion avoidance algorithm. TCP Vegas is a
488 sender-side only change to TCP that anticipates the onset of
489 congestion by estimating the bandwidth. TCP Vegas adjusts the
490 sending rate by modifying the congestion window. TCP Vegas
491 should provide less packet loss, but it is not as aggressive as
492 TCP Reno.
493
494 tcp_westwood (Boolean; default: disabled; Linux 2.4.26/2.6.3 to 2.6.13)
495 Enable TCP Westwood+ congestion control algorithm. TCP West‐
496 wood+ is a sender-side only modification of the TCP Reno proto‐
497 col stack that optimizes the performance of TCP congestion con‐
498 trol. It is based on end-to-end bandwidth estimation to set
499 congestion window and slow start threshold after a congestion
500 episode. Using this estimation, TCP Westwood+ adaptively sets a
501 slow start threshold and a congestion window which takes into
502 account the bandwidth used at the time congestion is experi‐
503 enced. TCP Westwood+ significantly increases fairness with
504 respect to TCP Reno in wired networks and throughput over wire‐
505 less links.
506
507 tcp_window_scaling (Boolean; default: enabled; since Linux 2.2)
508 Enable RFC 1323 TCP window scaling. This feature allows the use
509 of a large window (> 64K) on a TCP connection, should the other
510 end support it. Normally, the 16 bit window length field in the
511 TCP header limits the window size to less than 64K bytes. If
512 larger windows are desired, applications can increase the size
513 of their socket buffers and the window scaling option will be
514 employed. If tcp_window_scaling is disabled, TCP will not nego‐
515 tiate the use of window scaling with the other end during con‐
516 nection setup.
517
518 tcp_wmem (since Linux 2.4)
519 This is a vector of 3 integers: [min, default, max]. These
520 parameters are used by TCP to regulate send buffer sizes. TCP
521 dynamically adjusts the size of the send buffer from the default
522 values listed below, in the range of these values, depending on
523 memory available.
524
525 min Minimum size of the send buffer used by each TCP
526 socket. The default value is the system page size.
527 (On Linux 2.4, the default value is 4K bytes.) This
528 value is used to ensure that in memory pressure mode,
529 allocations below this size will still succeed. This
530 is not used to bound the size of the send buffer
531 declared using SO_SNDBUF on a socket.
532
533 default The default size of the send buffer for a TCP socket.
534 This value overwrites the initial default buffer size
535 from the generic global
536 /proc/sys/net/core/wmem_default defined for all proto‐
537 cols. The default value is 16K bytes. If larger send
538 buffer sizes are desired, this value should be
539 increased (to affect all sockets). To employ large
540 TCP windows, the /proc/sys/net/ipv4/tcp_window_scaling
541 must be set to a nonzero value (default).
542
543 max The maximum size of the send buffer used by each TCP
544 socket. This value does not override the value in
545 /proc/sys/net/core/wmem_max. This is not used to
546 limit the size of the send buffer declared using
547 SO_SNDBUF on a socket. The default value is calcu‐
548 lated using the formula
549
550 max(65536, min(4MB, tcp_mem[1]*PAGE_SIZE/128))
551
552 (On Linux 2.4, the default value is 128K bytes, low‐
553 ered 64K depending on low-memory systems.)
554
555 tcp_workaround_signed_windows (Boolean; default: disabled; since Linux
556 2.6.26)
557 If enabled, assume that no receipt of a window-scaling option
558 means that the remote TCP is broken and treats the window as a
559 signed quantity. If disabled, assume that the remote TCP is not
560 broken even if we do not receive a window scaling option from
561 it.
562
563 Socket Options
564 To set or get a TCP socket option, call getsockopt(2) to read or set‐
565 sockopt(2) to write the option with the option level argument set to
566 IPPROTO_TCP. In addition, most IPPROTO_IP socket options are valid on
567 TCP sockets. For more information see ip(7).
568
569 TCP_CORK (since Linux 2.2)
570 If set, don't send out partial frames. All queued partial
571 frames are sent when the option is cleared again. This is use‐
572 ful for prepending headers before calling sendfile(2), or for
573 throughput optimization. As currently implemented, there is a
574 200 millisecond ceiling on the time for which output is corked
575 by TCP_CORK. If this ceiling is reached, then queued data is
576 automatically transmitted. This option can be combined with
577 TCP_NODELAY only since Linux 2.5.71. This option should not be
578 used in code intended to be portable.
579
580 TCP_DEFER_ACCEPT (since Linux 2.4)
581 Allow a listener to be awakened only when data arrives on the
582 socket. Takes an integer value (seconds), this can bound the
583 maximum number of attempts TCP will make to complete the connec‐
584 tion. This option should not be used in code intended to be
585 portable.
586
587 TCP_INFO (since Linux 2.4)
588 Used to collect information about this socket. The kernel
589 returns a struct tcp_info as defined in the file
590 /usr/include/linux/tcp.h. This option should not be used in
591 code intended to be portable.
592
593 TCP_KEEPCNT (since Linux 2.4)
594 The maximum number of keepalive probes TCP should send before
595 dropping the connection. This option should not be used in code
596 intended to be portable.
597
598 TCP_KEEPIDLE (since Linux 2.4)
599 The time (in seconds) the connection needs to remain idle before
600 TCP starts sending keepalive probes, if the socket option
601 SO_KEEPALIVE has been set on this socket. This option should
602 not be used in code intended to be portable.
603
604 TCP_KEEPINTVL (since Linux 2.4)
605 The time (in seconds) between individual keepalive probes. This
606 option should not be used in code intended to be portable.
607
608 TCP_LINGER2 (since Linux 2.4)
609 The lifetime of orphaned FIN_WAIT2 state sockets. This option
610 can be used to override the system-wide setting in the file
611 /proc/sys/net/ipv4/tcp_fin_timeout for this socket. This is not
612 to be confused with the socket(7) level option SO_LINGER. This
613 option should not be used in code intended to be portable.
614
615 TCP_MAXSEG
616 The maximum segment size for outgoing TCP packets. If this
617 option is set before connection establishment, it also changes
618 the MSS value announced to the other end in the initial packet.
619 Values greater than the (eventual) interface MTU have no effect.
620 TCP will also impose its minimum and maximum bounds over the
621 value provided.
622
623 TCP_NODELAY
624 If set, disable the Nagle algorithm. This means that segments
625 are always sent as soon as possible, even if there is only a
626 small amount of data. When not set, data is buffered until
627 there is a sufficient amount to send out, thereby avoiding the
628 frequent sending of small packets, which results in poor uti‐
629 lization of the network. This option is overridden by TCP_CORK;
630 however, setting this option forces an explicit flush of pending
631 output, even if TCP_CORK is currently set.
632
633 TCP_QUICKACK (since Linux 2.4.4)
634 Enable quickack mode if set or disable quickack mode if cleared.
635 In quickack mode, acks are sent immediately, rather than delayed
636 if needed in accordance to normal TCP operation. This flag is
637 not permanent, it only enables a switch to or from quickack
638 mode. Subsequent operation of the TCP protocol will once again
639 enter/leave quickack mode depending on internal protocol pro‐
640 cessing and factors such as delayed ack timeouts occurring and
641 data transfer. This option should not be used in code intended
642 to be portable.
643
644 TCP_SYNCNT (since Linux 2.4)
645 Set the number of SYN retransmits that TCP should send before
646 aborting the attempt to connect. It cannot exceed 255. This
647 option should not be used in code intended to be portable.
648
649 TCP_WINDOW_CLAMP (since Linux 2.4)
650 Bound the size of the advertised window to this value. The ker‐
651 nel imposes a minimum size of SOCK_MIN_RCVBUF/2. This option
652 should not be used in code intended to be portable.
653
654 Sockets API
655 TCP provides limited support for out-of-band data, in the form of (a
656 single byte of) urgent data. In Linux this means if the other end
657 sends newer out-of-band data the older urgent data is inserted as nor‐
658 mal data into the stream (even when SO_OOBINLINE is not set). This
659 differs from BSD-based stacks.
660
661 Linux uses the BSD compatible interpretation of the urgent pointer
662 field by default. This violates RFC 1122, but is required for interop‐
663 erability with other stacks. It can be changed via
664 /proc/sys/net/ipv4/tcp_stdurg.
665
666 It is possible to peek at out-of-band data using the recv(2) MSG_PEEK
667 flag.
668
669 Since version 2.4, Linux supports the use of MSG_TRUNC in the flags
670 argument of recv(2) (and recvmsg(2)). This flag causes the received
671 bytes of data to be discarded, rather than passed back in a caller-sup‐
672 plied buffer. Since Linux 2.4.4, MSG_PEEK also has this effect when
673 used in conjunction with MSG_OOB to receive out-of-band data.
674
675 Ioctls
676 These following ioctl(2) calls return information in value. The cor‐
677 rect syntax is:
678
679 int value;
680 error = ioctl(tcp_socket, ioctl_type, &value);
681
682 ioctl_type is one of the following:
683
684 SIOCINQ
685 Returns the amount of queued unread data in the receive buffer.
686 The socket must not be in LISTEN state, otherwise an error (EIN‐
687 VAL) is returned.
688
689 SIOCATMARK
690 Returns true (i.e., value is nonzero) if the inbound data stream
691 is at the urgent mark.
692
693 If the SO_OOBINLINE socket option is set, and SIOCATMARK returns
694 true, then the next read from the socket will return the urgent
695 data. If the SO_OOBINLINE socket option is not set, and SIOCAT‐
696 MARK returns true, then the next read from the socket will
697 return the bytes following the urgent data (to actually read the
698 urgent data requires the recv(MSG_OOB) flag).
699
700 Note that a read never reads across the urgent mark. If an
701 application is informed of the presence of urgent data via
702 select(2) (using the exceptfds argument) or through delivery of
703 a SIGURG signal, then it can advance up to the mark using a loop
704 which repeatedly tests SIOCATMARK and performs a read (request‐
705 ing any number of bytes) as long as SIOCATMARK returns false.
706
707 SIOCOUTQ
708 Returns the amount of unsent data in the socket send queue. The
709 socket must not be in LISTEN state, otherwise an error (EINVAL)
710 is returned.
711
712 Error Handling
713 When a network error occurs, TCP tries to resend the packet. If it
714 doesn't succeed after some time, either ETIMEDOUT or the last received
715 error on this connection is reported.
716
717 Some applications require a quicker error notification. This can be
718 enabled with the IPPROTO_IP level IP_RECVERR socket option. When this
719 option is enabled, all incoming errors are immediately passed to the
720 user program. Use this option with care — it makes TCP less tolerant
721 to routing changes and other normal network conditions.
722
724 EAFNOTSUPPORT
725 Passed socket address type in sin_family was not AF_INET.
726
727 EPIPE The other end closed the socket unexpectedly or a read is exe‐
728 cuted on a shut down socket.
729
730 ETIMEDOUT
731 The other end didn't acknowledge retransmitted data after some
732 time.
733
734 Any errors defined for ip(7) or the generic socket layer may also be
735 returned for TCP.
736
738 Support for Explicit Congestion Notification, zero-copy sendfile(2),
739 reordering support and some SACK extensions (DSACK) were introduced in
740 2.4. Support for forward acknowledgement (FACK), TIME_WAIT recycling,
741 and per-connection keepalive socket options were introduced in 2.3.
742
744 Not all errors are documented.
745 IPv6 is not described.
746
748 accept(2), bind(2), connect(2), getsockopt(2), listen(2), recvmsg(2),
749 sendfile(2), sendmsg(2), socket(2), ip(7), socket(7)
750
751 RFC 793 for the TCP specification.
752 RFC 1122 for the TCP requirements and a description of the Nagle algo‐
753 rithm.
754 RFC 1323 for TCP timestamp and window scaling options.
755 RFC 1644 for a description of TIME_WAIT assassination hazards.
756 RFC 3168 for a description of Explicit Congestion Notification.
757 RFC 2581 for TCP congestion control algorithms.
758 RFC 2018 and RFC 2883 for SACK and extensions to SACK.
759
761 This page is part of release 3.25 of the Linux man-pages project. A
762 description of the project, and information about reporting bugs, can
763 be found at http://www.kernel.org/doc/man-pages/.
764
765
766
767Linux 2009-09-30 TCP(7)