1TCP(7) Linux Programmer's Manual TCP(7)
2
3
4
6 tcp - TCP protocol
7
9 #include <sys/socket.h>
10 #include <netinet/in.h>
11 #include <netinet/tcp.h>
12
13 tcp_socket = socket(AF_INET, SOCK_STREAM, 0);
14
16 This is an implementation of the TCP protocol defined in RFC 793,
17 RFC 1122 and RFC 2001 with the NewReno and SACK extensions. It pro‐
18 vides a reliable, stream-oriented, full-duplex connection between two
19 sockets on top of ip(7), for both v4 and v6 versions. TCP guarantees
20 that the data arrives in order and retransmits lost packets. It gener‐
21 ates and checks a per-packet checksum to catch transmission errors.
22 TCP does not preserve record boundaries.
23
24 A newly created TCP socket has no remote or local address and is not
25 fully specified. To create an outgoing TCP connection use connect(2)
26 to establish a connection to another TCP socket. To receive new incom‐
27 ing connections, first bind(2) the socket to a local address and port
28 and then call listen(2) to put the socket into the listening state.
29 After that a new socket for each incoming connection can be accepted
30 using accept(2). A socket which has had accept(2) or connect(2) suc‐
31 cessfully called on it is fully specified and may transmit data. Data
32 cannot be transmitted on listening or not yet connected sockets.
33
34 Linux supports RFC 1323 TCP high performance extensions. These include
35 Protection Against Wrapped Sequence Numbers (PAWS), Window Scaling and
36 Timestamps. Window scaling allows the use of large (> 64K) TCP windows
37 in order to support links with high latency or bandwidth. To make use
38 of them, the send and receive buffer sizes must be increased. They can
39 be set globally with the /proc/sys/net/ipv4/tcp_wmem and
40 /proc/sys/net/ipv4/tcp_rmem files, or on individual sockets by using
41 the SO_SNDBUF and SO_RCVBUF socket options with the setsockopt(2) call.
42
43 The maximum sizes for socket buffers declared via the SO_SNDBUF and
44 SO_RCVBUF mechanisms are limited by the values in the
45 /proc/sys/net/core/rmem_max and /proc/sys/net/core/wmem_max files.
46 Note that TCP actually allocates twice the size of the buffer requested
47 in the setsockopt(2) call, and so a succeeding getsockopt(2) call will
48 not return the same size of buffer as requested in the setsockopt(2)
49 call. TCP uses the extra space for administrative purposes and inter‐
50 nal kernel structures, and the /proc file values reflect the larger
51 sizes compared to the actual TCP windows. On individual connections,
52 the socket buffer size must be set prior to the listen(2) or connect(2)
53 calls in order to have it take effect. See socket(7) for more informa‐
54 tion.
55
56 TCP supports urgent data. Urgent data is used to signal the receiver
57 that some important message is part of the data stream and that it
58 should be processed as soon as possible. To send urgent data specify
59 the MSG_OOB option to send(2). When urgent data is received, the ker‐
60 nel sends a SIGURG signal to the process or process group that has been
61 set as the socket "owner" using the SIOCSPGRP or FIOSETOWN ioctls (or
62 the POSIX.1-2001-specified fcntl(2) F_SETOWN operation). When the
63 SO_OOBINLINE socket option is enabled, urgent data is put into the nor‐
64 mal data stream (a program can test for its location using the SIOCAT‐
65 MARK ioctl described below), otherwise it can be only received when the
66 MSG_OOB flag is set for recv(2) or recvmsg(2).
67
68 Linux 2.4 introduced a number of changes for improved throughput and
69 scaling, as well as enhanced functionality. Some of these features
70 include support for zero-copy sendfile(2), Explicit Congestion Notifi‐
71 cation, new management of TIME_WAIT sockets, keep-alive socket options
72 and support for Duplicate SACK extensions.
73
74 Address Formats
75 TCP is built on top of IP (see ip(7)). The address formats defined by
76 ip(7) apply to TCP. TCP only supports point-to-point communication;
77 broadcasting and multicasting are not supported.
78
79 /proc interfaces
80 System-wide TCP parameter settings can be accessed by files in the
81 directory /proc/sys/net/ipv4/. In addition, most IP /proc interfaces
82 also apply to TCP; see ip(7). Variables described as Boolean take an
83 integer value, with a non-zero value ("true") meaning that the corre‐
84 sponding option is enabled, and a zero value ("false") meaning that the
85 option is disabled.
86
87 tcp_abc (Integer; default: 0; since Linux 2.6.15)
88 Controls the Appropriate Byte Count (ABC), defined in RFC 3465.
89 ABC is a way of increasing the congestion window (cwnd) more
90 slowly in response to partial acknowledgments. Possible values
91 are:
92
93 0 increase cwnd once per acknowledgment (no ABC)
94
95 1 increase cwnd once per acknowledgment of full sized segment
96
97 2 allow increase cwnd by two if acknowledgment is of two seg‐
98 ments to compensate for delayed acknowledgments.
99
100 tcp_abort_on_overflow (Boolean; default: disabled; since Linux 2.4)
101 Enable resetting connections if the listening service is too
102 slow and unable to keep up and accept them. It means that if
103 overflow occurred due to a burst, the connection will recover.
104 Enable this option only if you are really sure that the listen‐
105 ing daemon cannot be tuned to accept connections faster.
106 Enabling this option can harm the clients of your server.
107
108 tcp_adv_win_scale (integer; default: 2; since Linux 2.4)
109 Count buffering overhead as bytes/2^tcp_adv_win_scale, if
110 tcp_adv_win_scale is greater than 0; or bytes-
111 bytes/2^(-tcp_adv_win_scale), if tcp_adv_win_scale is less than
112 or equal to zero.
113
114 The socket receive buffer space is shared between the applica‐
115 tion and kernel. TCP maintains part of the buffer as the TCP
116 window, this is the size of the receive window advertised to the
117 other end. The rest of the space is used as the "application"
118 buffer, used to isolate the network from scheduling and applica‐
119 tion latencies. The tcp_adv_win_scale default value of 2
120 implies that the space used for the application buffer is one
121 fourth that of the total.
122
123 tcp_allowed_congestion_control (String; default: see text; since Linux
124 2.4.20)
125 Show/set the congestion control choices available to non-privi‐
126 leged processes (see the description of the TCP_CONGESTION
127 socket option). The list is a subset of those listed in
128 tcp_available_congestion_control. The default value for this
129 list is "reno" plus the default setting of tcp_congestion_con‐
130 trol.
131
132 tcp_available_congestion_control (String; read-only; since Linux
133 2.4.20)
134 Shows a list of the congestion-control algorithms that are reg‐
135 istered. This list is a limiting set for the list in
136 tcp_allowed_congestion_control. More congestion-control algo‐
137 rithms may be available as modules, but not loaded.
138
139 tcp_app_win (integer; default: 31; since Linux 2.4)
140 This variable defines how many bytes of the TCP window are
141 reserved for buffering overhead.
142
143 A maximum of (window/2^tcp_app_win, mss) bytes in the window are
144 reserved for the application buffer. A value of 0 implies that
145 no amount is reserved.
146
147 tcp_base_mss (Integer; default: 512; since Linux 2.6.17)
148 The initial value of search_low to be used by the packetization
149 layer Path MTU discovery (MTU probing). If MTU probing is
150 enabled, this is the initial MSS used by the connection.
151
152 tcp_bic (Boolean; default: disabled; Linux 2.4.27/2.6.6 to 2.6.13)
153 Enable BIC TCP congestion control algorithm. BIC-TCP is a
154 sender-side only change that ensures a linear RTT fairness under
155 large windows while offering both scalability and bounded TCP-
156 friendliness. The protocol combines two schemes called additive
157 increase and binary search increase. When the congestion window
158 is large, additive increase with a large increment ensures lin‐
159 ear RTT fairness as well as good scalability. Under small con‐
160 gestion windows, binary search increase provides TCP friendli‐
161 ness.
162
163 tcp_bic_low_window (integer; default: 14; Linux 2.4.27/2.6.6 to 2.6.13)
164 Sets the threshold window (in packets) where BIC TCP starts to
165 adjust the congestion window. Below this threshold BIC TCP
166 behaves the same as the default TCP Reno.
167
168 tcp_bic_fast_convergence (Boolean; default: enabled; Linux 2.4.27/2.6.6
169 to 2.6.13)
170 Forces BIC TCP to more quickly respond to changes in congestion
171 window. Allows two flows sharing the same connection to con‐
172 verge more rapidly.
173
174 tcp_congestion_control (String; default: enabled; since Linux 2.4.13)
175 Set the default congestion-control algorithm to be used for new
176 connections. The algorithm "reno" is always available, but
177 additional choices may be available depending on kernel configu‐
178 ration. The default value for this file is set as part of ker‐
179 nel configuration.
180
181 tcp_dma_copybreak (integer; default: 4096; since Linux 2.6.24)
182 Lower limit, in bytes, of the size of socket reads that will be
183 offloaded to a DMA copy engine, if one is present in the system
184 and the kernel was configured with the CONFIG_NET_DMA option.
185
186 tcp_dsack (Boolean; default: enabled; since Linux 2.4)
187 Enable RFC 2883 TCP Duplicate SACK support.
188
189 tcp_ecn (Boolean; default: disabled; since Linux 2.4)
190 Enable RFC 2884 Explicit Congestion Notification. When enabled,
191 connectivity to some destinations could be affected due to
192 older, misbehaving routers along the path causing connections to
193 be dropped.
194
195 tcp_fack (Boolean; default: enabled; since Linux 2.2)
196 Enable TCP Forward Acknowledgement support.
197
198 tcp_fin_timeout (integer; default: 60; since Linux 2.2)
199 This specifies how many seconds to wait for a final FIN packet
200 before the socket is forcibly closed. This is strictly a viola‐
201 tion of the TCP specification, but required to prevent denial-
202 of-service attacks. In Linux 2.2, the default value was 180.
203
204 tcp_frto (integer; default: 0; since Linux 2.4.21/2.6)
205 Enables F-RTO, an enhanced recovery algorithm for TCP retrans‐
206 mission timeouts (RTOs). It is particularly beneficial in wire‐
207 less environments where packet loss is typically due to random
208 radio interference rather than intermediate router congestion.
209 See RFC 4138 for more details.
210
211 This file can have one of the following values:
212
213 0 Disabled.
214
215 1 The basic version F-RTO algorithm is enabled.
216
217 2 Enable SACK-enhanced F-RTO if flow uses SACK. The basic ver‐
218 sion can be used also when SACK is in use though in that case
219 scenario(s) exists where F-RTO interacts badly with the
220 packet counting of the SACK-enabled TCP flow.
221
222 Before Linux 2.6.22, this parameter was a Boolean value, sup‐
223 porting just values 0 and 1 above.
224
225 tcp_frto_response (integer; default: 0; since Linux 2.6.22)
226 When F-RTO has detected that a TCP retransmission timeout was
227 spurious (i.e, the timeout would have been avoided had TCP set a
228 longer retransmission timeout), TCP has several options concern‐
229 ing what to do next. Possible values are:
230
231 0 Rate halving based; a smooth and conservative response,
232 results in halved congestion window (cwnd) and slow-start
233 threshold (ssthresh) after one RTT.
234
235 1 Very conservative response; not recommended because even
236 though being valid, it interacts poorly with the rest of
237 Linux TCP; halves cwnd and ssthresh immediately.
238
239 2 Aggressive response; undoes congestion-control measures that
240 are now known to be unnecessary (ignoring the possibility of
241 a lost retransmission that would require TCP to be more cau‐
242 tious); cwnd and ssthresh are restored to the values prior to
243 timeout.
244
245 tcp_keepalive_intvl (integer; default: 75; since Linux 2.4)
246 The number of seconds between TCP keep-alive probes.
247
248 tcp_keepalive_probes (integer; default: 9; since Linux 2.2)
249 The maximum number of TCP keep-alive probes to send before giv‐
250 ing up and killing the connection if no response is obtained
251 from the other end.
252
253 tcp_keepalive_time (integer; default: 7200; since Linux 2.2)
254 The number of seconds a connection needs to be idle before TCP
255 begins sending out keep-alive probes. Keep-alives are only sent
256 when the SO_KEEPALIVE socket option is enabled. The default
257 value is 7200 seconds (2 hours). An idle connection is termi‐
258 nated after approximately an additional 11 minutes (9 probes an
259 interval of 75 seconds apart) when keep-alive is enabled.
260
261 Note that underlying connection tracking mechanisms and applica‐
262 tion timeouts may be much shorter.
263
264 tcp_low_latency (Boolean; default: disabled; since Linux 2.4.21/2.6)
265 If enabled, the TCP stack makes decisions that prefer lower
266 latency as opposed to higher throughput. It this option is dis‐
267 abled, then higher throughput is preferred. An example of an
268 application where this default should be changed would be a
269 Beowulf compute cluster.
270
271 tcp_max_orphans (integer; default: see below; since Linux 2.4)
272 The maximum number of orphaned (not attached to any user file
273 handle) TCP sockets allowed in the system. When this number is
274 exceeded, the orphaned connection is reset and a warning is
275 printed. This limit exists only to prevent simple denial-of-
276 service attacks. Lowering this limit is not recommended. Net‐
277 work conditions might require you to increase the number of
278 orphans allowed, but note that each orphan can eat up to ~64K of
279 unswappable memory. The default initial value is set equal to
280 the kernel parameter NR_FILE. This initial default is adjusted
281 depending on the memory in the system.
282
283 tcp_max_syn_backlog (integer; default: see below; since Linux 2.2)
284 The maximum number of queued connection requests which have
285 still not received an acknowledgement from the connecting
286 client. If this number is exceeded, the kernel will begin drop‐
287 ping requests. The default value of 256 is increased to 1024
288 when the memory present in the system is adequate or greater (>=
289 128Mb), and reduced to 128 for those systems with very low mem‐
290 ory (<= 32Mb). It is recommended that if this needs to be
291 increased above 1024, TCP_SYNQ_HSIZE in include/net/tcp.h be
292 modified to keep TCP_SYNQ_HSIZE*16<=tcp_max_syn_backlog, and the
293 kernel be recompiled.
294
295 tcp_max_tw_buckets (integer; default: see below; since Linux 2.4)
296 The maximum number of sockets in TIME_WAIT state allowed in the
297 system. This limit exists only to prevent simple denial-of-ser‐
298 vice attacks. The default value of NR_FILE*2 is adjusted
299 depending on the memory in the system. If this number is
300 exceeded, the socket is closed and a warning is printed.
301
302 tcp_moderate_rcvbuf (Boolean; default: enabled; since Linux
303 2.4.17/2.6.7)
304 If enabled, TCP performs receive buffer auto-tuning, attempting
305 to automatically size the buffer (no greater than tcp_rmem[2])
306 to match the size required by the path for full throughput.
307
308 tcp_mem (since Linux 2.4)
309 This is a vector of 3 integers: [low, pressure, high]. These
310 bounds, measured in units of the system page size, are used by
311 TCP to track its memory usage. The defaults are calculated at
312 boot time from the amount of available memory. (TCP can only
313 use low memory for this, which is limited to around 900
314 megabytes on 32-bit systems. 64-bit systems do not suffer this
315 limitation.)
316
317 low TCP doesn't regulate its memory allocation when the
318 number of pages it has allocated globally is below
319 this number.
320
321 pressure When the amount of memory allocated by TCP exceeds
322 this number of pages, TCP moderates its memory con‐
323 sumption. This memory pressure state is exited once
324 the number of pages allocated falls below the low
325 mark.
326
327 high The maximum number of pages, globally, that TCP will
328 allocate. This value overrides any other limits
329 imposed by the kernel.
330
331 tcp_mtu_probing (integer; default: 0; since Linux 2.6.17)
332 This parameter controls TCP Packetization-Layer Path MTU Discov‐
333 ery. The following values may be assigned to the file:
334
335 0 Disabled
336
337 1 Disabled by default, enabled when an ICMP black hole detected
338
339 2 Always enabled, use initial MSS of tcp_base_mss.
340
341 tcp_no_metrics_save (Boolean; default: disabled; since Linux 2.6.6)
342 By default, TCP saves various connection metrics in the route
343 cache when the connection closes, so that connections estab‐
344 lished in the near future can use these to set initial condi‐
345 tions. Usually, this increases overall performance, but it may
346 sometimes cause performance degradation. If tcp_no_metrics_save
347 is enabled, TCP will not cache metrics on closing connections.
348
349 tcp_orphan_retries (integer; default: 8; since Linux 2.4)
350 The maximum number of attempts made to probe the other end of a
351 connection which has been closed by our end.
352
353 tcp_reordering (integer; default: 3; since Linux 2.4)
354 The maximum a packet can be reordered in a TCP packet stream
355 without TCP assuming packet loss and going into slow start. It
356 is not advisable to change this number. This is a packet
357 reordering detection metric designed to minimize unnecessary
358 back off and retransmits provoked by reordering of packets on a
359 connection.
360
361 tcp_retrans_collapse (Boolean; default: enabled; since Linux 2.2)
362 Try to send full-sized packets during retransmit.
363
364 tcp_retries1 (integer; default: 3; since Linux 2.2)
365 The number of times TCP will attempt to retransmit a packet on
366 an established connection normally, without the extra effort of
367 getting the network layers involved. Once we exceed this number
368 of retransmits, we first have the network layer update the route
369 if possible before each new retransmit. The default is the RFC
370 specified minimum of 3.
371
372 tcp_retries2 (integer; default: 15; since Linux 2.2)
373 The maximum number of times a TCP packet is retransmitted in
374 established state before giving up. The default value is 15,
375 which corresponds to a duration of approximately between 13 to
376 30 minutes, depending on the retransmission timeout. The
377 RFC 1122 specified minimum limit of 100 seconds is typically
378 deemed too short.
379
380 tcp_rfc1337 (Boolean; default: disabled; since Linux 2.2)
381 Enable TCP behavior conformant with RFC 1337. When disabled, if
382 a RST is received in TIME_WAIT state, we close the socket imme‐
383 diately without waiting for the end of the TIME_WAIT period.
384
385 tcp_rmem (since Linux 2.4)
386 This is a vector of 3 integers: [min, default, max]. These
387 parameters are used by TCP to regulate receive buffer sizes.
388 TCP dynamically adjusts the size of the receive buffer from the
389 defaults listed below, in the range of these values, depending
390 on memory available in the system.
391
392 min minimum size of the receive buffer used by each TCP
393 socket. The default value is the system page size.
394 (On Linux 2.4, the default value is 4K, lowered to
395 PAGE_SIZE bytes in low-memory systems.) This value is
396 used to ensure that in memory pressure mode, alloca‐
397 tions below this size will still succeed. This is not
398 used to bound the size of the receive buffer declared
399 using SO_RCVBUF on a socket.
400
401 default the default size of the receive buffer for a TCP
402 socket. This value overwrites the initial default
403 buffer size from the generic global
404 net.core.rmem_default defined for all protocols. The
405 default value is 87380 bytes. (On Linux 2.4, this
406 will be lowered to 43689 in low-memory systems.) If
407 larger receive buffer sizes are desired, this value
408 should be increased (to affect all sockets). To
409 employ large TCP windows, the net.ipv4.tcp_win‐
410 dow_scaling must be enabled (default).
411
412 max the maximum size of the receive buffer used by each
413 TCP socket. This value does not override the global
414 net.core.rmem_max. This is not used to limit the size
415 of the receive buffer declared using SO_RCVBUF on a
416 socket. The default value is calculated using the
417 formula
418
419 max(87380, min(4MB, tcp_mem[1]*PAGE_SIZE/128))
420
421 (On Linux 2.4, the default is 87380*2 bytes, lowered
422 to 87380 in low-memory systems).
423
424 tcp_sack (Boolean; default: enabled; since Linux 2.2)
425 Enable RFC 2018 TCP Selective Acknowledgements.
426
427 tcp_slow_start_after_idle (Boolean; default: enabled; since Linux
428 2.6.18)
429 If enabled, provide RFC 2861 behavior and time out the conges‐
430 tion window after an idle period. An idle period is defined as
431 the current RTO (retransmission timeout). If disabled, the con‐
432 gestion window will not be timed out after an idle period.
433
434 tcp_stdurg (Boolean; default: disabled; since Linux 2.2)
435 If this option is enabled, then use the RFC 1122 interpretation
436 of the TCP urgent-pointer field. According to this interpreta‐
437 tion, the urgent pointer points to the last byte of urgent data.
438 If this option is disabled, then use the BSD-compatible inter‐
439 pretation of the urgent pointer: the urgent pointer points to
440 the first byte after the urgent data. Enabling this option may
441 lead to interoperability problems.
442
443 tcp_syn_retries (integer; default: 5; since Linux 2.2)
444 The maximum number of times initial SYNs for an active TCP con‐
445 nection attempt will be retransmitted. This value should not be
446 higher than 255. The default value is 5, which corresponds to
447 approximately 180 seconds.
448
449 tcp_synack_retries (integer; default: 5; since Linux 2.2)
450 The maximum number of times a SYN/ACK segment for a passive TCP
451 connection will be retransmitted. This number should not be
452 higher than 255.
453
454 tcp_syncookies (Boolean; since Linux 2.2)
455 Enable TCP syncookies. The kernel must be compiled with CON‐
456 FIG_SYN_COOKIES. Send out syncookies when the syn backlog queue
457 of a socket overflows. The syncookies feature attempts to pro‐
458 tect a socket from a SYN flood attack. This should be used as a
459 last resort, if at all. This is a violation of the TCP proto‐
460 col, and conflicts with other areas of TCP such as TCP exten‐
461 sions. It can cause problems for clients and relays. It is not
462 recommended as a tuning mechanism for heavily loaded servers to
463 help with overloaded or misconfigured conditions. For recom‐
464 mended alternatives see tcp_max_syn_backlog, tcp_synack_retries,
465 and tcp_abort_on_overflow.
466
467 tcp_timestamps (Boolean; default: enabled; since Linux 2.2)
468 Enable RFC 1323 TCP timestamps.
469
470 tcp_tso_win_divisor (integer; default: 3; since Linux 2.6.9)
471 This parameter controls what percentage of the congestion window
472 can be consumed by a single TCP Segmentation Offload (TSO)
473 frame. The setting of this parameter is a tradeoff between
474 burstiness and building larger TSO frames.
475
476 tcp_tw_recycle (Boolean; default: disabled; since Linux 2.4)
477 Enable fast recycling of TIME_WAIT sockets. Enabling this
478 option is not recommended since this causes problems when work‐
479 ing with NAT (Network Address Translation).
480
481 tcp_tw_reuse (Boolean; default: disabled; since Linux 2.4.19/2.6)
482 Allow to reuse TIME_WAIT sockets for new connections when it is
483 safe from protocol viewpoint. It should not be changed without
484 advice/request of technical experts.
485
486 tcp_vegas_cong_avoid (Boolean; default: disabled; Linux 2.2 to 2.6.13)
487 Enable TCP Vegas congestion avoidance algorithm. TCP Vegas is a
488 sender-side only change to TCP that anticipates the onset of
489 congestion by estimating the bandwidth. TCP Vegas adjusts the
490 sending rate by modifying the congestion window. TCP Vegas
491 should provide less packet loss, but it is not as aggressive as
492 TCP Reno.
493
494 tcp_westwood (Boolean; default: disabled; Linux 2.4.26/2.6.3 to 2.6.13)
495 Enable TCP Westwood+ congestion control algorithm. TCP West‐
496 wood+ is a sender-side only modification of the TCP Reno proto‐
497 col stack that optimizes the performance of TCP congestion con‐
498 trol. It is based on end-to-end bandwidth estimation to set
499 congestion window and slow start threshold after a congestion
500 episode. Using this estimation, TCP Westwood+ adaptively sets a
501 slow start threshold and a congestion window which takes into
502 account the bandwidth used at the time congestion is experi‐
503 enced. TCP Westwood+ significantly increases fairness with
504 respect to TCP Reno in wired networks and throughput over wire‐
505 less links.
506
507 tcp_window_scaling (Boolean; default: enabled; since Linux 2.2)
508 Enable RFC 1323 TCP window scaling. This feature allows the use
509 of a large window (> 64K) on a TCP connection, should the other
510 end support it. Normally, the 16 bit window length field in the
511 TCP header limits the window size to less than 64K bytes. If
512 larger windows are desired, applications can increase the size
513 of their socket buffers and the window scaling option will be
514 employed. If tcp_window_scaling is disabled, TCP will not nego‐
515 tiate the use of window scaling with the other end during con‐
516 nection setup.
517
518 tcp_wmem (since Linux 2.4)
519 This is a vector of 3 integers: [min, default, max]. These
520 parameters are used by TCP to regulate send buffer sizes. TCP
521 dynamically adjusts the size of the send buffer from the default
522 values listed below, in the range of these values, depending on
523 memory available.
524
525 min Minimum size of the send buffer used by each TCP
526 socket. The default value is the system page size.
527 (On Linux 2.4, the default value is 4K bytes.) This
528 value is used to ensure that in memory pressure mode,
529 allocations below this size will still succeed. This
530 is not used to bound the size of the send buffer
531 declared using SO_SNDBUF on a socket.
532
533 default The default size of the send buffer for a TCP socket.
534 This value overwrites the initial default buffer size
535 from the generic global net.core.wmem_default defined
536 for all protocols. The default value is 16K bytes.
537 If larger send buffer sizes are desired, this value
538 should be increased (to affect all sockets). To
539 employ large TCP windows, the
540 /proc/sys/net/ipv4/tcp_window_scaling must be set to a
541 non-zero value (default).
542
543 max The maximum size of the send buffer used by each TCP
544 socket. This value does not override the value in
545 /proc/sys/net/core/wmem_max. This is not used to
546 limit the size of the send buffer declared using
547 SO_SNDBUF on a socket. The default value is calcu‐
548 lated using the formula
549
550 max(65536, min(4MB, tcp_mem[1]*PAGE_SIZE/128))
551
552 (On Linux 2.4, the default value is 128K bytes, low‐
553 ered 64K depending on low-memory systems.)
554
555 tcp_workaround_signed_windows (Boolean; default: disabled; since Linux
556 2.6.26)
557 If enabled, assume that no receipt of a window-scaling option
558 means that the remote TCP is broken and treats the window as a
559 signed quantity. If disabled, assume that the remote TCP is not
560 broken even if we do not receive a window scaling option from
561 it.
562
563 Socket Options
564 To set or get a TCP socket option, call getsockopt(2) to read or set‐
565 sockopt(2) to write the option with the option level argument set to
566 IPPROTO_TCP. In addition, most IPPROTO_IP socket options are valid on
567 TCP sockets. For more information see ip(7).
568
569 TCP_CONGESTION (since Linux 2.6.13)
570 Get or set the congestion-control algorithm for this socket.
571 The optval argument is a pointer to a character-string buffer.
572
573 For getsockopt() *optlen specifies the amount of space available
574 in the buffer pointed to by optval, which should be at least 16
575 bytes (defined by the kernel-internal constant TCP_CA_NAME_MAX).
576 On return, the buffer pointed to by optval is set to a null-ter‐
577 minated string containing the name of the congestion-control
578 algorithm for this socket, and *optlen is set to the minimum of
579 its original value and TCP_CA_NAME_MAX. If the value passed in
580 *optlen is too small, then the string returned in *optval is
581 silently truncated, and no terminating null byte is added. If
582 an empty string is returned, then the socket is using the
583 default congestion-control algorithm, determined as described
584 under tcp_congestion_control above.
585
586 For setsockopt() optlen specifies the length of the congestion-
587 control algorithm name contained in the buffer pointed to by
588 optval; this length need not include any terminating null byte.
589 The algorithm "reno" is always permitted; other algorithms may
590 be available, depending on kernel configuration. Possible
591 errors from setsockopt() include: algorithm not found/available
592 (ENOENT); setting this algorithm requires the CAP_NET_ADMIN
593 capability (EPERM); and failure getting kernel module (EBUSY).
594
595 TCP_CORK (since Linux 2.2)
596 If set, don't send out partial frames. All queued partial
597 frames are sent when the option is cleared again. This is use‐
598 ful for prepending headers before calling sendfile(2), or for
599 throughput optimization. As currently implemented, there is a
600 200 millisecond ceiling on the time for which output is corked
601 by TCP_CORK. If this ceiling is reached, then queued data is
602 automatically transmitted. This option can be combined with
603 TCP_NODELAY only since Linux 2.5.71. This option should not be
604 used in code intended to be portable.
605
606 TCP_DEFER_ACCEPT (since Linux 2.4)
607 Allows a listener to be awakened only when data arrives on the
608 socket. Takes an integer value (seconds), this can bound the
609 maximum number of attempts TCP will make to complete the connec‐
610 tion. This option should not be used in code intended to be
611 portable.
612
613 TCP_INFO (since Linux 2.4)
614 Used to collect information about this socket. The kernel
615 returns a struct tcp_info as defined in the file
616 /usr/include/linux/tcp.h. This option should not be used in
617 code intended to be portable.
618
619 TCP_KEEPCNT (since Linux 2.4)
620 The maximum number of keepalive probes TCP should send before
621 dropping the connection. This option should not be used in code
622 intended to be portable.
623
624 TCP_KEEPIDLE (since Linux 2.4)
625 The time (in seconds) the connection needs to remain idle before
626 TCP starts sending keepalive probes, if the socket option
627 SO_KEEPALIVE has been set on this socket. This option should
628 not be used in code intended to be portable.
629
630 TCP_KEEPINTVL (since Linux 2.4)
631 The time (in seconds) between individual keepalive probes. This
632 option should not be used in code intended to be portable.
633
634 TCP_LINGER2 (since Linux 2.4)
635 The lifetime of orphaned FIN_WAIT2 state sockets. This option
636 can be used to override the system-wide setting in the file
637 /proc/sys/net/ipv4/tcp_fin_timeout for this socket. This is not
638 to be confused with the socket(7) level option SO_LINGER. This
639 option should not be used in code intended to be portable.
640
641 TCP_MAXSEG
642 The maximum segment size for outgoing TCP packets. If this
643 option is set before connection establishment, it also changes
644 the MSS value announced to the other end in the initial packet.
645 Values greater than the (eventual) interface MTU have no effect.
646 TCP will also impose its minimum and maximum bounds over the
647 value provided.
648
649 TCP_NODELAY
650 If set, disable the Nagle algorithm. This means that segments
651 are always sent as soon as possible, even if there is only a
652 small amount of data. When not set, data is buffered until
653 there is a sufficient amount to send out, thereby avoiding the
654 frequent sending of small packets, which results in poor uti‐
655 lization of the network. This option is overridden by TCP_CORK;
656 however, setting this option forces an explicit flush of pending
657 output, even if TCP_CORK is currently set.
658
659 TCP_QUICKACK (since Linux 2.4.4)
660 Enable quickack mode if set or disable quickack mode if cleared.
661 In quickack mode, acks are sent immediately, rather than delayed
662 if needed in accordance to normal TCP operation. This flag is
663 not permanent, it only enables a switch to or from quickack
664 mode. Subsequent operation of the TCP protocol will once again
665 enter/leave quickack mode depending on internal protocol pro‐
666 cessing and factors such as delayed ack timeouts occurring and
667 data transfer. This option should not be used in code intended
668 to be portable.
669
670 TCP_SYNCNT (since Linux 2.4)
671 Set the number of SYN retransmits that TCP should send before
672 aborting the attempt to connect. It cannot exceed 255. This
673 option should not be used in code intended to be portable.
674
675 TCP_WINDOW_CLAMP (since Linux 2.4)
676 Bound the size of the advertised window to this value. The ker‐
677 nel imposes a minimum size of SOCK_MIN_RCVBUF/2. This option
678 should not be used in code intended to be portable.
679
680 Sockets API
681 TCP provides limited support for out-of-band data, in the form of (a
682 single byte of) urgent data. In Linux this means if the other end
683 sends newer out-of-band data the older urgent data is inserted as nor‐
684 mal data into the stream (even when SO_OOBINLINE is not set). This
685 differs from BSD-based stacks.
686
687 Linux uses the BSD compatible interpretation of the urgent pointer
688 field by default. This violates RFC 1122, but is required for interop‐
689 erability with other stacks. It can be changed via
690 /proc/sys/net/ipv4/tcp_stdurg.
691
692 It is possible to peek at out-of-band data using the recv(2) MSG_PEEK
693 flag.
694
695 Since version 2.4, Linux supports the use of MSG_TRUNC in the flags
696 argument of recv(2) (and recvmsg(2)). This flag causes the received
697 bytes of data to be discarded, rather than passed back in a caller-sup‐
698 plied buffer. Since Linux 2.4.4, MSG_PEEK also has this effect when
699 used in conjunction with MSG_OOB to receive out-of-band data.
700
701 Ioctls
702 These following ioctl(2) calls return information in value. The cor‐
703 rect syntax is:
704
705 int value;
706 error = ioctl(tcp_socket, ioctl_type, &value);
707
708 ioctl_type is one of the following:
709
710 SIOCINQ
711 Returns the amount of queued unread data in the receive buffer.
712 The socket must not be in LISTEN state, otherwise an error (EIN‐
713 VAL) is returned.
714
715 SIOCATMARK
716 Returns true (i.e., value is non-zero) if the inbound data
717 stream is at the urgent mark.
718
719 If the SO_OOBINLINE socket option is set, and SIOCATMARK returns
720 true, then the next read from the socket will return the urgent
721 data. If the SO_OOBINLINE socket option is not set, and SIOCAT‐
722 MARK returns true, then the next read from the socket will
723 return the bytes following the urgent data (to actually read the
724 urgent data requires the recv(MSG_OOB) flag).
725
726 Note that a read never reads across the urgent mark. If an
727 application is informed of the presence of urgent data via
728 select(2) (using the exceptfds argument) or through delivery of
729 a SIGURG signal, then it can advance up to the mark using a loop
730 which repeatedly tests SIOCATMARK and performs a read (request‐
731 ing any number of bytes) as long as SIOCATMARK returns false.
732
733 SIOCOUTQ
734 Returns the amount of unsent data in the socket send queue. The
735 socket must not be in LISTEN state, otherwise an error (EINVAL)
736 is returned.
737
738 Error Handling
739 When a network error occurs, TCP tries to resend the packet. If it
740 doesn't succeed after some time, either ETIMEDOUT or the last received
741 error on this connection is reported.
742
743 Some applications require a quicker error notification. This can be
744 enabled with the IPPROTO_IP level IP_RECVERR socket option. When this
745 option is enabled, all incoming errors are immediately passed to the
746 user program. Use this option with care — it makes TCP less tolerant
747 to routing changes and other normal network conditions.
748
750 EAFNOTSUPPORT
751 Passed socket address type in sin_family was not AF_INET.
752
753 EPIPE The other end closed the socket unexpectedly or a read is exe‐
754 cuted on a shut down socket.
755
756 ETIMEDOUT
757 The other end didn't acknowledge retransmitted data after some
758 time.
759
760 Any errors defined for ip(7) or the generic socket layer may also be
761 returned for TCP.
762
764 Support for Explicit Congestion Notification, zero-copy sendfile(2),
765 reordering support and some SACK extensions (DSACK) were introduced in
766 2.4. Support for forward acknowledgement (FACK), TIME_WAIT recycling,
767 and per-connection keepalive socket options were introduced in 2.3.
768
770 Not all errors are documented.
771 IPv6 is not described.
772
774 accept(2), bind(2), connect(2), getsockopt(2), listen(2), recvmsg(2),
775 sendfile(2), sendmsg(2), socket(2), ip(7), socket(7)
776
777 RFC 793 for the TCP specification.
778 RFC 1122 for the TCP requirements and a description of the Nagle algo‐
779 rithm.
780 RFC 1323 for TCP timestamp and window scaling options.
781 RFC 1644 for a description of TIME_WAIT assassination hazards.
782 RFC 3168 for a description of Explicit Congestion Notification.
783 RFC 2581 for TCP congestion control algorithms.
784 RFC 2018 and RFC 2883 for SACK and extensions to SACK.
785
787 This page is part of release 3.22 of the Linux man-pages project. A
788 description of the project, and information about reporting bugs, can
789 be found at http://www.kernel.org/doc/man-pages/.
790
791
792
793Linux 2008-12-01 TCP(7)