epoll(7)

1EPOLL(7)                   Linux Programmer's Manual                  EPOLL(7)
2
3
4

NAME

6       epoll - I/O event notification facility
7

SYNOPSIS

9       #include <sys/epoll.h>
10

DESCRIPTION

12       The  epoll  API performs a similar task to poll(2): monitoring multiple
13       file descriptors to see if I/O is possible on any of them.   The  epoll
14       API can be used either as an edge-triggered or a level-triggered inter‐
15       face and scales well to large numbers of watched file descriptors.  The
16       following  system  calls  are  provided  to  create and manage an epoll
17       instance:
18
19       *  epoll_create(2)  creates  an  epoll  instance  and  returns  a  file
20          descriptor  referring to that instance.  (The more recent epoll_cre‐
21          ate1(2) extends the functionality of epoll_create(2).)
22
23       *  Interest in particular  file  descriptors  is  then  registered  via
24          epoll_ctl(2).   The  set of file descriptors currently registered on
25          an epoll instance is sometimes called an epoll set.
26
27       *  epoll_wait(2) waits for I/O events, blocking the calling  thread  if
28          no events are currently available.
29
30   Level-triggered and edge-triggered
31       The  epoll event distribution interface is able to behave both as edge-
32       triggered (ET) and as level-triggered (LT).  The difference between the
33       two mechanisms can be described as follows.  Suppose that this scenario
34       happens:
35
36       1. The file descriptor that represents the read side of a pipe (rfd) is
37          registered on the epoll instance.
38
39       2. A pipe writer writes 2 kB of data on the write side of the pipe.
40
41       3. A call to epoll_wait(2) is done that will return rfd as a ready file
42          descriptor.
43
44       4. The pipe reader reads 1 kB of data from rfd.
45
46       5. A call to epoll_wait(2) is done.
47
48       If the rfd file descriptor has been added to the epoll interface  using
49       the  EPOLLET  (edge-triggered)  flag, the call to epoll_wait(2) done in
50       step 5 will probably hang despite the available data still  present  in
51       the  file  input buffer; meanwhile the remote peer might be expecting a
52       response based on the data it already sent.  The  reason  for  this  is
53       that edge-triggered mode delivers events only when changes occur on the
54       monitored file descriptor.  So, in step 5 the caller might end up wait‐
55       ing  for some data that is already present inside the input buffer.  In
56       the above example, an event on rfd will be  generated  because  of  the
57       write  done in 2 and the event is consumed in 3.  Since the read opera‐
58       tion done in 4 does not consume the whole  buffer  data,  the  call  to
59       epoll_wait(2) done in step 5 might block indefinitely.
60
61       An  application  that  employs  the EPOLLET flag should use nonblocking
62       file descriptors to avoid having a blocking read or write starve a task
63       that  is  handling multiple file descriptors.  The suggested way to use
64       epoll as an edge-triggered (EPOLLET) interface is as follows:
65
66              i   with nonblocking file descriptors; and
67
68              ii  by waiting for an  event  only  after  read(2)  or  write(2)
69                  return EAGAIN.
70
71       By  contrast,  when  used  as a level-triggered interface (the default,
72       when EPOLLET is not specified), epoll is simply a faster  poll(2),  and
73       can be used wherever the latter is used since it shares the same seman‐
74       tics.
75
76       Since even with edge-triggered epoll, multiple events can be  generated
77       upon  receipt  of multiple chunks of data, the caller has the option to
78       specify the EPOLLONESHOT flag, to tell epoll to disable the  associated
79       file descriptor after the receipt of an event with epoll_wait(2).  When
80       the EPOLLONESHOT flag is specified, it is the  caller's  responsibility
81       to rearm the file descriptor using epoll_ctl(2) with EPOLL_CTL_MOD.
82
83   /proc interfaces
84       The following interfaces can be used to limit the amount of kernel mem‐
85       ory consumed by epoll:
86
87       /proc/sys/fs/epoll/max_user_watches (since Linux 2.6.28)
88              This specifies a limit on the total number of  file  descriptors
89              that  a user can register across all epoll instances on the sys‐
90              tem.  The limit is per  real  user  ID.   Each  registered  file
91              descriptor  costs  roughly  90  bytes  on  a  32-bit kernel, and
92              roughly 160 bytes on a 64-bit kernel.   Currently,  the  default
93              value  for  max_user_watches  is  1/25 (4%) of the available low
94              memory, divided by the registration cost in bytes.
95
96   Example for suggested usage
97       While the usage of epoll when employed as a  level-triggered  interface
98       does  have  the  same  semantics  as  poll(2), the edge-triggered usage
99       requires more clarification to avoid stalls in  the  application  event
100       loop.   In this example, listener is a nonblocking socket on which lis‐
101       ten(2) has been called.  The function do_use_fd() uses  the  new  ready
102       file descriptor until EAGAIN is returned by either read(2) or write(2).
103       An event-driven state machine application should, after having received
104       EAGAIN,  record  its  current  state  so  that  at  the  next  call  to
105       do_use_fd() it will continue to  read(2)  or  write(2)  from  where  it
106       stopped before.
107
108           #define MAX_EVENTS 10
109           struct epoll_event ev, events[MAX_EVENTS];
110           int listen_sock, conn_sock, nfds, epollfd;
111
112           /* Set up listening socket, 'listen_sock' (socket(),
113              bind(), listen()) */
114
115           epollfd = epoll_create(10);
116           if (epollfd == -1) {
117               perror("epoll_create");
118               exit(EXIT_FAILURE);
119           }
120
121           ev.events = EPOLLIN;
122           ev.data.fd = listen_sock;
123           if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == -1) {
124               perror("epoll_ctl: listen_sock");
125               exit(EXIT_FAILURE);
126           }
127
128           for (;;) {
129               nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
130               if (nfds == -1) {
131                   perror("epoll_pwait");
132                   exit(EXIT_FAILURE);
133               }
134
135               for (n = 0; n < nfds; ++n) {
136                   if (events[n].data.fd == listen_sock) {
137                       conn_sock = accept(listen_sock,
138                                       (struct sockaddr *) &local, &addrlen);
139                       if (conn_sock == -1) {
140                           perror("accept");
141                           exit(EXIT_FAILURE);
142                       }
143                       setnonblocking(conn_sock);
144                       ev.events = EPOLLIN | EPOLLET;
145                       ev.data.fd = conn_sock;
146                       if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
147                                   &ev) == -1) {
148                           perror("epoll_ctl: conn_sock");
149                           exit(EXIT_FAILURE);
150                       }
151                   } else {
152                       do_use_fd(events[n].data.fd);
153                   }
154               }
155           }
156
157       When  used  as an edge-triggered interface, for performance reasons, it
158       is possible to add the  file  descriptor  inside  the  epoll  interface
159       (EPOLL_CTL_ADD) once by specifying (EPOLLIN|EPOLLOUT).  This allows you
160       to avoid continuously switching between EPOLLIN  and  EPOLLOUT  calling
161       epoll_ctl(2) with EPOLL_CTL_MOD.
162
163   Questions and answers
164       Q0  What is the key used to distinguish the file descriptors registered
165           in an epoll set?
166
167       A0  The key is the combination of the file descriptor  number  and  the
168           open  file  description  (also  known as an "open file handle", the
169           kernel's internal representation of an open file).
170
171       Q1  What happens if you register the same file descriptor on  an  epoll
172           instance twice?
173
174       A1  You  will  probably  get  EEXIST.  However, it is possible to add a
175           duplicate (dup(2), dup2(2), fcntl(2)  F_DUPFD)  descriptor  to  the
176           same  epoll instance.  This can be a useful technique for filtering
177           events, if the duplicate file descriptors are registered with  dif‐
178           ferent events masks.
179
180       Q2  Can  two epoll instances wait for the same file descriptor?  If so,
181           are events reported to both epoll file descriptors?
182
183       A2  Yes, and events would be reported to both.  However,  careful  pro‐
184           gramming may be needed to do this correctly.
185
186       Q3  Is the epoll file descriptor itself poll/epoll/selectable?
187
188       A3  Yes.   If  an epoll file descriptor has events waiting then it will
189           indicate as being readable.
190
191       Q4  What happens if one attempts to put an epoll file  descriptor  into
192           its own file descriptor set?
193
194       A4  The  epoll_ctl(2) call will fail (EINVAL).  However, you can add an
195           epoll file descriptor inside another epoll file descriptor set.
196
197       Q5  Can I send an epoll file descriptor over a UNIX  domain  socket  to
198           another process?
199
200       A5  Yes,  but  it  does  not make sense to do this, since the receiving
201           process would not have copies of the file descriptors in the  epoll
202           set.
203
204       Q6  Will  closing  a  file  descriptor  cause it to be removed from all
205           epoll sets automatically?
206
207       A6  Yes, but be aware of the following point.  A file descriptor  is  a
208           reference  to  an  open file description (see open(2)).  Whenever a
209           descriptor is duplicated via dup(2), dup2(2), fcntl(2) F_DUPFD,  or
210           fork(2),  a  new  file  descriptor  referring to the same open file
211           description is created.  An  open  file  description  continues  to
212           exist  until all file descriptors referring to it have been closed.
213           A file descriptor is removed from an epoll set only after  all  the
214           file  descriptors referring to the underlying open file description
215           have been closed (or before if the descriptor is explicitly removed
216           using  epoll_ctl(2)  EPOLL_CTL_DEL).   This means that even after a
217           file descriptor that is part of  an  epoll  set  has  been  closed,
218           events  may  be  reported  for  that  file descriptor if other file
219           descriptors referring  to  the  same  underlying  file  description
220           remain open.
221
222       Q7  If more than one event occurs between epoll_wait(2) calls, are they
223           combined or reported separately?
224
225       A7  They will be combined.
226
227       Q8  Does an operation on a file descriptor affect the already collected
228           but not yet reported events?
229
230       A8  You  can  do two operations on an existing file descriptor.  Remove
231           would be meaningless for this case.  Modify will  reread  available
232           I/O.
233
234       Q9  Do I need to continuously read/write a file descriptor until EAGAIN
235           when using the EPOLLET flag (edge-triggered behavior) ?
236
237       A9  Receiving an event from epoll_wait(2) should suggest  to  you  that
238           such file descriptor is ready for the requested I/O operation.  You
239           must consider it ready  until  the  next  (nonblocking)  read/write
240           yields  EAGAIN.   When  and how you will use the file descriptor is
241           entirely up to you.
242
243           For packet/token-oriented files (e.g., datagram socket, terminal in
244           canonical  mode),  the only way to detect the end of the read/write
245           I/O space is to continue to read/write until EAGAIN.
246
247           For stream-oriented files (e.g., pipe, FIFO,  stream  socket),  the
248           condition  that  the  read/write I/O space is exhausted can also be
249           detected by checking the amount of data read from / written to  the
250           target file descriptor.  For example, if you call read(2) by asking
251           to read a certain amount of data and read(2) returns a lower number
252           of  bytes,  you  can be sure of having exhausted the read I/O space
253           for the file descriptor.  The  same  is  true  when  writing  using
254           write(2).   (Avoid  this  latter  technique if you cannot guarantee
255           that the monitored file descriptor always refers to  a  stream-ori‐
256           ented file.)
257
258   Possible pitfalls and ways to avoid them
259       o Starvation (edge-triggered)
260
261       If  there is a large amount of I/O space, it is possible that by trying
262       to drain it the other files will not get processed causing  starvation.
263       (This problem is not specific to epoll.)
264
265       The  solution  is to maintain a ready list and mark the file descriptor
266       as ready in its associated data structure, thereby allowing the  appli‐
267       cation  to  remember  which  files need to be processed but still round
268       robin amongst all the ready files.  This also supports ignoring  subse‐
269       quent events you receive for file descriptors that are already ready.
270
271       o If using an event cache...
272
273       If  you  use  an event cache or store all the file descriptors returned
274       from epoll_wait(2), then make sure to provide a way to mark its closure
275       dynamically  (i.e.,  caused by a previous event's processing).  Suppose
276       you receive 100 events from epoll_wait(2), and in event #47 a condition
277       causes  event  #13  to  be  closed.   If  you  remove the structure and
278       close(2) the file descriptor for event #13, then your event cache might
279       still  say  there  are  events waiting for that file descriptor causing
280       confusion.
281
282       One solution for this is to call, during the processing  of  event  47,
283       epoll_ctl(EPOLL_CTL_DEL)  to  delete  file  descriptor 13 and close(2),
284       then mark its associated data structure as removed and  link  it  to  a
285       cleanup list.  If you find another event for file descriptor 13 in your
286       batch processing, you will discover the file descriptor had been previ‐
287       ously removed and there will be no confusion.
288

VERSIONS

290       The epoll API was introduced in Linux kernel 2.5.44.  Support was added
291       to glibc in version 2.3.2.
292

CONFORMING TO

294       The epoll API is Linux-specific.  Some other  systems  provide  similar
295       mechanisms, for example, FreeBSD has kqueue, and Solaris has /dev/poll.
296

COLOPHON

301       This  page  is  part of release 3.53 of the Linux man-pages project.  A
302       description of the project, and information about reporting  bugs,  can
303       be found at http://www.kernel.org/doc/man-pages/.
304
305
306
307Linux                             2012-04-17                          EPOLL(7)

NAME

SYNOPSIS

DESCRIPTION

VERSIONS

CONFORMING TO

SEE ALSO

COLOPHON