seccomp_unotify(2)

1SECCOMP_UNOTIFY(2)         Linux Programmer's Manual        SECCOMP_UNOTIFY(2)
2
3
4

NAME

6       seccomp_unotify - Seccomp user-space notification mechanism
7

SYNOPSIS

9       #include <linux/seccomp.h>
10       #include <linux/filter.h>
11       #include <linux/audit.h>
12
13       int seccomp(unsigned int operation, unsigned int flags, void *args);
14
15       #include <sys/ioctl.h>
16
17       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
18                 struct seccomp_notif *req);
19       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
20                 struct seccomp_notif_resp *resp);
21       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
22       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ADDFD,
23                 struct seccomp_notif_addfd *addfd);
24

DESCRIPTION

26       This  page  describes the user-space notification mechanism provided by
27       the Secure Computing (seccomp) facility.  As well as  the  use  of  the
28       SECCOMP_FILTER_FLAG_NEW_LISTENER  flag,  the SECCOMP_RET_USER_NOTIF ac‐
29       tion value, and the SECCOMP_GET_NOTIF_SIZES operation described in sec‐
30       comp(2),  this  mechanism  involves  the  use  of  a  number of related
31       ioctl(2) operations (described below).
32
33   Overview
34       In conventional usage of a seccomp filter, the decision  about  how  to
35       treat  a  system  call  is made by the filter itself.  By contrast, the
36       user-space notification mechanism allows the seccomp filter to delegate
37       the  handling  of  the system call to another user-space process.  Note
38       that this mechanism is explicitly not intended as a method implementing
39       security policy; see NOTES.
40
41       In the discussion that follows, the thread(s) on which the seccomp fil‐
42       ter is installed is (are) referred to as the target,  and  the  process
43       that  is  notified by the user-space notification mechanism is referred
44       to as the supervisor.
45
46       A suitably privileged supervisor can use  the  user-space  notification
47       mechanism to perform actions on behalf of the target.  The advantage of
48       the user-space notification mechanism is that the supervisor will  usu‐
49       ally be able to retrieve information about the target and the performed
50       system call that the seccomp filter itself cannot.  (A  seccomp  filter
51       is limited in the information it can obtain and the actions that it can
52       perform because it is running on a virtual machine inside the kernel.)
53
54       An overview of the steps performed by the target and the supervisor  is
55       as follows:
56
57       1. The  target  establishes  a  seccomp filter in the usual manner, but
58          with two differences:
59
60          • The seccomp(2)  flags  argument  includes  the  flag  SECCOMP_FIL‐
61            TER_FLAG_NEW_LISTENER.   Consequently,  the  return  value  of the
62            (successful) seccomp(2) call is a new "listening" file  descriptor
63            that  can  be used to receive notifications.  Only one "listening"
64            seccomp filter can be installed for a thread.
65
66          • In cases where it is appropriate, the seccomp filter  returns  the
67            action value SECCOMP_RET_USER_NOTIF.  This return value will trig‐
68            ger a notification event.
69
70       2. In order that the supervisor can obtain notifications using the lis‐
71          tening  file  descriptor, (a duplicate of) that file descriptor must
72          be passed from the target to the supervisor.  One way in which  this
73          could  be  done is by passing the file descriptor over a UNIX domain
74          socket connection between the target and the supervisor  (using  the
75          SCM_RIGHTS  ancillary  message  type described in unix(7)).  Another
76          way to do this is through the use of pidfd_getfd(2).
77
78       3. The supervisor will receive notification  events  on  the  listening
79          file  descriptor.   These  events are returned as structures of type
80          seccomp_notif.  Because this structure and its size may evolve  over
81          kernel  versions,  the  supervisor  must first determine the size of
82          this structure using the seccomp(2)  SECCOMP_GET_NOTIF_SIZES  opera‐
83          tion,  which  returns  a structure of type seccomp_notif_sizes.  The
84          supervisor  allocates  a  buffer  of  size  seccomp_notif_sizes.sec‐
85          comp_notif  bytes  to  receive notification events.  In addition,the
86          supervisor allocates another buffer of size seccomp_notif_sizes.sec‐
87          comp_notif_resp  bytes for the response (a struct seccomp_notif_resp
88          structure) that it will provide to the kernel (and thus the target).
89
90       4. The target then performs its workload, which includes  system  calls
91          that  will  be  controlled  by  the seccomp filter.  Whenever one of
92          these  system  calls  causes  the  filter   to   return   the   SEC‐
93          COMP_RET_USER_NOTIF  action value, the kernel does not (yet) execute
94          the system call; instead, execution of  the  target  is  temporarily
95          blocked inside the kernel (in a sleep state that is interruptible by
96          signals) and a notification event is generated on the listening file
97          descriptor.
98
99       5. The  supervisor  can  now  repeatedly monitor the listening file de‐
100          scriptor for SECCOMP_RET_USER_NOTIF-triggered events.  To  do  this,
101          the  supervisor uses the SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation
102          to read information  about  a  notification  event;  this  operation
103          blocks  until  an  event is available.  The operation returns a sec‐
104          comp_notif structure containing information about  the  system  call
105          that  is being attempted by the target.  (As described in NOTES, the
106          file descriptor can also be monitored with  select(2),  poll(2),  or
107          epoll(7).)
108
109       6. The seccomp_notif structure returned by the SECCOMP_IOCTL_NOTIF_RECV
110          operation includes the same information (a  seccomp_data  structure)
111          that  was passed to the seccomp filter.  This information allows the
112          supervisor to discover the system call number and the arguments  for
113          the  target's system call.  In addition, the notification event con‐
114          tains the ID of the thread that triggered  the  notification  and  a
115          unique  cookie  value  that  is used in subsequent SECCOMP_IOCTL_NO‐
116          TIF_ID_VALID and SECCOMP_IOCTL_NOTIF_SEND operations.
117
118          The information in the notification can be used to discover the val‐
119          ues  of  pointer  arguments  for the target's system call.  (This is
120          something that can't be done from within a seccomp filter.)  One way
121          in  which  the  supervisor  can do this is to open the corresponding
122          /proc/[tid]/mem file (see proc(5)) and read bytes from the  location
123          that corresponds to one of the pointer arguments whose value is sup‐
124          plied in the notification event.  (The supervisor must be careful to
125          avoid  a  race condition that can occur when doing this; see the de‐
126          scription of the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation be‐
127          low.)   In addition, the supervisor can access other system informa‐
128          tion that is visible in user space but which is not accessible  from
129          a seccomp filter.
130
131       7. Having obtained information as per the previous step, the supervisor
132          may then choose to perform an action in  response  to  the  target's
133          system call (which, as noted above, is not executed when the seccomp
134          filter returns the SECCOMP_RET_USER_NOTIF action value).
135
136          One example use case here relates to containers.  The target may  be
137          located  inside  a container where it does not have sufficient capa‐
138          bilities to mount a filesystem in the container's  mount  namespace.
139          However,  the  supervisor may be a more privileged process that does
140          have sufficient capabilities to perform the mount operation.
141
142       8. The supervisor then sends a response to the notification.   The  in‐
143          formation  in this response is used by the kernel to construct a re‐
144          turn value for the target's system call and  provide  a  value  that
145          will be assigned to the errno variable of the target.
146
147          The response is sent using the SECCOMP_IOCTL_NOTIF_SEND ioctl(2) op‐
148          eration, which is used to transmit a seccomp_notif_resp structure to
149          the  kernel.  This structure includes a cookie value that the super‐
150          visor obtained in the seccomp_notif structure returned by  the  SEC‐
151          COMP_IOCTL_NOTIF_RECV  operation.  This cookie value allows the ker‐
152          nel to associate the response with the target.  This structure  must
153          include  the  cookie  value that the supervisor obtained in the sec‐
154          comp_notif structure returned by the SECCOMP_IOCTL_NOTIF_RECV opera‐
155          tion;  the  cookie  allows the kernel to associate the response with
156          the target.
157
158       9. Once the notification has been sent, the system call in  the  target
159          thread  unblocks, returning the information that was provided by the
160          supervisor in the notification response.
161
162       As a variation on the last two steps, the supervisor  can  send  a  re‐
163       sponse that tells the kernel that it should execute the target thread's
164       system call; see the  discussion  of  SECCOMP_USER_NOTIF_FLAG_CONTINUE,
165       below.
166

IOCTL OPERATIONS

168       The  following  ioctl(2)  operations are supported by the seccomp user-
169       space notification file descriptor.  For each of these operations,  the
170       first  (file descriptor) argument of ioctl(2) is the listening file de‐
171       scriptor returned  by  a  call  to  seccomp(2)  with  the  SECCOMP_FIL‐
172       TER_FLAG_NEW_LISTENER flag.
173
174   SECCOMP_IOCTL_NOTIF_RECV
175       The  SECCOMP_IOCTL_NOTIF_RECV  operation (available since Linux 5.0) is
176       used to obtain a user-space notification event.  If no  such  event  is
177       currently  pending,  the  operation  blocks until an event occurs.  The
178       third ioctl(2) argument is a pointer to a structure  of  the  following
179       form  which  contains information about the event.  This structure must
180       be zeroed out before the call.
181
182           struct seccomp_notif {
183               __u64  id;              /* Cookie */
184               __u32  pid;             /* TID of target thread */
185               __u32  flags;           /* Currently unused (0) */
186               struct seccomp_data data;   /* See seccomp(2) */
187           };
188
189       The fields in this structure are as follows:
190
191       id     This is a cookie for the  notification.   Each  such  cookie  is
192              guaranteed to be unique for the corresponding seccomp filter.
193
194              • The  cookie  can be used with the SECCOMP_IOCTL_NOTIF_ID_VALID
195                ioctl(2) operation described below.
196
197              • When returning a notification response to the kernel, the  su‐
198                pervisor  must  include  the  cookie  value in the seccomp_no‐
199                tif_resp structure that is specified as the  argument  of  the
200                SECCOMP_IOCTL_NOTIF_SEND operation.
201
202       pid    This  is  the  thread ID of the target thread that triggered the
203              notification event.
204
205       flags  This is a bit mask of flags providing further information on the
206              event.   In  the  current  implementation,  this field is always
207              zero.
208
209       data   This is a seccomp_data structure  containing  information  about
210              the  system  call  that triggered the notification.  This is the
211              same structure that is passed to the seccomp filter.   See  sec‐
212              comp(2) for details of this structure.
213
214       On  success,  this operation returns 0; on failure, -1 is returned, and
215       errno is set to indicate the cause of the error.   This  operation  can
216       fail with the following errors:
217
218       EINVAL (since Linux 5.5)
219              The  seccomp_notif  structure  that  was passed to the call con‐
220              tained nonzero fields.
221
222       ENOENT The target thread was killed by a signal as the notification in‐
223              formation  was being generated, or the target's (blocked) system
224              call was interrupted by a signal handler.
225
226   SECCOMP_IOCTL_NOTIF_ID_VALID
227       The SECCOMP_IOCTL_NOTIF_ID_VALID operation (available since Linux  5.0)
228       is  used  to  check  that a notification ID returned by an earlier SEC‐
229       COMP_IOCTL_NOTIF_RECV operation is still valid (i.e., that  the  target
230       still  exists  and  its  system call is still blocked waiting for a re‐
231       sponse).
232
233       The third ioctl(2) argument is a pointer to the cookie (id) returned by
234       the SECCOMP_IOCTL_NOTIF_RECV operation.
235
236       This  operation  is  necessary  to avoid race conditions that can occur
237       when the pid returned by the SECCOMP_IOCTL_NOTIF_RECV operation  termi‐
238       nates, and that process ID is reused by another process.  An example of
239       this kind of race is the following
240
241       1. A notification is generated on the listening file  descriptor.   The
242          returned seccomp_notif contains the TID of the target thread (in the
243          pid field of the structure).
244
245       2. The target terminates.
246
247       3. Another thread or process is created on the system  that  by  chance
248          reuses the TID that was freed when the target terminated.
249
250       4. The  supervisor  open(2)s  the  /proc/[tid]/mem file for the TID ob‐
251          tained in step 1, with the intention of (say) inspecting the  memory
252          location(s)  that containing the argument(s) of the system call that
253          triggered the notification in step 1.
254
255       In the above scenario, the risk is that the supervisor may try  to  ac‐
256       cess  the  memory of a process other than the target.  This race can be
257       avoided by following the  call  to  open(2)  with  a  SECCOMP_IOCTL_NO‐
258       TIF_ID_VALID  operation  to  verify that the process that generated the
259       notification is still alive.  (Note that if the target terminates after
260       the  latter step, a subsequent read(2) from the file descriptor may re‐
261       turn 0, indicating end of file.)
262
263       See NOTES for a  discussion  of  other  cases  where  SECCOMP_IOCTL_NO‐
264       TIF_ID_VALID checks must be performed.
265
266       On  success  (i.e., the notification ID is still valid), this operation
267       returns 0.  On failure (i.e., the notification ID is no longer  valid),
268       -1 is returned, and errno is set to ENOENT.
269
270   SECCOMP_IOCTL_NOTIF_SEND
271       The  SECCOMP_IOCTL_NOTIF_SEND  operation (available since Linux 5.0) is
272       used to send a notification response back to  the  kernel.   The  third
273       ioctl(2)  argument of this structure is a pointer to a structure of the
274       following form:
275
276           struct seccomp_notif_resp {
277               __u64 id;           /* Cookie value */
278               __s64 val;          /* Success return value */
279               __s32 error;        /* 0 (success) or negative error number */
280               __u32 flags;        /* See below */
281           };
282
283       The fields of this structure are as follows:
284
285       id     This is the cookie  value  that  was  obtained  using  the  SEC‐
286              COMP_IOCTL_NOTIF_RECV  operation.   This cookie value allows the
287              kernel to correctly associate this response with the system call
288              that triggered the user-space notification.
289
290       val    This is the value that will be used for a spoofed success return
291              for the target's system call; see below.
292
293       error  This is the value that will be used as the error number  (errno)
294              for a spoofed error return for the target's system call; see be‐
295              low.
296
297       flags  This is a bit mask that includes zero or more of  the  following
298              flags:
299
300              SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
301                     Tell the kernel to execute the target's system call.
302
303       Two kinds of response are possible:
304
305       • A  response  to  the kernel telling it to execute the target's system
306         call.  In  this  case,  the  flags  field  includes  SECCOMP_USER_NO‐
307         TIF_FLAG_CONTINUE and the error and val fields must be zero.
308
309         This  kind  of  response  can be useful in cases where the supervisor
310         needs to do deeper analysis of the target's system call than is  pos‐
311         sible  from  a  seccomp filter (e.g., examining the values of pointer
312         arguments), and, having decided that the system call does not require
313         emulation  by the supervisor, the supervisor wants the system call to
314         be executed normally in the target.
315
316         The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag should be  used  with  cau‐
317         tion; see NOTES.
318
319       • A  spoofed  return value for the target's system call.  In this case,
320         the kernel does not execute the target's system call, instead causing
321         the  system  call to return a spoofed value as specified by fields of
322         the seccomp_notif_resp structure.   The  supervisor  should  set  the
323         fields of this structure as follows:
324
325         +  flags does not contain SECCOMP_USER_NOTIF_FLAG_CONTINUE.
326
327         +  error  is  set  either to 0 for a spoofed "success" return or to a
328            negative error number for a spoofed "failure" return.  In the for‐
329            mer case, the kernel causes the target's system call to return the
330            value specified in the val field.  In the latter case, the  kernel
331            causes  the  target's  system  call to return -1, and errno is as‐
332            signed the negated error value.
333
334         +  val is set to a value that will be used as the return value for  a
335            spoofed  "success" return for the target's system call.  The value
336            in this field is ignored if the error  field  contains  a  nonzero
337            value.
338
339       On  success,  this operation returns 0; on failure, -1 is returned, and
340       errno is set to indicate the cause of the error.   This  operation  can
341       fail with the following errors:
342
343       EINPROGRESS
344              A response to this notification has already been sent.
345
346       EINVAL An invalid value was specified in the flags field.
347
348       EINVAL The  flags field contained SECCOMP_USER_NOTIF_FLAG_CONTINUE, and
349              the error or val field was not zero.
350
351       ENOENT The blocked system call in the target has been interrupted by  a
352              signal handler or the target has terminated.
353
354   SECCOMP_IOCTL_NOTIF_ADDFD
355       The SECCOMP_IOCTL_NOTIF_ADDFD operation (available since Linux 5.9) al‐
356       lows the supervisor to install a file descriptor into the target's file
357       descriptor  table.   Much like the use of SCM_RIGHTS messages described
358       in unix(7), this operation is semantically equivalent to duplicating  a
359       file  descriptor  from  the supervisor's file descriptor table into the
360       target's file descriptor table.
361
362       The SECCOMP_IOCTL_NOTIF_ADDFD operation permits the supervisor to  emu‐
363       late  a target system call (such as socket(2) or openat(2)) that gener‐
364       ates a file descriptor.  The supervisor can  perform  the  system  call
365       that  generates  the file descriptor (and associated open file descrip‐
366       tion) and then use this operation to allocate a  file  descriptor  that
367       refers to the same open file description in the target.  (For an expla‐
368       nation of open file descriptions, see open(2).)
369
370       Once this operation has been performed, the supervisor  can  close  its
371       copy of the file descriptor.
372
373       In  the  target,  the  received  file descriptor is subject to the same
374       Linux Security Module (LSM) checks as are applied to a file  descriptor
375       that  is  received in an SCM_RIGHTS ancillary message.  If the file de‐
376       scriptor refers to a socket, it inherits the cgroup version  1  network
377       controller settings (classid and netprioidx) of the target.
378
379       The  third ioctl(2) argument is a pointer to a structure of the follow‐
380       ing form:
381
382           struct seccomp_notif_addfd {
383               __u64 id;           /* Cookie value */
384               __u32 flags;        /* Flags */
385               __u32 srcfd;        /* Local file descriptor number */
386               __u32 newfd;        /* 0 or desired file descriptor
387                                      number in target */
388               __u32 newfd_flags;  /* Flags to set on target file
389                                      descriptor */
390           };
391
392       The fields in this structure are as follows:
393
394       id     This field should be set to the notification ID  (cookie  value)
395              that was obtained via SECCOMP_IOCTL_NOTIF_RECV.
396
397       flags  This  field  is  a bit mask of flags that modify the behavior of
398              the operation.  Currently, only one flag is supported:
399
400              SECCOMP_ADDFD_FLAG_SETFD
401                     When allocating the file descriptor in  the  target,  use
402                     the file descriptor number specified in the newfd field.
403
404              SECCOMP_ADDFD_FLAG_SEND (since Linux 5.14)
405                     Perform  the equivalent of SECCOMP_IOCTL_NOTIF_ADDFD plus
406                     SECCOMP_IOCTL_NOTIF_SEND as an atomic operation.  On suc‐
407                     cessful  invocation, the target process's errno will be 0
408                     and the return value will be the file  descriptor  number
409                     that was allocated in the target.  If allocating the file
410                     descriptor in the target fails, the target's system  call
411                     continues  to  be  blocked until a successful response is
412                     sent.
413
414       srcfd  This field should be set to the number of the file descriptor in
415              the supervisor that is to be duplicated.
416
417       newfd  This  field determines which file descriptor number is allocated
418              in the target.  If the  SECCOMP_ADDFD_FLAG_SETFD  flag  is  set,
419              then this field specifies which file descriptor number should be
420              allocated.  If this file descriptor number is  already  open  in
421              the target, it is atomically closed and reused.  If the descrip‐
422              tor duplication fails due to an LSM check, or if srcfd is not  a
423              valid  file  descriptor,  the  file descriptor newfd will not be
424              closed in the target process.
425
426              If the SECCOMP_ADDFD_FLAG_SETFD flag it not set, then this field
427              must  be  0, and the kernel allocates the lowest unused file de‐
428              scriptor number in the target.
429
430       newfd_flags
431              This field is a bit mask specifying flags that should be set  on
432              the  file  descriptor  that  is  received in the target process.
433              Currently, only the following flag is implemented:
434
435              O_CLOEXEC
436                     Set the close-on-exec flag on the received file  descrip‐
437                     tor.
438
439       On  success, this ioctl(2) call returns the number of the file descrip‐
440       tor that was allocated in the target.  Assuming that the emulated  sys‐
441       tem  call  is one that returns a file descriptor as its function result
442       (e.g.,  socket(2)),  this  value  can  be  used  as  the  return  value
443       (resp.val)  that  is supplied in the response that is subsequently sent
444       with the SECCOMP_IOCTL_NOTIF_SEND operation.
445
446       On error, -1 is returned and errno is set to indicate the cause of  the
447       error.
448
449       This operation can fail with the following errors:
450
451       EBADF  Allocating  the  file  descriptor  in the target would cause the
452              target's RLIMIT_NOFILE limit to be exceeded (see getrlimit(2)).
453
454       EBUSY  If the flag SECCOMP_IOCTL_NOTIF_SEND is used, this means the op‐
455              eration  can't proceed until other SECCOMP_IOCTL_NOTIF_ADDFD re‐
456              quests are processed.
457
458       EINPROGRESS
459              The user-space notification specified in the id field exists but
460              has  not yet been fetched (by a SECCOMP_IOCTL_NOTIF_RECV) or has
461              already been responded to (by a SECCOMP_IOCTL_NOTIF_SEND).
462
463       EINVAL An invalid flag was specified in the flags or newfd_flags field,
464              or  the  newfd field is nonzero and the SECCOMP_ADDFD_FLAG_SETFD
465              flag was not specified in the flags field.
466
467       EMFILE The file descriptor number specified in newfd exceeds the  limit
468              specified in /proc/sys/fs/nr_open.
469
470       ENOENT The  blocked system call in the target has been interrupted by a
471              signal handler or the target has terminated.
472
473       Here is some sample code (with error handling omitted)  that  uses  the
474       SECCOMP_ADDFD_FLAG_SETFD  operation  (here,  to  emulate a call to ope‐
475       nat(2)):
476
477           int fd, removeFd;
478
479           fd = openat(req->data.args[0], path, req->data.args[2],
480                           req->data.args[3]);
481
482           struct seccomp_notif_addfd addfd;
483           addfd.id = req->id; /* Cookie from SECCOMP_IOCTL_NOTIF_RECV */
484           addfd.srcfd = fd;
485           addfd.newfd = 0;
486           addfd.flags = 0;
487           addfd.newfd_flags = O_CLOEXEC;
488
489           targetFd = ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd);
490
491           close(fd);          /* No longer needed in supervisor */
492
493           struct seccomp_notif_resp *resp;
494               /* Code to allocate 'resp' omitted */
495           resp->id = req->id;
496           resp->error = 0;        /* "Success" */
497           resp->val = targetFd;
498           resp->flags = 0;
499           ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp);
500

NOTES

502       One example use case for the user-space notification  mechanism  is  to
503       allow  a  container  manager (a process which is typically running with
504       more privilege than the processes inside the container) to mount  block
505       devices  or  create device nodes for the container.  The mount use case
506       provides  an  example  of  where  the  SECCOMP_USER_NOTIF_FLAG_CONTINUE
507       ioctl(2)  operation  is  useful.  Upon receiving a notification for the
508       mount(2) system call, the container manager (the "supervisor") can dis‐
509       tinguish a request to mount a block filesystem (which would not be pos‐
510       sible for a "target" process inside the container) and mount that  file
511       system.   If, on the other hand, the container manager detects that the
512       operation could be performed by the process inside the container (e.g.,
513       a  mount  of  a tmpfs(5) filesystem), it can notify the kernel that the
514       target process's mount(2) system call can continue.
515
516   select()/poll()/epoll semantics
517       The file descriptor returned when seccomp(2) is employed with the  SEC‐
518       COMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using poll(2),
519       epoll(7), and select(2).  These interfaces indicate that the  file  de‐
520       scriptor is ready as follows:
521
522       • When  a  notification  is pending, these interfaces indicate that the
523         file descriptor is readable.  Following such an indication, a  subse‐
524         quent SECCOMP_IOCTL_NOTIF_RECV ioctl(2) will not block, returning ei‐
525         ther information about a notification or else failing with the  error
526         EINTR  if  the  target has been killed by a signal or its system call
527         has been interrupted by a signal handler.
528
529       • After  the  notification  has  been  received  (i.e.,  by  the   SEC‐
530         COMP_IOCTL_NOTIF_RECV  ioctl(2) operation), these interfaces indicate
531         that the file descriptor is writable, meaning that a notification re‐
532         sponse can be sent using the SECCOMP_IOCTL_NOTIF_SEND ioctl(2) opera‐
533         tion.
534
535       • After the last thread using the filter has terminated and been reaped
536         using  waitpid(2) (or similar), the file descriptor indicates an end-
537         of-file  condition  (readable  in  select(2);   POLLHUP/EPOLLHUP   in
538         poll(2)/ epoll_wait(2)).
539
540   Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE
541       The  intent  of  the user-space notification feature is to allow system
542       calls to be performed on behalf of the  target.   The  target's  system
543       call  should either be handled by the supervisor or allowed to continue
544       normally in the kernel (where standard security policies  will  be  ap‐
545       plied).
546
547       Note  well: this mechanism must not be used to make security policy de‐
548       cisions about the system call, which would be inherently race-prone for
549       reasons described next.
550
551       The  SECCOMP_USER_NOTIF_FLAG_CONTINUE  flag  must be used with caution.
552       If set by the supervisor, the target's system call will continue.  How‐
553       ever,  there  is  a  time-of-check, time-of-use race here, since an at‐
554       tacker could exploit the interval of time where the target  is  blocked
555       waiting  on  the "continue" response to do things such as rewriting the
556       system call arguments.
557
558       Note furthermore that a user-space notifier can be bypassed if the  ex‐
559       isting  filters  allow  the  use of seccomp(2) or prctl(2) to install a
560       filter that returns an action value with a higher precedence than  SEC‐
561       COMP_RET_USER_NOTIF (see seccomp(2)).
562
563       It  should thus be absolutely clear that the seccomp user-space notifi‐
564       cation mechanism can not be used to implement a  security  policy!   It
565       should  only  ever be used in scenarios where a more privileged process
566       supervises the system calls of a lesser privileged target to get around
567       kernel-enforced  security  restrictions  when the supervisor deems this
568       safe.  In other words, in order to continue a system call, the supervi‐
569       sor should be sure that another security mechanism or the kernel itself
570       will sufficiently block the system call if its arguments are  rewritten
571       to something unsafe.
572
573   Caveats regarding the use of /proc/[tid]/mem
574       The  discussion  above  noted  the  need  to  use the SECCOMP_IOCTL_NO‐
575       TIF_ID_VALID ioctl(2) when opening the /proc/[tid]/mem file of the tar‐
576       get  to  avoid  the  possibility  of  accessing the memory of the wrong
577       process in the event that the target terminates and its ID is  recycled
578       by another (unrelated) thread.  However, the use of this ioctl(2) oper‐
579       ation is also necessary in other situations, as explained in  the  fol‐
580       lowing paragraphs.
581
582       Consider the following scenario, where the supervisor tries to read the
583       pathname argument of a target's blocked mount(2) system call:
584
585       • From one of its functions (func()), the target calls mount(2),  which
586         triggers a user-space notification and causes the target to block.
587
588       • The  supervisor receives the notification, opens /proc/[tid]/mem, and
589         (successfully) performs the SECCOMP_IOCTL_NOTIF_ID_VALID check.
590
591       • The target receives a signal, which causes the mount(2) to abort.
592
593       • The signal handler executes in the target, and returns.
594
595       • Upon return from the handler, the execution of func() resumes, and it
596         returns (and perhaps other functions are called, overwriting the mem‐
597         ory that had been used for the stack frame of func()).
598
599       • Using the address provided in the notification information,  the  su‐
600         pervisor reads from the target's memory location that used to contain
601         the pathname.
602
603       • The supervisor now calls mount(2) with some arbitrary bytes  obtained
604         in the previous step.
605
606       The  conclusion  from  the  above  scenario is this: since the target's
607       blocked system call may be interrupted by a signal handler, the  super‐
608       visor  must be written to expect that the target may abandon its system
609       call at any time; in such an event, any information that the supervisor
610       obtained from the target's memory must be considered invalid.
611
612       To  prevent such scenarios, every read from the target's memory must be
613       separated from use of the bytes  so  obtained  by  a  SECCOMP_IOCTL_NO‐
614       TIF_ID_VALID  check.   In  the above example, the check would be placed
615       between the two final steps.  An example of such a check  is  shown  in
616       EXAMPLES.
617
618       Following on from the above, it should be clear that a write by the su‐
619       pervisor into the target's memory can never be considered safe.
620
621   Caveats regarding blocking system calls
622       Suppose that the target performs a  blocking  system  call  (e.g.,  ac‐
623       cept(2))  that the supervisor should handle.  The supervisor might then
624       in turn execute the same blocking system call.
625
626       In this scenario, it is important to note that if the  target's  system
627       call  is now interrupted by a signal, the supervisor is not informed of
628       this.  If the supervisor does not take suitable steps to actively  dis‐
629       cover  that  the target's system call has been canceled, various diffi‐
630       culties can occur.  Taking the example  of  accept(2),  the  supervisor
631       might  remain  blocked  in its accept(2) holding a port number that the
632       target (which, after the interruption by the  signal  handler,  perhaps
633       closed   its  listening  socket)  might expect to be able to reuse in a
634       bind(2) call.
635
636       Therefore, when the supervisor wishes  to  emulate  a  blocking  system
637       call, it must do so in such a way that it gets informed if the target's
638       system call is interrupted by a signal handler.  For  example,  if  the
639       supervisor itself executes the same blocking system call, then it could
640       employ a separate thread that uses the SECCOMP_IOCTL_NOTIF_ID_VALID op‐
641       eration  to  check  if  the target is still blocked in its system call.
642       Alternatively, in the  accept(2)  example,  the  supervisor  might  use
643       poll(2) to monitor both the notification file descriptor (so as to dis‐
644       cover when the target's accept(2) call has been  interrupted)  and  the
645       listening  file  descriptor  (so as to know when a connection is avail‐
646       able).
647
648       If the target's system call is interrupted, the  supervisor  must  take
649       care  to release resources (e.g., file descriptors) that it acquired on
650       behalf of the target.
651
652   Interaction with SA_RESTART signal handlers
653       Consider the following scenario:
654
655       • The target process has used sigaction(2) to install a signal  handler
656         with the SA_RESTART flag.
657
658       • The target has made a system call that triggered a seccomp user-space
659         notification and the target is currently blocked until the supervisor
660         sends a notification response.
661
662       • A  signal  is  delivered to the target and the signal handler is exe‐
663         cuted.
664
665       • When (if) the supervisor attempts to send  a  notification  response,
666         the  SECCOMP_IOCTL_NOTIF_SEND  ioctl(2)) operation will fail with the
667         ENOENT error.
668
669       In this scenario, the kernel will restart  the  target's  system  call.
670       Consequently,  the supervisor will receive another user-space notifica‐
671       tion.  Thus, depending on how many times the blocked system call is in‐
672       terrupted  by a signal handler, the supervisor may receive multiple no‐
673       tifications for the same instance of a system call in the target.
674
675       One oddity is that system call restarting as described in this scenario
676       will  occur even for the blocking system calls listed in signal(7) that
677       would never normally be restarted by the SA_RESTART flag.
678
679       Furthermore, if the supervisor response is a file descriptor added with
680       SECCOMP_IOCTL_NOTIF_ADDFD, then the flag SECCOMP_ADDFD_FLAG_SEND can be
681       used to atomically add the file descriptor and return that value,  mak‐
682       ing sure no file descriptors are inadvertently leaked into the target.
683

BUGS

685       If a SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation is performed after the
686       target terminates, then the ioctl(2) call simply  blocks  (rather  than
687       returning an error to indicate that the target no longer exists).
688

EXAMPLES

690       The  (somewhat  contrived)  program shown below demonstrates the use of
691       the interfaces described in this page.  The  program  creates  a  child
692       process  that  serves  as  the "target" process.  The child process in‐
693       stalls a seccomp filter that returns the SECCOMP_RET_USER_NOTIF  action
694       value  if  a  call  is  made to mkdir(2).  The child process then calls
695       mkdir(2) once for each of the supplied command-line arguments, and  re‐
696       ports the result returned by the call.  After processing all arguments,
697       the child process terminates.
698
699       The parent process acts as the supervisor, listening for the  notifica‐
700       tions  that are generated when the target process calls mkdir(2).  When
701       such a notification occurs, the supervisor examines the memory  of  the
702       target  process  (using /proc/[pid]/mem) to discover the pathname argu‐
703       ment that was supplied to the mkdir(2) call, and performs  one  of  the
704       following actions:
705
706       • If  the  pathname begins with the prefix "/tmp/", then the supervisor
707         attempts to create the specified directory, and then spoofs a  return
708         for  the target process based on the return value of the supervisor's
709         mkdir(2) call.  In the event that that  call  succeeds,  the  spoofed
710         success return value is the length of the pathname.
711
712       • If  the  pathname begins with "./" (i.e., it is a relative pathname),
713         the supervisor sends a SECCOMP_USER_NOTIF_FLAG_CONTINUE  response  to
714         the kernel to say that the kernel should execute the target process's
715         mkdir(2) call.
716
717       • If the pathname begins with some other prefix, the supervisor  spoofs
718         an  error return for the target process, so that the target process's
719         mkdir(2) call appears to fail with the error  EOPNOTSUPP  ("Operation
720         not  supported").  Additionally, if the specified pathname is exactly
721         "/bye", then the supervisor terminates.
722
723       This program can be used to demonstrate various aspects of the behavior
724       of  the  seccomp  user-space  notification mechanism.  To help aid such
725       demonstrations, the program logs various messages to show the operation
726       of  the  target  process  (lines prefixed "T:") and the supervisor (in‐
727       dented lines prefixed "S:").
728
729       In the following example, the target attempts to create  the  directory
730       /tmp/x.   Upon  receiving  the notification, the supervisor creates the
731       directory on the target's behalf, and spoofs a success return to be re‐
732       ceived by the target process's mkdir(2) call.
733
734           $ ./seccomp_unotify /tmp/x
735           T: PID = 23168
736
737           T: about to mkdir("/tmp/x")
738                   S: got notification (ID 0x17445c4a0f4e0e3c) for PID 23168
739                   S: executing: mkdir("/tmp/x", 0700)
740                   S: success! spoofed return = 6
741                   S: sending response (flags = 0; val = 6; error = 0)
742           T: SUCCESS: mkdir(2) returned 6
743
744           T: terminating
745                   S: target has terminated; bye
746
747       In  the  above  output,  note that the spoofed return value seen by the
748       target process is 6 (the length of the pathname /tmp/x), whereas a nor‐
749       mal mkdir(2) call returns 0 on success.
750
751       In  the  next  example, the target attempts to create a directory using
752       the relative pathname ./sub.  Since this pathname starts with "./", the
753       supervisor  sends  a  SECCOMP_USER_NOTIF_FLAG_CONTINUE  response to the
754       kernel,  and  the  kernel  then  (successfully)  executes  the   target
755       process's mkdir(2) call.
756
757           $ ./seccomp_unotify ./sub
758           T: PID = 23204
759
760           T: about to mkdir("./sub")
761                   S: got notification (ID 0xddb16abe25b4c12) for PID 23204
762                   S: target can execute system call
763                   S: sending response (flags = 0x1; val = 0; error = 0)
764           T: SUCCESS: mkdir(2) returned 0
765
766           T: terminating
767                   S: target has terminated; bye
768
769       If  the  target  process attempts to create a directory with a pathname
770       that doesn't start with "." and doesn't begin with the prefix  "/tmp/",
771       then  the supervisor spoofs an error return (EOPNOTSUPP, "Operation not
772       supported") for the target's mkdir(2) call (which is not executed):
773
774           $ ./seccomp_unotify /xxx
775           T: PID = 23178
776
777           T: about to mkdir("/xxx")
778                   S: got notification (ID 0xe7dc095d1c524e80) for PID 23178
779                   S: spoofing error response (Operation not supported)
780                   S: sending response (flags = 0; val = 0; error = -95)
781           T: ERROR: mkdir(2): Operation not supported
782
783           T: terminating
784                   S: target has terminated; bye
785
786       In the next example, the target process attempts to create a  directory
787       with  the  pathname /tmp/nosuchdir/b.  Upon receiving the notification,
788       the supervisor attempts to create that directory, but the mkdir(2) call
789       fails  because  the  directory  /tmp/nosuchdir  does not exist.  Conse‐
790       quently, the supervisor spoofs an error return that  passes  the  error
791       that it received back to the target process's mkdir(2) call.
792
793           $ ./seccomp_unotify /tmp/nosuchdir/b
794           T: PID = 23199
795
796           T: about to mkdir("/tmp/nosuchdir/b")
797                   S: got notification (ID 0x8744454293506046) for PID 23199
798                   S: executing: mkdir("/tmp/nosuchdir/b", 0700)
799                   S: failure! (errno = 2; No such file or directory)
800                   S: sending response (flags = 0; val = 0; error = -2)
801           T: ERROR: mkdir(2): No such file or directory
802
803           T: terminating
804                   S: target has terminated; bye
805
806       If the supervisor receives a notification and sees that the argument of
807       the target's mkdir(2) is the string "/bye", then (as well  as  spoofing
808       an EOPNOTSUPP error), the supervisor terminates.  If the target process
809       subsequently executes another mkdir(2) that triggers its seccomp filter
810       to  return  the  SECCOMP_RET_USER_NOTIF  action  value, then the kernel
811       causes the target process's system call to fail with the  error  ENOSYS
812       ("Function  not  implemented").   This is demonstrated by the following
813       example:
814
815           $ ./seccomp_unotify /bye /tmp/y
816           T: PID = 23185
817
818           T: about to mkdir("/bye")
819                   S: got notification (ID 0xa81236b1d2f7b0f4) for PID 23185
820                   S: spoofing error response (Operation not supported)
821                   S: sending response (flags = 0; val = 0; error = -95)
822                   S: terminating **********
823           T: ERROR: mkdir(2): Operation not supported
824
825           T: about to mkdir("/tmp/y")
826           T: ERROR: mkdir(2): Function not implemented
827
828           T: terminating
829
830   Program source
831       #define _GNU_SOURCE
832       #include <errno.h>
833       #include <fcntl.h>
834       #include <limits.h>
835       #include <linux/audit.h>
836       #include <linux/filter.h>
837       #include <linux/seccomp.h>
838       #include <signal.h>
839       #include <stdbool.h>
840       #include <stddef.h>
841       #include <stdint.h>
842       #include <stdio.h>
843       #include <stdlib.h>
844       #include <sys/socket.h>
845       #include <sys/ioctl.h>
846       #include <sys/prctl.h>
847       #include <sys/stat.h>
848       #include <sys/types.h>
849       #include <sys/un.h>
850       #include <sys/syscall.h>
851       #include <unistd.h>
852
853       #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
854                               } while (0)
855
856       /* Send the file descriptor 'fd' over the connected UNIX domain socket
857          'sockfd'. Returns 0 on success, or -1 on error. */
858
859       static int
860       sendfd(int sockfd, int fd)
861       {
862           struct msghdr msgh;
863           struct iovec iov;
864           int data;
865           struct cmsghdr *cmsgp;
866
867           /* Allocate a char array of suitable size to hold the ancillary data.
868              However, since this buffer is in reality a 'struct cmsghdr', use a
869              union to ensure that it is suitably aligned. */
870           union {
871               char   buf[CMSG_SPACE(sizeof(int))];
872                               /* Space large enough to hold an 'int' */
873               struct cmsghdr align;
874           } controlMsg;
875
876           /* The 'msg_name' field can be used to specify the address of the
877              destination socket when sending a datagram. However, we do not
878              need to use this field because 'sockfd' is a connected socket. */
879
880           msgh.msg_name = NULL;
881           msgh.msg_namelen = 0;
882
883           /* On Linux, we must transmit at least one byte of real data in
884              order to send ancillary data. We transmit an arbitrary integer
885              whose value is ignored by recvfd(). */
886
887           msgh.msg_iov = &iov;
888           msgh.msg_iovlen = 1;
889           iov.iov_base = &data;
890           iov.iov_len = sizeof(int);
891           data = 12345;
892
893           /* Set 'msghdr' fields that describe ancillary data */
894
895           msgh.msg_control = controlMsg.buf;
896           msgh.msg_controllen = sizeof(controlMsg.buf);
897
898           /* Set up ancillary data describing file descriptor to send */
899
900           cmsgp = CMSG_FIRSTHDR(&msgh);
901           cmsgp->cmsg_level = SOL_SOCKET;
902           cmsgp->cmsg_type = SCM_RIGHTS;
903           cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
904           memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));
905
906           /* Send real plus ancillary data */
907
908           if (sendmsg(sockfd, &msgh, 0) == -1)
909               return -1;
910
911           return 0;
912       }
913
914       /* Receive a file descriptor on a connected UNIX domain socket. Returns
915          the received file descriptor on success, or -1 on error. */
916
917       static int
918       recvfd(int sockfd)
919       {
920           struct msghdr msgh;
921           struct iovec iov;
922           int data, fd;
923           ssize_t nr;
924
925           /* Allocate a char buffer for the ancillary data. See the comments
926              in sendfd() */
927           union {
928               char   buf[CMSG_SPACE(sizeof(int))];
929               struct cmsghdr align;
930           } controlMsg;
931           struct cmsghdr *cmsgp;
932
933           /* The 'msg_name' field can be used to obtain the address of the
934              sending socket. However, we do not need this information. */
935
936           msgh.msg_name = NULL;
937           msgh.msg_namelen = 0;
938
939           /* Specify buffer for receiving real data */
940
941           msgh.msg_iov = &iov;
942           msgh.msg_iovlen = 1;
943           iov.iov_base = &data;       /* Real data is an 'int' */
944           iov.iov_len = sizeof(int);
945
946           /* Set 'msghdr' fields that describe ancillary data */
947
948           msgh.msg_control = controlMsg.buf;
949           msgh.msg_controllen = sizeof(controlMsg.buf);
950
951           /* Receive real plus ancillary data; real data is ignored */
952
953           nr = recvmsg(sockfd, &msgh, 0);
954           if (nr == -1)
955               return -1;
956
957           cmsgp = CMSG_FIRSTHDR(&msgh);
958
959           /* Check the validity of the 'cmsghdr' */
960
961           if (cmsgp == NULL ||
962                   cmsgp->cmsg_len != CMSG_LEN(sizeof(int)) ||
963                   cmsgp->cmsg_level != SOL_SOCKET ||
964                   cmsgp->cmsg_type != SCM_RIGHTS) {
965               errno = EINVAL;
966               return -1;
967           }
968
969           /* Return the received file descriptor to our caller */
970
971           memcpy(&fd, CMSG_DATA(cmsgp), sizeof(int));
972           return fd;
973       }
974
975       static void
976       sigchldHandler(int sig)
977       {
978           char msg[] = "\tS: target has terminated; bye\n";
979
980           write(STDOUT_FILENO, msg, sizeof(msg) - 1);
981           _exit(EXIT_SUCCESS);
982       }
983
984       static int
985       seccomp(unsigned int operation, unsigned int flags, void *args)
986       {
987           return syscall(__NR_seccomp, operation, flags, args);
988       }
989
990       /* The following is the x86-64-specific BPF boilerplate code for checking
991          that the BPF program is running on the right architecture + ABI. At
992          completion of these instructions, the accumulator contains the system
993          call number. */
994
995       /* For the x32 ABI, all system call numbers have bit 30 set */
996
997       #define X32_SYSCALL_BIT         0x40000000
998
999       #define X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR \
1000               BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
1001                       (offsetof(struct seccomp_data, arch))), \
1002               BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 2), \
1003               BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
1004                        (offsetof(struct seccomp_data, nr))), \
1005               BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1), \
1006               BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS)
1007
1008       /* installNotifyFilter() installs a seccomp filter that generates
1009          user-space notifications (SECCOMP_RET_USER_NOTIF) when the process
1010          calls mkdir(2); the filter allows all other system calls.
1011
1012          The function return value is a file descriptor from which the
1013          user-space notifications can be fetched. */
1014
1015       static int
1016       installNotifyFilter(void)
1017       {
1018           struct sock_filter filter[] = {
1019               X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR,
1020
1021               /* mkdir() triggers notification to user-space supervisor */
1022
1023               BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_mkdir, 0, 1),
1024               BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_USER_NOTIF),
1025
1026               /* Every other system call is allowed */
1027
1028               BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
1029           };
1030
1031           struct sock_fprog prog = {
1032               .len = sizeof(filter) / sizeof(filter[0]),
1033               .filter = filter,
1034           };
1035
1036           /* Install the filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
1037              as a result, seccomp() returns a notification file descriptor. */
1038
1039           int notifyFd = seccomp(SECCOMP_SET_MODE_FILTER,
1040                                  SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
1041           if (notifyFd == -1)
1042               errExit("seccomp-install-notify-filter");
1043
1044           return notifyFd;
1045       }
1046
1047       /* Close a pair of sockets created by socketpair() */
1048
1049       static void
1050       closeSocketPair(int sockPair[2])
1051       {
1052           if (close(sockPair[0]) == -1)
1053               errExit("closeSocketPair-close-0");
1054           if (close(sockPair[1]) == -1)
1055               errExit("closeSocketPair-close-1");
1056       }
1057
1058       /* Implementation of the target process; create a child process that:
1059
1060          (1) installs a seccomp filter with the
1061              SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
1062          (2) writes the seccomp notification file descriptor returned from
1063              the previous step onto the UNIX domain socket, 'sockPair[0]';
1064          (3) calls mkdir(2) for each element of 'argv'.
1065
1066          The function return value in the parent is the PID of the child
1067          process; the child does not return from this function. */
1068
1069       static pid_t
1070       targetProcess(int sockPair[2], char *argv[])
1071       {
1072           pid_t targetPid = fork();
1073           if (targetPid == -1)
1074               errExit("fork");
1075
1076           if (targetPid > 0)          /* In parent, return PID of child */
1077               return targetPid;
1078
1079           /* Child falls through to here */
1080
1081           printf("T: PID = %ld\n", (long) getpid());
1082
1083           /* Install seccomp filter(s) */
1084
1085           if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
1086               errExit("prctl");
1087
1088           int notifyFd = installNotifyFilter();
1089
1090           /* Pass the notification file descriptor to the tracing process over
1091              a UNIX domain socket */
1092
1093           if (sendfd(sockPair[0], notifyFd) == -1)
1094               errExit("sendfd");
1095
1096           /* Notification and socket FDs are no longer needed in target */
1097
1098           if (close(notifyFd) == -1)
1099               errExit("close-target-notify-fd");
1100
1101           closeSocketPair(sockPair);
1102
1103           /* Perform a mkdir() call for each of the command-line arguments */
1104
1105           for (char **ap = argv; *ap != NULL; ap++) {
1106               printf("\nT: about to mkdir(\"%s\")\n", *ap);
1107
1108               int s = mkdir(*ap, 0700);
1109               if (s == -1)
1110                   perror("T: ERROR: mkdir(2)");
1111               else
1112                   printf("T: SUCCESS: mkdir(2) returned %d\n", s);
1113           }
1114
1115           printf("\nT: terminating\n");
1116           exit(EXIT_SUCCESS);
1117       }
1118
1119       /* Check that the notification ID provided by a SECCOMP_IOCTL_NOTIF_RECV
1120          operation is still valid. It will no longer be valid if the target
1121          process has terminated or is no longer blocked in the system call that
1122          generated the notification (because it was interrupted by a signal).
1123
1124          This operation can be used when doing such things as accessing
1125          /proc/PID files in the target process in order to avoid TOCTOU race
1126          conditions where the PID that is returned by SECCOMP_IOCTL_NOTIF_RECV
1127          terminates and is reused by another process. */
1128
1129       static bool
1130       cookieIsValid(int notifyFd, uint64_t id)
1131       {
1132           return ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == 0;
1133       }
1134
1135       /* Access the memory of the target process in order to fetch the
1136          pathname referred to by the system call argument 'argNum' in
1137          'req->data.args[]'.  The pathname is returned in 'path',
1138          a buffer of 'len' bytes allocated by the caller.
1139
1140          Returns true if the pathname is successfully fetched, and false
1141          otherwise. For possible causes of failure, see the comments below. */
1142
1143       static bool
1144       getTargetPathname(struct seccomp_notif *req, int notifyFd,
1145                         int argNum, char *path, size_t len)
1146       {
1147           char procMemPath[PATH_MAX];
1148
1149           snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);
1150
1151           int procMemFd = open(procMemPath, O_RDONLY | O_CLOEXEC);
1152           if (procMemFd == -1)
1153               return false;
1154
1155           /* Check that the process whose info we are accessing is still alive
1156              and blocked in the system call that caused the notification.
1157              If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed in
1158              cookieIsValid()) succeeded, we know that the /proc/PID/mem file
1159              descriptor that we opened corresponded to the process for which we
1160              received a notification. If that process subsequently terminates,
1161              then read() on that file descriptor will return 0 (EOF). */
1162
1163           if (!cookieIsValid(notifyFd, req->id)) {
1164               close(procMemFd);
1165               return false;
1166           }
1167
1168           /* Read bytes at the location containing the pathname argument */
1169
1170           ssize_t nread = pread(procMemFd, path, len, req->data.args[argNum]);
1171
1172           close(procMemFd);
1173
1174           if (nread <= 0)
1175               return false;
1176
1177           /* Once again check that the notification ID is still valid. The
1178              case we are particularly concerned about here is that just
1179              before we fetched the pathname, the target's blocked system
1180              call was interrupted by a signal handler, and after the handler
1181              returned, the target carried on execution (past the interrupted
1182              system call). In that case, we have no guarantees about what we
1183              are reading, since the target's memory may have been arbitrarily
1184              changed by subsequent operations. */
1185
1186           if (!cookieIsValid(notifyFd, req->id)) {
1187               perror("\tS: notification ID check failed!!!");
1188               return false;
1189           }
1190
1191           /* Even if the target's system call was not interrupted by a signal,
1192              we have no guarantees about what was in the memory of the target
1193              process. (The memory may have been modified by another thread, or
1194              even by an external attacking process.) We therefore treat the
1195              buffer returned by pread() as untrusted input. The buffer should
1196              contain a terminating null byte; if not, then we will trigger an
1197              error for the target process. */
1198
1199           if (strnlen(path, nread) < nread)
1200               return true;
1201
1202           return false;
1203       }
1204
1205       /* Allocate buffers for the seccomp user-space notification request and
1206          response structures. It is the caller's responsibility to free the
1207          buffers returned via 'req' and 'resp'. */
1208
1209       static void
1210       allocSeccompNotifBuffers(struct seccomp_notif **req,
1211               struct seccomp_notif_resp **resp,
1212               struct seccomp_notif_sizes *sizes)
1213       {
1214           /* Discover the sizes of the structures that are used to receive
1215              notifications and send notification responses, and allocate
1216              buffers of those sizes. */
1217
1218           if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, sizes) == -1)
1219               errExit("seccomp-SECCOMP_GET_NOTIF_SIZES");
1220
1221           *req = malloc(sizes->seccomp_notif);
1222           if (*req == NULL)
1223               errExit("malloc-seccomp_notif");
1224
1225           /* When allocating the response buffer, we must allow for the fact
1226              that the user-space binary may have been built with user-space
1227              headers where 'struct seccomp_notif_resp' is bigger than the
1228              response buffer expected by the (older) kernel. Therefore, we
1229              allocate a buffer that is the maximum of the two sizes. This
1230              ensures that if the supervisor places bytes into the response
1231              structure that are past the response size that the kernel expects,
1232              then the supervisor is not touching an invalid memory location. */
1233
1234           size_t resp_size = sizes->seccomp_notif_resp;
1235           if (sizeof(struct seccomp_notif_resp) > resp_size)
1236               resp_size = sizeof(struct seccomp_notif_resp);
1237
1238           *resp = malloc(resp_size);
1239           if (resp == NULL)
1240               errExit("malloc-seccomp_notif_resp");
1241
1242       }
1243
1244       /* Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file
1245          descriptor, 'notifyFd'. */
1246
1247       static void
1248       handleNotifications(int notifyFd)
1249       {
1250           struct seccomp_notif_sizes sizes;
1251           struct seccomp_notif *req;
1252           struct seccomp_notif_resp *resp;
1253           char path[PATH_MAX];
1254
1255           allocSeccompNotifBuffers(&req, &resp, &sizes);
1256
1257           /* Loop handling notifications */
1258
1259           for (;;) {
1260
1261               /* Wait for next notification, returning info in '*req' */
1262
1263               memset(req, 0, sizes.seccomp_notif);
1264               if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_RECV, req) == -1) {
1265                   if (errno == EINTR)
1266                       continue;
1267                   errExit("\tS: ioctl-SECCOMP_IOCTL_NOTIF_RECV");
1268               }
1269
1270               printf("\tS: got notification (ID %#llx) for PID %d\n",
1271                       req->id, req->pid);
1272
1273               /* The only system call that can generate a notification event
1274                  is mkdir(2). Nevertheless, we check that the notified system
1275                  call is indeed mkdir() as kind of future-proofing of this
1276                  code in case the seccomp filter is later modified to
1277                  generate notifications for other system calls. */
1278
1279               if (req->data.nr != __NR_mkdir) {
1280                   printf("\tS: notification contained unexpected "
1281                           "system call number; bye!!!\n");
1282                   exit(EXIT_FAILURE);
1283               }
1284
1285               bool pathOK = getTargetPathname(req, notifyFd, 0, path,
1286                                               sizeof(path));
1287
1288               /* Prepopulate some fields of the response */
1289
1290               resp->id = req->id;     /* Response includes notification ID */
1291               resp->flags = 0;
1292               resp->val = 0;
1293
1294               /* If getTargetPathname() failed, trigger an EINVAL error
1295                  response (sending this response may yield an error if the
1296                  failure occurred because the notification ID was no longer
1297                  valid); if the directory is in /tmp, then create it on behalf
1298                  of the supervisor; if the pathname starts with '.', tell the
1299                  kernel to let the target process execute the mkdir();
1300                  otherwise, give an error for a directory pathname in any other
1301                  location. */
1302
1303               if (!pathOK) {
1304                   resp->error = -EINVAL;
1305                   printf("\tS: spoofing error for invalid pathname (%s)\n",
1306                           strerror(-resp->error));
1307               } else if (strncmp(path, "/tmp/", strlen("/tmp/")) == 0) {
1308                   printf("\tS: executing: mkdir(\"%s\", %#llo)\n",
1309                           path, req->data.args[1]);
1310
1311                   if (mkdir(path, req->data.args[1]) == 0) {
1312                       resp->error = 0;            /* "Success" */
1313                       resp->val = strlen(path);   /* Used as return value of
1314                                                      mkdir() in target */
1315                       printf("\tS: success! spoofed return = %lld\n",
1316                               resp->val);
1317                   } else {
1318
1319                       /* If mkdir() failed in the supervisor, pass the error
1320                          back to the target */
1321
1322                       resp->error = -errno;
1323                       printf("\tS: failure! (errno = %d; %s)\n", errno,
1324                               strerror(errno));
1325                   }
1326               } else if (strncmp(path, "./", strlen("./")) == 0) {
1327                   resp->error = resp->val = 0;
1328                   resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
1329                   printf("\tS: target can execute system call\n");
1330               } else {
1331                   resp->error = -EOPNOTSUPP;
1332                   printf("\tS: spoofing error response (%s)\n",
1333                           strerror(-resp->error));
1334               }
1335
1336               /* Send a response to the notification */
1337
1338               printf("\tS: sending response "
1339                       "(flags = %#x; val = %lld; error = %d)\n",
1340                       resp->flags, resp->val, resp->error);
1341
1342               if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp) == -1) {
1343                   if (errno == ENOENT)
1344                       printf("\tS: response failed with ENOENT; "
1345                               "perhaps target process's syscall was "
1346                               "interrupted by a signal?\n");
1347                   else
1348                       perror("ioctl-SECCOMP_IOCTL_NOTIF_SEND");
1349               }
1350
1351               /* If the pathname is just "/bye", then the supervisor breaks out
1352                  of the loop and terminates. This allows us to see what happens
1353                  if the target process makes further calls to mkdir(2). */
1354
1355               if (strcmp(path, "/bye") == 0)
1356                   break;
1357           }
1358
1359           free(req);
1360           free(resp);
1361           printf("\tS: terminating **********\n");
1362           exit(EXIT_FAILURE);
1363       }
1364
1365       /* Implementation of the supervisor process:
1366
1367          (1) obtains the notification file descriptor from 'sockPair[1]'
1368          (2) handles notifications that arrive on that file descriptor. */
1369
1370       static void
1371       supervisor(int sockPair[2])
1372       {
1373           int notifyFd = recvfd(sockPair[1]);
1374           if (notifyFd == -1)
1375               errExit("recvfd");
1376
1377           closeSocketPair(sockPair);  /* We no longer need the socket pair */
1378
1379           handleNotifications(notifyFd);
1380       }
1381
1382       int
1383       main(int argc, char *argv[])
1384       {
1385           int sockPair[2];
1386
1387           setbuf(stdout, NULL);
1388
1389           if (argc < 2) {
1390               fprintf(stderr, "At least one pathname argument is required\n");
1391               exit(EXIT_FAILURE);
1392           }
1393
1394           /* Create a UNIX domain socket that is used to pass the seccomp
1395              notification file descriptor from the target process to the
1396              supervisor process. */
1397
1398           if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockPair) == -1)
1399               errExit("socketpair");
1400
1401           /* Create a child process--the "target"--that installs seccomp
1402              filtering. The target process writes the seccomp notification
1403              file descriptor onto 'sockPair[0]' and then calls mkdir(2) for
1404              each directory in the command-line arguments. */
1405
1406           (void) targetProcess(sockPair, &argv[optind]);
1407
1408           /* Catch SIGCHLD when the target terminates, so that the
1409              supervisor can also terminate. */
1410
1411           struct sigaction sa;
1412           sa.sa_handler = sigchldHandler;
1413           sa.sa_flags = 0;
1414           sigemptyset(&sa.sa_mask);
1415           if (sigaction(SIGCHLD, &sa, NULL) == -1)
1416               errExit("sigaction");
1417
1418           supervisor(sockPair);
1419
1420           exit(EXIT_SUCCESS);
1421       }
1422

COLOPHON

1430       This  page  is  part of release 5.13 of the Linux man-pages project.  A
1431       description of the project, information about reporting bugs,  and  the
1432       latest     version     of     this    page,    can    be    found    at
1433       https://www.kernel.org/doc/man-pages/.
1434
1435
1436
1437Linux                             2021-06-20                SECCOMP_UNOTIFY(2)