1SECCOMP_UNOTIFY(2)         Linux Programmer's Manual        SECCOMP_UNOTIFY(2)
2
3
4

NAME

6       seccomp_unotify - Seccomp user-space notification mechanism
7

SYNOPSIS

9       #include <linux/seccomp.h>
10       #include <linux/filter.h>
11       #include <linux/audit.h>
12
13       int seccomp(unsigned int operation, unsigned int flags, void *args);
14
15       #include <sys/ioctl.h>
16
17       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
18                 struct seccomp_notif *req);
19       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
20                 struct seccomp_notif_resp *resp);
21       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
22       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ADDFD,
23                 struct seccomp_notif_addfd *addfd);
24

DESCRIPTION

26       This  page  describes the user-space notification mechanism provided by
27       the Secure Computing (seccomp) facility.  As well as  the  use  of  the
28       SECCOMP_FILTER_FLAG_NEW_LISTENER  flag,  the SECCOMP_RET_USER_NOTIF ac‐
29       tion value, and the SECCOMP_GET_NOTIF_SIZES operation described in sec‐
30       comp(2),  this  mechanism  involves  the  use  of  a  number of related
31       ioctl(2) operations (described below).
32
33   Overview
34       In conventional usage of a seccomp filter, the decision  about  how  to
35       treat  a  system  call  is made by the filter itself.  By contrast, the
36       user-space notification mechanism allows the seccomp filter to delegate
37       the  handling  of  the system call to another user-space process.  Note
38       that this mechanism is explicitly not intended as a method implementing
39       security policy; see NOTES.
40
41       In the discussion that follows, the thread(s) on which the seccomp fil‐
42       ter is installed is (are) referred to as the target,  and  the  process
43       that  is  notified by the user-space notification mechanism is referred
44       to as the supervisor.
45
46       A suitably privileged supervisor can use  the  user-space  notification
47       mechanism to perform actions on behalf of the target.  The advantage of
48       the user-space notification mechanism is that the supervisor will  usu‐
49       ally be able to retrieve information about the target and the performed
50       system call that the seccomp filter itself cannot.  (A  seccomp  filter
51       is limited in the information it can obtain and the actions that it can
52       perform because it is running on a virtual machine inside the kernel.)
53
54       An overview of the steps performed by the target and the supervisor  is
55       as follows:
56
57       1. The  target  establishes  a  seccomp filter in the usual manner, but
58          with two differences:
59
60          • The seccomp(2)  flags  argument  includes  the  flag  SECCOMP_FIL‐
61            TER_FLAG_NEW_LISTENER.   Consequently,  the  return  value  of the
62            (successful) seccomp(2) call is a new "listening" file  descriptor
63            that  can  be used to receive notifications.  Only one "listening"
64            seccomp filter can be installed for a thread.
65
66          • In cases where it is appropriate, the seccomp filter  returns  the
67            action value SECCOMP_RET_USER_NOTIF.  This return value will trig‐
68            ger a notification event.
69
70       2. In order that the supervisor can obtain notifications using the lis‐
71          tening  file  descriptor, (a duplicate of) that file descriptor must
72          be passed from the target to the supervisor.  One way in which  this
73          could  be  done is by passing the file descriptor over a UNIX domain
74          socket connection between the target and the supervisor  (using  the
75          SCM_RIGHTS  ancillary  message  type described in unix(7)).  Another
76          way to do this is through the use of pidfd_getfd(2).
77
78       3. The supervisor will receive notification  events  on  the  listening
79          file  descriptor.   These  events are returned as structures of type
80          seccomp_notif.  Because this structure and its size may evolve  over
81          kernel  versions,  the  supervisor  must first determine the size of
82          this structure using the seccomp(2)  SECCOMP_GET_NOTIF_SIZES  opera‐
83          tion,  which  returns  a structure of type seccomp_notif_sizes.  The
84          supervisor  allocates  a  buffer  of  size  seccomp_notif_sizes.sec‐
85          comp_notif  bytes  to  receive notification events.  In addition,the
86          supervisor allocates another buffer of size seccomp_notif_sizes.sec‐
87          comp_notif_resp  bytes for the response (a struct seccomp_notif_resp
88          structure) that it will provide to the kernel (and thus the target).
89
90       4. The target then performs its workload, which includes  system  calls
91          that  will  be  controlled  by  the seccomp filter.  Whenever one of
92          these  system  calls  causes  the  filter   to   return   the   SEC‐
93          COMP_RET_USER_NOTIF  action value, the kernel does not (yet) execute
94          the system call; instead, execution of  the  target  is  temporarily
95          blocked inside the kernel (in a sleep state that is interruptible by
96          signals) and a notification event is generated on the listening file
97          descriptor.
98
99       5. The  supervisor  can  now  repeatedly monitor the listening file de‐
100          scriptor for SECCOMP_RET_USER_NOTIF-triggered events.  To  do  this,
101          the  supervisor uses the SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation
102          to read information  about  a  notification  event;  this  operation
103          blocks  until  an  event is available.  The operation returns a sec‐
104          comp_notif structure containing information about  the  system  call
105          that  is being attempted by the target.  (As described in NOTES, the
106          file descriptor can also be monitored with  select(2),  poll(2),  or
107          epoll(7).)
108
109       6. The seccomp_notif structure returned by the SECCOMP_IOCTL_NOTIF_RECV
110          operation includes the same information (a  seccomp_data  structure)
111          that  was passed to the seccomp filter.  This information allows the
112          supervisor to discover the system call number and the arguments  for
113          the  target's system call.  In addition, the notification event con‐
114          tains the ID of the thread that triggered  the  notification  and  a
115          unique  cookie  value  that  is used in subsequent SECCOMP_IOCTL_NO‐
116          TIF_ID_VALID and SECCOMP_IOCTL_NOTIF_SEND operations.
117
118          The information in the notification can be used to discover the val‐
119          ues  of  pointer  arguments  for the target's system call.  (This is
120          something that can't be done from within a seccomp filter.)  One way
121          in  which  the  supervisor  can do this is to open the corresponding
122          /proc/[tid]/mem file (see proc(5)) and read bytes from the  location
123          that corresponds to one of the pointer arguments whose value is sup‐
124          plied in the notification event.  (The supervisor must be careful to
125          avoid  a  race condition that can occur when doing this; see the de‐
126          scription of the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation be‐
127          low.)   In addition, the supervisor can access other system informa‐
128          tion that is visible in user space but which is not accessible  from
129          a seccomp filter.
130
131       7. Having obtained information as per the previous step, the supervisor
132          may then choose to perform an action in  response  to  the  target's
133          system call (which, as noted above, is not executed when the seccomp
134          filter returns the SECCOMP_RET_USER_NOTIF action value).
135
136          One example use case here relates to containers.  The target may  be
137          located  inside  a container where it does not have sufficient capa‐
138          bilities to mount a filesystem in the container's  mount  namespace.
139          However,  the  supervisor may be a more privileged process that does
140          have sufficient capabilities to perform the mount operation.
141
142       8. The supervisor then sends a response to the notification.   The  in‐
143          formation  in this response is used by the kernel to construct a re‐
144          turn value for the target's system call and  provide  a  value  that
145          will be assigned to the errno variable of the target.
146
147          The response is sent using the SECCOMP_IOCTL_NOTIF_SEND ioctl(2) op‐
148          eration, which is used to transmit a seccomp_notif_resp structure to
149          the  kernel.  This structure includes a cookie value that the super‐
150          visor obtained in the seccomp_notif structure returned by  the  SEC‐
151          COMP_IOCTL_NOTIF_RECV  operation.  This cookie value allows the ker‐
152          nel to associate the response with the target.  This structure  must
153          include  the  cookie  value that the supervisor obtained in the sec‐
154          comp_notif structure returned by the SECCOMP_IOCTL_NOTIF_RECV opera‐
155          tion;  the  cookie  allows the kernel to associate the response with
156          the target.
157
158       9. Once the notification has been sent, the system call in  the  target
159          thread  unblocks, returning the information that was provided by the
160          supervisor in the notification response.
161
162       As a variation on the last two steps, the supervisor  can  send  a  re‐
163       sponse that tells the kernel that it should execute the target thread's
164       system call; see the  discussion  of  SECCOMP_USER_NOTIF_FLAG_CONTINUE,
165       below.
166

IOCTL OPERATIONS

168       The  following  ioctl(2)  operations are supported by the seccomp user-
169       space notification file descriptor.  For each of these operations,  the
170       first  (file descriptor) argument of ioctl(2) is the listening file de‐
171       scriptor returned  by  a  call  to  seccomp(2)  with  the  SECCOMP_FIL‐
172       TER_FLAG_NEW_LISTENER flag.
173
174   SECCOMP_IOCTL_NOTIF_RECV
175       The  SECCOMP_IOCTL_NOTIF_RECV  operation (available since Linux 5.0) is
176       used to obtain a user-space notification event.  If no  such  event  is
177       currently  pending,  the  operation  blocks until an event occurs.  The
178       third ioctl(2) argument is a pointer to a structure  of  the  following
179       form  which  contains information about the event.  This structure must
180       be zeroed out before the call.
181
182           struct seccomp_notif {
183               __u64  id;              /* Cookie */
184               __u32  pid;             /* TID of target thread */
185               __u32  flags;           /* Currently unused (0) */
186               struct seccomp_data data;   /* See seccomp(2) */
187           };
188
189       The fields in this structure are as follows:
190
191       id     This is a cookie for the  notification.   Each  such  cookie  is
192              guaranteed to be unique for the corresponding seccomp filter.
193
194              • The  cookie  can be used with the SECCOMP_IOCTL_NOTIF_ID_VALID
195                ioctl(2) operation described below.
196
197              • When returning a notification response to the kernel, the  su‐
198                pervisor  must  include  the  cookie  value in the seccomp_no‐
199                tif_resp structure that is specified as the  argument  of  the
200                SECCOMP_IOCTL_NOTIF_SEND operation.
201
202       pid    This  is  the  thread ID of the target thread that triggered the
203              notification event.
204
205       flags  This is a bit mask of flags providing further information on the
206              event.   In  the  current  implementation,  this field is always
207              zero.
208
209       data   This is a seccomp_data structure  containing  information  about
210              the  system  call  that triggered the notification.  This is the
211              same structure that is passed to the seccomp filter.   See  sec‐
212              comp(2) for details of this structure.
213
214       On  success,  this operation returns 0; on failure, -1 is returned, and
215       errno is set to indicate the cause of the error.   This  operation  can
216       fail with the following errors:
217
218       EINVAL (since Linux 5.5)
219              The  seccomp_notif  structure  that  was passed to the call con‐
220              tained nonzero fields.
221
222       ENOENT The target thread was killed by a signal as the notification in‐
223              formation  was being generated, or the target's (blocked) system
224              call was interrupted by a signal handler.
225
226   SECCOMP_IOCTL_NOTIF_ID_VALID
227       The SECCOMP_IOCTL_NOTIF_ID_VALID operation (available since Linux  5.0)
228       is  used  to  check  that a notification ID returned by an earlier SEC‐
229       COMP_IOCTL_NOTIF_RECV operation is still valid (i.e., that  the  target
230       still  exists  and  its  system call is still blocked waiting for a re‐
231       sponse).
232
233       The third ioctl(2) argument is a pointer to the cookie (id) returned by
234       the SECCOMP_IOCTL_NOTIF_RECV operation.
235
236       This  operation  is  necessary  to avoid race conditions that can occur
237       when the pid returned by the SECCOMP_IOCTL_NOTIF_RECV operation  termi‐
238       nates, and that process ID is reused by another process.  An example of
239       this kind of race is the following
240
241       1. A notification is generated on the listening file  descriptor.   The
242          returned seccomp_notif contains the TID of the target thread (in the
243          pid field of the structure).
244
245       2. The target terminates.
246
247       3. Another thread or process is created on the system  that  by  chance
248          reuses the TID that was freed when the target terminated.
249
250       4. The  supervisor  open(2)s  the  /proc/[tid]/mem file for the TID ob‐
251          tained in step 1, with the intention of (say) inspecting the  memory
252          location(s)  that containing the argument(s) of the system call that
253          triggered the notification in step 1.
254
255       In the above scenario, the risk is that the supervisor may try  to  ac‐
256       cess  the  memory of a process other than the target.  This race can be
257       avoided by following the  call  to  open(2)  with  a  SECCOMP_IOCTL_NO‐
258       TIF_ID_VALID  operation  to  verify that the process that generated the
259       notification is still alive.  (Note that if the target terminates after
260       the  latter step, a subsequent read(2) from the file descriptor may re‐
261       turn 0, indicating end of file.)
262
263       See NOTES for a  discussion  of  other  cases  where  SECCOMP_IOCTL_NO‐
264       TIF_ID_VALID checks must be performed.
265
266       On  success  (i.e., the notification ID is still valid), this operation
267       returns 0.  On failure (i.e., the notification ID is no longer  valid),
268       -1 is returned, and errno is set to ENOENT.
269
270   SECCOMP_IOCTL_NOTIF_SEND
271       The  SECCOMP_IOCTL_NOTIF_SEND  operation (available since Linux 5.0) is
272       used to send a notification response back to  the  kernel.   The  third
273       ioctl(2)  argument of this structure is a pointer to a structure of the
274       following form:
275
276           struct seccomp_notif_resp {
277               __u64 id;           /* Cookie value */
278               __s64 val;          /* Success return value */
279               __s32 error;        /* 0 (success) or negative error number */
280               __u32 flags;        /* See below */
281           };
282
283       The fields of this structure are as follows:
284
285       id     This is the cookie  value  that  was  obtained  using  the  SEC‐
286              COMP_IOCTL_NOTIF_RECV  operation.   This cookie value allows the
287              kernel to correctly associate this response with the system call
288              that triggered the user-space notification.
289
290       val    This is the value that will be used for a spoofed success return
291              for the target's system call; see below.
292
293       error  This is the value that will be used as the error number  (errno)
294              for a spoofed error return for the target's system call; see be‐
295              low.
296
297       flags  This is a bit mask that includes zero or more of  the  following
298              flags:
299
300              SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
301                     Tell the kernel to execute the target's system call.
302
303       Two kinds of response are possible:
304
305       • A  response  to  the kernel telling it to execute the target's system
306         call.  In  this  case,  the  flags  field  includes  SECCOMP_USER_NO‐
307         TIF_FLAG_CONTINUE and the error and val fields must be zero.
308
309         This  kind  of  response  can be useful in cases where the supervisor
310         needs to do deeper analysis of the target's system call than is  pos‐
311         sible  from  a  seccomp filter (e.g., examining the values of pointer
312         arguments), and, having decided that the system call does not require
313         emulation  by the supervisor, the supervisor wants the system call to
314         be executed normally in the target.
315
316         The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag should be  used  with  cau‐
317         tion; see NOTES.
318
319       • A  spoofed  return value for the target's system call.  In this case,
320         the kernel does not execute the target's system call, instead causing
321         the  system  call to return a spoofed value as specified by fields of
322         the seccomp_notif_resp structure.   The  supervisor  should  set  the
323         fields of this structure as follows:
324
325         +  flags does not contain SECCOMP_USER_NOTIF_FLAG_CONTINUE.
326
327         +  error  is  set  either to 0 for a spoofed "success" return or to a
328            negative error number for a spoofed "failure" return.  In the for‐
329            mer case, the kernel causes the target's system call to return the
330            value specified in the val field.  In the latter case, the  kernel
331            causes  the  target's  system  call to return -1, and errno is as‐
332            signed the negated error value.
333
334         +  val is set to a value that will be used as the return value for  a
335            spoofed  "success" return for the target's system call.  The value
336            in this field is ignored if the error  field  contains  a  nonzero
337            value.
338
339       On  success,  this operation returns 0; on failure, -1 is returned, and
340       errno is set to indicate the cause of the error.   This  operation  can
341       fail with the following errors:
342
343       EINPROGRESS
344              A response to this notification has already been sent.
345
346       EINVAL An invalid value was specified in the flags field.
347
348       EINVAL The  flags field contained SECCOMP_USER_NOTIF_FLAG_CONTINUE, and
349              the error or val field was not zero.
350
351       ENOENT The blocked system call in the target has been interrupted by  a
352              signal handler or the target has terminated.
353
354   SECCOMP_IOCTL_NOTIF_ADDFD
355       The SECCOMP_IOCTL_NOTIF_ADDFD operation (available since Linux 5.9) al‐
356       lows the supervisor to install a file descriptor into the target's file
357       descriptor  table.   Much like the use of SCM_RIGHTS messages described
358       in unix(7), this operation is semantically equivalent to duplicating  a
359       file  descriptor  from  the supervisor's file descriptor table into the
360       target's file descriptor table.
361
362       The SECCOMP_IOCTL_NOTIF_ADDFD operation permits the supervisor to  emu‐
363       late  a target system call (such as socket(2) or openat(2)) that gener‐
364       ates a file descriptor.  The supervisor can  perform  the  system  call
365       that  generates  the file descriptor (and associated open file descrip‐
366       tion) and then use this operation to allocate a  file  descriptor  that
367       refers to the same open file description in the target.  (For an expla‐
368       nation of open file descriptions, see open(2).)
369
370       Once this operation has been performed, the supervisor  can  close  its
371       copy of the file descriptor.
372
373       In  the  target,  the  received  file descriptor is subject to the same
374       Linux Security Module (LSM) checks as are applied to a file  descriptor
375       that  is  received in an SCM_RIGHTS ancillary message.  If the file de‐
376       scriptor refers to a socket, it inherits the cgroup version  1  network
377       controller settings (classid and netprioidx) of the target.
378
379       The  third ioctl(2) argument is a pointer to a structure of the follow‐
380       ing form:
381
382           struct seccomp_notif_addfd {
383               __u64 id;           /* Cookie value */
384               __u32 flags;        /* Flags */
385               __u32 srcfd;        /* Local file descriptor number */
386               __u32 newfd;        /* 0 or desired file descriptor
387                                      number in target */
388               __u32 newfd_flags;  /* Flags to set on target file
389                                      descriptor */
390           };
391
392       The fields in this structure are as follows:
393
394       id     This field should be set to the notification ID  (cookie  value)
395              that was obtained via SECCOMP_IOCTL_NOTIF_RECV.
396
397       flags  This  field  is  a bit mask of flags that modify the behavior of
398              the operation.  Currently, only one flag is supported:
399
400              SECCOMP_ADDFD_FLAG_SETFD
401                     When allocating the file descriptor in  the  target,  use
402                     the file descriptor number specified in the newfd field.
403
404       srcfd  This field should be set to the number of the file descriptor in
405              the supervisor that is to be duplicated.
406
407       newfd  This field determines which file descriptor number is  allocated
408              in  the  target.   If  the SECCOMP_ADDFD_FLAG_SETFD flag is set,
409              then this field specifies which file descriptor number should be
410              allocated.   If  this  file descriptor number is already open in
411              the target, it is atomically closed and reused.  If the descrip‐
412              tor  duplication fails due to an LSM check, or if srcfd is not a
413              valid file descriptor, the file descriptor  newfd  will  not  be
414              closed in the target process.
415
416              If the SECCOMP_ADDFD_FLAG_SETFD flag it not set, then this field
417              must be 0, and the kernel allocates the lowest unused  file  de‐
418              scriptor number in the target.
419
420       newfd_flags
421              This  field is a bit mask specifying flags that should be set on
422              the file descriptor that is  received  in  the  target  process.
423              Currently, only the following flag is implemented:
424
425              O_CLOEXEC
426                     Set  the close-on-exec flag on the received file descrip‐
427                     tor.
428
429       On success, this ioctl(2) call returns the number of the file  descrip‐
430       tor  that was allocated in the target.  Assuming that the emulated sys‐
431       tem call is one that returns a file descriptor as its  function  result
432       (e.g.,  socket(2)),  this  value  can  be  used  as  the  return  value
433       (resp.val) that is supplied in the response that is  subsequently  sent
434       with the SECCOMP_IOCTL_NOTIF_SEND operation.
435
436       On  error, -1 is returned and errno is set to indicate the cause of the
437       error.
438
439       This operation can fail with the following errors:
440
441       EBADF  Allocating the file descriptor in the  target  would  cause  the
442              target's RLIMIT_NOFILE limit to be exceeded (see getrlimit(2)).
443
444       EINPROGRESS
445              The user-space notification specified in the id field exists but
446              has not yet been fetched (by a SECCOMP_IOCTL_NOTIF_RECV) or  has
447              already been responded to (by a SECCOMP_IOCTL_NOTIF_SEND).
448
449       EINVAL An invalid flag was specified in the flags or newfd_flags field,
450              or the newfd field is nonzero and  the  SECCOMP_ADDFD_FLAG_SETFD
451              flag was not specified in the flags field.
452
453       EMFILE The  file descriptor number specified in newfd exceeds the limit
454              specified in /proc/sys/fs/nr_open.
455
456       ENOENT The blocked system call in the target has been interrupted by  a
457              signal handler or the target has terminated.
458
459       Here  is  some  sample code (with error handling omitted) that uses the
460       SECCOMP_ADDFD_FLAG_SETFD operation (here, to emulate  a  call  to  ope‐
461       nat(2)):
462
463           int fd, removeFd;
464
465           fd = openat(req->data.args[0], path, req->data.args[2],
466                           req->data.args[3]);
467
468           struct seccomp_notif_addfd addfd;
469           addfd.id = req->id; /* Cookie from SECCOMP_IOCTL_NOTIF_RECV */
470           addfd.srcfd = fd;
471           addfd.newfd = 0;
472           addfd.flags = 0;
473           addfd.newfd_flags = O_CLOEXEC;
474
475           targetFd = ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd);
476
477           close(fd);          /* No longer needed in supervisor */
478
479           struct seccomp_notif_resp *resp;
480               /* Code to allocate 'resp' omitted */
481           resp->id = req->id;
482           resp->error = 0;        /* "Success" */
483           resp->val = targetFd;
484           resp->flags = 0;
485           ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp);
486

NOTES

488       One  example  use  case for the user-space notification mechanism is to
489       allow a container manager (a process which is  typically  running  with
490       more  privilege than the processes inside the container) to mount block
491       devices or create device nodes for the container.  The mount  use  case
492       provides  an  example  of  where  the  SECCOMP_USER_NOTIF_FLAG_CONTINUE
493       ioctl(2) operation is useful.  Upon receiving a  notification  for  the
494       mount(2) system call, the container manager (the "supervisor") can dis‐
495       tinguish a request to mount a block filesystem (which would not be pos‐
496       sible  for a "target" process inside the container) and mount that file
497       system.  If, on the other hand, the container manager detects that  the
498       operation could be performed by the process inside the container (e.g.,
499       a mount of a tmpfs(5) filesystem), it can notify the  kernel  that  the
500       target process's mount(2) system call can continue.
501
502   select()/poll()/epoll semantics
503       The  file descriptor returned when seccomp(2) is employed with the SEC‐
504       COMP_FILTER_FLAG_NEW_LISTENER flag  can  be  monitored  using  poll(2),
505       epoll(7),  and  select(2).  These interfaces indicate that the file de‐
506       scriptor is ready as follows:
507
508       • When a notification is pending, these interfaces  indicate  that  the
509         file  descriptor is readable.  Following such an indication, a subse‐
510         quent SECCOMP_IOCTL_NOTIF_RECV ioctl(2) will not block, returning ei‐
511         ther  information about a notification or else failing with the error
512         EINTR if the target has been killed by a signal or  its  system  call
513         has been interrupted by a signal handler.
514
515       • After   the  notification  has  been  received  (i.e.,  by  the  SEC‐
516         COMP_IOCTL_NOTIF_RECV ioctl(2) operation), these interfaces  indicate
517         that the file descriptor is writable, meaning that a notification re‐
518         sponse can be sent using the SECCOMP_IOCTL_NOTIF_SEND ioctl(2) opera‐
519         tion.
520
521       • After the last thread using the filter has terminated and been reaped
522         using waitpid(2) (or similar), the file descriptor indicates an  end-
523         of-file   condition   (readable  in  select(2);  POLLHUP/EPOLLHUP  in
524         poll(2)/ epoll_wait(2)).
525
526   Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE
527       The intent of the user-space notification feature is  to  allow  system
528       calls  to  be  performed  on behalf of the target.  The target's system
529       call should either be handled by the supervisor or allowed to  continue
530       normally  in  the  kernel (where standard security policies will be ap‐
531       plied).
532
533       Note well: this mechanism must not be used to make security policy  de‐
534       cisions about the system call, which would be inherently race-prone for
535       reasons described next.
536
537       The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be  used  with  caution.
538       If set by the supervisor, the target's system call will continue.  How‐
539       ever, there is a time-of-check, time-of-use race  here,  since  an  at‐
540       tacker  could  exploit the interval of time where the target is blocked
541       waiting on the "continue" response to do things such as  rewriting  the
542       system call arguments.
543
544       Note  furthermore that a user-space notifier can be bypassed if the ex‐
545       isting filters allow the use of seccomp(2) or  prctl(2)  to  install  a
546       filter  that returns an action value with a higher precedence than SEC‐
547       COMP_RET_USER_NOTIF (see seccomp(2)).
548
549       It should thus be absolutely clear that the seccomp user-space  notifi‐
550       cation  mechanism  can  not be used to implement a security policy!  It
551       should only ever be used in scenarios where a more  privileged  process
552       supervises the system calls of a lesser privileged target to get around
553       kernel-enforced security restrictions when the  supervisor  deems  this
554       safe.  In other words, in order to continue a system call, the supervi‐
555       sor should be sure that another security mechanism or the kernel itself
556       will  sufficiently block the system call if its arguments are rewritten
557       to something unsafe.
558
559   Caveats regarding the use of /proc/[tid]/mem
560       The discussion above  noted  the  need  to  use  the  SECCOMP_IOCTL_NO‐
561       TIF_ID_VALID ioctl(2) when opening the /proc/[tid]/mem file of the tar‐
562       get to avoid the possibility of  accessing  the  memory  of  the  wrong
563       process  in the event that the target terminates and its ID is recycled
564       by another (unrelated) thread.  However, the use of this ioctl(2) oper‐
565       ation  is  also necessary in other situations, as explained in the fol‐
566       lowing paragraphs.
567
568       Consider the following scenario, where the supervisor tries to read the
569       pathname argument of a target's blocked mount(2) system call:
570
571       • From  one of its functions (func()), the target calls mount(2), which
572         triggers a user-space notification and causes the target to block.
573
574       • The supervisor receives the notification, opens /proc/[tid]/mem,  and
575         (successfully) performs the SECCOMP_IOCTL_NOTIF_ID_VALID check.
576
577       • The target receives a signal, which causes the mount(2) to abort.
578
579       • The signal handler executes in the target, and returns.
580
581       • Upon return from the handler, the execution of func() resumes, and it
582         returns (and perhaps other functions are called, overwriting the mem‐
583         ory that had been used for the stack frame of func()).
584
585       • Using  the  address provided in the notification information, the su‐
586         pervisor reads from the target's memory location that used to contain
587         the pathname.
588
589       • The  supervisor now calls mount(2) with some arbitrary bytes obtained
590         in the previous step.
591
592       The conclusion from the above scenario  is  this:  since  the  target's
593       blocked  system call may be interrupted by a signal handler, the super‐
594       visor must be written to expect that the target may abandon its  system
595       call at any time; in such an event, any information that the supervisor
596       obtained from the target's memory must be considered invalid.
597
598       To prevent such scenarios, every read from the target's memory must  be
599       separated  from  use  of  the  bytes so obtained by a SECCOMP_IOCTL_NO‐
600       TIF_ID_VALID check.  In the above example, the check  would  be  placed
601       between  the  two  final steps.  An example of such a check is shown in
602       EXAMPLES.
603
604       Following on from the above, it should be clear that a write by the su‐
605       pervisor into the target's memory can never be considered safe.
606
607   Caveats regarding blocking system calls
608       Suppose  that  the  target  performs  a blocking system call (e.g., ac‐
609       cept(2)) that the supervisor should handle.  The supervisor might  then
610       in turn execute the same blocking system call.
611
612       In  this  scenario, it is important to note that if the target's system
613       call is now interrupted by a signal, the supervisor is not informed  of
614       this.   If the supervisor does not take suitable steps to actively dis‐
615       cover that the target's system call has been canceled,  various  diffi‐
616       culties  can  occur.   Taking  the example of accept(2), the supervisor
617       might remain blocked in its accept(2) holding a port  number  that  the
618       target  (which,  after  the interruption by the signal handler, perhaps
619       closed  its listening socket) might expect to be able  to  reuse  in  a
620       bind(2) call.
621
622       Therefore,  when  the  supervisor  wishes  to emulate a blocking system
623       call, it must do so in such a way that it gets informed if the target's
624       system  call  is  interrupted by a signal handler.  For example, if the
625       supervisor itself executes the same blocking system call, then it could
626       employ a separate thread that uses the SECCOMP_IOCTL_NOTIF_ID_VALID op‐
627       eration to check if the target is still blocked  in  its  system  call.
628       Alternatively,  in  the  accept(2)  example,  the  supervisor might use
629       poll(2) to monitor both the notification file descriptor (so as  as  to
630       discover when the target's accept(2) call has been interrupted) and the
631       listening file descriptor (so as to know when a  connection  is  avail‐
632       able).
633
634       If  the  target's  system call is interrupted, the supervisor must take
635       care to release resources (e.g., file descriptors) that it acquired  on
636       behalf of the target.
637
638   Interaction with SA_RESTART signal handlers
639       Consider the following scenario:
640
641       • The  target process has used sigaction(2) to install a signal handler
642         with the SA_RESTART flag.
643
644       • The target has made a system call that triggered a seccomp user-space
645         notification and the target is currently blocked until the supervisor
646         sends a notification response.
647
648       • A signal is delivered to the target and the signal  handler  is  exe‐
649         cuted.
650
651       • When  (if)  the  supervisor attempts to send a notification response,
652         the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)) operation will fail  with  the
653         ENOENT error.
654
655       In  this  scenario,  the  kernel will restart the target's system call.
656       Consequently, the supervisor will receive another user-space  notifica‐
657       tion.  Thus, depending on how many times the blocked system call is in‐
658       terrupted by a signal handler, the supervisor may receive multiple  no‐
659       tifications for the same instance of a system call in the target.
660
661       One oddity is that system call restarting as described in this scenario
662       will occur even for the blocking system calls listed in signal(7)  that
663       would never normally be restarted by the SA_RESTART flag.
664

BUGS

666       If a SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation is performed after the
667       target terminates, then the ioctl(2) call simply  blocks  (rather  than
668       returning an error to indicate that the target no longer exists).
669

EXAMPLES

671       The  (somewhat  contrived)  program shown below demonstrates the use of
672       the interfaces described in this page.  The  program  creates  a  child
673       process  that  serves  as  the "target" process.  The child process in‐
674       stalls a seccomp filter that returns the SECCOMP_RET_USER_NOTIF  action
675       value  if  a  call  is  made to mkdir(2).  The child process then calls
676       mkdir(2) once for each of the supplied command-line arguments, and  re‐
677       ports the result returned by the call.  After processing all arguments,
678       the child process terminates.
679
680       The parent process acts as the supervisor, listening for the  notifica‐
681       tions  that are generated when the target process calls mkdir(2).  When
682       such a notification occurs, the supervisor examines the memory  of  the
683       target  process  (using /proc/[pid]/mem) to discover the pathname argu‐
684       ment that was supplied to the mkdir(2) call, and performs  one  of  the
685       following actions:
686
687       • If  the  pathname begins with the prefix "/tmp/", then the supervisor
688         attempts to create the specified directory, and then spoofs a  return
689         for  the target process based on the return value of the supervisor's
690         mkdir(2) call.  In the event that that  call  succeeds,  the  spoofed
691         success return value is the length of the pathname.
692
693       • If  the  pathname begins with "./" (i.e., it is a relative pathname),
694         the supervisor sends a SECCOMP_USER_NOTIF_FLAG_CONTINUE  response  to
695         the kernel to say that the kernel should execute the target process's
696         mkdir(2) call.
697
698       • If the pathname begins with some other prefix, the supervisor  spoofs
699         an  error return for the target process, so that the target process's
700         mkdir(2) call appears to fail with the error  EOPNOTSUPP  ("Operation
701         not  supported").  Additionally, if the specified pathname is exactly
702         "/bye", then the supervisor terminates.
703
704       This program can be used to demonstrate various aspects of the behavior
705       of  the  seccomp  user-space  notification mechanism.  To help aid such
706       demonstrations, the program logs various messages to show the operation
707       of  the  target  process  (lines prefixed "T:") and the supervisor (in‐
708       dented lines prefixed "S:").
709
710       In the following example, the target attempts to create  the  directory
711       /tmp/x.   Upon  receiving  the notification, the supervisor creates the
712       directory on the target's behalf, and spoofs a success return to be re‐
713       ceived by the target process's mkdir(2) call.
714
715           $ ./seccomp_unotify /tmp/x
716           T: PID = 23168
717
718           T: about to mkdir("/tmp/x")
719                   S: got notification (ID 0x17445c4a0f4e0e3c) for PID 23168
720                   S: executing: mkdir("/tmp/x", 0700)
721                   S: success! spoofed return = 6
722                   S: sending response (flags = 0; val = 6; error = 0)
723           T: SUCCESS: mkdir(2) returned 6
724
725           T: terminating
726                   S: target has terminated; bye
727
728       In  the  above  output,  note that the spoofed return value seen by the
729       target process is 6 (the length of the pathname /tmp/x), whereas a nor‐
730       mal mkdir(2) call returns 0 on success.
731
732       In  the  next  example, the target attempts to create a directory using
733       the relative pathname ./sub.  Since this pathname starts with "./", the
734       supervisor  sends  a  SECCOMP_USER_NOTIF_FLAG_CONTINUE  response to the
735       kernel,  and  the  kernel  then  (successfully)  executes  the   target
736       process's mkdir(2) call.
737
738           $ ./seccomp_unotify ./sub
739           T: PID = 23204
740
741           T: about to mkdir("./sub")
742                   S: got notification (ID 0xddb16abe25b4c12) for PID 23204
743                   S: target can execute system call
744                   S: sending response (flags = 0x1; val = 0; error = 0)
745           T: SUCCESS: mkdir(2) returned 0
746
747           T: terminating
748                   S: target has terminated; bye
749
750       If  the  target  process attempts to create a directory with a pathname
751       that doesn't start with "." and doesn't begin with the prefix  "/tmp/",
752       then  the supervisor spoofs an error return (EOPNOTSUPP, "Operation not
753       supported") for the target's mkdir(2) call (which is not executed):
754
755           $ ./seccomp_unotify /xxx
756           T: PID = 23178
757
758           T: about to mkdir("/xxx")
759                   S: got notification (ID 0xe7dc095d1c524e80) for PID 23178
760                   S: spoofing error response (Operation not supported)
761                   S: sending response (flags = 0; val = 0; error = -95)
762           T: ERROR: mkdir(2): Operation not supported
763
764           T: terminating
765                   S: target has terminated; bye
766
767       In the next example, the target process attempts to create a  directory
768       with  the  pathname /tmp/nosuchdir/b.  Upon receiving the notification,
769       the supervisor attempts to create that directory, but the mkdir(2) call
770       fails  because  the  directory  /tmp/nosuchdir  does not exist.  Conse‐
771       quently, the supervisor spoofs an error return that  passes  the  error
772       that it received back to the target process's mkdir(2) call.
773
774           $ ./seccomp_unotify /tmp/nosuchdir/b
775           T: PID = 23199
776
777           T: about to mkdir("/tmp/nosuchdir/b")
778                   S: got notification (ID 0x8744454293506046) for PID 23199
779                   S: executing: mkdir("/tmp/nosuchdir/b", 0700)
780                   S: failure! (errno = 2; No such file or directory)
781                   S: sending response (flags = 0; val = 0; error = -2)
782           T: ERROR: mkdir(2): No such file or directory
783
784           T: terminating
785                   S: target has terminated; bye
786
787       If the supervisor receives a notification and sees that the argument of
788       the target's mkdir(2) is the string "/bye", then (as well  as  spoofing
789       an EOPNOTSUPP error), the supervisor terminates.  If the target process
790       subsequently executes another mkdir(2) that triggers its seccomp filter
791       to  return  the  SECCOMP_RET_USER_NOTIF  action  value, then the kernel
792       causes the target process's system call to fail with the  error  ENOSYS
793       ("Function  not  implemented").   This is demonstrated by the following
794       example:
795
796           $ ./seccomp_unotify /bye /tmp/y
797           T: PID = 23185
798
799           T: about to mkdir("/bye")
800                   S: got notification (ID 0xa81236b1d2f7b0f4) for PID 23185
801                   S: spoofing error response (Operation not supported)
802                   S: sending response (flags = 0; val = 0; error = -95)
803                   S: terminating **********
804           T: ERROR: mkdir(2): Operation not supported
805
806           T: about to mkdir("/tmp/y")
807           T: ERROR: mkdir(2): Function not implemented
808
809           T: terminating
810
811   Program source
812       #define _GNU_SOURCE
813       #include <errno.h>
814       #include <fcntl.h>
815       #include <limits.h>
816       #include <linux/audit.h>
817       #include <linux/filter.h>
818       #include <linux/seccomp.h>
819       #include <signal.h>
820       #include <stdbool.h>
821       #include <stddef.h>
822       #include <stdint.h>
823       #include <stdio.h>
824       #include <stdlib.h>
825       #include <sys/socket.h>
826       #include <sys/ioctl.h>
827       #include <sys/prctl.h>
828       #include <sys/stat.h>
829       #include <sys/types.h>
830       #include <sys/un.h>
831       #include <sys/syscall.h>
832       #include <unistd.h>
833
834       #define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
835                               } while (0)
836
837       /* Send the file descriptor 'fd' over the connected UNIX domain socket
838          'sockfd'. Returns 0 on success, or -1 on error. */
839
840       static int
841       sendfd(int sockfd, int fd)
842       {
843           struct msghdr msgh;
844           struct iovec iov;
845           int data;
846           struct cmsghdr *cmsgp;
847
848           /* Allocate a char array of suitable size to hold the ancillary data.
849              However, since this buffer is in reality a 'struct cmsghdr', use a
850              union to ensure that it is suitably aligned. */
851           union {
852               char   buf[CMSG_SPACE(sizeof(int))];
853                               /* Space large enough to hold an 'int' */
854               struct cmsghdr align;
855           } controlMsg;
856
857           /* The 'msg_name' field can be used to specify the address of the
858              destination socket when sending a datagram. However, we do not
859              need to use this field because 'sockfd' is a connected socket. */
860
861           msgh.msg_name = NULL;
862           msgh.msg_namelen = 0;
863
864           /* On Linux, we must transmit at least one byte of real data in
865              order to send ancillary data. We transmit an arbitrary integer
866              whose value is ignored by recvfd(). */
867
868           msgh.msg_iov = &iov;
869           msgh.msg_iovlen = 1;
870           iov.iov_base = &data;
871           iov.iov_len = sizeof(int);
872           data = 12345;
873
874           /* Set 'msghdr' fields that describe ancillary data */
875
876           msgh.msg_control = controlMsg.buf;
877           msgh.msg_controllen = sizeof(controlMsg.buf);
878
879           /* Set up ancillary data describing file descriptor to send */
880
881           cmsgp = CMSG_FIRSTHDR(&msgh);
882           cmsgp->cmsg_level = SOL_SOCKET;
883           cmsgp->cmsg_type = SCM_RIGHTS;
884           cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
885           memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));
886
887           /* Send real plus ancillary data */
888
889           if (sendmsg(sockfd, &msgh, 0) == -1)
890               return -1;
891
892           return 0;
893       }
894
895       /* Receive a file descriptor on a connected UNIX domain socket. Returns
896          the received file descriptor on success, or -1 on error. */
897
898       static int
899       recvfd(int sockfd)
900       {
901           struct msghdr msgh;
902           struct iovec iov;
903           int data, fd;
904           ssize_t nr;
905
906           /* Allocate a char buffer for the ancillary data. See the comments
907              in sendfd() */
908           union {
909               char   buf[CMSG_SPACE(sizeof(int))];
910               struct cmsghdr align;
911           } controlMsg;
912           struct cmsghdr *cmsgp;
913
914           /* The 'msg_name' field can be used to obtain the address of the
915              sending socket. However, we do not need this information. */
916
917           msgh.msg_name = NULL;
918           msgh.msg_namelen = 0;
919
920           /* Specify buffer for receiving real data */
921
922           msgh.msg_iov = &iov;
923           msgh.msg_iovlen = 1;
924           iov.iov_base = &data;       /* Real data is an 'int' */
925           iov.iov_len = sizeof(int);
926
927           /* Set 'msghdr' fields that describe ancillary data */
928
929           msgh.msg_control = controlMsg.buf;
930           msgh.msg_controllen = sizeof(controlMsg.buf);
931
932           /* Receive real plus ancillary data; real data is ignored */
933
934           nr = recvmsg(sockfd, &msgh, 0);
935           if (nr == -1)
936               return -1;
937
938           cmsgp = CMSG_FIRSTHDR(&msgh);
939
940           /* Check the validity of the 'cmsghdr' */
941
942           if (cmsgp == NULL ||
943                   cmsgp->cmsg_len != CMSG_LEN(sizeof(int)) ||
944                   cmsgp->cmsg_level != SOL_SOCKET ||
945                   cmsgp->cmsg_type != SCM_RIGHTS) {
946               errno = EINVAL;
947               return -1;
948           }
949
950           /* Return the received file descriptor to our caller */
951
952           memcpy(&fd, CMSG_DATA(cmsgp), sizeof(int));
953           return fd;
954       }
955
956       static void
957       sigchldHandler(int sig)
958       {
959           char msg[] = "\tS: target has terminated; bye\n";
960
961           write(STDOUT_FILENO, msg, sizeof(msg) - 1);
962           _exit(EXIT_SUCCESS);
963       }
964
965       static int
966       seccomp(unsigned int operation, unsigned int flags, void *args)
967       {
968           return syscall(__NR_seccomp, operation, flags, args);
969       }
970
971       /* The following is the x86-64-specific BPF boilerplate code for checking
972          that the BPF program is running on the right architecture + ABI. At
973          completion of these instructions, the accumulator contains the system
974          call number. */
975
976       /* For the x32 ABI, all system call numbers have bit 30 set */
977
978       #define X32_SYSCALL_BIT         0x40000000
979
980       #define X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR \
981               BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
982                       (offsetof(struct seccomp_data, arch))), \
983               BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 2), \
984               BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
985                        (offsetof(struct seccomp_data, nr))), \
986               BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1), \
987               BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS)
988
989       /* installNotifyFilter() installs a seccomp filter that generates
990          user-space notifications (SECCOMP_RET_USER_NOTIF) when the process
991          calls mkdir(2); the filter allows all other system calls.
992
993          The function return value is a file descriptor from which the
994          user-space notifications can be fetched. */
995
996       static int
997       installNotifyFilter(void)
998       {
999           struct sock_filter filter[] = {
1000               X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR,
1001
1002               /* mkdir() triggers notification to user-space supervisor */
1003
1004               BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_mkdir, 0, 1),
1005               BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_USER_NOTIF),
1006
1007               /* Every other system call is allowed */
1008
1009               BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
1010           };
1011
1012           struct sock_fprog prog = {
1013               .len = sizeof(filter) / sizeof(filter[0]),
1014               .filter = filter,
1015           };
1016
1017           /* Install the filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
1018              as a result, seccomp() returns a notification file descriptor. */
1019
1020           int notifyFd = seccomp(SECCOMP_SET_MODE_FILTER,
1021                                  SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
1022           if (notifyFd == -1)
1023               errExit("seccomp-install-notify-filter");
1024
1025           return notifyFd;
1026       }
1027
1028       /* Close a pair of sockets created by socketpair() */
1029
1030       static void
1031       closeSocketPair(int sockPair[2])
1032       {
1033           if (close(sockPair[0]) == -1)
1034               errExit("closeSocketPair-close-0");
1035           if (close(sockPair[1]) == -1)
1036               errExit("closeSocketPair-close-1");
1037       }
1038
1039       /* Implementation of the target process; create a child process that:
1040
1041          (1) installs a seccomp filter with the
1042              SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
1043          (2) writes the seccomp notification file descriptor returned from
1044              the previous step onto the UNIX domain socket, 'sockPair[0]';
1045          (3) calls mkdir(2) for each element of 'argv'.
1046
1047          The function return value in the parent is the PID of the child
1048          process; the child does not return from this function. */
1049
1050       static pid_t
1051       targetProcess(int sockPair[2], char *argv[])
1052       {
1053           pid_t targetPid = fork();
1054           if (targetPid == -1)
1055               errExit("fork");
1056
1057           if (targetPid > 0)          /* In parent, return PID of child */
1058               return targetPid;
1059
1060           /* Child falls through to here */
1061
1062           printf("T: PID = %ld\n", (long) getpid());
1063
1064           /* Install seccomp filter(s) */
1065
1066           if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
1067               errExit("prctl");
1068
1069           int notifyFd = installNotifyFilter();
1070
1071           /* Pass the notification file descriptor to the tracing process over
1072              a UNIX domain socket */
1073
1074           if (sendfd(sockPair[0], notifyFd) == -1)
1075               errExit("sendfd");
1076
1077           /* Notification and socket FDs are no longer needed in target */
1078
1079           if (close(notifyFd) == -1)
1080               errExit("close-target-notify-fd");
1081
1082           closeSocketPair(sockPair);
1083
1084           /* Perform a mkdir() call for each of the command-line arguments */
1085
1086           for (char **ap = argv; *ap != NULL; ap++) {
1087               printf("\nT: about to mkdir(\"%s\")\n", *ap);
1088
1089               int s = mkdir(*ap, 0700);
1090               if (s == -1)
1091                   perror("T: ERROR: mkdir(2)");
1092               else
1093                   printf("T: SUCCESS: mkdir(2) returned %d\n", s);
1094           }
1095
1096           printf("\nT: terminating\n");
1097           exit(EXIT_SUCCESS);
1098       }
1099
1100       /* Check that the notification ID provided by a SECCOMP_IOCTL_NOTIF_RECV
1101          operation is still valid. It will no longer be valid if the target
1102          process has terminated or is no longer blocked in the system call that
1103          generated the notification (because it was interrupted by a signal).
1104
1105          This operation can be used when doing such things as accessing
1106          /proc/PID files in the target process in order to avoid TOCTOU race
1107          conditions where the PID that is returned by SECCOMP_IOCTL_NOTIF_RECV
1108          terminates and is reused by another process. */
1109
1110       static bool
1111       cookieIsValid(int notifyFd, uint64_t id)
1112       {
1113           return ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == 0;
1114       }
1115
1116       /* Access the memory of the target process in order to fetch the
1117          pathname referred to by the system call argument 'argNum' in
1118          'req->data.args[]'.  The pathname is returned in 'path',
1119          a buffer of 'len' bytes allocated by the caller.
1120
1121          Returns true if the pathname is successfully fetched, and false
1122          otherwise. For possible causes of failure, see the comments below. */
1123
1124       static bool
1125       getTargetPathname(struct seccomp_notif *req, int notifyFd,
1126                         int argNum, char *path, size_t len)
1127       {
1128           char procMemPath[PATH_MAX];
1129
1130           snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);
1131
1132           int procMemFd = open(procMemPath, O_RDONLY | O_CLOEXEC);
1133           if (procMemFd == -1)
1134               return false;
1135
1136           /* Check that the process whose info we are accessing is still alive
1137              and blocked in the system call that caused the notification.
1138              If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed in
1139              cookieIsValid()) succeeded, we know that the /proc/PID/mem file
1140              descriptor that we opened corresponded to the process for which we
1141              received a notification. If that process subsequently terminates,
1142              then read() on that file descriptor will return 0 (EOF). */
1143
1144           if (!cookieIsValid(notifyFd, req->id)) {
1145               close(procMemFd);
1146               return false;
1147           }
1148
1149           /* Read bytes at the location containing the pathname argument */
1150
1151           ssize_t nread = pread(procMemFd, path, len, req->data.args[argNum]);
1152
1153           close(procMemFd);
1154
1155           if (nread <= 0)
1156               return false;
1157
1158           /* Once again check that the notification ID is still valid. The
1159              case we are particularly concerned about here is that just
1160              before we fetched the pathname, the target's blocked system
1161              call was interrupted by a signal handler, and after the handler
1162              returned, the target carried on execution (past the interrupted
1163              system call). In that case, we have no guarantees about what we
1164              are reading, since the target's memory may have been arbitrarily
1165              changed by subsequent operations. */
1166
1167           if (!cookieIsValid(notifyFd, req->id)) {
1168               perror("\tS: notification ID check failed!!!");
1169               return false;
1170           }
1171
1172           /* Even if the target's system call was not interrupted by a signal,
1173              we have no guarantees about what was in the memory of the target
1174              process. (The memory may have been modified by another thread, or
1175              even by an external attacking process.) We therefore treat the
1176              buffer returned by pread() as untrusted input. The buffer should
1177              contain a terminating null byte; if not, then we will trigger an
1178              error for the target process. */
1179
1180           if (strnlen(path, nread) < nread)
1181               return true;
1182
1183           return false;
1184       }
1185
1186       /* Allocate buffers for the seccomp user-space notification request and
1187          response structures. It is the caller's responsibility to free the
1188          buffers returned via 'req' and 'resp'. */
1189
1190       static void
1191       allocSeccompNotifBuffers(struct seccomp_notif **req,
1192               struct seccomp_notif_resp **resp,
1193               struct seccomp_notif_sizes *sizes)
1194       {
1195           /* Discover the sizes of the structures that are used to receive
1196              notifications and send notification responses, and allocate
1197              buffers of those sizes. */
1198
1199           if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, sizes) == -1)
1200               errExit("seccomp-SECCOMP_GET_NOTIF_SIZES");
1201
1202           *req = malloc(sizes->seccomp_notif);
1203           if (*req == NULL)
1204               errExit("malloc-seccomp_notif");
1205
1206           /* When allocating the response buffer, we must allow for the fact
1207              that the user-space binary may have been built with user-space
1208              headers where 'struct seccomp_notif_resp' is bigger than the
1209              response buffer expected by the (older) kernel. Therefore, we
1210              allocate a buffer that is the maximum of the two sizes. This
1211              ensures that if the supervisor places bytes into the response
1212              structure that are past the response size that the kernel expects,
1213              then the supervisor is not touching an invalid memory location. */
1214
1215           size_t resp_size = sizes->seccomp_notif_resp;
1216           if (sizeof(struct seccomp_notif_resp) > resp_size)
1217               resp_size = sizeof(struct seccomp_notif_resp);
1218
1219           *resp = malloc(resp_size);
1220           if (resp == NULL)
1221               errExit("malloc-seccomp_notif_resp");
1222
1223       }
1224
1225       /* Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file
1226          descriptor, 'notifyFd'. */
1227
1228       static void
1229       handleNotifications(int notifyFd)
1230       {
1231           struct seccomp_notif_sizes sizes;
1232           struct seccomp_notif *req;
1233           struct seccomp_notif_resp *resp;
1234           char path[PATH_MAX];
1235
1236           allocSeccompNotifBuffers(&req, &resp, &sizes);
1237
1238           /* Loop handling notifications */
1239
1240           for (;;) {
1241
1242               /* Wait for next notification, returning info in '*req' */
1243
1244               memset(req, 0, sizes.seccomp_notif);
1245               if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_RECV, req) == -1) {
1246                   if (errno == EINTR)
1247                       continue;
1248                   errExit("\tS: ioctl-SECCOMP_IOCTL_NOTIF_RECV");
1249               }
1250
1251               printf("\tS: got notification (ID %#llx) for PID %d\n",
1252                       req->id, req->pid);
1253
1254               /* The only system call that can generate a notification event
1255                  is mkdir(2). Nevertheless, we check that the notified system
1256                  call is indeed mkdir() as kind of future-proofing of this
1257                  code in case the seccomp filter is later modified to
1258                  generate notifications for other system calls. */
1259
1260               if (req->data.nr != __NR_mkdir) {
1261                   printf("\tS: notification contained unexpected "
1262                           "system call number; bye!!!\n");
1263                   exit(EXIT_FAILURE);
1264               }
1265
1266               bool pathOK = getTargetPathname(req, notifyFd, 0, path,
1267                                               sizeof(path));
1268
1269               /* Prepopulate some fields of the response */
1270
1271               resp->id = req->id;     /* Response includes notification ID */
1272               resp->flags = 0;
1273               resp->val = 0;
1274
1275               /* If getTargetPathname() failed, trigger an EINVAL error
1276                  response (sending this response may yield an error if the
1277                  failure occurred because the notification ID was no longer
1278                  valid); if the directory is in /tmp, then create it on behalf
1279                  of the supervisor; if the pathname starts with '.', tell the
1280                  kernel to let the target process execute the mkdir();
1281                  otherwise, give an error for a directory pathname in any other
1282                  location. */
1283
1284               if (!pathOK) {
1285                   resp->error = -EINVAL;
1286                   printf("\tS: spoofing error for invalid pathname (%s)\n",
1287                           strerror(-resp->error));
1288               } else if (strncmp(path, "/tmp/", strlen("/tmp/")) == 0) {
1289                   printf("\tS: executing: mkdir(\"%s\", %#llo)\n",
1290                           path, req->data.args[1]);
1291
1292                   if (mkdir(path, req->data.args[1]) == 0) {
1293                       resp->error = 0;            /* "Success" */
1294                       resp->val = strlen(path);   /* Used as return value of
1295                                                      mkdir() in target */
1296                       printf("\tS: success! spoofed return = %lld\n",
1297                               resp->val);
1298                   } else {
1299
1300                       /* If mkdir() failed in the supervisor, pass the error
1301                          back to the target */
1302
1303                       resp->error = -errno;
1304                       printf("\tS: failure! (errno = %d; %s)\n", errno,
1305                               strerror(errno));
1306                   }
1307               } else if (strncmp(path, "./", strlen("./")) == 0) {
1308                   resp->error = resp->val = 0;
1309                   resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
1310                   printf("\tS: target can execute system call\n");
1311               } else {
1312                   resp->error = -EOPNOTSUPP;
1313                   printf("\tS: spoofing error response (%s)\n",
1314                           strerror(-resp->error));
1315               }
1316
1317               /* Send a response to the notification */
1318
1319               printf("\tS: sending response "
1320                       "(flags = %#x; val = %lld; error = %d)\n",
1321                       resp->flags, resp->val, resp->error);
1322
1323               if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp) == -1) {
1324                   if (errno == ENOENT)
1325                       printf("\tS: response failed with ENOENT; "
1326                               "perhaps target process's syscall was "
1327                               "interrupted by a signal?\n");
1328                   else
1329                       perror("ioctl-SECCOMP_IOCTL_NOTIF_SEND");
1330               }
1331
1332               /* If the pathname is just "/bye", then the supervisor breaks out
1333                  of the loop and terminates. This allows us to see what happens
1334                  if the target process makes further calls to mkdir(2). */
1335
1336               if (strcmp(path, "/bye") == 0)
1337                   break;
1338           }
1339
1340           free(req);
1341           free(resp);
1342           printf("\tS: terminating **********\n");
1343           exit(EXIT_FAILURE);
1344       }
1345
1346       /* Implementation of the supervisor process:
1347
1348          (1) obtains the notification file descriptor from 'sockPair[1]'
1349          (2) handles notifications that arrive on that file descriptor. */
1350
1351       static void
1352       supervisor(int sockPair[2])
1353       {
1354           int notifyFd = recvfd(sockPair[1]);
1355           if (notifyFd == -1)
1356               errExit("recvfd");
1357
1358           closeSocketPair(sockPair);  /* We no longer need the socket pair */
1359
1360           handleNotifications(notifyFd);
1361       }
1362
1363       int
1364       main(int argc, char *argv[])
1365       {
1366           int sockPair[2];
1367
1368           setbuf(stdout, NULL);
1369
1370           if (argc < 2) {
1371               fprintf(stderr, "At least one pathname argument is required\n");
1372               exit(EXIT_FAILURE);
1373           }
1374
1375           /* Create a UNIX domain socket that is used to pass the seccomp
1376              notification file descriptor from the target process to the
1377              supervisor process. */
1378
1379           if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockPair) == -1)
1380               errExit("socketpair");
1381
1382           /* Create a child process--the "target"--that installs seccomp
1383              filtering. The target process writes the seccomp notification
1384              file descriptor onto 'sockPair[0]' and then calls mkdir(2) for
1385              each directory in the command-line arguments. */
1386
1387           (void) targetProcess(sockPair, &argv[optind]);
1388
1389           /* Catch SIGCHLD when the target terminates, so that the
1390              supervisor can also terminate. */
1391
1392           struct sigaction sa;
1393           sa.sa_handler = sigchldHandler;
1394           sa.sa_flags = 0;
1395           sigemptyset(&sa.sa_mask);
1396           if (sigaction(SIGCHLD, &sa, NULL) == -1)
1397               errExit("sigaction");
1398
1399           supervisor(sockPair);
1400
1401           exit(EXIT_SUCCESS);
1402       }
1403

SEE ALSO

1405       ioctl(2), pidfd_open(2), pidfd_getfd(2), seccomp(2)
1406
1407       A further example program can be found in the kernel source  file  sam‐
1408       ples/seccomp/user-trap.c.
1409

COLOPHON

1411       This  page  is  part of release 5.12 of the Linux man-pages project.  A
1412       description of the project, information about reporting bugs,  and  the
1413       latest     version     of     this    page,    can    be    found    at
1414       https://www.kernel.org/doc/man-pages/.
1415
1416
1417
1418Linux                             2021-06-20                SECCOMP_UNOTIFY(2)
Impressum