1seccomp_unotify(2)            System Calls Manual           seccomp_unotify(2)
2
3
4

NAME

6       seccomp_unotify - Seccomp user-space notification mechanism
7

LIBRARY

9       Standard C library (libc, -lc)
10

SYNOPSIS

12       #include <linux/seccomp.h>
13       #include <linux/filter.h>
14       #include <linux/audit.h>
15
16       int seccomp(unsigned int operation, unsigned int flags, void *args);
17
18       #include <sys/ioctl.h>
19
20       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
21                 struct seccomp_notif *req);
22       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
23                 struct seccomp_notif_resp *resp);
24       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
25       int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ADDFD,
26                 struct seccomp_notif_addfd *addfd);
27

DESCRIPTION

29       This  page  describes the user-space notification mechanism provided by
30       the Secure Computing (seccomp) facility.  As well as  the  use  of  the
31       SECCOMP_FILTER_FLAG_NEW_LISTENER  flag,  the SECCOMP_RET_USER_NOTIF ac‐
32       tion value, and the SECCOMP_GET_NOTIF_SIZES operation described in sec‐
33       comp(2),  this  mechanism  involves  the  use  of  a  number of related
34       ioctl(2) operations (described below).
35
36   Overview
37       In conventional usage of a seccomp filter, the decision  about  how  to
38       treat  a  system  call  is made by the filter itself.  By contrast, the
39       user-space notification mechanism allows the seccomp filter to delegate
40       the  handling  of  the system call to another user-space process.  Note
41       that this mechanism is explicitly not intended as a method implementing
42       security policy; see NOTES.
43
44       In the discussion that follows, the thread(s) on which the seccomp fil‐
45       ter is installed is (are) referred to as the target,  and  the  process
46       that  is  notified by the user-space notification mechanism is referred
47       to as the supervisor.
48
49       A suitably privileged supervisor can use  the  user-space  notification
50       mechanism to perform actions on behalf of the target.  The advantage of
51       the user-space notification mechanism is that the supervisor will  usu‐
52       ally be able to retrieve information about the target and the performed
53       system call that the seccomp filter itself cannot.  (A  seccomp  filter
54       is limited in the information it can obtain and the actions that it can
55       perform because it is running on a virtual machine inside the kernel.)
56
57       An overview of the steps performed by the target and the supervisor  is
58       as follows:
59
60       (1)  The  target  establishes a seccomp filter in the usual manner, but
61            with two differences:
62
63            •  The seccomp(2) flags argument includes  the  flag  SECCOMP_FIL‐
64               TER_FLAG_NEW_LISTENER.   Consequently,  the return value of the
65               (successful) seccomp(2) call is a new "listening" file descrip‐
66               tor  that can be used to receive notifications.  Only one "lis‐
67               tening" seccomp filter can be installed for a thread.
68
69            •  In cases where it is appropriate, the  seccomp  filter  returns
70               the  action  value  SECCOMP_RET_USER_NOTIF.   This return value
71               will trigger a notification event.
72
73       (2)  In order that the supervisor can obtain  notifications  using  the
74            listening  file  descriptor, (a duplicate of) that file descriptor
75            must be passed from the target to  the  supervisor.   One  way  in
76            which  this could be done is by passing the file descriptor over a
77            UNIX domain socket connection between the target and the  supervi‐
78            sor  (using  the  SCM_RIGHTS  ancillary  message type described in
79            unix(7)).   Another  way  to  do  this  is  through  the  use   of
80            pidfd_getfd(2).
81
82       (3)  The  supervisor  will receive notification events on the listening
83            file descriptor.  These events are returned as structures of  type
84            seccomp_notif.   Because  this  structure  and its size may evolve
85            over kernel versions, the supervisor must first determine the size
86            of this structure using the seccomp(2) SECCOMP_GET_NOTIF_SIZES op‐
87            eration, which returns a structure  of  type  seccomp_notif_sizes.
88            The supervisor allocates a buffer of size seccomp_notif_sizes.sec‐
89            comp_notif bytes to receive notification events.  In  addition,the
90            supervisor   allocates   another   buffer   of   size  seccomp_no‐
91            tif_sizes.seccomp_notif_resp bytes for the response (a struct sec‐
92            comp_notif_resp structure) that it will provide to the kernel (and
93            thus the target).
94
95       (4)  The target then performs its workload, which includes system calls
96            that  will  be  controlled by the seccomp filter.  Whenever one of
97            these  system  calls  causes  the  filter  to  return   the   SEC‐
98            COMP_RET_USER_NOTIF  action  value, the kernel does not (yet) exe‐
99            cute the system call; instead, execution of the target  is  tempo‐
100            rarily  blocked inside the kernel (in a sleep state that is inter‐
101            ruptible by signals) and a notification event is generated on  the
102            listening file descriptor.
103
104       (5)  The  supervisor  can now repeatedly monitor the listening file de‐
105            scriptor for SECCOMP_RET_USER_NOTIF-triggered events.  To do this,
106            the  supervisor  uses the SECCOMP_IOCTL_NOTIF_RECV ioctl(2) opera‐
107            tion to read information about a notification event;  this  opera‐
108            tion  blocks until an event is available.  The operation returns a
109            seccomp_notif structure containing information  about  the  system
110            call  that  is  being  attempted  by the target.  (As described in
111            NOTES, the file descriptor can also be monitored  with  select(2),
112            poll(2), or epoll(7).)
113
114       (6)  The  seccomp_notif  structure  returned  by  the SECCOMP_IOCTL_NO‐
115            TIF_RECV operation includes the same information  (a  seccomp_data
116            structure)  that  was passed to the seccomp filter.  This informa‐
117            tion allows the supervisor to discover the system call number  and
118            the  arguments for the target's system call.  In addition, the no‐
119            tification event contains the ID of the thread that triggered  the
120            notification  and a unique cookie value that is used in subsequent
121            SECCOMP_IOCTL_NOTIF_ID_VALID and  SECCOMP_IOCTL_NOTIF_SEND  opera‐
122            tions.
123
124            The  information  in  the notification can be used to discover the
125            values of pointer arguments for the target's system  call.   (This
126            is  something  that  can't  be done from within a seccomp filter.)
127            One way in which the supervisor can do this is to open the  corre‐
128            sponding  /proc/tid/mem file (see proc(5)) and read bytes from the
129            location that corresponds to one of the  pointer  arguments  whose
130            value is supplied in the notification event.  (The supervisor must
131            be careful to avoid a race condition that  can  occur  when  doing
132            this;  see  the  description  of  the SECCOMP_IOCTL_NOTIF_ID_VALID
133            ioctl(2) operation below.)  In addition, the supervisor can access
134            other  system  information that is visible in user space but which
135            is not accessible from a seccomp filter.
136
137       (7)  Having obtained information as per the previous step, the supervi‐
138            sor  may  then choose to perform an action in response to the tar‐
139            get's system call (which, as noted above, is not executed when the
140            seccomp filter returns the SECCOMP_RET_USER_NOTIF action value).
141
142            One  example  use case here relates to containers.  The target may
143            be located inside a container where it does  not  have  sufficient
144            capabilities  to mount a filesystem in the container's mount name‐
145            space.  However, the supervisor may be a more  privileged  process
146            that does have sufficient capabilities to perform the mount opera‐
147            tion.
148
149       (8)  The supervisor then sends a response to the notification.  The in‐
150            formation  in  this  response is used by the kernel to construct a
151            return value for the target's system call and provide a value that
152            will be assigned to the errno variable of the target.
153
154            The  response  is sent using the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)
155            operation, which is used to transmit a  seccomp_notif_resp  struc‐
156            ture  to  the kernel.  This structure includes a cookie value that
157            the supervisor obtained in the seccomp_notif structure returned by
158            the  SECCOMP_IOCTL_NOTIF_RECV operation.  This cookie value allows
159            the kernel to associate the response with the target.  This struc‐
160            ture must include the cookie value that the supervisor obtained in
161            the seccomp_notif  structure  returned  by  the  SECCOMP_IOCTL_NO‐
162            TIF_RECV  operation; the cookie allows the kernel to associate the
163            response with the target.
164
165       (9)  Once the notification has been sent, the system call in the target
166            thread  unblocks,  returning  the information that was provided by
167            the supervisor in the notification response.
168
169       As a variation on the last two steps, the supervisor  can  send  a  re‐
170       sponse that tells the kernel that it should execute the target thread's
171       system call; see the  discussion  of  SECCOMP_USER_NOTIF_FLAG_CONTINUE,
172       below.
173

IOCTL OPERATIONS

175       The  following  ioctl(2)  operations are supported by the seccomp user-
176       space notification file descriptor.  For each of these operations,  the
177       first  (file descriptor) argument of ioctl(2) is the listening file de‐
178       scriptor returned  by  a  call  to  seccomp(2)  with  the  SECCOMP_FIL‐
179       TER_FLAG_NEW_LISTENER flag.
180
181   SECCOMP_IOCTL_NOTIF_RECV
182       The  SECCOMP_IOCTL_NOTIF_RECV  operation (available since Linux 5.0) is
183       used to obtain a user-space notification event.  If no  such  event  is
184       currently  pending,  the  operation  blocks until an event occurs.  The
185       third ioctl(2) argument is a pointer to a structure  of  the  following
186       form  which  contains information about the event.  This structure must
187       be zeroed out before the call.
188
189           struct seccomp_notif {
190               __u64  id;              /* Cookie */
191               __u32  pid;             /* TID of target thread */
192               __u32  flags;           /* Currently unused (0) */
193               struct seccomp_data data;   /* See seccomp(2) */
194           };
195
196       The fields in this structure are as follows:
197
198       id     This is a cookie for the  notification.   Each  such  cookie  is
199              guaranteed to be unique for the corresponding seccomp filter.
200
201              •  The  cookie can be used with the SECCOMP_IOCTL_NOTIF_ID_VALID
202                 ioctl(2) operation described below.
203
204              •  When returning a notification response to the kernel, the su‐
205                 pervisor  must  include  the  cookie value in the seccomp_no‐
206                 tif_resp structure that is specified as the argument  of  the
207                 SECCOMP_IOCTL_NOTIF_SEND operation.
208
209       pid    This  is  the  thread ID of the target thread that triggered the
210              notification event.
211
212       flags  This is a bit mask of flags providing further information on the
213              event.   In  the  current  implementation,  this field is always
214              zero.
215
216       data   This is a seccomp_data structure  containing  information  about
217              the  system  call  that triggered the notification.  This is the
218              same structure that is passed to the seccomp filter.   See  sec‐
219              comp(2) for details of this structure.
220
221       On  success,  this operation returns 0; on failure, -1 is returned, and
222       errno is set to indicate the cause of the error.   This  operation  can
223       fail with the following errors:
224
225       EINVAL (since Linux 5.5)
226              The  seccomp_notif  structure  that  was passed to the call con‐
227              tained nonzero fields.
228
229       ENOENT The target thread was killed by a signal as the notification in‐
230              formation  was being generated, or the target's (blocked) system
231              call was interrupted by a signal handler.
232
233   SECCOMP_IOCTL_NOTIF_ID_VALID
234       The SECCOMP_IOCTL_NOTIF_ID_VALID operation (available since Linux  5.0)
235       is  used  to  check  that a notification ID returned by an earlier SEC‐
236       COMP_IOCTL_NOTIF_RECV operation is still valid (i.e., that  the  target
237       still  exists  and  its  system call is still blocked waiting for a re‐
238       sponse).
239
240       The third ioctl(2) argument is a pointer to the cookie (id) returned by
241       the SECCOMP_IOCTL_NOTIF_RECV operation.
242
243       This  operation  is  necessary  to avoid race conditions that can occur
244       when the pid returned by the SECCOMP_IOCTL_NOTIF_RECV operation  termi‐
245       nates, and that process ID is reused by another process.  An example of
246       this kind of race is the following
247
248       (1)  A notification is generated on the listening file descriptor.  The
249            returned  seccomp_notif  contains the TID of the target thread (in
250            the pid field of the structure).
251
252       (2)  The target terminates.
253
254       (3)  Another thread or process is created on the system that by  chance
255            reuses the TID that was freed when the target terminated.
256
257       (4)  The  supervisor  open(2)s  the  /proc/tid/mem file for the TID ob‐
258            tained in step 1, with the intention of (say) inspecting the  mem‐
259            ory location(s) that containing the argument(s) of the system call
260            that triggered the notification in step 1.
261
262       In the above scenario, the risk is that the supervisor may try  to  ac‐
263       cess  the  memory of a process other than the target.  This race can be
264       avoided by following the  call  to  open(2)  with  a  SECCOMP_IOCTL_NO‐
265       TIF_ID_VALID  operation  to  verify that the process that generated the
266       notification is still alive.  (Note that if the target terminates after
267       the  latter step, a subsequent read(2) from the file descriptor may re‐
268       turn 0, indicating end of file.)
269
270       See NOTES for a  discussion  of  other  cases  where  SECCOMP_IOCTL_NO‐
271       TIF_ID_VALID checks must be performed.
272
273       On  success  (i.e., the notification ID is still valid), this operation
274       returns 0.  On failure (i.e., the notification ID is no longer  valid),
275       -1 is returned, and errno is set to ENOENT.
276
277   SECCOMP_IOCTL_NOTIF_SEND
278       The  SECCOMP_IOCTL_NOTIF_SEND  operation (available since Linux 5.0) is
279       used to send a notification response back to  the  kernel.   The  third
280       ioctl(2)  argument of this structure is a pointer to a structure of the
281       following form:
282
283           struct seccomp_notif_resp {
284               __u64 id;           /* Cookie value */
285               __s64 val;          /* Success return value */
286               __s32 error;        /* 0 (success) or negative error number */
287               __u32 flags;        /* See below */
288           };
289
290       The fields of this structure are as follows:
291
292       id     This is the cookie  value  that  was  obtained  using  the  SEC‐
293              COMP_IOCTL_NOTIF_RECV  operation.   This cookie value allows the
294              kernel to correctly associate this response with the system call
295              that triggered the user-space notification.
296
297       val    This is the value that will be used for a spoofed success return
298              for the target's system call; see below.
299
300       error  This is the value that will be used as the error number  (errno)
301              for a spoofed error return for the target's system call; see be‐
302              low.
303
304       flags  This is a bit mask that includes zero or more of  the  following
305              flags:
306
307              SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
308                     Tell the kernel to execute the target's system call.
309
310       Two kinds of response are possible:
311
312       •  A  response  to the kernel telling it to execute the target's system
313          call.  In this  case,  the  flags  field  includes  SECCOMP_USER_NO‐
314          TIF_FLAG_CONTINUE and the error and val fields must be zero.
315
316          This  kind  of  response can be useful in cases where the supervisor
317          needs to do deeper analysis of the target's system call than is pos‐
318          sible  from  a seccomp filter (e.g., examining the values of pointer
319          arguments), and, having decided that the system call  does  not  re‐
320          quire  emulation  by the supervisor, the supervisor wants the system
321          call to be executed normally in the target.
322
323          The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag should be used  with  cau‐
324          tion; see NOTES.
325
326       •  A  spoofed return value for the target's system call.  In this case,
327          the kernel does not execute the target's system call, instead  caus‐
328          ing the system call to return a spoofed value as specified by fields
329          of the seccomp_notif_resp structure.  The supervisor should set  the
330          fields of this structure as follows:
331
332          +  flags does not contain SECCOMP_USER_NOTIF_FLAG_CONTINUE.
333
334          +  error  is  set either to 0 for a spoofed "success" return or to a
335             negative error number for a spoofed  "failure"  return.   In  the
336             former case, the kernel causes the target's system call to return
337             the value specified in the val field.  In the  latter  case,  the
338             kernel causes the target's system call to return -1, and errno is
339             assigned the negated error value.
340
341          +  val is set to a value that will be used as the return value for a
342             spoofed "success" return for the target's system call.  The value
343             in this field is ignored if the error field  contains  a  nonzero
344             value.
345
346       On  success,  this operation returns 0; on failure, -1 is returned, and
347       errno is set to indicate the cause of the error.   This  operation  can
348       fail with the following errors:
349
350       EINPROGRESS
351              A response to this notification has already been sent.
352
353       EINVAL An invalid value was specified in the flags field.
354
355       EINVAL The  flags field contained SECCOMP_USER_NOTIF_FLAG_CONTINUE, and
356              the error or val field was not zero.
357
358       ENOENT The blocked system call in the target has been interrupted by  a
359              signal handler or the target has terminated.
360
361   SECCOMP_IOCTL_NOTIF_ADDFD
362       The SECCOMP_IOCTL_NOTIF_ADDFD operation (available since Linux 5.9) al‐
363       lows the supervisor to install a file descriptor into the target's file
364       descriptor  table.   Much like the use of SCM_RIGHTS messages described
365       in unix(7), this operation is semantically equivalent to duplicating  a
366       file  descriptor  from  the supervisor's file descriptor table into the
367       target's file descriptor table.
368
369       The SECCOMP_IOCTL_NOTIF_ADDFD operation permits the supervisor to  emu‐
370       late  a target system call (such as socket(2) or openat(2)) that gener‐
371       ates a file descriptor.  The supervisor can  perform  the  system  call
372       that  generates  the file descriptor (and associated open file descrip‐
373       tion) and then use this operation to allocate a  file  descriptor  that
374       refers to the same open file description in the target.  (For an expla‐
375       nation of open file descriptions, see open(2).)
376
377       Once this operation has been performed, the supervisor  can  close  its
378       copy of the file descriptor.
379
380       In  the  target,  the  received  file descriptor is subject to the same
381       Linux Security Module (LSM) checks as are applied to a file  descriptor
382       that  is  received in an SCM_RIGHTS ancillary message.  If the file de‐
383       scriptor refers to a socket, it inherits the cgroup version  1  network
384       controller settings (classid and netprioidx) of the target.
385
386       The  third ioctl(2) argument is a pointer to a structure of the follow‐
387       ing form:
388
389           struct seccomp_notif_addfd {
390               __u64 id;           /* Cookie value */
391               __u32 flags;        /* Flags */
392               __u32 srcfd;        /* Local file descriptor number */
393               __u32 newfd;        /* 0 or desired file descriptor
394                                      number in target */
395               __u32 newfd_flags;  /* Flags to set on target file
396                                      descriptor */
397           };
398
399       The fields in this structure are as follows:
400
401       id     This field should be set to the notification ID  (cookie  value)
402              that was obtained via SECCOMP_IOCTL_NOTIF_RECV.
403
404       flags  This  field  is  a bit mask of flags that modify the behavior of
405              the operation.  Currently, only one flag is supported:
406
407              SECCOMP_ADDFD_FLAG_SETFD
408                     When allocating the file descriptor in  the  target,  use
409                     the file descriptor number specified in the newfd field.
410
411              SECCOMP_ADDFD_FLAG_SEND (since Linux 5.14)
412                     Perform  the equivalent of SECCOMP_IOCTL_NOTIF_ADDFD plus
413                     SECCOMP_IOCTL_NOTIF_SEND as an atomic operation.  On suc‐
414                     cessful  invocation, the target process's errno will be 0
415                     and the return value will be the file  descriptor  number
416                     that was allocated in the target.  If allocating the file
417                     descriptor in the target fails, the target's system  call
418                     continues  to  be  blocked until a successful response is
419                     sent.
420
421       srcfd  This field should be set to the number of the file descriptor in
422              the supervisor that is to be duplicated.
423
424       newfd  This  field determines which file descriptor number is allocated
425              in the target.  If the  SECCOMP_ADDFD_FLAG_SETFD  flag  is  set,
426              then this field specifies which file descriptor number should be
427              allocated.  If this file descriptor number is  already  open  in
428              the target, it is atomically closed and reused.  If the descrip‐
429              tor duplication fails due to an LSM check, or if srcfd is not  a
430              valid  file  descriptor,  the  file descriptor newfd will not be
431              closed in the target process.
432
433              If the SECCOMP_ADDFD_FLAG_SETFD flag it not set, then this field
434              must  be  0, and the kernel allocates the lowest unused file de‐
435              scriptor number in the target.
436
437       newfd_flags
438              This field is a bit mask specifying flags that should be set  on
439              the  file  descriptor  that  is  received in the target process.
440              Currently, only the following flag is implemented:
441
442              O_CLOEXEC
443                     Set the close-on-exec flag on the received file  descrip‐
444                     tor.
445
446       On  success, this ioctl(2) call returns the number of the file descrip‐
447       tor that was allocated in the target.  Assuming that the emulated  sys‐
448       tem  call  is one that returns a file descriptor as its function result
449       (e.g.,  socket(2)),  this  value  can  be  used  as  the  return  value
450       (resp.val)  that  is supplied in the response that is subsequently sent
451       with the SECCOMP_IOCTL_NOTIF_SEND operation.
452
453       On error, -1 is returned and errno is set to indicate the cause of  the
454       error.
455
456       This operation can fail with the following errors:
457
458       EBADF  Allocating  the  file  descriptor  in the target would cause the
459              target's RLIMIT_NOFILE limit to be exceeded (see getrlimit(2)).
460
461       EBUSY  If the flag SECCOMP_IOCTL_NOTIF_SEND is used, this means the op‐
462              eration  can't proceed until other SECCOMP_IOCTL_NOTIF_ADDFD re‐
463              quests are processed.
464
465       EINPROGRESS
466              The user-space notification specified in the id field exists but
467              has  not yet been fetched (by a SECCOMP_IOCTL_NOTIF_RECV) or has
468              already been responded to (by a SECCOMP_IOCTL_NOTIF_SEND).
469
470       EINVAL An invalid flag was specified in the flags or newfd_flags field,
471              or  the  newfd field is nonzero and the SECCOMP_ADDFD_FLAG_SETFD
472              flag was not specified in the flags field.
473
474       EMFILE The file descriptor number specified in newfd exceeds the  limit
475              specified in /proc/sys/fs/nr_open.
476
477       ENOENT The  blocked system call in the target has been interrupted by a
478              signal handler or the target has terminated.
479
480       Here is some sample code (with error handling omitted)  that  uses  the
481       SECCOMP_ADDFD_FLAG_SETFD  operation  (here,  to  emulate a call to ope‐
482       nat(2)):
483
484           int fd, removeFd;
485
486           fd = openat(req->data.args[0], path, req->data.args[2],
487                           req->data.args[3]);
488
489           struct seccomp_notif_addfd addfd;
490           addfd.id = req->id; /* Cookie from SECCOMP_IOCTL_NOTIF_RECV */
491           addfd.srcfd = fd;
492           addfd.newfd = 0;
493           addfd.flags = 0;
494           addfd.newfd_flags = O_CLOEXEC;
495
496           targetFd = ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd);
497
498           close(fd);          /* No longer needed in supervisor */
499
500           struct seccomp_notif_resp *resp;
501               /* Code to allocate 'resp' omitted */
502           resp->id = req->id;
503           resp->error = 0;        /* "Success" */
504           resp->val = targetFd;
505           resp->flags = 0;
506           ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp);
507

NOTES

509       One example use case for the user-space notification  mechanism  is  to
510       allow  a  container  manager (a process which is typically running with
511       more privilege than the processes inside the container) to mount  block
512       devices  or  create device nodes for the container.  The mount use case
513       provides  an  example  of  where  the  SECCOMP_USER_NOTIF_FLAG_CONTINUE
514       ioctl(2)  operation  is  useful.  Upon receiving a notification for the
515       mount(2) system call, the container manager (the "supervisor") can dis‐
516       tinguish a request to mount a block filesystem (which would not be pos‐
517       sible for a "target" process inside the container) and mount that  file
518       system.   If, on the other hand, the container manager detects that the
519       operation could be performed by the process inside the container (e.g.,
520       a  mount  of  a tmpfs(5) filesystem), it can notify the kernel that the
521       target process's mount(2) system call can continue.
522
523   select()/poll()/epoll semantics
524       The file descriptor returned when seccomp(2) is employed with the  SEC‐
525       COMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  using poll(2),
526       epoll(7), and select(2).  These interfaces indicate that the  file  de‐
527       scriptor is ready as follows:
528
529       •  When  a  notification is pending, these interfaces indicate that the
530          file descriptor is readable.  Following such an indication, a subse‐
531          quent  SECCOMP_IOCTL_NOTIF_RECV  ioctl(2)  will not block, returning
532          either information about a notification or else failing with the er‐
533          ror  EINTR  if  the target has been killed by a signal or its system
534          call has been interrupted by a signal handler.
535
536       •  After  the  notification  has  been  received  (i.e.,  by  the  SEC‐
537          COMP_IOCTL_NOTIF_RECV ioctl(2) operation), these interfaces indicate
538          that the file descriptor is writable, meaning  that  a  notification
539          response can be sent using the SECCOMP_IOCTL_NOTIF_SEND ioctl(2) op‐
540          eration.
541
542       •  After the last thread using  the  filter  has  terminated  and  been
543          reaped  using waitpid(2) (or similar), the file descriptor indicates
544          an end-of-file condition (readable in select(2); POLLHUP/EPOLLHUP in
545          poll(2)/ epoll_wait(2)).
546
547   Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE
548       The  intent  of  the user-space notification feature is to allow system
549       calls to be performed on behalf of the  target.   The  target's  system
550       call  should either be handled by the supervisor or allowed to continue
551       normally in the kernel (where standard security policies  will  be  ap‐
552       plied).
553
554       Note  well: this mechanism must not be used to make security policy de‐
555       cisions about the system call, which would be inherently race-prone for
556       reasons described next.
557
558       The  SECCOMP_USER_NOTIF_FLAG_CONTINUE  flag  must be used with caution.
559       If set by the supervisor, the target's system call will continue.  How‐
560       ever,  there  is  a  time-of-check, time-of-use race here, since an at‐
561       tacker could exploit the interval of time where the target  is  blocked
562       waiting  on  the "continue" response to do things such as rewriting the
563       system call arguments.
564
565       Note furthermore that a user-space notifier can be bypassed if the  ex‐
566       isting  filters  allow  the  use of seccomp(2) or prctl(2) to install a
567       filter that returns an action value with a higher precedence than  SEC‐
568       COMP_RET_USER_NOTIF (see seccomp(2)).
569
570       It  should thus be absolutely clear that the seccomp user-space notifi‐
571       cation mechanism can not be used to implement a  security  policy!   It
572       should  only  ever be used in scenarios where a more privileged process
573       supervises the system calls of a lesser privileged target to get around
574       kernel-enforced  security  restrictions  when the supervisor deems this
575       safe.  In other words, in order to continue a system call, the supervi‐
576       sor should be sure that another security mechanism or the kernel itself
577       will sufficiently block the system call if its arguments are  rewritten
578       to something unsafe.
579
580   Caveats regarding the use of /proc/tid/mem
581       The  discussion  above  noted  the  need  to  use the SECCOMP_IOCTL_NO‐
582       TIF_ID_VALID ioctl(2) when opening the /proc/tid/mem file of the target
583       to  avoid  the possibility of accessing the memory of the wrong process
584       in the event that the target terminates and its ID is recycled  by  an‐
585       other  (unrelated) thread.  However, the use of this ioctl(2) operation
586       is also necessary in other situations, as explained  in  the  following
587       paragraphs.
588
589       Consider the following scenario, where the supervisor tries to read the
590       pathname argument of a target's blocked mount(2) system call:
591
592       (1)  From one of its functions (func()),  the  target  calls  mount(2),
593            which  triggers a user-space notification and causes the target to
594            block.
595
596       (2)  The supervisor receives the notification, opens /proc/tid/mem, and
597            (successfully) performs the SECCOMP_IOCTL_NOTIF_ID_VALID check.
598
599       (3)  The target receives a signal, which causes the mount(2) to abort.
600
601       (4)  The signal handler executes in the target, and returns.
602
603       (5)  Upon return from the handler, the execution of func() resumes, and
604            it returns (and perhaps other functions  are  called,  overwriting
605            the memory that had been used for the stack frame of func()).
606
607       (6)  Using  the  address  provided in the notification information, the
608            supervisor reads from the target's memory location  that  used  to
609            contain the pathname.
610
611       (7)  The  supervisor  now  calls mount(2) with some arbitrary bytes ob‐
612            tained in the previous step.
613
614       The conclusion from the above scenario  is  this:  since  the  target's
615       blocked  system call may be interrupted by a signal handler, the super‐
616       visor must be written to expect that the target may abandon its  system
617       call at any time; in such an event, any information that the supervisor
618       obtained from the target's memory must be considered invalid.
619
620       To prevent such scenarios, every read from the target's memory must  be
621       separated  from  use  of  the  bytes so obtained by a SECCOMP_IOCTL_NO‐
622       TIF_ID_VALID check.  In the above example, the check  would  be  placed
623       between  the  two  final steps.  An example of such a check is shown in
624       EXAMPLES.
625
626       Following on from the above, it should be clear that a write by the su‐
627       pervisor into the target's memory can never be considered safe.
628
629   Caveats regarding blocking system calls
630       Suppose  that  the  target  performs  a blocking system call (e.g., ac‐
631       cept(2)) that the supervisor should handle.  The supervisor might  then
632       in turn execute the same blocking system call.
633
634       In  this  scenario, it is important to note that if the target's system
635       call is now interrupted by a signal, the supervisor is not informed  of
636       this.   If the supervisor does not take suitable steps to actively dis‐
637       cover that the target's system call has been canceled,  various  diffi‐
638       culties  can  occur.   Taking  the example of accept(2), the supervisor
639       might remain blocked in its accept(2) holding a port  number  that  the
640       target  (which,  after  the interruption by the signal handler, perhaps
641       closed  its listening socket) might expect to be able  to  reuse  in  a
642       bind(2) call.
643
644       Therefore,  when  the  supervisor  wishes  to emulate a blocking system
645       call, it must do so in such a way that it gets informed if the target's
646       system  call  is  interrupted by a signal handler.  For example, if the
647       supervisor itself executes the same blocking system call, then it could
648       employ a separate thread that uses the SECCOMP_IOCTL_NOTIF_ID_VALID op‐
649       eration to check if the target is still blocked  in  its  system  call.
650       Alternatively,  in  the  accept(2)  example,  the  supervisor might use
651       poll(2) to monitor both the notification file descriptor (so as to dis‐
652       cover  when  the  target's accept(2) call has been interrupted) and the
653       listening file descriptor (so as to know when a  connection  is  avail‐
654       able).
655
656       If  the  target's  system call is interrupted, the supervisor must take
657       care to release resources (e.g., file descriptors) that it acquired  on
658       behalf of the target.
659
660   Interaction with SA_RESTART signal handlers
661       Consider the following scenario:
662
663       (1)  The  target process has used sigaction(2) to install a signal han‐
664            dler with the SA_RESTART flag.
665
666       (2)  The target has made a system call that triggered a  seccomp  user-
667            space  notification  and the target is currently blocked until the
668            supervisor sends a notification response.
669
670       (3)  A signal is delivered to the target and the signal handler is exe‐
671            cuted.
672
673       (4)  When (if) the supervisor attempts to send a notification response,
674            the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)) operation  will  fail  with
675            the ENOENT error.
676
677       In  this  scenario,  the  kernel will restart the target's system call.
678       Consequently, the supervisor will receive another user-space  notifica‐
679       tion.  Thus, depending on how many times the blocked system call is in‐
680       terrupted by a signal handler, the supervisor may receive multiple  no‐
681       tifications for the same instance of a system call in the target.
682
683       One oddity is that system call restarting as described in this scenario
684       will occur even for the blocking system calls listed in signal(7)  that
685       would never normally be restarted by the SA_RESTART flag.
686
687       Furthermore, if the supervisor response is a file descriptor added with
688       SECCOMP_IOCTL_NOTIF_ADDFD, then the flag SECCOMP_ADDFD_FLAG_SEND can be
689       used  to atomically add the file descriptor and return that value, mak‐
690       ing sure no file descriptors are inadvertently leaked into the target.
691

BUGS

693       If a SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation is performed after the
694       target  terminates,  then  the ioctl(2) call simply blocks (rather than
695       returning an error to indicate that the target no longer exists).
696

EXAMPLES

698       The (somewhat contrived) program shown below demonstrates  the  use  of
699       the  interfaces  described  in  this page.  The program creates a child
700       process that serves as the "target" process.   The  child  process  in‐
701       stalls  a seccomp filter that returns the SECCOMP_RET_USER_NOTIF action
702       value if a call is made to mkdir(2).   The  child  process  then  calls
703       mkdir(2)  once for each of the supplied command-line arguments, and re‐
704       ports the result returned by the call.  After processing all arguments,
705       the child process terminates.
706
707       The  parent process acts as the supervisor, listening for the notifica‐
708       tions that are generated when the target process calls mkdir(2).   When
709       such  a  notification occurs, the supervisor examines the memory of the
710       target process (using /proc/pid/mem) to discover the pathname  argument
711       that was supplied to the mkdir(2) call, and performs one of the follow‐
712       ing actions:
713
714       •  If the pathname begins with the prefix "/tmp/", then the  supervisor
715          attempts to create the specified directory, and then spoofs a return
716          for the target process based on the return value of the supervisor's
717          mkdir(2)  call.   In  the event that that call succeeds, the spoofed
718          success return value is the length of the pathname.
719
720       •  If the pathname begins with "./" (i.e., it is a relative  pathname),
721          the  supervisor sends a SECCOMP_USER_NOTIF_FLAG_CONTINUE response to
722          the kernel  to  say  that  the  kernel  should  execute  the  target
723          process's mkdir(2) call.
724
725       •  If the pathname begins with some other prefix, the supervisor spoofs
726          an error return for the target process, so that the target process's
727          mkdir(2)  call appears to fail with the error EOPNOTSUPP ("Operation
728          not supported").  Additionally, if the specified pathname is exactly
729          "/bye", then the supervisor terminates.
730
731       This program can be used to demonstrate various aspects of the behavior
732       of the seccomp user-space notification mechanism.   To  help  aid  such
733       demonstrations, the program logs various messages to show the operation
734       of the target process (lines prefixed "T:")  and  the  supervisor  (in‐
735       dented lines prefixed "S:").
736
737       In  the  following example, the target attempts to create the directory
738       /tmp/x.  Upon receiving the notification, the  supervisor  creates  the
739       directory on the target's behalf, and spoofs a success return to be re‐
740       ceived by the target process's mkdir(2) call.
741
742           $ ./seccomp_unotify /tmp/x
743           T: PID = 23168
744
745           T: about to mkdir("/tmp/x")
746                   S: got notification (ID 0x17445c4a0f4e0e3c) for PID 23168
747                   S: executing: mkdir("/tmp/x", 0700)
748                   S: success! spoofed return = 6
749                   S: sending response (flags = 0; val = 6; error = 0)
750           T: SUCCESS: mkdir(2) returned 6
751
752           T: terminating
753                   S: target has terminated; bye
754
755       In the above output, note that the spoofed return  value  seen  by  the
756       target process is 6 (the length of the pathname /tmp/x), whereas a nor‐
757       mal mkdir(2) call returns 0 on success.
758
759       In the next example, the target attempts to create  a  directory  using
760       the relative pathname ./sub.  Since this pathname starts with "./", the
761       supervisor sends a  SECCOMP_USER_NOTIF_FLAG_CONTINUE  response  to  the
762       kernel,   and  the  kernel  then  (successfully)  executes  the  target
763       process's mkdir(2) call.
764
765           $ ./seccomp_unotify ./sub
766           T: PID = 23204
767
768           T: about to mkdir("./sub")
769                   S: got notification (ID 0xddb16abe25b4c12) for PID 23204
770                   S: target can execute system call
771                   S: sending response (flags = 0x1; val = 0; error = 0)
772           T: SUCCESS: mkdir(2) returned 0
773
774           T: terminating
775                   S: target has terminated; bye
776
777       If the target process attempts to create a directory  with  a  pathname
778       that  doesn't start with "." and doesn't begin with the prefix "/tmp/",
779       then the supervisor spoofs an error return (EOPNOTSUPP, "Operation  not
780       supported") for the target's mkdir(2) call (which is not executed):
781
782           $ ./seccomp_unotify /xxx
783           T: PID = 23178
784
785           T: about to mkdir("/xxx")
786                   S: got notification (ID 0xe7dc095d1c524e80) for PID 23178
787                   S: spoofing error response (Operation not supported)
788                   S: sending response (flags = 0; val = 0; error = -95)
789           T: ERROR: mkdir(2): Operation not supported
790
791           T: terminating
792                   S: target has terminated; bye
793
794       In  the next example, the target process attempts to create a directory
795       with the pathname /tmp/nosuchdir/b.  Upon receiving  the  notification,
796       the supervisor attempts to create that directory, but the mkdir(2) call
797       fails because the directory  /tmp/nosuchdir  does  not  exist.   Conse‐
798       quently,  the  supervisor  spoofs an error return that passes the error
799       that it received back to the target process's mkdir(2) call.
800
801           $ ./seccomp_unotify /tmp/nosuchdir/b
802           T: PID = 23199
803
804           T: about to mkdir("/tmp/nosuchdir/b")
805                   S: got notification (ID 0x8744454293506046) for PID 23199
806                   S: executing: mkdir("/tmp/nosuchdir/b", 0700)
807                   S: failure! (errno = 2; No such file or directory)
808                   S: sending response (flags = 0; val = 0; error = -2)
809           T: ERROR: mkdir(2): No such file or directory
810
811           T: terminating
812                   S: target has terminated; bye
813
814       If the supervisor receives a notification and sees that the argument of
815       the  target's  mkdir(2) is the string "/bye", then (as well as spoofing
816       an EOPNOTSUPP error), the supervisor terminates.  If the target process
817       subsequently executes another mkdir(2) that triggers its seccomp filter
818       to return the SECCOMP_RET_USER_NOTIF  action  value,  then  the  kernel
819       causes  the  target process's system call to fail with the error ENOSYS
820       ("Function not implemented").  This is demonstrated  by  the  following
821       example:
822
823           $ ./seccomp_unotify /bye /tmp/y
824           T: PID = 23185
825
826           T: about to mkdir("/bye")
827                   S: got notification (ID 0xa81236b1d2f7b0f4) for PID 23185
828                   S: spoofing error response (Operation not supported)
829                   S: sending response (flags = 0; val = 0; error = -95)
830                   S: terminating **********
831           T: ERROR: mkdir(2): Operation not supported
832
833           T: about to mkdir("/tmp/y")
834           T: ERROR: mkdir(2): Function not implemented
835
836           T: terminating
837
838   Program source
839       #define _GNU_SOURCE
840       #include <err.h>
841       #include <errno.h>
842       #include <fcntl.h>
843       #include <limits.h>
844       #include <linux/audit.h>
845       #include <linux/filter.h>
846       #include <linux/seccomp.h>
847       #include <signal.h>
848       #include <stdbool.h>
849       #include <stddef.h>
850       #include <stdint.h>
851       #include <stdio.h>
852       #include <stdlib.h>
853       #include <string.h>
854       #include <sys/ioctl.h>
855       #include <sys/prctl.h>
856       #include <sys/socket.h>
857       #include <sys/stat.h>
858       #include <sys/syscall.h>
859       #include <sys/types.h>
860       #include <sys/un.h>
861       #include <unistd.h>
862
863       #define ARRAY_SIZE(arr)  (sizeof(arr) / sizeof((arr)[0]))
864
865       /* Send the file descriptor 'fd' over the connected UNIX domain socket
866          'sockfd'. Returns 0 on success, or -1 on error. */
867
868       static int
869       sendfd(int sockfd, int fd)
870       {
871           int             data;
872           struct iovec    iov;
873           struct msghdr   msgh;
874           struct cmsghdr  *cmsgp;
875
876           /* Allocate a char array of suitable size to hold the ancillary data.
877              However, since this buffer is in reality a 'struct cmsghdr', use a
878              union to ensure that it is suitably aligned. */
879           union {
880               char   buf[CMSG_SPACE(sizeof(int))];
881                               /* Space large enough to hold an 'int' */
882               struct cmsghdr align;
883           } controlMsg;
884
885           /* The 'msg_name' field can be used to specify the address of the
886              destination socket when sending a datagram. However, we do not
887              need to use this field because 'sockfd' is a connected socket. */
888
889           msgh.msg_name = NULL;
890           msgh.msg_namelen = 0;
891
892           /* On Linux, we must transmit at least one byte of real data in
893              order to send ancillary data. We transmit an arbitrary integer
894              whose value is ignored by recvfd(). */
895
896           msgh.msg_iov = &iov;
897           msgh.msg_iovlen = 1;
898           iov.iov_base = &data;
899           iov.iov_len = sizeof(int);
900           data = 12345;
901
902           /* Set 'msghdr' fields that describe ancillary data */
903
904           msgh.msg_control = controlMsg.buf;
905           msgh.msg_controllen = sizeof(controlMsg.buf);
906
907           /* Set up ancillary data describing file descriptor to send */
908
909           cmsgp = CMSG_FIRSTHDR(&msgh);
910           cmsgp->cmsg_level = SOL_SOCKET;
911           cmsgp->cmsg_type = SCM_RIGHTS;
912           cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
913           memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));
914
915           /* Send real plus ancillary data */
916
917           if (sendmsg(sockfd, &msgh, 0) == -1)
918               return -1;
919
920           return 0;
921       }
922
923       /* Receive a file descriptor on a connected UNIX domain socket. Returns
924          the received file descriptor on success, or -1 on error. */
925
926       static int
927       recvfd(int sockfd)
928       {
929           int            data, fd;
930           ssize_t        nr;
931           struct iovec   iov;
932           struct msghdr  msgh;
933
934           /* Allocate a char buffer for the ancillary data. See the comments
935              in sendfd() */
936           union {
937               char   buf[CMSG_SPACE(sizeof(int))];
938               struct cmsghdr align;
939           } controlMsg;
940           struct cmsghdr *cmsgp;
941
942           /* The 'msg_name' field can be used to obtain the address of the
943              sending socket. However, we do not need this information. */
944
945           msgh.msg_name = NULL;
946           msgh.msg_namelen = 0;
947
948           /* Specify buffer for receiving real data */
949
950           msgh.msg_iov = &iov;
951           msgh.msg_iovlen = 1;
952           iov.iov_base = &data;       /* Real data is an 'int' */
953           iov.iov_len = sizeof(int);
954
955           /* Set 'msghdr' fields that describe ancillary data */
956
957           msgh.msg_control = controlMsg.buf;
958           msgh.msg_controllen = sizeof(controlMsg.buf);
959
960           /* Receive real plus ancillary data; real data is ignored */
961
962           nr = recvmsg(sockfd, &msgh, 0);
963           if (nr == -1)
964               return -1;
965
966           cmsgp = CMSG_FIRSTHDR(&msgh);
967
968           /* Check the validity of the 'cmsghdr' */
969
970           if (cmsgp == NULL
971               || cmsgp->cmsg_len != CMSG_LEN(sizeof(int))
972               || cmsgp->cmsg_level != SOL_SOCKET
973               || cmsgp->cmsg_type != SCM_RIGHTS)
974           {
975               errno = EINVAL;
976               return -1;
977           }
978
979           /* Return the received file descriptor to our caller */
980
981           memcpy(&fd, CMSG_DATA(cmsgp), sizeof(int));
982           return fd;
983       }
984
985       static void
986       sigchldHandler(int sig)
987       {
988           char msg[] = "\tS: target has terminated; bye\n";
989
990           write(STDOUT_FILENO, msg, sizeof(msg) - 1);
991           _exit(EXIT_SUCCESS);
992       }
993
994       static int
995       seccomp(unsigned int operation, unsigned int flags, void *args)
996       {
997           return syscall(SYS_seccomp, operation, flags, args);
998       }
999
1000       /* The following is the x86-64-specific BPF boilerplate code for checking
1001          that the BPF program is running on the right architecture + ABI. At
1002          completion of these instructions, the accumulator contains the system
1003          call number. */
1004
1005       /* For the x32 ABI, all system call numbers have bit 30 set */
1006
1007       #define X32_SYSCALL_BIT         0x40000000
1008
1009       #define X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR \
1010               BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
1011                        (offsetof(struct seccomp_data, arch))), \
1012               BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 2), \
1013               BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
1014                        (offsetof(struct seccomp_data, nr))), \
1015               BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1), \
1016               BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS)
1017
1018       /* installNotifyFilter() installs a seccomp filter that generates
1019          user-space notifications (SECCOMP_RET_USER_NOTIF) when the process
1020          calls mkdir(2); the filter allows all other system calls.
1021
1022          The function return value is a file descriptor from which the
1023          user-space notifications can be fetched. */
1024
1025       static int
1026       installNotifyFilter(void)
1027       {
1028           int notifyFd;
1029
1030           struct sock_filter filter[] = {
1031               X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR,
1032
1033               /* mkdir() triggers notification to user-space supervisor */
1034
1035               BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, SYS_mkdir, 0, 1),
1036               BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_USER_NOTIF),
1037
1038               /* Every other system call is allowed */
1039
1040               BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
1041           };
1042
1043           struct sock_fprog prog = {
1044               .len = ARRAY_SIZE(filter),
1045               .filter = filter,
1046           };
1047
1048           /* Install the filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
1049              as a result, seccomp() returns a notification file descriptor. */
1050
1051           notifyFd = seccomp(SECCOMP_SET_MODE_FILTER,
1052                              SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
1053           if (notifyFd == -1)
1054               err(EXIT_FAILURE, "seccomp-install-notify-filter");
1055
1056           return notifyFd;
1057       }
1058
1059       /* Close a pair of sockets created by socketpair() */
1060
1061       static void
1062       closeSocketPair(int sockPair[2])
1063       {
1064           if (close(sockPair[0]) == -1)
1065               err(EXIT_FAILURE, "closeSocketPair-close-0");
1066           if (close(sockPair[1]) == -1)
1067               err(EXIT_FAILURE, "closeSocketPair-close-1");
1068       }
1069
1070       /* Implementation of the target process; create a child process that:
1071
1072          (1) installs a seccomp filter with the
1073              SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
1074          (2) writes the seccomp notification file descriptor returned from
1075              the previous step onto the UNIX domain socket, 'sockPair[0]';
1076          (3) calls mkdir(2) for each element of 'argv'.
1077
1078          The function return value in the parent is the PID of the child
1079          process; the child does not return from this function. */
1080
1081       static pid_t
1082       targetProcess(int sockPair[2], char *argv[])
1083       {
1084           int    notifyFd, s;
1085           pid_t  targetPid;
1086
1087           targetPid = fork();
1088
1089           if (targetPid == -1)
1090               err(EXIT_FAILURE, "fork");
1091
1092           if (targetPid > 0)          /* In parent, return PID of child */
1093               return targetPid;
1094
1095           /* Child falls through to here */
1096
1097           printf("T: PID = %ld\n", (long) getpid());
1098
1099           /* Install seccomp filter(s) */
1100
1101           if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
1102               err(EXIT_FAILURE, "prctl");
1103
1104           notifyFd = installNotifyFilter();
1105
1106           /* Pass the notification file descriptor to the tracing process over
1107              a UNIX domain socket */
1108
1109           if (sendfd(sockPair[0], notifyFd) == -1)
1110               err(EXIT_FAILURE, "sendfd");
1111
1112           /* Notification and socket FDs are no longer needed in target */
1113
1114           if (close(notifyFd) == -1)
1115               err(EXIT_FAILURE, "close-target-notify-fd");
1116
1117           closeSocketPair(sockPair);
1118
1119           /* Perform a mkdir() call for each of the command-line arguments */
1120
1121           for (char **ap = argv; *ap != NULL; ap++) {
1122               printf("\nT: about to mkdir(\"%s\")\n", *ap);
1123
1124               s = mkdir(*ap, 0700);
1125               if (s == -1)
1126                   perror("T: ERROR: mkdir(2)");
1127               else
1128                   printf("T: SUCCESS: mkdir(2) returned %d\n", s);
1129           }
1130
1131           printf("\nT: terminating\n");
1132           exit(EXIT_SUCCESS);
1133       }
1134
1135       /* Check that the notification ID provided by a SECCOMP_IOCTL_NOTIF_RECV
1136          operation is still valid. It will no longer be valid if the target
1137          process has terminated or is no longer blocked in the system call that
1138          generated the notification (because it was interrupted by a signal).
1139
1140          This operation can be used when doing such things as accessing
1141          /proc/PID files in the target process in order to avoid TOCTOU race
1142          conditions where the PID that is returned by SECCOMP_IOCTL_NOTIF_RECV
1143          terminates and is reused by another process. */
1144
1145       static bool
1146       cookieIsValid(int notifyFd, uint64_t id)
1147       {
1148           return ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == 0;
1149       }
1150
1151       /* Access the memory of the target process in order to fetch the
1152          pathname referred to by the system call argument 'argNum' in
1153          'req->data.args[]'.  The pathname is returned in 'path',
1154          a buffer of 'len' bytes allocated by the caller.
1155
1156          Returns true if the pathname is successfully fetched, and false
1157          otherwise. For possible causes of failure, see the comments below. */
1158
1159       static bool
1160       getTargetPathname(struct seccomp_notif *req, int notifyFd,
1161                         int argNum, char *path, size_t len)
1162       {
1163           int      procMemFd;
1164           char     procMemPath[PATH_MAX];
1165           ssize_t  nread;
1166
1167           snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);
1168
1169           procMemFd = open(procMemPath, O_RDONLY | O_CLOEXEC);
1170           if (procMemFd == -1)
1171               return false;
1172
1173           /* Check that the process whose info we are accessing is still alive
1174              and blocked in the system call that caused the notification.
1175              If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed in
1176              cookieIsValid()) succeeded, we know that the /proc/PID/mem file
1177              descriptor that we opened corresponded to the process for which we
1178              received a notification. If that process subsequently terminates,
1179              then read() on that file descriptor will return 0 (EOF). */
1180
1181           if (!cookieIsValid(notifyFd, req->id)) {
1182               close(procMemFd);
1183               return false;
1184           }
1185
1186           /* Read bytes at the location containing the pathname argument */
1187
1188           nread = pread(procMemFd, path, len, req->data.args[argNum]);
1189
1190           close(procMemFd);
1191
1192           if (nread <= 0)
1193               return false;
1194
1195           /* Once again check that the notification ID is still valid. The
1196              case we are particularly concerned about here is that just
1197              before we fetched the pathname, the target's blocked system
1198              call was interrupted by a signal handler, and after the handler
1199              returned, the target carried on execution (past the interrupted
1200              system call). In that case, we have no guarantees about what we
1201              are reading, since the target's memory may have been arbitrarily
1202              changed by subsequent operations. */
1203
1204           if (!cookieIsValid(notifyFd, req->id)) {
1205               perror("\tS: notification ID check failed!!!");
1206               return false;
1207           }
1208
1209           /* Even if the target's system call was not interrupted by a signal,
1210              we have no guarantees about what was in the memory of the target
1211              process. (The memory may have been modified by another thread, or
1212              even by an external attacking process.) We therefore treat the
1213              buffer returned by pread() as untrusted input. The buffer should
1214              contain a terminating null byte; if not, then we will trigger an
1215              error for the target process. */
1216
1217           if (strnlen(path, nread) < nread)
1218               return true;
1219
1220           return false;
1221       }
1222
1223       /* Allocate buffers for the seccomp user-space notification request and
1224          response structures. It is the caller's responsibility to free the
1225          buffers returned via 'req' and 'resp'. */
1226
1227       static void
1228       allocSeccompNotifBuffers(struct seccomp_notif **req,
1229                                struct seccomp_notif_resp **resp,
1230                                struct seccomp_notif_sizes *sizes)
1231       {
1232           size_t  resp_size;
1233
1234           /* Discover the sizes of the structures that are used to receive
1235              notifications and send notification responses, and allocate
1236              buffers of those sizes. */
1237
1238           if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, sizes) == -1)
1239               err(EXIT_FAILURE, "seccomp-SECCOMP_GET_NOTIF_SIZES");
1240
1241           *req = malloc(sizes->seccomp_notif);
1242           if (*req == NULL)
1243               err(EXIT_FAILURE, "malloc-seccomp_notif");
1244
1245           /* When allocating the response buffer, we must allow for the fact
1246              that the user-space binary may have been built with user-space
1247              headers where 'struct seccomp_notif_resp' is bigger than the
1248              response buffer expected by the (older) kernel. Therefore, we
1249              allocate a buffer that is the maximum of the two sizes. This
1250              ensures that if the supervisor places bytes into the response
1251              structure that are past the response size that the kernel expects,
1252              then the supervisor is not touching an invalid memory location. */
1253
1254           resp_size = sizes->seccomp_notif_resp;
1255           if (sizeof(struct seccomp_notif_resp) > resp_size)
1256               resp_size = sizeof(struct seccomp_notif_resp);
1257
1258           *resp = malloc(resp_size);
1259           if (*resp == NULL)
1260               err(EXIT_FAILURE, "malloc-seccomp_notif_resp");
1261
1262       }
1263
1264       /* Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file
1265          descriptor, 'notifyFd'. */
1266
1267       static void
1268       handleNotifications(int notifyFd)
1269       {
1270           bool                        pathOK;
1271           char                        path[PATH_MAX];
1272           struct seccomp_notif        *req;
1273           struct seccomp_notif_resp   *resp;
1274           struct seccomp_notif_sizes  sizes;
1275
1276           allocSeccompNotifBuffers(&req, &resp, &sizes);
1277
1278           /* Loop handling notifications */
1279
1280           for (;;) {
1281
1282               /* Wait for next notification, returning info in '*req' */
1283
1284               memset(req, 0, sizes.seccomp_notif);
1285               if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_RECV, req) == -1) {
1286                   if (errno == EINTR)
1287                       continue;
1288                   err(EXIT_FAILURE, "\tS: ioctl-SECCOMP_IOCTL_NOTIF_RECV");
1289               }
1290
1291               printf("\tS: got notification (ID %#llx) for PID %d\n",
1292                      req->id, req->pid);
1293
1294               /* The only system call that can generate a notification event
1295                  is mkdir(2). Nevertheless, we check that the notified system
1296                  call is indeed mkdir() as kind of future-proofing of this
1297                  code in case the seccomp filter is later modified to
1298                  generate notifications for other system calls. */
1299
1300               if (req->data.nr != SYS_mkdir) {
1301                   printf("\tS: notification contained unexpected "
1302                          "system call number; bye!!!\n");
1303                   exit(EXIT_FAILURE);
1304               }
1305
1306               pathOK = getTargetPathname(req, notifyFd, 0, path, sizeof(path));
1307
1308               /* Prepopulate some fields of the response */
1309
1310               resp->id = req->id;     /* Response includes notification ID */
1311               resp->flags = 0;
1312               resp->val = 0;
1313
1314               /* If getTargetPathname() failed, trigger an EINVAL error
1315                  response (sending this response may yield an error if the
1316                  failure occurred because the notification ID was no longer
1317                  valid); if the directory is in /tmp, then create it on behalf
1318                  of the supervisor; if the pathname starts with '.', tell the
1319                  kernel to let the target process execute the mkdir();
1320                  otherwise, give an error for a directory pathname in any other
1321                  location. */
1322
1323               if (!pathOK) {
1324                   resp->error = -EINVAL;
1325                   printf("\tS: spoofing error for invalid pathname (%s)\n",
1326                          strerror(-resp->error));
1327               } else if (strncmp(path, "/tmp/", strlen("/tmp/")) == 0) {
1328                   printf("\tS: executing: mkdir(\"%s\", %#llo)\n",
1329                          path, req->data.args[1]);
1330
1331                   if (mkdir(path, req->data.args[1]) == 0) {
1332                       resp->error = 0;            /* "Success" */
1333                       resp->val = strlen(path);   /* Used as return value of
1334                                                      mkdir() in target */
1335                       printf("\tS: success! spoofed return = %lld\n",
1336                              resp->val);
1337                   } else {
1338
1339                       /* If mkdir() failed in the supervisor, pass the error
1340                          back to the target */
1341
1342                       resp->error = -errno;
1343                       printf("\tS: failure! (errno = %d; %s)\n", errno,
1344                              strerror(errno));
1345                   }
1346               } else if (strncmp(path, "./", strlen("./")) == 0) {
1347                   resp->error = resp->val = 0;
1348                   resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
1349                   printf("\tS: target can execute system call\n");
1350               } else {
1351                   resp->error = -EOPNOTSUPP;
1352                   printf("\tS: spoofing error response (%s)\n",
1353                          strerror(-resp->error));
1354               }
1355
1356               /* Send a response to the notification */
1357
1358               printf("\tS: sending response "
1359                      "(flags = %#x; val = %lld; error = %d)\n",
1360                      resp->flags, resp->val, resp->error);
1361
1362               if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp) == -1) {
1363                   if (errno == ENOENT)
1364                       printf("\tS: response failed with ENOENT; "
1365                              "perhaps target process's syscall was "
1366                              "interrupted by a signal?\n");
1367                   else
1368                       perror("ioctl-SECCOMP_IOCTL_NOTIF_SEND");
1369               }
1370
1371               /* If the pathname is just "/bye", then the supervisor breaks out
1372                  of the loop and terminates. This allows us to see what happens
1373                  if the target process makes further calls to mkdir(2). */
1374
1375               if (strcmp(path, "/bye") == 0)
1376                   break;
1377           }
1378
1379           free(req);
1380           free(resp);
1381           printf("\tS: terminating **********\n");
1382           exit(EXIT_FAILURE);
1383       }
1384
1385       /* Implementation of the supervisor process:
1386
1387          (1) obtains the notification file descriptor from 'sockPair[1]'
1388          (2) handles notifications that arrive on that file descriptor. */
1389
1390       static void
1391       supervisor(int sockPair[2])
1392       {
1393           int notifyFd;
1394
1395           notifyFd = recvfd(sockPair[1]);
1396
1397           if (notifyFd == -1)
1398               err(EXIT_FAILURE, "recvfd");
1399
1400           closeSocketPair(sockPair);  /* We no longer need the socket pair */
1401
1402           handleNotifications(notifyFd);
1403       }
1404
1405       int
1406       main(int argc, char *argv[])
1407       {
1408           int               sockPair[2];
1409           struct sigaction  sa;
1410
1411           setbuf(stdout, NULL);
1412
1413           if (argc < 2) {
1414               fprintf(stderr, "At least one pathname argument is required\n");
1415               exit(EXIT_FAILURE);
1416           }
1417
1418           /* Create a UNIX domain socket that is used to pass the seccomp
1419              notification file descriptor from the target process to the
1420              supervisor process. */
1421
1422           if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockPair) == -1)
1423               err(EXIT_FAILURE, "socketpair");
1424
1425           /* Create a child process--the "target"--that installs seccomp
1426              filtering. The target process writes the seccomp notification
1427              file descriptor onto 'sockPair[0]' and then calls mkdir(2) for
1428              each directory in the command-line arguments. */
1429
1430           (void) targetProcess(sockPair, &argv[optind]);
1431
1432           /* Catch SIGCHLD when the target terminates, so that the
1433              supervisor can also terminate. */
1434
1435           sa.sa_handler = sigchldHandler;
1436           sa.sa_flags = 0;
1437           sigemptyset(&sa.sa_mask);
1438           if (sigaction(SIGCHLD, &sa, NULL) == -1)
1439               err(EXIT_FAILURE, "sigaction");
1440
1441           supervisor(sockPair);
1442
1443           exit(EXIT_SUCCESS);
1444       }
1445

SEE ALSO

1447       ioctl(2), pidfd_getfd(2), pidfd_open(2), seccomp(2)
1448
1449       A  further  example program can be found in the kernel source file sam‐
1450       ples/seccomp/user-trap.c.
1451
1452
1453
1454Linux man-pages 6.05              2023-05-03                seccomp_unotify(2)
Impressum