1SECCOMP_UNOTIFY(2) Linux Programmer's Manual SECCOMP_UNOTIFY(2)
2
3
4
6 seccomp_unotify - Seccomp user-space notification mechanism
7
9 #include <linux/seccomp.h>
10 #include <linux/filter.h>
11 #include <linux/audit.h>
12
13 int seccomp(unsigned int operation, unsigned int flags, void *args);
14
15 #include <sys/ioctl.h>
16
17 int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
18 struct seccomp_notif *req);
19 int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
20 struct seccomp_notif_resp *resp);
21 int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
22 int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ADDFD,
23 struct seccomp_notif_addfd *addfd);
24
26 This page describes the user-space notification mechanism provided by
27 the Secure Computing (seccomp) facility. As well as the use of the
28 SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the SECCOMP_RET_USER_NOTIF ac‐
29 tion value, and the SECCOMP_GET_NOTIF_SIZES operation described in sec‐
30 comp(2), this mechanism involves the use of a number of related
31 ioctl(2) operations (described below).
32
33 Overview
34 In conventional usage of a seccomp filter, the decision about how to
35 treat a system call is made by the filter itself. By contrast, the
36 user-space notification mechanism allows the seccomp filter to delegate
37 the handling of the system call to another user-space process. Note
38 that this mechanism is explicitly not intended as a method implementing
39 security policy; see NOTES.
40
41 In the discussion that follows, the thread(s) on which the seccomp fil‐
42 ter is installed is (are) referred to as the target, and the process
43 that is notified by the user-space notification mechanism is referred
44 to as the supervisor.
45
46 A suitably privileged supervisor can use the user-space notification
47 mechanism to perform actions on behalf of the target. The advantage of
48 the user-space notification mechanism is that the supervisor will usu‐
49 ally be able to retrieve information about the target and the performed
50 system call that the seccomp filter itself cannot. (A seccomp filter
51 is limited in the information it can obtain and the actions that it can
52 perform because it is running on a virtual machine inside the kernel.)
53
54 An overview of the steps performed by the target and the supervisor is
55 as follows:
56
57 1. The target establishes a seccomp filter in the usual manner, but
58 with two differences:
59
60 • The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
61 TER_FLAG_NEW_LISTENER. Consequently, the return value of the
62 (successful) seccomp(2) call is a new "listening" file descriptor
63 that can be used to receive notifications. Only one "listening"
64 seccomp filter can be installed for a thread.
65
66 • In cases where it is appropriate, the seccomp filter returns the
67 action value SECCOMP_RET_USER_NOTIF. This return value will trig‐
68 ger a notification event.
69
70 2. In order that the supervisor can obtain notifications using the lis‐
71 tening file descriptor, (a duplicate of) that file descriptor must
72 be passed from the target to the supervisor. One way in which this
73 could be done is by passing the file descriptor over a UNIX domain
74 socket connection between the target and the supervisor (using the
75 SCM_RIGHTS ancillary message type described in unix(7)). Another
76 way to do this is through the use of pidfd_getfd(2).
77
78 3. The supervisor will receive notification events on the listening
79 file descriptor. These events are returned as structures of type
80 seccomp_notif. Because this structure and its size may evolve over
81 kernel versions, the supervisor must first determine the size of
82 this structure using the seccomp(2) SECCOMP_GET_NOTIF_SIZES opera‐
83 tion, which returns a structure of type seccomp_notif_sizes. The
84 supervisor allocates a buffer of size seccomp_notif_sizes.sec‐
85 comp_notif bytes to receive notification events. In addition,the
86 supervisor allocates another buffer of size seccomp_notif_sizes.sec‐
87 comp_notif_resp bytes for the response (a struct seccomp_notif_resp
88 structure) that it will provide to the kernel (and thus the target).
89
90 4. The target then performs its workload, which includes system calls
91 that will be controlled by the seccomp filter. Whenever one of
92 these system calls causes the filter to return the SEC‐
93 COMP_RET_USER_NOTIF action value, the kernel does not (yet) execute
94 the system call; instead, execution of the target is temporarily
95 blocked inside the kernel (in a sleep state that is interruptible by
96 signals) and a notification event is generated on the listening file
97 descriptor.
98
99 5. The supervisor can now repeatedly monitor the listening file de‐
100 scriptor for SECCOMP_RET_USER_NOTIF-triggered events. To do this,
101 the supervisor uses the SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation
102 to read information about a notification event; this operation
103 blocks until an event is available. The operation returns a sec‐
104 comp_notif structure containing information about the system call
105 that is being attempted by the target. (As described in NOTES, the
106 file descriptor can also be monitored with select(2), poll(2), or
107 epoll(7).)
108
109 6. The seccomp_notif structure returned by the SECCOMP_IOCTL_NOTIF_RECV
110 operation includes the same information (a seccomp_data structure)
111 that was passed to the seccomp filter. This information allows the
112 supervisor to discover the system call number and the arguments for
113 the target's system call. In addition, the notification event con‐
114 tains the ID of the thread that triggered the notification and a
115 unique cookie value that is used in subsequent SECCOMP_IOCTL_NO‐
116 TIF_ID_VALID and SECCOMP_IOCTL_NOTIF_SEND operations.
117
118 The information in the notification can be used to discover the val‐
119 ues of pointer arguments for the target's system call. (This is
120 something that can't be done from within a seccomp filter.) One way
121 in which the supervisor can do this is to open the corresponding
122 /proc/[tid]/mem file (see proc(5)) and read bytes from the location
123 that corresponds to one of the pointer arguments whose value is sup‐
124 plied in the notification event. (The supervisor must be careful to
125 avoid a race condition that can occur when doing this; see the de‐
126 scription of the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation be‐
127 low.) In addition, the supervisor can access other system informa‐
128 tion that is visible in user space but which is not accessible from
129 a seccomp filter.
130
131 7. Having obtained information as per the previous step, the supervisor
132 may then choose to perform an action in response to the target's
133 system call (which, as noted above, is not executed when the seccomp
134 filter returns the SECCOMP_RET_USER_NOTIF action value).
135
136 One example use case here relates to containers. The target may be
137 located inside a container where it does not have sufficient capa‐
138 bilities to mount a filesystem in the container's mount namespace.
139 However, the supervisor may be a more privileged process that does
140 have sufficient capabilities to perform the mount operation.
141
142 8. The supervisor then sends a response to the notification. The in‐
143 formation in this response is used by the kernel to construct a re‐
144 turn value for the target's system call and provide a value that
145 will be assigned to the errno variable of the target.
146
147 The response is sent using the SECCOMP_IOCTL_NOTIF_SEND ioctl(2) op‐
148 eration, which is used to transmit a seccomp_notif_resp structure to
149 the kernel. This structure includes a cookie value that the super‐
150 visor obtained in the seccomp_notif structure returned by the SEC‐
151 COMP_IOCTL_NOTIF_RECV operation. This cookie value allows the ker‐
152 nel to associate the response with the target. This structure must
153 include the cookie value that the supervisor obtained in the sec‐
154 comp_notif structure returned by the SECCOMP_IOCTL_NOTIF_RECV opera‐
155 tion; the cookie allows the kernel to associate the response with
156 the target.
157
158 9. Once the notification has been sent, the system call in the target
159 thread unblocks, returning the information that was provided by the
160 supervisor in the notification response.
161
162 As a variation on the last two steps, the supervisor can send a re‐
163 sponse that tells the kernel that it should execute the target thread's
164 system call; see the discussion of SECCOMP_USER_NOTIF_FLAG_CONTINUE,
165 below.
166
168 The following ioctl(2) operations are supported by the seccomp user-
169 space notification file descriptor. For each of these operations, the
170 first (file descriptor) argument of ioctl(2) is the listening file de‐
171 scriptor returned by a call to seccomp(2) with the SECCOMP_FIL‐
172 TER_FLAG_NEW_LISTENER flag.
173
174 SECCOMP_IOCTL_NOTIF_RECV
175 The SECCOMP_IOCTL_NOTIF_RECV operation (available since Linux 5.0) is
176 used to obtain a user-space notification event. If no such event is
177 currently pending, the operation blocks until an event occurs. The
178 third ioctl(2) argument is a pointer to a structure of the following
179 form which contains information about the event. This structure must
180 be zeroed out before the call.
181
182 struct seccomp_notif {
183 __u64 id; /* Cookie */
184 __u32 pid; /* TID of target thread */
185 __u32 flags; /* Currently unused (0) */
186 struct seccomp_data data; /* See seccomp(2) */
187 };
188
189 The fields in this structure are as follows:
190
191 id This is a cookie for the notification. Each such cookie is
192 guaranteed to be unique for the corresponding seccomp filter.
193
194 • The cookie can be used with the SECCOMP_IOCTL_NOTIF_ID_VALID
195 ioctl(2) operation described below.
196
197 • When returning a notification response to the kernel, the su‐
198 pervisor must include the cookie value in the seccomp_no‐
199 tif_resp structure that is specified as the argument of the
200 SECCOMP_IOCTL_NOTIF_SEND operation.
201
202 pid This is the thread ID of the target thread that triggered the
203 notification event.
204
205 flags This is a bit mask of flags providing further information on the
206 event. In the current implementation, this field is always
207 zero.
208
209 data This is a seccomp_data structure containing information about
210 the system call that triggered the notification. This is the
211 same structure that is passed to the seccomp filter. See sec‐
212 comp(2) for details of this structure.
213
214 On success, this operation returns 0; on failure, -1 is returned, and
215 errno is set to indicate the cause of the error. This operation can
216 fail with the following errors:
217
218 EINVAL (since Linux 5.5)
219 The seccomp_notif structure that was passed to the call con‐
220 tained nonzero fields.
221
222 ENOENT The target thread was killed by a signal as the notification in‐
223 formation was being generated, or the target's (blocked) system
224 call was interrupted by a signal handler.
225
226 SECCOMP_IOCTL_NOTIF_ID_VALID
227 The SECCOMP_IOCTL_NOTIF_ID_VALID operation (available since Linux 5.0)
228 is used to check that a notification ID returned by an earlier SEC‐
229 COMP_IOCTL_NOTIF_RECV operation is still valid (i.e., that the target
230 still exists and its system call is still blocked waiting for a re‐
231 sponse).
232
233 The third ioctl(2) argument is a pointer to the cookie (id) returned by
234 the SECCOMP_IOCTL_NOTIF_RECV operation.
235
236 This operation is necessary to avoid race conditions that can occur
237 when the pid returned by the SECCOMP_IOCTL_NOTIF_RECV operation termi‐
238 nates, and that process ID is reused by another process. An example of
239 this kind of race is the following
240
241 1. A notification is generated on the listening file descriptor. The
242 returned seccomp_notif contains the TID of the target thread (in the
243 pid field of the structure).
244
245 2. The target terminates.
246
247 3. Another thread or process is created on the system that by chance
248 reuses the TID that was freed when the target terminated.
249
250 4. The supervisor open(2)s the /proc/[tid]/mem file for the TID ob‐
251 tained in step 1, with the intention of (say) inspecting the memory
252 location(s) that containing the argument(s) of the system call that
253 triggered the notification in step 1.
254
255 In the above scenario, the risk is that the supervisor may try to ac‐
256 cess the memory of a process other than the target. This race can be
257 avoided by following the call to open(2) with a SECCOMP_IOCTL_NO‐
258 TIF_ID_VALID operation to verify that the process that generated the
259 notification is still alive. (Note that if the target terminates after
260 the latter step, a subsequent read(2) from the file descriptor may re‐
261 turn 0, indicating end of file.)
262
263 See NOTES for a discussion of other cases where SECCOMP_IOCTL_NO‐
264 TIF_ID_VALID checks must be performed.
265
266 On success (i.e., the notification ID is still valid), this operation
267 returns 0. On failure (i.e., the notification ID is no longer valid),
268 -1 is returned, and errno is set to ENOENT.
269
270 SECCOMP_IOCTL_NOTIF_SEND
271 The SECCOMP_IOCTL_NOTIF_SEND operation (available since Linux 5.0) is
272 used to send a notification response back to the kernel. The third
273 ioctl(2) argument of this structure is a pointer to a structure of the
274 following form:
275
276 struct seccomp_notif_resp {
277 __u64 id; /* Cookie value */
278 __s64 val; /* Success return value */
279 __s32 error; /* 0 (success) or negative error number */
280 __u32 flags; /* See below */
281 };
282
283 The fields of this structure are as follows:
284
285 id This is the cookie value that was obtained using the SEC‐
286 COMP_IOCTL_NOTIF_RECV operation. This cookie value allows the
287 kernel to correctly associate this response with the system call
288 that triggered the user-space notification.
289
290 val This is the value that will be used for a spoofed success return
291 for the target's system call; see below.
292
293 error This is the value that will be used as the error number (errno)
294 for a spoofed error return for the target's system call; see be‐
295 low.
296
297 flags This is a bit mask that includes zero or more of the following
298 flags:
299
300 SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
301 Tell the kernel to execute the target's system call.
302
303 Two kinds of response are possible:
304
305 • A response to the kernel telling it to execute the target's system
306 call. In this case, the flags field includes SECCOMP_USER_NO‐
307 TIF_FLAG_CONTINUE and the error and val fields must be zero.
308
309 This kind of response can be useful in cases where the supervisor
310 needs to do deeper analysis of the target's system call than is pos‐
311 sible from a seccomp filter (e.g., examining the values of pointer
312 arguments), and, having decided that the system call does not require
313 emulation by the supervisor, the supervisor wants the system call to
314 be executed normally in the target.
315
316 The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag should be used with cau‐
317 tion; see NOTES.
318
319 • A spoofed return value for the target's system call. In this case,
320 the kernel does not execute the target's system call, instead causing
321 the system call to return a spoofed value as specified by fields of
322 the seccomp_notif_resp structure. The supervisor should set the
323 fields of this structure as follows:
324
325 + flags does not contain SECCOMP_USER_NOTIF_FLAG_CONTINUE.
326
327 + error is set either to 0 for a spoofed "success" return or to a
328 negative error number for a spoofed "failure" return. In the for‐
329 mer case, the kernel causes the target's system call to return the
330 value specified in the val field. In the latter case, the kernel
331 causes the target's system call to return -1, and errno is as‐
332 signed the negated error value.
333
334 + val is set to a value that will be used as the return value for a
335 spoofed "success" return for the target's system call. The value
336 in this field is ignored if the error field contains a nonzero
337 value.
338
339 On success, this operation returns 0; on failure, -1 is returned, and
340 errno is set to indicate the cause of the error. This operation can
341 fail with the following errors:
342
343 EINPROGRESS
344 A response to this notification has already been sent.
345
346 EINVAL An invalid value was specified in the flags field.
347
348 EINVAL The flags field contained SECCOMP_USER_NOTIF_FLAG_CONTINUE, and
349 the error or val field was not zero.
350
351 ENOENT The blocked system call in the target has been interrupted by a
352 signal handler or the target has terminated.
353
354 SECCOMP_IOCTL_NOTIF_ADDFD
355 The SECCOMP_IOCTL_NOTIF_ADDFD operation (available since Linux 5.9) al‐
356 lows the supervisor to install a file descriptor into the target's file
357 descriptor table. Much like the use of SCM_RIGHTS messages described
358 in unix(7), this operation is semantically equivalent to duplicating a
359 file descriptor from the supervisor's file descriptor table into the
360 target's file descriptor table.
361
362 The SECCOMP_IOCTL_NOTIF_ADDFD operation permits the supervisor to emu‐
363 late a target system call (such as socket(2) or openat(2)) that gener‐
364 ates a file descriptor. The supervisor can perform the system call
365 that generates the file descriptor (and associated open file descrip‐
366 tion) and then use this operation to allocate a file descriptor that
367 refers to the same open file description in the target. (For an expla‐
368 nation of open file descriptions, see open(2).)
369
370 Once this operation has been performed, the supervisor can close its
371 copy of the file descriptor.
372
373 In the target, the received file descriptor is subject to the same
374 Linux Security Module (LSM) checks as are applied to a file descriptor
375 that is received in an SCM_RIGHTS ancillary message. If the file de‐
376 scriptor refers to a socket, it inherits the cgroup version 1 network
377 controller settings (classid and netprioidx) of the target.
378
379 The third ioctl(2) argument is a pointer to a structure of the follow‐
380 ing form:
381
382 struct seccomp_notif_addfd {
383 __u64 id; /* Cookie value */
384 __u32 flags; /* Flags */
385 __u32 srcfd; /* Local file descriptor number */
386 __u32 newfd; /* 0 or desired file descriptor
387 number in target */
388 __u32 newfd_flags; /* Flags to set on target file
389 descriptor */
390 };
391
392 The fields in this structure are as follows:
393
394 id This field should be set to the notification ID (cookie value)
395 that was obtained via SECCOMP_IOCTL_NOTIF_RECV.
396
397 flags This field is a bit mask of flags that modify the behavior of
398 the operation. Currently, only one flag is supported:
399
400 SECCOMP_ADDFD_FLAG_SETFD
401 When allocating the file descriptor in the target, use
402 the file descriptor number specified in the newfd field.
403
404 srcfd This field should be set to the number of the file descriptor in
405 the supervisor that is to be duplicated.
406
407 newfd This field determines which file descriptor number is allocated
408 in the target. If the SECCOMP_ADDFD_FLAG_SETFD flag is set,
409 then this field specifies which file descriptor number should be
410 allocated. If this file descriptor number is already open in
411 the target, it is atomically closed and reused. If the descrip‐
412 tor duplication fails due to an LSM check, or if srcfd is not a
413 valid file descriptor, the file descriptor newfd will not be
414 closed in the target process.
415
416 If the SECCOMP_ADDFD_FLAG_SETFD flag it not set, then this field
417 must be 0, and the kernel allocates the lowest unused file de‐
418 scriptor number in the target.
419
420 newfd_flags
421 This field is a bit mask specifying flags that should be set on
422 the file descriptor that is received in the target process.
423 Currently, only the following flag is implemented:
424
425 O_CLOEXEC
426 Set the close-on-exec flag on the received file descrip‐
427 tor.
428
429 On success, this ioctl(2) call returns the number of the file descrip‐
430 tor that was allocated in the target. Assuming that the emulated sys‐
431 tem call is one that returns a file descriptor as its function result
432 (e.g., socket(2)), this value can be used as the return value
433 (resp.val) that is supplied in the response that is subsequently sent
434 with the SECCOMP_IOCTL_NOTIF_SEND operation.
435
436 On error, -1 is returned and errno is set to indicate the cause of the
437 error.
438
439 This operation can fail with the following errors:
440
441 EBADF Allocating the file descriptor in the target would cause the
442 target's RLIMIT_NOFILE limit to be exceeded (see getrlimit(2)).
443
444 EINPROGRESS
445 The user-space notification specified in the id field exists but
446 has not yet been fetched (by a SECCOMP_IOCTL_NOTIF_RECV) or has
447 already been responded to (by a SECCOMP_IOCTL_NOTIF_SEND).
448
449 EINVAL An invalid flag was specified in the flags or newfd_flags field,
450 or the newfd field is nonzero and the SECCOMP_ADDFD_FLAG_SETFD
451 flag was not specified in the flags field.
452
453 EMFILE The file descriptor number specified in newfd exceeds the limit
454 specified in /proc/sys/fs/nr_open.
455
456 ENOENT The blocked system call in the target has been interrupted by a
457 signal handler or the target has terminated.
458
459 Here is some sample code (with error handling omitted) that uses the
460 SECCOMP_ADDFD_FLAG_SETFD operation (here, to emulate a call to ope‐
461 nat(2)):
462
463 int fd, removeFd;
464
465 fd = openat(req->data.args[0], path, req->data.args[2],
466 req->data.args[3]);
467
468 struct seccomp_notif_addfd addfd;
469 addfd.id = req->id; /* Cookie from SECCOMP_IOCTL_NOTIF_RECV */
470 addfd.srcfd = fd;
471 addfd.newfd = 0;
472 addfd.flags = 0;
473 addfd.newfd_flags = O_CLOEXEC;
474
475 targetFd = ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd);
476
477 close(fd); /* No longer needed in supervisor */
478
479 struct seccomp_notif_resp *resp;
480 /* Code to allocate 'resp' omitted */
481 resp->id = req->id;
482 resp->error = 0; /* "Success" */
483 resp->val = targetFd;
484 resp->flags = 0;
485 ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp);
486
488 One example use case for the user-space notification mechanism is to
489 allow a container manager (a process which is typically running with
490 more privilege than the processes inside the container) to mount block
491 devices or create device nodes for the container. The mount use case
492 provides an example of where the SECCOMP_USER_NOTIF_FLAG_CONTINUE
493 ioctl(2) operation is useful. Upon receiving a notification for the
494 mount(2) system call, the container manager (the "supervisor") can dis‐
495 tinguish a request to mount a block filesystem (which would not be pos‐
496 sible for a "target" process inside the container) and mount that file
497 system. If, on the other hand, the container manager detects that the
498 operation could be performed by the process inside the container (e.g.,
499 a mount of a tmpfs(5) filesystem), it can notify the kernel that the
500 target process's mount(2) system call can continue.
501
502 select()/poll()/epoll semantics
503 The file descriptor returned when seccomp(2) is employed with the SEC‐
504 COMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using poll(2),
505 epoll(7), and select(2). These interfaces indicate that the file de‐
506 scriptor is ready as follows:
507
508 • When a notification is pending, these interfaces indicate that the
509 file descriptor is readable. Following such an indication, a subse‐
510 quent SECCOMP_IOCTL_NOTIF_RECV ioctl(2) will not block, returning ei‐
511 ther information about a notification or else failing with the error
512 EINTR if the target has been killed by a signal or its system call
513 has been interrupted by a signal handler.
514
515 • After the notification has been received (i.e., by the SEC‐
516 COMP_IOCTL_NOTIF_RECV ioctl(2) operation), these interfaces indicate
517 that the file descriptor is writable, meaning that a notification re‐
518 sponse can be sent using the SECCOMP_IOCTL_NOTIF_SEND ioctl(2) opera‐
519 tion.
520
521 • After the last thread using the filter has terminated and been reaped
522 using waitpid(2) (or similar), the file descriptor indicates an end-
523 of-file condition (readable in select(2); POLLHUP/EPOLLHUP in
524 poll(2)/ epoll_wait(2)).
525
526 Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE
527 The intent of the user-space notification feature is to allow system
528 calls to be performed on behalf of the target. The target's system
529 call should either be handled by the supervisor or allowed to continue
530 normally in the kernel (where standard security policies will be ap‐
531 plied).
532
533 Note well: this mechanism must not be used to make security policy de‐
534 cisions about the system call, which would be inherently race-prone for
535 reasons described next.
536
537 The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with caution.
538 If set by the supervisor, the target's system call will continue. How‐
539 ever, there is a time-of-check, time-of-use race here, since an at‐
540 tacker could exploit the interval of time where the target is blocked
541 waiting on the "continue" response to do things such as rewriting the
542 system call arguments.
543
544 Note furthermore that a user-space notifier can be bypassed if the ex‐
545 isting filters allow the use of seccomp(2) or prctl(2) to install a
546 filter that returns an action value with a higher precedence than SEC‐
547 COMP_RET_USER_NOTIF (see seccomp(2)).
548
549 It should thus be absolutely clear that the seccomp user-space notifi‐
550 cation mechanism can not be used to implement a security policy! It
551 should only ever be used in scenarios where a more privileged process
552 supervises the system calls of a lesser privileged target to get around
553 kernel-enforced security restrictions when the supervisor deems this
554 safe. In other words, in order to continue a system call, the supervi‐
555 sor should be sure that another security mechanism or the kernel itself
556 will sufficiently block the system call if its arguments are rewritten
557 to something unsafe.
558
559 Caveats regarding the use of /proc/[tid]/mem
560 The discussion above noted the need to use the SECCOMP_IOCTL_NO‐
561 TIF_ID_VALID ioctl(2) when opening the /proc/[tid]/mem file of the tar‐
562 get to avoid the possibility of accessing the memory of the wrong
563 process in the event that the target terminates and its ID is recycled
564 by another (unrelated) thread. However, the use of this ioctl(2) oper‐
565 ation is also necessary in other situations, as explained in the fol‐
566 lowing paragraphs.
567
568 Consider the following scenario, where the supervisor tries to read the
569 pathname argument of a target's blocked mount(2) system call:
570
571 • From one of its functions (func()), the target calls mount(2), which
572 triggers a user-space notification and causes the target to block.
573
574 • The supervisor receives the notification, opens /proc/[tid]/mem, and
575 (successfully) performs the SECCOMP_IOCTL_NOTIF_ID_VALID check.
576
577 • The target receives a signal, which causes the mount(2) to abort.
578
579 • The signal handler executes in the target, and returns.
580
581 • Upon return from the handler, the execution of func() resumes, and it
582 returns (and perhaps other functions are called, overwriting the mem‐
583 ory that had been used for the stack frame of func()).
584
585 • Using the address provided in the notification information, the su‐
586 pervisor reads from the target's memory location that used to contain
587 the pathname.
588
589 • The supervisor now calls mount(2) with some arbitrary bytes obtained
590 in the previous step.
591
592 The conclusion from the above scenario is this: since the target's
593 blocked system call may be interrupted by a signal handler, the super‐
594 visor must be written to expect that the target may abandon its system
595 call at any time; in such an event, any information that the supervisor
596 obtained from the target's memory must be considered invalid.
597
598 To prevent such scenarios, every read from the target's memory must be
599 separated from use of the bytes so obtained by a SECCOMP_IOCTL_NO‐
600 TIF_ID_VALID check. In the above example, the check would be placed
601 between the two final steps. An example of such a check is shown in
602 EXAMPLES.
603
604 Following on from the above, it should be clear that a write by the su‐
605 pervisor into the target's memory can never be considered safe.
606
607 Caveats regarding blocking system calls
608 Suppose that the target performs a blocking system call (e.g., ac‐
609 cept(2)) that the supervisor should handle. The supervisor might then
610 in turn execute the same blocking system call.
611
612 In this scenario, it is important to note that if the target's system
613 call is now interrupted by a signal, the supervisor is not informed of
614 this. If the supervisor does not take suitable steps to actively dis‐
615 cover that the target's system call has been canceled, various diffi‐
616 culties can occur. Taking the example of accept(2), the supervisor
617 might remain blocked in its accept(2) holding a port number that the
618 target (which, after the interruption by the signal handler, perhaps
619 closed its listening socket) might expect to be able to reuse in a
620 bind(2) call.
621
622 Therefore, when the supervisor wishes to emulate a blocking system
623 call, it must do so in such a way that it gets informed if the target's
624 system call is interrupted by a signal handler. For example, if the
625 supervisor itself executes the same blocking system call, then it could
626 employ a separate thread that uses the SECCOMP_IOCTL_NOTIF_ID_VALID op‐
627 eration to check if the target is still blocked in its system call.
628 Alternatively, in the accept(2) example, the supervisor might use
629 poll(2) to monitor both the notification file descriptor (so as as to
630 discover when the target's accept(2) call has been interrupted) and the
631 listening file descriptor (so as to know when a connection is avail‐
632 able).
633
634 If the target's system call is interrupted, the supervisor must take
635 care to release resources (e.g., file descriptors) that it acquired on
636 behalf of the target.
637
638 Interaction with SA_RESTART signal handlers
639 Consider the following scenario:
640
641 • The target process has used sigaction(2) to install a signal handler
642 with the SA_RESTART flag.
643
644 • The target has made a system call that triggered a seccomp user-space
645 notification and the target is currently blocked until the supervisor
646 sends a notification response.
647
648 • A signal is delivered to the target and the signal handler is exe‐
649 cuted.
650
651 • When (if) the supervisor attempts to send a notification response,
652 the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)) operation will fail with the
653 ENOENT error.
654
655 In this scenario, the kernel will restart the target's system call.
656 Consequently, the supervisor will receive another user-space notifica‐
657 tion. Thus, depending on how many times the blocked system call is in‐
658 terrupted by a signal handler, the supervisor may receive multiple no‐
659 tifications for the same instance of a system call in the target.
660
661 One oddity is that system call restarting as described in this scenario
662 will occur even for the blocking system calls listed in signal(7) that
663 would never normally be restarted by the SA_RESTART flag.
664
666 If a SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation is performed after the
667 target terminates, then the ioctl(2) call simply blocks (rather than
668 returning an error to indicate that the target no longer exists).
669
671 The (somewhat contrived) program shown below demonstrates the use of
672 the interfaces described in this page. The program creates a child
673 process that serves as the "target" process. The child process in‐
674 stalls a seccomp filter that returns the SECCOMP_RET_USER_NOTIF action
675 value if a call is made to mkdir(2). The child process then calls
676 mkdir(2) once for each of the supplied command-line arguments, and re‐
677 ports the result returned by the call. After processing all arguments,
678 the child process terminates.
679
680 The parent process acts as the supervisor, listening for the notifica‐
681 tions that are generated when the target process calls mkdir(2). When
682 such a notification occurs, the supervisor examines the memory of the
683 target process (using /proc/[pid]/mem) to discover the pathname argu‐
684 ment that was supplied to the mkdir(2) call, and performs one of the
685 following actions:
686
687 • If the pathname begins with the prefix "/tmp/", then the supervisor
688 attempts to create the specified directory, and then spoofs a return
689 for the target process based on the return value of the supervisor's
690 mkdir(2) call. In the event that that call succeeds, the spoofed
691 success return value is the length of the pathname.
692
693 • If the pathname begins with "./" (i.e., it is a relative pathname),
694 the supervisor sends a SECCOMP_USER_NOTIF_FLAG_CONTINUE response to
695 the kernel to say that the kernel should execute the target process's
696 mkdir(2) call.
697
698 • If the pathname begins with some other prefix, the supervisor spoofs
699 an error return for the target process, so that the target process's
700 mkdir(2) call appears to fail with the error EOPNOTSUPP ("Operation
701 not supported"). Additionally, if the specified pathname is exactly
702 "/bye", then the supervisor terminates.
703
704 This program can be used to demonstrate various aspects of the behavior
705 of the seccomp user-space notification mechanism. To help aid such
706 demonstrations, the program logs various messages to show the operation
707 of the target process (lines prefixed "T:") and the supervisor (in‐
708 dented lines prefixed "S:").
709
710 In the following example, the target attempts to create the directory
711 /tmp/x. Upon receiving the notification, the supervisor creates the
712 directory on the target's behalf, and spoofs a success return to be re‐
713 ceived by the target process's mkdir(2) call.
714
715 $ ./seccomp_unotify /tmp/x
716 T: PID = 23168
717
718 T: about to mkdir("/tmp/x")
719 S: got notification (ID 0x17445c4a0f4e0e3c) for PID 23168
720 S: executing: mkdir("/tmp/x", 0700)
721 S: success! spoofed return = 6
722 S: sending response (flags = 0; val = 6; error = 0)
723 T: SUCCESS: mkdir(2) returned 6
724
725 T: terminating
726 S: target has terminated; bye
727
728 In the above output, note that the spoofed return value seen by the
729 target process is 6 (the length of the pathname /tmp/x), whereas a nor‐
730 mal mkdir(2) call returns 0 on success.
731
732 In the next example, the target attempts to create a directory using
733 the relative pathname ./sub. Since this pathname starts with "./", the
734 supervisor sends a SECCOMP_USER_NOTIF_FLAG_CONTINUE response to the
735 kernel, and the kernel then (successfully) executes the target
736 process's mkdir(2) call.
737
738 $ ./seccomp_unotify ./sub
739 T: PID = 23204
740
741 T: about to mkdir("./sub")
742 S: got notification (ID 0xddb16abe25b4c12) for PID 23204
743 S: target can execute system call
744 S: sending response (flags = 0x1; val = 0; error = 0)
745 T: SUCCESS: mkdir(2) returned 0
746
747 T: terminating
748 S: target has terminated; bye
749
750 If the target process attempts to create a directory with a pathname
751 that doesn't start with "." and doesn't begin with the prefix "/tmp/",
752 then the supervisor spoofs an error return (EOPNOTSUPP, "Operation not
753 supported") for the target's mkdir(2) call (which is not executed):
754
755 $ ./seccomp_unotify /xxx
756 T: PID = 23178
757
758 T: about to mkdir("/xxx")
759 S: got notification (ID 0xe7dc095d1c524e80) for PID 23178
760 S: spoofing error response (Operation not supported)
761 S: sending response (flags = 0; val = 0; error = -95)
762 T: ERROR: mkdir(2): Operation not supported
763
764 T: terminating
765 S: target has terminated; bye
766
767 In the next example, the target process attempts to create a directory
768 with the pathname /tmp/nosuchdir/b. Upon receiving the notification,
769 the supervisor attempts to create that directory, but the mkdir(2) call
770 fails because the directory /tmp/nosuchdir does not exist. Conse‐
771 quently, the supervisor spoofs an error return that passes the error
772 that it received back to the target process's mkdir(2) call.
773
774 $ ./seccomp_unotify /tmp/nosuchdir/b
775 T: PID = 23199
776
777 T: about to mkdir("/tmp/nosuchdir/b")
778 S: got notification (ID 0x8744454293506046) for PID 23199
779 S: executing: mkdir("/tmp/nosuchdir/b", 0700)
780 S: failure! (errno = 2; No such file or directory)
781 S: sending response (flags = 0; val = 0; error = -2)
782 T: ERROR: mkdir(2): No such file or directory
783
784 T: terminating
785 S: target has terminated; bye
786
787 If the supervisor receives a notification and sees that the argument of
788 the target's mkdir(2) is the string "/bye", then (as well as spoofing
789 an EOPNOTSUPP error), the supervisor terminates. If the target process
790 subsequently executes another mkdir(2) that triggers its seccomp filter
791 to return the SECCOMP_RET_USER_NOTIF action value, then the kernel
792 causes the target process's system call to fail with the error ENOSYS
793 ("Function not implemented"). This is demonstrated by the following
794 example:
795
796 $ ./seccomp_unotify /bye /tmp/y
797 T: PID = 23185
798
799 T: about to mkdir("/bye")
800 S: got notification (ID 0xa81236b1d2f7b0f4) for PID 23185
801 S: spoofing error response (Operation not supported)
802 S: sending response (flags = 0; val = 0; error = -95)
803 S: terminating **********
804 T: ERROR: mkdir(2): Operation not supported
805
806 T: about to mkdir("/tmp/y")
807 T: ERROR: mkdir(2): Function not implemented
808
809 T: terminating
810
811 Program source
812 #define _GNU_SOURCE
813 #include <errno.h>
814 #include <fcntl.h>
815 #include <limits.h>
816 #include <linux/audit.h>
817 #include <linux/filter.h>
818 #include <linux/seccomp.h>
819 #include <signal.h>
820 #include <stdbool.h>
821 #include <stddef.h>
822 #include <stdint.h>
823 #include <stdio.h>
824 #include <stdlib.h>
825 #include <sys/socket.h>
826 #include <sys/ioctl.h>
827 #include <sys/prctl.h>
828 #include <sys/stat.h>
829 #include <sys/types.h>
830 #include <sys/un.h>
831 #include <sys/syscall.h>
832 #include <unistd.h>
833
834 #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
835 } while (0)
836
837 /* Send the file descriptor 'fd' over the connected UNIX domain socket
838 'sockfd'. Returns 0 on success, or -1 on error. */
839
840 static int
841 sendfd(int sockfd, int fd)
842 {
843 struct msghdr msgh;
844 struct iovec iov;
845 int data;
846 struct cmsghdr *cmsgp;
847
848 /* Allocate a char array of suitable size to hold the ancillary data.
849 However, since this buffer is in reality a 'struct cmsghdr', use a
850 union to ensure that it is suitably aligned. */
851 union {
852 char buf[CMSG_SPACE(sizeof(int))];
853 /* Space large enough to hold an 'int' */
854 struct cmsghdr align;
855 } controlMsg;
856
857 /* The 'msg_name' field can be used to specify the address of the
858 destination socket when sending a datagram. However, we do not
859 need to use this field because 'sockfd' is a connected socket. */
860
861 msgh.msg_name = NULL;
862 msgh.msg_namelen = 0;
863
864 /* On Linux, we must transmit at least one byte of real data in
865 order to send ancillary data. We transmit an arbitrary integer
866 whose value is ignored by recvfd(). */
867
868 msgh.msg_iov = &iov;
869 msgh.msg_iovlen = 1;
870 iov.iov_base = &data;
871 iov.iov_len = sizeof(int);
872 data = 12345;
873
874 /* Set 'msghdr' fields that describe ancillary data */
875
876 msgh.msg_control = controlMsg.buf;
877 msgh.msg_controllen = sizeof(controlMsg.buf);
878
879 /* Set up ancillary data describing file descriptor to send */
880
881 cmsgp = CMSG_FIRSTHDR(&msgh);
882 cmsgp->cmsg_level = SOL_SOCKET;
883 cmsgp->cmsg_type = SCM_RIGHTS;
884 cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
885 memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));
886
887 /* Send real plus ancillary data */
888
889 if (sendmsg(sockfd, &msgh, 0) == -1)
890 return -1;
891
892 return 0;
893 }
894
895 /* Receive a file descriptor on a connected UNIX domain socket. Returns
896 the received file descriptor on success, or -1 on error. */
897
898 static int
899 recvfd(int sockfd)
900 {
901 struct msghdr msgh;
902 struct iovec iov;
903 int data, fd;
904 ssize_t nr;
905
906 /* Allocate a char buffer for the ancillary data. See the comments
907 in sendfd() */
908 union {
909 char buf[CMSG_SPACE(sizeof(int))];
910 struct cmsghdr align;
911 } controlMsg;
912 struct cmsghdr *cmsgp;
913
914 /* The 'msg_name' field can be used to obtain the address of the
915 sending socket. However, we do not need this information. */
916
917 msgh.msg_name = NULL;
918 msgh.msg_namelen = 0;
919
920 /* Specify buffer for receiving real data */
921
922 msgh.msg_iov = &iov;
923 msgh.msg_iovlen = 1;
924 iov.iov_base = &data; /* Real data is an 'int' */
925 iov.iov_len = sizeof(int);
926
927 /* Set 'msghdr' fields that describe ancillary data */
928
929 msgh.msg_control = controlMsg.buf;
930 msgh.msg_controllen = sizeof(controlMsg.buf);
931
932 /* Receive real plus ancillary data; real data is ignored */
933
934 nr = recvmsg(sockfd, &msgh, 0);
935 if (nr == -1)
936 return -1;
937
938 cmsgp = CMSG_FIRSTHDR(&msgh);
939
940 /* Check the validity of the 'cmsghdr' */
941
942 if (cmsgp == NULL ||
943 cmsgp->cmsg_len != CMSG_LEN(sizeof(int)) ||
944 cmsgp->cmsg_level != SOL_SOCKET ||
945 cmsgp->cmsg_type != SCM_RIGHTS) {
946 errno = EINVAL;
947 return -1;
948 }
949
950 /* Return the received file descriptor to our caller */
951
952 memcpy(&fd, CMSG_DATA(cmsgp), sizeof(int));
953 return fd;
954 }
955
956 static void
957 sigchldHandler(int sig)
958 {
959 char msg[] = "\tS: target has terminated; bye\n";
960
961 write(STDOUT_FILENO, msg, sizeof(msg) - 1);
962 _exit(EXIT_SUCCESS);
963 }
964
965 static int
966 seccomp(unsigned int operation, unsigned int flags, void *args)
967 {
968 return syscall(__NR_seccomp, operation, flags, args);
969 }
970
971 /* The following is the x86-64-specific BPF boilerplate code for checking
972 that the BPF program is running on the right architecture + ABI. At
973 completion of these instructions, the accumulator contains the system
974 call number. */
975
976 /* For the x32 ABI, all system call numbers have bit 30 set */
977
978 #define X32_SYSCALL_BIT 0x40000000
979
980 #define X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR \
981 BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
982 (offsetof(struct seccomp_data, arch))), \
983 BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 2), \
984 BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
985 (offsetof(struct seccomp_data, nr))), \
986 BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1), \
987 BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS)
988
989 /* installNotifyFilter() installs a seccomp filter that generates
990 user-space notifications (SECCOMP_RET_USER_NOTIF) when the process
991 calls mkdir(2); the filter allows all other system calls.
992
993 The function return value is a file descriptor from which the
994 user-space notifications can be fetched. */
995
996 static int
997 installNotifyFilter(void)
998 {
999 struct sock_filter filter[] = {
1000 X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR,
1001
1002 /* mkdir() triggers notification to user-space supervisor */
1003
1004 BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_mkdir, 0, 1),
1005 BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_USER_NOTIF),
1006
1007 /* Every other system call is allowed */
1008
1009 BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
1010 };
1011
1012 struct sock_fprog prog = {
1013 .len = sizeof(filter) / sizeof(filter[0]),
1014 .filter = filter,
1015 };
1016
1017 /* Install the filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
1018 as a result, seccomp() returns a notification file descriptor. */
1019
1020 int notifyFd = seccomp(SECCOMP_SET_MODE_FILTER,
1021 SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
1022 if (notifyFd == -1)
1023 errExit("seccomp-install-notify-filter");
1024
1025 return notifyFd;
1026 }
1027
1028 /* Close a pair of sockets created by socketpair() */
1029
1030 static void
1031 closeSocketPair(int sockPair[2])
1032 {
1033 if (close(sockPair[0]) == -1)
1034 errExit("closeSocketPair-close-0");
1035 if (close(sockPair[1]) == -1)
1036 errExit("closeSocketPair-close-1");
1037 }
1038
1039 /* Implementation of the target process; create a child process that:
1040
1041 (1) installs a seccomp filter with the
1042 SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
1043 (2) writes the seccomp notification file descriptor returned from
1044 the previous step onto the UNIX domain socket, 'sockPair[0]';
1045 (3) calls mkdir(2) for each element of 'argv'.
1046
1047 The function return value in the parent is the PID of the child
1048 process; the child does not return from this function. */
1049
1050 static pid_t
1051 targetProcess(int sockPair[2], char *argv[])
1052 {
1053 pid_t targetPid = fork();
1054 if (targetPid == -1)
1055 errExit("fork");
1056
1057 if (targetPid > 0) /* In parent, return PID of child */
1058 return targetPid;
1059
1060 /* Child falls through to here */
1061
1062 printf("T: PID = %ld\n", (long) getpid());
1063
1064 /* Install seccomp filter(s) */
1065
1066 if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
1067 errExit("prctl");
1068
1069 int notifyFd = installNotifyFilter();
1070
1071 /* Pass the notification file descriptor to the tracing process over
1072 a UNIX domain socket */
1073
1074 if (sendfd(sockPair[0], notifyFd) == -1)
1075 errExit("sendfd");
1076
1077 /* Notification and socket FDs are no longer needed in target */
1078
1079 if (close(notifyFd) == -1)
1080 errExit("close-target-notify-fd");
1081
1082 closeSocketPair(sockPair);
1083
1084 /* Perform a mkdir() call for each of the command-line arguments */
1085
1086 for (char **ap = argv; *ap != NULL; ap++) {
1087 printf("\nT: about to mkdir(\"%s\")\n", *ap);
1088
1089 int s = mkdir(*ap, 0700);
1090 if (s == -1)
1091 perror("T: ERROR: mkdir(2)");
1092 else
1093 printf("T: SUCCESS: mkdir(2) returned %d\n", s);
1094 }
1095
1096 printf("\nT: terminating\n");
1097 exit(EXIT_SUCCESS);
1098 }
1099
1100 /* Check that the notification ID provided by a SECCOMP_IOCTL_NOTIF_RECV
1101 operation is still valid. It will no longer be valid if the target
1102 process has terminated or is no longer blocked in the system call that
1103 generated the notification (because it was interrupted by a signal).
1104
1105 This operation can be used when doing such things as accessing
1106 /proc/PID files in the target process in order to avoid TOCTOU race
1107 conditions where the PID that is returned by SECCOMP_IOCTL_NOTIF_RECV
1108 terminates and is reused by another process. */
1109
1110 static bool
1111 cookieIsValid(int notifyFd, uint64_t id)
1112 {
1113 return ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == 0;
1114 }
1115
1116 /* Access the memory of the target process in order to fetch the
1117 pathname referred to by the system call argument 'argNum' in
1118 'req->data.args[]'. The pathname is returned in 'path',
1119 a buffer of 'len' bytes allocated by the caller.
1120
1121 Returns true if the pathname is successfully fetched, and false
1122 otherwise. For possible causes of failure, see the comments below. */
1123
1124 static bool
1125 getTargetPathname(struct seccomp_notif *req, int notifyFd,
1126 int argNum, char *path, size_t len)
1127 {
1128 char procMemPath[PATH_MAX];
1129
1130 snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);
1131
1132 int procMemFd = open(procMemPath, O_RDONLY | O_CLOEXEC);
1133 if (procMemFd == -1)
1134 return false;
1135
1136 /* Check that the process whose info we are accessing is still alive
1137 and blocked in the system call that caused the notification.
1138 If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed in
1139 cookieIsValid()) succeeded, we know that the /proc/PID/mem file
1140 descriptor that we opened corresponded to the process for which we
1141 received a notification. If that process subsequently terminates,
1142 then read() on that file descriptor will return 0 (EOF). */
1143
1144 if (!cookieIsValid(notifyFd, req->id)) {
1145 close(procMemFd);
1146 return false;
1147 }
1148
1149 /* Read bytes at the location containing the pathname argument */
1150
1151 ssize_t nread = pread(procMemFd, path, len, req->data.args[argNum]);
1152
1153 close(procMemFd);
1154
1155 if (nread <= 0)
1156 return false;
1157
1158 /* Once again check that the notification ID is still valid. The
1159 case we are particularly concerned about here is that just
1160 before we fetched the pathname, the target's blocked system
1161 call was interrupted by a signal handler, and after the handler
1162 returned, the target carried on execution (past the interrupted
1163 system call). In that case, we have no guarantees about what we
1164 are reading, since the target's memory may have been arbitrarily
1165 changed by subsequent operations. */
1166
1167 if (!cookieIsValid(notifyFd, req->id)) {
1168 perror("\tS: notification ID check failed!!!");
1169 return false;
1170 }
1171
1172 /* Even if the target's system call was not interrupted by a signal,
1173 we have no guarantees about what was in the memory of the target
1174 process. (The memory may have been modified by another thread, or
1175 even by an external attacking process.) We therefore treat the
1176 buffer returned by pread() as untrusted input. The buffer should
1177 contain a terminating null byte; if not, then we will trigger an
1178 error for the target process. */
1179
1180 if (strnlen(path, nread) < nread)
1181 return true;
1182
1183 return false;
1184 }
1185
1186 /* Allocate buffers for the seccomp user-space notification request and
1187 response structures. It is the caller's responsibility to free the
1188 buffers returned via 'req' and 'resp'. */
1189
1190 static void
1191 allocSeccompNotifBuffers(struct seccomp_notif **req,
1192 struct seccomp_notif_resp **resp,
1193 struct seccomp_notif_sizes *sizes)
1194 {
1195 /* Discover the sizes of the structures that are used to receive
1196 notifications and send notification responses, and allocate
1197 buffers of those sizes. */
1198
1199 if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, sizes) == -1)
1200 errExit("seccomp-SECCOMP_GET_NOTIF_SIZES");
1201
1202 *req = malloc(sizes->seccomp_notif);
1203 if (*req == NULL)
1204 errExit("malloc-seccomp_notif");
1205
1206 /* When allocating the response buffer, we must allow for the fact
1207 that the user-space binary may have been built with user-space
1208 headers where 'struct seccomp_notif_resp' is bigger than the
1209 response buffer expected by the (older) kernel. Therefore, we
1210 allocate a buffer that is the maximum of the two sizes. This
1211 ensures that if the supervisor places bytes into the response
1212 structure that are past the response size that the kernel expects,
1213 then the supervisor is not touching an invalid memory location. */
1214
1215 size_t resp_size = sizes->seccomp_notif_resp;
1216 if (sizeof(struct seccomp_notif_resp) > resp_size)
1217 resp_size = sizeof(struct seccomp_notif_resp);
1218
1219 *resp = malloc(resp_size);
1220 if (resp == NULL)
1221 errExit("malloc-seccomp_notif_resp");
1222
1223 }
1224
1225 /* Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file
1226 descriptor, 'notifyFd'. */
1227
1228 static void
1229 handleNotifications(int notifyFd)
1230 {
1231 struct seccomp_notif_sizes sizes;
1232 struct seccomp_notif *req;
1233 struct seccomp_notif_resp *resp;
1234 char path[PATH_MAX];
1235
1236 allocSeccompNotifBuffers(&req, &resp, &sizes);
1237
1238 /* Loop handling notifications */
1239
1240 for (;;) {
1241
1242 /* Wait for next notification, returning info in '*req' */
1243
1244 memset(req, 0, sizes.seccomp_notif);
1245 if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_RECV, req) == -1) {
1246 if (errno == EINTR)
1247 continue;
1248 errExit("\tS: ioctl-SECCOMP_IOCTL_NOTIF_RECV");
1249 }
1250
1251 printf("\tS: got notification (ID %#llx) for PID %d\n",
1252 req->id, req->pid);
1253
1254 /* The only system call that can generate a notification event
1255 is mkdir(2). Nevertheless, we check that the notified system
1256 call is indeed mkdir() as kind of future-proofing of this
1257 code in case the seccomp filter is later modified to
1258 generate notifications for other system calls. */
1259
1260 if (req->data.nr != __NR_mkdir) {
1261 printf("\tS: notification contained unexpected "
1262 "system call number; bye!!!\n");
1263 exit(EXIT_FAILURE);
1264 }
1265
1266 bool pathOK = getTargetPathname(req, notifyFd, 0, path,
1267 sizeof(path));
1268
1269 /* Prepopulate some fields of the response */
1270
1271 resp->id = req->id; /* Response includes notification ID */
1272 resp->flags = 0;
1273 resp->val = 0;
1274
1275 /* If getTargetPathname() failed, trigger an EINVAL error
1276 response (sending this response may yield an error if the
1277 failure occurred because the notification ID was no longer
1278 valid); if the directory is in /tmp, then create it on behalf
1279 of the supervisor; if the pathname starts with '.', tell the
1280 kernel to let the target process execute the mkdir();
1281 otherwise, give an error for a directory pathname in any other
1282 location. */
1283
1284 if (!pathOK) {
1285 resp->error = -EINVAL;
1286 printf("\tS: spoofing error for invalid pathname (%s)\n",
1287 strerror(-resp->error));
1288 } else if (strncmp(path, "/tmp/", strlen("/tmp/")) == 0) {
1289 printf("\tS: executing: mkdir(\"%s\", %#llo)\n",
1290 path, req->data.args[1]);
1291
1292 if (mkdir(path, req->data.args[1]) == 0) {
1293 resp->error = 0; /* "Success" */
1294 resp->val = strlen(path); /* Used as return value of
1295 mkdir() in target */
1296 printf("\tS: success! spoofed return = %lld\n",
1297 resp->val);
1298 } else {
1299
1300 /* If mkdir() failed in the supervisor, pass the error
1301 back to the target */
1302
1303 resp->error = -errno;
1304 printf("\tS: failure! (errno = %d; %s)\n", errno,
1305 strerror(errno));
1306 }
1307 } else if (strncmp(path, "./", strlen("./")) == 0) {
1308 resp->error = resp->val = 0;
1309 resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
1310 printf("\tS: target can execute system call\n");
1311 } else {
1312 resp->error = -EOPNOTSUPP;
1313 printf("\tS: spoofing error response (%s)\n",
1314 strerror(-resp->error));
1315 }
1316
1317 /* Send a response to the notification */
1318
1319 printf("\tS: sending response "
1320 "(flags = %#x; val = %lld; error = %d)\n",
1321 resp->flags, resp->val, resp->error);
1322
1323 if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp) == -1) {
1324 if (errno == ENOENT)
1325 printf("\tS: response failed with ENOENT; "
1326 "perhaps target process's syscall was "
1327 "interrupted by a signal?\n");
1328 else
1329 perror("ioctl-SECCOMP_IOCTL_NOTIF_SEND");
1330 }
1331
1332 /* If the pathname is just "/bye", then the supervisor breaks out
1333 of the loop and terminates. This allows us to see what happens
1334 if the target process makes further calls to mkdir(2). */
1335
1336 if (strcmp(path, "/bye") == 0)
1337 break;
1338 }
1339
1340 free(req);
1341 free(resp);
1342 printf("\tS: terminating **********\n");
1343 exit(EXIT_FAILURE);
1344 }
1345
1346 /* Implementation of the supervisor process:
1347
1348 (1) obtains the notification file descriptor from 'sockPair[1]'
1349 (2) handles notifications that arrive on that file descriptor. */
1350
1351 static void
1352 supervisor(int sockPair[2])
1353 {
1354 int notifyFd = recvfd(sockPair[1]);
1355 if (notifyFd == -1)
1356 errExit("recvfd");
1357
1358 closeSocketPair(sockPair); /* We no longer need the socket pair */
1359
1360 handleNotifications(notifyFd);
1361 }
1362
1363 int
1364 main(int argc, char *argv[])
1365 {
1366 int sockPair[2];
1367
1368 setbuf(stdout, NULL);
1369
1370 if (argc < 2) {
1371 fprintf(stderr, "At least one pathname argument is required\n");
1372 exit(EXIT_FAILURE);
1373 }
1374
1375 /* Create a UNIX domain socket that is used to pass the seccomp
1376 notification file descriptor from the target process to the
1377 supervisor process. */
1378
1379 if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockPair) == -1)
1380 errExit("socketpair");
1381
1382 /* Create a child process--the "target"--that installs seccomp
1383 filtering. The target process writes the seccomp notification
1384 file descriptor onto 'sockPair[0]' and then calls mkdir(2) for
1385 each directory in the command-line arguments. */
1386
1387 (void) targetProcess(sockPair, &argv[optind]);
1388
1389 /* Catch SIGCHLD when the target terminates, so that the
1390 supervisor can also terminate. */
1391
1392 struct sigaction sa;
1393 sa.sa_handler = sigchldHandler;
1394 sa.sa_flags = 0;
1395 sigemptyset(&sa.sa_mask);
1396 if (sigaction(SIGCHLD, &sa, NULL) == -1)
1397 errExit("sigaction");
1398
1399 supervisor(sockPair);
1400
1401 exit(EXIT_SUCCESS);
1402 }
1403
1405 ioctl(2), pidfd_open(2), pidfd_getfd(2), seccomp(2)
1406
1407 A further example program can be found in the kernel source file sam‐
1408 ples/seccomp/user-trap.c.
1409
1411 This page is part of release 5.12 of the Linux man-pages project. A
1412 description of the project, information about reporting bugs, and the
1413 latest version of this page, can be found at
1414 https://www.kernel.org/doc/man-pages/.
1415
1416
1417
1418Linux 2021-06-20 SECCOMP_UNOTIFY(2)