1SECCOMP_UNOTIFY(2) Linux Programmer's Manual SECCOMP_UNOTIFY(2)
2
3
4
6 seccomp_unotify - Seccomp user-space notification mechanism
7
9 #include <linux/seccomp.h>
10 #include <linux/filter.h>
11 #include <linux/audit.h>
12
13 int seccomp(unsigned int operation, unsigned int flags, void *args);
14
15 #include <sys/ioctl.h>
16
17 int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
18 struct seccomp_notif *req);
19 int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
20 struct seccomp_notif_resp *resp);
21 int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
22 int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ADDFD,
23 struct seccomp_notif_addfd *addfd);
24
26 This page describes the user-space notification mechanism provided by
27 the Secure Computing (seccomp) facility. As well as the use of the
28 SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the SECCOMP_RET_USER_NOTIF ac‐
29 tion value, and the SECCOMP_GET_NOTIF_SIZES operation described in sec‐
30 comp(2), this mechanism involves the use of a number of related
31 ioctl(2) operations (described below).
32
33 Overview
34 In conventional usage of a seccomp filter, the decision about how to
35 treat a system call is made by the filter itself. By contrast, the
36 user-space notification mechanism allows the seccomp filter to delegate
37 the handling of the system call to another user-space process. Note
38 that this mechanism is explicitly not intended as a method implementing
39 security policy; see NOTES.
40
41 In the discussion that follows, the thread(s) on which the seccomp fil‐
42 ter is installed is (are) referred to as the target, and the process
43 that is notified by the user-space notification mechanism is referred
44 to as the supervisor.
45
46 A suitably privileged supervisor can use the user-space notification
47 mechanism to perform actions on behalf of the target. The advantage of
48 the user-space notification mechanism is that the supervisor will usu‐
49 ally be able to retrieve information about the target and the performed
50 system call that the seccomp filter itself cannot. (A seccomp filter
51 is limited in the information it can obtain and the actions that it can
52 perform because it is running on a virtual machine inside the kernel.)
53
54 An overview of the steps performed by the target and the supervisor is
55 as follows:
56
57 1. The target establishes a seccomp filter in the usual manner, but
58 with two differences:
59
60 • The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
61 TER_FLAG_NEW_LISTENER. Consequently, the return value of the
62 (successful) seccomp(2) call is a new "listening" file descriptor
63 that can be used to receive notifications. Only one "listening"
64 seccomp filter can be installed for a thread.
65
66 • In cases where it is appropriate, the seccomp filter returns the
67 action value SECCOMP_RET_USER_NOTIF. This return value will trig‐
68 ger a notification event.
69
70 2. In order that the supervisor can obtain notifications using the lis‐
71 tening file descriptor, (a duplicate of) that file descriptor must
72 be passed from the target to the supervisor. One way in which this
73 could be done is by passing the file descriptor over a UNIX domain
74 socket connection between the target and the supervisor (using the
75 SCM_RIGHTS ancillary message type described in unix(7)). Another
76 way to do this is through the use of pidfd_getfd(2).
77
78 3. The supervisor will receive notification events on the listening
79 file descriptor. These events are returned as structures of type
80 seccomp_notif. Because this structure and its size may evolve over
81 kernel versions, the supervisor must first determine the size of
82 this structure using the seccomp(2) SECCOMP_GET_NOTIF_SIZES opera‐
83 tion, which returns a structure of type seccomp_notif_sizes. The
84 supervisor allocates a buffer of size seccomp_notif_sizes.sec‐
85 comp_notif bytes to receive notification events. In addition,the
86 supervisor allocates another buffer of size seccomp_notif_sizes.sec‐
87 comp_notif_resp bytes for the response (a struct seccomp_notif_resp
88 structure) that it will provide to the kernel (and thus the target).
89
90 4. The target then performs its workload, which includes system calls
91 that will be controlled by the seccomp filter. Whenever one of
92 these system calls causes the filter to return the SEC‐
93 COMP_RET_USER_NOTIF action value, the kernel does not (yet) execute
94 the system call; instead, execution of the target is temporarily
95 blocked inside the kernel (in a sleep state that is interruptible by
96 signals) and a notification event is generated on the listening file
97 descriptor.
98
99 5. The supervisor can now repeatedly monitor the listening file de‐
100 scriptor for SECCOMP_RET_USER_NOTIF-triggered events. To do this,
101 the supervisor uses the SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation
102 to read information about a notification event; this operation
103 blocks until an event is available. The operation returns a sec‐
104 comp_notif structure containing information about the system call
105 that is being attempted by the target. (As described in NOTES, the
106 file descriptor can also be monitored with select(2), poll(2), or
107 epoll(7).)
108
109 6. The seccomp_notif structure returned by the SECCOMP_IOCTL_NOTIF_RECV
110 operation includes the same information (a seccomp_data structure)
111 that was passed to the seccomp filter. This information allows the
112 supervisor to discover the system call number and the arguments for
113 the target's system call. In addition, the notification event con‐
114 tains the ID of the thread that triggered the notification and a
115 unique cookie value that is used in subsequent SECCOMP_IOCTL_NO‐
116 TIF_ID_VALID and SECCOMP_IOCTL_NOTIF_SEND operations.
117
118 The information in the notification can be used to discover the val‐
119 ues of pointer arguments for the target's system call. (This is
120 something that can't be done from within a seccomp filter.) One way
121 in which the supervisor can do this is to open the corresponding
122 /proc/[tid]/mem file (see proc(5)) and read bytes from the location
123 that corresponds to one of the pointer arguments whose value is sup‐
124 plied in the notification event. (The supervisor must be careful to
125 avoid a race condition that can occur when doing this; see the de‐
126 scription of the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) operation be‐
127 low.) In addition, the supervisor can access other system informa‐
128 tion that is visible in user space but which is not accessible from
129 a seccomp filter.
130
131 7. Having obtained information as per the previous step, the supervisor
132 may then choose to perform an action in response to the target's
133 system call (which, as noted above, is not executed when the seccomp
134 filter returns the SECCOMP_RET_USER_NOTIF action value).
135
136 One example use case here relates to containers. The target may be
137 located inside a container where it does not have sufficient capa‐
138 bilities to mount a filesystem in the container's mount namespace.
139 However, the supervisor may be a more privileged process that does
140 have sufficient capabilities to perform the mount operation.
141
142 8. The supervisor then sends a response to the notification. The in‐
143 formation in this response is used by the kernel to construct a re‐
144 turn value for the target's system call and provide a value that
145 will be assigned to the errno variable of the target.
146
147 The response is sent using the SECCOMP_IOCTL_NOTIF_SEND ioctl(2) op‐
148 eration, which is used to transmit a seccomp_notif_resp structure to
149 the kernel. This structure includes a cookie value that the super‐
150 visor obtained in the seccomp_notif structure returned by the SEC‐
151 COMP_IOCTL_NOTIF_RECV operation. This cookie value allows the ker‐
152 nel to associate the response with the target. This structure must
153 include the cookie value that the supervisor obtained in the sec‐
154 comp_notif structure returned by the SECCOMP_IOCTL_NOTIF_RECV opera‐
155 tion; the cookie allows the kernel to associate the response with
156 the target.
157
158 9. Once the notification has been sent, the system call in the target
159 thread unblocks, returning the information that was provided by the
160 supervisor in the notification response.
161
162 As a variation on the last two steps, the supervisor can send a re‐
163 sponse that tells the kernel that it should execute the target thread's
164 system call; see the discussion of SECCOMP_USER_NOTIF_FLAG_CONTINUE,
165 below.
166
168 The following ioctl(2) operations are supported by the seccomp user-
169 space notification file descriptor. For each of these operations, the
170 first (file descriptor) argument of ioctl(2) is the listening file de‐
171 scriptor returned by a call to seccomp(2) with the SECCOMP_FIL‐
172 TER_FLAG_NEW_LISTENER flag.
173
174 SECCOMP_IOCTL_NOTIF_RECV
175 The SECCOMP_IOCTL_NOTIF_RECV operation (available since Linux 5.0) is
176 used to obtain a user-space notification event. If no such event is
177 currently pending, the operation blocks until an event occurs. The
178 third ioctl(2) argument is a pointer to a structure of the following
179 form which contains information about the event. This structure must
180 be zeroed out before the call.
181
182 struct seccomp_notif {
183 __u64 id; /* Cookie */
184 __u32 pid; /* TID of target thread */
185 __u32 flags; /* Currently unused (0) */
186 struct seccomp_data data; /* See seccomp(2) */
187 };
188
189 The fields in this structure are as follows:
190
191 id This is a cookie for the notification. Each such cookie is
192 guaranteed to be unique for the corresponding seccomp filter.
193
194 • The cookie can be used with the SECCOMP_IOCTL_NOTIF_ID_VALID
195 ioctl(2) operation described below.
196
197 • When returning a notification response to the kernel, the su‐
198 pervisor must include the cookie value in the seccomp_no‐
199 tif_resp structure that is specified as the argument of the
200 SECCOMP_IOCTL_NOTIF_SEND operation.
201
202 pid This is the thread ID of the target thread that triggered the
203 notification event.
204
205 flags This is a bit mask of flags providing further information on the
206 event. In the current implementation, this field is always
207 zero.
208
209 data This is a seccomp_data structure containing information about
210 the system call that triggered the notification. This is the
211 same structure that is passed to the seccomp filter. See sec‐
212 comp(2) for details of this structure.
213
214 On success, this operation returns 0; on failure, -1 is returned, and
215 errno is set to indicate the cause of the error. This operation can
216 fail with the following errors:
217
218 EINVAL (since Linux 5.5)
219 The seccomp_notif structure that was passed to the call con‐
220 tained nonzero fields.
221
222 ENOENT The target thread was killed by a signal as the notification in‐
223 formation was being generated, or the target's (blocked) system
224 call was interrupted by a signal handler.
225
226 SECCOMP_IOCTL_NOTIF_ID_VALID
227 The SECCOMP_IOCTL_NOTIF_ID_VALID operation (available since Linux 5.0)
228 is used to check that a notification ID returned by an earlier SEC‐
229 COMP_IOCTL_NOTIF_RECV operation is still valid (i.e., that the target
230 still exists and its system call is still blocked waiting for a re‐
231 sponse).
232
233 The third ioctl(2) argument is a pointer to the cookie (id) returned by
234 the SECCOMP_IOCTL_NOTIF_RECV operation.
235
236 This operation is necessary to avoid race conditions that can occur
237 when the pid returned by the SECCOMP_IOCTL_NOTIF_RECV operation termi‐
238 nates, and that process ID is reused by another process. An example of
239 this kind of race is the following
240
241 1. A notification is generated on the listening file descriptor. The
242 returned seccomp_notif contains the TID of the target thread (in the
243 pid field of the structure).
244
245 2. The target terminates.
246
247 3. Another thread or process is created on the system that by chance
248 reuses the TID that was freed when the target terminated.
249
250 4. The supervisor open(2)s the /proc/[tid]/mem file for the TID ob‐
251 tained in step 1, with the intention of (say) inspecting the memory
252 location(s) that containing the argument(s) of the system call that
253 triggered the notification in step 1.
254
255 In the above scenario, the risk is that the supervisor may try to ac‐
256 cess the memory of a process other than the target. This race can be
257 avoided by following the call to open(2) with a SECCOMP_IOCTL_NO‐
258 TIF_ID_VALID operation to verify that the process that generated the
259 notification is still alive. (Note that if the target terminates after
260 the latter step, a subsequent read(2) from the file descriptor may re‐
261 turn 0, indicating end of file.)
262
263 See NOTES for a discussion of other cases where SECCOMP_IOCTL_NO‐
264 TIF_ID_VALID checks must be performed.
265
266 On success (i.e., the notification ID is still valid), this operation
267 returns 0. On failure (i.e., the notification ID is no longer valid),
268 -1 is returned, and errno is set to ENOENT.
269
270 SECCOMP_IOCTL_NOTIF_SEND
271 The SECCOMP_IOCTL_NOTIF_SEND operation (available since Linux 5.0) is
272 used to send a notification response back to the kernel. The third
273 ioctl(2) argument of this structure is a pointer to a structure of the
274 following form:
275
276 struct seccomp_notif_resp {
277 __u64 id; /* Cookie value */
278 __s64 val; /* Success return value */
279 __s32 error; /* 0 (success) or negative error number */
280 __u32 flags; /* See below */
281 };
282
283 The fields of this structure are as follows:
284
285 id This is the cookie value that was obtained using the SEC‐
286 COMP_IOCTL_NOTIF_RECV operation. This cookie value allows the
287 kernel to correctly associate this response with the system call
288 that triggered the user-space notification.
289
290 val This is the value that will be used for a spoofed success return
291 for the target's system call; see below.
292
293 error This is the value that will be used as the error number (errno)
294 for a spoofed error return for the target's system call; see be‐
295 low.
296
297 flags This is a bit mask that includes zero or more of the following
298 flags:
299
300 SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
301 Tell the kernel to execute the target's system call.
302
303 Two kinds of response are possible:
304
305 • A response to the kernel telling it to execute the target's system
306 call. In this case, the flags field includes SECCOMP_USER_NO‐
307 TIF_FLAG_CONTINUE and the error and val fields must be zero.
308
309 This kind of response can be useful in cases where the supervisor
310 needs to do deeper analysis of the target's system call than is pos‐
311 sible from a seccomp filter (e.g., examining the values of pointer
312 arguments), and, having decided that the system call does not require
313 emulation by the supervisor, the supervisor wants the system call to
314 be executed normally in the target.
315
316 The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag should be used with cau‐
317 tion; see NOTES.
318
319 • A spoofed return value for the target's system call. In this case,
320 the kernel does not execute the target's system call, instead causing
321 the system call to return a spoofed value as specified by fields of
322 the seccomp_notif_resp structure. The supervisor should set the
323 fields of this structure as follows:
324
325 + flags does not contain SECCOMP_USER_NOTIF_FLAG_CONTINUE.
326
327 + error is set either to 0 for a spoofed "success" return or to a
328 negative error number for a spoofed "failure" return. In the for‐
329 mer case, the kernel causes the target's system call to return the
330 value specified in the val field. In the latter case, the kernel
331 causes the target's system call to return -1, and errno is as‐
332 signed the negated error value.
333
334 + val is set to a value that will be used as the return value for a
335 spoofed "success" return for the target's system call. The value
336 in this field is ignored if the error field contains a nonzero
337 value.
338
339 On success, this operation returns 0; on failure, -1 is returned, and
340 errno is set to indicate the cause of the error. This operation can
341 fail with the following errors:
342
343 EINPROGRESS
344 A response to this notification has already been sent.
345
346 EINVAL An invalid value was specified in the flags field.
347
348 EINVAL The flags field contained SECCOMP_USER_NOTIF_FLAG_CONTINUE, and
349 the error or val field was not zero.
350
351 ENOENT The blocked system call in the target has been interrupted by a
352 signal handler or the target has terminated.
353
354 SECCOMP_IOCTL_NOTIF_ADDFD
355 The SECCOMP_IOCTL_NOTIF_ADDFD operation (available since Linux 5.9) al‐
356 lows the supervisor to install a file descriptor into the target's file
357 descriptor table. Much like the use of SCM_RIGHTS messages described
358 in unix(7), this operation is semantically equivalent to duplicating a
359 file descriptor from the supervisor's file descriptor table into the
360 target's file descriptor table.
361
362 The SECCOMP_IOCTL_NOTIF_ADDFD operation permits the supervisor to emu‐
363 late a target system call (such as socket(2) or openat(2)) that gener‐
364 ates a file descriptor. The supervisor can perform the system call
365 that generates the file descriptor (and associated open file descrip‐
366 tion) and then use this operation to allocate a file descriptor that
367 refers to the same open file description in the target. (For an expla‐
368 nation of open file descriptions, see open(2).)
369
370 Once this operation has been performed, the supervisor can close its
371 copy of the file descriptor.
372
373 In the target, the received file descriptor is subject to the same
374 Linux Security Module (LSM) checks as are applied to a file descriptor
375 that is received in an SCM_RIGHTS ancillary message. If the file de‐
376 scriptor refers to a socket, it inherits the cgroup version 1 network
377 controller settings (classid and netprioidx) of the target.
378
379 The third ioctl(2) argument is a pointer to a structure of the follow‐
380 ing form:
381
382 struct seccomp_notif_addfd {
383 __u64 id; /* Cookie value */
384 __u32 flags; /* Flags */
385 __u32 srcfd; /* Local file descriptor number */
386 __u32 newfd; /* 0 or desired file descriptor
387 number in target */
388 __u32 newfd_flags; /* Flags to set on target file
389 descriptor */
390 };
391
392 The fields in this structure are as follows:
393
394 id This field should be set to the notification ID (cookie value)
395 that was obtained via SECCOMP_IOCTL_NOTIF_RECV.
396
397 flags This field is a bit mask of flags that modify the behavior of
398 the operation. Currently, only one flag is supported:
399
400 SECCOMP_ADDFD_FLAG_SETFD
401 When allocating the file descriptor in the target, use
402 the file descriptor number specified in the newfd field.
403
404 SECCOMP_ADDFD_FLAG_SEND (since Linux 5.14)
405 Perform the equivalent of SECCOMP_IOCTL_NOTIF_ADDFD plus
406 SECCOMP_IOCTL_NOTIF_SEND as an atomic operation. On suc‐
407 cessful invocation, the target process's errno will be 0
408 and the return value will be the file descriptor number
409 that was allocated in the target. If allocating the file
410 descriptor in the target fails, the target's system call
411 continues to be blocked until a successful response is
412 sent.
413
414 srcfd This field should be set to the number of the file descriptor in
415 the supervisor that is to be duplicated.
416
417 newfd This field determines which file descriptor number is allocated
418 in the target. If the SECCOMP_ADDFD_FLAG_SETFD flag is set,
419 then this field specifies which file descriptor number should be
420 allocated. If this file descriptor number is already open in
421 the target, it is atomically closed and reused. If the descrip‐
422 tor duplication fails due to an LSM check, or if srcfd is not a
423 valid file descriptor, the file descriptor newfd will not be
424 closed in the target process.
425
426 If the SECCOMP_ADDFD_FLAG_SETFD flag it not set, then this field
427 must be 0, and the kernel allocates the lowest unused file de‐
428 scriptor number in the target.
429
430 newfd_flags
431 This field is a bit mask specifying flags that should be set on
432 the file descriptor that is received in the target process.
433 Currently, only the following flag is implemented:
434
435 O_CLOEXEC
436 Set the close-on-exec flag on the received file descrip‐
437 tor.
438
439 On success, this ioctl(2) call returns the number of the file descrip‐
440 tor that was allocated in the target. Assuming that the emulated sys‐
441 tem call is one that returns a file descriptor as its function result
442 (e.g., socket(2)), this value can be used as the return value
443 (resp.val) that is supplied in the response that is subsequently sent
444 with the SECCOMP_IOCTL_NOTIF_SEND operation.
445
446 On error, -1 is returned and errno is set to indicate the cause of the
447 error.
448
449 This operation can fail with the following errors:
450
451 EBADF Allocating the file descriptor in the target would cause the
452 target's RLIMIT_NOFILE limit to be exceeded (see getrlimit(2)).
453
454 EBUSY If the flag SECCOMP_IOCTL_NOTIF_SEND is used, this means the op‐
455 eration can't proceed until other SECCOMP_IOCTL_NOTIF_ADDFD re‐
456 quests are processed.
457
458 EINPROGRESS
459 The user-space notification specified in the id field exists but
460 has not yet been fetched (by a SECCOMP_IOCTL_NOTIF_RECV) or has
461 already been responded to (by a SECCOMP_IOCTL_NOTIF_SEND).
462
463 EINVAL An invalid flag was specified in the flags or newfd_flags field,
464 or the newfd field is nonzero and the SECCOMP_ADDFD_FLAG_SETFD
465 flag was not specified in the flags field.
466
467 EMFILE The file descriptor number specified in newfd exceeds the limit
468 specified in /proc/sys/fs/nr_open.
469
470 ENOENT The blocked system call in the target has been interrupted by a
471 signal handler or the target has terminated.
472
473 Here is some sample code (with error handling omitted) that uses the
474 SECCOMP_ADDFD_FLAG_SETFD operation (here, to emulate a call to ope‐
475 nat(2)):
476
477 int fd, removeFd;
478
479 fd = openat(req->data.args[0], path, req->data.args[2],
480 req->data.args[3]);
481
482 struct seccomp_notif_addfd addfd;
483 addfd.id = req->id; /* Cookie from SECCOMP_IOCTL_NOTIF_RECV */
484 addfd.srcfd = fd;
485 addfd.newfd = 0;
486 addfd.flags = 0;
487 addfd.newfd_flags = O_CLOEXEC;
488
489 targetFd = ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd);
490
491 close(fd); /* No longer needed in supervisor */
492
493 struct seccomp_notif_resp *resp;
494 /* Code to allocate 'resp' omitted */
495 resp->id = req->id;
496 resp->error = 0; /* "Success" */
497 resp->val = targetFd;
498 resp->flags = 0;
499 ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp);
500
502 One example use case for the user-space notification mechanism is to
503 allow a container manager (a process which is typically running with
504 more privilege than the processes inside the container) to mount block
505 devices or create device nodes for the container. The mount use case
506 provides an example of where the SECCOMP_USER_NOTIF_FLAG_CONTINUE
507 ioctl(2) operation is useful. Upon receiving a notification for the
508 mount(2) system call, the container manager (the "supervisor") can dis‐
509 tinguish a request to mount a block filesystem (which would not be pos‐
510 sible for a "target" process inside the container) and mount that file
511 system. If, on the other hand, the container manager detects that the
512 operation could be performed by the process inside the container (e.g.,
513 a mount of a tmpfs(5) filesystem), it can notify the kernel that the
514 target process's mount(2) system call can continue.
515
516 select()/poll()/epoll semantics
517 The file descriptor returned when seccomp(2) is employed with the SEC‐
518 COMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using poll(2),
519 epoll(7), and select(2). These interfaces indicate that the file de‐
520 scriptor is ready as follows:
521
522 • When a notification is pending, these interfaces indicate that the
523 file descriptor is readable. Following such an indication, a subse‐
524 quent SECCOMP_IOCTL_NOTIF_RECV ioctl(2) will not block, returning ei‐
525 ther information about a notification or else failing with the error
526 EINTR if the target has been killed by a signal or its system call
527 has been interrupted by a signal handler.
528
529 • After the notification has been received (i.e., by the SEC‐
530 COMP_IOCTL_NOTIF_RECV ioctl(2) operation), these interfaces indicate
531 that the file descriptor is writable, meaning that a notification re‐
532 sponse can be sent using the SECCOMP_IOCTL_NOTIF_SEND ioctl(2) opera‐
533 tion.
534
535 • After the last thread using the filter has terminated and been reaped
536 using waitpid(2) (or similar), the file descriptor indicates an end-
537 of-file condition (readable in select(2); POLLHUP/EPOLLHUP in
538 poll(2)/ epoll_wait(2)).
539
540 Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE
541 The intent of the user-space notification feature is to allow system
542 calls to be performed on behalf of the target. The target's system
543 call should either be handled by the supervisor or allowed to continue
544 normally in the kernel (where standard security policies will be ap‐
545 plied).
546
547 Note well: this mechanism must not be used to make security policy de‐
548 cisions about the system call, which would be inherently race-prone for
549 reasons described next.
550
551 The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with caution.
552 If set by the supervisor, the target's system call will continue. How‐
553 ever, there is a time-of-check, time-of-use race here, since an at‐
554 tacker could exploit the interval of time where the target is blocked
555 waiting on the "continue" response to do things such as rewriting the
556 system call arguments.
557
558 Note furthermore that a user-space notifier can be bypassed if the ex‐
559 isting filters allow the use of seccomp(2) or prctl(2) to install a
560 filter that returns an action value with a higher precedence than SEC‐
561 COMP_RET_USER_NOTIF (see seccomp(2)).
562
563 It should thus be absolutely clear that the seccomp user-space notifi‐
564 cation mechanism can not be used to implement a security policy! It
565 should only ever be used in scenarios where a more privileged process
566 supervises the system calls of a lesser privileged target to get around
567 kernel-enforced security restrictions when the supervisor deems this
568 safe. In other words, in order to continue a system call, the supervi‐
569 sor should be sure that another security mechanism or the kernel itself
570 will sufficiently block the system call if its arguments are rewritten
571 to something unsafe.
572
573 Caveats regarding the use of /proc/[tid]/mem
574 The discussion above noted the need to use the SECCOMP_IOCTL_NO‐
575 TIF_ID_VALID ioctl(2) when opening the /proc/[tid]/mem file of the tar‐
576 get to avoid the possibility of accessing the memory of the wrong
577 process in the event that the target terminates and its ID is recycled
578 by another (unrelated) thread. However, the use of this ioctl(2) oper‐
579 ation is also necessary in other situations, as explained in the fol‐
580 lowing paragraphs.
581
582 Consider the following scenario, where the supervisor tries to read the
583 pathname argument of a target's blocked mount(2) system call:
584
585 • From one of its functions (func()), the target calls mount(2), which
586 triggers a user-space notification and causes the target to block.
587
588 • The supervisor receives the notification, opens /proc/[tid]/mem, and
589 (successfully) performs the SECCOMP_IOCTL_NOTIF_ID_VALID check.
590
591 • The target receives a signal, which causes the mount(2) to abort.
592
593 • The signal handler executes in the target, and returns.
594
595 • Upon return from the handler, the execution of func() resumes, and it
596 returns (and perhaps other functions are called, overwriting the mem‐
597 ory that had been used for the stack frame of func()).
598
599 • Using the address provided in the notification information, the su‐
600 pervisor reads from the target's memory location that used to contain
601 the pathname.
602
603 • The supervisor now calls mount(2) with some arbitrary bytes obtained
604 in the previous step.
605
606 The conclusion from the above scenario is this: since the target's
607 blocked system call may be interrupted by a signal handler, the super‐
608 visor must be written to expect that the target may abandon its system
609 call at any time; in such an event, any information that the supervisor
610 obtained from the target's memory must be considered invalid.
611
612 To prevent such scenarios, every read from the target's memory must be
613 separated from use of the bytes so obtained by a SECCOMP_IOCTL_NO‐
614 TIF_ID_VALID check. In the above example, the check would be placed
615 between the two final steps. An example of such a check is shown in
616 EXAMPLES.
617
618 Following on from the above, it should be clear that a write by the su‐
619 pervisor into the target's memory can never be considered safe.
620
621 Caveats regarding blocking system calls
622 Suppose that the target performs a blocking system call (e.g., ac‐
623 cept(2)) that the supervisor should handle. The supervisor might then
624 in turn execute the same blocking system call.
625
626 In this scenario, it is important to note that if the target's system
627 call is now interrupted by a signal, the supervisor is not informed of
628 this. If the supervisor does not take suitable steps to actively dis‐
629 cover that the target's system call has been canceled, various diffi‐
630 culties can occur. Taking the example of accept(2), the supervisor
631 might remain blocked in its accept(2) holding a port number that the
632 target (which, after the interruption by the signal handler, perhaps
633 closed its listening socket) might expect to be able to reuse in a
634 bind(2) call.
635
636 Therefore, when the supervisor wishes to emulate a blocking system
637 call, it must do so in such a way that it gets informed if the target's
638 system call is interrupted by a signal handler. For example, if the
639 supervisor itself executes the same blocking system call, then it could
640 employ a separate thread that uses the SECCOMP_IOCTL_NOTIF_ID_VALID op‐
641 eration to check if the target is still blocked in its system call.
642 Alternatively, in the accept(2) example, the supervisor might use
643 poll(2) to monitor both the notification file descriptor (so as to dis‐
644 cover when the target's accept(2) call has been interrupted) and the
645 listening file descriptor (so as to know when a connection is avail‐
646 able).
647
648 If the target's system call is interrupted, the supervisor must take
649 care to release resources (e.g., file descriptors) that it acquired on
650 behalf of the target.
651
652 Interaction with SA_RESTART signal handlers
653 Consider the following scenario:
654
655 • The target process has used sigaction(2) to install a signal handler
656 with the SA_RESTART flag.
657
658 • The target has made a system call that triggered a seccomp user-space
659 notification and the target is currently blocked until the supervisor
660 sends a notification response.
661
662 • A signal is delivered to the target and the signal handler is exe‐
663 cuted.
664
665 • When (if) the supervisor attempts to send a notification response,
666 the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)) operation will fail with the
667 ENOENT error.
668
669 In this scenario, the kernel will restart the target's system call.
670 Consequently, the supervisor will receive another user-space notifica‐
671 tion. Thus, depending on how many times the blocked system call is in‐
672 terrupted by a signal handler, the supervisor may receive multiple no‐
673 tifications for the same instance of a system call in the target.
674
675 One oddity is that system call restarting as described in this scenario
676 will occur even for the blocking system calls listed in signal(7) that
677 would never normally be restarted by the SA_RESTART flag.
678
679 Furthermore, if the supervisor response is a file descriptor added with
680 SECCOMP_IOCTL_NOTIF_ADDFD, then the flag SECCOMP_ADDFD_FLAG_SEND can be
681 used to atomically add the file descriptor and return that value, mak‐
682 ing sure no file descriptors are inadvertently leaked into the target.
683
685 If a SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation is performed after the
686 target terminates, then the ioctl(2) call simply blocks (rather than
687 returning an error to indicate that the target no longer exists).
688
690 The (somewhat contrived) program shown below demonstrates the use of
691 the interfaces described in this page. The program creates a child
692 process that serves as the "target" process. The child process in‐
693 stalls a seccomp filter that returns the SECCOMP_RET_USER_NOTIF action
694 value if a call is made to mkdir(2). The child process then calls
695 mkdir(2) once for each of the supplied command-line arguments, and re‐
696 ports the result returned by the call. After processing all arguments,
697 the child process terminates.
698
699 The parent process acts as the supervisor, listening for the notifica‐
700 tions that are generated when the target process calls mkdir(2). When
701 such a notification occurs, the supervisor examines the memory of the
702 target process (using /proc/[pid]/mem) to discover the pathname argu‐
703 ment that was supplied to the mkdir(2) call, and performs one of the
704 following actions:
705
706 • If the pathname begins with the prefix "/tmp/", then the supervisor
707 attempts to create the specified directory, and then spoofs a return
708 for the target process based on the return value of the supervisor's
709 mkdir(2) call. In the event that that call succeeds, the spoofed
710 success return value is the length of the pathname.
711
712 • If the pathname begins with "./" (i.e., it is a relative pathname),
713 the supervisor sends a SECCOMP_USER_NOTIF_FLAG_CONTINUE response to
714 the kernel to say that the kernel should execute the target process's
715 mkdir(2) call.
716
717 • If the pathname begins with some other prefix, the supervisor spoofs
718 an error return for the target process, so that the target process's
719 mkdir(2) call appears to fail with the error EOPNOTSUPP ("Operation
720 not supported"). Additionally, if the specified pathname is exactly
721 "/bye", then the supervisor terminates.
722
723 This program can be used to demonstrate various aspects of the behavior
724 of the seccomp user-space notification mechanism. To help aid such
725 demonstrations, the program logs various messages to show the operation
726 of the target process (lines prefixed "T:") and the supervisor (in‐
727 dented lines prefixed "S:").
728
729 In the following example, the target attempts to create the directory
730 /tmp/x. Upon receiving the notification, the supervisor creates the
731 directory on the target's behalf, and spoofs a success return to be re‐
732 ceived by the target process's mkdir(2) call.
733
734 $ ./seccomp_unotify /tmp/x
735 T: PID = 23168
736
737 T: about to mkdir("/tmp/x")
738 S: got notification (ID 0x17445c4a0f4e0e3c) for PID 23168
739 S: executing: mkdir("/tmp/x", 0700)
740 S: success! spoofed return = 6
741 S: sending response (flags = 0; val = 6; error = 0)
742 T: SUCCESS: mkdir(2) returned 6
743
744 T: terminating
745 S: target has terminated; bye
746
747 In the above output, note that the spoofed return value seen by the
748 target process is 6 (the length of the pathname /tmp/x), whereas a nor‐
749 mal mkdir(2) call returns 0 on success.
750
751 In the next example, the target attempts to create a directory using
752 the relative pathname ./sub. Since this pathname starts with "./", the
753 supervisor sends a SECCOMP_USER_NOTIF_FLAG_CONTINUE response to the
754 kernel, and the kernel then (successfully) executes the target
755 process's mkdir(2) call.
756
757 $ ./seccomp_unotify ./sub
758 T: PID = 23204
759
760 T: about to mkdir("./sub")
761 S: got notification (ID 0xddb16abe25b4c12) for PID 23204
762 S: target can execute system call
763 S: sending response (flags = 0x1; val = 0; error = 0)
764 T: SUCCESS: mkdir(2) returned 0
765
766 T: terminating
767 S: target has terminated; bye
768
769 If the target process attempts to create a directory with a pathname
770 that doesn't start with "." and doesn't begin with the prefix "/tmp/",
771 then the supervisor spoofs an error return (EOPNOTSUPP, "Operation not
772 supported") for the target's mkdir(2) call (which is not executed):
773
774 $ ./seccomp_unotify /xxx
775 T: PID = 23178
776
777 T: about to mkdir("/xxx")
778 S: got notification (ID 0xe7dc095d1c524e80) for PID 23178
779 S: spoofing error response (Operation not supported)
780 S: sending response (flags = 0; val = 0; error = -95)
781 T: ERROR: mkdir(2): Operation not supported
782
783 T: terminating
784 S: target has terminated; bye
785
786 In the next example, the target process attempts to create a directory
787 with the pathname /tmp/nosuchdir/b. Upon receiving the notification,
788 the supervisor attempts to create that directory, but the mkdir(2) call
789 fails because the directory /tmp/nosuchdir does not exist. Conse‐
790 quently, the supervisor spoofs an error return that passes the error
791 that it received back to the target process's mkdir(2) call.
792
793 $ ./seccomp_unotify /tmp/nosuchdir/b
794 T: PID = 23199
795
796 T: about to mkdir("/tmp/nosuchdir/b")
797 S: got notification (ID 0x8744454293506046) for PID 23199
798 S: executing: mkdir("/tmp/nosuchdir/b", 0700)
799 S: failure! (errno = 2; No such file or directory)
800 S: sending response (flags = 0; val = 0; error = -2)
801 T: ERROR: mkdir(2): No such file or directory
802
803 T: terminating
804 S: target has terminated; bye
805
806 If the supervisor receives a notification and sees that the argument of
807 the target's mkdir(2) is the string "/bye", then (as well as spoofing
808 an EOPNOTSUPP error), the supervisor terminates. If the target process
809 subsequently executes another mkdir(2) that triggers its seccomp filter
810 to return the SECCOMP_RET_USER_NOTIF action value, then the kernel
811 causes the target process's system call to fail with the error ENOSYS
812 ("Function not implemented"). This is demonstrated by the following
813 example:
814
815 $ ./seccomp_unotify /bye /tmp/y
816 T: PID = 23185
817
818 T: about to mkdir("/bye")
819 S: got notification (ID 0xa81236b1d2f7b0f4) for PID 23185
820 S: spoofing error response (Operation not supported)
821 S: sending response (flags = 0; val = 0; error = -95)
822 S: terminating **********
823 T: ERROR: mkdir(2): Operation not supported
824
825 T: about to mkdir("/tmp/y")
826 T: ERROR: mkdir(2): Function not implemented
827
828 T: terminating
829
830 Program source
831 #define _GNU_SOURCE
832 #include <errno.h>
833 #include <fcntl.h>
834 #include <limits.h>
835 #include <linux/audit.h>
836 #include <linux/filter.h>
837 #include <linux/seccomp.h>
838 #include <signal.h>
839 #include <stdbool.h>
840 #include <stddef.h>
841 #include <stdint.h>
842 #include <stdio.h>
843 #include <stdlib.h>
844 #include <sys/socket.h>
845 #include <sys/ioctl.h>
846 #include <sys/prctl.h>
847 #include <sys/stat.h>
848 #include <sys/types.h>
849 #include <sys/un.h>
850 #include <sys/syscall.h>
851 #include <unistd.h>
852
853 #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
854 } while (0)
855
856 /* Send the file descriptor 'fd' over the connected UNIX domain socket
857 'sockfd'. Returns 0 on success, or -1 on error. */
858
859 static int
860 sendfd(int sockfd, int fd)
861 {
862 struct msghdr msgh;
863 struct iovec iov;
864 int data;
865 struct cmsghdr *cmsgp;
866
867 /* Allocate a char array of suitable size to hold the ancillary data.
868 However, since this buffer is in reality a 'struct cmsghdr', use a
869 union to ensure that it is suitably aligned. */
870 union {
871 char buf[CMSG_SPACE(sizeof(int))];
872 /* Space large enough to hold an 'int' */
873 struct cmsghdr align;
874 } controlMsg;
875
876 /* The 'msg_name' field can be used to specify the address of the
877 destination socket when sending a datagram. However, we do not
878 need to use this field because 'sockfd' is a connected socket. */
879
880 msgh.msg_name = NULL;
881 msgh.msg_namelen = 0;
882
883 /* On Linux, we must transmit at least one byte of real data in
884 order to send ancillary data. We transmit an arbitrary integer
885 whose value is ignored by recvfd(). */
886
887 msgh.msg_iov = &iov;
888 msgh.msg_iovlen = 1;
889 iov.iov_base = &data;
890 iov.iov_len = sizeof(int);
891 data = 12345;
892
893 /* Set 'msghdr' fields that describe ancillary data */
894
895 msgh.msg_control = controlMsg.buf;
896 msgh.msg_controllen = sizeof(controlMsg.buf);
897
898 /* Set up ancillary data describing file descriptor to send */
899
900 cmsgp = CMSG_FIRSTHDR(&msgh);
901 cmsgp->cmsg_level = SOL_SOCKET;
902 cmsgp->cmsg_type = SCM_RIGHTS;
903 cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
904 memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));
905
906 /* Send real plus ancillary data */
907
908 if (sendmsg(sockfd, &msgh, 0) == -1)
909 return -1;
910
911 return 0;
912 }
913
914 /* Receive a file descriptor on a connected UNIX domain socket. Returns
915 the received file descriptor on success, or -1 on error. */
916
917 static int
918 recvfd(int sockfd)
919 {
920 struct msghdr msgh;
921 struct iovec iov;
922 int data, fd;
923 ssize_t nr;
924
925 /* Allocate a char buffer for the ancillary data. See the comments
926 in sendfd() */
927 union {
928 char buf[CMSG_SPACE(sizeof(int))];
929 struct cmsghdr align;
930 } controlMsg;
931 struct cmsghdr *cmsgp;
932
933 /* The 'msg_name' field can be used to obtain the address of the
934 sending socket. However, we do not need this information. */
935
936 msgh.msg_name = NULL;
937 msgh.msg_namelen = 0;
938
939 /* Specify buffer for receiving real data */
940
941 msgh.msg_iov = &iov;
942 msgh.msg_iovlen = 1;
943 iov.iov_base = &data; /* Real data is an 'int' */
944 iov.iov_len = sizeof(int);
945
946 /* Set 'msghdr' fields that describe ancillary data */
947
948 msgh.msg_control = controlMsg.buf;
949 msgh.msg_controllen = sizeof(controlMsg.buf);
950
951 /* Receive real plus ancillary data; real data is ignored */
952
953 nr = recvmsg(sockfd, &msgh, 0);
954 if (nr == -1)
955 return -1;
956
957 cmsgp = CMSG_FIRSTHDR(&msgh);
958
959 /* Check the validity of the 'cmsghdr' */
960
961 if (cmsgp == NULL ||
962 cmsgp->cmsg_len != CMSG_LEN(sizeof(int)) ||
963 cmsgp->cmsg_level != SOL_SOCKET ||
964 cmsgp->cmsg_type != SCM_RIGHTS) {
965 errno = EINVAL;
966 return -1;
967 }
968
969 /* Return the received file descriptor to our caller */
970
971 memcpy(&fd, CMSG_DATA(cmsgp), sizeof(int));
972 return fd;
973 }
974
975 static void
976 sigchldHandler(int sig)
977 {
978 char msg[] = "\tS: target has terminated; bye\n";
979
980 write(STDOUT_FILENO, msg, sizeof(msg) - 1);
981 _exit(EXIT_SUCCESS);
982 }
983
984 static int
985 seccomp(unsigned int operation, unsigned int flags, void *args)
986 {
987 return syscall(__NR_seccomp, operation, flags, args);
988 }
989
990 /* The following is the x86-64-specific BPF boilerplate code for checking
991 that the BPF program is running on the right architecture + ABI. At
992 completion of these instructions, the accumulator contains the system
993 call number. */
994
995 /* For the x32 ABI, all system call numbers have bit 30 set */
996
997 #define X32_SYSCALL_BIT 0x40000000
998
999 #define X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR \
1000 BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
1001 (offsetof(struct seccomp_data, arch))), \
1002 BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 2), \
1003 BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
1004 (offsetof(struct seccomp_data, nr))), \
1005 BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1), \
1006 BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS)
1007
1008 /* installNotifyFilter() installs a seccomp filter that generates
1009 user-space notifications (SECCOMP_RET_USER_NOTIF) when the process
1010 calls mkdir(2); the filter allows all other system calls.
1011
1012 The function return value is a file descriptor from which the
1013 user-space notifications can be fetched. */
1014
1015 static int
1016 installNotifyFilter(void)
1017 {
1018 struct sock_filter filter[] = {
1019 X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR,
1020
1021 /* mkdir() triggers notification to user-space supervisor */
1022
1023 BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_mkdir, 0, 1),
1024 BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_USER_NOTIF),
1025
1026 /* Every other system call is allowed */
1027
1028 BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
1029 };
1030
1031 struct sock_fprog prog = {
1032 .len = sizeof(filter) / sizeof(filter[0]),
1033 .filter = filter,
1034 };
1035
1036 /* Install the filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
1037 as a result, seccomp() returns a notification file descriptor. */
1038
1039 int notifyFd = seccomp(SECCOMP_SET_MODE_FILTER,
1040 SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
1041 if (notifyFd == -1)
1042 errExit("seccomp-install-notify-filter");
1043
1044 return notifyFd;
1045 }
1046
1047 /* Close a pair of sockets created by socketpair() */
1048
1049 static void
1050 closeSocketPair(int sockPair[2])
1051 {
1052 if (close(sockPair[0]) == -1)
1053 errExit("closeSocketPair-close-0");
1054 if (close(sockPair[1]) == -1)
1055 errExit("closeSocketPair-close-1");
1056 }
1057
1058 /* Implementation of the target process; create a child process that:
1059
1060 (1) installs a seccomp filter with the
1061 SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
1062 (2) writes the seccomp notification file descriptor returned from
1063 the previous step onto the UNIX domain socket, 'sockPair[0]';
1064 (3) calls mkdir(2) for each element of 'argv'.
1065
1066 The function return value in the parent is the PID of the child
1067 process; the child does not return from this function. */
1068
1069 static pid_t
1070 targetProcess(int sockPair[2], char *argv[])
1071 {
1072 pid_t targetPid = fork();
1073 if (targetPid == -1)
1074 errExit("fork");
1075
1076 if (targetPid > 0) /* In parent, return PID of child */
1077 return targetPid;
1078
1079 /* Child falls through to here */
1080
1081 printf("T: PID = %ld\n", (long) getpid());
1082
1083 /* Install seccomp filter(s) */
1084
1085 if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
1086 errExit("prctl");
1087
1088 int notifyFd = installNotifyFilter();
1089
1090 /* Pass the notification file descriptor to the tracing process over
1091 a UNIX domain socket */
1092
1093 if (sendfd(sockPair[0], notifyFd) == -1)
1094 errExit("sendfd");
1095
1096 /* Notification and socket FDs are no longer needed in target */
1097
1098 if (close(notifyFd) == -1)
1099 errExit("close-target-notify-fd");
1100
1101 closeSocketPair(sockPair);
1102
1103 /* Perform a mkdir() call for each of the command-line arguments */
1104
1105 for (char **ap = argv; *ap != NULL; ap++) {
1106 printf("\nT: about to mkdir(\"%s\")\n", *ap);
1107
1108 int s = mkdir(*ap, 0700);
1109 if (s == -1)
1110 perror("T: ERROR: mkdir(2)");
1111 else
1112 printf("T: SUCCESS: mkdir(2) returned %d\n", s);
1113 }
1114
1115 printf("\nT: terminating\n");
1116 exit(EXIT_SUCCESS);
1117 }
1118
1119 /* Check that the notification ID provided by a SECCOMP_IOCTL_NOTIF_RECV
1120 operation is still valid. It will no longer be valid if the target
1121 process has terminated or is no longer blocked in the system call that
1122 generated the notification (because it was interrupted by a signal).
1123
1124 This operation can be used when doing such things as accessing
1125 /proc/PID files in the target process in order to avoid TOCTOU race
1126 conditions where the PID that is returned by SECCOMP_IOCTL_NOTIF_RECV
1127 terminates and is reused by another process. */
1128
1129 static bool
1130 cookieIsValid(int notifyFd, uint64_t id)
1131 {
1132 return ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == 0;
1133 }
1134
1135 /* Access the memory of the target process in order to fetch the
1136 pathname referred to by the system call argument 'argNum' in
1137 'req->data.args[]'. The pathname is returned in 'path',
1138 a buffer of 'len' bytes allocated by the caller.
1139
1140 Returns true if the pathname is successfully fetched, and false
1141 otherwise. For possible causes of failure, see the comments below. */
1142
1143 static bool
1144 getTargetPathname(struct seccomp_notif *req, int notifyFd,
1145 int argNum, char *path, size_t len)
1146 {
1147 char procMemPath[PATH_MAX];
1148
1149 snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);
1150
1151 int procMemFd = open(procMemPath, O_RDONLY | O_CLOEXEC);
1152 if (procMemFd == -1)
1153 return false;
1154
1155 /* Check that the process whose info we are accessing is still alive
1156 and blocked in the system call that caused the notification.
1157 If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed in
1158 cookieIsValid()) succeeded, we know that the /proc/PID/mem file
1159 descriptor that we opened corresponded to the process for which we
1160 received a notification. If that process subsequently terminates,
1161 then read() on that file descriptor will return 0 (EOF). */
1162
1163 if (!cookieIsValid(notifyFd, req->id)) {
1164 close(procMemFd);
1165 return false;
1166 }
1167
1168 /* Read bytes at the location containing the pathname argument */
1169
1170 ssize_t nread = pread(procMemFd, path, len, req->data.args[argNum]);
1171
1172 close(procMemFd);
1173
1174 if (nread <= 0)
1175 return false;
1176
1177 /* Once again check that the notification ID is still valid. The
1178 case we are particularly concerned about here is that just
1179 before we fetched the pathname, the target's blocked system
1180 call was interrupted by a signal handler, and after the handler
1181 returned, the target carried on execution (past the interrupted
1182 system call). In that case, we have no guarantees about what we
1183 are reading, since the target's memory may have been arbitrarily
1184 changed by subsequent operations. */
1185
1186 if (!cookieIsValid(notifyFd, req->id)) {
1187 perror("\tS: notification ID check failed!!!");
1188 return false;
1189 }
1190
1191 /* Even if the target's system call was not interrupted by a signal,
1192 we have no guarantees about what was in the memory of the target
1193 process. (The memory may have been modified by another thread, or
1194 even by an external attacking process.) We therefore treat the
1195 buffer returned by pread() as untrusted input. The buffer should
1196 contain a terminating null byte; if not, then we will trigger an
1197 error for the target process. */
1198
1199 if (strnlen(path, nread) < nread)
1200 return true;
1201
1202 return false;
1203 }
1204
1205 /* Allocate buffers for the seccomp user-space notification request and
1206 response structures. It is the caller's responsibility to free the
1207 buffers returned via 'req' and 'resp'. */
1208
1209 static void
1210 allocSeccompNotifBuffers(struct seccomp_notif **req,
1211 struct seccomp_notif_resp **resp,
1212 struct seccomp_notif_sizes *sizes)
1213 {
1214 /* Discover the sizes of the structures that are used to receive
1215 notifications and send notification responses, and allocate
1216 buffers of those sizes. */
1217
1218 if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, sizes) == -1)
1219 errExit("seccomp-SECCOMP_GET_NOTIF_SIZES");
1220
1221 *req = malloc(sizes->seccomp_notif);
1222 if (*req == NULL)
1223 errExit("malloc-seccomp_notif");
1224
1225 /* When allocating the response buffer, we must allow for the fact
1226 that the user-space binary may have been built with user-space
1227 headers where 'struct seccomp_notif_resp' is bigger than the
1228 response buffer expected by the (older) kernel. Therefore, we
1229 allocate a buffer that is the maximum of the two sizes. This
1230 ensures that if the supervisor places bytes into the response
1231 structure that are past the response size that the kernel expects,
1232 then the supervisor is not touching an invalid memory location. */
1233
1234 size_t resp_size = sizes->seccomp_notif_resp;
1235 if (sizeof(struct seccomp_notif_resp) > resp_size)
1236 resp_size = sizeof(struct seccomp_notif_resp);
1237
1238 *resp = malloc(resp_size);
1239 if (resp == NULL)
1240 errExit("malloc-seccomp_notif_resp");
1241
1242 }
1243
1244 /* Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file
1245 descriptor, 'notifyFd'. */
1246
1247 static void
1248 handleNotifications(int notifyFd)
1249 {
1250 struct seccomp_notif_sizes sizes;
1251 struct seccomp_notif *req;
1252 struct seccomp_notif_resp *resp;
1253 char path[PATH_MAX];
1254
1255 allocSeccompNotifBuffers(&req, &resp, &sizes);
1256
1257 /* Loop handling notifications */
1258
1259 for (;;) {
1260
1261 /* Wait for next notification, returning info in '*req' */
1262
1263 memset(req, 0, sizes.seccomp_notif);
1264 if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_RECV, req) == -1) {
1265 if (errno == EINTR)
1266 continue;
1267 errExit("\tS: ioctl-SECCOMP_IOCTL_NOTIF_RECV");
1268 }
1269
1270 printf("\tS: got notification (ID %#llx) for PID %d\n",
1271 req->id, req->pid);
1272
1273 /* The only system call that can generate a notification event
1274 is mkdir(2). Nevertheless, we check that the notified system
1275 call is indeed mkdir() as kind of future-proofing of this
1276 code in case the seccomp filter is later modified to
1277 generate notifications for other system calls. */
1278
1279 if (req->data.nr != __NR_mkdir) {
1280 printf("\tS: notification contained unexpected "
1281 "system call number; bye!!!\n");
1282 exit(EXIT_FAILURE);
1283 }
1284
1285 bool pathOK = getTargetPathname(req, notifyFd, 0, path,
1286 sizeof(path));
1287
1288 /* Prepopulate some fields of the response */
1289
1290 resp->id = req->id; /* Response includes notification ID */
1291 resp->flags = 0;
1292 resp->val = 0;
1293
1294 /* If getTargetPathname() failed, trigger an EINVAL error
1295 response (sending this response may yield an error if the
1296 failure occurred because the notification ID was no longer
1297 valid); if the directory is in /tmp, then create it on behalf
1298 of the supervisor; if the pathname starts with '.', tell the
1299 kernel to let the target process execute the mkdir();
1300 otherwise, give an error for a directory pathname in any other
1301 location. */
1302
1303 if (!pathOK) {
1304 resp->error = -EINVAL;
1305 printf("\tS: spoofing error for invalid pathname (%s)\n",
1306 strerror(-resp->error));
1307 } else if (strncmp(path, "/tmp/", strlen("/tmp/")) == 0) {
1308 printf("\tS: executing: mkdir(\"%s\", %#llo)\n",
1309 path, req->data.args[1]);
1310
1311 if (mkdir(path, req->data.args[1]) == 0) {
1312 resp->error = 0; /* "Success" */
1313 resp->val = strlen(path); /* Used as return value of
1314 mkdir() in target */
1315 printf("\tS: success! spoofed return = %lld\n",
1316 resp->val);
1317 } else {
1318
1319 /* If mkdir() failed in the supervisor, pass the error
1320 back to the target */
1321
1322 resp->error = -errno;
1323 printf("\tS: failure! (errno = %d; %s)\n", errno,
1324 strerror(errno));
1325 }
1326 } else if (strncmp(path, "./", strlen("./")) == 0) {
1327 resp->error = resp->val = 0;
1328 resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
1329 printf("\tS: target can execute system call\n");
1330 } else {
1331 resp->error = -EOPNOTSUPP;
1332 printf("\tS: spoofing error response (%s)\n",
1333 strerror(-resp->error));
1334 }
1335
1336 /* Send a response to the notification */
1337
1338 printf("\tS: sending response "
1339 "(flags = %#x; val = %lld; error = %d)\n",
1340 resp->flags, resp->val, resp->error);
1341
1342 if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp) == -1) {
1343 if (errno == ENOENT)
1344 printf("\tS: response failed with ENOENT; "
1345 "perhaps target process's syscall was "
1346 "interrupted by a signal?\n");
1347 else
1348 perror("ioctl-SECCOMP_IOCTL_NOTIF_SEND");
1349 }
1350
1351 /* If the pathname is just "/bye", then the supervisor breaks out
1352 of the loop and terminates. This allows us to see what happens
1353 if the target process makes further calls to mkdir(2). */
1354
1355 if (strcmp(path, "/bye") == 0)
1356 break;
1357 }
1358
1359 free(req);
1360 free(resp);
1361 printf("\tS: terminating **********\n");
1362 exit(EXIT_FAILURE);
1363 }
1364
1365 /* Implementation of the supervisor process:
1366
1367 (1) obtains the notification file descriptor from 'sockPair[1]'
1368 (2) handles notifications that arrive on that file descriptor. */
1369
1370 static void
1371 supervisor(int sockPair[2])
1372 {
1373 int notifyFd = recvfd(sockPair[1]);
1374 if (notifyFd == -1)
1375 errExit("recvfd");
1376
1377 closeSocketPair(sockPair); /* We no longer need the socket pair */
1378
1379 handleNotifications(notifyFd);
1380 }
1381
1382 int
1383 main(int argc, char *argv[])
1384 {
1385 int sockPair[2];
1386
1387 setbuf(stdout, NULL);
1388
1389 if (argc < 2) {
1390 fprintf(stderr, "At least one pathname argument is required\n");
1391 exit(EXIT_FAILURE);
1392 }
1393
1394 /* Create a UNIX domain socket that is used to pass the seccomp
1395 notification file descriptor from the target process to the
1396 supervisor process. */
1397
1398 if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockPair) == -1)
1399 errExit("socketpair");
1400
1401 /* Create a child process--the "target"--that installs seccomp
1402 filtering. The target process writes the seccomp notification
1403 file descriptor onto 'sockPair[0]' and then calls mkdir(2) for
1404 each directory in the command-line arguments. */
1405
1406 (void) targetProcess(sockPair, &argv[optind]);
1407
1408 /* Catch SIGCHLD when the target terminates, so that the
1409 supervisor can also terminate. */
1410
1411 struct sigaction sa;
1412 sa.sa_handler = sigchldHandler;
1413 sa.sa_flags = 0;
1414 sigemptyset(&sa.sa_mask);
1415 if (sigaction(SIGCHLD, &sa, NULL) == -1)
1416 errExit("sigaction");
1417
1418 supervisor(sockPair);
1419
1420 exit(EXIT_SUCCESS);
1421 }
1422
1424 ioctl(2), pidfd_getfd(2), pidfd_open(2), seccomp(2)
1425
1426 A further example program can be found in the kernel source file sam‐
1427 ples/seccomp/user-trap.c.
1428
1430 This page is part of release 5.13 of the Linux man-pages project. A
1431 description of the project, information about reporting bugs, and the
1432 latest version of this page, can be found at
1433 https://www.kernel.org/doc/man-pages/.
1434
1435
1436
1437Linux 2021-06-20 SECCOMP_UNOTIFY(2)