1seccomp_unotify(2) System Calls Manual seccomp_unotify(2)
2
3
4
6 seccomp_unotify - Seccomp user-space notification mechanism
7
9 Standard C library (libc, -lc)
10
12 #include <linux/seccomp.h>
13 #include <linux/filter.h>
14 #include <linux/audit.h>
15
16 int seccomp(unsigned int operation, unsigned int flags, void *args);
17
18 #include <sys/ioctl.h>
19
20 int ioctl(int fd, SECCOMP_IOCTL_NOTIF_RECV,
21 struct seccomp_notif *req);
22 int ioctl(int fd, SECCOMP_IOCTL_NOTIF_SEND,
23 struct seccomp_notif_resp *resp);
24 int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ID_VALID, __u64 *id);
25 int ioctl(int fd, SECCOMP_IOCTL_NOTIF_ADDFD,
26 struct seccomp_notif_addfd *addfd);
27
29 This page describes the user-space notification mechanism provided by
30 the Secure Computing (seccomp) facility. As well as the use of the
31 SECCOMP_FILTER_FLAG_NEW_LISTENER flag, the SECCOMP_RET_USER_NOTIF ac‐
32 tion value, and the SECCOMP_GET_NOTIF_SIZES operation described in sec‐
33 comp(2), this mechanism involves the use of a number of related
34 ioctl(2) operations (described below).
35
36 Overview
37 In conventional usage of a seccomp filter, the decision about how to
38 treat a system call is made by the filter itself. By contrast, the
39 user-space notification mechanism allows the seccomp filter to delegate
40 the handling of the system call to another user-space process. Note
41 that this mechanism is explicitly not intended as a method implementing
42 security policy; see NOTES.
43
44 In the discussion that follows, the thread(s) on which the seccomp fil‐
45 ter is installed is (are) referred to as the target, and the process
46 that is notified by the user-space notification mechanism is referred
47 to as the supervisor.
48
49 A suitably privileged supervisor can use the user-space notification
50 mechanism to perform actions on behalf of the target. The advantage of
51 the user-space notification mechanism is that the supervisor will usu‐
52 ally be able to retrieve information about the target and the performed
53 system call that the seccomp filter itself cannot. (A seccomp filter
54 is limited in the information it can obtain and the actions that it can
55 perform because it is running on a virtual machine inside the kernel.)
56
57 An overview of the steps performed by the target and the supervisor is
58 as follows:
59
60 (1) The target establishes a seccomp filter in the usual manner, but
61 with two differences:
62
63 • The seccomp(2) flags argument includes the flag SECCOMP_FIL‐
64 TER_FLAG_NEW_LISTENER. Consequently, the return value of the
65 (successful) seccomp(2) call is a new "listening" file descrip‐
66 tor that can be used to receive notifications. Only one "lis‐
67 tening" seccomp filter can be installed for a thread.
68
69 • In cases where it is appropriate, the seccomp filter returns
70 the action value SECCOMP_RET_USER_NOTIF. This return value
71 will trigger a notification event.
72
73 (2) In order that the supervisor can obtain notifications using the
74 listening file descriptor, (a duplicate of) that file descriptor
75 must be passed from the target to the supervisor. One way in
76 which this could be done is by passing the file descriptor over a
77 UNIX domain socket connection between the target and the supervi‐
78 sor (using the SCM_RIGHTS ancillary message type described in
79 unix(7)). Another way to do this is through the use of
80 pidfd_getfd(2).
81
82 (3) The supervisor will receive notification events on the listening
83 file descriptor. These events are returned as structures of type
84 seccomp_notif. Because this structure and its size may evolve
85 over kernel versions, the supervisor must first determine the size
86 of this structure using the seccomp(2) SECCOMP_GET_NOTIF_SIZES op‐
87 eration, which returns a structure of type seccomp_notif_sizes.
88 The supervisor allocates a buffer of size seccomp_notif_sizes.sec‐
89 comp_notif bytes to receive notification events. In addition,the
90 supervisor allocates another buffer of size seccomp_no‐
91 tif_sizes.seccomp_notif_resp bytes for the response (a struct sec‐
92 comp_notif_resp structure) that it will provide to the kernel (and
93 thus the target).
94
95 (4) The target then performs its workload, which includes system calls
96 that will be controlled by the seccomp filter. Whenever one of
97 these system calls causes the filter to return the SEC‐
98 COMP_RET_USER_NOTIF action value, the kernel does not (yet) exe‐
99 cute the system call; instead, execution of the target is tempo‐
100 rarily blocked inside the kernel (in a sleep state that is inter‐
101 ruptible by signals) and a notification event is generated on the
102 listening file descriptor.
103
104 (5) The supervisor can now repeatedly monitor the listening file de‐
105 scriptor for SECCOMP_RET_USER_NOTIF-triggered events. To do this,
106 the supervisor uses the SECCOMP_IOCTL_NOTIF_RECV ioctl(2) opera‐
107 tion to read information about a notification event; this opera‐
108 tion blocks until an event is available. The operation returns a
109 seccomp_notif structure containing information about the system
110 call that is being attempted by the target. (As described in
111 NOTES, the file descriptor can also be monitored with select(2),
112 poll(2), or epoll(7).)
113
114 (6) The seccomp_notif structure returned by the SECCOMP_IOCTL_NO‐
115 TIF_RECV operation includes the same information (a seccomp_data
116 structure) that was passed to the seccomp filter. This informa‐
117 tion allows the supervisor to discover the system call number and
118 the arguments for the target's system call. In addition, the no‐
119 tification event contains the ID of the thread that triggered the
120 notification and a unique cookie value that is used in subsequent
121 SECCOMP_IOCTL_NOTIF_ID_VALID and SECCOMP_IOCTL_NOTIF_SEND opera‐
122 tions.
123
124 The information in the notification can be used to discover the
125 values of pointer arguments for the target's system call. (This
126 is something that can't be done from within a seccomp filter.)
127 One way in which the supervisor can do this is to open the corre‐
128 sponding /proc/tid/mem file (see proc(5)) and read bytes from the
129 location that corresponds to one of the pointer arguments whose
130 value is supplied in the notification event. (The supervisor must
131 be careful to avoid a race condition that can occur when doing
132 this; see the description of the SECCOMP_IOCTL_NOTIF_ID_VALID
133 ioctl(2) operation below.) In addition, the supervisor can access
134 other system information that is visible in user space but which
135 is not accessible from a seccomp filter.
136
137 (7) Having obtained information as per the previous step, the supervi‐
138 sor may then choose to perform an action in response to the tar‐
139 get's system call (which, as noted above, is not executed when the
140 seccomp filter returns the SECCOMP_RET_USER_NOTIF action value).
141
142 One example use case here relates to containers. The target may
143 be located inside a container where it does not have sufficient
144 capabilities to mount a filesystem in the container's mount name‐
145 space. However, the supervisor may be a more privileged process
146 that does have sufficient capabilities to perform the mount opera‐
147 tion.
148
149 (8) The supervisor then sends a response to the notification. The in‐
150 formation in this response is used by the kernel to construct a
151 return value for the target's system call and provide a value that
152 will be assigned to the errno variable of the target.
153
154 The response is sent using the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)
155 operation, which is used to transmit a seccomp_notif_resp struc‐
156 ture to the kernel. This structure includes a cookie value that
157 the supervisor obtained in the seccomp_notif structure returned by
158 the SECCOMP_IOCTL_NOTIF_RECV operation. This cookie value allows
159 the kernel to associate the response with the target. This struc‐
160 ture must include the cookie value that the supervisor obtained in
161 the seccomp_notif structure returned by the SECCOMP_IOCTL_NO‐
162 TIF_RECV operation; the cookie allows the kernel to associate the
163 response with the target.
164
165 (9) Once the notification has been sent, the system call in the target
166 thread unblocks, returning the information that was provided by
167 the supervisor in the notification response.
168
169 As a variation on the last two steps, the supervisor can send a re‐
170 sponse that tells the kernel that it should execute the target thread's
171 system call; see the discussion of SECCOMP_USER_NOTIF_FLAG_CONTINUE,
172 below.
173
175 The following ioctl(2) operations are supported by the seccomp user-
176 space notification file descriptor. For each of these operations, the
177 first (file descriptor) argument of ioctl(2) is the listening file de‐
178 scriptor returned by a call to seccomp(2) with the SECCOMP_FIL‐
179 TER_FLAG_NEW_LISTENER flag.
180
181 SECCOMP_IOCTL_NOTIF_RECV
182 The SECCOMP_IOCTL_NOTIF_RECV operation (available since Linux 5.0) is
183 used to obtain a user-space notification event. If no such event is
184 currently pending, the operation blocks until an event occurs. The
185 third ioctl(2) argument is a pointer to a structure of the following
186 form which contains information about the event. This structure must
187 be zeroed out before the call.
188
189 struct seccomp_notif {
190 __u64 id; /* Cookie */
191 __u32 pid; /* TID of target thread */
192 __u32 flags; /* Currently unused (0) */
193 struct seccomp_data data; /* See seccomp(2) */
194 };
195
196 The fields in this structure are as follows:
197
198 id This is a cookie for the notification. Each such cookie is
199 guaranteed to be unique for the corresponding seccomp filter.
200
201 • The cookie can be used with the SECCOMP_IOCTL_NOTIF_ID_VALID
202 ioctl(2) operation described below.
203
204 • When returning a notification response to the kernel, the su‐
205 pervisor must include the cookie value in the seccomp_no‐
206 tif_resp structure that is specified as the argument of the
207 SECCOMP_IOCTL_NOTIF_SEND operation.
208
209 pid This is the thread ID of the target thread that triggered the
210 notification event.
211
212 flags This is a bit mask of flags providing further information on the
213 event. In the current implementation, this field is always
214 zero.
215
216 data This is a seccomp_data structure containing information about
217 the system call that triggered the notification. This is the
218 same structure that is passed to the seccomp filter. See sec‐
219 comp(2) for details of this structure.
220
221 On success, this operation returns 0; on failure, -1 is returned, and
222 errno is set to indicate the cause of the error. This operation can
223 fail with the following errors:
224
225 EINVAL (since Linux 5.5)
226 The seccomp_notif structure that was passed to the call con‐
227 tained nonzero fields.
228
229 ENOENT The target thread was killed by a signal as the notification in‐
230 formation was being generated, or the target's (blocked) system
231 call was interrupted by a signal handler.
232
233 SECCOMP_IOCTL_NOTIF_ID_VALID
234 The SECCOMP_IOCTL_NOTIF_ID_VALID operation (available since Linux 5.0)
235 is used to check that a notification ID returned by an earlier SEC‐
236 COMP_IOCTL_NOTIF_RECV operation is still valid (i.e., that the target
237 still exists and its system call is still blocked waiting for a re‐
238 sponse).
239
240 The third ioctl(2) argument is a pointer to the cookie (id) returned by
241 the SECCOMP_IOCTL_NOTIF_RECV operation.
242
243 This operation is necessary to avoid race conditions that can occur
244 when the pid returned by the SECCOMP_IOCTL_NOTIF_RECV operation termi‐
245 nates, and that process ID is reused by another process. An example of
246 this kind of race is the following
247
248 (1) A notification is generated on the listening file descriptor. The
249 returned seccomp_notif contains the TID of the target thread (in
250 the pid field of the structure).
251
252 (2) The target terminates.
253
254 (3) Another thread or process is created on the system that by chance
255 reuses the TID that was freed when the target terminated.
256
257 (4) The supervisor open(2)s the /proc/tid/mem file for the TID ob‐
258 tained in step 1, with the intention of (say) inspecting the mem‐
259 ory location(s) that containing the argument(s) of the system call
260 that triggered the notification in step 1.
261
262 In the above scenario, the risk is that the supervisor may try to ac‐
263 cess the memory of a process other than the target. This race can be
264 avoided by following the call to open(2) with a SECCOMP_IOCTL_NO‐
265 TIF_ID_VALID operation to verify that the process that generated the
266 notification is still alive. (Note that if the target terminates after
267 the latter step, a subsequent read(2) from the file descriptor may re‐
268 turn 0, indicating end of file.)
269
270 See NOTES for a discussion of other cases where SECCOMP_IOCTL_NO‐
271 TIF_ID_VALID checks must be performed.
272
273 On success (i.e., the notification ID is still valid), this operation
274 returns 0. On failure (i.e., the notification ID is no longer valid),
275 -1 is returned, and errno is set to ENOENT.
276
277 SECCOMP_IOCTL_NOTIF_SEND
278 The SECCOMP_IOCTL_NOTIF_SEND operation (available since Linux 5.0) is
279 used to send a notification response back to the kernel. The third
280 ioctl(2) argument of this structure is a pointer to a structure of the
281 following form:
282
283 struct seccomp_notif_resp {
284 __u64 id; /* Cookie value */
285 __s64 val; /* Success return value */
286 __s32 error; /* 0 (success) or negative error number */
287 __u32 flags; /* See below */
288 };
289
290 The fields of this structure are as follows:
291
292 id This is the cookie value that was obtained using the SEC‐
293 COMP_IOCTL_NOTIF_RECV operation. This cookie value allows the
294 kernel to correctly associate this response with the system call
295 that triggered the user-space notification.
296
297 val This is the value that will be used for a spoofed success return
298 for the target's system call; see below.
299
300 error This is the value that will be used as the error number (errno)
301 for a spoofed error return for the target's system call; see be‐
302 low.
303
304 flags This is a bit mask that includes zero or more of the following
305 flags:
306
307 SECCOMP_USER_NOTIF_FLAG_CONTINUE (since Linux 5.5)
308 Tell the kernel to execute the target's system call.
309
310 Two kinds of response are possible:
311
312 • A response to the kernel telling it to execute the target's system
313 call. In this case, the flags field includes SECCOMP_USER_NO‐
314 TIF_FLAG_CONTINUE and the error and val fields must be zero.
315
316 This kind of response can be useful in cases where the supervisor
317 needs to do deeper analysis of the target's system call than is pos‐
318 sible from a seccomp filter (e.g., examining the values of pointer
319 arguments), and, having decided that the system call does not re‐
320 quire emulation by the supervisor, the supervisor wants the system
321 call to be executed normally in the target.
322
323 The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag should be used with cau‐
324 tion; see NOTES.
325
326 • A spoofed return value for the target's system call. In this case,
327 the kernel does not execute the target's system call, instead caus‐
328 ing the system call to return a spoofed value as specified by fields
329 of the seccomp_notif_resp structure. The supervisor should set the
330 fields of this structure as follows:
331
332 + flags does not contain SECCOMP_USER_NOTIF_FLAG_CONTINUE.
333
334 + error is set either to 0 for a spoofed "success" return or to a
335 negative error number for a spoofed "failure" return. In the
336 former case, the kernel causes the target's system call to return
337 the value specified in the val field. In the latter case, the
338 kernel causes the target's system call to return -1, and errno is
339 assigned the negated error value.
340
341 + val is set to a value that will be used as the return value for a
342 spoofed "success" return for the target's system call. The value
343 in this field is ignored if the error field contains a nonzero
344 value.
345
346 On success, this operation returns 0; on failure, -1 is returned, and
347 errno is set to indicate the cause of the error. This operation can
348 fail with the following errors:
349
350 EINPROGRESS
351 A response to this notification has already been sent.
352
353 EINVAL An invalid value was specified in the flags field.
354
355 EINVAL The flags field contained SECCOMP_USER_NOTIF_FLAG_CONTINUE, and
356 the error or val field was not zero.
357
358 ENOENT The blocked system call in the target has been interrupted by a
359 signal handler or the target has terminated.
360
361 SECCOMP_IOCTL_NOTIF_ADDFD
362 The SECCOMP_IOCTL_NOTIF_ADDFD operation (available since Linux 5.9) al‐
363 lows the supervisor to install a file descriptor into the target's file
364 descriptor table. Much like the use of SCM_RIGHTS messages described
365 in unix(7), this operation is semantically equivalent to duplicating a
366 file descriptor from the supervisor's file descriptor table into the
367 target's file descriptor table.
368
369 The SECCOMP_IOCTL_NOTIF_ADDFD operation permits the supervisor to emu‐
370 late a target system call (such as socket(2) or openat(2)) that gener‐
371 ates a file descriptor. The supervisor can perform the system call
372 that generates the file descriptor (and associated open file descrip‐
373 tion) and then use this operation to allocate a file descriptor that
374 refers to the same open file description in the target. (For an expla‐
375 nation of open file descriptions, see open(2).)
376
377 Once this operation has been performed, the supervisor can close its
378 copy of the file descriptor.
379
380 In the target, the received file descriptor is subject to the same
381 Linux Security Module (LSM) checks as are applied to a file descriptor
382 that is received in an SCM_RIGHTS ancillary message. If the file de‐
383 scriptor refers to a socket, it inherits the cgroup version 1 network
384 controller settings (classid and netprioidx) of the target.
385
386 The third ioctl(2) argument is a pointer to a structure of the follow‐
387 ing form:
388
389 struct seccomp_notif_addfd {
390 __u64 id; /* Cookie value */
391 __u32 flags; /* Flags */
392 __u32 srcfd; /* Local file descriptor number */
393 __u32 newfd; /* 0 or desired file descriptor
394 number in target */
395 __u32 newfd_flags; /* Flags to set on target file
396 descriptor */
397 };
398
399 The fields in this structure are as follows:
400
401 id This field should be set to the notification ID (cookie value)
402 that was obtained via SECCOMP_IOCTL_NOTIF_RECV.
403
404 flags This field is a bit mask of flags that modify the behavior of
405 the operation. Currently, only one flag is supported:
406
407 SECCOMP_ADDFD_FLAG_SETFD
408 When allocating the file descriptor in the target, use
409 the file descriptor number specified in the newfd field.
410
411 SECCOMP_ADDFD_FLAG_SEND (since Linux 5.14)
412 Perform the equivalent of SECCOMP_IOCTL_NOTIF_ADDFD plus
413 SECCOMP_IOCTL_NOTIF_SEND as an atomic operation. On suc‐
414 cessful invocation, the target process's errno will be 0
415 and the return value will be the file descriptor number
416 that was allocated in the target. If allocating the file
417 descriptor in the target fails, the target's system call
418 continues to be blocked until a successful response is
419 sent.
420
421 srcfd This field should be set to the number of the file descriptor in
422 the supervisor that is to be duplicated.
423
424 newfd This field determines which file descriptor number is allocated
425 in the target. If the SECCOMP_ADDFD_FLAG_SETFD flag is set,
426 then this field specifies which file descriptor number should be
427 allocated. If this file descriptor number is already open in
428 the target, it is atomically closed and reused. If the descrip‐
429 tor duplication fails due to an LSM check, or if srcfd is not a
430 valid file descriptor, the file descriptor newfd will not be
431 closed in the target process.
432
433 If the SECCOMP_ADDFD_FLAG_SETFD flag it not set, then this field
434 must be 0, and the kernel allocates the lowest unused file de‐
435 scriptor number in the target.
436
437 newfd_flags
438 This field is a bit mask specifying flags that should be set on
439 the file descriptor that is received in the target process.
440 Currently, only the following flag is implemented:
441
442 O_CLOEXEC
443 Set the close-on-exec flag on the received file descrip‐
444 tor.
445
446 On success, this ioctl(2) call returns the number of the file descrip‐
447 tor that was allocated in the target. Assuming that the emulated sys‐
448 tem call is one that returns a file descriptor as its function result
449 (e.g., socket(2)), this value can be used as the return value
450 (resp.val) that is supplied in the response that is subsequently sent
451 with the SECCOMP_IOCTL_NOTIF_SEND operation.
452
453 On error, -1 is returned and errno is set to indicate the cause of the
454 error.
455
456 This operation can fail with the following errors:
457
458 EBADF Allocating the file descriptor in the target would cause the
459 target's RLIMIT_NOFILE limit to be exceeded (see getrlimit(2)).
460
461 EBUSY If the flag SECCOMP_IOCTL_NOTIF_SEND is used, this means the op‐
462 eration can't proceed until other SECCOMP_IOCTL_NOTIF_ADDFD re‐
463 quests are processed.
464
465 EINPROGRESS
466 The user-space notification specified in the id field exists but
467 has not yet been fetched (by a SECCOMP_IOCTL_NOTIF_RECV) or has
468 already been responded to (by a SECCOMP_IOCTL_NOTIF_SEND).
469
470 EINVAL An invalid flag was specified in the flags or newfd_flags field,
471 or the newfd field is nonzero and the SECCOMP_ADDFD_FLAG_SETFD
472 flag was not specified in the flags field.
473
474 EMFILE The file descriptor number specified in newfd exceeds the limit
475 specified in /proc/sys/fs/nr_open.
476
477 ENOENT The blocked system call in the target has been interrupted by a
478 signal handler or the target has terminated.
479
480 Here is some sample code (with error handling omitted) that uses the
481 SECCOMP_ADDFD_FLAG_SETFD operation (here, to emulate a call to ope‐
482 nat(2)):
483
484 int fd, removeFd;
485
486 fd = openat(req->data.args[0], path, req->data.args[2],
487 req->data.args[3]);
488
489 struct seccomp_notif_addfd addfd;
490 addfd.id = req->id; /* Cookie from SECCOMP_IOCTL_NOTIF_RECV */
491 addfd.srcfd = fd;
492 addfd.newfd = 0;
493 addfd.flags = 0;
494 addfd.newfd_flags = O_CLOEXEC;
495
496 targetFd = ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd);
497
498 close(fd); /* No longer needed in supervisor */
499
500 struct seccomp_notif_resp *resp;
501 /* Code to allocate 'resp' omitted */
502 resp->id = req->id;
503 resp->error = 0; /* "Success" */
504 resp->val = targetFd;
505 resp->flags = 0;
506 ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp);
507
509 One example use case for the user-space notification mechanism is to
510 allow a container manager (a process which is typically running with
511 more privilege than the processes inside the container) to mount block
512 devices or create device nodes for the container. The mount use case
513 provides an example of where the SECCOMP_USER_NOTIF_FLAG_CONTINUE
514 ioctl(2) operation is useful. Upon receiving a notification for the
515 mount(2) system call, the container manager (the "supervisor") can dis‐
516 tinguish a request to mount a block filesystem (which would not be pos‐
517 sible for a "target" process inside the container) and mount that file
518 system. If, on the other hand, the container manager detects that the
519 operation could be performed by the process inside the container (e.g.,
520 a mount of a tmpfs(5) filesystem), it can notify the kernel that the
521 target process's mount(2) system call can continue.
522
523 select()/poll()/epoll semantics
524 The file descriptor returned when seccomp(2) is employed with the SEC‐
525 COMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using poll(2),
526 epoll(7), and select(2). These interfaces indicate that the file de‐
527 scriptor is ready as follows:
528
529 • When a notification is pending, these interfaces indicate that the
530 file descriptor is readable. Following such an indication, a subse‐
531 quent SECCOMP_IOCTL_NOTIF_RECV ioctl(2) will not block, returning
532 either information about a notification or else failing with the er‐
533 ror EINTR if the target has been killed by a signal or its system
534 call has been interrupted by a signal handler.
535
536 • After the notification has been received (i.e., by the SEC‐
537 COMP_IOCTL_NOTIF_RECV ioctl(2) operation), these interfaces indicate
538 that the file descriptor is writable, meaning that a notification
539 response can be sent using the SECCOMP_IOCTL_NOTIF_SEND ioctl(2) op‐
540 eration.
541
542 • After the last thread using the filter has terminated and been
543 reaped using waitpid(2) (or similar), the file descriptor indicates
544 an end-of-file condition (readable in select(2); POLLHUP/EPOLLHUP in
545 poll(2)/ epoll_wait(2)).
546
547 Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE
548 The intent of the user-space notification feature is to allow system
549 calls to be performed on behalf of the target. The target's system
550 call should either be handled by the supervisor or allowed to continue
551 normally in the kernel (where standard security policies will be ap‐
552 plied).
553
554 Note well: this mechanism must not be used to make security policy de‐
555 cisions about the system call, which would be inherently race-prone for
556 reasons described next.
557
558 The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with caution.
559 If set by the supervisor, the target's system call will continue. How‐
560 ever, there is a time-of-check, time-of-use race here, since an at‐
561 tacker could exploit the interval of time where the target is blocked
562 waiting on the "continue" response to do things such as rewriting the
563 system call arguments.
564
565 Note furthermore that a user-space notifier can be bypassed if the ex‐
566 isting filters allow the use of seccomp(2) or prctl(2) to install a
567 filter that returns an action value with a higher precedence than SEC‐
568 COMP_RET_USER_NOTIF (see seccomp(2)).
569
570 It should thus be absolutely clear that the seccomp user-space notifi‐
571 cation mechanism can not be used to implement a security policy! It
572 should only ever be used in scenarios where a more privileged process
573 supervises the system calls of a lesser privileged target to get around
574 kernel-enforced security restrictions when the supervisor deems this
575 safe. In other words, in order to continue a system call, the supervi‐
576 sor should be sure that another security mechanism or the kernel itself
577 will sufficiently block the system call if its arguments are rewritten
578 to something unsafe.
579
580 Caveats regarding the use of /proc/tid/mem
581 The discussion above noted the need to use the SECCOMP_IOCTL_NO‐
582 TIF_ID_VALID ioctl(2) when opening the /proc/tid/mem file of the target
583 to avoid the possibility of accessing the memory of the wrong process
584 in the event that the target terminates and its ID is recycled by an‐
585 other (unrelated) thread. However, the use of this ioctl(2) operation
586 is also necessary in other situations, as explained in the following
587 paragraphs.
588
589 Consider the following scenario, where the supervisor tries to read the
590 pathname argument of a target's blocked mount(2) system call:
591
592 (1) From one of its functions (func()), the target calls mount(2),
593 which triggers a user-space notification and causes the target to
594 block.
595
596 (2) The supervisor receives the notification, opens /proc/tid/mem, and
597 (successfully) performs the SECCOMP_IOCTL_NOTIF_ID_VALID check.
598
599 (3) The target receives a signal, which causes the mount(2) to abort.
600
601 (4) The signal handler executes in the target, and returns.
602
603 (5) Upon return from the handler, the execution of func() resumes, and
604 it returns (and perhaps other functions are called, overwriting
605 the memory that had been used for the stack frame of func()).
606
607 (6) Using the address provided in the notification information, the
608 supervisor reads from the target's memory location that used to
609 contain the pathname.
610
611 (7) The supervisor now calls mount(2) with some arbitrary bytes ob‐
612 tained in the previous step.
613
614 The conclusion from the above scenario is this: since the target's
615 blocked system call may be interrupted by a signal handler, the super‐
616 visor must be written to expect that the target may abandon its system
617 call at any time; in such an event, any information that the supervisor
618 obtained from the target's memory must be considered invalid.
619
620 To prevent such scenarios, every read from the target's memory must be
621 separated from use of the bytes so obtained by a SECCOMP_IOCTL_NO‐
622 TIF_ID_VALID check. In the above example, the check would be placed
623 between the two final steps. An example of such a check is shown in
624 EXAMPLES.
625
626 Following on from the above, it should be clear that a write by the su‐
627 pervisor into the target's memory can never be considered safe.
628
629 Caveats regarding blocking system calls
630 Suppose that the target performs a blocking system call (e.g., ac‐
631 cept(2)) that the supervisor should handle. The supervisor might then
632 in turn execute the same blocking system call.
633
634 In this scenario, it is important to note that if the target's system
635 call is now interrupted by a signal, the supervisor is not informed of
636 this. If the supervisor does not take suitable steps to actively dis‐
637 cover that the target's system call has been canceled, various diffi‐
638 culties can occur. Taking the example of accept(2), the supervisor
639 might remain blocked in its accept(2) holding a port number that the
640 target (which, after the interruption by the signal handler, perhaps
641 closed its listening socket) might expect to be able to reuse in a
642 bind(2) call.
643
644 Therefore, when the supervisor wishes to emulate a blocking system
645 call, it must do so in such a way that it gets informed if the target's
646 system call is interrupted by a signal handler. For example, if the
647 supervisor itself executes the same blocking system call, then it could
648 employ a separate thread that uses the SECCOMP_IOCTL_NOTIF_ID_VALID op‐
649 eration to check if the target is still blocked in its system call.
650 Alternatively, in the accept(2) example, the supervisor might use
651 poll(2) to monitor both the notification file descriptor (so as to dis‐
652 cover when the target's accept(2) call has been interrupted) and the
653 listening file descriptor (so as to know when a connection is avail‐
654 able).
655
656 If the target's system call is interrupted, the supervisor must take
657 care to release resources (e.g., file descriptors) that it acquired on
658 behalf of the target.
659
660 Interaction with SA_RESTART signal handlers
661 Consider the following scenario:
662
663 (1) The target process has used sigaction(2) to install a signal han‐
664 dler with the SA_RESTART flag.
665
666 (2) The target has made a system call that triggered a seccomp user-
667 space notification and the target is currently blocked until the
668 supervisor sends a notification response.
669
670 (3) A signal is delivered to the target and the signal handler is exe‐
671 cuted.
672
673 (4) When (if) the supervisor attempts to send a notification response,
674 the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)) operation will fail with
675 the ENOENT error.
676
677 In this scenario, the kernel will restart the target's system call.
678 Consequently, the supervisor will receive another user-space notifica‐
679 tion. Thus, depending on how many times the blocked system call is in‐
680 terrupted by a signal handler, the supervisor may receive multiple no‐
681 tifications for the same instance of a system call in the target.
682
683 One oddity is that system call restarting as described in this scenario
684 will occur even for the blocking system calls listed in signal(7) that
685 would never normally be restarted by the SA_RESTART flag.
686
687 Furthermore, if the supervisor response is a file descriptor added with
688 SECCOMP_IOCTL_NOTIF_ADDFD, then the flag SECCOMP_ADDFD_FLAG_SEND can be
689 used to atomically add the file descriptor and return that value, mak‐
690 ing sure no file descriptors are inadvertently leaked into the target.
691
693 If a SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation is performed after the
694 target terminates, then the ioctl(2) call simply blocks (rather than
695 returning an error to indicate that the target no longer exists).
696
698 The (somewhat contrived) program shown below demonstrates the use of
699 the interfaces described in this page. The program creates a child
700 process that serves as the "target" process. The child process in‐
701 stalls a seccomp filter that returns the SECCOMP_RET_USER_NOTIF action
702 value if a call is made to mkdir(2). The child process then calls
703 mkdir(2) once for each of the supplied command-line arguments, and re‐
704 ports the result returned by the call. After processing all arguments,
705 the child process terminates.
706
707 The parent process acts as the supervisor, listening for the notifica‐
708 tions that are generated when the target process calls mkdir(2). When
709 such a notification occurs, the supervisor examines the memory of the
710 target process (using /proc/pid/mem) to discover the pathname argument
711 that was supplied to the mkdir(2) call, and performs one of the follow‐
712 ing actions:
713
714 • If the pathname begins with the prefix "/tmp/", then the supervisor
715 attempts to create the specified directory, and then spoofs a return
716 for the target process based on the return value of the supervisor's
717 mkdir(2) call. In the event that that call succeeds, the spoofed
718 success return value is the length of the pathname.
719
720 • If the pathname begins with "./" (i.e., it is a relative pathname),
721 the supervisor sends a SECCOMP_USER_NOTIF_FLAG_CONTINUE response to
722 the kernel to say that the kernel should execute the target
723 process's mkdir(2) call.
724
725 • If the pathname begins with some other prefix, the supervisor spoofs
726 an error return for the target process, so that the target process's
727 mkdir(2) call appears to fail with the error EOPNOTSUPP ("Operation
728 not supported"). Additionally, if the specified pathname is exactly
729 "/bye", then the supervisor terminates.
730
731 This program can be used to demonstrate various aspects of the behavior
732 of the seccomp user-space notification mechanism. To help aid such
733 demonstrations, the program logs various messages to show the operation
734 of the target process (lines prefixed "T:") and the supervisor (in‐
735 dented lines prefixed "S:").
736
737 In the following example, the target attempts to create the directory
738 /tmp/x. Upon receiving the notification, the supervisor creates the
739 directory on the target's behalf, and spoofs a success return to be re‐
740 ceived by the target process's mkdir(2) call.
741
742 $ ./seccomp_unotify /tmp/x
743 T: PID = 23168
744
745 T: about to mkdir("/tmp/x")
746 S: got notification (ID 0x17445c4a0f4e0e3c) for PID 23168
747 S: executing: mkdir("/tmp/x", 0700)
748 S: success! spoofed return = 6
749 S: sending response (flags = 0; val = 6; error = 0)
750 T: SUCCESS: mkdir(2) returned 6
751
752 T: terminating
753 S: target has terminated; bye
754
755 In the above output, note that the spoofed return value seen by the
756 target process is 6 (the length of the pathname /tmp/x), whereas a nor‐
757 mal mkdir(2) call returns 0 on success.
758
759 In the next example, the target attempts to create a directory using
760 the relative pathname ./sub. Since this pathname starts with "./", the
761 supervisor sends a SECCOMP_USER_NOTIF_FLAG_CONTINUE response to the
762 kernel, and the kernel then (successfully) executes the target
763 process's mkdir(2) call.
764
765 $ ./seccomp_unotify ./sub
766 T: PID = 23204
767
768 T: about to mkdir("./sub")
769 S: got notification (ID 0xddb16abe25b4c12) for PID 23204
770 S: target can execute system call
771 S: sending response (flags = 0x1; val = 0; error = 0)
772 T: SUCCESS: mkdir(2) returned 0
773
774 T: terminating
775 S: target has terminated; bye
776
777 If the target process attempts to create a directory with a pathname
778 that doesn't start with "." and doesn't begin with the prefix "/tmp/",
779 then the supervisor spoofs an error return (EOPNOTSUPP, "Operation not
780 supported") for the target's mkdir(2) call (which is not executed):
781
782 $ ./seccomp_unotify /xxx
783 T: PID = 23178
784
785 T: about to mkdir("/xxx")
786 S: got notification (ID 0xe7dc095d1c524e80) for PID 23178
787 S: spoofing error response (Operation not supported)
788 S: sending response (flags = 0; val = 0; error = -95)
789 T: ERROR: mkdir(2): Operation not supported
790
791 T: terminating
792 S: target has terminated; bye
793
794 In the next example, the target process attempts to create a directory
795 with the pathname /tmp/nosuchdir/b. Upon receiving the notification,
796 the supervisor attempts to create that directory, but the mkdir(2) call
797 fails because the directory /tmp/nosuchdir does not exist. Conse‐
798 quently, the supervisor spoofs an error return that passes the error
799 that it received back to the target process's mkdir(2) call.
800
801 $ ./seccomp_unotify /tmp/nosuchdir/b
802 T: PID = 23199
803
804 T: about to mkdir("/tmp/nosuchdir/b")
805 S: got notification (ID 0x8744454293506046) for PID 23199
806 S: executing: mkdir("/tmp/nosuchdir/b", 0700)
807 S: failure! (errno = 2; No such file or directory)
808 S: sending response (flags = 0; val = 0; error = -2)
809 T: ERROR: mkdir(2): No such file or directory
810
811 T: terminating
812 S: target has terminated; bye
813
814 If the supervisor receives a notification and sees that the argument of
815 the target's mkdir(2) is the string "/bye", then (as well as spoofing
816 an EOPNOTSUPP error), the supervisor terminates. If the target process
817 subsequently executes another mkdir(2) that triggers its seccomp filter
818 to return the SECCOMP_RET_USER_NOTIF action value, then the kernel
819 causes the target process's system call to fail with the error ENOSYS
820 ("Function not implemented"). This is demonstrated by the following
821 example:
822
823 $ ./seccomp_unotify /bye /tmp/y
824 T: PID = 23185
825
826 T: about to mkdir("/bye")
827 S: got notification (ID 0xa81236b1d2f7b0f4) for PID 23185
828 S: spoofing error response (Operation not supported)
829 S: sending response (flags = 0; val = 0; error = -95)
830 S: terminating **********
831 T: ERROR: mkdir(2): Operation not supported
832
833 T: about to mkdir("/tmp/y")
834 T: ERROR: mkdir(2): Function not implemented
835
836 T: terminating
837
838 Program source
839 #define _GNU_SOURCE
840 #include <err.h>
841 #include <errno.h>
842 #include <fcntl.h>
843 #include <limits.h>
844 #include <linux/audit.h>
845 #include <linux/filter.h>
846 #include <linux/seccomp.h>
847 #include <signal.h>
848 #include <stdbool.h>
849 #include <stddef.h>
850 #include <stdint.h>
851 #include <stdio.h>
852 #include <stdlib.h>
853 #include <string.h>
854 #include <sys/ioctl.h>
855 #include <sys/prctl.h>
856 #include <sys/socket.h>
857 #include <sys/stat.h>
858 #include <sys/syscall.h>
859 #include <sys/types.h>
860 #include <sys/un.h>
861 #include <unistd.h>
862
863 #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
864
865 /* Send the file descriptor 'fd' over the connected UNIX domain socket
866 'sockfd'. Returns 0 on success, or -1 on error. */
867
868 static int
869 sendfd(int sockfd, int fd)
870 {
871 int data;
872 struct iovec iov;
873 struct msghdr msgh;
874 struct cmsghdr *cmsgp;
875
876 /* Allocate a char array of suitable size to hold the ancillary data.
877 However, since this buffer is in reality a 'struct cmsghdr', use a
878 union to ensure that it is suitably aligned. */
879 union {
880 char buf[CMSG_SPACE(sizeof(int))];
881 /* Space large enough to hold an 'int' */
882 struct cmsghdr align;
883 } controlMsg;
884
885 /* The 'msg_name' field can be used to specify the address of the
886 destination socket when sending a datagram. However, we do not
887 need to use this field because 'sockfd' is a connected socket. */
888
889 msgh.msg_name = NULL;
890 msgh.msg_namelen = 0;
891
892 /* On Linux, we must transmit at least one byte of real data in
893 order to send ancillary data. We transmit an arbitrary integer
894 whose value is ignored by recvfd(). */
895
896 msgh.msg_iov = &iov;
897 msgh.msg_iovlen = 1;
898 iov.iov_base = &data;
899 iov.iov_len = sizeof(int);
900 data = 12345;
901
902 /* Set 'msghdr' fields that describe ancillary data */
903
904 msgh.msg_control = controlMsg.buf;
905 msgh.msg_controllen = sizeof(controlMsg.buf);
906
907 /* Set up ancillary data describing file descriptor to send */
908
909 cmsgp = CMSG_FIRSTHDR(&msgh);
910 cmsgp->cmsg_level = SOL_SOCKET;
911 cmsgp->cmsg_type = SCM_RIGHTS;
912 cmsgp->cmsg_len = CMSG_LEN(sizeof(int));
913 memcpy(CMSG_DATA(cmsgp), &fd, sizeof(int));
914
915 /* Send real plus ancillary data */
916
917 if (sendmsg(sockfd, &msgh, 0) == -1)
918 return -1;
919
920 return 0;
921 }
922
923 /* Receive a file descriptor on a connected UNIX domain socket. Returns
924 the received file descriptor on success, or -1 on error. */
925
926 static int
927 recvfd(int sockfd)
928 {
929 int data, fd;
930 ssize_t nr;
931 struct iovec iov;
932 struct msghdr msgh;
933
934 /* Allocate a char buffer for the ancillary data. See the comments
935 in sendfd() */
936 union {
937 char buf[CMSG_SPACE(sizeof(int))];
938 struct cmsghdr align;
939 } controlMsg;
940 struct cmsghdr *cmsgp;
941
942 /* The 'msg_name' field can be used to obtain the address of the
943 sending socket. However, we do not need this information. */
944
945 msgh.msg_name = NULL;
946 msgh.msg_namelen = 0;
947
948 /* Specify buffer for receiving real data */
949
950 msgh.msg_iov = &iov;
951 msgh.msg_iovlen = 1;
952 iov.iov_base = &data; /* Real data is an 'int' */
953 iov.iov_len = sizeof(int);
954
955 /* Set 'msghdr' fields that describe ancillary data */
956
957 msgh.msg_control = controlMsg.buf;
958 msgh.msg_controllen = sizeof(controlMsg.buf);
959
960 /* Receive real plus ancillary data; real data is ignored */
961
962 nr = recvmsg(sockfd, &msgh, 0);
963 if (nr == -1)
964 return -1;
965
966 cmsgp = CMSG_FIRSTHDR(&msgh);
967
968 /* Check the validity of the 'cmsghdr' */
969
970 if (cmsgp == NULL
971 || cmsgp->cmsg_len != CMSG_LEN(sizeof(int))
972 || cmsgp->cmsg_level != SOL_SOCKET
973 || cmsgp->cmsg_type != SCM_RIGHTS)
974 {
975 errno = EINVAL;
976 return -1;
977 }
978
979 /* Return the received file descriptor to our caller */
980
981 memcpy(&fd, CMSG_DATA(cmsgp), sizeof(int));
982 return fd;
983 }
984
985 static void
986 sigchldHandler(int sig)
987 {
988 char msg[] = "\tS: target has terminated; bye\n";
989
990 write(STDOUT_FILENO, msg, sizeof(msg) - 1);
991 _exit(EXIT_SUCCESS);
992 }
993
994 static int
995 seccomp(unsigned int operation, unsigned int flags, void *args)
996 {
997 return syscall(SYS_seccomp, operation, flags, args);
998 }
999
1000 /* The following is the x86-64-specific BPF boilerplate code for checking
1001 that the BPF program is running on the right architecture + ABI. At
1002 completion of these instructions, the accumulator contains the system
1003 call number. */
1004
1005 /* For the x32 ABI, all system call numbers have bit 30 set */
1006
1007 #define X32_SYSCALL_BIT 0x40000000
1008
1009 #define X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR \
1010 BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
1011 (offsetof(struct seccomp_data, arch))), \
1012 BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 2), \
1013 BPF_STMT(BPF_LD | BPF_W | BPF_ABS, \
1014 (offsetof(struct seccomp_data, nr))), \
1015 BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1), \
1016 BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS)
1017
1018 /* installNotifyFilter() installs a seccomp filter that generates
1019 user-space notifications (SECCOMP_RET_USER_NOTIF) when the process
1020 calls mkdir(2); the filter allows all other system calls.
1021
1022 The function return value is a file descriptor from which the
1023 user-space notifications can be fetched. */
1024
1025 static int
1026 installNotifyFilter(void)
1027 {
1028 int notifyFd;
1029
1030 struct sock_filter filter[] = {
1031 X86_64_CHECK_ARCH_AND_LOAD_SYSCALL_NR,
1032
1033 /* mkdir() triggers notification to user-space supervisor */
1034
1035 BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, SYS_mkdir, 0, 1),
1036 BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_USER_NOTIF),
1037
1038 /* Every other system call is allowed */
1039
1040 BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
1041 };
1042
1043 struct sock_fprog prog = {
1044 .len = ARRAY_SIZE(filter),
1045 .filter = filter,
1046 };
1047
1048 /* Install the filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
1049 as a result, seccomp() returns a notification file descriptor. */
1050
1051 notifyFd = seccomp(SECCOMP_SET_MODE_FILTER,
1052 SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
1053 if (notifyFd == -1)
1054 err(EXIT_FAILURE, "seccomp-install-notify-filter");
1055
1056 return notifyFd;
1057 }
1058
1059 /* Close a pair of sockets created by socketpair() */
1060
1061 static void
1062 closeSocketPair(int sockPair[2])
1063 {
1064 if (close(sockPair[0]) == -1)
1065 err(EXIT_FAILURE, "closeSocketPair-close-0");
1066 if (close(sockPair[1]) == -1)
1067 err(EXIT_FAILURE, "closeSocketPair-close-1");
1068 }
1069
1070 /* Implementation of the target process; create a child process that:
1071
1072 (1) installs a seccomp filter with the
1073 SECCOMP_FILTER_FLAG_NEW_LISTENER flag;
1074 (2) writes the seccomp notification file descriptor returned from
1075 the previous step onto the UNIX domain socket, 'sockPair[0]';
1076 (3) calls mkdir(2) for each element of 'argv'.
1077
1078 The function return value in the parent is the PID of the child
1079 process; the child does not return from this function. */
1080
1081 static pid_t
1082 targetProcess(int sockPair[2], char *argv[])
1083 {
1084 int notifyFd, s;
1085 pid_t targetPid;
1086
1087 targetPid = fork();
1088
1089 if (targetPid == -1)
1090 err(EXIT_FAILURE, "fork");
1091
1092 if (targetPid > 0) /* In parent, return PID of child */
1093 return targetPid;
1094
1095 /* Child falls through to here */
1096
1097 printf("T: PID = %ld\n", (long) getpid());
1098
1099 /* Install seccomp filter(s) */
1100
1101 if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
1102 err(EXIT_FAILURE, "prctl");
1103
1104 notifyFd = installNotifyFilter();
1105
1106 /* Pass the notification file descriptor to the tracing process over
1107 a UNIX domain socket */
1108
1109 if (sendfd(sockPair[0], notifyFd) == -1)
1110 err(EXIT_FAILURE, "sendfd");
1111
1112 /* Notification and socket FDs are no longer needed in target */
1113
1114 if (close(notifyFd) == -1)
1115 err(EXIT_FAILURE, "close-target-notify-fd");
1116
1117 closeSocketPair(sockPair);
1118
1119 /* Perform a mkdir() call for each of the command-line arguments */
1120
1121 for (char **ap = argv; *ap != NULL; ap++) {
1122 printf("\nT: about to mkdir(\"%s\")\n", *ap);
1123
1124 s = mkdir(*ap, 0700);
1125 if (s == -1)
1126 perror("T: ERROR: mkdir(2)");
1127 else
1128 printf("T: SUCCESS: mkdir(2) returned %d\n", s);
1129 }
1130
1131 printf("\nT: terminating\n");
1132 exit(EXIT_SUCCESS);
1133 }
1134
1135 /* Check that the notification ID provided by a SECCOMP_IOCTL_NOTIF_RECV
1136 operation is still valid. It will no longer be valid if the target
1137 process has terminated or is no longer blocked in the system call that
1138 generated the notification (because it was interrupted by a signal).
1139
1140 This operation can be used when doing such things as accessing
1141 /proc/PID files in the target process in order to avoid TOCTOU race
1142 conditions where the PID that is returned by SECCOMP_IOCTL_NOTIF_RECV
1143 terminates and is reused by another process. */
1144
1145 static bool
1146 cookieIsValid(int notifyFd, uint64_t id)
1147 {
1148 return ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ID_VALID, &id) == 0;
1149 }
1150
1151 /* Access the memory of the target process in order to fetch the
1152 pathname referred to by the system call argument 'argNum' in
1153 'req->data.args[]'. The pathname is returned in 'path',
1154 a buffer of 'len' bytes allocated by the caller.
1155
1156 Returns true if the pathname is successfully fetched, and false
1157 otherwise. For possible causes of failure, see the comments below. */
1158
1159 static bool
1160 getTargetPathname(struct seccomp_notif *req, int notifyFd,
1161 int argNum, char *path, size_t len)
1162 {
1163 int procMemFd;
1164 char procMemPath[PATH_MAX];
1165 ssize_t nread;
1166
1167 snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid);
1168
1169 procMemFd = open(procMemPath, O_RDONLY | O_CLOEXEC);
1170 if (procMemFd == -1)
1171 return false;
1172
1173 /* Check that the process whose info we are accessing is still alive
1174 and blocked in the system call that caused the notification.
1175 If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed in
1176 cookieIsValid()) succeeded, we know that the /proc/PID/mem file
1177 descriptor that we opened corresponded to the process for which we
1178 received a notification. If that process subsequently terminates,
1179 then read() on that file descriptor will return 0 (EOF). */
1180
1181 if (!cookieIsValid(notifyFd, req->id)) {
1182 close(procMemFd);
1183 return false;
1184 }
1185
1186 /* Read bytes at the location containing the pathname argument */
1187
1188 nread = pread(procMemFd, path, len, req->data.args[argNum]);
1189
1190 close(procMemFd);
1191
1192 if (nread <= 0)
1193 return false;
1194
1195 /* Once again check that the notification ID is still valid. The
1196 case we are particularly concerned about here is that just
1197 before we fetched the pathname, the target's blocked system
1198 call was interrupted by a signal handler, and after the handler
1199 returned, the target carried on execution (past the interrupted
1200 system call). In that case, we have no guarantees about what we
1201 are reading, since the target's memory may have been arbitrarily
1202 changed by subsequent operations. */
1203
1204 if (!cookieIsValid(notifyFd, req->id)) {
1205 perror("\tS: notification ID check failed!!!");
1206 return false;
1207 }
1208
1209 /* Even if the target's system call was not interrupted by a signal,
1210 we have no guarantees about what was in the memory of the target
1211 process. (The memory may have been modified by another thread, or
1212 even by an external attacking process.) We therefore treat the
1213 buffer returned by pread() as untrusted input. The buffer should
1214 contain a terminating null byte; if not, then we will trigger an
1215 error for the target process. */
1216
1217 if (strnlen(path, nread) < nread)
1218 return true;
1219
1220 return false;
1221 }
1222
1223 /* Allocate buffers for the seccomp user-space notification request and
1224 response structures. It is the caller's responsibility to free the
1225 buffers returned via 'req' and 'resp'. */
1226
1227 static void
1228 allocSeccompNotifBuffers(struct seccomp_notif **req,
1229 struct seccomp_notif_resp **resp,
1230 struct seccomp_notif_sizes *sizes)
1231 {
1232 size_t resp_size;
1233
1234 /* Discover the sizes of the structures that are used to receive
1235 notifications and send notification responses, and allocate
1236 buffers of those sizes. */
1237
1238 if (seccomp(SECCOMP_GET_NOTIF_SIZES, 0, sizes) == -1)
1239 err(EXIT_FAILURE, "seccomp-SECCOMP_GET_NOTIF_SIZES");
1240
1241 *req = malloc(sizes->seccomp_notif);
1242 if (*req == NULL)
1243 err(EXIT_FAILURE, "malloc-seccomp_notif");
1244
1245 /* When allocating the response buffer, we must allow for the fact
1246 that the user-space binary may have been built with user-space
1247 headers where 'struct seccomp_notif_resp' is bigger than the
1248 response buffer expected by the (older) kernel. Therefore, we
1249 allocate a buffer that is the maximum of the two sizes. This
1250 ensures that if the supervisor places bytes into the response
1251 structure that are past the response size that the kernel expects,
1252 then the supervisor is not touching an invalid memory location. */
1253
1254 resp_size = sizes->seccomp_notif_resp;
1255 if (sizeof(struct seccomp_notif_resp) > resp_size)
1256 resp_size = sizeof(struct seccomp_notif_resp);
1257
1258 *resp = malloc(resp_size);
1259 if (*resp == NULL)
1260 err(EXIT_FAILURE, "malloc-seccomp_notif_resp");
1261
1262 }
1263
1264 /* Handle notifications that arrive via the SECCOMP_RET_USER_NOTIF file
1265 descriptor, 'notifyFd'. */
1266
1267 static void
1268 handleNotifications(int notifyFd)
1269 {
1270 bool pathOK;
1271 char path[PATH_MAX];
1272 struct seccomp_notif *req;
1273 struct seccomp_notif_resp *resp;
1274 struct seccomp_notif_sizes sizes;
1275
1276 allocSeccompNotifBuffers(&req, &resp, &sizes);
1277
1278 /* Loop handling notifications */
1279
1280 for (;;) {
1281
1282 /* Wait for next notification, returning info in '*req' */
1283
1284 memset(req, 0, sizes.seccomp_notif);
1285 if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_RECV, req) == -1) {
1286 if (errno == EINTR)
1287 continue;
1288 err(EXIT_FAILURE, "\tS: ioctl-SECCOMP_IOCTL_NOTIF_RECV");
1289 }
1290
1291 printf("\tS: got notification (ID %#llx) for PID %d\n",
1292 req->id, req->pid);
1293
1294 /* The only system call that can generate a notification event
1295 is mkdir(2). Nevertheless, we check that the notified system
1296 call is indeed mkdir() as kind of future-proofing of this
1297 code in case the seccomp filter is later modified to
1298 generate notifications for other system calls. */
1299
1300 if (req->data.nr != SYS_mkdir) {
1301 printf("\tS: notification contained unexpected "
1302 "system call number; bye!!!\n");
1303 exit(EXIT_FAILURE);
1304 }
1305
1306 pathOK = getTargetPathname(req, notifyFd, 0, path, sizeof(path));
1307
1308 /* Prepopulate some fields of the response */
1309
1310 resp->id = req->id; /* Response includes notification ID */
1311 resp->flags = 0;
1312 resp->val = 0;
1313
1314 /* If getTargetPathname() failed, trigger an EINVAL error
1315 response (sending this response may yield an error if the
1316 failure occurred because the notification ID was no longer
1317 valid); if the directory is in /tmp, then create it on behalf
1318 of the supervisor; if the pathname starts with '.', tell the
1319 kernel to let the target process execute the mkdir();
1320 otherwise, give an error for a directory pathname in any other
1321 location. */
1322
1323 if (!pathOK) {
1324 resp->error = -EINVAL;
1325 printf("\tS: spoofing error for invalid pathname (%s)\n",
1326 strerror(-resp->error));
1327 } else if (strncmp(path, "/tmp/", strlen("/tmp/")) == 0) {
1328 printf("\tS: executing: mkdir(\"%s\", %#llo)\n",
1329 path, req->data.args[1]);
1330
1331 if (mkdir(path, req->data.args[1]) == 0) {
1332 resp->error = 0; /* "Success" */
1333 resp->val = strlen(path); /* Used as return value of
1334 mkdir() in target */
1335 printf("\tS: success! spoofed return = %lld\n",
1336 resp->val);
1337 } else {
1338
1339 /* If mkdir() failed in the supervisor, pass the error
1340 back to the target */
1341
1342 resp->error = -errno;
1343 printf("\tS: failure! (errno = %d; %s)\n", errno,
1344 strerror(errno));
1345 }
1346 } else if (strncmp(path, "./", strlen("./")) == 0) {
1347 resp->error = resp->val = 0;
1348 resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
1349 printf("\tS: target can execute system call\n");
1350 } else {
1351 resp->error = -EOPNOTSUPP;
1352 printf("\tS: spoofing error response (%s)\n",
1353 strerror(-resp->error));
1354 }
1355
1356 /* Send a response to the notification */
1357
1358 printf("\tS: sending response "
1359 "(flags = %#x; val = %lld; error = %d)\n",
1360 resp->flags, resp->val, resp->error);
1361
1362 if (ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp) == -1) {
1363 if (errno == ENOENT)
1364 printf("\tS: response failed with ENOENT; "
1365 "perhaps target process's syscall was "
1366 "interrupted by a signal?\n");
1367 else
1368 perror("ioctl-SECCOMP_IOCTL_NOTIF_SEND");
1369 }
1370
1371 /* If the pathname is just "/bye", then the supervisor breaks out
1372 of the loop and terminates. This allows us to see what happens
1373 if the target process makes further calls to mkdir(2). */
1374
1375 if (strcmp(path, "/bye") == 0)
1376 break;
1377 }
1378
1379 free(req);
1380 free(resp);
1381 printf("\tS: terminating **********\n");
1382 exit(EXIT_FAILURE);
1383 }
1384
1385 /* Implementation of the supervisor process:
1386
1387 (1) obtains the notification file descriptor from 'sockPair[1]'
1388 (2) handles notifications that arrive on that file descriptor. */
1389
1390 static void
1391 supervisor(int sockPair[2])
1392 {
1393 int notifyFd;
1394
1395 notifyFd = recvfd(sockPair[1]);
1396
1397 if (notifyFd == -1)
1398 err(EXIT_FAILURE, "recvfd");
1399
1400 closeSocketPair(sockPair); /* We no longer need the socket pair */
1401
1402 handleNotifications(notifyFd);
1403 }
1404
1405 int
1406 main(int argc, char *argv[])
1407 {
1408 int sockPair[2];
1409 struct sigaction sa;
1410
1411 setbuf(stdout, NULL);
1412
1413 if (argc < 2) {
1414 fprintf(stderr, "At least one pathname argument is required\n");
1415 exit(EXIT_FAILURE);
1416 }
1417
1418 /* Create a UNIX domain socket that is used to pass the seccomp
1419 notification file descriptor from the target process to the
1420 supervisor process. */
1421
1422 if (socketpair(AF_UNIX, SOCK_STREAM, 0, sockPair) == -1)
1423 err(EXIT_FAILURE, "socketpair");
1424
1425 /* Create a child process--the "target"--that installs seccomp
1426 filtering. The target process writes the seccomp notification
1427 file descriptor onto 'sockPair[0]' and then calls mkdir(2) for
1428 each directory in the command-line arguments. */
1429
1430 (void) targetProcess(sockPair, &argv[optind]);
1431
1432 /* Catch SIGCHLD when the target terminates, so that the
1433 supervisor can also terminate. */
1434
1435 sa.sa_handler = sigchldHandler;
1436 sa.sa_flags = 0;
1437 sigemptyset(&sa.sa_mask);
1438 if (sigaction(SIGCHLD, &sa, NULL) == -1)
1439 err(EXIT_FAILURE, "sigaction");
1440
1441 supervisor(sockPair);
1442
1443 exit(EXIT_SUCCESS);
1444 }
1445
1447 ioctl(2), pidfd_getfd(2), pidfd_open(2), seccomp(2)
1448
1449 A further example program can be found in the kernel source file sam‐
1450 ples/seccomp/user-trap.c.
1451
1452
1453
1454Linux man-pages 6.05 2023-05-03 seccomp_unotify(2)