1SECCOMP(2) Linux Programmer's Manual SECCOMP(2)
2
3
4
6 seccomp - operate on Secure Computing state of the process
7
9 #include <linux/seccomp.h> /* Definition of SECCOMP_* constants */
10 #include <linux/filter.h> /* Definition of struct sock_fprog */
11 #include <linux/audit.h> /* Definition of AUDIT_* constants */
12 #include <linux/signal.h> /* Definition of SIG* constants */
13 #include <sys/ptrace.h> /* Definition of PTRACE_* constants */
14 #include <sys/syscall.h> /* Definition of SYS_* constants */
15 #include <unistd.h>
16
17 int syscall(SYS_seccomp, unsigned int operation, unsigned int flags,
18 void *args);
19
20 Note: glibc provides no wrapper for seccomp(), necessitating the use of
21 syscall(2).
22
24 The seccomp() system call operates on the Secure Computing (seccomp)
25 state of the calling process.
26
27 Currently, Linux supports the following operation values:
28
29 SECCOMP_SET_MODE_STRICT
30 The only system calls that the calling thread is permitted to
31 make are read(2), write(2), _exit(2) (but not exit_group(2)),
32 and sigreturn(2). Other system calls result in the termination
33 of the calling thread, or termination of the entire process with
34 the SIGKILL signal when there is only one thread. Strict secure
35 computing mode is useful for number-crunching applications that
36 may need to execute untrusted byte code, perhaps obtained by
37 reading from a pipe or socket.
38
39 Note that although the calling thread can no longer call sig‐
40 procmask(2), it can use sigreturn(2) to block all signals apart
41 from SIGKILL and SIGSTOP. This means that alarm(2) (for exam‐
42 ple) is not sufficient for restricting the process's execution
43 time. Instead, to reliably terminate the process, SIGKILL must
44 be used. This can be done by using timer_create(2) with
45 SIGEV_SIGNAL and sigev_signo set to SIGKILL, or by using setr‐
46 limit(2) to set the hard limit for RLIMIT_CPU.
47
48 This operation is available only if the kernel is configured
49 with CONFIG_SECCOMP enabled.
50
51 The value of flags must be 0, and args must be NULL.
52
53 This operation is functionally identical to the call:
54
55 prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
56
57 SECCOMP_SET_MODE_FILTER
58 The system calls allowed are defined by a pointer to a Berkeley
59 Packet Filter (BPF) passed via args. This argument is a pointer
60 to a struct sock_fprog; it can be designed to filter arbitrary
61 system calls and system call arguments. If the filter is in‐
62 valid, seccomp() fails, returning EINVAL in errno.
63
64 If fork(2) or clone(2) is allowed by the filter, any child pro‐
65 cesses will be constrained to the same system call filters as
66 the parent. If execve(2) is allowed, the existing filters will
67 be preserved across a call to execve(2).
68
69 In order to use the SECCOMP_SET_MODE_FILTER operation, either
70 the calling thread must have the CAP_SYS_ADMIN capability in its
71 user namespace, or the thread must already have the no_new_privs
72 bit set. If that bit was not already set by an ancestor of this
73 thread, the thread must make the following call:
74
75 prctl(PR_SET_NO_NEW_PRIVS, 1);
76
77 Otherwise, the SECCOMP_SET_MODE_FILTER operation fails and re‐
78 turns EACCES in errno. This requirement ensures that an unpriv‐
79 ileged process cannot apply a malicious filter and then invoke a
80 set-user-ID or other privileged program using execve(2), thus
81 potentially compromising that program. (Such a malicious filter
82 might, for example, cause an attempt to use setuid(2) to set the
83 caller's user IDs to nonzero values to instead return 0 without
84 actually making the system call. Thus, the program might be
85 tricked into retaining superuser privileges in circumstances
86 where it is possible to influence it to do dangerous things be‐
87 cause it did not actually drop privileges.)
88
89 If prctl(2) or seccomp() is allowed by the attached filter, fur‐
90 ther filters may be added. This will increase evaluation time,
91 but allows for further reduction of the attack surface during
92 execution of a thread.
93
94 The SECCOMP_SET_MODE_FILTER operation is available only if the
95 kernel is configured with CONFIG_SECCOMP_FILTER enabled.
96
97 When flags is 0, this operation is functionally identical to the
98 call:
99
100 prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, args);
101
102 The recognized flags are:
103
104 SECCOMP_FILTER_FLAG_LOG (since Linux 4.14)
105 All filter return actions except SECCOMP_RET_ALLOW should
106 be logged. An administrator may override this filter
107 flag by preventing specific actions from being logged via
108 the /proc/sys/kernel/seccomp/actions_logged file.
109
110 SECCOMP_FILTER_FLAG_NEW_LISTENER (since Linux 5.0)
111 After successfully installing the filter program, return
112 a new user-space notification file descriptor. (The
113 close-on-exec flag is set for the file descriptor.) When
114 the filter returns SECCOMP_RET_USER_NOTIF a notification
115 will be sent to this file descriptor.
116
117 At most one seccomp filter using the SECCOMP_FIL‐
118 TER_FLAG_NEW_LISTENER flag can be installed for a thread.
119
120 See seccomp_unotify(2) for further details.
121
122 SECCOMP_FILTER_FLAG_SPEC_ALLOW (since Linux 4.17)
123 Disable Speculative Store Bypass mitigation.
124
125 SECCOMP_FILTER_FLAG_TSYNC
126 When adding a new filter, synchronize all other threads
127 of the calling process to the same seccomp filter tree.
128 A "filter tree" is the ordered list of filters attached
129 to a thread. (Attaching identical filters in separate
130 seccomp() calls results in different filters from this
131 perspective.)
132
133 If any thread cannot synchronize to the same filter tree,
134 the call will not attach the new seccomp filter, and will
135 fail, returning the first thread ID found that cannot
136 synchronize. Synchronization will fail if another thread
137 in the same process is in SECCOMP_MODE_STRICT or if it
138 has attached new seccomp filters to itself, diverging
139 from the calling thread's filter tree.
140
141 SECCOMP_GET_ACTION_AVAIL (since Linux 4.14)
142 Test to see if an action is supported by the kernel. This oper‐
143 ation is helpful to confirm that the kernel knows of a more re‐
144 cently added filter return action since the kernel treats all
145 unknown actions as SECCOMP_RET_KILL_PROCESS.
146
147 The value of flags must be 0, and args must be a pointer to an
148 unsigned 32-bit filter return action.
149
150 SECCOMP_GET_NOTIF_SIZES (since Linux 5.0)
151 Get the sizes of the seccomp user-space notification structures.
152 Since these structures may evolve and grow over time, this com‐
153 mand can be used to determine how much memory to allocate for
154 sending and receiving notifications.
155
156 The value of flags must be 0, and args must be a pointer to a
157 struct seccomp_notif_sizes, which has the following form:
158
159 struct seccomp_notif_sizes
160 __u16 seccomp_notif; /* Size of notification structure */
161 __u16 seccomp_notif_resp; /* Size of response structure */
162 __u16 seccomp_data; /* Size of 'struct seccomp_data' */
163 };
164
165 See seccomp_unotify(2) for further details.
166
167 Filters
168 When adding filters via SECCOMP_SET_MODE_FILTER, args points to a fil‐
169 ter program:
170
171 struct sock_fprog {
172 unsigned short len; /* Number of BPF instructions */
173 struct sock_filter *filter; /* Pointer to array of
174 BPF instructions */
175 };
176
177 Each program must contain one or more BPF instructions:
178
179 struct sock_filter { /* Filter block */
180 __u16 code; /* Actual filter code */
181 __u8 jt; /* Jump true */
182 __u8 jf; /* Jump false */
183 __u32 k; /* Generic multiuse field */
184 };
185
186 When executing the instructions, the BPF program operates on the system
187 call information made available (i.e., use the BPF_ABS addressing mode)
188 as a (read-only) buffer of the following form:
189
190 struct seccomp_data {
191 int nr; /* System call number */
192 __u32 arch; /* AUDIT_ARCH_* value
193 (see <linux/audit.h>) */
194 __u64 instruction_pointer; /* CPU instruction pointer */
195 __u64 args[6]; /* Up to 6 system call arguments */
196 };
197
198 Because numbering of system calls varies between architectures and some
199 architectures (e.g., x86-64) allow user-space code to use the calling
200 conventions of multiple architectures (and the convention being used
201 may vary over the life of a process that uses execve(2) to execute bi‐
202 naries that employ the different conventions), it is usually necessary
203 to verify the value of the arch field.
204
205 It is strongly recommended to use an allow-list approach whenever pos‐
206 sible because such an approach is more robust and simple. A deny-list
207 will have to be updated whenever a potentially dangerous system call is
208 added (or a dangerous flag or option if those are deny-listed), and it
209 is often possible to alter the representation of a value without alter‐
210 ing its meaning, leading to a deny-list bypass. See also Caveats be‐
211 low.
212
213 The arch field is not unique for all calling conventions. The x86-64
214 ABI and the x32 ABI both use AUDIT_ARCH_X86_64 as arch, and they run on
215 the same processors. Instead, the mask __X32_SYSCALL_BIT is used on
216 the system call number to tell the two ABIs apart.
217
218 This means that a policy must either deny all syscalls with
219 __X32_SYSCALL_BIT or it must recognize syscalls with and without
220 __X32_SYSCALL_BIT set. A list of system calls to be denied based on nr
221 that does not also contain nr values with __X32_SYSCALL_BIT set can be
222 bypassed by a malicious program that sets __X32_SYSCALL_BIT.
223
224 Additionally, kernels prior to Linux 5.4 incorrectly permitted nr in
225 the ranges 512-547 as well as the corresponding non-x32 syscalls ORed
226 with __X32_SYSCALL_BIT. For example, nr == 521 and nr == (101 |
227 __X32_SYSCALL_BIT) would result in invocations of ptrace(2) with poten‐
228 tially confused x32-vs-x86_64 semantics in the kernel. Policies in‐
229 tended to work on kernels before Linux 5.4 must ensure that they deny
230 or otherwise correctly handle these system calls. On Linux 5.4 and
231 newer, such system calls will fail with the error ENOSYS, without doing
232 anything.
233
234 The instruction_pointer field provides the address of the machine-lan‐
235 guage instruction that performed the system call. This might be useful
236 in conjunction with the use of /proc/[pid]/maps to perform checks based
237 on which region (mapping) of the program made the system call. (Proba‐
238 bly, it is wise to lock down the mmap(2) and mprotect(2) system calls
239 to prevent the program from subverting such checks.)
240
241 When checking values from args, keep in mind that arguments are often
242 silently truncated before being processed, but after the seccomp check.
243 For example, this happens if the i386 ABI is used on an x86-64 kernel:
244 although the kernel will normally not look beyond the 32 lowest bits of
245 the arguments, the values of the full 64-bit registers will be present
246 in the seccomp data. A less surprising example is that if the x86-64
247 ABI is used to perform a system call that takes an argument of type
248 int, the more-significant half of the argument register is ignored by
249 the system call, but visible in the seccomp data.
250
251 A seccomp filter returns a 32-bit value consisting of two parts: the
252 most significant 16 bits (corresponding to the mask defined by the con‐
253 stant SECCOMP_RET_ACTION_FULL) contain one of the "action" values
254 listed below; the least significant 16-bits (defined by the constant
255 SECCOMP_RET_DATA) are "data" to be associated with this return value.
256
257 If multiple filters exist, they are all executed, in reverse order of
258 their addition to the filter tree—that is, the most recently installed
259 filter is executed first. (Note that all filters will be called even
260 if one of the earlier filters returns SECCOMP_RET_KILL. This is done
261 to simplify the kernel code and to provide a tiny speed-up in the exe‐
262 cution of sets of filters by avoiding a check for this uncommon case.)
263 The return value for the evaluation of a given system call is the
264 first-seen action value of highest precedence (along with its accompa‐
265 nying data) returned by execution of all of the filters.
266
267 In decreasing order of precedence, the action values that may be re‐
268 turned by a seccomp filter are:
269
270 SECCOMP_RET_KILL_PROCESS (since Linux 4.14)
271 This value results in immediate termination of the process, with
272 a core dump. The system call is not executed. By contrast with
273 SECCOMP_RET_KILL_THREAD below, all threads in the thread group
274 are terminated. (For a discussion of thread groups, see the de‐
275 scription of the CLONE_THREAD flag in clone(2).)
276
277 The process terminates as though killed by a SIGSYS signal.
278 Even if a signal handler has been registered for SIGSYS, the
279 handler will be ignored in this case and the process always ter‐
280 minates. To a parent process that is waiting on this process
281 (using waitpid(2) or similar), the returned wstatus will indi‐
282 cate that its child was terminated as though by a SIGSYS signal.
283
284 SECCOMP_RET_KILL_THREAD (or SECCOMP_RET_KILL)
285 This value results in immediate termination of the thread that
286 made the system call. The system call is not executed. Other
287 threads in the same thread group will continue to execute.
288
289 The thread terminates as though killed by a SIGSYS signal. See
290 SECCOMP_RET_KILL_PROCESS above.
291
292 Before Linux 4.11, any process terminated in this way would not
293 trigger a coredump (even though SIGSYS is documented in sig‐
294 nal(7) as having a default action of termination with a core
295 dump). Since Linux 4.11, a single-threaded process will dump
296 core if terminated in this way.
297
298 With the addition of SECCOMP_RET_KILL_PROCESS in Linux 4.14,
299 SECCOMP_RET_KILL_THREAD was added as a synonym for SEC‐
300 COMP_RET_KILL, in order to more clearly distinguish the two ac‐
301 tions.
302
303 Note: the use of SECCOMP_RET_KILL_THREAD to kill a single thread
304 in a multithreaded process is likely to leave the process in a
305 permanently inconsistent and possibly corrupt state.
306
307 SECCOMP_RET_TRAP
308 This value results in the kernel sending a thread-directed
309 SIGSYS signal to the triggering thread. (The system call is not
310 executed.) Various fields will be set in the siginfo_t struc‐
311 ture (see sigaction(2)) associated with signal:
312
313 * si_signo will contain SIGSYS.
314
315 * si_call_addr will show the address of the system call in‐
316 struction.
317
318 * si_syscall and si_arch will indicate which system call was
319 attempted.
320
321 * si_code will contain SYS_SECCOMP.
322
323 * si_errno will contain the SECCOMP_RET_DATA portion of the
324 filter return value.
325
326 The program counter will be as though the system call happened
327 (i.e., the program counter will not point to the system call in‐
328 struction). The return value register will contain an architec‐
329 ture-dependent value; if resuming execution, set it to something
330 appropriate for the system call. (The architecture dependency
331 is because replacing it with ENOSYS could overwrite some useful
332 information.)
333
334 SECCOMP_RET_ERRNO
335 This value results in the SECCOMP_RET_DATA portion of the fil‐
336 ter's return value being passed to user space as the errno value
337 without executing the system call.
338
339 SECCOMP_RET_USER_NOTIF (since Linux 5.0)
340 Forward the system call to an attached user-space supervisor
341 process to allow that process to decide what to do with the sys‐
342 tem call. If there is no attached supervisor (either because
343 the filter was not installed with the SECCOMP_FIL‐
344 TER_FLAG_NEW_LISTENER flag or because the file descriptor was
345 closed), the filter returns ENOSYS (similar to what happens when
346 a filter returns SECCOMP_RET_TRACE and there is no tracer). See
347 seccomp_unotify(2) for further details.
348
349 Note that the supervisor process will not be notified if another
350 filter returns an action value with a precedence greater than
351 SECCOMP_RET_USER_NOTIF.
352
353 SECCOMP_RET_TRACE
354 When returned, this value will cause the kernel to attempt to
355 notify a ptrace(2)-based tracer prior to executing the system
356 call. If there is no tracer present, the system call is not ex‐
357 ecuted and returns a failure status with errno set to ENOSYS.
358
359 A tracer will be notified if it requests PTRACE_O_TRACESECCOMP
360 using ptrace(PTRACE_SETOPTIONS). The tracer will be notified of
361 a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of the
362 filter's return value will be available to the tracer via
363 PTRACE_GETEVENTMSG.
364
365 The tracer can skip the system call by changing the system call
366 number to -1. Alternatively, the tracer can change the system
367 call requested by changing the system call to a valid system
368 call number. If the tracer asks to skip the system call, then
369 the system call will appear to return the value that the tracer
370 puts in the return value register.
371
372 Before kernel 4.8, the seccomp check will not be run again after
373 the tracer is notified. (This means that, on older kernels,
374 seccomp-based sandboxes must not allow use of ptrace(2)—even of
375 other sandboxed processes—without extreme care; ptracers can use
376 this mechanism to escape from the seccomp sandbox.)
377
378 Note that a tracer process will not be notified if another fil‐
379 ter returns an action value with a precedence greater than SEC‐
380 COMP_RET_TRACE.
381
382 SECCOMP_RET_LOG (since Linux 4.14)
383 This value results in the system call being executed after the
384 filter return action is logged. An administrator may override
385 the logging of this action via the /proc/sys/kernel/seccomp/ac‐
386 tions_logged file.
387
388 SECCOMP_RET_ALLOW
389 This value results in the system call being executed.
390
391 If an action value other than one of the above is specified, then the
392 filter action is treated as either SECCOMP_RET_KILL_PROCESS (since
393 Linux 4.14) or SECCOMP_RET_KILL_THREAD (in Linux 4.13 and earlier).
394
395 /proc interfaces
396 The files in the directory /proc/sys/kernel/seccomp provide additional
397 seccomp information and configuration:
398
399 actions_avail (since Linux 4.14)
400 A read-only ordered list of seccomp filter return actions in
401 string form. The ordering, from left-to-right, is in decreasing
402 order of precedence. The list represents the set of seccomp
403 filter return actions supported by the kernel.
404
405 actions_logged (since Linux 4.14)
406 A read-write ordered list of seccomp filter return actions that
407 are allowed to be logged. Writes to the file do not need to be
408 in ordered form but reads from the file will be ordered in the
409 same way as the actions_avail file.
410
411 It is important to note that the value of actions_logged does
412 not prevent certain filter return actions from being logged when
413 the audit subsystem is configured to audit a task. If the ac‐
414 tion is not found in the actions_logged file, the final decision
415 on whether to audit the action for that task is ultimately left
416 up to the audit subsystem to decide for all filter return ac‐
417 tions other than SECCOMP_RET_ALLOW.
418
419 The "allow" string is not accepted in the actions_logged file as
420 it is not possible to log SECCOMP_RET_ALLOW actions. Attempting
421 to write "allow" to the file will fail with the error EINVAL.
422
423 Audit logging of seccomp actions
424 Since Linux 4.14, the kernel provides the facility to log the actions
425 returned by seccomp filters in the audit log. The kernel makes the de‐
426 cision to log an action based on the action type, whether or not the
427 action is present in the actions_logged file, and whether kernel audit‐
428 ing is enabled (e.g., via the kernel boot option audit=1). The rules
429 are as follows:
430
431 * If the action is SECCOMP_RET_ALLOW, the action is not logged.
432
433 * Otherwise, if the action is either SECCOMP_RET_KILL_PROCESS or SEC‐
434 COMP_RET_KILL_THREAD, and that action appears in the actions_logged
435 file, the action is logged.
436
437 * Otherwise, if the filter has requested logging (the SECCOMP_FIL‐
438 TER_FLAG_LOG flag) and the action appears in the actions_logged
439 file, the action is logged.
440
441 * Otherwise, if kernel auditing is enabled and the process is being
442 audited (autrace(8)), the action is logged.
443
444 * Otherwise, the action is not logged.
445
447 On success, seccomp() returns 0. On error, if SECCOMP_FIL‐
448 TER_FLAG_TSYNC was used, the return value is the ID of the thread that
449 caused the synchronization failure. (This ID is a kernel thread ID of
450 the type returned by clone(2) and gettid(2).) On other errors, -1 is
451 returned, and errno is set to indicate the error.
452
454 seccomp() can fail for the following reasons:
455
456 EACCES The caller did not have the CAP_SYS_ADMIN capability in its user
457 namespace, or had not set no_new_privs before using SEC‐
458 COMP_SET_MODE_FILTER.
459
460 EBUSY While installing a new filter, the SECCOMP_FILTER_FLAG_NEW_LIS‐
461 TENER flag was specified, but a previous filter had already been
462 installed with that flag.
463
464 EFAULT args was not a valid address.
465
466 EINVAL operation is unknown or is not supported by this kernel version
467 or configuration.
468
469 EINVAL The specified flags are invalid for the given operation.
470
471 EINVAL operation included BPF_ABS, but the specified offset was not
472 aligned to a 32-bit boundary or exceeded sizeof(struct sec‐
473 comp_data).
474
475 EINVAL A secure computing mode has already been set, and operation dif‐
476 fers from the existing setting.
477
478 EINVAL operation specified SECCOMP_SET_MODE_FILTER, but the filter pro‐
479 gram pointed to by args was not valid or the length of the fil‐
480 ter program was zero or exceeded BPF_MAXINSNS [22m(4096) instruc‐
481 tions.
482
483 ENOMEM Out of memory.
484
485 ENOMEM The total length of all filter programs attached to the calling
486 thread would exceed MAX_INSNS_PER_PATH [22m(32768) instructions.
487 Note that for the purposes of calculating this limit, each al‐
488 ready existing filter program incurs an overhead penalty of 4
489 instructions.
490
491 EOPNOTSUPP
492 operation specified SECCOMP_GET_ACTION_AVAIL, but the kernel
493 does not support the filter return action specified by args.
494
495 ESRCH Another thread caused a failure during thread sync, but its ID
496 could not be determined.
497
499 The seccomp() system call first appeared in Linux 3.17.
500
502 The seccomp() system call is a nonstandard Linux extension.
503
505 Rather than hand-coding seccomp filters as shown in the example below,
506 you may prefer to employ the libseccomp library, which provides a
507 front-end for generating seccomp filters.
508
509 The Seccomp field of the /proc/[pid]/status file provides a method of
510 viewing the seccomp mode of a process; see proc(5).
511
512 seccomp() provides a superset of the functionality provided by the
513 prctl(2) PR_SET_SECCOMP operation (which does not support flags).
514
515 Since Linux 4.4, the ptrace(2) PTRACE_SECCOMP_GET_FILTER operation can
516 be used to dump a process's seccomp filters.
517
518 Architecture support for seccomp BPF
519 Architecture support for seccomp BPF filtering is available on the fol‐
520 lowing architectures:
521
522 * x86-64, i386, x32 (since Linux 3.5)
523 * ARM (since Linux 3.8)
524 * s390 (since Linux 3.8)
525 * MIPS (since Linux 3.16)
526 * ARM-64 (since Linux 3.19)
527 * PowerPC (since Linux 4.3)
528 * Tile (since Linux 4.3)
529 * PA-RISC (since Linux 4.6)
530
531 Caveats
532 There are various subtleties to consider when applying seccomp filters
533 to a program, including the following:
534
535 * Some traditional system calls have user-space implementations in the
536 vdso(7) on many architectures. Notable examples include clock_get‐
537 time(2), gettimeofday(2), and time(2). On such architectures, sec‐
538 comp filtering for these system calls will have no effect. (How‐
539 ever, there are cases where the vdso(7) implementations may fall
540 back to invoking the true system call, in which case seccomp filters
541 would see the system call.)
542
543 * Seccomp filtering is based on system call numbers. However, appli‐
544 cations typically do not directly invoke system calls, but instead
545 call wrapper functions in the C library which in turn invoke the
546 system calls. Consequently, one must be aware of the following:
547
548 • The glibc wrappers for some traditional system calls may actually
549 employ system calls with different names in the kernel. For ex‐
550 ample, the exit(2) wrapper function actually employs the
551 exit_group(2) system call, and the fork(2) wrapper function actu‐
552 ally calls clone(2).
553
554 • The behavior of wrapper functions may vary across architectures,
555 according to the range of system calls provided on those archi‐
556 tectures. In other words, the same wrapper function may invoke
557 different system calls on different architectures.
558
559 • Finally, the behavior of wrapper functions can change across
560 glibc versions. For example, in older versions, the glibc wrap‐
561 per function for open(2) invoked the system call of the same
562 name, but starting in glibc 2.26, the implementation switched to
563 calling openat(2) on all architectures.
564
565 The consequence of the above points is that it may be necessary to fil‐
566 ter for a system call other than might be expected. Various manual
567 pages in Section 2 provide helpful details about the differences be‐
568 tween wrapper functions and the underlying system calls in subsections
569 entitled C library/kernel differences.
570
571 Furthermore, note that the application of seccomp filters even risks
572 causing bugs in an application, when the filters cause unexpected fail‐
573 ures for legitimate operations that the application might need to per‐
574 form. Such bugs may not easily be discovered when testing the seccomp
575 filters if the bugs occur in rarely used application code paths.
576
577 Seccomp-specific BPF details
578 Note the following BPF details specific to seccomp filters:
579
580 * The BPF_H and BPF_B size modifiers are not supported: all operations
581 must load and store (4-byte) words (BPF_W).
582
583 * To access the contents of the seccomp_data buffer, use the BPF_ABS
584 addressing mode modifier.
585
586 * The BPF_LEN addressing mode modifier yields an immediate mode oper‐
587 and whose value is the size of the seccomp_data buffer.
588
590 The program below accepts four or more arguments. The first three ar‐
591 guments are a system call number, a numeric architecture identifier,
592 and an error number. The program uses these values to construct a BPF
593 filter that is used at run time to perform the following checks:
594
595 [1] If the program is not running on the specified architecture, the
596 BPF filter causes system calls to fail with the error ENOSYS.
597
598 [2] If the program attempts to execute the system call with the speci‐
599 fied number, the BPF filter causes the system call to fail, with
600 errno being set to the specified error number.
601
602 The remaining command-line arguments specify the pathname and addi‐
603 tional arguments of a program that the example program should attempt
604 to execute using execv(3) (a library function that employs the ex‐
605 ecve(2) system call). Some example runs of the program are shown be‐
606 low.
607
608 First, we display the architecture that we are running on (x86-64) and
609 then construct a shell function that looks up system call numbers on
610 this architecture:
611
612 $ uname -m
613 x86_64
614 $ syscall_nr() {
615 cat /usr/src/linux/arch/x86/syscalls/syscall_64.tbl | \
616 awk '$2 != "x32" && $3 == "'$1'" { print $1 }'
617 }
618
619 When the BPF filter rejects a system call (case [2] above), it causes
620 the system call to fail with the error number specified on the command
621 line. In the experiments shown here, we'll use error number 99:
622
623 $ errno 99
624 EADDRNOTAVAIL 99 Cannot assign requested address
625
626 In the following example, we attempt to run the command whoami(1), but
627 the BPF filter rejects the execve(2) system call, so that the command
628 is not even executed:
629
630 $ syscall_nr execve
631 59
632 $ ./a.out
633 Usage: ./a.out <syscall_nr> <arch> <errno> <prog> [<args>]
634 Hint for <arch>: AUDIT_ARCH_I386: 0x40000003
635 AUDIT_ARCH_X86_64: 0xC000003E
636 $ ./a.out 59 0xC000003E 99 /bin/whoami
637 execv: Cannot assign requested address
638
639 In the next example, the BPF filter rejects the write(2) system call,
640 so that, although it is successfully started, the whoami(1) command is
641 not able to write output:
642
643 $ syscall_nr write
644 1
645 $ ./a.out 1 0xC000003E 99 /bin/whoami
646
647 In the final example, the BPF filter rejects a system call that is not
648 used by the whoami(1) command, so it is able to successfully execute
649 and produce output:
650
651 $ syscall_nr preadv
652 295
653 $ ./a.out 295 0xC000003E 99 /bin/whoami
654 cecilia
655
656 Program source
657 #include <errno.h>
658 #include <stddef.h>
659 #include <stdio.h>
660 #include <stdlib.h>
661 #include <unistd.h>
662 #include <linux/audit.h>
663 #include <linux/filter.h>
664 #include <linux/seccomp.h>
665 #include <sys/prctl.h>
666
667 #define X32_SYSCALL_BIT 0x40000000
668 #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
669
670 static int
671 install_filter(int syscall_nr, int t_arch, int f_errno)
672 {
673 unsigned int upper_nr_limit = 0xffffffff;
674
675 /* Assume that AUDIT_ARCH_X86_64 means the normal x86-64 ABI
676 (in the x32 ABI, all system calls have bit 30 set in the
677 'nr' field, meaning the numbers are >= X32_SYSCALL_BIT). */
678 if (t_arch == AUDIT_ARCH_X86_64)
679 upper_nr_limit = X32_SYSCALL_BIT - 1;
680
681 struct sock_filter filter[] = {
682 /* [0] Load architecture from 'seccomp_data' buffer into
683 accumulator. */
684 BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
685 (offsetof(struct seccomp_data, arch))),
686
687 /* [1] Jump forward 5 instructions if architecture does not
688 match 't_arch'. */
689 BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 5),
690
691 /* [2] Load system call number from 'seccomp_data' buffer into
692 accumulator. */
693 BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
694 (offsetof(struct seccomp_data, nr))),
695
696 /* [3] Check ABI - only needed for x86-64 in deny-list use
697 cases. Use BPF_JGT instead of checking against the bit
698 mask to avoid having to reload the syscall number. */
699 BPF_JUMP(BPF_JMP | BPF_JGT | BPF_K, upper_nr_limit, 3, 0),
700
701 /* [4] Jump forward 1 instruction if system call number
702 does not match 'syscall_nr'. */
703 BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, syscall_nr, 0, 1),
704
705 /* [5] Matching architecture and system call: don't execute
706 the system call, and return 'f_errno' in 'errno'. */
707 BPF_STMT(BPF_RET | BPF_K,
708 SECCOMP_RET_ERRNO | (f_errno & SECCOMP_RET_DATA)),
709
710 /* [6] Destination of system call number mismatch: allow other
711 system calls. */
712 BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
713
714 /* [7] Destination of architecture mismatch: kill process. */
715 BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
716 };
717
718 struct sock_fprog prog = {
719 .len = ARRAY_SIZE(filter),
720 .filter = filter,
721 };
722
723 if (seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog)) {
724 perror("seccomp");
725 return 1;
726 }
727
728 return 0;
729 }
730
731 int
732 main(int argc, char *argv[])
733 {
734 if (argc < 5) {
735 fprintf(stderr, "Usage: "
736 "%s <syscall_nr> <arch> <errno> <prog> [<args>]\n"
737 "Hint for <arch>: AUDIT_ARCH_I386: 0x%X\n"
738 " AUDIT_ARCH_X86_64: 0x%X\n"
739 "\n", argv[0], AUDIT_ARCH_I386, AUDIT_ARCH_X86_64);
740 exit(EXIT_FAILURE);
741 }
742
743 if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
744 perror("prctl");
745 exit(EXIT_FAILURE);
746 }
747
748 if (install_filter(strtol(argv[1], NULL, 0),
749 strtol(argv[2], NULL, 0),
750 strtol(argv[3], NULL, 0)))
751 exit(EXIT_FAILURE);
752
753 execv(argv[4], &argv[4]);
754 perror("execv");
755 exit(EXIT_FAILURE);
756 }
757
759 bpfc(1), strace(1), bpf(2), prctl(2), ptrace(2), seccomp_unotify(2),
760 sigaction(2), proc(5), signal(7), socket(7)
761
762 Various pages from the libseccomp library, including: scmp_sys_re‐
763 solver(1), seccomp_export_bpf(3), seccomp_init(3), seccomp_load(3), and
764 seccomp_rule_add(3).
765
766 The kernel source files Documentation/networking/filter.txt and Docu‐
767 mentation/userspace-api/seccomp_filter.rst (or Documentation/prctl/sec‐
768 comp_filter.txt before Linux 4.13).
769
770 McCanne, S. and Jacobson, V. (1992) The BSD Packet Filter: A New Archi‐
771 tecture for User-level Packet Capture, Proceedings of the USENIX Winter
772 1993 Conference ⟨http://www.tcpdump.org/papers/bpf-usenix93.pdf⟩
773
775 This page is part of release 5.13 of the Linux man-pages project. A
776 description of the project, information about reporting bugs, and the
777 latest version of this page, can be found at
778 https://www.kernel.org/doc/man-pages/.
779
780
781
782Linux 2021-08-27 SECCOMP(2)