1SECCOMP(2) Linux Programmer's Manual SECCOMP(2)
2
3
4
6 seccomp - operate on Secure Computing state of the process
7
9 #include <linux/seccomp.h> /* Definition of SECCOMP_* constants */
10 #include <linux/filter.h> /* Definition of struct sock_fprog */
11 #include <linux/audit.h> /* Definition of AUDIT_* constants */
12 #include <linux/signal.h> /* Definition of SIG* constants */
13 #include <sys/ptrace.h> /* Definition of PTRACE_* constants */
14 #include <sys/syscall.h> /* Definition of SYS_* constants */
15 #include <unistd.h>
16
17 int syscall(SYS_seccomp, unsigned int operation, unsigned int flags,
18 void *args);
19
20 Note: glibc provides no wrapper for seccomp(), necessitating the use of
21 syscall(2).
22
24 The seccomp() system call operates on the Secure Computing (seccomp)
25 state of the calling process.
26
27 Currently, Linux supports the following operation values:
28
29 SECCOMP_SET_MODE_STRICT
30 The only system calls that the calling thread is permitted to
31 make are read(2), write(2), _exit(2) (but not exit_group(2)),
32 and sigreturn(2). Other system calls result in the delivery of
33 a SIGKILL signal. Strict secure computing mode is useful for
34 number-crunching applications that may need to execute untrusted
35 byte code, perhaps obtained by reading from a pipe or socket.
36
37 Note that although the calling thread can no longer call sig‐
38 procmask(2), it can use sigreturn(2) to block all signals apart
39 from SIGKILL and SIGSTOP. This means that alarm(2) (for exam‐
40 ple) is not sufficient for restricting the process's execution
41 time. Instead, to reliably terminate the process, SIGKILL must
42 be used. This can be done by using timer_create(2) with
43 SIGEV_SIGNAL and sigev_signo set to SIGKILL, or by using setr‐
44 limit(2) to set the hard limit for RLIMIT_CPU.
45
46 This operation is available only if the kernel is configured
47 with CONFIG_SECCOMP enabled.
48
49 The value of flags must be 0, and args must be NULL.
50
51 This operation is functionally identical to the call:
52
53 prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
54
55 SECCOMP_SET_MODE_FILTER
56 The system calls allowed are defined by a pointer to a Berkeley
57 Packet Filter (BPF) passed via args. This argument is a pointer
58 to a struct sock_fprog; it can be designed to filter arbitrary
59 system calls and system call arguments. If the filter is in‐
60 valid, seccomp() fails, returning EINVAL in errno.
61
62 If fork(2) or clone(2) is allowed by the filter, any child pro‐
63 cesses will be constrained to the same system call filters as
64 the parent. If execve(2) is allowed, the existing filters will
65 be preserved across a call to execve(2).
66
67 In order to use the SECCOMP_SET_MODE_FILTER operation, either
68 the calling thread must have the CAP_SYS_ADMIN capability in its
69 user namespace, or the thread must already have the no_new_privs
70 bit set. If that bit was not already set by an ancestor of this
71 thread, the thread must make the following call:
72
73 prctl(PR_SET_NO_NEW_PRIVS, 1);
74
75 Otherwise, the SECCOMP_SET_MODE_FILTER operation fails and re‐
76 turns EACCES in errno. This requirement ensures that an unpriv‐
77 ileged process cannot apply a malicious filter and then invoke a
78 set-user-ID or other privileged program using execve(2), thus
79 potentially compromising that program. (Such a malicious filter
80 might, for example, cause an attempt to use setuid(2) to set the
81 caller's user IDs to nonzero values to instead return 0 without
82 actually making the system call. Thus, the program might be
83 tricked into retaining superuser privileges in circumstances
84 where it is possible to influence it to do dangerous things be‐
85 cause it did not actually drop privileges.)
86
87 If prctl(2) or seccomp() is allowed by the attached filter, fur‐
88 ther filters may be added. This will increase evaluation time,
89 but allows for further reduction of the attack surface during
90 execution of a thread.
91
92 The SECCOMP_SET_MODE_FILTER operation is available only if the
93 kernel is configured with CONFIG_SECCOMP_FILTER enabled.
94
95 When flags is 0, this operation is functionally identical to the
96 call:
97
98 prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, args);
99
100 The recognized flags are:
101
102 SECCOMP_FILTER_FLAG_LOG (since Linux 4.14)
103 All filter return actions except SECCOMP_RET_ALLOW should
104 be logged. An administrator may override this filter
105 flag by preventing specific actions from being logged via
106 the /proc/sys/kernel/seccomp/actions_logged file.
107
108 SECCOMP_FILTER_FLAG_NEW_LISTENER (since Linux 5.0)
109 After successfully installing the filter program, return
110 a new user-space notification file descriptor. (The
111 close-on-exec flag is set for the file descriptor.) When
112 the filter returns SECCOMP_RET_USER_NOTIF a notification
113 will be sent to this file descriptor.
114
115 At most one seccomp filter using the SECCOMP_FIL‐
116 TER_FLAG_NEW_LISTENER flag can be installed for a thread.
117
118 See seccomp_unotify(2) for further details.
119
120 SECCOMP_FILTER_FLAG_SPEC_ALLOW (since Linux 4.17)
121 Disable Speculative Store Bypass mitigation.
122
123 SECCOMP_FILTER_FLAG_TSYNC
124 When adding a new filter, synchronize all other threads
125 of the calling process to the same seccomp filter tree.
126 A "filter tree" is the ordered list of filters attached
127 to a thread. (Attaching identical filters in separate
128 seccomp() calls results in different filters from this
129 perspective.)
130
131 If any thread cannot synchronize to the same filter tree,
132 the call will not attach the new seccomp filter, and will
133 fail, returning the first thread ID found that cannot
134 synchronize. Synchronization will fail if another thread
135 in the same process is in SECCOMP_MODE_STRICT or if it
136 has attached new seccomp filters to itself, diverging
137 from the calling thread's filter tree.
138
139 SECCOMP_GET_ACTION_AVAIL (since Linux 4.14)
140 Test to see if an action is supported by the kernel. This oper‐
141 ation is helpful to confirm that the kernel knows of a more re‐
142 cently added filter return action since the kernel treats all
143 unknown actions as SECCOMP_RET_KILL_PROCESS.
144
145 The value of flags must be 0, and args must be a pointer to an
146 unsigned 32-bit filter return action.
147
148 SECCOMP_GET_NOTIF_SIZES (since Linux 5.0)
149 Get the sizes of the seccomp user-space notification structures.
150 Since these structures may evolve and grow over time, this com‐
151 mand can be used to determine how much memory to allocate for
152 sending and receiving notifications.
153
154 The value of flags must be 0, and args must be a pointer to a
155 struct seccomp_notif_sizes, which has the following form:
156
157 struct seccomp_notif_sizes
158 __u16 seccomp_notif; /* Size of notification structure */
159 __u16 seccomp_notif_resp; /* Size of response structure */
160 __u16 seccomp_data; /* Size of 'struct seccomp_data' */
161 };
162
163 See seccomp_unotify(2) for further details.
164
165 Filters
166 When adding filters via SECCOMP_SET_MODE_FILTER, args points to a fil‐
167 ter program:
168
169 struct sock_fprog {
170 unsigned short len; /* Number of BPF instructions */
171 struct sock_filter *filter; /* Pointer to array of
172 BPF instructions */
173 };
174
175 Each program must contain one or more BPF instructions:
176
177 struct sock_filter { /* Filter block */
178 __u16 code; /* Actual filter code */
179 __u8 jt; /* Jump true */
180 __u8 jf; /* Jump false */
181 __u32 k; /* Generic multiuse field */
182 };
183
184 When executing the instructions, the BPF program operates on the system
185 call information made available (i.e., use the BPF_ABS addressing mode)
186 as a (read-only) buffer of the following form:
187
188 struct seccomp_data {
189 int nr; /* System call number */
190 __u32 arch; /* AUDIT_ARCH_* value
191 (see <linux/audit.h>) */
192 __u64 instruction_pointer; /* CPU instruction pointer */
193 __u64 args[6]; /* Up to 6 system call arguments */
194 };
195
196 Because numbering of system calls varies between architectures and some
197 architectures (e.g., x86-64) allow user-space code to use the calling
198 conventions of multiple architectures (and the convention being used
199 may vary over the life of a process that uses execve(2) to execute bi‐
200 naries that employ the different conventions), it is usually necessary
201 to verify the value of the arch field.
202
203 It is strongly recommended to use an allow-list approach whenever pos‐
204 sible because such an approach is more robust and simple. A deny-list
205 will have to be updated whenever a potentially dangerous system call is
206 added (or a dangerous flag or option if those are deny-listed), and it
207 is often possible to alter the representation of a value without alter‐
208 ing its meaning, leading to a deny-list bypass. See also Caveats be‐
209 low.
210
211 The arch field is not unique for all calling conventions. The x86-64
212 ABI and the x32 ABI both use AUDIT_ARCH_X86_64 as arch, and they run on
213 the same processors. Instead, the mask __X32_SYSCALL_BIT is used on
214 the system call number to tell the two ABIs apart.
215
216 This means that a policy must either deny all syscalls with
217 __X32_SYSCALL_BIT or it must recognize syscalls with and without
218 __X32_SYSCALL_BIT set. A list of system calls to be denied based on nr
219 that does not also contain nr values with __X32_SYSCALL_BIT set can be
220 bypassed by a malicious program that sets __X32_SYSCALL_BIT.
221
222 Additionally, kernels prior to Linux 5.4 incorrectly permitted nr in
223 the ranges 512-547 as well as the corresponding non-x32 syscalls ORed
224 with __X32_SYSCALL_BIT. For example, nr == 521 and nr == (101 |
225 __X32_SYSCALL_BIT) would result in invocations of ptrace(2) with poten‐
226 tially confused x32-vs-x86_64 semantics in the kernel. Policies in‐
227 tended to work on kernels before Linux 5.4 must ensure that they deny
228 or otherwise correctly handle these system calls. On Linux 5.4 and
229 newer, such system calls will fail with the error ENOSYS, without doing
230 anything.
231
232 The instruction_pointer field provides the address of the machine-lan‐
233 guage instruction that performed the system call. This might be useful
234 in conjunction with the use of /proc/[pid]/maps to perform checks based
235 on which region (mapping) of the program made the system call. (Proba‐
236 bly, it is wise to lock down the mmap(2) and mprotect(2) system calls
237 to prevent the program from subverting such checks.)
238
239 When checking values from args, keep in mind that arguments are often
240 silently truncated before being processed, but after the seccomp check.
241 For example, this happens if the i386 ABI is used on an x86-64 kernel:
242 although the kernel will normally not look beyond the 32 lowest bits of
243 the arguments, the values of the full 64-bit registers will be present
244 in the seccomp data. A less surprising example is that if the x86-64
245 ABI is used to perform a system call that takes an argument of type
246 int, the more-significant half of the argument register is ignored by
247 the system call, but visible in the seccomp data.
248
249 A seccomp filter returns a 32-bit value consisting of two parts: the
250 most significant 16 bits (corresponding to the mask defined by the con‐
251 stant SECCOMP_RET_ACTION_FULL) contain one of the "action" values
252 listed below; the least significant 16-bits (defined by the constant
253 SECCOMP_RET_DATA) are "data" to be associated with this return value.
254
255 If multiple filters exist, they are all executed, in reverse order of
256 their addition to the filter tree—that is, the most recently installed
257 filter is executed first. (Note that all filters will be called even
258 if one of the earlier filters returns SECCOMP_RET_KILL. This is done
259 to simplify the kernel code and to provide a tiny speed-up in the exe‐
260 cution of sets of filters by avoiding a check for this uncommon case.)
261 The return value for the evaluation of a given system call is the
262 first-seen action value of highest precedence (along with its accompa‐
263 nying data) returned by execution of all of the filters.
264
265 In decreasing order of precedence, the action values that may be re‐
266 turned by a seccomp filter are:
267
268 SECCOMP_RET_KILL_PROCESS (since Linux 4.14)
269 This value results in immediate termination of the process, with
270 a core dump. The system call is not executed. By contrast with
271 SECCOMP_RET_KILL_THREAD below, all threads in the thread group
272 are terminated. (For a discussion of thread groups, see the de‐
273 scription of the CLONE_THREAD flag in clone(2).)
274
275 The process terminates as though killed by a SIGSYS signal.
276 Even if a signal handler has been registered for SIGSYS, the
277 handler will be ignored in this case and the process always ter‐
278 minates. To a parent process that is waiting on this process
279 (using waitpid(2) or similar), the returned wstatus will indi‐
280 cate that its child was terminated as though by a SIGSYS signal.
281
282 SECCOMP_RET_KILL_THREAD (or SECCOMP_RET_KILL)
283 This value results in immediate termination of the thread that
284 made the system call. The system call is not executed. Other
285 threads in the same thread group will continue to execute.
286
287 The thread terminates as though killed by a SIGSYS signal. See
288 SECCOMP_RET_KILL_PROCESS above.
289
290 Before Linux 4.11, any process terminated in this way would not
291 trigger a coredump (even though SIGSYS is documented in sig‐
292 nal(7) as having a default action of termination with a core
293 dump). Since Linux 4.11, a single-threaded process will dump
294 core if terminated in this way.
295
296 With the addition of SECCOMP_RET_KILL_PROCESS in Linux 4.14,
297 SECCOMP_RET_KILL_THREAD was added as a synonym for SEC‐
298 COMP_RET_KILL, in order to more clearly distinguish the two ac‐
299 tions.
300
301 Note: the use of SECCOMP_RET_KILL_THREAD to kill a single thread
302 in a multithreaded process is likely to leave the process in a
303 permanently inconsistent and possibly corrupt state.
304
305 SECCOMP_RET_TRAP
306 This value results in the kernel sending a thread-directed
307 SIGSYS signal to the triggering thread. (The system call is not
308 executed.) Various fields will be set in the siginfo_t struc‐
309 ture (see sigaction(2)) associated with signal:
310
311 * si_signo will contain SIGSYS.
312
313 * si_call_addr will show the address of the system call in‐
314 struction.
315
316 * si_syscall and si_arch will indicate which system call was
317 attempted.
318
319 * si_code will contain SYS_SECCOMP.
320
321 * si_errno will contain the SECCOMP_RET_DATA portion of the
322 filter return value.
323
324 The program counter will be as though the system call happened
325 (i.e., the program counter will not point to the system call in‐
326 struction). The return value register will contain an architec‐
327 ture-dependent value; if resuming execution, set it to something
328 appropriate for the system call. (The architecture dependency
329 is because replacing it with ENOSYS could overwrite some useful
330 information.)
331
332 SECCOMP_RET_ERRNO
333 This value results in the SECCOMP_RET_DATA portion of the fil‐
334 ter's return value being passed to user space as the errno value
335 without executing the system call.
336
337 SECCOMP_RET_USER_NOTIF (since Linux 5.0)
338 Forward the system call to an attached user-space supervisor
339 process to allow that process to decide what to do with the sys‐
340 tem call. If there is no attached supervisor (either because
341 the filter was not installed with the SECCOMP_FIL‐
342 TER_FLAG_NEW_LISTENER flag or because the file descriptor was
343 closed), the filter returns ENOSYS (similar to what happens when
344 a filter returns SECCOMP_RET_TRACE and there is no tracer). See
345 seccomp_unotify(2) for further details.
346
347 Note that the supervisor process will not be notified if another
348 filter returns an action value with a precedence greater than
349 SECCOMP_RET_USER_NOTIF.
350
351 SECCOMP_RET_TRACE
352 When returned, this value will cause the kernel to attempt to
353 notify a ptrace(2)-based tracer prior to executing the system
354 call. If there is no tracer present, the system call is not ex‐
355 ecuted and returns a failure status with errno set to ENOSYS.
356
357 A tracer will be notified if it requests PTRACE_O_TRACESECCOMP
358 using ptrace(PTRACE_SETOPTIONS). The tracer will be notified of
359 a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of the
360 filter's return value will be available to the tracer via
361 PTRACE_GETEVENTMSG.
362
363 The tracer can skip the system call by changing the system call
364 number to -1. Alternatively, the tracer can change the system
365 call requested by changing the system call to a valid system
366 call number. If the tracer asks to skip the system call, then
367 the system call will appear to return the value that the tracer
368 puts in the return value register.
369
370 Before kernel 4.8, the seccomp check will not be run again after
371 the tracer is notified. (This means that, on older kernels,
372 seccomp-based sandboxes must not allow use of ptrace(2)—even of
373 other sandboxed processes—without extreme care; ptracers can use
374 this mechanism to escape from the seccomp sandbox.)
375
376 Note that a tracer process will not be notified if another fil‐
377 ter returns an action value with a precedence greater than SEC‐
378 COMP_RET_TRACE.
379
380 SECCOMP_RET_LOG (since Linux 4.14)
381 This value results in the system call being executed after the
382 filter return action is logged. An administrator may override
383 the logging of this action via the /proc/sys/kernel/seccomp/ac‐
384 tions_logged file.
385
386 SECCOMP_RET_ALLOW
387 This value results in the system call being executed.
388
389 If an action value other than one of the above is specified, then the
390 filter action is treated as either SECCOMP_RET_KILL_PROCESS (since
391 Linux 4.14) or SECCOMP_RET_KILL_THREAD (in Linux 4.13 and earlier).
392
393 /proc interfaces
394 The files in the directory /proc/sys/kernel/seccomp provide additional
395 seccomp information and configuration:
396
397 actions_avail (since Linux 4.14)
398 A read-only ordered list of seccomp filter return actions in
399 string form. The ordering, from left-to-right, is in decreasing
400 order of precedence. The list represents the set of seccomp
401 filter return actions supported by the kernel.
402
403 actions_logged (since Linux 4.14)
404 A read-write ordered list of seccomp filter return actions that
405 are allowed to be logged. Writes to the file do not need to be
406 in ordered form but reads from the file will be ordered in the
407 same way as the actions_avail file.
408
409 It is important to note that the value of actions_logged does
410 not prevent certain filter return actions from being logged when
411 the audit subsystem is configured to audit a task. If the ac‐
412 tion is not found in the actions_logged file, the final decision
413 on whether to audit the action for that task is ultimately left
414 up to the audit subsystem to decide for all filter return ac‐
415 tions other than SECCOMP_RET_ALLOW.
416
417 The "allow" string is not accepted in the actions_logged file as
418 it is not possible to log SECCOMP_RET_ALLOW actions. Attempting
419 to write "allow" to the file will fail with the error EINVAL.
420
421 Audit logging of seccomp actions
422 Since Linux 4.14, the kernel provides the facility to log the actions
423 returned by seccomp filters in the audit log. The kernel makes the de‐
424 cision to log an action based on the action type, whether or not the
425 action is present in the actions_logged file, and whether kernel audit‐
426 ing is enabled (e.g., via the kernel boot option audit=1). The rules
427 are as follows:
428
429 * If the action is SECCOMP_RET_ALLOW, the action is not logged.
430
431 * Otherwise, if the action is either SECCOMP_RET_KILL_PROCESS or SEC‐
432 COMP_RET_KILL_THREAD, and that action appears in the actions_logged
433 file, the action is logged.
434
435 * Otherwise, if the filter has requested logging (the SECCOMP_FIL‐
436 TER_FLAG_LOG flag) and the action appears in the actions_logged
437 file, the action is logged.
438
439 * Otherwise, if kernel auditing is enabled and the process is being
440 audited (autrace(8)), the action is logged.
441
442 * Otherwise, the action is not logged.
443
445 On success, seccomp() returns 0. On error, if SECCOMP_FIL‐
446 TER_FLAG_TSYNC was used, the return value is the ID of the thread that
447 caused the synchronization failure. (This ID is a kernel thread ID of
448 the type returned by clone(2) and gettid(2).) On other errors, -1 is
449 returned, and errno is set to indicate the error.
450
452 seccomp() can fail for the following reasons:
453
454 EACCES The caller did not have the CAP_SYS_ADMIN capability in its user
455 namespace, or had not set no_new_privs before using SEC‐
456 COMP_SET_MODE_FILTER.
457
458 EBUSY While installing a new filter, the SECCOMP_FILTER_FLAG_NEW_LIS‐
459 TENER flag was specified, but a previous filter had already been
460 installed with that flag.
461
462 EFAULT args was not a valid address.
463
464 EINVAL operation is unknown or is not supported by this kernel version
465 or configuration.
466
467 EINVAL The specified flags are invalid for the given operation.
468
469 EINVAL operation included BPF_ABS, but the specified offset was not
470 aligned to a 32-bit boundary or exceeded sizeof(struct sec‐
471 comp_data).
472
473 EINVAL A secure computing mode has already been set, and operation dif‐
474 fers from the existing setting.
475
476 EINVAL operation specified SECCOMP_SET_MODE_FILTER, but the filter pro‐
477 gram pointed to by args was not valid or the length of the fil‐
478 ter program was zero or exceeded BPF_MAXINSNS [22m(4096) instruc‐
479 tions.
480
481 ENOMEM Out of memory.
482
483 ENOMEM The total length of all filter programs attached to the calling
484 thread would exceed MAX_INSNS_PER_PATH [22m(32768) instructions.
485 Note that for the purposes of calculating this limit, each al‐
486 ready existing filter program incurs an overhead penalty of 4
487 instructions.
488
489 EOPNOTSUPP
490 operation specified SECCOMP_GET_ACTION_AVAIL, but the kernel
491 does not support the filter return action specified by args.
492
493 ESRCH Another thread caused a failure during thread sync, but its ID
494 could not be determined.
495
497 The seccomp() system call first appeared in Linux 3.17.
498
500 The seccomp() system call is a nonstandard Linux extension.
501
503 Rather than hand-coding seccomp filters as shown in the example below,
504 you may prefer to employ the libseccomp library, which provides a
505 front-end for generating seccomp filters.
506
507 The Seccomp field of the /proc/[pid]/status file provides a method of
508 viewing the seccomp mode of a process; see proc(5).
509
510 seccomp() provides a superset of the functionality provided by the
511 prctl(2) PR_SET_SECCOMP operation (which does not support flags).
512
513 Since Linux 4.4, the ptrace(2) PTRACE_SECCOMP_GET_FILTER operation can
514 be used to dump a process's seccomp filters.
515
516 Architecture support for seccomp BPF
517 Architecture support for seccomp BPF filtering is available on the fol‐
518 lowing architectures:
519
520 * x86-64, i386, x32 (since Linux 3.5)
521 * ARM (since Linux 3.8)
522 * s390 (since Linux 3.8)
523 * MIPS (since Linux 3.16)
524 * ARM-64 (since Linux 3.19)
525 * PowerPC (since Linux 4.3)
526 * Tile (since Linux 4.3)
527 * PA-RISC (since Linux 4.6)
528
529 Caveats
530 There are various subtleties to consider when applying seccomp filters
531 to a program, including the following:
532
533 * Some traditional system calls have user-space implementations in the
534 vdso(7) on many architectures. Notable examples include clock_get‐
535 time(2), gettimeofday(2), and time(2). On such architectures, sec‐
536 comp filtering for these system calls will have no effect. (How‐
537 ever, there are cases where the vdso(7) implementations may fall
538 back to invoking the true system call, in which case seccomp filters
539 would see the system call.)
540
541 * Seccomp filtering is based on system call numbers. However, appli‐
542 cations typically do not directly invoke system calls, but instead
543 call wrapper functions in the C library which in turn invoke the
544 system calls. Consequently, one must be aware of the following:
545
546 • The glibc wrappers for some traditional system calls may actually
547 employ system calls with different names in the kernel. For ex‐
548 ample, the exit(2) wrapper function actually employs the
549 exit_group(2) system call, and the fork(2) wrapper function actu‐
550 ally calls clone(2).
551
552 • The behavior of wrapper functions may vary across architectures,
553 according to the range of system calls provided on those archi‐
554 tectures. In other words, the same wrapper function may invoke
555 different system calls on different architectures.
556
557 • Finally, the behavior of wrapper functions can change across
558 glibc versions. For example, in older versions, the glibc wrap‐
559 per function for open(2) invoked the system call of the same
560 name, but starting in glibc 2.26, the implementation switched to
561 calling openat(2) on all architectures.
562
563 The consequence of the above points is that it may be necessary to fil‐
564 ter for a system call other than might be expected. Various manual
565 pages in Section 2 provide helpful details about the differences be‐
566 tween wrapper functions and the underlying system calls in subsections
567 entitled C library/kernel differences.
568
569 Furthermore, note that the application of seccomp filters even risks
570 causing bugs in an application, when the filters cause unexpected fail‐
571 ures for legitimate operations that the application might need to per‐
572 form. Such bugs may not easily be discovered when testing the seccomp
573 filters if the bugs occur in rarely used application code paths.
574
575 Seccomp-specific BPF details
576 Note the following BPF details specific to seccomp filters:
577
578 * The BPF_H and BPF_B size modifiers are not supported: all operations
579 must load and store (4-byte) words (BPF_W).
580
581 * To access the contents of the seccomp_data buffer, use the BPF_ABS
582 addressing mode modifier.
583
584 * The BPF_LEN addressing mode modifier yields an immediate mode oper‐
585 and whose value is the size of the seccomp_data buffer.
586
588 The program below accepts four or more arguments. The first three ar‐
589 guments are a system call number, a numeric architecture identifier,
590 and an error number. The program uses these values to construct a BPF
591 filter that is used at run time to perform the following checks:
592
593 [1] If the program is not running on the specified architecture, the
594 BPF filter causes system calls to fail with the error ENOSYS.
595
596 [2] If the program attempts to execute the system call with the speci‐
597 fied number, the BPF filter causes the system call to fail, with
598 errno being set to the specified error number.
599
600 The remaining command-line arguments specify the pathname and addi‐
601 tional arguments of a program that the example program should attempt
602 to execute using execv(3) (a library function that employs the ex‐
603 ecve(2) system call). Some example runs of the program are shown be‐
604 low.
605
606 First, we display the architecture that we are running on (x86-64) and
607 then construct a shell function that looks up system call numbers on
608 this architecture:
609
610 $ uname -m
611 x86_64
612 $ syscall_nr() {
613 cat /usr/src/linux/arch/x86/syscalls/syscall_64.tbl | \
614 awk '$2 != "x32" && $3 == "'$1'" { print $1 }'
615 }
616
617 When the BPF filter rejects a system call (case [2] above), it causes
618 the system call to fail with the error number specified on the command
619 line. In the experiments shown here, we'll use error number 99:
620
621 $ errno 99
622 EADDRNOTAVAIL 99 Cannot assign requested address
623
624 In the following example, we attempt to run the command whoami(1), but
625 the BPF filter rejects the execve(2) system call, so that the command
626 is not even executed:
627
628 $ syscall_nr execve
629 59
630 $ ./a.out
631 Usage: ./a.out <syscall_nr> <arch> <errno> <prog> [<args>]
632 Hint for <arch>: AUDIT_ARCH_I386: 0x40000003
633 AUDIT_ARCH_X86_64: 0xC000003E
634 $ ./a.out 59 0xC000003E 99 /bin/whoami
635 execv: Cannot assign requested address
636
637 In the next example, the BPF filter rejects the write(2) system call,
638 so that, although it is successfully started, the whoami(1) command is
639 not able to write output:
640
641 $ syscall_nr write
642 1
643 $ ./a.out 1 0xC000003E 99 /bin/whoami
644
645 In the final example, the BPF filter rejects a system call that is not
646 used by the whoami(1) command, so it is able to successfully execute
647 and produce output:
648
649 $ syscall_nr preadv
650 295
651 $ ./a.out 295 0xC000003E 99 /bin/whoami
652 cecilia
653
654 Program source
655 #include <errno.h>
656 #include <stddef.h>
657 #include <stdio.h>
658 #include <stdlib.h>
659 #include <unistd.h>
660 #include <linux/audit.h>
661 #include <linux/filter.h>
662 #include <linux/seccomp.h>
663 #include <sys/prctl.h>
664
665 #define X32_SYSCALL_BIT 0x40000000
666 #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
667
668 static int
669 install_filter(int syscall_nr, int t_arch, int f_errno)
670 {
671 unsigned int upper_nr_limit = 0xffffffff;
672
673 /* Assume that AUDIT_ARCH_X86_64 means the normal x86-64 ABI
674 (in the x32 ABI, all system calls have bit 30 set in the
675 'nr' field, meaning the numbers are >= X32_SYSCALL_BIT). */
676 if (t_arch == AUDIT_ARCH_X86_64)
677 upper_nr_limit = X32_SYSCALL_BIT - 1;
678
679 struct sock_filter filter[] = {
680 /* [0] Load architecture from 'seccomp_data' buffer into
681 accumulator. */
682 BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
683 (offsetof(struct seccomp_data, arch))),
684
685 /* [1] Jump forward 5 instructions if architecture does not
686 match 't_arch'. */
687 BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 5),
688
689 /* [2] Load system call number from 'seccomp_data' buffer into
690 accumulator. */
691 BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
692 (offsetof(struct seccomp_data, nr))),
693
694 /* [3] Check ABI - only needed for x86-64 in deny-list use
695 cases. Use BPF_JGT instead of checking against the bit
696 mask to avoid having to reload the syscall number. */
697 BPF_JUMP(BPF_JMP | BPF_JGT | BPF_K, upper_nr_limit, 3, 0),
698
699 /* [4] Jump forward 1 instruction if system call number
700 does not match 'syscall_nr'. */
701 BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, syscall_nr, 0, 1),
702
703 /* [5] Matching architecture and system call: don't execute
704 the system call, and return 'f_errno' in 'errno'. */
705 BPF_STMT(BPF_RET | BPF_K,
706 SECCOMP_RET_ERRNO | (f_errno & SECCOMP_RET_DATA)),
707
708 /* [6] Destination of system call number mismatch: allow other
709 system calls. */
710 BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
711
712 /* [7] Destination of architecture mismatch: kill process. */
713 BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
714 };
715
716 struct sock_fprog prog = {
717 .len = ARRAY_SIZE(filter),
718 .filter = filter,
719 };
720
721 if (seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog)) {
722 perror("seccomp");
723 return 1;
724 }
725
726 return 0;
727 }
728
729 int
730 main(int argc, char **argv)
731 {
732 if (argc < 5) {
733 fprintf(stderr, "Usage: "
734 "%s <syscall_nr> <arch> <errno> <prog> [<args>]\n"
735 "Hint for <arch>: AUDIT_ARCH_I386: 0x%X\n"
736 " AUDIT_ARCH_X86_64: 0x%X\n"
737 "\n", argv[0], AUDIT_ARCH_I386, AUDIT_ARCH_X86_64);
738 exit(EXIT_FAILURE);
739 }
740
741 if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
742 perror("prctl");
743 exit(EXIT_FAILURE);
744 }
745
746 if (install_filter(strtol(argv[1], NULL, 0),
747 strtol(argv[2], NULL, 0),
748 strtol(argv[3], NULL, 0)))
749 exit(EXIT_FAILURE);
750
751 execv(argv[4], &argv[4]);
752 perror("execv");
753 exit(EXIT_FAILURE);
754 }
755
757 bpfc(1), strace(1), bpf(2), prctl(2), ptrace(2), seccomp_unotify(2),
758 sigaction(2), proc(5), signal(7), socket(7)
759
760 Various pages from the libseccomp library, including: scmp_sys_re‐
761 solver(1), seccomp_export_bpf(3), seccomp_init(3), seccomp_load(3), and
762 seccomp_rule_add(3).
763
764 The kernel source files Documentation/networking/filter.txt and Docu‐
765 mentation/userspace-api/seccomp_filter.rst (or Documentation/prctl/sec‐
766 comp_filter.txt before Linux 4.13).
767
768 McCanne, S. and Jacobson, V. (1992) The BSD Packet Filter: A New Archi‐
769 tecture for User-level Packet Capture, Proceedings of the USENIX Winter
770 1993 Conference ⟨http://www.tcpdump.org/papers/bpf-usenix93.pdf⟩
771
773 This page is part of release 5.12 of the Linux man-pages project. A
774 description of the project, information about reporting bugs, and the
775 latest version of this page, can be found at
776 https://www.kernel.org/doc/man-pages/.
777
778
779
780Linux 2021-03-22 SECCOMP(2)