1seccomp(2) System Calls Manual seccomp(2)
2
3
4
6 seccomp - operate on Secure Computing state of the process
7
9 Standard C library (libc, -lc)
10
12 #include <linux/seccomp.h> /* Definition of SECCOMP_* constants */
13 #include <linux/filter.h> /* Definition of struct sock_fprog */
14 #include <linux/audit.h> /* Definition of AUDIT_* constants */
15 #include <linux/signal.h> /* Definition of SIG* constants */
16 #include <sys/ptrace.h> /* Definition of PTRACE_* constants */
17 #include <sys/syscall.h> /* Definition of SYS_* constants */
18 #include <unistd.h>
19
20 int syscall(SYS_seccomp, unsigned int operation, unsigned int flags,
21 void *args);
22
23 Note: glibc provides no wrapper for seccomp(), necessitating the use of
24 syscall(2).
25
27 The seccomp() system call operates on the Secure Computing (seccomp)
28 state of the calling process.
29
30 Currently, Linux supports the following operation values:
31
32 SECCOMP_SET_MODE_STRICT
33 The only system calls that the calling thread is permitted to
34 make are read(2), write(2), _exit(2) (but not exit_group(2)),
35 and sigreturn(2). Other system calls result in the termination
36 of the calling thread, or termination of the entire process with
37 the SIGKILL signal when there is only one thread. Strict secure
38 computing mode is useful for number-crunching applications that
39 may need to execute untrusted byte code, perhaps obtained by
40 reading from a pipe or socket.
41
42 Note that although the calling thread can no longer call sig‐
43 procmask(2), it can use sigreturn(2) to block all signals apart
44 from SIGKILL and SIGSTOP. This means that alarm(2) (for exam‐
45 ple) is not sufficient for restricting the process's execution
46 time. Instead, to reliably terminate the process, SIGKILL must
47 be used. This can be done by using timer_create(2) with
48 SIGEV_SIGNAL and sigev_signo set to SIGKILL, or by using setr‐
49 limit(2) to set the hard limit for RLIMIT_CPU.
50
51 This operation is available only if the kernel is configured
52 with CONFIG_SECCOMP enabled.
53
54 The value of flags must be 0, and args must be NULL.
55
56 This operation is functionally identical to the call:
57
58 prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
59
60 SECCOMP_SET_MODE_FILTER
61 The system calls allowed are defined by a pointer to a Berkeley
62 Packet Filter (BPF) passed via args. This argument is a pointer
63 to a struct sock_fprog; it can be designed to filter arbitrary
64 system calls and system call arguments. If the filter is in‐
65 valid, seccomp() fails, returning EINVAL in errno.
66
67 If fork(2) or clone(2) is allowed by the filter, any child pro‐
68 cesses will be constrained to the same system call filters as
69 the parent. If execve(2) is allowed, the existing filters will
70 be preserved across a call to execve(2).
71
72 In order to use the SECCOMP_SET_MODE_FILTER operation, either
73 the calling thread must have the CAP_SYS_ADMIN capability in its
74 user namespace, or the thread must already have the no_new_privs
75 bit set. If that bit was not already set by an ancestor of this
76 thread, the thread must make the following call:
77
78 prctl(PR_SET_NO_NEW_PRIVS, 1);
79
80 Otherwise, the SECCOMP_SET_MODE_FILTER operation fails and re‐
81 turns EACCES in errno. This requirement ensures that an unpriv‐
82 ileged process cannot apply a malicious filter and then invoke a
83 set-user-ID or other privileged program using execve(2), thus
84 potentially compromising that program. (Such a malicious filter
85 might, for example, cause an attempt to use setuid(2) to set the
86 caller's user IDs to nonzero values to instead return 0 without
87 actually making the system call. Thus, the program might be
88 tricked into retaining superuser privileges in circumstances
89 where it is possible to influence it to do dangerous things be‐
90 cause it did not actually drop privileges.)
91
92 If prctl(2) or seccomp() is allowed by the attached filter, fur‐
93 ther filters may be added. This will increase evaluation time,
94 but allows for further reduction of the attack surface during
95 execution of a thread.
96
97 The SECCOMP_SET_MODE_FILTER operation is available only if the
98 kernel is configured with CONFIG_SECCOMP_FILTER enabled.
99
100 When flags is 0, this operation is functionally identical to the
101 call:
102
103 prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, args);
104
105 The recognized flags are:
106
107 SECCOMP_FILTER_FLAG_LOG (since Linux 4.14)
108 All filter return actions except SECCOMP_RET_ALLOW should
109 be logged. An administrator may override this filter
110 flag by preventing specific actions from being logged via
111 the /proc/sys/kernel/seccomp/actions_logged file.
112
113 SECCOMP_FILTER_FLAG_NEW_LISTENER (since Linux 5.0)
114 After successfully installing the filter program, return
115 a new user-space notification file descriptor. (The
116 close-on-exec flag is set for the file descriptor.) When
117 the filter returns SECCOMP_RET_USER_NOTIF a notification
118 will be sent to this file descriptor.
119
120 At most one seccomp filter using the SECCOMP_FIL‐
121 TER_FLAG_NEW_LISTENER flag can be installed for a thread.
122
123 See seccomp_unotify(2) for further details.
124
125 SECCOMP_FILTER_FLAG_SPEC_ALLOW (since Linux 4.17)
126 Disable Speculative Store Bypass mitigation.
127
128 SECCOMP_FILTER_FLAG_TSYNC
129 When adding a new filter, synchronize all other threads
130 of the calling process to the same seccomp filter tree.
131 A "filter tree" is the ordered list of filters attached
132 to a thread. (Attaching identical filters in separate
133 seccomp() calls results in different filters from this
134 perspective.)
135
136 If any thread cannot synchronize to the same filter tree,
137 the call will not attach the new seccomp filter, and will
138 fail, returning the first thread ID found that cannot
139 synchronize. Synchronization will fail if another thread
140 in the same process is in SECCOMP_MODE_STRICT or if it
141 has attached new seccomp filters to itself, diverging
142 from the calling thread's filter tree.
143
144 SECCOMP_GET_ACTION_AVAIL (since Linux 4.14)
145 Test to see if an action is supported by the kernel. This oper‐
146 ation is helpful to confirm that the kernel knows of a more re‐
147 cently added filter return action since the kernel treats all
148 unknown actions as SECCOMP_RET_KILL_PROCESS.
149
150 The value of flags must be 0, and args must be a pointer to an
151 unsigned 32-bit filter return action.
152
153 SECCOMP_GET_NOTIF_SIZES (since Linux 5.0)
154 Get the sizes of the seccomp user-space notification structures.
155 Since these structures may evolve and grow over time, this com‐
156 mand can be used to determine how much memory to allocate for
157 sending and receiving notifications.
158
159 The value of flags must be 0, and args must be a pointer to a
160 struct seccomp_notif_sizes, which has the following form:
161
162 struct seccomp_notif_sizes
163 __u16 seccomp_notif; /* Size of notification structure */
164 __u16 seccomp_notif_resp; /* Size of response structure */
165 __u16 seccomp_data; /* Size of 'struct seccomp_data' */
166 };
167
168 See seccomp_unotify(2) for further details.
169
170 Filters
171 When adding filters via SECCOMP_SET_MODE_FILTER, args points to a fil‐
172 ter program:
173
174 struct sock_fprog {
175 unsigned short len; /* Number of BPF instructions */
176 struct sock_filter *filter; /* Pointer to array of
177 BPF instructions */
178 };
179
180 Each program must contain one or more BPF instructions:
181
182 struct sock_filter { /* Filter block */
183 __u16 code; /* Actual filter code */
184 __u8 jt; /* Jump true */
185 __u8 jf; /* Jump false */
186 __u32 k; /* Generic multiuse field */
187 };
188
189 When executing the instructions, the BPF program operates on the system
190 call information made available (i.e., use the BPF_ABS addressing mode)
191 as a (read-only) buffer of the following form:
192
193 struct seccomp_data {
194 int nr; /* System call number */
195 __u32 arch; /* AUDIT_ARCH_* value
196 (see <linux/audit.h>) */
197 __u64 instruction_pointer; /* CPU instruction pointer */
198 __u64 args[6]; /* Up to 6 system call arguments */
199 };
200
201 Because numbering of system calls varies between architectures and some
202 architectures (e.g., x86-64) allow user-space code to use the calling
203 conventions of multiple architectures (and the convention being used
204 may vary over the life of a process that uses execve(2) to execute bi‐
205 naries that employ the different conventions), it is usually necessary
206 to verify the value of the arch field.
207
208 It is strongly recommended to use an allow-list approach whenever pos‐
209 sible because such an approach is more robust and simple. A deny-list
210 will have to be updated whenever a potentially dangerous system call is
211 added (or a dangerous flag or option if those are deny-listed), and it
212 is often possible to alter the representation of a value without alter‐
213 ing its meaning, leading to a deny-list bypass. See also Caveats be‐
214 low.
215
216 The arch field is not unique for all calling conventions. The x86-64
217 ABI and the x32 ABI both use AUDIT_ARCH_X86_64 as arch, and they run on
218 the same processors. Instead, the mask __X32_SYSCALL_BIT is used on
219 the system call number to tell the two ABIs apart.
220
221 This means that a policy must either deny all syscalls with
222 __X32_SYSCALL_BIT or it must recognize syscalls with and without
223 __X32_SYSCALL_BIT set. A list of system calls to be denied based on nr
224 that does not also contain nr values with __X32_SYSCALL_BIT set can be
225 bypassed by a malicious program that sets __X32_SYSCALL_BIT.
226
227 Additionally, kernels prior to Linux 5.4 incorrectly permitted nr in
228 the ranges 512-547 as well as the corresponding non-x32 syscalls ORed
229 with __X32_SYSCALL_BIT. For example, nr == 521 and nr == (101 |
230 __X32_SYSCALL_BIT) would result in invocations of ptrace(2) with poten‐
231 tially confused x32-vs-x86_64 semantics in the kernel. Policies in‐
232 tended to work on kernels before Linux 5.4 must ensure that they deny
233 or otherwise correctly handle these system calls. On Linux 5.4 and
234 newer, such system calls will fail with the error ENOSYS, without doing
235 anything.
236
237 The instruction_pointer field provides the address of the machine-lan‐
238 guage instruction that performed the system call. This might be useful
239 in conjunction with the use of /proc/pid/maps to perform checks based
240 on which region (mapping) of the program made the system call. (Proba‐
241 bly, it is wise to lock down the mmap(2) and mprotect(2) system calls
242 to prevent the program from subverting such checks.)
243
244 When checking values from args, keep in mind that arguments are often
245 silently truncated before being processed, but after the seccomp check.
246 For example, this happens if the i386 ABI is used on an x86-64 kernel:
247 although the kernel will normally not look beyond the 32 lowest bits of
248 the arguments, the values of the full 64-bit registers will be present
249 in the seccomp data. A less surprising example is that if the x86-64
250 ABI is used to perform a system call that takes an argument of type
251 int, the more-significant half of the argument register is ignored by
252 the system call, but visible in the seccomp data.
253
254 A seccomp filter returns a 32-bit value consisting of two parts: the
255 most significant 16 bits (corresponding to the mask defined by the con‐
256 stant SECCOMP_RET_ACTION_FULL) contain one of the "action" values
257 listed below; the least significant 16-bits (defined by the constant
258 SECCOMP_RET_DATA) are "data" to be associated with this return value.
259
260 If multiple filters exist, they are all executed, in reverse order of
261 their addition to the filter tree—that is, the most recently installed
262 filter is executed first. (Note that all filters will be called even
263 if one of the earlier filters returns SECCOMP_RET_KILL. This is done
264 to simplify the kernel code and to provide a tiny speed-up in the exe‐
265 cution of sets of filters by avoiding a check for this uncommon case.)
266 The return value for the evaluation of a given system call is the
267 first-seen action value of highest precedence (along with its accompa‐
268 nying data) returned by execution of all of the filters.
269
270 In decreasing order of precedence, the action values that may be re‐
271 turned by a seccomp filter are:
272
273 SECCOMP_RET_KILL_PROCESS (since Linux 4.14)
274 This value results in immediate termination of the process, with
275 a core dump. The system call is not executed. By contrast with
276 SECCOMP_RET_KILL_THREAD below, all threads in the thread group
277 are terminated. (For a discussion of thread groups, see the de‐
278 scription of the CLONE_THREAD flag in clone(2).)
279
280 The process terminates as though killed by a SIGSYS signal.
281 Even if a signal handler has been registered for SIGSYS, the
282 handler will be ignored in this case and the process always ter‐
283 minates. To a parent process that is waiting on this process
284 (using waitpid(2) or similar), the returned wstatus will indi‐
285 cate that its child was terminated as though by a SIGSYS signal.
286
287 SECCOMP_RET_KILL_THREAD (or SECCOMP_RET_KILL)
288 This value results in immediate termination of the thread that
289 made the system call. The system call is not executed. Other
290 threads in the same thread group will continue to execute.
291
292 The thread terminates as though killed by a SIGSYS signal. See
293 SECCOMP_RET_KILL_PROCESS above.
294
295 Before Linux 4.11, any process terminated in this way would not
296 trigger a coredump (even though SIGSYS is documented in sig‐
297 nal(7) as having a default action of termination with a core
298 dump). Since Linux 4.11, a single-threaded process will dump
299 core if terminated in this way.
300
301 With the addition of SECCOMP_RET_KILL_PROCESS in Linux 4.14,
302 SECCOMP_RET_KILL_THREAD was added as a synonym for SEC‐
303 COMP_RET_KILL, in order to more clearly distinguish the two ac‐
304 tions.
305
306 Note: the use of SECCOMP_RET_KILL_THREAD to kill a single thread
307 in a multithreaded process is likely to leave the process in a
308 permanently inconsistent and possibly corrupt state.
309
310 SECCOMP_RET_TRAP
311 This value results in the kernel sending a thread-directed
312 SIGSYS signal to the triggering thread. (The system call is not
313 executed.) Various fields will be set in the siginfo_t struc‐
314 ture (see sigaction(2)) associated with signal:
315
316 • si_signo will contain SIGSYS.
317
318 • si_call_addr will show the address of the system call in‐
319 struction.
320
321 • si_syscall and si_arch will indicate which system call was
322 attempted.
323
324 • si_code will contain SYS_SECCOMP.
325
326 • si_errno will contain the SECCOMP_RET_DATA portion of the
327 filter return value.
328
329 The program counter will be as though the system call happened
330 (i.e., the program counter will not point to the system call in‐
331 struction). The return value register will contain an architec‐
332 ture-dependent value; if resuming execution, set it to something
333 appropriate for the system call. (The architecture dependency
334 is because replacing it with ENOSYS could overwrite some useful
335 information.)
336
337 SECCOMP_RET_ERRNO
338 This value results in the SECCOMP_RET_DATA portion of the fil‐
339 ter's return value being passed to user space as the errno value
340 without executing the system call.
341
342 SECCOMP_RET_USER_NOTIF (since Linux 5.0)
343 Forward the system call to an attached user-space supervisor
344 process to allow that process to decide what to do with the sys‐
345 tem call. If there is no attached supervisor (either because
346 the filter was not installed with the SECCOMP_FIL‐
347 TER_FLAG_NEW_LISTENER flag or because the file descriptor was
348 closed), the filter returns ENOSYS (similar to what happens when
349 a filter returns SECCOMP_RET_TRACE and there is no tracer). See
350 seccomp_unotify(2) for further details.
351
352 Note that the supervisor process will not be notified if another
353 filter returns an action value with a precedence greater than
354 SECCOMP_RET_USER_NOTIF.
355
356 SECCOMP_RET_TRACE
357 When returned, this value will cause the kernel to attempt to
358 notify a ptrace(2)-based tracer prior to executing the system
359 call. If there is no tracer present, the system call is not ex‐
360 ecuted and returns a failure status with errno set to ENOSYS.
361
362 A tracer will be notified if it requests PTRACE_O_TRACESECCOMP
363 using ptrace(PTRACE_SETOPTIONS). The tracer will be notified of
364 a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of the
365 filter's return value will be available to the tracer via
366 PTRACE_GETEVENTMSG.
367
368 The tracer can skip the system call by changing the system call
369 number to -1. Alternatively, the tracer can change the system
370 call requested by changing the system call to a valid system
371 call number. If the tracer asks to skip the system call, then
372 the system call will appear to return the value that the tracer
373 puts in the return value register.
374
375 Before Linux 4.8, the seccomp check will not be run again after
376 the tracer is notified. (This means that, on older kernels,
377 seccomp-based sandboxes must not allow use of ptrace(2)—even of
378 other sandboxed processes—without extreme care; ptracers can use
379 this mechanism to escape from the seccomp sandbox.)
380
381 Note that a tracer process will not be notified if another fil‐
382 ter returns an action value with a precedence greater than SEC‐
383 COMP_RET_TRACE.
384
385 SECCOMP_RET_LOG (since Linux 4.14)
386 This value results in the system call being executed after the
387 filter return action is logged. An administrator may override
388 the logging of this action via the /proc/sys/kernel/seccomp/ac‐
389 tions_logged file.
390
391 SECCOMP_RET_ALLOW
392 This value results in the system call being executed.
393
394 If an action value other than one of the above is specified, then the
395 filter action is treated as either SECCOMP_RET_KILL_PROCESS (since
396 Linux 4.14) or SECCOMP_RET_KILL_THREAD (in Linux 4.13 and earlier).
397
398 /proc interfaces
399 The files in the directory /proc/sys/kernel/seccomp provide additional
400 seccomp information and configuration:
401
402 actions_avail (since Linux 4.14)
403 A read-only ordered list of seccomp filter return actions in
404 string form. The ordering, from left-to-right, is in decreasing
405 order of precedence. The list represents the set of seccomp
406 filter return actions supported by the kernel.
407
408 actions_logged (since Linux 4.14)
409 A read-write ordered list of seccomp filter return actions that
410 are allowed to be logged. Writes to the file do not need to be
411 in ordered form but reads from the file will be ordered in the
412 same way as the actions_avail file.
413
414 It is important to note that the value of actions_logged does
415 not prevent certain filter return actions from being logged when
416 the audit subsystem is configured to audit a task. If the ac‐
417 tion is not found in the actions_logged file, the final decision
418 on whether to audit the action for that task is ultimately left
419 up to the audit subsystem to decide for all filter return ac‐
420 tions other than SECCOMP_RET_ALLOW.
421
422 The "allow" string is not accepted in the actions_logged file as
423 it is not possible to log SECCOMP_RET_ALLOW actions. Attempting
424 to write "allow" to the file will fail with the error EINVAL.
425
426 Audit logging of seccomp actions
427 Since Linux 4.14, the kernel provides the facility to log the actions
428 returned by seccomp filters in the audit log. The kernel makes the de‐
429 cision to log an action based on the action type, whether or not the
430 action is present in the actions_logged file, and whether kernel audit‐
431 ing is enabled (e.g., via the kernel boot option audit=1). The rules
432 are as follows:
433
434 • If the action is SECCOMP_RET_ALLOW, the action is not logged.
435
436 • Otherwise, if the action is either SECCOMP_RET_KILL_PROCESS or SEC‐
437 COMP_RET_KILL_THREAD, and that action appears in the actions_logged
438 file, the action is logged.
439
440 • Otherwise, if the filter has requested logging (the SECCOMP_FIL‐
441 TER_FLAG_LOG flag) and the action appears in the actions_logged
442 file, the action is logged.
443
444 • Otherwise, if kernel auditing is enabled and the process is being
445 audited (autrace(8)), the action is logged.
446
447 • Otherwise, the action is not logged.
448
450 On success, seccomp() returns 0. On error, if SECCOMP_FIL‐
451 TER_FLAG_TSYNC was used, the return value is the ID of the thread that
452 caused the synchronization failure. (This ID is a kernel thread ID of
453 the type returned by clone(2) and gettid(2).) On other errors, -1 is
454 returned, and errno is set to indicate the error.
455
457 seccomp() can fail for the following reasons:
458
459 EACCES The caller did not have the CAP_SYS_ADMIN capability in its user
460 namespace, or had not set no_new_privs before using SEC‐
461 COMP_SET_MODE_FILTER.
462
463 EBUSY While installing a new filter, the SECCOMP_FILTER_FLAG_NEW_LIS‐
464 TENER flag was specified, but a previous filter had already been
465 installed with that flag.
466
467 EFAULT args was not a valid address.
468
469 EINVAL operation is unknown or is not supported by this kernel version
470 or configuration.
471
472 EINVAL The specified flags are invalid for the given operation.
473
474 EINVAL operation included BPF_ABS, but the specified offset was not
475 aligned to a 32-bit boundary or exceeded sizeof(struct sec‐
476 comp_data).
477
478 EINVAL A secure computing mode has already been set, and operation dif‐
479 fers from the existing setting.
480
481 EINVAL operation specified SECCOMP_SET_MODE_FILTER, but the filter pro‐
482 gram pointed to by args was not valid or the length of the fil‐
483 ter program was zero or exceeded BPF_MAXINSNS [22m(4096) instruc‐
484 tions.
485
486 ENOMEM Out of memory.
487
488 ENOMEM The total length of all filter programs attached to the calling
489 thread would exceed MAX_INSNS_PER_PATH [22m(32768) instructions.
490 Note that for the purposes of calculating this limit, each al‐
491 ready existing filter program incurs an overhead penalty of 4
492 instructions.
493
494 EOPNOTSUPP
495 operation specified SECCOMP_GET_ACTION_AVAIL, but the kernel
496 does not support the filter return action specified by args.
497
498 ESRCH Another thread caused a failure during thread sync, but its ID
499 could not be determined.
500
502 Linux.
503
505 Linux 3.17.
506
508 Rather than hand-coding seccomp filters as shown in the example below,
509 you may prefer to employ the libseccomp library, which provides a
510 front-end for generating seccomp filters.
511
512 The Seccomp field of the /proc/pid/status file provides a method of
513 viewing the seccomp mode of a process; see proc(5).
514
515 seccomp() provides a superset of the functionality provided by the
516 prctl(2) PR_SET_SECCOMP operation (which does not support flags).
517
518 Since Linux 4.4, the ptrace(2) PTRACE_SECCOMP_GET_FILTER operation can
519 be used to dump a process's seccomp filters.
520
521 Architecture support for seccomp BPF
522 Architecture support for seccomp BPF filtering is available on the fol‐
523 lowing architectures:
524
525 • x86-64, i386, x32 (since Linux 3.5)
526 • ARM (since Linux 3.8)
527 • s390 (since Linux 3.8)
528 • MIPS (since Linux 3.16)
529 • ARM-64 (since Linux 3.19)
530 • PowerPC (since Linux 4.3)
531 • Tile (since Linux 4.3)
532 • PA-RISC (since Linux 4.6)
533
534 Caveats
535 There are various subtleties to consider when applying seccomp filters
536 to a program, including the following:
537
538 • Some traditional system calls have user-space implementations in the
539 vdso(7) on many architectures. Notable examples include clock_get‐
540 time(2), gettimeofday(2), and time(2). On such architectures, sec‐
541 comp filtering for these system calls will have no effect. (How‐
542 ever, there are cases where the vdso(7) implementations may fall
543 back to invoking the true system call, in which case seccomp filters
544 would see the system call.)
545
546 • Seccomp filtering is based on system call numbers. However, appli‐
547 cations typically do not directly invoke system calls, but instead
548 call wrapper functions in the C library which in turn invoke the
549 system calls. Consequently, one must be aware of the following:
550
551 • The glibc wrappers for some traditional system calls may actually
552 employ system calls with different names in the kernel. For ex‐
553 ample, the exit(2) wrapper function actually employs the
554 exit_group(2) system call, and the fork(2) wrapper function actu‐
555 ally calls clone(2).
556
557 • The behavior of wrapper functions may vary across architectures,
558 according to the range of system calls provided on those archi‐
559 tectures. In other words, the same wrapper function may invoke
560 different system calls on different architectures.
561
562 • Finally, the behavior of wrapper functions can change across
563 glibc versions. For example, in older versions, the glibc wrap‐
564 per function for open(2) invoked the system call of the same
565 name, but starting in glibc 2.26, the implementation switched to
566 calling openat(2) on all architectures.
567
568 The consequence of the above points is that it may be necessary to fil‐
569 ter for a system call other than might be expected. Various manual
570 pages in Section 2 provide helpful details about the differences be‐
571 tween wrapper functions and the underlying system calls in subsections
572 entitled C library/kernel differences.
573
574 Furthermore, note that the application of seccomp filters even risks
575 causing bugs in an application, when the filters cause unexpected fail‐
576 ures for legitimate operations that the application might need to per‐
577 form. Such bugs may not easily be discovered when testing the seccomp
578 filters if the bugs occur in rarely used application code paths.
579
580 Seccomp-specific BPF details
581 Note the following BPF details specific to seccomp filters:
582
583 • The BPF_H and BPF_B size modifiers are not supported: all operations
584 must load and store (4-byte) words (BPF_W).
585
586 • To access the contents of the seccomp_data buffer, use the BPF_ABS
587 addressing mode modifier.
588
589 • The BPF_LEN addressing mode modifier yields an immediate mode oper‐
590 and whose value is the size of the seccomp_data buffer.
591
593 The program below accepts four or more arguments. The first three ar‐
594 guments are a system call number, a numeric architecture identifier,
595 and an error number. The program uses these values to construct a BPF
596 filter that is used at run time to perform the following checks:
597
598 • If the program is not running on the specified architecture, the BPF
599 filter causes system calls to fail with the error ENOSYS.
600
601 • If the program attempts to execute the system call with the speci‐
602 fied number, the BPF filter causes the system call to fail, with er‐
603 rno being set to the specified error number.
604
605 The remaining command-line arguments specify the pathname and addi‐
606 tional arguments of a program that the example program should attempt
607 to execute using execv(3) (a library function that employs the ex‐
608 ecve(2) system call). Some example runs of the program are shown be‐
609 low.
610
611 First, we display the architecture that we are running on (x86-64) and
612 then construct a shell function that looks up system call numbers on
613 this architecture:
614
615 $ uname -m
616 x86_64
617 $ syscall_nr() {
618 cat /usr/src/linux/arch/x86/syscalls/syscall_64.tbl | \
619 awk '$2 != "x32" && $3 == "'$1'" { print $1 }'
620 }
621
622 When the BPF filter rejects a system call (case [2] above), it causes
623 the system call to fail with the error number specified on the command
624 line. In the experiments shown here, we'll use error number 99:
625
626 $ errno 99
627 EADDRNOTAVAIL 99 Cannot assign requested address
628
629 In the following example, we attempt to run the command whoami(1), but
630 the BPF filter rejects the execve(2) system call, so that the command
631 is not even executed:
632
633 $ syscall_nr execve
634 59
635 $ ./a.out
636 Usage: ./a.out <syscall_nr> <arch> <errno> <prog> [<args>]
637 Hint for <arch>: AUDIT_ARCH_I386: 0x40000003
638 AUDIT_ARCH_X86_64: 0xC000003E
639 $ ./a.out 59 0xC000003E 99 /bin/whoami
640 execv: Cannot assign requested address
641
642 In the next example, the BPF filter rejects the write(2) system call,
643 so that, although it is successfully started, the whoami(1) command is
644 not able to write output:
645
646 $ syscall_nr write
647 1
648 $ ./a.out 1 0xC000003E 99 /bin/whoami
649
650 In the final example, the BPF filter rejects a system call that is not
651 used by the whoami(1) command, so it is able to successfully execute
652 and produce output:
653
654 $ syscall_nr preadv
655 295
656 $ ./a.out 295 0xC000003E 99 /bin/whoami
657 cecilia
658
659 Program source
660 #include <linux/audit.h>
661 #include <linux/filter.h>
662 #include <linux/seccomp.h>
663 #include <stddef.h>
664 #include <stdio.h>
665 #include <stdlib.h>
666 #include <sys/prctl.h>
667 #include <sys/syscall.h>
668 #include <unistd.h>
669
670 #define X32_SYSCALL_BIT 0x40000000
671 #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
672
673 static int
674 install_filter(int syscall_nr, unsigned int t_arch, int f_errno)
675 {
676 unsigned int upper_nr_limit = 0xffffffff;
677
678 /* Assume that AUDIT_ARCH_X86_64 means the normal x86-64 ABI
679 (in the x32 ABI, all system calls have bit 30 set in the
680 'nr' field, meaning the numbers are >= X32_SYSCALL_BIT). */
681 if (t_arch == AUDIT_ARCH_X86_64)
682 upper_nr_limit = X32_SYSCALL_BIT - 1;
683
684 struct sock_filter filter[] = {
685 /* [0] Load architecture from 'seccomp_data' buffer into
686 accumulator. */
687 BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
688 (offsetof(struct seccomp_data, arch))),
689
690 /* [1] Jump forward 5 instructions if architecture does not
691 match 't_arch'. */
692 BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 5),
693
694 /* [2] Load system call number from 'seccomp_data' buffer into
695 accumulator. */
696 BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
697 (offsetof(struct seccomp_data, nr))),
698
699 /* [3] Check ABI - only needed for x86-64 in deny-list use
700 cases. Use BPF_JGT instead of checking against the bit
701 mask to avoid having to reload the syscall number. */
702 BPF_JUMP(BPF_JMP | BPF_JGT | BPF_K, upper_nr_limit, 3, 0),
703
704 /* [4] Jump forward 1 instruction if system call number
705 does not match 'syscall_nr'. */
706 BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, syscall_nr, 0, 1),
707
708 /* [5] Matching architecture and system call: don't execute
709 the system call, and return 'f_errno' in 'errno'. */
710 BPF_STMT(BPF_RET | BPF_K,
711 SECCOMP_RET_ERRNO | (f_errno & SECCOMP_RET_DATA)),
712
713 /* [6] Destination of system call number mismatch: allow other
714 system calls. */
715 BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
716
717 /* [7] Destination of architecture mismatch: kill process. */
718 BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
719 };
720
721 struct sock_fprog prog = {
722 .len = ARRAY_SIZE(filter),
723 .filter = filter,
724 };
725
726 if (syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &prog)) {
727 perror("seccomp");
728 return 1;
729 }
730
731 return 0;
732 }
733
734 int
735 main(int argc, char *argv[])
736 {
737 if (argc < 5) {
738 fprintf(stderr, "Usage: "
739 "%s <syscall_nr> <arch> <errno> <prog> [<args>]\n"
740 "Hint for <arch>: AUDIT_ARCH_I386: 0x%X\n"
741 " AUDIT_ARCH_X86_64: 0x%X\n"
742 "\n", argv[0], AUDIT_ARCH_I386, AUDIT_ARCH_X86_64);
743 exit(EXIT_FAILURE);
744 }
745
746 if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
747 perror("prctl");
748 exit(EXIT_FAILURE);
749 }
750
751 if (install_filter(strtol(argv[1], NULL, 0),
752 strtoul(argv[2], NULL, 0),
753 strtol(argv[3], NULL, 0)))
754 exit(EXIT_FAILURE);
755
756 execv(argv[4], &argv[4]);
757 perror("execv");
758 exit(EXIT_FAILURE);
759 }
760
762 bpfc(1), strace(1), bpf(2), prctl(2), ptrace(2), seccomp_unotify(2),
763 sigaction(2), proc(5), signal(7), socket(7)
764
765 Various pages from the libseccomp library, including: scmp_sys_re‐
766 solver(1), seccomp_export_bpf(3), seccomp_init(3), seccomp_load(3), and
767 seccomp_rule_add(3).
768
769 The kernel source files Documentation/networking/filter.txt and Docu‐
770 mentation/userspace-api/seccomp_filter.rst (or Documentation/prctl/sec‐
771 comp_filter.txt before Linux 4.13).
772
773 McCanne, S. and Jacobson, V. (1992) The BSD Packet Filter: A New Archi‐
774 tecture for User-level Packet Capture, Proceedings of the USENIX Winter
775 1993 Conference ⟨http://www.tcpdump.org/papers/bpf-usenix93.pdf⟩
776
777
778
779Linux man-pages 6.04 2023-03-30 seccomp(2)