1SECCOMP(2) Linux Programmer's Manual SECCOMP(2)
2
3
4
6 seccomp - operate on Secure Computing state of the process
7
9 #include <linux/seccomp.h>
10 #include <linux/filter.h>
11 #include <linux/audit.h>
12 #include <linux/signal.h>
13 #include <sys/ptrace.h>
14
15 int seccomp(unsigned int operation, unsigned int flags, void *args);
16
18 The seccomp() system call operates on the Secure Computing (seccomp)
19 state of the calling process.
20
21 Currently, Linux supports the following operation values:
22
23 SECCOMP_SET_MODE_STRICT
24 The only system calls that the calling thread is permitted to
25 make are read(2), write(2), _exit(2) (but not exit_group(2)),
26 and sigreturn(2). Other system calls result in the delivery of
27 a SIGKILL signal. Strict secure computing mode is useful for
28 number-crunching applications that may need to execute untrusted
29 byte code, perhaps obtained by reading from a pipe or socket.
30
31 Note that although the calling thread can no longer call sig‐
32 procmask(2), it can use sigreturn(2) to block all signals apart
33 from SIGKILL and SIGSTOP. This means that alarm(2) (for exam‐
34 ple) is not sufficient for restricting the process's execution
35 time. Instead, to reliably terminate the process, SIGKILL must
36 be used. This can be done by using timer_create(2) with
37 SIGEV_SIGNAL and sigev_signo set to SIGKILL, or by using setr‐
38 limit(2) to set the hard limit for RLIMIT_CPU.
39
40 This operation is available only if the kernel is configured
41 with CONFIG_SECCOMP enabled.
42
43 The value of flags must be 0, and args must be NULL.
44
45 This operation is functionally identical to the call:
46
47 prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
48
49 SECCOMP_SET_MODE_FILTER
50 The system calls allowed are defined by a pointer to a Berkeley
51 Packet Filter (BPF) passed via args. This argument is a pointer
52 to a struct sock_fprog; it can be designed to filter arbitrary
53 system calls and system call arguments. If the filter is in‐
54 valid, seccomp() fails, returning EINVAL in errno.
55
56 If fork(2) or clone(2) is allowed by the filter, any child pro‐
57 cesses will be constrained to the same system call filters as
58 the parent. If execve(2) is allowed, the existing filters will
59 be preserved across a call to execve(2).
60
61 In order to use the SECCOMP_SET_MODE_FILTER operation, either
62 the calling thread must have the CAP_SYS_ADMIN capability in its
63 user namespace, or the thread must already have the no_new_privs
64 bit set. If that bit was not already set by an ancestor of this
65 thread, the thread must make the following call:
66
67 prctl(PR_SET_NO_NEW_PRIVS, 1);
68
69 Otherwise, the SECCOMP_SET_MODE_FILTER operation fails and re‐
70 turns EACCES in errno. This requirement ensures that an unpriv‐
71 ileged process cannot apply a malicious filter and then invoke a
72 set-user-ID or other privileged program using execve(2), thus
73 potentially compromising that program. (Such a malicious filter
74 might, for example, cause an attempt to use setuid(2) to set the
75 caller's user IDs to nonzero values to instead return 0 without
76 actually making the system call. Thus, the program might be
77 tricked into retaining superuser privileges in circumstances
78 where it is possible to influence it to do dangerous things be‐
79 cause it did not actually drop privileges.)
80
81 If prctl(2) or seccomp() is allowed by the attached filter, fur‐
82 ther filters may be added. This will increase evaluation time,
83 but allows for further reduction of the attack surface during
84 execution of a thread.
85
86 The SECCOMP_SET_MODE_FILTER operation is available only if the
87 kernel is configured with CONFIG_SECCOMP_FILTER enabled.
88
89 When flags is 0, this operation is functionally identical to the
90 call:
91
92 prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, args);
93
94 The recognized flags are:
95
96 SECCOMP_FILTER_FLAG_TSYNC
97 When adding a new filter, synchronize all other threads
98 of the calling process to the same seccomp filter tree.
99 A "filter tree" is the ordered list of filters attached
100 to a thread. (Attaching identical filters in separate
101 seccomp() calls results in different filters from this
102 perspective.)
103
104 If any thread cannot synchronize to the same filter tree,
105 the call will not attach the new seccomp filter, and will
106 fail, returning the first thread ID found that cannot
107 synchronize. Synchronization will fail if another thread
108 in the same process is in SECCOMP_MODE_STRICT or if it
109 has attached new seccomp filters to itself, diverging
110 from the calling thread's filter tree.
111
112 SECCOMP_FILTER_FLAG_LOG (since Linux 4.14)
113 All filter return actions except SECCOMP_RET_ALLOW should
114 be logged. An administrator may override this filter
115 flag by preventing specific actions from being logged via
116 the /proc/sys/kernel/seccomp/actions_logged file.
117
118 SECCOMP_FILTER_FLAG_SPEC_ALLOW (since Linux 4.17)
119 Disable Speculative Store Bypass mitigation.
120
121 SECCOMP_GET_ACTION_AVAIL (since Linux 4.14)
122 Test to see if an action is supported by the kernel. This oper‐
123 ation is helpful to confirm that the kernel knows of a more re‐
124 cently added filter return action since the kernel treats all
125 unknown actions as SECCOMP_RET_KILL_PROCESS.
126
127 The value of flags must be 0, and args must be a pointer to an
128 unsigned 32-bit filter return action.
129
130 Filters
131 When adding filters via SECCOMP_SET_MODE_FILTER, args points to a fil‐
132 ter program:
133
134 struct sock_fprog {
135 unsigned short len; /* Number of BPF instructions */
136 struct sock_filter *filter; /* Pointer to array of
137 BPF instructions */
138 };
139
140 Each program must contain one or more BPF instructions:
141
142 struct sock_filter { /* Filter block */
143 __u16 code; /* Actual filter code */
144 __u8 jt; /* Jump true */
145 __u8 jf; /* Jump false */
146 __u32 k; /* Generic multiuse field */
147 };
148
149 When executing the instructions, the BPF program operates on the system
150 call information made available (i.e., use the BPF_ABS addressing mode)
151 as a (read-only) buffer of the following form:
152
153 struct seccomp_data {
154 int nr; /* System call number */
155 __u32 arch; /* AUDIT_ARCH_* value
156 (see <linux/audit.h>) */
157 __u64 instruction_pointer; /* CPU instruction pointer */
158 __u64 args[6]; /* Up to 6 system call arguments */
159 };
160
161 Because numbering of system calls varies between architectures and some
162 architectures (e.g., x86-64) allow user-space code to use the calling
163 conventions of multiple architectures (and the convention being used
164 may vary over the life of a process that uses execve(2) to execute bi‐
165 naries that employ the different conventions), it is usually necessary
166 to verify the value of the arch field.
167
168 It is strongly recommended to use an allow-list approach whenever pos‐
169 sible because such an approach is more robust and simple. A deny-list
170 will have to be updated whenever a potentially dangerous system call is
171 added (or a dangerous flag or option if those are deny-listed), and it
172 is often possible to alter the representation of a value without alter‐
173 ing its meaning, leading to a deny-list bypass. See also Caveats be‐
174 low.
175
176 The arch field is not unique for all calling conventions. The x86-64
177 ABI and the x32 ABI both use AUDIT_ARCH_X86_64 as arch, and they run on
178 the same processors. Instead, the mask __X32_SYSCALL_BIT is used on
179 the system call number to tell the two ABIs apart.
180
181 This means that a policy must either deny all syscalls with
182 __X32_SYSCALL_BIT or it must recognize syscalls with and without
183 __X32_SYSCALL_BIT set. A list of system calls to be denied based on nr
184 that does not also contain nr values with __X32_SYSCALL_BIT set can be
185 bypassed by a malicious program that sets __X32_SYSCALL_BIT.
186
187 Additionally, kernels prior to Linux 5.4 incorrectly permitted nr in
188 the ranges 512-547 as well as the corresponding non-x32 syscalls ORed
189 with __X32_SYSCALL_BIT. For example, nr == 521 and nr == (101 |
190 __X32_SYSCALL_BIT) would result in invocations of ptrace(2) with poten‐
191 tially confused x32-vs-x86_64 semantics in the kernel. Policies in‐
192 tended to work on kernels before Linux 5.4 must ensure that they deny
193 or otherwise correctly handle these system calls. On Linux 5.4 and
194 newer, such system calls will fail with the error ENOSYS, without doing
195 anything.
196
197 The instruction_pointer field provides the address of the machine-lan‐
198 guage instruction that performed the system call. This might be useful
199 in conjunction with the use of /proc/[pid]/maps to perform checks based
200 on which region (mapping) of the program made the system call. (Proba‐
201 bly, it is wise to lock down the mmap(2) and mprotect(2) system calls
202 to prevent the program from subverting such checks.)
203
204 When checking values from args, keep in mind that arguments are often
205 silently truncated before being processed, but after the seccomp check.
206 For example, this happens if the i386 ABI is used on an x86-64 kernel:
207 although the kernel will normally not look beyond the 32 lowest bits of
208 the arguments, the values of the full 64-bit registers will be present
209 in the seccomp data. A less surprising example is that if the x86-64
210 ABI is used to perform a system call that takes an argument of type
211 int, the more-significant half of the argument register is ignored by
212 the system call, but visible in the seccomp data.
213
214 A seccomp filter returns a 32-bit value consisting of two parts: the
215 most significant 16 bits (corresponding to the mask defined by the con‐
216 stant SECCOMP_RET_ACTION_FULL) contain one of the "action" values
217 listed below; the least significant 16-bits (defined by the constant
218 SECCOMP_RET_DATA) are "data" to be associated with this return value.
219
220 If multiple filters exist, they are all executed, in reverse order of
221 their addition to the filter tree—that is, the most recently installed
222 filter is executed first. (Note that all filters will be called even
223 if one of the earlier filters returns SECCOMP_RET_KILL. This is done
224 to simplify the kernel code and to provide a tiny speed-up in the exe‐
225 cution of sets of filters by avoiding a check for this uncommon case.)
226 The return value for the evaluation of a given system call is the
227 first-seen action value of highest precedence (along with its accompa‐
228 nying data) returned by execution of all of the filters.
229
230 In decreasing order of precedence, the action values that may be re‐
231 turned by a seccomp filter are:
232
233 SECCOMP_RET_KILL_PROCESS (since Linux 4.14)
234 This value results in immediate termination of the process, with
235 a core dump. The system call is not executed. By contrast with
236 SECCOMP_RET_KILL_THREAD below, all threads in the thread group
237 are terminated. (For a discussion of thread groups, see the de‐
238 scription of the CLONE_THREAD flag in clone(2).)
239
240 The process terminates as though killed by a SIGSYS signal.
241 Even if a signal handler has been registered for SIGSYS, the
242 handler will be ignored in this case and the process always ter‐
243 minates. To a parent process that is waiting on this process
244 (using waitpid(2) or similar), the returned wstatus will indi‐
245 cate that its child was terminated as though by a SIGSYS signal.
246
247 SECCOMP_RET_KILL_THREAD (or SECCOMP_RET_KILL)
248 This value results in immediate termination of the thread that
249 made the system call. The system call is not executed. Other
250 threads in the same thread group will continue to execute.
251
252 The thread terminates as though killed by a SIGSYS signal. See
253 SECCOMP_RET_KILL_PROCESS above.
254
255 Before Linux 4.11, any process terminated in this way would not
256 trigger a coredump (even though SIGSYS is documented in sig‐
257 nal(7) as having a default action of termination with a core
258 dump). Since Linux 4.11, a single-threaded process will dump
259 core if terminated in this way.
260
261 With the addition of SECCOMP_RET_KILL_PROCESS in Linux 4.14,
262 SECCOMP_RET_KILL_THREAD was added as a synonym for SEC‐
263 COMP_RET_KILL, in order to more clearly distinguish the two ac‐
264 tions.
265
266 Note: the use of SECCOMP_RET_KILL_THREAD to kill a single thread
267 in a multithreaded process is likely to leave the process in a
268 permanently inconsistent and possibly corrupt state.
269
270 SECCOMP_RET_TRAP
271 This value results in the kernel sending a thread-directed
272 SIGSYS signal to the triggering thread. (The system call is not
273 executed.) Various fields will be set in the siginfo_t struc‐
274 ture (see sigaction(2)) associated with signal:
275
276 * si_signo will contain SIGSYS.
277
278 * si_call_addr will show the address of the system call in‐
279 struction.
280
281 * si_syscall and si_arch will indicate which system call was
282 attempted.
283
284 * si_code will contain SYS_SECCOMP.
285
286 * si_errno will contain the SECCOMP_RET_DATA portion of the
287 filter return value.
288
289 The program counter will be as though the system call happened
290 (i.e., the program counter will not point to the system call in‐
291 struction). The return value register will contain an architec‐
292 ture-dependent value; if resuming execution, set it to something
293 appropriate for the system call. (The architecture dependency
294 is because replacing it with ENOSYS could overwrite some useful
295 information.)
296
297 SECCOMP_RET_ERRNO
298 This value results in the SECCOMP_RET_DATA portion of the fil‐
299 ter's return value being passed to user space as the errno value
300 without executing the system call.
301
302 SECCOMP_RET_TRACE
303 When returned, this value will cause the kernel to attempt to
304 notify a ptrace(2)-based tracer prior to executing the system
305 call. If there is no tracer present, the system call is not ex‐
306 ecuted and returns a failure status with errno set to ENOSYS.
307
308 A tracer will be notified if it requests PTRACE_O_TRACESECCOMP
309 using ptrace(PTRACE_SETOPTIONS). The tracer will be notified of
310 a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of the
311 filter's return value will be available to the tracer via
312 PTRACE_GETEVENTMSG.
313
314 The tracer can skip the system call by changing the system call
315 number to -1. Alternatively, the tracer can change the system
316 call requested by changing the system call to a valid system
317 call number. If the tracer asks to skip the system call, then
318 the system call will appear to return the value that the tracer
319 puts in the return value register.
320
321 Before kernel 4.8, the seccomp check will not be run again after
322 the tracer is notified. (This means that, on older kernels,
323 seccomp-based sandboxes must not allow use of ptrace(2)—even of
324 other sandboxed processes—without extreme care; ptracers can use
325 this mechanism to escape from the seccomp sandbox.)
326
327 Note that a tracer process will not be notified if another fil‐
328 ter returns an action value with a precedence greater than SEC‐
329 COMP_RET_TRACE.
330
331 SECCOMP_RET_LOG (since Linux 4.14)
332 This value results in the system call being executed after the
333 filter return action is logged. An administrator may override
334 the logging of this action via the /proc/sys/kernel/seccomp/ac‐
335 tions_logged file.
336
337 SECCOMP_RET_ALLOW
338 This value results in the system call being executed.
339
340 If an action value other than one of the above is specified, then the
341 filter action is treated as either SECCOMP_RET_KILL_PROCESS (since
342 Linux 4.14) or SECCOMP_RET_KILL_THREAD (in Linux 4.13 and earlier).
343
344 /proc interfaces
345 The files in the directory /proc/sys/kernel/seccomp provide additional
346 seccomp information and configuration:
347
348 actions_avail (since Linux 4.14)
349 A read-only ordered list of seccomp filter return actions in
350 string form. The ordering, from left-to-right, is in decreasing
351 order of precedence. The list represents the set of seccomp
352 filter return actions supported by the kernel.
353
354 actions_logged (since Linux 4.14)
355 A read-write ordered list of seccomp filter return actions that
356 are allowed to be logged. Writes to the file do not need to be
357 in ordered form but reads from the file will be ordered in the
358 same way as the actions_avail file.
359
360 It is important to note that the value of actions_logged does
361 not prevent certain filter return actions from being logged when
362 the audit subsystem is configured to audit a task. If the ac‐
363 tion is not found in the actions_logged file, the final decision
364 on whether to audit the action for that task is ultimately left
365 up to the audit subsystem to decide for all filter return ac‐
366 tions other than SECCOMP_RET_ALLOW.
367
368 The "allow" string is not accepted in the actions_logged file as
369 it is not possible to log SECCOMP_RET_ALLOW actions. Attempting
370 to write "allow" to the file will fail with the error EINVAL.
371
372 Audit logging of seccomp actions
373 Since Linux 4.14, the kernel provides the facility to log the actions
374 returned by seccomp filters in the audit log. The kernel makes the de‐
375 cision to log an action based on the action type, whether or not the
376 action is present in the actions_logged file, and whether kernel audit‐
377 ing is enabled (e.g., via the kernel boot option audit=1). The rules
378 are as follows:
379
380 * If the action is SECCOMP_RET_ALLOW, the action is not logged.
381
382 * Otherwise, if the action is either SECCOMP_RET_KILL_PROCESS or SEC‐
383 COMP_RET_KILL_THREAD, and that action appears in the actions_logged
384 file, the action is logged.
385
386 * Otherwise, if the filter has requested logging (the SECCOMP_FIL‐
387 TER_FLAG_LOG flag) and the action appears in the actions_logged
388 file, the action is logged.
389
390 * Otherwise, if kernel auditing is enabled and the process is being
391 audited (autrace(8)), the action is logged.
392
393 * Otherwise, the action is not logged.
394
396 On success, seccomp() returns 0. On error, if SECCOMP_FIL‐
397 TER_FLAG_TSYNC was used, the return value is the ID of the thread that
398 caused the synchronization failure. (This ID is a kernel thread ID of
399 the type returned by clone(2) and gettid(2).) On other errors, -1 is
400 returned, and errno is set to indicate the cause of the error.
401
403 seccomp() can fail for the following reasons:
404
405 EACCES The caller did not have the CAP_SYS_ADMIN capability in its user
406 namespace, or had not set no_new_privs before using SEC‐
407 COMP_SET_MODE_FILTER.
408
409 EFAULT args was not a valid address.
410
411 EINVAL operation is unknown or is not supported by this kernel version
412 or configuration.
413
414 EINVAL The specified flags are invalid for the given operation.
415
416 EINVAL operation included BPF_ABS, but the specified offset was not
417 aligned to a 32-bit boundary or exceeded sizeof(struct sec‐
418 comp_data).
419
420 EINVAL A secure computing mode has already been set, and operation dif‐
421 fers from the existing setting.
422
423 EINVAL operation specified SECCOMP_SET_MODE_FILTER, but the filter pro‐
424 gram pointed to by args was not valid or the length of the fil‐
425 ter program was zero or exceeded BPF_MAXINSNS (4096) instruc‐
426 tions.
427
428 ENOMEM Out of memory.
429
430 ENOMEM The total length of all filter programs attached to the calling
431 thread would exceed MAX_INSNS_PER_PATH [22m(32768) instructions.
432 Note that for the purposes of calculating this limit, each al‐
433 ready existing filter program incurs an overhead penalty of 4
434 instructions.
435
436 EOPNOTSUPP
437 operation specified SECCOMP_GET_ACTION_AVAIL, but the kernel
438 does not support the filter return action specified by args.
439
440 ESRCH Another thread caused a failure during thread sync, but its ID
441 could not be determined.
442
444 The seccomp() system call first appeared in Linux 3.17.
445
447 The seccomp() system call is a nonstandard Linux extension.
448
450 Rather than hand-coding seccomp filters as shown in the example below,
451 you may prefer to employ the libseccomp library, which provides a
452 front-end for generating seccomp filters.
453
454 The Seccomp field of the /proc/[pid]/status file provides a method of
455 viewing the seccomp mode of a process; see proc(5).
456
457 seccomp() provides a superset of the functionality provided by the
458 prctl(2) PR_SET_SECCOMP operation (which does not support flags).
459
460 Since Linux 4.4, the ptrace(2) PTRACE_SECCOMP_GET_FILTER operation can
461 be used to dump a process's seccomp filters.
462
463 Architecture support for seccomp BPF
464 Architecture support for seccomp BPF filtering is available on the fol‐
465 lowing architectures:
466
467 * x86-64, i386, x32 (since Linux 3.5)
468 * ARM (since Linux 3.8)
469 * s390 (since Linux 3.8)
470 * MIPS (since Linux 3.16)
471 * ARM-64 (since Linux 3.19)
472 * PowerPC (since Linux 4.3)
473 * Tile (since Linux 4.3)
474 * PA-RISC (since Linux 4.6)
475
476 Caveats
477 There are various subtleties to consider when applying seccomp filters
478 to a program, including the following:
479
480 * Some traditional system calls have user-space implementations in the
481 vdso(7) on many architectures. Notable examples include clock_get‐
482 time(2), gettimeofday(2), and time(2). On such architectures, sec‐
483 comp filtering for these system calls will have no effect. (How‐
484 ever, there are cases where the vdso(7) implementations may fall
485 back to invoking the true system call, in which case seccomp filters
486 would see the system call.)
487
488 * Seccomp filtering is based on system call numbers. However, appli‐
489 cations typically do not directly invoke system calls, but instead
490 call wrapper functions in the C library which in turn invoke the
491 system calls. Consequently, one must be aware of the following:
492
493 • The glibc wrappers for some traditional system calls may actually
494 employ system calls with different names in the kernel. For ex‐
495 ample, the exit(2) wrapper function actually employs the
496 exit_group(2) system call, and the fork(2) wrapper function actu‐
497 ally calls clone(2).
498
499 • The behavior of wrapper functions may vary across architectures,
500 according to the range of system calls provided on those archi‐
501 tectures. In other words, the same wrapper function may invoke
502 different system calls on different architectures.
503
504 • Finally, the behavior of wrapper functions can change across
505 glibc versions. For example, in older versions, the glibc wrap‐
506 per function for open(2) invoked the system call of the same
507 name, but starting in glibc 2.26, the implementation switched to
508 calling openat(2) on all architectures.
509
510 The consequence of the above points is that it may be necessary to fil‐
511 ter for a system call other than might be expected. Various manual
512 pages in Section 2 provide helpful details about the differences be‐
513 tween wrapper functions and the underlying system calls in subsections
514 entitled C library/kernel differences.
515
516 Furthermore, note that the application of seccomp filters even risks
517 causing bugs in an application, when the filters cause unexpected fail‐
518 ures for legitimate operations that the application might need to per‐
519 form. Such bugs may not easily be discovered when testing the seccomp
520 filters if the bugs occur in rarely used application code paths.
521
522 Seccomp-specific BPF details
523 Note the following BPF details specific to seccomp filters:
524
525 * The BPF_H and BPF_B size modifiers are not supported: all operations
526 must load and store (4-byte) words (BPF_W).
527
528 * To access the contents of the seccomp_data buffer, use the BPF_ABS
529 addressing mode modifier.
530
531 * The BPF_LEN addressing mode modifier yields an immediate mode oper‐
532 and whose value is the size of the seccomp_data buffer.
533
535 The program below accepts four or more arguments. The first three ar‐
536 guments are a system call number, a numeric architecture identifier,
537 and an error number. The program uses these values to construct a BPF
538 filter that is used at run time to perform the following checks:
539
540 [1] If the program is not running on the specified architecture, the
541 BPF filter causes system calls to fail with the error ENOSYS.
542
543 [2] If the program attempts to execute the system call with the speci‐
544 fied number, the BPF filter causes the system call to fail, with
545 errno being set to the specified error number.
546
547 The remaining command-line arguments specify the pathname and addi‐
548 tional arguments of a program that the example program should attempt
549 to execute using execv(3) (a library function that employs the ex‐
550 ecve(2) system call). Some example runs of the program are shown be‐
551 low.
552
553 First, we display the architecture that we are running on (x86-64) and
554 then construct a shell function that looks up system call numbers on
555 this architecture:
556
557 $ uname -m
558 x86_64
559 $ syscall_nr() {
560 cat /usr/src/linux/arch/x86/syscalls/syscall_64.tbl | \
561 awk '$2 != "x32" && $3 == "'$1'" { print $1 }'
562 }
563
564 When the BPF filter rejects a system call (case [2] above), it causes
565 the system call to fail with the error number specified on the command
566 line. In the experiments shown here, we'll use error number 99:
567
568 $ errno 99
569 EADDRNOTAVAIL 99 Cannot assign requested address
570
571 In the following example, we attempt to run the command whoami(1), but
572 the BPF filter rejects the execve(2) system call, so that the command
573 is not even executed:
574
575 $ syscall_nr execve
576 59
577 $ ./a.out
578 Usage: ./a.out <syscall_nr> <arch> <errno> <prog> [<args>]
579 Hint for <arch>: AUDIT_ARCH_I386: 0x40000003
580 AUDIT_ARCH_X86_64: 0xC000003E
581 $ ./a.out 59 0xC000003E 99 /bin/whoami
582 execv: Cannot assign requested address
583
584 In the next example, the BPF filter rejects the write(2) system call,
585 so that, although it is successfully started, the whoami(1) command is
586 not able to write output:
587
588 $ syscall_nr write
589 1
590 $ ./a.out 1 0xC000003E 99 /bin/whoami
591
592 In the final example, the BPF filter rejects a system call that is not
593 used by the whoami(1) command, so it is able to successfully execute
594 and produce output:
595
596 $ syscall_nr preadv
597 295
598 $ ./a.out 295 0xC000003E 99 /bin/whoami
599 cecilia
600
601 Program source
602 #include <errno.h>
603 #include <stddef.h>
604 #include <stdio.h>
605 #include <stdlib.h>
606 #include <unistd.h>
607 #include <linux/audit.h>
608 #include <linux/filter.h>
609 #include <linux/seccomp.h>
610 #include <sys/prctl.h>
611
612 #define X32_SYSCALL_BIT 0x40000000
613 #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
614
615 static int
616 install_filter(int syscall_nr, int t_arch, int f_errno)
617 {
618 unsigned int upper_nr_limit = 0xffffffff;
619
620 /* Assume that AUDIT_ARCH_X86_64 means the normal x86-64 ABI
621 (in the x32 ABI, all system calls have bit 30 set in the
622 'nr' field, meaning the numbers are >= X32_SYSCALL_BIT) */
623 if (t_arch == AUDIT_ARCH_X86_64)
624 upper_nr_limit = X32_SYSCALL_BIT - 1;
625
626 struct sock_filter filter[] = {
627 /* [0] Load architecture from 'seccomp_data' buffer into
628 accumulator */
629 BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
630 (offsetof(struct seccomp_data, arch))),
631
632 /* [1] Jump forward 5 instructions if architecture does not
633 match 't_arch' */
634 BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 5),
635
636 /* [2] Load system call number from 'seccomp_data' buffer into
637 accumulator */
638 BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
639 (offsetof(struct seccomp_data, nr))),
640
641 /* [3] Check ABI - only needed for x86-64 in deny-list use
642 cases. Use BPF_JGT instead of checking against the bit
643 mask to avoid having to reload the syscall number. */
644 BPF_JUMP(BPF_JMP | BPF_JGT | BPF_K, upper_nr_limit, 3, 0),
645
646 /* [4] Jump forward 1 instruction if system call number
647 does not match 'syscall_nr' */
648 BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, syscall_nr, 0, 1),
649
650 /* [5] Matching architecture and system call: don't execute
651 the system call, and return 'f_errno' in 'errno' */
652 BPF_STMT(BPF_RET | BPF_K,
653 SECCOMP_RET_ERRNO | (f_errno & SECCOMP_RET_DATA)),
654
655 /* [6] Destination of system call number mismatch: allow other
656 system calls */
657 BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
658
659 /* [7] Destination of architecture mismatch: kill process */
660 BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
661 };
662
663 struct sock_fprog prog = {
664 .len = ARRAY_SIZE(filter),
665 .filter = filter,
666 };
667
668 if (seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog)) {
669 perror("seccomp");
670 return 1;
671 }
672
673 return 0;
674 }
675
676 int
677 main(int argc, char **argv)
678 {
679 if (argc < 5) {
680 fprintf(stderr, "Usage: "
681 "%s <syscall_nr> <arch> <errno> <prog> [<args>]\n"
682 "Hint for <arch>: AUDIT_ARCH_I386: 0x%X\n"
683 " AUDIT_ARCH_X86_64: 0x%X\n"
684 "\n", argv[0], AUDIT_ARCH_I386, AUDIT_ARCH_X86_64);
685 exit(EXIT_FAILURE);
686 }
687
688 if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
689 perror("prctl");
690 exit(EXIT_FAILURE);
691 }
692
693 if (install_filter(strtol(argv[1], NULL, 0),
694 strtol(argv[2], NULL, 0),
695 strtol(argv[3], NULL, 0)))
696 exit(EXIT_FAILURE);
697
698 execv(argv[4], &argv[4]);
699 perror("execv");
700 exit(EXIT_FAILURE);
701 }
702
704 bpfc(1), strace(1), bpf(2), prctl(2), ptrace(2), sigaction(2), proc(5),
705 signal(7), socket(7)
706
707 Various pages from the libseccomp library, including: scmp_sys_re‐
708 solver(1), seccomp_export_bpf(3), seccomp_init(3), seccomp_load(3), and
709 seccomp_rule_add(3).
710
711 The kernel source files Documentation/networking/filter.txt and Docu‐
712 mentation/userspace-api/seccomp_filter.rst (or Documentation/prctl/sec‐
713 comp_filter.txt before Linux 4.13).
714
715 McCanne, S. and Jacobson, V. (1992) The BSD Packet Filter: A New Archi‐
716 tecture for User-level Packet Capture, Proceedings of the USENIX Winter
717 1993 Conference ⟨http://www.tcpdump.org/papers/bpf-usenix93.pdf⟩
718
720 This page is part of release 5.10 of the Linux man-pages project. A
721 description of the project, information about reporting bugs, and the
722 latest version of this page, can be found at
723 https://www.kernel.org/doc/man-pages/.
724
725
726
727Linux 2020-11-01 SECCOMP(2)