1SECCOMP(2)                 Linux Programmer's Manual                SECCOMP(2)
2
3
4

NAME

6       seccomp - operate on Secure Computing state of the process
7

SYNOPSIS

9       #include <linux/seccomp.h>  /* Definition of SECCOMP_* constants */
10       #include <linux/filter.h>   /* Definition of struct sock_fprog */
11       #include <linux/audit.h>    /* Definition of AUDIT_* constants */
12       #include <linux/signal.h>   /* Definition of SIG* constants */
13       #include <sys/ptrace.h>     /* Definition of PTRACE_* constants */
14       #include <sys/syscall.h>    /* Definition of SYS_* constants */
15       #include <unistd.h>
16
17       int syscall(SYS_seccomp, unsigned int operation, unsigned int flags,
18                   void *args);
19
20       Note: glibc provides no wrapper for seccomp(), necessitating the use of
21       syscall(2).
22

DESCRIPTION

24       The seccomp() system call operates on the  Secure  Computing  (seccomp)
25       state of the calling process.
26
27       Currently, Linux supports the following operation values:
28
29       SECCOMP_SET_MODE_STRICT
30              The  only  system  calls that the calling thread is permitted to
31              make are read(2), write(2), _exit(2)  (but  not  exit_group(2)),
32              and  sigreturn(2).  Other system calls result in the delivery of
33              a SIGKILL signal.  Strict secure computing mode  is  useful  for
34              number-crunching applications that may need to execute untrusted
35              byte code, perhaps obtained by reading from a pipe or socket.
36
37              Note that although the calling thread can no  longer  call  sig‐
38              procmask(2),  it can use sigreturn(2) to block all signals apart
39              from SIGKILL and SIGSTOP.  This means that alarm(2)  (for  exam‐
40              ple)  is  not sufficient for restricting the process's execution
41              time.  Instead, to reliably terminate the process, SIGKILL  must
42              be  used.   This  can  be  done  by  using  timer_create(2) with
43              SIGEV_SIGNAL and sigev_signo set to SIGKILL, or by  using  setr‐
44              limit(2) to set the hard limit for RLIMIT_CPU.
45
46              This  operation  is  available  only if the kernel is configured
47              with CONFIG_SECCOMP enabled.
48
49              The value of flags must be 0, and args must be NULL.
50
51              This operation is functionally identical to the call:
52
53                  prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
54
55       SECCOMP_SET_MODE_FILTER
56              The system calls allowed are defined by a pointer to a  Berkeley
57              Packet Filter (BPF) passed via args.  This argument is a pointer
58              to a struct sock_fprog; it can be designed to  filter  arbitrary
59              system  calls  and  system call arguments.  If the filter is in‐
60              valid, seccomp() fails, returning EINVAL in errno.
61
62              If fork(2) or clone(2) is allowed by the filter, any child  pro‐
63              cesses  will  be  constrained to the same system call filters as
64              the parent.  If execve(2) is allowed, the existing filters  will
65              be preserved across a call to execve(2).
66
67              In  order  to  use the SECCOMP_SET_MODE_FILTER operation, either
68              the calling thread must have the CAP_SYS_ADMIN capability in its
69              user namespace, or the thread must already have the no_new_privs
70              bit set.  If that bit was not already set by an ancestor of this
71              thread, the thread must make the following call:
72
73                  prctl(PR_SET_NO_NEW_PRIVS, 1);
74
75              Otherwise,  the  SECCOMP_SET_MODE_FILTER operation fails and re‐
76              turns EACCES in errno.  This requirement ensures that an unpriv‐
77              ileged process cannot apply a malicious filter and then invoke a
78              set-user-ID or other privileged program  using  execve(2),  thus
79              potentially compromising that program.  (Such a malicious filter
80              might, for example, cause an attempt to use setuid(2) to set the
81              caller's  user IDs to nonzero values to instead return 0 without
82              actually making the system call.  Thus,  the  program  might  be
83              tricked  into  retaining  superuser  privileges in circumstances
84              where it is possible to influence it to do dangerous things  be‐
85              cause it did not actually drop privileges.)
86
87              If prctl(2) or seccomp() is allowed by the attached filter, fur‐
88              ther filters may be added.  This will increase evaluation  time,
89              but  allows  for  further reduction of the attack surface during
90              execution of a thread.
91
92              The SECCOMP_SET_MODE_FILTER operation is available only  if  the
93              kernel is configured with CONFIG_SECCOMP_FILTER enabled.
94
95              When flags is 0, this operation is functionally identical to the
96              call:
97
98                  prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, args);
99
100              The recognized flags are:
101
102              SECCOMP_FILTER_FLAG_LOG (since Linux 4.14)
103                     All filter return actions except SECCOMP_RET_ALLOW should
104                     be  logged.   An  administrator  may override this filter
105                     flag by preventing specific actions from being logged via
106                     the /proc/sys/kernel/seccomp/actions_logged file.
107
108              SECCOMP_FILTER_FLAG_NEW_LISTENER (since Linux 5.0)
109                     After  successfully installing the filter program, return
110                     a new  user-space  notification  file  descriptor.   (The
111                     close-on-exec flag is set for the file descriptor.)  When
112                     the filter returns SECCOMP_RET_USER_NOTIF a  notification
113                     will be sent to this file descriptor.
114
115                     At   most  one  seccomp  filter  using  the  SECCOMP_FIL‐
116                     TER_FLAG_NEW_LISTENER flag can be installed for a thread.
117
118                     See seccomp_unotify(2) for further details.
119
120              SECCOMP_FILTER_FLAG_SPEC_ALLOW (since Linux 4.17)
121                     Disable Speculative Store Bypass mitigation.
122
123              SECCOMP_FILTER_FLAG_TSYNC
124                     When adding a new filter, synchronize all  other  threads
125                     of  the  calling process to the same seccomp filter tree.
126                     A "filter tree" is the ordered list of  filters  attached
127                     to  a  thread.   (Attaching identical filters in separate
128                     seccomp() calls results in different  filters  from  this
129                     perspective.)
130
131                     If any thread cannot synchronize to the same filter tree,
132                     the call will not attach the new seccomp filter, and will
133                     fail,  returning  the  first  thread ID found that cannot
134                     synchronize.  Synchronization will fail if another thread
135                     in  the  same  process is in SECCOMP_MODE_STRICT or if it
136                     has attached new seccomp  filters  to  itself,  diverging
137                     from the calling thread's filter tree.
138
139       SECCOMP_GET_ACTION_AVAIL (since Linux 4.14)
140              Test to see if an action is supported by the kernel.  This oper‐
141              ation is helpful to confirm that the kernel knows of a more  re‐
142              cently  added  filter  return action since the kernel treats all
143              unknown actions as SECCOMP_RET_KILL_PROCESS.
144
145              The value of flags must be 0, and args must be a pointer  to  an
146              unsigned 32-bit filter return action.
147
148       SECCOMP_GET_NOTIF_SIZES (since Linux 5.0)
149              Get the sizes of the seccomp user-space notification structures.
150              Since these structures may evolve and grow over time, this  com‐
151              mand  can  be  used to determine how much memory to allocate for
152              sending and receiving notifications.
153
154              The value of flags must be 0, and args must be a  pointer  to  a
155              struct seccomp_notif_sizes, which has the following form:
156
157              struct seccomp_notif_sizes
158                  __u16 seccomp_notif;      /* Size of notification structure */
159                  __u16 seccomp_notif_resp; /* Size of response structure */
160                  __u16 seccomp_data;       /* Size of 'struct seccomp_data' */
161              };
162
163              See seccomp_unotify(2) for further details.
164
165   Filters
166       When  adding filters via SECCOMP_SET_MODE_FILTER, args points to a fil‐
167       ter program:
168
169           struct sock_fprog {
170               unsigned short      len;    /* Number of BPF instructions */
171               struct sock_filter *filter; /* Pointer to array of
172                                              BPF instructions */
173           };
174
175       Each program must contain one or more BPF instructions:
176
177           struct sock_filter {            /* Filter block */
178               __u16 code;                 /* Actual filter code */
179               __u8  jt;                   /* Jump true */
180               __u8  jf;                   /* Jump false */
181               __u32 k;                    /* Generic multiuse field */
182           };
183
184       When executing the instructions, the BPF program operates on the system
185       call information made available (i.e., use the BPF_ABS addressing mode)
186       as a (read-only) buffer of the following form:
187
188           struct seccomp_data {
189               int   nr;                   /* System call number */
190               __u32 arch;                 /* AUDIT_ARCH_* value
191                                              (see <linux/audit.h>) */
192               __u64 instruction_pointer;  /* CPU instruction pointer */
193               __u64 args[6];              /* Up to 6 system call arguments */
194           };
195
196       Because numbering of system calls varies between architectures and some
197       architectures  (e.g.,  x86-64) allow user-space code to use the calling
198       conventions of multiple architectures (and the  convention  being  used
199       may  vary over the life of a process that uses execve(2) to execute bi‐
200       naries that employ the different conventions), it is usually  necessary
201       to verify the value of the arch field.
202
203       It  is strongly recommended to use an allow-list approach whenever pos‐
204       sible because such an approach is more robust and simple.  A  deny-list
205       will have to be updated whenever a potentially dangerous system call is
206       added (or a dangerous flag or option if those are deny-listed), and  it
207       is often possible to alter the representation of a value without alter‐
208       ing its meaning, leading to a deny-list bypass.  See also  Caveats  be‐
209       low.
210
211       The  arch  field is not unique for all calling conventions.  The x86-64
212       ABI and the x32 ABI both use AUDIT_ARCH_X86_64 as arch, and they run on
213       the  same  processors.   Instead, the mask __X32_SYSCALL_BIT is used on
214       the system call number to tell the two ABIs apart.
215
216       This  means  that  a  policy  must  either  deny  all   syscalls   with
217       __X32_SYSCALL_BIT  or  it  must  recognize  syscalls  with  and without
218       __X32_SYSCALL_BIT set.  A list of system calls to be denied based on nr
219       that  does not also contain nr values with __X32_SYSCALL_BIT set can be
220       bypassed by a malicious program that sets __X32_SYSCALL_BIT.
221
222       Additionally, kernels prior to Linux 5.4 incorrectly  permitted  nr  in
223       the  ranges  512-547 as well as the corresponding non-x32 syscalls ORed
224       with __X32_SYSCALL_BIT.  For example, nr  ==  521  and  nr  ==  (101  |
225       __X32_SYSCALL_BIT) would result in invocations of ptrace(2) with poten‐
226       tially confused x32-vs-x86_64 semantics in the  kernel.   Policies  in‐
227       tended  to  work on kernels before Linux 5.4 must ensure that they deny
228       or otherwise correctly handle these system calls.   On  Linux  5.4  and
229       newer, such system calls will fail with the error ENOSYS, without doing
230       anything.
231
232       The instruction_pointer field provides the address of the  machine-lan‐
233       guage instruction that performed the system call.  This might be useful
234       in conjunction with the use of /proc/[pid]/maps to perform checks based
235       on which region (mapping) of the program made the system call.  (Proba‐
236       bly, it is wise to lock down the mmap(2) and mprotect(2)  system  calls
237       to prevent the program from subverting such checks.)
238
239       When  checking  values from args, keep in mind that arguments are often
240       silently truncated before being processed, but after the seccomp check.
241       For  example, this happens if the i386 ABI is used on an x86-64 kernel:
242       although the kernel will normally not look beyond the 32 lowest bits of
243       the  arguments, the values of the full 64-bit registers will be present
244       in the seccomp data.  A less surprising example is that if  the  x86-64
245       ABI  is  used  to  perform a system call that takes an argument of type
246       int, the more-significant half of the argument register is  ignored  by
247       the system call, but visible in the seccomp data.
248
249       A  seccomp  filter  returns a 32-bit value consisting of two parts: the
250       most significant 16 bits (corresponding to the mask defined by the con‐
251       stant  SECCOMP_RET_ACTION_FULL)  contain  one  of  the  "action" values
252       listed below; the least significant 16-bits (defined  by  the  constant
253       SECCOMP_RET_DATA) are "data" to be associated with this return value.
254
255       If  multiple  filters exist, they are all executed, in reverse order of
256       their addition to the filter tree—that is, the most recently  installed
257       filter  is  executed first.  (Note that all filters will be called even
258       if one of the earlier filters returns SECCOMP_RET_KILL.  This  is  done
259       to  simplify the kernel code and to provide a tiny speed-up in the exe‐
260       cution of sets of filters by avoiding a check for this uncommon  case.)
261       The  return  value  for  the  evaluation  of a given system call is the
262       first-seen action value of highest precedence (along with its  accompa‐
263       nying data) returned by execution of all of the filters.
264
265       In  decreasing  order  of precedence, the action values that may be re‐
266       turned by a seccomp filter are:
267
268       SECCOMP_RET_KILL_PROCESS (since Linux 4.14)
269              This value results in immediate termination of the process, with
270              a core dump.  The system call is not executed.  By contrast with
271              SECCOMP_RET_KILL_THREAD below, all threads in the  thread  group
272              are terminated.  (For a discussion of thread groups, see the de‐
273              scription of the CLONE_THREAD flag in clone(2).)
274
275              The process terminates as though  killed  by  a  SIGSYS  signal.
276              Even  if  a  signal  handler has been registered for SIGSYS, the
277              handler will be ignored in this case and the process always ter‐
278              minates.   To  a  parent process that is waiting on this process
279              (using waitpid(2) or similar), the returned wstatus  will  indi‐
280              cate that its child was terminated as though by a SIGSYS signal.
281
282       SECCOMP_RET_KILL_THREAD (or SECCOMP_RET_KILL)
283              This  value  results in immediate termination of the thread that
284              made the system call.  The system call is not  executed.   Other
285              threads in the same thread group will continue to execute.
286
287              The  thread terminates as though killed by a SIGSYS signal.  See
288              SECCOMP_RET_KILL_PROCESS above.
289
290              Before Linux 4.11, any process terminated in this way would  not
291              trigger  a  coredump  (even  though SIGSYS is documented in sig‐
292              nal(7) as having a default action of  termination  with  a  core
293              dump).   Since  Linux  4.11, a single-threaded process will dump
294              core if terminated in this way.
295
296              With the addition of  SECCOMP_RET_KILL_PROCESS  in  Linux  4.14,
297              SECCOMP_RET_KILL_THREAD   was   added  as  a  synonym  for  SEC‐
298              COMP_RET_KILL, in order to more clearly distinguish the two  ac‐
299              tions.
300
301              Note: the use of SECCOMP_RET_KILL_THREAD to kill a single thread
302              in a multithreaded process is likely to leave the process  in  a
303              permanently inconsistent and possibly corrupt state.
304
305       SECCOMP_RET_TRAP
306              This  value  results  in  the  kernel  sending a thread-directed
307              SIGSYS signal to the triggering thread.  (The system call is not
308              executed.)   Various  fields will be set in the siginfo_t struc‐
309              ture (see sigaction(2)) associated with signal:
310
311              *  si_signo will contain SIGSYS.
312
313              *  si_call_addr will show the address of  the  system  call  in‐
314                 struction.
315
316              *  si_syscall  and  si_arch  will indicate which system call was
317                 attempted.
318
319              *  si_code will contain SYS_SECCOMP.
320
321              *  si_errno will contain the  SECCOMP_RET_DATA  portion  of  the
322                 filter return value.
323
324              The  program  counter will be as though the system call happened
325              (i.e., the program counter will not point to the system call in‐
326              struction).  The return value register will contain an architec‐
327              ture-dependent value; if resuming execution, set it to something
328              appropriate  for  the system call.  (The architecture dependency
329              is because replacing it with ENOSYS could overwrite some  useful
330              information.)
331
332       SECCOMP_RET_ERRNO
333              This  value  results in the SECCOMP_RET_DATA portion of the fil‐
334              ter's return value being passed to user space as the errno value
335              without executing the system call.
336
337       SECCOMP_RET_USER_NOTIF (since Linux 5.0)
338              Forward  the  system  call  to an attached user-space supervisor
339              process to allow that process to decide what to do with the sys‐
340              tem  call.   If  there is no attached supervisor (either because
341              the   filter   was   not   installed   with   the   SECCOMP_FIL‐
342              TER_FLAG_NEW_LISTENER  flag  or  because the file descriptor was
343              closed), the filter returns ENOSYS (similar to what happens when
344              a filter returns SECCOMP_RET_TRACE and there is no tracer).  See
345              seccomp_unotify(2) for further details.
346
347              Note that the supervisor process will not be notified if another
348              filter  returns  an  action value with a precedence greater than
349              SECCOMP_RET_USER_NOTIF.
350
351       SECCOMP_RET_TRACE
352              When returned, this value will cause the kernel  to  attempt  to
353              notify  a  ptrace(2)-based  tracer prior to executing the system
354              call.  If there is no tracer present, the system call is not ex‐
355              ecuted and returns a failure status with errno set to ENOSYS.
356
357              A  tracer  will be notified if it requests PTRACE_O_TRACESECCOMP
358              using ptrace(PTRACE_SETOPTIONS).  The tracer will be notified of
359              a  PTRACE_EVENT_SECCOMP  and the SECCOMP_RET_DATA portion of the
360              filter's return value  will  be  available  to  the  tracer  via
361              PTRACE_GETEVENTMSG.
362
363              The  tracer can skip the system call by changing the system call
364              number to -1.  Alternatively, the tracer can change  the  system
365              call  requested  by  changing  the system call to a valid system
366              call number.  If the tracer asks to skip the system  call,  then
367              the  system call will appear to return the value that the tracer
368              puts in the return value register.
369
370              Before kernel 4.8, the seccomp check will not be run again after
371              the  tracer  is  notified.   (This means that, on older kernels,
372              seccomp-based sandboxes must not allow use of ptrace(2)—even  of
373              other sandboxed processes—without extreme care; ptracers can use
374              this mechanism to escape from the seccomp sandbox.)
375
376              Note that a tracer process will not be notified if another  fil‐
377              ter  returns an action value with a precedence greater than SEC‐
378              COMP_RET_TRACE.
379
380       SECCOMP_RET_LOG (since Linux 4.14)
381              This value results in the system call being executed  after  the
382              filter  return  action is logged.  An administrator may override
383              the logging of this action via the  /proc/sys/kernel/seccomp/ac‐
384              tions_logged file.
385
386       SECCOMP_RET_ALLOW
387              This value results in the system call being executed.
388
389       If  an  action value other than one of the above is specified, then the
390       filter action is  treated  as  either  SECCOMP_RET_KILL_PROCESS  (since
391       Linux 4.14) or SECCOMP_RET_KILL_THREAD (in Linux 4.13 and earlier).
392
393   /proc interfaces
394       The  files in the directory /proc/sys/kernel/seccomp provide additional
395       seccomp information and configuration:
396
397       actions_avail (since Linux 4.14)
398              A read-only ordered list of seccomp  filter  return  actions  in
399              string form.  The ordering, from left-to-right, is in decreasing
400              order of precedence.  The list represents  the  set  of  seccomp
401              filter return actions supported by the kernel.
402
403       actions_logged (since Linux 4.14)
404              A  read-write ordered list of seccomp filter return actions that
405              are allowed to be logged.  Writes to the file do not need to  be
406              in  ordered  form but reads from the file will be ordered in the
407              same way as the actions_avail file.
408
409              It is important to note that the value  of  actions_logged  does
410              not prevent certain filter return actions from being logged when
411              the audit subsystem is configured to audit a task.  If  the  ac‐
412              tion is not found in the actions_logged file, the final decision
413              on whether to audit the action for that task is ultimately  left
414              up  to  the  audit subsystem to decide for all filter return ac‐
415              tions other than SECCOMP_RET_ALLOW.
416
417              The "allow" string is not accepted in the actions_logged file as
418              it is not possible to log SECCOMP_RET_ALLOW actions.  Attempting
419              to write "allow" to the file will fail with the error EINVAL.
420
421   Audit logging of seccomp actions
422       Since Linux 4.14, the kernel provides the facility to log  the  actions
423       returned by seccomp filters in the audit log.  The kernel makes the de‐
424       cision to log an action based on the action type,  whether or  not  the
425       action is present in the actions_logged file, and whether kernel audit‐
426       ing is enabled (e.g., via the kernel boot option audit=1).   The  rules
427       are as follows:
428
429       *  If the action is SECCOMP_RET_ALLOW, the action is not logged.
430
431       *  Otherwise,  if the action is either SECCOMP_RET_KILL_PROCESS or SEC‐
432          COMP_RET_KILL_THREAD, and that action appears in the  actions_logged
433          file, the action is logged.
434
435       *  Otherwise,  if  the  filter  has requested logging (the SECCOMP_FIL‐
436          TER_FLAG_LOG flag) and the  action  appears  in  the  actions_logged
437          file, the action is logged.
438
439       *  Otherwise,  if  kernel  auditing is enabled and the process is being
440          audited (autrace(8)), the action is logged.
441
442       *  Otherwise, the action is not logged.
443

RETURN VALUE

445       On  success,  seccomp()  returns  0.    On   error,   if   SECCOMP_FIL‐
446       TER_FLAG_TSYNC  was used, the return value is the ID of the thread that
447       caused the synchronization failure.  (This ID is a kernel thread ID  of
448       the  type  returned by clone(2) and gettid(2).)  On other errors, -1 is
449       returned, and errno is set to indicate the error.
450

ERRORS

452       seccomp() can fail for the following reasons:
453
454       EACCES The caller did not have the CAP_SYS_ADMIN capability in its user
455              namespace,  or  had  not  set  no_new_privs  before  using  SEC‐
456              COMP_SET_MODE_FILTER.
457
458       EBUSY  While installing a new filter, the  SECCOMP_FILTER_FLAG_NEW_LIS‐
459              TENER flag was specified, but a previous filter had already been
460              installed with that flag.
461
462       EFAULT args was not a valid address.
463
464       EINVAL operation is unknown or is not supported by this kernel  version
465              or configuration.
466
467       EINVAL The specified flags are invalid for the given operation.
468
469       EINVAL operation  included  BPF_ABS,  but  the specified offset was not
470              aligned to a  32-bit  boundary  or  exceeded  sizeof(struct sec‐
471              comp_data).
472
473       EINVAL A secure computing mode has already been set, and operation dif‐
474              fers from the existing setting.
475
476       EINVAL operation specified SECCOMP_SET_MODE_FILTER, but the filter pro‐
477              gram  pointed to by args was not valid or the length of the fil‐
478              ter program was zero or exceeded  BPF_MAXINSNS  (4096)  instruc‐
479              tions.
480
481       ENOMEM Out of memory.
482
483       ENOMEM The  total length of all filter programs attached to the calling
484              thread would  exceed  MAX_INSNS_PER_PATH  (32768)  instructions.
485              Note  that  for the purposes of calculating this limit, each al‐
486              ready existing filter program incurs an overhead  penalty  of  4
487              instructions.
488
489       EOPNOTSUPP
490              operation  specified  SECCOMP_GET_ACTION_AVAIL,  but  the kernel
491              does not support the filter return action specified by args.
492
493       ESRCH  Another thread caused a failure during thread sync, but  its  ID
494              could not be determined.
495

VERSIONS

497       The seccomp() system call first appeared in Linux 3.17.
498

CONFORMING TO

500       The seccomp() system call is a nonstandard Linux extension.
501

NOTES

503       Rather  than hand-coding seccomp filters as shown in the example below,
504       you may prefer to employ  the  libseccomp  library,  which  provides  a
505       front-end for generating seccomp filters.
506
507       The  Seccomp  field of the /proc/[pid]/status file provides a method of
508       viewing the seccomp mode of a process; see proc(5).
509
510       seccomp() provides a superset of  the  functionality  provided  by  the
511       prctl(2) PR_SET_SECCOMP operation (which does not support flags).
512
513       Since  Linux 4.4, the ptrace(2) PTRACE_SECCOMP_GET_FILTER operation can
514       be used to dump a process's seccomp filters.
515
516   Architecture support for seccomp BPF
517       Architecture support for seccomp BPF filtering is available on the fol‐
518       lowing architectures:
519
520       *  x86-64, i386, x32 (since Linux 3.5)
521       *  ARM (since Linux 3.8)
522       *  s390 (since Linux 3.8)
523       *  MIPS (since Linux 3.16)
524       *  ARM-64 (since Linux 3.19)
525       *  PowerPC (since Linux 4.3)
526       *  Tile (since Linux 4.3)
527       *  PA-RISC (since Linux 4.6)
528
529   Caveats
530       There  are various subtleties to consider when applying seccomp filters
531       to a program, including the following:
532
533       *  Some traditional system calls have user-space implementations in the
534          vdso(7)  on many architectures.  Notable examples include clock_get‐
535          time(2), gettimeofday(2), and time(2).  On such architectures,  sec‐
536          comp  filtering  for  these system calls will have no effect.  (How‐
537          ever, there are cases where the  vdso(7)  implementations  may  fall
538          back to invoking the true system call, in which case seccomp filters
539          would see the system call.)
540
541       *  Seccomp filtering is based on system call numbers.  However,  appli‐
542          cations  typically  do not directly invoke system calls, but instead
543          call wrapper functions in the C library which  in  turn  invoke  the
544          system calls.  Consequently, one must be aware of the following:
545
546          •  The glibc wrappers for some traditional system calls may actually
547             employ system calls with different names in the kernel.  For  ex‐
548             ample,   the   exit(2)  wrapper  function  actually  employs  the
549             exit_group(2) system call, and the fork(2) wrapper function actu‐
550             ally calls clone(2).
551
552          •  The  behavior of wrapper functions may vary across architectures,
553             according to the range of system calls provided on  those  archi‐
554             tectures.   In  other words, the same wrapper function may invoke
555             different system calls on different architectures.
556
557          •  Finally, the behavior of  wrapper  functions  can  change  across
558             glibc  versions.  For example, in older versions, the glibc wrap‐
559             per function for open(2) invoked the  system  call  of  the  same
560             name,  but starting in glibc 2.26, the implementation switched to
561             calling openat(2) on all architectures.
562
563       The consequence of the above points is that it may be necessary to fil‐
564       ter  for  a  system  call other than might be expected.  Various manual
565       pages in Section 2 provide helpful details about  the  differences  be‐
566       tween  wrapper functions and the underlying system calls in subsections
567       entitled C library/kernel differences.
568
569       Furthermore, note that the application of seccomp  filters  even  risks
570       causing bugs in an application, when the filters cause unexpected fail‐
571       ures for legitimate operations that the application might need to  per‐
572       form.   Such bugs may not easily be discovered when testing the seccomp
573       filters if the bugs occur in rarely used application code paths.
574
575   Seccomp-specific BPF details
576       Note the following BPF details specific to seccomp filters:
577
578       *  The BPF_H and BPF_B size modifiers are not supported: all operations
579          must load and store (4-byte) words (BPF_W).
580
581       *  To  access  the contents of the seccomp_data buffer, use the BPF_ABS
582          addressing mode modifier.
583
584       *  The BPF_LEN addressing mode modifier yields an immediate mode  oper‐
585          and whose value is the size of the seccomp_data buffer.
586

EXAMPLES

588       The  program below accepts four or more arguments.  The first three ar‐
589       guments are a system call number, a  numeric  architecture  identifier,
590       and  an error number.  The program uses these values to construct a BPF
591       filter that is used at run time to perform the following checks:
592
593       [1] If the program is not running on the  specified  architecture,  the
594           BPF filter causes system calls to fail with the error ENOSYS.
595
596       [2] If  the program attempts to execute the system call with the speci‐
597           fied number, the BPF filter causes the system call  to  fail,  with
598           errno being set to the specified error number.
599
600       The  remaining  command-line  arguments  specify the pathname and addi‐
601       tional arguments of a program that the example program  should  attempt
602       to  execute  using  execv(3)  (a  library function that employs the ex‐
603       ecve(2) system call).  Some example runs of the program are  shown  be‐
604       low.
605
606       First,  we display the architecture that we are running on (x86-64) and
607       then construct a shell function that looks up system  call  numbers  on
608       this architecture:
609
610           $ uname -m
611           x86_64
612           $ syscall_nr() {
613               cat /usr/src/linux/arch/x86/syscalls/syscall_64.tbl | \
614               awk '$2 != "x32" && $3 == "'$1'" { print $1 }'
615           }
616
617       When  the  BPF filter rejects a system call (case [2] above), it causes
618       the system call to fail with the error number specified on the  command
619       line.  In the experiments shown here, we'll use error number 99:
620
621           $ errno 99
622           EADDRNOTAVAIL 99 Cannot assign requested address
623
624       In  the following example, we attempt to run the command whoami(1), but
625       the BPF filter rejects the execve(2) system call, so that  the  command
626       is not even executed:
627
628           $ syscall_nr execve
629           59
630           $ ./a.out
631           Usage: ./a.out <syscall_nr> <arch> <errno> <prog> [<args>]
632           Hint for <arch>: AUDIT_ARCH_I386: 0x40000003
633                            AUDIT_ARCH_X86_64: 0xC000003E
634           $ ./a.out 59 0xC000003E 99 /bin/whoami
635           execv: Cannot assign requested address
636
637       In  the  next example, the BPF filter rejects the write(2) system call,
638       so that, although it is successfully started, the whoami(1) command  is
639       not able to write output:
640
641           $ syscall_nr write
642           1
643           $ ./a.out 1 0xC000003E 99 /bin/whoami
644
645       In  the final example, the BPF filter rejects a system call that is not
646       used by the whoami(1) command, so it is able  to  successfully  execute
647       and produce output:
648
649           $ syscall_nr preadv
650           295
651           $ ./a.out 295 0xC000003E 99 /bin/whoami
652           cecilia
653
654   Program source
655       #include <errno.h>
656       #include <stddef.h>
657       #include <stdio.h>
658       #include <stdlib.h>
659       #include <unistd.h>
660       #include <linux/audit.h>
661       #include <linux/filter.h>
662       #include <linux/seccomp.h>
663       #include <sys/prctl.h>
664
665       #define X32_SYSCALL_BIT 0x40000000
666       #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
667
668       static int
669       install_filter(int syscall_nr, int t_arch, int f_errno)
670       {
671           unsigned int upper_nr_limit = 0xffffffff;
672
673           /* Assume that AUDIT_ARCH_X86_64 means the normal x86-64 ABI
674              (in the x32 ABI, all system calls have bit 30 set in the
675              'nr' field, meaning the numbers are >= X32_SYSCALL_BIT). */
676           if (t_arch == AUDIT_ARCH_X86_64)
677               upper_nr_limit = X32_SYSCALL_BIT - 1;
678
679           struct sock_filter filter[] = {
680               /* [0] Load architecture from 'seccomp_data' buffer into
681                      accumulator. */
682               BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
683                        (offsetof(struct seccomp_data, arch))),
684
685               /* [1] Jump forward 5 instructions if architecture does not
686                      match 't_arch'. */
687               BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 5),
688
689               /* [2] Load system call number from 'seccomp_data' buffer into
690                      accumulator. */
691               BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
692                        (offsetof(struct seccomp_data, nr))),
693
694               /* [3] Check ABI - only needed for x86-64 in deny-list use
695                      cases.  Use BPF_JGT instead of checking against the bit
696                      mask to avoid having to reload the syscall number. */
697               BPF_JUMP(BPF_JMP | BPF_JGT | BPF_K, upper_nr_limit, 3, 0),
698
699               /* [4] Jump forward 1 instruction if system call number
700                      does not match 'syscall_nr'. */
701               BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, syscall_nr, 0, 1),
702
703               /* [5] Matching architecture and system call: don't execute
704                  the system call, and return 'f_errno' in 'errno'. */
705               BPF_STMT(BPF_RET | BPF_K,
706                        SECCOMP_RET_ERRNO | (f_errno & SECCOMP_RET_DATA)),
707
708               /* [6] Destination of system call number mismatch: allow other
709                      system calls. */
710               BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
711
712               /* [7] Destination of architecture mismatch: kill process. */
713               BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
714           };
715
716           struct sock_fprog prog = {
717               .len = ARRAY_SIZE(filter),
718               .filter = filter,
719           };
720
721           if (seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog)) {
722               perror("seccomp");
723               return 1;
724           }
725
726           return 0;
727       }
728
729       int
730       main(int argc, char **argv)
731       {
732           if (argc < 5) {
733               fprintf(stderr, "Usage: "
734                       "%s <syscall_nr> <arch> <errno> <prog> [<args>]\n"
735                       "Hint for <arch>: AUDIT_ARCH_I386: 0x%X\n"
736                       "                 AUDIT_ARCH_X86_64: 0x%X\n"
737                       "\n", argv[0], AUDIT_ARCH_I386, AUDIT_ARCH_X86_64);
738               exit(EXIT_FAILURE);
739           }
740
741           if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
742               perror("prctl");
743               exit(EXIT_FAILURE);
744           }
745
746           if (install_filter(strtol(argv[1], NULL, 0),
747                              strtol(argv[2], NULL, 0),
748                              strtol(argv[3], NULL, 0)))
749               exit(EXIT_FAILURE);
750
751           execv(argv[4], &argv[4]);
752           perror("execv");
753           exit(EXIT_FAILURE);
754       }
755

SEE ALSO

757       bpfc(1),  strace(1),  bpf(2),  prctl(2), ptrace(2), seccomp_unotify(2),
758       sigaction(2), proc(5), signal(7), socket(7)
759
760       Various pages from  the  libseccomp  library,  including:  scmp_sys_re‐
761       solver(1), seccomp_export_bpf(3), seccomp_init(3), seccomp_load(3), and
762       seccomp_rule_add(3).
763
764       The kernel source files Documentation/networking/filter.txt  and  Docu‐
765       mentation/userspace-api/seccomp_filter.rst (or Documentation/prctl/sec‐
766       comp_filter.txt before Linux 4.13).
767
768       McCanne, S. and Jacobson, V. (1992) The BSD Packet Filter: A New Archi‐
769       tecture for User-level Packet Capture, Proceedings of the USENIX Winter
770       1993 Conference ⟨http://www.tcpdump.org/papers/bpf-usenix93.pdf
771

COLOPHON

773       This page is part of release 5.12 of the Linux  man-pages  project.   A
774       description  of  the project, information about reporting bugs, and the
775       latest    version    of    this    page,    can     be     found     at
776       https://www.kernel.org/doc/man-pages/.
777
778
779
780Linux                             2021-03-22                        SECCOMP(2)
Impressum