1seccomp(2)                    System Calls Manual                   seccomp(2)
2
3
4

NAME

6       seccomp - operate on Secure Computing state of the process
7

LIBRARY

9       Standard C library (libc, -lc)
10

SYNOPSIS

12       #include <linux/seccomp.h>  /* Definition of SECCOMP_* constants */
13       #include <linux/filter.h>   /* Definition of struct sock_fprog */
14       #include <linux/audit.h>    /* Definition of AUDIT_* constants */
15       #include <linux/signal.h>   /* Definition of SIG* constants */
16       #include <sys/ptrace.h>     /* Definition of PTRACE_* constants */
17       #include <sys/syscall.h>    /* Definition of SYS_* constants */
18       #include <unistd.h>
19
20       int syscall(SYS_seccomp, unsigned int operation, unsigned int flags,
21                   void *args);
22
23       Note: glibc provides no wrapper for seccomp(), necessitating the use of
24       syscall(2).
25

DESCRIPTION

27       The seccomp() system call operates on the  Secure  Computing  (seccomp)
28       state of the calling process.
29
30       Currently, Linux supports the following operation values:
31
32       SECCOMP_SET_MODE_STRICT
33              The  only  system  calls that the calling thread is permitted to
34              make are read(2), write(2), _exit(2)  (but  not  exit_group(2)),
35              and  sigreturn(2).  Other system calls result in the termination
36              of the calling thread, or termination of the entire process with
37              the SIGKILL signal when there is only one thread.  Strict secure
38              computing mode is useful for number-crunching applications  that
39              may  need  to  execute  untrusted byte code, perhaps obtained by
40              reading from a pipe or socket.
41
42              Note that although the calling thread can no  longer  call  sig‐
43              procmask(2),  it can use sigreturn(2) to block all signals apart
44              from SIGKILL and SIGSTOP.  This means that alarm(2)  (for  exam‐
45              ple)  is  not sufficient for restricting the process's execution
46              time.  Instead, to reliably terminate the process, SIGKILL  must
47              be  used.   This  can  be  done  by  using  timer_create(2) with
48              SIGEV_SIGNAL and sigev_signo set to SIGKILL, or by  using  setr‐
49              limit(2) to set the hard limit for RLIMIT_CPU.
50
51              This  operation  is  available  only if the kernel is configured
52              with CONFIG_SECCOMP enabled.
53
54              The value of flags must be 0, and args must be NULL.
55
56              This operation is functionally identical to the call:
57
58                  prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
59
60       SECCOMP_SET_MODE_FILTER
61              The system calls allowed are defined by a pointer to a  Berkeley
62              Packet Filter (BPF) passed via args.  This argument is a pointer
63              to a struct sock_fprog; it can be designed to  filter  arbitrary
64              system  calls  and  system call arguments.  If the filter is in‐
65              valid, seccomp() fails, returning EINVAL in errno.
66
67              If fork(2) or clone(2) is allowed by the filter, any child  pro‐
68              cesses  will  be  constrained to the same system call filters as
69              the parent.  If execve(2) is allowed, the existing filters  will
70              be preserved across a call to execve(2).
71
72              In  order  to  use the SECCOMP_SET_MODE_FILTER operation, either
73              the calling thread must have the CAP_SYS_ADMIN capability in its
74              user namespace, or the thread must already have the no_new_privs
75              bit set.  If that bit was not already set by an ancestor of this
76              thread, the thread must make the following call:
77
78                  prctl(PR_SET_NO_NEW_PRIVS, 1);
79
80              Otherwise,  the  SECCOMP_SET_MODE_FILTER operation fails and re‐
81              turns EACCES in errno.  This requirement ensures that an unpriv‐
82              ileged process cannot apply a malicious filter and then invoke a
83              set-user-ID or other privileged program  using  execve(2),  thus
84              potentially compromising that program.  (Such a malicious filter
85              might, for example, cause an attempt to use setuid(2) to set the
86              caller's  user IDs to nonzero values to instead return 0 without
87              actually making the system call.  Thus,  the  program  might  be
88              tricked  into  retaining  superuser  privileges in circumstances
89              where it is possible to influence it to do dangerous things  be‐
90              cause it did not actually drop privileges.)
91
92              If prctl(2) or seccomp() is allowed by the attached filter, fur‐
93              ther filters may be added.  This will increase evaluation  time,
94              but  allows  for  further reduction of the attack surface during
95              execution of a thread.
96
97              The SECCOMP_SET_MODE_FILTER operation is available only  if  the
98              kernel is configured with CONFIG_SECCOMP_FILTER enabled.
99
100              When flags is 0, this operation is functionally identical to the
101              call:
102
103                  prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, args);
104
105              The recognized flags are:
106
107              SECCOMP_FILTER_FLAG_LOG (since Linux 4.14)
108                     All filter return actions except SECCOMP_RET_ALLOW should
109                     be  logged.   An  administrator  may override this filter
110                     flag by preventing specific actions from being logged via
111                     the /proc/sys/kernel/seccomp/actions_logged file.
112
113              SECCOMP_FILTER_FLAG_NEW_LISTENER (since Linux 5.0)
114                     After  successfully installing the filter program, return
115                     a new  user-space  notification  file  descriptor.   (The
116                     close-on-exec flag is set for the file descriptor.)  When
117                     the filter returns SECCOMP_RET_USER_NOTIF a  notification
118                     will be sent to this file descriptor.
119
120                     At   most  one  seccomp  filter  using  the  SECCOMP_FIL‐
121                     TER_FLAG_NEW_LISTENER flag can be installed for a thread.
122
123                     See seccomp_unotify(2) for further details.
124
125              SECCOMP_FILTER_FLAG_SPEC_ALLOW (since Linux 4.17)
126                     Disable Speculative Store Bypass mitigation.
127
128              SECCOMP_FILTER_FLAG_TSYNC
129                     When adding a new filter, synchronize all  other  threads
130                     of  the  calling process to the same seccomp filter tree.
131                     A "filter tree" is the ordered list of  filters  attached
132                     to  a  thread.   (Attaching identical filters in separate
133                     seccomp() calls results in different  filters  from  this
134                     perspective.)
135
136                     If any thread cannot synchronize to the same filter tree,
137                     the call will not attach the new seccomp filter, and will
138                     fail,  returning  the  first  thread ID found that cannot
139                     synchronize.  Synchronization will fail if another thread
140                     in  the  same  process is in SECCOMP_MODE_STRICT or if it
141                     has attached new seccomp  filters  to  itself,  diverging
142                     from the calling thread's filter tree.
143
144       SECCOMP_GET_ACTION_AVAIL (since Linux 4.14)
145              Test to see if an action is supported by the kernel.  This oper‐
146              ation is helpful to confirm that the kernel knows of a more  re‐
147              cently  added  filter  return action since the kernel treats all
148              unknown actions as SECCOMP_RET_KILL_PROCESS.
149
150              The value of flags must be 0, and args must be a pointer  to  an
151              unsigned 32-bit filter return action.
152
153       SECCOMP_GET_NOTIF_SIZES (since Linux 5.0)
154              Get the sizes of the seccomp user-space notification structures.
155              Since these structures may evolve and grow over time, this  com‐
156              mand  can  be  used to determine how much memory to allocate for
157              sending and receiving notifications.
158
159              The value of flags must be 0, and args must be a  pointer  to  a
160              struct seccomp_notif_sizes, which has the following form:
161
162              struct seccomp_notif_sizes
163                  __u16 seccomp_notif;      /* Size of notification structure */
164                  __u16 seccomp_notif_resp; /* Size of response structure */
165                  __u16 seccomp_data;       /* Size of 'struct seccomp_data' */
166              };
167
168              See seccomp_unotify(2) for further details.
169
170   Filters
171       When  adding filters via SECCOMP_SET_MODE_FILTER, args points to a fil‐
172       ter program:
173
174           struct sock_fprog {
175               unsigned short      len;    /* Number of BPF instructions */
176               struct sock_filter *filter; /* Pointer to array of
177                                              BPF instructions */
178           };
179
180       Each program must contain one or more BPF instructions:
181
182           struct sock_filter {            /* Filter block */
183               __u16 code;                 /* Actual filter code */
184               __u8  jt;                   /* Jump true */
185               __u8  jf;                   /* Jump false */
186               __u32 k;                    /* Generic multiuse field */
187           };
188
189       When executing the instructions, the BPF program operates on the system
190       call information made available (i.e., use the BPF_ABS addressing mode)
191       as a (read-only) buffer of the following form:
192
193           struct seccomp_data {
194               int   nr;                   /* System call number */
195               __u32 arch;                 /* AUDIT_ARCH_* value
196                                              (see <linux/audit.h>) */
197               __u64 instruction_pointer;  /* CPU instruction pointer */
198               __u64 args[6];              /* Up to 6 system call arguments */
199           };
200
201       Because numbering of system calls varies between architectures and some
202       architectures  (e.g.,  x86-64) allow user-space code to use the calling
203       conventions of multiple architectures (and the  convention  being  used
204       may  vary over the life of a process that uses execve(2) to execute bi‐
205       naries that employ the different conventions), it is usually  necessary
206       to verify the value of the arch field.
207
208       It  is strongly recommended to use an allow-list approach whenever pos‐
209       sible because such an approach is more robust and simple.  A  deny-list
210       will have to be updated whenever a potentially dangerous system call is
211       added (or a dangerous flag or option if those are deny-listed), and  it
212       is often possible to alter the representation of a value without alter‐
213       ing its meaning, leading to a deny-list bypass.  See also  Caveats  be‐
214       low.
215
216       The  arch  field is not unique for all calling conventions.  The x86-64
217       ABI and the x32 ABI both use AUDIT_ARCH_X86_64 as arch, and they run on
218       the  same  processors.   Instead, the mask __X32_SYSCALL_BIT is used on
219       the system call number to tell the two ABIs apart.
220
221       This  means  that  a  policy  must  either  deny  all   syscalls   with
222       __X32_SYSCALL_BIT  or  it  must  recognize  syscalls  with  and without
223       __X32_SYSCALL_BIT set.  A list of system calls to be denied based on nr
224       that  does not also contain nr values with __X32_SYSCALL_BIT set can be
225       bypassed by a malicious program that sets __X32_SYSCALL_BIT.
226
227       Additionally, kernels prior to Linux 5.4 incorrectly  permitted  nr  in
228       the  ranges  512-547 as well as the corresponding non-x32 syscalls ORed
229       with __X32_SYSCALL_BIT.  For example, nr  ==  521  and  nr  ==  (101  |
230       __X32_SYSCALL_BIT) would result in invocations of ptrace(2) with poten‐
231       tially confused x32-vs-x86_64 semantics in the  kernel.   Policies  in‐
232       tended  to  work on kernels before Linux 5.4 must ensure that they deny
233       or otherwise correctly handle these system calls.   On  Linux  5.4  and
234       newer, such system calls will fail with the error ENOSYS, without doing
235       anything.
236
237       The instruction_pointer field provides the address of the  machine-lan‐
238       guage instruction that performed the system call.  This might be useful
239       in conjunction with the use of /proc/pid/maps to perform  checks  based
240       on which region (mapping) of the program made the system call.  (Proba‐
241       bly, it is wise to lock down the mmap(2) and mprotect(2)  system  calls
242       to prevent the program from subverting such checks.)
243
244       When  checking  values from args, keep in mind that arguments are often
245       silently truncated before being processed, but after the seccomp check.
246       For  example, this happens if the i386 ABI is used on an x86-64 kernel:
247       although the kernel will normally not look beyond the 32 lowest bits of
248       the  arguments, the values of the full 64-bit registers will be present
249       in the seccomp data.  A less surprising example is that if  the  x86-64
250       ABI  is  used  to  perform a system call that takes an argument of type
251       int, the more-significant half of the argument register is  ignored  by
252       the system call, but visible in the seccomp data.
253
254       A  seccomp  filter  returns a 32-bit value consisting of two parts: the
255       most significant 16 bits (corresponding to the mask defined by the con‐
256       stant  SECCOMP_RET_ACTION_FULL)  contain  one  of  the  "action" values
257       listed below; the least significant 16-bits (defined  by  the  constant
258       SECCOMP_RET_DATA) are "data" to be associated with this return value.
259
260       If  multiple  filters exist, they are all executed, in reverse order of
261       their addition to the filter tree—that is, the most recently  installed
262       filter  is  executed first.  (Note that all filters will be called even
263       if one of the earlier filters returns SECCOMP_RET_KILL.  This  is  done
264       to  simplify the kernel code and to provide a tiny speed-up in the exe‐
265       cution of sets of filters by avoiding a check for this uncommon  case.)
266       The  return  value  for  the  evaluation  of a given system call is the
267       first-seen action value of highest precedence (along with its  accompa‐
268       nying data) returned by execution of all of the filters.
269
270       In  decreasing  order  of precedence, the action values that may be re‐
271       turned by a seccomp filter are:
272
273       SECCOMP_RET_KILL_PROCESS (since Linux 4.14)
274              This value results in immediate termination of the process, with
275              a core dump.  The system call is not executed.  By contrast with
276              SECCOMP_RET_KILL_THREAD below, all threads in the  thread  group
277              are terminated.  (For a discussion of thread groups, see the de‐
278              scription of the CLONE_THREAD flag in clone(2).)
279
280              The process terminates as though  killed  by  a  SIGSYS  signal.
281              Even  if  a  signal  handler has been registered for SIGSYS, the
282              handler will be ignored in this case and the process always ter‐
283              minates.   To  a  parent process that is waiting on this process
284              (using waitpid(2) or similar), the returned wstatus  will  indi‐
285              cate that its child was terminated as though by a SIGSYS signal.
286
287       SECCOMP_RET_KILL_THREAD (or SECCOMP_RET_KILL)
288              This  value  results in immediate termination of the thread that
289              made the system call.  The system call is not  executed.   Other
290              threads in the same thread group will continue to execute.
291
292              The  thread terminates as though killed by a SIGSYS signal.  See
293              SECCOMP_RET_KILL_PROCESS above.
294
295              Before Linux 4.11, any process terminated in this way would  not
296              trigger  a  coredump  (even  though SIGSYS is documented in sig‐
297              nal(7) as having a default action of  termination  with  a  core
298              dump).   Since  Linux  4.11, a single-threaded process will dump
299              core if terminated in this way.
300
301              With the addition of  SECCOMP_RET_KILL_PROCESS  in  Linux  4.14,
302              SECCOMP_RET_KILL_THREAD   was   added  as  a  synonym  for  SEC‐
303              COMP_RET_KILL, in order to more clearly distinguish the two  ac‐
304              tions.
305
306              Note: the use of SECCOMP_RET_KILL_THREAD to kill a single thread
307              in a multithreaded process is likely to leave the process  in  a
308              permanently inconsistent and possibly corrupt state.
309
310       SECCOMP_RET_TRAP
311              This  value  results  in  the  kernel  sending a thread-directed
312              SIGSYS signal to the triggering thread.  (The system call is not
313              executed.)   Various  fields will be set in the siginfo_t struc‐
314              ture (see sigaction(2)) associated with signal:
315
316si_signo will contain SIGSYS.
317
318si_call_addr will show the address of  the  system  call  in‐
319                 struction.
320
321si_syscall  and  si_arch  will indicate which system call was
322                 attempted.
323
324si_code will contain SYS_SECCOMP.
325
326si_errno will contain the  SECCOMP_RET_DATA  portion  of  the
327                 filter return value.
328
329              The  program  counter will be as though the system call happened
330              (i.e., the program counter will not point to the system call in‐
331              struction).  The return value register will contain an architec‐
332              ture-dependent value; if resuming execution, set it to something
333              appropriate  for  the system call.  (The architecture dependency
334              is because replacing it with ENOSYS could overwrite some  useful
335              information.)
336
337       SECCOMP_RET_ERRNO
338              This  value  results in the SECCOMP_RET_DATA portion of the fil‐
339              ter's return value being passed to user space as the errno value
340              without executing the system call.
341
342       SECCOMP_RET_USER_NOTIF (since Linux 5.0)
343              Forward  the  system  call  to an attached user-space supervisor
344              process to allow that process to decide what to do with the sys‐
345              tem  call.   If  there is no attached supervisor (either because
346              the   filter   was   not   installed   with   the   SECCOMP_FIL‐
347              TER_FLAG_NEW_LISTENER  flag  or  because the file descriptor was
348              closed), the filter returns ENOSYS (similar to what happens when
349              a filter returns SECCOMP_RET_TRACE and there is no tracer).  See
350              seccomp_unotify(2) for further details.
351
352              Note that the supervisor process will not be notified if another
353              filter  returns  an  action value with a precedence greater than
354              SECCOMP_RET_USER_NOTIF.
355
356       SECCOMP_RET_TRACE
357              When returned, this value will cause the kernel  to  attempt  to
358              notify  a  ptrace(2)-based  tracer prior to executing the system
359              call.  If there is no tracer present, the system call is not ex‐
360              ecuted and returns a failure status with errno set to ENOSYS.
361
362              A  tracer  will be notified if it requests PTRACE_O_TRACESECCOMP
363              using ptrace(PTRACE_SETOPTIONS).  The tracer will be notified of
364              a  PTRACE_EVENT_SECCOMP  and the SECCOMP_RET_DATA portion of the
365              filter's return value  will  be  available  to  the  tracer  via
366              PTRACE_GETEVENTMSG.
367
368              The  tracer can skip the system call by changing the system call
369              number to -1.  Alternatively, the tracer can change  the  system
370              call  requested  by  changing  the system call to a valid system
371              call number.  If the tracer asks to skip the system  call,  then
372              the  system call will appear to return the value that the tracer
373              puts in the return value register.
374
375              Before Linux 4.8, the seccomp check will not be run again  after
376              the  tracer  is  notified.   (This means that, on older kernels,
377              seccomp-based sandboxes must not allow use of ptrace(2)—even  of
378              other sandboxed processes—without extreme care; ptracers can use
379              this mechanism to escape from the seccomp sandbox.)
380
381              Note that a tracer process will not be notified if another  fil‐
382              ter  returns an action value with a precedence greater than SEC‐
383              COMP_RET_TRACE.
384
385       SECCOMP_RET_LOG (since Linux 4.14)
386              This value results in the system call being executed  after  the
387              filter  return  action is logged.  An administrator may override
388              the logging of this action via the  /proc/sys/kernel/seccomp/ac‐
389              tions_logged file.
390
391       SECCOMP_RET_ALLOW
392              This value results in the system call being executed.
393
394       If  an  action value other than one of the above is specified, then the
395       filter action is  treated  as  either  SECCOMP_RET_KILL_PROCESS  (since
396       Linux 4.14) or SECCOMP_RET_KILL_THREAD (in Linux 4.13 and earlier).
397
398   /proc interfaces
399       The  files in the directory /proc/sys/kernel/seccomp provide additional
400       seccomp information and configuration:
401
402       actions_avail (since Linux 4.14)
403              A read-only ordered list of seccomp  filter  return  actions  in
404              string form.  The ordering, from left-to-right, is in decreasing
405              order of precedence.  The list represents  the  set  of  seccomp
406              filter return actions supported by the kernel.
407
408       actions_logged (since Linux 4.14)
409              A  read-write ordered list of seccomp filter return actions that
410              are allowed to be logged.  Writes to the file do not need to  be
411              in  ordered  form but reads from the file will be ordered in the
412              same way as the actions_avail file.
413
414              It is important to note that the value  of  actions_logged  does
415              not prevent certain filter return actions from being logged when
416              the audit subsystem is configured to audit a task.  If  the  ac‐
417              tion is not found in the actions_logged file, the final decision
418              on whether to audit the action for that task is ultimately  left
419              up  to  the  audit subsystem to decide for all filter return ac‐
420              tions other than SECCOMP_RET_ALLOW.
421
422              The "allow" string is not accepted in the actions_logged file as
423              it is not possible to log SECCOMP_RET_ALLOW actions.  Attempting
424              to write "allow" to the file will fail with the error EINVAL.
425
426   Audit logging of seccomp actions
427       Since Linux 4.14, the kernel provides the facility to log  the  actions
428       returned by seccomp filters in the audit log.  The kernel makes the de‐
429       cision to log an action based on the action type,  whether or  not  the
430       action is present in the actions_logged file, and whether kernel audit‐
431       ing is enabled (e.g., via the kernel boot option audit=1).   The  rules
432       are as follows:
433
434       •  If the action is SECCOMP_RET_ALLOW, the action is not logged.
435
436       •  Otherwise,  if the action is either SECCOMP_RET_KILL_PROCESS or SEC‐
437          COMP_RET_KILL_THREAD, and that action appears in the  actions_logged
438          file, the action is logged.
439
440       •  Otherwise,  if  the  filter  has requested logging (the SECCOMP_FIL‐
441          TER_FLAG_LOG flag) and the  action  appears  in  the  actions_logged
442          file, the action is logged.
443
444       •  Otherwise,  if  kernel  auditing is enabled and the process is being
445          audited (autrace(8)), the action is logged.
446
447       •  Otherwise, the action is not logged.
448

RETURN VALUE

450       On  success,  seccomp()  returns  0.    On   error,   if   SECCOMP_FIL‐
451       TER_FLAG_TSYNC  was used, the return value is the ID of the thread that
452       caused the synchronization failure.  (This ID is a kernel thread ID  of
453       the  type  returned by clone(2) and gettid(2).)  On other errors, -1 is
454       returned, and errno is set to indicate the error.
455

ERRORS

457       seccomp() can fail for the following reasons:
458
459       EACCES The caller did not have the CAP_SYS_ADMIN capability in its user
460              namespace,  or  had  not  set  no_new_privs  before  using  SEC‐
461              COMP_SET_MODE_FILTER.
462
463       EBUSY  While installing a new filter, the  SECCOMP_FILTER_FLAG_NEW_LIS‐
464              TENER flag was specified, but a previous filter had already been
465              installed with that flag.
466
467       EFAULT args was not a valid address.
468
469       EINVAL operation is unknown or is not supported by this kernel  version
470              or configuration.
471
472       EINVAL The specified flags are invalid for the given operation.
473
474       EINVAL operation  included  BPF_ABS,  but  the specified offset was not
475              aligned to a 32-bit  boundary  or  exceeded  sizeof(struct  sec‐
476              comp_data).
477
478       EINVAL A secure computing mode has already been set, and operation dif‐
479              fers from the existing setting.
480
481       EINVAL operation specified SECCOMP_SET_MODE_FILTER, but the filter pro‐
482              gram  pointed to by args was not valid or the length of the fil‐
483              ter program was zero or exceeded  BPF_MAXINSNS  (4096)  instruc‐
484              tions.
485
486       ENOMEM Out of memory.
487
488       ENOMEM The  total length of all filter programs attached to the calling
489              thread would  exceed  MAX_INSNS_PER_PATH  (32768)  instructions.
490              Note  that  for the purposes of calculating this limit, each al‐
491              ready existing filter program incurs an overhead  penalty  of  4
492              instructions.
493
494       EOPNOTSUPP
495              operation  specified  SECCOMP_GET_ACTION_AVAIL,  but  the kernel
496              does not support the filter return action specified by args.
497
498       ESRCH  Another thread caused a failure during thread sync, but  its  ID
499              could not be determined.
500

STANDARDS

502       Linux.
503

HISTORY

505       Linux 3.17.
506

NOTES

508       Rather  than hand-coding seccomp filters as shown in the example below,
509       you may prefer to employ  the  libseccomp  library,  which  provides  a
510       front-end for generating seccomp filters.
511
512       The  Seccomp  field  of  the /proc/pid/status file provides a method of
513       viewing the seccomp mode of a process; see proc(5).
514
515       seccomp() provides a superset of  the  functionality  provided  by  the
516       prctl(2) PR_SET_SECCOMP operation (which does not support flags).
517
518       Since  Linux 4.4, the ptrace(2) PTRACE_SECCOMP_GET_FILTER operation can
519       be used to dump a process's seccomp filters.
520
521   Architecture support for seccomp BPF
522       Architecture support for seccomp BPF filtering is available on the fol‐
523       lowing architectures:
524
525       •  x86-64, i386, x32 (since Linux 3.5)
526       •  ARM (since Linux 3.8)
527       •  s390 (since Linux 3.8)
528       •  MIPS (since Linux 3.16)
529       •  ARM-64 (since Linux 3.19)
530       •  PowerPC (since Linux 4.3)
531       •  Tile (since Linux 4.3)
532       •  PA-RISC (since Linux 4.6)
533
534   Caveats
535       There  are various subtleties to consider when applying seccomp filters
536       to a program, including the following:
537
538       •  Some traditional system calls have user-space implementations in the
539          vdso(7)  on many architectures.  Notable examples include clock_get‐
540          time(2), gettimeofday(2), and time(2).  On such architectures,  sec‐
541          comp  filtering  for  these system calls will have no effect.  (How‐
542          ever, there are cases where the  vdso(7)  implementations  may  fall
543          back to invoking the true system call, in which case seccomp filters
544          would see the system call.)
545
546       •  Seccomp filtering is based on system call numbers.  However,  appli‐
547          cations  typically  do not directly invoke system calls, but instead
548          call wrapper functions in the C library which  in  turn  invoke  the
549          system calls.  Consequently, one must be aware of the following:
550
551          •  The glibc wrappers for some traditional system calls may actually
552             employ system calls with different names in the kernel.  For  ex‐
553             ample,   the   exit(2)  wrapper  function  actually  employs  the
554             exit_group(2) system call, and the fork(2) wrapper function actu‐
555             ally calls clone(2).
556
557          •  The  behavior of wrapper functions may vary across architectures,
558             according to the range of system calls provided on  those  archi‐
559             tectures.   In  other words, the same wrapper function may invoke
560             different system calls on different architectures.
561
562          •  Finally, the behavior of  wrapper  functions  can  change  across
563             glibc  versions.  For example, in older versions, the glibc wrap‐
564             per function for open(2) invoked the  system  call  of  the  same
565             name,  but starting in glibc 2.26, the implementation switched to
566             calling openat(2) on all architectures.
567
568       The consequence of the above points is that it may be necessary to fil‐
569       ter  for  a  system  call other than might be expected.  Various manual
570       pages in Section 2 provide helpful details about  the  differences  be‐
571       tween  wrapper functions and the underlying system calls in subsections
572       entitled C library/kernel differences.
573
574       Furthermore, note that the application of seccomp  filters  even  risks
575       causing bugs in an application, when the filters cause unexpected fail‐
576       ures for legitimate operations that the application might need to  per‐
577       form.   Such bugs may not easily be discovered when testing the seccomp
578       filters if the bugs occur in rarely used application code paths.
579
580   Seccomp-specific BPF details
581       Note the following BPF details specific to seccomp filters:
582
583       •  The BPF_H and BPF_B size modifiers are not supported: all operations
584          must load and store (4-byte) words (BPF_W).
585
586       •  To  access  the contents of the seccomp_data buffer, use the BPF_ABS
587          addressing mode modifier.
588
589       •  The BPF_LEN addressing mode modifier yields an immediate mode  oper‐
590          and whose value is the size of the seccomp_data buffer.
591

EXAMPLES

593       The  program below accepts four or more arguments.  The first three ar‐
594       guments are a system call number, a  numeric  architecture  identifier,
595       and  an error number.  The program uses these values to construct a BPF
596       filter that is used at run time to perform the following checks:
597
598       •  If the program is not running on the specified architecture, the BPF
599          filter causes system calls to fail with the error ENOSYS.
600
601       •  If  the  program attempts to execute the system call with the speci‐
602          fied number, the BPF filter causes the system call to fail, with er‐
603          rno being set to the specified error number.
604
605       The  remaining  command-line  arguments  specify the pathname and addi‐
606       tional arguments of a program that the example program  should  attempt
607       to  execute  using  execv(3)  (a  library function that employs the ex‐
608       ecve(2) system call).  Some example runs of the program are  shown  be‐
609       low.
610
611       First,  we display the architecture that we are running on (x86-64) and
612       then construct a shell function that looks up system  call  numbers  on
613       this architecture:
614
615           $ uname -m
616           x86_64
617           $ syscall_nr() {
618               cat /usr/src/linux/arch/x86/syscalls/syscall_64.tbl | \
619               awk '$2 != "x32" && $3 == "'$1'" { print $1 }'
620           }
621
622       When  the  BPF filter rejects a system call (case [2] above), it causes
623       the system call to fail with the error number specified on the  command
624       line.  In the experiments shown here, we'll use error number 99:
625
626           $ errno 99
627           EADDRNOTAVAIL 99 Cannot assign requested address
628
629       In  the following example, we attempt to run the command whoami(1), but
630       the BPF filter rejects the execve(2) system call, so that  the  command
631       is not even executed:
632
633           $ syscall_nr execve
634           59
635           $ ./a.out
636           Usage: ./a.out <syscall_nr> <arch> <errno> <prog> [<args>]
637           Hint for <arch>: AUDIT_ARCH_I386: 0x40000003
638                            AUDIT_ARCH_X86_64: 0xC000003E
639           $ ./a.out 59 0xC000003E 99 /bin/whoami
640           execv: Cannot assign requested address
641
642       In  the  next example, the BPF filter rejects the write(2) system call,
643       so that, although it is successfully started, the whoami(1) command  is
644       not able to write output:
645
646           $ syscall_nr write
647           1
648           $ ./a.out 1 0xC000003E 99 /bin/whoami
649
650       In  the final example, the BPF filter rejects a system call that is not
651       used by the whoami(1) command, so it is able  to  successfully  execute
652       and produce output:
653
654           $ syscall_nr preadv
655           295
656           $ ./a.out 295 0xC000003E 99 /bin/whoami
657           cecilia
658
659   Program source
660       #include <linux/audit.h>
661       #include <linux/filter.h>
662       #include <linux/seccomp.h>
663       #include <stddef.h>
664       #include <stdio.h>
665       #include <stdlib.h>
666       #include <sys/prctl.h>
667       #include <sys/syscall.h>
668       #include <unistd.h>
669
670       #define X32_SYSCALL_BIT 0x40000000
671       #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
672
673       static int
674       install_filter(int syscall_nr, unsigned int t_arch, int f_errno)
675       {
676           unsigned int upper_nr_limit = 0xffffffff;
677
678           /* Assume that AUDIT_ARCH_X86_64 means the normal x86-64 ABI
679              (in the x32 ABI, all system calls have bit 30 set in the
680              'nr' field, meaning the numbers are >= X32_SYSCALL_BIT). */
681           if (t_arch == AUDIT_ARCH_X86_64)
682               upper_nr_limit = X32_SYSCALL_BIT - 1;
683
684           struct sock_filter filter[] = {
685               /* [0] Load architecture from 'seccomp_data' buffer into
686                      accumulator. */
687               BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
688                        (offsetof(struct seccomp_data, arch))),
689
690               /* [1] Jump forward 5 instructions if architecture does not
691                      match 't_arch'. */
692               BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 5),
693
694               /* [2] Load system call number from 'seccomp_data' buffer into
695                      accumulator. */
696               BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
697                        (offsetof(struct seccomp_data, nr))),
698
699               /* [3] Check ABI - only needed for x86-64 in deny-list use
700                      cases.  Use BPF_JGT instead of checking against the bit
701                      mask to avoid having to reload the syscall number. */
702               BPF_JUMP(BPF_JMP | BPF_JGT | BPF_K, upper_nr_limit, 3, 0),
703
704               /* [4] Jump forward 1 instruction if system call number
705                      does not match 'syscall_nr'. */
706               BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, syscall_nr, 0, 1),
707
708               /* [5] Matching architecture and system call: don't execute
709                  the system call, and return 'f_errno' in 'errno'. */
710               BPF_STMT(BPF_RET | BPF_K,
711                        SECCOMP_RET_ERRNO | (f_errno & SECCOMP_RET_DATA)),
712
713               /* [6] Destination of system call number mismatch: allow other
714                      system calls. */
715               BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
716
717               /* [7] Destination of architecture mismatch: kill process. */
718               BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
719           };
720
721           struct sock_fprog prog = {
722               .len = ARRAY_SIZE(filter),
723               .filter = filter,
724           };
725
726           if (syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &prog)) {
727               perror("seccomp");
728               return 1;
729           }
730
731           return 0;
732       }
733
734       int
735       main(int argc, char *argv[])
736       {
737           if (argc < 5) {
738               fprintf(stderr, "Usage: "
739                       "%s <syscall_nr> <arch> <errno> <prog> [<args>]\n"
740                       "Hint for <arch>: AUDIT_ARCH_I386: 0x%X\n"
741                       "                 AUDIT_ARCH_X86_64: 0x%X\n"
742                       "\n", argv[0], AUDIT_ARCH_I386, AUDIT_ARCH_X86_64);
743               exit(EXIT_FAILURE);
744           }
745
746           if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
747               perror("prctl");
748               exit(EXIT_FAILURE);
749           }
750
751           if (install_filter(strtol(argv[1], NULL, 0),
752                              strtoul(argv[2], NULL, 0),
753                              strtol(argv[3], NULL, 0)))
754               exit(EXIT_FAILURE);
755
756           execv(argv[4], &argv[4]);
757           perror("execv");
758           exit(EXIT_FAILURE);
759       }
760

SEE ALSO

762       bpfc(1),  strace(1),  bpf(2),  prctl(2), ptrace(2), seccomp_unotify(2),
763       sigaction(2), proc(5), signal(7), socket(7)
764
765       Various pages from  the  libseccomp  library,  including:  scmp_sys_re‐
766       solver(1), seccomp_export_bpf(3), seccomp_init(3), seccomp_load(3), and
767       seccomp_rule_add(3).
768
769       The kernel source files Documentation/networking/filter.txt  and  Docu‐
770       mentation/userspace-api/seccomp_filter.rst (or Documentation/prctl/sec‐
771       comp_filter.txt before Linux 4.13).
772
773       McCanne, S. and Jacobson, V. (1992) The BSD Packet Filter: A New Archi‐
774       tecture for User-level Packet Capture, Proceedings of the USENIX Winter
775       1993 Conference ⟨http://www.tcpdump.org/papers/bpf-usenix93.pdf
776
777
778
779Linux man-pages 6.04              2023-03-30                        seccomp(2)
Impressum