seccomp(2)

1SECCOMP(2)                 Linux Programmer's Manual                SECCOMP(2)
2
3
4

NAME

6       seccomp - operate on Secure Computing state of the process
7

SYNOPSIS

9       #include <linux/seccomp.h>
10       #include <linux/filter.h>
11       #include <linux/audit.h>
12       #include <linux/signal.h>
13       #include <sys/ptrace.h>
14
15       int seccomp(unsigned int operation, unsigned int flags, void *args);
16

DESCRIPTION

18       The  seccomp()  system  call operates on the Secure Computing (seccomp)
19       state of the calling process.
20
21       Currently, Linux supports the following operation values:
22
23       SECCOMP_SET_MODE_STRICT
24              The only system calls that the calling thread  is  permitted  to
25              make  are  read(2),  write(2), _exit(2) (but not exit_group(2)),
26              and sigreturn(2).  Other system calls result in the delivery  of
27              a  SIGKILL  signal.   Strict secure computing mode is useful for
28              number-crunching applications that may need to execute untrusted
29              byte code, perhaps obtained by reading from a pipe or socket.
30
31              Note  that  although  the calling thread can no longer call sig‐
32              procmask(2), it can use sigreturn(2) to block all signals  apart
33              from  SIGKILL  and SIGSTOP.  This means that alarm(2) (for exam‐
34              ple) is not sufficient for restricting the  process's  execution
35              time.   Instead, to reliably terminate the process, SIGKILL must
36              be used.   This  can  be  done  by  using  timer_create(2)  with
37              SIGEV_SIGNAL  and  sigev_signo set to SIGKILL, or by using setr‐
38              limit(2) to set the hard limit for RLIMIT_CPU.
39
40              This operation is available only if  the  kernel  is  configured
41              with CONFIG_SECCOMP enabled.
42
43              The value of flags must be 0, and args must be NULL.
44
45              This operation is functionally identical to the call:
46
47                  prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
48
49       SECCOMP_SET_MODE_FILTER
50              The  system calls allowed are defined by a pointer to a Berkeley
51              Packet Filter (BPF) passed via args.  This argument is a pointer
52              to  a  struct sock_fprog; it can be designed to filter arbitrary
53              system calls and system call arguments.  If the  filter  is  in‐
54              valid, seccomp() fails, returning EINVAL in errno.
55
56              If  fork(2) or clone(2) is allowed by the filter, any child pro‐
57              cesses will be constrained to the same system  call  filters  as
58              the  parent.  If execve(2) is allowed, the existing filters will
59              be preserved across a call to execve(2).
60
61              In order to use the  SECCOMP_SET_MODE_FILTER  operation,  either
62              the calling thread must have the CAP_SYS_ADMIN capability in its
63              user namespace, or the thread must already have the no_new_privs
64              bit set.  If that bit was not already set by an ancestor of this
65              thread, the thread must make the following call:
66
67                  prctl(PR_SET_NO_NEW_PRIVS, 1);
68
69              Otherwise, the SECCOMP_SET_MODE_FILTER operation fails  and  re‐
70              turns EACCES in errno.  This requirement ensures that an unpriv‐
71              ileged process cannot apply a malicious filter and then invoke a
72              set-user-ID  or  other  privileged program using execve(2), thus
73              potentially compromising that program.  (Such a malicious filter
74              might, for example, cause an attempt to use setuid(2) to set the
75              caller's user IDs to nonzero values to instead return 0  without
76              actually  making  the  system  call.  Thus, the program might be
77              tricked into retaining  superuser  privileges  in  circumstances
78              where  it is possible to influence it to do dangerous things be‐
79              cause it did not actually drop privileges.)
80
81              If prctl(2) or seccomp() is allowed by the attached filter, fur‐
82              ther  filters may be added.  This will increase evaluation time,
83              but allows for further reduction of the  attack  surface  during
84              execution of a thread.
85
86              The  SECCOMP_SET_MODE_FILTER  operation is available only if the
87              kernel is configured with CONFIG_SECCOMP_FILTER enabled.
88
89              When flags is 0, this operation is functionally identical to the
90              call:
91
92                  prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, args);
93
94              The recognized flags are:
95
96              SECCOMP_FILTER_FLAG_TSYNC
97                     When  adding  a new filter, synchronize all other threads
98                     of the calling process to the same seccomp  filter  tree.
99                     A  "filter  tree" is the ordered list of filters attached
100                     to a thread.  (Attaching identical  filters  in  separate
101                     seccomp()  calls  results  in different filters from this
102                     perspective.)
103
104                     If any thread cannot synchronize to the same filter tree,
105                     the call will not attach the new seccomp filter, and will
106                     fail, returning the first thread  ID  found  that  cannot
107                     synchronize.  Synchronization will fail if another thread
108                     in the same process is in SECCOMP_MODE_STRICT  or  if  it
109                     has  attached  new  seccomp  filters to itself, diverging
110                     from the calling thread's filter tree.
111
112              SECCOMP_FILTER_FLAG_LOG (since Linux 4.14)
113                     All filter return actions except SECCOMP_RET_ALLOW should
114                     be  logged.   An  administrator  may override this filter
115                     flag by preventing specific actions from being logged via
116                     the /proc/sys/kernel/seccomp/actions_logged file.
117
118              SECCOMP_FILTER_FLAG_SPEC_ALLOW (since Linux 4.17)
119                     Disable Speculative Store Bypass mitigation.
120
121       SECCOMP_GET_ACTION_AVAIL (since Linux 4.14)
122              Test to see if an action is supported by the kernel.  This oper‐
123              ation is helpful to confirm that the kernel knows of a more  re‐
124              cently  added  filter  return action since the kernel treats all
125              unknown actions as SECCOMP_RET_KILL_PROCESS.
126
127              The value of flags must be 0, and args must be a pointer  to  an
128              unsigned 32-bit filter return action.
129
130   Filters
131       When  adding filters via SECCOMP_SET_MODE_FILTER, args points to a fil‐
132       ter program:
133
134           struct sock_fprog {
135               unsigned short      len;    /* Number of BPF instructions */
136               struct sock_filter *filter; /* Pointer to array of
137                                              BPF instructions */
138           };
139
140       Each program must contain one or more BPF instructions:
141
142           struct sock_filter {            /* Filter block */
143               __u16 code;                 /* Actual filter code */
144               __u8  jt;                   /* Jump true */
145               __u8  jf;                   /* Jump false */
146               __u32 k;                    /* Generic multiuse field */
147           };
148
149       When executing the instructions, the BPF program operates on the system
150       call information made available (i.e., use the BPF_ABS addressing mode)
151       as a (read-only) buffer of the following form:
152
153           struct seccomp_data {
154               int   nr;                   /* System call number */
155               __u32 arch;                 /* AUDIT_ARCH_* value
156                                              (see <linux/audit.h>) */
157               __u64 instruction_pointer;  /* CPU instruction pointer */
158               __u64 args[6];              /* Up to 6 system call arguments */
159           };
160
161       Because numbering of system calls varies between architectures and some
162       architectures  (e.g.,  x86-64) allow user-space code to use the calling
163       conventions of multiple architectures (and the  convention  being  used
164       may  vary over the life of a process that uses execve(2) to execute bi‐
165       naries that employ the different conventions), it is usually  necessary
166       to verify the value of the arch field.
167
168       It  is strongly recommended to use an allow-list approach whenever pos‐
169       sible because such an approach is more robust and simple.  A  deny-list
170       will have to be updated whenever a potentially dangerous system call is
171       added (or a dangerous flag or option if those are deny-listed), and  it
172       is often possible to alter the representation of a value without alter‐
173       ing its meaning, leading to a deny-list bypass.  See also  Caveats  be‐
174       low.
175
176       The  arch  field is not unique for all calling conventions.  The x86-64
177       ABI and the x32 ABI both use AUDIT_ARCH_X86_64 as arch, and they run on
178       the  same  processors.   Instead, the mask __X32_SYSCALL_BIT is used on
179       the system call number to tell the two ABIs apart.
180
181       This  means  that  a  policy  must  either  deny  all   syscalls   with
182       __X32_SYSCALL_BIT  or  it  must  recognize  syscalls  with  and without
183       __X32_SYSCALL_BIT set.  A list of system calls to be denied based on nr
184       that  does not also contain nr values with __X32_SYSCALL_BIT set can be
185       bypassed by a malicious program that sets __X32_SYSCALL_BIT.
186
187       Additionally, kernels prior to Linux 5.4 incorrectly  permitted  nr  in
188       the  ranges  512-547 as well as the corresponding non-x32 syscalls ORed
189       with __X32_SYSCALL_BIT.  For example, nr  ==  521  and  nr  ==  (101  |
190       __X32_SYSCALL_BIT) would result in invocations of ptrace(2) with poten‐
191       tially confused x32-vs-x86_64 semantics in the  kernel.   Policies  in‐
192       tended  to  work on kernels before Linux 5.4 must ensure that they deny
193       or otherwise correctly handle these system calls.   On  Linux  5.4  and
194       newer, such system calls will fail with the error ENOSYS, without doing
195       anything.
196
197       The instruction_pointer field provides the address of the  machine-lan‐
198       guage instruction that performed the system call.  This might be useful
199       in conjunction with the use of /proc/[pid]/maps to perform checks based
200       on which region (mapping) of the program made the system call.  (Proba‐
201       bly, it is wise to lock down the mmap(2) and mprotect(2)  system  calls
202       to prevent the program from subverting such checks.)
203
204       When  checking  values from args, keep in mind that arguments are often
205       silently truncated before being processed, but after the seccomp check.
206       For  example, this happens if the i386 ABI is used on an x86-64 kernel:
207       although the kernel will normally not look beyond the 32 lowest bits of
208       the  arguments, the values of the full 64-bit registers will be present
209       in the seccomp data.  A less surprising example is that if  the  x86-64
210       ABI  is  used  to  perform a system call that takes an argument of type
211       int, the more-significant half of the argument register is  ignored  by
212       the system call, but visible in the seccomp data.
213
214       A  seccomp  filter  returns a 32-bit value consisting of two parts: the
215       most significant 16 bits (corresponding to the mask defined by the con‐
216       stant  SECCOMP_RET_ACTION_FULL)  contain  one  of  the  "action" values
217       listed below; the least significant 16-bits (defined  by  the  constant
218       SECCOMP_RET_DATA) are "data" to be associated with this return value.
219
220       If  multiple  filters exist, they are all executed, in reverse order of
221       their addition to the filter tree—that is, the most recently  installed
222       filter  is  executed first.  (Note that all filters will be called even
223       if one of the earlier filters returns SECCOMP_RET_KILL.  This  is  done
224       to  simplify the kernel code and to provide a tiny speed-up in the exe‐
225       cution of sets of filters by avoiding a check for this uncommon  case.)
226       The  return  value  for  the  evaluation  of a given system call is the
227       first-seen action value of highest precedence (along with its  accompa‐
228       nying data) returned by execution of all of the filters.
229
230       In  decreasing  order  of precedence, the action values that may be re‐
231       turned by a seccomp filter are:
232
233       SECCOMP_RET_KILL_PROCESS (since Linux 4.14)
234              This value results in immediate termination of the process, with
235              a core dump.  The system call is not executed.  By contrast with
236              SECCOMP_RET_KILL_THREAD below, all threads in the  thread  group
237              are terminated.  (For a discussion of thread groups, see the de‐
238              scription of the CLONE_THREAD flag in clone(2).)
239
240              The process terminates as though  killed  by  a  SIGSYS  signal.
241              Even  if  a  signal  handler has been registered for SIGSYS, the
242              handler will be ignored in this case and the process always ter‐
243              minates.   To  a  parent process that is waiting on this process
244              (using waitpid(2) or similar), the returned wstatus  will  indi‐
245              cate that its child was terminated as though by a SIGSYS signal.
246
247       SECCOMP_RET_KILL_THREAD (or SECCOMP_RET_KILL)
248              This  value  results in immediate termination of the thread that
249              made the system call.  The system call is not  executed.   Other
250              threads in the same thread group will continue to execute.
251
252              The  thread terminates as though killed by a SIGSYS signal.  See
253              SECCOMP_RET_KILL_PROCESS above.
254
255              Before Linux 4.11, any process terminated in this way would  not
256              trigger  a  coredump  (even  though SIGSYS is documented in sig‐
257              nal(7) as having a default action of  termination  with  a  core
258              dump).   Since  Linux  4.11, a single-threaded process will dump
259              core if terminated in this way.
260
261              With the addition of  SECCOMP_RET_KILL_PROCESS  in  Linux  4.14,
262              SECCOMP_RET_KILL_THREAD   was   added  as  a  synonym  for  SEC‐
263              COMP_RET_KILL, in order to more clearly distinguish the two  ac‐
264              tions.
265
266              Note: the use of SECCOMP_RET_KILL_THREAD to kill a single thread
267              in a multithreaded process is likely to leave the process  in  a
268              permanently inconsistent and possibly corrupt state.
269
270       SECCOMP_RET_TRAP
271              This  value  results  in  the  kernel  sending a thread-directed
272              SIGSYS signal to the triggering thread.  (The system call is not
273              executed.)   Various  fields will be set in the siginfo_t struc‐
274              ture (see sigaction(2)) associated with signal:
275
276              *  si_signo will contain SIGSYS.
277
278              *  si_call_addr will show the address of  the  system  call  in‐
279                 struction.
280
281              *  si_syscall  and  si_arch  will indicate which system call was
282                 attempted.
283
284              *  si_code will contain SYS_SECCOMP.
285
286              *  si_errno will contain the  SECCOMP_RET_DATA  portion  of  the
287                 filter return value.
288
289              The  program  counter will be as though the system call happened
290              (i.e., the program counter will not point to the system call in‐
291              struction).  The return value register will contain an architec‐
292              ture-dependent value; if resuming execution, set it to something
293              appropriate  for  the system call.  (The architecture dependency
294              is because replacing it with ENOSYS could overwrite some  useful
295              information.)
296
297       SECCOMP_RET_ERRNO
298              This  value  results in the SECCOMP_RET_DATA portion of the fil‐
299              ter's return value being passed to user space as the errno value
300              without executing the system call.
301
302       SECCOMP_RET_TRACE
303              When  returned,  this  value will cause the kernel to attempt to
304              notify a ptrace(2)-based tracer prior to  executing  the  system
305              call.  If there is no tracer present, the system call is not ex‐
306              ecuted and returns a failure status with errno set to ENOSYS.
307
308              A tracer will be notified if it  requests  PTRACE_O_TRACESECCOMP
309              using ptrace(PTRACE_SETOPTIONS).  The tracer will be notified of
310              a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion  of  the
311              filter's  return  value  will  be  available  to  the tracer via
312              PTRACE_GETEVENTMSG.
313
314              The tracer can skip the system call by changing the system  call
315              number  to  -1.  Alternatively, the tracer can change the system
316              call requested by changing the system call  to  a  valid  system
317              call  number.   If the tracer asks to skip the system call, then
318              the system call will appear to return the value that the  tracer
319              puts in the return value register.
320
321              Before kernel 4.8, the seccomp check will not be run again after
322              the tracer is notified.  (This means  that,  on  older  kernels,
323              seccomp-based  sandboxes must not allow use of ptrace(2)—even of
324              other sandboxed processes—without extreme care; ptracers can use
325              this mechanism to escape from the seccomp sandbox.)
326
327              Note  that a tracer process will not be notified if another fil‐
328              ter returns an action value with a precedence greater than  SEC‐
329              COMP_RET_TRACE.
330
331       SECCOMP_RET_LOG (since Linux 4.14)
332              This  value  results in the system call being executed after the
333              filter return action is logged.  An administrator  may  override
334              the  logging of this action via the /proc/sys/kernel/seccomp/ac‐
335              tions_logged file.
336
337       SECCOMP_RET_ALLOW
338              This value results in the system call being executed.
339
340       If an action value other than one of the above is specified,  then  the
341       filter  action  is  treated  as  either SECCOMP_RET_KILL_PROCESS (since
342       Linux 4.14) or SECCOMP_RET_KILL_THREAD (in Linux 4.13 and earlier).
343
344   /proc interfaces
345       The files in the directory /proc/sys/kernel/seccomp provide  additional
346       seccomp information and configuration:
347
348       actions_avail (since Linux 4.14)
349              A  read-only  ordered  list  of seccomp filter return actions in
350              string form.  The ordering, from left-to-right, is in decreasing
351              order  of  precedence.   The  list represents the set of seccomp
352              filter return actions supported by the kernel.
353
354       actions_logged (since Linux 4.14)
355              A read-write ordered list of seccomp filter return actions  that
356              are  allowed to be logged.  Writes to the file do not need to be
357              in ordered form but reads from the file will be ordered  in  the
358              same way as the actions_avail file.
359
360              It  is  important  to note that the value of actions_logged does
361              not prevent certain filter return actions from being logged when
362              the  audit  subsystem is configured to audit a task.  If the ac‐
363              tion is not found in the actions_logged file, the final decision
364              on  whether to audit the action for that task is ultimately left
365              up to the audit subsystem to decide for all  filter  return  ac‐
366              tions other than SECCOMP_RET_ALLOW.
367
368              The "allow" string is not accepted in the actions_logged file as
369              it is not possible to log SECCOMP_RET_ALLOW actions.  Attempting
370              to write "allow" to the file will fail with the error EINVAL.
371
372   Audit logging of seccomp actions
373       Since  Linux  4.14, the kernel provides the facility to log the actions
374       returned by seccomp filters in the audit log.  The kernel makes the de‐
375       cision  to  log an action based on the action type,  whether or not the
376       action is present in the actions_logged file, and whether kernel audit‐
377       ing  is  enabled (e.g., via the kernel boot option audit=1).  The rules
378       are as follows:
379
380       *  If the action is SECCOMP_RET_ALLOW, the action is not logged.
381
382       *  Otherwise, if the action is either SECCOMP_RET_KILL_PROCESS or  SEC‐
383          COMP_RET_KILL_THREAD,  and that action appears in the actions_logged
384          file, the action is logged.
385
386       *  Otherwise, if the filter has  requested  logging  (the  SECCOMP_FIL‐
387          TER_FLAG_LOG  flag)  and  the  action  appears in the actions_logged
388          file, the action is logged.
389
390       *  Otherwise, if kernel auditing is enabled and the  process  is  being
391          audited (autrace(8)), the action is logged.
392
393       *  Otherwise, the action is not logged.
394

RETURN VALUE

396       On   success,   seccomp()   returns   0.   On  error,  if  SECCOMP_FIL‐
397       TER_FLAG_TSYNC was used, the return value is the ID of the thread  that
398       caused  the synchronization failure.  (This ID is a kernel thread ID of
399       the type returned by clone(2) and gettid(2).)  On other errors,  -1  is
400       returned, and errno is set to indicate the cause of the error.
401

ERRORS

403       seccomp() can fail for the following reasons:
404
405       EACCES The caller did not have the CAP_SYS_ADMIN capability in its user
406              namespace,  or  had  not  set  no_new_privs  before  using  SEC‐
407              COMP_SET_MODE_FILTER.
408
409       EFAULT args was not a valid address.
410
411       EINVAL operation  is unknown or is not supported by this kernel version
412              or configuration.
413
414       EINVAL The specified flags are invalid for the given operation.
415
416       EINVAL operation included BPF_ABS, but the  specified  offset  was  not
417              aligned  to  a  32-bit  boundary  or exceeded sizeof(struct sec‐
418              comp_data).
419
420       EINVAL A secure computing mode has already been set, and operation dif‐
421              fers from the existing setting.
422
423       EINVAL operation specified SECCOMP_SET_MODE_FILTER, but the filter pro‐
424              gram pointed to by args was not valid or the length of the  fil‐
425              ter  program  was  zero or exceeded BPF_MAXINSNS (4096) instruc‐
426              tions.
427
428       ENOMEM Out of memory.
429
430       ENOMEM The total length of all filter programs attached to the  calling
431              thread  would  exceed  MAX_INSNS_PER_PATH  [22m(32768) instructions.
432              Note that for the purposes of calculating this limit,  each  al‐
433              ready  existing  filter  program incurs an overhead penalty of 4
434              instructions.
435
436       EOPNOTSUPP
437              operation specified  SECCOMP_GET_ACTION_AVAIL,  but  the  kernel
438              does not support the filter return action specified by args.
439
440       ESRCH  Another  thread  caused a failure during thread sync, but its ID
441              could not be determined.
442

VERSIONS

444       The seccomp() system call first appeared in Linux 3.17.
445

CONFORMING TO

447       The seccomp() system call is a nonstandard Linux extension.
448

NOTES

450       Rather than hand-coding seccomp filters as shown in the example  below,
451       you  may  prefer  to  employ  the  libseccomp library, which provides a
452       front-end for generating seccomp filters.
453
454       The Seccomp field of the /proc/[pid]/status file provides a  method  of
455       viewing the seccomp mode of a process; see proc(5).
456
457       seccomp()  provides  a  superset  of  the functionality provided by the
458       prctl(2) PR_SET_SECCOMP operation (which does not support flags).
459
460       Since Linux 4.4, the ptrace(2) PTRACE_SECCOMP_GET_FILTER operation  can
461       be used to dump a process's seccomp filters.
462
463   Architecture support for seccomp BPF
464       Architecture support for seccomp BPF filtering is available on the fol‐
465       lowing architectures:
466
467       *  x86-64, i386, x32 (since Linux 3.5)
468       *  ARM (since Linux 3.8)
469       *  s390 (since Linux 3.8)
470       *  MIPS (since Linux 3.16)
471       *  ARM-64 (since Linux 3.19)
472       *  PowerPC (since Linux 4.3)
473       *  Tile (since Linux 4.3)
474       *  PA-RISC (since Linux 4.6)
475
476   Caveats
477       There are various subtleties to consider when applying seccomp  filters
478       to a program, including the following:
479
480       *  Some traditional system calls have user-space implementations in the
481          vdso(7) on many architectures.  Notable examples include  clock_get‐
482          time(2),  gettimeofday(2), and time(2).  On such architectures, sec‐
483          comp filtering for these system calls will have  no  effect.   (How‐
484          ever,  there  are  cases  where the vdso(7) implementations may fall
485          back to invoking the true system call, in which case seccomp filters
486          would see the system call.)
487
488       *  Seccomp  filtering is based on system call numbers.  However, appli‐
489          cations typically do not directly invoke system calls,  but  instead
490          call  wrapper  functions  in  the C library which in turn invoke the
491          system calls.  Consequently, one must be aware of the following:
492
493          •  The glibc wrappers for some traditional system calls may actually
494             employ  system calls with different names in the kernel.  For ex‐
495             ample,  the  exit(2)  wrapper  function  actually   employs   the
496             exit_group(2) system call, and the fork(2) wrapper function actu‐
497             ally calls clone(2).
498
499          •  The behavior of wrapper functions may vary across  architectures,
500             according  to  the range of system calls provided on those archi‐
501             tectures.  In other words, the same wrapper function  may  invoke
502             different system calls on different architectures.
503
504          •  Finally,  the  behavior  of  wrapper  functions can change across
505             glibc versions.  For example, in older versions, the glibc  wrap‐
506             per  function  for  open(2)  invoked  the system call of the same
507             name, but starting in glibc 2.26, the implementation switched  to
508             calling openat(2) on all architectures.
509
510       The consequence of the above points is that it may be necessary to fil‐
511       ter for a system call other than might  be  expected.   Various  manual
512       pages  in  Section  2 provide helpful details about the differences be‐
513       tween wrapper functions and the underlying system calls in  subsections
514       entitled C library/kernel differences.
515
516       Furthermore,  note  that  the application of seccomp filters even risks
517       causing bugs in an application, when the filters cause unexpected fail‐
518       ures  for legitimate operations that the application might need to per‐
519       form.  Such bugs may not easily be discovered when testing the  seccomp
520       filters if the bugs occur in rarely used application code paths.
521
522   Seccomp-specific BPF details
523       Note the following BPF details specific to seccomp filters:
524
525       *  The BPF_H and BPF_B size modifiers are not supported: all operations
526          must load and store (4-byte) words (BPF_W).
527
528       *  To access the contents of the seccomp_data buffer, use  the  BPF_ABS
529          addressing mode modifier.
530
531       *  The  BPF_LEN addressing mode modifier yields an immediate mode oper‐
532          and whose value is the size of the seccomp_data buffer.
533

EXAMPLES

535       The program below accepts four or more arguments.  The first three  ar‐
536       guments  are  a  system call number, a numeric architecture identifier,
537       and an error number.  The program uses these values to construct a  BPF
538       filter that is used at run time to perform the following checks:
539
540       [1] If  the  program  is not running on the specified architecture, the
541           BPF filter causes system calls to fail with the error ENOSYS.
542
543       [2] If the program attempts to execute the system call with the  speci‐
544           fied  number,  the  BPF filter causes the system call to fail, with
545           errno being set to the specified error number.
546
547       The remaining command-line arguments specify  the  pathname  and  addi‐
548       tional  arguments  of a program that the example program should attempt
549       to execute using execv(3) (a library  function  that  employs  the  ex‐
550       ecve(2)  system  call).  Some example runs of the program are shown be‐
551       low.
552
553       First, we display the architecture that we are running on (x86-64)  and
554       then  construct  a  shell function that looks up system call numbers on
555       this architecture:
556
557           $ uname -m
558           x86_64
559           $ syscall_nr() {
560               cat /usr/src/linux/arch/x86/syscalls/syscall_64.tbl | \
561               awk '$2 != "x32" && $3 == "'$1'" { print $1 }'
562           }
563
564       When the BPF filter rejects a system call (case [2] above),  it  causes
565       the  system call to fail with the error number specified on the command
566       line.  In the experiments shown here, we'll use error number 99:
567
568           $ errno 99
569           EADDRNOTAVAIL 99 Cannot assign requested address
570
571       In the following example, we attempt to run the command whoami(1),  but
572       the  BPF  filter rejects the execve(2) system call, so that the command
573       is not even executed:
574
575           $ syscall_nr execve
576           59
577           $ ./a.out
578           Usage: ./a.out <syscall_nr> <arch> <errno> <prog> [<args>]
579           Hint for <arch>: AUDIT_ARCH_I386: 0x40000003
580                            AUDIT_ARCH_X86_64: 0xC000003E
581           $ ./a.out 59 0xC000003E 99 /bin/whoami
582           execv: Cannot assign requested address
583
584       In the next example, the BPF filter rejects the write(2)  system  call,
585       so  that, although it is successfully started, the whoami(1) command is
586       not able to write output:
587
588           $ syscall_nr write
589           1
590           $ ./a.out 1 0xC000003E 99 /bin/whoami
591
592       In the final example, the BPF filter rejects a system call that is  not
593       used  by  the  whoami(1) command, so it is able to successfully execute
594       and produce output:
595
596           $ syscall_nr preadv
597           295
598           $ ./a.out 295 0xC000003E 99 /bin/whoami
599           cecilia
600
601   Program source
602       #include <errno.h>
603       #include <stddef.h>
604       #include <stdio.h>
605       #include <stdlib.h>
606       #include <unistd.h>
607       #include <linux/audit.h>
608       #include <linux/filter.h>
609       #include <linux/seccomp.h>
610       #include <sys/prctl.h>
611
612       #define X32_SYSCALL_BIT 0x40000000
613       #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
614
615       static int
616       install_filter(int syscall_nr, int t_arch, int f_errno)
617       {
618           unsigned int upper_nr_limit = 0xffffffff;
619
620           /* Assume that AUDIT_ARCH_X86_64 means the normal x86-64 ABI
621              (in the x32 ABI, all system calls have bit 30 set in the
622              'nr' field, meaning the numbers are >= X32_SYSCALL_BIT) */
623           if (t_arch == AUDIT_ARCH_X86_64)
624               upper_nr_limit = X32_SYSCALL_BIT - 1;
625
626           struct sock_filter filter[] = {
627               /* [0] Load architecture from 'seccomp_data' buffer into
628                      accumulator */
629               BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
630                        (offsetof(struct seccomp_data, arch))),
631
632               /* [1] Jump forward 5 instructions if architecture does not
633                      match 't_arch' */
634               BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 5),
635
636               /* [2] Load system call number from 'seccomp_data' buffer into
637                      accumulator */
638               BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
639                        (offsetof(struct seccomp_data, nr))),
640
641               /* [3] Check ABI - only needed for x86-64 in deny-list use
642                      cases.  Use BPF_JGT instead of checking against the bit
643                      mask to avoid having to reload the syscall number. */
644               BPF_JUMP(BPF_JMP | BPF_JGT | BPF_K, upper_nr_limit, 3, 0),
645
646               /* [4] Jump forward 1 instruction if system call number
647                      does not match 'syscall_nr' */
648               BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, syscall_nr, 0, 1),
649
650               /* [5] Matching architecture and system call: don't execute
651                  the system call, and return 'f_errno' in 'errno' */
652               BPF_STMT(BPF_RET | BPF_K,
653                        SECCOMP_RET_ERRNO | (f_errno & SECCOMP_RET_DATA)),
654
655               /* [6] Destination of system call number mismatch: allow other
656                      system calls */
657               BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
658
659               /* [7] Destination of architecture mismatch: kill process */
660               BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
661           };
662
663           struct sock_fprog prog = {
664               .len = ARRAY_SIZE(filter),
665               .filter = filter,
666           };
667
668           if (seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog)) {
669               perror("seccomp");
670               return 1;
671           }
672
673           return 0;
674       }
675
676       int
677       main(int argc, char **argv)
678       {
679           if (argc < 5) {
680               fprintf(stderr, "Usage: "
681                       "%s <syscall_nr> <arch> <errno> <prog> [<args>]\n"
682                       "Hint for <arch>: AUDIT_ARCH_I386: 0x%X\n"
683                       "                 AUDIT_ARCH_X86_64: 0x%X\n"
684                       "\n", argv[0], AUDIT_ARCH_I386, AUDIT_ARCH_X86_64);
685               exit(EXIT_FAILURE);
686           }
687
688           if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
689               perror("prctl");
690               exit(EXIT_FAILURE);
691           }
692
693           if (install_filter(strtol(argv[1], NULL, 0),
694                              strtol(argv[2], NULL, 0),
695                              strtol(argv[3], NULL, 0)))
696               exit(EXIT_FAILURE);
697
698           execv(argv[4], &argv[4]);
699           perror("execv");
700           exit(EXIT_FAILURE);
701       }
702

COLOPHON

720       This  page  is  part of release 5.10 of the Linux man-pages project.  A
721       description of the project, information about reporting bugs,  and  the
722       latest     version     of     this    page,    can    be    found    at
723       https://www.kernel.org/doc/man-pages/.
724
725
726
727Linux                             2020-11-01                        SECCOMP(2)