1rig(1) General Commands Manual rig(1)
2
3
4
6 rig - Monitor a system for events and trigger specific actions
7
9 rig <RESOURCE OR SUBCOMMAND> [OPTIONS] <ACTIONS> [ACTION OPTIONS]
10
11
13 rig is a tool to assist in troubleshooting seemingly randomly occurring
14 events or events that occur at times that make active monitoring by a
15 sysadmin difficult.
16
17 rig sets-up detached processes, known as 'rigs', that watch a given re‐
18 source for a trigger condition, and once that trigger condition is met
19 takes actions defined by the user.
20
21
22
24 The following are options available to all rigs (resources).
25
26
27 --delay DELAY
28 Specify the number of seconds to wait after a rig is triggered
29 before running the configured actions. Note that the rig will
30 still trigger and stop all watcher threads immediately - this
31 delay comes after thread termination but before action execution
32 in order to avoid a possible race condition where multiple
33 watcher threads could conceivably trigger during a sufficiently
34 high delay time.
35
36 Default 0 seconds, meaning execute actions immediately upon rig
37 trigger condition being met.
38
39
40 --debug
41 Set logging level to debug instead of the default info level.
42
43
44 --foreground
45 Run the rig in the foreground, keeping stdout attached.
46
47 --interval SECONDS
48 Specify the amount of time to wait between a rig's polling cy‐
49 cles. Most rigs monitor their resources in a flow of update ->
50 compare -> wait, where wait is simply sleeping until the next
51 needed update. Use this option to set how long a rig should
52 wait/sleep before updating their monitors again.
53
54 Default: 1, meaning update and compare once every second.
55
56
57 --name NAME
58 Give the rig a name, rather than generating a random one at de‐
59 ployment.
60
61 By default, rigs are given a randomly generated string as a
62 name, which will appear in rig info output and in the rig's
63 socket name. Using this option will use the provided name in‐
64 stead, and may be useful in distinguishing rigs when several are
65 deployed at one time
66
67 --no-archive
68 Do not create a tar archive of the collected data after a rig
69 has been triggered.
70
71 Normally, once all data has been collected, rig will create a
72 gzip'd tar archive under /var/tmp containing all the files cre‐
73 ated from the rig's actions - after which, the temp directory at
74 /var/tmp/rig/<id>/ is deleted.
75
76 Using this option skips creating the archive and preserves the
77 temp directory.
78
79
80 --repeat COUNT
81 Repeat certain actions COUNT number of times after the initial
82 execution of the action.
83
84 Actions will, unless otherwise specified by this option, only
85 execute once. Using this option actions that support repetition
86 will be repeated an additional COUNT number of times. For exam‐
87 ple, using --repeat 2 will result in repeatable actions being
88 executed three (3) total times.
89
90 Not every action supports repetition - in fact most do not. See
91 specific action's sections for information on if it can be re‐
92 peated or not.
93
94
95 --repeat-delay SECONDS
96 Number of seconds to wait between repetitive executions of the
97 same action.
98
99 This can be useful when using an action like gcore when you want
100 to get coredumps over a certain time period. For example, using
101 --repeat 1 --repeat-delay 60 will give you two (2) coredumps
102 taken one minute apart.
103
104 Defaults to one second.
105
106
107 --restart COUNT
108 Restart a configured rig up to COUNT number of times after being
109 triggered.
110
111 By default, a rig will trigger once and then terminate. Using
112 this option, an individual rig may restart itself up to COUNT
113 number of times, producing an additional archive of the re‐
114 quested data after the triggering event happens again.
115
116 Note that this is the number of times to restart, not the total
117 number of times to run. Using a restart value of '2' means that
118 there will be 3 total archives generated for a rig.
119
120 By default, this is set to 0, meaning terminate after the first
121 trigger event. Use a value of '-1' to have a rig perpetually
122 restart itself without limit.
123
124
125
127 rig list
128 Show a list of known existing rigs and their status. Status in‐
129 formation is obtained by querying the socket created for that
130 particular rig.
131
132
133 rig destroy -i [ID or 'all']
134 Destroy a deployed rig with id ID. If ID is 'all', destroy all
135 known rigs. Note that if another entity kills the pid for the
136 running rig, destroy will fail as the socket is no longer con‐
137 nected to the (now killed) process. In this case use the --force
138 option to cleanup the lingering socket.
139
140 Any data the rig has generated will be lost when invoking de‐
141 stroy.
142
143
144 rig info -i [ID]
145 Get detailed information on a rig. This information will include
146 configuration options, the entire cmdline string given to launch
147 the rig, as well as information on each action the rig is con‐
148 figured to take and what the expected result from those ac‐
149 tion(s) are.
150
151 Currently, this data is written to stdout in JSON format.
152
153
154 rig trigger -i [ID]
155 Manually trigger rig with id ID. This will cause the specified
156 rig to begin executing the actions configured for it, as if the
157 trigger condition had been met.
158
159 Note that this is only effective on a single rig basis, so using
160 a value of 'all' for the ID will not work.
161
162
164 These are the system resources that rig can monitor. There may be addi‐
165 tional manpages for specific resources. Where applicable this will be
166 noted below.
167
168 Note that 'resources', 'monitors', and referencing 'a rig' as a dis‐
169 tinct entity all refer to the same thing.
170
171 When creating a rig, if successful the rig's ID will be printed to con‐
172 sole.
173
174
175 logs Watch a single or multiple log files and/or journald units for a
176 specified message. When that message is matched to any watched
177 file or journal, the trigger condition is met and configured ac‐
178 tions are initiated.
179
180 The following options are available for the logs rig:
181
182 -m|--message STRING
183 Define the string that serves as the trigger condition
184 for the rig. This can be a regex string or an exact mes‐
185 sage. Be very careful in using the '*' regex character as
186 this may cause unintended behavior such as the rig imme‐
187 diately triggering on the first message seen.
188
189 Note that a small amount of transformation and testing is
190 done on the provided STRING. First, '*' characters are
191 converted to the python-style regex match of '.*'. After
192 which, rig performs a test on if the provided message
193 will regex-match itself, and if that fails the rig aborts
194 the creation process.
195
196 Aside from the conversion noted above, regexes provided
197 in this option must be python-style and not shell-style.
198
199
200 --logfile FILE
201 A comma-delimited list of files to watch. Each FILE spec‐
202 ified will be monitored from the current end of the file,
203 so old entries will not set off the rig's actions.
204
205 Default: /var/log/messages
206
207 --no-files
208 Do not monitor any log files.
209
210 --journal UNIT
211 A comma-delimited list of journal units to watch. The
212 journal is watched as a singular entity, and will be fil‐
213 tered to only read from the provided UNIT(s). If no UNIT
214 is specified, the whole system journal will be monitored.
215
216 Default: 'system'
217
218 --no-journal
219 Do not monitor the journal.
220
221 --count COUNT
222 The number of times the --message string should be
223 matched before the rig is triggered. Default 1 - meaning
224 match on the first occurence.
225
226
227
228 ping Perform a simple ongoing ping test against a specified host.
229 Pings are sent one at a time at a defined interval, and the re‐
230 sponse is evaluated. Ping-type rigs may monitor for number of
231 lost packets and/or packets exceeding a specified RTT in mil‐
232 liseconds.
233
234 Packets are first evaluated for loss (including timeouts), then
235 for RTT time.
236
237 The following options are available for the ping rig:
238
239 --host ADDRESS
240 The target IP or hostname to ping. This is a required op‐
241 tion in order for a ping rig to be created.
242
243 During rig creation, a 'sanity check' ping is sent to the
244 ADDRESS to ensure that it is an address that is reachable
245 on the network and that it will respond to ICMP packets.
246 If this sanity check fails, rig creation is aborted.
247
248 --ping-timeout SECONDS
249 Specify the number of SECONDS to allow for a ping re‐
250 sponse. If a ping encounters a timeout, then it is con‐
251 sidered both a lost packet and a packet exceeding the RTT
252 threshold (see --ping-ms-max and --ping-ms-count).
253
254 --lost-count PACKETS
255 Specify the number of PACKETS to accept being lost or
256 timed-out, before triggering the rig.
257
258 Default: 1 (trigger on the first lost packet)
259
260 --ping-interval SECONDS
261 Specify the number of SECONDS to wait between ping re‐
262 quests sent to the target host.
263
264 Default: 1
265
266 --ping-ms-max MILLISECONDS
267 Specify the RTT threshold to allow for a returned ping
268 request. If the RTT reported by the ping command is above
269 this value in milliseconds, it is counted against the
270 threshold of packets exceeding this value specified by
271 --ping-ms-count.
272
273 By default, this form of checking is disabled. Any inte‐
274 ger value passed to this option will enable RTT monitor‐
275 ing.
276
277 --ping-ms-count PACKETS
278 Specify the number of PACKETS that may exceed the defined
279 --ping-ms-max RTT value before triggering the rig.
280
281 Default: 5
282
283 process
284 Watch a single process or list of processes for state changes or
285 resource consumption thresholds. When the process enters the
286 specified state or the specified resource consumption threshold
287 is met, the trigger condition is met.
288
289 The following options are available for the process rig:
290
291 --proc A PID or process name of processes to watch. If a process
292 name is specified, then rig will attempt to convert this
293 to a PID during rig creation. If multiple PIDs are found,
294 the default behavior is to fail creation and exit. To
295 have rig monitor all processes found for a process name,
296 use the --all option.
297
298 --state STATE
299 The state that a process needs to be in, in order to
300 trigger the rig. The following is a list of supported
301 states:
302
303 NAME DESCRIPTION SHORT‐
304 HAND
305 dead Dead - should never be seen 'X'
306 disk-sleep Uninterruptible sleep 'D' or
307 'UN'
308 running Currently running 'R' or
309 'run'
310 sleeping Interruptible sleep 'S' or
311 'sleep'
312 stopped Stopped 'T' or
313 'stop'
314 zombie Exited, still in proc table 'Z' or
315 'zomb'
316
317 Users can use either the full status name, or the short‐
318 hand noted in the final column of the table above. Both
319 the names and the shorthand values are case sensitive.
320
321 This can also be set to a "not" value by preceeding one
322 of the above state strings with a exclaimation mark (!),
323 e.g. '!sleeping' will match any non-sleep (S) state sta‐
324 tus for the process(es). Most shells will require you to
325 quote the state string when using the '!' character.
326
327 Note that using '!running' will cause rig to not trigger
328 against a state of 'sleeping', as generally speaking
329 'running' processes spend much of their time in S state,
330 and it is assumed that triggering against such a process
331 is not desired.
332
333 Process status is polled once every second.
334
335 --rss INTEGER
336 The amount of rss (resident set size) memory usage to use
337 as a threshold for triggering the rig. If the process'
338 RSS usage goes above this value, trigger.
339
340 The value provided here may be suffixed with K, M, or G
341 to denote the IEC unit. Rig will convert the provided
342 value and suffix into a value in bytes.
343
344 --vms INTEGER
345 The same as --rss but monitoring Virtual Memory Size in‐
346 stead.
347
348 --memperc PERCENT
349 The percentage of total system memory a process is con‐
350 suming to use as a threshold for triggering the rig. If
351 the process' %mem meets or exceeds this value, trigger.
352
353 PERCENT may be a whole integer or a float. When using a
354 float, the process rig respects up to two (2) decimal
355 points of precision. For example, using ´--memperc 10.25´
356 is the same as using ´--memperc 10.25678´.
357
358 --cpuperc PERCENT
359 The percentage of CPU usage a process is consuming to use
360 as a threshold for triggering the rig. If the process'
361 %cpu meets or exceeds this value, trigger.
362
363 PERCENT may be a whole integer or a float. When using a
364 float and monitoring for CPU usage, rig respects one (1)
365 decimal point of precision due to how CPU usage is re‐
366 ported.
367
368 PERCENT may be above 100 - as CPU usage can exceed 100
369 when a process is running on multiple CPUs.
370
371
372 system
373
374 Watch the system's utilization of resources as a whole, e.g. to‐
375 tal CPU or memory usage. When the utilization of a given re‐
376 source is either exceeded or falls below the given threshold
377 (determined as appropriate for each resource), the trigger con‐
378 dition is met.
379
380 The following options are available for the system rig:
381
382 --iowait PERCENT
383 The amount of %iowait as reported by the kernel to use as
384 a threshold value.
385
386 If exceeded, trigger the rig.
387
388 --steal PERCENT
389 The amount of %steal as reported by the kernel to use as
390 a threshold value.
391
392 If exceeded, trigger the rig.
393
394 --nice PERCENT
395 The amount of %nice as reported by the kernel to use as a
396 threshold value.
397
398 If exceeded, trigger the rig.
399
400 --guest PERCENT
401 The amount of %guest as reported by the kernel to use as
402 a threshold value.
403
404 If exceeded, trigger the rig.
405
406 --user The amount of %user as reported by the kernel to use as a
407 threshold value.
408
409 If exceeded, trigger the rig.
410
411 --available INTEGER
412 The amount of available memory in MiB as reported by the
413 kernel to use as a threshold value.
414
415 If the amount of available memory falls below this
416 threshold, trigger the rig.
417
418 --free INTEGER
419 The amount of free memory in MiB as reported by the ker‐
420 nel to use as a threshold value.
421
422 If the amount of free memory falls below this threshold,
423 trigger the rig.
424
425 --used INTEGER
426 The amount of used memory in MiB as reported by the ker‐
427 nel to use as a threshold value.
428
429 If the amount of used memory exceeds this threshold,
430 trigger the rig.
431
432 --slab INTEGER
433 The amount of slab memory in MiB as reported by the ker‐
434 nel to use as a threshold value.
435
436 If the amount of slab memory exceeds this threshold,
437 trigger the rig.
438
439 --cpuperc PERCENT
440 The amount of total CPU usage as reported by the kernel
441 as a percentage to use as a threshold value.
442
443 If exceeded, trigger the rig.
444
445 This value may be a whole integer or a float. Floats are
446 precise out to one (1) decimal point.
447
448 --memperc PERCENT
449 The amount of total memory usage as reported by the ker‐
450 nel as a percentage to use as a theshold value.
451
452 If exceeded, trigger the rig.
453
454 This value may be a whole integer or a float. Floats are
455 precise out to one (1) decimal point.
456
457 --loadavg FLOAT
458 System load average as reported by the OS to use as a
459 threshold value. If the reported loadavg exceeds this
460 value, trigger the rig. This option can accept either an
461 integer (1) or a float (1.0).
462
463 Linux returns loadavg data for the past 1, 5, and 15 min‐
464 utes. The system rig will monitor only one (1) of these
465 intervals at a time, as controlled by the --loadavg-in‐
466 terval option.
467
468 --loadavg-interval [1, 5, 15]
469 Which time interval the rig should monitor when watching
470 the system's loadavg. Only 1, 5, and 15 are accepted
471 values for this option, as that is what the Linux kernel
472 returns loadavg data for.
473
474 Default: 1
475
476 --temp INTEGER
477 The temperature in Celsius rig should monitor the CPU for
478 meeting or exceeding.
479
480 This option takes an integer value, though temperature
481 data is single decimal point sensitive, so a temperature
482 of 50.9 degrees will not trigger a rig that sets this op‐
483 tion to 51.
484
485 By default rig will monitor the first physical CPU pack‐
486 age installed on the system. This may be changed via the
487 --cpu-id option. Note that rig will only monitor whole
488 packages and not individual cores, and that package tem‐
489 peratures reported are the highest reported temperature
490 for any core in that package.
491
492
493 --cpu-id ID
494 If specified, monitor this physical CPU package. By de‐
495 fault, rig will monitor physical CPU package 0 - meaning
496 the first physically installed CPU.
497
498 When specifying an ID here, remember that in Linux CPU
499 IDs are zero-indexed, so the first CPU will be ID 0, the
500 second ID 1, and so forth.
501
502 Default: 0
503
504 Filesystem
505
506 Watch a filesystem, directory, or file for utilization changes.
507 Currently this rig is focused on space consumption, and will
508 trigger when the specified path or backing filesystem exceeds
509 the defined threshold for space utilization.
510
511 The following options are available for the filesystem rig:
512
513 --path PATH
514 Specify the filesystem, directory, or file path for the
515 rig to monitor. The location provided must exists when
516 the rig initializes for monitoring to be supported.
517
518 --size SIZE
519 Specify the size threshold to trigger on for the provided
520 --path. The size given must be an integer suffixed with
521 either K, M, G, or T. The provided value will be con‐
522 verted to bytes.
523
524 --fs-size SIZE
525 Use this option instead of --size if you want to monitor
526 the space usage of the backing filesystem for --path
527 rather than the size of the path alone.
528
529 Similar to --size this value must be suffixed with either
530 K, M, G, or T.
531
532 --fs-used PERCENT
533 Similar to --fs-size but instead provide a percentage
534 value to trigger on, when the filesystem's %used exceeds
535 this value.
536
537 Note that using this option is ultimately the same as
538 --fs-size as rig will convert the specified percentage
539 into a raw bytes value to use for comparisons.
540
541
542
544 The following actions are supported responses to triggered rigs. These
545 may be chained together on a single rig, so deploying multiple rigs
546 with matching trigger conditions with single, varying actions is unnec‐
547 essary.
548
549 Actions are executed based on a priority weighting system, where lower
550 values represent a higher priority action, and those actions with lower
551 values are executed before those with higher values. This is to allow
552 more time-sensitive actions to be taken before those that may either
553 take a long time to execute or are otherwise unaffected by allowing
554 other actions to run before them. Action priority values are set by the
555 actions directly and are currently not able to be modified by users.
556
557 gcore Collect a coredump of a given process or processes using GDB's
558 gcore utility.
559
560 Note that this does _not_ interrupt the running process(es).
561 Cores are saved to /tmp and will be named either core.$pid or
562 core.$proc_name.$pid depending on if a PID or process name was
563 provided. This action will be executed first when a rig is trig‐
564 gered and multiple actions are specified.
565
566 This action supports repetition via the --repeat option.
567
568 The gcore action supports the following options:
569
570 --gcore PROCESS
571 Enables this action and takes either a PID or process
572 name as a value. If a process name is given, the PID is
573 determined at rig creation. If multiple PIDs are found
574 for the same process name, the default behavior is to
575 fail rig creation. Use the --all-pids option to instead
576 use all PIDs discovered for a process name.
577
578 This option can be specified multiple times. E.G. --gcore
579 12345 --gcore myprocess will generate a coredump for PID
580 12345 and a process matching the name 'myprocess'.
581
582
583 --all-pids
584 Tells this action to collect a coredump for all PIDs
585 found for a provided process name.
586
587
588 --freeze
589 Freeze the process(es) that will be core dumped by send‐
590 ing a SIGSTOP prior to calling gcore on the discovered
591 pid(s).
592
593 If successful, then rig will send a SIGCONT after the
594 gcore execution has completed in order to thaw the
595 process.
596
597
598 kdump Generate a vmcore by triggering a kernel crash via sysrq.
599
600 Note that this action WILL cause node disruption by triggering a
601 kernel panic to generate the vmcore. This means your system will
602 reboot when this action is triggered.
603
604 The kdump action does not perform any configuration checks on
605 the system's kdump installation. It is assumed that kdump has
606 been properly configured and tested prior to using this action.
607
608 The kdump action supports the following options:
609
610 --kdump
611 Enables this action
612
613
614 --sysrq INTEGER
615 When the rig is deployed, if this option is set, rig will
616 set the system's /proc/sys/kernel/sysrq to the value pro‐
617 vided. See sysrq kernel documentation for information on
618 what values are supported.
619
620
621
622 sosreport
623 Run an sos report after the rig has been triggered. Select
624 plugin enablement options as well as the --plugin-option from
625 sos report are supported by this rig. This action should run
626 after any time-sensitive actions otherwise specified by the user
627 for a given rig.
628
629 The sosreport action supports the following options:
630
631 --sosreport
632 Enables this action
633
634 --enable-plugins PLUGINS
635 Specifically force the specified comma-delimited list of
636 PLUGINS to be enabled.
637
638 --plugin-option PLUGOPT
639 Modify a specific plugin's runtime options. This is
640 passed directly to sos report as the same --plugin-option
641 value, which should take the form 'name.option=value'.
642 For example, to increase the podman plugin timeout use
643 ´--plugin-option podman.timeout=600´.
644
645 If you need to pass multiple sos report plugin options,
646 use a comma-delimited list here instead of specifying
647 this option multiple times.
648
649 --skip-plugins PLUGINS
650 Do not run these specified plugins. Use a comma-delimited
651 list to skip multiple plugins.
652
653 --only-plugins PLUGINS
654 Only enable these specific plugins, disable all others.
655 Use a comma-delimited list to specify multiple plugins.
656
657
658 tcpdump
659 Start collecting a tcpdump when the rig is initialized, and stop
660 the collection when the rig triggers. This action will be trig‐
661 gered before most other actions, but after the gcore action.
662
663 Note there will be a slight delay in configuring any rig that
664 uses the tcpdump action as rig must verify that the tcpdump
665 process started successfully during the initialization process.
666
667 The tcpdump action supports the following options:
668
669 --tcpdump
670 Enables this action
671
672 --iface INTERFACE
673 Starts the tcpdump to monitor the provided INTERFACE. In
674 almost all situations this should likely be set to a spe‐
675 cific interface on the system, however the value of 'any'
676 is accepted by the tcpdump command in order to listen on
677 all interfaces. Be wary of using this however as use of
678 'any' means will make it impossible to determine which
679 interface a particular packet came in on in the resulting
680 packet capture.
681
682 Default: eth0
683
684 --filter FILTER
685 Provide a filter to use with tcpdump in order to reduce
686 the amount of traffic recorded in the packet capture.
687 This value is passed directly to the tcpdump utility, and
688 thus can be any valid filter accepted by tcpdump.
689
690 For most shells you must quote the filter string for rig
691 to pass it correctly.
692
693 --snaplen LENGTH --snapshot-length LENGTH
694 Set the snapshot length for the packet capture. This will
695 truncate captured packets to LENGTH bytes, which defaults
696 to 262144 bytes. Using a value of 0 (also the default),
697 will imply a LENGTH of 262144 bytes.
698
699 --dump-size SIZE
700 Limit the size of the packet capture file(s) to SIZE in
701 MB.
702
703 Default: 10
704
705 --captures CAPTURES
706 Specify the number of packet capture files to keep. If
707 more than one (1), then tcpdump will rotate the packet
708 capture file when it reaches the --size value and keep
709 CAPTURES number of files.
710
711 E.G. Using a CAPTURES of 2 and a DUMP-SIZE of 5, then
712 when the rig terminates you will have up to 2 5MB packet
713 captures.
714
715 Default: 1 (packet capture file is replaced upon reaching
716 SIZE limit).
717
718
719 monitor
720 While a rig is running, monitor various system statistics and
721 record them for later review. These statistics may be file con‐
722 tents or command outputs.
723
724 This action begins collecting information when the rig is
725 started, and stops when the rig is triggered.
726
727 By default, networking-centric information is monitored via com‐
728 mands such as netstat, ss, top, ps, and more. Similarly several
729 networking-related files under /proc/ are monitored.
730
731 The rate at which these collections take place is controlled via
732 the --interval option.
733
734 The monitor action supports the following options:
735
736 --monitor
737 Enables this action.
738
739 --disable-monitor-defaults
740 Do not monitor or collect any of the default items. This
741 implies that all collections will be specified via --mon‐
742 itor-files and/or --monitor-commands.
743
744 --monitor-files FILES
745 A comma-delimited list of files to monitor. Monitored
746 files have their contents copied to a file within the
747 rig's archive of the same name. The contents will be sep‐
748 arated by a timestamp header taken at the time of collec‐
749 tion.
750
751 --monitor-commands COMMANDS
752 A comma-delimited list of commands to execute every --in‐
753 terval seconds, and have that output saved to a file
754 within the rig's archive. Output collections are sepa‐
755 rated by a timestamp header taken at the time of collec‐
756 tion.
757
758 Note that commands will need to be properly quoted if
759 there are spaces (or other quotes) in the command string.
760 For example, to run 'ps auxwww' the proper invocation
761 would be --monitor-commands='ps auxwww'.
762
763 In-line shell scripting is not supported. While it may be
764 possible for such values to function, there are no guar‐
765 antees as to those executions working properly, at all,
766 not causing unintended side-effects or harm, et cetera.
767 Dragons ahead, and so forth.
768
769
770 noop
771
772 Does nothing - this action runs a no-op. This is ideally used
773 for when you need to test a rig's configuration to make sure a
774 rig's trigger condition is set properly - e.g. a regex string
775 for the logs' rig message option.
776
777 The noop action supports the following options:
778
779 --noop Enables this action
780
782 Jake Hunsaker <jhunsake@redhat.com>
783
784
785
786 January 2019 rig(1)