1rig(1) General Commands Manual rig(1)
2
3
4
6 rig - Monitor a system for events and trigger specific actions
7
9 rig <RESOURCE OR SUBCOMMAND> [OPTIONS] <ACTIONS> [ACTION OPTIONS]
10
11
13 rig is a tool to assist in troubleshooting seemingly randomly occurring
14 events or events that occur at times that make active monitoring by a
15 sysadmin difficult.
16
17 rig sets-up detached processes, known as 'rigs', that watch a given
18 resource for a trigger condition, and once that trigger condition is
19 met takes actions defined by the user.
20
21
22
24 The following are options available to all rigs (resources).
25
26
27 --delay DELAY
28 Specify the number of seconds to wait after a rig is triggered
29 before running the configured actions. Note that the rig will
30 still trigger and stop all watcher threads immediately - this
31 delay comes after thread termination but before action execution
32 in order to avoid a possible race condition where multiple
33 watcher threads could conceivably trigger during a sufficiently
34 high delay time.
35
36 Default 0 seconds, meaning execute actions immediately upon rig
37 trigger condition being met.
38
39
40 --debug
41 Set logging level to debug instead of the default info level.
42
43
44 --foreground
45 Run the rig in the foreground, keeping stdout attached.
46
47 --interval SECONDS
48 Specify the amount of time to wait between a rig's polling
49 cycles. Most rigs monitor their resources in a flow of update ->
50 compare -> wait, where wait is simply sleeping until the next
51 needed update. Use this option to set how long a rig should
52 wait/sleep before updating their monitors again.
53
54 Default: 1, meaning update and compare once every second.
55
56
57 --name NAME
58 Give the rig a name, rather than generating a random one at
59 deployment.
60
61 By default, rigs are given a randomly generated string as a
62 name, which will appear in rig info output and in the rig's
63 socket name. Using this option will use the provided name
64 instead, and may be useful in distinguishing rigs when several
65 are deployed at one time
66
67 --no-archive
68 Do not create a tar archive of the collected data after a rig
69 has been triggered.
70
71 Normally, once all data has been collected, rig will create a
72 gzip'd tar archive under /var/tmp containing all the files cre‐
73 ated from the rig's actions - after which, the temp directory at
74 /var/tmp/rig/<id>/ is deleted.
75
76 Using this option skips creating the archive and preserves the
77 temp directory.
78
79
80 --repeat COUNT
81 Repeat certain actions COUNT number of times after the initial
82 execution of the action.
83
84 Actions will, unless otherwise specified by this option, only
85 execute once. Using this option actions that support repetition
86 will be repeated an additional COUNT number of times. For exam‐
87 ple, using --repeat 2 will result in repeatable actions being
88 executed three (3) total times.
89
90 Not every action supports repetition - in fact most do not. See
91 specific action's sections for information on if it can be
92 repeated or not.
93
94
95 --repeat-delay SECONDS
96 Number of seconds to wait between repetitive executions of the
97 same action.
98
99 This can be useful when using an action like gcore when you want
100 to get coredumps over a certain time period. For example, using
101 --repeat 1 --repeat-delay 60 will give you two (2) coredumps
102 taken one minute apart.
103
104 Defaults to one second.
105
106
107 --restart COUNT
108 Restart a configured rig up to COUNT number of times after being
109 triggered.
110
111 By default, a rig will trigger once and then terminate. Using
112 this option, an individual rig may restart itself up to COUNT
113 number of times, producing an additional archive of the
114 requested data after the triggering event happens again.
115
116 Note that this is the number of times to restart, not the total
117 number of times to run. Using a restart value of '2' means that
118 there will be 3 total archives generated for a rig.
119
120 By default, this is set to 0, meaning terminate after the first
121 trigger event. Use a value of '-1' to have a rig perpetually
122 restart itself without limit.
123
124
125
127 rig list
128 Show a list of known existing rigs and their status. Status
129 information is obtained by querying the socket created for that
130 particular rig.
131
132
133 rig destroy -i [ID or 'all']
134 Destroy a deployed rig with id ID. If ID is 'all', destroy all
135 known rigs. Note that if another entity kills the pid for the
136 running rig, destroy will fail as the socket is no longer con‐
137 nected to the (now killed) process. In this case use the --force
138 option to cleanup the lingering socket.
139
140
141 rig info -i [ID]
142 Get detailed information on a rig. This information will include
143 configuration options, the entire cmdline string given to launch
144 the rig, as well as information on each action the rig is con‐
145 figured to take and what the expected result from those
146 action(s) are.
147
148 Currently, this data is written to stdout in JSON format.
149
150
151 rig trigger -i [ID]
152 Manually trigger rig with id ID. This will cause the specified
153 rig to begin executing the actions configured for it, as if the
154 trigger condition had been met.
155
156 Note that this is only effective on a single rig basis, so using
157 a value of 'all' for the ID will not work.
158
159
161 These are the system resources that rig can monitor. There may be addi‐
162 tional manpages for specific resources. Where applicable this will be
163 noted below.
164
165 Note that 'resources', 'monitors', and referencing 'a rig' as a dis‐
166 tinct entity all refer to the same thing.
167
168 When creating a rig, if successful the rig's ID will be printed to con‐
169 sole.
170
171
172 logs Watch a single or multiple log files and/or journald units for a
173 specified message. When that message is matched to any watched
174 file or journal, the trigger condition is met and configured
175 actions are initiated.
176
177 The following options are available for the logs rig:
178
179 -m|--message STRING
180 Define the string that serves as the trigger condition
181 for the rig. This can be a regex string or an exact mes‐
182 sage. Be very careful in using the '*' regex character as
183 this may cause unintended behavior such as the rig imme‐
184 diately triggering on the first message seen.
185
186 Note that a small amount of transformation and testing is
187 done on the provided STRING. First, '*' characters are
188 converted to the python-style regex match of '.*'. After
189 which, rig performs a test on if the provided message
190 will regex-match itself, and if that fails the rig aborts
191 the creation process.
192
193 Aside from the conversion noted above, regexes provided
194 in this option must be python-style and not shell-style.
195
196
197 --logfile FILE
198 A comma-delimited list of files to watch. Each FILE spec‐
199 ified will be monitored from the current end of the file,
200 so old entries will not set off the rig's actions.
201
202 Default: /var/log/messages
203
204 --no-files
205 Do not monitor any log files.
206
207 --journal UNIT
208 A comma-delimited list of journal units to watch. The
209 journal is watched as a singular entity, and will be fil‐
210 tered to only read from the provided UNIT(s). If no UNIT
211 is specified, the whole system journal will be monitored.
212
213 Default: 'system'
214
215 --no-journal
216 Do not monitor the journal.
217
218 --count COUNT
219 The number of times the --message string should be
220 matched before the rig is triggered. Default 1 - meaning
221 match on the first occurence.
222
223
224
225 ping Perform a simple ongoing ping test against a specified host.
226 Pings are sent one at a time at a defined interval, and the
227 response is evaluated. Ping-type rigs may monitor for number of
228 lost packets and/or packets exceeding a specified RTT in mil‐
229 liseconds.
230
231 Packets are first evaluated for loss (including timeouts), then
232 for RTT time.
233
234 The following options are available for the ping rig:
235
236 --host ADDRESS
237 The target IP or hostname to ping. This is a required
238 option in order for a ping rig to be created.
239
240 During rig creation, a 'sanity check' ping is sent to the
241 ADDRESS to ensure that it is an address that is reachable
242 on the network and that it will respond to ICMP packets.
243 If this sanity check fails, rig creation is aborted.
244
245 --ping-timeout SECONDS
246 Specify the number of SECONDS to allow for a ping
247 response. If a ping encounters a timeout, then it is con‐
248 sidered both a lost packet and a packet exceeding the RTT
249 threshold (see --ping-ms-max and --ping-ms-count).
250
251 --lost-count PACKETS
252 Specify the number of PACKETS to accept being lost or
253 timed-out, before triggering the rig.
254
255 Default: 1 (trigger on the first lost packet)
256
257 --ping-interval SECONDS
258 Specify the number of SECONDS to wait between ping
259 requests sent to the target host.
260
261 Default: 1
262
263 --ping-ms-max MILLISECONDS
264 Specify the RTT threshold to allow for a returned ping
265 request. If the RTT reported by the ping command is above
266 this value in milliseconds, it is counted against the
267 threshold of packets exceeding this value specified by
268 --ping-ms-count.
269
270 By default, this form of checking is disabled. Any inte‐
271 ger value passed to this option will enable RTT monitor‐
272 ing.
273
274 --ping-ms-count PACKETS
275 Specify the number of PACKETS that may exceed the defined
276 --ping-ms-max RTT value before triggering the rig.
277
278 Default: 5
279
280 process
281 Watch a single process or list of processes for state changes or
282 resource consumption thresholds. When the process enters the
283 specified state or the specified resource consumption threshold
284 is met, the trigger condition is met.
285
286 The following options are available for the process rig:
287
288 --proc A PID or process name of processes to watch. If a process
289 name is specified, then rig will attempt to convert this
290 to a PID during rig creation. If multiple PIDs are found,
291 the default behavior is to fail creation and exit. To
292 have rig monitor all processes found for a process name,
293 use the --all option.
294
295 --state STATE
296 The state that a process needs to be in, in order to
297 trigger the rig. The following is a list of supported
298 states:
299
300 NAME DESCRIPTION SHORT‐
301 HAND
302 dead Dead - should never be seen 'X'
303 disk-sleep Uninterruptible sleep 'D' or
304 'UN'
305 running Currently running 'R' or
306 'run'
307 sleeping Interruptible sleep 'S' or
308 'sleep'
309 stopped Stopped 'T' or
310 'stop'
311 zombie Exited, still in proc table 'Z' or
312 'zomb'
313
314 Users can use either the full status name, or the short‐
315 hand noted in the final column of the table above. Both
316 the names and the shorthand values are case sensitive.
317
318 This can also be set to a "not" value by preceeding one
319 of the above state strings with a exclaimation mark (!),
320 e.g. '!sleeping' will match any non-sleep (S) state sta‐
321 tus for the process(es). Most shells will require you to
322 quote the state string when using the '!' character.
323
324 Note that using '!running' will cause rig to not trigger
325 against a state of 'sleeping', as generally speaking
326 'running' processes spend much of their time in S state,
327 and it is assumed that triggering against such a process
328 is not desired.
329
330 Process status is polled once every second.
331
332 --rss INTEGER
333 The amount of rss (resident set size) memory usage to use
334 as a threshold for triggering the rig. If the process'
335 RSS usage goes above this value, trigger.
336
337 The value provided here may be suffixed with K, M, or G
338 to denote the IEC unit. Rig will convert the provided
339 value and suffix into a value in bytes.
340
341 --vms INTEGER
342 The same as --rss but monitoring Virtual Memory Size
343 instead.
344
345 --memperc PERCENT
346 The percentage of total system memory a process is con‐
347 suming to use as a threshold for triggering the rig. If
348 the process' %mem meets or exceeds this value, trigger.
349
350 PERCENT may be a whole integer or a float. When using a
351 float, the process rig respects up to two (2) decimal
352 points of precision. For example, using ´--memperc 10.25´
353 is the same as using ´--memperc 10.25678´.
354
355 --cpuperc PERCENT
356 The percentage of CPU usage a process is consuming to use
357 as a threshold for triggering the rig. If the process'
358 %cpu meets or exceeds this value, trigger.
359
360 PERCENT may be a whole integer or a float. When using a
361 float and monitoring for CPU usage, rig respects one (1)
362 decimal point of precision due to how CPU usage is
363 reported.
364
365 PERCENT may be above 100 - as CPU usage can exceed 100
366 when a process is running on multiple CPUs.
367
368
369 system
370
371 Watch the system's utilization of resources as a whole, e.g.
372 total CPU or memory usage. When the utilization of a given
373 resource is either exceeded or falls below the given threshold
374 (determined as appropriate for each resource), the trigger con‐
375 dition is met.
376
377 The following options are available for the system rig:
378
379 --iowait PERCENT
380 The amount of %iowait as reported by the kernel to use as
381 a threshold value.
382
383 If exceeded, trigger the rig.
384
385 --steal PERCENT
386 The amount of %steal as reported by the kernel to use as
387 a threshold value.
388
389 If exceeded, trigger the rig.
390
391 --nice PERCENT
392 The amount of %nice as reported by the kernel to use as a
393 threshold value.
394
395 If exceeded, trigger the rig.
396
397 --guest PERCENT
398 The amount of %guest as reported by the kernel to use as
399 a threshold value.
400
401 If exceeded, trigger the rig.
402
403 --user The amount of %user as reported by the kernel to use as a
404 threshold value.
405
406 If exceeded, trigger the rig.
407
408 --available INTEGER
409 The amount of available memory in MiB as reported by the
410 kernel to use as a threshold value.
411
412 If the amount of available memory falls below this
413 threshold, trigger the rig.
414
415 --free INTEGER
416 The amount of free memory in MiB as reported by the ker‐
417 nel to use as a threshold value.
418
419 If the amount of free memory falls below this threshold,
420 trigger the rig.
421
422 --used INTEGER
423 The amount of used memory in MiB as reported by the ker‐
424 nel to use as a threshold value.
425
426 If the amount of used memory exceeds this threshold,
427 trigger the rig.
428
429 --slab INTEGER
430 The amount of slab memory in MiB as reported by the ker‐
431 nel to use as a threshold value.
432
433 If the amount of slab memory exceeds this threshold,
434 trigger the rig.
435
436 --cpuperc PERCENT
437 The amount of total CPU usage as reported by the kernel
438 as a percentage to use as a threshold value.
439
440 If exceeded, trigger the rig.
441
442 This value may be a whole integer or a float. Floats are
443 precise out to one (1) decimal point.
444
445 --memperc PERCENT
446 The amount of total memory usage as reported by the ker‐
447 nel as a percentage to use as a theshold value.
448
449 If exceeded, trigger the rig.
450
451 This value may be a whole integer or a float. Floats are
452 precise out to one (1) decimal point.
453
454 --loadavg FLOAT
455 System load average as reported by the OS to use as a
456 threshold value. If the reported loadavg exceeds this
457 value, trigger the rig. This option can accept either an
458 integer (1) or a float (1.0).
459
460 Linux returns loadavg data for the past 1, 5, and 15 min‐
461 utes. The system rig will monitor only one (1) of these
462 intervals at a time, as controlled by the --loadavg-
463 interval option.
464
465 --loadavg-interval [1, 5, 15]
466 Which time interval the rig should monitor when watching
467 the system's loadavg. Only 1, 5, and 15 are accepted
468 values for this option, as that is what the Linux kernel
469 returns loadavg data for.
470
471 Default: 1
472
473 --temp INTEGER
474 The temperature in Celsius rig should monitor the CPU for
475 meeting or exceeding.
476
477 This option takes an integer value, though temperature
478 data is single decimal point sensitive, so a temperature
479 of 50.9 degrees will not trigger a rig that sets this
480 option to 51.
481
482 By default rig will monitor the first physical CPU pack‐
483 age installed on the system. This may be changed via the
484 --cpu-id option. Note that rig will only monitor whole
485 packages and not individual cores, and that package tem‐
486 peratures reported are the highest reported temperature
487 for any core in that package.
488
489
490 --cpu-id ID
491 If specified, monitor this physical CPU package. By
492 default, rig will monitor physical CPU package 0 - mean‐
493 ing the first physically installed CPU.
494
495 When specifying an ID here, remember that in Linux CPU
496 IDs are zero-indexed, so the first CPU will be ID 0, the
497 second ID 1, and so forth.
498
499 Default: 0
500
501 Filesystem
502
503 Watch a filesystem, directory, or file for utilization changes.
504 Currently this rig is focused on space consumption, and will
505 trigger when the specified path or backing filesystem exceeds
506 the defined threshold for space utilization.
507
508 The following options are available for the filesystem rig:
509
510 --path PATH
511 Specify the filesystem, directory, or file path for the
512 rig to monitor. The location provided must exists when
513 the rig initializes for monitoring to be supported.
514
515 --size SIZE
516 Specify the size threshold to trigger on for the provided
517 --path. The size given must be an integer suffixed with
518 either K, M, G, or T. The provided value will be con‐
519 verted to bytes.
520
521 --fs-size SIZE
522 Use this option instead of --size if you want to monitor
523 the space usage of the backing filesystem for --path
524 rather than the size of the path alone.
525
526 Similar to --size this value must be suffixed with either
527 K, M, G, or T.
528
529 --fs-used PERCENT
530 Similar to --fs-size but instead provide a percentage
531 value to trigger on, when the filesystem's %used exceeds
532 this value.
533
534 Note that using this option is ultimately the same as
535 --fs-size as rig will convert the specified percentage
536 into a raw bytes value to use for comparisons.
537
538
539
541 The following actions are supported responses to triggered rigs. These
542 may be chained together on a single rig, so deploying multiple rigs
543 with matching trigger conditions with single, varying actions is unnec‐
544 essary.
545
546 Actions are executed based on a priority weighting system, where lower
547 values represent a higher priority action, and those actions with lower
548 values are executed before those with higher values. This is to allow
549 more time-sensitive actions to be taken before those that may either
550 take a long time to execute or are otherwise unaffected by allowing
551 other actions to run before them. Action priority values are set by the
552 actions directly and are currently not able to be modified by users.
553
554 gcore Collect a coredump of a given process or processes using GDB's
555 gcore utility.
556
557 Note that this does _not_ interrupt the running process(es).
558 Cores are saved to /tmp and will be named either core.$pid or
559 core.$proc_name.$pid depending on if a PID or process name was
560 provided. This action will be executed first when a rig is trig‐
561 gered and multiple actions are specified.
562
563 This action supports repetition via the --repeat option.
564
565 The gcore action supports the following options:
566
567 --gcore PROCESS
568 Enables this action and takes either a PID or process
569 name as a value. If a process name is given, the PID is
570 determined at rig creation. If multiple PIDs are found
571 for the same process name, the default behavior is to
572 fail rig creation. Use the --all-pids option to instead
573 use all PIDs discovered for a process name.
574
575 This option can be specified multiple times. E.G. --gcore
576 12345 --gcore myprocess will generate a coredump for PID
577 12345 and a process matching the name 'myprocess'.
578
579
580 --all-pids
581 Tells this action to collect a coredump for all PIDs
582 found for a provided process name.
583
584
585 --freeze
586 Freeze the process(es) that will be core dumped by send‐
587 ing a SIGSTOP prior to calling gcore on the discovered
588 pid(s).
589
590 If successful, then rig will send a SIGCONT after the
591 gcore execution has completed in order to thaw the
592 process.
593
594
595 kdump Generate a vmcore by triggering a kernel crash via sysrq.
596
597 Note that this action WILL cause node disruption by triggering a
598 kernel panic to generate the vmcore. This means your system will
599 reboot when this action is triggered.
600
601 The kdump action does not perform any configuration checks on
602 the system's kdump installation. It is assumed that kdump has
603 been properly configured and tested prior to using this action.
604
605 The kdump action supports the following options:
606
607 --kdump
608 Enables this action
609
610
611 --sysrq INTEGER
612 When the rig is deployed, if this option is set, rig will
613 set the system's /proc/sys/kernel/sysrq to the value pro‐
614 vided. See sysrq kernel documentation for information on
615 what values are supported.
616
617
618
619 sosreport
620 Run a sosreport after the rig has been triggered. Select plugin
621 enablement options as well as the --plugin-option from sosreport
622 are supported by this rig. This action should run after any
623 time-sensitive actions otherwise specified by the user for a
624 given rig.
625
626 The sosreport action supports the following options:
627
628 --sosreport
629 Enables this action
630
631 --enable-plugins PLUGINS
632 Specifically force the specified comma-delimited list of
633 PLUGINS to be enabled.
634
635 --plugin-option PLUGOPT
636 Modify a specific plugin's runtime options. This is
637 passed directly to sosreport as the same --plugin-option
638 value, which should take the form 'name.option=value'.
639 For example, to increase the podman plugin timeout use
640 ´--plugin-option podman.timeout=600´.
641
642 If you need to pass multiple sosreport plugin options,
643 use a comma-delimited list here instead of specifying
644 this option multiple times.
645
646 --skip-plugins PLUGINS
647 Do not run these specified plugins. Use a comma-delimited
648 list to skip multiple plugins.
649
650 --only-plugins PLUGINS
651 Only enable these specific plugins, disable all others.
652 Use a comma-delimited list to specify multiple plugins.
653
654
655 tcpdump
656 Start collecting a tcpdump when the rig is initialized, and stop
657 the collection when the rig triggers. This action will be trig‐
658 gered before most other actions, but after the gcore action.
659
660 Note there will be a slight delay in configuring any rig that
661 uses the tcpdump action as rig must verify that the tcpdump
662 process started successfully during the initialization process.
663
664 The tcpdump action supports the following options:
665
666 --tcpdump
667 Enables this action
668
669 --iface INTERFACE
670 Starts the tcpdump to monitor the provided INTERFACE. In
671 almost all situations this should likely be set to a spe‐
672 cific interface on the system, however the value of 'any'
673 is accepted by the tcpdump command in order to listen on
674 all interfaces. Be wary of using this however as use of
675 'any' means will make it impossible to determine which
676 interface a particular packet came in on in the resulting
677 packet capture.
678
679 Default: eth0
680
681 --filter FILTER
682 Provide a filter to use with tcpdump in order to reduce
683 the amount of traffic recorded in the packet capture.
684 This value is passed directly to the tcpdump utility, and
685 thus can be any valid filter accepted by tcpdump.
686
687 For most shells you must quote the filter string for rig
688 to pass it correctly.
689
690 --size SIZE
691 Limit the size of the packet capture file(s) to SIZE in
692 MB.
693
694 Default: 10
695
696 --captures CAPTURES
697 Specify the number of packet capture files to keep. If
698 more than one (1), then tcpdump will rotate the packet
699 capture file when it reaches the --size value and keep
700 CAPTURES number of files.
701
702 E.G. Using a CAPTURES of 2 and a SIZE of 5, then when the
703 rig terminates you will have up to 2 5MB packet captures.
704
705 Default: 1 (packet capture file is replaced upon reaching
706 SIZE limit).
707
708
709 noop
710
711 Does nothing - this action runs a no-op. This is ideally used
712 for when you need to test a rig's configuration to make sure a
713 rig's trigger condition is set properly - e.g. a regex string
714 for the logs' rig message option.
715
716 The noop action supports the following options:
717
718 --noop Enables this action
719
721 Jake Hunsaker <jhunsake@redhat.com>
722
723
724
725 January 2019 rig(1)