1COLMUX(1) colmux COLMUX(1)
2
3
4
6 colmux - multiplex communications to multiple systems running collectl
7 from a single system
8
9
11 colmux [-command "collectl-switches... [-p filespec]]" [-address
12 addr1[,addr2,...]|-addr filename] [-cols col1[,col2...]] | [-column
13 num]
14
15
17 This utility gathers up data generated by collectl from multiple sys‐
18 tems and multiplexes it into a single consolidated format. It runs in
19 essentially 2 distinct modes, the first is known as real-time, because
20 data is retrieved and displayed in real time. The second is playback
21 mode because data is played back from existing collectl data files.
22
23 There are also 2 general formats for the data being displayed. The
24 first is a multi-line display in which the data is displayed in the
25 native form that collectl displays it, except it is sorted by a distint
26 column, essentially allowing one to see the TOP producers of that data.
27 The second format is a single line display in which one or more dis‐
28 tinct data elements from each source is displayed on the same line.
29 This latter format is never sorted, but rather positionally organized
30 by the name of the system that generated it.
31
32 Collectl will be then be executed, using any optional switches speci‐
33 fied by -command, on each of the systems specified by -address OR read
34 those addresses from a file it the target of that switch is a filename
35 rather than a list of hosts OR on the local system if -address is not
36 specified. See collectl for details of the various switches. In some
37 cases certain collectl switches will not make sense in a colmux envi‐
38 ronment and if chosen will generate an error. Further, if hosts are
39 specified with -address, they should be a individual addresses or host‐
40 names separated by commas. In turn, any of them can be in what those
41 familiar with pdsh would recognize as -w format.
42
43 Colmux will then execute the collectl command, gather the results from
44 all sources for a particular interval and display them one result per
45 line, sorted by the specified column OR all on the same line in groups
46 specified by -cols. The number of lines displayed is set to the size
47 of the terminal window by default, but can be changed using -lines.
48 The one exception is the use of -nosort which only applies to the play‐
49 back of existing collectl raw files. In this mode all records for a
50 particular interval will be displayed and the sorting bypassed, making
51 this a speedy and convenient mechanism for gathering all data from all
52 systems in one place for potential further processing.
53
54 Colmux will never modify the size of the terminal window so to see more
55 or wider lines either expand the window or override the number of dis‐
56 play lines and run it again. If the number display lines is set
57 greater then the terminal height or 0, colmux will no longer overlay
58 the previous window and simply run in a continuous scrolling mode.
59
60 Common Switches
61
62 -address list|pdsh|filename
63 Specify any combination of addresses as hostnames OR in pdsh -w
64 format OR a filename containing a list of hostnames/addresses, 1
65 per line. You MUST have passwordless ssh access to these nodes.
66 If a different username is required, be sure to specify
67 addresses in username@host format noting you do not have to have
68 the same username on each host. If specified, these usernames
69 will override those specified with the -username switch. rsh
70 access is not supported.
71
72 -command switches
73 One can specify virtually any collectl command here, both in
74 real-time or playback mode. Some switches may only be used dur‐
75 ing one mode or the other and colmux will usually let you know
76 if you specify an invalid combination or an otherwise restricted
77 switch. Only those directly affecting colmux are listed below:
78
79 --from, --thru
80 Limit the timeframe for data being played back, noting
81 you can include both the from and thru times with the
82 --from switch if you separate then with a hyphen.
83
84 -o time-format
85 This is a "magic" switch in that it not only tells col‐
86 lectl how to display dates/times (no other options are
87 permitted using -o other than those from the set [dDTm]),
88 it also tells colmux how to display dates/times too.
89
90 In single line mode, the timestamp will either come from
91 the host system in real-time mode OR the first host when
92 run in playback mode. This is the most common use/need
93 for this switch. But be careful in choosing column num‐
94 bers with -cols as the position of the data shifts by 1
95 when time is included and by 2 if date and time are.
96 Using -test will correctly show the shifted positions but
97 only if you include -o with the command at the same time
98 you use -test.
99
100 In real-time/top mode this switch is not allowed since
101 colmux simply reports the current time of the system it
102 is running on.
103
104 When playing back data multi-line formatted data from one
105 or more files, a timestamp for each interval is reported,
106 consisting of the time of that interval. When this
107 switch is included, each line will be tagged with an
108 appropriate timestamp since on rare occasions they may
109 not necessarily all be identical.
110
111 -p playback-file
112 This switch tells colmux to run in playback mode. The
113 filename should include the directory location and is
114 usually specified with wild cards, limiting the selected
115 file(s) to a specific date. When those files are on the
116 same host (-address is not specified), they may be for
117 multiple hosts, but when the files are on remote hosts
118 they must all be for be that unique host. If the file
119 specification includes the string TODAY or YESTERDAY they
120 will be replaced with *yyyymmdd* for that date.
121
122 -P
123 Run collectl in plot-format. This allows one to specify
124 just about any combination of subsystems since all data
125 is always displayed on a single line. However, due to
126 the lack of formatting, this also makes no sense for
127 multi-line displays and is therefore only supported in
128 single-line format.
129
130 -help
131 Show a brief help message and exit.
132
133 -hostwidth n
134 By default, colmux set the hostwidth to 8, unless it sees some‐
135 thing wider and for most situations this is sufficient. How‐
136 ever, if one specifies hostnames that are aliases of the longer
137 hostname, colmux has no way of knowing the real hostlengths
138 until after it starts receiving data from collectl and the for‐
139 matting will be off if the hostnames are longer than the
140 default. To overcome this problem, use this switch to force the
141 hostname to be wider.
142
143 -lines
144 Change the number of lines that are displayed for each interval
145 in multi-line mode. The default will be determined by the ter‐
146 minal size returned by the linux resize command if present. If
147 that command is not present, the size will be initially set to
148 24. If -lines is greater than the terminal size or 0, top-like
149 behavior will not be used when in real-time mode.
150
151 Single-line format controls the number of lines displayed
152 between headers. A value of 0 will only display the header one
153 time.
154
155 -noescape
156 Colmux uses brute-force screen formatting, that is it generates
157 its own VT100 escape sequences to clear lines and/or move the
158 cursor. On some occasions you may want to disable this
159 sequences if you wish to recode the output and do your own post-
160 processing of it. This switch will do just that.
161
162 -port
163 Sometimes a remote version of collectl is already using the
164 default socket. This allows one to start another instance and
165 override that value.
166
167 -test
168 This tells colmux to execute the specified collectl command
169 either locally or on the first remote system specified by
170 -address, print the associated header with the selected col‐
171 umn(s) highlighted and also include each column name along with
172 its ordinal number, making it fairly easy to make sure you've
173 selected the right column(s).
174
175 -username name
176 Use this username for ALL ssh commands. It can be overridden
177 for specific hosts by specifying them with the -address switch
178 with the desired hostnames.
179
180 -version
181 Display the version and exit. It will also report if
182 Term::ReadKey is installed and if so what its version number is.
183
184 Playback Mode Specific
185
186 The following additional switches only apply to playback mode. There
187 are no real-time mode specific switches.
188
189 -delay seconds
190 Introduce a delay between intervals in seconds. You can specify
191 fractional values. Not using this switch will cause the output
192 to be displayed as fast as it can be rendered.
193
194 -home
195 Move the cursor to the home position (upper left-hand corner) of
196 the display to use a top-like display format. This ONLY applies
197 to multi-line mode when in playback mode and provides a mecha‐
198 nism for displaying recorded data in a top-like fashion.
199
200 -hostfilter addr[,addr]
201 When playing back files for multiple hosts on the local system,
202 sometimes you do not want to play back ALL the host files. This
203 filter allows you to specify only those hosts which you want to
204 process. The format of the list of addresses is specified in
205 the same way as -address except that you cannot specify a file‐
206 name.
207
208 -nosort
209 Intended primarily for output that would be redirected to a
210 file, do not sort or include any escape sequences in the output.
211
212 Multi-Line Format
213
214 When there is more output then will fit on the screen, colmux
215 includes the text:
216 Displaying: lines xx thru yy out of zz
217 on the right-side of the top line of the display, where xx is
218 typically 1.
219
220 However, once colmux is running, one might want to look at sub‐
221 sequent lines, ie those below the bottom of the screen and
222 therefore invisible. If the ReadKey module is installed, one
223 can simply use the PageDown key to move down the display and the
224 PageUp key to move in the other direction. If ReadKey is not
225 installed, typing the multi-key sequences pd<ENTER> or pu<ENTER>
226 will cause the same thing to happen.
227
228 -colhelp
229 When you wish to change the sort column and the arrow keys
230 aren't available to you, it may be cumbersome to identify the
231 number of the column to type in followed by RETURN. This tells
232 colmux to display the numbers over each column eliminating the
233 need to manually count them and find the one you want.
234
235 -column num
236 Set the sort column to this number. The column numbering is
237 determined by the columns returned by collectl for the requested
238 command. Since date/time columns are optional for non-plot
239 data, their inclusion will change the numbering of the columns
240 so if you are not sure you selected the correct column, you
241 should first execute your command with -test included.
242
243 You can also change the column number interactively with the
244 RIGHT/LEFT arrow keys IF the ReadKey module is installed (see
245 colmux -version) OR simply type it in followed by the <ENTER>
246 key.
247
248 -finalcr
249 There is a real odd case in which you might want to pipe colmux
250 real-time output to a script for further processing. However,
251 if you do this you can't read the final line with a routine that
252 expects a terminating CR, like python's readline(). Rather,
253 that last line and the one that follows will be returned as one
254 long string. This switch tell colmux to insert that final CR,
255 which WILL mess up the screen under normal operations, so be
256 forewarned.
257
258 -hostformat char:pos
259 There are times one has long hostnames which can either take up
260 valuable screen real estate or are simply painful to look at.
261 This switch may evolve over time and is currently targetted as
262 hostnames that have repeating parts along with a unique part,
263 separated by a character such as a hyphen. This switch allows
264 you to specify a single character followed by the piece of the
265 hostname you'd like to see displayed. For example, if you have
266 a hostname like aaa-bbbb-cccc-dddd, -hostformat -:3 will cause
267 the cccc piece to be displayed.
268
269 -nobold
270 Do not highlight the selected column. This may be useful when
271 redirecting output to a file and you do not want the associated
272 escape sequences to be written to it.
273
274 -reverse
275 Reverse the default sort order. You can also change the direc‐
276 tion of the sort interactively with the UP/DOWN arrow keys IF
277 the ReadKey module is installed (see colmux -version)
278 OR simply type the r key and <ENTER>.
279
280 -zero
281 Do not display any rows with 0 in the sort column. You can also
282 type z<ENTER>interactively.
283
284 Single-Line Format
285
286 -col1000
287 Divide each column by 1000 before display
288
289 -colk
290 Divide each column by 1024 before display
291
292 -collog10
293 Remap large numbers to a smaller number of values by taking the
294 log10 of them and further transforming by the followign mapping:
295 0,1 to 0, 10 to 10, 100 to 20, 1000 to 30, 10000 to 40, ... 1e9
296 to 90.
297
298 -cols num,...
299 Group all data together for each host by column number(s). As
300 with -column, you can confirm the correct column(s) have been
301 selected by first running with -test.
302
303 -colnodet
304 Do not show data for individual hosts, just display the totals.
305
306 -colnodiv num,...
307 Do not divide the specified column numbers by 1000 or 1024 when
308 col1000 or colk or apply the colllog10 transformation when spec‐
309 ified. A typical usage is if you want to look at cpu loads as
310 well as network or disk stats in which case you may want to
311 divide the latter by 1024 but not the cpu.
312
313 -colnoinst
314 Do no include instance portion (and surrounding brackets) in
315 totals column headers.
316
317 -coltotal
318 Include the totals for each column to the right.
319
320 -colwidth
321 Set the output columns to this width, typically used in conjunc‐
322 tion with -col1000 or colk to allow more hosts to fit onto the
323 same line. It can also be used if the host names are too narrow
324 for column headers and you have room to display wider names.
325
326 Exception Reporting Specific
327
328 In single-line format, rather than wait for all hosts to report their
329 data, colmux simply reports the last data seen when the time to gener‐
330 ate a line of output has come. In most cases, these do reflect the
331 most recent data values but in times of load, the data may be late get‐
332 ting to colmux and so a previous value may be reported. If the age of
333 that data exceeds a defined number of intervals, the default is cur‐
334 rently 2, an exception value will be reported of -1. At other times it
335 has been seen where kernel/driver bugs may cause incorrect values to be
336 reported as negative numbers and those values are also reported as -1.
337 Both the age and exception values can be changed with the following
338 switches.
339
340 -age number
341 When initially starting up and all hosts have not yet reported
342 any data, colmux will display a -1 to indicate no data has been
343 seen yet. If during processing a host fails to report in -age
344 intervals, the default is 2, colmux will also report a -1 indi‐
345 cating the data is stale.
346
347 -negdataval val
348 In some cases, there could be erroneous data reported as nega‐
349 tive numbers (though sometimes negative numbers are valid).
350 When specified, replace any negative numbers with this value.
351
352 -nodataval val
353 This switch allows you to change the -1 that is normally
354 reported for missing or stale data to the specified value, most
355 commonly 0.
356
357 Diagnostics
358
359 The following switches are intended more for diagnostic purposes than
360 normal operation, though are also worth using on appropriate occasions.
361
362 -debug val
363 This switch is for generating diagnostic information at various
364 levels. It is actually a bit mask, whose values are listed in
365 the beginning on colmux itself. Perhaps the most useful value is
366 1 as it will cause colmux to display all the remote commands
367 issues to each host in the address list and can often reveal
368 problems when things don't seem to be working correctly
369
370 -nocheck
371 This switch was initially included in an earlier version when
372 remote host checking was causing problem in some cases and by
373 skipping those checks, colmux would run more reliably. While it
374 is felt that as of V3.2.0 these reachability checks are now
375 reliable and should not be skipped, this switch has been left in
376 place.
377
378 -quiet
379 By default and when -nocheck not specified, colmux checks the
380 versions of all collectl instances against that of the first
381 node found to be running collectl and if different, reports the
382 mismatch. This switch suppresses that warning.
383
384 When a connection is received from an unexpected address, a
385 warning is also reported and the request promptly ignored. This
386 switch also suppresses those messages as well. For more infor‐
387 mation on problems connecting, see CONNECTION PROBLEMS.
388
389 -reachable
390 By default, when a node is found to not be reachable, colmux
391 will remove it from its list of hosts and continue execution.
392 This switch will tell colmux to exit when all hosts are not
393 reachable.
394
395 Miscellaneous
396
397 There are 2 switches whose descriptions don't really fit anywhere else:
398
399 -colbin path
400 On rare occasions, such as testing a patch to collectl in a copy
401 NOT in /usr/bin, you may want to tell colmux to use that copy
402 instead of the standard one. Use this switch to point to that
403 copy. Naturally that copy must exist in that location on all
404 systems.
405
406 -keepalive secs
407 Colmux uses ssh to start collectl on each remote machine and
408 then communications between collectl and colmux occur over a
409 socket. Normally, ssh is configured to timeout after an inter‐
410 val of inactivity, such as 30 minutes, which means a long-run‐
411 ning colmux session will begin to lose connections when this
412 interval is reached. By specifying a keepalive interval, you're
413 telling the ssh to send a periodic keepalive to the other end so
414 that connection doesn't get dropped.
415
416 -retaddr addr
417 Tell remote collectls to open a socket on this address instead
418 of the preselected one. For more details on this, see CONNEC‐
419 TION PROBLEMS.
420
421 -timeout secs
422 By default, collectl waits up to 10 seconds for remote instances
423 of collectl to connect back. On slower networks or when a very
424 large number of instances have been started, they may fail to
425 connect back in time. This switch will extend that timeout, but
426 it also requires collectl V3.6.4 be used because earlier version
427 do not support this feature.
428
429 -timerange secs
430 When colmux starts up and checks the connectivity to all the
431 machines specified by -addr, it also gets their current
432 date/time and using that computes the range of system times
433 across all nodes. If that time is found to be more then
434 -timerange seconds, colmux generates a warning as this differ‐
435 ence could cause reporting probems. One can increase the range
436 to get rid of the message (not recommended unless other factors
437 are preventing nodes from responding quickly enough to the date
438 command) OR suppress the warning with -quiet.
439
440
442 All logs being played back must have been collected using the same
443 interval as colmux only looks at the first file/host to determine the
444 appropriate value.
445
446 It is assumed all clocks are reasonably well synchronized as colmux
447 uses time to determine which data is to be displayed as a set.
448
449 All files must be in the same directory on all systems and that direc‐
450 tory must be included in the playback file specification
451
452 All files on a remote host must be for that host only
453
454
456 Run collectl on 3 nodes, showing CPU, Disk and Network statistics once
457 a second and sorted by column 1, which happens to be total cpu.
458
459 colmux -addr abc,def,xyz
460
461 Dynamically display top processes on nodes n1-n10 of a cluster once a
462 second, sorted by column 5.
463
464 colmux -addr n[1-10] -command "-sZ :1" -column 5
465
466 Do the same for yesterday, between the hours of 5AM and 6AM, being sure
467 to stall for 1/2 second between intervals. Note, if you leave off
468 -addr you could put all the logs into /var/log/collectl on the local
469 host and play them back from there.
470
471 colmux -addr n[1-10] -command "-sZ -p/var/log/collectl/YESTERDAY -from
472 05:00-06:00" -column 5 -delay .5
473
474 Look at the amount of mapped and slab memory consumed on nodes n1-n10
475 and n15 in real-time, every 2 seconds using single-line format.
476 Include totals and preface each line with the time. Since memory sizes
477 tend to be rather large, divide each by 1024 so we see MB rather than
478 KB. Note that the columns numbers are always displayed are ascending
479 order regardless of their order in -cols. To be sure, first test the
480 column numbers.
481
482 colmux -addr n[1-10,15] -command "-sm -i2 -oT" -cols 6,7 -coltot -colk
483 -test
484 colmux -addr n[1-10,15] -command "-sm -i2 -oT" -cols 6,7 -coltot -colk
485
486 Display most active disks, based on KB written, on nodes n1, n4 and n5.
487
488 colmux -addr n1,n4,n5 -command "-sD" -column 6
489
490 Here is a cool trick. Collectl currently lets you look at top pro‐
491 cesses with the --top switch and even choose a sort column by name.
492 However, if you want to change the column you need to exit, then rerun
493 collectl with a different sort column name. But if you run it like
494 this example, you get the power of colmux to dynamically change the
495 sort columns with the arrow keys! You can also use this technique to
496 have collectl dynamically sort any local multi-line data such as slabs
497 or even detail data like CPU, Disk, Lustre and Networks too! Naturally
498 this technique works just as well with playing back data as well.
499
500 colmux -command "-sZ -i:1"
501
502
504 colmux requires passwordless ssh between the node it is running on
505 those it is monitoring. also be sure the port you are using for commu‐
506 nications, the default is 2655, if open
507
508
510 The way colmux works is to choose an address it wants to communicate
511 over and starts up one or more remote copies of collectl, telling them
512 to connect back to colmux using that address. The easiest way to see
513 this, is to run colmux with -noesc, which tells it NOT to issue any
514 escape sequences and therefore not to run in full screen mode. The
515 addional switch of -debug 1 tells it to show the remote collectl
516 startup command. When there is a communications problem you will typi‐
517 cally see 'connection timed out' messages displayed.
518
519 There are actually a couple of possibilities here, one of which is a
520 firewall is preventing connections and the easiest way to test this is
521 run collectl on the local machine like this: collectl -Aserver. This
522 tells collectl run as a server, listening for connections just like
523 colmux. Then log into a remote machine and run /usr/share/col‐
524 lectl/util/client.pl addr-of-server which tells client.pl to open a
525 socket to that copy of collectl. It should fail just like when it was
526 run via colmux, so try opening the firewall and try it again. If it
527 fixes the problem, it was indeed the firewall blocking things and col‐
528 mux should now work just fine.
529
530 Sometimes there are multiple interfaces defined on the machine hosting
531 colmux and in some cases only some addresses will allow socket connec‐
532 tions. Again, using client.pl on the remote machine try connecting
533 back to collectl over different addresses and when you find one that
534 works, tell colmux to use that address for communication via the
535 -retaddr switch.
536
537
539 This program was written by Mark Seger (mjseger@gmail.com).
540 Copyright 2016 Hewlett-Packard Development Company, L.P.
541
542
544 http://collectl-utils.sourceforge.net/colmux.html
545
546
547
548LOCAL DECEMBER 2010 COLMUX(1)