1COLMUX(1)                           colmux                           COLMUX(1)
2
3
4

NAME

6       colmux  - multiplex communications to multiple systems running collectl
7       from a single system
8
9

SYNOPSIS

11       colmux  [-command  "collectl-switches...  [-p   filespec]]"   [-address
12       addr1[,addr2,...]|-addr  filename]  [-cols  col1[,col2...]]  | [-column
13       num]
14
15

DESCRIPTION

17       This utility gathers up data generated by collectl from  multiple  sys‐
18       tems  and multiplexes it into a single consolidated format.  It runs in
19       essentially 2 distinct modes, the first is known as real-time,  because
20       data  is  retrieved and displayed in real time.  The second is playback
21       mode because data is played back from existing collectl data files.
22
23       There are also 2 general formats for the  data  being  displayed.   The
24       first  is  a  multi-line  display in which the data is displayed in the
25       native form that collectl displays it, except it is sorted by a distint
26       column, essentially allowing one to see the TOP producers of that data.
27       The second format is a single line display in which one  or  more  dis‐
28       tinct  data  elements  from  each source is displayed on the same line.
29       This latter format is never sorted, but rather  positionally  organized
30       by the name of the system that generated it.
31
32       Collectl  will  be then be executed, using any optional switches speci‐
33       fied by -command, on each of the systems specified by -address OR  read
34       those  addresses from a file it the target of that switch is a filename
35       rather than a list of hosts OR on the local system if -address  is  not
36       specified.   See collectl for details of the various switches.  In some
37       cases certain collectl switches will not make sense in a  colmux  envi‐
38       ronment  and  if  chosen will generate an error.  Further, if hosts are
39       specified with -address, they should be a individual addresses or host‐
40       names  separated  by commas.  In turn, any of them can be in what those
41       familiar with pdsh would recognize as -w format.
42
43       Colmux will then execute the collectl command, gather the results  from
44       all  sources  for a particular interval and display them one result per
45       line, sorted by the specified column OR all on the same line in  groups
46       specified  by  -cols.  The number of lines displayed is set to the size
47       of the terminal window by default, but can  be  changed  using  -lines.
48       The one exception is the use of -nosort which only applies to the play‐
49       back of existing collectl raw files.  In this mode all  records  for  a
50       particular  interval will be displayed and the sorting bypassed, making
51       this a speedy and convenient mechanism for gathering all data from  all
52       systems in one place for potential further processing.
53
54       Colmux will never modify the size of the terminal window so to see more
55       or wider lines either expand the window or override the number of  dis‐
56       play  lines  and  run  it  again.   If  the number display lines is set
57       greater then the terminal height or 0, colmux will  no  longer  overlay
58       the previous window and simply run in a continuous scrolling mode.
59
60       Common Switches
61
62       -address list|pdsh|filename
63              Specify  any combination of addresses as hostnames OR in pdsh -w
64              format OR a filename containing a list of hostnames/addresses, 1
65              per line.  You MUST have passwordless ssh access to these nodes.
66              If  a  different  username  is  required,  be  sure  to  specify
67              addresses in username@host format noting you do not have to have
68              the same username on each host.  If specified,  these  usernames
69              will  override  those  specified with the -username switch.  rsh
70              access is not supported.
71
72       -command switches
73              One can specify virtually any collectl  command  here,  both  in
74              real-time or playback mode.  Some switches may only be used dur‐
75              ing one mode or the other and colmux will usually let  you  know
76              if you specify an invalid combination or an otherwise restricted
77              switch.  Only those directly affecting colmux are listed below:
78
79              --from, --thru
80                     Limit the timeframe for data being  played  back,  noting
81                     you  can  include  both  the from and thru times with the
82                     --from switch if you separate then with a hyphen.
83
84              -o time-format
85                     This is a "magic" switch in that it not only  tells  col‐
86                     lectl  how  to  display dates/times (no other options are
87                     permitted using -o other than those from the set [dDTm]),
88                     it also tells colmux how to display dates/times too.
89
90                     In  single line mode, the timestamp will either come from
91                     the host system in real-time mode OR the first host  when
92                     run  in  playback mode.  This is the most common use/need
93                     for this switch.  But be careful in choosing column  num‐
94                     bers  with  -cols as the position of the data shifts by 1
95                     when time is included and by 2  if  date  and  time  are.
96                     Using -test will correctly show the shifted positions but
97                     only if you include -o with the command at the same  time
98                     you use -test.
99
100                     In  real-time/top  mode  this switch is not allowed since
101                     colmux simply reports the current time of the  system  it
102                     is running on.
103
104                     When playing back data multi-line formatted data from one
105                     or more files, a timestamp for each interval is reported,
106                     consisting  of  the  time  of  that  interval.  When this
107                     switch is included, each line  will  be  tagged  with  an
108                     appropriate  timestamp  since  on rare occasions they may
109                     not necessarily all be identical.
110
111              -p playback-file
112                     This switch tells colmux to run in  playback  mode.   The
113                     filename  should  include  the  directory location and is
114                     usually specified with wild cards, limiting the  selected
115                     file(s)  to a specific date.  When those files are on the
116                     same host (-address is not specified), they  may  be  for
117                     multiple  hosts,  but  when the files are on remote hosts
118                     they must all be for be that unique host.   If  the  file
119                     specification includes the string TODAY or YESTERDAY they
120                     will be replaced with *yyyymmdd* for that date.
121
122              -P
123                     Run collectl in plot-format.  This allows one to  specify
124                     just  about  any combination of subsystems since all data
125                     is always displayed on a single line.   However,  due  to
126                     the  lack  of  formatting,  this  also makes no sense for
127                     multi-line displays and is therefore  only  supported  in
128                     single-line format.
129
130       -help
131              Show a brief help message and exit.
132
133       -hostwidth n
134              By  default, colmux set the hostwidth to 8, unless it sees some‐
135              thing wider and for most situations this  is  sufficient.   How‐
136              ever,  if one specifies hostnames that are aliases of the longer
137              hostname, colmux has no way  of  knowing  the  real  hostlengths
138              until  after it starts receiving data from collectl and the for‐
139              matting will be  off  if  the  hostnames  are  longer  than  the
140              default.  To overcome this problem, use this switch to force the
141              hostname to be wider.
142
143       -lines
144              Change the number of lines that are displayed for each  interval
145              in  multi-line mode.  The default will be determined by the ter‐
146              minal size returned by the linux resize command if present.   If
147              that  command  is not present, the size will be initially set to
148              24.  If -lines is greater than the terminal size or 0,  top-like
149              behavior will not be used when in real-time mode.
150
151              Single-line  format  controls  the  number  of  lines  displayed
152              between headers.  A value of 0 will only display the header  one
153              time.
154
155       -noescape
156              Colmux  uses brute-force screen formatting, that is it generates
157              its own VT100 escape sequences to clear lines  and/or  move  the
158              cursor.   On  some  occasions  you  may  want  to  disable  this
159              sequences if you wish to recode the output and do your own post-
160              processing of it.  This switch will do just that.
161
162       -port
163              Sometimes  a  remote  version  of  collectl is already using the
164              default socket.  This allows one to start another  instance  and
165              override that value.
166
167       -test
168              This  tells  colmux  to  execute  the specified collectl command
169              either locally or  on  the  first  remote  system  specified  by
170              -address,  print  the  associated  header with the selected col‐
171              umn(s) highlighted and also include each column name along  with
172              its  ordinal  number,  making it fairly easy to make sure you've
173              selected the right column(s).
174
175       -username name
176              Use this username for ALL ssh commands.  It  can  be  overridden
177              for  specific  hosts by specifying them with the -address switch
178              with the desired hostnames.
179
180       -version
181              Display  the  version  and  exit.   It  will  also   report   if
182              Term::ReadKey is installed and if so what its version number is.
183
184       Playback Mode Specific
185
186       The  following  additional switches only apply to playback mode.  There
187       are no real-time mode specific switches.
188
189       -delay seconds
190              Introduce a delay between intervals in seconds.  You can specify
191              fractional  values.  Not using this switch will cause the output
192              to be displayed as fast as it can be rendered.
193
194       -home
195              Move the cursor to the home position (upper left-hand corner) of
196              the display to use a top-like display format.  This ONLY applies
197              to multi-line mode when in playback mode and provides  a  mecha‐
198              nism for displaying recorded data in a top-like fashion.
199
200       -hostfilter addr[,addr]
201              When  playing back files for multiple hosts on the local system,
202              sometimes you do not want to play back ALL the host files.  This
203              filter  allows you to specify only those hosts which you want to
204              process.  The format of the list of addresses  is  specified  in
205              the  same way as -address except that you cannot specify a file‐
206              name.
207
208       -nosort
209              Intended primarily for output that  would  be  redirected  to  a
210              file, do not sort or include any escape sequences in the output.
211
212       Multi-Line Format
213
214              When  there  is  more output then will fit on the screen, colmux
215              includes the text:
216                     Displaying: lines xx thru yy out of zz
217              on the right-side of the top line of the display,  where  xx  is
218              typically 1.
219
220              However,  once colmux is running, one might want to look at sub‐
221              sequent lines, ie those below  the  bottom  of  the  screen  and
222              therefore  invisible.   If  the ReadKey module is installed, one
223              can simply use the PageDown key to move down the display and the
224              PageUp  key  to  move in the other direction.  If ReadKey is not
225              installed, typing the multi-key sequences pd<ENTER> or pu<ENTER>
226              will cause the same thing to happen.
227
228       -colhelp
229              When  you  wish  to  change  the  sort column and the arrow keys
230              aren't available to you, it may be cumbersome  to  identify  the
231              number  of the column to type in followed by RETURN.  This tells
232              colmux to display the numbers over each column  eliminating  the
233              need to manually count them and find the one you want.
234
235       -column num
236              Set  the  sort  column  to this number.  The column numbering is
237              determined by the columns returned by collectl for the requested
238              command.   Since  date/time  columns  are  optional for non-plot
239              data, their inclusion will change the numbering of  the  columns
240              so  if  you  are  not  sure you selected the correct column, you
241              should first execute your command with -test included.
242
243              You can also change the column  number  interactively  with  the
244              RIGHT/LEFT  arrow  keys  IF the ReadKey module is installed (see
245              colmux -version) OR simply type it in followed  by  the  <ENTER>
246              key.
247
248       -finalcr
249              There  is a real odd case in which you might want to pipe colmux
250              real-time output to a script for further  processing.   However,
251              if you do this you can't read the final line with a routine that
252              expects a terminating CR,  like  python's  readline().   Rather,
253              that  last line and the one that follows will be returned as one
254              long string.  This switch tell colmux to insert that  final  CR,
255              which  WILL  mess  up  the screen under normal operations, so be
256              forewarned.
257
258       -hostformat char:pos
259              There are times one has long hostnames which can either take  up
260              valuable  screen  real  estate or are simply painful to look at.
261              This switch may evolve over time and is currently  targetted  as
262              hostnames  that  have  repeating parts along with a unique part,
263              separated by a character such as a hyphen.  This  switch  allows
264              you  to  specify a single character followed by the piece of the
265              hostname you'd like to see displayed.  For example, if you  have
266              a  hostname  like aaa-bbbb-cccc-dddd, -hostformat -:3 will cause
267              the cccc piece to be displayed.
268
269       -nobold
270              Do not highlight the selected column.  This may be  useful  when
271              redirecting  output to a file and you do not want the associated
272              escape sequences to be written to it.
273
274       -reverse
275              Reverse the default sort order.  You can also change the  direc‐
276              tion  of  the  sort interactively with the UP/DOWN arrow keys IF
277              the ReadKey module is installed (see colmux -version)
278               OR simply type the r key and <ENTER>.
279
280       -zero
281              Do not display any rows with 0 in the sort column.  You can also
282              type z<ENTER>interactively.
283
284       Single-Line Format
285
286       -col1000
287              Divide each column by 1000 before display
288
289       -colk
290              Divide each column by 1024 before display
291
292       -collog10
293              Remap  large numbers to a smaller number of values by taking the
294              log10 of them and further transforming by the followign mapping:
295              0,1  to 0, 10 to 10, 100 to 20, 1000 to 30, 10000 to 40, ... 1e9
296              to 90.
297
298       -cols num,...
299              Group all data together for each host by column  number(s).   As
300              with  -column,  you  can confirm the correct column(s) have been
301              selected by first running with -test.
302
303       -colnodet
304              Do not show data for individual hosts, just display the totals.
305
306       -colnodiv num,...
307              Do not divide the specified column numbers by 1000 or 1024  when
308              col1000 or colk or apply the colllog10 transformation when spec‐
309              ified.  A typical usage is if you want to look at cpu  loads  as
310              well  as  network  or  disk  stats in which case you may want to
311              divide the latter by 1024 but not the cpu.
312
313       -colnoinst
314              Do no include instance portion  (and  surrounding  brackets)  in
315              totals column headers.
316
317       -coltotal
318              Include the totals for each column to the right.
319
320       -colwidth
321              Set the output columns to this width, typically used in conjunc‐
322              tion with -col1000 or colk to allow more hosts to fit  onto  the
323              same line.  It can also be used if the host names are too narrow
324              for column headers and you have room to display wider names.
325
326       Exception Reporting Specific
327
328       In single-line format, rather than wait for all hosts to  report  their
329       data,  colmux simply reports the last data seen when the time to gener‐
330       ate a line of output has come.  In most cases,  these  do  reflect  the
331       most recent data values but in times of load, the data may be late get‐
332       ting to colmux and so a previous value may be reported.  If the age  of
333       that  data  exceeds  a defined number of intervals, the default is cur‐
334       rently 2, an exception value will be reported of -1.  At other times it
335       has been seen where kernel/driver bugs may cause incorrect values to be
336       reported as negative numbers and those values are also reported as  -1.
337       Both  the  age  and  exception values can be changed with the following
338       switches.
339
340       -age number
341              When initially starting up and all hosts have not  yet  reported
342              any  data, colmux will display a -1 to indicate no data has been
343              seen yet.  If during processing a host fails to report  in  -age
344              intervals,  the default is 2, colmux will also report a -1 indi‐
345              cating the data is stale.
346
347       -negdataval val
348              In some cases, there could be erroneous data reported  as  nega‐
349              tive  numbers  (though  sometimes  negative  numbers are valid).
350              When specified, replace any negative numbers with this value.
351
352       -nodataval val
353              This switch allows  you  to  change  the  -1  that  is  normally
354              reported  for missing or stale data to the specified value, most
355              commonly 0.
356
357       Diagnostics
358
359       The following switches are intended more for diagnostic  purposes  than
360       normal operation, though are also worth using on appropriate occasions.
361
362       -debug val
363              This  switch is for generating diagnostic information at various
364              levels.  It is actually a bit mask, whose values are  listed  in
365              the beginning on colmux itself. Perhaps the most useful value is
366              1 as it will cause colmux to display  all  the  remote  commands
367              issues  to  each  host  in the address list and can often reveal
368              problems when things don't seem to be working correctly
369
370       -nocheck
371              This switch was initially included in an  earlier  version  when
372              remote  host  checking  was causing problem in some cases and by
373              skipping those checks, colmux would run more reliably.  While it
374              is  felt  that  as  of  V3.2.0 these reachability checks are now
375              reliable and should not be skipped, this switch has been left in
376              place.
377
378       -quiet
379              By  default  and  when -nocheck not specified, colmux checks the
380              versions of all collectl instances against  that  of  the  first
381              node  found to be running collectl and if different, reports the
382              mismatch.  This switch suppresses that warning.
383
384              When a connection is received  from  an  unexpected  address,  a
385              warning is also reported and the request promptly ignored.  This
386              switch also suppresses those messages as well.  For more  infor‐
387              mation on problems connecting, see CONNECTION PROBLEMS.
388
389       -reachable
390              By  default,  when  a  node is found to not be reachable, colmux
391              will remove it from its list of hosts  and  continue  execution.
392              This  switch  will  tell  colmux  to exit when all hosts are not
393              reachable.
394
395       Miscellaneous
396
397       There are 2 switches whose descriptions don't really fit anywhere else:
398
399       -colbin path
400              On rare occasions, such as testing a patch to collectl in a copy
401              NOT  in  /usr/bin,  you may want to tell colmux to use that copy
402              instead of the standard one.  Use this switch to point  to  that
403              copy.   Naturally  that  copy must exist in that location on all
404              systems.
405
406       -keepalive secs
407              Colmux uses ssh to start collectl on  each  remote  machine  and
408              then  communications  between  collectl  and colmux occur over a
409              socket.  Normally, ssh is configured to timeout after an  inter‐
410              val  of  inactivity, such as 30 minutes, which means a long-run‐
411              ning colmux session will begin to  lose  connections  when  this
412              interval is reached.  By specifying a keepalive interval, you're
413              telling the ssh to send a periodic keepalive to the other end so
414              that connection doesn't get dropped.
415
416       -retaddr addr
417              Tell  remote  collectls to open a socket on this address instead
418              of the preselected one.  For more details on this,  see  CONNEC‐
419              TION PROBLEMS.
420
421       -timeout secs
422              By default, collectl waits up to 10 seconds for remote instances
423              of collectl to connect back.  On slower networks or when a  very
424              large  number  of  instances have been started, they may fail to
425              connect back in time.  This switch will extend that timeout, but
426              it also requires collectl V3.6.4 be used because earlier version
427              do not support this feature.
428
429       -timerange secs
430              When colmux starts up and checks the  connectivity  to  all  the
431              machines   specified  by  -addr,  it  also  gets  their  current
432              date/time and using that computes  the  range  of  system  times
433              across  all  nodes.   If  that  time  is  found  to be more then
434              -timerange seconds, colmux generates a warning as  this  differ‐
435              ence  could cause reporting probems.  One can increase the range
436              to get rid of the message (not recommended unless other  factors
437              are  preventing nodes from responding quickly enough to the date
438              command) OR suppress the warning with -quiet.
439
440

PLAYBACK MODE RESTRICTIONS

442       All logs being played back must have  been  collected  using  the  same
443       interval  as  colmux only looks at the first file/host to determine the
444       appropriate value.
445
446       It is assumed all clocks are reasonably  well  synchronized  as  colmux
447       uses time to determine which data is to be displayed as a set.
448
449       All  files must be in the same directory on all systems and that direc‐
450       tory must be included in the playback file specification
451
452       All files on a remote host must be for that host only
453
454

EXAMPLES

456       Run collectl on 3 nodes, showing CPU, Disk and Network statistics  once
457       a second and sorted by column 1, which happens to be total cpu.
458
459       colmux -addr abc,def,xyz
460
461       Dynamically  display  top processes on nodes n1-n10 of a cluster once a
462       second, sorted by column 5.
463
464       colmux -addr n[1-10] -command "-sZ :1" -column 5
465
466       Do the same for yesterday, between the hours of 5AM and 6AM, being sure
467       to  stall  for  1/2  second  between intervals.  Note, if you leave off
468       -addr you could put all the logs into /var/log/collectl  on  the  local
469       host and play them back from there.
470
471       colmux  -addr n[1-10] -command "-sZ -p/var/log/collectl/YESTERDAY -from
472       05:00-06:00" -column 5 -delay .5
473
474       Look at the amount of mapped and slab memory consumed on  nodes  n1-n10
475       and  n15  in  real-time,  every  2  seconds  using  single-line format.
476       Include totals and preface each line with the time.  Since memory sizes
477       tend  to  be rather large, divide each by 1024 so we see MB rather than
478       KB.  Note that the columns numbers are always displayed  are  ascending
479       order  regardless  of  their order in -cols. To be sure, first test the
480       column numbers.
481
482       colmux -addr n[1-10,15] -command "-sm -i2 -oT" -cols 6,7 -coltot  -colk
483       -test
484       colmux -addr n[1-10,15] -command "-sm -i2 -oT" -cols 6,7 -coltot -colk
485
486       Display most active disks, based on KB written, on nodes n1, n4 and n5.
487
488       colmux -addr n1,n4,n5 -command "-sD" -column 6
489
490       Here  is  a  cool  trick.  Collectl currently lets you look at top pro‐
491       cesses with the --top switch and even choose a  sort  column  by  name.
492       However,  if you want to change the column you need to exit, then rerun
493       collectl with a different sort column name.  But if  you  run  it  like
494       this  example,  you  get  the power of colmux to dynamically change the
495       sort columns with the arrow keys!  You can also use this  technique  to
496       have  collectl dynamically sort any local multi-line data such as slabs
497       or even detail data like CPU, Disk, Lustre and Networks too!  Naturally
498       this technique works just as well with playing back data as well.
499
500       colmux -command "-sZ -i:1"
501
502

RESTRICTIONS

504       colmux  requires  passwordless  ssh  between  the node it is running on
505       those it is monitoring.  also be sure the port you are using for commu‐
506       nications, the default is 2655, if open
507
508

CONNECTION PROBLEMS

510       The  way  colmux  works is to choose an address it wants to communicate
511       over and starts up one or more remote copies of collectl, telling  them
512       to  connect  back to colmux using that address.  The easiest way to see
513       this, is to run colmux with -noesc, which tells it  NOT  to  issue  any
514       escape  sequences  and  therefore  not to run in full screen mode.  The
515       addional switch of -debug 1  tells  it  to  show  the  remote  collectl
516       startup command.  When there is a communications problem you will typi‐
517       cally see 'connection timed out' messages displayed.
518
519       There are actually a couple of possibilities here, one of  which  is  a
520       firewall  is preventing connections and the easiest way to test this is
521       run collectl on the local machine like this: collectl  -Aserver.   This
522       tells  collectl  run  as  a server, listening for connections just like
523       colmux.  Then  log  into  a  remote  machine  and  run  /usr/share/col‐
524       lectl/util/client.pl  addr-of-server  which  tells  client.pl to open a
525       socket to that copy of collectl.  It should fail just like when it  was
526       run  via  colmux, so  try opening the firewall and try it again.  If it
527       fixes the problem, it was indeed the firewall blocking things and  col‐
528       mux should now work just fine.
529
530       Sometimes  there are multiple interfaces defined on the machine hosting
531       colmux and in some cases only some addresses will allow socket  connec‐
532       tions.   Again,  using  client.pl  on the remote machine try connecting
533       back to collectl over different addresses and when you  find  one  that
534       works,  tell  colmux  to  use  that  address  for communication via the
535       -retaddr switch.
536
537

AUTHOR

539       This program was written by Mark Seger (mjseger@gmail.com).
540       Copyright 2016 Hewlett-Packard Development Company, L.P.
541
542

SEE ALSO

544       http://collectl-utils.sourceforge.net/colmux.html
545
546
547
548LOCAL                            DECEMBER 2010                       COLMUX(1)
Impressum