corosync.conf(5)

1COROSYNC_CONF(5)  Corosync Cluster Engine Programmer's Manual COROSYNC_CONF(5)
2
3
4

NAME

6       corosync.conf - corosync executive configuration file
7
8

SYNOPSIS

10       /etc/corosync.conf
11
12

DESCRIPTION

14       The corosync.conf instructs the corosync executive about various param‐
15       eters needed to control the corosync executive.  Empty lines and  lines
16       starting with # character are ignored.  The configuration file consists
17       of bracketed top level directives.  The possible directive choices are:
18
19
20       totem { }
21              This top level directive contains configuration options for  the
22              totem protocol.
23
24       logging { }
25              This top level directive contains configuration options for log‐
26              ging.
27
28       event { }
29              This top level directive contains configuration options for  the
30              event service.
31
32
33       It  is  also possible to specify the top level parameter compatibility.
34       This directive indicates the level of compatibility  requested  by  the
35       user.  The option whitetank can be specified to remain backward compat‐
36       able with openais-0.80.z.  The option none can be specified to only  be
37       compatable  with corosync-1.Y.Z.  Extra processing during configuration
38       changes is required to remain backward compatable.
39
40       The default is whitetank. (backwards compatibility)
41
42
43       Within the totem directive, an interface directive is required.   There
44       is also one configuration option which is required:
45
46       Within  the  interface sub-directive of totem there are four parameters
47       which are required.  There is one parameter which is optional.
48
49
50       ringnumber
51              This specifies the ring number for the  interface.   When  using
52              the redundant ring protocol, each interface should specify sepa‐
53              rate ring numbers to uniquely identify to the membership  proto‐
54              col  which  interface  to  use  for  which  redundant  ring. The
55              ringnumber must start at 0.
56
57
58       bindnetaddr
59              This specifies the network address the corosync executive should
60              bind  to.   For  example, if the local interface is 192.168.5.92
61              with netmask 255.255.255.0, set bindnetaddr to 192.168.5.0.   If
62              the    local    interface    is    192.168.5.92   with   netmask
63              255.255.255.192, set bindnetaddr to 192.168.5.64, and so forth.
64
65              This may also be an IPV6 address, in which case IPV6  networking
66              will  be used.  In this case, the full address must be specified
67              and there is no automatic selection  of  the  network  interface
68              within a specific subnet as with IPv4.
69
70              If IPv6 networking is used, the nodeid field must be specified.
71
72
73       broadcast
74              This  is  optional  and can be set to yes.  If it is set to yes,
75              the broadcast address will be used for communication.   If  this
76              option is set, mcastaddr should not be set.
77
78
79       mcastaddr
80              This  is  the multicast address used by corosync executive.  The
81              default should work for most networks, but the network  adminis‐
82              trator  should  be  queried  about  a  multicast address to use.
83              Avoid 224.x.x.x because this is a "config" multicast address.
84
85              This may also be an IPV6 multicast address, in which  case  IPV6
86              networking will be used.  If IPv6 networking is used, the nodeid
87              field must be specified.
88
89
90       mcastport
91              This specifies the UDP port number.  It is possible to  use  the
92              same  multicast  address on a network with the corosync services
93              configured for different UDP ports.  Please note  corosync  uses
94              two  UDP  ports mcastport (for mcast receives) and mcastport - 1
95              (for mcast sends).  If you have multiple clusters  on  the  same
96              network using the same mcastaddr please configure the mcastports
97              with a gap.
98
99
100       ttl    This specifies the Time To Live (TTL). If you run  your  cluster
101              on  a  routed network then the default of "1" will be too small.
102              This option provides a way to increase this up to 255. The valid
103              range  is  0..255.   Note  that  this is only valid on multicast
104              transport types.
105
106
107       member This specifies a member on the interface and used with the  udpu
108              transport  only.  Every node that should be a member of the mem‐
109              bership should be specified  as  a  separate  member  directive.
110              Within  the  member  directive  there  is a parameter memberaddr
111              which specifies the ip address of one of the nodes.
112
113
114       Within the totem directive, there are seven  configuration  options  of
115       which one is required, five are optional, and one is required when IPV6
116       is configured in the interface subdirective.   The  required  directive
117       controls  the  version of the totem configuration.  The optional option
118       unless using IPV6 directive controls identification of  the  processor.
119       The  optional options control secrecy and authentication, the redundant
120       ring mode of operation, maximum network  MTU,  and  number  of  sending
121       threads, and the nodeid field.
122
123
124       version
125              This specifies the version of the configuration file.  Currently
126              the only valid version for this directive is 2.
127
128
129       nodeid This configuration  option  is  optional  when  using  IPv4  and
130              required when using IPv6.  This is a 32 bit value specifying the
131              node identifier delivered to the cluster membership service.  If
132              this  is not specified with IPv4, the node id will be determined
133              from the 32 bit IP address the system to  which  the  system  is
134              bound  with  ring identifier of 0.  The node identifier value of
135              zero is reserved and should not be used.
136
137
138       clear_node_high_bit
139              This configuration option is optional and is only relevant  when
140              no  nodeid  is specified.  Some openais clients require a signed
141              32 bit nodeid that is greater than zero however by default  ope‐
142              nais  uses all 32 bits of the IPv4 address space when generating
143              a nodeid.  Set this option to yes to force the high  bit  to  be
144              zero  and therefor ensure the nodeid is a positive signed 32 bit
145              integer.
146
147              WARNING: The clusters behavior is undefined if  this  option  is
148              enabled  on  only  a subset of the cluster (for example during a
149              rolling upgrade).
150
151
152       secauth
153              This specifies that HMAC/SHA1 authentication should be  used  to
154              authenticate  all  messages.  It further specifies that all data
155              should be encrypted with the sober128  encryption  algorithm  to
156              protect data from eavesdropping.
157
158              Enabling this option adds a 36 byte header to every message sent
159              by totem which reduces total throughput.  Encryption and authen‐
160              tication  consume  75% of CPU cycles in aisexec as measured with
161              gprof when enabled.
162
163              For 100mbit  networks  with  1500  MTU  frame  transmissions:  A
164              throughput of 9mb/sec is possible with 100% cpu utilization when
165              this option is enabled on 3ghz cpus.  A throughput  of  10mb/sec
166              is  possible wth 20% cpu utilization when this optin is disabled
167              on 3ghz cpus.
168
169              For gig-e networks with large frame transmissions: A  throughput
170              of  20mb/sec  is  possible  when  this option is enabled on 3ghz
171              cpus.  A throughput of 60mb/sec is possible when this option  is
172              disabled on 3ghz cpus.
173
174              The default is on.
175
176
177       rrp_mode
178              This  specifies  the  mode of redundant ring, which may be none,
179              active, or passive.  Active replication  offers  slightly  lower
180              latency from transmit to delivery in faulty network environments
181              but with less performance.  Passive replication may nearly  dou‐
182              ble  the  speed  of  the  totem protocol if the protocol doesn't
183              become cpu bound.  The final option is none, in which case  only
184              one  network  interface will be used to operate the totem proto‐
185              col.
186
187              If only one interface directive is specified, none is  automati‐
188              cally  chosen.   If multiple interface directives are specified,
189              only active or passive may be chosen.
190
191
192       netmtu This specifies the network maximum transmit unit.  To  set  this
193              value  beyond  1500,  the  regular  frame MTU, requires ethernet
194              devices that support large, or also called  jumbo,  frames.   If
195              any device in the network doesn't support large frames, the pro‐
196              tocol will not operate properly.  The hosts must also have their
197              mtu size set from 1500 to whatever frame size is specified here.
198
199              Please  note  while some NICs or switches claim large frame sup‐
200              port, they support 9000 MTU as the maximum frame size  including
201              the  IP  header.   Setting the netmtu and host MTUs to 9000 will
202              cause totem to use the full 9000 bytes of the frame.  Then Linux
203              will  add  a  18 byte header moving the full frame size to 9018.
204              As a result some hardware will not operate  properly  with  this
205              size  of data.  A netmtu of 8982 seems to work for the few large
206              frame devices that have been tested.  Some  manufacturers  claim
207              large  frame  support  when  in fact they support frame sizes of
208              4500 bytes.
209
210              Increasing the MTU from 1500 to 8982 doubles throughput  perfor‐
211              mance  from  30MB/sec to 60MB/sec as measured with evsbench with
212              175000 byte messages with the secauth directive set to off.
213
214              When sending multicast traffic, if the network frequently recon‐
215              figures,  chances  are  that  some device in the network doesn't
216              support large frames.
217
218              Choose hardware carefully if intending to use large  frame  sup‐
219              port.
220
221              The default is 1500.
222
223
224       threads
225              This directive controls how many threads are used to encrypt and
226              send multicast messages.  If secauth is off, the  protocol  will
227              never  use  threaded  sending.  If secauth is on, this directive
228              allows systems to be  configured  to  use  multiple  threads  to
229              encrypt and send multicast messages.
230
231              A  thread  directive of 0 indicates that no threaded send should
232              be used.  This mode offers best performance for non-SMP systems.
233
234              The default is 0.
235
236
237       vsftype
238              This directive controls the virtual synchrony filter  type  used
239              to  identify  a  primary component.  The preferred choice is YKD
240              dynamic linear voting, however,  for  clusters  larger  then  32
241              nodes  YKD  consumes  alot  of memory.  For large scale clusters
242              that are created by changing the MAX_PROCESSORS_COUNT #define in
243              the  C code totem.h file, the virtual synchrony filter "none" is
244              recommended but then AMF and DLCK services (which are  currently
245              experimental) are not safe for use.
246
247              The default is ykd.  The vsftype can also be set to none.
248
249
250       transport
251              This  directive  controls  the transport mechanism used.  If the
252              interface to which corosync is binding is an RDMA interface such
253              as  RoCEE  or  Infiniband, the "iba" parameter may be specified.
254              To avoid the use of  multicast  entirely,  a  unicast  transport
255              parameter "udpu" can be specified.  This requires specifying the
256              list of members that could potentially make  up  the  membership
257              before deployment.
258
259              The  default is udp.  The transport type can also be set to udpu
260              or iba.
261
262              Within the totem  directive,  there  are  several  configuration
263              options which are used to control the operation of the protocol.
264              It is generally not recommended to change any  of  these  values
265              without  proper  guidance and sufficient testing.  Some networks
266              may require larger values if suffering from frequent  reconfigu‐
267              rations.  Some applications may require faster failure detection
268              times which can be achieved by reducing the token timeout.
269
270
271       token  This timeout specifies in milliseconds until  a  token  loss  is
272              declared  after  not  receiving a token.  This is the time spent
273              detecting a failure of a processor in the current configuration.
274              Reforming  a  new  configuration  takes about 50 milliseconds in
275              addition to this timeout.
276
277              The default is 1000 milliseconds.
278
279
280       token_retransmit
281              This timeout specifies in milliseconds  after  how  long  before
282              receiving  a  token  the  token  is retransmitted.  This will be
283              automatically calculated if token is modified.  It is not recom‐
284              mended  to  alter  this value without guidance from the corosync
285              community.
286
287              The default is 238 milliseconds.
288
289
290       hold   This timeout specifies in milliseconds how long the token should
291              be  held  by  the  representative when the protocol is under low
292              utilization.   It is not recommended to alter this value without
293              guidance from the corosync community.
294
295              The default is 180 milliseconds.
296
297
298       token_retransmits_before_loss_const
299              This  value  identifies  how  many  token  retransmits should be
300              attempted before forming a new configuration.  If this value  is
301              set,  retransmit  and hold will be automatically calculated from
302              retransmits_before_loss and token.
303
304              The default is 4 retransmissions.
305
306
307       join   This timeout specifies in milliseconds how long to wait for join
308              messages in the membership protocol.
309
310              The default is 50 milliseconds.
311
312
313       send_join
314              This  timeout specifies in milliseconds an upper range between 0
315              and send_join to wait before sending a join message.   For  con‐
316              figurations  with less then 32 nodes, this parameter is not nec‐
317              essary.  For larger rings, this parameter is necessary to ensure
318              the  NIC  is not overflowed with join messages on formation of a
319              new ring.  A reasonable value for large rings (128 nodes)  would
320              be 80msec.  Other timer values must also change if this value is
321              changed.  Seek advice from the corosync mailing list  if  trying
322              to run larger configurations.
323
324              The default is 0 milliseconds.
325
326
327       consensus
328              This timeout specifies in milliseconds how long to wait for con‐
329              sensus to be achieved before starting a new round of  membership
330              configuration.   The  minimum  value for consensus must be 1.2 *
331              token.  This value will be automatically  calculated  at  1.2  *
332              token if the user doesn't specify a consensus value.
333
334              For  two node clusters, a consensus larger then the join timeout
335              but less then token is safe.  For three node or larger clusters,
336              consensus  should  be larger then token.  There is an increasing
337              risk of odd membership changes,  which  stil  guarantee  virtual
338              synchrony,  as node count grows if consensus is less than token.
339
340              The default is 1200 milliseconds.
341
342
343       merge  This  timeout  specifies in milliseconds how long to wait before
344              checking for a partition when  no  multicast  traffic  is  being
345              sent.   If  multicast traffic is being sent, the merge detection
346              happens automatically as a function of the protocol.
347
348              The default is 200 milliseconds.
349
350
351       downcheck
352              This timeout specifies in milliseconds how long to  wait  before
353              checking  that  a network interface is back up after it has been
354              downed.
355
356              The default is 1000 millseconds.
357
358
359       fail_recv_const
360              This constant specifies how many rotations of the token  without
361              receiving  any  of the messages when messages should be received
362              may occur before a new configuration is formed.
363
364              The default is 2500 failures to receive a message.
365
366
367       seqno_unchanged_const
368              This constant specifies how many rotations of the token  without
369              any  multicast  traffic  should occur before the merge detection
370              timeout is started.
371
372              The default is 30 rotations.
373
374
375       heartbeat_failures_allowed
376              [HeartBeating mechanism] Configures  the  optional  HeartBeating
377              mechanism for faster failure detection. Keep in mind that engag‐
378              ing this mechanism in lossy networks  could  cause  faulty  loss
379              declaration  as  the  mechanism relies on the network for heart‐
380              beating.
381
382              So as a rule of thumb use this mechanism if you require improved
383              failure in low to medium utilized networks.
384
385              This  constant  specifies  the  number of heartbeat failures the
386              system should tolerate before declaring heartbeat failure e.g 3.
387              Also  if this value is not set or is 0 then the heartbeat mecha‐
388              nism is not engaged in the system  and  token  rotation  is  the
389              method of failure detection
390
391              The default is 0 (disabled).
392
393
394       max_network_delay
395              [HeartBeating mechanism] This constant specifies in milliseconds
396              the approximate delay that your network takes to  transport  one
397              packet  from  one machine to another. This value is to be set by
398              system engineers and please dont change  if  not  sure  as  this
399              effects the failure detection mechanism using heartbeat.
400
401              The default is 50 milliseconds.
402
403
404       window_size
405              This  constant specifies the maximum number of messages that may
406              be sent on  one  token  rotation.   If  all  processors  perform
407              equally  well,  this  value  could  be  large (300), which would
408              introduce higher latency from origination to delivery  for  very
409              large  rings.   To  reduce  latency  in  large  rings(16+),  the
410              defaults are a safe compromise.  If 1 or more slow  processor(s)
411              are  present  among  fast  processors,  window_size should be no
412              larger then 256000 / netmtu to  avoid  overflow  of  the  kernel
413              receive buffers.  The user is notified of this by the display of
414              a retransmit list in the notification logs.  There is no loss of
415              data, but performance is reduced when these errors occur.
416
417              The default is 50 messages.
418
419
420       max_messages
421              This  constant specifies the maximum number of messages that may
422              be sent by one processor on receipt of the token.  The  max_mes‐
423              sages  parameter  is limited to 256000 / netmtu to prevent over‐
424              flow of the kernel transmit buffers.
425
426              The default is 17 messages.
427
428
429       miss_count_const
430              This constant defines the maximum number of times on receipt  of
431              a  token  a  message  is  checked  for  retransmission  before a
432              retransmission occurs.  This parameter is useful to  modify  for
433              switches  that delay multicast packets compared to unicast pack‐
434              ets.  The default setting  works  well  for  nearly  all  modern
435              switches.
436
437              The default is 5 messages.
438
439
440       rrp_problem_count_timeout
441              This  specifies  the  time in milliseconds to wait before decre‐
442              menting the problem count by 1 for a particular ring to ensure a
443              link is not marked faulty for transient network failures.
444
445              The default is 2000 milliseconds.
446
447
448       rrp_problem_count_threshold
449              This  specifies the number of times a problem is detected with a
450              link before setting the link faulty.  Once a link is set faulty,
451              no  more data is transmitted upon it.  Also, the problem counter
452              is no longer decremented when the problem count timeout expires.
453
454              A problem is detected whenever all tokens  from  the  proceeding
455              processor     have     not     been    received    within    the
456              rrp_token_expired_timeout.   The  rrp_problem_count_threshold  *
457              rrp_token_expired_timeout should be atleast 50 milliseconds less
458              then the token timeout, or a complete reconfiguration may occur.
459
460              The default is 10 problem counts.
461
462
463       rrp_problem_count_mcast_threshold
464              This specifies the number of times a problem  is  detected  with
465              multicast  before  setting the link faulty for passive rrp mode.
466              This variable is unused in active rrp mode.
467
468              The default is 10 times rrp_problem_count_threshold.
469
470
471       rrp_token_expired_timeout
472              This specifies the time in milliseconds to increment the problem
473              counter  for  the  redundant  ring  protocol  after  not  having
474              received a token from all rings for a particular processor.
475
476              This value will automatically be calculated from the token time‐
477              out  and  problem_count_threshold  but may be overridden.  It is
478              not recommended to override this value without guidance from the
479              corosync community.
480
481              The default is 47 milliseconds.
482
483
484       rrp_autorecovery_check_timeout
485              This  specifies  the time in milliseconds to check if the failed
486              ring can be auto-recovered.
487
488              The default is 1000 milliseconds.
489
490
491       Within the logging directive, there are several  configuration  options
492       which are all optional.
493
494
495       The following 3 options are valid only for the top level logging direc‐
496       tive:
497
498
499       timestamp
500              This specifies that a timestamp is placed on all log messages.
501
502              The default is off.
503
504
505       fileline
506              This specifies that file and line should be printed.
507
508              The default is off.
509
510
511       function_name
512              This specifies that the code function name should be printed.
513
514              The default is off.
515
516
517       The following options are valid both for top  level  logging  directive
518       and they can be overriden in logger_subsys entries.
519
520
521       to_stderr
522
523       to_logfile
524
525       to_syslog
526              These specify the destination of logging output. Any combination
527              of these options may be specified. Valid options are yes and no.
528
529              The default is syslog and stderr.
530
531              Please note, if you are using to_logfile and want to rotate  the
532              file, use logrotate(8) with the option copytruncate.  eg.
533
534              /var/log/corosync.log {
535                  missingok
536                  compress
537                  notifempty
538                  daily
539                  rotate 7
540                  copytruncate
541              }
542
543       logfile
544              If  the  to_logfile directive is set to yes , this option speci‐
545              fies the pathname of the log file.
546
547              No default.
548
549
550       logfile_priority
551              This specifies the logfile priority for this particular  subsys‐
552              tem.  Ignored if debug is on.  Possible values are: alert, crit,
553              debug (same as debug = on), emerg, err, info, notice, warning.
554
555              The default is: info.
556
557
558       syslog_facility
559              This specifies the syslog facility type that will  be  used  for
560              any messages sent to syslog. options are daemon, local0, local1,
561              local2, local3, local4, local5, local6 & local7.
562
563              The default is daemon.
564
565
566       syslog_priority
567              This specifies the syslog level for this  particular  subsystem.
568              Ignored if debug is on.  Possible values are: alert, crit, debug
569              (same as debug = on), emerg, err, info, notice, warning.
570
571              The default is: info.
572
573
574       debug  This specifies whether debug output is logged for this  particu‐
575              lar logger.
576
577              The default is off.
578
579
580       tags   This  specifies  which tags should be traced for this particular
581              logger.  Set debug directive to on in order  to  enable  tracing
582              using tags.  Values are specified using a vertical bar as a log‐
583              ical OR separator:
584
585              enter|leave|trace1|trace2|trace3|...
586
587              The default is none.
588
589
590       Within the logging directive, logger_subsys directives are optional.
591
592
593       Within the logger_subsys sub-directive, all of the above  logging  con‐
594       figuration  options  are  valid and can be used to override the default
595       settings.  The subsys entry, described below, is mandatory to  identify
596       the subsystem.
597
598
599       subsys This  specifies  the subsystem identity (name) for which logging
600              is specified. This is the name used by a service in the log_init
601              () call. E.g. 'CKPT'. This directive is required.
602
603

FILES

605       /etc/corosync.conf
606              The corosync executive configuration file.
607
608

NAME

SYNOPSIS

DESCRIPTION

FILES

SEE ALSO