corosync.conf(5)

1COROSYNC_CONF(5)  Corosync Cluster Engine Programmer's Manual COROSYNC_CONF(5)
2
3
4

NAME

6       corosync.conf - corosync executive configuration file
7
8

SYNOPSIS

10       /etc/corosync/corosync.conf
11
12

DESCRIPTION

14       The corosync.conf instructs the corosync executive about various param‐
15       eters needed to control the corosync executive.  Empty lines and  lines
16       starting with # character are ignored.  The configuration file consists
17       of bracketed top level directives.  The possible directive choices are:
18
19
20       totem { }
21              This top level directive contains configuration options for  the
22              totem protocol.
23
24       logging { }
25              This top level directive contains configuration options for log‐
26              ging.
27
28       event { }
29              This top level directive contains configuration options for  the
30              event service.
31
32
33       It  is  also possible to specify the top level parameter compatibility.
34       This directive indicates the level of compatibility  requested  by  the
35       user.  The option whitetank can be specified to remain backward compat‐
36       able with openais-0.80.z.  The option none can be specified to only  be
37       compatable  with corosync-1.Y.Z.  Extra processing during configuration
38       changes is required to remain backward compatable.
39
40       The default is whitetank. (backwards compatibility)
41
42
43       Within the totem directive, an interface directive is required.   There
44       is also one configuration option which is required:
45
46       Within  the  interface sub-directive of totem there are four parameters
47       which are required.  There is one parameter which is optional.
48
49
50       ringnumber
51              This specifies the ring number for the  interface.   When  using
52              the redundant ring protocol, each interface should specify sepa‐
53              rate ring numbers to uniquely identify to the membership  proto‐
54              col  which  interface  to  use  for  which  redundant  ring. The
55              ringnumber must start at 0.
56
57
58       bindnetaddr
59              This specifies the network address the corosync executive should
60              bind  to.   For  example, if the local interface is 192.168.5.92
61              with netmask 255.255.255.0, set bindnetaddr to 192.168.5.0.   If
62              the    local    interface    is    192.168.5.92   with   netmask
63              255.255.255.192, set bindnetaddr to 192.168.5.64, and so forth.
64
65              This may also be an IPV6 address, in which case IPV6  networking
66              will  be used.  In this case, the full address must be specified
67              and there is no automatic selection  of  the  network  interface
68              within a specific subnet as with IPv4.
69
70              If IPv6 networking is used, the nodeid field must be specified.
71
72
73       broadcast
74              This  is  optional  and can be set to yes.  If it is set to yes,
75              the broadcast address will be used for communication.   If  this
76              option is set, mcastaddr should not be set.
77
78
79       mcastaddr
80              This  is  the multicast address used by corosync executive.  The
81              default should work for most networks, but the network  adminis‐
82              trator  should  be  queried  about  a  multicast address to use.
83              Avoid 224.x.x.x because this is a "config" multicast address.
84
85              This may also be an IPV6 multicast address, in which  case  IPV6
86              networking will be used.  If IPv6 networking is used, the nodeid
87              field must be specified.
88
89
90       mcastport
91              This specifies the UDP port number.  It is possible to  use  the
92              same  multicast  address on a network with the corosync services
93              configured for different UDP ports.  Please note  corosync  uses
94              two  UDP  ports mcastport (for mcast receives) and mcastport - 1
95              (for mcast sends).  If you have multiple clusters  on  the  same
96              network using the same mcastaddr please configure the mcastports
97              with a gap.
98
99
100       ttl    This specifies the Time To Live (TTL). If you run  your  cluster
101              on  a  routed network then the default of "1" will be too small.
102              This option provides a way to increase this up to 255. The valid
103              range  is  0..255.   Note  that  this is only valid on multicast
104              transport types.
105
106
107       member This specifies a member on the interface and used with the  udpu
108              transport  only.  Every node that should be a member of the mem‐
109              bership should be specified  as  a  separate  member  directive.
110              Within  the  member  directive  there  is a parameter memberaddr
111              which specifies the ip address of one of the nodes.
112
113
114       Within the totem directive, there are seven  configuration  options  of
115       which one is required, five are optional, and one is required when IPV6
116       is configured in the interface subdirective.   The  required  directive
117       controls  the  version of the totem configuration.  The optional option
118       unless using IPV6 directive controls identification of  the  processor.
119       The  optional options control secrecy and authentication, the redundant
120       ring mode of operation, maximum network  MTU,  and  number  of  sending
121       threads, and the nodeid field.
122
123
124       version
125              This specifies the version of the configuration file.  Currently
126              the only valid version for this directive is 2.
127
128
129       nodeid This configuration  option  is  optional  when  using  IPv4  and
130              required when using IPv6.  This is a 32 bit value specifying the
131              node identifier delivered to the cluster membership service.  If
132              this  is not specified with IPv4, the node id will be determined
133              from the 32 bit IP address the system to  which  the  system  is
134              bound  with  ring identifier of 0.  The node identifier value of
135              zero is reserved and should not be used.
136
137
138       clear_node_high_bit
139              This configuration option is optional and is only relevant  when
140              no  nodeid  is specified.  Some openais clients require a signed
141              32 bit nodeid that is greater than zero however by default  ope‐
142              nais  uses all 32 bits of the IPv4 address space when generating
143              a nodeid.  Set this option to yes to force the high  bit  to  be
144              zero  and therefor ensure the nodeid is a positive signed 32 bit
145              integer.
146
147              WARNING: The clusters behavior is undefined if  this  option  is
148              enabled  on  only  a subset of the cluster (for example during a
149              rolling upgrade).
150
151
152       secauth
153              This specifies that HMAC/SHA1 authentication should be  used  to
154              authenticate  all  messages.  It further specifies that all data
155              should be encrypted with the sober128  encryption  algorithm  to
156              protect data from eavesdropping.
157
158              Enabling this option adds a 36 byte header to every message sent
159              by totem which reduces total throughput.  Encryption and authen‐
160              tication  consume  75% of CPU cycles in aisexec as measured with
161              gprof when enabled.
162
163              For 100mbit  networks  with  1500  MTU  frame  transmissions:  A
164              throughput of 9mb/sec is possible with 100% cpu utilization when
165              this option is enabled on 3ghz cpus.  A throughput  of  10mb/sec
166              is  possible wth 20% cpu utilization when this optin is disabled
167              on 3ghz cpus.
168
169              For gig-e networks with large frame transmissions: A  throughput
170              of  20mb/sec  is  possible  when  this option is enabled on 3ghz
171              cpus.  A throughput of 60mb/sec is possible when this option  is
172              disabled on 3ghz cpus.
173
174              The default is on.
175
176
177       rrp_mode
178              This  specifies  the  mode of redundant ring, which may be none,
179              active, or passive.  Active replication  offers  slightly  lower
180              latency from transmit to delivery in faulty network environments
181              but with less performance.  Passive replication may nearly  dou‐
182              ble  the  speed  of  the  totem protocol if the protocol doesn't
183              become cpu bound.  The final option is none, in which case  only
184              one  network  interface will be used to operate the totem proto‐
185              col.
186
187              If only one interface directive is specified, none is  automati‐
188              cally  chosen.   If multiple interface directives are specified,
189              only active or passive may be chosen.
190
191              When using multiple interfaces, make sure to use different  mul‐
192              ticast  address/port  (port  for  same address must differ by at
193              least two) pair for each interface (this is checked  by  parser)
194              to make rrp works.
195
196
197       netmtu This  specifies  the network maximum transmit unit.  To set this
198              value beyond 1500, the  regular  frame  MTU,  requires  ethernet
199              devices  that  support  large, or also called jumbo, frames.  If
200              any device in the network doesn't support large frames, the pro‐
201              tocol will not operate properly.  The hosts must also have their
202              mtu size set from 1500 to whatever frame size is specified here.
203
204              Please note while some NICs or switches claim large  frame  sup‐
205              port,  they support 9000 MTU as the maximum frame size including
206              the IP header.  Setting the netmtu and host MTUs  to  9000  will
207              cause totem to use the full 9000 bytes of the frame.  Then Linux
208              will add a 18 byte header moving the full frame  size  to  9018.
209              As  a  result  some hardware will not operate properly with this
210              size of data.  A netmtu of 8982 seems to work for the few  large
211              frame  devices  that have been tested.  Some manufacturers claim
212              large frame support when in fact they  support  frame  sizes  of
213              4500 bytes.
214
215              Increasing  the MTU from 1500 to 8982 doubles throughput perfor‐
216              mance from 30MB/sec to 60MB/sec as measured with  evsbench  with
217              175000 byte messages with the secauth directive set to off.
218
219              When sending multicast traffic, if the network frequently recon‐
220              figures, chances are that some device  in  the  network  doesn't
221              support large frames.
222
223              Choose  hardware  carefully if intending to use large frame sup‐
224              port.
225
226              The default is 1500.
227
228
229       threads
230              This directive controls how many threads are used to encrypt and
231              send  multicast  messages.  If secauth is off, the protocol will
232              never use threaded sending.  If secauth is  on,  this  directive
233              allows  systems  to  be  configured  to  use multiple threads to
234              encrypt and send multicast messages.
235
236              A thread directive of 0 indicates that no threaded  send  should
237              be used.  This mode offers best performance for non-SMP systems.
238
239              The default is 0.
240
241
242       vsftype
243              This  directive  controls the virtual synchrony filter type used
244              to identify a primary component.  The preferred  choice  is  YKD
245              dynamic  linear  voting,  however,  for  clusters larger then 32
246              nodes YKD consumes alot of memory.   For  large  scale  clusters
247              that are created by changing the MAX_PROCESSORS_COUNT #define in
248              the C code totem.h file, the virtual synchrony filter "none"  is
249              recommended  but then AMF and DLCK services (which are currently
250              experimental) are not safe for use.
251
252              The default is ykd.  The vsftype can also be set to none.
253
254
255       transport
256              This directive controls the transport mechanism  used.   If  the
257              interface to which corosync is binding is an RDMA interface such
258              as RoCEE or Infiniband, the "iba" parameter  may  be  specified.
259              To  avoid  the  use  of  multicast entirely, a unicast transport
260              parameter "udpu" can be specified.  This requires specifying the
261              list  of  members  that could potentially make up the membership
262              before deployment.
263
264              The default is udp.  The transport type can also be set to  udpu
265              or iba.
266
267              Within  the  totem  directive,  there  are several configuration
268              options which are used to control the operation of the protocol.
269              It  is  generally  not recommended to change any of these values
270              without proper guidance and sufficient testing.   Some  networks
271              may  require larger values if suffering from frequent reconfigu‐
272              rations.  Some applications may require faster failure detection
273              times which can be achieved by reducing the token timeout.
274
275
276       token  This  timeout  specifies  in  milliseconds until a token loss is
277              declared after not receiving a token.  This is  the  time  spent
278              detecting a failure of a processor in the current configuration.
279              Reforming a new configuration takes  about  50  milliseconds  in
280              addition to this timeout.
281
282              The default is 1000 milliseconds.
283
284
285       token_retransmit
286              This  timeout  specifies  in  milliseconds after how long before
287              receiving a token the token  is  retransmitted.   This  will  be
288              automatically calculated if token is modified.  It is not recom‐
289              mended to alter this value without guidance  from  the  corosync
290              community.
291
292              The default is 238 milliseconds.
293
294
295       hold   This timeout specifies in milliseconds how long the token should
296              be held by the representative when the  protocol  is  under  low
297              utilization.   It is not recommended to alter this value without
298              guidance from the corosync community.
299
300              The default is 180 milliseconds.
301
302
303       token_retransmits_before_loss_const
304              This value identifies  how  many  token  retransmits  should  be
305              attempted  before forming a new configuration.  If this value is
306              set, retransmit and hold will be automatically  calculated  from
307              retransmits_before_loss and token.
308
309              The default is 4 retransmissions.
310
311
312       join   This timeout specifies in milliseconds how long to wait for join
313              messages in the membership protocol.
314
315              The default is 50 milliseconds.
316
317
318       send_join
319              This timeout specifies in milliseconds an upper range between  0
320              and  send_join  to wait before sending a join message.  For con‐
321              figurations with less then 32 nodes, this parameter is not  nec‐
322              essary.  For larger rings, this parameter is necessary to ensure
323              the NIC is not overflowed with join messages on formation  of  a
324              new  ring.  A reasonable value for large rings (128 nodes) would
325              be 80msec.  Other timer values must also change if this value is
326              changed.   Seek  advice from the corosync mailing list if trying
327              to run larger configurations.
328
329              The default is 0 milliseconds.
330
331
332       consensus
333              This timeout specifies in milliseconds how long to wait for con‐
334              sensus  to be achieved before starting a new round of membership
335              configuration.  The minimum value for consensus must  be  1.2  *
336              token.   This  value  will  be automatically calculated at 1.2 *
337              token if the user doesn't specify a consensus value.
338
339              For two node clusters, a consensus larger then the join  timeout
340              but less then token is safe.  For three node or larger clusters,
341              consensus should be larger then token.  There is  an  increasing
342              risk  of  odd  membership  changes, which stil guarantee virtual
343              synchrony,  as node count grows if consensus is less than token.
344
345              The default is 1200 milliseconds.
346
347
348       merge  This timeout specifies in milliseconds how long to  wait  before
349              checking  for  a  partition  when  no multicast traffic is being
350              sent.  If multicast traffic is being sent, the  merge  detection
351              happens automatically as a function of the protocol.
352
353              The default is 200 milliseconds.
354
355
356       downcheck
357              This  timeout  specifies in milliseconds how long to wait before
358              checking that a network interface is back up after it  has  been
359              downed.
360
361              The default is 1000 millseconds.
362
363
364       fail_recv_const
365              This  constant specifies how many rotations of the token without
366              receiving any of the messages when messages should  be  received
367              may occur before a new configuration is formed.
368
369              The default is 2500 failures to receive a message.
370
371
372       seqno_unchanged_const
373              This  constant specifies how many rotations of the token without
374              any multicast traffic should occur  before  the  hold  timer  is
375              started.
376
377              The default is 30 rotations.
378
379
380       heartbeat_failures_allowed
381              [HeartBeating  mechanism]  Configures  the optional HeartBeating
382              mechanism for faster failure detection. Keep in mind that engag‐
383              ing  this  mechanism  in  lossy networks could cause faulty loss
384              declaration as the mechanism relies on the  network  for  heart‐
385              beating.
386
387              So as a rule of thumb use this mechanism if you require improved
388              failure in low to medium utilized networks.
389
390              This constant specifies the number  of  heartbeat  failures  the
391              system should tolerate before declaring heartbeat failure e.g 3.
392              Also if this value is not set or is 0 then the heartbeat  mecha‐
393              nism  is  not  engaged  in  the system and token rotation is the
394              method of failure detection
395
396              The default is 0 (disabled).
397
398
399       max_network_delay
400              [HeartBeating mechanism] This constant specifies in milliseconds
401              the  approximate  delay that your network takes to transport one
402              packet from one machine to another. This value is to be  set  by
403              system  engineers  and  please  dont  change if not sure as this
404              effects the failure detection mechanism using heartbeat.
405
406              The default is 50 milliseconds.
407
408
409       window_size
410              This constant specifies the maximum number of messages that  may
411              be  sent  on  one  token  rotation.   If  all processors perform
412              equally well, this value  could  be  large  (300),  which  would
413              introduce  higher  latency from origination to delivery for very
414              large  rings.   To  reduce  latency  in  large  rings(16+),  the
415              defaults  are a safe compromise.  If 1 or more slow processor(s)
416              are present among fast  processors,  window_size  should  be  no
417              larger  then  256000  /  netmtu  to avoid overflow of the kernel
418              receive buffers.  The user is notified of this by the display of
419              a retransmit list in the notification logs.  There is no loss of
420              data, but performance is reduced when these errors occur.
421
422              The default is 50 messages.
423
424
425       max_messages
426              This constant specifies the maximum number of messages that  may
427              be  sent by one processor on receipt of the token.  The max_mes‐
428              sages parameter is limited to 256000 / netmtu to  prevent  over‐
429              flow of the kernel transmit buffers.
430
431              The default is 17 messages.
432
433
434       miss_count_const
435              This  constant defines the maximum number of times on receipt of
436              a token  a  message  is  checked  for  retransmission  before  a
437              retransmission  occurs.   This parameter is useful to modify for
438              switches that delay multicast packets compared to unicast  pack‐
439              ets.   The  default  setting  works  well  for nearly all modern
440              switches.
441
442              The default is 5 messages.
443
444
445       rrp_problem_count_timeout
446              This specifies the time in milliseconds to  wait  before  decre‐
447              menting the problem count by 1 for a particular ring to ensure a
448              link is not marked faulty for transient network failures.
449
450              The default is 2000 milliseconds.
451
452
453       rrp_problem_count_threshold
454              This specifies the number of times a problem is detected with  a
455              link before setting the link faulty.  Once a link is set faulty,
456              no more data is transmitted upon it.  Also, the problem  counter
457              is no longer decremented when the problem count timeout expires.
458
459              A  problem  is  detected whenever all tokens from the proceeding
460              processor    have    not    been     received     within     the
461              rrp_token_expired_timeout.   The  rrp_problem_count_threshold  *
462              rrp_token_expired_timeout should be atleast 50 milliseconds less
463              then the token timeout, or a complete reconfiguration may occur.
464
465              The default is 10 problem counts.
466
467
468       rrp_problem_count_mcast_threshold
469              This  specifies  the  number of times a problem is detected with
470              multicast before setting the link faulty for passive  rrp  mode.
471              This variable is unused in active rrp mode.
472
473              The default is 10 times rrp_problem_count_threshold.
474
475
476       rrp_token_expired_timeout
477              This specifies the time in milliseconds to increment the problem
478              counter  for  the  redundant  ring  protocol  after  not  having
479              received a token from all rings for a particular processor.
480
481              This value will automatically be calculated from the token time‐
482              out and problem_count_threshold but may be  overridden.   It  is
483              not recommended to override this value without guidance from the
484              corosync community.
485
486              The default is 47 milliseconds.
487
488
489       rrp_autorecovery_check_timeout
490              This specifies the time in milliseconds to check if  the  failed
491              ring can be auto-recovered.
492
493              The default is 1000 milliseconds.
494
495
496       Within  the  logging directive, there are several configuration options
497       which are all optional.
498
499
500       The following 3 options are valid only for the top level logging direc‐
501       tive:
502
503
504       timestamp
505              This specifies that a timestamp is placed on all log messages.
506
507              The default is off.
508
509
510       fileline
511              This specifies that file and line should be printed.
512
513              The default is off.
514
515
516       function_name
517              This specifies that the code function name should be printed.
518
519              The default is off.
520
521
522       The  following  options  are valid both for top level logging directive
523       and they can be overriden in logger_subsys entries.
524
525
526       to_stderr
527
528       to_logfile
529
530       to_syslog
531              These specify the destination of logging output. Any combination
532              of these options may be specified. Valid options are yes and no.
533
534              The default is syslog and stderr.
535
536              Please  note, if you are using to_logfile and want to rotate the
537              file, use logrotate(8) with the option copytruncate.  eg.
538              /var/log/corosync.log {
539                   missingok
540                   compress
541                   notifempty
542                   daily
543                   rotate 7
544                   copytruncate
545              }
546
547
548       logfile
549              If the to_logfile directive is set to yes , this  option  speci‐
550              fies the pathname of the log file.
551
552              No default.
553
554
555       logfile_priority
556              This  specifies the logfile priority for this particular subsys‐
557              tem. Ignored if debug is on.  Possible values are: alert,  crit,
558              debug (same as debug = on), emerg, err, info, notice, warning.
559
560              The default is: info.
561
562
563       syslog_facility
564              This  specifies  the  syslog facility type that will be used for
565              any messages sent to syslog. options are daemon, local0, local1,
566              local2, local3, local4, local5, local6 & local7.
567
568              The default is daemon.
569
570
571       syslog_priority
572              This  specifies  the syslog level for this particular subsystem.
573              Ignored if debug is on.  Possible values are: alert, crit, debug
574              (same as debug = on), emerg, err, info, notice, warning.
575
576              The default is: info.
577
578
579       debug  This  specifies whether debug output is logged for this particu‐
580              lar logger. Also can contain value trace, what is highest  level
581              of debug informations.
582
583              The default is off.
584
585
586       tags   This  specifies  which tags should be traced for this particular
587              logger.  Set debug directive to on in order  to  enable  tracing
588              using tags.  Values are specified using a vertical bar as a log‐
589              ical OR separator:
590
591              enter|leave|trace1|trace2|trace3|...
592
593              The default is none.
594
595
596       Within the logging directive, logger_subsys directives are optional.
597
598
599       Within the logger_subsys sub-directive, all of the above  logging  con‐
600       figuration  options  are  valid and can be used to override the default
601       settings.  The subsys entry, described below, is mandatory to  identify
602       the subsystem.
603
604
605       subsys This  specifies  the subsystem identity (name) for which logging
606              is specified. This is the name used by a service in the log_init
607              () call. E.g. 'CKPT'. This directive is required.
608
609

FILES

611       /etc/corosync/corosync.conf
612              The corosync executive configuration file.
613
614

NAME

SYNOPSIS

DESCRIPTION

FILES

SEE ALSO