1GANESHA-RADOS-CLUSTER-DESIGN(8)   NFS-Ganesha  GANESHA-RADOS-CLUSTER-DESIGN(8)
2
3
4

NAME

6       ganesha-rados-cluster-design - Clustered RADOS Recovery Backend Design
7

OVERVIEW

9       This  document  aims  to  explain  the  theory  and  design  behind the
10       rados_cluster recovery backend, which coordinates grace period enforce‐
11       ment among multiple, independent NFS servers.
12
13       In  order to understand the clustered recovery backend, it's first nec‐
14       essary to understand how recovery works with a single server:
15

SINGLETON SERVER RECOVERY

17       NFSv4 is a lease-based protocol. Clients set up a relationship  to  the
18       server  and  must  periodically  renew their lease in order to maintain
19       their ephemeral state (open files, locks, delegations or layouts).
20
21       When a singleton NFS server is restarted, any ephemeral state is  lost.
22       When  the  server  comes comes back online, NFS clients detect that the
23       server has been restarted and will reclaim  the  ephemeral  state  that
24       they held at the time of their last contact with the server.
25

SINGLETON GRACE PERIOD

27       In  order  to  ensure  that we don't end up with conflicts, clients are
28       barred from acquiring any new state while in the Recovery  phase.  Only
29       reclaim operations are allowed.
30
31       This period of time is called the grace period. Most NFS servers have a
32       grace period that lasts around two lease periods,  however  nfs-ganesha
33       can  and will lift the grace period early if it determines that no more
34       clients will be allowed to recover.
35
36       Once the grace period ends, the server will move into its Normal opera‐
37       tion  state.  During  this  period, no more recovery is allowed and new
38       state can be acquired by NFS clients.
39

REBOOT EPOCHS

41       The lifecycle of a singleton NFS server  can  be  considered  to  be  a
42       series  of transitions from the Recovery period to Normal operation and
43       back. In the remainder of this document we'll consider such a period to
44       be an epoch, and assign each a number beginning with 1.
45
46       Visually,  we  can  represent  it  like  this, such that each Normal ->
47       Recovery transition is marked by a change in the epoch value:
48
49          +-------+-------+-------+---------------+-------+
50          | State | R | N | R | N | R | R | R | N | R | N |
51          +-------+-------+-------+---------------+-------+
52          | Epoch |   1   |   2   |       3       |   4   |
53          +-------+-------+-------+---------------+-------+
54
55       Note that it is possible to restart during the grace period  (as  shown
56       above  during  epoch 3). That just serves to extend the recovery period
57       and the epoch. A new epoch is only declared during a Recovery -> Normal
58       transition.
59

CLIENT RECOVERY DATABASE

61       There  are  some  potential edge cases that can occur involving network
62       partitions and multiple reboots. In order to prevent those, the  server
63       must  maintain  a  list of clients that hold state on the server at any
64       given time. This list must be maintained on stable storage. If a client
65       sends  a  request  to reclaim some state, then the server must check to
66       make sure it's on that list before allowing the request.
67
68       Thus when the server allows reclaim requests it  must  always  gate  it
69       against  the recovery database from the previous epoch. As clients come
70       in to reclaim, we establish records for them in a new database  associ‐
71       ated with the current epoch.
72
73       The  transition  from  recovery  to  normal operation should perform an
74       atomic switch of recovery databases. A recovery database  only  becomes
75       legitimate  on  a  recovery to normal transition. Until that point, the
76       recovery database from the previous epoch is the canonical one.
77

EXPORTING A CLUSTERED FILESYSTEM

79       Let's consider a set of independent NFS servers, all  serving  out  the
80       same  content  from  a clustered backend filesystem of any flavor. Each
81       NFS server in this case can itself be considered a clustered FS client.
82       This  means that the NFS server is really just a proxy for state on the
83       clustered filesystem.
84
85       The filesystem must make some  guarantees  to  the  NFS  server.  First
86       filesystem guarantee:
87
88       1. The  filesystem  ensures  that  the NFS servers (aka the FS clients)
89          cannot obtain state that conflicts with that of another NFS server.
90
91       This is somewhat obvious and is  what  we  expect  from  any  clustered
92       filesystem  outside  of  any  requirements  of  NFS.  If  the clustered
93       filesystem can provide this, then we know that conflicting state during
94       normal operations cannot be granted.
95
96       The  recovery  period  has  a  different set of rules. If an NFS server
97       crashes and is restarted, then we have a window of time when  that  NFS
98       server does not know what state was held by its clients.
99
100       If  the  state  held  by the crashed NFS server is immediately released
101       after the crash, another NFS server could hand  out  conflicting  state
102       before the original NFS client has a chance to recover it.
103
104       This must be prevented. Second filesystem guarantee:
105
106       2. The  filesystem  must  not release state held by a server during the
107          previous epoch until all servers in the cluster  are  enforcing  the
108          grace period.
109
110       In  practical terms, we want the filesystem to provide a way for an NFS
111       server to tell it when it's safe to release state held  by  a  previous
112       instance of itself. The server should do this once it knows that all of
113       its siblings are enforcing the grace period.
114
115       Note that we do not require that all servers restart and allow  reclaim
116       at  that  point.  It's sufficient for them to simply begin grace period
117       enforcement as soon as possible once one server needs it.
118

CLUSTERED GRACE PERIOD DATABASE

120       At this point the cluster siblings are no  longer  completely  indepen‐
121       dent,  and  the  grace  period has become a cluster-wide property. This
122       means that we must track the current epoch on some sort of shared stor‐
123       age that the servers can all access.
124
125       Additionally  we  must  also keep track of whether a cluster-wide grace
126       period is in effect. Any running nodes  should  all  be  informed  when
127       either of this info changes, so they can take appropriate steps when it
128       occurs.
129
130       In the rados_cluster backend, we track these using two epoch values:
131
132       C: is the current epoch. This represents the current epoch value
133              of the cluster
134
135       R: is the recovery epoch. This represents the epoch from which
136              clients are allowed to recover. A non-zero value here means that
137              a cluster-wide grace period is in effect. Setting this to 0 ends
138              that grace period.
139
140       In order to decide when to make grace period transitions,  each  server
141       must  also  advertise  its state to the other nodes. Specifically, each
142       server must be able to determine these two things  about  each  of  its
143       siblings:
144
145       1. Does  this  server  have  clients  from the previous epoch that will
146          require recovery? (NEED)
147
148       2. Is this server enforcing the grace period  by  refusing  non-reclaim
149          locks?  (ENFORCING)
150
151       We  do this with a pair of flags per sibling (NEED and ENFORCING). Each
152       server typically manages its own flags.
153
154       The rados_cluster backend stores all of this information  in  a  single
155       RADOS object that is modified using read/modify/write cycles. Typically
156       we'll read the whole object, modify it, and then  attept  to  write  it
157       back. If something changes between the read and write, we redo the read
158       and try it again.
159

CLUSTERED CLIENT RECOVERY DATABASES

161       In rados_cluster the client recovery  databases  are  stored  as  RADOS
162       objects.  Each  NFS  server  has its own set of them and they are given
163       names that have the current epoch (C) embedded in it. This ensures that
164       recovery databases are specific to a particular epoch.
165
166       In  general,  it's safe to delete any recovery database that precedes R
167       when R is non-zero, and safe to remove any recovery database except for
168       the  current  one (the one with C in the name) when the grace period is
169       not in effect (R==0).
170

ESTABLISHING A NEW GRACE PERIOD

172       When a server restarts and wants to  allow  clients  to  reclaim  their
173       state,  it must establish a new epoch by incrementing the current epoch
174       to declare a new grace period (R=C; C=C+1).
175
176       The exception to this rule is when the cluster is already  in  a  grace
177       period.  Servers  can  just join an in-progress grace period instead of
178       establishing a new one if one is already active.
179
180       In either case, the server should also set its NEED and ENFORCING flags
181       at the same time.
182
183       The  other  surviving cluster siblings should take steps to begin grace
184       period enforcement as soon as possible. This entails "draining off" any
185       in-progress state morphing operations and then blocking the acquisition
186       of any new state (usually with a return  of  NFS4ERR_GRACE  to  clients
187       that  attempt  it).  Again, there is no need for the survivors from the
188       previous epoch to allow recovery here.
189
190       The surviving servers must however  establish  a  new  client  recovery
191       database  at this point to ensure that their clients can do recovery in
192       the event of a crash afterward.
193
194       Once all of the siblings are enforcing the grace period, the recovering
195       server  can then request that the filesystem release the old state, and
196       allow clients to begin reclaiming their  state.  In  the  rados_cluster
197       backend  driver,  we do this by stalling server startup until all hosts
198       in the cluster are enforcing the grace period.
199

LIFTING THE GRACE PERIOD

201       Transitioning from recovery to normal operation really consists of  two
202       different steps:
203
204       1. the server decides that it no longer requires a grace period, either
205          due to it timing out or there not being any clients  that  would  be
206          allowed to reclaim.
207
208       2. the  server stops enforcing the grace period and transitions to nor‐
209          mal operation
210
211       These concepts are often conflated in singleton servers, but in a clus‐
212       ter we must consider them independently.
213
214       When a server is finished with its own local recovery period, it should
215       clear its NEED flag. That server should continue  enforcing  the  grace
216       period  however until the grace period is fully lifted. The server must
217       not permit reclaims after clearing its NEED flag, however.
218
219       If the servers' own NEED flag is the last one set, then it can lift the
220       grace  period (by setting R=0). At that point, all servers in the clus‐
221       ter can end grace period enforcement, and communicate that fact to  the
222       others by clearing their ENFORCING flags.
223
224
225
226
227                                 Nov 26, 2019  GANESHA-RADOS-CLUSTER-DESIGN(8)
Impressum