1GANESHA-RADOS-CLUSTER-DESIGN(8) NFS-Ganesha GANESHA-RADOS-CLUSTER-DESIGN(8)
2
3
4
6 ganesha-rados-cluster-design - Clustered RADOS Recovery Backend Design
7
9 This document aims to explain the theory and design behind the ra‐
10 dos_cluster recovery backend, which coordinates grace period enforce‐
11 ment among multiple, independent NFS servers.
12
13 In order to understand the clustered recovery backend, it's first nec‐
14 essary to understand how recovery works with a single server:
15
17 NFSv4 is a lease-based protocol. Clients set up a relationship to the
18 server and must periodically renew their lease in order to maintain
19 their ephemeral state (open files, locks, delegations or layouts).
20
21 When a singleton NFS server is restarted, any ephemeral state is lost.
22 When the server comes comes back online, NFS clients detect that the
23 server has been restarted and will reclaim the ephemeral state that
24 they held at the time of their last contact with the server.
25
27 In order to ensure that we don't end up with conflicts, clients are
28 barred from acquiring any new state while in the Recovery phase. Only
29 reclaim operations are allowed.
30
31 This period of time is called the grace period. Most NFS servers have a
32 grace period that lasts around two lease periods, however nfs-ganesha
33 can and will lift the grace period early if it determines that no more
34 clients will be allowed to recover.
35
36 Once the grace period ends, the server will move into its Normal opera‐
37 tion state. During this period, no more recovery is allowed and new
38 state can be acquired by NFS clients.
39
41 The lifecycle of a singleton NFS server can be considered to be a se‐
42 ries of transitions from the Recovery period to Normal operation and
43 back. In the remainder of this document we'll consider such a period to
44 be an epoch, and assign each a number beginning with 1.
45
46 Visually, we can represent it like this, such that each Normal -> Re‐
47 covery transition is marked by a change in the epoch value:
48
49 +-------+-------+-------+---------------+-------+
50 | State | R | N | R | N | R | R | R | N | R | N |
51 +-------+-------+-------+---------------+-------+
52 | Epoch | 1 | 2 | 3 | 4 |
53 +-------+-------+-------+---------------+-------+
54
55 Note that it is possible to restart during the grace period (as shown
56 above during epoch 3). That just serves to extend the recovery period
57 and the epoch. A new epoch is only declared during a Recovery -> Normal
58 transition.
59
61 There are some potential edge cases that can occur involving network
62 partitions and multiple reboots. In order to prevent those, the server
63 must maintain a list of clients that hold state on the server at any
64 given time. This list must be maintained on stable storage. If a client
65 sends a request to reclaim some state, then the server must check to
66 make sure it's on that list before allowing the request.
67
68 Thus when the server allows reclaim requests it must always gate it
69 against the recovery database from the previous epoch. As clients come
70 in to reclaim, we establish records for them in a new database associ‐
71 ated with the current epoch.
72
73 The transition from recovery to normal operation should perform an
74 atomic switch of recovery databases. A recovery database only becomes
75 legitimate on a recovery to normal transition. Until that point, the
76 recovery database from the previous epoch is the canonical one.
77
79 Let's consider a set of independent NFS servers, all serving out the
80 same content from a clustered backend filesystem of any flavor. Each
81 NFS server in this case can itself be considered a clustered FS client.
82 This means that the NFS server is really just a proxy for state on the
83 clustered filesystem.
84
85 The filesystem must make some guarantees to the NFS server. First
86 filesystem guarantee:
87
88 1. The filesystem ensures that the NFS servers (aka the FS clients)
89 cannot obtain state that conflicts with that of another NFS server.
90
91 This is somewhat obvious and is what we expect from any clustered
92 filesystem outside of any requirements of NFS. If the clustered
93 filesystem can provide this, then we know that conflicting state during
94 normal operations cannot be granted.
95
96 The recovery period has a different set of rules. If an NFS server
97 crashes and is restarted, then we have a window of time when that NFS
98 server does not know what state was held by its clients.
99
100 If the state held by the crashed NFS server is immediately released af‐
101 ter the crash, another NFS server could hand out conflicting state be‐
102 fore the original NFS client has a chance to recover it.
103
104 This must be prevented. Second filesystem guarantee:
105
106 2. The filesystem must not release state held by a server during the
107 previous epoch until all servers in the cluster are enforcing the
108 grace period.
109
110 In practical terms, we want the filesystem to provide a way for an NFS
111 server to tell it when it's safe to release state held by a previous
112 instance of itself. The server should do this once it knows that all of
113 its siblings are enforcing the grace period.
114
115 Note that we do not require that all servers restart and allow reclaim
116 at that point. It's sufficient for them to simply begin grace period
117 enforcement as soon as possible once one server needs it.
118
120 At this point the cluster siblings are no longer completely indepen‐
121 dent, and the grace period has become a cluster-wide property. This
122 means that we must track the current epoch on some sort of shared stor‐
123 age that the servers can all access.
124
125 Additionally we must also keep track of whether a cluster-wide grace
126 period is in effect. Any running nodes should all be informed when ei‐
127 ther of this info changes, so they can take appropriate steps when it
128 occurs.
129
130 In the rados_cluster backend, we track these using two epoch values:
131
132 C: is the current epoch. This represents the current epoch value
133 of the cluster
134
135 R: is the recovery epoch. This represents the epoch from which
136 clients are allowed to recover. A non-zero value here means that
137 a cluster-wide grace period is in effect. Setting this to 0 ends
138 that grace period.
139
140 In order to decide when to make grace period transitions, each server
141 must also advertise its state to the other nodes. Specifically, each
142 server must be able to determine these two things about each of its
143 siblings:
144
145 1. Does this server have clients from the previous epoch that will re‐
146 quire recovery? (NEED)
147
148 2. Is this server enforcing the grace period by refusing non-reclaim
149 locks? (ENFORCING)
150
151 We do this with a pair of flags per sibling (NEED and ENFORCING). Each
152 server typically manages its own flags.
153
154 The rados_cluster backend stores all of this information in a single
155 RADOS object that is modified using read/modify/write cycles. Typically
156 we'll read the whole object, modify it, and then attempt to write it
157 back. If something changes between the read and write, we redo the read
158 and try it again.
159
161 In rados_cluster the client recovery databases are stored as RADOS ob‐
162 jects. Each NFS server has its own set of them and they are given names
163 that have the current epoch (C) embedded in it. This ensures that re‐
164 covery databases are specific to a particular epoch.
165
166 In general, it's safe to delete any recovery database that precedes R
167 when R is non-zero, and safe to remove any recovery database except for
168 the current one (the one with C in the name) when the grace period is
169 not in effect (R==0).
170
172 When a server restarts and wants to allow clients to reclaim their
173 state, it must establish a new epoch by incrementing the current epoch
174 to declare a new grace period (R=C; C=C+1).
175
176 The exception to this rule is when the cluster is already in a grace
177 period. Servers can just join an in-progress grace period instead of
178 establishing a new one if one is already active.
179
180 In either case, the server should also set its NEED and ENFORCING flags
181 at the same time.
182
183 The other surviving cluster siblings should take steps to begin grace
184 period enforcement as soon as possible. This entails "draining off" any
185 in-progress state morphing operations and then blocking the acquisition
186 of any new state (usually with a return of NFS4ERR_GRACE to clients
187 that attempt it). Again, there is no need for the survivors from the
188 previous epoch to allow recovery here.
189
190 The surviving servers must however establish a new client recovery
191 database at this point to ensure that their clients can do recovery in
192 the event of a crash afterward.
193
194 Once all of the siblings are enforcing the grace period, the recovering
195 server can then request that the filesystem release the old state, and
196 allow clients to begin reclaiming their state. In the rados_cluster
197 backend driver, we do this by stalling server startup until all hosts
198 in the cluster are enforcing the grace period.
199
201 Transitioning from recovery to normal operation really consists of two
202 different steps:
203
204 1. the server decides that it no longer requires a grace period, either
205 due to it timing out or there not being any clients that would be
206 allowed to reclaim.
207
208 2. the server stops enforcing the grace period and transitions to nor‐
209 mal operation
210
211 These concepts are often conflated in singleton servers, but in a clus‐
212 ter we must consider them independently.
213
214 When a server is finished with its own local recovery period, it should
215 clear its NEED flag. That server should continue enforcing the grace
216 period however until the grace period is fully lifted. The server must
217 not permit reclaims after clearing its NEED flag, however.
218
219 If the servers' own NEED flag is the last one set, then it can lift the
220 grace period (by setting R=0). At that point, all servers in the clus‐
221 ter can end grace period enforcement, and communicate that fact to the
222 others by clearing their ENFORCING flags.
223
224
225
226
227 Feb 28, 2023 GANESHA-RADOS-CLUSTER-DESIGN(8)