1zbackup(1) zbackup(1)
2
3
4
6 zbackup is a globally-deduplicating backup tool, based on the ideas
7 found in rsync (http://rsync.samba.org/). Feed a large .tar into it,
8 and it will store duplicate regions of it only once, then compress and
9 optionally encrypt the result. Feed another .tar file, and it will al‐
10 so re-use any data found in any previous backups. This way only new
11 changes are stored, and as long as the files are not very different,
12 the amount of storage required is very low. Any of the backup files
13 stored previously can be read back in full at any time. The program is
14 format-agnostic, so you can feed virtually any files to it (any types
15 of archives, proprietary formats, even raw disk images -- but see
16 Caveats).
17
18 This is achieved by sliding a window with a rolling hash over the input
19 at a byte granularity and checking whether the block in focus was ever
20 met already. If a rolling hash matches, an additional full crypto‐
21 graphic hash is calculated to ensure the block is indeed the same. The
22 deduplication happens then.
23
25 The program has the following features:
26
27 • Parallel LZMA or LZO compression of the stored data
28
29 • Built-in AES encryption of the stored data
30
31 • Possibility to delete old backup data
32
33 • Use of a 64-bit rolling hash, keeping the amount of soft collisions
34 to zero
35
36 • Repository consists of immutable files. No existing files are ever
37 modified
38
39 • Written in C++ only with only modest library dependencies
40
41 • Safe to use in production (see below)
42
43 • Possibility to exchange data between repos without recompression
44
46 • cmake >= 2.8.2 (though it should not be too hard to compile the
47 sources by hand if needed)
48
49 • libssl-dev for all encryption, hashing and random numbers
50
51 • libprotobuf-dev and protobuf-compiler for data serialization
52
53 • liblzma-dev for compression
54
55 • liblzo2-dev for compression (optional)
56
57 • zlib1g-dev for adler32 calculation
58
60 To build and install:
61
62 cd zbackup
63 cmake .
64 make
65 sudo make install
66 # or just run as ./zbackup
67
68 Zbackup is also part of the Debian (https://packages.de‐
69 bian.org/search?keywords=zbackup), Ubuntu (http://packages.ubun‐
70 tu.com/search?keywords=zbackup) and Arch Linux (https://aur.archlin‐
71 ux.org/packages/zbackup/) distributions of GNU/Linux.
72
73 To use:
74
75 zbackup init --non-encrypted /my/backup/repo
76 tar c /my/precious/data | zbackup backup /my/backup/repo/backups/backup-`date '+%Y-%m-%d'`
77 zbackup restore /my/backup/repo/backups/backup-`date '+%Y-%m-%d'` > /my/precious/backup-restored.tar
78
79 If you have a lot of RAM to spare, you can use it to speed-up the re‐
80 store process -- to use 512 MB more, pass --cache-size 512mb when
81 restoring.
82
83 If encryption is wanted, create a file with your password:
84
85 # more secure to use an editor
86 echo mypassword > ~/.my_backup_password
87 chmod 600 ~/.my_backup_password
88
89 Then init the repo the following way:
90
91 zbackup init --password-file ~/.my_backup_password /my/backup/repo
92
93 And always pass the same argument afterwards:
94
95 tar c /my/precious/data | zbackup --password-file ~/.my_backup_password backup /my/backup/repo/backups/backup-`date '+%Y-%m-%d'`
96 zbackup --password-file ~/.my_backup_password restore /my/backup/repo/backups/backup-`date '+%Y-%m-%d'` > /my/precious/backup-restored.tar
97
98 If you have a 32-bit system and a lot of cores, consider lowering the
99 number of compression threads by passing --threads 4 or --threads 2 if
100 the program runs out of address space when backing up (see why below,
101 item 2). There should be no problem on a 64-bit system.
102
104 • While you can pipe any data into the program, the data should be un‐
105 compressed and unencrypted -- otherwise no deduplication could be
106 performed on it. zbackup would compress and encrypt the data itself,
107 so there's no need to do that yourself. So just run tar c and pipe
108 it into zbackup directly. If backing up disk images employing en‐
109 cryption, pipe the unencrypted version (the one you normally mount).
110 If you create .zip or .rar files, use no compression (-0 or -m0) and
111 no encryption.
112
113 • Parallel LZMA compression uses a lot of RAM (several hundreds of
114 megabytes, depending on the number of threads used), and ten times
115 more virtual address space. The latter is only relevant on 32-bit
116 architectures where it's limited to 2 or 3 GB. If you hit the ceil‐
117 ing, lower the number of threads with --threads.
118
119 • Since the data is deduplicated, there's naturally no redundancy in
120 it. A loss of a single file can lead to a loss of virtually all da‐
121 ta. Make sure you store it on a redundant storage (RAID1, a cloud
122 provider etc).
123
124 • The encryption key, if used, is stored in the info file in the root
125 of the repo. It is encrypted with your password. Technically thus
126 you can change your password without re-encrypting any data, and as
127 long as no one possesses the old info file and knows your old pass‐
128 word, you would be safe (even though the actual option to change
129 password is not implemented yet -- someone who needs this is welcome
130 to create a pull request -- the possibility is all there). Also note
131 that it is crucial you don't lose your info file, as otherwise the
132 whole backup would be lost.
133
135 • Right now the only modes supported are reading from standard input
136 and writing to standard output. FUSE mounts and NBD servers may be
137 added later if someone contributes the code.
138
139 • The program keeps all known blocks in an in-RAM hash table, which may
140 create scalability problems for very large repos (see below).
141
142 • The only encryption mode currently implemented is AES-128 in CBC mode
143 with PKCS#7 padding. If you believe that this is not secure enough,
144 patches are welcome. Before you jump to conclusions however, read
145 this article (http://www.schneier.com/blog/archives/2009/07/anoth‐
146 er_new_aes.html).
147
148 • The only compression mode supported is LZMA, which suits backups very
149 nicely.
150
151 • It's only possible to fully restore the backup in order to get to a
152 required file, without any option to quickly pick it out. tar would
153 not allow to do it anyway, but e.g. for zip files it could have been
154 possible. This is possible to implement though, e.g. by exposing
155 the data over a FUSE filesystem.
156
157 • There's no option to specify block and bundle sizes other than the
158 default ones (currently 64k and 2MB respectively), though it's triv‐
159 ial to add command-line switches for those.
160
161 Most of those limitations can be lifted by implementing the respective
162 features.
163
165 Is it safe to use zbackup for production data? Being free software,
166 the program comes with no warranty of any kind. That said, it's per‐
167 fectly safe for production, and here's why. When performing a backup,
168 the program never modifies or deletes any existing files -- only new
169 ones are created. It specifically checks for that, and the code paths
170 involved are short and easy to inspect. Furthermore, each backup is
171 protected by its SHA256 sum, which is calculated before piping the data
172 into the deduplication logic. The code path doing that is also short
173 and easy to inspect. When a backup is being restored, its SHA256 is
174 calculated again and compared against the stored one. The program
175 would fail on a mismatch. Therefore, to ensure safety it is enough to
176 restore each backup to /dev/null immediately after creating it. If it
177 restores fine, it will restore fine ever after.
178 To add some statistics, the author of the program has been using an
179 older version of zbackup internally for over a year. The SHA256 check
180 never ever failed. Again, even if it does, you would know immediately,
181 so no work would be lost. Therefore you are welcome to try the program
182 in production, and if you like it, stick with it.
183
185 The repository has the following directory structure:
186
187 /repo
188 backups/
189 bundles/
190 00/
191 01/
192 02/
193 ...
194 index/
195 info
196
197 • The backups directory contain your backups. Those are very small
198 files which are needed for restoration. They are encrypted if en‐
199 cryption is enabled. The names can be arbitrary. It is possible to
200 arrange files in subdirectories, too. Free renaming is also allowed.
201
202 • The bundles directory contains the bulk of data. Each bundle inter‐
203 nally contains multiple small chunks, compressed together and en‐
204 crypted. Together all those chunks account for all deduplicated data
205 stored.
206
207 • The index directory contains the full index of all chunks in the
208 repository, together with their bundle names. A separate index file
209 is created for each backup session. Technically those files are re‐
210 dundant, all information is contained in the bundles themselves.
211 However, having a separate index is nice for two reasons: 1) it's
212 faster to read as it incurs less seeks, and 2) it allows making back‐
213 ups while storing bundles elsewhere. Bundles are only needed when
214 restoring -- otherwise it's sufficient to only have index. One could
215 then move all newly created bundles into another machine after each
216 backup.
217
218 • info is a very important file which contains all global repository
219 metadata, such as chunk and bundle sizes, and an encryption key en‐
220 crypted with the user password. It is paramount not to lose it, so
221 backing it up separately somewhere might be a good idea. On the oth‐
222 er hand, if you absolutely don't trust your remote storage provider,
223 you might consider not storing it with the rest of the data. It
224 would then be impossible to decrypt it at all, even if your password
225 gets known later.
226
227 The program does not have any facilities for sending your backup over
228 the network. You can rsync the repo to another computer or use any
229 kind of cloud storage capable of storing files. Since zbackup never
230 modifies any existing files, the latter is especially easy -- just tell
231 the upload tool you use not to upload any files which already exist on
232 the remote side (e.g. with gsutil it's gsutil cp -R -n /my/backup
233 gs:/mybackup/).
234
235 To aid with creating backups, there's an utility called tartool includ‐
236 ed with zbackup. The idea is the following: one sprinkles empty files
237 called .backup and .no-backup across the entire filesystem. Directo‐
238 ries where .backup files are placed are marked for backing up. Simi‐
239 larly, directories with .no-backup files are marked not to be backed
240 up. Additionally, it is possible to place .backup-XYZ in the same di‐
241 rectory where XYZ is to mark XYZ for backing up, or place .no-backup-
242 XYZ to mark it not to be backed up. Then tartool can be run with three
243 arguments -- the root directory to start from (can be /), the output
244 includes file, and the output excludes file. The tool traverses over
245 the given directory noting the .backup* and .no-backup* files and cre‐
246 ating include and exclude lists for the tar utility. The tar utility
247 could then be run as tar c --files-from includes --exclude-from ex‐
248 cludes to store all chosen data.
249
251 This section tries do address the question on the maximum amount of da‐
252 ta which can be held in a backup repository. What is meant here is the
253 deduplicated data. The number of bytes in all source files ever fed
254 into the repository doesn't matter, but the total size of the resulting
255 repository does. Internally all input data is split into small blocks
256 called chunks (up to 64k each by default). Chunks are collected into
257 bundles (up to 2MB each by default), and those bundles are then com‐
258 pressed and encrypted.
259
260 There are then two problems with the total number of chunks in the
261 repository:
262
263 • Hashes of all existing chunks are needed to be kept in RAM while the
264 backup is ongoing. Since the sliding window performs checking with a
265 single-byte granularity, lookups would otherwise be too slow. The
266 amount of data needed to be stored is technically only 24 bytes for
267 each chunk, where the size of the chunk is up to 64k. In an example
268 real-life 18GB repo, only 18MB are taken by in its hash index. Mul‐
269 tiply this roughly by two to have an estimate of RAM needed to store
270 this index as an in-RAM hash table. However, as this size is propor‐
271 tional to the total size of the repo, for 2TB repo you could already
272 require 2GB of RAM. Most repos are much smaller though, and as long
273 as the deduplication works properly, in many cases you can store ter‐
274 abytes of highly-redundant backup files in a 20GB repo easily.
275
276 • We use a 64-bit rolling hash, which allows to have an O(1) lookup
277 cost at each byte we process. Due to birthday paradox
278 (https://en.wikipedia.org/wiki/Birthday_paradox), we would start hav‐
279 ing collisions when we approach 2^32 hashes. If each chunk we have
280 is 32k on average, we would get there when our repo grows to 128TB.
281 We would still be able to continue, but as the number of collisions
282 would grow, we would have to resort to calculating the full hash of a
283 block at each byte more and more often, which would result in a con‐
284 siderable slowdown.
285
286 All in all, as long as the amount of RAM permits, one can go up to sev‐
287 eral terabytes in deduplicated data, and start having some slowdown af‐
288 ter having hundreds of terabytes, RAM-permitting.
289
291 • We use a 64-bit modified Rabin-Karp rolling hash (see rolling_hash.hh
292 for details), while most other programs use a 32-bit one. As noted
293 previously, one problem with the hash size is its birthday bound,
294 which with the 32-bit hash is met after having only 2^16 hashes. The
295 choice of a 64-bit hash allows us to scale much better while having
296 virtually the same calculation cost on a typical 64-bit machine.
297
298 • rsync uses MD5 as its strong hash. While MD5 is known to be fast, it
299 is also known to be broken, allowing a malicious user to craft col‐
300 liding inputs. zbackup uses SHA1 instead. The cost of SHA1 calcula‐
301 tions on modern machines is actually less than that of MD5 (run
302 openssl speed md5 sha1 on yours), so it's a win-win situation. We
303 only keep the first 128 bits of the SHA1 output, and therefore to‐
304 gether with the rolling hash we have a 192-bit hash for each chunk.
305 It's a multiple of 8 bytes which is a nice properly on 64-bit ma‐
306 chines, and it is long enough not to worry about possible collisions.
307
308 • AES-128 in CBC mode with PKCS#7 padding is used for encryption. This
309 seems to be a reasonbly safe classic solution. Each encrypted file
310 has a random IV as its first 16 bytes.
311
312 • We use Google's protocol buffers (https://developers.google.com/pro‐
313 tocol-buffers/) to represent data structures in binary form. They
314 are very efficient and relatively simple to use.
315
317 zbackup uses LZMA to compress stored data. It compresses very well,
318 but it will slow down your backup (unless you have a very fast CPU).
319
320 LZO is much faster, but the files will be bigger. If you don't want
321 your backup process to be cpu-bound, you should consider using LZO.
322 However, there are some caveats:
323
324 1. LZO is so fast that other parts of zbackup consume significant por‐
325 tions of the CPU. In fact, it is only using one core on my machine
326 because compression is the only thing that can run in parallel.
327
328 2. I've hacked the LZO support in a day. You shouldn't trust it.
329 Please make sure that restore works before you assume that your data
330 is safe. That may still be faster than a backup with LZMA ;-)
331
332 3. LZMA is still the default, so make sure that you use the --compres‐
333 sion lzo argument when you init the repo or whenever you do a back‐
334 up.
335
336 You can mix LZMA and LZO in a repository. Each bundle file has a field
337 that says how it was compressed, so zbackup will use the right method
338 to decompress it. You could use an old zbackup respository with only
339 LZMA bundles and start using LZO. However, please think twice before
340 you do that because old versions of zbackup won't be able to read those
341 bundles.
342
344 There's a lot to be improved in the program. It was released with the
345 minimum amount of functionality to be useful. It is also stable. This
346 should hopefully stimulate people to join the development and add all
347 those other fancy features. Here's a list of ideas:
348
349 • Additional options, such as configurable chunk and bundle sizes etc.
350
351 • A command to change password.
352
353 • Improved garbage collection. The program should support ability to
354 specify maximum index file size / maximum index file count (for bet‐
355 ter compatibility with cloud storages as well) or something like re‐
356 tention policy.
357
358 • A command to fsck the repo by doing something close to what garbage
359 collection does, but also checking all hashes and so on.
360
361 • Parallel decompression. Right now decompression is single-threaded,
362 but it is possible to look ahead in the stream and perform prefetch‐
363 ing.
364
365 • Support for mounting the repo over FUSE. Random access to data would
366 then be possible.
367
368 • Support for exposing a backed up file over a userspace NBD server.
369 It would then be possible to mount raw disk images without extracting
370 them.
371
372 • Support for other encryption types (preferably for everything openssl
373 supports with its evp).
374
375 • Support for other compression methods.
376
377 • You name it!
378
380 • The program's website is at <http://zbackup.org/>.
381
382 • Development happens at <https://github.com/zbackup/zbackup>.
383
384 • Discussion forum is at <https://groups.google.com/forum/#!fo‐
385 rum/zbackup>. Please ask for help there!
386
387 The author is reachable over email at <ikm@zbackup.org>. Please be
388 constructive and don't ask for help using the program, though. In most
389 cases it's best to stick to the forum, unless you have something to
390 discuss with the author in private.
391
393 zbackup is certainly not the first project to embrace the idea of using
394 a rolling hash for deduplication. Here's a list of other projects the
395 author found on the web:
396
397 • bup (https://github.com/bup/bup), based on storing data in git packs.
398 No possibility of removing old data. This program was the initial
399 inspiration for zbackup.
400
401 • ddar (http://www.synctus.com/ddar/), seems to be a little bit outdat‐
402 ed. Contains a nice list of alternatives with comparisons.
403
404 • rdiff-backup (http://www.nongnu.org/rdiff-backup/), based on the
405 original rsync algorithm. Does not do global deduplication, only
406 working over the files with the same file name.
407
408 • duplicity (http://duplicity.nongnu.org/), which looks similar to
409 rdiff-backup with regards to mode of operation.
410
411 • Some filesystems (most notably ZFS (http://en.wikipedia.org/wiki/ZFS)
412 and Btrfs (http://en.wikipedia.org/wiki/Btrfs)) provide deduplication
413 features. They do so only at block level though, without a sliding
414 window, so they can not accomodate to arbitrary byte insertion/dele‐
415 tion in the middle of data.
416
417 • Attic (https://attic-backup.org/), which looks very similar to zback‐
418 up.
419
421 Copyright (c) 2012-2014 Konstantin Isakov (<ikm@zbackup.org>) and
422 ZBackup contributors, see CONTRIBUTORS. Licensed under GNU GPLv2 or
423 later + OpenSSL, see LICENSE.
424
425 This program is free software; you can redistribute it and/or modify it
426 under the terms of the GNU General Public License as published by the
427 Free Software Foundation; either version 2 of the License, or (at your
428 option) any later version.
429
430 This program is distributed in the hope that it will be useful, but
431 WITHOUT ANY WARRANTY; without even the implied warranty of MER‐
432 CHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
433 Public License for more details.
434
435 You should have received a copy of the GNU General Public License along
436 with this program; if not, write to the Free Software Foundation, Inc.,
437 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
438
439
440
441 January 22, 2022 zbackup(1)