zbackup(1)

1zbackup(1)                                                          zbackup(1)
2
3
4

Introduction

6       zbackup  is  a  globally-deduplicating  backup tool, based on the ideas
7       found in rsync (http://rsync.samba.org/).  Feed a large .tar  into  it,
8       and  it will store duplicate regions of it only once, then compress and
9       optionally encrypt the result.  Feed another .tar file, and it will al‐
10       so  re-use  any  data found in any previous backups.  This way only new
11       changes are stored, and as long as the files are  not  very  different,
12       the  amount  of  storage required is very low.  Any of the backup files
13       stored previously can be read back in full at any time.  The program is
14       format-agnostic,  so  you can feed virtually any files to it (any types
15       of archives, proprietary formats, even  raw  disk  images  --  but  see
16       Caveats).
17
18       This is achieved by sliding a window with a rolling hash over the input
19       at a byte granularity and checking whether the block in focus was  ever
20       met  already.   If  a  rolling hash matches, an additional full crypto‐
21       graphic hash is calculated to ensure the block is indeed the same.  The
22       deduplication happens then.
23

Features

25       The program has the following features:
26
27       • Parallel LZMA or LZO compression of the stored data
28
29       • Built-in AES encryption of the stored data
30
31       • Possibility to delete old backup data
32
33       • Use  of  a 64-bit rolling hash, keeping the amount of soft collisions
34         to zero
35
36       • Repository consists of immutable files.  No existing files  are  ever
37         modified
38
39       • Written in C++ only with only modest library dependencies
40
41       • Safe to use in production (see below)
42
43       • Possibility to exchange data between repos without recompression
44

Build dependencies

46       • cmake  >=  2.8.2  (though  it  should  not be too hard to compile the
47         sources by hand if needed)
48
49       • libssl-dev for all encryption, hashing and random numbers
50
51       • libprotobuf-dev and protobuf-compiler for data serialization
52
53       • liblzma-dev for compression
54
55       • liblzo2-dev for compression (optional)
56
57       • zlib1g-dev for adler32 calculation
58

Quickstart

60       To build and install:
61
62              cd zbackup
63              cmake .
64              make
65              sudo make install
66              # or just run as ./zbackup
67
68       Zbackup   is   also   part   of   the   Debian    (https://packages.de‐
69       bian.org/search?keywords=zbackup),     Ubuntu    (http://packages.ubun‐
70       tu.com/search?keywords=zbackup) and  Arch  Linux  (https://aur.archlin‐
71       ux.org/packages/zbackup/) distributions of GNU/Linux.
72
73       To use:
74
75              zbackup init --non-encrypted /my/backup/repo
76              tar c /my/precious/data | zbackup backup /my/backup/repo/backups/backup-`date '+%Y-%m-%d'`
77              zbackup restore /my/backup/repo/backups/backup-`date '+%Y-%m-%d'` > /my/precious/backup-restored.tar
78
79       If  you  have a lot of RAM to spare, you can use it to speed-up the re‐
80       store process -- to use 512  MB  more,  pass  --cache-size  512mb  when
81       restoring.
82
83       If encryption is wanted, create a file with your password:
84
85              # more secure to use an editor
86              echo mypassword > ~/.my_backup_password
87              chmod 600 ~/.my_backup_password
88
89       Then init the repo the following way:
90
91              zbackup init --password-file ~/.my_backup_password /my/backup/repo
92
93       And always pass the same argument afterwards:
94
95              tar c /my/precious/data | zbackup --password-file ~/.my_backup_password backup /my/backup/repo/backups/backup-`date '+%Y-%m-%d'`
96              zbackup --password-file ~/.my_backup_password restore /my/backup/repo/backups/backup-`date '+%Y-%m-%d'` > /my/precious/backup-restored.tar
97
98       If  you  have a 32-bit system and a lot of cores, consider lowering the
99       number of compression threads by passing --threads 4 or --threads 2  if
100       the  program  runs out of address space when backing up (see why below,
101       item 2).  There should be no problem on a 64-bit system.
102

Caveats

104       • While you can pipe any data into the program, the data should be  un‐
105         compressed  and  unencrypted  --  otherwise no deduplication could be
106         performed on it.  zbackup would compress and encrypt the data itself,
107         so  there's  no need to do that yourself.  So just run tar c and pipe
108         it into zbackup directly.  If backing up disk  images  employing  en‐
109         cryption,  pipe the unencrypted version (the one you normally mount).
110         If you create .zip or .rar files, use no compression (-0 or -m0)  and
111         no encryption.
112
113       • Parallel  LZMA  compression  uses  a  lot of RAM (several hundreds of
114         megabytes, depending on the number of threads used),  and  ten  times
115         more  virtual  address  space.  The latter is only relevant on 32-bit
116         architectures where it's limited to 2 or 3 GB.  If you hit the  ceil‐
117         ing, lower the number of threads with --threads.
118
119       • Since  the  data  is deduplicated, there's naturally no redundancy in
120         it.  A loss of a single file can lead to a loss of virtually all  da‐
121         ta.   Make  sure  you store it on a redundant storage (RAID1, a cloud
122         provider etc).
123
124       • The encryption key, if used, is stored in the info file in  the  root
125         of  the  repo.  It is encrypted with your password.  Technically thus
126         you can change your password without re-encrypting any data,  and  as
127         long  as  no one possesses the old info file and knows your old pass‐
128         word, you would be safe (even though  the  actual  option  to  change
129         password  is not implemented yet -- someone who needs this is welcome
130         to create a pull request -- the possibility is all there).  Also note
131         that  it  is  crucial you don't lose your info file, as otherwise the
132         whole backup would be lost.
133

Limitations

135       • Right now the only modes supported are reading  from  standard  input
136         and  writing  to standard output.  FUSE mounts and NBD servers may be
137         added later if someone contributes the code.
138
139       • The program keeps all known blocks in an in-RAM hash table, which may
140         create scalability problems for very large repos (see below).
141
142       • The only encryption mode currently implemented is AES-128 in CBC mode
143         with PKCS#7 padding.  If you believe that this is not secure  enough,
144         patches  are  welcome.   Before you jump to conclusions however, read
145         this  article   (http://www.schneier.com/blog/archives/2009/07/anoth‐
146         er_new_aes.html).
147
148       • The only compression mode supported is LZMA, which suits backups very
149         nicely.
150
151       • It's only possible to fully restore the backup in order to get  to  a
152         required  file, without any option to quickly pick it out.  tar would
153         not allow to do it anyway, but e.g.  for zip files it could have been
154         possible.   This  is  possible to implement though, e.g.  by exposing
155         the data over a FUSE filesystem.
156
157       • There's no option to specify block and bundle sizes  other  than  the
158         default  ones (currently 64k and 2MB respectively), though it's triv‐
159         ial to add command-line switches for those.
160
161       Most of those limitations can be lifted by implementing the  respective
162       features.
163

Safety

165       Is  it  safe  to use zbackup for production data?  Being free software,
166       the program comes with no warranty of any kind.  That said,  it's  per‐
167       fectly  safe for production, and here's why.  When performing a backup,
168       the program never modifies or deletes any existing files  --  only  new
169       ones  are created.  It specifically checks for that, and the code paths
170       involved are short and easy to inspect.  Furthermore,  each  backup  is
171       protected by its SHA256 sum, which is calculated before piping the data
172       into the deduplication logic.  The code path doing that is  also  short
173       and  easy  to  inspect.  When a backup is being restored, its SHA256 is
174       calculated again and compared against  the  stored  one.   The  program
175       would  fail on a mismatch.  Therefore, to ensure safety it is enough to
176       restore each backup to /dev/null immediately after creating it.  If  it
177       restores fine, it will restore fine ever after.
178       To  add  some  statistics,  the author of the program has been using an
179       older version of zbackup internally for over a year.  The SHA256  check
180       never ever failed.  Again, even if it does, you would know immediately,
181       so no work would be lost.  Therefore you are welcome to try the program
182       in production, and if you like it, stick with it.
183

Usage notes

185       The repository has the following directory structure:
186
187              /repo
188                  backups/
189                  bundles/
190                      00/
191                      01/
192                      02/
193                      ...
194                  index/
195                  info
196
197       • The  backups  directory  contain  your backups.  Those are very small
198         files which are needed for restoration.  They are  encrypted  if  en‐
199         cryption  is enabled.  The names can be arbitrary.  It is possible to
200         arrange files in subdirectories, too.  Free renaming is also allowed.
201
202       • The bundles directory contains the bulk of data.  Each bundle  inter‐
203         nally  contains  multiple  small  chunks, compressed together and en‐
204         crypted.  Together all those chunks account for all deduplicated data
205         stored.
206
207       • The  index  directory  contains  the  full index of all chunks in the
208         repository, together with their bundle names.  A separate index  file
209         is  created for each backup session.  Technically those files are re‐
210         dundant, all information is  contained  in  the  bundles  themselves.
211         However,  having  a  separate  index is nice for two reasons: 1) it's
212         faster to read as it incurs less seeks, and 2) it allows making back‐
213         ups  while  storing  bundles elsewhere.  Bundles are only needed when
214         restoring -- otherwise it's sufficient to only have index.  One could
215         then  move  all newly created bundles into another machine after each
216         backup.
217
218       • info is a very important file which contains  all  global  repository
219         metadata,  such  as chunk and bundle sizes, and an encryption key en‐
220         crypted with the user password.  It is paramount not to lose  it,  so
221         backing it up separately somewhere might be a good idea.  On the oth‐
222         er hand, if you absolutely don't trust your remote storage  provider,
223         you  might  consider  not  storing  it with the rest of the data.  It
224         would then be impossible to decrypt it at all, even if your  password
225         gets known later.
226
227       The  program  does not have any facilities for sending your backup over
228       the network.  You can rsync the repo to another  computer  or  use  any
229       kind  of  cloud  storage capable of storing files.  Since zbackup never
230       modifies any existing files, the latter is especially easy -- just tell
231       the  upload tool you use not to upload any files which already exist on
232       the remote side (e.g.  with gsutil it's  gsutil  cp  -R  -n  /my/backup
233       gs:/mybackup/).
234
235       To aid with creating backups, there's an utility called tartool includ‐
236       ed with zbackup.  The idea is the following: one sprinkles empty  files
237       called  .backup  and .no-backup across the entire filesystem.  Directo‐
238       ries where .backup files are placed are marked for backing  up.   Simi‐
239       larly,  directories  with  .no-backup files are marked not to be backed
240       up.  Additionally, it is possible to place .backup-XYZ in the same  di‐
241       rectory  where  XYZ is to mark XYZ for backing up, or place .no-backup-
242       XYZ to mark it not to be backed up.  Then tartool can be run with three
243       arguments  --  the  root directory to start from (can be /), the output
244       includes file, and the output excludes file.  The tool  traverses  over
245       the  given directory noting the .backup* and .no-backup* files and cre‐
246       ating include and exclude lists for the tar utility.  The  tar  utility
247       could  then  be  run  as tar c --files-from includes --exclude-from ex‐
248       cludes to store all chosen data.
249

Scalability

251       This section tries do address the question on the maximum amount of da‐
252       ta which can be held in a backup repository.  What is meant here is the
253       deduplicated data.  The number of bytes in all source  files  ever  fed
254       into the repository doesn't matter, but the total size of the resulting
255       repository does.  Internally all input data is split into small  blocks
256       called  chunks  (up to 64k each by default).  Chunks are collected into
257       bundles (up to 2MB each by default), and those bundles  are  then  com‐
258       pressed and encrypted.
259
260       There  are  then  two  problems  with the total number of chunks in the
261       repository:
262
263       • Hashes of all existing chunks are needed to be kept in RAM while  the
264         backup is ongoing.  Since the sliding window performs checking with a
265         single-byte granularity, lookups would otherwise be  too  slow.   The
266         amount  of  data needed to be stored is technically only 24 bytes for
267         each chunk, where the size of the chunk is up to 64k.  In an  example
268         real-life  18GB repo, only 18MB are taken by in its hash index.  Mul‐
269         tiply this roughly by two to have an estimate of RAM needed to  store
270         this index as an in-RAM hash table.  However, as this size is propor‐
271         tional to the total size of the repo, for 2TB repo you could  already
272         require  2GB of RAM.  Most repos are much smaller though, and as long
273         as the deduplication works properly, in many cases you can store ter‐
274         abytes of highly-redundant backup files in a 20GB repo easily.
275
276       • We  use  a  64-bit  rolling hash, which allows to have an O(1) lookup
277         cost  at  each  byte   we   process.    Due   to   birthday   paradox
278         (https://en.wikipedia.org/wiki/Birthday_paradox), we would start hav‐
279         ing collisions when we approach 2^32 hashes.  If each chunk  we  have
280         is  32k  on average, we would get there when our repo grows to 128TB.
281         We would still be able to continue, but as the number  of  collisions
282         would grow, we would have to resort to calculating the full hash of a
283         block at each byte more and more often, which would result in a  con‐
284         siderable slowdown.
285
286       All in all, as long as the amount of RAM permits, one can go up to sev‐
287       eral terabytes in deduplicated data, and start having some slowdown af‐
288       ter having hundreds of terabytes, RAM-permitting.
289

Design choices

291       • We use a 64-bit modified Rabin-Karp rolling hash (see rolling_hash.hh
292         for details), while most other programs use a 32-bit one.   As  noted
293         previously,  one  problem  with  the hash size is its birthday bound,
294         which with the 32-bit hash is met after having only 2^16 hashes.  The
295         choice  of  a 64-bit hash allows us to scale much better while having
296         virtually the same calculation cost on a typical 64-bit machine.
297
298       • rsync uses MD5 as its strong hash.  While MD5 is known to be fast, it
299         is  also  known to be broken, allowing a malicious user to craft col‐
300         liding inputs.  zbackup uses SHA1 instead.  The cost of SHA1 calcula‐
301         tions  on  modern  machines  is  actually  less than that of MD5 (run
302         openssl speed md5 sha1 on yours), so it's a  win-win  situation.   We
303         only  keep  the  first 128 bits of the SHA1 output, and therefore to‐
304         gether with the rolling hash we have a 192-bit hash for  each  chunk.
305         It's  a  multiple  of  8 bytes which is a nice properly on 64-bit ma‐
306         chines, and it is long enough not to worry about possible collisions.
307
308       • AES-128 in CBC mode with PKCS#7 padding is used for encryption.  This
309         seems  to  be a reasonbly safe classic solution.  Each encrypted file
310         has a random IV as its first 16 bytes.
311
312       • We use Google's protocol buffers  (https://developers.google.com/pro‐
313         tocol-buffers/)  to  represent  data structures in binary form.  They
314         are very efficient and relatively simple to use.
315

Compression

317       zbackup uses LZMA to compress stored data.  It  compresses  very  well,
318       but it will slow down your backup (unless you have a very fast CPU).
319
320       LZO  is  much  faster, but the files will be bigger.  If you don't want
321       your backup process to be cpu-bound, you  should  consider  using  LZO.
322       However, there are some caveats:
323
324       1. LZO  is so fast that other parts of zbackup consume significant por‐
325          tions of the CPU.  In fact, it is only using one core on my  machine
326          because compression is the only thing that can run in parallel.
327
328       2. I've  hacked  the  LZO  support  in  a day.  You shouldn't trust it.
329          Please make sure that restore works before you assume that your data
330          is safe.  That may still be faster than a backup with LZMA ;-)
331
332       3. LZMA  is still the default, so make sure that you use the --compres‐
333          sion lzo argument when you init the repo or whenever you do a  back‐
334          up.
335
336       You can mix LZMA and LZO in a repository.  Each bundle file has a field
337       that says how it was compressed, so zbackup will use the  right  method
338       to  decompress  it.  You could use an old zbackup respository with only
339       LZMA bundles and start using LZO.  However, please think  twice  before
340       you do that because old versions of zbackup won't be able to read those
341       bundles.
342

Improvements

344       There's a lot to be improved in the program.  It was released with  the
345       minimum amount of functionality to be useful.  It is also stable.  This
346       should hopefully stimulate people to join the development and  add  all
347       those other fancy features.  Here's a list of ideas:
348
349       • Additional options, such as configurable chunk and bundle sizes etc.
350
351       • A command to change password.
352
353       • Improved  garbage  collection.  The program should support ability to
354         specify maximum index file size / maximum index file count (for  bet‐
355         ter  compatibility with cloud storages as well) or something like re‐
356         tention policy.
357
358       • A command to fsck the repo by doing something close to  what  garbage
359         collection does, but also checking all hashes and so on.
360
361       • Parallel  decompression.  Right now decompression is single-threaded,
362         but it is possible to look ahead in the stream and perform  prefetch‐
363         ing.
364
365       • Support for mounting the repo over FUSE.  Random access to data would
366         then be possible.
367
368       • Support for exposing a backed up file over a  userspace  NBD  server.
369         It would then be possible to mount raw disk images without extracting
370         them.
371
372       • Support for other encryption types (preferably for everything openssl
373         supports with its evp).
374
375       • Support for other compression methods.
376
377       • You name it!
378

Communication

380       • The program's website is at <http://zbackup.org/>.
381
382       • Development happens at <https://github.com/zbackup/zbackup>.
383
384       • Discussion   forum   is   at   <https://groups.google.com/forum/#!fo‐
385         rum/zbackup>.  Please ask for help there!
386
387       The author is reachable over email  at  <ikm@zbackup.org>.   Please  be
388       constructive and don't ask for help using the program, though.  In most
389       cases it's best to stick to the forum, unless  you  have  something  to
390       discuss with the author in private.
391

Similar projects

393       zbackup is certainly not the first project to embrace the idea of using
394       a rolling hash for deduplication.  Here's a list of other projects  the
395       author found on the web:
396
397       • bup (https://github.com/bup/bup), based on storing data in git packs.
398         No possibility of removing old data.  This program  was  the  initial
399         inspiration for zbackup.
400
401       • ddar (http://www.synctus.com/ddar/), seems to be a little bit outdat‐
402         ed.  Contains a nice list of alternatives with comparisons.
403
404       • rdiff-backup  (http://www.nongnu.org/rdiff-backup/),  based  on   the
405         original  rsync  algorithm.   Does  not do global deduplication, only
406         working over the files with the same file name.
407
408       • duplicity  (http://duplicity.nongnu.org/),  which  looks  similar  to
409         rdiff-backup with regards to mode of operation.
410
411       • Some filesystems (most notably ZFS (http://en.wikipedia.org/wiki/ZFS)
412         and Btrfs (http://en.wikipedia.org/wiki/Btrfs)) provide deduplication
413         features.   They  do so only at block level though, without a sliding
414         window, so they can not accomodate to arbitrary byte  insertion/dele‐
415         tion in the middle of data.
416
417       • Attic (https://attic-backup.org/), which looks very similar to zback‐
418         up.
419

Credits

421       Copyright  (c)  2012-2014  Konstantin  Isakov  (<ikm@zbackup.org>)  and
422       ZBackup  contributors,  see  CONTRIBUTORS.  Licensed under GNU GPLv2 or
423       later + OpenSSL, see LICENSE.
424
425       This program is free software; you can redistribute it and/or modify it
426       under  the  terms of the GNU General Public License as published by the
427       Free Software Foundation; either version 2 of the License, or (at  your
428       option) any later version.
429
430       This  program  is  distributed  in the hope that it will be useful, but
431       WITHOUT ANY  WARRANTY;  without  even  the  implied  warranty  of  MER‐
432       CHANTABILITY  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General
433       Public License for more details.
434
435       You should have received a copy of the GNU General Public License along
436       with this program; if not, write to the Free Software Foundation, Inc.,
437       51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
438
439
440
441                               January 22, 2022                     zbackup(1)