duperemove(8)

1duperemove(8)               System Manager's Manual              duperemove(8)
2
3
4

NAME

6       duperemove - Find duplicate extents and submit them for deduplication
7

SYNOPSIS

9       duperemove [options] files...
10

DESCRIPTION

12       duperemove  is a simple tool for finding duplicated extents and submit‐
13       ting them for deduplication. When given a list of files  it  will  hash
14       the  contents  of their extents and compare those hashes to each other,
15       finding and categorizing extents that match each other. When given  the
16       -d option, duperemove will submit those extents for deduplication using
17       the Linux kernel extent-same ioctl.
18
19       duperemove can store the hashes it computes in a hashfile. If given  an
20       existing  hashfile, duperemove will only compute hashes for those files
21       which have changed since the last run.  Thus you can run duperemove re‐
22       peatedly  on your data as it changes, without having to re-checksum un‐
23       changed data.  For more on hashfiles see the --hashfile option below as
24       well as the Examples section.
25
26       duperemove  can  also  take  input  from  the  fdupes  program, see the
27       --fdupes option below.
28
29

GENERAL

31       Duperemove has two major modes of operation, one of which is  a  subset
32       of the other.
33
34
35   Readonly / Non-deduplicating Mode
36       When run without -d (the default) duperemove will print out one or more
37       tables of matching extents it has determined would be ideal  candidates
38       for deduplication. As a result, readonly mode is useful for seeing what
39       duperemove might do when run with -d.
40
41       Generally, duperemove does not concern itself with the underlying  rep‐
42       resentation  of  the  extents  it processes. Some of them could be com‐
43       pressed, undergoing I/O, or even have  already  been  deduplicated.  In
44       dedupe  mode, the kernel handles those details and therefore we try not
45       to replicate that work.
46
47
48   Deduping Mode
49       This functions similarly to readonly mode with the exception  that  the
50       duplicated extents found in our "read, hash, and compare" step will ac‐
51       tually be submitted for deduplication. Extents that have  already  been
52       deduped  will  be  skipped.  An estimate of the total data deduplicated
53       will be printed after the operation is complete. This estimate is  cal‐
54       culated  by comparing the total amount of shared bytes in each file be‐
55       fore and after the dedupe.
56
57

OPTIONS

59       files can refer to a list of regular files and directories or be a  hy‐
60       phen  (-)  to  read them from standard input.  If a directory is speci‐
61       fied, all regular files within it will also be scanned. Duperemove  can
62       also be told to recursively scan directories with the '-r' switch.
63
64
65       -r     Enable recursive dir traversal.
66
67
68       -d     De-dupe the results - only works on btrfs and xfs.
69
70
71       -A     Opens  files readonly when deduping. Primarily for use by privi‐
72              leged users on readonly snapshots.
73
74
75       -h     Print numbers in human-readable format.
76
77
78       -q     Quiet mode. Duperemove will only print errors and a  short  sum‐
79              mary of any dedupe.
80
81
82       --hashfile=hashfile
83              Use  a file for storage of hashes instead of memory. This option
84              drastically reduces the memory footprint of  duperemove  and  is
85              recommended  when  your data set is more than a few files large.
86              Hashfiles are also reusable, allowing you to further reduce  the
87              amount of hashing done on subsequent dedupe runs.
88
89              If  hashfile  does  not exist it will be created.  If it exists,
90              duperemove will check the file paths stored  inside  of  it  for
91              changes.   Files  which have changed will be rescanned and their
92              updated hashes will be written to the hashfile.   Deleted  files
93              will be removed from the hashfile.
94
95              New  files  are only added to the hashfile if they are discover‐
96              able via the files argument.  For that reason you probably  want
97              to  provide  the same files list and -r arguments on each run of
98              duperemove.  The file discovery algorithm is efficient and  will
99              only  visit  each  file once, even if it is already in the hash‐
100              file.
101
102              Adding a new path to a hashfile is as simple as adding it to the
103              files argument.
104
105              When  deduping  from  a hashfile, duperemove will avoid deduping
106              files which have not changed since the last dedupe.
107
108
109       -L     Print all files in the hashfile and exit. Requires  the  --hash‐
110              file  option.  Will print additional information about each file
111              when run with -v.
112
113
114       -R [file]
115              Remove file from the db and  exit.  Can  be  specified  multiple
116              times.  Duperemove  will  read the list from standard input if a
117              hyphen (-) is provided. Requires the --hashfile option.
118
119              Note: If you are piping filenames from  another  duperemove  in‐
120              stance  it  is advisable to do so into a temporary file first as
121              running duperemove simultaneously on the same hashfile may  cor‐
122              rupt that hashfile.
123
124
125       --fdupes
126              Run  in fdupes mode. With this option you can pipe the output of
127              fdupes to duperemove to dedupe any duplicate files  found.  When
128              receiving  a  file list in this manner, duperemove will skip the
129              hashing phase.
130
131
132       -v     Be verbose.
133
134
135       --skip-zeroes
136              Read data blocks and skip any zeroed blocks, useful for  speedup
137              duperemove, but can prevent deduplication of zeroed files.
138
139
140       -b size
141              Use  the specified block size for reading file extents. Defaults
142              to 128K.
143
144
145       --io-threads=N
146              Use N threads for I/O. This is used  by  the  file  hashing  and
147              dedupe stages. Default is automatically detected based on number
148              of host cpus.
149
150
151       --cpu-threads=N
152              Use N threads for CPU bound tasks. This is used by the duplicate
153              extent finding stage. Default is automatically detected based on
154              number of host cpus.
155
156              Note: Hyperthreading can adversely affect performance of the ex‐
157              tent  finding stage. If duperemove detects an Intel CPU with hy‐
158              perthreading it will use half the number of  cores  reported  by
159              the system for cpu bound tasks.
160
161
162       --dedupe-options=options
163              Comma  separated  list  of  options  which  alter how we dedupe.
164              Prepend 'no' to an option in order to turn it off.
165
166              [no]partial
167                     Duperemove can often find more dedupe by  comparing  por‐
168                     tions  of  extents  to each other. This can be a lengthy,
169                     CPU intensive task so it is turned off by default.  Using
170                     --batchsize  is recommended to limit the negative effects
171                     of this option.
172
173                     The code behind this option is under  active  development
174                     and as a result the semantics of the partial argument may
175                     change.
176
177
178              [no]same
179                     Defaults to off. Allow dedupe of extents within the  same
180                     file.
181
182              [no]fiemap
183                     Defaults  to  on. Duperemove uses the fiemap ioctl during
184                     the dedupe stage to optimize out already deduped  extents
185                     as  well as to provide an estimate of the space saved af‐
186                     ter dedupe operations are complete.
187
188                     Unfortunately, some versions of Btrfs  exhibit  extremely
189                     poor performance in fiemap as the number of references on
190                     a file extent goes up. If you are experiencing the dedupe
191                     phase  slowing  down or 'locking up' this option may give
192                     you a significant amount of performance back.
193
194                     Note: This does not turn off all usage of fiemap, to dis‐
195                     able  fiemap  during  the  file scan stage, you will also
196                     want to use the --lookup-extents=no option.
197
198              [no]block
199                     Deprecated.
200
201
202       --help Prints help text.
203
204
205       --lookup-extents=[yes|no]
206              Defaults to yes. Allows duperemove  to  skip  checksumming  some
207              blocks by checking their extent state.
208
209
210       --read-hashes=hashfile
211              This  option is primarily for testing. See the --hashfile option
212              if you want to use hashfiles.
213
214              Read hashes from a hashfile. A file list is  not  required  with
215              this  option.  Dedupe  can be done if duperemove is run from the
216              same base directory as is stored in  the  hash  file  (basically
217              duperemove has to be able to find the files).
218
219
220       --write-hashes=hashfile
221              This  option is primarily for testing. See the --hashfile option
222              if you want to use hashfiles.
223
224              Write hashes to a hashfile. These can be read in at a later date
225              and deduped from.
226
227
228       --debug
229              Print debug messages, forces -v if selected.
230
231
232       --hash-threads=N
233              Deprecated, see --io-threads above.
234
235
236       --exclude=PATTERN
237              You  an exclude certain files and folders from the deduplication
238              process. This might be benefical for skipping subvolume snapshot
239              mounts,  for  instance. You need to provide full path for exclu‐
240              sion. For example providing just a file name with a wildcard i.e
241              duperemove  --exclude file-* won't ever match because internally
242              duperemove works with absolute paths. Another thing to  keep  in
243              mind is that shells usually expand glob pattern so the passed in
244              pattern ought to also be quoted. Taking everything into  consid‐
245              eration  the correct way to pass an exclusion pattern is dupere‐
246              move --exclude "/path/to/dir/file*" /path/to/dir
247
248
249       -B, --batchsize=N
250              Run the deduplication phase every N files  newly  scanned.  This
251              greatly  reduces memory usage for large dataset, or when you are
252              doing partial extents lookup, but reduces  multithreading  effi‐
253              ciency.
254
255              Because  of  that  small  overhead,  its value shall be selected
256              based on the average file size and blocksize.
257
258              1000 is a sane value for extents-only lookups, while you can  go
259              as  low  as  1 if you are running duperemove on very large files
260              (like virtual machines etc).
261
262              By default, batching is disabled.
263
264

EXAMPLES

266   Simple Usage
267       Dedupe the files in directory /foo, recurse  into  all  subdirectories.
268       You only want to use this for small data sets.
269
270              duperemove -dr /foo
271
272       Use  duperemove  with  fdupes to dedupe identical files below directory
273       foo.
274
275              fdupes -r /foo | duperemove --fdupes
276
277
278   Using Hashfiles
279       Duperemove can optionally store the hashes it calculates in a hashfile.
280       Hashfiles  have two primary advantages - memory usage and re-usability.
281       When using a hashfile, duperemove will stream computed  hashes  to  it,
282       instead of main memory.
283
284       If Duperemove is run with an existing hashfile, it will only scan those
285       files which have changed since the last time the hashfile was  updated.
286       The  files argument controls which directories duperemove will scan for
287       newly added files. In the simplest usage, you rerun duperemove with the
288       same  parameters  and  it will only scan changed or newly added files -
289       see the first example below.
290
291
292       Dedupe the files in directory foo, storing hashes in foo.hash.  We  can
293       run  this  command multiple times and duperemove will only checksum and
294       dedupe changed or newly added files.
295
296              duperemove -dr --hashfile=foo.hash foo/
297
298       Don't scan for new files, only update changed or  deleted  files,  then
299       dedupe.
300
301              duperemove -dr --hashfile=foo.hash
302
303       Add  directory bar to our hashfile and discover any files that were re‐
304       cently added to foo.
305
306              duperemove -dr --hashfile=foo.hash foo/ bar/
307
308       List the files tracked by foo.hash.
309
310              duperemove -L --hashfile=foo.hash
311
312

FAQ

314   Is there an upper limit to the amount of data duperemove can process?
315       Duperemove is fast at reading and cataloging data. Dedupe runs will  be
316       memory limited unless the '--hashfile' option is used. '--hashfile' al‐
317       lows duperemove to temporarily store duplicated hashes  to  disk,  thus
318       removing the large memory overhead and allowing for a far larger amount
319       of data to be scanned and deduped. Realistically  though  you  will  be
320       limited  by  the speed of your disks and cpu. In those situations where
321       resources are limited you may have success by  breaking  up  the  input
322       data set into smaller pieces.
323
324       When  using  a hashfile, duperemove will only store duplicate hashes in
325       memory. During normal operation then the hash tree  will  make  up  the
326       largest portion of duperemove memory usage. As of Duperemove v0.11 hash
327       entries are 88 bytes in size. If  you  know  the  number  of  duplicate
328       blocks in your data set you can get a rough approximation of memory us‐
329       age by multiplying with the hash entry size.
330
331       Actual performance numbers are dependent on hardware - up to date test‐
332       ing  information  is  kept  on  the  duperemove wiki (see below for the
333       link).
334
335
336   How large of a hashfile will duperemove create?
337       Hashfiles are essentially sqlite3 database files with  several  tables,
338       the largest of which are the files and extents tables. Each extents ta‐
339       ble entry is about 72 bytes though that may grow as features are added.
340       The size of a files table entry depends on the file path but a good es‐
341       timate is around 270 bytes per file. The number of extents  in  a  data
342       set is directly proportional to file fragmentation level.
343
344       If you know the total number of extents and files in your data set then
345       you can calculate the hashfile size as:
346
347       Hashfile Size = Num Hashes X 72 + Num Files X 270
348
349       Using a real world example of 1TB (8388608 128K blocks)  of  data  over
350       1000 files:
351
352       8388608  *  72  +  270 * 1000 = 755244720 or about 720MB for 1TB spread
353       over 1000 files.
354
355       Note that none of this takes database overhead into account.
356
357
358   Is is safe to interrupt the program (Ctrl-C)?
359       Yes, Duperemove uses a transactional database engine and  organizes  db
360       changes  to  take  advantage  of those features. The result is that you
361       should be able to ctrl-c the program at any point  and  re-run  without
362       experiencing corruption of your hashfile.
363
364
365   I got two identical files, why are they not deduped?
366       Duperemove  by  default works on extent granularity. What this means is
367       if there are two files which are logically  identical  (have  the  same
368       content)  but are laid out on disk with different extent structure they
369       won't be deduped. For example if 2 files are 128k each and  their  con‐
370       tent are identical but one of them consists of a single 128k extent and
371       the other of 2 x 64k extents then they won't be deduped. This  behavior
372       is  dependent on the current implementation and is subject to change as
373       duperemove is being improved.
374
375
376   How can I find out my space savings after a dedupe?
377       Duperemove will print out an estimate of the saved space after a dedupe
378       operation for you.
379
380       You can get a more accurate picture by running 'btrfs fi df' before and
381       after each duperemove run.
382
383       Be careful about using the 'df' tool on btrfs - it is common for  space
384       reporting to be 'behind' while delayed updates get processed, so an im‐
385       mediate df after deduping might not show any savings.
386
387
388   Why is the total deduped data report an estimate?
389       At the moment duperemove can detect that some  underlying  extents  are
390       shared  with  other files, but it can not resolve which files those ex‐
391       tents are shared with.
392
393       Imagine duperemove is examining a series of files and it notes a shared
394       data  region in one of them. That data could be shared with a file out‐
395       side of the series. Since duperemove can't resolve that information  it
396       will  account the shared data against our dedupe operation while in re‐
397       ality, the kernel might deduplicate it further for us.
398
399
400   Why are my files showing dedupe but my disk space is not shrinking?
401       This is a little complicated, but it comes down to a feature  in  Btrfs
402       called   _bookending_.   The   Btrfs  wiki  explains  this  in  detail:
403       http://en.wikipedia.org/wiki/Btrfs#Extents.
404
405       Essentially though, the underlying representation of an extent in Btrfs
406       can  not be split (with small exception). So sometimes we can end up in
407       a situation where a file extent gets partially deduped (and the extents
408       marked  as shared) but the underlying extent item is not freed or trun‐
409       cated.
410
411
412   Is duperemove safe for my data?
413       Yes. To be specific, duperemove does not deduplicate the  data  itself.
414       It  simply  finds  candidates  for dedupe and submits them to the Linux
415       kernel extent-same ioctl. In order to ensure data integrity, the kernel
416       locks  out other access to the file and does a byte-by-byte compare be‐
417       fore proceeding with the dedupe.
418
419
420   What is the cost of deduplication?
421       Deduplication will lead to increased fragmentation. The blocksize  cho‐
422       sen  can  have  an effect on this. Larger blocksizes will fragment less
423       but may not save you as much space. Conversely, smaller block sizes may
424       save more space at the cost of increased fragmentation.
425
426

NOTES

428       Deduplication is currently only supported by the btrfs and xfs filesys‐
429       tem.
430
431       The Duperemove project page can be  found  at  https://github.com/mark‐
432       fasheh/duperemove
433
434       There is also a wiki at https://github.com/markfasheh/duperemove/wiki
435
436