duperemove(8)

1duperemove(8)               System Manager's Manual              duperemove(8)
2
3
4

NAME

6       duperemove - Find duplicate extents and print them to stdout
7

SYNOPSIS

9       duperemove [options] files...
10

DESCRIPTION

12       duperemove  is a simple tool for finding duplicated extents and submit‐
13       ting them for deduplication. When given a list of files  it  will  hash
14       the  contents  of their extents and compare those hashes to each other,
15       finding and categorizing extents that match each other. When given  the
16       -d option, duperemove will submit those extents for deduplication using
17       the Linux kernel extent-same ioctl.
18
19       duperemove can store the hashes it computes in a hashfile. If given  an
20       existing  hashfile, duperemove will only compute hashes for those files
21       which have changed since the last run.  Thus you can run duperemove re‐
22       peatedly  on your data as it changes, without having to re-checksum un‐
23       changed data.  For more on hashfiles see the --hashfile option below as
24       well as the Examples section.
25
26       duperemove  can  also  take  input  from  the  fdupes  program, see the
27       --fdupes option below.
28
29

GENERAL

31       Duperemove has two major modes of operation one of which is a subset of
32       the other.
33
34
35   Readonly / Non-deduplicating Mode
36       When run without -d (the default) duperemove will print out one or more
37       tables of matching extents it has determined would be ideal  candidates
38       for deduplication. As a result, readonly mode is useful for seeing what
39       duperemove might do when run with -d.
40
41       Generally, duperemove does not concern itself with the underlying  rep‐
42       resentation  of  the  extents  it processes. Some of them could be com‐
43       pressed, undergoing I/O, or even have  already  been  deduplicated.  In
44       dedupe  mode, the kernel handles those details and therefore we try not
45       to replicate that work.
46
47
48   Deduping Mode
49       This functions similarly to readonly mode with the exception  that  the
50       duplicated extents found in our "read, hash, and compare" step will ac‐
51       tually be submitted for deduplication. Extents that have  already  been
52       deduped  will  be  skipped.  An estimate of the total data deduplicated
53       will be printed after the operation is complete. This estimate is  cal‐
54       culated  by comparing the total amount of shared bytes in each file be‐
55       fore and after the dedupe.
56
57

OPTIONS

59       files can refer to a list of regular files and directories or be a  hy‐
60       phen  (-)  to  read them from standard input.  If a directory is speci‐
61       fied, all regular files within it will also be scanned. Duperemove  can
62       also be told to recursively scan directories with the '-r' switch.
63
64
65       -r     Enable recursive dir traversal.
66
67
68       -d     De-dupe the results - only works on btrfs and xfs.
69
70
71       -A     Opens  files readonly when deduping. Primarily for use by privi‐
72              leged users on readonly snapshots.
73
74
75       -h     Print numbers in human-readable format.
76
77
78       -q     Quiet mode. Duperemove will only print errors and a  short  sum‐
79              mary of any dedupe.
80
81
82       --hashfile=hashfile
83              Use  a file for storage of hashes instead of memory. This option
84              drastically reduces the memory footprint of  duperemove  and  is
85              recommended  when  your data set is more than a few files large.
86              Hashfiles are also reusable, allowing you to further reduce  the
87              amount of hashing done on subsequent dedupe runs.
88
89              If  hashfile  does  not exist it will be created.  If it exists,
90              duperemove will check the file paths stored  inside  of  it  for
91              changes.   Files  which have changed will be rescanned and their
92              updated hashes will be written to the hashfile.   Deleted  files
93              will be removed from the hashfile.
94
95              New  files  are only added to the hashfile if they are discover‐
96              able via the files argument.  For that reason you probably  want
97              to  provide  the same files list and -r arguments on each run of
98              duperemove.  The file discovery algorithm is efficient and  will
99              only  visit  each  file once, even if it is already in the hash‐
100              file.
101
102              Adding a new path to a hashfile is as simple as adding it to the
103              files argument.
104
105              When  deduping  from  a hashfile, duperemove will avoid deduping
106              files which have not changed since the last dedupe.
107
108
109       -L     Print all files in the hashfile and exit. Requires  the  --hash‐
110              file  option.  Will print additional information about each file
111              when run with -v.
112
113
114       -R [file]
115              Remove file from the db and  exit.  Can  be  specified  multiple
116              times.  Duperemove  will  read the list from standard input if a
117              hyphen (-) is provided. Requires the --hashfile option.
118
119              Note: If you are piping filenames from  another  duperemove  in‐
120              stance  it  is advisable to do so into a temporary file first as
121              running duperemove simultaneously on the same hashfile may  cor‐
122              rupt that hashfile.
123
124
125       --fdupes
126              Run  in fdupes mode. With this option you can pipe the output of
127              fdupes to duperemove to dedupe any duplicate files  found.  When
128              receiving  a  file list in this manner, duperemove will skip the
129              hashing phase.
130
131
132       -v     Be verbose.
133
134
135       --skip-zeroes
136              Read data blocks and skip any zeroed blocks, useful for  speedup
137              duperemove, but can prevent deduplication of zeroed files.
138
139
140       -b size
141              Use  the specified block size for reading file extents. Defaults
142              to 128K.
143
144
145       --io-threads=N
146              Use N threads for I/O. This is used  by  the  file  hashing  and
147              dedupe stages. Default is automatically detected based on number
148              of host cpus.
149
150
151       --cpu-threads=N
152              Use N threads for CPU bound tasks. This is used by the duplicate
153              extent finding stage. Default is automatically detected based on
154              number of host cpus.
155
156              Note: Hyperthreading can adversely affect performance of the ex‐
157              tent  finding stage. If duperemove detects an Intel CPU with hy‐
158              perthreading it will use half the number of  cores  reported  by
159              the system for cpu bound tasks.
160
161
162       --dedupe-options=options
163              Comma  separated  list  of  options  which  alter how we dedupe.
164              Prepend 'no' to an option in order to turn it off.
165
166              [no]partial
167                     Duperemove can often find more dedupe by  comparing  por‐
168                     tions  of  extents  to each other. This can be a lengthy,
169                     CPU intensive task so it is turned off by default.
170
171                     The code behind this option is under  active  development
172                     and as a result the semantics of the partial argument may
173                     change.
174
175
176              [no]same
177                     Defaults to off. Allow dedupe of extents within the  same
178                     file.
179
180              [no]fiemap
181                     Defaults  to  on. Duperemove uses the fiemap ioctl during
182                     the dedupe stage to optimize out already deduped  extents
183                     as  well as to provide an estimate of the space saved af‐
184                     ter dedupe operations are complete.
185
186                     Unfortunately, some versions of Btrfs  exhibit  extremely
187                     poor performance in fiemap as the number of references on
188                     a file extent goes up. If you are experiencing the dedupe
189                     phase  slowing  down or 'locking up' this option may give
190                     you a significant amount of performance back.
191
192                     Note: This does not turn off all usage of fiemap, to dis‐
193                     able  fiemap  during  the  file scan stage, you will also
194                     want to use the --lookup-extents=no option.
195
196              [no]block
197                     Deprecated.
198
199
200       --help Prints help text.
201
202
203       --lookup-extents=[yes|no]
204              Defaults to yes. Allows duperemove  to  skip  checksumming  some
205              blocks by checking their extent state.
206
207
208       --read-hashes=hashfile
209              This  option is primarily for testing. See the --hashfile option
210              if you want to use hashfiles.
211
212              Read hashes from a hashfile. A file list is  not  required  with
213              this  option.  Dedupe  can be done if duperemove is run from the
214              same base directory as is stored in  the  hash  file  (basically
215              duperemove has to be able to find the files).
216
217
218       --write-hashes=hashfile
219              This  option is primarily for testing. See the --hashfile option
220              if you want to use hashfiles.
221
222              Write hashes to a hashfile. These can be read in at a later date
223              and deduped from.
224
225
226       --debug
227              Print debug messages, forces -v if selected.
228
229
230       --hash-threads=N
231              Deprecated, see --io-threads above.
232
233
234       --hash=alg
235              You  can  choose between murmur3 and xxhash. The default is mur‐
236              mur3 as it is very fast and can generate 128 bit digests  for  a
237              very  small chance of collision. Xxhash may be faster but gener‐
238              ates only 64 bit digests. Both hashes are fast enough  that  the
239              default should work well for the overwhelming majority of users.
240
241
242       --exclude=PATTERN
243              You  an exclude certain files and folders from the deduplication
244              process. This might be benefical for skipping subvolume snapshot
245              mounts,  for  instance. You need to provide full path for exclu‐
246              sion. For example providing just a file name with a wildcard i.e
247              duperemove  --exclude file-* won't ever match because internally
248              duperemove works with absolute paths. Another thing to  keep  in
249              mind is that shells usually expand glob pattern so the passed in
250              pattern ought to also be quoted. Taking everything into  consid‐
251              eration  the correct way to pass an exclusion pattern is dupere‐
252              move --exclude "/path/to/dir/file*" /path/to/dir
253
254

EXAMPLES

256   Simple Usage
257       Dedupe the files in directory /foo, recurse  into  all  subdirectories.
258       You only want to use this for small data sets.
259
260              duperemove -dr /foo
261
262       Use  duperemove  with  fdupes to dedupe identical files below directory
263       foo.
264
265              fdupes -r /foo | duperemove --fdupes
266
267
268   Using Hashfiles
269       Duperemove can optionally store the hashes it calculates in a hashfile.
270       Hashfiles  have two primary advantages - memory usage and re-usability.
271       When using a hashfile, duperemove will stream computed  hashes  to  it,
272       instead of main memory.
273
274       If Duperemove is run with an existing hashfile, it will only scan those
275       files which have changed since the last time the hashfile was  updated.
276       The  files argument controls which directories duperemove will scan for
277       newly added files. In the simplest usage, you rerun duperemove with the
278       same  parameters  and  it will only scan changed or newly added files -
279       see the first example below.
280
281
282       Dedupe the files in directory foo, storing hashes in foo.hash.  We  can
283       run  this  command multiple times and duperemove will only checksum and
284       dedupe changed or newly added files.
285
286              duperemove -dr --hashfile=foo.hash foo/
287
288       Don't scan for new files, only update changed or  deleted  files,  then
289       dedupe.
290
291              duperemove -dr --hashfile=foo.hash
292
293       Add  directory bar to our hashfile and discover any files that were re‐
294       cently added to foo.
295
296              duperemove -dr --hashfile=foo.hash foo/ bar/
297
298       List the files tracked by foo.hash.
299
300              duperemove -L --hashfile=foo.hash
301
302

FAQ

304   Is there an upper limit to the amount of data duperemove can process?
305       Duperemove v0.11 is fast at reading and cataloging  data.  Dedupe  runs
306       will be memory limited unless the '--hashfile' option is used. '--hash‐
307       file' allows duperemove to temporarily store duplicated hashes to disk,
308       thus  removing  the large memory overhead and allowing for a far larger
309       amount of data to be scanned and deduped. Realistically though you will
310       be  limited  by  the  speed  of your disks and cpu. In those situations
311       where resources are limited you may have success by breaking up the in‐
312       put data set into smaller pieces.
313
314       When  using  a hashfile, duperemove will only store duplicate hashes in
315       memory. During normal operation then the hash tree  will  make  up  the
316       largest portion of duperemove memory usage. As of Duperemove v0.11 hash
317       entries are 88 bytes in size. If  you  know  the  number  of  duplicate
318       blocks in your data set you can get a rough approximation of memory us‐
319       age by multiplying with the hash entry size.
320
321       Actual performance numbers are dependent on hardware - up to date test‐
322       ing  information  is  kept  on  the  duperemove wiki (see below for the
323       link).
324
325
326   How large of a hashfile will duperemove create?
327       Hashfiles are essentially sqlite3 database files with  several  tables,
328       the largest of which are the files and extents tables. Each extents ta‐
329       ble entry is about 72 bytes though that may grow as features are added.
330       The size of a files table entry depends on the file path but a good es‐
331       timate is around 270 bytes per file. The number of extents  in  a  data
332       set is directly proportional to file fragmentation level.
333
334       If you know the total number of extents and files in your data set then
335       you can calculate the hashfile size as:
336
337       Hashfile Size = Num Hashes X 72 + Num Files X 270
338
339       Using a real world example of 1TB (8388608 128K blocks)  of  data  over
340       1000 files:
341
342       8388608  *  72  +  270 * 1000 = 755244720 or about 720MB for 1TB spread
343       over 1000 files.
344
345       Note that none of this takes database overhead into account.
346
347
348   Is is safe to interrupt the program (Ctrl-C)?
349       Yes, Duperemove uses a transactional database engine and  organizes  db
350       changes  to  take  advantage  of those features. The result is that you
351       should be able to ctrl-c the program at any point  and  re-run  without
352       experiencing corruption of your hashfile.
353
354
355   I got two identical files, why are they not deduped?
356       Duperemove  by  default works on extent granularity. What this means is
357       if there are two files which are logically  identical  (have  the  same
358       content)  but are laid out on disk with different extent structure they
359       won't be deduped. For example if 2 files are 128k each and  their  con‐
360       tent are identical but one of them consists of a single 128k extent and
361       the other of 2 x 64k extents then they won't be deduped. This  behavior
362       is  dependent on the current implementation and is subject to change as
363       duperemove is being improved.
364
365
366   How can I find out my space savings after a dedupe?
367       Duperemove will print out an estimate of the saved space after a dedupe
368       operation for you.
369
370       You can get a more accurate picture by running 'btrfs fi df' before and
371       after each duperemove run.
372
373       Be careful about using the 'df' tool on btrfs - it is common for  space
374       reporting to be 'behind' while delayed updates get processed, so an im‐
375       mediate df after deduping might not show any savings.
376
377
378   Why is the total deduped data report an estimate?
379       At the moment duperemove can detect that some  underlying  extents  are
380       shared  with  other files, but it can not resolve which files those ex‐
381       tents are shared with.
382
383       Imagine duperemove is examing a series of files and it notes  a  shared
384       data  region in one of them. That data could be shared with a file out‐
385       side of the series. Since duperemove can't resolve that information  it
386       will  account the shared data against our dedupe operation while in re‐
387       ality, the kernel might deduplicate it further for us.
388
389
390   Why are my files showing dedupe but my disk space is not shrinking?
391       This is a little complicated, but it comes down to a feature  in  Btrfs
392       called   _bookending_.   The   Btrfs  wiki  explains  this  in  detail:
393       http://en.wikipedia.org/wiki/Btrfs#Extents.
394
395       Essentially though, the underlying representation of an extent in Btrfs
396       can  not be split (with small exception). So sometimes we can end up in
397       a situation where a file extent gets partially deduped (and the extents
398       marked  as shared) but the underlying extent item is not freed or trun‐
399       cated.
400
401
402   Is duperemove safe for my data?
403       Yes. To be specific, duperemove does not deduplicate the  data  itself.
404       It  simply  finds  candidates  for dedupe and submits them to the Linux
405       kernel extent-same ioctl. In order to ensure data integrity, the kernel
406       locks  out other access to the file and does a byte-by-byte compare be‐
407       fore proceeding with the dedupe.
408
409
410   What is the cost of deduplication?
411       Deduplication will lead to increased fragmentation. The blocksize  cho‐
412       sen  can  have  an effect on this. Larger blocksizes will fragment less
413       but may not save you as much space. Conversely, smaller block sizes may
414       save more space at the cost of increased fragmentation.
415
416

NOTES

418       Deduplication is currently only supported by the btrfs and xfs filesys‐
419       tem.
420
421       The Duperemove project page can be  found  at  https://github.com/mark‐
422       fasheh/duperemove
423
424       There is also a wiki at https://github.com/markfasheh/duperemove/wiki
425
426