duperemove(8)

1Duperemove(8)               System Manager’s Manual              Duperemove(8)
2
3
4

NAME

6       duperemove - Find duplicate extents and submit them for deduplication
7

SYNOPSIS

9       duperemove options files...
10

DESCRIPTION

12       duperemove  is a simple tool for finding duplicated extents and submit‐
13       ting them for deduplication.  When given a list of files it  will  hash
14       the  contents  of their extents and compare those hashes to each other,
15       finding and categorizing extents that match each other.  When given the
16       -d option, duperemove will submit those extents for deduplication using
17       the Linux kernel FIDEDUPERANGE ioctl.
18
19       duperemove can store the hashes it computes in a hashfile.  If given an
20       existing  hashfile, duperemove will only compute hashes for those files
21       which have changed since the last run.  Thus you can run duperemove re‐
22       peatedly  on your data as it changes, without having to re-checksum un‐
23       changed data.  For more on hashfiles see the --hashfile option below as
24       well as the Examples section.
25
26       duperemove  can  also  take  input  from  the  fdupes  program, see the
27       --fdupes option below.
28

GENERAL

30       Duperemove has two major modes of operation, one of which is  a  subset
31       of the other.
32
33   Readonly / Non-deduplicating Mode
34       When run without -d (the default) duperemove will print out one or more
35       tables of matching extents it has determined would be ideal  candidates
36       for  deduplication.   As  a  result, readonly mode is useful for seeing
37       what duperemove might do when run with -d.
38
39       Generally, duperemove does not concern itself with the underlying  rep‐
40       resentation  of  the  extents it processes.  Some of them could be com‐
41       pressed, undergoing I/O, or even have already  been  deduplicated.   In
42       dedupe  mode, the kernel handles those details and therefore we try not
43       to replicate that work.
44
45   Deduping Mode
46       This functions similarly to readonly mode with the exception  that  the
47       duplicated extents found in our “read, hash, and compare” step will ac‐
48       tually be submitted for deduplication.  Extents that have already  been
49       deduped  will  be  skipped.  An estimate of the total data deduplicated
50       will be printed after the operation is complete.  This estimate is cal‐
51       culated  by comparing the total amount of shared bytes in each file be‐
52       fore and after the dedupe.
53

OPTIONS

55   Common options
56       files can refer to a list of regular files and directories or be a  hy‐
57       phen  (-)  to  read them from standard input.  If a directory is speci‐
58       fied, all regular files within it will also be scanned.  Duperemove can
59       also be told to recursively scan directories with the -r switch.
60
61       -r     Enable recursive dir traversal.
62
63       -d     De-dupe the results - only works on btrfs and xfs.  Use this op‐
64              tion twice to disable the check and try to run the ioctl anyway.
65
66       --hashfile=hashfile
67              Use a file for storage of hashes instead of memory.  This option
68              drastically  reduces  the  memory footprint of duperemove and is
69              recommended when your data set is more than a few  files  large.
70              Hashfiles  are also reusable, allowing you to further reduce the
71              amount of hashing done on subsequent dedupe runs.
72
73              If hashfile does not exist it will be created.   If  it  exists,
74              duperemove  will  check  the  file paths stored inside of it for
75              changes.  Files which have changed will be rescanned  and  their
76              updated  hashes  will be written to the hashfile.  Deleted files
77              will be removed from the hashfile.
78
79              New files are only added to the hashfile if they  are  discover‐
80              able  via the files argument.  For that reason you probably want
81              to provide the same files list and -r arguments on each  run  of
82              duperemove.   The file discovery algorithm is efficient and will
83              only visit each file once, even if it is already  in  the  hash‐
84              file.
85
86              Adding a new path to a hashfile is as simple as adding it to the
87              files argument.
88
89              When deduping from a hashfile, duperemove  will  avoid  deduping
90              files which have not changed since the last dedupe.
91
92       -B N, --batchsize=N
93              Run  the  deduplication phase every N files newly scanned.  This
94              greatly reduces memory usage for large dataset, or when you  are
95              doing  partial  extents lookup, but reduces multithreading effi‐
96              ciency.
97
98              Because of that small overhead,  its  value  shall  be  selected
99              based on the average file size and blocksize.
100
101              The  default is a sane value for extents-only lookups, while you
102              can go as low as 1 if you are running duperemove on  very  large
103              files (like virtual machines etc).
104
105              By default, batching is set to 1024.
106
107       -h     Print numbers in human-readable format.
108
109       -q     Quiet  mode.  Duperemove will only print errors and a short sum‐
110              mary of any dedupe.
111
112       -v     Be verbose.
113
114       --help Prints help text.
115
116   Advanced options
117       --fdupes
118              Run in fdupes mode.  With this option you can pipe the output of
119              fdupes  to duperemove to dedupe any duplicate files found.  When
120              receiving a file list in this manner, duperemove will  skip  the
121              hashing phase.
122
123       -L     Print  all files in the hashfile and exit.  Requires the --hash‐
124              file option.  Will print additional information about each  file
125              when run with -v.
126
127       -R files ..
128              Remove file from the db and exit.  Duperemove will read the list
129              from standard input if a hyphen (-) is provided.   Requires  the
130              --hashfile option.
131
132              Note:  If  you  are piping filenames from another duperemove in‐
133              stance it is advisable to do so into a temporary file  first  as
134              running  duperemove simultaneously on the same hashfile may cor‐
135              rupt that hashfile.
136
137       --skip-zeroes
138              Read data blocks and skip any zeroed blocks, useful for  speedup
139              duperemove, but can prevent deduplication of zeroed files.
140
141       -b size
142              Use the specified block size for reading file extents.  Defaults
143              to 128K.
144
145       --io-threads=N
146              Use N threads for I/O.  This is used by  the  file  hashing  and
147              dedupe  stages.  Default is automatically detected based on num‐
148              ber of host cpus.
149
150       --cpu-threads=N
151              Use N threads for CPU bound tasks.  This is used by  the  dupli‐
152              cate  extent  finding  stage.  Default is automatically detected
153              based on number of host cpus.
154
155              Note: Hyperthreading can adversely affect performance of the ex‐
156              tent finding stage.  If duperemove detects an Intel CPU with hy‐
157              perthreading it will use half the number of  cores  reported  by
158              the system for cpu bound tasks.
159
160       --dedupe-options=options
161              Comma  separated  list  of  options  which  alter how we dedupe.
162              Prepend `no' to an option in order to turn it off.
163
164              [no]partial
165                     Duperemove can often find more dedupe by  comparing  por‐
166                     tions  of  extents to each other.  This can be a lengthy,
167                     CPU intensive task so it is turned off by default.  Using
168                     --batchsize  is recommended to limit the negative effects
169                     of this option.
170
171                     The code behind this option is under  active  development
172                     and as a result the semantics of the partial argument may
173                     change.
174
175              [no]same
176                     Defaults to on.  Allow dedupe of extents within the  same
177                     file.
178
179              [no]rescan_files
180                     Defaults  to  on.   Duperemove  will check for files that
181                     were found and deduplicated in a previous run,  based  on
182                     the hashfile.  Deduplicated files may have changed if new
183                     content was added, but also if their physical layout  was
184                     modified  (defrag  for  instance).  You can disable those
185                     checks to increase performance  when  running  duperemove
186                     against  a  specific  directory or file which you know is
187                     the only part of a larger, unchanged dataset.  Duperemove
188                     will still dedupe that specific target against any shared
189                     extent found in the existing files.
190
191              [no]only_whole_files
192                     Defaults to off.  Duperemove will only work on full file.
193                     Both  extent-based  and block-based deduplication will be
194                     disabled.  The hashfile will be smaller, some  operations
195                     will be faster, but the deduplication efficiency will in‐
196                     deed be reduced.
197
198       --read-hashes=hashfile
199              This option is primarily for testing.  See the --hashfile option
200              if you want to use hashfiles.
201
202              Read  hashes  from a hashfile.  A file list is not required with
203              this option.  Dedupe can be done if duperemove is run  from  the
204              same  base  directory  as  is stored in the hash file (basically
205              duperemove has to be able to find the files).
206
207       --write-hashes=hashfile
208              This option is primarily for testing.  See the --hashfile option
209              if you want to use hashfiles.
210
211              Write  hashes  to  a  hashfile.  These can be read in at a later
212              date and deduped from.
213
214       --debug
215              Print debug messages, forces -v if selected.
216
217       --hash-threads=N
218              Deprecated, see --io-threads above.
219
220       --exclude=PATTERN
221              You can exclude certain files and folders from the deduplication
222              process.   This  might be benefical for skipping subvolume snap‐
223              shot mounts, for instance.  Unless you provide a full  path  for
224              exclusion,  the  exclude will be relative to the current working
225              directory.  Another thing to keep in mind is that shells usually
226              expand  glob  pattern  so the passed in pattern ought to also be
227              quoted.  Taking everything into consideration the correct way to
228              pass    an    exclusion    pattern   is   duperemove   --exclude
229              "/path/to/dir/file*" /path/to/dir
230

EXAMPLES

232   Simple Usage
233       Dedupe the files in directory /foo, recurse  into  all  subdirectories.
234       You only want to use this for small data sets:
235
236              duperemove -dr /foo
237
238       Use  duperemove  with  fdupes to dedupe identical files below directory
239       foo:
240
241              fdupes -r /foo | duperemove --fdupes
242
243   Using Hashfiles
244       Duperemove can optionally store the hashes it calculates in a hashfile.
245       Hashfiles  have two primary advantages - memory usage and re-usability.
246       When using a hashfile, duperemove will stream computed  hashes  to  it,
247       instead of main memory.
248
249       If Duperemove is run with an existing hashfile, it will only scan those
250       files which have changed since the last time the hashfile was  updated.
251       The  files argument controls which directories duperemove will scan for
252       newly added files.  In the simplest usage, you  rerun  duperemove  with
253       the  same parameters and it will only scan changed or newly added files
254       - see the first example below.
255
256       Dedupe the files in directory foo, storing hashes in foo.hash.  We  can
257       run  this  command multiple times and duperemove will only checksum and
258       dedupe changed or newly added files:
259
260              duperemove -dr --hashfile=foo.hash foo/
261
262       Don’t scan for new files, only update changed or  deleted  files,  then
263       dedupe:
264
265              duperemove -dr --hashfile=foo.hash
266
267       Add  directory bar to our hashfile and discover any files that were re‐
268       cently added to foo:
269
270              duperemove -dr --hashfile=foo.hash foo/ bar/
271
272       List the files tracked by foo.hash:
273
274              duperemove -L --hashfile=foo.hash
275

FAQ

277   Is duperemove safe for my data?
278       Yes.  To be specific, duperemove does not deduplicate the data  itself.
279       It  simply  finds  candidates  for dedupe and submits them to the Linux
280       kernel FIDEDUPERANGE ioctl.  In order to  ensure  data  integrity,  the
281       kernel  locks out other access to the file and does a byte-by-byte com‐
282       pare before proceeding with the dedupe.
283
284   Is is safe to interrupt the program (Ctrl-C)?
285       Yes.  The Linux kernel deals with  the  actual  data.   On  Duperemove’
286       side,  a transactional database engine is used.  The result is that you
287       should be able to ctrl-c the program at any point  and  re-run  without
288       experiencing corruption of your hashfile.  In case of a bug, your hash‐
289       file may be broken, but your data never will.
290
291   I got two identical files, why are they not deduped?
292       Duperemove by default works on extent granularity.  What this means  is
293       if  there  are  two  files which are logically identical (have the same
294       content) but are laid out on disk with different extent structure  they
295       won’t  be deduped.  For example if 2 files are 128k each and their con‐
296       tent are identical but one of them consists of a single 128k extent and
297       the other of 2 * 64k extents then they won’t be deduped.  This behavior
298       is dependent on the current implementation and is subject to change  as
299       duperemove is being improved.
300
301   What is the cost of deduplication?
302       Deduplication will lead to increased fragmentation.  The blocksize cho‐
303       sen can have an effect on this.  Larger blocksizes will  fragment  less
304       but  may  not  save you as much space.  Conversely, smaller block sizes
305       may save more space at the cost of increased fragmentation.
306
307   How can I find out my space savings after a dedupe?
308       Duperemove will print out an estimate of the saved space after a dedupe
309       operation for you.
310
311       You can get a more accurate picture by running `btrfs fi df' before and
312       after each duperemove run.
313
314       Be careful about using the `df' tool on btrfs - it is common for  space
315       reporting to be `behind' while delayed updates get processed, so an im‐
316       mediate df after deduping might not show any savings.
317
318   Why is the total deduped data report an estimate?
319       At the moment duperemove can detect that some  underlying  extents  are
320       shared  with  other files, but it can not resolve which files those ex‐
321       tents are shared with.
322
323       Imagine duperemove is examining a series of files and it notes a shared
324       data region in one of them.  That data could be shared with a file out‐
325       side of the series.  Since duperemove can’t resolve that information it
326       will  account the shared data against our dedupe operation while in re‐
327       ality, the kernel might deduplicate it further for us.
328
329   Why are my files showing dedupe but my disk space is not shrinking?
330       This is a little complicated, but it comes down to a feature  in  Btrfs
331       called   bookending.    The   Btrfs  wiki  (http://en.wikipedia.org/wi‐
332       ki/Btrfs#Extents) explains this in detail.
333
334       Essentially though, the underlying representation of an extent in Btrfs
335       can not be split (with small exception).  So sometimes we can end up in
336       a situation where a file extent gets partially deduped (and the extents
337       marked  as shared) but the underlying extent item is not freed or trun‐
338       cated.
339
340   Is there an upper limit to the amount of data duperemove can process?
341       Duperemove is fast at reading and cataloging data.  Dedupe runs will be
342       memory limited unless the --hashfile option is used.  --hashfile allows
343       duperemove to temporarily store duplicated hashes to disk, thus  remov‐
344       ing  the  large memory overhead and allowing for a far larger amount of
345       data to be scanned and deduped.  Realistically though you will be  lim‐
346       ited by the speed of your disks and cpu.  In those situations where re‐
347       sources are limited you may have success by breaking up the input  data
348       set into smaller pieces.
349
350       When  using  a hashfile, duperemove will only store duplicate hashes in
351       memory.  During normal operation then the hash tree will  make  up  the
352       largest  portion  of  duperemove  memory usage.  As of Duperemove v0.11
353       hash entries are 88 bytes in size.  If you know the number of duplicate
354       blocks in your data set you can get a rough approximation of memory us‐
355       age by multiplying with the hash entry size.
356
357       Actual performance numbers are dependent on hardware - up to date test‐
358       ing  information  is  kept  on  the  duperemove wiki (see below for the
359       link).
360
361   How large of a hashfile will duperemove create?
362       Hashfiles are essentially sqlite3 database files with  several  tables,
363       the  largest  of  which are the files and extents tables.  Each extents
364       table entry is about 72 bytes though that  may  grow  as  features  are
365       added.   The size of a files table entry depends on the file path but a
366       good estimate is around 270 bytes per file.  The number of extents in a
367       data set is directly proportional to file fragmentation level.
368
369       If you know the total number of extents and files in your data set then
370       you can calculate the hashfile size as:
371
372              Hashfile Size = Num Hashes * 72 + Num Files * 270
373
374       Using a real world example of 1TB (8388608 128K blocks)  of  data  over
375       1000 files:
376
377              8388608 * 72 + 270 * 1000 = 755244720 or about 720MB for 1TB spread over 1000 files.
378
379       Note that none of this takes database overhead into account.
380

NOTES

382       Deduplication is currently only supported by the btrfs and xfs filesys‐
383       tem.
384
385       The   Duperemove   project   page    can    be    found    on    github
386       (https://github.com/markfasheh/duperemove)
387
388       There is also a wiki (https://github.com/markfasheh/duperemove/wiki)
389