duperemove(8)

1duperemove(8)               System Manager's Manual              duperemove(8)
2
3
4

NAME

6       duperemove - Find duplicate extents and print them to stdout
7

SYNOPSIS

9       duperemove [options] files...
10

DESCRIPTION

12       duperemove  is a simple tool for finding duplicated extents and submit‐
13       ting them for deduplication. When given a list of files  it  will  hash
14       their  contents  on  a block by block basis and compare those hashes to
15       each other, finding and categorizing extents  that  match  each  other.
16       When  given  the  -d  option,  duperemove will submit those extents for
17       deduplication using the Linux kernel extent-same ioctl.
18
19       duperemove can store the hashes it computes in a hashfile. If given  an
20       existing  hashfile, duperemove will only compute hashes for those files
21       which have changed since the last run.  Thus  you  can  run  duperemove
22       repeatedly  on  your  data as it changes, without having to re-checksum
23       unchanged data.  For more on hashfiles see the --hashfile option  below
24       as well as the Examples section.
25
26       duperemove  can  also  take  input  from  the  fdupes  program, see the
27       --fdupes option below.
28
29

GENERAL

31       Duperemove has two major modes of operation one of which is a subset of
32       the other.
33
34
35   Readonly / Non-deduplicating Mode
36       When run without -d (the default) duperemove will print out one or more
37       tables of matching extents it has determined would be ideal  candidates
38       for deduplication. As a result, readonly mode is useful for seeing what
39       duperemove might do when run with -d. The output could also be used  by
40       some  other software to submit the extents for deduplication at a later
41       time.
42
43       It is important to note that this mode will not print out all instances
44       of matching extents, just those it would consider for deduplication.
45
46       Generally,  duperemove does not concern itself with the underlying rep‐
47       resentation of the extents it processes. Some of  them  could  be  com‐
48       pressed,  undergoing  I/O,  or  even have already been deduplicated. In
49       dedupe mode, the kernel handles those details and therefore we try  not
50       to replicate that work.
51
52
53   Deduping Mode
54       This  functions  similarly to readonly mode with the exception that the
55       duplicated extents found in our "read, hash,  and  compare"  step  will
56       actually  be submitted for deduplication. An estimate of the total data
57       deduplicated will be printed after  the  operation  is  complete.  This
58       estimate is calculated by comparing the total amount of shared bytes in
59       each file before and after the dedupe.
60
61

OPTIONS

63       files can refer to a list of regular files  and  directories  or  be  a
64       hyphen  (-) to read them from standard input.  If a directory is speci‐
65       fied, all regular files within it will also be scanned. Duperemove  can
66       also be told to recursively scan directories with the '-r' switch.
67
68
69       -r     Enable recursive dir traversal.
70
71
72       -d     De-dupe  the  results  - only works on btrfs and xfs (experimen‐
73              tal).
74
75
76       -A     Opens files readonly when deduping. Primarily for use by  privi‐
77              leged users on readonly snapshots.
78
79
80       -h     Print numbers in human-readable format.
81
82
83       -q     Quiet  mode.  Duperemove will only print errors and a short sum‐
84              mary of any dedupe.
85
86
87       --hashfile=hashfile
88              Use a file for storage of hashes instead of memory. This  option
89              drastically  reduces  the  memory footprint of duperemove and is
90              recommended when your data set is more than a few  files  large.
91              Hashfiles  are also reusable, allowing you to further reduce the
92              amount of hashing done on subsequent dedupe runs.
93
94              If hashfile does not exist it will be created.   If  it  exists,
95              duperemove  will  check  the  file paths stored inside of it for
96              changes.  Files which have changed will be rescanned  and  their
97              updated  hashes  will be written to the hashfile.  Deleted files
98              will be removed from the hashfile.
99
100              New files are only added to the hashfile if they  are  discover‐
101              able  via the files argument.  For that reason you probably want
102              to provide the same files list and -r arguments on each  run  of
103              duperemove.   The file discovery algorithm is efficient and will
104              only visit each file once, even if it is already  in  the  hash‐
105              file.
106
107              Adding a new path to a hashfile is as simple as adding it to the
108              files argument.
109
110              When deduping from a hashfile, duperemove  will  avoid  deduping
111              files which have not changed since the last dedupe.
112
113
114       -L     Print  all  files in the hashfile and exit. Requires the --hash‐
115              file option.  Will print additional information about each  file
116              when run with -v.
117
118
119       -R [file]
120              Remove  file  from  the  db  and exit. Can be specified multiple
121              times. Duperemove will read the list from standard  input  if  a
122              hyphen (-) is provided. Requires the --hashfile option.
123
124              Note:  If  you  are  piping  filenames  from  another duperemove
125              instance it is advisable to do so into a temporary file first as
126              running  duperemove simultaneously on the same hashfile may cor‐
127              rupt that hashfile.
128
129
130       --fdupes
131              Run in fdupes mode. With this option you can pipe the output  of
132              fdupes  to  duperemove to dedupe any duplicate files found. When
133              receiving a file list in this manner, duperemove will  skip  the
134              hashing phase.
135
136
137       -v     Be verbose.
138
139
140       --skip-zeroes
141              Read  data blocks and skip any zeroed blocks, useful for speedup
142              duperemove, but can prevent deduplication of zeroed files.
143
144
145       -b size
146              Use the specified block size. Raising the block size  will  con‐
147              sume less memory but may miss some duplicate blocks. Conversely,
148              lowering the blocksize consumes more memory and  may  find  more
149              duplicate  blocks. The default blocksize of 128K was chosen with
150              these parameters in mind.
151
152
153       --io-threads=N
154              Use N threads for I/O. This is used  by  the  file  hashing  and
155              dedupe stages. Default is automatically detected based on number
156              of host cpus.
157
158
159       --cpu-threads=N
160              Use N threads for CPU bound tasks. This is used by the duplicate
161              extent finding stage. Default is automatically detected based on
162              number of host cpus.
163
164              Note: Hyperthreading can adversely  affect  performance  of  the
165              extent  finding  stage.  If duperemove detects an Intel CPU with
166              hyperthreading it will use half the number of cores reported  by
167              the system for cpu bound tasks.
168
169
170       --dedupe-options=options
171              Comma  separated  list  of  options  which  alter how we dedupe.
172              Prepend 'no' to an option in order to turn it off.
173
174              [no]same
175                     Defaults to off. Allow dedupe of extents within the  same
176                     file.
177
178              [no]fiemap
179                     Defaults  to  on. Duperemove uses the fiemap ioctl during
180                     the dedupe stage to optimize out already deduped  extents
181                     as  well  as  to  provide  an estimate of the space saved
182                     after dedupe operations are complete.
183
184                     Unfortunately, some versions of Btrfs  exhibit  extremely
185                     poor performance in fiemap as the number of references on
186                     a file extent goes up. If you are experiencing the dedupe
187                     phase  slowing  down or 'locking up' this option may give
188                     you a significant amount of performance back.
189
190                     Note: This does not turn off all usage of fiemap, to dis‐
191                     able  fiemap  during  the  file scan stage, you will also
192                     want to use the --lookup-extents=no option.
193
194              [no]block
195                     Defaults to off. Dedupe by block  -  don't  optimize  our
196                     data  into extents before dedupe. Generally this is unde‐
197                     sirable as it will greatly increase the total  number  of
198                     dedupe  requests.  There  is  also a larger potential for
199                     file fragmentation.
200
201
202       --help Prints help text.
203
204
205       --lookup-extents=[yes|no]
206              Defaults to no. Allows  duperemove  to  skip  checksumming  some
207              blocks by checking their extent state.
208
209
210       -x     Don't  cross filesystem boundaries, this is the default behavior
211              since duperemove v0.11. The option is kept for backwards compat‐
212              ibility.
213
214
215       --read-hashes=hashfile
216              This  option is primarily for testing. See the --hashfile option
217              if you want to use hashfiles.
218
219              Read hashes from a hashfile. A file list is  not  required  with
220              this  option.  Dedupe  can be done if duperemove is run from the
221              same base directory as is stored in  the  hash  file  (basically
222              duperemove has to be able to find the files).
223
224
225       --write-hashes=hashfile
226              This  option is primarily for testing. See the --hashfile option
227              if you want to use hashfiles.
228
229              Write hashes to a hashfile. These can be read in at a later date
230              and deduped from.
231
232
233       --debug
234              Print debug messages, forces -v if selected.
235
236
237       --hash-threads=N
238              Deprecated, see --io-threads above.
239
240
241       --hash=alg
242              You  can  choose between murmur3 and xxhash. The default is mur‐
243              mur3 as it is very fast and can generate 128 bit digests  for  a
244              very  small chance of collision. Xxhash may be faster but gener‐
245              ates only 64 bit digests. Both hashes are fast enough  that  the
246              default should work well for the overwhelming majority of users.
247
248

EXAMPLES

250   Simple Usage
251       Dedupe  the  files  in directory /foo, recurse into all subdirectories.
252       You only want to use this for small data sets.
253
254              duperemove -dr /foo
255
256       Use duperemove with fdupes to dedupe identical  files  below  directory
257       foo.
258
259              fdupes -r /foo | duperemove --fdupes
260
261
262   Using Hashfiles
263       Duperemove can optionally store the hashes it calculates in a hashfile.
264       Hashfiles have two primary advantages - memory usage and  re-usability.
265       When  using  a  hashfile, duperemove will stream computed hashes to it,
266       instead of main memory.
267
268       If Duperemove is run with an existing hashfile, it will only scan those
269       files  which have changed since the last time the hashfile was updated.
270       The files argument controls which directories duperemove will scan  for
271       newly added files. In the simplest usage, you rerun duperemove with the
272       same parameters and it will only scan changed or newly  added  files  -
273       see the first example below.
274
275
276       Dedupe  the  files in directory foo, storing hashes in foo.hash. We can
277       run this command multiple times and duperemove will only  checksum  and
278       dedupe changed or newly added files.
279
280              duperemove -dr --hashfile=foo.hash foo/
281
282       Don't  scan  for  new files, only update changed or deleted files, then
283       dedupe.
284
285              duperemove -dr --hashfile=foo.hash
286
287       Add directory bar to our hashfile and  discover  any  files  that  were
288       recently added to foo.
289
290              duperemove -dr --hashfile=foo.hash foo/ bar/
291
292       List the files tracked by foo.hash.
293
294              duperemove -L --hashfile=foo.hash
295
296

FAQ

298   Is there an upper limit to the amount of data duperemove can process?
299       Duperemove  v0.11  is  fast at reading and cataloging data. Dedupe runs
300       will be memory limited unless the '--hashfile' option is used. '--hash‐
301       file' allows duperemove to temporarily store duplicated hashes to disk,
302       thus removing the large memory overhead and allowing for a  far  larger
303       amount of data to be scanned and deduped. Realistically though you will
304       be limited by the speed of your disks  and  cpu.  In  those  situations
305       where  resources  are  limited  you may have success by breaking up the
306       input data set into smaller pieces.
307
308       When using a hashfile, duperemove will only store duplicate  hashes  in
309       memory.  During  normal  operation  then the hash tree will make up the
310       largest portion of duperemove memory usage. As of Duperemove v0.11 hash
311       entries  are  88  bytes  in  size.  If you know the number of duplicate
312       blocks in your data set you can get a  rough  approximation  of  memory
313       usage by multiplying with the hash entry size.
314
315       Actual performance numbers are dependent on hardware - up to date test‐
316       ing information is kept on the  duperemove  wiki  (see  below  for  the
317       link).
318
319
320   How large of a hashfile will duperemove create?
321       Hashfiles  are  essentially sqlite3 database files with several tables,
322       the largest of which are the files and hashes tables. Each hashes table
323       entry is under 90 bytes though that may grow as features are added. The
324       size of a files table entry depends on the file path but a  good  esti‐
325       mate is around 270 bytes per file.
326
327       If  you know the total number of blocks and files in your data set then
328       you can calculate the hashfile size as:
329
330       Hashfile Size = Num Hashes X 90 + Num Files X 270
331
332       Using a real world example of 1TB (8388608 128K blocks)  of  data  over
333       1000 files:
334
335       8388608  *  90  +  270 * 1000 = 755244720 or about 720MB for 1TB spread
336       over 1000 files.
337
338
339   Is is safe to interrupt the program (Ctrl-C)?
340       Yes, Duperemove uses a transactional database engine and  organizes  db
341       changes  to  take  advantage  of those features. The result is that you
342       should be able to ctrl-c the program at any point  and  re-run  without
343       experiencing corruption of your hashfile.
344
345
346   How can I find out my space savings after a dedupe?
347       Duperemove will print out an estimate of the saved space after a dedupe
348       operation for you.
349
350       You can get a more accurate picture by running 'btrfs fi df' before and
351       after each duperemove run.
352
353       Be  careful about using the 'df' tool on btrfs - it is common for space
354       reporting to be 'behind' while delayed updates  get  processed,  so  an
355       immediate df after deduping might not show any savings.
356
357
358   Why is the total deduped data report an estimate?
359       At  the  moment  duperemove can detect that some underlying extents are
360       shared with other files, but it  can  not  resolve  which  files  those
361       extents are shared with.
362
363       Imagine  duperemove  is examing a series of files and it notes a shared
364       data region in one of them. That data could be shared with a file  out‐
365       side  of the series. Since duperemove can't resolve that information it
366       will account the shared data against  our  dedupe  operation  while  in
367       reality, the kernel might deduplicate it further for us.
368
369
370   Why are my files showing dedupe but my disk space is not shrinking?
371       This  is  a little complicated, but it comes down to a feature in Btrfs
372       called  _bookending_.  The  Btrfs  wiki  explains   this   in   detail:
373       http://en.wikipedia.org/wiki/Btrfs#Extents.
374
375       Essentially though, the underlying representation of an extent in Btrfs
376       can not be split (with small exception). So sometimes we can end up  in
377       a situation where a file extent gets partially deduped (and the extents
378       marked as shared) but the underlying extent item is not freed or  trun‐
379       cated.
380
381
382   Is duperemove safe for my data?
383       Yes.  To  be specific, duperemove does not deduplicate the data itself.
384       It simply finds candidates for dedupe and submits  them  to  the  Linux
385       kernel extent-same ioctl. In order to ensure data integrity, the kernel
386       locks out other access to the file  and  does  a  byte-by-byte  compare
387       before proceeding with the dedupe.
388
389
390   What is the cost of deduplication?
391       Deduplication  will lead to increased fragmentation. The blocksize cho‐
392       sen can have an effect on this. Larger blocksizes  will  fragment  less
393       but may not save you as much space. Conversely, smaller block sizes may
394       save more space at the cost of increased fragmentation.
395
396

NOTES

398       Deduplication is currently only supported by the btrfs and xfs filesys‐
399       tem.
400
401       The  Duperemove  project  page can be found at https://github.com/mark‐
402       fasheh/duperemove
403
404       There is also a wiki at https://github.com/markfasheh/duperemove/wiki
405
406