duperemove(8)

1duperemove(8)               System Manager's Manual              duperemove(8)
2
3
4

NAME

6       duperemove - Find duplicate extents and print them to stdout
7

SYNOPSIS

9       duperemove [options] files...
10

DESCRIPTION

12       duperemove  is a simple tool for finding duplicated extents and submit‐
13       ting them for deduplication. When given a list of files  it  will  hash
14       their  contents  on  a block by block basis and compare those hashes to
15       each other, finding and categorizing blocks that match each other. When
16       given  the -d option, duperemove will submit those extents for dedupli‐
17       cation using the Linux kernel extent-same ioctl.
18
19       duperemove can store the hashes it computes in a hashfile. If given  an
20       existing  hashfile, duperemove will only compute hashes for those files
21       which have changed since the last run.  Thus  you  can  run  duperemove
22       repeatedly  on  your  data as it changes, without having to re-checksum
23       unchanged data.  For more on hashfiles see the --hashfile option  below
24       as well as the Examples section.
25
26       duperemove  can  also  take  input  from  the  fdupes  program, see the
27       --fdupes option below.
28
29

GENERAL

31       Duperemove has two major modes of operation one of which is a subset of
32       the other.
33
34
35   Readonly / Non-deduplicating Mode
36       When run without -d (the default) duperemove will print out one or more
37       tables of matching extents it has determined would be ideal  candidates
38       for deduplication. As a result, readonly mode is useful for seeing what
39       duperemove might do when run with -d. The output could also be used  by
40       some  other software to submit the extents for deduplication at a later
41       time.
42
43       Generally, duperemove does not concern itself with the underlying  rep‐
44       resentation  of  the  extents  it processes. Some of them could be com‐
45       pressed, undergoing I/O, or even have  already  been  deduplicated.  In
46       dedupe  mode, the kernel handles those details and therefore we try not
47       to replicate that work.
48
49
50   Deduping Mode
51       This functions similarly to readonly mode with the exception  that  the
52       duplicated  extents  found  in  our "read, hash, and compare" step will
53       actually be submitted for deduplication. An estimate of the total  data
54       deduplicated  will  be  printed  after  the operation is complete. This
55       estimate is calculated by comparing the total amount of shared bytes in
56       each file before and after the dedupe.
57
58

OPTIONS

60       files  can  refer  to  a  list of regular files and directories or be a
61       hyphen (-) to read them from standard input.  If a directory is  speci‐
62       fied,  all regular files within it will also be scanned. Duperemove can
63       also be told to recursively scan directories with the '-r' switch.
64
65
66       -r     Enable recursive dir traversal.
67
68
69       -d     De-dupe the results - only works on btrfs  and  xfs  (experimen‐
70              tal).
71
72
73       -A     Opens  files readonly when deduping. Primarily for use by privi‐
74              leged users on readonly snapshots.
75
76
77       -h     Print numbers in human-readable format.
78
79
80       --hashfile=hashfile
81              Use a file for storage of hashes instead of memory. This  option
82              drastically  reduces  the  memory footprint of duperemove and is
83              recommended when your data set is more than a few  files  large.
84              Hashfiles  are also reusable, allowing you to further reduce the
85              amount of hashing done on subsequent dedupe runs.
86
87              If hashfile does not exist it will be created.   If  it  exists,
88              duperemove  will  check  the  file paths stored inside of it for
89              changes.  Files which have changed will be rescanned  and  their
90              updated  hashes  will be written to the hashfile.  Deleted files
91              will be removed from the hashfile.
92
93              New files are only added to the hashfile if they  are  discover‐
94              able  via the files argument.  For that reason you probably want
95              to provide the same files list and -r arguments on each  run  of
96              duperemove.   The file discovery algorithm is efficient and will
97              only visit each file once, even if it is already  in  the  hash‐
98              file.
99
100              Adding a new path to a hashfile is as simple as adding it to the
101              files argument.
102
103              When deduping from a hashfile, duperemove  will  avoid  deduping
104              files which have not changed since the last dedupe.
105
106
107       -L     Print  all  files in the hashfile and exit. Requires the --hash‐
108              file option.  Will print additional information about each  file
109              when run with -v.
110
111
112       -R [file]
113              Remove  file  from  the  db  and exit. Can be specified multiple
114              times. Duperemove will read the list from standard  input  if  a
115              hyphen (-) is provided. Requires the --hashfile option.
116
117              Note:  If  you  are  piping  filenames  from  another duperemove
118              instance it is advisable to do so into a temporary file first as
119              running  duperemove simultaneously on the same hashfile may cor‐
120              rupt that hashfile.
121
122
123       --fdupes
124              Run in fdupes mode. With this option you can pipe the output  of
125              fdupes  to  duperemove to dedupe any duplicate files found. When
126              receiving a file list in this manner, duperemove will  skip  the
127              hashing phase.
128
129
130       -v     Be verbose.
131
132
133       --skip-zeroes
134              Read  data blocks and skip any zeroed blocks, useful for speedup
135              duperemove, but can prevent deduplication of zeroed files.
136
137
138       -b size
139              Use the specified block size. Raising the block size  will  con‐
140              sume less memory but may miss some duplicate blocks. Conversely,
141              lowering the blocksize consumes more memory and  may  find  more
142              duplicate  blocks. The default blocksize of 128K was chosen with
143              these parameters in mind.
144
145
146       --io-threads=N
147              Use N threads for I/O. This is used  by  the  file  hashing  and
148              dedupe stages. Default is automatically detected based on number
149              of host cpus.
150
151
152       --cpu-threads=N
153              Use N threads for CPU bound tasks. This is used by the duplicate
154              extent finding stage. Default is automatically detected based on
155              number of host cpus.
156
157              Note: Hyperthreading can adversely  affect  performance  of  the
158              extent  finding  stage.  If duperemove detects an Intel CPU with
159              hyperthreading it will use half the number of cores reported  by
160              the system for cpu bound tasks.
161
162
163       --dedupe-options=options
164              Comma  separated  list  of  options  which  alter how we dedupe.
165              Prepend 'no' to an option in order to turn it off.
166
167              [no]same
168                     Defaults to off. Allow dedupe of extents within the  same
169                     file.
170
171              [no]fiemap
172                     Defaults  to  on. Duperemove uses the fiemap ioctl during
173                     the dedupe stage to optimize out already deduped  extents
174                     as  well  as  to  provide  an estimate of the space saved
175                     after dedupe operations are complete.
176
177                     Unfortunately, some versions of Btrfs  exhibit  extremely
178                     poor performance in fiemap as the number of references on
179                     a file extent goes up. If you are experiencing the dedupe
180                     phase  slowing  down or 'locking up' this option may give
181                     you a significant amount of performance back.
182
183                     Note: This does not turn off all usage of fiemap, to dis‐
184                     able  fiemap  during  the  file scan stage, you will also
185                     want to use the --lookup-extents=no option.
186
187              [no]block
188                     Defaults  to  on.  Duperemove  submits  duplicate  blocks
189                     directly to the dedupe engine.
190
191                     Duperemove  can  optionally  optimize the duplicate block
192                     lists into larger extents prior to dedupe submission. The
193                     search  algorithm  used  for this however has a very high
194                     memory and cpu overhead, but may  reduce  the  number  of
195                     extent references created during dedupe. If you'd like to
196                     try this, run with 'noblock'.
197
198
199       --help Prints help text.
200
201
202       --lookup-extents=[yes|no]
203              Defaults to no. Allows  duperemove  to  skip  checksumming  some
204              blocks by checking their extent state.
205
206
207       -x     Don't  cross filesystem boundaries, this is the default behavior
208              since duperemove v0.11. The option is kept for backwards compat‐
209              ibility.
210
211
212       --read-hashes=hashfile
213              This  option is primarily for testing. See the --hashfile option
214              if you want to use hashfiles.
215
216              Read hashes from a hashfile. A file list is  not  required  with
217              this  option.  Dedupe  can be done if duperemove is run from the
218              same base directory as is stored in  the  hash  file  (basically
219              duperemove has to be able to find the files).
220
221
222       --write-hashes=hashfile
223              This  option is primarily for testing. See the --hashfile option
224              if you want to use hashfiles.
225
226              Write hashes to a hashfile. These can be read in at a later date
227              and deduped from.
228
229
230       --debug
231              Print debug messages, forces -v if selected.
232
233
234       --hash-threads=N
235              Deprecated, see --io-threads above.
236
237
238       --hash=alg
239              You  can  choose between murmur3 and xxhash. The default is mur‐
240              mur3 as it is very fast and can generate 128 bit digests  for  a
241              very  small chance of collision. Xxhash may be faster but gener‐
242              ates only 64 bit digests. Both hashes are fast enough  that  the
243              default should work well for the overwhelming majority of users.
244
245

EXAMPLES

247   Simple Usage
248       Dedupe  the  files  in directory /foo, recurse into all subdirectories.
249       You only want to use this for small data sets.
250
251              duperemove -dr /foo
252
253       Use duperemove with fdupes to dedupe identical  files  below  directory
254       foo.
255
256              fdupes -r /foo | duperemove --fdupes
257
258
259   Using Hashfiles
260       Duperemove can optionally store the hashes it calculates in a hashfile.
261       Hashfiles have two primary advantages - memory usage and  re-usability.
262       When  using  a  hashfile, duperemove will stream computed hashes to it,
263       instead of main memory.
264
265       If Duperemove is run with an existing hashfile, it will only scan those
266       files  which have changed since the last time the hashfile was updated.
267       The files argument controls which directories duperemove will scan  for
268       newly added files. In the simplest usage, you rerun duperemove with the
269       same parameters and it will only scan changed or newly  added  files  -
270       see the first example below.
271
272
273       Dedupe  the  files in directory foo, storing hashes in foo.hash. We can
274       run this command multiple times and duperemove will only  checksum  and
275       dedupe changed or newly added files.
276
277              duperemove -dr --hashfile=foo.hash foo/
278
279       Don't  scan  for  new files, only update changed or deleted files, then
280       dedupe.
281
282              duperemove -dr --hashfile=foo.hash
283
284       Add directory bar to our hashfile and  discover  any  files  that  were
285       recently added to foo.
286
287              duperemove -dr --hashfile=foo.hash foo/ bar/
288
289       List the files tracked by foo.hash.
290
291              duperemove -L --hashfile=foo.hash
292
293

FAQ

295   Is there an upper limit to the amount of data duperemove can process?
296       Duperemove  v0.11  is  fast at reading and cataloging data. Dedupe runs
297       will be memory limited unless the '--hashfile' option is used. '--hash‐
298       file' allows duperemove to temporarily store duplicated hashes to disk,
299       thus removing the large memory overhead and allowing for a  far  larger
300       amount of data to be scanned and deduped. Realistically though you will
301       be limited by the speed of your disks  and  cpu.  In  those  situations
302       where  resources  are  limited  you may have success by breaking up the
303       input data set into smaller pieces.
304
305       When using a hashfile, duperemove will only store duplicate  hashes  in
306       memory.  During  normal  operation  then the hash tree will make up the
307       largest portion of duperemove memory usage. As of Duperemove v0.11 hash
308       entries  are  88  bytes  in  size.  If you know the number of duplicate
309       blocks in your data set you can get a  rough  approximation  of  memory
310       usage by multiplying with the hash entry size.
311
312       Actual performance numbers are dependent on hardware - up to date test‐
313       ing information is kept on the  duperemove  wiki  (see  below  for  the
314       link).
315
316
317   How large of a hashfile will duperemove create?
318       Hashfiles  are  essentially sqlite3 database files with several tables,
319       the largest of which are the files and hashes tables. Each hashes table
320       entry is under 90 bytes though that may grow as features are added. The
321       size of a files table entry depends on the file path but a  good  esti‐
322       mate is around 270 bytes per file.
323
324       If  you know the total number of blocks and files in your data set then
325       you can calculate the hashfile size as:
326
327       Hashfile Size = Num Hashes X 90 + Num Files X 270
328
329       Using a real world example of 1TB (8388608 128K blocks)  of  data  over
330       1000 files:
331
332       8388608  *  90  +  270 * 1000 = 755244720 or about 720MB for 1TB spread
333       over 1000 files.
334
335
336   Is is safe to interrupt the program (Ctrl-C)?
337       Yes, Duperemove uses a transactional database engine and  organizes  db
338       changes  to  take  advantage  of those features. The result is that you
339       should be able to ctrl-c the program at any point  and  re-run  without
340       experiencing corruption of your hashfile.
341
342
343   How can I find out my space savings after a dedupe?
344       Duperemove will print out an estimate of the saved space after a dedupe
345       operation for you.
346
347       You can get a more accurate picture by running 'btrfs fi df' before and
348       after each duperemove run.
349
350       Be  careful about using the 'df' tool on btrfs - it is common for space
351       reporting to be 'behind' while delayed updates  get  processed,  so  an
352       immediate df after deduping might not show any savings.
353
354
355   Why is the total deduped data report an estimate?
356       At  the  moment  duperemove can detect that some underlying extents are
357       shared with other files, but it  can  not  resolve  which  files  those
358       extents are shared with.
359
360       Imagine  duperemove  is examing a series of files and it notes a shared
361       data region in one of them. That data could be shared with a file  out‐
362       side  of the series. Since duperemove can't resolve that information it
363       will account the shared data against  our  dedupe  operation  while  in
364       reality, the kernel might deduplicate it further for us.
365
366
367   Why are my files showing dedupe but my disk space is not shrinking?
368       This  is  a little complicated, but it comes down to a feature in Btrfs
369       called  _bookending_.  The  Btrfs  wiki  explains   this   in   detail:
370       http://en.wikipedia.org/wiki/Btrfs#Extents.
371
372       Essentially though, the underlying representation of an extent in Btrfs
373       can not be split (with small exception). So sometimes we can end up  in
374       a situation where a file extent gets partially deduped (and the extents
375       marked as shared) but the underlying extent item is not freed or  trun‐
376       cated.
377
378
379   Is duperemove safe for my data?
380       Yes.  To  be specific, duperemove does not deduplicate the data itself.
381       It simply finds candidates for dedupe and submits  them  to  the  Linux
382       kernel extent-same ioctl. In order to ensure data integrity, the kernel
383       locks out other access to the file  and  does  a  byte-by-byte  compare
384       before proceeding with the dedupe.
385
386
387   What is the cost of deduplication?
388       Deduplication  will lead to increased fragmentation. The blocksize cho‐
389       sen can have an effect on this. Larger blocksizes  will  fragment  less
390       but may not save you as much space. Conversely, smaller block sizes may
391       save more space at the cost of increased fragmentation.
392
393

NOTES

395       Deduplication is currently only supported by the btrfs and xfs filesys‐
396       tem.
397
398       The  Duperemove  project  page can be found at https://github.com/mark‐
399       fasheh/duperemove
400
401       There is also a wiki at https://github.com/markfasheh/duperemove/wiki
402
403