1duperemove(8) System Manager's Manual duperemove(8)
2
3
4
6 duperemove - Find duplicate extents and print them to stdout
7
9 duperemove [options] files...
10
12 duperemove is a simple tool for finding duplicated extents and submit‐
13 ting them for deduplication. When given a list of files it will hash
14 their contents on a block by block basis and compare those hashes to
15 each other, finding and categorizing extents that match each other.
16 When given the -d option, duperemove will submit those extents for
17 deduplication using the Linux kernel extent-same ioctl.
18
19 duperemove can store the hashes it computes in a hashfile. If given an
20 existing hashfile, duperemove will only compute hashes for those files
21 which have changed since the last run. Thus you can run duperemove
22 repeatedly on your data as it changes, without having to re-checksum
23 unchanged data. For more on hashfiles see the --hashfile option below
24 as well as the Examples section.
25
26 duperemove can also take input from the fdupes program, see the
27 --fdupes option below.
28
29
31 Duperemove has two major modes of operation one of which is a subset of
32 the other.
33
34
35 Readonly / Non-deduplicating Mode
36 When run without -d (the default) duperemove will print out one or more
37 tables of matching extents it has determined would be ideal candidates
38 for deduplication. As a result, readonly mode is useful for seeing what
39 duperemove might do when run with -d. The output could also be used by
40 some other software to submit the extents for deduplication at a later
41 time.
42
43 It is important to note that this mode will not print out all instances
44 of matching extents, just those it would consider for deduplication.
45
46 Generally, duperemove does not concern itself with the underlying rep‐
47 resentation of the extents it processes. Some of them could be com‐
48 pressed, undergoing I/O, or even have already been deduplicated. In
49 dedupe mode, the kernel handles those details and therefore we try not
50 to replicate that work.
51
52
53 Deduping Mode
54 This functions similarly to readonly mode with the exception that the
55 duplicated extents found in our "read, hash, and compare" step will
56 actually be submitted for deduplication. An estimate of the total data
57 deduplicated will be printed after the operation is complete. This
58 estimate is calculated by comparing the total amount of shared bytes in
59 each file before and after the dedupe.
60
61
63 files can refer to a list of regular files and directories or be a
64 hyphen (-) to read them from standard input. If a directory is speci‐
65 fied, all regular files within it will also be scanned. Duperemove can
66 also be told to recursively scan directories with the '-r' switch.
67
68
69 -r Enable recursive dir traversal.
70
71
72 -d De-dupe the results - only works on btrfs and xfs (experimen‐
73 tal).
74
75
76 -A Opens files readonly when deduping. Primarily for use by privi‐
77 leged users on readonly snapshots.
78
79
80 -h Print numbers in human-readable format.
81
82
83 -q Quiet mode. Duperemove will only print errors and a short sum‐
84 mary of any dedupe.
85
86
87 --hashfile=hashfile
88 Use a file for storage of hashes instead of memory. This option
89 drastically reduces the memory footprint of duperemove and is
90 recommended when your data set is more than a few files large.
91 Hashfiles are also reusable, allowing you to further reduce the
92 amount of hashing done on subsequent dedupe runs.
93
94 If hashfile does not exist it will be created. If it exists,
95 duperemove will check the file paths stored inside of it for
96 changes. Files which have changed will be rescanned and their
97 updated hashes will be written to the hashfile. Deleted files
98 will be removed from the hashfile.
99
100 New files are only added to the hashfile if they are discover‐
101 able via the files argument. For that reason you probably want
102 to provide the same files list and -r arguments on each run of
103 duperemove. The file discovery algorithm is efficient and will
104 only visit each file once, even if it is already in the hash‐
105 file.
106
107 Adding a new path to a hashfile is as simple as adding it to the
108 files argument.
109
110 When deduping from a hashfile, duperemove will avoid deduping
111 files which have not changed since the last dedupe.
112
113
114 -L Print all files in the hashfile and exit. Requires the --hash‐
115 file option. Will print additional information about each file
116 when run with -v.
117
118
119 -R [file]
120 Remove file from the db and exit. Can be specified multiple
121 times. Duperemove will read the list from standard input if a
122 hyphen (-) is provided. Requires the --hashfile option.
123
124 Note: If you are piping filenames from another duperemove
125 instance it is advisable to do so into a temporary file first as
126 running duperemove simultaneously on the same hashfile may cor‐
127 rupt that hashfile.
128
129
130 --fdupes
131 Run in fdupes mode. With this option you can pipe the output of
132 fdupes to duperemove to dedupe any duplicate files found. When
133 receiving a file list in this manner, duperemove will skip the
134 hashing phase.
135
136
137 -v Be verbose.
138
139
140 --skip-zeroes
141 Read data blocks and skip any zeroed blocks, useful for speedup
142 duperemove, but can prevent deduplication of zeroed files.
143
144
145 -b size
146 Use the specified block size. Raising the block size will con‐
147 sume less memory but may miss some duplicate blocks. Conversely,
148 lowering the blocksize consumes more memory and may find more
149 duplicate blocks. The default blocksize of 128K was chosen with
150 these parameters in mind.
151
152
153 --io-threads=N
154 Use N threads for I/O. This is used by the file hashing and
155 dedupe stages. Default is automatically detected based on number
156 of host cpus.
157
158
159 --cpu-threads=N
160 Use N threads for CPU bound tasks. This is used by the duplicate
161 extent finding stage. Default is automatically detected based on
162 number of host cpus.
163
164 Note: Hyperthreading can adversely affect performance of the
165 extent finding stage. If duperemove detects an Intel CPU with
166 hyperthreading it will use half the number of cores reported by
167 the system for cpu bound tasks.
168
169
170 --dedupe-options=options
171 Comma separated list of options which alter how we dedupe.
172 Prepend 'no' to an option in order to turn it off.
173
174 [no]same
175 Defaults to off. Allow dedupe of extents within the same
176 file.
177
178 [no]fiemap
179 Defaults to on. Duperemove uses the fiemap ioctl during
180 the dedupe stage to optimize out already deduped extents
181 as well as to provide an estimate of the space saved
182 after dedupe operations are complete.
183
184 Unfortunately, some versions of Btrfs exhibit extremely
185 poor performance in fiemap as the number of references on
186 a file extent goes up. If you are experiencing the dedupe
187 phase slowing down or 'locking up' this option may give
188 you a significant amount of performance back.
189
190 Note: This does not turn off all usage of fiemap, to dis‐
191 able fiemap during the file scan stage, you will also
192 want to use the --lookup-extents=no option.
193
194 [no]block
195 Defaults to off. Dedupe by block - don't optimize our
196 data into extents before dedupe. Generally this is unde‐
197 sirable as it will greatly increase the total number of
198 dedupe requests. There is also a larger potential for
199 file fragmentation.
200
201
202 --help Prints help text.
203
204
205 --lookup-extents=[yes|no]
206 Defaults to no. Allows duperemove to skip checksumming some
207 blocks by checking their extent state.
208
209
210 -x Don't cross filesystem boundaries, this is the default behavior
211 since duperemove v0.11. The option is kept for backwards compat‐
212 ibility.
213
214
215 --read-hashes=hashfile
216 This option is primarily for testing. See the --hashfile option
217 if you want to use hashfiles.
218
219 Read hashes from a hashfile. A file list is not required with
220 this option. Dedupe can be done if duperemove is run from the
221 same base directory as is stored in the hash file (basically
222 duperemove has to be able to find the files).
223
224
225 --write-hashes=hashfile
226 This option is primarily for testing. See the --hashfile option
227 if you want to use hashfiles.
228
229 Write hashes to a hashfile. These can be read in at a later date
230 and deduped from.
231
232
233 --debug
234 Print debug messages, forces -v if selected.
235
236
237 --hash-threads=N
238 Deprecated, see --io-threads above.
239
240
241 --hash=alg
242 You can choose between murmur3 and xxhash. The default is mur‐
243 mur3 as it is very fast and can generate 128 bit digests for a
244 very small chance of collision. Xxhash may be faster but gener‐
245 ates only 64 bit digests. Both hashes are fast enough that the
246 default should work well for the overwhelming majority of users.
247
248
250 Simple Usage
251 Dedupe the files in directory /foo, recurse into all subdirectories.
252 You only want to use this for small data sets.
253
254 duperemove -dr /foo
255
256 Use duperemove with fdupes to dedupe identical files below directory
257 foo.
258
259 fdupes -r /foo | duperemove --fdupes
260
261
262 Using Hashfiles
263 Duperemove can optionally store the hashes it calculates in a hashfile.
264 Hashfiles have two primary advantages - memory usage and re-usability.
265 When using a hashfile, duperemove will stream computed hashes to it,
266 instead of main memory.
267
268 If Duperemove is run with an existing hashfile, it will only scan those
269 files which have changed since the last time the hashfile was updated.
270 The files argument controls which directories duperemove will scan for
271 newly added files. In the simplest usage, you rerun duperemove with the
272 same parameters and it will only scan changed or newly added files -
273 see the first example below.
274
275
276 Dedupe the files in directory foo, storing hashes in foo.hash. We can
277 run this command multiple times and duperemove will only checksum and
278 dedupe changed or newly added files.
279
280 duperemove -dr --hashfile=foo.hash foo/
281
282 Don't scan for new files, only update changed or deleted files, then
283 dedupe.
284
285 duperemove -dr --hashfile=foo.hash
286
287 Add directory bar to our hashfile and discover any files that were
288 recently added to foo.
289
290 duperemove -dr --hashfile=foo.hash foo/ bar/
291
292 List the files tracked by foo.hash.
293
294 duperemove -L --hashfile=foo.hash
295
296
298 Is there an upper limit to the amount of data duperemove can process?
299 Duperemove v0.11 is fast at reading and cataloging data. Dedupe runs
300 will be memory limited unless the '--hashfile' option is used. '--hash‐
301 file' allows duperemove to temporarily store duplicated hashes to disk,
302 thus removing the large memory overhead and allowing for a far larger
303 amount of data to be scanned and deduped. Realistically though you will
304 be limited by the speed of your disks and cpu. In those situations
305 where resources are limited you may have success by breaking up the
306 input data set into smaller pieces.
307
308 When using a hashfile, duperemove will only store duplicate hashes in
309 memory. During normal operation then the hash tree will make up the
310 largest portion of duperemove memory usage. As of Duperemove v0.11 hash
311 entries are 88 bytes in size. If you know the number of duplicate
312 blocks in your data set you can get a rough approximation of memory
313 usage by multiplying with the hash entry size.
314
315 Actual performance numbers are dependent on hardware - up to date test‐
316 ing information is kept on the duperemove wiki (see below for the
317 link).
318
319
320 How large of a hashfile will duperemove create?
321 Hashfiles are essentially sqlite3 database files with several tables,
322 the largest of which are the files and hashes tables. Each hashes table
323 entry is under 90 bytes though that may grow as features are added. The
324 size of a files table entry depends on the file path but a good esti‐
325 mate is around 270 bytes per file.
326
327 If you know the total number of blocks and files in your data set then
328 you can calculate the hashfile size as:
329
330 Hashfile Size = Num Hashes X 90 + Num Files X 270
331
332 Using a real world example of 1TB (8388608 128K blocks) of data over
333 1000 files:
334
335 8388608 * 90 + 270 * 1000 = 755244720 or about 720MB for 1TB spread
336 over 1000 files.
337
338
339 Is is safe to interrupt the program (Ctrl-C)?
340 Yes, Duperemove uses a transactional database engine and organizes db
341 changes to take advantage of those features. The result is that you
342 should be able to ctrl-c the program at any point and re-run without
343 experiencing corruption of your hashfile.
344
345
346 How can I find out my space savings after a dedupe?
347 Duperemove will print out an estimate of the saved space after a dedupe
348 operation for you.
349
350 You can get a more accurate picture by running 'btrfs fi df' before and
351 after each duperemove run.
352
353 Be careful about using the 'df' tool on btrfs - it is common for space
354 reporting to be 'behind' while delayed updates get processed, so an
355 immediate df after deduping might not show any savings.
356
357
358 Why is the total deduped data report an estimate?
359 At the moment duperemove can detect that some underlying extents are
360 shared with other files, but it can not resolve which files those
361 extents are shared with.
362
363 Imagine duperemove is examing a series of files and it notes a shared
364 data region in one of them. That data could be shared with a file out‐
365 side of the series. Since duperemove can't resolve that information it
366 will account the shared data against our dedupe operation while in
367 reality, the kernel might deduplicate it further for us.
368
369
370 Why are my files showing dedupe but my disk space is not shrinking?
371 This is a little complicated, but it comes down to a feature in Btrfs
372 called _bookending_. The Btrfs wiki explains this in detail:
373 http://en.wikipedia.org/wiki/Btrfs#Extents.
374
375 Essentially though, the underlying representation of an extent in Btrfs
376 can not be split (with small exception). So sometimes we can end up in
377 a situation where a file extent gets partially deduped (and the extents
378 marked as shared) but the underlying extent item is not freed or trun‐
379 cated.
380
381
382 Is duperemove safe for my data?
383 Yes. To be specific, duperemove does not deduplicate the data itself.
384 It simply finds candidates for dedupe and submits them to the Linux
385 kernel extent-same ioctl. In order to ensure data integrity, the kernel
386 locks out other access to the file and does a byte-by-byte compare
387 before proceeding with the dedupe.
388
389
390 What is the cost of deduplication?
391 Deduplication will lead to increased fragmentation. The blocksize cho‐
392 sen can have an effect on this. Larger blocksizes will fragment less
393 but may not save you as much space. Conversely, smaller block sizes may
394 save more space at the cost of increased fragmentation.
395
396
398 Deduplication is currently only supported by the btrfs and xfs filesys‐
399 tem.
400
401 The Duperemove project page can be found at https://github.com/mark‐
402 fasheh/duperemove
403
404 There is also a wiki at https://github.com/markfasheh/duperemove/wiki
405
406
408 hashstats(8) filesystems(5) btrfs(8) xfs(8) fdupes(1)
409
410
411
412Version 0.11 September 2016 duperemove(8)