1duperemove(8) System Manager's Manual duperemove(8)
2
3
4
6 duperemove - Find duplicate extents and print them to stdout
7
9 duperemove [options] files...
10
12 duperemove is a simple tool for finding duplicated extents and submit‐
13 ting them for deduplication. When given a list of files it will hash
14 their contents on a block by block basis and compare those hashes to
15 each other, finding and categorizing blocks that match each other. When
16 given the -d option, duperemove will submit those extents for dedupli‐
17 cation using the Linux kernel extent-same ioctl.
18
19 duperemove can store the hashes it computes in a hashfile. If given an
20 existing hashfile, duperemove will only compute hashes for those files
21 which have changed since the last run. Thus you can run duperemove
22 repeatedly on your data as it changes, without having to re-checksum
23 unchanged data. For more on hashfiles see the --hashfile option below
24 as well as the Examples section.
25
26 duperemove can also take input from the fdupes program, see the
27 --fdupes option below.
28
29
31 Duperemove has two major modes of operation one of which is a subset of
32 the other.
33
34
35 Readonly / Non-deduplicating Mode
36 When run without -d (the default) duperemove will print out one or more
37 tables of matching extents it has determined would be ideal candidates
38 for deduplication. As a result, readonly mode is useful for seeing what
39 duperemove might do when run with -d. The output could also be used by
40 some other software to submit the extents for deduplication at a later
41 time.
42
43 Generally, duperemove does not concern itself with the underlying rep‐
44 resentation of the extents it processes. Some of them could be com‐
45 pressed, undergoing I/O, or even have already been deduplicated. In
46 dedupe mode, the kernel handles those details and therefore we try not
47 to replicate that work.
48
49
50 Deduping Mode
51 This functions similarly to readonly mode with the exception that the
52 duplicated extents found in our "read, hash, and compare" step will
53 actually be submitted for deduplication. An estimate of the total data
54 deduplicated will be printed after the operation is complete. This
55 estimate is calculated by comparing the total amount of shared bytes in
56 each file before and after the dedupe.
57
58
60 files can refer to a list of regular files and directories or be a
61 hyphen (-) to read them from standard input. If a directory is speci‐
62 fied, all regular files within it will also be scanned. Duperemove can
63 also be told to recursively scan directories with the '-r' switch.
64
65
66 -r Enable recursive dir traversal.
67
68
69 -d De-dupe the results - only works on btrfs and xfs (experimen‐
70 tal).
71
72
73 -A Opens files readonly when deduping. Primarily for use by privi‐
74 leged users on readonly snapshots.
75
76
77 -h Print numbers in human-readable format.
78
79
80 --hashfile=hashfile
81 Use a file for storage of hashes instead of memory. This option
82 drastically reduces the memory footprint of duperemove and is
83 recommended when your data set is more than a few files large.
84 Hashfiles are also reusable, allowing you to further reduce the
85 amount of hashing done on subsequent dedupe runs.
86
87 If hashfile does not exist it will be created. If it exists,
88 duperemove will check the file paths stored inside of it for
89 changes. Files which have changed will be rescanned and their
90 updated hashes will be written to the hashfile. Deleted files
91 will be removed from the hashfile.
92
93 New files are only added to the hashfile if they are discover‐
94 able via the files argument. For that reason you probably want
95 to provide the same files list and -r arguments on each run of
96 duperemove. The file discovery algorithm is efficient and will
97 only visit each file once, even if it is already in the hash‐
98 file.
99
100 Adding a new path to a hashfile is as simple as adding it to the
101 files argument.
102
103 When deduping from a hashfile, duperemove will avoid deduping
104 files which have not changed since the last dedupe.
105
106
107 -L Print all files in the hashfile and exit. Requires the --hash‐
108 file option. Will print additional information about each file
109 when run with -v.
110
111
112 -R [file]
113 Remove file from the db and exit. Can be specified multiple
114 times. Duperemove will read the list from standard input if a
115 hyphen (-) is provided. Requires the --hashfile option.
116
117 Note: If you are piping filenames from another duperemove
118 instance it is advisable to do so into a temporary file first as
119 running duperemove simultaneously on the same hashfile may cor‐
120 rupt that hashfile.
121
122
123 --fdupes
124 Run in fdupes mode. With this option you can pipe the output of
125 fdupes to duperemove to dedupe any duplicate files found. When
126 receiving a file list in this manner, duperemove will skip the
127 hashing phase.
128
129
130 -v Be verbose.
131
132
133 --skip-zeroes
134 Read data blocks and skip any zeroed blocks, useful for speedup
135 duperemove, but can prevent deduplication of zeroed files.
136
137
138 -b size
139 Use the specified block size. Raising the block size will con‐
140 sume less memory but may miss some duplicate blocks. Conversely,
141 lowering the blocksize consumes more memory and may find more
142 duplicate blocks. The default blocksize of 128K was chosen with
143 these parameters in mind.
144
145
146 --io-threads=N
147 Use N threads for I/O. This is used by the file hashing and
148 dedupe stages. Default is automatically detected based on number
149 of host cpus.
150
151
152 --cpu-threads=N
153 Use N threads for CPU bound tasks. This is used by the duplicate
154 extent finding stage. Default is automatically detected based on
155 number of host cpus.
156
157 Note: Hyperthreading can adversely affect performance of the
158 extent finding stage. If duperemove detects an Intel CPU with
159 hyperthreading it will use half the number of cores reported by
160 the system for cpu bound tasks.
161
162
163 --dedupe-options=options
164 Comma separated list of options which alter how we dedupe.
165 Prepend 'no' to an option in order to turn it off.
166
167 [no]same
168 Defaults to off. Allow dedupe of extents within the same
169 file.
170
171 [no]fiemap
172 Defaults to on. Duperemove uses the fiemap ioctl during
173 the dedupe stage to optimize out already deduped extents
174 as well as to provide an estimate of the space saved
175 after dedupe operations are complete.
176
177 Unfortunately, some versions of Btrfs exhibit extremely
178 poor performance in fiemap as the number of references on
179 a file extent goes up. If you are experiencing the dedupe
180 phase slowing down or 'locking up' this option may give
181 you a significant amount of performance back.
182
183 Note: This does not turn off all usage of fiemap, to dis‐
184 able fiemap during the file scan stage, you will also
185 want to use the --lookup-extents=no option.
186
187 [no]block
188 Defaults to on. Duperemove submits duplicate blocks
189 directly to the dedupe engine.
190
191 Duperemove can optionally optimize the duplicate block
192 lists into larger extents prior to dedupe submission. The
193 search algorithm used for this however has a very high
194 memory and cpu overhead, but may reduce the number of
195 extent references created during dedupe. If you'd like to
196 try this, run with 'noblock'.
197
198
199 --help Prints help text.
200
201
202 --lookup-extents=[yes|no]
203 Defaults to no. Allows duperemove to skip checksumming some
204 blocks by checking their extent state.
205
206
207 -x Don't cross filesystem boundaries, this is the default behavior
208 since duperemove v0.11. The option is kept for backwards compat‐
209 ibility.
210
211
212 --read-hashes=hashfile
213 This option is primarily for testing. See the --hashfile option
214 if you want to use hashfiles.
215
216 Read hashes from a hashfile. A file list is not required with
217 this option. Dedupe can be done if duperemove is run from the
218 same base directory as is stored in the hash file (basically
219 duperemove has to be able to find the files).
220
221
222 --write-hashes=hashfile
223 This option is primarily for testing. See the --hashfile option
224 if you want to use hashfiles.
225
226 Write hashes to a hashfile. These can be read in at a later date
227 and deduped from.
228
229
230 --debug
231 Print debug messages, forces -v if selected.
232
233
234 --hash-threads=N
235 Deprecated, see --io-threads above.
236
237
238 --hash=alg
239 You can choose between murmur3 and xxhash. The default is mur‐
240 mur3 as it is very fast and can generate 128 bit digests for a
241 very small chance of collision. Xxhash may be faster but gener‐
242 ates only 64 bit digests. Both hashes are fast enough that the
243 default should work well for the overwhelming majority of users.
244
245
247 Simple Usage
248 Dedupe the files in directory /foo, recurse into all subdirectories.
249 You only want to use this for small data sets.
250
251 duperemove -dr /foo
252
253 Use duperemove with fdupes to dedupe identical files below directory
254 foo.
255
256 fdupes -r /foo | duperemove --fdupes
257
258
259 Using Hashfiles
260 Duperemove can optionally store the hashes it calculates in a hashfile.
261 Hashfiles have two primary advantages - memory usage and re-usability.
262 When using a hashfile, duperemove will stream computed hashes to it,
263 instead of main memory.
264
265 If Duperemove is run with an existing hashfile, it will only scan those
266 files which have changed since the last time the hashfile was updated.
267 The files argument controls which directories duperemove will scan for
268 newly added files. In the simplest usage, you rerun duperemove with the
269 same parameters and it will only scan changed or newly added files -
270 see the first example below.
271
272
273 Dedupe the files in directory foo, storing hashes in foo.hash. We can
274 run this command multiple times and duperemove will only checksum and
275 dedupe changed or newly added files.
276
277 duperemove -dr --hashfile=foo.hash foo/
278
279 Don't scan for new files, only update changed or deleted files, then
280 dedupe.
281
282 duperemove -dr --hashfile=foo.hash
283
284 Add directory bar to our hashfile and discover any files that were
285 recently added to foo.
286
287 duperemove -dr --hashfile=foo.hash foo/ bar/
288
289 List the files tracked by foo.hash.
290
291 duperemove -L --hashfile=foo.hash
292
293
295 Is there an upper limit to the amount of data duperemove can process?
296 Duperemove v0.11 is fast at reading and cataloging data. Dedupe runs
297 will be memory limited unless the '--hashfile' option is used. '--hash‐
298 file' allows duperemove to temporarily store duplicated hashes to disk,
299 thus removing the large memory overhead and allowing for a far larger
300 amount of data to be scanned and deduped. Realistically though you will
301 be limited by the speed of your disks and cpu. In those situations
302 where resources are limited you may have success by breaking up the
303 input data set into smaller pieces.
304
305 When using a hashfile, duperemove will only store duplicate hashes in
306 memory. During normal operation then the hash tree will make up the
307 largest portion of duperemove memory usage. As of Duperemove v0.11 hash
308 entries are 88 bytes in size. If you know the number of duplicate
309 blocks in your data set you can get a rough approximation of memory
310 usage by multiplying with the hash entry size.
311
312 Actual performance numbers are dependent on hardware - up to date test‐
313 ing information is kept on the duperemove wiki (see below for the
314 link).
315
316
317 How large of a hashfile will duperemove create?
318 Hashfiles are essentially sqlite3 database files with several tables,
319 the largest of which are the files and hashes tables. Each hashes table
320 entry is under 90 bytes though that may grow as features are added. The
321 size of a files table entry depends on the file path but a good esti‐
322 mate is around 270 bytes per file.
323
324 If you know the total number of blocks and files in your data set then
325 you can calculate the hashfile size as:
326
327 Hashfile Size = Num Hashes X 90 + Num Files X 270
328
329 Using a real world example of 1TB (8388608 128K blocks) of data over
330 1000 files:
331
332 8388608 * 90 + 270 * 1000 = 755244720 or about 720MB for 1TB spread
333 over 1000 files.
334
335
336 Is is safe to interrupt the program (Ctrl-C)?
337 Yes, Duperemove uses a transactional database engine and organizes db
338 changes to take advantage of those features. The result is that you
339 should be able to ctrl-c the program at any point and re-run without
340 experiencing corruption of your hashfile.
341
342
343 How can I find out my space savings after a dedupe?
344 Duperemove will print out an estimate of the saved space after a dedupe
345 operation for you.
346
347 You can get a more accurate picture by running 'btrfs fi df' before and
348 after each duperemove run.
349
350 Be careful about using the 'df' tool on btrfs - it is common for space
351 reporting to be 'behind' while delayed updates get processed, so an
352 immediate df after deduping might not show any savings.
353
354
355 Why is the total deduped data report an estimate?
356 At the moment duperemove can detect that some underlying extents are
357 shared with other files, but it can not resolve which files those
358 extents are shared with.
359
360 Imagine duperemove is examing a series of files and it notes a shared
361 data region in one of them. That data could be shared with a file out‐
362 side of the series. Since duperemove can't resolve that information it
363 will account the shared data against our dedupe operation while in
364 reality, the kernel might deduplicate it further for us.
365
366
367 Why are my files showing dedupe but my disk space is not shrinking?
368 This is a little complicated, but it comes down to a feature in Btrfs
369 called _bookending_. The Btrfs wiki explains this in detail:
370 http://en.wikipedia.org/wiki/Btrfs#Extents.
371
372 Essentially though, the underlying representation of an extent in Btrfs
373 can not be split (with small exception). So sometimes we can end up in
374 a situation where a file extent gets partially deduped (and the extents
375 marked as shared) but the underlying extent item is not freed or trun‐
376 cated.
377
378
379 Is duperemove safe for my data?
380 Yes. To be specific, duperemove does not deduplicate the data itself.
381 It simply finds candidates for dedupe and submits them to the Linux
382 kernel extent-same ioctl. In order to ensure data integrity, the kernel
383 locks out other access to the file and does a byte-by-byte compare
384 before proceeding with the dedupe.
385
386
387 What is the cost of deduplication?
388 Deduplication will lead to increased fragmentation. The blocksize cho‐
389 sen can have an effect on this. Larger blocksizes will fragment less
390 but may not save you as much space. Conversely, smaller block sizes may
391 save more space at the cost of increased fragmentation.
392
393
395 Deduplication is currently only supported by the btrfs and xfs filesys‐
396 tem.
397
398 The Duperemove project page can be found at https://github.com/mark‐
399 fasheh/duperemove
400
401 There is also a wiki at https://github.com/markfasheh/duperemove/wiki
402
403
405 hashstats(8) filesystems(5) btrfs(8) xfs(8) fdupes(1)
406
407
408
409Version 0.11 September 2016 duperemove(8)