1duperemove(8) System Manager's Manual duperemove(8)
2
3
4
6 duperemove - Find duplicate extents and print them to stdout
7
9 duperemove [options] files...
10
12 duperemove is a simple tool for finding duplicated extents and submit‐
13 ting them for deduplication. When given a list of files it will hash
14 the contents of their extents and compare those hashes to each other,
15 finding and categorizing extents that match each other. When given the
16 -d option, duperemove will submit those extents for deduplication using
17 the Linux kernel extent-same ioctl.
18
19 duperemove can store the hashes it computes in a hashfile. If given an
20 existing hashfile, duperemove will only compute hashes for those files
21 which have changed since the last run. Thus you can run duperemove re‐
22 peatedly on your data as it changes, without having to re-checksum un‐
23 changed data. For more on hashfiles see the --hashfile option below as
24 well as the Examples section.
25
26 duperemove can also take input from the fdupes program, see the
27 --fdupes option below.
28
29
31 Duperemove has two major modes of operation one of which is a subset of
32 the other.
33
34
35 Readonly / Non-deduplicating Mode
36 When run without -d (the default) duperemove will print out one or more
37 tables of matching extents it has determined would be ideal candidates
38 for deduplication. As a result, readonly mode is useful for seeing what
39 duperemove might do when run with -d.
40
41 Generally, duperemove does not concern itself with the underlying rep‐
42 resentation of the extents it processes. Some of them could be com‐
43 pressed, undergoing I/O, or even have already been deduplicated. In
44 dedupe mode, the kernel handles those details and therefore we try not
45 to replicate that work.
46
47
48 Deduping Mode
49 This functions similarly to readonly mode with the exception that the
50 duplicated extents found in our "read, hash, and compare" step will ac‐
51 tually be submitted for deduplication. Extents that have already been
52 deduped will be skipped. An estimate of the total data deduplicated
53 will be printed after the operation is complete. This estimate is cal‐
54 culated by comparing the total amount of shared bytes in each file be‐
55 fore and after the dedupe.
56
57
59 files can refer to a list of regular files and directories or be a hy‐
60 phen (-) to read them from standard input. If a directory is speci‐
61 fied, all regular files within it will also be scanned. Duperemove can
62 also be told to recursively scan directories with the '-r' switch.
63
64
65 -r Enable recursive dir traversal.
66
67
68 -d De-dupe the results - only works on btrfs and xfs.
69
70
71 -A Opens files readonly when deduping. Primarily for use by privi‐
72 leged users on readonly snapshots.
73
74
75 -h Print numbers in human-readable format.
76
77
78 -q Quiet mode. Duperemove will only print errors and a short sum‐
79 mary of any dedupe.
80
81
82 --hashfile=hashfile
83 Use a file for storage of hashes instead of memory. This option
84 drastically reduces the memory footprint of duperemove and is
85 recommended when your data set is more than a few files large.
86 Hashfiles are also reusable, allowing you to further reduce the
87 amount of hashing done on subsequent dedupe runs.
88
89 If hashfile does not exist it will be created. If it exists,
90 duperemove will check the file paths stored inside of it for
91 changes. Files which have changed will be rescanned and their
92 updated hashes will be written to the hashfile. Deleted files
93 will be removed from the hashfile.
94
95 New files are only added to the hashfile if they are discover‐
96 able via the files argument. For that reason you probably want
97 to provide the same files list and -r arguments on each run of
98 duperemove. The file discovery algorithm is efficient and will
99 only visit each file once, even if it is already in the hash‐
100 file.
101
102 Adding a new path to a hashfile is as simple as adding it to the
103 files argument.
104
105 When deduping from a hashfile, duperemove will avoid deduping
106 files which have not changed since the last dedupe.
107
108
109 -L Print all files in the hashfile and exit. Requires the --hash‐
110 file option. Will print additional information about each file
111 when run with -v.
112
113
114 -R [file]
115 Remove file from the db and exit. Can be specified multiple
116 times. Duperemove will read the list from standard input if a
117 hyphen (-) is provided. Requires the --hashfile option.
118
119 Note: If you are piping filenames from another duperemove in‐
120 stance it is advisable to do so into a temporary file first as
121 running duperemove simultaneously on the same hashfile may cor‐
122 rupt that hashfile.
123
124
125 --fdupes
126 Run in fdupes mode. With this option you can pipe the output of
127 fdupes to duperemove to dedupe any duplicate files found. When
128 receiving a file list in this manner, duperemove will skip the
129 hashing phase.
130
131
132 -v Be verbose.
133
134
135 --skip-zeroes
136 Read data blocks and skip any zeroed blocks, useful for speedup
137 duperemove, but can prevent deduplication of zeroed files.
138
139
140 -b size
141 Use the specified block size for reading file extents. Defaults
142 to 128K.
143
144
145 --io-threads=N
146 Use N threads for I/O. This is used by the file hashing and
147 dedupe stages. Default is automatically detected based on number
148 of host cpus.
149
150
151 --cpu-threads=N
152 Use N threads for CPU bound tasks. This is used by the duplicate
153 extent finding stage. Default is automatically detected based on
154 number of host cpus.
155
156 Note: Hyperthreading can adversely affect performance of the ex‐
157 tent finding stage. If duperemove detects an Intel CPU with hy‐
158 perthreading it will use half the number of cores reported by
159 the system for cpu bound tasks.
160
161
162 --dedupe-options=options
163 Comma separated list of options which alter how we dedupe.
164 Prepend 'no' to an option in order to turn it off.
165
166 [no]partial
167 Duperemove can often find more dedupe by comparing por‐
168 tions of extents to each other. This can be a lengthy,
169 CPU intensive task so it is turned off by default.
170
171 The code behind this option is under active development
172 and as a result the semantics of the partial argument may
173 change.
174
175
176 [no]same
177 Defaults to off. Allow dedupe of extents within the same
178 file.
179
180 [no]fiemap
181 Defaults to on. Duperemove uses the fiemap ioctl during
182 the dedupe stage to optimize out already deduped extents
183 as well as to provide an estimate of the space saved af‐
184 ter dedupe operations are complete.
185
186 Unfortunately, some versions of Btrfs exhibit extremely
187 poor performance in fiemap as the number of references on
188 a file extent goes up. If you are experiencing the dedupe
189 phase slowing down or 'locking up' this option may give
190 you a significant amount of performance back.
191
192 Note: This does not turn off all usage of fiemap, to dis‐
193 able fiemap during the file scan stage, you will also
194 want to use the --lookup-extents=no option.
195
196 [no]block
197 Deprecated.
198
199
200 --help Prints help text.
201
202
203 --lookup-extents=[yes|no]
204 Defaults to yes. Allows duperemove to skip checksumming some
205 blocks by checking their extent state.
206
207
208 --read-hashes=hashfile
209 This option is primarily for testing. See the --hashfile option
210 if you want to use hashfiles.
211
212 Read hashes from a hashfile. A file list is not required with
213 this option. Dedupe can be done if duperemove is run from the
214 same base directory as is stored in the hash file (basically
215 duperemove has to be able to find the files).
216
217
218 --write-hashes=hashfile
219 This option is primarily for testing. See the --hashfile option
220 if you want to use hashfiles.
221
222 Write hashes to a hashfile. These can be read in at a later date
223 and deduped from.
224
225
226 --debug
227 Print debug messages, forces -v if selected.
228
229
230 --hash-threads=N
231 Deprecated, see --io-threads above.
232
233
234 --hash=alg
235 You can choose between murmur3 and xxhash. The default is mur‐
236 mur3 as it is very fast and can generate 128 bit digests for a
237 very small chance of collision. Xxhash may be faster but gener‐
238 ates only 64 bit digests. Both hashes are fast enough that the
239 default should work well for the overwhelming majority of users.
240
241
242 --exclude=PATTERN
243 You an exclude certain files and folders from the deduplication
244 process. This might be benefical for skipping subvolume snapshot
245 mounts, for instance. You need to provide full path for exclu‐
246 sion. For example providing just a file name with a wildcard i.e
247 duperemove --exclude file-* won't ever match because internally
248 duperemove works with absolute paths. Another thing to keep in
249 mind is that shells usually expand glob pattern so the passed in
250 pattern ought to also be quoted. Taking everything into consid‐
251 eration the correct way to pass an exclusion pattern is dupere‐
252 move --exclude "/path/to/dir/file*" /path/to/dir
253
254
256 Simple Usage
257 Dedupe the files in directory /foo, recurse into all subdirectories.
258 You only want to use this for small data sets.
259
260 duperemove -dr /foo
261
262 Use duperemove with fdupes to dedupe identical files below directory
263 foo.
264
265 fdupes -r /foo | duperemove --fdupes
266
267
268 Using Hashfiles
269 Duperemove can optionally store the hashes it calculates in a hashfile.
270 Hashfiles have two primary advantages - memory usage and re-usability.
271 When using a hashfile, duperemove will stream computed hashes to it,
272 instead of main memory.
273
274 If Duperemove is run with an existing hashfile, it will only scan those
275 files which have changed since the last time the hashfile was updated.
276 The files argument controls which directories duperemove will scan for
277 newly added files. In the simplest usage, you rerun duperemove with the
278 same parameters and it will only scan changed or newly added files -
279 see the first example below.
280
281
282 Dedupe the files in directory foo, storing hashes in foo.hash. We can
283 run this command multiple times and duperemove will only checksum and
284 dedupe changed or newly added files.
285
286 duperemove -dr --hashfile=foo.hash foo/
287
288 Don't scan for new files, only update changed or deleted files, then
289 dedupe.
290
291 duperemove -dr --hashfile=foo.hash
292
293 Add directory bar to our hashfile and discover any files that were re‐
294 cently added to foo.
295
296 duperemove -dr --hashfile=foo.hash foo/ bar/
297
298 List the files tracked by foo.hash.
299
300 duperemove -L --hashfile=foo.hash
301
302
304 Is there an upper limit to the amount of data duperemove can process?
305 Duperemove v0.11 is fast at reading and cataloging data. Dedupe runs
306 will be memory limited unless the '--hashfile' option is used. '--hash‐
307 file' allows duperemove to temporarily store duplicated hashes to disk,
308 thus removing the large memory overhead and allowing for a far larger
309 amount of data to be scanned and deduped. Realistically though you will
310 be limited by the speed of your disks and cpu. In those situations
311 where resources are limited you may have success by breaking up the in‐
312 put data set into smaller pieces.
313
314 When using a hashfile, duperemove will only store duplicate hashes in
315 memory. During normal operation then the hash tree will make up the
316 largest portion of duperemove memory usage. As of Duperemove v0.11 hash
317 entries are 88 bytes in size. If you know the number of duplicate
318 blocks in your data set you can get a rough approximation of memory us‐
319 age by multiplying with the hash entry size.
320
321 Actual performance numbers are dependent on hardware - up to date test‐
322 ing information is kept on the duperemove wiki (see below for the
323 link).
324
325
326 How large of a hashfile will duperemove create?
327 Hashfiles are essentially sqlite3 database files with several tables,
328 the largest of which are the files and extents tables. Each extents ta‐
329 ble entry is about 72 bytes though that may grow as features are added.
330 The size of a files table entry depends on the file path but a good es‐
331 timate is around 270 bytes per file. The number of extents in a data
332 set is directly proportional to file fragmentation level.
333
334 If you know the total number of extents and files in your data set then
335 you can calculate the hashfile size as:
336
337 Hashfile Size = Num Hashes X 72 + Num Files X 270
338
339 Using a real world example of 1TB (8388608 128K blocks) of data over
340 1000 files:
341
342 8388608 * 72 + 270 * 1000 = 755244720 or about 720MB for 1TB spread
343 over 1000 files.
344
345 Note that none of this takes database overhead into account.
346
347
348 Is is safe to interrupt the program (Ctrl-C)?
349 Yes, Duperemove uses a transactional database engine and organizes db
350 changes to take advantage of those features. The result is that you
351 should be able to ctrl-c the program at any point and re-run without
352 experiencing corruption of your hashfile.
353
354
355 I got two identical files, why are they not deduped?
356 Duperemove by default works on extent granularity. What this means is
357 if there are two files which are logically identical (have the same
358 content) but are laid out on disk with different extent structure they
359 won't be deduped. For example if 2 files are 128k each and their con‐
360 tent are identical but one of them consists of a single 128k extent and
361 the other of 2 x 64k extents then they won't be deduped. This behavior
362 is dependent on the current implementation and is subject to change as
363 duperemove is being improved.
364
365
366 How can I find out my space savings after a dedupe?
367 Duperemove will print out an estimate of the saved space after a dedupe
368 operation for you.
369
370 You can get a more accurate picture by running 'btrfs fi df' before and
371 after each duperemove run.
372
373 Be careful about using the 'df' tool on btrfs - it is common for space
374 reporting to be 'behind' while delayed updates get processed, so an im‐
375 mediate df after deduping might not show any savings.
376
377
378 Why is the total deduped data report an estimate?
379 At the moment duperemove can detect that some underlying extents are
380 shared with other files, but it can not resolve which files those ex‐
381 tents are shared with.
382
383 Imagine duperemove is examing a series of files and it notes a shared
384 data region in one of them. That data could be shared with a file out‐
385 side of the series. Since duperemove can't resolve that information it
386 will account the shared data against our dedupe operation while in re‐
387 ality, the kernel might deduplicate it further for us.
388
389
390 Why are my files showing dedupe but my disk space is not shrinking?
391 This is a little complicated, but it comes down to a feature in Btrfs
392 called _bookending_. The Btrfs wiki explains this in detail:
393 http://en.wikipedia.org/wiki/Btrfs#Extents.
394
395 Essentially though, the underlying representation of an extent in Btrfs
396 can not be split (with small exception). So sometimes we can end up in
397 a situation where a file extent gets partially deduped (and the extents
398 marked as shared) but the underlying extent item is not freed or trun‐
399 cated.
400
401
402 Is duperemove safe for my data?
403 Yes. To be specific, duperemove does not deduplicate the data itself.
404 It simply finds candidates for dedupe and submits them to the Linux
405 kernel extent-same ioctl. In order to ensure data integrity, the kernel
406 locks out other access to the file and does a byte-by-byte compare be‐
407 fore proceeding with the dedupe.
408
409
410 What is the cost of deduplication?
411 Deduplication will lead to increased fragmentation. The blocksize cho‐
412 sen can have an effect on this. Larger blocksizes will fragment less
413 but may not save you as much space. Conversely, smaller block sizes may
414 save more space at the cost of increased fragmentation.
415
416
418 Deduplication is currently only supported by the btrfs and xfs filesys‐
419 tem.
420
421 The Duperemove project page can be found at https://github.com/mark‐
422 fasheh/duperemove
423
424 There is also a wiki at https://github.com/markfasheh/duperemove/wiki
425
426
428 hashstats(8) filesystems(5) btrfs(8) xfs(8) fdupes(1)
429
430
431
432Version 0.11 September 2016 duperemove(8)