1duperemove(8) System Manager's Manual duperemove(8)
2
3
4
6 duperemove - Find duplicate extents and submit them for deduplication
7
9 duperemove [options] files...
10
12 duperemove is a simple tool for finding duplicated extents and submit‐
13 ting them for deduplication. When given a list of files it will hash
14 the contents of their extents and compare those hashes to each other,
15 finding and categorizing extents that match each other. When given the
16 -d option, duperemove will submit those extents for deduplication using
17 the Linux kernel extent-same ioctl.
18
19 duperemove can store the hashes it computes in a hashfile. If given an
20 existing hashfile, duperemove will only compute hashes for those files
21 which have changed since the last run. Thus you can run duperemove re‐
22 peatedly on your data as it changes, without having to re-checksum un‐
23 changed data. For more on hashfiles see the --hashfile option below as
24 well as the Examples section.
25
26 duperemove can also take input from the fdupes program, see the
27 --fdupes option below.
28
29
31 Duperemove has two major modes of operation, one of which is a subset
32 of the other.
33
34
35 Readonly / Non-deduplicating Mode
36 When run without -d (the default) duperemove will print out one or more
37 tables of matching extents it has determined would be ideal candidates
38 for deduplication. As a result, readonly mode is useful for seeing what
39 duperemove might do when run with -d.
40
41 Generally, duperemove does not concern itself with the underlying rep‐
42 resentation of the extents it processes. Some of them could be com‐
43 pressed, undergoing I/O, or even have already been deduplicated. In
44 dedupe mode, the kernel handles those details and therefore we try not
45 to replicate that work.
46
47
48 Deduping Mode
49 This functions similarly to readonly mode with the exception that the
50 duplicated extents found in our "read, hash, and compare" step will ac‐
51 tually be submitted for deduplication. Extents that have already been
52 deduped will be skipped. An estimate of the total data deduplicated
53 will be printed after the operation is complete. This estimate is cal‐
54 culated by comparing the total amount of shared bytes in each file be‐
55 fore and after the dedupe.
56
57
59 files can refer to a list of regular files and directories or be a hy‐
60 phen (-) to read them from standard input. If a directory is speci‐
61 fied, all regular files within it will also be scanned. Duperemove can
62 also be told to recursively scan directories with the '-r' switch.
63
64
65 -r Enable recursive dir traversal.
66
67
68 -d De-dupe the results - only works on btrfs and xfs.
69
70
71 -A Opens files readonly when deduping. Primarily for use by privi‐
72 leged users on readonly snapshots.
73
74
75 -h Print numbers in human-readable format.
76
77
78 -q Quiet mode. Duperemove will only print errors and a short sum‐
79 mary of any dedupe.
80
81
82 --hashfile=hashfile
83 Use a file for storage of hashes instead of memory. This option
84 drastically reduces the memory footprint of duperemove and is
85 recommended when your data set is more than a few files large.
86 Hashfiles are also reusable, allowing you to further reduce the
87 amount of hashing done on subsequent dedupe runs.
88
89 If hashfile does not exist it will be created. If it exists,
90 duperemove will check the file paths stored inside of it for
91 changes. Files which have changed will be rescanned and their
92 updated hashes will be written to the hashfile. Deleted files
93 will be removed from the hashfile.
94
95 New files are only added to the hashfile if they are discover‐
96 able via the files argument. For that reason you probably want
97 to provide the same files list and -r arguments on each run of
98 duperemove. The file discovery algorithm is efficient and will
99 only visit each file once, even if it is already in the hash‐
100 file.
101
102 Adding a new path to a hashfile is as simple as adding it to the
103 files argument.
104
105 When deduping from a hashfile, duperemove will avoid deduping
106 files which have not changed since the last dedupe.
107
108
109 -L Print all files in the hashfile and exit. Requires the --hash‐
110 file option. Will print additional information about each file
111 when run with -v.
112
113
114 -R [file]
115 Remove file from the db and exit. Can be specified multiple
116 times. Duperemove will read the list from standard input if a
117 hyphen (-) is provided. Requires the --hashfile option.
118
119 Note: If you are piping filenames from another duperemove in‐
120 stance it is advisable to do so into a temporary file first as
121 running duperemove simultaneously on the same hashfile may cor‐
122 rupt that hashfile.
123
124
125 --fdupes
126 Run in fdupes mode. With this option you can pipe the output of
127 fdupes to duperemove to dedupe any duplicate files found. When
128 receiving a file list in this manner, duperemove will skip the
129 hashing phase.
130
131
132 -v Be verbose.
133
134
135 --skip-zeroes
136 Read data blocks and skip any zeroed blocks, useful for speedup
137 duperemove, but can prevent deduplication of zeroed files.
138
139
140 -b size
141 Use the specified block size for reading file extents. Defaults
142 to 128K.
143
144
145 --io-threads=N
146 Use N threads for I/O. This is used by the file hashing and
147 dedupe stages. Default is automatically detected based on number
148 of host cpus.
149
150
151 --cpu-threads=N
152 Use N threads for CPU bound tasks. This is used by the duplicate
153 extent finding stage. Default is automatically detected based on
154 number of host cpus.
155
156 Note: Hyperthreading can adversely affect performance of the ex‐
157 tent finding stage. If duperemove detects an Intel CPU with hy‐
158 perthreading it will use half the number of cores reported by
159 the system for cpu bound tasks.
160
161
162 --dedupe-options=options
163 Comma separated list of options which alter how we dedupe.
164 Prepend 'no' to an option in order to turn it off.
165
166 [no]partial
167 Duperemove can often find more dedupe by comparing por‐
168 tions of extents to each other. This can be a lengthy,
169 CPU intensive task so it is turned off by default. Using
170 --batchsize is recommended to limit the negative effects
171 of this option.
172
173 The code behind this option is under active development
174 and as a result the semantics of the partial argument may
175 change.
176
177
178 [no]same
179 Defaults to off. Allow dedupe of extents within the same
180 file.
181
182 [no]fiemap
183 Defaults to on. Duperemove uses the fiemap ioctl during
184 the dedupe stage to optimize out already deduped extents
185 as well as to provide an estimate of the space saved af‐
186 ter dedupe operations are complete.
187
188 Unfortunately, some versions of Btrfs exhibit extremely
189 poor performance in fiemap as the number of references on
190 a file extent goes up. If you are experiencing the dedupe
191 phase slowing down or 'locking up' this option may give
192 you a significant amount of performance back.
193
194 Note: This does not turn off all usage of fiemap, to dis‐
195 able fiemap during the file scan stage, you will also
196 want to use the --lookup-extents=no option.
197
198 [no]block
199 Deprecated.
200
201
202 --help Prints help text.
203
204
205 --lookup-extents=[yes|no]
206 Defaults to yes. Allows duperemove to skip checksumming some
207 blocks by checking their extent state.
208
209
210 --read-hashes=hashfile
211 This option is primarily for testing. See the --hashfile option
212 if you want to use hashfiles.
213
214 Read hashes from a hashfile. A file list is not required with
215 this option. Dedupe can be done if duperemove is run from the
216 same base directory as is stored in the hash file (basically
217 duperemove has to be able to find the files).
218
219
220 --write-hashes=hashfile
221 This option is primarily for testing. See the --hashfile option
222 if you want to use hashfiles.
223
224 Write hashes to a hashfile. These can be read in at a later date
225 and deduped from.
226
227
228 --debug
229 Print debug messages, forces -v if selected.
230
231
232 --hash-threads=N
233 Deprecated, see --io-threads above.
234
235
236 --exclude=PATTERN
237 You an exclude certain files and folders from the deduplication
238 process. This might be benefical for skipping subvolume snapshot
239 mounts, for instance. You need to provide full path for exclu‐
240 sion. For example providing just a file name with a wildcard i.e
241 duperemove --exclude file-* won't ever match because internally
242 duperemove works with absolute paths. Another thing to keep in
243 mind is that shells usually expand glob pattern so the passed in
244 pattern ought to also be quoted. Taking everything into consid‐
245 eration the correct way to pass an exclusion pattern is dupere‐
246 move --exclude "/path/to/dir/file*" /path/to/dir
247
248
249 -B, --batchsize=N
250 Run the deduplication phase every N files newly scanned. This
251 greatly reduces memory usage for large dataset, or when you are
252 doing partial extents lookup, but reduces multithreading effi‐
253 ciency.
254
255 Because of that small overhead, its value shall be selected
256 based on the average file size and blocksize.
257
258 1000 is a sane value for extents-only lookups, while you can go
259 as low as 1 if you are running duperemove on very large files
260 (like virtual machines etc).
261
262 By default, batching is disabled.
263
264
266 Simple Usage
267 Dedupe the files in directory /foo, recurse into all subdirectories.
268 You only want to use this for small data sets.
269
270 duperemove -dr /foo
271
272 Use duperemove with fdupes to dedupe identical files below directory
273 foo.
274
275 fdupes -r /foo | duperemove --fdupes
276
277
278 Using Hashfiles
279 Duperemove can optionally store the hashes it calculates in a hashfile.
280 Hashfiles have two primary advantages - memory usage and re-usability.
281 When using a hashfile, duperemove will stream computed hashes to it,
282 instead of main memory.
283
284 If Duperemove is run with an existing hashfile, it will only scan those
285 files which have changed since the last time the hashfile was updated.
286 The files argument controls which directories duperemove will scan for
287 newly added files. In the simplest usage, you rerun duperemove with the
288 same parameters and it will only scan changed or newly added files -
289 see the first example below.
290
291
292 Dedupe the files in directory foo, storing hashes in foo.hash. We can
293 run this command multiple times and duperemove will only checksum and
294 dedupe changed or newly added files.
295
296 duperemove -dr --hashfile=foo.hash foo/
297
298 Don't scan for new files, only update changed or deleted files, then
299 dedupe.
300
301 duperemove -dr --hashfile=foo.hash
302
303 Add directory bar to our hashfile and discover any files that were re‐
304 cently added to foo.
305
306 duperemove -dr --hashfile=foo.hash foo/ bar/
307
308 List the files tracked by foo.hash.
309
310 duperemove -L --hashfile=foo.hash
311
312
314 Is there an upper limit to the amount of data duperemove can process?
315 Duperemove is fast at reading and cataloging data. Dedupe runs will be
316 memory limited unless the '--hashfile' option is used. '--hashfile' al‐
317 lows duperemove to temporarily store duplicated hashes to disk, thus
318 removing the large memory overhead and allowing for a far larger amount
319 of data to be scanned and deduped. Realistically though you will be
320 limited by the speed of your disks and cpu. In those situations where
321 resources are limited you may have success by breaking up the input
322 data set into smaller pieces.
323
324 When using a hashfile, duperemove will only store duplicate hashes in
325 memory. During normal operation then the hash tree will make up the
326 largest portion of duperemove memory usage. As of Duperemove v0.11 hash
327 entries are 88 bytes in size. If you know the number of duplicate
328 blocks in your data set you can get a rough approximation of memory us‐
329 age by multiplying with the hash entry size.
330
331 Actual performance numbers are dependent on hardware - up to date test‐
332 ing information is kept on the duperemove wiki (see below for the
333 link).
334
335
336 How large of a hashfile will duperemove create?
337 Hashfiles are essentially sqlite3 database files with several tables,
338 the largest of which are the files and extents tables. Each extents ta‐
339 ble entry is about 72 bytes though that may grow as features are added.
340 The size of a files table entry depends on the file path but a good es‐
341 timate is around 270 bytes per file. The number of extents in a data
342 set is directly proportional to file fragmentation level.
343
344 If you know the total number of extents and files in your data set then
345 you can calculate the hashfile size as:
346
347 Hashfile Size = Num Hashes X 72 + Num Files X 270
348
349 Using a real world example of 1TB (8388608 128K blocks) of data over
350 1000 files:
351
352 8388608 * 72 + 270 * 1000 = 755244720 or about 720MB for 1TB spread
353 over 1000 files.
354
355 Note that none of this takes database overhead into account.
356
357
358 Is is safe to interrupt the program (Ctrl-C)?
359 Yes, Duperemove uses a transactional database engine and organizes db
360 changes to take advantage of those features. The result is that you
361 should be able to ctrl-c the program at any point and re-run without
362 experiencing corruption of your hashfile.
363
364
365 I got two identical files, why are they not deduped?
366 Duperemove by default works on extent granularity. What this means is
367 if there are two files which are logically identical (have the same
368 content) but are laid out on disk with different extent structure they
369 won't be deduped. For example if 2 files are 128k each and their con‐
370 tent are identical but one of them consists of a single 128k extent and
371 the other of 2 x 64k extents then they won't be deduped. This behavior
372 is dependent on the current implementation and is subject to change as
373 duperemove is being improved.
374
375
376 How can I find out my space savings after a dedupe?
377 Duperemove will print out an estimate of the saved space after a dedupe
378 operation for you.
379
380 You can get a more accurate picture by running 'btrfs fi df' before and
381 after each duperemove run.
382
383 Be careful about using the 'df' tool on btrfs - it is common for space
384 reporting to be 'behind' while delayed updates get processed, so an im‐
385 mediate df after deduping might not show any savings.
386
387
388 Why is the total deduped data report an estimate?
389 At the moment duperemove can detect that some underlying extents are
390 shared with other files, but it can not resolve which files those ex‐
391 tents are shared with.
392
393 Imagine duperemove is examining a series of files and it notes a shared
394 data region in one of them. That data could be shared with a file out‐
395 side of the series. Since duperemove can't resolve that information it
396 will account the shared data against our dedupe operation while in re‐
397 ality, the kernel might deduplicate it further for us.
398
399
400 Why are my files showing dedupe but my disk space is not shrinking?
401 This is a little complicated, but it comes down to a feature in Btrfs
402 called _bookending_. The Btrfs wiki explains this in detail:
403 http://en.wikipedia.org/wiki/Btrfs#Extents.
404
405 Essentially though, the underlying representation of an extent in Btrfs
406 can not be split (with small exception). So sometimes we can end up in
407 a situation where a file extent gets partially deduped (and the extents
408 marked as shared) but the underlying extent item is not freed or trun‐
409 cated.
410
411
412 Is duperemove safe for my data?
413 Yes. To be specific, duperemove does not deduplicate the data itself.
414 It simply finds candidates for dedupe and submits them to the Linux
415 kernel extent-same ioctl. In order to ensure data integrity, the kernel
416 locks out other access to the file and does a byte-by-byte compare be‐
417 fore proceeding with the dedupe.
418
419
420 What is the cost of deduplication?
421 Deduplication will lead to increased fragmentation. The blocksize cho‐
422 sen can have an effect on this. Larger blocksizes will fragment less
423 but may not save you as much space. Conversely, smaller block sizes may
424 save more space at the cost of increased fragmentation.
425
426
428 Deduplication is currently only supported by the btrfs and xfs filesys‐
429 tem.
430
431 The Duperemove project page can be found at https://github.com/mark‐
432 fasheh/duperemove
433
434 There is also a wiki at https://github.com/markfasheh/duperemove/wiki
435
436
438 hashstats(8) filesystems(5) btrfs(8) xfs(8) fdupes(1)
439
440
441
442Version 0.12 September 2016 duperemove(8)