1Duperemove(8) System Manager’s Manual Duperemove(8)
2
3
4
6 duperemove - Find duplicate extents and submit them for deduplication
7
9 duperemove options files...
10
12 duperemove is a simple tool for finding duplicated extents and submit‐
13 ting them for deduplication. When given a list of files it will hash
14 the contents of their extents and compare those hashes to each other,
15 finding and categorizing extents that match each other. When given the
16 -d option, duperemove will submit those extents for deduplication using
17 the Linux kernel FIDEDUPERANGE ioctl.
18
19 duperemove can store the hashes it computes in a hashfile. If given an
20 existing hashfile, duperemove will only compute hashes for those files
21 which have changed since the last run. Thus you can run duperemove re‐
22 peatedly on your data as it changes, without having to re-checksum un‐
23 changed data. For more on hashfiles see the --hashfile option below as
24 well as the Examples section.
25
26 duperemove can also take input from the fdupes program, see the
27 --fdupes option below.
28
30 Duperemove has two major modes of operation, one of which is a subset
31 of the other.
32
33 Readonly / Non-deduplicating Mode
34 When run without -d (the default) duperemove will print out one or more
35 tables of matching extents it has determined would be ideal candidates
36 for deduplication. As a result, readonly mode is useful for seeing
37 what duperemove might do when run with -d.
38
39 Generally, duperemove does not concern itself with the underlying rep‐
40 resentation of the extents it processes. Some of them could be com‐
41 pressed, undergoing I/O, or even have already been deduplicated. In
42 dedupe mode, the kernel handles those details and therefore we try not
43 to replicate that work.
44
45 Deduping Mode
46 This functions similarly to readonly mode with the exception that the
47 duplicated extents found in our “read, hash, and compare” step will ac‐
48 tually be submitted for deduplication. Extents that have already been
49 deduped will be skipped. An estimate of the total data deduplicated
50 will be printed after the operation is complete. This estimate is cal‐
51 culated by comparing the total amount of shared bytes in each file be‐
52 fore and after the dedupe.
53
55 Common options
56 files can refer to a list of regular files and directories or be a hy‐
57 phen (-) to read them from standard input. If a directory is speci‐
58 fied, all regular files within it will also be scanned. Duperemove can
59 also be told to recursively scan directories with the -r switch.
60
61 -r Enable recursive dir traversal.
62
63 -d De-dupe the results - only works on btrfs and xfs. Use this op‐
64 tion twice to disable the check and try to run the ioctl anyway.
65
66 --hashfile=hashfile
67 Use a file for storage of hashes instead of memory. This option
68 drastically reduces the memory footprint of duperemove and is
69 recommended when your data set is more than a few files large.
70 Hashfiles are also reusable, allowing you to further reduce the
71 amount of hashing done on subsequent dedupe runs.
72
73 If hashfile does not exist it will be created. If it exists,
74 duperemove will check the file paths stored inside of it for
75 changes. Files which have changed will be rescanned and their
76 updated hashes will be written to the hashfile. Deleted files
77 will be removed from the hashfile.
78
79 New files are only added to the hashfile if they are discover‐
80 able via the files argument. For that reason you probably want
81 to provide the same files list and -r arguments on each run of
82 duperemove. The file discovery algorithm is efficient and will
83 only visit each file once, even if it is already in the hash‐
84 file.
85
86 Adding a new path to a hashfile is as simple as adding it to the
87 files argument.
88
89 When deduping from a hashfile, duperemove will avoid deduping
90 files which have not changed since the last dedupe.
91
92 -B N, --batchsize=N
93 Run the deduplication phase every N files newly scanned. This
94 greatly reduces memory usage for large dataset, or when you are
95 doing partial extents lookup, but reduces multithreading effi‐
96 ciency.
97
98 Because of that small overhead, its value shall be selected
99 based on the average file size and blocksize.
100
101 The default is a sane value for extents-only lookups, while you
102 can go as low as 1 if you are running duperemove on very large
103 files (like virtual machines etc).
104
105 By default, batching is set to 1024.
106
107 -h Print numbers in human-readable format.
108
109 -q Quiet mode. Duperemove will only print errors and a short sum‐
110 mary of any dedupe.
111
112 -v Be verbose.
113
114 --help Prints help text.
115
116 Advanced options
117 --fdupes
118 Run in fdupes mode. With this option you can pipe the output of
119 fdupes to duperemove to dedupe any duplicate files found. When
120 receiving a file list in this manner, duperemove will skip the
121 hashing phase.
122
123 -L Print all files in the hashfile and exit. Requires the --hash‐
124 file option. Will print additional information about each file
125 when run with -v.
126
127 -R files ..
128 Remove file from the db and exit. Duperemove will read the list
129 from standard input if a hyphen (-) is provided. Requires the
130 --hashfile option.
131
132 Note: If you are piping filenames from another duperemove in‐
133 stance it is advisable to do so into a temporary file first as
134 running duperemove simultaneously on the same hashfile may cor‐
135 rupt that hashfile.
136
137 --skip-zeroes
138 Read data blocks and skip any zeroed blocks, useful for speedup
139 duperemove, but can prevent deduplication of zeroed files.
140
141 -b size
142 Use the specified block size for reading file extents. Defaults
143 to 128K.
144
145 --io-threads=N
146 Use N threads for I/O. This is used by the file hashing and
147 dedupe stages. Default is automatically detected based on num‐
148 ber of host cpus.
149
150 --cpu-threads=N
151 Use N threads for CPU bound tasks. This is used by the dupli‐
152 cate extent finding stage. Default is automatically detected
153 based on number of host cpus.
154
155 Note: Hyperthreading can adversely affect performance of the ex‐
156 tent finding stage. If duperemove detects an Intel CPU with hy‐
157 perthreading it will use half the number of cores reported by
158 the system for cpu bound tasks.
159
160 --dedupe-options=options
161 Comma separated list of options which alter how we dedupe.
162 Prepend `no' to an option in order to turn it off.
163
164 [no]partial
165 Duperemove can often find more dedupe by comparing por‐
166 tions of extents to each other. This can be a lengthy,
167 CPU intensive task so it is turned off by default. Using
168 --batchsize is recommended to limit the negative effects
169 of this option.
170
171 The code behind this option is under active development
172 and as a result the semantics of the partial argument may
173 change.
174
175 [no]same
176 Defaults to on. Allow dedupe of extents within the same
177 file.
178
179 [no]rescan_files
180 Defaults to on. Duperemove will check for files that
181 were found and deduplicated in a previous run, based on
182 the hashfile. Deduplicated files may have changed if new
183 content was added, but also if their physical layout was
184 modified (defrag for instance). You can disable those
185 checks to increase performance when running duperemove
186 against a specific directory or file which you know is
187 the only part of a larger, unchanged dataset. Duperemove
188 will still dedupe that specific target against any shared
189 extent found in the existing files.
190
191 [no]only_whole_files
192 Defaults to off. Duperemove will only work on full file.
193 Both extent-based and block-based deduplication will be
194 disabled. The hashfile will be smaller, some operations
195 will be faster, but the deduplication efficiency will in‐
196 deed be reduced.
197
198 --read-hashes=hashfile
199 This option is primarily for testing. See the --hashfile option
200 if you want to use hashfiles.
201
202 Read hashes from a hashfile. A file list is not required with
203 this option. Dedupe can be done if duperemove is run from the
204 same base directory as is stored in the hash file (basically
205 duperemove has to be able to find the files).
206
207 --write-hashes=hashfile
208 This option is primarily for testing. See the --hashfile option
209 if you want to use hashfiles.
210
211 Write hashes to a hashfile. These can be read in at a later
212 date and deduped from.
213
214 --debug
215 Print debug messages, forces -v if selected.
216
217 --hash-threads=N
218 Deprecated, see --io-threads above.
219
220 --exclude=PATTERN
221 You can exclude certain files and folders from the deduplication
222 process. This might be benefical for skipping subvolume snap‐
223 shot mounts, for instance. Unless you provide a full path for
224 exclusion, the exclude will be relative to the current working
225 directory. Another thing to keep in mind is that shells usually
226 expand glob pattern so the passed in pattern ought to also be
227 quoted. Taking everything into consideration the correct way to
228 pass an exclusion pattern is duperemove --exclude
229 "/path/to/dir/file*" /path/to/dir
230
232 Simple Usage
233 Dedupe the files in directory /foo, recurse into all subdirectories.
234 You only want to use this for small data sets:
235
236 duperemove -dr /foo
237
238 Use duperemove with fdupes to dedupe identical files below directory
239 foo:
240
241 fdupes -r /foo | duperemove --fdupes
242
243 Using Hashfiles
244 Duperemove can optionally store the hashes it calculates in a hashfile.
245 Hashfiles have two primary advantages - memory usage and re-usability.
246 When using a hashfile, duperemove will stream computed hashes to it,
247 instead of main memory.
248
249 If Duperemove is run with an existing hashfile, it will only scan those
250 files which have changed since the last time the hashfile was updated.
251 The files argument controls which directories duperemove will scan for
252 newly added files. In the simplest usage, you rerun duperemove with
253 the same parameters and it will only scan changed or newly added files
254 - see the first example below.
255
256 Dedupe the files in directory foo, storing hashes in foo.hash. We can
257 run this command multiple times and duperemove will only checksum and
258 dedupe changed or newly added files:
259
260 duperemove -dr --hashfile=foo.hash foo/
261
262 Don’t scan for new files, only update changed or deleted files, then
263 dedupe:
264
265 duperemove -dr --hashfile=foo.hash
266
267 Add directory bar to our hashfile and discover any files that were re‐
268 cently added to foo:
269
270 duperemove -dr --hashfile=foo.hash foo/ bar/
271
272 List the files tracked by foo.hash:
273
274 duperemove -L --hashfile=foo.hash
275
277 Is duperemove safe for my data?
278 Yes. To be specific, duperemove does not deduplicate the data itself.
279 It simply finds candidates for dedupe and submits them to the Linux
280 kernel FIDEDUPERANGE ioctl. In order to ensure data integrity, the
281 kernel locks out other access to the file and does a byte-by-byte com‐
282 pare before proceeding with the dedupe.
283
284 Is is safe to interrupt the program (Ctrl-C)?
285 Yes. The Linux kernel deals with the actual data. On Duperemove’
286 side, a transactional database engine is used. The result is that you
287 should be able to ctrl-c the program at any point and re-run without
288 experiencing corruption of your hashfile. In case of a bug, your hash‐
289 file may be broken, but your data never will.
290
291 I got two identical files, why are they not deduped?
292 Duperemove by default works on extent granularity. What this means is
293 if there are two files which are logically identical (have the same
294 content) but are laid out on disk with different extent structure they
295 won’t be deduped. For example if 2 files are 128k each and their con‐
296 tent are identical but one of them consists of a single 128k extent and
297 the other of 2 * 64k extents then they won’t be deduped. This behavior
298 is dependent on the current implementation and is subject to change as
299 duperemove is being improved.
300
301 What is the cost of deduplication?
302 Deduplication will lead to increased fragmentation. The blocksize cho‐
303 sen can have an effect on this. Larger blocksizes will fragment less
304 but may not save you as much space. Conversely, smaller block sizes
305 may save more space at the cost of increased fragmentation.
306
307 How can I find out my space savings after a dedupe?
308 Duperemove will print out an estimate of the saved space after a dedupe
309 operation for you.
310
311 You can get a more accurate picture by running `btrfs fi df' before and
312 after each duperemove run.
313
314 Be careful about using the `df' tool on btrfs - it is common for space
315 reporting to be `behind' while delayed updates get processed, so an im‐
316 mediate df after deduping might not show any savings.
317
318 Why is the total deduped data report an estimate?
319 At the moment duperemove can detect that some underlying extents are
320 shared with other files, but it can not resolve which files those ex‐
321 tents are shared with.
322
323 Imagine duperemove is examining a series of files and it notes a shared
324 data region in one of them. That data could be shared with a file out‐
325 side of the series. Since duperemove can’t resolve that information it
326 will account the shared data against our dedupe operation while in re‐
327 ality, the kernel might deduplicate it further for us.
328
329 Why are my files showing dedupe but my disk space is not shrinking?
330 This is a little complicated, but it comes down to a feature in Btrfs
331 called bookending. The Btrfs wiki (http://en.wikipedia.org/wi‐
332 ki/Btrfs#Extents) explains this in detail.
333
334 Essentially though, the underlying representation of an extent in Btrfs
335 can not be split (with small exception). So sometimes we can end up in
336 a situation where a file extent gets partially deduped (and the extents
337 marked as shared) but the underlying extent item is not freed or trun‐
338 cated.
339
340 Is there an upper limit to the amount of data duperemove can process?
341 Duperemove is fast at reading and cataloging data. Dedupe runs will be
342 memory limited unless the --hashfile option is used. --hashfile allows
343 duperemove to temporarily store duplicated hashes to disk, thus remov‐
344 ing the large memory overhead and allowing for a far larger amount of
345 data to be scanned and deduped. Realistically though you will be lim‐
346 ited by the speed of your disks and cpu. In those situations where re‐
347 sources are limited you may have success by breaking up the input data
348 set into smaller pieces.
349
350 When using a hashfile, duperemove will only store duplicate hashes in
351 memory. During normal operation then the hash tree will make up the
352 largest portion of duperemove memory usage. As of Duperemove v0.11
353 hash entries are 88 bytes in size. If you know the number of duplicate
354 blocks in your data set you can get a rough approximation of memory us‐
355 age by multiplying with the hash entry size.
356
357 Actual performance numbers are dependent on hardware - up to date test‐
358 ing information is kept on the duperemove wiki (see below for the
359 link).
360
361 How large of a hashfile will duperemove create?
362 Hashfiles are essentially sqlite3 database files with several tables,
363 the largest of which are the files and extents tables. Each extents
364 table entry is about 72 bytes though that may grow as features are
365 added. The size of a files table entry depends on the file path but a
366 good estimate is around 270 bytes per file. The number of extents in a
367 data set is directly proportional to file fragmentation level.
368
369 If you know the total number of extents and files in your data set then
370 you can calculate the hashfile size as:
371
372 Hashfile Size = Num Hashes * 72 + Num Files * 270
373
374 Using a real world example of 1TB (8388608 128K blocks) of data over
375 1000 files:
376
377 8388608 * 72 + 270 * 1000 = 755244720 or about 720MB for 1TB spread over 1000 files.
378
379 Note that none of this takes database overhead into account.
380
382 Deduplication is currently only supported by the btrfs and xfs filesys‐
383 tem.
384
385 The Duperemove project page can be found on github
386 (https://github.com/markfasheh/duperemove)
387
388 There is also a wiki (https://github.com/markfasheh/duperemove/wiki)
389
391 • hashstats(8)
392
393 • filesystems(5)
394
395 • btrfs(8)
396
397 • xfs(8)
398
399 • fdupes(1)
400
401 • ioctl_fideduprange(2)
402
403
404
405duperemove 0.13 29 Sept 2023 Duperemove(8)