1RMLINT(1) rmlint documentation RMLINT(1)
2
3
4
6 rmlint - find duplicate files and other space waste efficiently
7
9 SYNOPSIS
10 rmlint [TARGET_DIR_OR_FILES ...] [//] [TAGGED_TARGET_DIR_OR_FILES ...]
11 [-] [OPTIONS]
12
13 DESCRIPTION
14 rmlint finds space waste and other broken things on your filesystem.
15 It's main focus lies on finding duplicate files and directories.
16
17 It is able to find the following types of lint:
18
19 • Duplicate files and directories (and as a by-product unique files).
20
21 • Nonstripped Binaries (Binaries with debug symbols; needs to be ex‐
22 plicitly enabled).
23
24 • Broken symbolic links.
25
26 • Empty files and directories (also nested empty directories).
27
28 • Files with broken user or group id.
29
30 rmlint itself WILL NOT DELETE ANY FILES. It does however produce exe‐
31 cutable output (for example a shell script) to help you delete the
32 files if you want to. Another design principle is that it should work
33 well together with other tools like find. Therefore we do not replicate
34 features of other well know programs, as for example pattern matching
35 and finding duplicate filenames. However we provide many convenience
36 options for common use cases that are hard to build from scratch with
37 standard tools.
38
39 In order to find the lint, rmlint is given one or more directories to
40 traverse. If no directories or files were given, the current working
41 directory is assumed. By default, rmlint will ignore hidden files and
42 will not follow symlinks (see Traversal Options). rmlint will first
43 find "other lint" and then search the remaining files for duplicates.
44
45 rmlint tries to be helpful by guessing what file of a group of dupli‐
46 cates is the original (i.e. the file that should not be deleted). It
47 does this by using different sorting strategies that can be controlled
48 via the -S option. By default it chooses the first-named path on the
49 commandline. If two duplicates come from the same path, it will also
50 apply different fallback sort strategies (See the documentation of the
51 -S strategy).
52
53 This behaviour can be also overwritten if you know that a certain di‐
54 rectory contains duplicates and another one originals. In this case you
55 write the original directory after specifying a single // on the com‐
56 mandline. Everything that comes after is a preferred (or a "tagged")
57 directory. If there are duplicates from an unpreferred and from a pre‐
58 ferred directory, the preferred one will always count as original. Spe‐
59 cial options can also be used to always keep files in preferred direc‐
60 tories (-k) and to only find duplicates that are present in both given
61 directories (-m).
62
63 We advise new users to have a short look at all options rmlint has to
64 offer, and maybe test some examples before letting it run on productive
65 data. WRONG ASSUMPTIONS ARE THE BIGGEST ENEMY OF YOUR DATA. There are
66 some extended example at the end of this manual, but each option that
67 is not self-explanatory will also try to give examples.
68
69 OPTIONS
70 General Options
71 -T --types="list" (default: defaults)
72 Configure the types of lint rmlint will look for. The list
73 string is a comma-separated list of lint types or lint groups
74 (other separators like semicolon or space also work though).
75
76 One of the following groups can be specified at the beginning of
77 the list:
78
79 • all: Enables all lint types.
80
81 • defaults: Enables all lint types, but nonstripped.
82
83 • minimal: defaults minus emptyfiles and emptydirs.
84
85 • minimaldirs: defaults minus emptyfiles, emptydirs and dupli‐
86 cates, but with duplicatedirs.
87
88 • none: Disable all lint types [default].
89
90 Any of the following lint types can be added individually, or
91 deselected by prefixing with a -:
92
93 • badids, bi: Find files with bad UID, GID or both.
94
95 • badlinks, bl: Find bad symlinks pointing nowhere valid.
96
97 • emptydirs, ed: Find empty directories.
98
99 • emptyfiles, ef: Find empty files.
100
101 • nonstripped, ns: Find nonstripped binaries.
102
103 • duplicates, df: Find duplicate files.
104
105 • duplicatedirs, dd: Find duplicate directories (This is the
106 same -D!)
107
108 WARNING: It is good practice to enclose the description in sin‐
109 gle or double quotes. In obscure cases argument parsing might
110 fail in weird ways, especially when using spaces as separator.
111
112 Example:
113
114 $ rmlint -T "df,dd" # Only search for duplicate files and directories
115 $ rmlint -T "all -df -dd" # Search for all lint except duplicate files and dirs.
116
117 -o --output=spec / -O --add-output=spec (default: -o sh:rmlint.sh -o
118 pretty:stdout -o summary:stdout -o json:rmlint.json)
119 Configure the way rmlint outputs its results. A spec is in the
120 form format:file or just format. A file might either be an ar‐
121 bitrary path or stdout or stderr. If file is omitted, stdout is
122 assumed. format is the name of a formatter supported by this
123 program. For a list of formatters and their options, refer to
124 the Formatters section below.
125
126 If -o is specified, rmlint's default outputs are overwritten.
127 With --O the defaults are preserved. Either -o or -O may be
128 specified multiple times to get multiple outputs, including mul‐
129 tiple outputs of the same format.
130
131 Examples:
132
133 $ rmlint -o json # Stream the json output to stdout
134 $ rmlint -O csv:/tmp/rmlint.csv # Output an extra csv fle to /tmp
135
136 -c --config=spec[=value] (default: none)
137 Configure a format. This option can be used to fine-tune the be‐
138 haviour of the existing formatters. See the Formatters section
139 for details on the available keys.
140
141 If the value is omitted it is set to a value meaning "enabled".
142
143 Examples:
144
145 $ rmlint -c sh:link # Smartly link duplicates instead of removing
146 $ rmlint -c progressbar:fancy # Use a different theme for the progressbar
147
148 -z --perms[=[rwx]] (default: no check)
149 Only look into file if it is readable, writable or executable by
150 the current user. Which one of the can be given as argument as
151 one of "rwx".
152
153 If no argument is given, "rw" is assumed. Note that r does basi‐
154 cally nothing user-visible since rmlint will ignore unreadable
155 files anyways. It's just there for the sake of completeness.
156
157 By default this check is not done.
158
159 $ rmlint -z rx $(echo $PATH | tr ":" " ") # Look at all exe‐
160 cutable files in $PATH
161
162 -a --algorithm=name (default: blake2b)
163 Choose the algorithm to use for finding duplicate files. The al‐
164 gorithm can be either paranoid (byte-by-byte file comparison) or
165 use one of several file hash algorithms to identify duplicates.
166 The following hash families are available (in approximate de‐
167 scending order of cryptographic strength):
168
169 sha3, blake,
170
171 sha,
172
173 highway, md
174
175 metro, murmur, xxhash
176
177 The weaker hash functions still offer excellent distribution
178 properties, but are potentially more vulnerable to malicious
179 crafting of duplicate files.
180
181 The full list of hash functions (in decreasing order of checksum
182 length) is:
183
184 512-bit: blake2b, blake2bp, sha3-512, sha512
185
186 384-bit: sha3-384,
187
188 256-bit: blake2s, blake2sp, sha3-256, sha256, highway256,
189 metro256, metrocrc256
190
191 160-bit: sha1
192
193 128-bit: md5, murmur, metro, metrocrc
194
195 64-bit: highway64, xxhash.
196
197 The use of 64-bit hash length for detecting duplicate files is
198 not recommended, due to the probability of a random hash colli‐
199 sion.
200
201 -p --paranoid / -P --less-paranoid (default)
202 Increase or decrease the paranoia of rmlint's duplicate algo‐
203 rithm. Use -p if you want byte-by-byte comparison without any
204 hashing.
205
206 • -p is equivalent to --algorithm=paranoid
207
208 • -P is equivalent to --algorithm=highway256
209
210 • -PP is equivalent to --algorithm=metro256
211
212 • -PPP is equivalent to --algorithm=metro
213
214 -v --loud / -V --quiet
215 Increase or decrease the verbosity. You can pass these options
216 several times. This only affects rmlint's logging on stderr, but
217 not the outputs defined with -o. Passing either option more than
218 three times has no further effect.
219
220 -g --progress / -G --no-progress (default)
221 Show a progressbar with sane defaults.
222
223 Convenience shortcut for -o progressbar -o summary -o sh:rm‐
224 lint.sh -o json:rmlint.json -VVV.
225
226 NOTE: This flag clears all previous outputs. If you want addi‐
227 tional outputs, specify them after this flag using -O.
228
229 -D --merge-directories (default: disabled)
230 Makes rmlint use a special mode where all found duplicates are
231 collected and checked if whole directory trees are duplicates.
232 Use with caution: You always should make sure that the investi‐
233 gated directory is not modified during rmlint's or its removal
234 scripts run.
235
236 IMPORTANT: Definition of equal: Two directories are considered
237 equal by rmlint if they contain the exact same data, no matter
238 how the files containing the data are named. Imagine that rmlint
239 creates a long, sorted stream out of the data found in the di‐
240 rectory and compares this in a magic way to another directory.
241 This means that the layout of the directory is not considered to
242 be important by default. Also empty files will not count as con‐
243 tent. This might be surprising to some users, but remember that
244 rmlint generally cares only about content, not about any other
245 metadata or layout. If you want to only find trees with the same
246 hierarchy you should use --honour-dir-layout / -j.
247
248 Output is deferred until all duplicates were found. Duplicate
249 directories are printed first, followed by any remaining dupli‐
250 cate files that are isolated or inside of any original directo‐
251 ries.
252
253 --rank-by applies for directories too, but 'p' or 'P' (path in‐
254 dex) has no defined (i.e. useful) meaning. Sorting takes only
255 place when the number of preferred files in the directory dif‐
256 fers.
257
258 NOTES:
259
260 • This option enables --partial-hidden and -@ (--see-symlinks)
261 for convenience. If this is not desired, you should change
262 this after specifying -D.
263
264 • This feature might add some runtime for large datasets.
265
266 • When using this option, you will not be able to use the -c
267 sh:clone option. Use -c sh:link as a good alternative.
268
269 -j --honour-dir-layout (default: disabled)
270 Only recognize directories as duplicates that have the same path
271 layout. In other words: All duplicates that build the duplicate
272 directory must have the same path from the root of each respec‐
273 tive directory. This flag makes no sense without --merge-direc‐
274 tories.
275
276 -y --sort-by=order (default: none)
277 During output, sort the found duplicate groups by criteria de‐
278 scribed by order. order is a string that may consist of one or
279 more of the following letters:
280
281 • s: Sort by size of group.
282
283 • a: Sort alphabetically by the basename of the original.
284
285 • m: Sort by mtime of the original.
286
287 • p: Sort by path-index of the original.
288
289 • o: Sort by natural found order (might be different on each
290 run).
291
292 • n: Sort by number of files in the group.
293
294 The letter may also be written uppercase (similar to -S /
295 --rank-by) to reverse the sorting. Note that rmlint has to hold
296 back all results to the end of the run before sorting and print‐
297 ing.
298
299 -w --with-color (default) / -W --no-with-color
300 Use color escapes for pretty output or disable them. If you
301 pipe rmlints output to a file -W is assumed automatically.
302
303 -h --help / -H --show-man
304 Show a shorter reference help text (-h) or the full man page
305 (-H).
306
307 --version
308 Print the version of rmlint. Includes git revision and compile
309 time features. Please include this when giving feedback to us.
310
311 Traversal Options
312 -s --size=range (default: "1")
313 Only consider files as duplicates in a certain size range. The
314 format of range is min-max, where both ends can be specified as
315 a number with an optional multiplier. The available multipliers
316 are:
317
318 • C (1^1), W (2^1), B (512^1), K (1000^1), KB (1024^1), M
319 (1000^2), MB (1024^2), G (1000^3), GB (1024^3),
320
321 • T (1000^4), TB (1024^4), P (1000^5), PB (1024^5), E (1000^6),
322 EB (1024^6)
323
324 The size format is about the same as dd(1) uses. A valid example
325 would be: "100KB-2M". This limits duplicates to a range from 100
326 Kilobyte to 2 Megabyte.
327
328 It's also possible to specify only one size. In this case the
329 size is interpreted as "bigger or equal". If you want to filter
330 for files up to this size you can add a - in front (-s -1M == -s
331 0-1M).
332
333 Edge case: The default excludes empty files from the duplicate
334 search. Normally these are treated specially by rmlint by han‐
335 dling them as other lint. If you want to include empty files as
336 duplicates you should lower the limit to zero:
337
338 $ rmlint -T df --size 0
339
340 -d --max-depth=depth (default: INF)
341 Only recurse up to this depth. A depth of 1 would disable recur‐
342 sion and is equivalent to a directory listing. A depth of 2
343 would also consider all children directories and so on.
344
345 -l --hardlinked (default) / --keep-hardlinked / -L --no-hardlinked
346 Hardlinked files are treated as duplicates by default
347 (--hardlinked). If --keep-hardlinked is given, rmlint will not
348 delete any files that are hardlinked to an original in their re‐
349 spective group. Such files will be displayed like originals,
350 i.e. for the default output with a "ls" in front. The reasoning
351 here is to maximize the number of kept files, while maximizing
352 the number of freed space: Removing hardlinks to originals will
353 not allocate any free space.
354
355 If --no-hardlinked is given, only one file (of a set of
356 hardlinked files) is considered, all the others are ignored;
357 this means, they are not deleted and also not even shown in the
358 output. The "highest ranked" of the set is the one that is con‐
359 sidered.
360
361 -f --followlinks / -F --no-followlinks / -@ --see-symlinks (default)
362 -f will always follow symbolic links. If file system loops oc‐
363 curs rmlint will detect this. If -F is specified, symbolic links
364 will be ignored completely, if -@ is specified, rmlint will see
365 symlinks and treats them like small files with the path to their
366 target in them. The latter is the default behaviour, since it is
367 a sensible default for --merge-directories.
368
369 -x --no-crossdev / -X --crossdev (default)
370 Stay always on the same device (-x), or allow crossing mount‐
371 points (-X). The latter is the default.
372
373 -r --hidden / -R --no-hidden (default) / --partial-hidden
374 Also traverse hidden directories? This is often not a good idea,
375 since directories like .git/ would be investigated, possibly
376 leading to the deletion of internal git files which in turn
377 break a repository. With --partial-hidden hidden files and
378 folders are only considered if they're inside duplicate directo‐
379 ries (see --merge-directories) and will be deleted as part of
380 it.
381
382 -b --match-basename
383 Only consider those files as dupes that have the same basename.
384 See also man 1 basename. The comparison of the basenames is
385 case-insensitive.
386
387 -B --unmatched-basename
388 Only consider those files as dupes that do not share the same
389 basename. See also man 1 basename. The comparison of the base‐
390 names is case-insensitive.
391
392 -e --match-with-extension / -E --no-match-with-extension (default)
393 Only consider those files as dupes that have the same file ex‐
394 tension. For example two photos would only match if they are a
395 .png. The extension is compared case-insensitive, so .PNG is the
396 same as .png.
397
398 -i --match-without-extension / -I --no-match-without-extension (de‐
399 fault)
400 Only consider those files as dupes that have the same basename
401 minus the file extension. For example: banana.png and Ba‐
402 nana.jpeg would be considered, while apple.png and peach.png
403 won't. The comparison is case-insensitive.
404
405 -n --newer-than-stamp=<timestamp_filename> / -N
406 --newer-than=<iso8601_timestamp_or_unix_timestamp>
407 Only consider files (and their size siblings for duplicates)
408 newer than a certain modification time (mtime). The age barrier
409 may be given as seconds since the epoch or as ISO8601-Timestamp
410 like 2014-09-08T00:12:32+0200.
411
412 -n expects a file from which it can read the timestamp. After
413 rmlint run, the file will be updated with the current timestamp.
414 If the file does not initially exist, no filtering is done but
415 the stampfile is still written.
416
417 -N, in contrast, takes the timestamp directly and will not write
418 anything.
419
420 Note that rmlint will find duplicates newer than timestamp, even
421 if the original is older. If you want only find duplicates
422 where both original and duplicate are newer than timestamp you
423 can use find(1):
424
425 • find -mtime -1 -print0 | rmlint -0 # pass all files younger
426 than a day to rmlint
427
428 Note: you can make rmlint write out a compatible timestamp with:
429
430 • -O stamp:stdout # Write a seconds-since-epoch timestamp to
431 stdout on finish.
432
433 • -O stamp:stdout -c stamp:iso8601 # Same, but write as ISO8601.
434
435 Original Detection Options
436 -k --keep-all-tagged / -K --keep-all-untagged
437 Don't delete any duplicates that are in tagged paths (-k) or
438 that are in non-tagged paths (-K). (Tagged paths are those that
439 were named after //).
440
441 -m --must-match-tagged / -M --must-match-untagged
442 Only look for duplicates of which at least one is in one of the
443 tagged paths. (Paths that were named after //).
444
445 Note that the combinations of -kM and -Km are prohibited by rm‐
446 lint. See https://github.com/sahib/rmlint/issues/244 for more
447 information.
448
449 -S --rank-by=criteria (default: pOma)
450 Sort the files in a group of duplicates into originals and du‐
451 plicates by one or more criteria. Each criteria is defined by a
452 single letter (except r and x which expect a regex pattern after
453 the letter). Multiple criteria may be given as string, where the
454 first criteria is the most important. If one criteria cannot de‐
455 cide between original and duplicate the next one is tried.
456
457 • m: keep lowest mtime (oldest) M: keep highest mtime
458 (newest)
459
460 • a: keep first alphabetically A: keep last alphabet‐
461 ically
462
463 • p: keep first named path P: keep last named
464 path
465
466 • d: keep path with lowest depth D: keep path with
467 highest depth
468
469 • l: keep path with shortest basename L: keep path with
470 longest basename
471
472 • r: keep paths matching regex R: keep path not
473 matching regex
474
475 • x: keep basenames matching regex X: keep basenames not
476 matching regex
477
478 • h: keep file with lowest hardlink count H: keep file with
479 highest hardlink count
480
481 • o: keep file with lowest number of hardlinks outside of the
482 paths traversed by rmlint.
483
484 • O: keep file with highest number of hardlinks outside of the
485 paths traversed by rmlint.
486
487 Alphabetical sort will only use the basename of the file and ig‐
488 nore its case. One can have multiple criteria, e.g.: -S am will
489 choose first alphabetically; if tied then by mtime. Note: orig‐
490 inal path criteria (specified using //) will always take first
491 priority over -S options.
492
493 For more fine grained control, it is possible to give a regular
494 expression to sort by. This can be useful when you know a common
495 fact that identifies original paths (like a path component being
496 src or a certain file ending).
497
498 To use the regular expression you simply enclose it in the cri‐
499 teria string by adding <REGULAR_EXPRESSION> after specifying r
500 or x. Example: -S 'r<.*\.bak$>' makes all files that have a .bak
501 suffix original files.
502
503 Warning: When using r or x, try to make your regex to be as spe‐
504 cific as possible! Good practice includes adding a $ anchor at
505 the end of the regex.
506
507 Tips:
508
509 • l is useful for files like file.mp3 vs file.1.mp3 or
510 file.mp3.bak.
511
512 • a can be used as last criteria to assert a defined order.
513
514 • o/O and h/H are only useful if there any hardlinks in the tra‐
515 versed path.
516
517 • o/O takes the number of hardlinks outside the traversed paths
518 (and thereby minimizes/maximizes the overall number of
519 hardlinks). h/H in contrast only takes the number of hardlinks
520 inside of the traversed paths. When hardlinking files, one
521 would like to link to the original file with the highest outer
522 link count (O) in order to maximise the space cleanup. H does
523 not maximise the space cleanup, it just selects the file with
524 the highest total hardlink count. You usually want to specify
525 O.
526
527 • pOma is the default since p ensures that first given paths
528 rank as originals, O ensures that hardlinks are handled well,
529 m ensures that the oldest file is the original and a simply
530 ensures a defined ordering if no other criteria applies.
531
532 Caching
533 --replay
534 Read an existing json file and re-output it. When --replay is
535 given, rmlint does no input/output on the filesystem, even if
536 you pass additional paths. The paths you pass will be used for
537 filtering the --replay output.
538
539 This is very useful if you want to reformat, refilter or resort
540 the output you got from a previous run. Usage is simple: Just
541 pass --replay on the second run, with other changed to the new
542 formatters or filters. Pass the .json files of the previous runs
543 additionally to the paths you ran rmlint on. You can also merge
544 several previous runs by specifying more than one .json file, in
545 this case it will merge all files given and output them as one
546 big run.
547
548 If you want to view only the duplicates of certain subdirecto‐
549 ries, just pass them on the commandline as usual.
550
551 The usage of // has the same effect as in a normal run. It can
552 be used to prefer one .json file over another. However note that
553 running rmlint in --replay mode includes no real disk traversal,
554 i.e. only duplicates from previous runs are printed. Therefore
555 specifying new paths will simply have no effect. As a security
556 measure, --replay will ignore files whose mtime changed in the
557 meantime (i.e. mtime in the .json file differs from the current
558 one). These files might have been modified and are silently ig‐
559 nored.
560
561 By design, some options will not have any effect. Those are:
562
563 • --followlinks
564
565 • --algorithm
566
567 • --paranoid
568
569 • --clamp-low
570
571 • --hardlinked
572
573 • --write-unfinished
574
575 • ... and all other caching options below.
576
577 NOTE: In --replay mode, a new .json file will be written to rm‐
578 lint.replay.json in order to avoid overwriting rmlint.json.
579
580 -C --xattr
581 Shortcut for --xattr-read, --xattr-write, --write-unfinished.
582 This will write a checksum and a timestamp to the extended at‐
583 tributes of each file that rmlint hashed. This speeds up subse‐
584 quent runs on the same data set. Please note that not all
585 filesystems may support extended attributes and you need write
586 support to use this feature.
587
588 See the individual options below for more details and some exam‐
589 ples.
590
591 --xattr-read / --xattr-write / --xattr-clear
592 Read or write cached checksums from the extended file at‐
593 tributes. This feature can be used to speed up consecutive
594 runs.
595
596 CAUTION: This could potentially lead to false positives if file
597 contents are somehow modified without changing the file modifi‐
598 cation time. rmlint uses the mtime to determine the modifica‐
599 tion timestamp if a checksum is outdated. This is not a problem
600 if you use the clone or reflink operation on a filesystem like
601 btrfs. There an outdated checksum entry would simply lead to
602 some duplicate work done in the kernel but would do no harm oth‐
603 erwise.
604
605 NOTE: Many tools do not support extended file attributes prop‐
606 erly, resulting in a loss of the information when copying the
607 file or editing it.
608
609 NOTE: You can specify --xattr-write and --xattr-read at the same
610 time. This will read from existing checksums at the start of
611 the run and update all hashed files at the end.
612
613 Usage example:
614
615 $ rmlint large_file_cluster/ -U --xattr-write # first run should be slow.
616 $ rmlint large_file_cluster/ --xattr-read # second run should be faster.
617
618 # Or do the same in just one run:
619 $ rmlint large_file_cluster/ --xattr
620
621 -U --write-unfinished
622 Include files in output that have not been hashed fully, i.e.
623 files that do not appear to have a duplicate. Note that this
624 will not include all files that rmlint traversed, but only the
625 files that were chosen to be hashed.
626
627 This is mainly useful in conjunction with --xattr-write/read.
628 When re-running rmlint on a large dataset this can greatly speed
629 up a re-run in some cases. Please refer to --xattr-read for an
630 example.
631
632 If you want to output unique files, please look into the uniques
633 output formatter.
634
635 Rarely used, miscellaneous options
636 -t --threads=N (default: 16)
637 The number of threads to use during file tree traversal and
638 hashing. rmlint probably knows better than you how to set this
639 value, so just leave it as it is. Setting it to 1 will also not
640 make rmlint a single threaded program.
641
642 -u --limit-mem=size
643 Apply a maximum number of memory to use for hashing and --para‐
644 noid. The total number of memory might still exceed this limit
645 though, especially when setting it very low. In general rmlint
646 will however consume about this amount of memory plus a more or
647 less constant extra amount that depends on the data you are
648 scanning.
649
650 The size-description has the same format as for --size, there‐
651 fore you can do something like this (use this if you have 1GB of
652 memory available):
653
654 $ rmlint -u 512M # Limit paranoid mem usage to 512 MB
655
656 -q --clamp-low=[fac.tor|percent%|offset] (default: 0) / -Q
657 --clamp-top=[fac.tor|percent%|offset] (default: 1.0)
658 The argument can be either passed as factor (a number with a .
659 in it), a percent value (suffixed by %) or as absolute number or
660 size spec, like in --size.
661
662 Only look at the content of files in the range of from low to
663 (including) high. This means, if the range is less than -q 0% to
664 -Q 100%, than only partial duplicates are searched. If the file
665 size is less than the clamp limits, the file is ignored during
666 traversing. Be careful when using this function, you can easily
667 get dangerous results for small files.
668
669 This is useful in a few cases where a file consists of a con‐
670 stant sized header or footer. With this option you can just com‐
671 pare the data in between. Also it might be useful for approxi‐
672 mate comparison where it suffices when the file is the same in
673 the middle part.
674
675 Example:
676
677 $ rmlint -q 10% -Q 512M # Only read the last 90% of a file, but
678 read at max. 512MB
679
680 -Z --mtime-window=T (default: -1)
681 Only consider those files as duplicates that have the same con‐
682 tent and the same modification time (mtime) within a certain
683 window of T seconds. If T is 0, both files need to have the
684 same mtime. For T=1 they may differ one second and so on. If the
685 window size is negative, the mtime of duplicates will not be
686 considered. T may be a floating point number.
687
688 However, with three (or more) files, the mtime difference be‐
689 tween two duplicates can be bigger than the mtime window T, i.e.
690 several files may be chained together by the window. Example: If
691 T is 1, the four files fooA (mtime: 00:00:00), fooB (00:00:01),
692 fooC (00:00:02), fooD (00:00:03) would all belong to the same
693 duplicate group, although the mtime of fooA and fooD differs by
694 3 seconds.
695
696 --with-fiemap (default) / --without-fiemap
697 Enable or disable reading the file extents on rotational disk in
698 order to optimize disk access patterns. If this feature is not
699 available, it is disabled automatically.
700
701 FORMATTERS
702 • csv: Output all found lint as comma-separated-value list.
703
704 Available options:
705
706 • no_header: Do not write a first line describing the column headers.
707
708 • unique: Include unique files in the output.
709
710 •
711
712 sh: Output all found lint as shell script This formatter is activated
713 as default.
714
715 available options:
716
717 • cmd: Specify a user defined command to run on duplicates. The com‐
718 mand can be any valid /bin/sh-expression. The duplicate path and
719 original path can be accessed via "$1" and "$2". The command will
720 be written to the user_command function in the sh-file produced by
721 rmlint.
722
723 • handler Define a comma separated list of handlers to try on dupli‐
724 cate files in that given order until one handler succeeds. Handlers
725 are just the name of a way of getting rid of the file and can be
726 any of the following:
727
728 • clone: For reflink-capable filesystems only. Try to clone both
729 files with the FIDEDUPERANGE ioctl(3p) (or BTRFS_IOC_FILE_EX‐
730 TENT_SAME on older kernels). This will free up duplicate ex‐
731 tents. Needs at least kernel 4.2. Use this option when you only
732 have read-only access to a btrfs filesystem but still want to
733 deduplicate it. This is usually the case for snapshots.
734
735 • reflink: Try to reflink the duplicate file to the original. See
736 also --reflink in man 1 cp. Fails if the filesystem does not sup‐
737 port it.
738
739 • hardlink: Replace the duplicate file with a hardlink to the orig‐
740 inal file. The resulting files will have the same inode number.
741 Fails if both files are not on the same partition. You can use ls
742 -i to show the inode number of a file and find -samefile <path>
743 to find all hardlinks for a certain file.
744
745 • symlink: Tries to replace the duplicate file with a symbolic link
746 to the original. This handler never fails.
747
748 • remove: Remove the file using rm -rf. (-r for duplicate dirs).
749 This handler never fails.
750
751 • usercmd: Use the provided user defined command (-c sh:cmd=some‐
752 thing). This handler never fails.
753
754 Default is remove.
755
756 • link: Shortcut for -c sh:handler=clone,reflink,hardlink,symlink.
757 Use this if you are on a reflink-capable system.
758
759 • hardlink: Shortcut for -c sh:handler=hardlink,symlink. Use this if
760 you want to hardlink files, but want to fallback for duplicates
761 that lie on different devices.
762
763 • symlink: Shortcut for -c sh:handler=symlink. Use this as last
764 straw.
765
766 • json: Print a JSON-formatted dump of all found reports. Outputs all
767 lint as a json document. The document is a list of dictionaries,
768 where the first and last element is the header and the footer. Every‐
769 thing between are data-dictionaries.
770
771 Available options:
772
773 • unique: Include unique files in the output.
774
775 • no_header=[true|false]: Print the header with metadata (default:
776 true)
777
778 • no_footer=[true|false]: Print the footer with statistics (default:
779 true)
780
781 • oneline=[true|false]: Print one json document per line (default:
782 false) This is useful if you plan to parse the output line-by-line,
783 e.g. while rmlint is sill running.
784
785 This formatter is extremely useful if you're in need of scripting
786 more complex behaviour, that is not directly possible with rmlint's
787 built-in options. A very handy tool here is jq. Here is an example
788 to output all original files directly from a rmlint run:
789
790 $ rmlint -o | json jq -r '.[1:-1][] | select(.is_original) | .path'
791
792 • py: Outputs a python script and a JSON document, just like the json
793 formatter. The JSON document is written to .rmlint.json, executing
794 the script will make it read from there. This formatter is mostly in‐
795 tended for complex use-cases where the lint needs special handling
796 that you define in the python script. Therefore the python script
797 can be modified to do things standard rmlint is not able to do eas‐
798 ily.
799
800 • uniques: Outputs all unique paths found during the run, one path per
801 line. This is often useful for scripting purposes.
802
803 Available options:
804
805 • print0: Do not put newlines between paths but zero bytes.
806
807 • stamp:
808
809 Outputs a timestamp of the time rmlint was run. See also the
810 --newer-than and --newer-than-stamp file option.
811
812 Available options:
813
814 • iso8601=[true|false]: Write an ISO8601 formatted timestamps or sec‐
815 onds since epoch?
816
817 • progressbar: Shows a progressbar. This is meant for use with stdout
818 or stderr [default].
819
820 See also: -g (--progress) for a convenience shortcut option.
821
822 Available options:
823
824 • update_interval=number: Number of milliseconds to wait between up‐
825 dates. Higher values use less resources (default 50).
826
827 • ascii: Do not attempt to use unicode characters, which might not be
828 supported by some terminals.
829
830 • fancy: Use a more fancy style for the progressbar.
831
832 • pretty: Shows all found items in realtime nicely colored. This for‐
833 matter is activated as default.
834
835 • summary: Shows counts of files and their respective size after the
836 run. Also list all written output files.
837
838 • fdupes: Prints an output similar to the popular duplicate finder
839 fdupes(1). At first a progressbar is printed on stderr. Afterwards
840 the found files are printed on stdout; each set of duplicates gets
841 printed as a block separated by newlines. Originals are highlighted
842 in green. At the bottom a summary is printed on stderr. This is
843 mostly useful for scripts that were set up for parsing fdupes output.
844 We recommend the json formatter for every other scripting purpose.
845
846 Available options:
847
848 • omitfirst: Same as the -f / --omitfirst option in fdupes(1). Omits
849 the first line of each set of duplicates (i.e. the original file.
850
851 • sameline: Same as the -1 / --sameline option in fdupes(1). Does not
852 print newlines between files, only a space. Newlines are printed
853 only between sets of duplicates.
854
855 OTHER STAND-ALONE COMMANDS
856 rmlint --gui
857 Start the optional graphical frontend to rmlint called Shredder.
858
859 This will only work when Shredder and its dependencies were in‐
860 stalled. See also:
861 http://rmlint.readthedocs.org/en/latest/gui.html
862
863 The gui has its own set of options, see --gui --help for a list.
864 These should be placed at the end, ie rmlint --gui [options]
865 when calling it from commandline.
866
867 rmlint --hash [paths...]
868 Make rmlint work as a multi-threaded file hash utility, similar
869 to the popular md5sum or sha1sum utilities, but faster and with
870 more algorithms. A set of paths given on the commandline or
871 from stdin is hashed using one of the available hash algorithms.
872 Use rmlint --hash -h to see options.
873
874 rmlint --equal [paths...]
875 Check if the paths given on the commandline all have equal con‐
876 tent. If all paths are equal and no other error happened, rmlint
877 will exit with an exit code 0. Otherwise it will exit with a
878 nonzero exit code. All other options can be used as normal, but
879 note that no other formatters (sh, csv etc.) will be executed by
880 default. At least two paths need to be passed.
881
882 Note: This even works for directories and also in combination
883 with paranoid mode (pass -pp for byte comparison); remember that
884 rmlint does not care about the layout of the directory, but only
885 about the content of the files in it. At least two paths need to
886 be given to the commandline.
887
888 By default this will use hashing to compare the files and/or di‐
889 rectories.
890
891 rmlint --dedupe [-r] [-v|-V] <src> <dest>
892 If the filesystem supports files sharing physical storage be‐
893 tween multiple files, and if src and dest have same content,
894 this command makes the data in the src file appear the dest file
895 by sharing the underlying storage.
896
897 This command is similar to cp --reflink=always <src> <dest> ex‐
898 cept that it (a) checks that src and dest have identical data,
899 and it makes no changes to dest's metadata.
900
901 Running with -r option will enable deduplication of read-only
902 [btrfs] snapshots (requires root).
903
904 rmlint --is-reflink [-v|-V] <file1> <file2>
905 Tests whether file1 and file2 are reflinks (reference same
906 data). This command makes rmlint exit with one of the following
907 exit codes:
908
909 • 0: files are reflinks
910
911 • 1: files are not reflinks
912
913 • 3: not a regular file
914
915 • 4: file sizes differ
916
917 • 5: fiemaps can't be read
918
919 • 6: file1 and file2 are the same path
920
921 • 7: file1 and file2 are the same file under different mount‐
922 points
923
924 • 8: files are hardlinks
925
926 • 9: files are symlinks
927
928 • 10: files are not on same device
929
930 • 11: other error encountered
931
932 EXAMPLES
933 This is a collection of common use cases and other tricks:
934
935 • Check the current working directory for duplicates.
936
937 $ rmlint
938
939 • Show a progressbar:
940
941 $ rmlint -g
942
943 • Quick re-run on large datasets using different ranking criteria on
944 second run:
945
946 $ rmlint large_dir/ # First run; writes rmlint.json
947
948 $ rmlint --replay rmlint.json large_dir -S MaD
949
950 • Merge together previous runs, but prefer the originals to be from
951 b.json and make sure that no files are deleted from b.json:
952
953 $ rmlint --replay a.json // b.json -k
954
955 • Search only for duplicates and duplicate directories
956
957 $ rmlint -T "df,dd" .
958
959 • Compare files byte-by-byte in current directory:
960
961 $ rmlint -pp .
962
963 • Find duplicates with same basename (excluding extension):
964
965 $ rmlint -e
966
967 • Do more complex traversal using find(1).
968
969 $ find /usr/lib -iname '*.so' -type f | rmlint - # find all duplicate
970 .so files
971
972 $ find /usr/lib -iname '*.so' -type f -print0 | rmlint -0 # as above
973 but handles filenames with newline character in them
974
975 $ find ~/pics -iname '*.png' | ./rmlint - # compare png files only
976
977 • Limit file size range to investigate:
978
979 $ rmlint -s 2GB # Find everything >= 2GB
980
981 $ rmlint -s 0-2GB # Find everything < 2GB
982
983 • Only find writable and executable files:
984
985 $ rmlint --perms wx
986
987 • Reflink if possible, else hardlink duplicates to original if possi‐
988 ble, else replace duplicate with a symbolic link:
989
990 $ rmlint -c sh:link
991
992 • Inject user-defined command into shell script output:
993
994 $ rmlint -o sh -c sh:cmd='echo "original:" "$2" "is the same as"
995 "$1"'
996
997 • Use shred to overwrite the contents of a file fully:
998
999 $ rmlint -c 'sh:cmd=shred -un 10 "$1"'
1000
1001 • Use data as master directory. Find only duplicates in backup that are
1002 also in data. Do not delete any files in data:
1003
1004 $ rmlint backup // data --keep-all-tagged --must-match-tagged
1005
1006 • Compare if the directories a b c and are equal
1007
1008 $ rmlint --equal a b c && echo "Files are equal" || echo "Files are
1009 not equal"
1010
1011 • Test if two files are reflinks
1012
1013 $ rmlint --is-reflink a b && echo "Files are reflinks" || echo "Files
1014 are not reflinks".
1015
1016 • Cache calculated checksums for next run. The checksums will be writ‐
1017 ten to the extended file attributes:
1018
1019 $ rmlint --xattr
1020
1021 • Produce a list of unique files in a folder:
1022
1023 $ rmlint -o uniques
1024
1025 • Produce a list of files that are unique, including original files
1026 ("one of each"):
1027
1028 $ rmlint t -o json -o uniques:unique_files | jq -r '.[1:-1][] | se‐
1029 lect(.is_original) | .path' | sort > original_files $ cat
1030 unique_files original_files
1031
1032 • Sort files by a user-defined regular expression
1033
1034 # Always keep files with ABC or DEF in their basename,
1035 # dismiss all duplicates with tmp, temp or cache in their names
1036 # and if none of those are applicable, keep the oldest files instead.
1037 $ ./rmlint -S 'x<.*(ABC|DEF).*>X<.*(tmp|temp|cache).*>m' /some/path
1038
1039 • Sort files by adding priorities to several user-defined regular ex‐
1040 pressions:
1041
1042 # Unlike the previous snippet, this one uses priorities:
1043 # Always keep files in ABC, DEF, GHI by following that particular order of
1044 # importance (ABC has a top priority), dismiss all duplicates with
1045 # tmp, temp, cache in their paths and if none of those are applicable,
1046 # keep the oldest files instead.
1047 $ rmlint -S 'r<.*ABC.*>r<.*DEF.*>r<.*GHI.*>R<.*(tmp|temp|cache).*>m' /some/path
1048
1049 PROBLEMS
1050 1. False Positives: Depending on the options you use, there is a very
1051 slight risk of false positives (files that are erroneously detected
1052 as duplicate). The default hash function (blake2b) is very safe but
1053 in theory it is possible for two files to have then same hash. If
1054 you had 10^73 different files, all the same size, then the chance of
1055 a false positive is still less than 1 in a billion. If you're con‐
1056 cerned just use the --paranoid (-pp) option. This will compare all
1057 the files byte-by-byte and is not much slower than blake2b (it may
1058 even be faster), although it is a lot more memory-hungry.
1059
1060 2. File modification during or after rmlint run: It is possible that a
1061 file that rmlint recognized as duplicate is modified afterwards, re‐
1062 sulting in a different file. If you use the rmlint-generated shell
1063 script to delete the duplicates, you can run it with the -p option
1064 to do a full re-check of the duplicate against the original before
1065 it deletes the file. When using -c sh:hardlink or -c sh:symlink care
1066 should be taken that a modification of one file will now result in a
1067 modification of all files. This is not the case for -c sh:reflink
1068 or -c sh:clone. Use -c sh:link to minimise this risk.
1069
1070 SEE ALSO
1071 Reading the manpages o these tools might help working with rmlint:
1072
1073 • find(1)
1074
1075 • rm(1)
1076
1077 • cp(1)
1078
1079 Extended documentation and an in-depth tutorial can be found at:
1080
1081 • http://rmlint.rtfd.org
1082
1083 BUGS
1084 If you found a bug, have a feature requests or want to say something
1085 nice, please visit https://github.com/sahib/rmlint/issues.
1086
1087 Please make sure to describe your problem in detail. Always include the
1088 version of rmlint (--version). If you experienced a crash, please in‐
1089 clude at least one of the following information with a debug build of
1090 rmlint:
1091
1092 • gdb --ex run -ex bt --args rmlint -vvv [your_options]
1093
1094 • valgrind --leak-check=no rmlint -vvv [your_options]
1095
1096 You can build a debug build of rmlint like this:
1097
1098 • git clone git@github.com:sahib/rmlint.git
1099
1100 • cd rmlint
1101
1102 • scons GDB=1 DEBUG=1
1103
1104 • sudo scons install # Optional
1105
1106 LICENSE
1107 rmlint is licensed under the terms of the GPLv3.
1108
1109 See the COPYRIGHT file that came with the source for more information.
1110
1111 PROGRAM AUTHORS
1112 rmlint was written by:
1113
1114 • Christopher <sahib> Pahl 2010-2017 (https://github.com/sahib)
1115
1116 • Daniel <SeeSpotRun> T. 2014-2017 (https://github.com/SeeSpotRun)
1117
1118 Also see the http://rmlint.rtfd.org for other people that helped us.
1119
1120 If you consider a donation you can use Flattr or buy us a beer if we
1121 meet:
1122
1123 https://flattr.com/thing/302682/libglyr
1124
1126 Christopher Pahl, Daniel Thomas
1127
1129 2014-2023, Christopher Pahl & Daniel Thomas
1130
1131
1132
1133
1134 Aug 01, 2023 RMLINT(1)