1RMLINT(1) rmlint documentation RMLINT(1)
2
3
4
6 rmlint - find duplicate files and other space waste efficiently
7
9 SYNOPSIS
10 rmlint [TARGET_DIR_OR_FILES ...] [//] [TAGGED_TARGET_DIR_OR_FILES ...]
11 [-] [OPTIONS]
12
13 DESCRIPTION
14 rmlint finds space waste and other broken things on your filesystem.
15 It's main focus lies on finding duplicate files and directories.
16
17 It is able to find the following types of lint:
18
19 · Duplicate files and directories (and as a by-product unique files).
20
21 · Nonstripped Binaries (Binaries with debug symbols; needs to be
22 explicitly enabled).
23
24 · Broken symbolic links.
25
26 · Empty files and directories (also nested empty directories).
27
28 · Files with broken user or group id.
29
30 rmlint itself WILL NOT DELETE ANY FILES. It does however produce exe‐
31 cutable output (for example a shell script) to help you delete the
32 files if you want to. Another design principle is that it should work
33 well together with other tools like find. Therefore we do not replicate
34 features of other well know programs, as for example pattern matching
35 and finding duplicate filenames. However we provide many convenience
36 options for common use cases that are hard to build from scratch with
37 standard tools.
38
39 In order to find the lint, rmlint is given one or more directories to
40 traverse. If no directories or files were given, the current working
41 directory is assumed. By default, rmlint will ignore hidden files and
42 will not follow symlinks (see Traversal Options). rmlint will first
43 find "other lint" and then search the remaining files for duplicates.
44
45 rmlint tries to be helpful by guessing what file of a group of dupli‐
46 cates is the original (i.e. the file that should not be deleted). It
47 does this by using different sorting strategies that can be controlled
48 via the -S option. By default it chooses the first-named path on the
49 commandline. If two duplicates come from the same path, it will also
50 apply different fallback sort strategies (See the documentation of the
51 -S strategy).
52
53 This behaviour can be also overwritten if you know that a certain
54 directory contains duplicates and another one originals. In this case
55 you write the original directory after specifying a single // on the
56 commandline. Everything that comes after is a preferred (or a
57 "tagged") directory. If there are duplicates from an unpreferred and
58 from a preferred directory, the preferred one will always count as
59 original. Special options can also be used to always keep files in pre‐
60 ferred directories (-k) and to only find duplicates that are present in
61 both given directories (-m).
62
63 We advise new users to have a short look at all options rmlint has to
64 offer, and maybe test some examples before letting it run on productive
65 data. WRONG ASSUMPTIONS ARE THE BIGGEST ENEMY OF YOUR DATA. There are
66 some extended example at the end of this manual, but each option that
67 is not self-explanatory will also try to give examples.
68
69 OPTIONS
70 General Options
71 -T --types="list" (default: defaults)
72 Configure the types of lint rmlint will look for. The list
73 string is a comma-separated list of lint types or lint groups
74 (other separators like semicolon or space also work though).
75
76 One of the following groups can be specified at the beginning of
77 the list:
78
79 · all: Enables all lint types.
80
81 · defaults: Enables all lint types, but nonstripped.
82
83 · minimal: defaults minus emptyfiles and emptydirs.
84
85 · minimaldirs: defaults minus emptyfiles, emptydirs and dupli‐
86 cates, but with duplicatedirs.
87
88 · none: Disable all lint types [default].
89
90 Any of the following lint types can be added individually, or
91 deselected by prefixing with a -:
92
93 · badids, bi: Find files with bad UID, GID or both.
94
95 · badlinks, bl: Find bad symlinks pointing nowhere valid.
96
97 · emptydirs, ed: Find empty directories.
98
99 · emptyfiles, ef: Find empty files.
100
101 · nonstripped, ns: Find nonstripped binaries.
102
103 · duplicates, df: Find duplicate files.
104
105 · duplicatedirs, dd: Find duplicate directories (This is the
106 same -D!)
107
108 WARNING: It is good practice to enclose the description in sin‐
109 gle or double quotes. In obscure cases argument parsing might
110 fail in weird ways, especially when using spaces as separator.
111
112 Example:
113
114 $ rmlint -T "df,dd" # Only search for duplicate files and directories
115 $ rmlint -T "all -df -dd" # Search for all lint except duplicate files and dirs.
116
117 -o --output=spec / -O --add-output=spec (default: -o sh:rmlint.sh -o
118 pretty:stdout -o summary:stdout -o json:rmlint.json)
119 Configure the way rmlint outputs its results. A spec is in the
120 form format:file or just format. A file might either be an
121 arbitrary path or stdout or stderr. If file is omitted, stdout
122 is assumed. format is the name of a formatter supported by this
123 program. For a list of formatters and their options, refer to
124 the Formatters section below.
125
126 If -o is specified, rmlint's default outputs are overwritten.
127 With --O the defaults are preserved. Either -o or -O may be
128 specified multiple times to get multiple outputs, including mul‐
129 tiple outputs of the same format.
130
131 Examples:
132
133 $ rmlint -o json # Stream the json output to stdout
134 $ rmlint -O csv:/tmp/rmlint.csv # Output an extra csv fle to /tmp
135
136 -c --config=spec[=value] (default: none)
137 Configure a format. This option can be used to fine-tune the be‐
138 haviour of the existing formatters. See the Formatters section
139 for details on the available keys.
140
141 If the value is omitted it is set to a value meaning "enabled".
142
143 Examples:
144
145 $ rmlint -c sh:link # Smartly link duplicates instead of removing
146 $ rmlint -c progressbar:fancy # Use a different theme for the progressbar
147
148 -z --perms[=[rwx]] (default: no check)
149 Only look into file if it is readable, writable or executable by
150 the current user. Which one of the can be given as argument as
151 one of "rwx".
152
153 If no argument is given, "rw" is assumed. Note that r does basi‐
154 cally nothing user-visible since rmlint will ignore unreadable
155 files anyways. It's just there for the sake of completeness.
156
157 By default this check is not done.
158
159 $ rmlint -z rx $(echo $PATH | tr ":" " ") # Look at all exe‐
160 cutable files in $PATH
161
162 -a --algorithm=name (default: blake2b)
163 Choose the algorithm to use for finding duplicate files. The
164 algorithm can be either paranoid (byte-by-byte file comparison)
165 or use one of several file hash algorithms to identify dupli‐
166 cates. The following hash families are available (in approxi‐
167 mate descending order of cryptographic strength):
168
169 sha3, blake,
170
171 sha,
172
173 highway, md
174
175 metro, murmur, xxhash
176
177 The weaker hash functions still offer excellent distribution
178 properties, but are potentially more vulnerable to malicious
179 crafting of duplicate files.
180
181 The full list of hash functions (in decreasing order of checksum
182 length) is:
183
184 512-bit: blake2b, blake2bp, sha3-512, sha512
185
186 384-bit: sha3-384,
187
188 256-bit: blake2s, blake2sp, sha3-256, sha256, highway256,
189 metro256, metrocrc256
190
191 160-bit: sha1
192
193 128-bit: md5, murmur, metro, metrocrc
194
195 64-bit: highway64, xxhash.
196
197 The use of 64-bit hash length for detecting duplicate files is
198 not recommended, due to the probability of a random hash colli‐
199 sion.
200
201 -p --paranoid / -P --less-paranoid (default)
202 Increase or decrease the paranoia of rmlint's duplicate algo‐
203 rithm. Use -p if you want byte-by-byte comparison without any
204 hashing.
205
206 · -p is equivalent to --algorithm=paranoid
207
208 · -P is equivalent to --algorithm=highway256
209
210 · -PP is equivalent to --algorithm=metro256
211
212 · -PPP is equivalent to --algorithm=metro
213
214 -v --loud / -V --quiet
215 Increase or decrease the verbosity. You can pass these options
216 several times. This only affects rmlint's logging on stderr, but
217 not the outputs defined with -o. Passing either option more than
218 three times has no further effect.
219
220 -g --progress / -G --no-progress (default)
221 Show a progressbar with sane defaults.
222
223 Convenience shortcut for -o progressbar -o summary -o
224 sh:rmlint.sh -o json:rmlint.json -VVV.
225
226 NOTE: This flag clears all previous outputs. If you want addi‐
227 tional outputs, specify them after this flag using -O.
228
229 -D --merge-directories (default: disabled)
230 Makes rmlint use a special mode where all found duplicates are
231 collected and checked if whole directory trees are duplicates.
232 Use with caution: You always should make sure that the investi‐
233 gated directory is not modified during rmlint's or its removal
234 scripts run.
235
236 IMPORTANT: Definition of equal: Two directories are considered
237 equal by rmlint if they contain the exact same data, no matter
238 how the files containing the data are named. Imagine that rmlint
239 creates a long, sorted stream out of the data found in the
240 directory and compares this in a magic way to another directory.
241 This means that the layout of the directory is not considered to
242 be important by default. Also empty files will not count as con‐
243 tent. This might be surprising to some users, but remember that
244 rmlint generally cares only about content, not about any other
245 metadata or layout. If you want to only find trees with the same
246 hierarchy you should use --honour-dir-layout / -j.
247
248 Output is deferred until all duplicates were found. Duplicate
249 directories are printed first, followed by any remaining dupli‐
250 cate files that are isolated or inside of any original directo‐
251 ries.
252
253 --rank-by applies for directories too, but 'p' or 'P' (path
254 index) has no defined (i.e. useful) meaning. Sorting takes only
255 place when the number of preferred files in the directory dif‐
256 fers.
257
258 NOTES:
259
260 · This option enables --partial-hidden and -@ (--see-symlinks)
261 for convenience. If this is not desired, you should change
262 this after specifying -D.
263
264 · This feature might add some runtime for large datasets.
265
266 · When using this option, you will not be able to use the -c
267 sh:clone option. Use -c sh:link as a good alternative.
268
269 -j --honour-dir-layout (default: disabled)
270 Only recognize directories as duplicates that have the same path
271 layout. In other words: All duplicates that build the duplicate
272 directory must have the same path from the root of each respec‐
273 tive directory. This flag makes no sense without --merge-direc‐
274 tories.
275
276 -y --sort-by=order (default: none)
277 During output, sort the found duplicate groups by criteria
278 described by order. order is a string that may consist of one
279 or more of the following letters:
280
281 · s: Sort by size of group.
282
283 · a: Sort alphabetically by the basename of the original.
284
285 · m: Sort by mtime of the original.
286
287 · p: Sort by path-index of the original.
288
289 · o: Sort by natural found order (might be different on each
290 run).
291
292 · n: Sort by number of files in the group.
293
294 The letter may also be written uppercase (similar to -S /
295 --rank-by) to reverse the sorting. Note that rmlint has to hold
296 back all results to the end of the run before sorting and print‐
297 ing.
298
299 -w --with-color (default) / -W --no-with-color
300 Use color escapes for pretty output or disable them. If you
301 pipe rmlints output to a file -W is assumed automatically.
302
303 -h --help / -H --show-man
304 Show a shorter reference help text (-h) or the full man page
305 (-H).
306
307 --version
308 Print the version of rmlint. Includes git revision and compile
309 time features. Please include this when giving feedback to us.
310
311 Traversal Options
312 -s --size=range (default: 1 )
313 Only consider files as duplicates in a certain size range. The
314 format of range is min-max, where both ends can be specified as
315 a number with an optional multiplier. The available multipliers
316 are:
317
318 · C (1^1), W (2^1), B (512^1), K (1000^1), KB (1024^1), M
319 (1000^2), MB (1024^2), G (1000^3), GB (1024^3),
320
321 · T (1000^4), TB (1024^4), P (1000^5), PB (1024^5), E (1000^6),
322 EB (1024^6)
323
324 The size format is about the same as dd(1) uses. A valid example
325 would be: "100KB-2M". This limits duplicates to a range from 100
326 Kilobyte to 2 Megabyte.
327
328 It's also possible to specify only one size. In this case the
329 size is interpreted as "bigger or equal". If you want to filter
330 for files up to this size you can add a - in front (-s -1M == -s
331 0-1M).
332
333 Edge case: The default excludes empty files from the duplicate
334 search. Normally these are treated specially by rmlint by han‐
335 dling them as other lint. If you want to include empty files as
336 duplicates you should lower the limit to zero:
337
338 $ rmlint -T df --size 0
339
340 -d --max-depth=depth (default: INF)
341 Only recurse up to this depth. A depth of 1 would disable recur‐
342 sion and is equivalent to a directory listing. A depth of 2
343 would also consider all children directories and so on.
344
345 -l --hardlinked (default) / --keep-hardlinked / -L --no-hardlinked
346 Hardlinked files are treated as duplicates by default
347 (--hardlinked). If --keep-hardlinked is given, rmlint will not
348 delete any files that are hardlinked to an original in their
349 respective group. Such files will be displayed like originals,
350 i.e. for the default output with a "ls" in front. The reasoning
351 here is to maximize the number of kept files, while maximizing
352 the number of freed space: Removing hardlinks to originals will
353 not allocate any free space.
354
355 If --no-hardlinked is given, only one file (of a set of
356 hardlinked files) is considered, all the others are ignored;
357 this means, they are not deleted and also not even shown in the
358 output. The "highest ranked" of the set is the one that is con‐
359 sidered.
360
361 -f --followlinks / -F --no-followlinks / -@ --see-symlinks (default)
362 -f will always follow symbolic links. If file system loops
363 occurs rmlint will detect this. If -F is specified, symbolic
364 links will be ignored completely, if -@ is specified, rmlint
365 will see symlinks and treats them like small files with the path
366 to their target in them. The latter is the default behaviour,
367 since it is a sensible default for --merge-directories.
368
369 -x --no-crossdev / -X --crossdev (default)
370 Stay always on the same device (-x), or allow crossing mount‐
371 points (-X). The latter is the default.
372
373 -r --hidden / -R --no-hidden (default) / --partial-hidden
374 Also traverse hidden directories? This is often not a good idea,
375 since directories like .git/ would be investigated, possibly
376 leading to the deletion of internal git files which in turn
377 break a repository. With --partial-hidden hidden files and
378 folders are only considered if they're inside duplicate directo‐
379 ries (see --merge-directories) and will be deleted as part of
380 it.
381
382 -b --match-basename
383 Only consider those files as dupes that have the same basename.
384 See also man 1 basename. The comparison of the basenames is
385 case-insensitive.
386
387 -B --unmatched-basename
388 Only consider those files as dupes that do not share the same
389 basename. See also man 1 basename. The comparison of the base‐
390 names is case-insensitive.
391
392 -e --match-with-extension / -E --no-match-with-extension (default)
393 Only consider those files as dupes that have the same file
394 extension. For example two photos would only match if they are a
395 .png. The extension is compared case-insensitive, so .PNG is the
396 same as .png.
397
398 -i --match-without-extension / -I --no-match-without-extension
399 (default)
400 Only consider those files as dupes that have the same basename
401 minus the file extension. For example: banana.png and
402 Banana.jpeg would be considered, while apple.png and peach.png
403 won't. The comparison is case-insensitive.
404
405 -n --newer-than-stamp=<timestamp_filename> / -N
406 --newer-than=<iso8601_timestamp_or_unix_timestamp>
407 Only consider files (and their size siblings for duplicates)
408 newer than a certain modification time (mtime). The age barrier
409 may be given as seconds since the epoch or as ISO8601-Timestamp
410 like 2014-09-08T00:12:32+0200.
411
412 -n expects a file from which it can read the timestamp. After
413 rmlint run, the file will be updated with the current timestamp.
414 If the file does not initially exist, no filtering is done but
415 the stampfile is still written.
416
417 -N, in contrast, takes the timestamp directly and will not write
418 anything.
419
420 Note that rmlint will find duplicates newer than timestamp, even
421 if the original is older. If you want only find duplicates
422 where both original and duplicate are newer than timestamp you
423 can use find(1):
424
425 · find -mtime -1 -print0 | rmlint -0 # pass all files younger
426 than a day to rmlint
427
428 Note: you can make rmlint write out a compatible timestamp with:
429
430 · -O stamp:stdout # Write a seconds-since-epoch timestamp to
431 stdout on finish.
432
433 · -O stamp:stdout -c stamp:iso8601 # Same, but write as ISO8601.
434
435 Original Detection Options
436 -k --keep-all-tagged / -K --keep-all-untagged
437 Don't delete any duplicates that are in tagged paths (-k) or
438 that are in non-tagged paths (-K). (Tagged paths are those that
439 were named after //).
440
441 -m --must-match-tagged / -M --must-match-untagged
442 Only look for duplicates of which at least one is in one of the
443 tagged paths. (Paths that were named after //).
444
445 Note that the combinations of -kM and -Km are prohibited by
446 rmlint. See https://github.com/sahib/rmlint/issues/244 for more
447 information.
448
449 -S --rank-by=criteria (default: pOma)
450 Sort the files in a group of duplicates into originals and
451 duplicates by one or more criteria. Each criteria is defined by
452 a single letter (except r and x which expect a regex pattern
453 after the letter). Multiple criteria may be given as string,
454 where the first criteria is the most important. If one criteria
455 cannot decide between original and duplicate the next one is
456 tried.
457
458 · m: keep lowest mtime (oldest) M: keep highest mtime
459 (newest)
460
461 · a: keep first alphabetically A: keep last alphabet‐
462 ically
463
464 · p: keep first named path P: keep last named
465 path
466
467 · d: keep path with lowest depth D: keep path with
468 highest depth
469
470 · l: keep path with shortest basename L: keep path with
471 longest basename
472
473 · r: keep paths matching regex R: keep path not
474 matching regex
475
476 · x: keep basenames matching regex X: keep basenames not
477 matching regex
478
479 · h: keep file with lowest hardlink count H: keep file with
480 highest hardlink count
481
482 · o: keep file with lowest number of hardlinks outside of the
483 paths traversed by rmlint.
484
485 · O: keep file with highest number of hardlinks outside of the
486 paths traversed by rmlint.
487
488 Alphabetical sort will only use the basename of the file and
489 ignore its case. One can have multiple criteria, e.g.: -S am
490 will choose first alphabetically; if tied then by mtime. Note:
491 original path criteria (specified using //) will always take
492 first priority over -S options.
493
494 For more fine grained control, it is possible to give a regular
495 expression to sort by. This can be useful when you know a common
496 fact that identifies original paths (like a path component being
497 src or a certain file ending).
498
499 To use the regular expression you simply enclose it in the cri‐
500 teria string by adding <REGULAR_EXPRESSION> after specifying r
501 or x. Example: -S 'r<.*\.bak$>' makes all files that have a .bak
502 suffix original files.
503
504 Warning: When using r or x, try to make your regex to be as spe‐
505 cific as possible! Good practice includes adding a $ anchor at
506 the end of the regex.
507
508 Tips:
509
510 · l is useful for files like file.mp3 vs file.1.mp3 or
511 file.mp3.bak.
512
513 · a can be used as last criteria to assert a defined order.
514
515 · o/O and h/H are only useful if there any hardlinks in the tra‐
516 versed path.
517
518 · o/O takes the number of hardlinks outside the traversed paths
519 (and thereby minimizes/maximizes the overall number of
520 hardlinks). h/H in contrast only takes the number of hardlinks
521 inside of the traversed paths. When hardlinking files, one
522 would like to link to the original file with the highest outer
523 link count (O) in order to maximise the space cleanup. H does
524 not maximise the space cleanup, it just selects the file with
525 the highest total hardlink count. You usually want to specify
526 O.
527
528 · pOma is the default since p ensures that first given paths
529 rank as originals, O ensures that hardlinks are handled well,
530 m ensures that the oldest file is the original and a simply
531 ensures a defined ordering if no other criteria applies.
532
533 Caching
534 --replay
535 Read an existing json file and re-output it. When --replay is
536 given, rmlint does no input/output on the filesystem, even if
537 you pass additional paths. The paths you pass will be used for
538 filtering the --replay output.
539
540 This is very useful if you want to reformat, refilter or resort
541 the output you got from a previous run. Usage is simple: Just
542 pass --replay on the second run, with other changed to the new
543 formatters or filters. Pass the .json files of the previous runs
544 additionally to the paths you ran rmlint on. You can also merge
545 several previous runs by specifying more than one .json file, in
546 this case it will merge all files given and output them as one
547 big run.
548
549 If you want to view only the duplicates of certain subdirecto‐
550 ries, just pass them on the commandline as usual.
551
552 The usage of // has the same effect as in a normal run. It can
553 be used to prefer one .json file over another. However note that
554 running rmlint in --replay mode includes no real disk traversal,
555 i.e. only duplicates from previous runs are printed. Therefore
556 specifying new paths will simply have no effect. As a security
557 measure, --replay will ignore files whose mtime changed in the
558 meantime (i.e. mtime in the .json file differs from the current
559 one). These files might have been modified and are silently
560 ignored.
561
562 By design, some options will not have any effect. Those are:
563
564 · --followlinks
565
566 · --algorithm
567
568 · --paranoid
569
570 · --clamp-low
571
572 · --hardlinked
573
574 · --write-unfinished
575
576 · ... and all other caching options below.
577
578 NOTE: In --replay mode, a new .json file will be written to
579 rmlint.replay.json in order to avoid overwriting rmlint.json.
580
581 -C --xattr
582 Shortcut for --xattr-read, --xattr-write, --write-unfinished.
583 This will write a checksum and a timestamp to the extended
584 attributes of each file that rmlint hashed. This speeds up sub‐
585 sequent runs on the same data set. Please note that not all
586 filesystems may support extended attributes and you need write
587 support to use this feature.
588
589 See the individual options below for more details and some exam‐
590 ples.
591
592 --xattr-read / --xattr-write / --xattr-clear
593 Read or write cached checksums from the extended file
594 attributes. This feature can be used to speed up consecutive
595 runs.
596
597 CAUTION: This could potentially lead to false positives if file
598 contents are somehow modified without changing the file modifi‐
599 cation time. rmlint uses the mtime to determine the modifica‐
600 tion timestamp if a checksum is outdated. This is not a problem
601 if you use the clone or reflink operation on a filesystem like
602 btrfs. There an outdated checksum entry would simply lead to
603 some duplicate work done in the kernel but would do no harm oth‐
604 erwise.
605
606 NOTE: Many tools do not support extended file attributes prop‐
607 erly, resulting in a loss of the information when copying the
608 file or editing it.
609
610 NOTE: You can specify --xattr-write and --xattr-read at the same
611 time. This will read from existing checksums at the start of
612 the run and update all hashed files at the end.
613
614 Usage example:
615
616 $ rmlint large_file_cluster/ -U --xattr-write # first run should be slow.
617 $ rmlint large_file_cluster/ --xattr-read # second run should be faster.
618
619 # Or do the same in just one run:
620 $ rmlint large_file_cluster/ --xattr
621
622 -U --write-unfinished
623 Include files in output that have not been hashed fully, i.e.
624 files that do not appear to have a duplicate. Note that this
625 will not include all files that rmlint traversed, but only the
626 files that were chosen to be hashed.
627
628 This is mainly useful in conjunction with --xattr-write/read.
629 When re-running rmlint on a large dataset this can greatly speed
630 up a re-run in some cases. Please refer to --xattr-read for an
631 example.
632
633 If you want to output unique files, please look into the uniques
634 output formatter.
635
636 Rarely used, miscellaneous options
637 -t --threads=N (default: 16)
638 The number of threads to use during file tree traversal and
639 hashing. rmlint probably knows better than you how to set this
640 value, so just leave it as it is. Setting it to 1 will also not
641 make rmlint a single threaded program.
642
643 -u --limit-mem=size
644 Apply a maximum number of memory to use for hashing and --para‐
645 noid. The total number of memory might still exceed this limit
646 though, especially when setting it very low. In general rmlint
647 will however consume about this amount of memory plus a more or
648 less constant extra amount that depends on the data you are
649 scanning.
650
651 The size-description has the same format as for --size, there‐
652 fore you can do something like this (use this if you have 1GB of
653 memory available):
654
655 $ rmlint -u 512M # Limit paranoid mem usage to 512 MB
656
657 -q --clamp-low=[fac.tor|percent%|offset] (default: 0) / -Q
658 --clamp-top=[fac.tor|percent%|offset] (default: 1.0)
659 The argument can be either passed as factor (a number with a .
660 in it), a percent value (suffixed by %) or as absolute number or
661 size spec, like in --size.
662
663 Only look at the content of files in the range of from low to
664 (including) high. This means, if the range is less than -q 0% to
665 -Q 100%, than only partial duplicates are searched. If the file
666 size is less than the clamp limits, the file is ignored during
667 traversing. Be careful when using this function, you can easily
668 get dangerous results for small files.
669
670 This is useful in a few cases where a file consists of a con‐
671 stant sized header or footer. With this option you can just com‐
672 pare the data in between. Also it might be useful for approxi‐
673 mate comparison where it suffices when the file is the same in
674 the middle part.
675
676 Example:
677
678 $ rmlint -q 10% -Q 512M # Only read the last 90% of a file, but
679 read at max. 512MB
680
681 -Z --mtime-window=T (default: -1)
682 Only consider those files as duplicates that have the same con‐
683 tent and the same modification time (mtime) within a certain
684 window of T seconds. If T is 0, both files need to have the
685 same mtime. For T=1 they may differ one second and so on. If the
686 window size is negative, the mtime of duplicates will not be
687 considered. T may be a floating point number.
688
689 However, with three (or more) files, the mtime difference
690 between two duplicates can be bigger than the mtime window T,
691 i.e. several files may be chained together by the window. Exam‐
692 ple: If T is 1, the four files fooA (mtime: 00:00:00), fooB
693 (00:00:01), fooC (00:00:02), fooD (00:00:03) would all belong to
694 the same duplicate group, although the mtime of fooA and fooD
695 differs by 3 seconds.
696
697 --with-fiemap (default) / --without-fiemap
698 Enable or disable reading the file extents on rotational disk in
699 order to optimize disk access patterns. If this feature is not
700 available, it is disabled automatically.
701
702 FORMATTERS
703 · csv: Output all found lint as comma-separated-value list.
704
705 Available options:
706
707 · no_header: Do not write a first line describing the column headers.
708
709 · unique: Include unique files in the output.
710
711 ·
712
713 sh: Output all found lint as shell script This formatter is activated
714 as default.
715
716 available options:
717
718 · cmd: Specify a user defined command to run on duplicates. The com‐
719 mand can be any valid /bin/sh-expression. The duplicate path and
720 original path can be accessed via "$1" and "$2". The command will
721 be written to the user_command function in the sh-file produced by
722 rmlint.
723
724 · handler Define a comma separated list of handlers to try on dupli‐
725 cate files in that given order until one handler succeeds. Handlers
726 are just the name of a way of getting rid of the file and can be
727 any of the following:
728
729 · clone: For reflink-capable filesystems only. Try to clone both
730 files with the FIDEDUPERANGE ioctl(3p) (or
731 BTRFS_IOC_FILE_EXTENT_SAME on older kernels). This will free up
732 duplicate extents. Needs at least kernel 4.2. Use this option
733 when you only have read-only access to a btrfs filesystem but
734 still want to deduplicate it. This is usually the case for snap‐
735 shots.
736
737 · reflink: Try to reflink the duplicate file to the original. See
738 also --reflink in man 1 cp. Fails if the filesystem does not sup‐
739 port it.
740
741 · hardlink: Replace the duplicate file with a hardlink to the orig‐
742 inal file. The resulting files will have the same inode number.
743 Fails if both files are not on the same partition. You can use ls
744 -i to show the inode number of a file and find -samefile <path>
745 to find all hardlinks for a certain file.
746
747 · symlink: Tries to replace the duplicate file with a symbolic link
748 to the original. This handler never fails.
749
750 · remove: Remove the file using rm -rf. (-r for duplicate dirs).
751 This handler never fails.
752
753 · usercmd: Use the provided user defined command (-c sh:cmd=some‐
754 thing). This handler never fails.
755
756 Default is remove.
757
758 · link: Shortcut for -c sh:handler=clone,reflink,hardlink,symlink.
759 Use this if you are on a reflink-capable system.
760
761 · hardlink: Shortcut for -c sh:handler=hardlink,symlink. Use this if
762 you want to hardlink files, but want to fallback for duplicates
763 that lie on different devices.
764
765 · symlink: Shortcut for -c sh:handler=symlink. Use this as last
766 straw.
767
768 · json: Print a JSON-formatted dump of all found reports. Outputs all
769 lint as a json document. The document is a list of dictionaries,
770 where the first and last element is the header and the footer. Every‐
771 thing between are data-dictionaries.
772
773 Available options:
774
775 · unique: Include unique files in the output.
776
777 · no_header=[true|false]: Print the header with metadata (default:
778 true)
779
780 · no_footer=[true|false]: Print the footer with statistics (default:
781 true)
782
783 · oneline=[true|false]: Print one json document per line (default:
784 false) This is useful if you plan to parse the output line-by-line,
785 e.g. while rmlint is sill running.
786
787 This formatter is extremely useful if you're in need of scripting
788 more complex behaviour, that is not directly possible with rmlint's
789 built-in options. A very handy tool here is jq. Here is an example
790 to output all original files directly from a rmlint run:
791
792 $ rmlint -o | json jq -r '.[1:-1][] | select(.is_original) | .path'
793
794 · py: Outputs a python script and a JSON document, just like the json
795 formatter. The JSON document is written to .rmlint.json, executing
796 the script will make it read from there. This formatter is mostly
797 intended for complex use-cases where the lint needs special handling
798 that you define in the python script. Therefore the python script
799 can be modified to do things standard rmlint is not able to do eas‐
800 ily.
801
802 · uniques: Outputs all unique paths found during the run, one path per
803 line. This is often useful for scripting purposes.
804
805 Available options:
806
807 · print0: Do not put newlines between paths but zero bytes.
808
809 · stamp:
810
811 Outputs a timestamp of the time rmlint was run. See also the
812 --newer-than and --newer-than-stamp file option.
813
814 Available options:
815
816 · iso8601=[true|false]: Write an ISO8601 formatted timestamps or sec‐
817 onds since epoch?
818
819 · progressbar: Shows a progressbar. This is meant for use with stdout
820 or stderr [default].
821
822 See also: -g (--progress) for a convenience shortcut option.
823
824 Available options:
825
826 · update_interval=number: Number of milliseconds to wait between
827 updates. Higher values use less resources (default 50).
828
829 · ascii: Do not attempt to use unicode characters, which might not be
830 supported by some terminals.
831
832 · fancy: Use a more fancy style for the progressbar.
833
834 · pretty: Shows all found items in realtime nicely colored. This for‐
835 matter is activated as default.
836
837 · summary: Shows counts of files and their respective size after the
838 run. Also list all written output files.
839
840 · fdupes: Prints an output similar to the popular duplicate finder
841 fdupes(1). At first a progressbar is printed on stderr. Afterwards
842 the found files are printed on stdout; each set of duplicates gets
843 printed as a block separated by newlines. Originals are highlighted
844 in green. At the bottom a summary is printed on stderr. This is
845 mostly useful for scripts that were set up for parsing fdupes output.
846 We recommend the json formatter for every other scripting purpose.
847
848 Available options:
849
850 · omitfirst: Same as the -f / --omitfirst option in fdupes(1). Omits
851 the first line of each set of duplicates (i.e. the original file.
852
853 · sameline: Same as the -1 / --sameline option in fdupes(1). Does not
854 print newlines between files, only a space. Newlines are printed
855 only between sets of duplicates.
856
857 OTHER STAND-ALONE COMMANDS
858 rmlint --gui
859 Start the optional graphical frontend to rmlint called Shredder.
860
861 This will only work when Shredder and its dependencies were
862 installed. See also:
863 http://rmlint.readthedocs.org/en/latest/gui.html
864
865 The gui has its own set of options, see --gui --help for a list.
866 These should be placed at the end, ie rmlint --gui [options]
867 when calling it from commandline.
868
869 rmlint --hash [paths...]
870 Make rmlint work as a multi-threaded file hash utility, similar
871 to the popular md5sum or sha1sum utilities, but faster and with
872 more algorithms. A set of paths given on the commandline or
873 from stdin is hashed using one of the available hash algorithms.
874 Use rmlint --hash -h to see options.
875
876 rmlint --equal [paths...]
877 Check if the paths given on the commandline all have equal con‐
878 tent. If all paths are equal and no other error happened, rmlint
879 will exit with an exit code 0. Otherwise it will exit with a
880 nonzero exit code. All other options can be used as normal, but
881 note that no other formatters (sh, csv etc.) will be executed by
882 default. At least two paths need to be passed.
883
884 Note: This even works for directories and also in combination
885 with paranoid mode (pass -pp for byte comparison); remember that
886 rmlint does not care about the layout of the directory, but only
887 about the content of the files in it. At least two paths need to
888 be given to the commandline.
889
890 By default this will use hashing to compare the files and/or
891 directories.
892
893 rmlint --dedupe [-r] [-v|-V] <src> <dest>
894 If the filesystem supports files sharing physical storage
895 between multiple files, and if src and dest have same content,
896 this command makes the data in the src file appear the dest file
897 by sharing the underlying storage.
898
899 This command is similar to cp --reflink=always <src> <dest>
900 except that it (a) checks that src and dest have identical data,
901 and it makes no changes to dest's metadata.
902
903 Running with -r option will enable deduplication of read-only
904 [btrfs] snapshots (requires root).
905
906 rmlint --is-reflink [-v|-V] <file1> <file2>
907 Tests whether file1 and file2 are reflinks (reference same
908 data). This command makes rmlint exit with one of the following
909 exit codes:
910
911 · 0: files are reflinks
912
913 · 1: files are not reflinks
914
915 · 3: not a regular file
916
917 · 4: file sizes differ
918
919 · 5: fiemaps can't be read
920
921 · 6: file1 and file2 are the same path
922
923 · 7: file1 and file2 are the same file under different mount‐
924 points
925
926 · 8: files are hardlinks
927
928 · 9: files are symlinks
929
930 · 10: files are not on same device
931
932 · 11: other error encountered
933
934 EXAMPLES
935 This is a collection of common use cases and other tricks:
936
937 · Check the current working directory for duplicates.
938
939 $ rmlint
940
941 · Show a progressbar:
942
943 $ rmlint -g
944
945 · Quick re-run on large datasets using different ranking criteria on
946 second run:
947
948 $ rmlint large_dir/ # First run; writes rmlint.json
949
950 $ rmlint --replay rmlint.json large_dir -S MaD
951
952 · Merge together previous runs, but prefer the originals to be from
953 b.json and make sure that no files are deleted from b.json:
954
955 $ rmlint --replay a.json // b.json -k
956
957 · Search only for duplicates and duplicate directories
958
959 $ rmlint -T "df,dd" .
960
961 · Compare files byte-by-byte in current directory:
962
963 $ rmlint -pp .
964
965 · Find duplicates with same basename (excluding extension):
966
967 $ rmlint -e
968
969 · Do more complex traversal using find(1).
970
971 $ find /usr/lib -iname '*.so' -type f | rmlint - # find all duplicate
972 .so files
973
974 $ find /usr/lib -iname '*.so' -type f -print0 | rmlint -0 # as above
975 but handles filenames with newline character in them
976
977 $ find ~/pics -iname '*.png' | ./rmlint - # compare png files only
978
979 · Limit file size range to investigate:
980
981 $ rmlint -s 2GB # Find everything >= 2GB
982
983 $ rmlint -s 0-2GB # Find everything < 2GB
984
985 · Only find writable and executable files:
986
987 $ rmlint --perms wx
988
989 · Reflink if possible, else hardlink duplicates to original if possi‐
990 ble, else replace duplicate with a symbolic link:
991
992 $ rmlint -c sh:link
993
994 · Inject user-defined command into shell script output:
995
996 $ rmlint -o sh -c sh:cmd='echo "original:" "$2" "is the same as"
997 "$1"'
998
999 · Use shred to overwrite the contents of a file fully:
1000
1001 $ rmlint -c 'sh:cmd=shred -un 10 "$1"'
1002
1003 · Use data as master directory. Find only duplicates in backup that are
1004 also in data. Do not delete any files in data:
1005
1006 $ rmlint backup // data --keep-all-tagged --must-match-tagged
1007
1008 · Compare if the directories a b c and are equal
1009
1010 $ rmlint --equal a b c && echo "Files are equal" || echo "Files are
1011 not equal"
1012
1013 · Test if two files are reflinks
1014
1015 $ rmlint --is-reflink a b && echo "Files are reflinks" || echo "Files
1016 are not reflinks".
1017
1018 · Cache calculated checksums for next run. The checksums will be writ‐
1019 ten to the extended file attributes:
1020
1021 $ rmlint --xattr
1022
1023 · Produce a list of unique files in a folder:
1024
1025 $ rmlint -o uniques
1026
1027 · Produce a list of files that are unique, including original files
1028 ("one of each"):
1029
1030 $ rmlint t -o json -o uniques:unique_files | jq -r '.[1:-1][] |
1031 select(.is_original) | .path' | sort > original_files $ cat
1032 unique_files original_files
1033
1034 · Sort files by a user-defined regular expression
1035
1036 # Always keep files with ABC or DEF in their basename,
1037 # dismiss all duplicates with tmp, temp or cache in their names
1038 # and if none of those are applicable, keep the oldest files instead.
1039 $ ./rmlint -S 'x<.*(ABC|DEF).*>X<.*(tmp|temp|cache).*>m' /some/path
1040
1041 · Sort files by adding priorities to several user-defined regular
1042 expressions:
1043
1044 # Unlike the previous snippet, this one uses priorities:
1045 # Always keep files in ABC, DEF, GHI by following that particular order of
1046 # importance (ABC has a top priority), dismiss all duplicates with
1047 # tmp, temp, cache in their paths and if none of those are applicable,
1048 # keep the oldest files instead.
1049 $ rmlint -S 'r<.*ABC.*>r<.*DEF.*>r<.*GHI.*>R<.*(tmp|temp|cache).*>m' /some/path
1050
1051 PROBLEMS
1052 1. False Positives: Depending on the options you use, there is a very
1053 slight risk of false positives (files that are erroneously detected
1054 as duplicate). The default hash function (blake2b) is very safe but
1055 in theory it is possible for two files to have then same hash. If
1056 you had 10^73 different files, all the same size, then the chance of
1057 a false positive is still less than 1 in a billion. If you're con‐
1058 cerned just use the --paranoid (-pp) option. This will compare all
1059 the files byte-by-byte and is not much slower than blake2b (it may
1060 even be faster), although it is a lot more memory-hungry.
1061
1062 2. File modification during or after rmlint run: It is possible that a
1063 file that rmlint recognized as duplicate is modified afterwards,
1064 resulting in a different file. If you use the rmlint-generated
1065 shell script to delete the duplicates, you can run it with the -p
1066 option to do a full re-check of the duplicate against the original
1067 before it deletes the file. When using -c sh:hardlink or -c sh:sym‐
1068 link care should be taken that a modification of one file will now
1069 result in a modification of all files. This is not the case for -c
1070 sh:reflink or -c sh:clone. Use -c sh:link to minimise this risk.
1071
1072 SEE ALSO
1073 Reading the manpages o these tools might help working with rmlint:
1074
1075 · find(1)
1076
1077 · rm(1)
1078
1079 · cp(1)
1080
1081 Extended documentation and an in-depth tutorial can be found at:
1082
1083 · http://rmlint.rtfd.org
1084
1085 BUGS
1086 If you found a bug, have a feature requests or want to say something
1087 nice, please visit https://github.com/sahib/rmlint/issues.
1088
1089 Please make sure to describe your problem in detail. Always include the
1090 version of rmlint (--version). If you experienced a crash, please
1091 include at least one of the following information with a debug build of
1092 rmlint:
1093
1094 · gdb --ex run -ex bt --args rmlint -vvv [your_options]
1095
1096 · valgrind --leak-check=no rmlint -vvv [your_options]
1097
1098 You can build a debug build of rmlint like this:
1099
1100 · git clone git@github.com:sahib/rmlint.git
1101
1102 · cd rmlint
1103
1104 · scons GDB=1 DEBUG=1
1105
1106 · sudo scons install # Optional
1107
1108 LICENSE
1109 rmlint is licensed under the terms of the GPLv3.
1110
1111 See the COPYRIGHT file that came with the source for more information.
1112
1113 PROGRAM AUTHORS
1114 rmlint was written by:
1115
1116 · Christopher <sahib> Pahl 2010-2017 (https://github.com/sahib)
1117
1118 · Daniel <SeeSpotRun> T. 2014-2017 (https://github.com/SeeSpotRun)
1119
1120 Also see the http://rmlint.rtfd.org for other people that helped us.
1121
1122 If you consider a donation you can use Flattr or buy us a beer if we
1123 meet:
1124
1125 https://flattr.com/thing/302682/libglyr
1126
1128 Christopher Pahl, Daniel Thomas
1129
1131 2014-2020, Christopher Pahl & Daniel Thomas
1132
1133
1134
1135
1136 Aug 01, 2020 RMLINT(1)