1RMLINT(1) rmlint documentation RMLINT(1)
2
3
4
6 rmlint - find duplicate files and other space waste efficiently
7
9 SYNOPSIS
10 rmlint [TARGET_DIR_OR_FILES ...] [//] [TAGGED_TARGET_DIR_OR_FILES ...]
11 [-] [OPTIONS]
12
13 DESCRIPTION
14 rmlint finds space waste and other broken things on your filesystem.
15 It's main focus lies on finding duplicate files and directories.
16
17 It is able to find the following types of lint:
18
19 · Duplicate files and directories (and as a result unique files).
20
21 · Nonstripped Binaries (Binaries with debug symbols; needs to be
22 explicityl enabled).
23
24 · Broken symbolic links.
25
26 · Empty files and directories (also nested empty directories).
27
28 · Files with broken user or group id.
29
30 rmlint itself WILL NOT DELETE ANY FILES. It does however produce exe‐
31 cutable output (for example a shell script) to help you delete the
32 files if you want to. Another design principle is that it should work
33 well together with other tools like find. Therefore we do not replicate
34 features of other well know programs, as for example pattern matching
35 and finding duplicate filenames. However we provide many convinience
36 options for common use cases that are hard to build from scratch with
37 standard tools.
38
39 In order to find the lint, rmlint is given one or more directories to
40 traverse. If no directories or files were given, the current working
41 directory is assumed. By default, rmlint will ignore hidden files and
42 will not follow symlinks (see Traversal Options). rmlint will first
43 find "other lint" and then search the remaining files for duplicates.
44
45 rmlint tries to be helpful by guessing what file of a group of dupli‐
46 cates is the original (i.e. the file that should not be deleted). It
47 does this by using different sorting strategies that can be controlled
48 via the -S option. By default it chooses the first-named path on the
49 commandline. If two duplicates come from the same path, it will also
50 apply different fallback sort strategies (See the documentation of the
51 -S strategy).
52
53 This behaviour can be also overwritten if you know that a certain
54 directory contains duplicates and another one originals. In this case
55 you write the original directory after specifying a single // on the
56 commandline. Everything that comes after is a preferred (or a
57 "tagged") directory. If there are duplicates from a unpreferred and
58 from a preffered directory, the preferred one will always count as
59 original. Special options can also be used to always keep files in pre‐
60 ferred directories (-k) and to only find duplicates that are present in
61 both given directories (-m).
62
63 We advise new users to have a short look at all options rmlint has to
64 offer, and maybe test some examples before letting it run on productive
65 data. WRONG ASSUMPTIONS ARE THE BIGGEST ENEMY OF YOUR DATA. There are
66 some extended example at the end of this manual, but each option that
67 is not self-explanatory will also try to give examples.
68
69 OPTIONS
70 General Options
71 -T --types="list" (default: defaults)
72 Configure the types of lint rmlint will look for. The list
73 string is a comma-separated list of lint types or lint groups
74 (other separators like semicolon or space also work though).
75
76 One of the following groups can be specified at the beginning of
77 the list:
78
79 · all: Enables all lint types.
80
81 · defaults: Enables all lint types, but nonstripped.
82
83 · minimal: defaults minus emptyfiles and emptydirs.
84
85 · minimaldirs: defaults minus emptyfiles, emptydirs and dupli‐
86 cates, but with duplicatedirs.
87
88 · none: Disable all lint types [default].
89
90 Any of the following lint types can be added individually, or
91 deselected by prefixing with a -:
92
93 · badids, bi: Find files with bad UID, GID or both.
94
95 · badlinks, bl: Find bad symlinks pointing nowhere valid.
96
97 · emptydirs, ed: Find empty directories.
98
99 · emptyfiles, ef: Find empty files.
100
101 · nonstripped, ns: Find nonstripped binaries.
102
103 · duplicates, df: Find duplicate files.
104
105 · duplicatedirs, dd: Find duplicate directories.
106
107 WARNING: It is good practice to enclose the description in sin‐
108 gle or double quotes. In obscure cases argument parsing might
109 fail in weird ways, especially when using spaces as separator.
110
111 Example:
112
113 $ rmlint -T "df,dd" # Only search for duplicate files and directories
114 $ rmlint -T "all -df -dd" # Search for all lint except duplicate files and dirs.
115
116 -o --output=spec / -O --add-output=spec (default: -o sh:rmlint.sh -o
117 pretty:stdout -o summary:stdout -o json:rmlint.json)
118 Configure the way rmlint outputs its results. A spec is in the
119 form format:file or just format. A file might either be an
120 arbitrary path or stdout or stderr. If file is omitted, stdout
121 is assumed. format is the name of a formatter supported by this
122 program. For a list of formatters and their options, refer to
123 the Formatters section below.
124
125 If -o is specified, rmlint's default outputs are overwritten.
126 With --O the defaults are preserved. Either -o or -O may be
127 specified multiple times to get multiple outputs, including mul‐
128 tiple outputs of the same format.
129
130 Examples:
131
132 $ rmlint -o json # Stream the json output to stdout
133 $ rmlint -O csv:/tmp/rmlint.csv # Output an extra csv fle to /tmp
134
135 -c --config=spec[=value] (default: none)
136 Configure a format. This option can be used to fine-tune the be‐
137 haviour of the existing formatters. See the Formatters section
138 for details on the available keys.
139
140 If the value is omitted it is set to a value meaning "enabled".
141
142 Examples:
143
144 $ rmlint -c sh:link # Smartly link duplicates instead of removing
145 $ rmlint -c progressbar:fancy # Use a different theme for the progressbar
146
147 -z --perms[=[rwx]] (default: no check)
148 Only look into file if it is readable, writable or executable by
149 the current user. Which one of the can be given as argument as
150 one of "rwx".
151
152 If no argument is given, "rw" is assumed. Note that r does basi‐
153 cally nothing user-visible since rmlint will ignore unreadable
154 files anyways. It's just there for the sake of completeness.
155
156 By default this check is not done.
157
158 $ rmlint -z rx $(echo $PATH | tr ":" " ") # Look at all exe‐
159 cutable files in $PATH
160
161 -a --algorithm=name (default: blake2b)
162 Choose the algorithm to use for finding duplicate files. The
163 algorithm can be either paranoid (byte-by-byte file comparison)
164 or use one of several file hash algorithms to identify dupli‐
165 cates. The following hash families are available (in approxi‐
166 mate descending order of cryptographic strength):
167
168 sha3, blake,
169
170 sha,
171
172 highway, md
173
174 metro, murmur, xxhash
175
176 The weaker hash functions still offer excellent distribution
177 properties, but are potentially more vulnerable to malicious
178 crafting of duplicate files.
179
180 The full list of hash functions (in decreasing order of checksum
181 length) is:
182
183 512-bit: blake2b, blake2bp, sha3-512, sha512
184
185 384-bit: sha3-384,
186
187 256-bit: blake2s, blake2sp, sha3-256, sha256, highway256,
188 metro256, metrocrc256
189
190 160-bit: sha1
191
192 128-bit: md5, murmur, metro, metrocrc
193
194 64-bit: highway64, xxhash.
195
196 The use of 64-bit hash length for detecting duplicate files is
197 not recommended, due to the probability of a random hash colli‐
198 sion.
199
200 -p --paranoid / -P --less-paranoid (default)
201 Increase or decrease the paranoia of rmlint's duplicate algo‐
202 rithm. Use -p if you want byte-by-byte comparison without any
203 hashing.
204
205 · -p is equivalent to --algorithm=paranoid
206
207 · -P is equivalent to --algorithm=highway256
208
209 · -PP is equivalent to --algorithm=metro256
210
211 · -PPP is equivalent to --algorithm=metro
212
213 -v --loud / -V --quiet
214 Increase or decrease the verbosity. You can pass these options
215 several times. This only affects rmlint's logging on stderr, but
216 not the outputs defined with -o. Passing either option more than
217 three times has no further effect.
218
219 -g --progress / -G --no-progress (default)
220 Show a progressbar with sane defaults.
221
222 Convenience shortcut for -o progressbar -o summary -o
223 sh:rmlint.sh -o json:rmlint.json -VVV.
224
225 NOTE: This flag clears all previous outputs. If you want addi‐
226 tional outputs, specify them after this flag using -O.
227
228 -D --merge-directories (default: disabled)
229 Makes rmlint use a special mode where all found duplicates are
230 collected and checked if whole directory trees are duplicates.
231 Use with caution: You always should make sure that the investi‐
232 gated directory is not modified during rmlint's or its removal
233 scripts run.
234
235 IMPORTANT: Definition of equal: Two directories are considered
236 equal by rmlint if they contain the exact same data, no matter
237 how the files containing the data are named. Imagine that rmlint
238 creates a long, sorted stream out of the data found in the
239 directory and compares this in a magic way to another directory.
240 This means that the layout of the directory is not considered to
241 be important by default. Also empty files will not count as con‐
242 tent. This might be surprising to some users, but remember that
243 rmlint generally cares only about content, not about any other
244 metadata or layout. If you want to only find trees with the same
245 hierarchy you should use --honour-dir-layout / -j.
246
247 Output is deferred until all duplicates were found. Duplicate
248 directories are printed first, followed by any remaining dupli‐
249 cate files that are isolated or inside of any original directo‐
250 ries.
251
252 --rank-by applies for directories too, but 'p' or 'P' (path
253 index) has no defined (i.e. useful) meaning. Sorting takes only
254 place when the number of preferred files in the directory dif‐
255 fers.
256
257 NOTES:
258
259 · This option enables --partial-hidden and -@ (--see-symlinks)
260 for convenience. If this is not desired, you should change
261 this after specifying -D.
262
263 · This feature might add some runtime for large datasets.
264
265 · When using this option, you will not be able to use the -c
266 sh:clone option. Use -c sh:link as a good alternative.
267
268 -j --honour-dir-layout (default: disabled)
269 Only recognize directories as duplicates that have the same path
270 layout. In other words: All duplicates that build the duplicate
271 directory must have the same path from the root of each respec‐
272 tive directory. This flag makes no sense without --merge-direc‐
273 tories.
274
275 -y --sort-by=order (default: none)
276 During output, sort the found duplicate groups by criteria
277 described by order. order is a string that may consist of one
278 or more of the following letters:
279
280 · s: Sort by size of group.
281
282 · a: Sort alphabetically by the basename of the original.
283
284 · m: Sort by mtime of the original.
285
286 · p: Sort by path-index of the original.
287
288 · o: Sort by natural found order (might be different on each
289 run).
290
291 · n: Sort by number of files in the group.
292
293 The letter may also be written uppercase (similar to -S /
294 --rank-by) to reverse the sorting. Note that rmlint has to hold
295 back all results to the end of the run before sorting and print‐
296 ing.
297
298 -w --with-color (default) / -W --no-with-color
299 Use color escapes for pretty output or disable them. If you
300 pipe rmlints output to a file -W is assumed automatically.
301
302 -h --help / -H --show-man
303 Show a shorter reference help text (-h) or the full man page
304 (-H).
305
306 --version
307 Print the version of rmlint. Includes git revision and compile
308 time features. Please include this when giving feedback to us.
309
310 Traversal Options
311 -s --size=range (default: 1 )
312 Only consider files as duplicates in a certain size range. The
313 format of range is min-max, where both ends can be specified as
314 a number with an optional multiplier. The available multipliers
315 are:
316
317 · C (1^1), W (2^1), B (512^1), K (1000^1), KB (1024^1), M
318 (1000^2), MB (1024^2), G (1000^3), GB (1024^3),
319
320 · T (1000^4), TB (1024^4), P (1000^5), PB (1024^5), E (1000^6),
321 EB (1024^6)
322
323 The size format is about the same as dd(1) uses. A valid example
324 would be: "100KB-2M". This limits duplicates to a range from 100
325 Kilobyte to 2 Megabyte.
326
327 It's also possible to specify only one size. In this case the
328 size is interpreted as "bigger or equal". If you want to to fil‐
329 ter for files up to this size you can add a - in front (-s -1M
330 == -s 0-1M).
331
332 Edge case: The default excludes empty files from the duplicate
333 search. Normally these are treated specially by rmlint by han‐
334 dling them as other lint. If you want to include empty files as
335 duplicates you should lower the limit to zero:
336
337 $ rmlint -T df --size 0
338
339 -d --max-depth=depth (default: INF)
340 Only recurse up to this depth. A depth of 1 would disable recur‐
341 sion and is equivalent to a directory listing. A depth of 2
342 would also consider also all children directories and so on.
343
344 -l --hardlinked (default) / --keep-hardlinked / -L --no-hardlinked
345 Hardlinked files are treated as duplicates by default
346 (--hardlinked). If --keep-hardlinked is given, rmlint will not
347 delete any files that are hardlinked to an original in their
348 respective group. Such files will be displayed like originals,
349 i.e. for the default output with a "ls" in front. The reasoning
350 here is to maximize the number of kept files, while maximizing
351 the number of freed space: Removing hardlinks to originals will
352 not allocate any free space.
353
354 If --no-hardlinked is given, only one file (of a set of
355 hardlinked files) is considered, all the others are ignored;
356 this means, they are not deleted and also not even shown in the
357 output. The "highest ranked" of the set is the one that is con‐
358 sidered.
359
360 -f --followlinks / -F --no-followlinks / -@ --see-symlinks (default)
361 -f will always follow symbolic links. If file system loops
362 occurs rmlint will detect this. If -F is specified, symbolic
363 links will be ignored completely, if -@ is specified, rmlint
364 will see symlinks and treats them like small files with the path
365 to their target in them. The latter is the default behaviour,
366 since it is a sensible default for --merge-directories.
367
368 -x --no-crossdev / -X --crossdev (default)
369 Stay always on the same device (-x), or allow crossing mount‐
370 points (-X). The latter is the default.
371
372 -r --hidden / -R --no-hidden (default) / --partial-hidden
373 Also traverse hidden directories? This is often not a good idea,
374 since directories like .git/ would be investigated, possibly
375 leading to the deletion of internal git files which in turn
376 break a repository. With --partial-hidden hidden files and
377 folders are only considered if they're inside duplicate directo‐
378 ries (see --merge-directories) and will be deleted as part of
379 it.
380
381 -b --match-basename
382 Only consider those files as dupes that have the same basename.
383 See also man 1 basename. The comparison of the basenames is
384 case-insensitive.
385
386 -B --unmatched-basename
387 Only consider those files as dupes that do not share the same
388 basename. See also man 1 basename. The comparison of the base‐
389 names is case-insensitive.
390
391 -e --match-with-extension / -E --no-match-with-extension (default)
392 Only consider those files as dupes that have the same file
393 extension. For example two photos would only match if they are a
394 .png. The extension is compared case-insensitive, so .PNG is the
395 same as .png.
396
397 -i --match-without-extension / -I --no-match-without-extension
398 (default)
399 Only consider those files as dupes that have the same basename
400 minus the file extension. For example: banana.png and
401 Banana.jpeg would be considered, while apple.png and peach.png
402 won't. The comparison is case-insensitive.
403
404 -n --newer-than-stamp=<timestamp_filename> / -N
405 --newer-than=<iso8601_timestamp_or_unix_timestamp>
406 Only consider files (and their size siblings for duplicates)
407 newer than a certain modification time (mtime). The age barrier
408 may be given as seconds since the epoch or as ISO8601-Timestamp
409 like 2014-09-08T00:12:32+0200.
410
411 -n expects a file from which it can read the timestamp. After
412 rmlint run, the file will be updated with the current timestamp.
413 If the file does not initially exist, no filtering is done but
414 the stampfile is still written.
415
416 -N, in contrast, takes the timestamp directly and will not write
417 anything.
418
419 Note that rmlint will find duplicates newer than timestamp, even
420 if the original is older. If you want only find duplicates
421 where both original and duplicate are newer than timestamp you
422 can use find(1):
423
424 · find -mtime -1 -print0 | rmlint -0 # pass all files younger
425 than a day to rmlint
426
427 Note: you can make rmlint write out a compatible timestamp with:
428
429 · -O stamp:stdout # Write a seconds-since-epoch timestamp to
430 stdout on finish.
431
432 · -O stamp:stdout -c stamp:iso8601 # Same, but write as ISO8601.
433
434 Original Detection Options
435 -k --keep-all-tagged / -K --keep-all-untagged
436 Don't delete any duplicates that are in tagged paths (-k) or
437 that are in non-tagged paths (-K). (Tagged paths are those that
438 were named after //).
439
440 -m --must-match-tagged / -M --must-match-untagged
441 Only look for duplicates of which at least one is in one of the
442 tagged paths. (Paths that were named after //).
443
444 Note that the combinations of -kM and -Km are prohibited by
445 rmlint. See https://github.com/sahib/rmlint/issues/244 for more
446 information.
447
448 -S --rank-by=criteria (default: pOma)
449 Sort the files in a group of duplicates into originals and
450 duplicates by one or more criteria. Each criteria is defined by
451 a single letter (except r and x which expect a regex pattern
452 after the letter). Multiple criteria may be given as string,
453 where the first criteria is the most important. If one criteria
454 cannot decide between original and duplicate the next one is
455 tried.
456
457 · m: keep lowest mtime (oldest) M: keep highest mtime
458 (newest)
459
460 · a: keep first alphabetically A: keep last alphabet‐
461 ically
462
463 · p: keep first named path P: keep last named
464 path
465
466 · d: keep path with lowest depth D: keep path with
467 highest depth
468
469 · l: keep path with shortest basename L: keep path with
470 longest basename
471
472 · r: keep paths matching regex R: keep path not
473 matching regex
474
475 · x: keep basenames matching regex X: keep basenames not
476 matching regex
477
478 · h: keep file with lowest hardlink count H: keep file with
479 highest hardlink count
480
481 · o: keep file with lowest number of hardlinks outside of the
482 paths traversed by rmlint.
483
484 · O: keep file with highest number of hardlinks outside of the
485 paths traversed by rmlint.
486
487 Alphabetical sort will only use the basename of the file and
488 ignore its case. One can have multiple criteria, e.g.: -S am
489 will choose first alphabetically; if tied then by mtime. Note:
490 original path criteria (specified using //) will always take
491 first priority over -S options.
492
493 For more fine grained control, it is possible to give a regular
494 expression to sort by. This can be useful when you know a common
495 fact that identifies original paths (like a path component being
496 src or a certain file ending).
497
498 To use the regular expression you simply enclose it in the cri‐
499 teria string by adding <REGULAR_EXPRESSION> after specifying r
500 or x. Example: -S 'r<.*\.bak$>' makes all files that have a .bak
501 suffix original files.
502
503 Warning: When using r or x, try to make your regex to be as spe‐
504 cific as possible! Good practice includes adding a $ anchor at
505 the end of the regex.
506
507 Tips:
508
509 · l is useful for files like file.mp3 vs file.1.mp3 or
510 file.mp3.bak.
511
512 · a can be used as last criteria to assert a defined order.
513
514 · o/O and h/H are only useful if there any hardlinks in the tra‐
515 versed path.
516
517 · o/O takes the number of hardlinks outside the traversed paths
518 (and thereby minimizes/maximizes the overall number of
519 hardlinks). h/H in contrast only takes the number of hardlinks
520 inside of the traversed paths. When hardlinking files, one
521 would like to link to the original file with the highest outer
522 link count (O) in order to maximise the space cleanup. H does
523 not maximise the space cleanup, it just selects the file with
524 the highest total hardlink count. You usually want to specify
525 O.
526
527 · pOma is the default since p ensures that first given paths
528 rank as originals, O ensures that hardlinks are handled well,
529 m ensures that the oldest file is the original and a simply
530 ensures a defined ordering if no other criteria applies.
531
532 Caching
533 --replay
534 Read an existing json file and re-output it. When --replay is
535 given, rmlint does no input/output on the filesystem, even if
536 you pass additional paths. The paths you pass will be used for
537 filtering the --replay output.
538
539 This is very useful if you want to reformat, refilter or resort
540 the output you got from a previous run. Usage is simple: Just
541 pass --replay on the second run, with other changed to the new
542 formatters or filters. Pass the .json files of the previous runs
543 additionally to the paths you ran rmlint on. You can also merge
544 several previous runs by specifying more than one .json file, in
545 this case it will merge all files given and output them as one
546 big run.
547
548 If you want to view only the duplicates of certain subdirecto‐
549 ries, just pass them on the commandline as usual.
550
551 The usage of // has the same effect as in a normal run. It can
552 be used to prefer one .json file over another. However note that
553 running rmlint in --replay mode includes no real disk traversal,
554 i.e. only duplicates from previous runs are printed. Therefore
555 specifying new paths will simply have no effect. As a security
556 measure, --replay will ignore files whose mtime changed in the
557 meantime (i.e. mtime in the .json file differes from the current
558 one). These files might have been modified and are silently
559 ignored.
560
561 By design, some options will not have any effect. Those are:
562
563 · --followlinks
564
565 · --algorithm
566
567 · --paranoid
568
569 · --clamp-low
570
571 · --hardlinked
572
573 · --write-unfinished
574
575 · ... and all other caching options below.
576
577 NOTE: In --replay mode, a new .json file will be written to
578 rmlint.replay.json in order to avoid overwriting rmlint.json.
579
580 --xattr-read / --xattr-write / --xattr-clear
581 Read or write cached checksums from the extended file
582 attributes. This feature can be used to speed up consecutive
583 runs.
584
585 CAUTION: This could potentially lead to false positives if file
586 contents are somehow modified without changing the file mtime.
587
588 NOTE: Many tools do not support extended file attributes prop‐
589 erly, resulting in a loss of the information when copying the
590 file or editing it. Also, this is a linux specific feature that
591 works not on all filesystems and only if you have write permis‐
592 sions to the file.
593
594 Usage example:
595
596 $ rmlint large_file_cluster/ -U --xattr-write # first run.
597 $ rmlint large_file_cluster/ --xattr-read # second run.
598
599 -U --write-unfinished
600 Include files in output that have not been hashed fully, i.e.
601 files that do not appear to have a duplicate. Note that this
602 will not include all files that rmlint traversed, but only the
603 files that were chosen to be hashed.
604
605 This is mainly useful in conjunction with --xattr-write/read.
606 When re-running rmlint on a large dataset this can greatly speed
607 up a re-run in some cases. Please refer to --xattr-read for an
608 example.
609
610 Rarely used, miscellaneous options
611 -t --threads=N (default: 16)
612 The number of threads to use during file tree traversal and
613 hashing. rmlint probably knows better than you how to set this
614 value, so just leave it as it is. Setting it to 1 will also not
615 make rmlint a single threaded program.
616
617 -u --limit-mem=size
618 Apply a maximum number of memory to use for hashing and --para‐
619 noid. The total number of memory might still exceed this limit
620 though, especially when setting it very low. In general rmlint
621 will however consume about this amont of memory plus a more or
622 less constant extra amount that depends on the data you are
623 scanning.
624
625 The size-description has the same format as for --size, there‐
626 fore you can do something like this (use this if you have 1GB of
627 memory available):
628
629 $ rmlint -u 512M # Limit paranoid mem usage to 512 MB
630
631 -q --clamp-low=[fac.tor|percent%|offset] (default: 0) / -Q
632 --clamp-top=[fac.tor|percent%|offset] (default: 1.0)
633 The argument can be either passed as factor (a number with a .
634 in it), a percent value (suffixed by %) or as absolute number or
635 size spec, like in --size.
636
637 Only look at the content of files in the range of from low to
638 (including) high. This means, if the range is less than -q 0% to
639 -Q 100%, than only partial duplicates are searched. If the file
640 size is less than the clamp limits, the file is ignored during
641 traversing. Be careful when using this function, you can easily
642 get dangerous results for small files.
643
644 This is useful in a few cases where a file consists of a con‐
645 stant sized header or footer. With this option you can just com‐
646 pare the data in between. Also it might be useful for approxi‐
647 mate comparison where it suffices when the file is the same in
648 the middle part.
649
650 Example:
651
652 $ rmlint -q 10% -Q 512M # Only read the last 90% of a file, but
653 read at max. 512MB
654
655 -Z --mtime-window=T (default: -1)
656 Only consider those files as duplicates that have the same con‐
657 tent and the same modification time (mtime) within a certain
658 window of T seconds. If T is 0, both files need to have the
659 same mtime. For T=1 they may differ one second and so on. If the
660 window size is negative, the mtime of duplicates will not be
661 considered. T may be a floating point number.
662
663 However, with three (or more) files, the mtime difference
664 between two duplicates can be bigger than the mtime window T,
665 i.e. several files may be chained together by the window. Exam‐
666 ple: If T is 1, the four files fooA (mtime: 00:00:00), fooB
667 (00:00:01), fooC (00:00:02), fooD (00:00:03) would all belong to
668 the same duplicate group, although the mtime of fooA and fooD
669 differs by 3 seconds.
670
671 --with-fiemap (default) / --without-fiemap
672 Enable or disable reading the file extents on rotational disk in
673 order to optimize disk access patterns. If this feature is not
674 available, it is disabled automatically.
675
676 FORMATTERS
677 · csv: Output all found lint as comma-separated-value list.
678
679 Available options:
680
681 · no_header: Do not write a first line describing the column headers.
682
683 ·
684
685 sh: Output all found lint as shell script This formatter is activated
686 as default.
687
688 Available options:
689
690 · cmd: Specify a user defined command to run on duplicates. The com‐
691 mand can be any valid /bin/sh-expression. The duplicate path and
692 original path can be accessed via "$1" and "$2". The command will
693 be written to the user_command function in the sh-file produced by
694 rmlint.
695
696 · handler Define a comma separated list of handlers to try on dupli‐
697 cate files in that given order until one handler succeeds. Handlers
698 are just the name of a way of getting rid of the file and can be
699 any of the following:
700
701 · clone: For reflink-capable filesystems only. Try to clone both
702 files with the FIDEDUPERANGE ioctl(3p) (or
703 BTRFS_IOC_FILE_EXTENT_SAME on older kernels). This will free up
704 duplicate extents. Needs at least kernel 4.2. Use this option
705 when you only have read-only acess to a btrfs filesystem but
706 still want to deduplicate it. This is usually the case for snap‐
707 shots.
708
709 · reflink: Try to reflink the duplicate file to the original. See
710 also --reflink in man 1 cp. Fails if the filesystem does not sup‐
711 port it.
712
713 · hardlink: Replace the duplicate file with a hardlink to the orig‐
714 inal file. The resulting files will have the same inode number.
715 Fails if both files are not on the same partition. You can use ls
716 -i to show the inode number of a file and find -samefile <path>
717 to find all hardlinks for a certain file.
718
719 · symlink: Tries to replace the duplicate file with a symbolic link
720 to the original. This handler never fails.
721
722 · remove: Remove the file using rm -rf. (-r for duplicate dirs).
723 This handler never fails.
724
725 · usercmd: Use the provided user defined command (-c sh:cmd=some‐
726 thing). Never fails.
727
728 Default is remove.
729
730 · link: Shortcut for -c sh:handler=clone,reflink,hardlink,symlink.
731 Use this if you are on a reflink-capable system.
732
733 · hardlink: Shortcut for -c sh:handler=hardlink,symlink. Use this if
734 you want to hardlink files, but want to fallback for duplicates
735 that lie on different devices.
736
737 · symlink: Shortcut for -c sh:handler=symlink. Use this as last
738 straw.
739
740 · json: Print a JSON-formatted dump of all found reports. Outputs all
741 lint as a json document. The document is a list of dictionaries,
742 where the first and last element is the header and the footer. Every‐
743 thing between are data-dictionaries.
744
745 Available options:
746
747 · no_header=[true|false]: Print the header with metadata (default:
748 true)
749
750 · no_footer=[true|false]: Print the footer with statistics (default:
751 true)
752
753 · oneline=[true|false]: Print one json document per line (default:
754 false) This is useful if you plan to parse the output line-by-line,
755 e.g. while rmlint is sill running.
756
757 · py: Outputs a python script and a JSON document, just like the json
758 formatter. The JSON document is written to .rmlint.json, executing
759 the script will make it read from there. This formatter is mostly
760 intented for complex use-cases where the lint needs special handling
761 that you define in the python script. Therefore the python script
762 can be modified to do things standard rmlint is not able to do eas‐
763 ily.
764
765 · stamp:
766
767 Outputs a timestamp of the time rmlint was run. See also the
768 --newer-than and --newer-than-stamp file option.
769
770 Available options:
771
772 · iso8601=[true|false]: Write an ISO8601 formatted timestamps or sec‐
773 onds since epoch?
774
775 · progressbar: Shows a progressbar. This is meant for use with stdout
776 or stderr [default].
777
778 See also: -g (--progress) for a convenience shortcut option.
779
780 Available options:
781
782 · update_interval=number: Number of milliseconds to wait between
783 updates. Higher values use less resources (default 50).
784
785 · ascii: Do not attempt to use unicode characters, which might not be
786 supported by some terminals.
787
788 · fancy: Use a more fancy style for the progressbar.
789
790 · pretty: Shows all found items in realtime nicely colored. This for‐
791 matter is activated as default.
792
793 · summary: Shows counts of files and their respective size after the
794 run. Also list all written output files.
795
796 · fdupes: Prints an output similar to the popular duplicate finder
797 fdupes(1). At first a progressbar is printed on stderr. Afterwards
798 the found files are printed on stdout; each set of duplicates gets
799 printed as a block separated by newlines. Originals are highlighted
800 in green. At the bottom a summary is printed on stderr. This is
801 mostly useful for scripts that were set up for parsing fdupes output.
802 We recommend the json formatter for every other scripting purpose.
803
804 Available options:
805
806 · omitfirst: Same as the -f / --omitfirst option in fdupes(1). Omits
807 the first line of each set of duplicates (i.e. the original file.
808
809 · sameline: Same as the -1 / --sameline option in fdupes(1). Does not
810 print newlines between files, only a space. Newlines are printed
811 only between sets of duplicates.
812
813 OTHER STAND-ALONE COMMANDS
814 rmlint --gui
815 Start the optional graphical frontend to rmlint called Shredder.
816
817 This will only work when Shredder and its dependencies were
818 installed. See also:
819 http://rmlint.readthedocs.org/en/latest/gui.html
820
821 The gui has its own set of options, see --gui --help for a list.
822 These should be placed at the end, ie rmlint --gui [options]
823 when calling it from commandline.
824
825 rmlint --hash [paths...]
826 Make rmlint work as a multi-threaded file hash utility, similar
827 to the popular md5sum or sha1sum utilities, but faster and with
828 more algorithms. A set of paths given on the commandline or
829 from stdin is hashed using one of the available hash algorithms.
830 Use rmlint --hash -h to see options.
831
832 rmlint --equal [paths...]
833 Check if the paths given on the commandline all have equal con‐
834 tent. If all paths are equal and no other error happened, rmlint
835 will exit with an exit code 0. Otherwise it will exit with a
836 nonzero exit code. All other options can be used as normal, but
837 note that no other formatters (sh, csv etc.) will be executed by
838 default. At least two paths need to be passed.
839
840 Note: This even works for directories and also in combination
841 with paranoid mode (pass -pp for byte comparison); remember that
842 rmlint does not care about the layout of the directory, but only
843 about the content of the files in it. At least two paths need to
844 be given to the commandline.
845
846 By default this will use hashing to compare the files and/or
847 directories.
848
849 rmlint --dedupe [-r] [-v|-V] <src> <dest>
850 If the filesystem supports files sharing physical storage
851 between multiple files, and if src and dest have same content,
852 this command makes the data in the src file appear the dest file
853 by sharing the underlying storage.
854
855 This command is similar to cp --reflink=always <src> <dest>
856 except that it (a) checks that src and dest have identical data,
857 and it makes no changes to dest's metadata.
858
859 Running with -r option will enable deduplication of read-only
860 [btrfs] snapshots (requires root).
861
862 rmlint --is-reflink [-v|-V] <file1> <file2>
863 Tests whether file1 and file2 are reflinks (reference same
864 data). Return codes:
865 0: files are reflinks 1: files are not reflinks 3: not a reg‐
866 ular file 4: file sizes differ 5: fiemaps can't be read 6:
867 file1 and file2 are the same path 7: file1 and file2 are the
868 same file under different mountpoints 8: files are hardlinks
869 9: files are symlinks (TODO) 10: files are not on same device
870 11: other error encountered
871
872 EXAMPLES
873 This is a collection of common usecases and other tricks:
874
875 · Check the current working directory for duplicates.
876
877 $ rmlint
878
879 · Show a progressbar:
880
881 $ rmlint -g
882
883 · Quick re-run on large datasets using different ranking criteria on
884 second run:
885
886 $ rmlint large_dir/ # First run; writes rmlint.json
887
888 $ rmlint --replay rmlint.json large_dir -S MaD
889
890 · Merge together previous runs, but prefer the originals to be from
891 b.json and make sure that no files are deleted from b.json:
892
893 $ rmlint --replay a.json // b.json -k
894
895 · Search only for duplicates and duplicate directories
896
897 $ rmlint -T "df,dd" .
898
899 · Compare files byte-by-byte in current directory:
900
901 $ rmlint -pp .
902
903 · Find duplicates with same basename (excluding extension):
904
905 $ rmlint -e
906
907 · Do more complex traversal using find(1).
908
909 $ find /usr/lib -iname '*.so' -type f | rmlint - # find all duplicate
910 .so files
911
912 $ find /usr/lib -iname '*.so' -type f -print0 | rmlint -0 # as above
913 but handles filenames with newline character in them
914
915 $ find ~/pics -iname '*.png' | ./rmlint - # compare png files only
916
917 · Limit file size range to investigate:
918
919 $ rmlint -s 2GB # Find everything >= 2GB
920
921 $ rmlint -s 0-2GB # Find everything < 2GB
922
923 · Only find writable and executable files:
924
925 $ rmlint --perms wx
926
927 · Reflink if possible, else hardlink duplicates to original if possi‐
928 ble, else replace duplicate with a symbolic link:
929
930 $ rmlint -c sh:link
931
932 · Inject user-defined command into shell script output:
933
934 $ rmlint -o sh -c sh:cmd='echo "original:" "$2" "is the same as"
935 "$1"'
936
937 · Use data as master directory. Find only duplicates in backup that are
938 also in data. Do not delete any files in data:
939
940 $ rmlint backup // data --keep-all-tagged --must-match-tagged
941
942 · Compare if the directories a b c and are equal
943
944 $ rmlint --equal a b c && echo "Files are equal" || echo "Files are
945 not equal"
946
947 · Test if two files are reflinks rmlint --is-reflink a b && echo "Files
948 are reflinks" || echo "Files are not reflinks".
949
950 PROBLEMS
951 1. False Positives: Depending on the options you use, there is a very
952 slight risk of false positives (files that are erroneously detected
953 as duplicate). The default hash function (blake2b) is very safe but
954 in theory it is possible for two files to have then same hash. If
955 you had 10^73 different files, all the same size, then the chance of
956 a false positive is still less than 1 in a billion. If you're con‐
957 cerned just use the --paranoid (-pp) option. This will compare all
958 the files byte-by-byte and is not much slower than blake2b (it may
959 even be faster), although it is a lot more memory-hungry.
960
961 2. File modification during or after rmlint run: It is possible that a
962 file that rmlint recognized as duplicate is modified afterwards,
963 resulting in a different file. If you use the rmlint-generated
964 shell script to delete the duplicates, you can run it with the -p
965 option to do a full re-check of the duplicate against the original
966 before it deletes the file. When using -c sh:hardlink or -c sh:sym‐
967 link care should be taken that a modification of one file will now
968 result in a modification of all files. This is not the case for -c
969 sh:reflink or -c sh:clone. Use -c sh:link to minimise this risk.
970
971 SEE ALSO
972 Reading the manpages o these tools might help working with rmlint:
973
974 · find(1)
975
976 · rm(1)
977
978 · cp(1)
979
980 Extended documentation and an in-depth tutorial can be found at:
981
982 · http://rmlint.rtfd.org
983
984 BUGS
985 If you found a bug, have a feature requests or want to say something
986 nice, please visit https://github.com/sahib/rmlint/issues.
987
988 Please make sure to describe your problem in detail. Always include the
989 version of rmlint (--version). If you experienced a crash, please
990 include at least one of the following information with a debug build of
991 rmlint:
992
993 · gdb --ex run -ex bt --args rmlint -vvv [your_options]
994
995 · valgrind --leak-check=no rmlint -vvv [your_options]
996
997 You can build a debug build of rmlint like this:
998
999 · git clone git@github.com:sahib/rmlint.git
1000
1001 · cd rmlint
1002
1003 · scons DEBUG=1
1004
1005 · sudo scons install # Optional
1006
1007 LICENSE
1008 rmlint is licensed under the terms of the GPLv3.
1009
1010 See the COPYRIGHT file that came with the source for more information.
1011
1012 PROGRAM AUTHORS
1013 rmlint was written by:
1014
1015 · Christopher <sahib> Pahl 2010-2017 (https://github.com/sahib)
1016
1017 · Daniel <SeeSpotRun> T. 2014-2017 (https://github.com/SeeSpotRun)
1018
1019 Also see the http://rmlint.rtfd.org for other people that helped us.
1020
1021 If you consider a donation you can use Flattr or buy us a beer if we
1022 meet:
1023
1024 https://flattr.com/thing/302682/libglyr
1025
1027 Christopher Pahl, Daniel Thomas
1028
1030 2014-2015, Christopher Pahl & Daniel Thomas
1031
1032
1033
1034
1035 Feb 18, 2019 RMLINT(1)