rmlint(1)

1RMLINT(1)                    rmlint documentation                    RMLINT(1)
2
3
4

NAME

6       rmlint - find duplicate files and other space waste efficiently
7

FIND DUPLICATE FILES AND OTHER SPACE WASTE EFFICIENTLY

9   SYNOPSIS
10       rmlint  [TARGET_DIR_OR_FILES ...] [//] [TAGGED_TARGET_DIR_OR_FILES ...]
11       [-] [OPTIONS]
12
13   DESCRIPTION
14       rmlint finds space waste and other broken things  on  your  filesystem.
15       It's main focus lies on finding duplicate files and directories.
16
17       It is able to find the following types of lint:
18
19       • Duplicate files and directories (and as a by-product unique files).
20
21       • Nonstripped  Binaries  (Binaries  with debug symbols; needs to be ex‐
22         plicitly enabled).
23
24       • Broken symbolic links.
25
26       • Empty files and directories (also nested empty directories).
27
28       • Files with broken user or group id.
29
30       rmlint itself WILL NOT DELETE ANY FILES. It does however  produce  exe‐
31       cutable  output  (for  example  a  shell script) to help you delete the
32       files if you want to. Another design principle is that it  should  work
33       well together with other tools like find. Therefore we do not replicate
34       features of other well know programs, as for example  pattern  matching
35       and  finding  duplicate filenames.  However we provide many convenience
36       options for common use cases that are hard to build from  scratch  with
37       standard tools.
38
39       In  order  to find the lint, rmlint is given one or more directories to
40       traverse.  If no directories or files were given, the  current  working
41       directory  is assumed.  By default, rmlint will ignore hidden files and
42       will not follow symlinks (see Traversal Options).   rmlint  will  first
43       find "other lint" and then search the remaining files for duplicates.
44
45       rmlint  tries  to be helpful by guessing what file of a group of dupli‐
46       cates is the original (i.e. the file that should not  be  deleted).  It
47       does  this by using different sorting strategies that can be controlled
48       via the -S option. By default it chooses the first-named  path  on  the
49       commandline.  If  two  duplicates come from the same path, it will also
50       apply different fallback sort strategies (See the documentation of  the
51       -S strategy).
52
53       This  behaviour  can be also overwritten if you know that a certain di‐
54       rectory contains duplicates and another one originals. In this case you
55       write  the original directory after specifying a single //  on the com‐
56       mandline.  Everything that comes after is a preferred (or  a  "tagged")
57       directory.  If there are duplicates from an unpreferred and from a pre‐
58       ferred directory, the preferred one will always count as original. Spe‐
59       cial  options can also be used to always keep files in preferred direc‐
60       tories (-k) and to only find duplicates that are present in both  given
61       directories (-m).
62
63       We  advise  new users to have a short look at all options rmlint has to
64       offer, and maybe test some examples before letting it run on productive
65       data.   WRONG ASSUMPTIONS ARE THE BIGGEST ENEMY OF YOUR DATA. There are
66       some extended example at the end of this manual, but each  option  that
67       is not self-explanatory will also try to give examples.
68
69   OPTIONS
70   General Options
71       -T --types="list" (default: defaults)
72              Configure  the  types  of  lint  rmlint  will look for. The list
73              string is a comma-separated list of lint types  or  lint  groups
74              (other separators like semicolon or space also work though).
75
76              One of the following groups can be specified at the beginning of
77              the list:
78
79              • all: Enables all lint types.
80
81              • defaults: Enables all lint types, but nonstripped.
82
83              • minimal: defaults minus emptyfiles and emptydirs.
84
85              • minimaldirs: defaults minus emptyfiles, emptydirs  and  dupli‐
86                cates, but with duplicatedirs.
87
88              • none: Disable all lint types [default].
89
90              Any  of  the  following lint types can be added individually, or
91              deselected by prefixing with a -:
92
93              • badids, bi: Find files with bad UID, GID or both.
94
95              • badlinks, bl: Find bad symlinks pointing nowhere valid.
96
97              • emptydirs, ed: Find empty directories.
98
99              • emptyfiles, ef: Find empty files.
100
101              • nonstripped, ns: Find nonstripped binaries.
102
103              • duplicates, df: Find duplicate files.
104
105              • duplicatedirs, dd: Find duplicate  directories  (This  is  the
106                same -D!)
107
108              WARNING:  It is good practice to enclose the description in sin‐
109              gle or double quotes. In obscure cases  argument  parsing  might
110              fail in weird ways, especially when using spaces as separator.
111
112              Example:
113
114                 $ rmlint -T "df,dd"        # Only search for duplicate files and directories
115                 $ rmlint -T "all -df -dd"  # Search for all lint except duplicate files and dirs.
116
117       -o  --output=spec  /  -O --add-output=spec (default: -o sh:rmlint.sh -o
118       pretty:stdout -o summary:stdout -o json:rmlint.json)
119              Configure the way rmlint outputs its results. A spec is  in  the
120              form  format:file or just format.  A file might either be an ar‐
121              bitrary path or stdout or stderr.  If file is omitted, stdout is
122              assumed.  format  is  the  name of a formatter supported by this
123              program. For a list of formatters and their  options,  refer  to
124              the Formatters section below.
125
126              If  -o  is  specified, rmlint's default outputs are overwritten.
127              With --O the defaults are preserved.  Either -o  or  -O  may  be
128              specified multiple times to get multiple outputs, including mul‐
129              tiple outputs of the same format.
130
131              Examples:
132
133                 $ rmlint -o json                 # Stream the json output to stdout
134                 $ rmlint -O csv:/tmp/rmlint.csv  # Output an extra csv fle to /tmp
135
136       -c --config=spec[=value] (default: none)
137              Configure a format. This option can be used to fine-tune the be‐
138              haviour  of  the existing formatters. See the Formatters section
139              for details on the available keys.
140
141              If the value is omitted it is set to a value meaning "enabled".
142
143              Examples:
144
145                 $ rmlint -c sh:link            # Smartly link duplicates instead of removing
146                 $ rmlint -c progressbar:fancy  # Use a different theme for the progressbar
147
148       -z --perms[=[rwx]] (default: no check)
149              Only look into file if it is readable, writable or executable by
150              the  current user.  Which one of the can be given as argument as
151              one of "rwx".
152
153              If no argument is given, "rw" is assumed. Note that r does basi‐
154              cally  nothing  user-visible since rmlint will ignore unreadable
155              files anyways.  It's just there for the sake of completeness.
156
157              By default this check is not done.
158
159              $ rmlint -z rx $(echo $PATH | tr ":" " ")  # Look  at  all  exe‐
160              cutable files in $PATH
161
162       -a --algorithm=name (default: blake2b)
163              Choose the algorithm to use for finding duplicate files. The al‐
164              gorithm can be either paranoid (byte-by-byte file comparison) or
165              use  one of several file hash algorithms to identify duplicates.
166              The following hash families are available  (in  approximate  de‐
167              scending order of cryptographic strength):
168
169              sha3, blake,
170
171              sha,
172
173              highway, md
174
175              metro, murmur, xxhash
176
177              The  weaker  hash  functions  still offer excellent distribution
178              properties, but are potentially  more  vulnerable  to  malicious
179              crafting of duplicate files.
180
181              The full list of hash functions (in decreasing order of checksum
182              length) is:
183
184              512-bit: blake2b, blake2bp, sha3-512, sha512
185
186              384-bit: sha3-384,
187
188              256-bit:  blake2s,  blake2sp,  sha3-256,   sha256,   highway256,
189              metro256, metrocrc256
190
191              160-bit: sha1
192
193              128-bit: md5, murmur, metro, metrocrc
194
195              64-bit: highway64, xxhash.
196
197              The  use  of 64-bit hash length for detecting duplicate files is
198              not recommended, due to the probability of a random hash  colli‐
199              sion.
200
201       -p --paranoid / -P --less-paranoid (default)
202              Increase  or  decrease  the paranoia of rmlint's duplicate algo‐
203              rithm.  Use -p if you want byte-by-byte comparison  without  any
204              hashing.
205
206              • -p is equivalent to --algorithm=paranoid
207
208              • -P is equivalent to --algorithm=highway256
209
210              • -PP is equivalent to --algorithm=metro256
211
212              • -PPP is equivalent to --algorithm=metro
213
214       -v --loud / -V --quiet
215              Increase  or  decrease the verbosity. You can pass these options
216              several times. This only affects rmlint's logging on stderr, but
217              not the outputs defined with -o. Passing either option more than
218              three times has no further effect.
219
220       -g --progress / -G --no-progress (default)
221              Show a progressbar with sane defaults.
222
223              Convenience shortcut for -o progressbar  -o  summary  -o  sh:rm‐
224              lint.sh -o json:rmlint.json -VVV.
225
226              NOTE:  This  flag clears all previous outputs. If you want addi‐
227              tional outputs, specify them after this flag using -O.
228
229       -D --merge-directories (default: disabled)
230              Makes rmlint use a special mode where all found  duplicates  are
231              collected  and  checked if whole directory trees are duplicates.
232              Use with caution: You always should make sure that the  investi‐
233              gated  directory  is not modified during rmlint's or its removal
234              scripts run.
235
236              IMPORTANT: Definition of equal: Two directories  are  considered
237              equal  by  rmlint if they contain the exact same data, no matter
238              how the files containing the data are named. Imagine that rmlint
239              creates  a  long, sorted stream out of the data found in the di‐
240              rectory and compares this in a magic way to  another  directory.
241              This means that the layout of the directory is not considered to
242              be important by default. Also empty files will not count as con‐
243              tent.  This might be surprising to some users, but remember that
244              rmlint generally cares only about content, not about  any  other
245              metadata or layout. If you want to only find trees with the same
246              hierarchy you should use --honour-dir-layout / -j.
247
248              Output is deferred until all duplicates  were  found.  Duplicate
249              directories  are printed first, followed by any remaining dupli‐
250              cate files that are isolated or inside of any original  directo‐
251              ries.
252
253              --rank-by  applies for directories too, but 'p' or 'P' (path in‐
254              dex) has no defined (i.e. useful) meaning.  Sorting  takes  only
255              place  when  the number of preferred files in the directory dif‐
256              fers.
257
258              NOTES:
259
260              • This option enables --partial-hidden and  -@  (--see-symlinks)
261                for  convenience.  If  this  is not desired, you should change
262                this after specifying -D.
263
264              • This feature might add some runtime for large datasets.
265
266              • When using this option, you will not be able  to  use  the  -c
267                sh:clone option.  Use -c sh:link as a good alternative.
268
269       -j --honour-dir-layout (default: disabled)
270              Only recognize directories as duplicates that have the same path
271              layout. In other words: All duplicates that build the  duplicate
272              directory  must have the same path from the root of each respec‐
273              tive directory.  This flag makes no sense without --merge-direc‐
274              tories.
275
276       -y --sort-by=order (default: none)
277              During  output,  sort the found duplicate groups by criteria de‐
278              scribed by order.  order is a string that may consist of one  or
279              more of the following letters:
280
281              • s: Sort by size of group.
282
283              • a: Sort alphabetically by the basename of the original.
284
285              • m: Sort by mtime of the original.
286
287              • p: Sort by path-index of the original.
288
289              • o:  Sort  by  natural  found order (might be different on each
290                run).
291
292              • n: Sort by number of files in the group.
293
294              The letter may also  be  written  uppercase  (similar  to  -S  /
295              --rank-by)  to reverse the sorting. Note that rmlint has to hold
296              back all results to the end of the run before sorting and print‐
297              ing.
298
299       -w --with-color (default) / -W --no-with-color
300              Use  color  escapes  for  pretty output or disable them.  If you
301              pipe rmlints output to a file -W is assumed automatically.
302
303       -h --help / -H --show-man
304              Show a shorter reference help text (-h) or  the  full  man  page
305              (-H).
306
307       --version
308              Print  the  version of rmlint. Includes git revision and compile
309              time features. Please include this when giving feedback to us.
310
311   Traversal Options
312       -s --size=range (default: 1 )
313              Only consider files as duplicates in a certain size range.   The
314              format  of range is min-max, where both ends can be specified as
315              a number with an optional multiplier. The available  multipliers
316              are:
317
318              • C  (1^1),  W  (2^1),  B  (512^1),  K  (1000^1), KB (1024^1), M
319                (1000^2), MB (1024^2), G (1000^3), GB (1024^3),
320
321              • T (1000^4), TB (1024^4), P (1000^5), PB (1024^5), E  (1000^6),
322                EB (1024^6)
323
324              The size format is about the same as dd(1) uses. A valid example
325              would be: "100KB-2M". This limits duplicates to a range from 100
326              Kilobyte to 2 Megabyte.
327
328              It's  also  possible  to specify only one size. In this case the
329              size is interpreted as "bigger or equal". If you want to  filter
330              for files up to this size you can add a - in front (-s -1M == -s
331              0-1M).
332
333              Edge case: The default excludes empty files from  the  duplicate
334              search.   Normally these are treated specially by rmlint by han‐
335              dling them as other lint. If you want to include empty files  as
336              duplicates you should lower the limit to zero:
337
338              $ rmlint -T df --size 0
339
340       -d --max-depth=depth (default: INF)
341              Only recurse up to this depth. A depth of 1 would disable recur‐
342              sion and is equivalent to a directory  listing.  A  depth  of  2
343              would also consider all children directories and so on.
344
345       -l --hardlinked (default) / --keep-hardlinked / -L --no-hardlinked
346              Hardlinked   files   are   treated   as  duplicates  by  default
347              (--hardlinked). If --keep-hardlinked is given, rmlint  will  not
348              delete any files that are hardlinked to an original in their re‐
349              spective group. Such files will  be  displayed  like  originals,
350              i.e. for the default output with a "ls" in front.  The reasoning
351              here is to maximize the number of kept files,  while  maximizing
352              the  number of freed space: Removing hardlinks to originals will
353              not allocate any free space.
354
355              If --no-hardlinked  is  given,  only  one  file  (of  a  set  of
356              hardlinked  files)  is  considered,  all the others are ignored;
357              this means, they are not deleted and also not even shown in  the
358              output.  The "highest ranked" of the set is the one that is con‐
359              sidered.
360
361       -f --followlinks / -F --no-followlinks / -@ --see-symlinks (default)
362              -f will always follow symbolic links. If file system  loops  oc‐
363              curs rmlint will detect this. If -F is specified, symbolic links
364              will be ignored completely, if -@ is specified, rmlint will  see
365              symlinks and treats them like small files with the path to their
366              target in them. The latter is the default behaviour, since it is
367              a sensible default for --merge-directories.
368
369       -x --no-crossdev / -X --crossdev (default)
370              Stay  always  on  the same device (-x), or allow crossing mount‐
371              points (-X). The latter is the default.
372
373       -r --hidden / -R --no-hidden (default) / --partial-hidden
374              Also traverse hidden directories? This is often not a good idea,
375              since  directories  like  .git/  would be investigated, possibly
376              leading to the deletion of internal  git  files  which  in  turn
377              break  a  repository.   With  --partial-hidden  hidden files and
378              folders are only considered if they're inside duplicate directo‐
379              ries  (see  --merge-directories)  and will be deleted as part of
380              it.
381
382       -b --match-basename
383              Only consider those files as dupes that have the same  basename.
384              See  also  man  1  basename.  The comparison of the basenames is
385              case-insensitive.
386
387       -B --unmatched-basename
388              Only consider those files as dupes that do not  share  the  same
389              basename.   See also man 1 basename. The comparison of the base‐
390              names is case-insensitive.
391
392       -e --match-with-extension / -E --no-match-with-extension (default)
393              Only consider those files as dupes that have the same  file  ex‐
394              tension.  For  example two photos would only match if they are a
395              .png. The extension is compared case-insensitive, so .PNG is the
396              same as .png.
397
398       -i  --match-without-extension  /  -I  --no-match-without-extension (de‐
399       fault)
400              Only consider those files as dupes that have the  same  basename
401              minus  the  file  extension.  For  example:  banana.png  and Ba‐
402              nana.jpeg would be considered,  while  apple.png  and  peach.png
403              won't. The comparison is case-insensitive.
404
405       -n         --newer-than-stamp=<timestamp_filename>         /         -N
406       --newer-than=<iso8601_timestamp_or_unix_timestamp>
407              Only consider files (and their  size  siblings  for  duplicates)
408              newer than a certain modification time (mtime).  The age barrier
409              may be given as seconds since the epoch or as  ISO8601-Timestamp
410              like 2014-09-08T00:12:32+0200.
411
412              -n  expects  a  file from which it can read the timestamp. After
413              rmlint run, the file will be updated with the current timestamp.
414              If  the  file does not initially exist, no filtering is done but
415              the stampfile is still written.
416
417              -N, in contrast, takes the timestamp directly and will not write
418              anything.
419
420              Note that rmlint will find duplicates newer than timestamp, even
421              if the original is older.  If  you  want  only  find  duplicates
422              where  both  original and duplicate are newer than timestamp you
423              can use find(1):
424
425              • find -mtime -1 -print0 | rmlint -0 # pass  all  files  younger
426                than a day to rmlint
427
428              Note: you can make rmlint write out a compatible timestamp with:
429
430              • -O  stamp:stdout   #  Write a seconds-since-epoch timestamp to
431                stdout on finish.
432
433              • -O stamp:stdout -c stamp:iso8601 # Same, but write as ISO8601.
434
435   Original Detection Options
436       -k --keep-all-tagged / -K --keep-all-untagged
437              Don't delete any duplicates that are in  tagged  paths  (-k)  or
438              that are in non-tagged paths (-K).  (Tagged paths are those that
439              were named after //).
440
441       -m --must-match-tagged / -M --must-match-untagged
442              Only look for duplicates of which at least one is in one of  the
443              tagged paths.  (Paths that were named after //).
444
445              Note  that the combinations of -kM and -Km are prohibited by rm‐
446              lint.  See https://github.com/sahib/rmlint/issues/244  for  more
447              information.
448
449       -S --rank-by=criteria (default: pOma)
450              Sort  the  files in a group of duplicates into originals and du‐
451              plicates by one or more criteria. Each criteria is defined by  a
452              single letter (except r and x which expect a regex pattern after
453              the letter). Multiple criteria may be given as string, where the
454              first criteria is the most important. If one criteria cannot de‐
455              cide between original and duplicate the next one is tried.
456
457              • m: keep lowest mtime (oldest)           M: keep highest  mtime
458                (newest)
459
460              • a: keep first alphabetically            A: keep last alphabet‐
461                ically
462
463              • p: keep first named path                 P:  keep  last  named
464                path
465
466              • d:  keep  path  with  lowest  depth          D: keep path with
467                highest depth
468
469              • l: keep path with shortest  basename      L:  keep  path  with
470                longest basename
471
472              • r:  keep  paths  matching  regex             R:  keep path not
473                matching regex
474
475              • x: keep basenames matching regex        X: keep basenames  not
476                matching regex
477
478              • h:  keep  file  with  lowest  hardlink count H: keep file with
479                highest hardlink count
480
481              • o: keep file with lowest number of hardlinks  outside  of  the
482                paths traversed by rmlint.
483
484              • O:  keep  file with highest number of hardlinks outside of the
485                paths traversed by rmlint.
486
487              Alphabetical sort will only use the basename of the file and ig‐
488              nore its case.  One can have multiple criteria, e.g.: -S am will
489              choose first alphabetically; if tied then by mtime.  Note: orig‐
490              inal  path  criteria (specified using //) will always take first
491              priority over -S options.
492
493              For more fine grained control, it is possible to give a  regular
494              expression to sort by. This can be useful when you know a common
495              fact that identifies original paths (like a path component being
496              src or a certain file ending).
497
498              To  use the regular expression you simply enclose it in the cri‐
499              teria string by adding <REGULAR_EXPRESSION> after  specifying  r
500              or x. Example: -S 'r<.*\.bak$>' makes all files that have a .bak
501              suffix original files.
502
503              Warning: When using r or x, try to make your regex to be as spe‐
504              cific  as  possible! Good practice includes adding a $ anchor at
505              the end of the regex.
506
507              Tips:
508
509              • l  is  useful  for  files  like  file.mp3  vs  file.1.mp3   or
510                file.mp3.bak.
511
512              • a can be used as last criteria to assert a defined order.
513
514              • o/O and h/H are only useful if there any hardlinks in the tra‐
515                versed path.
516
517              • o/O takes the number of hardlinks outside the traversed  paths
518                (and   thereby   minimizes/maximizes  the  overall  number  of
519                hardlinks). h/H in contrast only takes the number of hardlinks
520                inside  of  the  traversed  paths. When hardlinking files, one
521                would like to link to the original file with the highest outer
522                link  count (O) in order to maximise the space cleanup. H does
523                not maximise the space cleanup, it just selects the file  with
524                the  highest total hardlink count. You usually want to specify
525                O.
526
527              • pOma is the default since p ensures  that  first  given  paths
528                rank  as originals, O ensures that hardlinks are handled well,
529                m ensures that the oldest file is the original  and  a  simply
530                ensures a defined ordering if no other criteria applies.
531
532   Caching
533       --replay
534              Read  an  existing  json file and re-output it. When --replay is
535              given, rmlint does no input/output on the  filesystem,  even  if
536              you  pass  additional paths. The paths you pass will be used for
537              filtering the --replay output.
538
539              This is very useful if you want to reformat, refilter or  resort
540              the  output  you  got from a previous run. Usage is simple: Just
541              pass --replay on the second run, with other changed to  the  new
542              formatters or filters. Pass the .json files of the previous runs
543              additionally to the paths you ran rmlint on. You can also  merge
544              several previous runs by specifying more than one .json file, in
545              this case it will merge all files given and output them  as  one
546              big run.
547
548              If  you  want to view only the duplicates of certain subdirecto‐
549              ries, just pass them on the commandline as usual.
550
551              The usage of // has the same effect as in a normal run.  It  can
552              be used to prefer one .json file over another. However note that
553              running rmlint in --replay mode includes no real disk traversal,
554              i.e.  only  duplicates from previous runs are printed. Therefore
555              specifying new paths will simply have no effect. As  a  security
556              measure,  --replay  will ignore files whose mtime changed in the
557              meantime (i.e. mtime in the .json file differs from the  current
558              one).  These files might have been modified and are silently ig‐
559              nored.
560
561              By design, some options will not have any effect. Those are:
562
563              • --followlinks
564
565              • --algorithm
566
567              • --paranoid
568
569              • --clamp-low
570
571              • --hardlinked
572
573              • --write-unfinished
574
575              • ... and all other caching options below.
576
577              NOTE: In --replay mode, a new .json file will be written to  rm‐
578              lint.replay.json in order to avoid overwriting rmlint.json.
579
580       -C --xattr
581              Shortcut  for  --xattr-read,  --xattr-write, --write-unfinished.
582              This will write a checksum and a timestamp to the  extended  at‐
583              tributes  of each file that rmlint hashed. This speeds up subse‐
584              quent runs on the same data  set.   Please  note  that  not  all
585              filesystems  may  support extended attributes and you need write
586              support to use this feature.
587
588              See the individual options below for more details and some exam‐
589              ples.
590
591       --xattr-read / --xattr-write / --xattr-clear
592              Read  or  write  cached  checksums  from  the  extended file at‐
593              tributes.  This feature can be  used  to  speed  up  consecutive
594              runs.
595
596              CAUTION:  This could potentially lead to false positives if file
597              contents are somehow modified without changing the file  modifi‐
598              cation  time.   rmlint uses the mtime to determine the modifica‐
599              tion timestamp if a checksum is outdated. This is not a  problem
600              if  you  use the clone or reflink operation on a filesystem like
601              btrfs. There an outdated checksum entry  would  simply  lead  to
602              some duplicate work done in the kernel but would do no harm oth‐
603              erwise.
604
605              NOTE: Many tools do not support extended file  attributes  prop‐
606              erly,  resulting  in  a loss of the information when copying the
607              file or editing it.
608
609              NOTE: You can specify --xattr-write and --xattr-read at the same
610              time.   This  will  read from existing checksums at the start of
611              the run and update all hashed files at the end.
612
613              Usage example:
614
615                 $ rmlint large_file_cluster/ -U --xattr-write   # first run should be slow.
616                 $ rmlint large_file_cluster/ --xattr-read       # second run should be faster.
617
618                 # Or do the same in just one run:
619                 $ rmlint large_file_cluster/ --xattr
620
621       -U --write-unfinished
622              Include files in output that have not been  hashed  fully,  i.e.
623              files  that  do  not  appear to have a duplicate. Note that this
624              will not include all files that rmlint traversed, but  only  the
625              files that were chosen to be hashed.
626
627              This  is  mainly  useful in conjunction with --xattr-write/read.
628              When re-running rmlint on a large dataset this can greatly speed
629              up  a  re-run in some cases. Please refer to --xattr-read for an
630              example.
631
632              If you want to output unique files, please look into the uniques
633              output formatter.
634
635   Rarely used, miscellaneous options
636       -t --threads=N (default: 16)
637              The  number  of  threads  to  use during file tree traversal and
638              hashing.  rmlint probably knows better than you how to set  this
639              value,  so just leave it as it is. Setting it to 1 will also not
640              make rmlint a single threaded program.
641
642       -u --limit-mem=size
643              Apply a maximum number of memory to use for hashing and  --para‐
644              noid.   The total number of memory might still exceed this limit
645              though, especially when setting it very low. In  general  rmlint
646              will  however consume about this amount of memory plus a more or
647              less constant extra amount that depends  on  the  data  you  are
648              scanning.
649
650              The  size-description  has the same format as for --size, there‐
651              fore you can do something like this (use this if you have 1GB of
652              memory available):
653
654              $ rmlint -u 512M  # Limit paranoid mem usage to 512 MB
655
656       -q    --clamp-low=[fac.tor|percent%|offset]    (default:    0)   /   -Q
657       --clamp-top=[fac.tor|percent%|offset] (default: 1.0)
658              The argument can be either passed as factor (a number with  a  .
659              in it), a percent value (suffixed by %) or as absolute number or
660              size spec, like in --size.
661
662              Only look at the content of files in the range of  from  low  to
663              (including) high. This means, if the range is less than -q 0% to
664              -Q 100%, than only partial duplicates are searched. If the  file
665              size  is  less than the clamp limits, the file is ignored during
666              traversing. Be careful when using this function, you can  easily
667              get dangerous results for small files.
668
669              This  is  useful  in a few cases where a file consists of a con‐
670              stant sized header or footer. With this option you can just com‐
671              pare  the data in between.  Also it might be useful for approxi‐
672              mate comparison where it suffices when the file is the  same  in
673              the middle part.
674
675              Example:
676
677              $ rmlint -q 10% -Q 512M  # Only read the last 90% of a file, but
678              read at max. 512MB
679
680       -Z --mtime-window=T (default: -1)
681              Only consider those files as duplicates that have the same  con‐
682              tent  and  the  same  modification time (mtime) within a certain
683              window of T seconds.  If T is 0, both files  need  to  have  the
684              same mtime. For T=1 they may differ one second and so on. If the
685              window size is negative, the mtime of  duplicates  will  not  be
686              considered. T may be a floating point number.
687
688              However,  with  three  (or more) files, the mtime difference be‐
689              tween two duplicates can be bigger than the mtime window T, i.e.
690              several files may be chained together by the window. Example: If
691              T is 1, the four files fooA (mtime: 00:00:00), fooB  (00:00:01),
692              fooC  (00:00:02),  fooD  (00:00:03) would all belong to the same
693              duplicate group, although the mtime of fooA and fooD differs  by
694              3 seconds.
695
696       --with-fiemap (default) / --without-fiemap
697              Enable or disable reading the file extents on rotational disk in
698              order to optimize disk access patterns. If this feature  is  not
699              available, it is disabled automatically.
700
701   FORMATTERS
702       • csv: Output all found lint as comma-separated-value list.
703
704         Available options:
705
706         • no_header: Do not write a first line describing the column headers.
707
708         • unique: Include unique files in the output.
709
710       •
711
712         sh: Output all found lint as shell script This formatter is activated
713                as default.
714
715         available options:
716
717         • cmd: Specify a user defined command to run on duplicates.  The com‐
718           mand can be any valid /bin/sh-expression. The  duplicate  path  and
719           original  path can be accessed via "$1" and "$2".  The command will
720           be written to the user_command function in the sh-file produced  by
721           rmlint.
722
723         • handler  Define a comma separated list of handlers to try on dupli‐
724           cate files in that given order until one handler succeeds. Handlers
725           are  just  the  name of a way of getting rid of the file and can be
726           any of the following:
727
728           • clone: For reflink-capable filesystems only. Try  to  clone  both
729             files  with  the  FIDEDUPERANGE  ioctl(3p) (or BTRFS_IOC_FILE_EX‐
730             TENT_SAME on older kernels).  This will  free  up  duplicate  ex‐
731             tents.  Needs at least kernel 4.2.  Use this option when you only
732             have read-only access to a btrfs filesystem  but  still  want  to
733             deduplicate it. This is usually the case for snapshots.
734
735           • reflink:  Try  to reflink the duplicate file to the original. See
736             also --reflink in man 1 cp. Fails if the filesystem does not sup‐
737             port it.
738
739           • hardlink: Replace the duplicate file with a hardlink to the orig‐
740             inal file. The resulting files will have  the same inode  number.
741             Fails if both files are not on the same partition. You can use ls
742             -i to show the inode number of a file and find  -samefile  <path>
743             to find all hardlinks for a certain file.
744
745           • symlink: Tries to replace the duplicate file with a symbolic link
746             to the original. This handler never fails.
747
748           • remove: Remove the file using rm -rf. (-r  for  duplicate  dirs).
749             This handler never fails.
750
751           • usercmd:  Use  the provided user defined command (-c sh:cmd=some‐
752             thing). This handler never fails.
753
754           Default is remove.
755
756         • link: Shortcut  for  -c  sh:handler=clone,reflink,hardlink,symlink.
757           Use this if you are on a reflink-capable system.
758
759         • hardlink: Shortcut for -c sh:handler=hardlink,symlink.  Use this if
760           you want to hardlink files, but want  to  fallback  for  duplicates
761           that lie on different devices.
762
763         • symlink:  Shortcut  for  -c  sh:handler=symlink.   Use this as last
764           straw.
765
766       • json: Print a JSON-formatted dump of all found reports.  Outputs  all
767         lint  as  a  json  document.  The document is a list of dictionaries,
768         where the first and last element is the header and the footer. Every‐
769         thing between are data-dictionaries.
770
771         Available options:
772
773         • unique: Include unique files in the output.
774
775         • no_header=[true|false]:  Print  the  header with metadata (default:
776           true)
777
778         • no_footer=[true|false]: Print the footer with statistics  (default:
779           true)
780
781         • oneline=[true|false]:  Print  one  json document per line (default:
782           false) This is useful if you plan to parse the output line-by-line,
783           e.g. while rmlint is sill running.
784
785         This  formatter  is  extremely  useful if you're in need of scripting
786         more complex behaviour, that is not directly possible  with  rmlint's
787         built-in  options.  A very handy tool here is jq.  Here is an example
788         to output all original files directly from a rmlint run:
789
790         $ rmlint -o | json jq -r '.[1:-1][] | select(.is_original) | .path'
791
792       • py: Outputs a python script and a JSON document, just like  the  json
793         formatter.   The  JSON document is written to .rmlint.json, executing
794         the script will make it read from there. This formatter is mostly in‐
795         tended  for  complex  use-cases where the lint needs special handling
796         that you define in the python script.  Therefore  the  python  script
797         can  be  modified to do things standard rmlint is not able to do eas‐
798         ily.
799
800       • uniques: Outputs all unique paths found during the run, one path  per
801         line.  This is often useful for scripting purposes.
802
803         Available options:
804
805         • print0: Do not put newlines between paths but zero bytes.
806
807       • stamp:
808
809         Outputs  a  timestamp  of  the  time  rmlint  was  run.  See also the
810         --newer-than and --newer-than-stamp file option.
811
812         Available options:
813
814         • iso8601=[true|false]: Write an ISO8601 formatted timestamps or sec‐
815           onds since epoch?
816
817       • progressbar:  Shows  a progressbar. This is meant for use with stdout
818         or stderr [default].
819
820         See also: -g (--progress) for a convenience shortcut option.
821
822         Available options:
823
824         • update_interval=number: Number of milliseconds to wait between  up‐
825           dates.  Higher values use less resources (default 50).
826
827         • ascii: Do not attempt to use unicode characters, which might not be
828           supported by some terminals.
829
830         • fancy: Use a more fancy style for the progressbar.
831
832       • pretty: Shows all found items in realtime nicely colored.  This  for‐
833         matter is activated as default.
834
835       • summary:  Shows  counts  of files and their respective size after the
836         run.  Also list all written output files.
837
838       • fdupes: Prints an output similar  to  the  popular  duplicate  finder
839         fdupes(1).  At  first  a progressbar is printed on stderr. Afterwards
840         the found files are printed on stdout; each set  of  duplicates  gets
841         printed  as  a block separated by newlines. Originals are highlighted
842         in green. At the bottom a summary  is  printed  on  stderr.  This  is
843         mostly useful for scripts that were set up for parsing fdupes output.
844         We recommend the json formatter for every other scripting purpose.
845
846         Available options:
847
848         • omitfirst: Same as the -f / --omitfirst option in fdupes(1).  Omits
849           the first line of each set of duplicates (i.e. the original file.
850
851         • sameline: Same as the -1 / --sameline option in fdupes(1). Does not
852           print newlines between files, only a space.  Newlines  are  printed
853           only between sets of duplicates.
854
855   OTHER STAND-ALONE COMMANDS
856       rmlint --gui
857              Start the optional graphical frontend to rmlint called Shredder.
858
859              This  will only work when Shredder and its dependencies were in‐
860              stalled.                        See                        also:
861              http://rmlint.readthedocs.org/en/latest/gui.html
862
863              The gui has its own set of options, see --gui --help for a list.
864              These should be placed at the end,  ie  rmlint  --gui  [options]
865              when calling it from commandline.
866
867       rmlint --hash [paths...]
868              Make  rmlint work as a multi-threaded file hash utility, similar
869              to the popular md5sum or sha1sum utilities, but faster and  with
870              more  algorithms.   A  set  of paths given on the commandline or
871              from stdin is hashed using one of the available hash algorithms.
872              Use rmlint --hash -h to see options.
873
874       rmlint --equal [paths...]
875              Check  if the paths given on the commandline all have equal con‐
876              tent. If all paths are equal and no other error happened, rmlint
877              will  exit  with  an  exit code 0. Otherwise it will exit with a
878              nonzero exit code. All other options can be used as normal,  but
879              note that no other formatters (sh, csv etc.) will be executed by
880              default. At least two paths need to be passed.
881
882              Note: This even works for directories and  also  in  combination
883              with paranoid mode (pass -pp for byte comparison); remember that
884              rmlint does not care about the layout of the directory, but only
885              about the content of the files in it. At least two paths need to
886              be given to the commandline.
887
888              By default this will use hashing to compare the files and/or di‐
889              rectories.
890
891       rmlint --dedupe [-r] [-v|-V] <src> <dest>
892              If  the  filesystem  supports files sharing physical storage be‐
893              tween multiple files, and if src and  dest  have  same  content,
894              this command makes the data in the src file appear the dest file
895              by sharing the underlying storage.
896
897              This command is similar to cp --reflink=always <src> <dest>  ex‐
898              cept  that  it (a) checks that src and dest have identical data,
899              and it makes no changes to dest's metadata.
900
901              Running with -r option will enable  deduplication  of  read-only
902              [btrfs] snapshots (requires root).
903
904       rmlint --is-reflink [-v|-V] <file1> <file2>
905              Tests  whether  file1  and  file2  are  reflinks (reference same
906              data).  This command makes rmlint exit with one of the following
907              exit codes:
908
909              • 0: files are reflinks
910
911              • 1: files are not reflinks
912
913              • 3: not a regular file
914
915              • 4: file sizes differ
916
917              • 5: fiemaps can't be read
918
919              • 6: file1 and file2 are the same path
920
921              • 7:  file1  and  file2 are the same file under different mount‐
922                points
923
924              • 8: files are hardlinks
925
926              • 9: files are symlinks
927
928              • 10: files are not on same device
929
930              • 11: other error encountered
931
932   EXAMPLES
933       This is a collection of common use cases and other tricks:
934
935       • Check the current working directory for duplicates.
936
937         $ rmlint
938
939       • Show a progressbar:
940
941         $ rmlint -g
942
943       • Quick re-run on large datasets using different  ranking  criteria  on
944         second run:
945
946         $ rmlint large_dir/ # First run; writes rmlint.json
947
948         $ rmlint --replay rmlint.json large_dir -S MaD
949
950       • Merge  together  previous  runs,  but prefer the originals to be from
951         b.json and make sure that no files are deleted from b.json:
952
953         $ rmlint --replay a.json // b.json -k
954
955       • Search only for duplicates and duplicate directories
956
957         $ rmlint -T "df,dd" .
958
959       • Compare files byte-by-byte in current directory:
960
961         $ rmlint -pp .
962
963       • Find duplicates with same basename (excluding extension):
964
965         $ rmlint -e
966
967       • Do more complex traversal using find(1).
968
969         $ find /usr/lib -iname '*.so' -type f | rmlint - # find all duplicate
970         .so files
971
972         $  find /usr/lib -iname '*.so' -type f -print0 | rmlint -0 # as above
973         but handles filenames with newline character in them
974
975         $ find ~/pics -iname '*.png' | ./rmlint - # compare png files only
976
977       • Limit file size range to investigate:
978
979         $ rmlint -s 2GB    # Find everything >= 2GB
980
981         $ rmlint -s 0-2GB  # Find everything <  2GB
982
983       • Only find writable and executable files:
984
985         $ rmlint --perms wx
986
987       • Reflink if possible, else hardlink duplicates to original  if  possi‐
988         ble, else replace duplicate with a symbolic link:
989
990         $ rmlint -c sh:link
991
992       • Inject user-defined command into shell script output:
993
994         $  rmlint  -o  sh  -c  sh:cmd='echo "original:" "$2" "is the same as"
995         "$1"'
996
997       • Use shred to overwrite the contents of a file fully:
998
999         $ rmlint -c 'sh:cmd=shred -un 10 "$1"'
1000
1001       • Use data as master directory. Find only duplicates in backup that are
1002         also in data. Do not delete any files in data:
1003
1004         $ rmlint backup // data --keep-all-tagged --must-match-tagged
1005
1006       • Compare if the directories a b c and are equal
1007
1008         $  rmlint  --equal a b c && echo "Files are equal" || echo "Files are
1009         not equal"
1010
1011       • Test if two files are reflinks
1012
1013         $ rmlint --is-reflink a b && echo "Files are reflinks" || echo "Files
1014         are not reflinks".
1015
1016       • Cache  calculated checksums for next run. The checksums will be writ‐
1017         ten to the extended file attributes:
1018
1019         $ rmlint --xattr
1020
1021       • Produce a list of unique files in a folder:
1022
1023         $ rmlint -o uniques
1024
1025       • Produce a list of files that are  unique,  including  original  files
1026         ("one of each"):
1027
1028         $  rmlint t -o json -o uniques:unique_files |  jq -r '.[1:-1][] | se‐
1029         lect(.is_original)  |  .path'  |  sort   >   original_files   $   cat
1030         unique_files original_files
1031
1032       • Sort files by a user-defined regular expression
1033
1034                # Always keep files with ABC or DEF in their basename,
1035                # dismiss all duplicates with tmp, temp or cache in their names
1036                # and if none of those are applicable, keep the oldest files instead.
1037                $ ./rmlint -S 'x<.*(ABC|DEF).*>X<.*(tmp|temp|cache).*>m' /some/path
1038
1039       • Sort  files  by adding priorities to several user-defined regular ex‐
1040         pressions:
1041
1042                # Unlike the previous snippet, this one uses priorities:
1043                # Always keep files in ABC, DEF, GHI by following that particular order of
1044                # importance (ABC has a top priority), dismiss all duplicates with
1045                # tmp, temp, cache in their paths and if none of those are applicable,
1046                # keep the oldest files instead.
1047                $ rmlint -S 'r<.*ABC.*>r<.*DEF.*>r<.*GHI.*>R<.*(tmp|temp|cache).*>m' /some/path
1048
1049   PROBLEMS
1050       1. False Positives: Depending on the options you use, there is  a  very
1051          slight  risk of false positives (files that are erroneously detected
1052          as duplicate).  The default hash function (blake2b) is very safe but
1053          in  theory  it  is possible for two files to have then same hash. If
1054          you had 10^73 different files, all the same size, then the chance of
1055          a  false positive is still less than 1 in a billion.  If you're con‐
1056          cerned just use the --paranoid (-pp) option. This will  compare  all
1057          the  files  byte-by-byte and is not much slower than blake2b (it may
1058          even be faster), although it is a lot more memory-hungry.
1059
1060       2. File modification during or after rmlint run: It is possible that  a
1061          file that rmlint recognized as duplicate is modified afterwards, re‐
1062          sulting in a different file.  If you use the rmlint-generated  shell
1063          script  to  delete the duplicates, you can run it with the -p option
1064          to do a full re-check of the duplicate against the  original  before
1065          it deletes the file. When using -c sh:hardlink or -c sh:symlink care
1066          should be taken that a modification of one file will now result in a
1067          modification  of  all files.  This is not the case for -c sh:reflink
1068          or -c sh:clone. Use -c sh:link to minimise this risk.
1069
1070   SEE ALSO
1071       Reading the manpages o these tools might help working with rmlint:
1072
1073       • find(1)
1074
1075       • rm(1)
1076
1077       • cp(1)
1078
1079       Extended documentation and an in-depth tutorial can be found at:
1080
1081       • http://rmlint.rtfd.org
1082
1083   BUGS
1084       If you found a bug, have a feature requests or want  to  say  something
1085       nice, please visit https://github.com/sahib/rmlint/issues.
1086
1087       Please make sure to describe your problem in detail. Always include the
1088       version of rmlint (--version). If you experienced a crash,  please  in‐
1089       clude  at  least one of the following information with a debug build of
1090       rmlint:
1091
1092       • gdb --ex run -ex bt --args rmlint -vvv [your_options]
1093
1094       • valgrind --leak-check=no rmlint -vvv [your_options]
1095
1096       You can build a debug build of rmlint like this:
1097
1098       • git clone git@github.com:sahib/rmlint.git
1099
1100       • cd rmlint
1101
1102       • scons GDB=1 DEBUG=1
1103
1104       • sudo scons install  # Optional
1105
1106   LICENSE
1107       rmlint is licensed under the terms of the GPLv3.
1108
1109       See the COPYRIGHT file that came with the source for more information.
1110
1111   PROGRAM AUTHORS
1112       rmlint was written by:
1113
1114       • Christopher <sahib> Pahl 2010-2017 (https://github.com/sahib)
1115
1116       • Daniel <SeeSpotRun> T.   2014-2017 (https://github.com/SeeSpotRun)
1117
1118       Also see the  http://rmlint.rtfd.org for other people that helped us.
1119
1120       If you consider a donation you can use Flattr or buy us a  beer  if  we
1121       meet:
1122
1123       https://flattr.com/thing/302682/libglyr
1124

AUTHOR

1126       Christopher Pahl, Daniel Thomas
1127

COPYRIGHT

1129       2014-2021, Christopher Pahl & Daniel Thomas
1130
1131
1132
1133
1134                                 Jul 23, 2021                        RMLINT(1)