rmlint(1)

1RMLINT(1)                    rmlint documentation                    RMLINT(1)
2
3
4

NAME

6       rmlint - find duplicate files and other space waste efficiently
7

FIND DUPLICATE FILES AND OTHER SPACE WASTE EFFICIENTLY

9   SYNOPSIS
10       rmlint  [TARGET_DIR_OR_FILES ...] [//] [TAGGED_TARGET_DIR_OR_FILES ...]
11       [-] [OPTIONS]
12
13   DESCRIPTION
14       rmlint finds space waste and other broken things  on  your  filesystem.
15       It's main focus lies on finding duplicate files and directories.
16
17       It is able to find the following types of lint:
18
19       · Duplicate files and directories (and as a by-product unique files).
20
21       · Nonstripped  Binaries  (Binaries  with  debug  symbols;  needs  to be
22         explicitly enabled).
23
24       · Broken symbolic links.
25
26       · Empty files and directories (also nested empty directories).
27
28       · Files with broken user or group id.
29
30       rmlint itself WILL NOT DELETE ANY FILES. It does however  produce  exe‐
31       cutable  output  (for  example  a  shell script) to help you delete the
32       files if you want to. Another design principle is that it  should  work
33       well together with other tools like find. Therefore we do not replicate
34       features of other well know programs, as for example  pattern  matching
35       and  finding  duplicate filenames.  However we provide many convenience
36       options for common use cases that are hard to build from  scratch  with
37       standard tools.
38
39       In  order  to find the lint, rmlint is given one or more directories to
40       traverse.  If no directories or files were given, the  current  working
41       directory  is assumed.  By default, rmlint will ignore hidden files and
42       will not follow symlinks (see Traversal Options).   rmlint  will  first
43       find "other lint" and then search the remaining files for duplicates.
44
45       rmlint  tries  to be helpful by guessing what file of a group of dupli‐
46       cates is the original (i.e. the file that should not  be  deleted).  It
47       does  this by using different sorting strategies that can be controlled
48       via the -S option. By default it chooses the first-named  path  on  the
49       commandline.  If  two  duplicates come from the same path, it will also
50       apply different fallback sort strategies (See the documentation of  the
51       -S strategy).
52
53       This  behaviour  can  be  also  overwritten  if you know that a certain
54       directory contains duplicates and another one originals. In  this  case
55       you  write  the original directory after specifying a single //  on the
56       commandline.   Everything  that  comes  after  is  a  preferred  (or  a
57       "tagged")  directory.  If  there are duplicates from an unpreferred and
58       from a preferred directory, the preferred  one  will  always  count  as
59       original. Special options can also be used to always keep files in pre‐
60       ferred directories (-k) and to only find duplicates that are present in
61       both given directories (-m).
62
63       We  advise  new users to have a short look at all options rmlint has to
64       offer, and maybe test some examples before letting it run on productive
65       data.   WRONG ASSUMPTIONS ARE THE BIGGEST ENEMY OF YOUR DATA. There are
66       some extended example at the end of this manual, but each  option  that
67       is not self-explanatory will also try to give examples.
68
69   OPTIONS
70   General Options
71       -T --types="list" (default: defaults)
72              Configure  the  types  of  lint  rmlint  will look for. The list
73              string is a comma-separated list of lint types  or  lint  groups
74              (other separators like semicolon or space also work though).
75
76              One of the following groups can be specified at the beginning of
77              the list:
78
79              · all: Enables all lint types.
80
81              · defaults: Enables all lint types, but nonstripped.
82
83              · minimal: defaults minus emptyfiles and emptydirs.
84
85              · minimaldirs: defaults minus emptyfiles, emptydirs  and  dupli‐
86                cates, but with duplicatedirs.
87
88              · none: Disable all lint types [default].
89
90              Any  of  the  following lint types can be added individually, or
91              deselected by prefixing with a -:
92
93              · badids, bi: Find files with bad UID, GID or both.
94
95              · badlinks, bl: Find bad symlinks pointing nowhere valid.
96
97              · emptydirs, ed: Find empty directories.
98
99              · emptyfiles, ef: Find empty files.
100
101              · nonstripped, ns: Find nonstripped binaries.
102
103              · duplicates, df: Find duplicate files.
104
105              · duplicatedirs, dd: Find duplicate  directories  (This  is  the
106                same -D!)
107
108              WARNING:  It is good practice to enclose the description in sin‐
109              gle or double quotes. In obscure cases  argument  parsing  might
110              fail in weird ways, especially when using spaces as separator.
111
112              Example:
113
114                 $ rmlint -T "df,dd"        # Only search for duplicate files and directories
115                 $ rmlint -T "all -df -dd"  # Search for all lint except duplicate files and dirs.
116
117       -o  --output=spec  /  -O --add-output=spec (default: -o sh:rmlint.sh -o
118       pretty:stdout -o summary:stdout -o json:rmlint.json)
119              Configure the way rmlint outputs its results. A spec is  in  the
120              form  format:file  or  just  format.   A file might either be an
121              arbitrary path or stdout or stderr.  If file is omitted,  stdout
122              is  assumed. format is the name of a formatter supported by this
123              program. For a list of formatters and their  options,  refer  to
124              the Formatters section below.
125
126              If  -o  is  specified, rmlint's default outputs are overwritten.
127              With --O the defaults are preserved.  Either -o  or  -O  may  be
128              specified multiple times to get multiple outputs, including mul‐
129              tiple outputs of the same format.
130
131              Examples:
132
133                 $ rmlint -o json                 # Stream the json output to stdout
134                 $ rmlint -O csv:/tmp/rmlint.csv  # Output an extra csv fle to /tmp
135
136       -c --config=spec[=value] (default: none)
137              Configure a format. This option can be used to fine-tune the be‐
138              haviour  of  the existing formatters. See the Formatters section
139              for details on the available keys.
140
141              If the value is omitted it is set to a value meaning "enabled".
142
143              Examples:
144
145                 $ rmlint -c sh:link            # Smartly link duplicates instead of removing
146                 $ rmlint -c progressbar:fancy  # Use a different theme for the progressbar
147
148       -z --perms[=[rwx]] (default: no check)
149              Only look into file if it is readable, writable or executable by
150              the  current user.  Which one of the can be given as argument as
151              one of "rwx".
152
153              If no argument is given, "rw" is assumed. Note that r does basi‐
154              cally  nothing  user-visible since rmlint will ignore unreadable
155              files anyways.  It's just there for the sake of completeness.
156
157              By default this check is not done.
158
159              $ rmlint -z rx $(echo $PATH | tr ":" " ")  # Look  at  all  exe‐
160              cutable files in $PATH
161
162       -a --algorithm=name (default: blake2b)
163              Choose  the  algorithm  to  use for finding duplicate files. The
164              algorithm can be either paranoid (byte-by-byte file  comparison)
165              or  use  one  of several file hash algorithms to identify dupli‐
166              cates.  The following hash families are available  (in  approxi‐
167              mate descending order of cryptographic strength):
168
169              sha3, blake,
170
171              sha,
172
173              highway, md
174
175              metro, murmur, xxhash
176
177              The  weaker  hash  functions  still offer excellent distribution
178              properties, but are potentially  more  vulnerable  to  malicious
179              crafting of duplicate files.
180
181              The full list of hash functions (in decreasing order of checksum
182              length) is:
183
184              512-bit: blake2b, blake2bp, sha3-512, sha512
185
186              384-bit: sha3-384,
187
188              256-bit:  blake2s,  blake2sp,  sha3-256,   sha256,   highway256,
189              metro256, metrocrc256
190
191              160-bit: sha1
192
193              128-bit: md5, murmur, metro, metrocrc
194
195              64-bit: highway64, xxhash.
196
197              The  use  of 64-bit hash length for detecting duplicate files is
198              not recommended, due to the probability of a random hash  colli‐
199              sion.
200
201       -p --paranoid / -P --less-paranoid (default)
202              Increase  or  decrease  the paranoia of rmlint's duplicate algo‐
203              rithm.  Use -p if you want byte-by-byte comparison  without  any
204              hashing.
205
206              · -p is equivalent to --algorithm=paranoid
207
208              · -P is equivalent to --algorithm=highway256
209
210              · -PP is equivalent to --algorithm=metro256
211
212              · -PPP is equivalent to --algorithm=metro
213
214       -v --loud / -V --quiet
215              Increase  or  decrease the verbosity. You can pass these options
216              several times. This only affects rmlint's logging on stderr, but
217              not the outputs defined with -o. Passing either option more than
218              three times has no further effect.
219
220       -g --progress / -G --no-progress (default)
221              Show a progressbar with sane defaults.
222
223              Convenience  shortcut  for  -o   progressbar   -o   summary   -o
224              sh:rmlint.sh -o json:rmlint.json -VVV.
225
226              NOTE:  This  flag clears all previous outputs. If you want addi‐
227              tional outputs, specify them after this flag using -O.
228
229       -D --merge-directories (default: disabled)
230              Makes rmlint use a special mode where all found  duplicates  are
231              collected  and  checked if whole directory trees are duplicates.
232              Use with caution: You always should make sure that the  investi‐
233              gated  directory  is not modified during rmlint's or its removal
234              scripts run.
235
236              IMPORTANT: Definition of equal: Two directories  are  considered
237              equal  by  rmlint if they contain the exact same data, no matter
238              how the files containing the data are named. Imagine that rmlint
239              creates  a  long,  sorted  stream  out  of the data found in the
240              directory and compares this in a magic way to another directory.
241              This means that the layout of the directory is not considered to
242              be important by default. Also empty files will not count as con‐
243              tent.  This might be surprising to some users, but remember that
244              rmlint generally cares only about content, not about  any  other
245              metadata or layout. If you want to only find trees with the same
246              hierarchy you should use --honour-dir-layout / -j.
247
248              Output is deferred until all duplicates  were  found.  Duplicate
249              directories  are printed first, followed by any remaining dupli‐
250              cate files that are isolated or inside of any original  directo‐
251              ries.
252
253              --rank-by  applies  for  directories  too,  but 'p' or 'P' (path
254              index) has no defined (i.e. useful) meaning. Sorting takes  only
255              place  when  the number of preferred files in the directory dif‐
256              fers.
257
258              NOTES:
259
260              · This option enables --partial-hidden and  -@  (--see-symlinks)
261                for  convenience.  If  this  is not desired, you should change
262                this after specifying -D.
263
264              · This feature might add some runtime for large datasets.
265
266              · When using this option, you will not be able  to  use  the  -c
267                sh:clone option.  Use -c sh:link as a good alternative.
268
269       -j --honour-dir-layout (default: disabled)
270              Only recognize directories as duplicates that have the same path
271              layout. In other words: All duplicates that build the  duplicate
272              directory  must have the same path from the root of each respec‐
273              tive directory.  This flag makes no sense without --merge-direc‐
274              tories.
275
276       -y --sort-by=order (default: none)
277              During  output,  sort  the  found  duplicate  groups by criteria
278              described by order.  order is a string that may consist  of  one
279              or more of the following letters:
280
281              · s: Sort by size of group.
282
283              · a: Sort alphabetically by the basename of the original.
284
285              · m: Sort by mtime of the original.
286
287              · p: Sort by path-index of the original.
288
289              · o:  Sort  by  natural  found order (might be different on each
290                run).
291
292              · n: Sort by number of files in the group.
293
294              The letter may also  be  written  uppercase  (similar  to  -S  /
295              --rank-by)  to reverse the sorting. Note that rmlint has to hold
296              back all results to the end of the run before sorting and print‐
297              ing.
298
299       -w --with-color (default) / -W --no-with-color
300              Use  color  escapes  for  pretty output or disable them.  If you
301              pipe rmlints output to a file -W is assumed automatically.
302
303       -h --help / -H --show-man
304              Show a shorter reference help text (-h) or  the  full  man  page
305              (-H).
306
307       --version
308              Print  the  version of rmlint. Includes git revision and compile
309              time features. Please include this when giving feedback to us.
310
311   Traversal Options
312       -s --size=range (default: 1 )
313              Only consider files as duplicates in a certain size range.   The
314              format  of range is min-max, where both ends can be specified as
315              a number with an optional multiplier. The available  multipliers
316              are:
317
318              · C  (1^1),  W  (2^1),  B  (512^1),  K  (1000^1), KB (1024^1), M
319                (1000^2), MB (1024^2), G (1000^3), GB (1024^3),
320
321              · T (1000^4), TB (1024^4), P (1000^5), PB (1024^5), E  (1000^6),
322                EB (1024^6)
323
324              The size format is about the same as dd(1) uses. A valid example
325              would be: "100KB-2M". This limits duplicates to a range from 100
326              Kilobyte to 2 Megabyte.
327
328              It's  also  possible  to specify only one size. In this case the
329              size is interpreted as "bigger or equal". If you want to  filter
330              for files up to this size you can add a - in front (-s -1M == -s
331              0-1M).
332
333              Edge case: The default excludes empty files from  the  duplicate
334              search.   Normally these are treated specially by rmlint by han‐
335              dling them as other lint. If you want to include empty files  as
336              duplicates you should lower the limit to zero:
337
338              $ rmlint -T df --size 0
339
340       -d --max-depth=depth (default: INF)
341              Only recurse up to this depth. A depth of 1 would disable recur‐
342              sion and is equivalent to a directory  listing.  A  depth  of  2
343              would also consider all children directories and so on.
344
345       -l --hardlinked (default) / --keep-hardlinked / -L --no-hardlinked
346              Hardlinked   files   are   treated   as  duplicates  by  default
347              (--hardlinked). If --keep-hardlinked is given, rmlint  will  not
348              delete  any  files  that  are hardlinked to an original in their
349              respective group. Such files will be displayed  like  originals,
350              i.e. for the default output with a "ls" in front.  The reasoning
351              here is to maximize the number of kept files,  while  maximizing
352              the  number of freed space: Removing hardlinks to originals will
353              not allocate any free space.
354
355              If --no-hardlinked  is  given,  only  one  file  (of  a  set  of
356              hardlinked  files)  is  considered,  all the others are ignored;
357              this means, they are not deleted and also not even shown in  the
358              output.  The "highest ranked" of the set is the one that is con‐
359              sidered.
360
361       -f --followlinks / -F --no-followlinks / -@ --see-symlinks (default)
362              -f will always follow  symbolic  links.  If  file  system  loops
363              occurs  rmlint  will  detect  this. If -F is specified, symbolic
364              links will be ignored completely, if  -@  is  specified,  rmlint
365              will see symlinks and treats them like small files with the path
366              to their target in them. The latter is  the  default  behaviour,
367              since it is a sensible default for --merge-directories.
368
369       -x --no-crossdev / -X --crossdev (default)
370              Stay  always  on  the same device (-x), or allow crossing mount‐
371              points (-X). The latter is the default.
372
373       -r --hidden / -R --no-hidden (default) / --partial-hidden
374              Also traverse hidden directories? This is often not a good idea,
375              since  directories  like  .git/  would be investigated, possibly
376              leading to the deletion of internal  git  files  which  in  turn
377              break  a  repository.   With  --partial-hidden  hidden files and
378              folders are only considered if they're inside duplicate directo‐
379              ries  (see  --merge-directories)  and will be deleted as part of
380              it.
381
382       -b --match-basename
383              Only consider those files as dupes that have the same  basename.
384              See  also  man  1  basename.  The comparison of the basenames is
385              case-insensitive.
386
387       -B --unmatched-basename
388              Only consider those files as dupes that do not  share  the  same
389              basename.   See also man 1 basename. The comparison of the base‐
390              names is case-insensitive.
391
392       -e --match-with-extension / -E --no-match-with-extension (default)
393              Only consider those files as  dupes  that  have  the  same  file
394              extension. For example two photos would only match if they are a
395              .png. The extension is compared case-insensitive, so .PNG is the
396              same as .png.
397
398       -i    --match-without-extension   /   -I   --no-match-without-extension
399       (default)
400              Only consider those files as dupes that have the  same  basename
401              minus   the   file   extension.   For  example:  banana.png  and
402              Banana.jpeg would be considered, while apple.png  and  peach.png
403              won't. The comparison is case-insensitive.
404
405       -n         --newer-than-stamp=<timestamp_filename>         /         -N
406       --newer-than=<iso8601_timestamp_or_unix_timestamp>
407              Only consider files (and their  size  siblings  for  duplicates)
408              newer than a certain modification time (mtime).  The age barrier
409              may be given as seconds since the epoch or as  ISO8601-Timestamp
410              like 2014-09-08T00:12:32+0200.
411
412              -n  expects  a  file from which it can read the timestamp. After
413              rmlint run, the file will be updated with the current timestamp.
414              If  the  file does not initially exist, no filtering is done but
415              the stampfile is still written.
416
417              -N, in contrast, takes the timestamp directly and will not write
418              anything.
419
420              Note that rmlint will find duplicates newer than timestamp, even
421              if the original is older.  If  you  want  only  find  duplicates
422              where  both  original and duplicate are newer than timestamp you
423              can use find(1):
424
425              · find -mtime -1 -print0 | rmlint -0 # pass  all  files  younger
426                than a day to rmlint
427
428              Note: you can make rmlint write out a compatible timestamp with:
429
430              · -O  stamp:stdout   #  Write a seconds-since-epoch timestamp to
431                stdout on finish.
432
433              · -O stamp:stdout -c stamp:iso8601 # Same, but write as ISO8601.
434
435   Original Detection Options
436       -k --keep-all-tagged / -K --keep-all-untagged
437              Don't delete any duplicates that are in  tagged  paths  (-k)  or
438              that are in non-tagged paths (-K).  (Tagged paths are those that
439              were named after //).
440
441       -m --must-match-tagged / -M --must-match-untagged
442              Only look for duplicates of which at least one is in one of  the
443              tagged paths.  (Paths that were named after //).
444
445              Note  that  the  combinations  of  -kM and -Km are prohibited by
446              rmlint.  See https://github.com/sahib/rmlint/issues/244 for more
447              information.
448
449       -S --rank-by=criteria (default: pOma)
450              Sort  the  files  in  a  group  of duplicates into originals and
451              duplicates by one or more criteria. Each criteria is defined  by
452              a  single  letter  (except  r and x which expect a regex pattern
453              after the letter). Multiple criteria may  be  given  as  string,
454              where  the first criteria is the most important. If one criteria
455              cannot decide between original and duplicate  the  next  one  is
456              tried.
457
458              · m:  keep lowest mtime (oldest)           M: keep highest mtime
459                (newest)
460
461              · a: keep first alphabetically            A: keep last alphabet‐
462                ically
463
464              · p:  keep  first  named  path                P: keep last named
465                path
466
467              · d: keep path with lowest  depth           D:  keep  path  with
468                highest depth
469
470              · l:  keep  path  with  shortest  basename     L: keep path with
471                longest basename
472
473              · r: keep paths  matching  regex             R:  keep  path  not
474                matching regex
475
476              · x:  keep basenames matching regex        X: keep basenames not
477                matching regex
478
479              · h: keep file with lowest hardlink  count  H:  keep  file  with
480                highest hardlink count
481
482              · o:  keep  file  with lowest number of hardlinks outside of the
483                paths traversed by rmlint.
484
485              · O: keep file with highest number of hardlinks outside  of  the
486                paths traversed by rmlint.
487
488              Alphabetical  sort  will  only  use the basename of the file and
489              ignore its case.  One can have multiple criteria,  e.g.:  -S  am
490              will  choose first alphabetically; if tied then by mtime.  Note:
491              original path criteria (specified using  //)  will  always  take
492              first priority over -S options.
493
494              For  more fine grained control, it is possible to give a regular
495              expression to sort by. This can be useful when you know a common
496              fact that identifies original paths (like a path component being
497              src or a certain file ending).
498
499              To use the regular expression you simply enclose it in the  cri‐
500              teria  string  by adding <REGULAR_EXPRESSION> after specifying r
501              or x. Example: -S 'r<.*\.bak$>' makes all files that have a .bak
502              suffix original files.
503
504              Warning: When using r or x, try to make your regex to be as spe‐
505              cific as possible! Good practice includes adding a $  anchor  at
506              the end of the regex.
507
508              Tips:
509
510              · l   is  useful  for  files  like  file.mp3  vs  file.1.mp3  or
511                file.mp3.bak.
512
513              · a can be used as last criteria to assert a defined order.
514
515              · o/O and h/H are only useful if there any hardlinks in the tra‐
516                versed path.
517
518              · o/O  takes the number of hardlinks outside the traversed paths
519                (and  thereby  minimizes/maximizes  the  overall   number   of
520                hardlinks). h/H in contrast only takes the number of hardlinks
521                inside of the traversed paths.  When  hardlinking  files,  one
522                would like to link to the original file with the highest outer
523                link count (O) in order to maximise the space cleanup. H  does
524                not  maximise the space cleanup, it just selects the file with
525                the highest total hardlink count. You usually want to  specify
526                O.
527
528              · pOma  is  the  default  since p ensures that first given paths
529                rank as originals, O ensures that hardlinks are handled  well,
530                m  ensures  that  the oldest file is the original and a simply
531                ensures a defined ordering if no other criteria applies.
532
533   Caching
534       --replay
535              Read an existing json file and re-output it.  When  --replay  is
536              given,  rmlint  does  no input/output on the filesystem, even if
537              you pass additional paths. The paths you pass will be  used  for
538              filtering the --replay output.
539
540              This  is very useful if you want to reformat, refilter or resort
541              the output you got from a previous run. Usage  is  simple:  Just
542              pass  --replay  on the second run, with other changed to the new
543              formatters or filters. Pass the .json files of the previous runs
544              additionally  to the paths you ran rmlint on. You can also merge
545              several previous runs by specifying more than one .json file, in
546              this  case  it will merge all files given and output them as one
547              big run.
548
549              If you want to view only the duplicates of  certain  subdirecto‐
550              ries, just pass them on the commandline as usual.
551
552              The  usage  of // has the same effect as in a normal run. It can
553              be used to prefer one .json file over another. However note that
554              running rmlint in --replay mode includes no real disk traversal,
555              i.e. only duplicates from previous runs are  printed.  Therefore
556              specifying  new  paths will simply have no effect. As a security
557              measure, --replay will ignore files whose mtime changed  in  the
558              meantime  (i.e. mtime in the .json file differs from the current
559              one). These files might have  been  modified  and  are  silently
560              ignored.
561
562              By design, some options will not have any effect. Those are:
563
564              · --followlinks
565
566              · --algorithm
567
568              · --paranoid
569
570              · --clamp-low
571
572              · --hardlinked
573
574              · --write-unfinished
575
576              · ... and all other caching options below.
577
578              NOTE:  In  --replay  mode,  a  new .json file will be written to
579              rmlint.replay.json in order to avoid overwriting rmlint.json.
580
581       -C --xattr
582              Shortcut for --xattr-write,  --xattr-write,  --write-unfinished.
583              This  will  write  a  checksum  and  a timestamp to the extended
584              attributes of each file that rmlint hashed. This speeds up  sub‐
585              sequent  runs  on  the  same data set.  Please note that not all
586              filesystems may support extended attributes and you  need  write
587              support to use this feature.
588
589              See the individual options below for more details and some exam‐
590              ples.
591
592       --xattr-read / --xattr-write / --xattr-clear
593              Read  or  write  cached  checksums  from   the   extended   file
594              attributes.   This  feature  can be used to speed up consecutive
595              runs.
596
597              CAUTION: This could potentially lead to false positives if  file
598              contents  are somehow modified without changing the file modifi‐
599              cation time.  rmlint uses the mtime to determine  the  modifica‐
600              tion  timestamp if a checksum is outdated. This is not a problem
601              if you use the clone or reflink operation on a  filesystem  like
602              btrfs.  There  an  outdated  checksum entry would simply lead to
603              some duplicate work done in the kernel but would do no harm oth‐
604              erwise.
605
606              NOTE:  Many  tools do not support extended file attributes prop‐
607              erly, resulting in a loss of the information  when  copying  the
608              file or editing it.
609
610              NOTE: You can specify --xattr-write and --xattr-read at the same
611              time.  This will read from existing checksums at  the  start  of
612              the run and update all hashed files at the end.
613
614              Usage example:
615
616                 $ rmlint large_file_cluster/ -U --xattr-write   # first run should be slow.
617                 $ rmlint large_file_cluster/ --xattr-read       # second run should be faster.
618
619                 # Or do the same in just one run:
620                 $ rmlint large_file_cluster/ --xattr
621
622       -U --write-unfinished
623              Include  files  in  output that have not been hashed fully, i.e.
624              files that do not appear to have a  duplicate.  Note  that  this
625              will  not  include all files that rmlint traversed, but only the
626              files that were chosen to be hashed.
627
628              This is mainly useful in  conjunction  with  --xattr-write/read.
629              When re-running rmlint on a large dataset this can greatly speed
630              up a re-run in some cases. Please refer to --xattr-read  for  an
631              example.
632
633              If you want to output unique files, please look into the uniques
634              output formatter.
635
636   Rarely used, miscellaneous options
637       -t --threads=N (default: 16)
638              The number of threads to use  during  file  tree  traversal  and
639              hashing.   rmlint probably knows better than you how to set this
640              value, so just leave it as it is. Setting it to 1 will also  not
641              make rmlint a single threaded program.
642
643       -u --limit-mem=size
644              Apply  a maximum number of memory to use for hashing and --para‐
645              noid.  The total number of memory might still exceed this  limit
646              though,  especially  when setting it very low. In general rmlint
647              will however consume about this amount of memory plus a more  or
648              less  constant  extra  amount  that  depends on the data you are
649              scanning.
650
651              The size-description has the same format as for  --size,  there‐
652              fore you can do something like this (use this if you have 1GB of
653              memory available):
654
655              $ rmlint -u 512M  # Limit paranoid mem usage to 512 MB
656
657       -q   --clamp-low=[fac.tor|percent%|offset]   (default:    0)    /    -Q
658       --clamp-top=[fac.tor|percent%|offset] (default: 1.0)
659              The  argument  can be either passed as factor (a number with a .
660              in it), a percent value (suffixed by %) or as absolute number or
661              size spec, like in --size.
662
663              Only  look  at  the content of files in the range of from low to
664              (including) high. This means, if the range is less than -q 0% to
665              -Q  100%, than only partial duplicates are searched. If the file
666              size is less than the clamp limits, the file is  ignored  during
667              traversing.  Be careful when using this function, you can easily
668              get dangerous results for small files.
669
670              This is useful in a few cases where a file consists  of  a  con‐
671              stant sized header or footer. With this option you can just com‐
672              pare the data in between.  Also it might be useful for  approxi‐
673              mate  comparison  where it suffices when the file is the same in
674              the middle part.
675
676              Example:
677
678              $ rmlint -q 10% -Q 512M  # Only read the last 90% of a file, but
679              read at max. 512MB
680
681       -Z --mtime-window=T (default: -1)
682              Only  consider those files as duplicates that have the same con‐
683              tent and the same modification time  (mtime)  within  a  certain
684              window  of  T  seconds.   If T is 0, both files need to have the
685              same mtime. For T=1 they may differ one second and so on. If the
686              window  size  is  negative,  the mtime of duplicates will not be
687              considered. T may be a floating point number.
688
689              However, with  three  (or  more)  files,  the  mtime  difference
690              between  two  duplicates  can be bigger than the mtime window T,
691              i.e. several files may be chained together by the window.  Exam‐
692              ple:  If  T  is  1,  the four files fooA (mtime: 00:00:00), fooB
693              (00:00:01), fooC (00:00:02), fooD (00:00:03) would all belong to
694              the  same  duplicate  group, although the mtime of fooA and fooD
695              differs by 3 seconds.
696
697       --with-fiemap (default) / --without-fiemap
698              Enable or disable reading the file extents on rotational disk in
699              order  to  optimize disk access patterns. If this feature is not
700              available, it is disabled automatically.
701
702   FORMATTERS
703       · csv: Output all found lint as comma-separated-value list.
704
705         Available options:
706
707         · no_header: Do not write a first line describing the column headers.
708
709         · unique: Include unique files in the output.
710
711       ·
712
713         sh: Output all found lint as shell script This formatter is activated
714                as default.
715
716         Available options:
717
718         · cmd: Specify a user defined command to run on duplicates.  The com‐
719           mand  can  be  any valid /bin/sh-expression. The duplicate path and
720           original path can be accessed via "$1" and "$2".  The command  will
721           be  written to the user_command function in the sh-file produced by
722           rmlint.
723
724         · handler Define a comma separated list of handlers to try on  dupli‐
725           cate files in that given order until one handler succeeds. Handlers
726           are just the name of a way of getting rid of the file  and  can  be
727           any of the following:
728
729           · clone:  For  reflink-capable  filesystems only. Try to clone both
730             files     with     the      FIDEDUPERANGE      ioctl(3p)      (or
731             BTRFS_IOC_FILE_EXTENT_SAME  on older kernels).  This will free up
732             duplicate extents. Needs at least kernel 4.2.   Use  this  option
733             when  you  only  have  read-only access to a btrfs filesystem but
734             still want to deduplicate it. This is usually the case for  snap‐
735             shots.
736
737           · reflink:  Try  to reflink the duplicate file to the original. See
738             also --reflink in man 1 cp. Fails if the filesystem does not sup‐
739             port it.
740
741           · hardlink: Replace the duplicate file with a hardlink to the orig‐
742             inal file. The resulting files will have  the same inode  number.
743             Fails if both files are not on the same partition. You can use ls
744             -i to show the inode number of a file and find  -samefile  <path>
745             to find all hardlinks for a certain file.
746
747           · symlink: Tries to replace the duplicate file with a symbolic link
748             to the original. This handler never fails.
749
750           · remove: Remove the file using rm -rf. (-r  for  duplicate  dirs).
751             This handler never fails.
752
753           · usercmd:  Use  the provided user defined command (-c sh:cmd=some‐
754             thing). This handler never fails.
755
756           Default is remove.
757
758         · link: Shortcut  for  -c  sh:handler=clone,reflink,hardlink,symlink.
759           Use this if you are on a reflink-capable system.
760
761         · hardlink: Shortcut for -c sh:handler=hardlink,symlink.  Use this if
762           you want to hardlink files, but want  to  fallback  for  duplicates
763           that lie on different devices.
764
765         · symlink:  Shortcut  for  -c  sh:handler=symlink.   Use this as last
766           straw.
767
768       · json: Print a JSON-formatted dump of all found reports.  Outputs  all
769         lint  as  a  json  document.  The document is a list of dictionaries,
770         where the first and last element is the header and the footer. Every‐
771         thing between are data-dictionaries.
772
773         Available options:
774
775         · unique: Include unique files in the output.
776
777         · no_header=[true|false]:  Print  the  header with metadata (default:
778           true)
779
780         · no_footer=[true|false]: Print the footer with statistics  (default:
781           true)
782
783         · oneline=[true|false]:  Print  one  json document per line (default:
784           false) This is useful if you plan to parse the output line-by-line,
785           e.g. while rmlint is sill running.
786
787         This  formatter  is  extremely  useful if you're in need of scripting
788         more complex behaviour, that is not directly possible  with  rmlint's
789         built-in  options.  A very handy tool here is jq.  Here is an example
790         to output all original files directly from a rmlint run:
791
792         $ rmlint -o | json jq -r '.[1:-1][] | select(.is_original) | .path'
793
794       · py: Outputs a python script and a JSON document, just like  the  json
795         formatter.   The  JSON document is written to .rmlint.json, executing
796         the script will make it read from there.  This  formatter  is  mostly
797         intended  for complex use-cases where the lint needs special handling
798         that you define in the python script.  Therefore  the  python  script
799         can  be  modified to do things standard rmlint is not able to do eas‐
800         ily.
801
802       · uniques: Outputs all unique paths found during the run, one path  per
803         line.  This is often useful for scripting purposes.
804
805       · stamp:
806
807         Outputs  a  timestamp  of  the  time  rmlint  was  run.  See also the
808         --newer-than and --newer-than-stamp file option.
809
810         Available options:
811
812         · iso8601=[true|false]: Write an ISO8601 formatted timestamps or sec‐
813           onds since epoch?
814
815       · progressbar:  Shows  a progressbar. This is meant for use with stdout
816         or stderr [default].
817
818         See also: -g (--progress) for a convenience shortcut option.
819
820         Available options:
821
822         · update_interval=number: Number  of  milliseconds  to  wait  between
823           updates.  Higher values use less resources (default 50).
824
825         · ascii: Do not attempt to use unicode characters, which might not be
826           supported by some terminals.
827
828         · fancy: Use a more fancy style for the progressbar.
829
830       · pretty: Shows all found items in realtime nicely colored.  This  for‐
831         matter is activated as default.
832
833       · summary:  Shows  counts  of files and their respective size after the
834         run.  Also list all written output files.
835
836       · fdupes: Prints an output similar  to  the  popular  duplicate  finder
837         fdupes(1).  At  first  a progressbar is printed on stderr. Afterwards
838         the found files are printed on stdout; each set  of  duplicates  gets
839         printed  as  a block separated by newlines. Originals are highlighted
840         in green. At the bottom a summary  is  printed  on  stderr.  This  is
841         mostly useful for scripts that were set up for parsing fdupes output.
842         We recommend the json formatter for every other scripting purpose.
843
844         Available options:
845
846         · omitfirst: Same as the -f / --omitfirst option in fdupes(1).  Omits
847           the first line of each set of duplicates (i.e. the original file.
848
849         · sameline: Same as the -1 / --sameline option in fdupes(1). Does not
850           print newlines between files, only a space.  Newlines  are  printed
851           only between sets of duplicates.
852
853   OTHER STAND-ALONE COMMANDS
854       rmlint --gui
855              Start the optional graphical frontend to rmlint called Shredder.
856
857              This  will  only  work  when  Shredder and its dependencies were
858              installed.                       See                       also:
859              http://rmlint.readthedocs.org/en/latest/gui.html
860
861              The gui has its own set of options, see --gui --help for a list.
862              These should be placed at the end,  ie  rmlint  --gui  [options]
863              when calling it from commandline.
864
865       rmlint --hash [paths...]
866              Make  rmlint work as a multi-threaded file hash utility, similar
867              to the popular md5sum or sha1sum utilities, but faster and  with
868              more  algorithms.   A  set  of paths given on the commandline or
869              from stdin is hashed using one of the available hash algorithms.
870              Use rmlint --hash -h to see options.
871
872       rmlint --equal [paths...]
873              Check  if the paths given on the commandline all have equal con‐
874              tent. If all paths are equal and no other error happened, rmlint
875              will  exit  with  an  exit code 0. Otherwise it will exit with a
876              nonzero exit code. All other options can be used as normal,  but
877              note that no other formatters (sh, csv etc.) will be executed by
878              default. At least two paths need to be passed.
879
880              Note: This even works for directories and  also  in  combination
881              with paranoid mode (pass -pp for byte comparison); remember that
882              rmlint does not care about the layout of the directory, but only
883              about the content of the files in it. At least two paths need to
884              be given to the commandline.
885
886              By default this will use hashing to  compare  the  files  and/or
887              directories.
888
889       rmlint --dedupe [-r] [-v|-V] <src> <dest>
890              If  the  filesystem  supports  files  sharing  physical  storage
891              between multiple files, and if src and dest have  same  content,
892              this command makes the data in the src file appear the dest file
893              by sharing the underlying storage.
894
895              This command is similar  to  cp  --reflink=always  <src>  <dest>
896              except that it (a) checks that src and dest have identical data,
897              and it makes no changes to dest's metadata.
898
899              Running with -r option will enable  deduplication  of  read-only
900              [btrfs] snapshots (requires root).
901
902       rmlint --is-reflink [-v|-V] <file1> <file2>
903              Tests  whether  file1  and  file2  are  reflinks (reference same
904              data).  This command makes rmlint exit with one of the following
905              exit codes:
906
907              · 0: files are reflinks
908
909              · 1: files are not reflinks
910
911              · 3: not a regular file
912
913              · 4: file sizes differ
914
915              · 5: fiemaps can't be read
916
917              · 6: file1 and file2 are the same path
918
919              · 7:  file1  and  file2 are the same file under different mount‐
920                points
921
922              · 8: files are hardlinks
923
924              · 9: files are symlinks
925
926              · 10: files are not on same device
927
928              · 11: other error encountered
929
930   EXAMPLES
931       This is a collection of common use cases and other tricks:
932
933       · Check the current working directory for duplicates.
934
935         $ rmlint
936
937       · Show a progressbar:
938
939         $ rmlint -g
940
941       · Quick re-run on large datasets using different  ranking  criteria  on
942         second run:
943
944         $ rmlint large_dir/ # First run; writes rmlint.json
945
946         $ rmlint --replay rmlint.json large_dir -S MaD
947
948       · Merge  together  previous  runs,  but prefer the originals to be from
949         b.json and make sure that no files are deleted from b.json:
950
951         $ rmlint --replay a.json // b.json -k
952
953       · Search only for duplicates and duplicate directories
954
955         $ rmlint -T "df,dd" .
956
957       · Compare files byte-by-byte in current directory:
958
959         $ rmlint -pp .
960
961       · Find duplicates with same basename (excluding extension):
962
963         $ rmlint -e
964
965       · Do more complex traversal using find(1).
966
967         $ find /usr/lib -iname '*.so' -type f | rmlint - # find all duplicate
968         .so files
969
970         $  find /usr/lib -iname '*.so' -type f -print0 | rmlint -0 # as above
971         but handles filenames with newline character in them
972
973         $ find ~/pics -iname '*.png' | ./rmlint - # compare png files only
974
975       · Limit file size range to investigate:
976
977         $ rmlint -s 2GB    # Find everything >= 2GB
978
979         $ rmlint -s 0-2GB  # Find everything <  2GB
980
981       · Only find writable and executable files:
982
983         $ rmlint --perms wx
984
985       · Reflink if possible, else hardlink duplicates to original  if  possi‐
986         ble, else replace duplicate with a symbolic link:
987
988         $ rmlint -c sh:link
989
990       · Inject user-defined command into shell script output:
991
992         $  rmlint  -o  sh  -c  sh:cmd='echo "original:" "$2" "is the same as"
993         "$1"'
994
995       · Use shred to overwrite the contents of a file fully:
996
997         $ rmlint -c 'sh:cmd=shred -un 10 "$1"'
998
999       · Use data as master directory. Find only duplicates in backup that are
1000         also in data. Do not delete any files in data:
1001
1002         $ rmlint backup // data --keep-all-tagged --must-match-tagged
1003
1004       · Compare if the directories a b c and are equal
1005
1006         $  rmlint  --equal a b c && echo "Files are equal" || echo "Files are
1007         not equal"
1008
1009       · Test if two files are reflinks
1010
1011         $ rmlint --is-reflink a b && echo "Files are reflinks" || echo "Files
1012         are not reflinks".
1013
1014       · Cache  calculated checksums for next run. The checksums will be writ‐
1015         ten to the extended file attributes:
1016
1017         $ rmlint --xattr
1018
1019       · Produce a list of unique files in a folder:
1020
1021         $ rmlint -o uniques
1022
1023       · Produce a list of files that are  unique,  including  original  files
1024         ("one of each"):
1025
1026         $  rmlint  t  -o  json  -o uniques:unique_files |  jq -r '.[1:-1][] |
1027         select(.is_original)  |  .path'  |  sort  >  original_files   $   cat
1028         unique_files original_files
1029
1030   PROBLEMS
1031       1. False  Positives:  Depending on the options you use, there is a very
1032          slight risk of false positives (files that are erroneously  detected
1033          as duplicate).  The default hash function (blake2b) is very safe but
1034          in theory it is possible for two files to have then  same  hash.  If
1035          you had 10^73 different files, all the same size, then the chance of
1036          a false positive is still less than 1 in a billion.  If you're  con‐
1037          cerned  just  use the --paranoid (-pp) option. This will compare all
1038          the files byte-by-byte and is not much slower than blake2b  (it  may
1039          even be faster), although it is a lot more memory-hungry.
1040
1041       2. File  modification during or after rmlint run: It is possible that a
1042          file that rmlint recognized as  duplicate  is  modified  afterwards,
1043          resulting  in  a  different  file.   If you use the rmlint-generated
1044          shell script to delete the duplicates, you can run it  with  the  -p
1045          option  to  do a full re-check of the duplicate against the original
1046          before it deletes the file. When using -c sh:hardlink or -c  sh:sym‐
1047          link  care  should be taken that a modification of one file will now
1048          result in a modification of all files.  This is not the case for  -c
1049          sh:reflink or -c sh:clone. Use -c sh:link to minimise this risk.
1050
1051   SEE ALSO
1052       Reading the manpages o these tools might help working with rmlint:
1053
1054       · find(1)
1055
1056       · rm(1)
1057
1058       · cp(1)
1059
1060       Extended documentation and an in-depth tutorial can be found at:
1061
1062       · http://rmlint.rtfd.org
1063
1064   BUGS
1065       If  you  found  a bug, have a feature requests or want to say something
1066       nice, please visit https://github.com/sahib/rmlint/issues.
1067
1068       Please make sure to describe your problem in detail. Always include the
1069       version  of  rmlint  (--version).  If  you  experienced a crash, please
1070       include at least one of the following information with a debug build of
1071       rmlint:
1072
1073       · gdb --ex run -ex bt --args rmlint -vvv [your_options]
1074
1075       · valgrind --leak-check=no rmlint -vvv [your_options]
1076
1077       You can build a debug build of rmlint like this:
1078
1079       · git clone git@github.com:sahib/rmlint.git
1080
1081       · cd rmlint
1082
1083       · scons GDB=1 DEBUG=1
1084
1085       · sudo scons install  # Optional
1086
1087   LICENSE
1088       rmlint is licensed under the terms of the GPLv3.
1089
1090       See the COPYRIGHT file that came with the source for more information.
1091
1092   PROGRAM AUTHORS
1093       rmlint was written by:
1094
1095       · Christopher <sahib> Pahl 2010-2017 (https://github.com/sahib)
1096
1097       · Daniel <SeeSpotRun> T.   2014-2017 (https://github.com/SeeSpotRun)
1098
1099       Also see the  http://rmlint.rtfd.org for other people that helped us.
1100
1101       If  you  consider  a donation you can use Flattr or buy us a beer if we
1102       meet:
1103
1104       https://flattr.com/thing/302682/libglyr
1105

AUTHOR

1107       Christopher Pahl, Daniel Thomas
1108

COPYRIGHT

1110       2014-2020, Christopher Pahl & Daniel Thomas
1111
1112
1113
1114
1115                                 Feb 02, 2020                        RMLINT(1)