rmlint(1)

1RMLINT(1)                    rmlint documentation                    RMLINT(1)
2
3
4

NAME

6       rmlint - find duplicate files and other space waste efficiently
7

FIND DUPLICATE FILES AND OTHER SPACE WASTE EFFICIENTLY

9   SYNOPSIS
10       rmlint  [TARGET_DIR_OR_FILES ...] [//] [TAGGED_TARGET_DIR_OR_FILES ...]
11       [-] [OPTIONS]
12
13   DESCRIPTION
14       rmlint finds space waste and other broken things  on  your  filesystem.
15       It's main focus lies on finding duplicate files and directories.
16
17       It is able to find the following types of lint:
18
19       · Duplicate files and directories (and as a result unique files).
20
21       · Nonstripped  Binaries  (Binaries  with  debug  symbols;  needs  to be
22         explicityl enabled).
23
24       · Broken symbolic links.
25
26       · Empty files and directories (also nested empty directories).
27
28       · Files with broken user or group id.
29
30       rmlint itself WILL NOT DELETE ANY FILES. It does however  produce  exe‐
31       cutable  output  (for  example  a  shell script) to help you delete the
32       files if you want to. Another design principle is that it  should  work
33       well together with other tools like find. Therefore we do not replicate
34       features of other well know programs, as for example  pattern  matching
35       and  finding  duplicate filenames.  However we provide many convinience
36       options for common use cases that are hard to build from  scratch  with
37       standard tools.
38
39       In  order  to find the lint, rmlint is given one or more directories to
40       traverse.  If no directories or files were given, the  current  working
41       directory  is assumed.  By default, rmlint will ignore hidden files and
42       will not follow symlinks (see Traversal Options).   rmlint  will  first
43       find "other lint" and then search the remaining files for duplicates.
44
45       rmlint  tries  to be helpful by guessing what file of a group of dupli‐
46       cates is the original (i.e. the file that should not  be  deleted).  It
47       does  this by using different sorting strategies that can be controlled
48       via the -S option. By default it chooses the first-named  path  on  the
49       commandline.  If  two  duplicates come from the same path, it will also
50       apply different fallback sort strategies (See the documentation of  the
51       -S strategy).
52
53       This  behaviour  can  be  also  overwritten  if you know that a certain
54       directory contains duplicates and another one originals. In  this  case
55       you  write  the original directory after specifying a single //  on the
56       commandline.   Everything  that  comes  after  is  a  preferred  (or  a
57       "tagged")  directory.  If  there  are duplicates from a unpreferred and
58       from a preffered directory, the preferred  one  will  always  count  as
59       original. Special options can also be used to always keep files in pre‐
60       ferred directories (-k) and to only find duplicates that are present in
61       both given directories (-m).
62
63       We  advise  new users to have a short look at all options rmlint has to
64       offer, and maybe test some examples before letting it run on productive
65       data.   WRONG ASSUMPTIONS ARE THE BIGGEST ENEMY OF YOUR DATA. There are
66       some extended example at the end of this manual, but each  option  that
67       is not self-explanatory will also try to give examples.
68
69   OPTIONS
70   General Options
71       -T --types="list" (default: defaults)
72              Configure  the  types  of  lint  rmlint  will look for. The list
73              string is a comma-separated list of lint types  or  lint  groups
74              (other separators like semicolon or space also work though).
75
76              One of the following groups can be specified at the beginning of
77              the list:
78
79              · all: Enables all lint types.
80
81              · defaults: Enables all lint types, but nonstripped.
82
83              · minimal: defaults minus emptyfiles and emptydirs.
84
85              · minimaldirs: defaults minus emptyfiles, emptydirs  and  dupli‐
86                cates, but with duplicatedirs.
87
88              · none: Disable all lint types [default].
89
90              Any  of  the  following lint types can be added individually, or
91              deselected by prefixing with a -:
92
93              · badids, bi: Find files with bad UID, GID or both.
94
95              · badlinks, bl: Find bad symlinks pointing nowhere valid.
96
97              · emptydirs, ed: Find empty directories.
98
99              · emptyfiles, ef: Find empty files.
100
101              · nonstripped, ns: Find nonstripped binaries.
102
103              · duplicates, df: Find duplicate files.
104
105              · duplicatedirs, dd: Find duplicate directories.
106
107              WARNING: It is good practice to enclose the description in  sin‐
108              gle  or  double  quotes. In obscure cases argument parsing might
109              fail in weird ways, especially when using spaces as separator.
110
111              Example:
112
113                 $ rmlint -T "df,dd"        # Only search for duplicate files and directories
114                 $ rmlint -T "all -df -dd"  # Search for all lint except duplicate files and dirs.
115
116       -o --output=spec / -O --add-output=spec (default:  -o  sh:rmlint.sh  -o
117       pretty:stdout -o summary:stdout -o json:rmlint.json)
118              Configure  the  way rmlint outputs its results. A spec is in the
119              form format:file or just format.  A  file  might  either  be  an
120              arbitrary  path or stdout or stderr.  If file is omitted, stdout
121              is assumed. format is the name of a formatter supported by  this
122              program.  For  a  list of formatters and their options, refer to
123              the Formatters section below.
124
125              If -o is specified, rmlint's default  outputs  are  overwritten.
126              With  --O  the  defaults  are preserved.  Either -o or -O may be
127              specified multiple times to get multiple outputs, including mul‐
128              tiple outputs of the same format.
129
130              Examples:
131
132                 $ rmlint -o json                 # Stream the json output to stdout
133                 $ rmlint -O csv:/tmp/rmlint.csv  # Output an extra csv fle to /tmp
134
135       -c --config=spec[=value] (default: none)
136              Configure a format. This option can be used to fine-tune the be‐
137              haviour of the existing formatters. See the  Formatters  section
138              for details on the available keys.
139
140              If the value is omitted it is set to a value meaning "enabled".
141
142              Examples:
143
144                 $ rmlint -c sh:link            # Smartly link duplicates instead of removing
145                 $ rmlint -c progressbar:fancy  # Use a different theme for the progressbar
146
147       -z --perms[=[rwx]] (default: no check)
148              Only look into file if it is readable, writable or executable by
149              the current user.  Which one of the can be given as argument  as
150              one of "rwx".
151
152              If no argument is given, "rw" is assumed. Note that r does basi‐
153              cally nothing user-visible since rmlint will  ignore  unreadable
154              files anyways.  It's just there for the sake of completeness.
155
156              By default this check is not done.
157
158              $  rmlint  -z  rx $(echo $PATH | tr ":" " ")  # Look at all exe‐
159              cutable files in $PATH
160
161       -a --algorithm=name (default: blake2b)
162              Choose the algorithm to use for  finding  duplicate  files.  The
163              algorithm  can be either paranoid (byte-by-byte file comparison)
164              or use one of several file hash algorithms  to  identify  dupli‐
165              cates.   The  following hash families are available (in approxi‐
166              mate descending order of cryptographic strength):
167
168              sha3, blake,
169
170              sha,
171
172              highway, md
173
174              metro, murmur, xxhash
175
176              The weaker hash functions  still  offer  excellent  distribution
177              properties,  but  are  potentially  more vulnerable to malicious
178              crafting of duplicate files.
179
180              The full list of hash functions (in decreasing order of checksum
181              length) is:
182
183              512-bit: blake2b, blake2bp, sha3-512, sha512
184
185              384-bit: sha3-384,
186
187              256-bit:   blake2s,   blake2sp,  sha3-256,  sha256,  highway256,
188              metro256, metrocrc256
189
190              160-bit: sha1
191
192              128-bit: md5, murmur, metro, metrocrc
193
194              64-bit: highway64, xxhash.
195
196              The use of 64-bit hash length for detecting duplicate  files  is
197              not  recommended, due to the probability of a random hash colli‐
198              sion.
199
200       -p --paranoid / -P --less-paranoid (default)
201              Increase or decrease the paranoia of  rmlint's  duplicate  algo‐
202              rithm.   Use  -p if you want byte-by-byte comparison without any
203              hashing.
204
205              · -p is equivalent to --algorithm=paranoid
206
207              · -P is equivalent to --algorithm=highway256
208
209              · -PP is equivalent to --algorithm=metro256
210
211              · -PPP is equivalent to --algorithm=metro
212
213       -v --loud / -V --quiet
214              Increase or decrease the verbosity. You can pass  these  options
215              several times. This only affects rmlint's logging on stderr, but
216              not the outputs defined with -o. Passing either option more than
217              three times has no further effect.
218
219       -g --progress / -G --no-progress (default)
220              Show a progressbar with sane defaults.
221
222              Convenience   shortcut   for   -o   progressbar  -o  summary  -o
223              sh:rmlint.sh -o json:rmlint.json -VVV.
224
225              NOTE: This flag clears all previous outputs. If you  want  addi‐
226              tional outputs, specify them after this flag using -O.
227
228       -D --merge-directories (default: disabled)
229              Makes  rmlint  use a special mode where all found duplicates are
230              collected and checked if whole directory trees  are  duplicates.
231              Use  with caution: You always should make sure that the investi‐
232              gated directory is not modified during rmlint's or  its  removal
233              scripts run.
234
235              IMPORTANT:  Definition  of equal: Two directories are considered
236              equal by rmlint if they contain the exact same data,  no  matter
237              how the files containing the data are named. Imagine that rmlint
238              creates a long, sorted stream out  of  the  data  found  in  the
239              directory and compares this in a magic way to another directory.
240              This means that the layout of the directory is not considered to
241              be important by default. Also empty files will not count as con‐
242              tent. This might be surprising to some users, but remember  that
243              rmlint  generally  cares only about content, not about any other
244              metadata or layout. If you want to only find trees with the same
245              hierarchy you should use --honour-dir-layout / -j.
246
247              Output  is  deferred  until all duplicates were found. Duplicate
248              directories are printed first, followed by any remaining  dupli‐
249              cate  files that are isolated or inside of any original directo‐
250              ries.
251
252              --rank-by applies for directories too,  but  'p'  or  'P'  (path
253              index)  has no defined (i.e. useful) meaning. Sorting takes only
254              place when the number of preferred files in the  directory  dif‐
255              fers.
256
257              NOTES:
258
259              · This  option  enables --partial-hidden and -@ (--see-symlinks)
260                for convenience. If this is not  desired,  you  should  change
261                this after specifying -D.
262
263              · This feature might add some runtime for large datasets.
264
265              · When  using  this  option,  you will not be able to use the -c
266                sh:clone option.  Use -c sh:link as a good alternative.
267
268       -j --honour-dir-layout (default: disabled)
269              Only recognize directories as duplicates that have the same path
270              layout.  In other words: All duplicates that build the duplicate
271              directory must have the same path from the root of each  respec‐
272              tive directory.  This flag makes no sense without --merge-direc‐
273              tories.
274
275       -y --sort-by=order (default: none)
276              During output, sort  the  found  duplicate  groups  by  criteria
277              described  by  order.  order is a string that may consist of one
278              or more of the following letters:
279
280              · s: Sort by size of group.
281
282              · a: Sort alphabetically by the basename of the original.
283
284              · m: Sort by mtime of the original.
285
286              · p: Sort by path-index of the original.
287
288              · o: Sort by natural found order (might  be  different  on  each
289                run).
290
291              · n: Sort by number of files in the group.
292
293              The  letter  may  also  be  written  uppercase  (similar to -S /
294              --rank-by) to reverse the sorting. Note that rmlint has to  hold
295              back all results to the end of the run before sorting and print‐
296              ing.
297
298       -w --with-color (default) / -W --no-with-color
299              Use color escapes for pretty output or  disable  them.   If  you
300              pipe rmlints output to a file -W is assumed automatically.
301
302       -h --help / -H --show-man
303              Show  a  shorter  reference  help text (-h) or the full man page
304              (-H).
305
306       --version
307              Print the version of rmlint. Includes git revision  and  compile
308              time features. Please include this when giving feedback to us.
309
310   Traversal Options
311       -s --size=range (default: 1 )
312              Only  consider files as duplicates in a certain size range.  The
313              format of range is min-max, where both ends can be specified  as
314              a  number with an optional multiplier. The available multipliers
315              are:
316
317              · C (1^1), W (2^1),  B  (512^1),  K  (1000^1),  KB  (1024^1),  M
318                (1000^2), MB (1024^2), G (1000^3), GB (1024^3),
319
320              · T  (1000^4), TB (1024^4), P (1000^5), PB (1024^5), E (1000^6),
321                EB (1024^6)
322
323              The size format is about the same as dd(1) uses. A valid example
324              would be: "100KB-2M". This limits duplicates to a range from 100
325              Kilobyte to 2 Megabyte.
326
327              It's also possible to specify only one size. In  this  case  the
328              size is interpreted as "bigger or equal". If you want to to fil‐
329              ter for files up to this size you can add a - in front  (-s  -1M
330              == -s 0-1M).
331
332              Edge  case:  The default excludes empty files from the duplicate
333              search.  Normally these are treated specially by rmlint by  han‐
334              dling  them as other lint. If you want to include empty files as
335              duplicates you should lower the limit to zero:
336
337              $ rmlint -T df --size 0
338
339       -d --max-depth=depth (default: INF)
340              Only recurse up to this depth. A depth of 1 would disable recur‐
341              sion  and  is  equivalent  to  a directory listing. A depth of 2
342              would also consider also all children directories and so on.
343
344       -l --hardlinked (default) / --keep-hardlinked / -L --no-hardlinked
345              Hardlinked  files  are  treated   as   duplicates   by   default
346              (--hardlinked).  If  --keep-hardlinked is given, rmlint will not
347              delete any files that are hardlinked to  an  original  in  their
348              respective  group.  Such files will be displayed like originals,
349              i.e. for the default output with a "ls" in front.  The reasoning
350              here  is  to maximize the number of kept files, while maximizing
351              the number of freed space: Removing hardlinks to originals  will
352              not allocate any free space.
353
354              If  --no-hardlinked  is  given,  only  one  file  (of  a  set of
355              hardlinked files) is considered, all  the  others  are  ignored;
356              this  means, they are not deleted and also not even shown in the
357              output. The "highest ranked" of the set is the one that is  con‐
358              sidered.
359
360       -f --followlinks / -F --no-followlinks / -@ --see-symlinks (default)
361              -f  will  always  follow  symbolic  links.  If file system loops
362              occurs rmlint will detect this. If  -F  is  specified,  symbolic
363              links  will  be  ignored  completely, if -@ is specified, rmlint
364              will see symlinks and treats them like small files with the path
365              to  their  target  in them. The latter is the default behaviour,
366              since it is a sensible default for --merge-directories.
367
368       -x --no-crossdev / -X --crossdev (default)
369              Stay always on the same device (-x), or  allow  crossing  mount‐
370              points (-X). The latter is the default.
371
372       -r --hidden / -R --no-hidden (default) / --partial-hidden
373              Also traverse hidden directories? This is often not a good idea,
374              since directories like .git/  would  be  investigated,  possibly
375              leading  to  the  deletion  of  internal git files which in turn
376              break a repository.   With  --partial-hidden  hidden  files  and
377              folders are only considered if they're inside duplicate directo‐
378              ries (see --merge-directories) and will be deleted  as  part  of
379              it.
380
381       -b --match-basename
382              Only  consider those files as dupes that have the same basename.
383              See also man 1 basename. The  comparison  of  the  basenames  is
384              case-insensitive.
385
386       -B --unmatched-basename
387              Only  consider  those  files as dupes that do not share the same
388              basename.  See also man 1 basename. The comparison of the  base‐
389              names is case-insensitive.
390
391       -e --match-with-extension / -E --no-match-with-extension (default)
392              Only  consider  those  files  as  dupes  that have the same file
393              extension. For example two photos would only match if they are a
394              .png. The extension is compared case-insensitive, so .PNG is the
395              same as .png.
396
397       -i   --match-without-extension   /   -I    --no-match-without-extension
398       (default)
399              Only  consider  those files as dupes that have the same basename
400              minus  the  file  extension.   For   example:   banana.png   and
401              Banana.jpeg  would  be considered, while apple.png and peach.png
402              won't. The comparison is case-insensitive.
403
404       -n         --newer-than-stamp=<timestamp_filename>         /         -N
405       --newer-than=<iso8601_timestamp_or_unix_timestamp>
406              Only  consider  files  (and  their size siblings for duplicates)
407              newer than a certain modification time (mtime).  The age barrier
408              may  be given as seconds since the epoch or as ISO8601-Timestamp
409              like 2014-09-08T00:12:32+0200.
410
411              -n expects a file from which it can read  the  timestamp.  After
412              rmlint run, the file will be updated with the current timestamp.
413              If the file does not initially exist, no filtering is  done  but
414              the stampfile is still written.
415
416              -N, in contrast, takes the timestamp directly and will not write
417              anything.
418
419              Note that rmlint will find duplicates newer than timestamp, even
420              if  the  original  is  older.   If you want only find duplicates
421              where both original and duplicate are newer than  timestamp  you
422              can use find(1):
423
424              · find  -mtime  -1  -print0 | rmlint -0 # pass all files younger
425                than a day to rmlint
426
427              Note: you can make rmlint write out a compatible timestamp with:
428
429              · -O stamp:stdout  # Write a  seconds-since-epoch  timestamp  to
430                stdout on finish.
431
432              · -O stamp:stdout -c stamp:iso8601 # Same, but write as ISO8601.
433
434   Original Detection Options
435       -k --keep-all-tagged / -K --keep-all-untagged
436              Don't  delete  any  duplicates  that are in tagged paths (-k) or
437              that are in non-tagged paths (-K).  (Tagged paths are those that
438              were named after //).
439
440       -m --must-match-tagged / -M --must-match-untagged
441              Only  look for duplicates of which at least one is in one of the
442              tagged paths.  (Paths that were named after //).
443
444              Note that the combinations of -kM  and  -Km  are  prohibited  by
445              rmlint.  See https://github.com/sahib/rmlint/issues/244 for more
446              information.
447
448       -S --rank-by=criteria (default: pOma)
449              Sort the files in a  group  of  duplicates  into  originals  and
450              duplicates  by one or more criteria. Each criteria is defined by
451              a single letter (except r and x which  expect  a  regex  pattern
452              after  the  letter).  Multiple  criteria may be given as string,
453              where the first criteria is the most important. If one  criteria
454              cannot  decide  between  original  and duplicate the next one is
455              tried.
456
457              · m: keep lowest mtime (oldest)           M: keep highest  mtime
458                (newest)
459
460              · a: keep first alphabetically            A: keep last alphabet‐
461                ically
462
463              · p: keep first named path                 P:  keep  last  named
464                path
465
466              · d:  keep  path  with  lowest  depth          D: keep path with
467                highest depth
468
469              · l: keep path with shortest  basename      L:  keep  path  with
470                longest basename
471
472              · r:  keep  paths  matching  regex             R:  keep path not
473                matching regex
474
475              · x: keep basenames matching regex        X: keep basenames  not
476                matching regex
477
478              · h:  keep  file  with  lowest  hardlink count H: keep file with
479                highest hardlink count
480
481              · o: keep file with lowest number of hardlinks  outside  of  the
482                paths traversed by rmlint.
483
484              · O:  keep  file with highest number of hardlinks outside of the
485                paths traversed by rmlint.
486
487              Alphabetical sort will only use the basename  of  the  file  and
488              ignore  its  case.   One can have multiple criteria, e.g.: -S am
489              will choose first alphabetically; if tied then by mtime.   Note:
490              original  path  criteria  (specified  using //) will always take
491              first priority over -S options.
492
493              For more fine grained control, it is possible to give a  regular
494              expression to sort by. This can be useful when you know a common
495              fact that identifies original paths (like a path component being
496              src or a certain file ending).
497
498              To  use the regular expression you simply enclose it in the cri‐
499              teria string by adding <REGULAR_EXPRESSION> after  specifying  r
500              or x. Example: -S 'r<.*\.bak$>' makes all files that have a .bak
501              suffix original files.
502
503              Warning: When using r or x, try to make your regex to be as spe‐
504              cific  as  possible! Good practice includes adding a $ anchor at
505              the end of the regex.
506
507              Tips:
508
509              · l  is  useful  for  files  like  file.mp3  vs  file.1.mp3   or
510                file.mp3.bak.
511
512              · a can be used as last criteria to assert a defined order.
513
514              · o/O and h/H are only useful if there any hardlinks in the tra‐
515                versed path.
516
517              · o/O takes the number of hardlinks outside the traversed  paths
518                (and   thereby   minimizes/maximizes  the  overall  number  of
519                hardlinks). h/H in contrast only takes the number of hardlinks
520                inside  of  the  traversed  paths. When hardlinking files, one
521                would like to link to the original file with the highest outer
522                link  count (O) in order to maximise the space cleanup. H does
523                not maximise the space cleanup, it just selects the file  with
524                the  highest total hardlink count. You usually want to specify
525                O.
526
527              · pOma is the default since p ensures  that  first  given  paths
528                rank  as originals, O ensures that hardlinks are handled well,
529                m ensures that the oldest file is the original  and  a  simply
530                ensures a defined ordering if no other criteria applies.
531
532   Caching
533       --replay
534              Read  an  existing  json file and re-output it. When --replay is
535              given, rmlint does no input/output on the  filesystem,  even  if
536              you  pass  additional paths. The paths you pass will be used for
537              filtering the --replay output.
538
539              This is very useful if you want to reformat, refilter or  resort
540              the  output  you  got from a previous run. Usage is simple: Just
541              pass --replay on the second run, with other changed to  the  new
542              formatters or filters. Pass the .json files of the previous runs
543              additionally to the paths you ran rmlint on. You can also  merge
544              several previous runs by specifying more than one .json file, in
545              this case it will merge all files given and output them  as  one
546              big run.
547
548              If  you  want to view only the duplicates of certain subdirecto‐
549              ries, just pass them on the commandline as usual.
550
551              The usage of // has the same effect as in a normal run.  It  can
552              be used to prefer one .json file over another. However note that
553              running rmlint in --replay mode includes no real disk traversal,
554              i.e.  only  duplicates from previous runs are printed. Therefore
555              specifying new paths will simply have no effect. As  a  security
556              measure,  --replay  will ignore files whose mtime changed in the
557              meantime (i.e. mtime in the .json file differes from the current
558              one).  These  files  might  have  been modified and are silently
559              ignored.
560
561              By design, some options will not have any effect. Those are:
562
563              · --followlinks
564
565              · --algorithm
566
567              · --paranoid
568
569              · --clamp-low
570
571              · --hardlinked
572
573              · --write-unfinished
574
575              · ... and all other caching options below.
576
577              NOTE: In --replay mode, a new .json  file  will  be  written  to
578              rmlint.replay.json in order to avoid overwriting rmlint.json.
579
580       --xattr-read / --xattr-write / --xattr-clear
581              Read   or   write   cached  checksums  from  the  extended  file
582              attributes.  This feature can be used to  speed  up  consecutive
583              runs.
584
585              CAUTION:  This could potentially lead to false positives if file
586              contents are somehow modified without changing the file mtime.
587
588              NOTE: Many tools do not support extended file  attributes  prop‐
589              erly,  resulting  in  a loss of the information when copying the
590              file or editing it.  Also, this is a linux specific feature that
591              works  not on all filesystems and only if you have write permis‐
592              sions to the file.
593
594              Usage example:
595
596                 $ rmlint large_file_cluster/ -U --xattr-write   # first run.
597                 $ rmlint large_file_cluster/ --xattr-read       # second run.
598
599       -U --write-unfinished
600              Include files in output that have not been  hashed  fully,  i.e.
601              files  that  do  not  appear to have a duplicate. Note that this
602              will not include all files that rmlint traversed, but  only  the
603              files that were chosen to be hashed.
604
605              This  is  mainly  useful in conjunction with --xattr-write/read.
606              When re-running rmlint on a large dataset this can greatly speed
607              up  a  re-run in some cases. Please refer to --xattr-read for an
608              example.
609
610   Rarely used, miscellaneous options
611       -t --threads=N (default: 16)
612              The number of threads to use  during  file  tree  traversal  and
613              hashing.   rmlint probably knows better than you how to set this
614              value, so just leave it as it is. Setting it to 1 will also  not
615              make rmlint a single threaded program.
616
617       -u --limit-mem=size
618              Apply  a maximum number of memory to use for hashing and --para‐
619              noid.  The total number of memory might still exceed this  limit
620              though,  especially  when setting it very low. In general rmlint
621              will however consume about this amont of memory plus a  more  or
622              less  constant  extra  amount  that  depends on the data you are
623              scanning.
624
625              The size-description has the same format as for  --size,  there‐
626              fore you can do something like this (use this if you have 1GB of
627              memory available):
628
629              $ rmlint -u 512M  # Limit paranoid mem usage to 512 MB
630
631       -q   --clamp-low=[fac.tor|percent%|offset]   (default:    0)    /    -Q
632       --clamp-top=[fac.tor|percent%|offset] (default: 1.0)
633              The  argument  can be either passed as factor (a number with a .
634              in it), a percent value (suffixed by %) or as absolute number or
635              size spec, like in --size.
636
637              Only  look  at  the content of files in the range of from low to
638              (including) high. This means, if the range is less than -q 0% to
639              -Q  100%, than only partial duplicates are searched. If the file
640              size is less than the clamp limits, the file is  ignored  during
641              traversing.  Be careful when using this function, you can easily
642              get dangerous results for small files.
643
644              This is useful in a few cases where a file consists  of  a  con‐
645              stant sized header or footer. With this option you can just com‐
646              pare the data in between.  Also it might be useful for  approxi‐
647              mate  comparison  where it suffices when the file is the same in
648              the middle part.
649
650              Example:
651
652              $ rmlint -q 10% -Q 512M  # Only read the last 90% of a file, but
653              read at max. 512MB
654
655       -Z --mtime-window=T (default: -1)
656              Only  consider those files as duplicates that have the same con‐
657              tent and the same modification time  (mtime)  within  a  certain
658              window  of  T  seconds.   If T is 0, both files need to have the
659              same mtime. For T=1 they may differ one second and so on. If the
660              window  size  is  negative,  the mtime of duplicates will not be
661              considered. T may be a floating point number.
662
663              However, with  three  (or  more)  files,  the  mtime  difference
664              between  two  duplicates  can be bigger than the mtime window T,
665              i.e. several files may be chained together by the window.  Exam‐
666              ple:  If  T  is  1,  the four files fooA (mtime: 00:00:00), fooB
667              (00:00:01), fooC (00:00:02), fooD (00:00:03) would all belong to
668              the  same  duplicate  group, although the mtime of fooA and fooD
669              differs by 3 seconds.
670
671       --with-fiemap (default) / --without-fiemap
672              Enable or disable reading the file extents on rotational disk in
673              order  to  optimize disk access patterns. If this feature is not
674              available, it is disabled automatically.
675
676   FORMATTERS
677       · csv: Output all found lint as comma-separated-value list.
678
679         Available options:
680
681         · no_header: Do not write a first line describing the column headers.
682
683       ·
684
685         sh: Output all found lint as shell script This formatter is activated
686                as default.
687
688         Available options:
689
690         · cmd: Specify a user defined command to run on duplicates.  The com‐
691           mand  can  be  any valid /bin/sh-expression. The duplicate path and
692           original path can be accessed via "$1" and "$2".  The command  will
693           be  written to the user_command function in the sh-file produced by
694           rmlint.
695
696         · handler Define a comma separated list of handlers to try on  dupli‐
697           cate files in that given order until one handler succeeds. Handlers
698           are just the name of a way of getting rid of the file  and  can  be
699           any of the following:
700
701           · clone:  For  reflink-capable  filesystems only. Try to clone both
702             files     with     the      FIDEDUPERANGE      ioctl(3p)      (or
703             BTRFS_IOC_FILE_EXTENT_SAME  on older kernels).  This will free up
704             duplicate extents. Needs at least kernel 4.2.   Use  this  option
705             when  you  only  have  read-only  acess to a btrfs filesystem but
706             still want to deduplicate it. This is usually the case for  snap‐
707             shots.
708
709           · reflink:  Try  to reflink the duplicate file to the original. See
710             also --reflink in man 1 cp. Fails if the filesystem does not sup‐
711             port it.
712
713           · hardlink: Replace the duplicate file with a hardlink to the orig‐
714             inal file. The resulting files will have  the same inode  number.
715             Fails if both files are not on the same partition. You can use ls
716             -i to show the inode number of a file and find  -samefile  <path>
717             to find all hardlinks for a certain file.
718
719           · symlink: Tries to replace the duplicate file with a symbolic link
720             to the original. This handler never fails.
721
722           · remove: Remove the file using rm -rf. (-r  for  duplicate  dirs).
723             This handler never fails.
724
725           · usercmd:  Use  the provided user defined command (-c sh:cmd=some‐
726             thing). Never fails.
727
728           Default is remove.
729
730         · link: Shortcut  for  -c  sh:handler=clone,reflink,hardlink,symlink.
731           Use this if you are on a reflink-capable system.
732
733         · hardlink: Shortcut for -c sh:handler=hardlink,symlink.  Use this if
734           you want to hardlink files, but want  to  fallback  for  duplicates
735           that lie on different devices.
736
737         · symlink:  Shortcut  for  -c  sh:handler=symlink.   Use this as last
738           straw.
739
740       · json: Print a JSON-formatted dump of all found reports.  Outputs  all
741         lint  as  a  json  document.  The document is a list of dictionaries,
742         where the first and last element is the header and the footer. Every‐
743         thing between are data-dictionaries.
744
745         Available options:
746
747         · no_header=[true|false]:  Print  the  header with metadata (default:
748           true)
749
750         · no_footer=[true|false]: Print the footer with statistics  (default:
751           true)
752
753         · oneline=[true|false]:  Print  one  json document per line (default:
754           false) This is useful if you plan to parse the output line-by-line,
755           e.g. while rmlint is sill running.
756
757       · py:  Outputs  a python script and a JSON document, just like the json
758         formatter.  The JSON document is written to  .rmlint.json,  executing
759         the  script  will  make  it read from there. This formatter is mostly
760         intented for complex use-cases where the lint needs special  handling
761         that  you  define  in the python script.  Therefore the python script
762         can be modified to do things standard rmlint is not able to  do  eas‐
763         ily.
764
765       · stamp:
766
767         Outputs  a  timestamp  of  the  time  rmlint  was  run.  See also the
768         --newer-than and --newer-than-stamp file option.
769
770         Available options:
771
772         · iso8601=[true|false]: Write an ISO8601 formatted timestamps or sec‐
773           onds since epoch?
774
775       · progressbar:  Shows  a progressbar. This is meant for use with stdout
776         or stderr [default].
777
778         See also: -g (--progress) for a convenience shortcut option.
779
780         Available options:
781
782         · update_interval=number: Number  of  milliseconds  to  wait  between
783           updates.  Higher values use less resources (default 50).
784
785         · ascii: Do not attempt to use unicode characters, which might not be
786           supported by some terminals.
787
788         · fancy: Use a more fancy style for the progressbar.
789
790       · pretty: Shows all found items in realtime nicely colored.  This  for‐
791         matter is activated as default.
792
793       · summary:  Shows  counts  of files and their respective size after the
794         run.  Also list all written output files.
795
796       · fdupes: Prints an output similar  to  the  popular  duplicate  finder
797         fdupes(1).  At  first  a progressbar is printed on stderr. Afterwards
798         the found files are printed on stdout; each set  of  duplicates  gets
799         printed  as  a block separated by newlines. Originals are highlighted
800         in green. At the bottom a summary  is  printed  on  stderr.  This  is
801         mostly useful for scripts that were set up for parsing fdupes output.
802         We recommend the json formatter for every other scripting purpose.
803
804         Available options:
805
806         · omitfirst: Same as the -f / --omitfirst option in fdupes(1).  Omits
807           the first line of each set of duplicates (i.e. the original file.
808
809         · sameline: Same as the -1 / --sameline option in fdupes(1). Does not
810           print newlines between files, only a space.  Newlines  are  printed
811           only between sets of duplicates.
812
813   OTHER STAND-ALONE COMMANDS
814       rmlint --gui
815              Start the optional graphical frontend to rmlint called Shredder.
816
817              This  will  only  work  when  Shredder and its dependencies were
818              installed.                       See                       also:
819              http://rmlint.readthedocs.org/en/latest/gui.html
820
821              The gui has its own set of options, see --gui --help for a list.
822              These should be placed at the end,  ie  rmlint  --gui  [options]
823              when calling it from commandline.
824
825       rmlint --hash [paths...]
826              Make  rmlint work as a multi-threaded file hash utility, similar
827              to the popular md5sum or sha1sum utilities, but faster and  with
828              more  algorithms.   A  set  of paths given on the commandline or
829              from stdin is hashed using one of the available hash algorithms.
830              Use rmlint --hash -h to see options.
831
832       rmlint --equal [paths...]
833              Check  if the paths given on the commandline all have equal con‐
834              tent. If all paths are equal and no other error happened, rmlint
835              will  exit  with  an  exit code 0. Otherwise it will exit with a
836              nonzero exit code. All other options can be used as normal,  but
837              note that no other formatters (sh, csv etc.) will be executed by
838              default. At least two paths need to be passed.
839
840              Note: This even works for directories and  also  in  combination
841              with paranoid mode (pass -pp for byte comparison); remember that
842              rmlint does not care about the layout of the directory, but only
843              about the content of the files in it. At least two paths need to
844              be given to the commandline.
845
846              By default this will use hashing to  compare  the  files  and/or
847              directories.
848
849       rmlint --dedupe [-r] [-v|-V] <src> <dest>
850              If  the  filesystem  supports  files  sharing  physical  storage
851              between multiple files, and if src and dest have  same  content,
852              this command makes the data in the src file appear the dest file
853              by sharing the underlying storage.
854
855              This command is similar  to  cp  --reflink=always  <src>  <dest>
856              except that it (a) checks that src and dest have identical data,
857              and it makes no changes to dest's metadata.
858
859              Running with -r option will enable  deduplication  of  read-only
860              [btrfs] snapshots (requires root).
861
862       rmlint --is-reflink [-v|-V] <file1> <file2>
863              Tests  whether  file1  and  file2  are  reflinks (reference same
864              data).  Return codes:
865                 0: files are reflinks 1: files are not reflinks 3: not a reg‐
866                 ular  file  4:  file sizes differ 5: fiemaps can't be read 6:
867                 file1 and file2 are the same path 7: file1 and file2 are  the
868                 same  file under different mountpoints 8: files are hardlinks
869                 9: files are symlinks (TODO) 10: files are not on same device
870                 11: other error encountered
871
872   EXAMPLES
873       This is a collection of common usecases and other tricks:
874
875       · Check the current working directory for duplicates.
876
877         $ rmlint
878
879       · Show a progressbar:
880
881         $ rmlint -g
882
883       · Quick  re-run  on  large datasets using different ranking criteria on
884         second run:
885
886         $ rmlint large_dir/ # First run; writes rmlint.json
887
888         $ rmlint --replay rmlint.json large_dir -S MaD
889
890       · Merge together previous runs, but prefer the  originals  to  be  from
891         b.json and make sure that no files are deleted from b.json:
892
893         $ rmlint --replay a.json // b.json -k
894
895       · Search only for duplicates and duplicate directories
896
897         $ rmlint -T "df,dd" .
898
899       · Compare files byte-by-byte in current directory:
900
901         $ rmlint -pp .
902
903       · Find duplicates with same basename (excluding extension):
904
905         $ rmlint -e
906
907       · Do more complex traversal using find(1).
908
909         $ find /usr/lib -iname '*.so' -type f | rmlint - # find all duplicate
910         .so files
911
912         $ find /usr/lib -iname '*.so' -type f -print0 | rmlint -0 # as  above
913         but handles filenames with newline character in them
914
915         $ find ~/pics -iname '*.png' | ./rmlint - # compare png files only
916
917       · Limit file size range to investigate:
918
919         $ rmlint -s 2GB    # Find everything >= 2GB
920
921         $ rmlint -s 0-2GB  # Find everything <  2GB
922
923       · Only find writable and executable files:
924
925         $ rmlint --perms wx
926
927       · Reflink  if  possible, else hardlink duplicates to original if possi‐
928         ble, else replace duplicate with a symbolic link:
929
930         $ rmlint -c sh:link
931
932       · Inject user-defined command into shell script output:
933
934         $ rmlint -o sh -c sh:cmd='echo "original:"  "$2"  "is  the  same  as"
935         "$1"'
936
937       · Use data as master directory. Find only duplicates in backup that are
938         also in data. Do not delete any files in data:
939
940         $ rmlint backup // data --keep-all-tagged --must-match-tagged
941
942       · Compare if the directories a b c and are equal
943
944         $ rmlint --equal a b c && echo "Files are equal" || echo  "Files  are
945         not equal"
946
947       · Test if two files are reflinks rmlint --is-reflink a b && echo "Files
948         are reflinks" || echo "Files are not reflinks".
949
950   PROBLEMS
951       1. False Positives: Depending on the options you use, there is  a  very
952          slight  risk of false positives (files that are erroneously detected
953          as duplicate).  The default hash function (blake2b) is very safe but
954          in  theory  it  is possible for two files to have then same hash. If
955          you had 10^73 different files, all the same size, then the chance of
956          a  false positive is still less than 1 in a billion.  If you're con‐
957          cerned just use the --paranoid (-pp) option. This will  compare  all
958          the  files  byte-by-byte and is not much slower than blake2b (it may
959          even be faster), although it is a lot more memory-hungry.
960
961       2. File modification during or after rmlint run: It is possible that  a
962          file  that  rmlint  recognized  as duplicate is modified afterwards,
963          resulting in a different file.   If  you  use  the  rmlint-generated
964          shell  script  to  delete the duplicates, you can run it with the -p
965          option to do a full re-check of the duplicate against  the  original
966          before  it deletes the file. When using -c sh:hardlink or -c sh:sym‐
967          link care should be taken that a modification of one file  will  now
968          result  in a modification of all files.  This is not the case for -c
969          sh:reflink or -c sh:clone. Use -c sh:link to minimise this risk.
970
971   SEE ALSO
972       Reading the manpages o these tools might help working with rmlint:
973
974       · find(1)
975
976       · rm(1)
977
978       · cp(1)
979
980       Extended documentation and an in-depth tutorial can be found at:
981
982       · http://rmlint.rtfd.org
983
984   BUGS
985       If you found a bug, have a feature requests or want  to  say  something
986       nice, please visit https://github.com/sahib/rmlint/issues.
987
988       Please make sure to describe your problem in detail. Always include the
989       version of rmlint (--version).  If  you  experienced  a  crash,  please
990       include at least one of the following information with a debug build of
991       rmlint:
992
993       · gdb --ex run -ex bt --args rmlint -vvv [your_options]
994
995       · valgrind --leak-check=no rmlint -vvv [your_options]
996
997       You can build a debug build of rmlint like this:
998
999       · git clone git@github.com:sahib/rmlint.git
1000
1001       · cd rmlint
1002
1003       · scons DEBUG=1
1004
1005       · sudo scons install  # Optional
1006
1007   LICENSE
1008       rmlint is licensed under the terms of the GPLv3.
1009
1010       See the COPYRIGHT file that came with the source for more information.
1011
1012   PROGRAM AUTHORS
1013       rmlint was written by:
1014
1015       · Christopher <sahib> Pahl 2010-2017 (https://github.com/sahib)
1016
1017       · Daniel <SeeSpotRun> T.   2014-2017 (https://github.com/SeeSpotRun)
1018
1019       Also see the  http://rmlint.rtfd.org for other people that helped us.
1020
1021       If you consider a donation you can use Flattr or buy us a  beer  if  we
1022       meet:
1023
1024       https://flattr.com/thing/302682/libglyr
1025

AUTHOR

1027       Christopher Pahl, Daniel Thomas
1028

COPYRIGHT

1030       2014-2015, Christopher Pahl & Daniel Thomas
1031
1032
1033
1034
1035                                 Feb 18, 2019                        RMLINT(1)