samefile(1)

1SAMEFILE(1)                           JS                           SAMEFILE(1)
2
3
4

NAME

6       samefile - find identical files
7

SYNOPSIS

9       samefile [-g size] [-l | -r] [-s sep] [-0aiqVvx]
10

DESCRIPTION

12       samefile  reads a list of filenames (one filename per line) from stdin.
13       For each filename pair with identical contents, a  line  consisting  of
14       six  fields  is output: The size in bytes, two filenames, the character
15       ``='' if the two files are on the same device, ``X'' otherwise, and the
16       link counts of the two files.  The output is sorted in reverse order by
17       size as the primary key and the filenames as the secondary key.
18

OPTIONS

20       -0     Indicates that the input list of file names is  NUL  terminated,
21              for example as generated by implementations of find(1) that sup‐
22              port the -print0 option.  Without this option,  the  file  names
23              are assumed to be newline terminated.
24
25       -a     Do not sort files with same size alphabetically.
26
27       -g size
28              Compare only files with size greater than size bytes. Default is
29              0.
30
31       -i     Allow files with the same device/i-node pair to be added to  the
32              binary  tree.  This  might  be useful if output will be fed into
33              some other program.  If this option is used, the statistics dis‐
34              played  when using -v will not contain the ``You have a total of
35              x bytes in identical files'' line because  -i  prohibits  proper
36              calculation of this value.
37
38       -l     Do  not  check  if  files with identical contents are hard links
39              created by ln(1).  By default, samefile  checks  if  files  with
40              identical  contents  are  hard linked and, if they are, does not
41              write a name pair to stdout. A slight  speedup  is  gained  when
42              using  this  option.   This  option  is incompatible with the -r
43              option.
44
45       -q     Do not issue warning messages  when  open(2)  fails.   When  you
46              encounter such a warning, open probably failed due to a 'permis‐
47              sion denied' error on files or directories for which you have no
48              read permission.  Useful if you are not root and want to compare
49              your files against files in a system directory like /etc
50
51       -r     Report whether identical files are hard linked.   The  separator
52              string  followed  by  the  [bracketed] link count is appended to
53              each name pair if they are hard links  created  with  ln.   This
54              option  is  incompatible with the -l option. Note that this kind
55              of output has only four fields and will appear  unsorted  before
56              the actual output of samefile.
57
58       -s sep Use  string sep as the output field separator, defaults to a tab
59              character. Useful if filenames contain tab characters and output
60              must be processed by another program, say awk(1).
61
62       -V     Print the version information and exit.
63
64       -v     verbose mode. Write some statistical messages about memory usage
65              and work reduction as well as the sum of the sizes of all  iden‐
66              tical files to stderr.
67
68       -x     Switch  off  intelligence.  This  option  prevents samefile from
69              being smart. If files file1, file2 and file3 are  identical,  it
70              will  do  3 comparisons instead of just the two needed and write
71              more output. See the discussion under INTERNALS why  this  could
72              be  useful.   If  this  option is used, the statistics displayed
73              when using -v will not contain the ``You have a total of x bytes
74              in  identical  files'' line because -x prohibits proper calcula‐
75              tion of this value.
76

INTERNALS

78       samefile uses two stages to give optimum performance.
79
80       In the first stage,  all  non-plain  files  are  skipped  (directories,
81       devices,  FIFOs,  sockets,  symbolic  links) as well as files for which
82       stat(2) fails and files that have a size less than or  equal  to  size.
83       Output of the first stage (the filenames) is written into a binary tree
84       with one node for every file size.  It is  also  at  this  early  stage
85       where  checks  for hard links are done. If hard links are found, and -r
86       is requested, the name pairs are output immediately.  The whole list of
87       hard  linked  name pairs will therefore appear before any output of the
88       second stage.
89
90       For any i-node only one filename will  be  added  to  the  binary  tree
91       (unless -i was requested.)
92
93       In the second stage all files having the same size are compared against
94       each other. The rules of mathematical logic are applied to reduce  work
95       and  output  noise  (unless -x is requested): if files a, b, and c have
96       the same size and samefile finds that a = b and a = c then it will  not
97       compare  b  against c (and will not output a line for b and c) but only
98       for a = b and a = c. Note however, that because only the first filename
99       per  i-node gets into the second stage, the output for a group of iden‐
100       tical files with different i-node numbers is  also  minimized.  Suppose
101       you  have six identical files of size 100 in an i-node group consisting
102       of the three i-nodes with numbers 10,  20  and  30  (the  term  'i-node
103       group' has nothing to do with the i-node group notion of some file sys‐
104       tems - it merely refers to a set of i-nodes addressing files with iden‐
105       tical contents):
106
107       % ls -i
108          10 file1     20 file4     30 file6
109          10 file2     20 file5
110          10 file3
111       % ls | samefile
112       100     file1   file4   =       3       2
113       100     file1   file6   =       3       1
114
115       The  sum  of  the sizes in the first column is the amount of disk space
116       you could gain by making all 6 files links to only one file  or  remove
117       all  but  one  of  the files. To be precise, disk space is allocated in
118       blocks - you will probably gain two blocks here, rather than 200 bytes.
119       Note  that  it  is not enough to just remove file4 and file6 (you would
120       gain only 100 bytes because file5 still exists.) The proper way  is  to
121       use the -i option.  The output will look like
122
123       100     file1   file2   =       3       3
124       100     file1   file3   =       3       3
125       100     file1   file4   =       3       2
126       100     file1   file5   =       3       2
127       100     file1   file6   =       3       1
128
129       Removing  all  files  listed  in the third field will leave only file1.
130       Making all files hard links to file1 is easy. If the fourth field is  a
131       ``=''  do  a  forced hard link.  If you need to know about all combina‐
132       tions of identical files, then you use both the -i and -x option.  This
133       produces
134
135       % ls | samefile -ix
136       100     file1   file2   =       3       3
137       100     file1   file3   =       3       3
138       100     file1   file4   =       3       2
139       100     file1   file5   =       3       2
140       100     file1   file6   =       3       1
141       100     file2   file3   =       3       3
142       100     file2   file4   =       3       2
143       100     file2   file5   =       3       2
144       100     file2   file6   =       3       1
145       100     file3   file4   =       3       2
146       100     file3   file5   =       3       2
147       100     file3   file6   =       3       1
148       100     file4   file5   =       2       2
149       100     file4   file6   =       2       1
150       100     file5   file6   =       2       1
151
152

EXAMPLES

154       Find all identical files in the current working directory:
155
156       % ls | samefile
157
158       Find  all  identical  files in my HOME directory and subdirectories and
159       also tell me if there are hard links:
160
161       % find $HOME -type f -print | samefile -r
162
163       Find all identical files in the /usr directory  tree  that  are  bigger
164       than  10000 bytes and write the result to /tmp/usr (that one is for the
165       sysadmin folks, you may want to 'amp' - put it in the  background  with
166       the ampersand & - this command because it takes a few minutes.)
167
168       % find /usr -type f -print | samefile -g 10000 >/tmp/usr
169
170

DIAGNOSTICS

172       You will see a short usage message if you use illegal options.
173
174       malloc - free = xxxx
175              I  didn't  free  the  memory I've malloc(3)ed.  You found a bug.
176              Please report it to the author.
177
178       Allocation failed for 'expr' ...
179              Oops! You ran out of virtual memory. You must have  a  real  big
180              filename  list.  Try  to use a smaller one or increase resources
181              available to your processes.  For more information see ulimit(1)
182              or your similar shell builtin.
183

NOTES

188       Input  filenames  must  not have leading or trailing white space unless
189       the white space is part of the filename.
190

BUGS

192       Are you kidding?  Okay,  maybe  there  is  one.  The  source  has  been
193       lint(1)ed  and all possible care has been taken while coding. So if you
194       find a bug (or miss a feature) contact the
195

AUTHOR

197       Jens Schweikhardt - samefile@schweikhardt.net
198
199
200
201                                 7 AUGUST 2005                     SAMEFILE(1)