1SAMEFILE(1) JS SAMEFILE(1)
2
3
4
6 samefile - find identical files
7
9 samefile [-g size] [-l | -r] [-s sep] [-0aiqVvx]
10
12 samefile reads a list of filenames (one filename per line) from stdin.
13 For each filename pair with identical contents, a line consisting of
14 six fields is output: The size in bytes, two filenames, the character
15 ``='' if the two files are on the same device, ``X'' otherwise, and the
16 link counts of the two files. The output is sorted in reverse order by
17 size as the primary key and the filenames as the secondary key.
18
20 -0 Indicates that the input list of file names is NUL terminated,
21 for example as generated by implementations of find(1) that sup‐
22 port the -print0 option. Without this option, the file names
23 are assumed to be newline terminated.
24
25 -a Do not sort files with same size alphabetically.
26
27 -g size
28 Compare only files with size greater than size bytes. Default is
29 0.
30
31 -i Allow files with the same device/i-node pair to be added to the
32 binary tree. This might be useful if output will be fed into
33 some other program. If this option is used, the statistics dis‐
34 played when using -v will not contain the ``You have a total of
35 x bytes in identical files'' line because -i prohibits proper
36 calculation of this value.
37
38 -l Do not check if files with identical contents are hard links
39 created by ln(1). By default, samefile checks if files with
40 identical contents are hard linked and, if they are, does not
41 write a name pair to stdout. A slight speedup is gained when
42 using this option. This option is incompatible with the -r
43 option.
44
45 -q Do not issue warning messages when open(2) fails. When you
46 encounter such a warning, open probably failed due to a 'permis‐
47 sion denied' error on files or directories for which you have no
48 read permission. Useful if you are not root and want to compare
49 your files against files in a system directory like /etc
50
51 -r Report whether identical files are hard linked. The separator
52 string followed by the [bracketed] link count is appended to
53 each name pair if they are hard links created with ln. This
54 option is incompatible with the -l option. Note that this kind
55 of output has only four fields and will appear unsorted before
56 the actual output of samefile.
57
58 -s sep Use string sep as the output field separator, defaults to a tab
59 character. Useful if filenames contain tab characters and output
60 must be processed by another program, say awk(1).
61
62 -V Print the version information and exit.
63
64 -v verbose mode. Write some statistical messages about memory usage
65 and work reduction as well as the sum of the sizes of all iden‐
66 tical files to stderr.
67
68 -x Switch off intelligence. This option prevents samefile from
69 being smart. If files file1, file2 and file3 are identical, it
70 will do 3 comparisons instead of just the two needed and write
71 more output. See the discussion under INTERNALS why this could
72 be useful. If this option is used, the statistics displayed
73 when using -v will not contain the ``You have a total of x bytes
74 in identical files'' line because -x prohibits proper calcula‐
75 tion of this value.
76
78 samefile uses two stages to give optimum performance.
79
80 In the first stage, all non-plain files are skipped (directories,
81 devices, FIFOs, sockets, symbolic links) as well as files for which
82 stat(2) fails and files that have a size less than or equal to size.
83 Output of the first stage (the filenames) is written into a binary tree
84 with one node for every file size. It is also at this early stage
85 where checks for hard links are done. If hard links are found, and -r
86 is requested, the name pairs are output immediately. The whole list of
87 hard linked name pairs will therefore appear before any output of the
88 second stage.
89
90 For any i-node only one filename will be added to the binary tree
91 (unless -i was requested.)
92
93 In the second stage all files having the same size are compared against
94 each other. The rules of mathematical logic are applied to reduce work
95 and output noise (unless -x is requested): if files a, b, and c have
96 the same size and samefile finds that a = b and a = c then it will not
97 compare b against c (and will not output a line for b and c) but only
98 for a = b and a = c. Note however, that because only the first filename
99 per i-node gets into the second stage, the output for a group of iden‐
100 tical files with different i-node numbers is also minimized. Suppose
101 you have six identical files of size 100 in an i-node group consisting
102 of the three i-nodes with numbers 10, 20 and 30 (the term 'i-node
103 group' has nothing to do with the i-node group notion of some file sys‐
104 tems - it merely refers to a set of i-nodes addressing files with iden‐
105 tical contents):
106
107 % ls -i
108 10 file1 20 file4 30 file6
109 10 file2 20 file5
110 10 file3
111 % ls | samefile
112 100 file1 file4 = 3 2
113 100 file1 file6 = 3 1
114
115 The sum of the sizes in the first column is the amount of disk space
116 you could gain by making all 6 files links to only one file or remove
117 all but one of the files. To be precise, disk space is allocated in
118 blocks - you will probably gain two blocks here, rather than 200 bytes.
119 Note that it is not enough to just remove file4 and file6 (you would
120 gain only 100 bytes because file5 still exists.) The proper way is to
121 use the -i option. The output will look like
122
123 100 file1 file2 = 3 3
124 100 file1 file3 = 3 3
125 100 file1 file4 = 3 2
126 100 file1 file5 = 3 2
127 100 file1 file6 = 3 1
128
129 Removing all files listed in the third field will leave only file1.
130 Making all files hard links to file1 is easy. If the fourth field is a
131 ``='' do a forced hard link. If you need to know about all combina‐
132 tions of identical files, then you use both the -i and -x option. This
133 produces
134
135 % ls | samefile -ix
136 100 file1 file2 = 3 3
137 100 file1 file3 = 3 3
138 100 file1 file4 = 3 2
139 100 file1 file5 = 3 2
140 100 file1 file6 = 3 1
141 100 file2 file3 = 3 3
142 100 file2 file4 = 3 2
143 100 file2 file5 = 3 2
144 100 file2 file6 = 3 1
145 100 file3 file4 = 3 2
146 100 file3 file5 = 3 2
147 100 file3 file6 = 3 1
148 100 file4 file5 = 2 2
149 100 file4 file6 = 2 1
150 100 file5 file6 = 2 1
151
152
154 Find all identical files in the current working directory:
155
156 % ls | samefile
157
158 Find all identical files in my HOME directory and subdirectories and
159 also tell me if there are hard links:
160
161 % find $HOME -type f -print | samefile -r
162
163 Find all identical files in the /usr directory tree that are bigger
164 than 10000 bytes and write the result to /tmp/usr (that one is for the
165 sysadmin folks, you may want to 'amp' - put it in the background with
166 the ampersand & - this command because it takes a few minutes.)
167
168 % find /usr -type f -print | samefile -g 10000 >/tmp/usr
169
170
172 You will see a short usage message if you use illegal options.
173
174 malloc - free = xxxx
175 I didn't free the memory I've malloc(3)ed. You found a bug.
176 Please report it to the author.
177
178 Allocation failed for 'expr' ...
179 Oops! You ran out of virtual memory. You must have a real big
180 filename list. Try to use a smaller one or increase resources
181 available to your processes. For more information see ulimit(1)
182 or your similar shell builtin.
183
185 ln(1), find(1), rm(1), df(1)
186
188 Input filenames must not have leading or trailing white space unless
189 the white space is part of the filename.
190
192 Are you kidding? Okay, maybe there is one. The source has been
193 lint(1)ed and all possible care has been taken while coding. So if you
194 find a bug (or miss a feature) contact the
195
197 Jens Schweikhardt - samefile@schweikhardt.net
198
199
200
201 7 AUGUST 2005 SAMEFILE(1)