NAME

ua -: find identical sets of files (comes from the Hungarian word ugyanaz - meaning "the same")

SYNOPSIS

ua [OPTION]... [FILE]...

DESCRIPTION

Given a list of files, ua finds sets comprised of identical ones. ua was designed to take input from find or ls and produce output that is trivial to process by line oriented tools, such as sed, xargs, awk, wc, grep etc. For example, counting the number of sets of duplicates, simply:

: $ find ~ -type f | ua - | wc -l

or to find the largest such set:

: $ find ~ -type f | ua -ssep - | \ awk -Fsep '{if (NF>M) { M=NF;S=$0;}} END {print(S);}'

OPTIONS

-i: ignore letter case
-w: ignore white spaces
-n: do not ask the file system for file size
-v: verbose output (prints stuff to stderr), verbose help
-m max: consider only the first max bytes in the hash
-2: perform two stage hashing, first hash on the prefix of size set with -m and throw away candidates with unique prefix hashes
-s sep: separator (default SPACE)
-p: also print the hash value
-b size: set internal buffer size (default 1024)
-h: this help (-vh more verbose help)
-: read file names from stdin, where each line contains one file name (this must also be the last option in the list)

OUTPUT

Each line of the output represents one set of identical files. The columns are the path names separated by sep (-ssep). When -p set, the first column will be the hash value. Remember that if -i or -w are set, the hash value will likely be different from what md5sum would give.

ALGORITHM

Calculation proceeds in three steps:

: 1. Ask the FS for file size and throw away files with unique byte counts.
: 2. If so requested (-2), calculate a fast hash on a fixed-size prefix (given by -m) of the files with the same byte count and throw away the ones with unique prefix hash values
: 3. If there are exactly two matching files left in a subset after filtering on size and prefix hash, then these two will be compared by byte; otherwise the files will go through a full MD5 hash; and the ones with the same hash will be deemed identical.

-w implies -n, since the byte count is irrelevant information in this case. The two-stage hashing algorithm first calculates identical sets considering only a fixed-size prefix (thus the -2 option requires -m) and then from these sets calculates the final result. This can be much faster when there are many files with the same size or when comparing files with whitespaces ignored. When -w and -m max are both set, the max refers to the first max non-white space characters.

EXAMPLES

Get help on usage:

: $ ua -h
$ ua -vh

Find identical files in the current directory:

: $ ua *
$ ls | ua -p -

In the first case, the files are read from the command line, while in the second the file names are read from the standard input. The letter one also prints the hashcode.

Compare text files:

: $ ua -iwvb256 f1.txt f2.txt f3.txt

Compares the three files ignoring letter case and white spaces. Intermediate steps will be reported on stderr (-v). The -w implies -n, thus file sizes are not grouped. The internal buffer size is reduced to 256, since the whitespaces will cause data to be moved in the buffer.

Calculate the number of identical files under home:

: $ find ~ -type f | ua -2m256 - | wc -l

Considering the large number of files, the calculation will be performed with a two stage hash (-2). Only files that pass the 256 byte prefix hash will be fully hashed.

Find identical header files:

: $ find /usr/include -name '*.h' | ua -b256 -wm256 -2s, -

Ignore white spaces -w (thus use a smaller buffer -b256). Perform the calculation in two stages (-2), first cluster based on the whitespace-free first 256 characters (-m256). Also, separate the identical files in the output by commas (-s,).

VERSION

1.0, ua -h will tell you whether you have the hashed or the tree version.

AUTHOR

LICENSE

This is free software. You may redistribute copies of it under the terms of the Mozilla Public License <http://www.mozilla.org/MPL/>. There is NO WARRANTY, to the extent permitted by law.