|
NAMEsamefile - find identical filessamearchive - find identical files, while keeping archives intact SYNOPSISsamefile [-a | -A | -At | -L | -Z | | -Zt] [-g size] [-l | -r] [-m size] [-S sep] [-0HiqVvx]samearchive [-a | -A | -At | -L | -Z | -Zt] [-g size] [-l | -r] [-m size] [-S sep] [-0HiqVv] dir1 dir2 [...] DESCRIPTIONThese programs reads a list of filenames (one filename per line) from stdin and output the identical files on stdin. samearchive is written for the special case where each directory acts as an archive of backup. The output will only contain filename pairs that have the same relative path from the archive base. Therefor the output of samearchive will be a subset of samefileThe output exist out of six fields: the size in bytes, two filenames (with identical contence), the character = if the two files are on the same device, X otherwise, and the link counts of the two files. The output is sorted in reverse order by size as the primary key and a secondary key that depends on the user input. OPTIONS
INTERNALSThese programs uses two stages to give optimum performance.In the first stage, all non-plain files are skipped (directories, devices, FIFOs, sockets, symbolic links) as well as files for which stat(2) fails and files that have a size less than or equal to size or greater than size. When the memory is full, samefile will try to store a part of the filenames temporarily in /tmp/samefile/<pid>. When samefile is not able to do this it will rais the minimum size and removes paths from the memory accordingly. In the second stage the filenames that are hard linked are reported, assuming option -r was passed to the program. And the files are compared and identical filenames are reported after this. For any i-node only one filename will be added (unless -i was requested.) For each two i-nodes that match n lines will be printed that shows the first filename of the first i-node matched against all the filenames of the second i-node. Note however, that because only the first filename per i-node gets into the second stage, the output for a group of identical files with different i-node numbers is also minimized. Suppose you have six identical files of size 100 in an i-node group consisting of the three i-nodes with numbers 10, 20 and 30 (the term file systems - it merely refers to a set of i-nodes addressing files with identical contents): % ls -i 10 file1 20 file4 30 file6 10 file2 20 file5 10 file3 % ls | samefile 100 file1 file4 = 3 2 100 file1 file6 = 3 1 The sum of the sizes in the first column is the amount of disk space you could gain by making all 6 files links to only one file or remove all but one of the files. To be precise, disk space is allocated in blocks - you will probably gain two blocks here, rather than 200 bytes. Note that it is not enough to just remove file4 and file6 (you would gain only 100 bytes because file5 still exists.) The proper way is to use the -i option. The output will look like: 100 file1 file4 = 3 2 100 file1 file5 = 3 2 100 file1 file6 = 3 1 Removing all files listed in the third field will leave only file1. Making all files hard links to file1 is easy. If the fourth field is a ``='' do a forced hard link. If you need to know about all combinations of identical files, then you use both the -i and -x options. This produces: % ls | samefile -ix 100 file1 file4 = 3 2 100 file1 file5 = 3 2 100 file2 file4 = 3 2 100 file2 file5 = 3 2 100 file3 file4 = 3 2 100 file3 file5 = 3 2 100 file1 file6 = 3 1 100 file2 file6 = 3 1 100 file3 file6 = 3 1 100 file4 file6 = 2 1 100 file5 file6 = 2 1 FILES
EXAMPLESFind all identical files in the current working directory:% ls | samefile -i Find all identical files in my HOME directory and subdirectories and also tell me if there are hard links: % find $HOME -type f -print | samefile -r Find all identical files in the /usr directory tree that are bigger than 10000 bytes and write the result to /tmp/usr (that one is for the sysadmin folks, you may want to 'amp' - put it in the background with the ampersand & - this command because it takes a few minutes.) % find /usr -type f -print | samefile -g 10000 > /tmp/usr Find all identical files with in the system archives that live within the current working directory: % find /path/to/backup/system-* | samearchive system-* DIAGNOSTICSinaccessible: path This is probably due to a 'permission denied' error on files or directories within the given path for which you have no read permission.unreadable: path The file could be opend for reading jet failed while reading. You shouldn't encounter such a warnings but if you do, and recieve more than a few, this could be very well due to failing hard disk. <file.cpp>:<line> message You can encounter such a errors when you've compiled the port with debugging information. Please report such messages to the author with some relevant information about how to reproduce this bug. memory full: written amount path to disk The memory was full and a number of paths where temporarily written to disk. memory full: changed minimum file size to number The memory was full and the program coudn't temporarily write paths to disk, so it raised the minimum file size to the given number. At a later time you could rerun the program using the option -m to check that paths that where skipped and going to be skipped as a result. memory full: aborting... to manny files with the same size There were just to manny files with the same size to fit in to memory from this point on. Try to split the list up and then run the program multiple times. SEE ALSOsamearchive-lite(1) sameln(1) samesame(1) find(1) ls(1)NOTESInput filenames must not have leading or trailing white space unless the white space is part of the filename.HISTORsamefile was first written by Jens Schweikhardt in 1996. It was later rewritten by Alex de kruijff in 2009 in order to improve the performace. In addition the program now was able to handle memory allocation problems due to large list and gained some addition options.BUGSThe list is not sorted properly when using the option -x. This is not a bug but a feature. Proper sorting would either consume vast amounts of memory or time. The sorting options are there just to controle the output. (i.e. use -Zt if you intent to link with the file that was the most recently modified. You will find that file on the left.)AUTHORAlex de Kruijff
Visit the GSP FreeBSD Man Page Interface. |