|
NAMEhuge-count.pl - Count all the bigrams in a huge text without using huge amounts of memory.SYNOPSIShuge-count.pl --tokenlist --split 100 destination-dir inputDESCRIPTIONRuns count.pl efficiently on large amounts of data by splitting the data into separate files, and counting up each file separately, and then merging them to get overall results.Two output files are created. destination-dir/huge-count.output contains the bigram counts after applying --remove and --remove. destination-dir/complete-huge-count.output provides the bigram counts as if no --uremove or --remove cutoff were provided. USAGEhuge-count.pl [OPTIONS] DESTINATION [SOURCE]+INPUTRequired Arguments:[SOURCE]+Input to huge-count.pl should be a -
DESTINATION A complete path to a writable directory to which huge-count.pl can write all intermediate and final output files. If DESTINATION does not exist, a new directory is created, otherwise, the current directory is simply used for writing the output files. NOTE: If DESTINATION already exists and if the names of some of the existing files in DESTINATION clash with the names of the output files created by huge-count, these files will be over-written w/o prompting user. --tokenlist This parameter is required. huge-count will call count.pl and print out all the bigrams count.pl can find out. Optional Arguments:--split NThis parameter is required. huge-count will divide the output bigrams tokenlist generated by count.pl, sort on each part and recombine the bigram counts from all these intermediate result files into a single bigram output that shows bigram counts in SOURCE. Each part created with --split N will contain N lines. Value of N should be chosen such that huge-sort.pl can be efficiently run on any part containing N lines from the file contains all bigrams file. We suggest that N is equal to the number of KB of memory you have. If the computer has 8 GB RAM, which is 8,000,000 KB, N should be set to 8000000. If N is set too small, split output file suffixes exhausted. --token TOKENFILE Specify a file containing Perl regular expressions that define the tokenization scheme for counting. This will be provided to count.pl's --token option. --nontoken NOTOKENFILE Specify a file containing Perl regular expressions of non-token sequences that are removed prior to tokenization. This will be provided to the count.pl's --nontoken option. --stop STOPFILE Specify a file of Perl regex/s containing the list of stop words to be omitted from the output BIGRAMS. Stop list can be used in two modes - AND mode declared with '@stop.mode = AND' on the 1st line of the STOPFILE or OR mode declared using '@stop.mode = OR' on the 1st line of the STOPFILE. In AND mode, bigrams whose both constituent words are stop words are removed while, in OR mode, bigrams whose either or both constituent words are stopwords are removed from the output. --window W Tokens appearing within W positions from each other (with at most W-2 intervening words) will form bigrams. Same as count.pl's --window option. --remove L Bigrams with counts less than L in the entire SOURCE data are removed from the sample. The counts of the removed bigrams are not counted in any marginal totals. This has same effect as count.pl's --remove option. --uremove L Bigrams with counts more than L in the entire SOURCE data are removed from the sample. The counts of the removed bigrams are not counted in any marginal totals. This has same effect as count.pl's --uremove option. --frequency F Bigrams with counts less than F in the entire SOURCE are not displayed. The counts of the skipped bigrams ARE counted in the marginal totals. In other words, --frequency in huge-count.pl has same effect as the count.pl's --frequency option. --ufrequency F Bigrams with counts more than F in the entire SOURCE are not displayed. The counts of the skipped bigrams ARE counted in the marginal totals. In other words, --frequency in huge-count.pl has same effect as the count.pl's --ufrequency option. --newLine Switches ON the --newLine option in count.pl. This will prevent bigrams from spanning across the lines. Other Options : --help Displays this message. --version Displays the version information. PROGRAM LOGIC
In summary, a large datafile can be provided to huge-count in the form of a. A single plain file b. A directory containing several plain files c. Multiple plain files directly specified as command line arguments In all these cases, count.pl with --tokenlist is separately run on SOURCE files or parts of SOURCE file and intermediate results are written in DESTINATION dir.
OUTPUTAfter huge-count finishes successfully, DESTINATION will contain -
BUGShuge-count.pl doesn't consider bigrams at file boundaries. In other words, the result of count.pl and huge-count.pl on the same data file will differ if --newLine is not used, in that, huge-count.pl runs count.pl on multiple files separately and thus looses the track of the bigrams on file boundaries. With --window not specified, there will be loss of one bigram at each file boundary while its W bigrams with --window W.Functionality of huge-count with --tokenlist is same as count only if --newLine is used and all files start and end on sentence boundaries. In other words, there should not be any sentence breaks at the start or end of any file given to huge-count. AUTHORAmruta Purandare, University of Minnesota, DuluthTed Pedersen, University of Minnesota, Duluth tpederse at umn.edu Ying Liu, University of Minnesota, Twin Cities liux0395 at umn.edu COPYRIGHTCopyright (c) 2004-2010, Amruta Purandare, Ted Pedersen, and Ying LiuThis program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
Visit the GSP FreeBSD Man Page Interface. |