vocabulary -- extract vocabularies from Penn treebank files
vocabulary [-NT ntfile] [-POS posfile] [-word wordfile] [-count] [-binarized]
[-verbose] file1 [file2...]
File1, file2 etc. are the names of Penn treebank files. If none
are specified, STDIN is used.
- NT
- Write the non-terminal node vocabulary to ntfile.
- POS
- Write the part of speech vocabulary to posfile
- word
- Write the word vocabulary to wordfile.
- count
- Print the frequency counts for each of the categories.
- binarized
- The file is in binarized format.
- verbose
- Print filenames as they are processed.
Given a list of Penn treebank files, this script extracts the words, parts of
speech, and non-terminal node names and emits each in a separate file in order
of frequency.
Note that giving a "-" argument for any of ntfile,
posfile, or wordfile causes the results to be written to STDOUT.
W.P. McNeill <billmcn@ssli.ee.washington.edu>