|
NAMEindex.noun, data.noun, index.verb, data.verb, index.adj, data.adj, index.adv, data.adv - WordNet database filesnoun.exc, verb.exc. adj.exc adv.exc - morphology exception lists sentidx.vrb, sents.vrb - files used by search code to display sentences illustrating the use of some specific verbs DESCRIPTIONFor each syntactic category, two files are needed to represent the contents of the WordNet database - index.pos and data.pos, where pos is noun, verb, adj and adv. The other auxiliary files are used by the WordNet library's searching functions and are needed to run the various WordNet browsers.Each index file is an alphabetized list of all the words found in WordNet in the corresponding part of speech. On each line, following the word, is a list of byte offsets (synset_offsets) in the corresponding data file, one for each synset containing the word. Words in the index file are in lower case only, regardless of how they were entered in the lexicographer files. This folds various orthographic representations of the word into one line enabling database searches to be case insensitive. See wninput(5WN) for a detailed description of the lexicographer files A data file for a syntactic category contains information corresponding to the synsets that were specified in the lexicographer files, with relational pointers resolved to synset_offsets. Each line corresponds to a synset. Pointers are followed and hierarchies traversed by moving from one synset to another via the synset_offsets. The exception list files, pos.exc, are used to help the morphological processor find base forms from irregular inflections. The files sentidx.vrb and sents.vrb contain sentences illustrating the use of specific senses of some verbs. These files are used by the searching software in response to a request for verb sentence frames. Generic sentence frames are displayed when an illustrative sentence is not present. The various database files are in ASCII formats that are easily read by both humans and machines. All fields, unless otherwise noted, are separated by one space character, and all lines are terminated by a newline character. Fields enclosed in italicized square brackets may not be present. See wngloss(7WN) for a glossary of WordNet terminology and a discussion of the database's content and logical organization. Index File FormatEach index file begins with several lines containing a copyright notice, version number and license agreement. These lines all begin with two spaces and the line number so they do not interfere with the binary search algorithm that is used to look up entries in the index files. All other lines are in the following format. In the field descriptions, number always refers to a decimal integer unless otherwise defined.lemma pos synset_cnt p_cnt [ptr_symbol...] sense_cnt tagsense_cnt synset_offset [synset_offset...]
All remaining fields are with respect to senses of lemma in pos.
Data File FormatEach data file begins with several lines containing a copyright notice, version number and license agreement. These lines all begin with two spaces and the line number. All other lines are in the following format. Integer fields are of fixed length, and are zero-filled.synset_offset lex_filenum ss_type w_cnt word lex_id [word lex_id...] p_cnt [ptr...] [frames...] | gloss
n NOUN v VERB a ADJECTIVE s ADJECTIVE SATELLITE r ADVERB
Sense NumbersSenses in WordNet are generally ordered from most to least frequently used, with the most common sense numbered 1. Frequency of use is determined by the number of times a sense is tagged in the various semantic concordance texts. Senses that are not semantically tagged follow the ordered senses. The tagsense_cnt field for each entry in the index.pos files indicates how many of the senses in the list have been tagged.The cntlist(5WN) file provided with the database lists the number of times each sense is tagged in the semantic concordances. The data from cntlist is used by grind(1WN) to order the senses of each word. When the index.pos files are generated, the synset_offsets are output in sense number order, with sense 1 first in the list. Senses with the same number of semantic tags are assigned unique but consecutive sense numbers. The WordNet OVERVIEW search displays all senses of the specified word, in all syntactic categories, and indicates which of the senses are represented in the semantically tagged texts. Exception List File FormatException lists are alphabetized lists of inflected forms of words and their base forms. The first field of each line is an inflected form, followed by a space separated list of one or more base forms of the word. There is one exception list file for each syntactic category.Note that the noun and verb exception lists were automatically generated from a machine-readable dictionary, and contain many words that are not in WordNet. Also, for many of the inflected forms, base forms could be easily derived using the standard rules of detachment programmed into Morphy (See morph(7WN)). These anomalies are allowed to remain in the exception list files, as they do no harm. Verb Example SentencesFor some verb senses, example sentences illustrating the use of the verb sense can be displayed. Each line of the file sentidx.vrb contains a sense_key followed by a space and a comma separated list of example sentence template numbers, in decimal. The file sents.vrb lists all of the example sentence templates. Each line begins with the template number followed by a space. The rest of the line is the text of a template example sentence, with %s used as a placeholder in the text for the verb. Both files are sorted alphabetically so that the sense_key and template sentence number can be used as indices, via binsrch(3WN), into the appropriate file.When a request for FRAMES is made, the WordNet search code looks for the sense in sentidx.vrb. If found, the sentence template(s) listed is retrieved from sents.vrb, and the %s is replaced with the verb. If the sense is not found, the applicable generic sentence frame(s) listed in frames is displayed. NOTESInformation in the data.pos and index.pos files represents all of the word senses and synsets in the WordNet database. The word, lex_id, and lex_filenum fields together uniquely identify each word sense in WordNet. These can be encoded in a sense_key as described in senseidx(5WN). Each synset in the database can be uniquely identified by combining the synset_offset for the synset with a code for the syntactic category (since it is possible for synsets in different data.pos files to have the same synset_offset).The WordNet system provide both command line and window-based browser interfaces to the database. Both interfaces utilize a common library of search and morphology code. The source code for the library and interfaces is included in the WordNet package. See wnintro(3WN) for an overview of the WordNet source code. ENVIRONMENT VARIABLES (UNIX)
REGISTRY (WINDOWS)
FILES
SEE ALSOgrind(1WN), wn(1WN), wnb(1WN), wnintro(3WN), binsrch(3WN), wnintro(5WN), cntlist(5WN), lexnames(5WN), senseidx(5WN), wninput(5WN), morphy(7WN), wngloss(7WN), wngroups(7WN), wnstats(7WN).
Visit the GSP FreeBSD Man Page Interface. |