| wndb(5WN) - phpMan
WNDB(5WN) WordNet™ File Formats WNDB(5WN)
NAME
index.noun, data.noun, index.verb, data.verb, index.adj, data.adj, index.adv, data.adv -
WordNet database files
noun.exc, verb.exc. adj.exc adv.exc - morphology exception lists
sentidx.vrb, sents.vrb - files used by search code to display sentences illustrating the
use of some specific verbs
DESCRIPTION
For each syntactic category, two files are needed to represent the contents of the WordNet
database - index.pos and data.pos, where pos is noun, verb, adj and adv. The other auxil‐
iary files are used by the WordNet library's searching functions and are needed to run the
various WordNet browsers.
Each index file is an alphabetized list of all the words found in WordNet in the corre‐
sponding part of speech. On each line, following the word, is a list of byte offsets
(synset_offsets) in the corresponding data file, one for each synset containing the word.
Words in the index file are in lower case only, regardless of how they were entered in the
lexicographer files. This folds various orthographic representations of the word into one
line enabling database searches to be case insensitive. See wninput(5WN) for a detailed
description of the lexicographer files
A data file for a syntactic category contains information corresponding to the synsets
that were specified in the lexicographer files, with relational pointers resolved to
synset_offsets. Each line corresponds to a synset. Pointers are followed and hierarchies
traversed by moving from one synset to another via the synset_offsets.
The exception list files, pos.exc, are used to help the morphological processor find base
forms from irregular inflections.
The files sentidx.vrb and sents.vrb contain sentences illustrating the use of specific
senses of some verbs. These files are used by the searching software in response to a
request for verb sentence frames. Generic sentence frames are displayed when an illustra‐
tive sentence is not present.
The various database files are in ASCII formats that are easily read by both humans and
machines. All fields, unless otherwise noted, are separated by one space character, and
all lines are terminated by a newline character. Fields enclosed in italicized square
brackets may not be present.
See wngloss(7WN) for a glossary of WordNet terminology and a discussion of the database's
content and logical organization.
Index File Format
Each index file begins with several lines containing a copyright notice, version number
and license agreement. These lines all begin with two spaces and the line number so they
do not interfere with the binary search algorithm that is used to look up entries in the
index files. All other lines are in the following format. In the field descriptions,
number always refers to a decimal integer unless otherwise defined.
lemma pos synset_cnt p_cnt [ptr_symbol...] sense_cnt tagsense_cnt synset_offset [synset_offset...]
lemma lower case ASCII text of word or collocation. Collocations are formed by
joining individual words with an underscore (_) character.
pos Syntactic category: n for noun files, v for verb files, a for adjective
files, r for adverb files.
All remaining fields are with respect to senses of lemma in pos.
synset_cnt Number of synsets that lemma is in. This is the number of senses of the
word in WordNet. See Sense Numbers below for a discussion of how sense num‐
bers are assigned and the order of synset_offsets in the index files.
p_cnt Number of different pointers that lemma has in all synsets containing it.
ptr_symbol A space separated list of p_cnt different types of pointers that lemma has
in all synsets containing it. See wninput(5WN) for a list of pointer_sym‐
bols. If all senses of lemma have no pointers, this field is omitted and
p_cnt is 0.
sense_cnt Same as sense_cnt above. This is redundant, but the field was preserved
for compatibility reasons.
tagsense_cnt Number of senses of lemma that are ranked according to their frequency of
occurrence in semantic concordance texts.
synset_offset Byte offset in data.pos file of a synset containing lemma. Each
synset_offset in the list corresponds to a different sense of lemma in
WordNet. synset_offset is an 8 digit, zero-filled decimal integer that can
be used with fseek(3) to read a synset from the data file. When passed to
read_synset(3WN) along with the syntactic category, a data structure con‐
taining the parsed synset is returned.
Data File Format
Each data file begins with several lines containing a copyright notice, version number and
license agreement. These lines all begin with two spaces and the line number. All other
lines are in the following format. Integer fields are of fixed length, and are zero-
filled.
synset_offset lex_filenum ss_type w_cnt word lex_id [word lex_id...] p_cnt [ptr...] [frames...] | gloss
synset_offset Current byte offset in the file represented as an 8 digit decimal integer.
lex_filenum Two digit decimal integer corresponding to the lexicographer file name con‐
taining the synset. See lexnames(5WN) for the list of filenames and their
corresponding numbers.
ss_type One character code indicating the synset type:
n NOUN
v VERB
a ADJECTIVE
s ADJECTIVE SATELLITE
r ADVERB
w_cnt Two digit hexadecimal integer indicating the number of words in the synset.
word ASCII form of a word as entered in the synset by the lexicographer, with
spaces replaced by underscore characters (_). The text of the word is case
sensitive, in contrast to its form in the corresponding index.pos file,
that contains only lower-case forms. In data.adj, a word is followed by a
syntactic marker if one was specified in the lexicographer file. A syntac‐
tic marker is appended, in parentheses, onto word without any intervening
spaces. See wninput(5WN) for a list of the syntactic markers for adjec‐
tives.
lex_id One digit hexadecimal integer that, when appended onto lemma, uniquely
identifies a sense within a lexicographer file. lex_id numbers usually
start with 0, and are incremented as additional senses of the word are
added to the same file, although there is no requirement that the numbers
be consecutive or begin with 0. Note that a value of 0 is the default, and
therefore is not present in lexicographer files.
p_cnt Three digit decimal integer indicating the number of pointers from this
synset to other synsets. If p_cnt is 000 the synset has no pointers.
ptr A pointer from this synset to another. ptr is of the form:
pointer_symbol synset_offset pos source/target
where synset_offset is the byte offset of the target synset in the data
file corresponding to pos.
The source/target field distinguishes lexical and semantic pointers. It is
a four byte field, containing two two-digit hexadecimal integers. The
first two digits indicates the word number in the current (source) synset,
the last two digits indicate the word number in the target synset. A value
of 0000 means that pointer_symbol represents a semantic relation between
the current (source) synset and the target synset indicated by synset_off‐
set.
A lexical relation between two words in different synsets is represented by
non-zero values in the source and target word numbers. The first and last
two bytes of this field indicate the word numbers in the source and target
synsets, respectively, between which the relation holds. Word numbers are
assigned to the word fields in a synset, from left to right, beginning with
1.
See wninput(5WN) for a list of pointer_symbols, and semantic and lexical
pointer classifications.
frames In data.verb only, a list of numbers corresponding to the generic verb sen‐
tence frames for words in the synset. frames is of the form:
f_cnt + f_num w_num [ + f_num w_num...]
where f_cnt a two digit decimal integer indicating the number of generic
frames listed, f_num is a two digit decimal integer frame number, and w_num
is a two digit hexadecimal integer indicating the word in the synset that
the frame applies to. As with pointers, if this number is 00, f_num
applies to all words in the synset. If non-zero, it is applicable only to
the word indicated. Word numbers are assigned as described for pointers.
Each f_num w_num pair is preceded by a +. See wninput(5WN) for the text
of the generic sentence frames.
gloss Each synset contains a gloss. A gloss is represented as a vertical bar
(|), followed by a text string that continues until the end of the line.
The gloss may contain a definition, one or more example sentences, or both.
Sense Numbers
Senses in WordNet are generally ordered from most to least frequently used, with the most
common sense numbered 1. Frequency of use is determined by the number of times a sense is
tagged in the various semantic concordance texts. Senses that are not semantically tagged
follow the ordered senses. The tagsense_cnt field for each entry in the index.pos files
indicates how many of the senses in the list have been tagged.
The cntlist(5WN) file provided with the database lists the number of times each sense is
tagged in the semantic concordances. The data from cntlist is used by grind(1WN) to order
the senses of each word. When the index.pos files are generated, the synset_offsets are
output in sense number order, with sense 1 first in the list. Senses with the same number
of semantic tags are assigned unique but consecutive sense numbers. The WordNet OVERVIEW
search displays all senses of the specified word, in all syntactic categories, and indi‐
cates which of the senses are represented in the semantically tagged texts.
Exception List File Format
Exception lists are alphabetized lists of inflected forms of words and their base forms.
The first field of each line is an inflected form, followed by a space separated list of
one or more base forms of the word. There is one exception list file for each syntactic
category.
Note that the noun and verb exception lists were automatically generated from a machine-
readable dictionary, and contain many words that are not in WordNet. Also, for many of
the inflected forms, base forms could be easily derived using the standard rules of
detachment programmed into Morphy (See morph(7WN)). These anomalies are allowed to remain
in the exception list files, as they do no harm.
Verb Example Sentences
For some verb senses, example sentences illustrating the use of the verb sense can be dis‐
played. Each line of the file sentidx.vrb contains a sense_key followed by a space and a
comma separated list of example sentence template numbers, in decimal. The file sents.vrb
lists all of the example sentence templates. Each line begins with the template number
followed by a space. The rest of the line is the text of a template example sentence,
with %s used as a placeholder in the text for the verb. Both files are sorted alphabeti‐
cally so that the sense_key and template sentence number can be used as indices, via bin‐
srch(3WN), into the appropriate file.
When a request for FRAMES is made, the WordNet search code looks for the sense in sen‐
tidx.vrb. If found, the sentence template(s) listed is retrieved from sents.vrb, and the
%s is replaced with the verb. If the sense is not found, the applicable generic sentence
frame(s) listed in frames is displayed.
NOTES
Information in the data.pos and index.pos files represents all of the word senses and
synsets in the WordNet database. The word, lex_id, and lex_filenum fields together
uniquely identify each word sense in WordNet. These can be encoded in a sense_key as
described in senseidx(5WN). Each synset in the database can be uniquely identified by
combining the synset_offset for the synset with a code for the syntactic category (since
it is possible for synsets in different data.pos files to have the same synset_offset).
The WordNet system provide both command line and window-based browser interfaces to the
database. Both interfaces utilize a common library of search and morphology code. The
source code for the library and interfaces is included in the WordNet package. See wnin‐
tro(3WN) for an overview of the WordNet source code.
ENVIRONMENT VARIABLES (UNIX)
WNHOME Base directory for WordNet. Default is /usr/local/WordNet-3.0.
WNSEARCHDIR Directory in which the WordNet database has been installed. Default
is WNHOME/dict.
REGISTRY (WINDOWS)
HKEY_LOCAL_MACHINE\SOFTWARE\WordNet\3.0\WNHome
Base directory for WordNet. Default is C:\Program Files\WordNet\3.0.
FILES
index.pos database index files
data.pos database data files
*.vrb files of sentences illustrating the use of verbs
pos.exc morphology exception lists
SEE ALSO
grind(1WN), wn(1WN), wnb(1WN), wnintro(3WN), binsrch(3WN), wnintro(5WN), cntlist(5WN),
lexnames(5WN), senseidx(5WN), wninput(5WN), morphy(7WN), wngloss(7WN), wngroups(7WN),
wnstats(7WN).
WordNet 3.0 Dec 2006 WNDB(5WN)
|