1WNDB(5) WordNet™ File Formats WNDB(5)
2
3
4
6 index.noun, data.noun, index.verb, data.verb, index.adj, data.adj,
7 index.adv, data.adv - WordNet database files
8
9 noun.exc, verb.exc. adj.exc adv.exc - morphology exception lists
10
11 sentidx.vrb, sents.vrb - files used by search code to display sentences
12 illustrating the use of some specific verbs
13
15 For each syntactic category, two files are needed to represent the con‐
16 tents of the WordNet database - index.pos and data.pos, where pos is
17 noun, verb, adj and adv. The other auxiliary files are used by the
18 WordNet library's searching functions and are needed to run the various
19 WordNet browsers.
20
21 Each index file is an alphabetized list of all the words found in Word‐
22 Net in the corresponding part of speech. On each line, following the
23 word, is a list of byte offsets (synset_offsets) in the corresponding
24 data file, one for each synset containing the word. Words in the index
25 file are in lower case only, regardless of how they were entered in the
26 lexicographer files. This folds various orthographic representations
27 of the word into one line enabling database searches to be case insen‐
28 sitive. See wninput(5) for a detailed description of the lexicographer
29 files
30
31 A data file for a syntactic category contains information corresponding
32 to the synsets that were specified in the lexicographer files, with
33 relational pointers resolved to synset_offsets. Each line corresponds
34 to a synset. Pointers are followed and hierarchies traversed by moving
35 from one synset to another via the synset_offsets.
36
37 The exception list files, pos.exc, are used to help the morphological
38 processor find base forms from irregular inflections.
39
40 The files sentidx.vrb and sents.vrb contain sentences illustrating the
41 use of specific senses of some verbs. These files are used by the
42 searching software in response to a request for verb sentence frames.
43 Generic sentence frames are displayed when an illustrative sentence is
44 not present.
45
46 The various database files are in ASCII formats that are easily read by
47 both humans and machines. All fields, unless otherwise noted, are sep‐
48 arated by one space character, and all lines are terminated by a new‐
49 line character. Fields enclosed in italicized square brackets may not
50 be present.
51
52 See wngloss(7) for a glossary of WordNet terminology and a discussion
53 of the database's content and logical organization.
54
55 Index File Format
56 Each index file begins with several lines containing a copyright
57 notice, version number and license agreement. These lines all begin
58 with two spaces and the line number so they do not interfere with the
59 binary search algorithm that is used to look up entries in the index
60 files. All other lines are in the following format. In the field
61 descriptions, number always refers to a decimal integer unless other‐
62 wise defined.
63
64 lemma pos synset_cnt p_cnt [ptr_symbol...] sense_cnt tagsense_cnt synset_offset [synset_offset...]
65
66
67 lemma lower case ASCII text of word or collocation. Colloca‐
68 tions are formed by joining individual words with an
69 underscore (_) character.
70
71 pos Syntactic category: n for noun files, v for verb files,
72 a for adjective files, r for adverb files.
73
74
75 All remaining fields are with respect to senses of lemma in pos.
76
77
78 synset_cnt Number of synsets that lemma is in. This is the number
79 of senses of the word in WordNet. See Sense Numbers
80 below for a discussion of how sense numbers are assigned
81 and the order of synset_offsets in the index files.
82
83 p_cnt Number of different pointers that lemma has in all
84 synsets containing it.
85
86 ptr_symbol A space separated list of p_cnt different types of
87 pointers that lemma has in all synsets containing it.
88 See wninput(5) for a list of pointer_symbols. If all
89 senses of lemma have no pointers, this field is omitted
90 and p_cnt is 0.
91
92 sense_cnt Same as sense_cnt above. This is redundant, but the
93 field was preserved for compatibility reasons.
94
95 tagsense_cnt Number of senses of lemma that are ranked according to
96 their frequency of occurrence in semantic concordance
97 texts.
98
99 synset_offset Byte offset in data.pos file of a synset containing
100 lemma. Each synset_offset in the list corresponds to a
101 different sense of lemma in WordNet. synset_offset is
102 an 8 digit, zero-filled decimal integer that can be used
103 with fseek(3) to read a synset from the data file. When
104 passed to read_synset(3) along with the syntactic cate‐
105 gory, a data structure containing the parsed synset is
106 returned.
107
108 Data File Format
109 Each data file begins with several lines containing a copyright notice,
110 version number and license agreement. These lines all begin with two
111 spaces and the line number. All other lines are in the following for‐
112 mat. Integer fields are of fixed length, and are zero-filled.
113
114 synset_offset lex_filenum ss_type w_cnt word lex_id [word lex_id...] p_cnt [ptr...] [frames...] | gloss
115
116
117 synset_offset Current byte offset in the file represented as an 8
118 digit decimal integer.
119
120 lex_filenum Two digit decimal integer corresponding to the lexicog‐
121 rapher file name containing the synset. See lexnames(5)
122 for the list of filenames and their corresponding num‐
123 bers.
124
125 ss_type One character code indicating the synset type:
126
127 n NOUN
128 v VERB
129 a ADJECTIVE
130 s ADJECTIVE SATELLITE
131 r ADVERB
132
133 w_cnt Two digit hexadecimal integer indicating the number of
134 words in the synset.
135
136 word ASCII form of a word as entered in the synset by the
137 lexicographer, with spaces replaced by underscore char‐
138 acters (_). The text of the word is case sensitive, in
139 contrast to its form in the corresponding index.pos
140 file, that contains only lower-case forms. In data.adj,
141 a word is followed by a syntactic marker if one was
142 specified in the lexicographer file. A syntactic marker
143 is appended, in parentheses, onto word without any
144 intervening spaces. See wninput(5) for a list of the
145 syntactic markers for adjectives.
146
147 lex_id One digit hexadecimal integer that, when appended onto
148 lemma, uniquely identifies a sense within a lexicogra‐
149 pher file. lex_id numbers usually start with 0, and are
150 incremented as additional senses of the word are added
151 to the same file, although there is no requirement that
152 the numbers be consecutive or begin with 0. Note that a
153 value of 0 is the default, and therefore is not present
154 in lexicographer files.
155
156 p_cnt Three digit decimal integer indicating the number of
157 pointers from this synset to other synsets. If p_cnt is
158 000 the synset has no pointers.
159
160 ptr A pointer from this synset to another. ptr is of the
161 form:
162
163 pointer_symbol synset_offset pos source/target
164
165 where synset_offset is the byte offset of the target
166 synset in the data file corresponding to pos.
167
168 The source/target field distinguishes lexical and seman‐
169 tic pointers. It is a four byte field, containing two
170 two-digit hexadecimal integers. The first two digits
171 indicates the word number in the current (source)
172 synset, the last two digits indicate the word number in
173 the target synset. A value of 0000 means that
174 pointer_symbol represents a semantic relation between
175 the current (source) synset and the target synset indi‐
176 cated by synset_offset.
177
178 A lexical relation between two words in different
179 synsets is represented by non-zero values in the source
180 and target word numbers. The first and last two bytes
181 of this field indicate the word numbers in the source
182 and target synsets, respectively, between which the
183 relation holds. Word numbers are assigned to the word
184 fields in a synset, from left to right, beginning with
185 1.
186
187 See wninput(5) for a list of pointer_symbols, and seman‐
188 tic and lexical pointer classifications.
189
190 frames In data.verb only, a list of numbers corresponding to
191 the generic verb sentence frames for words in the
192 synset. frames is of the form:
193
194 f_cnt + f_num w_num [ + f_num w_num...]
195
196 where f_cnt a two digit decimal integer indicating the
197 number of generic frames listed, f_num is a two digit
198 decimal integer frame number, and w_num is a two digit
199 hexadecimal integer indicating the word in the synset
200 that the frame applies to. As with pointers, if this
201 number is 00, f_num applies to all words in the synset.
202 If non-zero, it is applicable only to the word indi‐
203 cated. Word numbers are assigned as described for
204 pointers. Each f_num w_num pair is preceded by a +.
205 See wninput(5) for the text of the generic sentence
206 frames.
207
208 gloss Each synset contains a gloss. A gloss is represented as
209 a vertical bar (|), followed by a text string that con‐
210 tinues until the end of the line. The gloss may contain
211 a definition, one or more example sentences, or both.
212
213 Sense Numbers
214 Senses in WordNet are generally ordered from most to least frequently
215 used, with the most common sense numbered 1. Frequency of use is
216 determined by the number of times a sense is tagged in the various
217 semantic concordance texts. Senses that are not semantically tagged
218 follow the ordered senses. The tagsense_cnt field for each entry in
219 the index.pos files indicates how many of the senses in the list have
220 been tagged.
221
222 The cntlist(5) file provided with the database lists the number of
223 times each sense is tagged in the semantic concordances. The data from
224 cntlist is used by grind(1) to order the senses of each word. When the
225 index.pos files are generated, the synset_offsets are output in sense
226 number order, with sense 1 first in the list. Senses with the same
227 number of semantic tags are assigned unique but consecutive sense num‐
228 bers. The WordNet OVERVIEW search displays all senses of the specified
229 word, in all syntactic categories, and indicates which of the senses
230 are represented in the semantically tagged texts.
231
232 Exception List File Format
233 Exception lists are alphabetized lists of inflected forms of words and
234 their base forms. The first field of each line is an inflected form,
235 followed by a space separated list of one or more base forms of the
236 word. There is one exception list file for each syntactic category.
237
238 Note that the noun and verb exception lists were automatically gener‐
239 ated from a machine-readable dictionary, and contain many words that
240 are not in WordNet. Also, for many of the inflected forms, base forms
241 could be easily derived using the standard rules of detachment pro‐
242 grammed into Morphy (See morph(7)). These anomalies are allowed to
243 remain in the exception list files, as they do no harm.
244
245
246 Verb Example Sentences
247 For some verb senses, example sentences illustrating the use of the
248 verb sense can be displayed. Each line of the file sentidx.vrb con‐
249 tains a sense_key followed by a space and a comma separated list of
250 example sentence template numbers, in decimal. The file sents.vrb
251 lists all of the example sentence templates. Each line begins with the
252 template number followed by a space. The rest of the line is the text
253 of a template example sentence, with %s used as a placeholder in the
254 text for the verb. Both files are sorted alphabetically so that the
255 sense_key and template sentence number can be used as indices, via bin‐
256 srch(3), into the appropriate file.
257
258 When a request for FRAMES is made, the WordNet search code looks for
259 the sense in sentidx.vrb. If found, the sentence template(s) listed is
260 retrieved from sents.vrb, and the %s is replaced with the verb. If the
261 sense is not found, the applicable generic sentence frame(s) listed in
262 frames is displayed.
263
265 Information in the data.pos and index.pos files represents all of the
266 word senses and synsets in the WordNet database. The word, lex_id, and
267 lex_filenum fields together uniquely identify each word sense in Word‐
268 Net. These can be encoded in a sense_key as described in senseidx(5).
269 Each synset in the database can be uniquely identified by combining the
270 synset_offset for the synset with a code for the syntactic category
271 (since it is possible for synsets in different data.pos files to have
272 the same synset_offset).
273
274 The WordNet system provide both command line and window-based browser
275 interfaces to the database. Both interfaces utilize a common library
276 of search and morphology code. The source code for the library and
277 interfaces is included in the WordNet package. See wnintro(3) for an
278 overview of the WordNet source code.
279
281 WNHOME Base directory for WordNet. Default is
282 /usr/local/WordNet-3.0.
283
284 WNSEARCHDIR Directory in which the WordNet database has been
285 installed. Default is WNHOME/dict.
286
288 HKEY_LOCAL_MACHINE\SOFTWARE\WordNet\3.0\WNHome
289 Base directory for WordNet. Default is C:\Pro‐
290 gram Files\WordNet\3.0.
291
293 index.pos database index files
294
295 data.pos database data files
296
297 *.vrb files of sentences illustrating the use of verbs
298
299 pos.exc morphology exception lists
300
302 grind(1), wn(1), wnb(1), wnintro(3), binsrch(3), wnintro(5),
303 cntlist(5), lexnames(5), senseidx(5), wninput(5), morphy(7),
304 wngloss(7), wngroups(7), wnstats(7).
305
306
307
308WordNet 3.0 Dec 2006 WNDB(5)