wndb(5) - f33

1WNDB(5)                      WordNet™ File Formats                     WNDB(5)
2
3
4

NAME

6       index.noun,  data.noun,  index.verb,  data.verb,  index.adj,  data.adj,
7       index.adv, data.adv - WordNet database files
8
9       noun.exc, verb.exc. adj.exc adv.exc - morphology exception lists
10
11       sentidx.vrb, sents.vrb - files used by search code to display sentences
12       illustrating the use of some specific verbs
13

DESCRIPTION

15       For each syntactic category, two files are needed to represent the con‐
16       tents of the WordNet database - index.pos and data.pos,  where  pos  is
17       noun,  verb,  adj  and  adv.  The other auxiliary files are used by the
18       WordNet library's searching functions and are needed to run the various
19       WordNet browsers.
20
21       Each index file is an alphabetized list of all the words found in Word‐
22       Net in the corresponding part of speech.  On each line,  following  the
23       word,  is  a list of byte offsets (synset_offsets) in the corresponding
24       data file, one for each synset containing the word.  Words in the index
25       file are in lower case only, regardless of how they were entered in the
26       lexicographer files.  This folds various  orthographic  representations
27       of  the word into one line enabling database searches to be case insen‐
28       sitive.  See wninput(5) for a detailed description of the lexicographer
29       files
30
31       A data file for a syntactic category contains information corresponding
32       to the synsets that were specified in  the  lexicographer  files,  with
33       relational  pointers resolved to synset_offsets.  Each line corresponds
34       to a synset.  Pointers are followed and hierarchies traversed by moving
35       from one synset to another via the synset_offsets.
36
37       The  exception  list files, pos.exc, are used to help the morphological
38       processor find base forms from irregular inflections.
39
40       The files sentidx.vrb and sents.vrb contain sentences illustrating  the
41       use  of  specific  senses  of  some verbs.  These files are used by the
42       searching software in response to a request for verb  sentence  frames.
43       Generic  sentence frames are displayed when an illustrative sentence is
44       not present.
45
46       The various database files are in ASCII formats that are easily read by
47       both humans and machines.  All fields, unless otherwise noted, are sep‐
48       arated by one space character, and all lines are terminated by  a  new‐
49       line  character.  Fields enclosed in italicized square brackets may not
50       be present.
51
52       See wngloss(7) for a glossary of WordNet terminology and  a  discussion
53       of the database's content and logical organization.
54
55   Index File Format
56       Each  index  file  begins  with  several  lines  containing a copyright
57       notice, version number and license agreement.  These  lines  all  begin
58       with  two  spaces and the line number so they do not interfere with the
59       binary search algorithm that is used to look up entries  in  the  index
60       files.   All  other  lines  are  in the following format.  In the field
61       descriptions, number always refers to a decimal integer  unless  other‐
62       wise defined.
63
64       lemma  pos  synset_cnt  p_cnt  [ptr_symbol...]  sense_cnt  tagsense_cnt   synset_offset  [synset_offset...]
65
66
67       lemma          lower  case ASCII text of word or collocation.  Colloca‐
68                      tions are formed by joining  individual  words  with  an
69                      underscore (_) character.
70
71       pos            Syntactic  category: n for noun files, v for verb files,
72                      a for adjective files, r for adverb files.
73
74
75       All remaining fields are with respect to senses of lemma in pos.
76
77
78       synset_cnt     Number of synsets that lemma is in.  This is the  number
79                      of  senses  of  the  word  in WordNet. See Sense Numbers
80                      below for a discussion of how sense numbers are assigned
81                      and the order of synset_offsets in the index files.
82
83       p_cnt          Number  of  different  pointers  that  lemma  has in all
84                      synsets containing it.
85
86       ptr_symbol     A space separated  list  of  p_cnt  different  types  of
87                      pointers  that  lemma  has in all synsets containing it.
88                      See wninput(5) for a list of  pointer_symbols.   If  all
89                      senses  of lemma have no pointers, this field is omitted
90                      and p_cnt is 0.
91
92       sense_cnt      Same as sense_cnt above.  This  is  redundant,  but  the
93                      field was preserved for compatibility reasons.
94
95       tagsense_cnt   Number  of  senses of lemma that are ranked according to
96                      their frequency of occurrence  in  semantic  concordance
97                      texts.
98
99       synset_offset  Byte  offset  in  data.pos  file  of a synset containing
100                      lemma.  Each synset_offset in the list corresponds to  a
101                      different  sense  of lemma in WordNet.  synset_offset is
102                      an 8 digit, zero-filled decimal integer that can be used
103                      with fseek(3) to read a synset from the data file.  When
104                      passed to read_synset(3) along with the syntactic  cate‐
105                      gory,  a  data structure containing the parsed synset is
106                      returned.
107
108   Data File Format
109       Each data file begins with several lines containing a copyright notice,
110       version  number  and license agreement.  These lines all begin with two
111       spaces and the line number.  All other lines are in the following  for‐
112       mat.  Integer fields are of fixed length, and are zero-filled.
113
114       synset_offset  lex_filenum  ss_type  w_cnt  word  lex_id  [word  lex_id...]  p_cnt  [ptr...]  [frames...]  |  gloss
115
116
117       synset_offset  Current  byte  offset  in  the  file represented as an 8
118                      digit decimal integer.
119
120       lex_filenum    Two digit decimal integer corresponding to the  lexicog‐
121                      rapher file name containing the synset.  See lexnames(5)
122                      for the list of filenames and their  corresponding  num‐
123                      bers.
124
125       ss_type        One character code indicating the synset type:
126
127                      n    NOUN
128                      v    VERB
129                      a    ADJECTIVE
130                      s    ADJECTIVE SATELLITE
131                      r    ADVERB
132
133       w_cnt          Two  digit  hexadecimal integer indicating the number of
134                      words in the synset.
135
136       word           ASCII form of a word as entered in  the  synset  by  the
137                      lexicographer,  with spaces replaced by underscore char‐
138                      acters (_).  The text of the word is case sensitive,  in
139                      contrast  to  its  form  in  the corresponding index.pos
140                      file, that contains only lower-case forms.  In data.adj,
141                      a  word  is  followed  by  a syntactic marker if one was
142                      specified in the lexicographer file.  A syntactic marker
143                      is  appended,  in  parentheses,  onto  word  without any
144                      intervening spaces.  See wninput(5) for a  list  of  the
145                      syntactic markers for adjectives.
146
147       lex_id         One  digit  hexadecimal integer that, when appended onto
148                      lemma, uniquely identifies a sense within  a  lexicogra‐
149                      pher file.  lex_id numbers usually start with 0, and are
150                      incremented as additional senses of the word  are  added
151                      to  the same file, although there is no requirement that
152                      the numbers be consecutive or begin with 0.  Note that a
153                      value  of 0 is the default, and therefore is not present
154                      in lexicographer files.
155
156       p_cnt          Three digit decimal integer  indicating  the  number  of
157                      pointers from this synset to other synsets.  If p_cnt is
158                      000 the synset has no pointers.
159
160       ptr            A pointer from this synset to another.  ptr  is  of  the
161                      form:
162
163                      pointer_symbol  synset_offset  pos  source/target
164
165                      where  synset_offset  is  the  byte offset of the target
166                      synset in the data file corresponding to pos.
167
168                      The source/target field distinguishes lexical and seman‐
169                      tic  pointers.   It is a four byte field, containing two
170                      two-digit hexadecimal integers.  The  first  two  digits
171                      indicates  the  word  number  in  the  current  (source)
172                      synset, the last two digits indicate the word number  in
173                      the   target   synset.   A  value  of  0000  means  that
174                      pointer_symbol represents a  semantic  relation  between
175                      the  current (source) synset and the target synset indi‐
176                      cated by synset_offset.
177
178                      A  lexical  relation  between  two  words  in  different
179                      synsets  is represented by non-zero values in the source
180                      and target word numbers.  The first and last  two  bytes
181                      of  this  field  indicate the word numbers in the source
182                      and target  synsets,  respectively,  between  which  the
183                      relation  holds.   Word numbers are assigned to the word
184                      fields in a synset, from left to right,  beginning  with
185                      1.
186
187                      See wninput(5) for a list of pointer_symbols, and seman‐
188                      tic and lexical pointer classifications.
189
190       frames         In data.verb only, a list of  numbers  corresponding  to
191                      the  generic  verb  sentence  frames  for  words  in the
192                      synset.  frames is of the form:
193
194                      f_cnt   +   f_num  w_num  [ +   f_num  w_num...]
195
196                      where f_cnt a two digit decimal integer  indicating  the
197                      number  of  generic  frames listed, f_num is a two digit
198                      decimal integer frame number, and w_num is a  two  digit
199                      hexadecimal  integer  indicating  the word in the synset
200                      that the frame applies to.  As with  pointers,  if  this
201                      number  is 00, f_num applies to all words in the synset.
202                      If non-zero, it is applicable only  to  the  word  indi‐
203                      cated.   Word  numbers  are  assigned  as  described for
204                      pointers.  Each f_num  w_num pair is preceded  by  a  +.
205                      See  wninput(5)  for  the  text  of the generic sentence
206                      frames.
207
208       gloss          Each synset contains a gloss.  A gloss is represented as
209                      a  vertical bar (|), followed by a text string that con‐
210                      tinues until the end of the line.  The gloss may contain
211                      a definition, one or more example sentences, or both.
212
213   Sense Numbers
214       Senses  in  WordNet are generally ordered from most to least frequently
215       used, with the most common sense  numbered  1.   Frequency  of  use  is
216       determined  by  the  number  of  times a sense is tagged in the various
217       semantic concordance texts.  Senses that are  not  semantically  tagged
218       follow  the  ordered  senses.  The tagsense_cnt field for each entry in
219       the index.pos files indicates how many of the senses in the  list  have
220       been tagged.
221
222       The  cntlist(5)  file  provided  with  the database lists the number of
223       times each sense is tagged in the semantic concordances.  The data from
224       cntlist is used by grind(1) to order the senses of each word.  When the
225       index.pos files are generated, the synset_offsets are output  in  sense
226       number  order,  with  sense  1 first in the list.  Senses with the same
227       number of semantic tags are assigned unique but consecutive sense  num‐
228       bers.  The WordNet OVERVIEW search displays all senses of the specified
229       word, in all syntactic categories, and indicates which  of  the  senses
230       are represented in the semantically tagged texts.
231
232   Exception List File Format
233       Exception  lists are alphabetized lists of inflected forms of words and
234       their base forms.  The first field of each line is an  inflected  form,
235       followed  by  a  space  separated list of one or more base forms of the
236       word.  There is one exception list file for each syntactic category.
237
238       Note that the noun and verb exception lists were  automatically  gener‐
239       ated  from  a  machine-readable dictionary, and contain many words that
240       are not in WordNet.  Also, for many of the inflected forms, base  forms
241       could  be  easily  derived  using the standard rules of detachment pro‐
242       grammed into Morphy (See morph(7)).  These  anomalies  are  allowed  to
243       remain in the exception list files, as they do no harm.
244
245
246   Verb Example Sentences
247       For  some  verb  senses,  example sentences illustrating the use of the
248       verb sense can be displayed.  Each line of the  file  sentidx.vrb  con‐
249       tains  a  sense_key  followed  by a space and a comma separated list of
250       example sentence template numbers,  in  decimal.   The  file  sents.vrb
251       lists all of the example sentence templates.  Each line begins with the
252       template number followed by a space.  The rest of the line is the  text
253       of  a  template  example sentence, with %s used as a placeholder in the
254       text for the verb.  Both files are sorted alphabetically  so  that  the
255       sense_key and template sentence number can be used as indices, via bin‐
256       srch(3), into the appropriate file.
257
258       When a request for FRAMES is made, the WordNet search  code  looks  for
259       the sense in sentidx.vrb.  If found, the sentence template(s) listed is
260       retrieved from sents.vrb, and the %s is replaced with the verb.  If the
261       sense  is not found, the applicable generic sentence frame(s) listed in
262       frames is displayed.
263

NOTES

265       Information in the data.pos and index.pos files represents all  of  the
266       word senses and synsets in the WordNet database.  The word, lex_id, and
267       lex_filenum fields together uniquely identify each word sense in  Word‐
268       Net.   These can be encoded in a sense_key as described in senseidx(5).
269       Each synset in the database can be uniquely identified by combining the
270       synset_offset  for  the  synset  with a code for the syntactic category
271       (since it is possible for synsets in different data.pos files  to  have
272       the same synset_offset).
273
274       The  WordNet  system provide both command line and window-based browser
275       interfaces to the database.  Both interfaces utilize a  common  library
276       of  search  and  morphology  code.  The source code for the library and
277       interfaces is included in the WordNet package.  See wnintro(3)  for  an
278       overview of the WordNet source code.
279

ENVIRONMENT VARIABLES (UNIX)

281       WNHOME              Base    directory    for   WordNet.    Default   is
282                           /usr/local/WordNet-3.0.
283
284       WNSEARCHDIR         Directory in which the WordNet  database  has  been
285                           installed.  Default is WNHOME/dict.
286

REGISTRY (WINDOWS)

288       HKEY_LOCAL_MACHINE\SOFTWARE\WordNet\3.0\WNHome
289                           Base  directory  for  WordNet.   Default is C:\Pro‐
290                           gram Files\WordNet\3.0.
291

FILES

293       index.pos           database index files
294
295       data.pos            database data files
296
297       *.vrb               files of sentences illustrating the use of verbs
298
299       pos.exc             morphology exception lists
300

NAME

DESCRIPTION

NOTES

ENVIRONMENT VARIABLES (UNIX)

REGISTRY (WINDOWS)

FILES

SEE ALSO