1SENSEIDX(5) WordNet™ File Formats SENSEIDX(5)
2
3
4
6 index.sense, sense.idx - WordNet's sense index
7
9 The WordNet sense index provides an alternate method for accessing
10 synsets and word senses in the WordNet database. It is useful to
11 applications that retrieve synsets or other information related to a
12 specific sense in WordNet, rather than all the senses of a word or col‐
13 location. It can also be used with tools like grep and Perl to find
14 all senses of a word in one or more parts of speech. A specific Word‐
15 Net sense, encoded as a sense_key, can be used as an index into this
16 file to obtain its WordNet sense number, the database byte offset of
17 the synset containing the sense, and the number of times it has been
18 tagged in the semantic concordance texts.
19
20 Concatenating the lemma and lex_sense fields of a semantically tagged
21 word (represented in a <wf ... > attribute/value pair) in a semantic
22 concordance file, using % as the concatenation character, creates the
23 sense_key for that sense, which can in turn be used to search the sense
24 index file.
25
26 A sense_key is the best way to represent a sense in semantic tagging or
27 other systems that refer to WordNet senses. sense_keys are independent
28 of WordNet sense numbers and synset_offsets, which vary between ver‐
29 sions of the database. Using the sense index and a sense_key, the cor‐
30 responding synset (via the synset_offset) and WordNet sense number can
31 easily be obtained. A mapping from noun sense_keys in WordNet 1.6 to
32 corresponding 2.0 sense_keys is provided with version 2.0, and is
33 described in sensemap(5).
34
35 See wndb(5) for a thorough discussion of the WordNet database files.
36
37 File Format
38 The sense index file lists all of the senses in the WordNet database
39 with each line representing one sense. The file is in alphabetical
40 order, fields are separated by one space, and each line is terminated
41 with a newline character.
42
43 Each line is of the form:
44
45 sense_key synset_offset sense_number tag_cnt
46
47 sense_key is an encoding of the word sense. Programs can construct a
48 sense key in this format and use it as a binary search key into the
49 sense index file. The format of a sense_key is described below.
50
51 synset_offset is the byte offset that the synset containing the sense
52 is found at in the database "data" file corresponding to the part of
53 speech encoded in the sense_key. synset_offset is an 8 digit, zero-
54 filled decimal integer, and can be used with fseek(3) to read a synset
55 from the data file. When passed to the WordNet library function
56 read_synset() along with the syntactic category, a data structure con‐
57 taining the parsed synset is returned.
58
59 sense_number is a decimal integer indicating the sense number of the
60 word, within the part of speech encoded in sense_key, in the WordNet
61 database. See wndb(5) for information about how sense numbers are
62 assigned.
63
64 tag_cnt represents the decimal number of times the sense is tagged in
65 various semantic concordance texts. A tag_cnt of 0 indicates that the
66 sense has not been semantically tagged.
67
68 Sense Key Encoding
69 A sense_key is represented as:
70
71 lemma%lex_sense
72
73 where lex_sense is encoded as:
74
75 ss_type:lex_filenum:lex_id:head_word:head_id
76
77 lemma is the ASCII text of the word or collocation as found in the
78 WordNet database index file corresponding to pos. lemma is in lower
79 case, and collocations are formed by joining individual words with an
80 underscore (_) character.
81
82 ss_type is a one digit decimal integer representing the synset type for
83 the sense. See Synset Type below for a listing of the numbers corre‐
84 sponding to each synset type.
85
86 lex_filenum is a two digit decimal integer representing the name of the
87 lexicographer file containing the synset for the sense. See lex‐
88 names(5) for the list of lexicographer file names and their correspond‐
89 ing numbers.
90
91 lex_id is a two digit decimal integer that, when appended onto lemma,
92 uniquely identifies a sense within a lexicographer file. lex_id num‐
93 bers usually start with 00, and are incremented as additional senses of
94 the word are added to the same file, although there is no requirement
95 that the numbers be consecutive or begin with 00. Note that a value of
96 00 is the default, and therefore is not present in lexicographer files.
97 Only non-default lex_id values must be explicitly assigned in lexicog‐
98 rapher files. See wninput(5) for information on the format of lexicog‐
99 rapher files.
100
101 head_word is only present if the sense is in an adjective satellite
102 synset. It is the lemma of the first word of the satellite's head
103 synset.
104
105 head_id is a two digit decimal integer that, when appended onto
106 head_word, uniquely identifies the sense of head_word within a lexicog‐
107 rapher file, as described for lex_id. There is a value in this field
108 only if head_word is present.
109
110 Synset Type
111 The synset type is encoded as follows:
112
113 1 NOUN
114 2 VERB
115 3 ADJECTIVE
116 4 ADVERB
117 5 ADJECTIVE SATELLITE
118
120 For non-satellite senses the head_word and head_id fields have no val‐
121 ues, however the field separator character (:) is present.
122
124 WNHOME Base directory for WordNet. Default is
125 /usr/local/WordNet-3.0.
126
127 WNSEARCHDIR Directory in which the WordNet database has been
128 installed. Default is WNHOME/dict.
129
131 HKEY_LOCAL_MACHINE\SOFTWARE\WordNet\3.0\WNHome
132 Base directory for WordNet. Default is C:\Pro‐
133 gram Files\WordNet\3.0.
134
136 index.sense sense index
137
139 binsrch(3), wnsearch(3), lexnames(5), wnintro(5), sensemap(5), wndb(5),
140 wninput(5).
141
142
143
144WordNet 3.0 Dec 2006 SENSEIDX(5)