1BT_SPLIT_NAMES(1) btparse BT_SPLIT_NAMES(1)
2
3
4
6 bt_split_names - splitting up BibTeX names and lists of names
7
9 bt_stringlist * bt_split_list (char * string,
10 char * delim,
11 char * filename,
12 int line,
13 char * description);
14 void bt_free_list (bt_stringlist *list);
15 bt_name * bt_split_name (char * name,
16 char * filename,
17 int line,
18 int name_num);
19 void bt_free_name (bt_name * name);
20
22 When BibTeX files are used for their original purpose---bibliographic
23 entries describing scholarly publications---processing lists of names
24 (authors and editors mostly) becomes important. Although such name-
25 processing is outside the general-purpose database domain of most of
26 the btparse library, these splitting functions are provided as a
27 concession to reality: most BibTeX data files use the BibTeX
28 conventions for author names, and a library to process that data ought
29 to be capable of processing the names.
30
31 Name-processing comes in two stages: first, split up a list of names
32 into individual strings; second, split up each name into "parts"
33 (first, von, last, and jr). The first is actually quite general: you
34 could pick a delimiter (such as 'and', used for lists of names) and use
35 it to divide any string into substrings. "bt_split_list()" could then
36 be called to break up the original string and extract the substrings.
37 "bt_split_name()", however, is quite specific to four-part author names
38 written using BibTeX conventions. (These conventions are described
39 informally in any BibTeX documentation; the description you will find
40 here is more formal and algorithmic---and thus harder to understand.)
41
42 See bt_format_names for information on turning split-up names back into
43 strings in a variety of ways.
44
46 bt_split_list()
47 bt_stringlist * bt_split_list (char * string,
48 char * delim,
49 char * filename,
50 int line,
51 char * description)
52
53 Splits "string" into substrings delimited by "delim" (a fixed
54 string). The splitting is done according to the rules used by
55 BibTeX for splitting up a list of names, in particular:
56
57 • delimiters at beginning or end of string are ignored
58
59 • delimiters must be surrounded by whitespace
60
61 • matching of delimiters is case insensitive
62
63 • delimiters at non-zero brace depth are ignored
64
65 For instance, if the delimiter is "and", then the string
66
67 Candy and Apples AnD {Green Eggs and Ham}
68
69 splits into three substrings: "Candy", "Apples", and "{Green Eggs
70 and Ham}".
71
72 If there are extra delimiters at the extremities of the
73 string---say, an "and" at the beginning of the string---then they
74 are included in the first/last string; no warning is currently
75 printed, but this may change. Successive delimiters ("and and")
76 result in a warning and a NULL string being added to the list of
77 substrings. For instance, the string
78
79 and Joe Q. Blow and and Smith, Jr., John
80
81 would split into three substrings: "and Joe Q. Blow", "NULL", and
82 "Smith, Jr., John".
83
84 (If these rules seem somewhat odd, don't blame me: I just
85 implemented BibTeX's observed behaviour and added warning messages
86 for one of the more obvious and easily-detected mistakes.)
87
88 The substrings are returned as a "bt_stringlist" structure:
89
90 typedef struct
91 {
92 char * string;
93 int num_items;
94 char ** items;
95 } bt_stringlist;
96
97 There is currently no elegant interface to this structure: you just
98 have to poke around in it yourself. The fields are:
99
100 "string"
101 a copy of the "string" parameter passed to "bt_split_list()",
102 but with NUL characters replacing the space after each
103 substring. (This is safe because delimiters must be surrounded
104 by whitespace, which means that each substring is followed by
105 whitespace which is not part of the substring.) You probably
106 shouldn't fiddle with "string"; it's just there so that
107 "bt_free_list()" has something to "free()".
108
109 "num_items"
110 the number of substrings found in the string passed to
111 "bt_split_list()".
112
113 "items"
114 an array of "num_items" pointers into "string". For instance,
115 "items[1]" points to the second substring. Since "string" has
116 been mangled with NUL characters, it is safe to treat
117 "items[i]" as a regular C string.
118
119 "filename", "line", and "description" are all used for
120 generating warning messages. "filename" and "line" simply
121 describe where the string came from, and "description" is a
122 brief (one word) description of the substrings. For instance,
123 if you are splitting a list of names, supply "name" for
124 "description"---that way, warnings will refer to "name X"
125 rather than "substring x".
126
127 bt_free_list()
128 void bt_free_list (bt_stringlist *list)
129
130 Frees a "bt_stringlist" structure as returned by "bt_split_list()".
131 That is, it frees the copy of the string you passed to
132 "bt_split_list()", and then frees the structure itself.
133
134 bt_split_name()
135 bt_name * bt_split_name (char * name,
136 char * filename,
137 int line,
138 int name_num)
139
140 Splits a single BibTeX-style author name into four parts: first,
141 von, last, and jr. This can handle almost all names in the style
142 of the major Western European languages, but not quite. (Alas!)
143
144 A name is split by first dividing into tokens; tokens are separated
145 by whitespace or commas at brace-level zero. Thus the name
146
147 van der Graaf, Horace Q.
148
149 has five tokens, whereas the name
150
151 {Foo, Bar, and Sons}
152
153 consists of a single token.
154
155 How tokens are divided into parts depends on the form of the name.
156 If the name has no commas at brace-level zero (as in the second
157 example), then it is assumed to be in either "first last" or "first
158 von last" form. If there are no tokens that start with a lower-
159 case letter, then "first last" form is assumed: the final token is
160 the last name, and all other tokens form the first name.
161 Otherwise, the earliest contiguous sequence of tokens with initial
162 lower-case letters is taken as the `von' part; if this sequence
163 includes the final token, then a warning is printed and the final
164 token is forced to be the `last' part.
165
166 If a name has a single comma, then it is assumed to be in "von
167 last, first" form. A leading sequence of tokens with initial
168 lower-case letters, if any, forms the `von' part; tokens between
169 the `von' and the comma form the `last' part; tokens following the
170 comma form the `first' part. Again, if there are no token
171 following a leading sequence of lowercase tokens, a warning is
172 printed and the token immediately preceding the comma is taken to
173 be the `last' part.
174
175 If a name has more than two commas, a warning is printed and the
176 name is treated as though only the first two commas were present.
177
178 Finally, if a name has two commas, it is assumed to be in "von
179 last, jr, first" form. (This is the only way to represent a name
180 with a `jr' part.) The parsing of the name is the same as for a
181 one-comma name, except that tokens between the two commas are taken
182 to be the `jr' part.
183
184 The one case not properly handled by BibTeX name conventions is a
185 name with a 'jr' part not separated from the last name by a comma;
186 for example:
187
188 Henry Ford Jr.
189 George Herbert Walker Bush III
190
191 Both of these would be incorrectly interpreted by both BibTeX and
192 bt_split_name(): the "Jr." or "III" token would be taken as the
193 last name, and the other tokekens as a two- or four-part first
194 name. The workaround is to shoehorn the 'jr' into the last name:
195
196 Henry {Ford Jr.}
197 George Herbert Walker {Bush III}
198
199 but this will make it impossible to extract the last name on its
200 own, e.g. to generate "author-year" style citations. This design
201 flaw may be fixed in a future version of btparse.
202
203 The split-up name is returned as a "bt_name" structure:
204
205 typedef struct
206 {
207 bt_stringlist * tokens;
208 char ** parts[BT_MAX_NAMEPARTS];
209 int part_len[BT_MAX_NAMEPARTS];
210 } bt_name;
211
212 Again, there's no nice interface to this structure; you'll just
213 have to access the fields individually. They are:
214
215 "tokens"
216 the name, broken down into a flat list of tokens. See above
217 for a description of the "bt_stringlist" structure.
218
219 "parts"
220 an array of arrays of pointers into the token list. The major
221 dimension of this beast is the "name part;" you should index
222 this dimension using the "bt_namepart" enum. For instance,
223 "parts[BTN_LAST]" is an array of pointers to the tokens
224 comprising the last name; "parts[BTN_LAST][1]" is a "char *":
225 the second token of the 'last' part; and
226 "parts[BTN_LAST][1][0]" is the first character of the second
227 token of the 'last' part.
228
229 "part_len"
230 the length, in tokens, of each part. For instance, you might
231 loop over all tokens in the 'first' part as follows (assuming
232 "name" is a "bt_name *" returned by "bt_split_name()"):
233
234 for (i = 0; i < name->part_len[BTN_FIRST]; i++)
235 {
236 printf ("token %d of first name: %s\n",
237 i, name->parts[BTN_FIRST][i]);
238 }
239
240 bt_free_name()
241 void bt_free_name (bt_name * name)
242
243 Frees the "bt_name" structure created by "bt_split_name()"
244 (including the "bt_stringlist" structure inside the "bt_name").
245
247 btparse, bt_format_names
248
250 Greg Ward <gward@python.net>
251
252
253
254btparse, version 0.88 2022-01-21 BT_SPLIT_NAMES(1)