1BT_SPLIT_NAMES(1)                   btparse                  BT_SPLIT_NAMES(1)
2
3
4

NAME

6       bt_split_names - splitting up BibTeX names and lists of names
7

SYNOPSIS

9          bt_stringlist * bt_split_list (char *   string,
10                                         char *   delim,
11                                         char *   filename,
12                                         int      line,
13                                         char *   description);
14          void bt_free_list (bt_stringlist *list);
15          bt_name * bt_split_name (char *  name,
16                                   char *  filename,
17                                   int     line,
18                                   int     name_num);
19          void bt_free_name (bt_name * name);
20

DESCRIPTION

22       When BibTeX files are used for their original purpose---bibliographic
23       entries describing scholarly publications---processing lists of names
24       (authors and editors mostly) becomes important.  Although such name-
25       processing is outside the general-purpose database domain of most of
26       the btparse library, these splitting functions are provided as a
27       concession to reality: most BibTeX data files use the BibTeX
28       conventions for author names, and a library to process that data ought
29       to be capable of processing the names.
30
31       Name-processing comes in two stages: first, split up a list of names
32       into individual strings; second, split up each name into "parts"
33       (first, von, last, and jr).  The first is actually quite general: you
34       could pick a delimiter (such as 'and', used for lists of names) and use
35       it to divide any string into substrings.  "bt_split_list()" could then
36       be called to break up the original string and extract the substrings.
37       "bt_split_name()", however, is quite specific to four-part author names
38       written using BibTeX conventions.  (These conventions are described
39       informally in any BibTeX documentation; the description you will find
40       here is more formal and algorithmic---and thus harder to understand.)
41
42       See bt_format_names for information on turning split-up names back into
43       strings in a variety of ways.
44

FUNCTIONS

46       bt_split_list()
47              bt_stringlist * bt_split_list (char *   string,
48                                             char *   delim,
49                                             char *   filename,
50                                             int      line,
51                                             char *   description)
52
53           Splits "string" into substrings delimited by "delim" (a fixed
54           string).  The splitting is done according to the rules used by
55           BibTeX for splitting up a list of names, in particular:
56
57           •   delimiters at beginning or end of string are ignored
58
59           •   delimiters must be surrounded by whitespace
60
61           •   matching of delimiters is case insensitive
62
63           •   delimiters at non-zero brace depth are ignored
64
65           For instance, if the delimiter is "and", then the string
66
67              Candy and Apples AnD {Green Eggs and Ham}
68
69           splits into three substrings: "Candy", "Apples", and "{Green Eggs
70           and Ham}".
71
72           If there are extra delimiters at the extremities of the
73           string---say, an "and" at the beginning of the string---then they
74           are included in the first/last string; no warning is currently
75           printed, but this may change.  Successive delimiters ("and and")
76           result in a warning and a NULL string being added to the list of
77           substrings.  For instance, the string
78
79              and Joe Q. Blow and and Smith, Jr., John
80
81           would split into three substrings: "and Joe Q. Blow", "NULL", and
82           "Smith, Jr., John".
83
84           (If these rules seem somewhat odd, don't blame me: I just
85           implemented BibTeX's observed behaviour and added warning messages
86           for one of the more obvious and easily-detected mistakes.)
87
88           The substrings are returned as a "bt_stringlist" structure:
89
90              typedef struct
91              {
92                 char *  string;
93                 int     num_items;
94                 char ** items;
95              } bt_stringlist;
96
97           There is currently no elegant interface to this structure: you just
98           have to poke around in it yourself.  The fields are:
99
100           "string"
101               a copy of the "string" parameter passed to "bt_split_list()",
102               but with NUL characters replacing the space after each
103               substring.  (This is safe because delimiters must be surrounded
104               by whitespace, which means that each substring is followed by
105               whitespace which is not part of the substring.)  You probably
106               shouldn't fiddle with "string"; it's just there so that
107               "bt_free_list()" has something to "free()".
108
109           "num_items"
110               the number of substrings found in the string passed to
111               "bt_split_list()".
112
113           "items"
114               an array of "num_items" pointers into "string".  For instance,
115               "items[1]" points to the second substring.  Since "string" has
116               been mangled with NUL characters, it is safe to treat
117               "items[i]" as a regular C string.
118
119               "filename", "line", and "description" are all used for
120               generating warning messages.  "filename" and "line" simply
121               describe where the string came from, and "description" is a
122               brief (one word) description of the substrings.  For instance,
123               if you are splitting a list of names, supply "name" for
124               "description"---that way, warnings will refer to "name X"
125               rather than "substring x".
126
127       bt_free_list()
128              void bt_free_list (bt_stringlist *list)
129
130           Frees a "bt_stringlist" structure as returned by "bt_split_list()".
131           That is, it frees the copy of the string you passed to
132           "bt_split_list()", and then frees the structure itself.
133
134       bt_split_name()
135              bt_name * bt_split_name (char *  name,
136                                       char *  filename,
137                                       int     line,
138                                       int     name_num)
139
140           Splits a single BibTeX-style author name into four parts: first,
141           von, last, and jr.  This can handle almost all names in the style
142           of the major Western European languages, but not quite.  (Alas!)
143
144           A name is split by first dividing into tokens; tokens are separated
145           by whitespace or commas at brace-level zero.  Thus the name
146
147              van der Graaf, Horace Q.
148
149           has five tokens, whereas the name
150
151              {Foo, Bar, and Sons}
152
153           consists of a single token.
154
155           How tokens are divided into parts depends on the form of the name.
156           If the name has no commas at brace-level zero (as in the second
157           example), then it is assumed to be in either "first last" or "first
158           von last" form.  If there are no tokens that start with a lower-
159           case letter, then "first last" form is assumed: the final token is
160           the last name, and all other tokens form the first name.
161           Otherwise, the earliest contiguous sequence of tokens with initial
162           lower-case letters is taken as the `von' part; if this sequence
163           includes the final token, then a warning is printed and the final
164           token is forced to be the `last' part.
165
166           If a name has a single comma, then it is assumed to be in "von
167           last, first" form.  A leading sequence of tokens with initial
168           lower-case letters, if any, forms the `von' part; tokens between
169           the `von' and the comma form the `last' part; tokens following the
170           comma form the `first' part.  Again, if there are no token
171           following a leading sequence of lowercase tokens, a warning is
172           printed and the token immediately preceding the comma is taken to
173           be the `last' part.
174
175           If a name has more than two commas, a warning is printed and the
176           name is treated as though only the first two commas were present.
177
178           Finally, if a name has two commas, it is assumed to be in "von
179           last, jr, first" form.  (This is the only way to represent a name
180           with a `jr' part.)  The parsing of the name is the same as for a
181           one-comma name, except that tokens between the two commas are taken
182           to be the `jr' part.
183
184           The one case not properly handled by BibTeX name conventions is a
185           name with a 'jr' part not separated from the last name by a comma;
186           for example:
187
188              Henry Ford Jr.
189              George Herbert Walker Bush III
190
191           Both of these would be incorrectly interpreted by both BibTeX and
192           bt_split_name(): the "Jr." or "III" token would be taken as the
193           last name, and the other tokekens as a two- or four-part first
194           name.  The workaround is to shoehorn the 'jr' into the last name:
195
196              Henry {Ford Jr.}
197              George Herbert Walker {Bush III}
198
199           but this will make it impossible to extract the last name on its
200           own, e.g. to generate "author-year" style citations.  This design
201           flaw may be fixed in a future version of btparse.
202
203           The split-up name is returned as a "bt_name" structure:
204
205              typedef struct
206              {
207                 bt_stringlist * tokens;
208                 char ** parts[BT_MAX_NAMEPARTS];
209                 int     part_len[BT_MAX_NAMEPARTS];
210              } bt_name;
211
212           Again, there's no nice interface to this structure; you'll just
213           have to access the fields individually.  They are:
214
215           "tokens"
216               the name, broken down into a flat list of tokens.  See above
217               for a description of the "bt_stringlist" structure.
218
219           "parts"
220               an array of arrays of pointers into the token list.  The major
221               dimension of this beast is the "name part;" you should index
222               this dimension using the "bt_namepart" enum.  For instance,
223               "parts[BTN_LAST]" is an array of pointers to the tokens
224               comprising the last name; "parts[BTN_LAST][1]" is a "char *":
225               the second token of the 'last' part; and
226               "parts[BTN_LAST][1][0]" is the first character of the second
227               token of the 'last' part.
228
229           "part_len"
230               the length, in tokens, of each part.  For instance, you might
231               loop over all tokens in the 'first' part as follows (assuming
232               "name" is a "bt_name *" returned by "bt_split_name()"):
233
234                  for (i = 0; i < name->part_len[BTN_FIRST]; i++)
235                  {
236                     printf ("token %d of first name: %s\n",
237                             i, name->parts[BTN_FIRST][i]);
238                  }
239
240       bt_free_name()
241              void bt_free_name (bt_name * name)
242
243           Frees the "bt_name" structure created by "bt_split_name()"
244           (including the "bt_stringlist" structure inside the "bt_name").
245

SEE ALSO

247       btparse, bt_format_names
248

AUTHOR

250       Greg Ward <gward@python.net>
251
252
253
254btparse, version 0.88             2022-01-21                 BT_SPLIT_NAMES(1)
Impressum