1Text::BibTeX::Name(3) User Contributed Perl DocumentationText::BibTeX::Name(3)
2
3
4
6 Text::BibTeX::Name - interface to BibTeX-style author names
7
9 use Text::BibTeX::Name;
10
11 $name = Text::BibTeX::Name->new();
12 $name->split('J. Random Hacker');
13 # or:
14 $name = Text::BibTeX::Name->new('J. Random Hacker');
15
16 @firstname_tokens = $name->part ('first');
17 $lastname = join (' ', $name->part ('last'));
18
19 $format = Text::BibTeX::NameFormat->new();
20 # ...customize $format...
21 $formatted = $name->format ($format);
22
24 "Text::BibTeX::Name" provides an abstraction for BibTeX-style names and
25 some basic operations on them. A name, in the BibTeX world, consists
26 of a list of tokens which are divided amongst four parts: `first',
27 `von', `last', and `jr'.
28
29 Tokens are separated by whitespace or commas at brace-level zero. Thus
30 the name
31
32 van der Graaf, Horace Q.
33
34 has five tokens, whereas the name
35
36 {Foo, Bar, and Sons}
37
38 consists of a single token. Skip down to "EXAMPLES" for more examples,
39 or read on if you want to know the exact details of how names are split
40 into tokens and parts.
41
42 How tokens are divided into parts depends on the form of the name. If
43 the name has no commas at brace-level zero (as in the second example),
44 then it is assumed to be in either "first last" or "first von last"
45 form. If there are no tokens that start with a lower-case letter, then
46 "first last" form is assumed: the final token is the last name, and all
47 other tokens form the first name. Otherwise, the earliest contiguous
48 sequence of tokens with initial lower-case letters is taken as the
49 `von' part; if this sequence includes the final token, then a warning
50 is printed and the final token is forced to be the `last' part.
51
52 If a name has a single comma, then it is assumed to be in "von last,
53 first" form. A leading sequence of tokens with initial lower-case
54 letters, if any, forms the `von' part; tokens between the `von' and the
55 comma form the `last' part; tokens following the comma form the `first'
56 part. Again, if there are no tokens following a leading sequence of
57 lowercase tokens, a warning is printed and the token immediately
58 preceding the comma is taken to be the `last' part.
59
60 If a name has more than two commas, a warning is printed and the name
61 is treated as though only the first two commas were present.
62
63 Finally, if a name has two commas, it is assumed to be in "von last,
64 jr, first" form. (This is the only way to represent a name with a `jr'
65 part.) The parsing of the name is the same as for a one-comma name,
66 except that tokens between the two commas are taken to be the `jr'
67 part.
68
70 The C code that does the actual work of splitting up names takes a
71 shortcut and makes few assumptions about whitespace. In particular,
72 there must be no leading whitespace, no trailing whitespace, no
73 consecutive whitespace characters in the string, and no whitespace
74 characters other than space. In other words, all whitespace must
75 consist of lone internal spaces.
76
78 The strings "John Smith" and "Smith, John" are different
79 representations of the same name, so split into parts and tokens the
80 same way, namely as:
81
82 first => ('John')
83 von => ()
84 last => ('Smith')
85 jr => ()
86
87 Note that every part is a list of tokens, even if there is only one
88 token in that part; empty parts get empty token lists. Every token is
89 just a string. Writing this example in actual code is simple:
90
91 $name = Text::BibTeX::Name->new("John Smith"); # or "Smith, John"
92 $name->part ('first'); # returns list ("John")
93 $name->part ('last'); # returns list ("Smith")
94 $name->part ('von'); # returns list ()
95 $name->part ('jr'); # returns list ()
96
97 (We'll omit the empty parts in the rest of the examples: just assume
98 that any unmentioned part is an empty list.) If more than two tokens
99 are included and there's no comma, they'll go to the first name: thus
100 "John Q. Smith" splits into
101
102 first => ("John", "Q."))
103 last => ("Smith")
104
105 and "J. R. R. Tolkein" into
106
107 first => ("J.", "R.", "R.")
108 last => ("Tolkein")
109
110 The ambiguous name "Kevin Philips Bong" splits into
111
112 first => ("Kevin", "Philips")
113 last => ("Bong")
114
115 which may or may not be the right thing, depending on the particular
116 person. There's no way to know though, so if this fellow's last name
117 is "Philips Bong" and not "Bong", the string representation of his name
118 must disambiguate. One possibility is "Philips Bong, Kevin" which
119 splits into
120
121 first => ("Kevin")
122 last => ("Philips", "Bong")
123
124 Alternately, "Kevin {Philips Bong}" takes advantage of the fact that
125 tokes are only split on whitespace at brace-level zero, and becomes
126
127 first => ("Kevin")
128 last => ("{Philips Bong}")
129
130 which is fine if your names are destined to be processed by TeX, but
131 might be problematic in other contexts. Similarly, "St John-Mollusc,
132 Oliver" becomes
133
134 first => ("Oliver")
135 last => ("St", "John-Mollusc")
136
137 which can also be written as "Oliver {St John-Mollusc}":
138
139 first => ("Oliver")
140 last => ("{St John-Mollusc}")
141
142 Since tokens are separated purely by whitespace, hyphenated names will
143 work either way: both "Nigel Incubator-Jones" and "Incubator-Jones,
144 Nigel" come out as
145
146 first => ("Nigel")
147 last => ("Incubator-Jones")
148
149 Multi-token last names with lowercase components -- the "von part" --
150 work fine: both "Ludwig van Beethoven" and "van Beethoven, Ludwig"
151 parse (correctly) into
152
153 first => ("Ludwig")
154 von => ("van")
155 last => ("Beethoven")
156
157 This allows these European aristocratic names to sort properly, i.e.
158 van Beethoven under B rather than v. Speaking of aristocratic European
159 names, "Charles Louis Xavier Joseph de la Vall{\'e}e Poussin" is
160 handled just fine, and splits into
161
162 first => ("Charles", "Louis", "Xavier", "Joseph")
163 von => ("de", "la")
164 last => ("Vall{\'e}e", "Poussin")
165
166 so could be sorted under V rather than d. (Note that the sorting
167 algorithm in Text::BibTeX::BibSort is a slavish imitiation of BibTeX
168 0.99, and therefore does the wrong thing with these names: the sort key
169 starts with the "von" part.)
170
171 However, capitalized "von parts" don't work so well: "R. J. Van de
172 Graaff" splits into
173
174 first => ("R.", "J.", "Van")
175 von => ("de")
176 last => ("Graaff")
177
178 which is clearly wrong. This name should be represented as "Van de
179 Graaff, R. J."
180
181 first => ("R.", "J.")
182 last => ("Van", "de", "Graaff")
183
184 which is probably right. (This particular Van de Graaff was an
185 American, so he probably belongs under V -- which is where my (British)
186 dictionary puts him. Other Van de Graaff's mileages may vary.)
187
188 Finally, many names include a suffix: "Jr.", "III", "fils", and so
189 forth. These are handled, but with some limitations. If there's a
190 comma before the suffix (the usual U.S. convention for "Jr."), then the
191 name should be in last, jr, first form, e.g. "Doe, Jr., John" comes out
192 (correctly) as
193
194 first => ("John")
195 last => ("Doe")
196 jr => ("Jr.")
197
198 but "John Doe, Jr." is ambiguous and is parsed as
199
200 first => ("Jr.")
201 last => ("John", "Doe")
202
203 (so don't do it that way). If there's no comma before the suffix --
204 the usual for Roman numerals, and occasionally seen with "Jr." -- then
205 you're stuck and have to make the suffix part of the last name. Thus,
206 "Gates III, William H." comes out
207
208 first => ("William", "H.")
209 last => ("Gates", "III")
210
211 but "William H. Gates III" is ambiguous, and becomes
212
213 first => ("William", "H.", "Gates")
214 last => ("III")
215
216 -- not what you want. Again, the curly-brace trick comes in handy, so
217 "William H. {Gates III}" splits into
218
219 first => ("William", "H.")
220 last => ("{Gates III}")
221
222 There is no way to make a comma-less suffix the "jr" part. (This is an
223 unfortunate consequence of slavishly imitating BibTeX 0.99.)
224
225 Finally, names that aren't really names of people but rather are
226 organization or company names should be forced into a single token by
227 wrapping them in curly braces. For example, "Foo, Bar and Sons" should
228 be written "{Foo, Bar and Sons}", which will split as
229
230 last => ("{Foo, Bar and Sons}")
231
232 Of course, if this is one name in a BibTeX "authors" or "editors" list,
233 this name has to be wrapped in braces anyways (because of the " and "),
234 but that's another story.
235
237 Putting a split-up name back together again in a flexible, customizable
238 way is the job of another module: see Text::BibTeX::NameFormat.
239
241 new([ [OPTS,] NAME [, FILENAME, LINE, NAME_NUM]])
242 Creates a new "Text::BibTeX::Name" object. If NAME is supplied, it
243 must be a string containing a single name, and it will be be passed
244 to the "split" method for further processing. FILENAME, LINE, and
245 NAME_NUM, if present, are all also passed to "split" to allow
246 better error messages.
247
248 If the first argument is a hash reference, it is used to define
249 configuration values. At the moment the available values are:
250
251 BINMODE
252 Set the way Text::BibTeX deals with strings. By default it
253 manages strings as bytes. You can set BINMODE to 'utf-8' to get
254 NFC normalized UTF-8 strings and you can customise the
255 normalization with the NORMALIZATION option.
256
257 Text::BibTeX::Name->new(
258 { binmode => 'utf-8', normalization => 'NFD' },
259 "Alberto Simo~es"});
260
261 split (NAME [, FILENAME, LINE, NAME_NUM])
262 Splits NAME (a string containing a single name) into tokens and
263 subsequently into the four parts of a BibTeX-style name (first,
264 von, last, and jr). (Each part is a list of tokens, and tokens are
265 separated by whitespace or commas at brace-depth zero. See above
266 for full details on how a name is split into its component parts.)
267
268 The token-lists that make up each part of the name are then stored
269 in the "Text::BibTeX::Name" object for later retrieval or
270 formatting with the "part" and "format" methods.
271
272 part (PARTNAME)
273 Returns the list of tokens in part PARTNAME of a name previously
274 split with "split". For example, suppose a "Text::BibTeX::Name"
275 object is created and initialized like this:
276
277 $name = Text::BibTeX::Name->new();
278 $name->split ('Charles Louis Xavier Joseph de la Vall{\'e}e Poussin');
279
280 Then this code:
281
282 $name->part ('von');
283
284 would return the list "('de','la')".
285
286 format (FORMAT)
287 Formats a name according to the specifications encoded in FORMAT,
288 which should be a "Text::BibTeX::NameFormat" (or descendant)
289 object. (In short, it must supply a method "apply" which takes a
290 "Text::BibTeX::NameFormat" object as its only argument.) Returns
291 the formatted name as a string.
292
293 See Text::BibTeX::NameFormat for full details on formatting names.
294
296 Text::BibTeX::Entry, Text::BibTeX::NameFormat, bt_split_names.
297
299 Greg Ward <gward@python.net>
300
302 Copyright (c) 1997-2000 by Gregory P. Ward. All rights reserved. This
303 file is part of the Text::BibTeX library. This library is free
304 software; you may redistribute it and/or modify it under the same terms
305 as Perl itself.
306
307
308
309perl v5.36.0 2022-07-22 Text::BibTeX::Name(3)