1BT_LANGUAGE(1)                      btparse                     BT_LANGUAGE(1)
2
3
4

NAME

6       bt_language - the BibTeX data language, as recognized by btparse
7

SYNOPSIS

9          # Lexical grammar, mode 1: top-level
10          AT                    \@
11          NEWLINE               \n
12          COMMENT               \%~[\n]*\n
13          WHITESPACE            [\ \r\t]+
14          JUNK                  ~[\@\n\ \r\t]+
15
16          # Lexical grammar, mode 2: in-entry
17          NEWLINE               \n
18          COMMENT               \%~[\n]*\n
19          WHITESPACE            [\ \r\t]+
20          NUMBER                [0-9]+
21          NAME                  [a-z0-9\!\$\&\*\+\-\.\/\:\;\<\>\?\[\]\^\_\`\|]+
22          LBRACE                \{
23          RBRACE                \}
24          LPAREN                \(
25          RPAREN                \)
26          EQUALS                =
27          HASH                  \#
28          COMMA                 ,
29          QUOTE                 \"
30
31          # Lexical grammar, mode 3: strings
32          # (very hairy -- see text)
33
34          # Syntactic grammar:
35          bibfile : ( entry )*
36
37          entry : AT NAME body
38
39          body : STRING                    # for comment entries
40               | ENTRY_OPEN contents ENTRY_CLOSE
41
42          contents : ( NAME | NUMBER ) COMMA fields   # for regular entries
43                   | fields                # for macro definition entries
44                   | value                 # for preamble entries
45
46          fields : field { COMMA fields }
47                 |
48
49          field : NAME EQUALS value
50
51          value : simple_value ( HASH simple_value )*
52
53          simple_value : STRING
54                       | NUMBER
55                       | NAME
56

DESCRIPTION

58       One of the problems with BibTeX is that there is no formal
59       specification of the language.  This means that users exploring the
60       arcane corners of the language are largely on their own, and
61       programmers implementing their own parsers are completely on their
62       own---except for observing the behaviour of the original
63       implementation.
64
65       Other parser implementors (Nelson Beebe of "bibclean" fame, in
66       particular) have taken the trouble to explain the language accepted by
67       their parser, and in that spirit the following is presented.
68
69       If you are unfamiliar with the arcana of regular and context-free
70       languages, you will not have any easy time understanding this.  This is
71       not an introduction to the BibTeX language; any LaTeX book would be
72       more suitable for learning the data language itself.
73

LEXICAL GRAMMAR

75       The lexical scanner has three distinct modes: top-level, in-entry, and
76       string.  Roughly speaking, top-level is the initial mode; we enter in-
77       entry mode on seeing an "@" at top-level; and on seeing the "}" or ")"
78       that ends the entry, we return to top-level.  We enter string mode on
79       seeing a """ or non-entry-delimiting "{" from in-entry mode.  Note that
80       the lexical language is both non-regular (because braces must balance)
81       and context-sensitive (because "{" can mean different things depending
82       on its syntactic context).  That said, we will use regular expressions
83       to describe the lexical elements, because they are the starting point
84       used by the lexical scanner itself.  The rest of the lexical grammar
85       will be informally explained in the text.
86
87       From top-level, the following tokens are recognized according to the
88       regular expressions on the right:
89
90          AT                    \@
91          NEWLINE               \n
92          COMMENT               \%~[\n]*\n
93          WHITESPACE            [\ \r\t]+
94          JUNK                  ~[\@\n\ \r\t]+
95
96       (Note that this is PCCTS regular expression syntax, which should be
97       fairly familiar to users of other regex engines.  One oddity is that a
98       character class is negated as "~[...]" rather than "[^...]".)
99
100       On seeing "at" at top-level, we enter in-entry mode.  Whitespace, junk,
101       newlines, and comments are all skipped, with the latter two
102       incrementing a line counter.  (Junk is explicitly recognized to allow
103       for "bibtex"'s "implicit comment" scheme.)
104
105       From in-entry mode, we recognize newline, comment, and whitespace
106       identically to top-level mode.  In addition, the following tokens are
107       recognized:
108
109          NUMBER                [0-9]+
110          NAME                  [a-z0-9\!\$\&\*\+\-\.\/\:\;\<\>\?\[\]\^\_\`\|]+
111          LBRACE                \{
112          RBRACE                \}
113          LPAREN                \(
114          RPAREN                \)
115          EQUALS                =
116          HASH                  \#
117          COMMA                 ,
118          QUOTE                 \"
119
120       At this point, the lexical scanner starts to sound suspiciously like a
121       context-free grammar, rather than a collection of independent regular
122       expressions.  However, it is necessary to keep this complexity in the
123       scanner because certain characters ("{" and "(" in particular) have
124       very different lexical meanings depending on the tokens that have
125       preceded them in the input stream.
126
127       In particular, "{" and "(" are treated as "entry openers" if they
128       follow one "at" and one "name" token, unless the value of the "name"
129       token is "comment".  (Note the switch from top-level to in-entry
130       between the two tokens.)  In the @comment case, the delimiter is
131       considered as starting a string, and we enter string mode.  Otherwise,
132       the delimiter is saved, and when we see a corresponding "}" or ")" it
133       is considered an "entry closer".  (Braces are balanced for free here
134       because the string lexer takes care of counting brace-depth.)
135
136       Anywhere else, "{" is considered as starting a string, and we enter
137       string mode.  """ always starts a string, regardless of context.  The
138       other tokens ("name", "number", "equals", "hash", and "comma") are
139       recognized unconditionally.
140
141       Note that "name" is a catch-all token used for entry types, citation
142       keys, field names, and macro names; because BibTeX has slightly
143       different (largely undocumented) rules for these various elements, a
144       bit of trickery is needed to make things work.  As a starting point,
145       consider BibTeX's definition of what's allowed for an entry key: a
146       sequence of any characters except
147
148          " # % ' ( ) , = { }
149
150       plus space.  There are a couple of problems with this scheme.  First,
151       without specifying the character set from which those "magic 10"
152       characters are drawn, it's a bit hard to know just what is allowed.
153       Second, allowing "@" characters could lead to confusing BibTeX syntax
154       (it doesn't confuse BibTeX, but it might confuse a human reader).
155       Finally, allowing certain characters that are special to TeX means that
156       BibTeX can generate bogus TeX code: try putting a backslash ("\") or
157       tilde ("~") in a citation key.  (This last exception is rather specific
158       to the "generating (La)TeX code from a BibTeX database" application,
159       but since that's the major application for BibTeX databases, then it
160       will presumably be the major application for btparse, at least
161       initially.  Thus, it makes sense to pay attention to this problem.)
162
163       In btparse, then, a name is defined as any sequence of letters, digits,
164       underscores, and the following characters:
165
166          ! $ & * + - . / : ; < > ? [ ] ^ _ ` |
167
168       This list was derived by removing BibTeX's "magic 10" from the set of
169       printable 7-bit ASCII characters (32-126), and then further removing
170       "@", "\", and "~".  This means that btparse disallows some of the
171       weirder entry keys that BibTeX would accept, such as "\foo@bar", but
172       still allows a string with initial digits.  In fact, from the above
173       definition it appears that btparse would accept a string of all digits
174       as a "name;" this is not the case, though, as the lexical scanner
175       recognizes such a digit string as a number first.  There are two
176       problems here: BibTeX entry keys may in fact be entirely numeric, and
177       field names may not begin with a digit.  (Those are two of the not-so-
178       obvious differences in BibTeX's handling of keys and field names.)  The
179       tricks used to deal with these problems are implemented in the parser
180       rather than the lexical scanner, so are described in "SYNTACTIC
181       GRAMMAR" below.
182
183       The string lexer recognizes "lbrace", "rbrace", "lparen", and "rparen"
184       tokens in order to count brace- or parenthesis-depth.  This is
185       necessary so it knows when to accept a string delimited by braces or
186       parentheses.  (Note that a parenthesis-delimited string is only allowed
187       after @comment---this is not a normal BibTeX construct.)  In addition,
188       it converts each non-space whitespace character (newline, carriage-
189       return, and tab) to a single space.  (Sequences of whitespace are not
190       collapsed; that's the domain of string post-processing, which is well
191       removed from the scanner or parser.)  Finally, it accepts """ to
192       delimit quote-delimited strings.  Apart from those restrictions, the
193       string lexer accepts anything up to the end-of-string delimiter.
194

SYNTACTIC GRAMMAR

196       (The language used to describe the grammar here is the extended Backus-
197       Naur Form (EBNF) used by PCCTS.  Terminals are represented by uppercase
198       strings, non-terminals by lowercase strings; terminal names are the
199       same as those given in the lexical grammar above.  "( foo )*" means
200       zero or more repetitions of the "foo" production, and "{ foo }" means
201       an optional "foo".)
202
203       A file is just a sequence of zero or more entries:
204
205          bibfile : ( entry )*
206
207       An entry is an at-sign, a name (the "entry type"), and the entry body:
208
209          entry : AT NAME body
210
211       A body is either a string (this alternative is only tried if the entry
212       type is "comment") or the entry contents:
213
214          body : STRING                    # for comment entries
215               | ENTRY_OPEN contents ENTRY_CLOSE
216
217       ("ENTRY_OPEN" and "ENTRY_CLOSE" are either "{" and "}" or "(" and ")",
218       depending what is seen in the input for a particular entry.)
219
220       There are three possible productions for the "contents" non-terminal.
221       Only one applies to any given entry, depending on the entry metatype
222       (which in turn depends on the entry type).  Currently, btparse supports
223       four entry metatypes: comment, preamble, macro definition, and regular.
224       The first two correspond to @comment and @preamble entries; "macro
225       definition" is for @string entries; and "regular" is for all other
226       entry types.  (The library will be extended to handle @modify and
227       @alias entry types, and corresponding "modify" and "alias" metatypes,
228       when BibTeX 1.0 is released and the exact syntax is known.)  The
229       "metatype" concept is necessary so that all entry types that aren't
230       specifically recognized fall into the "regular" metatype.  It's also
231       convenient not to have to "strcmp" the entry type all the time.
232
233          contents : ( NAME | NUMBER ) COMMA fields     # for regular entries
234                   | fields                # for macro definition entries
235                   | value                 # for preamble entries
236
237       Note that the entry key is not just a "NAME", but "( NAME | NUMBER)".
238       This is necessary because BibTeX allows all-numeric entry keys, but
239       btparse's lexical scanner recognizes such digit strings as "NUMBER"
240       tokens.
241
242       "fields" is a comma-separated list of fields, with an optional single
243       trailing comma:
244
245          fields : field { COMMA fields }
246                 |
247
248       A "field" is a single "field = value" assignment:
249
250          field : NAME EQUALS value
251
252       Note that "NAME" here is a restricted version of the "name" token
253       described in "LEXICAL GRAMMAR" above.  Any "name" token will be
254       accepted by the parser, but it is immediately checked to ensure that it
255       doesn't begin with a digit; if so, an artificial syntax error is
256       triggered.  (This is for compatibility with BibTeX, which doesn't allow
257       field names to start with a digit.)
258
259       A "value" is a series of simple values joined by '#' characters:
260
261          value : simple_value ( HASH simple_value )*
262
263       A simple value is a string, number, or name (for macro invocations):
264
265          simple_value : STRING
266                       | NUMBER
267                       | NAME
268

SEE ALSO

270       btparse
271

AUTHOR

273       Greg Ward <gward@python.net>
274
275
276
277btparse, version 0.89             2023-07-21                    BT_LANGUAGE(1)
Impressum