1libinn_uwildmat(3) InterNetNews Documentation libinn_uwildmat(3)
2
3
4
6 uwildmat, uwildmat_simple, uwildmat_poison - Perform wildmat matching
7
9 #include <inn/libinn.h>
10
11 bool uwildmat(const char *text, const char *pattern);
12
13 bool uwildmat_simple(const char *text, const char *pattern);
14
15 enum uwildmat uwildmat_poison(const char *text, const char *pattern);
16
18 uwildmat compares text against the wildmat expression pattern,
19 returning true if and only if the expression matches the text. "@" has
20 no special meaning in pattern when passed to uwildmat. Both text and
21 pattern are assumed to be in the UTF-8 character encoding, although
22 malformed UTF-8 sequences are treated in a way that attempts to be
23 mostly compatible with single-octet character sets like ISO 8859-1.
24 (In other words, if you try to match ISO 8859-1 text with these
25 routines everything should work as expected unless the ISO 8859-1 text
26 contains valid UTF-8 sequences, which thankfully is somewhat rare.)
27
28 uwildmat_simple is identical to uwildmat except that neither "!" nor
29 "," have any special meaning and pattern is always treated as a single
30 pattern. This function exists solely to support legacy interfaces like
31 NNTP's XPAT command, and should be avoided when implementing new
32 features.
33
34 uwildmat_poison works similarly to uwildmat, except that "@" as the
35 first character of one of the patterns in the expression (see below)
36 "poisons" the match if it matches. uwildmat_poison returns
37 UWILDMAT_MATCH if the expression matches the text, UWILDMAT_FAIL if it
38 doesn't, and UWILDMAT_POISON if the expression doesn't match because a
39 poisoned pattern matched the text. These enumeration constants are
40 defined in the inn/libinn.h header.
41
43 A wildmat expression follows rules similar to those of shell filename
44 wildcards but with some additions and changes. A wildmat expression is
45 composed of one or more wildmat patterns separated by commas. Each
46 character in the wildmat pattern matches a literal occurrence of that
47 same character in the text, with the exception of the following
48 metacharacters:
49
50 ? Matches any single character (including a single UTF-8
51 multibyte character, so "?" can match more than one byte).
52
53 * Matches any sequence of zero or more characters.
54
55 \ Turns off any special meaning of the following character; the
56 following character will match itself in the text. "\" will
57 escape any character, including another backslash or a comma
58 that otherwise would separate a pattern from the next pattern
59 in an expression. Note that "\" is not special inside a
60 character range (no metacharacters are).
61
62 [...] A character set, which matches any single character that falls
63 within that set. The presence of a character between the
64 brackets adds that character to the set; for example, "[amv]"
65 specifies the set containing the characters "a", "m", and "v".
66 A range of characters may be specified using "-"; for example,
67 "[0-5abc]" is equivalent to "[012345abc]". The order of
68 characters is as defined in the UTF-8 character set, and if the
69 start character of such a range falls after the ending
70 character of the range in that ranking the results of
71 attempting a match with that pattern are undefined.
72
73 In order to include a literal "]" character in the set, it must
74 be the first character of the set (possibly following "^"); for
75 example, "[]a]" matches either "]" or "a". To include a
76 literal "-" character in the set, it must be either the first
77 or the last character of the set. Backslashes have no special
78 meaning inside a character set, nor do any other of the wildmat
79 metacharacters.
80
81 [^...] A negated character set. Follows the same rules as a character
82 set above, but matches any character not contained in the set.
83 So, for example, "[^]-]" matches any character except "]" and
84 "-".
85
86 In addition, "!" (and possibly "@") have special meaning as the first
87 character of a pattern; see below.
88
89 When matching a wildmat expression against some text, each comma-
90 separated pattern is matched in order from left to right. In order to
91 match, the pattern must match the whole text; in regular expression
92 terminology, it's implicitly anchored at both the beginning and the
93 end. For example, the pattern "a" matches only the text "a"; it
94 doesn't match "ab" or "ba" or even "aa". If none of the patterns
95 match, the whole expression doesn't match. Otherwise, whether the
96 expression matches is determined entirely by the rightmost matching
97 pattern; the expression matches the text if and only if the rightmost
98 matching pattern is not negated.
99
100 For example, consider the text "news.misc". The expression "*" matches
101 this text, of course, as does "comp.*,news.*" (because the second
102 pattern matches). "news.*,!news.misc" does not match this text because
103 both patterns match, meaning that the rightmost takes precedence, and
104 the rightmost matching pattern is negated. "news.*,!news.misc,*.misc"
105 does match this text, since the rightmost matching pattern is not
106 negated.
107
108 Note that the expression "!news.misc" can't match anything. Either the
109 pattern doesn't match, in which case no patterns match and the
110 expression doesn't match, or the pattern does match, in which case
111 because it's negated the expression doesn't match. "*,!news.misc", on
112 the other hand, is a useful pattern that matches anything except
113 "news.misc".
114
115 "!" has significance only as the first character of a pattern; anywhere
116 else in the pattern, it matches a literal "!" in the text like any
117 other non-metacharacter.
118
119 If the uwildmat_poison interface is used, then "@" behaves the same as
120 "!" except that if an expression fails to match because the rightmost
121 matching pattern began with "@", UWILDMAT_POISON is returned instead of
122 UWILDMAT_FAIL.
123
124 If the uwildmat_simple interface is used, the matching rules are the
125 same as above except that none of "!", "@", or "," have any special
126 meaning at all and only match those literal characters.
127
129 All of these functions internally convert the passed arguments to const
130 unsigned char pointers. The only reason why they take regular char
131 pointers instead of unsigned char is for the convenience of INN and
132 other callers that may not be using unsigned char everywhere they
133 should. In a future revision, the public interface should be changed
134 to just take unsigned char pointers.
135
137 Written by Rich $alz <rsalz@uunet.uu.net> in 1986, and posted to Usenet
138 several times since then, most notably in comp.sources.misc in March,
139 1991.
140
141 Lars Mathiesen <thorinn@diku.dk> enhanced the multi-asterisk failure
142 mode in early 1991.
143
144 Rich and Lars increased the efficiency of star patterns and reposted it
145 to comp.sources.misc in April, 1991.
146
147 Robert Elz <kre@munnari.oz.au> added minus sign and close bracket
148 handling in June, 1991.
149
150 Russ Allbery <eagle@eyrie.org> added support for comma-separated
151 patterns and the "!" and "@" metacharacters to the core wildmat
152 routines in July, 2000. He also added support for UTF-8 characters,
153 changed the default behavior to assume that both the text and the
154 pattern are in UTF-8, and largely rewrote this documentation to expand
155 and clarify the description of how a wildmat expression matches.
156
157 Please note that the interfaces to these functions are named uwildmat
158 and the like rather than wildmat to distinguish them from the wildmat
159 function provided by Rich $alz's original implementation. While this
160 code is heavily based on Rich's original code, it has substantial
161 differences, including the extension to support UTF-8 characters, and
162 has noticeable functionality changes. Any bugs present in it aren't
163 Rich's fault.
164
166 grep(1), fnmatch(3), regex(3), regexp(3).
167
168
169
170INN 2.6.5 2022-02-18 libinn_uwildmat(3)