1uri_string(3) Erlang Module Definition uri_string(3)
2
3
4
6 uri_string - URI processing functions.
7
9 This module contains functions for parsing and handling URIs (RFC 3986)
10 and form-urlencoded query strings (HTML 5.2).
11
12 Parsing and serializing non-UTF-8 form-urlencoded query strings are
13 also supported (HTML 5.0).
14
15 A URI is an identifier consisting of a sequence of characters matching
16 the syntax rule named URI in RFC 3986.
17
18 The generic URI syntax consists of a hierarchical sequence of compo‐
19 nents referred to as the scheme, authority, path, query, and fragment:
20
21 URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
22 hier-part = "//" authority path-abempty
23 / path-absolute
24 / path-rootless
25 / path-empty
26 scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
27 authority = [ userinfo "@" ] host [ ":" port ]
28 userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
29
30 reserved = gen-delims / sub-delims
31 gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
32 sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
33 / "*" / "+" / "," / ";" / "="
34
35 unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
36
37
38
39 The interpretation of a URI depends only on the characters used and not
40 on how those characters are represented in a network protocol.
41
42 The functions implemented by this module cover the following use cases:
43
44 * Parsing URIs into its components and returing a map
45 parse/1
46
47 * Recomposing a map of URI components into a URI string
48 recompose/1
49
50 * Changing inbound binary and percent-encoding of URIs
51 transcode/2
52
53 * Transforming URIs into a normalized form
54 normalize/1
55 normalize/2
56
57 * Composing form-urlencoded query strings from a list of key-value
58 pairs
59 compose_query/1
60 compose_query/2
61
62 * Dissecting form-urlencoded query strings into a list of key-value
63 pairs
64 dissect_query/1
65
66 * Decoding percent-encoded triplets
67 percent_decode/1
68
69 There are four different encodings present during the handling of URIs:
70
71 * Inbound binary encoding in binaries
72
73 * Inbound percent-encoding in lists and binaries
74
75 * Outbound binary encoding in binaries
76
77 * Outbound percent-encoding in lists and binaries
78
79 Functions with uri_string() argument accept lists, binaries and mixed
80 lists (lists with binary elements) as input type. All of the functions
81 but transcode/2 expects input as lists of unicode codepoints, UTF-8
82 encoded binaries and UTF-8 percent-encoded URI parts ("%C3%B6" corre‐
83 sponds to the unicode character "ö").
84
85 Unless otherwise specified the return value type and encoding are the
86 same as the input type and encoding. That is, binary input returns
87 binary output, list input returns a list output but mixed input returns
88 list output.
89
90 In case of lists there is only percent-encoding. In binaries, however,
91 both binary encoding and percent-encoding shall be considered.
92 transcode/2 provides the means to convert between the supported encod‐
93 ings, it takes a uri_string() and a list of options specifying inbound
94 and outbound encodings.
95
96 RFC 3986 does not mandate any specific character encoding and it is
97 usually defined by the protocol or surrounding text. This library takes
98 the same assumption, binary and percent-encoding are handled as one
99 configuration unit, they cannot be set to different values.
100
102 error() = {error, atom(), term()}
103
104 Error tuple indicating the type of error. Possible values of the
105 second component:
106
107 * invalid_character
108
109 * invalid_encoding
110
111 * invalid_input
112
113 * invalid_map
114
115 * invalid_percent_encoding
116
117 * invalid_scheme
118
119 * invalid_uri
120
121 * invalid_utf8
122
123 * missing_value
124
125 The third component is a term providing additional information
126 about the cause of the error.
127
128 uri_map() =
129 #{fragment => unicode:chardata(),
130 host => unicode:chardata(),
131 path => unicode:chardata(),
132 port => integer() >= 0 | undefined,
133 query => unicode:chardata(),
134 scheme => unicode:chardata(),
135 userinfo => unicode:chardata()}
136
137 Map holding the main components of a URI.
138
139 uri_string() = iodata()
140
141 List of unicode codepoints, a UTF-8 encoded binary, or a mix of
142 the two, representing an RFC 3986 compliant URI (percent-encoded
143 form). A URI is a sequence of characters from a very limited
144 set: the letters of the basic Latin alphabet, digits, and a few
145 special characters.
146
148 allowed_characters() -> [{atom(), list()}]
149
150 This is a utility function meant to be used in the shell for
151 printing the allowed characters in each major URI component, and
152 also in the most important characters sets. Please note that
153 this function does not replace the ABNF rules defined by the
154 standards, these character sets are derived directly from those
155 aformentioned rules. For more information see the Uniform
156 Resource Identifiers chapter in stdlib's Users Guide.
157
158 compose_query(QueryList) -> QueryString
159
160 Types:
161
162 QueryList = [{unicode:chardata(), unicode:chardata() | true}]
163 QueryString = uri_string() | error()
164
165 Composes a form-urlencoded QueryString based on a QueryList, a
166 list of non-percent-encoded key-value pairs. Form-urlencoding is
167 defined in section 4.10.21.6 of the HTML 5.2 specification and
168 in section 4.10.22.6 of the HTML 5.0 specification for non-UTF-8
169 encodings.
170
171 See also the opposite operation dissect_query/1.
172
173 Example:
174
175 1> uri_string:compose_query([{"foo bar","1"},{"city","örebro"}]).
176 "foo+bar=1&city=%C3%B6rebro"
177 2> uri_string:compose_query([{<<"foo bar">>,<<"1">>},
178 2> {<<"city">>,<<"örebro"/utf8>>}]).
179 <<"foo+bar=1&city=%C3%B6rebro">>
180
181
182 compose_query(QueryList, Options) -> QueryString
183
184 Types:
185
186 QueryList = [{unicode:chardata(), unicode:chardata() | true}]
187 Options = [{encoding, atom()}]
188 QueryString = uri_string() | error()
189
190 Same as compose_query/1 but with an additional Options parame‐
191 ter, that controls the encoding ("charset") used by the encoding
192 algorithm. There are two supported encodings: utf8 (or unicode)
193 and latin1.
194
195 Each character in the entry's name and value that cannot be
196 expressed using the selected character encoding, is replaced by
197 a string consisting of a U+0026 AMPERSAND character (&), a "#"
198 (U+0023) character, one or more ASCII digits representing the
199 Unicode code point of the character in base ten, and finally a
200 ";" (U+003B) character.
201
202 Bytes that are out of the range 0x2A, 0x2D, 0x2E, 0x30 to 0x39,
203 0x41 to 0x5A, 0x5F, 0x61 to 0x7A, are percent-encoded (U+0025
204 PERCENT SIGN character (%) followed by uppercase ASCII hex dig‐
205 its representing the hexadecimal value of the byte).
206
207 See also the opposite operation dissect_query/1.
208
209 Example:
210
211 1> uri_string:compose_query([{"foo bar","1"},{"city","örebro"}],
212 1> [{encoding, latin1}]).
213 "foo+bar=1&city=%F6rebro"
214 2> uri_string:compose_query([{<<"foo bar">>,<<"1">>},
215 2> {<<"city">>,<<"東京"/utf8>>}], [{encoding, latin1}]).
216 <<"foo+bar=1&city=%26%2326481%3B%26%2320140%3B">>
217
218
219 dissect_query(QueryString) -> QueryList
220
221 Types:
222
223 QueryString = uri_string()
224 QueryList =
225 [{unicode:chardata(), unicode:chardata() | true}] |
226 error()
227
228 Dissects an urlencoded QueryString and returns a QueryList, a
229 list of non-percent-encoded key-value pairs. Form-urlencoding is
230 defined in section 4.10.21.6 of the HTML 5.2 specification and
231 in section 4.10.22.6 of the HTML 5.0 specification for non-UTF-8
232 encodings.
233
234 See also the opposite operation compose_query/1.
235
236 Example:
237
238 1> uri_string:dissect_query("foo+bar=1&city=%C3%B6rebro").
239 [{"foo bar","1"},{"city","örebro"}]
240 2> uri_string:dissect_query(<<"foo+bar=1&city=%26%2326481%3B%26%2320140%3B">>).
241 [{<<"foo bar">>,<<"1">>},
242 {<<"city">>,<<230,157,177,228,186,172>>}]
243
244
245 normalize(URI) -> NormalizedURI
246
247 Types:
248
249 URI = uri_string() | uri_map()
250 NormalizedURI = uri_string() | error()
251
252 Transforms an URI into a normalized form using Syntax-Based Nor‐
253 malization as defined by RFC 3986.
254
255 This function implements case normalization, percent-encoding
256 normalization, path segment normalization and scheme based nor‐
257 malization for HTTP(S) with basic support for FTP, SSH, SFTP and
258 TFTP.
259
260 Example:
261
262 1> uri_string:normalize("/a/b/c/./../../g").
263 "/a/g"
264 2> uri_string:normalize(<<"mid/content=5/../6">>).
265 <<"mid/6">>
266 3> uri_string:normalize("http://localhost:80").
267 "https://localhost/"
268 4> uri_string:normalize(#{scheme => "http",port => 80,path => "/a/b/c/./../../g",
269 4> host => "localhost-örebro"}).
270 "http://localhost-%C3%B6rebro/a/g"
271
272
273 normalize(URI, Options) -> NormalizedURI
274
275 Types:
276
277 URI = uri_string() | uri_map()
278 Options = [return_map]
279 NormalizedURI = uri_string() | uri_map() | error()
280
281 Same as normalize/1 but with an additional Options parameter,
282 that controls whether the normalized URI shall be returned as an
283 uri_map(). There is one supported option: return_map.
284
285 Example:
286
287 1> uri_string:normalize("/a/b/c/./../../g", [return_map]).
288 #{path => "/a/g"}
289 2> uri_string:normalize(<<"mid/content=5/../6">>, [return_map]).
290 #{path => <<"mid/6">>}
291 3> uri_string:normalize("http://localhost:80", [return_map]).
292 #{scheme => "http",path => "/",host => "localhost"}
293 4> uri_string:normalize(#{scheme => "http",port => 80,path => "/a/b/c/./../../g",
294 4> host => "localhost-örebro"}, [return_map]).
295 #{scheme => "http",path => "/a/g",host => "localhost-örebro"}
296
297
298 parse(URIString) -> URIMap
299
300 Types:
301
302 URIString = uri_string()
303 URIMap = uri_map() | error()
304
305 Parses an RFC 3986 compliant uri_string() into a uri_map(), that
306 holds the parsed components of the URI. If parsing fails, an
307 error tuple is returned.
308
309 See also the opposite operation recompose/1.
310
311 Example:
312
313 1> uri_string:parse("foo://user@example.com:8042/over/there?name=ferret#nose").
314 #{fragment => "nose",host => "example.com",
315 path => "/over/there",port => 8042,query => "name=ferret",
316 scheme => foo,userinfo => "user"}
317 2> uri_string:parse(<<"foo://user@example.com:8042/over/there?name=ferret">>).
318 #{host => <<"example.com">>,path => <<"/over/there">>,
319 port => 8042,query => <<"name=ferret">>,scheme => <<"foo">>,
320 userinfo => <<"user">>}
321
322
323 percent_decode(URI) -> Result
324
325 Types:
326
327 URI = uri_string() | uri_map()
328 Result =
329 uri_string() |
330 uri_map() |
331 {error, {invalid, {atom(), {term(), term()}}}}
332
333 Decodes all percent-encoded triplets in the input that can be
334 both a uri_string() and a uri_map(). Note, that this function
335 performs raw decoding and it shall be used on already parsed URI
336 components. Applying this function directly on a standard URI
337 can effectively change it.
338
339 If the input encoding is not UTF-8, an error tuple is returned.
340
341 Example:
342
343 1> uri_string:percent_decode(#{host => "localhost-%C3%B6rebro",path => [],
344 1> scheme => "http"}).
345 #{host => "localhost-örebro",path => [],scheme => "http"}
346 2> uri_string:percent_decode(<<"%C3%B6rebro">>).
347 <<"örebro"/utf8>>
348
349
350 Warning:
351 Using uri_string:percent_decode/1 directly on a URI is not safe.
352 This example shows, that after each consecutive application of
353 the function the resulting URI will be changed. None of these
354 URIs refer to the same resource.
355
356 3> uri_string:percent_decode(<<"http://local%252Fhost/path">>).
357 <<"http://local%2Fhost/path">>
358 4> uri_string:percent_decode(<<"http://local%2Fhost/path">>).
359 <<"http://local/host/path">>
360
361
362
363 recompose(URIMap) -> URIString
364
365 Types:
366
367 URIMap = uri_map()
368 URIString = uri_string() | error()
369
370 Creates an RFC 3986 compliant URIString (percent-encoded), based
371 on the components of URIMap. If the URIMap is invalid, an error
372 tuple is returned.
373
374 See also the opposite operation parse/1.
375
376 Example:
377
378 1> URIMap = #{fragment => "nose", host => "example.com", path => "/over/there",
379 1> port => 8042, query => "name=ferret", scheme => "foo", userinfo => "user"}.
380 #{fragment => "nose",host => "example.com",
381 path => "/over/there",port => 8042,query => "name=ferret",
382 scheme => "foo",userinfo => "user"}
383
384 2> uri_string:recompose(URIMap).
385 "foo://example.com:8042/over/there?name=ferret#nose"
386
387 resolve(RefURI, BaseURI) -> TargetURI
388
389 Types:
390
391 RefURI = BaseURI = uri_string() | uri_map()
392 TargetURI = uri_string() | error()
393
394 Convert a RefURI reference that might be relative to a given
395 base URI into the parsed components of the reference's target,
396 which can then be recomposed to form the target URI.
397
398 Example:
399
400 1> uri_string:resolve("/abs/ol/ute", "http://localhost/a/b/c?q").
401 "http://localhost/abs/ol/ute"
402 2> uri_string:resolve("../relative", "http://localhost/a/b/c?q").
403 "http://localhost/a/relative"
404 3> uri_string:resolve("http://localhost/full", "http://localhost/a/b/c?q").
405 "http://localhost/full"
406 4> uri_string:resolve(#{path => "path", query => "xyz"}, "http://localhost/a/b/c?q").
407 "http://localhost/a/b/path?xyz"
408
409
410 resolve(RefURI, BaseURI, Options) -> TargetURI
411
412 Types:
413
414 RefURI = BaseURI = uri_string() | uri_map()
415 Options = [return_map]
416 TargetURI = uri_string() | uri_map() | error()
417
418 Same as resolve/2 but with an additional Options parameter, that
419 controls whether the target URI shall be returned as an
420 uri_map(). There is one supported option: return_map.
421
422 Example:
423
424 1> uri_string:resolve("/abs/ol/ute", "http://localhost/a/b/c?q", [return_map]).
425 #{host => "localhost",path => "/abs/ol/ute",scheme => "http"}
426 2> uri_string:resolve(#{path => "/abs/ol/ute"}, #{scheme => "http",
427 2> host => "localhost", path => "/a/b/c?q"}, [return_map]).
428 #{host => "localhost",path => "/abs/ol/ute",scheme => "http"}
429
430
431 transcode(URIString, Options) -> Result
432
433 Types:
434
435 URIString = uri_string()
436 Options =
437 [{in_encoding, unicode:encoding()} |
438 {out_encoding, unicode:encoding()}]
439 Result = uri_string() | error()
440
441 Transcodes an RFC 3986 compliant URIString, where Options is a
442 list of tagged tuples, specifying the inbound (in_encoding) and
443 outbound (out_encoding) encodings. in_encoding and out_encoding
444 specifies both binary encoding and percent-encoding for the
445 input and output data. Mixed encoding, where binary encoding is
446 not the same as percent-encoding, is not supported. If an argu‐
447 ment is invalid, an error tuple is returned.
448
449 Example:
450
451 1> uri_string:transcode(<<"foo%00%00%00%F6bar"/utf32>>,
452 1> [{in_encoding, utf32},{out_encoding, utf8}]).
453 <<"foo%C3%B6bar"/utf8>>
454 2> uri_string:transcode("foo%F6bar", [{in_encoding, latin1},
455 2> {out_encoding, utf8}]).
456 "foo%C3%B6bar"
457
458
459
460
461Ericsson AB stdlib 3.14.1 uri_string(3)