1uri_string(3) Erlang Module Definition uri_string(3)
2
3
4
6 uri_string - URI processing functions.
7
9 This module contains functions for parsing and handling URIs (RFC 3986)
10 and form-urlencoded query strings (HTML 5.2).
11
12 Parsing and serializing non-UTF-8 form-urlencoded query strings are
13 also supported (HTML 5.0).
14
15 A URI is an identifier consisting of a sequence of characters matching
16 the syntax rule named URI in RFC 3986.
17
18 The generic URI syntax consists of a hierarchical sequence of compo‐
19 nents referred to as the scheme, authority, path, query, and fragment:
20
21 URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
22 hier-part = "//" authority path-abempty
23 / path-absolute
24 / path-rootless
25 / path-empty
26 scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
27 authority = [ userinfo "@" ] host [ ":" port ]
28 userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
29
30 reserved = gen-delims / sub-delims
31 gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
32 sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
33 / "*" / "+" / "," / ";" / "="
34
35 unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
36
37
38
39 The interpretation of a URI depends only on the characters used and not
40 on how those characters are represented in a network protocol.
41
42 The functions implemented by this module cover the following use cases:
43
44 * Parsing URIs into its components and returing a map
45 parse/1
46
47 * Recomposing a map of URI components into a URI string
48 recompose/1
49
50 * Changing inbound binary and percent-encoding of URIs
51 transcode/2
52
53 * Transforming URIs into a normalized form
54 normalize/1
55 normalize/2
56
57 * Composing form-urlencoded query strings from a list of key-value
58 pairs
59 compose_query/1
60 compose_query/2
61
62 * Dissecting form-urlencoded query strings into a list of key-value
63 pairs
64 dissect_query/1
65
66 There are four different encodings present during the handling of URIs:
67
68 * Inbound binary encoding in binaries
69
70 * Inbound percent-encoding in lists and binaries
71
72 * Outbound binary encoding in binaries
73
74 * Outbound percent-encoding in lists and binaries
75
76 Functions with uri_string() argument accept lists, binaries and mixed
77 lists (lists with binary elements) as input type. All of the functions
78 but transcode/2 expects input as lists of unicode codepoints, UTF-8
79 encoded binaries and UTF-8 percent-encoded URI parts ("%C3%B6" corre‐
80 sponds to the unicode character "ö").
81
82 Unless otherwise specified the return value type and encoding are the
83 same as the input type and encoding. That is, binary input returns
84 binary output, list input returns a list output but mixed input returns
85 list output.
86
87 In case of lists there is only percent-encoding. In binaries, however,
88 both binary encoding and percent-encoding shall be considered.
89 transcode/2 provides the means to convert between the supported encod‐
90 ings, it takes a uri_string() and a list of options specifying inbound
91 and outbound encodings.
92
93 RFC 3986 does not mandate any specific character encoding and it is
94 usually defined by the protocol or surrounding text. This library takes
95 the same assumption, binary and percent-encoding are handled as one
96 configuration unit, they cannot be set to different values.
97
99 error() = {error, atom(), term()}
100
101 Error tuple indicating the type of error. Possible values of the
102 second component:
103
104 * invalid_character
105
106 * invalid_encoding
107
108 * invalid_input
109
110 * invalid_map
111
112 * invalid_percent_encoding
113
114 * invalid_scheme
115
116 * invalid_uri
117
118 * invalid_utf8
119
120 * missing_value
121
122 The third component is a term providing additional information
123 about the cause of the error.
124
125 uri_map() =
126 #{fragment => unicode:chardata(),
127 host => unicode:chardata(),
128 path => unicode:chardata(),
129 port => integer() >= 0 | undefined,
130 query => unicode:chardata(),
131 scheme => unicode:chardata(),
132 userinfo => unicode:chardata()} |
133 #{}
134
135 Map holding the main components of a URI.
136
137 uri_string() = iodata()
138
139 List of unicode codepoints, a UTF-8 encoded binary, or a mix of
140 the two, representing an RFC 3986 compliant URI (percent-encoded
141 form). A URI is a sequence of characters from a very limited
142 set: the letters of the basic Latin alphabet, digits, and a few
143 special characters.
144
146 compose_query(QueryList) -> QueryString
147
148 Types:
149
150 QueryList = [{unicode:chardata(), unicode:chardata()}]
151 QueryString = uri_string() | error()
152
153 Composes a form-urlencoded QueryString based on a QueryList, a
154 list of non-percent-encoded key-value pairs. Form-urlencoding is
155 defined in section 4.10.21.6 of the HTML 5.2 specification and
156 in section 4.10.22.6 of the HTML 5.0 specification for non-UTF-8
157 encodings.
158
159 See also the opposite operation dissect_query/1.
160
161 Example:
162
163 1> uri_string:compose_query([{"foo bar","1"},{"city","örebro"}]).
164 "foo+bar=1&city=%C3%B6rebro"
165 2> uri_string:compose_query([{<<"foo bar">>,<<"1">>},
166 2> {<<"city">>,<<"örebro"/utf8>>}]).
167 <<"foo+bar=1&city=%C3%B6rebro">>
168
169
170 compose_query(QueryList, Options) -> QueryString
171
172 Types:
173
174 QueryList = [{unicode:chardata(), unicode:chardata()}]
175 Options = [{encoding, atom()}]
176 QueryString = uri_string() | error()
177
178 Same as compose_query/1 but with an additional Options parame‐
179 ter, that controls the encoding ("charset") used by the encoding
180 algorithm. There are two supported encodings: utf8 (or unicode)
181 and latin1.
182
183 Each character in the entry's name and value that cannot be
184 expressed using the selected character encoding, is replaced by
185 a string consisting of a U+0026 AMPERSAND character (&), a "#"
186 (U+0023) character, one or more ASCII digits representing the
187 Unicode code point of the character in base ten, and finally a
188 ";" (U+003B) character.
189
190 Bytes that are out of the range 0x2A, 0x2D, 0x2E, 0x30 to 0x39,
191 0x41 to 0x5A, 0x5F, 0x61 to 0x7A, are percent-encoded (U+0025
192 PERCENT SIGN character (%) followed by uppercase ASCII hex dig‐
193 its representing the hexadecimal value of the byte).
194
195 See also the opposite operation dissect_query/1.
196
197 Example:
198
199 1> uri_string:compose_query([{"foo bar","1"},{"city","örebro"}],
200 1> [{encoding, latin1}]).
201 "foo+bar=1&city=%F6rebro"
202 2> uri_string:compose_query([{<<"foo bar">>,<<"1">>},
203 2> {<<"city">>,<<"東京"/utf8>>}], [{encoding, latin1}]).
204 <<"foo+bar=1&city=%26%2326481%3B%26%2320140%3B">>
205
206
207 dissect_query(QueryString) -> QueryList
208
209 Types:
210
211 QueryString = uri_string()
212 QueryList =
213 [{unicode:chardata(), unicode:chardata()}] | error()
214
215 Dissects an urlencoded QueryString and returns a QueryList, a
216 list of non-percent-encoded key-value pairs. Form-urlencoding is
217 defined in section 4.10.21.6 of the HTML 5.2 specification and
218 in section 4.10.22.6 of the HTML 5.0 specification for non-UTF-8
219 encodings.
220
221 See also the opposite operation compose_query/1.
222
223 Example:
224
225 1> uri_string:dissect_query("foo+bar=1&city=%C3%B6rebro").
226 [{"foo bar","1"},{"city","örebro"}]
227 2> uri_string:dissect_query(<<"foo+bar=1&city=%26%2326481%3B%26%2320140%3B">>).
228 [{<<"foo bar">>,<<"1">>},
229 {<<"city">>,<<230,157,177,228,186,172>>}]
230
231
232 normalize(URI) -> NormalizedURI
233
234 Types:
235
236 URI = uri_string() | uri_map()
237 NormalizedURI = uri_string() | error()
238
239 Transforms an URI into a normalized form using Syntax-Based Nor‐
240 malization as defined by RFC 3986.
241
242 This function implements case normalization, percent-encoding
243 normalization, path segment normalization and scheme based nor‐
244 malization for HTTP(S) with basic support for FTP, SSH, SFTP and
245 TFTP.
246
247 Example:
248
249 1> uri_string:normalize("/a/b/c/./../../g").
250 "/a/g"
251 2> uri_string:normalize(<<"mid/content=5/../6">>).
252 <<"mid/6">>
253 3> uri_string:normalize("http://localhost:80").
254 "https://localhost/"
255 4> uri_string:normalize(#{scheme => "http",port => 80,path => "/a/b/c/./../../g",
256 4> host => "localhost-örebro"}).
257 "http://localhost-%C3%B6rebro/a/g"
258
259
260 normalize(URI, Options) -> NormalizedURI
261
262 Types:
263
264 URI = uri_string() | uri_map()
265 Options = [return_map]
266 NormalizedURI = uri_string() | uri_map()
267
268 Same as normalize/1 but with an additional Options parameter,
269 that controls if the normalized URI shall be returned as an
270 uri_map(). There is one supported option: return_map.
271
272 Example:
273
274 1> uri_string:normalize("/a/b/c/./../../g", [return_map]).
275 #{path => "/a/g"}
276 2> uri_string:normalize(<<"mid/content=5/../6">>, [return_map]).
277 #{path => <<"mid/6">>}
278 3> uri_string:normalize("http://localhost:80", [return_map]).
279 #{scheme => "http",path => "/",host => "localhost"}
280 4> uri_string:normalize(#{scheme => "http",port => 80,path => "/a/b/c/./../../g",
281 4> host => "localhost-örebro"}, [return_map]).
282 #{scheme => "http",path => "/a/g",host => "localhost-örebro"}
283
284
285 parse(URIString) -> URIMap
286
287 Types:
288
289 URIString = uri_string()
290 URIMap = uri_map() | error()
291
292 Parses an RFC 3986 compliant uri_string() into a uri_map(), that
293 holds the parsed components of the URI. If parsing fails, an
294 error tuple is returned.
295
296 See also the opposite operation recompose/1.
297
298 Example:
299
300 1> uri_string:parse("foo://user@example.com:8042/over/there?name=ferret#nose").
301 #{fragment => "nose",host => "example.com",
302 path => "/over/there",port => 8042,query => "name=ferret",
303 scheme => foo,userinfo => "user"}
304 2> uri_string:parse(<<"foo://user@example.com:8042/over/there?name=ferret">>).
305 #{host => <<"example.com">>,path => <<"/over/there">>,
306 port => 8042,query => <<"name=ferret">>,scheme => <<"foo">>,
307 userinfo => <<"user">>}
308
309
310 recompose(URIMap) -> URIString
311
312 Types:
313
314 URIMap = uri_map()
315 URIString = uri_string() | error()
316
317 Creates an RFC 3986 compliant URIString (percent-encoded), based
318 on the components of URIMap. If the URIMap is invalid, an error
319 tuple is returned.
320
321 See also the opposite operation parse/1.
322
323 Example:
324
325 1> URIMap = #{fragment => "nose", host => "example.com", path => "/over/there",
326 1> port => 8042, query => "name=ferret", scheme => "foo", userinfo => "user"}.
327 #{fragment => "top",host => "example.com",
328 path => "/over/there",port => 8042,query => "?name=ferret",
329 scheme => foo,userinfo => "user"}
330
331 2> uri_string:recompose(URIMap).
332 "foo://example.com:8042/over/there?name=ferret#nose"
333
334 transcode(URIString, Options) -> Result
335
336 Types:
337
338 URIString = uri_string()
339 Options =
340 [{in_encoding, unicode:encoding()} |
341 {out_encoding, unicode:encoding()}]
342 Result = uri_string() | error()
343
344 Transcodes an RFC 3986 compliant URIString, where Options is a
345 list of tagged tuples, specifying the inbound (in_encoding) and
346 outbound (out_encoding) encodings. in_encoding and out_encoding
347 specifies both binary encoding and percent-encoding for the
348 input and output data. Mixed encoding, where binary encoding is
349 not the same as percent-encoding, is not supported. If an argu‐
350 ment is invalid, an error tuple is returned.
351
352 Example:
353
354 1> uri_string:transcode(<<"foo%00%00%00%F6bar"/utf32>>,
355 1> [{in_encoding, utf32},{out_encoding, utf8}]).
356 <<"foo%C3%B6bar"/utf8>>
357 2> uri_string:transcode("foo%F6bar", [{in_encoding, latin1},
358 2> {out_encoding, utf8}]).
359 "foo%C3%B6bar"
360
361
362
363
364Ericsson AB stdlib 3.8.2.1 uri_string(3)