1uri_string(3)              Erlang Module Definition              uri_string(3)
2
3
4

NAME

6       uri_string - URI processing functions.
7

DESCRIPTION

9       This module contains functions for parsing and handling URIs (RFC 3986)
10       and form-urlencoded query strings (HTML 5.2).
11
12       Parsing and serializing non-UTF-8  form-urlencoded  query  strings  are
13       also supported (HTML 5.0).
14
15       A  URI is an identifier consisting of a sequence of characters matching
16       the syntax rule named URI in RFC 3986.
17
18       The generic URI syntax consists of a hierarchical  sequence  of  compo‐
19       nents referred to as the scheme, authority, path, query, and fragment:
20
21           URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
22           hier-part   = "//" authority path-abempty
23                          / path-absolute
24                          / path-rootless
25                          / path-empty
26           scheme      = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
27           authority   = [ userinfo "@" ] host [ ":" port ]
28           userinfo    = *( unreserved / pct-encoded / sub-delims / ":" )
29
30           reserved    = gen-delims / sub-delims
31           gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
32           sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
33                       / "*" / "+" / "," / ";" / "="
34
35           unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
36
37
38
39       The interpretation of a URI depends only on the characters used and not
40       on how those characters are represented in a network protocol.
41
42       The functions implemented by this module cover the following use cases:
43
44         * Parsing URIs into its components and returing a map
45           parse/1
46
47         * Recomposing a map of URI components into a URI string
48           recompose/1
49
50         * Changing inbound binary and percent-encoding of URIs
51           transcode/2
52
53         * Transforming URIs into a normalized form
54           normalize/1
55           normalize/2
56
57         * Composing form-urlencoded query strings from a  list  of  key-value
58           pairs
59           compose_query/1
60           compose_query/2
61
62         * Dissecting  form-urlencoded  query strings into a list of key-value
63           pairs
64           dissect_query/1
65
66         * Decoding percent-encoded triplets
67           percent_decode/1
68
69       There are four different encodings present during the handling of URIs:
70
71         * Inbound binary encoding in binaries
72
73         * Inbound percent-encoding in lists and binaries
74
75         * Outbound binary encoding in binaries
76
77         * Outbound percent-encoding in lists and binaries
78
79       Functions with uri_string() argument accept lists, binaries  and  mixed
80       lists  (lists with binary elements) as input type. All of the functions
81       but transcode/2 expects input as lists  of  unicode  codepoints,  UTF-8
82       encoded  binaries  and UTF-8 percent-encoded URI parts ("%C3%B6" corre‐
83       sponds to the unicode character "ö").
84
85       Unless otherwise specified the return value type and encoding  are  the
86       same  as  the  input  type  and encoding. That is, binary input returns
87       binary output, list input returns a list output but mixed input returns
88       list output.
89
90       In  case of lists there is only percent-encoding. In binaries, however,
91       both  binary  encoding  and  percent-encoding  shall   be   considered.
92       transcode/2  provides the means to convert between the supported encod‐
93       ings, it takes a uri_string() and a list of options specifying  inbound
94       and outbound encodings.
95
96       RFC  3986  does  not  mandate any specific character encoding and it is
97       usually defined by the protocol or surrounding text. This library takes
98       the  same  assumption,  binary  and percent-encoding are handled as one
99       configuration unit, they cannot be set to different values.
100

DATA TYPES

102       error() = {error, atom(), term()}
103
104              Error tuple indicating the type of error. Possible values of the
105              second component:
106
107                * invalid_character
108
109                * invalid_encoding
110
111                * invalid_input
112
113                * invalid_map
114
115                * invalid_percent_encoding
116
117                * invalid_scheme
118
119                * invalid_uri
120
121                * invalid_utf8
122
123                * missing_value
124
125              The  third  component is a term providing additional information
126              about the cause of the error.
127
128       uri_map() =
129           #{fragment => unicode:chardata(),
130             host => unicode:chardata(),
131             path => unicode:chardata(),
132             port => integer() >= 0 | undefined,
133             query => unicode:chardata(),
134             scheme => unicode:chardata(),
135             userinfo => unicode:chardata()}
136
137              Map holding the main components of a URI.
138
139       uri_string() = iodata()
140
141              List of unicode codepoints, a UTF-8 encoded binary, or a mix  of
142              the two, representing an RFC 3986 compliant URI (percent-encoded
143              form). A URI is a sequence of characters  from  a  very  limited
144              set:  the letters of the basic Latin alphabet, digits, and a few
145              special characters.
146

EXPORTS

148       allowed_characters() -> [{atom(), list()}]
149
150              This is a utility function meant to be used  in  the  shell  for
151              printing the allowed characters in each major URI component, and
152              also in the most important characters  sets.  Please  note  that
153              this  function  does  not  replace the ABNF rules defined by the
154              standards, these character sets are derived directly from  those
155              aformentioned  rules.  For  more  information  see  the  Uniform
156              Resource Identifiers chapter in stdlib's Users Guide.
157
158       compose_query(QueryList) -> QueryString
159
160              Types:
161
162                 QueryList = [{unicode:chardata(), unicode:chardata() | true}]
163                 QueryString = uri_string() | error()
164
165              Composes a form-urlencoded QueryString based on a  QueryList,  a
166              list of non-percent-encoded key-value pairs. Form-urlencoding is
167              defined in section 4.10.21.6 of the HTML 5.2  specification  and
168              in section 4.10.22.6 of the HTML 5.0 specification for non-UTF-8
169              encodings.
170
171              See also the opposite operation dissect_query/1.
172
173              Example:
174
175              1> uri_string:compose_query([{"foo bar","1"},{"city","örebro"}]).
176              "foo+bar=1&city=%C3%B6rebro"
177              2> uri_string:compose_query([{<<"foo bar">>,<<"1">>},
178              2> {<<"city">>,<<"örebro"/utf8>>}]).
179              <<"foo+bar=1&city=%C3%B6rebro">>
180
181
182       compose_query(QueryList, Options) -> QueryString
183
184              Types:
185
186                 QueryList = [{unicode:chardata(), unicode:chardata() | true}]
187                 Options = [{encoding, atom()}]
188                 QueryString = uri_string() | error()
189
190              Same as compose_query/1 but with an additional  Options  parame‐
191              ter, that controls the encoding ("charset") used by the encoding
192              algorithm. There are two supported encodings: utf8 (or  unicode)
193              and latin1.
194
195              Each  character  in  the  entry's  name and value that cannot be
196              expressed using the selected character encoding, is replaced  by
197              a  string  consisting of a U+0026 AMPERSAND character (&), a "#"
198              (U+0023) character, one or more ASCII  digits  representing  the
199              Unicode  code  point of the character in base ten, and finally a
200              ";" (U+003B) character.
201
202              Bytes that are out of the range 0x2A, 0x2D, 0x2E, 0x30 to  0x39,
203              0x41  to  0x5A,  0x5F, 0x61 to 0x7A, are percent-encoded (U+0025
204              PERCENT SIGN character (%) followed by uppercase ASCII hex  dig‐
205              its representing the hexadecimal value of the byte).
206
207              See also the opposite operation dissect_query/1.
208
209              Example:
210
211              1> uri_string:compose_query([{"foo bar","1"},{"city","örebro"}],
212              1> [{encoding, latin1}]).
213              "foo+bar=1&city=%F6rebro"
214              2> uri_string:compose_query([{<<"foo bar">>,<<"1">>},
215              2> {<<"city">>,<<"東京"/utf8>>}], [{encoding, latin1}]).
216              <<"foo+bar=1&city=%26%2326481%3B%26%2320140%3B">>
217
218
219       dissect_query(QueryString) -> QueryList
220
221              Types:
222
223                 QueryString = uri_string()
224                 QueryList =
225                     [{unicode:chardata(),   unicode:chardata()   |  true}]  |
226                 error()
227
228              Dissects an urlencoded QueryString and returns  a  QueryList,  a
229              list of non-percent-encoded key-value pairs. Form-urlencoding is
230              defined in section 4.10.21.6 of the HTML 5.2  specification  and
231              in section 4.10.22.6 of the HTML 5.0 specification for non-UTF-8
232              encodings.
233
234              See also the opposite operation compose_query/1.
235
236              Example:
237
238              1> uri_string:dissect_query("foo+bar=1&city=%C3%B6rebro").
239              [{"foo bar","1"},{"city","örebro"}]
240              2> uri_string:dissect_query(<<"foo+bar=1&city=%26%2326481%3B%26%2320140%3B">>).
241              [{<<"foo bar">>,<<"1">>},
242               {<<"city">>,<<230,157,177,228,186,172>>}]
243
244
245       normalize(URI) -> NormalizedURI
246
247              Types:
248
249                 URI = uri_string() | uri_map()
250                 NormalizedURI = uri_string() | error()
251
252              Transforms an URI into a normalized form using Syntax-Based Nor‐
253              malization as defined by RFC 3986.
254
255              This  function  implements  case normalization, percent-encoding
256              normalization, path segment normalization and scheme based  nor‐
257              malization for HTTP(S) with basic support for FTP, SSH, SFTP and
258              TFTP.
259
260              Example:
261
262              1> uri_string:normalize("/a/b/c/./../../g").
263              "/a/g"
264              2> uri_string:normalize(<<"mid/content=5/../6">>).
265              <<"mid/6">>
266              3> uri_string:normalize("http://localhost:80").
267              "https://localhost/"
268              4> uri_string:normalize(#{scheme => "http",port => 80,path => "/a/b/c/./../../g",
269              4> host => "localhost-örebro"}).
270              "http://localhost-%C3%B6rebro/a/g"
271
272
273       normalize(URI, Options) -> NormalizedURI
274
275              Types:
276
277                 URI = uri_string() | uri_map()
278                 Options = [return_map]
279                 NormalizedURI = uri_string() | uri_map() | error()
280
281              Same as normalize/1 but with an  additional  Options  parameter,
282              that controls whether the normalized URI shall be returned as an
283              uri_map(). There is one supported option: return_map.
284
285              Example:
286
287              1> uri_string:normalize("/a/b/c/./../../g", [return_map]).
288              #{path => "/a/g"}
289              2> uri_string:normalize(<<"mid/content=5/../6">>, [return_map]).
290              #{path => <<"mid/6">>}
291              3> uri_string:normalize("http://localhost:80", [return_map]).
292              #{scheme => "http",path => "/",host => "localhost"}
293              4> uri_string:normalize(#{scheme => "http",port => 80,path => "/a/b/c/./../../g",
294              4> host => "localhost-örebro"}, [return_map]).
295              #{scheme => "http",path => "/a/g",host => "localhost-örebro"}
296
297
298       parse(URIString) -> URIMap
299
300              Types:
301
302                 URIString = uri_string()
303                 URIMap = uri_map() | error()
304
305              Parses an RFC 3986 compliant uri_string() into a uri_map(), that
306              holds  the  parsed  components  of the URI. If parsing fails, an
307              error tuple is returned.
308
309              See also the opposite operation recompose/1.
310
311              Example:
312
313              1> uri_string:parse("foo://user@example.com:8042/over/there?name=ferret#nose").
314              #{fragment => "nose",host => "example.com",
315                path => "/over/there",port => 8042,query => "name=ferret",
316                scheme => foo,userinfo => "user"}
317              2> uri_string:parse(<<"foo://user@example.com:8042/over/there?name=ferret">>).
318              #{host => <<"example.com">>,path => <<"/over/there">>,
319                port => 8042,query => <<"name=ferret">>,scheme => <<"foo">>,
320                userinfo => <<"user">>}
321
322
323       percent_decode(URI) -> Result
324
325              Types:
326
327                 URI = uri_string() | uri_map()
328                 Result =
329                     uri_string() |
330                     uri_map() |
331                     {error, {invalid, {atom(), {term(), term()}}}}
332
333              Decodes all percent-encoded triplets in the input  that  can  be
334              both  a  uri_string()  and a uri_map(). Note, that this function
335              performs raw decoding and it shall be used on already parsed URI
336              components.  Applying  this  function directly on a standard URI
337              can effectively change it.
338
339              If the input encoding is not UTF-8, an error tuple is returned.
340
341              Example:
342
343              1> uri_string:percent_decode(#{host => "localhost-%C3%B6rebro",path => [],
344              1> scheme => "http"}).
345              #{host => "localhost-örebro",path => [],scheme => "http"}
346              2> uri_string:percent_decode(<<"%C3%B6rebro">>).
347              <<"örebro"/utf8>>
348
349
350          Warning:
351              Using uri_string:percent_decode/1 directly on a URI is not safe.
352              This  example  shows, that after each consecutive application of
353              the function the resulting URI will be changed.  None  of  these
354              URIs refer to the same resource.
355
356              3> uri_string:percent_decode(<<"http://local%252Fhost/path">>).
357              <<"http://local%2Fhost/path">>
358              4> uri_string:percent_decode(<<"http://local%2Fhost/path">>).
359              <<"http://local/host/path">>
360
361
362
363       recompose(URIMap) -> URIString
364
365              Types:
366
367                 URIMap = uri_map()
368                 URIString = uri_string() | error()
369
370              Creates an RFC 3986 compliant URIString (percent-encoded), based
371              on the components of URIMap. If the URIMap is invalid, an  error
372              tuple is returned.
373
374              See also the opposite operation parse/1.
375
376              Example:
377
378              1> URIMap = #{fragment => "nose", host => "example.com", path => "/over/there",
379              1> port => 8042, query => "name=ferret", scheme => "foo", userinfo => "user"}.
380              #{fragment => "nose",host => "example.com",
381                path => "/over/there",port => 8042,query => "name=ferret",
382                scheme => "foo",userinfo => "user"}
383
384              2> uri_string:recompose(URIMap).
385              "foo://example.com:8042/over/there?name=ferret#nose"
386
387       resolve(RefURI, BaseURI) -> TargetURI
388
389              Types:
390
391                 RefURI = BaseURI = uri_string() | uri_map()
392                 TargetURI = uri_string() | error()
393
394              Convert  a  RefURI  reference  that might be relative to a given
395              base URI into the parsed components of the  reference's  target,
396              which can then be recomposed to form the target URI.
397
398              Example:
399
400              1> uri_string:resolve("/abs/ol/ute", "http://localhost/a/b/c?q").
401              "http://localhost/abs/ol/ute"
402              2> uri_string:resolve("../relative", "http://localhost/a/b/c?q").
403              "http://localhost/a/relative"
404              3> uri_string:resolve("http://localhost/full", "http://localhost/a/b/c?q").
405              "http://localhost/full"
406              4> uri_string:resolve(#{path => "path", query => "xyz"}, "http://localhost/a/b/c?q").
407              "http://localhost/a/b/path?xyz"
408
409
410       resolve(RefURI, BaseURI, Options) -> TargetURI
411
412              Types:
413
414                 RefURI = BaseURI = uri_string() | uri_map()
415                 Options = [return_map]
416                 TargetURI = uri_string() | uri_map() | error()
417
418              Same as resolve/2 but with an additional Options parameter, that
419              controls  whether  the  target  URI  shall  be  returned  as  an
420              uri_map(). There is one supported option: return_map.
421
422              Example:
423
424              1> uri_string:resolve("/abs/ol/ute", "http://localhost/a/b/c?q", [return_map]).
425              #{host => "localhost",path => "/abs/ol/ute",scheme => "http"}
426              2> uri_string:resolve(#{path => "/abs/ol/ute"}, #{scheme => "http",
427              2> host => "localhost", path => "/a/b/c?q"}, [return_map]).
428              #{host => "localhost",path => "/abs/ol/ute",scheme => "http"}
429
430
431       transcode(URIString, Options) -> Result
432
433              Types:
434
435                 URIString = uri_string()
436                 Options =
437                     [{in_encoding, unicode:encoding()} |
438                      {out_encoding, unicode:encoding()}]
439                 Result = uri_string() | error()
440
441              Transcodes  an  RFC 3986 compliant URIString, where Options is a
442              list of tagged tuples, specifying the inbound (in_encoding)  and
443              outbound  (out_encoding) encodings. in_encoding and out_encoding
444              specifies both binary  encoding  and  percent-encoding  for  the
445              input  and output data. Mixed encoding, where binary encoding is
446              not the same as percent-encoding, is not supported. If an  argu‐
447              ment is invalid, an error tuple is returned.
448
449              Example:
450
451              1> uri_string:transcode(<<"foo%00%00%00%F6bar"/utf32>>,
452              1> [{in_encoding, utf32},{out_encoding, utf8}]).
453              <<"foo%C3%B6bar"/utf8>>
454              2> uri_string:transcode("foo%F6bar", [{in_encoding, latin1},
455              2> {out_encoding, utf8}]).
456              "foo%C3%B6bar"
457
458
459
460
461Ericsson AB                      stdlib 3.14.1                   uri_string(3)
Impressum