1Net::IDN::Standards(3)User Contributed Perl DocumentationNet::IDN::Standards(3)
2
3
4
6 Net::IDN::Standards -- Internationalized Domain Names for Applications
7 (IDNA)
8
10 Historically, domain names and host names were restricted to a limited
11 repertoire of ASCII characters, i.e. letters, digits and the hyphen
12 (i.e. "/[A-Z0-9-]/i"). Words and names from languages that require
13 additional characters (such as diacritics or special characters) or
14 other scripts could not be used.
15
16 Internationalized Domain Names (IDNs) extend the character repertoire
17 for domain names from ASCII to Unicode while maintaining backwards
18 compatibility with software that only expects and handles ASCII
19 characters.
20
21 In order to do so, Unicode domain names are converted to ASCII using an
22 ASCII-compatible encoding (ACE) called Punycode. On the wire, converted
23 domain names start with "xn--", followed by the ASCII encoding of the
24 Unicode string. The Unicode version is typically only shown in
25 applications presenting the domain to the user (hence Internationalized
26 Domain Names for Applications, IDNA). Internationalized Resource
27 Identifiers (IRIs), the Unicode version of URLs, may also include
28 domain names in their Unicode form.
29
30 The IDNA specifications, however, do not only cover the actual Punycode
31 conversion but also include extensive rules for preparation (mapping
32 and/or validation) of input strings. They typically define two
33 functions, "ToASCII" and "ToUnicode", which prepare and convert a
34 domain name to the ACE version or the Unicode version.
35
37 "The nice thing about standards is that you have so many to
38 choose from."
39 -- Andrew S. Tanenbaum
40
41 While the actual Punycode conversion is stable, there are different
42 specifications regarding mapping and/or validation (preparation):
43
44 IDNA2003
45 IDNA2003, which is defined in RFC 3490
46 (<http://tools.ietf.org/html/rfc3490>) and related documents, was the
47 original specification for the internationalization of domain names.
48
49 However, some issues were subsequently identified with IDNA2003: The
50 specification was tied to Unicode 3.2 and therefore did not allow
51 characters added in newer versions of Unicode (without updating the
52 specifications).
53
54 Furthermore, a few characters were mapped to other characters or
55 deleted although they would carry meaning in some languages (i.e. 'ss'
56 and 'X' were mapped to 'ss' and 'X'; ZWJ and ZWNJ were always mapped to
57 nothing, although some scripts like Arabic require them for correct
58 display).
59
60 IDNA2008
61 IDNA2008, which is defined in RFC 5890
62 (<http://tools.ietf.org/html/rfc5890>) and related documents, resolves
63 the issues found in IDNA2003.
64
65 This was done by allowing some characters that would either be mapped
66 to other characters, mapped to zero and/or cause the preparation to
67 fail. The new domain names would not be accessible by IDNA2003
68 implementations, of course.
69
70 However, IDNA2008 also disallowed a large number of characters that had
71 been allowed in IDNA2003 (mostly symbols). An implementation of
72 IDNA2008 would therefore no longer be able to access domain names such
73 as "X.com", which had been registered under IDNA2003.
74
75 UTS #46
76 Unicode Technical Standard #46 (UTS #46,
77 <http://unicode.org/reports/tr46/>) solves this problem by allowing
78 domain names that are valid in either IDNA2003 or IDNA2008.
79
80 This makes UTS #46 the perfect fit for domain lookup (be liberal in
81 what you accept) but unsuitable for validating domain names prior to
82 registration (be conservative in what you send).
83
85 Claus Faerber <CFAERBER@cpan.org>
86
87
88
89perl v5.30.1 2020-01-30 Net::IDN::Standards(3)