1Test::utf8(3) User Contributed Perl Documentation Test::utf8(3)
2
3
4
6 Test::utf8 - handy utf8 tests
7
9 # check the string is good
10 is_valid_string($string); # check the string is valid
11 is_sane_utf8($string); # check not double encoded
12
13 # check the string has certain attributes
14 is_flagged_utf8($string1); # has utf8 flag set
15 is_within_ascii($string2); # only has ascii chars in it
16 isnt_within_ascii($string3); # has chars outside the ascii range
17 is_within_latin_1($string4); # only has latin-1 chars in it
18 isnt_within_ascii($string5); # has chars outside the latin-1 range
19
21 This module is a collection of tests useful for dealing with utf8
22 strings in Perl.
23
24 This module has two types of tests: The validity tests check if a
25 string is valid and not corrupt, whereas the characteristics tests will
26 check that string has a given set of characteristics.
27
28 Validity Tests
29 is_valid_string($string, $testname)
30 Checks if the string is "valid", i.e. this passes and returns true
31 unless the internal utf8 flag hasn't been set on scalar that isn't
32 made up of a valid utf-8 byte sequence.
33
34 This should never happen and, in theory, this test should always
35 pass. Unless you (or a module you use) goes monkeying around inside
36 a scalar using Encode's private functions or XS code you shouldn't
37 ever end up in a situation where you've got a corrupt scalar. But
38 if you do, and you do, then this function should help you detect
39 the problem.
40
41 To be clear, here's an example of the error case this can detect:
42
43 my $mark = "Mark";
44 my $leon = "L\x{e9}on";
45 is_valid_string($mark); # passes, not utf-8
46 is_valid_string($leon); # passes, not utf-8
47
48 my $iloveny = "I \x{2665} NY";
49 is_valid_string($iloveny); # passes, proper utf-8
50
51 my $acme = "L\x{c3}\x{a9}on";
52 Encode::_utf8_on($acme); # (please don't do things like this)
53 is_valid_string($acme); # passes, proper utf-8 byte sequence upgraded
54
55 Encode::_utf8_on($leon); # (this is why you don't do things like this)
56 is_valid_string($leon); # fails! the byte \x{e9} isn't valid utf-8
57
58 is_sane_utf8($string, $name)
59 This test fails if the string contains something that looks like it
60 might be dodgy utf8, i.e. containing something that looks like the
61 multi-byte sequence for a latin-1 character but perl hasn't been
62 instructed to treat as such. Strings that are not utf8 always
63 automatically pass.
64
65 Some examples may help:
66
67 # This will pass as it's a normal latin-1 string
68 is_sane_utf8("Hello L\x{e9}eon");
69
70 # this will fail because the \x{c3}\x{a9} looks like the
71 # utf8 byte sequence for e-acute
72 my $string = "Hello L\x{c3}\x{a9}on";
73 is_sane_utf8($string);
74
75 # this will pass because the utf8 is correctly interpreted as utf8
76 Encode::_utf8_on($string)
77 is_sane_utf8($string);
78
79 Obviously this isn't a hundred percent reliable. The edge case
80 where this will fail is where you have "\x{c2}" (which is "LATIN
81 CAPITAL LETTER WITH CIRCUMFLEX") or "\x{c3}" (which is "LATIN
82 CAPITAL LETTER WITH TILDE") followed by one of the latin-1
83 punctuation symbols.
84
85 # a capital letter A with tilde surrounded by smart quotes
86 # this will fail because it'll see the "\x{c2}\x{94}" and think
87 # it's actually the utf8 sequence for the end smart quote
88 is_sane_utf8("\x{93}\x{c2}\x{94}");
89
90 However, since this hardly comes up this test is reasonably
91 reliable in most cases. Still, care should be applied in cases
92 where dynamic data is placed next to latin-1 punctuation to avoid
93 false negatives.
94
95 There exists two situations to cause this test to fail; The string
96 contains utf8 byte sequences and the string hasn't been flagged as
97 utf8 (this normally means that you got it from an external source
98 like a C library; When Perl needs to store a string internally as
99 utf8 it does it's own encoding and flagging transparently) or a
100 utf8 flagged string contains byte sequences that when translated to
101 characters themselves look like a utf8 byte sequence. The test
102 diagnostics tells you which is the case.
103
104 String Characteristic Tests
105 These routines allow you to check the range of characters in a string.
106 Note that these routines are blind to the actual encoding perl
107 internally uses to store the characters, they just check if the string
108 contains only characters that can be represented in the named encoding:
109
110 is_within_ascii
111 Tests that a string only contains characters that are in the ASCII
112 character set.
113
114 is_within_latin_1
115 Tests that a string only contains characters that are in latin-1.
116
117 Simply check if a scalar is or isn't flagged as utf8 by perl's
118 internals:
119
120 is_flagged_utf8($string, $name)
121 Passes if the string is flagged by perl's internals as utf8, fails
122 if it's not.
123
124 isnt_flagged_utf8($string,$name)
125 The opposite of "is_flagged_utf8", passes if and only if the string
126 isn't flagged as utf8 by perl's internals.
127
128 Note: you can refer to this function as "isn't_flagged_utf8" if you
129 really want to.
130
132 Written by Mark Fowler mark@twoshortplanks.com
133
135 Copyright Mark Fowler 2004,2012. All rights reserved.
136
137 This program is free software; you can redistribute it and/or modify it
138 under the same terms as Perl itself.
139
141 None known. Please report any to me via the CPAN RT system. See
142 http://rt.cpan.org/ for more details.
143
145 Test::DoubleEncodedEntities for testing for double encoded HTML
146 entities.
147
148
149
150perl v5.34.0 2022-01-21 Test::utf8(3)