Test::utf8(3pm)

1Test::utf8(3)         User Contributed Perl Documentation        Test::utf8(3)
2
3
4

NAME

6       Test::utf8 - handy utf8 tests
7

SYNOPSIS

9         # check the string is good
10         is_valid_string($string);   # check the string is valid
11         is_sane_utf8($string);      # check not double encoded
12
13         # check the string has certain attributes
14         is_flagged_utf8($string1);   # has utf8 flag set
15         is_within_ascii($string2);   # only has ascii chars in it
16         isnt_within_ascii($string3); # has chars outside the ascii range
17         is_within_latin_1($string4); # only has latin-1 chars in it
18         isnt_within_ascii($string5); # has chars outside the latin-1 range
19

DESCRIPTION

21       This module is a collection of tests useful for dealing with utf8
22       strings in Perl.
23
24       This module has two types of tests: The validity tests check if a
25       string is valid and not corrupt, whereas the characteristics tests will
26       check that string has a given set of characteristics.
27
28   Validity Tests
29       is_valid_string($string, $testname)
30           Checks if the string is "valid", i.e. this passes and returns true
31           unless the internal utf8 flag hasn't been set on scalar that isn't
32           made up of a valid utf-8 byte sequence.
33
34           This should never happen and, in theory, this test should always
35           pass. Unless you (or a module you use) goes monkeying around inside
36           a scalar using Encode's private functions or XS code you shouldn't
37           ever end up in a situation where you've got a corrupt scalar.  But
38           if you do, and you do, then this function should help you detect
39           the problem.
40
41           To be clear, here's an example of the error case this can detect:
42
43             my $mark = "Mark";
44             my $leon = "L\x{e9}on";
45             is_valid_string($mark);  # passes, not utf-8
46             is_valid_string($leon);  # passes, not utf-8
47
48             my $iloveny = "I \x{2665} NY";
49             is_valid_string($iloveny);      # passes, proper utf-8
50
51             my $acme = "L\x{c3}\x{a9}on";
52             Encode::_utf8_on($acme);      # (please don't do things like this)
53             is_valid_string($acme);       # passes, proper utf-8 byte sequence upgraded
54
55             Encode::_utf8_on($leon);      # (this is why you don't do things like this)
56             is_valid_string($leon);       # fails! the byte \x{e9} isn't valid utf-8
57
58       is_sane_utf8($string, $name)
59           This test fails if the string contains something that looks like it
60           might be dodgy utf8, i.e. containing something that looks like the
61           multi-byte sequence for a latin-1 character but perl hasn't been
62           instructed to treat as such.  Strings that are not utf8 always
63           automatically pass.
64
65           Some examples may help:
66
67             # This will pass as it's a normal latin-1 string
68             is_sane_utf8("Hello L\x{e9}eon");
69
70             # this will fail because the \x{c3}\x{a9} looks like the
71             # utf8 byte sequence for e-acute
72             my $string = "Hello L\x{c3}\x{a9}on";
73             is_sane_utf8($string);
74
75             # this will pass because the utf8 is correctly interpreted as utf8
76             Encode::_utf8_on($string)
77             is_sane_utf8($string);
78
79           Obviously this isn't a hundred percent reliable.  The edge case
80           where this will fail is where you have "\x{c2}" (which is "LATIN
81           CAPITAL LETTER WITH CIRCUMFLEX") or "\x{c3}" (which is "LATIN
82           CAPITAL LETTER WITH TILDE") followed by one of the latin-1
83           punctuation symbols.
84
85             # a capital letter A with tilde surrounded by smart quotes
86             # this will fail because it'll see the "\x{c2}\x{94}" and think
87             # it's actually the utf8 sequence for the end smart quote
88             is_sane_utf8("\x{93}\x{c2}\x{94}");
89
90           However, since this hardly comes up this test is reasonably
91           reliable in most cases.  Still, care should be applied in cases
92           where dynamic data is placed next to latin-1 punctuation to avoid
93           false negatives.
94
95           There exists two situations to cause this test to fail; The string
96           contains utf8 byte sequences and the string hasn't been flagged as
97           utf8 (this normally means that you got it from an external source
98           like a C library; When Perl needs to store a string internally as
99           utf8 it does it's own encoding and flagging transparently) or a
100           utf8 flagged string contains byte sequences that when translated to
101           characters themselves look like a utf8 byte sequence.  The test
102           diagnostics tells you which is the case.
103
104   String Characteristic Tests
105       These routines allow you to check the range of characters in a string.
106       Note that these routines are blind to the actual encoding perl
107       internally uses to store the characters, they just check if the string
108       contains only characters that can be represented in the named encoding:
109
110       is_within_ascii
111           Tests that a string only contains characters that are in the ASCII
112           character set.
113
114       is_within_latin_1
115           Tests that a string only contains characters that are in latin-1.
116
117       Simply check if a scalar is or isn't flagged as utf8 by perl's
118       internals:
119
120       is_flagged_utf8($string, $name)
121           Passes if the string is flagged by perl's internals as utf8, fails
122           if it's not.
123
124       isnt_flagged_utf8($string,$name)
125           The opposite of "is_flagged_utf8", passes if and only if the string
126           isn't flagged as utf8 by perl's internals.
127
128           Note: you can refer to this function as "isn't_flagged_utf8" if you
129           really want to.
130

AUTHOR

132       Written by Mark Fowler mark@twoshortplanks.com
133

COPYRIGHT

135       Copyright Mark Fowler 2004,2012.  All rights reserved.
136
137       This program is free software; you can redistribute it and/or modify it
138       under the same terms as Perl itself.
139

BUGS

141       None known.  Please report any to me via the CPAN RT system.  See
142       http://rt.cpan.org/ for more details.
143

NAME

SYNOPSIS

DESCRIPTION

AUTHOR

COPYRIGHT

BUGS

SEE ALSO