1File::BOM(3) User Contributed Perl Documentation File::BOM(3)
2
3
4
6 File::BOM - Utilities for handling Byte Order Marks
7
9 use File::BOM qw( :all )
10
11 high-level functions
12
13 # read a file with encoding from the BOM:
14 open_bom(FH, $file)
15 open_bom(FH, $file, ':utf8') # the same but with a default encoding
16
17 # get encoding too
18 $encoding = open_bom(FH, $file, ':utf8');
19
20 # open a potentially unseekable file:
21 ($encoding, $spillage) = open_bom(FH, $file, ':utf8');
22
23 # change encoding of an open handle according to BOM
24 $encoding = defuse(*HANDLE);
25 ($encoding, $spillage) = defuse(*HANDLE);
26
27 # Decode a string according to leading BOM:
28 $unicode = decode_from_bom($string_with_bom);
29
30 # Decode a string and get the encoding:
31 ($unicode, $encoding) = decode_from_bom($string_with_bom)
32
33 PerlIO::via interface
34
35 # Read the Right Thing from a unicode file with BOM:
36 open(HANDLE, '<:via(File::BOM)', $filename)
37
38 # Writing little-endian UTF-16 file with BOM:
39 open(HANDLE, '>:encoding(UTF-16LE):via(File::BOM)', $filename)
40
41 lower-level functions
42
43 # read BOM encoding from a filehandle:
44 $encoding = get_encoding_from_filehandle(FH)
45
46 # Get encoding even if FH is unseekable:
47 ($encoding, $spillage) = get_encoding_from_filehandle(FH);
48
49 # Get encoding from a known unseekable handle:
50 ($encdoing, $spillage) = get_encoding_from_stream(FH);
51
52 # get encoding and BOM length from BOM at start of string:
53 ($encoding, $offset) = get_encoding_from_bom($string);
54
55 variables
56
57 # print a BOM for a known encoding
58 print FH $enc2bom{$encoding};
59
60 # get an encoding from a known BOM
61 $enc = $bom2enc{$bom}
62
64 This module provides functions for handling unicode byte order marks,
65 which are to be found at the beginning of some files and streams.
66
67 For details about what a byte order mark is, see <http://www.uni‐
68 code.org/unicode/faq/utf_bom.html#BOM>
69
70 The intention of File::BOM is for files with BOMs to be readable as
71 seamlessly as possible, regardless of the encoding used. To that end,
72 several different interfaces are available, as shown in the synopsis
73 above.
74
76 Nothing by default.
77
78 symbols
79
80 * open_bom()
81 * defuse()
82 * decode_from_bom()
83 * get_encoding_from_filehandle()
84 * get_encoding_from_stream()
85 * get_encoding_from_bom()
86 * %bom2enc
87 * %enc2bom
88
89 tags
90
91 * :all
92 All of the above
93
94 * :subs
95 subroutines only
96
97 * :vars
98 just %bom2enc and %enc2bom
99
101 %bom2enc
102
103 Maps Byte Order marks to their encodings.
104
105 The keys of this hash are strings which represent the BOMs, the values
106 are their encodings, in a format which is understood by Encode
107
108 The encodings represented in this hash are: UTF-8, UTF-16BE, UTF-16LE,
109 UTF-32BE and UTF-32LE
110
111 %enc2bom
112
113 A reverse-lookup hash for bom2enc, with a few aliases used in Encode,
114 namely utf8, iso-10646-1 and UCS-2.
115
116 Note that UTF-16, UTF-32 and UCS-4 are not included in this hash.
117 Mainly because Encode::encode automatically puts BOMs on output. See
118 Encode::Unicode
119
121 open_bom
122
123 $encoding = open_bom(HANDLE, $filename, $default_mode)
124
125 ($encoding, $spill) = open_bom(HANDLE, $filename, $default_mode)
126
127 opens HANDLE for reading on $filename, setting the mode to the appro‐
128 priate encoding for the BOM stored in the file.
129
130 On failure, a fatal error is raised, see the DIAGNOSTICS section for
131 details on how to catch these. This is in order to allow the return
132 value(s) to be used for other purposes.
133
134 If the file doesn't contain a BOM, $default_mode is used instead.
135 Hence:
136
137 open_bom(FH, 'my_file.txt', ':utf8')
138
139 Opens my_file.txt for reading in an appropriate encoding found from the
140 BOM in that file, or as a UTF-8 file if none is found.
141
142 In the absense of a $default_mode argument, the following 2 calls
143 should be equivalent:
144
145 open_bom(FH, 'no_bom.txt');
146
147 open(FH, '<', 'no_bom.txt');
148
149 If an undefined value is passed as the handle, a symbol will be gener‐
150 ated for it like open() does:
151
152 # create filehandle on the fly
153 $enc = open_bom(my $fh, $filename, ':utf8');
154 $line = <$fh>;
155
156 The filehandle will be cued up to read after the BOM. Unseekable files
157 (e.g. fifos) will cause croaking, unless called in list context to
158 catch spillage from the handle. Any spillage will be automatically
159 decoded from the encoding, if found.
160
161 e.g.
162
163 # croak if my_socket is unseekable
164 open_bom(FH, 'my_socket');
165
166 # keep spillage if my_socket is unseekable
167 ($encoding, $spillage) = open_bom(FH, 'my_socket');
168
169 # discard any spillage from open_bom
170 ($encoding) = open_bom(FH, 'my_socket');
171
172 defuse
173
174 $enc = defuse(FH);
175
176 ($enc, $spill) = defuse(FH);
177
178 FH should be a filehandle opened for reading, it will have the relevant
179 encoding layer pushed onto it be binmode if a BOM is found. Spillage
180 should be Unicode, not bytes.
181
182 Any uncaptured spillage will be silently lost. If the handle is unseek‐
183 able, use list context to avoid data loss.
184
185 If no BOM is found, the mode will be unaffected.
186
187 decode_from_bom
188
189 $unicode_string = decode_from_bom($string, $default, $check)
190
191 ($unicode_string, $encoding) = decode_from_bom($string, $default, $check)
192
193 Reads a BOM from the beginning of $string, decodes $string (minus the
194 BOM) and returns it to you as a perl unicode string.
195
196 if $string doesn't have a BOM, $default is used instead.
197
198 $check, if supplied, is passed to Encode::decode as the third argument.
199
200 If there's no BOM and no default, the original string is returned and
201 encoding is ''.
202
203 See Encode
204
205 get_encoding_from_filehandle
206
207 $encoding = get_encoding_from_filehandle(HANDLE)
208
209 ($encoding, $spillage) = get_encoding_from_filehandle(HANDLE)
210
211 Returns the encoding found in the given filehandle.
212
213 The handle should be opened in a non-unicode way (e.g. mode '<:bytes')
214 so that the BOM can be read in its natural state.
215
216 After calling, the handle will be set to read at a point after the BOM
217 (or at the beginning of the file if no BOM was found)
218
219 If called in scalar context, unseekable handles cause a croak().
220
221 If called in list context, unseekable handles will be read byte-by-byte
222 and any spillage will be returned. See get_encoding_from_stream()
223
224 get_encoding_from_stream
225
226 ($encoding, $spillage) = get_encoding_from_stream(*FH);
227
228 Read a BOM from an unrewindable source. This means reading the stream
229 one byte at a time until either a BOM is found or every possible BOM is
230 ruled out. Any non-BOM bytes read from the handle will be returned in
231 $spillage.
232
233 If a BOM is found and the spillage contains a partial character (judg‐
234 ing by the expected character width for the encoding) more bytes will
235 be read from the handle to ensure that a complete character is
236 returned.
237
238 Spillage is always in bytes, not characters.
239
240 This function is less efficient than get_encoding_from_filehandle, but
241 should work just as well on a seekable handle as on an unseekable one.
242
243 get_encoding_from_bom
244
245 ($encoding, $offset) = get_encoding_from_bom($string)
246
247 Returns the encoding and length in bytes of the BOM in $string.
248
249 If there is no BOM, an empty string is returned and $offset is zero.
250
251 To get the data from the string, the following should work:
252
253 use Encode;
254
255 my($encoding, $offset) = get_encoding_from_bom($string);
256
257 if ($encoding) {
258 $string = decode($encoding, substr($string, $offset))
259 }
260
262 File::BOM can be used as a PerlIO::via interface.
263
264 open(HANDLE, '<:via(File::BOM)', 'my_file.txt');
265
266 open(HANDLE, '>:encoding(UTF-16LE):via(File::BOM)', 'out_file.txt)
267 print "foo\n"; # BOM is written to file here
268
269 This method is less prone to errors on non-seekable files as spillage
270 is incorporated into an internal buffer, but it doesn't give you any
271 information about the encoding being used, or indeed whether or not a
272 BOM was present.
273
274 There are a few known problems with this interface, especially sur‐
275 rounding seek() and tell(), please see the BUGS section for more
276 details about this.
277
278 Reading
279
280 The via(File::BOM) layer must be added before the handle is read from,
281 otherwise any BOM will be missed. If there is no BOM, no decoding will
282 be done.
283
284 Because of a limitation in PerlIO::via, read() always works on bytes,
285 not characters. BOM decoding will still be done but output will be
286 bytes of UTF-8.
287
288 open(BOM, '<:via(File::BOM)', $file)
289 $bytes_read = read(BOM, $buffer, $length);
290 $unicode = decode('UTF-8', $buffer, Encode::FB_QUIET);
291
292 # Now $unicode is valid unicode and $buffer contains any left-over bytes
293
294 Writing
295
296 Add the via(File::BOM) layer on top of a unicode encoding layer to
297 print a BOM at the start of the output file. This needs to be done
298 before any data is written. The BOM is written as part of the first
299 print command on the handle, so if you don't print anything to the han‐
300 dle, you won't get a BOM.
301
302 There is a "Wide character in print" warning generated when the
303 via(File::BOM) layer doesn't receive utf8 on writing. This glitch was
304 resolved in perl version 5.8.7, but if your perl version is older than
305 that, you'll need to make sure that the via(File::BOM) layer receives
306 utf8 like this:
307
308 # This works OK
309 open(FH, '>:encoding(UTF-16LE):via(File::BOM):utf8', $filename)
310
311 # This generates warnings with older perls
312 open(FH, '>:encoding(UTF-16LE):via(File::BOM)', $filename)
313
314 Seeking
315
316 Seeking with SEEK_SET results in an offset equal to the length of any
317 detected BOM being applied to the position parameter. Thus:
318
319 # Seek to end of BOM (not start of file!)
320 seek(FILE_BOM_HANDLE, 0, SEEK_SET)
321
322 Telling
323
324 In order to work correctly with seek(), tell() also returns a postion
325 adjusted by the length of the BOM.
326
328 * Encode
329 * Encode::Unicode
330 * <http://www.unicode.org/unicode/faq/utf_bom.html#BOM>
331
333 The following exceptions are raised via croak()
334
335 * Couldn't read '<filename>': $!
336 open_bom() couldn't open the given file for reading
337
338 * Couldn't set binmode of handle opened on '<filename>' to '<mode>': $!
339 open_bom() couldn't set the binmode of the handle
340
341 * No string
342 decode_from_bom called on an undefined value
343
344 * Unseekable handle: $!
345 get_encoding_from_filehandle() or open_bom() called on an unseek‐
346 able file or handle in scalar context.
347
348 * Couldn't read from handle: $!
349 _get_encoding_seekable() couldn't read the handle. This function is
350 called from get_encoding_from_filehandle(), defuse() and open_bom()
351
352 * Couldn't reset read position: $!
353 _get_encoding_seekable couldn't seek to the position after the BOM.
354
355 * Couldn't read byte: $!
356 get_encoding_from_stream couldn't read from the handle. This func‐
357 tion is called from get_encoding_from_filehandle() and open_bom()
358 when the handle or file is unseekable.
359
361 Older versions of PerlIO::via have a few problems with writing, see
362 above.
363
364 The current version of PerlIO::via has limitations with regard to seek
365 and tell, currently only line-wise seek and tell are supported by this
366 module. If read() is used to read partial lines, tell() will still give
367 the position of the end of the last line read.
368
369 Under windows, tell() seems to return erroneously when reading files
370 with unix line endings.
371
372 Under windows, warnings may be generated when using the PerlIO::via
373 interface to read UTF-16LE and UTF-32LE encoded files. This seems to be
374 a bug in the relevant encoding(...) layers.
375
377 Matt Lawrence <mattlaw@cpan.org>
378
379 With thanks to Mark Fowler and Steve Purkis for additional tests and
380 advice.
381
383 Copyright 2005 Matt Lawrence, All Rights Reserved.
384
385 This program is free software; you can redistribute it and/or modify it
386 under the same terms as Perl itself.
387
388
389
390perl v5.8.8 2007-04-17 File::BOM(3)