1File::BOM(3)          User Contributed Perl Documentation         File::BOM(3)
2
3
4

NAME

6       File::BOM - Utilities for handling Byte Order Marks
7

SYNOPSIS

9           use File::BOM qw( :all )
10
11       high-level functions
12
13           # read a file with encoding from the BOM:
14           open_bom(FH, $file)
15           open_bom(FH, $file, ':utf8') # the same but with a default encoding
16
17           # get encoding too
18           $encoding = open_bom(FH, $file, ':utf8');
19
20           # open a potentially unseekable file:
21           ($encoding, $spillage) = open_bom(FH, $file, ':utf8');
22
23           # change encoding of an open handle according to BOM
24           $encoding = defuse(*HANDLE);
25           ($encoding, $spillage) = defuse(*HANDLE);
26
27           # Decode a string according to leading BOM:
28           $unicode = decode_from_bom($string_with_bom);
29
30           # Decode a string and get the encoding:
31           ($unicode, $encoding) = decode_from_bom($string_with_bom)
32
33       PerlIO::via interface
34
35           # Read the Right Thing from a unicode file with BOM:
36           open(HANDLE, '<:via(File::BOM)', $filename)
37
38           # Writing little-endian UTF-16 file with BOM:
39           open(HANDLE, '>:encoding(UTF-16LE):via(File::BOM)', $filename)
40
41       lower-level functions
42
43           # read BOM encoding from a filehandle:
44           $encoding = get_encoding_from_filehandle(FH)
45
46           # Get encoding even if FH is unseekable:
47           ($encoding, $spillage) = get_encoding_from_filehandle(FH);
48
49           # Get encoding from a known unseekable handle:
50           ($encdoing, $spillage) = get_encoding_from_stream(FH);
51
52           # get encoding and BOM length from BOM at start of string:
53           ($encoding, $offset) = get_encoding_from_bom($string);
54
55       variables
56
57           # print a BOM for a known encoding
58           print FH $enc2bom{$encoding};
59
60           # get an encoding from a known BOM
61           $enc = $bom2enc{$bom}
62

DESCRIPTION

64       This module provides functions for handling unicode byte order marks,
65       which are to be found at the beginning of some files and streams.
66
67       For details about what a byte order mark is, see <http://www.uni
68       code.org/unicode/faq/utf_bom.html#BOM>
69
70       The intention of File::BOM is for files with BOMs to be readable as
71       seamlessly as possible, regardless of the encoding used. To that end,
72       several different interfaces are available, as shown in the synopsis
73       above.
74

EXPORTS

76       Nothing by default.
77
78       symbols
79
80       * open_bom()
81       * defuse()
82       * decode_from_bom()
83       * get_encoding_from_filehandle()
84       * get_encoding_from_stream()
85       * get_encoding_from_bom()
86       * %bom2enc
87       * %enc2bom
88
89       tags
90
91       * :all
92           All of the above
93
94       * :subs
95           subroutines only
96
97       * :vars
98           just %bom2enc and %enc2bom
99

VARIABLES

101       %bom2enc
102
103       Maps Byte Order marks to their encodings.
104
105       The keys of this hash are strings which represent the BOMs, the values
106       are their encodings, in a format which is understood by Encode
107
108       The encodings represented in this hash are: UTF-8, UTF-16BE, UTF-16LE,
109       UTF-32BE and UTF-32LE
110
111       %enc2bom
112
113       A reverse-lookup hash for bom2enc, with a few aliases used in Encode,
114       namely utf8, iso-10646-1 and UCS-2.
115
116       Note that UTF-16, UTF-32 and UCS-4 are not included in this hash.
117       Mainly because Encode::encode automatically puts BOMs on output. See
118       Encode::Unicode
119

FUNCTIONS

121       open_bom
122
123           $encoding = open_bom(HANDLE, $filename, $default_mode)
124
125           ($encoding, $spill) = open_bom(HANDLE, $filename, $default_mode)
126
127       opens HANDLE for reading on $filename, setting the mode to the appro‐
128       priate encoding for the BOM stored in the file.
129
130       On failure, a fatal error is raised, see the DIAGNOSTICS section for
131       details on how to catch these. This is in order to allow the return
132       value(s) to be used for other purposes.
133
134       If the file doesn't contain a BOM, $default_mode is used instead.
135       Hence:
136
137           open_bom(FH, 'my_file.txt', ':utf8')
138
139       Opens my_file.txt for reading in an appropriate encoding found from the
140       BOM in that file, or as a UTF-8 file if none is found.
141
142       In the absense of a $default_mode argument, the following 2 calls
143       should be equivalent:
144
145           open_bom(FH, 'no_bom.txt');
146
147           open(FH, '<', 'no_bom.txt');
148
149       If an undefined value is passed as the handle, a symbol will be gener‐
150       ated for it like open() does:
151
152           # create filehandle on the fly
153           $enc = open_bom(my $fh, $filename, ':utf8');
154           $line = <$fh>;
155
156       The filehandle will be cued up to read after the BOM. Unseekable files
157       (e.g.  fifos) will cause croaking, unless called in list context to
158       catch spillage from the handle. Any spillage will be automatically
159       decoded from the encoding, if found.
160
161           e.g.
162
163           # croak if my_socket is unseekable
164           open_bom(FH, 'my_socket');
165
166           # keep spillage if my_socket is unseekable
167           ($encoding, $spillage) = open_bom(FH, 'my_socket');
168
169           # discard any spillage from open_bom
170           ($encoding) = open_bom(FH, 'my_socket');
171
172       defuse
173
174           $enc = defuse(FH);
175
176           ($enc, $spill) = defuse(FH);
177
178       FH should be a filehandle opened for reading, it will have the relevant
179       encoding layer pushed onto it be binmode if a BOM is found. Spillage
180       should be Unicode, not bytes.
181
182       Any uncaptured spillage will be silently lost. If the handle is unseek‐
183       able, use list context to avoid data loss.
184
185       If no BOM is found, the mode will be unaffected.
186
187       decode_from_bom
188
189           $unicode_string = decode_from_bom($string, $default, $check)
190
191           ($unicode_string, $encoding) = decode_from_bom($string, $default, $check)
192
193       Reads a BOM from the beginning of $string, decodes $string (minus the
194       BOM) and returns it to you as a perl unicode string.
195
196       if $string doesn't have a BOM, $default is used instead.
197
198       $check, if supplied, is passed to Encode::decode as the third argument.
199
200       If there's no BOM and no default, the original string is returned and
201       encoding is ''.
202
203       See Encode
204
205       get_encoding_from_filehandle
206
207           $encoding = get_encoding_from_filehandle(HANDLE)
208
209           ($encoding, $spillage) = get_encoding_from_filehandle(HANDLE)
210
211       Returns the encoding found in the given filehandle.
212
213       The handle should be opened in a non-unicode way (e.g. mode '<:bytes')
214       so that the BOM can be read in its natural state.
215
216       After calling, the handle will be set to read at a point after the BOM
217       (or at the beginning of the file if no BOM was found)
218
219       If called in scalar context, unseekable handles cause a croak().
220
221       If called in list context, unseekable handles will be read byte-by-byte
222       and any spillage will be returned. See get_encoding_from_stream()
223
224       get_encoding_from_stream
225
226           ($encoding, $spillage) = get_encoding_from_stream(*FH);
227
228       Read a BOM from an unrewindable source. This means reading the stream
229       one byte at a time until either a BOM is found or every possible BOM is
230       ruled out. Any non-BOM bytes read from the handle will be returned in
231       $spillage.
232
233       If a BOM is found and the spillage contains a partial character (judg‐
234       ing by the expected character width for the encoding) more bytes will
235       be read from the handle to ensure that a complete character is
236       returned.
237
238       Spillage is always in bytes, not characters.
239
240       This function is less efficient than get_encoding_from_filehandle, but
241       should work just as well on a seekable handle as on an unseekable one.
242
243       get_encoding_from_bom
244
245           ($encoding, $offset) = get_encoding_from_bom($string)
246
247       Returns the encoding and length in bytes of the BOM in $string.
248
249       If there is no BOM, an empty string is returned and $offset is zero.
250
251       To get the data from the string, the following should work:
252
253           use Encode;
254
255           my($encoding, $offset) = get_encoding_from_bom($string);
256
257           if ($encoding) {
258               $string = decode($encoding, substr($string, $offset))
259           }
260

PerlIO::via interface

262       File::BOM can be used as a PerlIO::via interface.
263
264           open(HANDLE, '<:via(File::BOM)', 'my_file.txt');
265
266           open(HANDLE, '>:encoding(UTF-16LE):via(File::BOM)', 'out_file.txt)
267           print "foo\n"; # BOM is written to file here
268
269       This method is less prone to errors on non-seekable files as spillage
270       is incorporated into an internal buffer, but it doesn't give you any
271       information about the encoding being used, or indeed whether or not a
272       BOM was present.
273
274       There are a few known problems with this interface, especially sur‐
275       rounding seek() and tell(), please see the BUGS section for more
276       details about this.
277
278       Reading
279
280       The via(File::BOM) layer must be added before the handle is read from,
281       otherwise any BOM will be missed. If there is no BOM, no decoding will
282       be done.
283
284       Because of a limitation in PerlIO::via, read() always works on bytes,
285       not characters. BOM decoding will still be done but output will be
286       bytes of UTF-8.
287
288           open(BOM, '<:via(File::BOM)', $file)
289           $bytes_read = read(BOM, $buffer, $length);
290           $unicode = decode('UTF-8', $buffer, Encode::FB_QUIET);
291
292           # Now $unicode is valid unicode and $buffer contains any left-over bytes
293
294       Writing
295
296       Add the via(File::BOM) layer on top of a unicode encoding layer to
297       print a BOM at the start of the output file. This needs to be done
298       before any data is written. The BOM is written as part of the first
299       print command on the handle, so if you don't print anything to the han‐
300       dle, you won't get a BOM.
301
302       There is a "Wide character in print" warning generated when the
303       via(File::BOM) layer doesn't receive utf8 on writing. This glitch was
304       resolved in perl version 5.8.7, but if your perl version is older than
305       that, you'll need to make sure that the via(File::BOM) layer receives
306       utf8 like this:
307
308           # This works OK
309           open(FH, '>:encoding(UTF-16LE):via(File::BOM):utf8', $filename)
310
311           # This generates warnings with older perls
312           open(FH, '>:encoding(UTF-16LE):via(File::BOM)', $filename)
313
314       Seeking
315
316       Seeking with SEEK_SET results in an offset equal to the length of any
317       detected BOM being applied to the position parameter. Thus:
318
319           # Seek to end of BOM (not start of file!)
320           seek(FILE_BOM_HANDLE, 0, SEEK_SET)
321
322       Telling
323
324       In order to work correctly with seek(), tell() also returns a postion
325       adjusted by the length of the BOM.
326

SEE ALSO

328       * Encode
329       * Encode::Unicode
330       * <http://www.unicode.org/unicode/faq/utf_bom.html#BOM>
331

DIAGNOSTICS

333       The following exceptions are raised via croak()
334
335       * Couldn't read '<filename>': $!
336           open_bom() couldn't open the given file for reading
337
338       * Couldn't set binmode of handle opened on '<filename>' to '<mode>': $!
339           open_bom() couldn't set the binmode of the handle
340
341       * No string
342           decode_from_bom called on an undefined value
343
344       * Unseekable handle: $!
345           get_encoding_from_filehandle() or open_bom() called on an unseek‐
346           able file or handle in scalar context.
347
348       * Couldn't read from handle: $!
349           _get_encoding_seekable() couldn't read the handle. This function is
350           called from get_encoding_from_filehandle(), defuse() and open_bom()
351
352       * Couldn't reset read position: $!
353           _get_encoding_seekable couldn't seek to the position after the BOM.
354
355       * Couldn't read byte: $!
356           get_encoding_from_stream couldn't read from the handle. This func‐
357           tion is called from get_encoding_from_filehandle() and open_bom()
358           when the handle or file is unseekable.
359

BUGS

361       Older versions of PerlIO::via have a few problems with writing, see
362       above.
363
364       The current version of PerlIO::via has limitations with regard to seek
365       and tell, currently only line-wise seek and tell are supported by this
366       module. If read() is used to read partial lines, tell() will still give
367       the position of the end of the last line read.
368
369       Under windows, tell() seems to return erroneously when reading files
370       with unix line endings.
371
372       Under windows, warnings may be generated when using the PerlIO::via
373       interface to read UTF-16LE and UTF-32LE encoded files. This seems to be
374       a bug in the relevant encoding(...) layers.
375

AUTHOR

377       Matt Lawrence <mattlaw@cpan.org>
378
379       With thanks to Mark Fowler and Steve Purkis for additional tests and
380       advice.
381
383       Copyright 2005 Matt Lawrence, All Rights Reserved.
384
385       This program is free software; you can redistribute it and/or modify it
386       under the same terms as Perl itself.
387
388
389
390perl v5.8.8                       2007-04-17                      File::BOM(3)
Impressum