1File::BOM(3pm)        User Contributed Perl Documentation       File::BOM(3pm)
2
3
4

NAME

6       File::BOM - Utilities for handling Byte Order Marks
7

SYNOPSIS

9           use File::BOM qw( :all )
10
11   high-level functions
12           # read a file with encoding from the BOM:
13           open_bom(FH, $file)
14           open_bom(FH, $file, ':utf8') # the same but with a default encoding
15
16           # get encoding too
17           $encoding = open_bom(FH, $file, ':utf8');
18
19           # open a potentially unseekable file:
20           ($encoding, $spillage) = open_bom(FH, $file, ':utf8');
21
22           # change encoding of an open handle according to BOM
23           $encoding = defuse(*HANDLE);
24           ($encoding, $spillage) = defuse(*HANDLE);
25
26           # Decode a string according to leading BOM:
27           $unicode = decode_from_bom($string_with_bom);
28
29           # Decode a string and get the encoding:
30           ($unicode, $encoding) = decode_from_bom($string_with_bom)
31
32   PerlIO::via interface
33           # Read the Right Thing from a unicode file with BOM:
34           open(HANDLE, '<:via(File::BOM)', $filename)
35
36           # Writing little-endian UTF-16 file with BOM:
37           open(HANDLE, '>:encoding(UTF-16LE):via(File::BOM)', $filename)
38
39   lower-level functions
40           # read BOM encoding from a filehandle:
41           $encoding = get_encoding_from_filehandle(FH)
42
43           # Get encoding even if FH is unseekable:
44           ($encoding, $spillage) = get_encoding_from_filehandle(FH);
45
46           # Get encoding from a known unseekable handle:
47           ($encdoing, $spillage) = get_encoding_from_stream(FH);
48
49           # get encoding and BOM length from BOM at start of string:
50           ($encoding, $offset) = get_encoding_from_bom($string);
51
52   variables
53           # print a BOM for a known encoding
54           print FH $enc2bom{$encoding};
55
56           # get an encoding from a known BOM
57           $enc = $bom2enc{$bom}
58

DESCRIPTION

60       This module provides functions for handling unicode byte order marks,
61       which are to be found at the beginning of some files and streams.
62
63       For details about what a byte order mark is, see
64       <http://www.unicode.org/unicode/faq/utf_bom.html#BOM>
65
66       The intention of File::BOM is for files with BOMs to be readable as
67       seamlessly as possible, regardless of the encoding used. To that end,
68       several different interfaces are available, as shown in the synopsis
69       above.
70

EXPORTS

72       Nothing by default.
73
74   symbols
75open_bom()
76
77defuse()
78
79decode_from_bom()
80
81get_encoding_from_filehandle()
82
83get_encoding_from_stream()
84
85get_encoding_from_bom()
86
87       •   %bom2enc
88
89       •   %enc2bom
90
91   tags
92       •   :all
93
94           All of the above
95
96       •   :subs
97
98           subroutines only
99
100       •   :vars
101
102           just %bom2enc and %enc2bom
103

VARIABLES

105   %bom2enc
106       Maps Byte Order marks to their encodings.
107
108       The keys of this hash are strings which represent the BOMs, the values
109       are their encodings, in a format which is understood by Encode
110
111       The encodings represented in this hash are: UTF-8, UTF-16BE, UTF-16LE,
112       UTF-32BE and UTF-32LE
113
114   %enc2bom
115       A reverse-lookup hash for bom2enc, with a few aliases used in Encode,
116       namely utf8, iso-10646-1 and UCS-2.
117
118       Note that UTF-16, UTF-32 and UCS-4 are not included in this hash.
119       Mainly because Encode::encode automatically puts BOMs on output. See
120       Encode::Unicode
121

FUNCTIONS

123   open_bom
124           $encoding = open_bom(HANDLE, $filename, $default_mode)
125
126           ($encoding, $spill) = open_bom(HANDLE, $filename, $default_mode)
127
128       opens HANDLE for reading on $filename, setting the mode to the
129       appropriate encoding for the BOM stored in the file.
130
131       On failure, a fatal error is raised, see the DIAGNOSTICS section for
132       details on how to catch these. This is in order to allow the return
133       value(s) to be used for other purposes.
134
135       If the file doesn't contain a BOM, $default_mode is used instead.
136       Hence:
137
138           open_bom(FH, 'my_file.txt', ':utf8')
139
140       Opens my_file.txt for reading in an appropriate encoding found from the
141       BOM in that file, or as a UTF-8 file if none is found.
142
143       In the absence of a $default_mode argument, the following 2 calls
144       should be equivalent:
145
146           open_bom(FH, 'no_bom.txt');
147
148           open(FH, '<', 'no_bom.txt');
149
150       If an undefined value is passed as the handle, a symbol will be
151       generated for it like open() does:
152
153           # create filehandle on the fly
154           $enc = open_bom(my $fh, $filename, ':utf8');
155           $line = <$fh>;
156
157       The filehandle will be cued up to read after the BOM. Unseekable files
158       (e.g.  fifos) will cause croaking, unless called in list context to
159       catch spillage from the handle. Any spillage will be automatically
160       decoded from the encoding, if found.
161
162           e.g.
163
164           # croak if my_socket is unseekable
165           open_bom(FH, 'my_socket');
166
167           # keep spillage if my_socket is unseekable
168           ($encoding, $spillage) = open_bom(FH, 'my_socket');
169
170           # discard any spillage from open_bom
171           ($encoding) = open_bom(FH, 'my_socket');
172
173   defuse
174           $enc = defuse(FH);
175
176           ($enc, $spill) = defuse(FH);
177
178       FH should be a filehandle opened for reading, it will have the relevant
179       encoding layer pushed onto it be binmode if a BOM is found. Spillage
180       should be Unicode, not bytes.
181
182       Any uncaptured spillage will be silently lost. If the handle is
183       unseekable, use list context to avoid data loss.
184
185       If no BOM is found, the mode will be unaffected.
186
187   decode_from_bom
188           $unicode_string = decode_from_bom($string, $default, $check)
189
190           ($unicode_string, $encoding) = decode_from_bom($string, $default, $check)
191
192       Reads a BOM from the beginning of $string, decodes $string (minus the
193       BOM) and returns it to you as a perl unicode string.
194
195       if $string doesn't have a BOM, $default is used instead.
196
197       $check, if supplied, is passed to Encode::decode as the third argument.
198
199       If there's no BOM and no default, the original string is returned and
200       encoding is ''.
201
202       See Encode
203
204   get_encoding_from_filehandle
205           $encoding = get_encoding_from_filehandle(HANDLE)
206
207           ($encoding, $spillage) = get_encoding_from_filehandle(HANDLE)
208
209       Returns the encoding found in the given filehandle.
210
211       The handle should be opened in a non-unicode way (e.g. mode '<:bytes')
212       so that the BOM can be read in its natural state.
213
214       After calling, the handle will be set to read at a point after the BOM
215       (or at the beginning of the file if no BOM was found)
216
217       If called in scalar context, unseekable handles cause a croak().
218
219       If called in list context, unseekable handles will be read byte-by-byte
220       and any spillage will be returned. See get_encoding_from_stream()
221
222   get_encoding_from_stream
223           ($encoding, $spillage) = get_encoding_from_stream(*FH);
224
225       Read a BOM from an unrewindable source. This means reading the stream
226       one byte at a time until either a BOM is found or every possible BOM is
227       ruled out. Any non-BOM bytes read from the handle will be returned in
228       $spillage.
229
230       If a BOM is found and the spillage contains a partial character
231       (judging by the expected character width for the encoding) more bytes
232       will be read from the handle to ensure that a complete character is
233       returned.
234
235       Spillage is always in bytes, not characters.
236
237       This function is less efficient than get_encoding_from_filehandle, but
238       should work just as well on a seekable handle as on an unseekable one.
239
240   get_encoding_from_bom
241           ($encoding, $offset) = get_encoding_from_bom($string)
242
243       Returns the encoding and length in bytes of the BOM in $string.
244
245       If there is no BOM, an empty string is returned and $offset is zero.
246
247       To get the data from the string, the following should work:
248
249           use Encode;
250
251           my($encoding, $offset) = get_encoding_from_bom($string);
252
253           if ($encoding) {
254               $string = decode($encoding, substr($string, $offset))
255           }
256

PerlIO::via interface

258       File::BOM can be used as a PerlIO::via interface.
259
260           open(HANDLE, '<:via(File::BOM)', 'my_file.txt');
261
262           open(HANDLE, '>:encoding(UTF-16LE):via(File::BOM)', 'out_file.txt');
263           print "foo\n"; # BOM is written to file here
264
265       This method is less prone to errors on non-seekable files as spillage
266       is incorporated into an internal buffer, but it doesn't give you any
267       information about the encoding being used, or indeed whether or not a
268       BOM was present.
269
270       There are a few known problems with this interface, especially
271       surrounding seek() and tell(), please see the BUGS section for more
272       details about this.
273
274   Reading
275       The via(File::BOM) layer must be added before the handle is read from,
276       otherwise any BOM will be missed. If there is no BOM, no decoding will
277       be done.
278
279       Because of a limitation in PerlIO::via, read() always works on bytes,
280       not characters. BOM decoding will still be done but output will be
281       bytes of UTF-8.
282
283           open(BOM, '<:via(File::BOM)', $file);
284           $bytes_read = read(BOM, $buffer, $length);
285           $unicode = decode('UTF-8', $buffer, Encode::FB_QUIET);
286
287           # Now $unicode is valid unicode and $buffer contains any left-over bytes
288
289   Writing
290       Add the via(File::BOM) layer on top of a unicode encoding layer to
291       print a BOM at the start of the output file. This needs to be done
292       before any data is written. The BOM is written as part of the first
293       print command on the handle, so if you don't print anything to the
294       handle, you won't get a BOM.
295
296       There is a "Wide character in print" warning generated when the
297       via(File::BOM) layer doesn't receive utf8 on writing. This glitch was
298       resolved in perl version 5.8.7, but if your perl version is older than
299       that, you'll need to make sure that the via(File::BOM) layer receives
300       utf8 like this:
301
302           # This works OK
303           open(FH, '>:encoding(UTF-16LE):via(File::BOM):utf8', $filename)
304
305           # This generates warnings with older perls
306           open(FH, '>:encoding(UTF-16LE):via(File::BOM)', $filename)
307
308   Seeking
309       Seeking with SEEK_SET results in an offset equal to the length of any
310       detected BOM being applied to the position parameter. Thus:
311
312           # Seek to end of BOM (not start of file!)
313           seek(FILE_BOM_HANDLE, 0, SEEK_SET)
314
315   Telling
316       In order to work correctly with seek(), tell() also returns a postion
317       adjusted by the length of the BOM.
318

SEE ALSO

320       •   Encode
321
322       •   Encode::Unicode
323
324       •   <http://www.unicode.org/unicode/faq/utf_bom.html#BOM>
325

DIAGNOSTICS

327       The following exceptions are raised via croak()
328
329       •   Couldn't read '<filename>': $!
330
331           open_bom() couldn't open the given file for reading
332
333       •   Couldn't set binmode of handle opened on '<filename>' to '<mode>':
334           $!
335
336           open_bom() couldn't set the binmode of the handle
337
338       •   No string
339
340           decode_from_bom called on an undefined value
341
342       •   Unseekable handle: $!
343
344           get_encoding_from_filehandle() or open_bom() called on an
345           unseekable file or handle in scalar context.
346
347       •   Couldn't read from handle: $!
348
349           _get_encoding_seekable() couldn't read the handle. This function is
350           called from get_encoding_from_filehandle(), defuse() and open_bom()
351
352       •   Couldn't reset read position: $!
353
354           _get_encoding_seekable couldn't seek to the position after the BOM.
355
356       •   Couldn't read byte: $!
357
358           get_encoding_from_stream couldn't read from the handle. This
359           function is called from get_encoding_from_filehandle() and
360           open_bom() when the handle or file is unseekable.
361

BUGS

363       Older versions of PerlIO::via have a few problems with writing, see
364       above.
365
366       The current version of PerlIO::via has limitations with regard to seek
367       and tell, currently only line-wise seek and tell are supported by this
368       module. If read() is used to read partial lines, tell() will still give
369       the position of the end of the last line read.
370
371       Under windows, tell() seems to return erroneously when reading files
372       with unix line endings.
373
374       Under windows, warnings may be generated when using the PerlIO::via
375       interface to read UTF-16LE and UTF-32LE encoded files. This seems to be
376       a bug in the relevant encoding(...) layers.
377

AUTHOR

379       Matt Lawrence <mattlaw@cpan.org>
380
381       With thanks to Mark Fowler and Steve Purkis for additional tests and
382       advice.
383
385       Copyright 2005 Matt Lawrence, All Rights Reserved.
386
387       This program is free software; you can redistribute it and/or modify it
388       under the same terms as Perl itself.
389
390
391
392perl v5.38.0                      2023-07-20                    File::BOM(3pm)
Impressum