1File::BOM(3pm) User Contributed Perl Documentation File::BOM(3pm)
2
3
4
6 File::BOM - Utilities for handling Byte Order Marks
7
9 use File::BOM qw( :all )
10
11 high-level functions
12 # read a file with encoding from the BOM:
13 open_bom(FH, $file)
14 open_bom(FH, $file, ':utf8') # the same but with a default encoding
15
16 # get encoding too
17 $encoding = open_bom(FH, $file, ':utf8');
18
19 # open a potentially unseekable file:
20 ($encoding, $spillage) = open_bom(FH, $file, ':utf8');
21
22 # change encoding of an open handle according to BOM
23 $encoding = defuse(*HANDLE);
24 ($encoding, $spillage) = defuse(*HANDLE);
25
26 # Decode a string according to leading BOM:
27 $unicode = decode_from_bom($string_with_bom);
28
29 # Decode a string and get the encoding:
30 ($unicode, $encoding) = decode_from_bom($string_with_bom)
31
32 PerlIO::via interface
33 # Read the Right Thing from a unicode file with BOM:
34 open(HANDLE, '<:via(File::BOM)', $filename)
35
36 # Writing little-endian UTF-16 file with BOM:
37 open(HANDLE, '>:encoding(UTF-16LE):via(File::BOM)', $filename)
38
39 lower-level functions
40 # read BOM encoding from a filehandle:
41 $encoding = get_encoding_from_filehandle(FH)
42
43 # Get encoding even if FH is unseekable:
44 ($encoding, $spillage) = get_encoding_from_filehandle(FH);
45
46 # Get encoding from a known unseekable handle:
47 ($encdoing, $spillage) = get_encoding_from_stream(FH);
48
49 # get encoding and BOM length from BOM at start of string:
50 ($encoding, $offset) = get_encoding_from_bom($string);
51
52 variables
53 # print a BOM for a known encoding
54 print FH $enc2bom{$encoding};
55
56 # get an encoding from a known BOM
57 $enc = $bom2enc{$bom}
58
60 This module provides functions for handling unicode byte order marks,
61 which are to be found at the beginning of some files and streams.
62
63 For details about what a byte order mark is, see
64 <http://www.unicode.org/unicode/faq/utf_bom.html#BOM>
65
66 The intention of File::BOM is for files with BOMs to be readable as
67 seamlessly as possible, regardless of the encoding used. To that end,
68 several different interfaces are available, as shown in the synopsis
69 above.
70
72 Nothing by default.
73
74 symbols
75 • open_bom()
76
77 • defuse()
78
79 • decode_from_bom()
80
81 • get_encoding_from_filehandle()
82
83 • get_encoding_from_stream()
84
85 • get_encoding_from_bom()
86
87 • %bom2enc
88
89 • %enc2bom
90
91 tags
92 • :all
93
94 All of the above
95
96 • :subs
97
98 subroutines only
99
100 • :vars
101
102 just %bom2enc and %enc2bom
103
105 %bom2enc
106 Maps Byte Order marks to their encodings.
107
108 The keys of this hash are strings which represent the BOMs, the values
109 are their encodings, in a format which is understood by Encode
110
111 The encodings represented in this hash are: UTF-8, UTF-16BE, UTF-16LE,
112 UTF-32BE and UTF-32LE
113
114 %enc2bom
115 A reverse-lookup hash for bom2enc, with a few aliases used in Encode,
116 namely utf8, iso-10646-1 and UCS-2.
117
118 Note that UTF-16, UTF-32 and UCS-4 are not included in this hash.
119 Mainly because Encode::encode automatically puts BOMs on output. See
120 Encode::Unicode
121
123 open_bom
124 $encoding = open_bom(HANDLE, $filename, $default_mode)
125
126 ($encoding, $spill) = open_bom(HANDLE, $filename, $default_mode)
127
128 opens HANDLE for reading on $filename, setting the mode to the
129 appropriate encoding for the BOM stored in the file.
130
131 On failure, a fatal error is raised, see the DIAGNOSTICS section for
132 details on how to catch these. This is in order to allow the return
133 value(s) to be used for other purposes.
134
135 If the file doesn't contain a BOM, $default_mode is used instead.
136 Hence:
137
138 open_bom(FH, 'my_file.txt', ':utf8')
139
140 Opens my_file.txt for reading in an appropriate encoding found from the
141 BOM in that file, or as a UTF-8 file if none is found.
142
143 In the absence of a $default_mode argument, the following 2 calls
144 should be equivalent:
145
146 open_bom(FH, 'no_bom.txt');
147
148 open(FH, '<', 'no_bom.txt');
149
150 If an undefined value is passed as the handle, a symbol will be
151 generated for it like open() does:
152
153 # create filehandle on the fly
154 $enc = open_bom(my $fh, $filename, ':utf8');
155 $line = <$fh>;
156
157 The filehandle will be cued up to read after the BOM. Unseekable files
158 (e.g. fifos) will cause croaking, unless called in list context to
159 catch spillage from the handle. Any spillage will be automatically
160 decoded from the encoding, if found.
161
162 e.g.
163
164 # croak if my_socket is unseekable
165 open_bom(FH, 'my_socket');
166
167 # keep spillage if my_socket is unseekable
168 ($encoding, $spillage) = open_bom(FH, 'my_socket');
169
170 # discard any spillage from open_bom
171 ($encoding) = open_bom(FH, 'my_socket');
172
173 defuse
174 $enc = defuse(FH);
175
176 ($enc, $spill) = defuse(FH);
177
178 FH should be a filehandle opened for reading, it will have the relevant
179 encoding layer pushed onto it be binmode if a BOM is found. Spillage
180 should be Unicode, not bytes.
181
182 Any uncaptured spillage will be silently lost. If the handle is
183 unseekable, use list context to avoid data loss.
184
185 If no BOM is found, the mode will be unaffected.
186
187 decode_from_bom
188 $unicode_string = decode_from_bom($string, $default, $check)
189
190 ($unicode_string, $encoding) = decode_from_bom($string, $default, $check)
191
192 Reads a BOM from the beginning of $string, decodes $string (minus the
193 BOM) and returns it to you as a perl unicode string.
194
195 if $string doesn't have a BOM, $default is used instead.
196
197 $check, if supplied, is passed to Encode::decode as the third argument.
198
199 If there's no BOM and no default, the original string is returned and
200 encoding is ''.
201
202 See Encode
203
204 get_encoding_from_filehandle
205 $encoding = get_encoding_from_filehandle(HANDLE)
206
207 ($encoding, $spillage) = get_encoding_from_filehandle(HANDLE)
208
209 Returns the encoding found in the given filehandle.
210
211 The handle should be opened in a non-unicode way (e.g. mode '<:bytes')
212 so that the BOM can be read in its natural state.
213
214 After calling, the handle will be set to read at a point after the BOM
215 (or at the beginning of the file if no BOM was found)
216
217 If called in scalar context, unseekable handles cause a croak().
218
219 If called in list context, unseekable handles will be read byte-by-byte
220 and any spillage will be returned. See get_encoding_from_stream()
221
222 get_encoding_from_stream
223 ($encoding, $spillage) = get_encoding_from_stream(*FH);
224
225 Read a BOM from an unrewindable source. This means reading the stream
226 one byte at a time until either a BOM is found or every possible BOM is
227 ruled out. Any non-BOM bytes read from the handle will be returned in
228 $spillage.
229
230 If a BOM is found and the spillage contains a partial character
231 (judging by the expected character width for the encoding) more bytes
232 will be read from the handle to ensure that a complete character is
233 returned.
234
235 Spillage is always in bytes, not characters.
236
237 This function is less efficient than get_encoding_from_filehandle, but
238 should work just as well on a seekable handle as on an unseekable one.
239
240 get_encoding_from_bom
241 ($encoding, $offset) = get_encoding_from_bom($string)
242
243 Returns the encoding and length in bytes of the BOM in $string.
244
245 If there is no BOM, an empty string is returned and $offset is zero.
246
247 To get the data from the string, the following should work:
248
249 use Encode;
250
251 my($encoding, $offset) = get_encoding_from_bom($string);
252
253 if ($encoding) {
254 $string = decode($encoding, substr($string, $offset))
255 }
256
258 File::BOM can be used as a PerlIO::via interface.
259
260 open(HANDLE, '<:via(File::BOM)', 'my_file.txt');
261
262 open(HANDLE, '>:encoding(UTF-16LE):via(File::BOM)', 'out_file.txt');
263 print "foo\n"; # BOM is written to file here
264
265 This method is less prone to errors on non-seekable files as spillage
266 is incorporated into an internal buffer, but it doesn't give you any
267 information about the encoding being used, or indeed whether or not a
268 BOM was present.
269
270 There are a few known problems with this interface, especially
271 surrounding seek() and tell(), please see the BUGS section for more
272 details about this.
273
274 Reading
275 The via(File::BOM) layer must be added before the handle is read from,
276 otherwise any BOM will be missed. If there is no BOM, no decoding will
277 be done.
278
279 Because of a limitation in PerlIO::via, read() always works on bytes,
280 not characters. BOM decoding will still be done but output will be
281 bytes of UTF-8.
282
283 open(BOM, '<:via(File::BOM)', $file);
284 $bytes_read = read(BOM, $buffer, $length);
285 $unicode = decode('UTF-8', $buffer, Encode::FB_QUIET);
286
287 # Now $unicode is valid unicode and $buffer contains any left-over bytes
288
289 Writing
290 Add the via(File::BOM) layer on top of a unicode encoding layer to
291 print a BOM at the start of the output file. This needs to be done
292 before any data is written. The BOM is written as part of the first
293 print command on the handle, so if you don't print anything to the
294 handle, you won't get a BOM.
295
296 There is a "Wide character in print" warning generated when the
297 via(File::BOM) layer doesn't receive utf8 on writing. This glitch was
298 resolved in perl version 5.8.7, but if your perl version is older than
299 that, you'll need to make sure that the via(File::BOM) layer receives
300 utf8 like this:
301
302 # This works OK
303 open(FH, '>:encoding(UTF-16LE):via(File::BOM):utf8', $filename)
304
305 # This generates warnings with older perls
306 open(FH, '>:encoding(UTF-16LE):via(File::BOM)', $filename)
307
308 Seeking
309 Seeking with SEEK_SET results in an offset equal to the length of any
310 detected BOM being applied to the position parameter. Thus:
311
312 # Seek to end of BOM (not start of file!)
313 seek(FILE_BOM_HANDLE, 0, SEEK_SET)
314
315 Telling
316 In order to work correctly with seek(), tell() also returns a postion
317 adjusted by the length of the BOM.
318
320 • Encode
321
322 • Encode::Unicode
323
324 • <http://www.unicode.org/unicode/faq/utf_bom.html#BOM>
325
327 The following exceptions are raised via croak()
328
329 • Couldn't read '<filename>': $!
330
331 open_bom() couldn't open the given file for reading
332
333 • Couldn't set binmode of handle opened on '<filename>' to '<mode>':
334 $!
335
336 open_bom() couldn't set the binmode of the handle
337
338 • No string
339
340 decode_from_bom called on an undefined value
341
342 • Unseekable handle: $!
343
344 get_encoding_from_filehandle() or open_bom() called on an
345 unseekable file or handle in scalar context.
346
347 • Couldn't read from handle: $!
348
349 _get_encoding_seekable() couldn't read the handle. This function is
350 called from get_encoding_from_filehandle(), defuse() and open_bom()
351
352 • Couldn't reset read position: $!
353
354 _get_encoding_seekable couldn't seek to the position after the BOM.
355
356 • Couldn't read byte: $!
357
358 get_encoding_from_stream couldn't read from the handle. This
359 function is called from get_encoding_from_filehandle() and
360 open_bom() when the handle or file is unseekable.
361
363 Older versions of PerlIO::via have a few problems with writing, see
364 above.
365
366 The current version of PerlIO::via has limitations with regard to seek
367 and tell, currently only line-wise seek and tell are supported by this
368 module. If read() is used to read partial lines, tell() will still give
369 the position of the end of the last line read.
370
371 Under windows, tell() seems to return erroneously when reading files
372 with unix line endings.
373
374 Under windows, warnings may be generated when using the PerlIO::via
375 interface to read UTF-16LE and UTF-32LE encoded files. This seems to be
376 a bug in the relevant encoding(...) layers.
377
379 Matt Lawrence <mattlaw@cpan.org>
380
381 With thanks to Mark Fowler and Steve Purkis for additional tests and
382 advice.
383
385 Copyright 2005 Matt Lawrence, All Rights Reserved.
386
387 This program is free software; you can redistribute it and/or modify it
388 under the same terms as Perl itself.
389
390
391
392perl v5.38.0 2023-07-20 File::BOM(3pm)