1MboxParser(3) User Contributed Perl Documentation MboxParser(3)
2
3
4
6 Mail::MboxParser - read-only access to UNIX-mailboxes
7
9 use Mail::MboxParser;
10
11 my $parseropts = {
12 enable_cache => 1,
13 enable_grep => 1,
14 cache_file_name => 'mail/cache-file',
15 };
16 my $mb = Mail::MboxParser->new('some_mailbox',
17 decode => 'ALL',
18 parseropts => $parseropts);
19
20 # -----------
21
22 # slurping
23 for my $msg ($mb->get_messages) {
24 print $msg->header->{subject}, "\n";
25 $msg->store_all_attachments(path => '/tmp');
26 }
27
28 # iterating
29 while (my $msg = $mb->next_message) {
30 print $msg->header->{subject}, "\n";
31 # ...
32 }
33
34 # we forgot to do something with the messages
35 $mb->rewind;
36 while (my $msg = $mb->next_message) {
37 # iterate again
38 # ...
39 }
40
41 # subscripting one message after the other
42 for my $idx (0 .. $mb->nmsgs - 1) {
43 my $msg = $mb->get_message($idx);
44 }
45
47 This module attempts to provide a simplified access to standard UNIX-
48 mailboxes. It offers only a subset of methods to get 'straight to the
49 point'. More sophisticated things can still be done by invoking any
50 method from MIME::Tools on the appropriate return values.
51
52 Mail::MboxParser has not been derived from Mail::Box and thus isn't
53 acquainted with it in any way. It, however, incorporates some
54 invaluable hints by the author of Mail::Box, Mark Overmeer.
55
57 See also the section ERROR-HANDLING much further below.
58
59 More to that, see the relevant manpages of Mail::MboxParser::Mail,
60 Mail::MboxParser::Mail::Body and Mail::MboxParser::Mail::Convertable
61 for a description of the methods for these objects.
62
63 new(mailbox, options)
64 new(scalar-ref, options)
65 new(array-ref, options)
66 new(filehandle, options)
67 This creates a new MboxParser-object opening the specified
68 'mailbox' with either absolute or relative path.
69
70 new() can also take a reference to a variable containing the
71 mailbox either as one string (reference to a scalar) or linewise
72 (reference to an array), or a filehandle from which to read the
73 mailbox.
74
75 The following option(s) may be useful. The value in brackets below
76 the key is the default if none given.
77
78 key: | value: | description:
79 ==========|============|===============================
80 decode | 'NEVER' | never decode transfer-encoded
81 (NEVER) | | data
82 |------------|-------------------------------
83 | 'BODY' | will decode body into a human-
84 | | readable format
85 |------------|-------------------------------
86 | 'HEADER' | will decode header fields if
87 | | any is encoded
88 |------------|-------------------------------
89 | 'ALL' | decode any data
90 ==========|============|===============================
91 uudecode | 1 | enable extraction of uuencoded
92 (0) | | attachments in MIME::Parser
93 |------------|-------------------------------
94 | 0 | uuencoded attachments are
95 | | treated as plain body text
96 ==========|============|===============================
97 newline | 'UNIX' | UNIXish line-endings
98 (AUTO) | | ("\n" aka \012)
99 |------------|-------------------------------
100 | 'WIN' | Win32 line-endings
101 | | ("\n\r" aka \012\015)
102 |------------|-------------------------------
103 | 'AUTO' | try to do autodetection
104 |------------|-------------------------------
105 | custom | a user-given value for totally
106 | | borked mailboxes
107 ==========|============|===============================
108 oldparser | 1 | uses the old (and slower)
109 (0) | | parser (but guaranteed to show
110 | | the old behaviour)
111 |------------|-------------------------------
112 | 0 | uses Mail::Mbox::MessageParser
113 ==========|============|===============================
114 parseropts| | see "Specifying parser opts"
115 | | below
116 ==========|============|===============================
117
118 The newline option comes in handy if you have a mbox-file that
119 happens to not conform to the rules of your operating-system's
120 character semantics one way or another. One such scenario: You are
121 using the module under Win but deliberately have mailboxes with
122 UNIX-newlines (or the other way round). If you do not give this
123 option, 'AUTO' is assumed and some basic tests on the mailbox are
124 performed. This autoedection is of course not capable of detecting
125 cases where you use something like '#DELIMITER' as line-ending. It
126 can as to yet only distinguish between UNIX and Win32ish newlines.
127 You may be lucky and it even works for Macintoshs. If you have more
128 extravagant wishes, pass a costum value:
129
130 my $mb = new Mail::MboxParser ("mbox", newline => '#DELIMITER');
131
132 You can't use regexes here since internally this relies on the $/
133 var ($INPUT_RECORD_SEPERATOR, that is).
134
135 When passing either a scalar-, array-ref or \*STDIN as first-
136 argument, an anonymous tmp-file is created to hold the data. This
137 procedure is hidden away from the user so there is no need to worry
138 about it. Since a tmp-file acts just like an ordinary mailbox-file
139 you don't need to be concerned about loss of data or so once you
140 have been walking through the mailbox-data. No data will be lost
141 and it'll all be fine and smooth.
142
143 Specifying parser options
144 When available, the module will use "Mail::Mbox::MessageParser" to do
145 the parsing. To get the most speed out of it, you can tweak some of its
146 options. Arguably, you even have to do that in order to make it use
147 caching. Options for the parser are given via the parseropts switch
148 that expects a reference to a hash as values. The values you can
149 specify are:
150
151 enable_cache
152 When set to a true value, caching is used but only if you gave
153 cache_file_name. There is no default value here!
154
155 cache_file_name
156 The file used for caching. This option is mandatory if
157 enable_cache is true.
158
159 enable_grep
160 When set to a true value (which is the default), the extern
161 grep(1) is used to speed up parsing. If your system does not
162 provide a usable grep implementation, it silently falls back to
163 the pure Perl parser.
164
165 When the module was unable to create a "Mail::Mbox::MessageParser"
166 object, it will fall back to the old parser in the hope that the
167 construction of the object then succeeds.
168
169 open(source, options)
170 Takes exactly the same arguments as new() does just that it can be
171 used to change the characteristics of a mailbox on the fly.
172
173 get_messages
174 Returns an array containing all messages in the mailbox
175 respresented as Mail::MboxParser::Mail objects. This method is
176 _minimally_ quicker than iterating over the mailbox using
177 "next_message" but eats much more memory. Memory-usage will grow
178 linearly for each new message detected since this method creates a
179 huge array containing all messages. After creating this array, it
180 will be returned.
181
182 get_message(n)
183 Returns the n-th message (first message has index 0) in a mailbox.
184 Examine "$mb->error" which contains an error-string if the message
185 does not exist. In this case, "get_message" returns undef.
186
187 next_message
188 This lets you iterate over a mailbox one mail after another. The
189 great advantage over "get_messages" is the very low memory-
190 comsumption. It will be at a constant level throughout the
191 execution of your script. Secondly, it almost instantly begins
192 spitting out Mail::MboxParser::Mail-objects since it doesn't have
193 to slurp in all mails before returing them.
194
195 set_pos(n)
196 rewind
197 current_pos
198 These three methods deal with the position of the internal
199 filehandle backening the mailbox. Once you have iterated over the
200 whole mailbox using "next_message" MboxParser has reached the end
201 of the mailbox and you have to do repositioning if you want to
202 iterate again. You could do this with either "set_pos" or "rewind".
203
204 $mb->rewind; # equivalent to
205 $mb->set_pos(0);
206
207 "current_pos" reveals the current position in the mailbox and can
208 be used to later return to this position if you want to do tricky
209 things. Mark that "current_pos" does *not* return the current line
210 but rather the current character as returned by Perl's tell()
211 function.
212
213 my $last_pos;
214 while (my $msg = $mb->next_message) {
215 # ...
216 if ($msg->header->{subject} eq 'I was looking for this') {
217 $last_pos = $mb->current_pos;
218 last; # bail out here and do something else
219 }
220 }
221
222 # ...
223 # ...
224
225 # now continue where we stopped:
226 $mb->set_pos($last_pos)
227 while (my $msg = $mb->next_message) {
228 # ...
229 }
230
231 WARNING: Be very careful with these methods when using the parser
232 of "Mail::Mbox::MessageParser". This parser maintains its own state
233 and you shouldn't expect it to always be in sync with the state of
234 "Mail::MboxParser". If you need some finer control over the
235 parsing, better consider to use the public interface as described
236 in the manpage of Mail::Mbox::MessageParser. Use "parser()" to get
237 the underlying parser object.
238
239 This however may expose you to the same problems turned around:
240 "Mail::MboxParser" may loose its sync with its parser when you do
241 that.
242
243 Therefore: Just avoid any of the above for now and wait till
244 "Mail::Mbox::MessageParser" has a stable interface.
245
246 make_index
247 You can force the creation of a message-index with this method. The
248 message-index is a mapping between the index-number of a message (0
249 .. $mb->nmsgs - 1) and the byte-position of the filehandle. This
250 is usually done automatically for you once you call "get_message"
251 hence the first call for a particular message will be a little
252 slower since the message-index first has to be built. This is,
253 however, done rather quickly.
254
255 You can have a peek at the index if you are interested. The
256 following produces a nicely padded table (suitable for mailboxes up
257 to 9.9999...GB ;-).
258
259 $mb->make_index;
260 for (0 .. $mb->nmsgs - 1) {
261 printf "%5.5d => %10.10d\n",
262 $_, $mb->get_pos($_);
263 }
264
265 get_pos(n)
266 This method takes the index-number of a certain message within the
267 mailbox and returns the corresponding position of the filehandle
268 that represents that start of the file.
269
270 It is mainly used by "get_message()" and you wouldn't really have
271 to bother using it yourself except for statistical purpose as
272 demonstrated above along with make_index.
273
274 nmsgs
275 Returns the number of messages in a mailbox. You could naturally
276 also call get_messages in scalar-context, but this one wont create
277 new objects. It just counts them and thus it is much quicker and
278 wont eat a lot of memory.
279
280 parser
281 Returns the bare "Mail::Mbox::MessageParser" object. If no such
282 object exists returns "undef".
283
284 You can use this method to check whether the module actually uses
285 the old or new parser. If "parser" returns a false value, it is
286 using the old parsing routines.
287
288 METHODS SHARED BY ALL OBJECTS
289 error
290 Call this immediately after one of the methods above that mention a
291 possible error-message.
292
293 log Sort of internal weirdnesses are recorded here. Again only the last
294 event is saved.
295
297 Mail::MboxParser provides a mechanism for you to figure out why some
298 methods did not function as you expected. There are four classes of
299 unexpected behavior:
300
301 [1m(1) bad arguments
302 In this case you called a method with arguments that did not make
303 sense, hence you confused Mail::MboxParser. Example:
304
305 $mail->store_entity_body; # wrong, needs two arguments
306 $mail->store_entity_body(0); # wrong, still needs one more
307
308 In any of the above two cases, you'll get an error message and your
309 script will exit. The message will, however, tell you in which line
310 of your script this error occured.
311
312 [1m(2) correct arguments but...
313 Consider this line:
314
315 $mail->store_entity_body(50, \*FH); # could be wrong
316
317 Obviously you did call store_entity_body with the correct number of
318 arguments. That's good because now your script wont just exit.
319 Unfortunately, your program can't know in advance whether the
320 particular mail ($mail) has a 51st entity.
321
322 So, what to do?
323
324 Just be brave: Write the above line and do the error-checking
325 afterwards by calling $mail->error immediately after
326 store_entity_body:
327
328 $mail->store_entity_body(50, *\FH);
329 if ($mail->error) {
330 print "Oups, something wrong:", $mail->error;
331 }
332
333 In the description of the available methods above, you always find
334 a remark when you could use $mail->error. It always returns a
335 string that you can print out and investigate any further.
336
337 [1m(3) errors, that never get visible
338 Well, they exist. When you handle MIME-stuff a lot such as
339 attachments etc., Mail::MboxParser internally calls a lot of
340 methods provided by the MIME::Tools package. These work splendidly
341 in most cases, but the MIME::Tools may fail to produce something
342 sensible if you have a very queer or even screwed up mailbox.
343
344 If this happens you might find information on that when calling
345 $mail->log. This will give you the more or less unfiltered error-
346 messages produced by MIME::Tools.
347
348 My advice: Ignore them! If there really is something in $mail->log
349 it is either because you're mails are totally weird (there is
350 nothing you can do about that then) or these errors are smoothly
351 catched inside Mail::MboxParser in which case all should be fine
352 for you.
353
354 [1m(4) the apocalyps
355 If nothing seems to work the way it should and $mail->error is
356 empty, then the worst case has set in: Mail::MboxParser has a bug.
357
358 Needless to say that there is any way to get around of this. In
359 this case you should contact and I'll examine that.
360
362 I have been working hard on making Mail::MboxParser eat less memory and
363 as quick as possible. Due to that, two time and memory consuming
364 matters are now called on demand. That is, parsing out the MIME-parts
365 and turning the raw header into a hash have become closures.
366
367 The drawback of that is that it may get inefficient if you often call
368
369 $mail->header->{field}
370
371 In this case you should probably save the return value of $mail->header
372 (a hashref) into a variable since each time you call it the raw header
373 is parsed.
374
375 On the other hand, if you have a mailbox of, say, 25MB, and hold each
376 header of each message in memory, you'll quickly run out of that. So,
377 you can now choose between more performance and more memory.
378
379 This all does not happen if you just parse a mailbox to extract one
380 header-field (eg. subject), work with that and exit. In this case it
381 will need both less memory and is still considerably quicker. :-)
382
384 Some mailers have a fancy idea of how a "To: "- or "Cc: "-line should
385 look. I have seen things like:
386
387 To: "\"John Doe"\" <john.doe@example.com>
388
389 The splitting into name and email, however, does still work here, but
390 you have to remove these silly double-quotes and backslashes yourself.
391
392 The way of counting the messages and detecting them now complies to RFC
393 822. This is, however, no guarentee that it all works seamlessly.
394 There are just so many mailboxes that get screwed up by mal-formated
395 mails.
396
398 Apart from new bugs that almost certainly have been introduced with
399 this release, following things still need to be done:
400
401 Transfer-Encoding
402 Still, only quoted-printable encoding is correctly handled.
403
404 Tests
405 Clean-up of the test-scripts is desperately needed. Now they
406 represent rather an arbitrary selection of tested functions. Some
407 are tested several times while others don't show up at all in the
408 suits.
409
411 Thanks to a number of people who gave me invaluable hints that helped
412 me with Mail::Box, notably Mark Overmeer for his hints on more object-
413 orientedness.
414
415 Kenn Frankel (kenn AT kenn DOT cc) kindly patched the broken split-
416 header routine and added get_field().
417
418 David Coppit for making me aware of "Mail::Mbox::MessageParser" and
419 designing it the way I needed to make it work for my module.
420
422 This is version 0.55.
423
425 Tassilo von Parseval <tassilo.von.parseval@rwth-aachen.de>
426
427 Copyright (c) 2001-2005 Tassilo von Parseval. This program is free
428 software; you can redistribute it and/or modify it under the same terms
429 as Perl itself.
430
432 MIME::Entity
433
434 Mail::MboxParser::Mail, Mail::MboxParser::Mail::Body,
435 Mail::MboxParser::Mail::Convertable
436
437 Mail::Mbox::MessageParser
438
439
440
441perl v5.32.0 2020-07-28 MboxParser(3)