URI::Find(3pm)

1URI::Find(3)          User Contributed Perl Documentation         URI::Find(3)
2
3
4

NAME

6       URI::Find - Find URIs in arbitrary text
7

SYNOPSIS

9         require URI::Find;
10
11         my $finder = URI::Find->new(\&callback);
12
13         $how_many_found = $finder->find(\$text);
14

DESCRIPTION

16       This module does one thing: Finds URIs and URLs in plain text.  It
17       finds them quickly and it finds them all (or what URI.pm considers a
18       URI to be.)  It only finds URIs which include a scheme (http:// or the
19       like), for something a bit less strict have a look at
20       URI::Find::Schemeless.
21
22       For a command-line interface, urifind is provided.
23
24   Public Methods
25       new
26             my $finder = URI::Find->new(\&callback);
27
28           Creates a new URI::Find object.
29
30           &callback is a function which is called on each URI found.  It is
31           passed two arguments, the first is a URI object representing the
32           URI found.  The second is the original text of the URI found.  The
33           return value of the callback will replace the original URI in the
34           text.
35
36       find
37             my $how_many_found = $finder->find(\$text);
38
39           $text is a string to search and possibly modify with your callback.
40
41           Alternatively, "find" can be called with a replacement function for
42           the rest of the text:
43
44             use CGI qw(escapeHTML);
45             # ...
46             my $how_many_found = $finder->find(\$text, \&escapeHTML);
47
48           will not only call the callback function for every URL found (and
49           perform the replacement instructions therein), but also run the
50           rest of the text through "escapeHTML()". This makes it easier to
51           turn plain text which contains URLs into HTML (see example below).
52
53   Protected Methods
54       I got a bunch of mail from people asking if I'd add certain features to
55       URI::Find.  Most wanted the search to be less restrictive, do more
56       heuristics, etc...  Since many of the requests were contradictory, I'm
57       letting people create their own custom subclasses to do what they want.
58
59       The following are methods internal to URI::Find which a subclass can
60       override to change the way URI::Find acts.  They are only to be called
61       inside a URI::Find subclass.  Users of this module are NOT to use these
62       methods.
63
64       uri_re
65             my $uri_re = $self->uri_re;
66
67           Returns the regex for finding absolute, schemed URIs
68           (http://www.foo.com and such).  This, combined with
69           schemeless_uri_re() is what finds candidate URIs.
70
71           Usually this method does not have to be overridden.
72
73       schemeless_uri_re
74             my $schemeless_re = $self->schemeless_uri_re;
75
76           Returns the regex for finding schemeless URIs (www.foo.com and
77           such) and other things which might be URIs.  By default this will
78           match nothing (though it used to try to find schemeless URIs which
79           started with "www" and "ftp").
80
81           Many people will want to override this method.  See
82           URI::Find::Schemeless for a subclass does a reasonable job of
83           finding URIs which might be missing the scheme.
84
85       uric_set
86             my $uric_set = $self->uric_set;
87
88           Returns a set matching the 'uric' set defined in RFC 2396 suitable
89           for putting into a character set ([]) in a regex.
90
91           You almost never have to override this.
92
93       cruft_set
94             my $cruft_set = $self->cruft_set;
95
96           Returns a set of characters which are considered garbage.  Used by
97           decruft().
98
99       decruft
100             my $uri = $self->decruft($uri);
101
102           Sometimes garbage characters like periods and parenthesis get
103           accidentally matched along with the URI.  In order for the URI to
104           be properly identified, it must sometimes be "decrufted", the
105           garbage characters stripped.
106
107           This method takes a candidate URI and strips off any cruft it
108           finds.
109
110       recruft
111             my $uri = $self->recruft($uri);
112
113           This method puts back the cruft taken off with decruft().  This is
114           necessary because the cruft is destructively removed from the
115           string before invoking the user's callback, so it has to be put
116           back afterwards.
117
118       schemeless_to_schemed
119             my $schemed_uri = $self->schemeless_to_schemed($schemeless_uri);
120
121           This takes a schemeless URI and returns an absolute, schemed URI.
122           The standard implementation supplies ftp:// for URIs which start
123           with ftp., and http:// otherwise.
124
125       is_schemed
126             $obj->is_schemed($uri);
127
128           Returns whether or not the given URI is schemed or schemeless.
129           True for schemed, false for schemeless.
130
131       badinvo
132             __PACKAGE__->badinvo($extra_levels, $msg)
133
134           This is used to complain about bogus subroutine/method invocations.
135           The args are optional.
136
137   Old Functions
138       The old find_uri() function is still around and it works, but its
139       deprecated.
140

EXAMPLES

142       Store a list of all URIs (normalized) in the document.
143
144         my @uris;
145         my $finder = URI::Find->new(sub {
146             my($uri) = shift;
147             push @uris, $uri;
148         });
149         $finder->find(\$text);
150
151       Print the original URI text found and the normalized representation.
152
153         my $finder = URI::Find->new(sub {
154             my($uri, $orig_uri) = @_;
155             print "The text '$orig_uri' represents '$uri'\n";
156             return $orig_uri;
157         });
158         $finder->find(\$text);
159
160       Check each URI in document to see if it exists.
161
162         use LWP::Simple;
163
164         my $finder = URI::Find->new(sub {
165             my($uri, $orig_uri) = @_;
166             if( head $uri ) {
167                 print "$orig_uri is okay\n";
168             }
169             else {
170                 print "$orig_uri cannot be found\n";
171             }
172             return $orig_uri;
173         });
174         $finder->find(\$text);
175
176       Turn plain text into HTML, with each URI found wrapped in an HTML
177       anchor.
178
179         use CGI qw(escapeHTML);
180         use URI::Find;
181
182         my $finder = URI::Find->new(sub {
183             my($uri, $orig_uri) = @_;
184             return qq|<a href="$uri">$orig_uri</a>|;
185         });
186         $finder->find(\$text, \&escapeHTML);
187         print "<pre>$text</pre>";
188

NOTES

190       Will not find URLs with Internationalized Domain Names or pretty much
191       any non-ascii stuff in them.  See
192       <http://rt.cpan.org/Ticket/Display.html?id=44226>
193

AUTHOR

195       Michael G Schwern <schwern@pobox.com> with insight from Uri Gutman,
196       Greg Bacon, Jeff Pinyan, Roderick Schertler and others.
197
198       Roderick Schertler <roderick@argon.org> maintained versions 0.11 to
199       0.16.
200
201       Darren Chamberlain wrote urifind.
202

LICENSE

204       Copyright 2000, 2009-2010, 2014, 2016 by Michael G Schwern
205       <schwern@pobox.com>.
206
207       This program is free software; you can redistribute it and/or modify it
208       under the same terms as Perl itself.
209
210       See http://www.perlfoundation.org/artistic_license_1_0
211