1URI::Find(3) User Contributed Perl Documentation URI::Find(3)
2
3
4
6 URI::Find - Find URIs in arbitrary text
7
9 require URI::Find;
10
11 my $finder = URI::Find->new(\&callback);
12
13 $how_many_found = $finder->find(\$text);
14
16 This module does one thing: Finds URIs and URLs in plain text. It
17 finds them quickly and it finds them all (or what URI::URL considers a
18 URI to be.) It only finds URIs which include a scheme (http:// or the
19 like), for something a bit less strict have a look at
20 URI::Find::Schemeless.
21
22 For a command-line interface, urifind is provided.
23
24 Public Methods
25 new
26 my $finder = URI::Find->new(\&callback);
27
28 Creates a new URI::Find object.
29
30 &callback is a function which is called on each URI found. It is
31 passed two arguments, the first is a URI::URL object representing
32 the URI found. The second is the original text of the URI found.
33 The return value of the callback will replace the original URI in
34 the text.
35
36 find
37 my $how_many_found = $finder->find(\$text);
38
39 $text is a string to search and possibly modify with your callback.
40
41 Alternatively, "find" can be called with a replacement function for
42 the rest of the text:
43
44 use CGI qw(escapeHTML);
45 # ...
46 my $how_many_found = $finder->find(\$text, \&escapeHTML);
47
48 will not only call the callback function for every URL found (and
49 perform the replacement instructions therein), but also run the
50 rest of the text through "escapeHTML()". This makes it easier to
51 turn plain text which contains URLs into HTML (see example below).
52
53 Protected Methods
54 I got a bunch of mail from people asking if I'd add certain features to
55 URI::Find. Most wanted the search to be less restrictive, do more
56 heuristics, etc... Since many of the requests were contradictory, I'm
57 letting people create their own custom subclasses to do what they want.
58
59 The following are methods internal to URI::Find which a subclass can
60 override to change the way URI::Find acts. They are only to be called
61 inside a URI::Find subclass. Users of this module are NOT to use these
62 methods.
63
64 uri_re
65 my $uri_re = $self->uri_re;
66
67 Returns the regex for finding absolute, schemed URIs
68 (http://www.foo.com and such). This, combined with
69 schemeless_uri_re() is what finds candidate URIs.
70
71 Usually this method does not have to be overridden.
72
73 schemeless_uri_re
74 my $schemeless_re = $self->schemeless_uri_re;
75
76 Returns the regex for finding schemeless URIs (www.foo.com and
77 such) and other things which might be URIs. By default this will
78 match nothing (though it used to try to find schemeless URIs which
79 started with "www" and "ftp").
80
81 Many people will want to override this method. See
82 URI::Find::Schemeless for a subclass does a reasonable job of
83 finding URIs which might be missing the scheme.
84
85 uric_set
86 my $uric_set = $self->uric_set;
87
88 Returns a set matching the 'uric' set defined in RFC 2396 suitable
89 for putting into a character set ([]) in a regex.
90
91 You almost never have to override this.
92
93 cruft_set
94 my $cruft_set = $self->cruft_set;
95
96 Returns a set of characters which are considered garbage. Used by
97 decruft().
98
99 decruft
100 my $uri = $self->decruft($uri);
101
102 Sometimes garbage characters like periods and parenthesis get
103 accidentally matched along with the URI. In order for the URI to
104 be properly identified, it must sometimes be "decrufted", the
105 garbage characters stripped.
106
107 This method takes a candidate URI and strips off any cruft it
108 finds.
109
110 recruft
111 my $uri = $self->recruft($uri);
112
113 This method puts back the cruft taken off with decruft(). This is
114 necessary because the cruft is destructively removed from the
115 string before invoking the user's callback, so it has to be put
116 back afterwards.
117
118 schemeless_to_schemed
119 my $schemed_uri = $self->schemeless_to_schemed($schemeless_uri);
120
121 This takes a schemeless URI and returns an absolute, schemed URI.
122 The standard implementation supplies ftp:// for URIs which start
123 with ftp., and http:// otherwise.
124
125 is_schemed
126 $obj->is_schemed($uri);
127
128 Returns whether or not the given URI is schemed or schemeless.
129 True for schemed, false for schemeless.
130
131 badinvo
132 __PACKAGE__->badinvo($extra_levels, $msg)
133
134 This is used to complain about bogus subroutine/method invocations.
135 The args are optional.
136
137 Old Functions
138 The old find_uri() function is still around and it works, but its
139 deprecated.
140
142 Store a list of all URIs (normalized) in the document.
143
144 my @uris;
145 my $finder = URI::Find->new(sub {
146 my($uri) = shift;
147 push @uris, $uri;
148 });
149 $finder->find(\$text);
150
151 Print the original URI text found and the normalized representation.
152
153 my $finder = URI::Find->new(sub {
154 my($uri, $orig_uri) = @_;
155 print "The text '$orig_uri' represents '$uri'\n";
156 return $orig_uri;
157 });
158 $finder->find(\$text);
159
160 Check each URI in document to see if it exists.
161
162 use LWP::Simple;
163
164 my $finder = URI::Find->new(sub {
165 my($uri, $orig_uri) = @_;
166 if( head $uri ) {
167 print "$orig_uri is okay\n";
168 }
169 else {
170 print "$orig_uri cannot be found\n";
171 }
172 return $orig_uri;
173 });
174 $finder->find(\$text);
175
176 Turn plain text into HTML, with each URI found wrapped in an HTML
177 anchor.
178
179 use CGI qw(escapeHTML);
180 use URI::Find;
181
182 my $finder = URI::Find->new(sub {
183 my($uri, $orig_uri) = @_;
184 return qq|<a href="$uri">$orig_uri</a>|;
185 });
186 $finder->find(\$text, \&escapeHTML);
187 print "<pre>$text</pre>";
188
190 Will not find URLs with Internationalized Domain Names or pretty much
191 any non-ascii stuff in them. See
192 <http://rt.cpan.org/Ticket/Display.html?id=44226>
193
195 Michael G Schwern <schwern@pobox.com> with insight from Uri Gutman,
196 Greg Bacon, Jeff Pinyan, Roderick Schertler and others.
197
198 Roderick Schertler <roderick@argon.org> maintained versions 0.11 to
199 0.16.
200
201 Darren Chamberlain wrote urifind.
202
204 Copyright 2000, 2009-2010 by Michael G Schwern <schwern@pobox.com>.
205
206 This program is free software; you can redistribute it and/or modify it
207 under the same terms as Perl itself.
208
209 See http://www.perlfoundation.org/artistic_license_1_0
210
212 urifind, URI::Find::Schemeless, URI::URL, URI, RFC 3986 Appendix C
213
214
215
216perl v5.12.3 2011-03-27 URI::Find(3)