Extracting All the URLs from an HTML Document
Problem
You want to find all the URLs on a web page.
Solution
Do you only want to find links (that is, URLs mentioned in the HREF attribute of an A tag)? Do you also want to find the URLs of embedded objects like images and applets? Or do you want to find all URLs, including ones mentioned in the text of the page?
The last case is the simplest. You can use URI.extract to get all the URLs found in a string, or to get only the URLs with certain schemes. Here well extract URLs from some HTML, whether or not they e inside A tags:
require uri text = %{"My homepage is at <a href="http://www.example.com/">http://www.example.com/</a>, and be sure to check out my weblog at http://www.example.com/blog/. Email me at <a href="mailto:bob@example.com">bob@example.com</a>.} URI.extract(text) # => ["http://www.example.com/", "http://www.example.com/", # "http://www.example.com/blog/.", "mailto:bob@example.com"] # Get HTTP(S) links only. URI.extract(text, [http, https]) # => ["http://www.example.com/", "http://www.example.com/" # "http://www.example.com/blog/."]
If you only want URLs that show up inside certain tags, you need to parse the HTML. Assuming the document is valid, you can do this with any of the parsers in the rexml library. Heres an efficient implementation using REXMLs stream parser. It retrieves URLs found in the hrEF attributes of A tags and the SRC attributes of IMG tags, but you can customize this behavior by passing a different map to the constructor.
require
exml/document
require
exml/streamlistener
require set
class LinkGrabber
include REXML::StreamListener
attr_reader :links
def initialize(interesting_tags = {a => %w{href}, img => %w{src}}.freeze)
@tags = interesting_tags
@links = Set.new
end
def tag_start(name, attrs)
@tags[name].each do |uri_attr|
@links << attrs[uri_attr] if attrs[uri_attr]
end if @tags[name]
end
def parse(text)
REXML::Document.parse_stream(text, self)
end
end
grabber =
LinkGrabber.new
grabber.parse(text)
grabber.links
# => #
The URI.extract solution uses regular expressions to find everything that looks like a URL. This is faster and easier to write than a REXML parser, but it will find every absolute URL in the document, including any mentioned in the text and any in the documents initial DOCTYPE. It will not find relative
URLs hidden within HREF attributes, since those don start with an access scheme like "http://". URI.extract treats the period at the end of the first sentence ("check out my weblog at…")as though it were part of the URL. URLs contained within English text are often ambiguous in this way. "http://www.example.com/blog/." is a perfectly valid URL and might be correct, but that period is probably just punctuation. Accessing the URL is the only sure way to know for sure, but its almost always safe to strip those characters:
END_CHARS = %{.,?!:;}
URI.extract(text, [http]).collect { |u| END_CHARS.index(u[-1]) ? u.chop : u }
# => ["http://www.example.com/", "http://www.example.com/",
# "http://www.example.com/blog/"]
The parser solution defines a listener that hears about every tag present in its interesting_tags map. It checks each tag for attributes that tend to contain URLs: "href" for <a> tags and "src" for tags, for instance. Every URL it finds goes into a set. The use of a set here guarantees that the result contains no duplicate URLs. If you want to gather (possibly duplicate)URLs in the order they were found in the document, use a list, the way URI.extract does. The
LinkGrabber solution will not find URLs in the text portions of the document, but it will find relative URLs. Of course, you still need to know how to turn relative URLs into absolute URLs. If the document has a tag, you can use that. Otherwise, the base depends on the original URL of the document. Heres a subclass of LinkGrabber that changes relative links to absolute links if possible. Since it uses URI.join, which returns a URI object, your set will end up containing URI objects instead of strings:
class AbsoluteLinkGrabber < LinkGrabber
include REXML::StreamListener
attr_reader :links
def initialize(original_url = nil,
interesting_tags = {a => %w{href}, img => %w{src}}.freeze)
super(interesting_tags)
@base = original_url
end
def tag_start(name, attrs)
if name == ase
@base = attrs[href]
end
super
end
def parse(text)
super
# If we know of a base URL by the end of the document, use it to
# change all relative
URLs to absolute URLs.
@links.collect! { |l| URI.join(@base, l) } if @base
end
end
If you want to use the parsing solution, but the web page has invalid HTML that chokes the REXML parsers (which is quite likely), try the techniques mentioned in Recipe 11.5. Almost 20 HTML tags can have URLs in one or more of their attributes. If you want to collect every URL mentioned in an appropriate part of a web page, heres a big map you can pass in to the constructor of LinkGrabber or AbsoluteLinkGrabber:
URL_LOCATIONS = { a => %w{href},
area => %w{href},
applet => %w{classid},
ase => %w{href},
lockquote => %w{cite},
ody => %w{background},
codebase => %w{classid},
del => %w{cite},
form => %w{action},
frame => %w{src longdesc},
iframe => %w{src longdesc},
input => %w{src usemap},
img => %w{src longdesc usemap},
ins => %w{cite},
link => %w{href},
object => %w{usemap archive codebase data},
profile => %w{head},
q => %w{cite},
script => %w{src}}.freeze
Discussion
See Also
Категории