Extracting All the URLs from an HTML Document

Problem

You want to find all the URLs on a web page.

Solution

Do you only want to find links (that is, URLs mentioned in the HREF attribute of an A tag)? Do you also want to find the URLs of embedded objects like images and applets? Or do you want to find all URLs, including ones mentioned in the text of the page?

The last case is the simplest. You can use URI.extract to get all the URLs found in a string, or to get only the URLs with certain schemes. Here well extract URLs from some HTML, whether or not they e inside A tags:

require uri text = %{"My homepage is at <a href="http://www.example.com/">http://www.example.com/</a>, and be sure to check out my weblog at http://www.example.com/blog/. Email me at <a href="mailto:bob@example.com">bob@example.com</a>.} URI.extract(text) # => ["http://www.example.com/", "http://www.example.com/", # "http://www.example.com/blog/.", "mailto:bob@example.com"] # Get HTTP(S) links only. URI.extract(text, [http, https]) # => ["http://www.example.com/", "http://www.example.com/" # "http://www.example.com/blog/."]

If you only want URLs that show up inside certain tags, you need to parse the HTML. Assuming the document is valid, you can do this with any of the parsers in the rexml library. Heres an efficient implementation using REXMLs stream parser. It retrieves URLs found in the hrEF attributes of A tags and the SRC attributes of IMG tags, but you can customize this behavior by passing a different map to the constructor.

require exml/document require exml/streamlistener require set class LinkGrabber include REXML::StreamListener attr_reader :links def initialize(interesting_tags = {a => %w{href}, img => %w{src}}.freeze) @tags = interesting_tags @links = Set.new end def tag_start(name, attrs) @tags[name].each do |uri_attr| @links << attrs[uri_attr] if attrs[uri_attr] end if @tags[name] end def parse(text) REXML::Document.parse_stream(text, self) end end grabber = LinkGrabber.new grabber.parse(text) grabber.links # => #

Discussion

The URI.extract solution uses regular expressions to find everything that looks like a URL. This is faster and easier to write than a REXML parser, but it will find every absolute URL in the document, including any mentioned in the text and any in the documents initial DOCTYPE. It will not find relative URLs hidden within HREF attributes, since those don start with an access scheme like "http://".

URI.extract treats the period at the end of the first sentence ("check out my weblog at…")as though it were part of the URL. URLs contained within English text are often ambiguous in this way. "http://www.example.com/blog/." is a perfectly valid URL and might be correct, but that period is probably just punctuation. Accessing the URL is the only sure way to know for sure, but its almost always safe to strip those characters:

END_CHARS = %{.,?!:;} URI.extract(text, [http]).collect { |u| END_CHARS.index(u[-1]) ? u.chop : u } # => ["http://www.example.com/", "http://www.example.com/", # "http://www.example.com/blog/"]

The parser solution defines a listener that hears about every tag present in its interesting_tags map. It checks each tag for attributes that tend to contain URLs: "href" for <a> tags and "src" for tags, for instance. Every URL it finds goes into a set.

The use of a set here guarantees that the result contains no duplicate URLs. If you want to gather (possibly duplicate)URLs in the order they were found in the document, use a list, the way URI.extract does.

The LinkGrabber solution will not find URLs in the text portions of the document, but it will find relative URLs. Of course, you still need to know how to turn relative URLs into absolute URLs. If the document has a tag, you can use that. Otherwise, the base depends on the original URL of the document.

Heres a subclass of LinkGrabber that changes relative links to absolute links if possible. Since it uses URI.join, which returns a URI object, your set will end up containing URI objects instead of strings:

class AbsoluteLinkGrabber < LinkGrabber include REXML::StreamListener attr_reader :links def initialize(original_url = nil, interesting_tags = {a => %w{href}, img => %w{src}}.freeze) super(interesting_tags) @base = original_url end def tag_start(name, attrs) if name == ase @base = attrs[href] end super end def parse(text) super # If we know of a base URL by the end of the document, use it to # change all relative URLs to absolute URLs. @links.collect! { |l| URI.join(@base, l) } if @base end end

If you want to use the parsing solution, but the web page has invalid HTML that chokes the REXML parsers (which is quite likely), try the techniques mentioned in Recipe 11.5.

Almost 20 HTML tags can have URLs in one or more of their attributes. If you want to collect every URL mentioned in an appropriate part of a web page, heres a big map you can pass in to the constructor of LinkGrabber or AbsoluteLinkGrabber:

URL_LOCATIONS = { a => %w{href}, area => %w{href}, applet => %w{classid}, ase => %w{href}, lockquote => %w{cite}, ody => %w{background}, codebase => %w{classid}, del => %w{cite}, form => %w{action}, frame => %w{src longdesc}, iframe => %w{src longdesc}, input => %w{src usemap}, img => %w{src longdesc usemap}, ins => %w{cite}, link => %w{href}, object => %w{usemap archive codebase data}, profile => %w{head}, q => %w{cite}, script => %w{src}}.freeze

See Also

Категории