A Real-World HTTP Client
The first three recipes in this chapter cover different ways of fetching web pages. The techniques they describe work well if you just need to fetch one specific web page, but in the interests of simplicity they omit some details youll need to consider when writing a web spider, a web browser, or any other serious HTTP client. This recipe creates a library that deals with the details.
Mixed HTTP and HTTPS
-
Any general client will have to be able to make both HTTP and HTTPS requests. But the simple Net:HTTP methods that work in Recipe 14.1 can be used to make HTTPS requests. Our library will use use HTTPRequest objects for everything. If the user requests a URL that uses the "https" scheme, well flip the request objects use_ssl switch, as seen in Recipe 14.2.
Redirects
-
Lots of things can go wrong with an HTTP request: the page might have moved, it might require authentication, or it might simply be gone. Most HTTP errors call for higher-level handling or human intervention, but when a page has moved, a smart client can automatically follow it to its new location.
Our library will automatically follow redirects that provide "Location" fields in their responses. Itll prevent infinite redirect loops by refusing to visit a URL its already visited. Itll prevent infinite redirect chains by limiting the number of redirects. After all the redirects are followed, itll make the final URI available as a member of the response object.
Proxies
-
Users use HTTP proxies to make high-latency connections work faster, surf anonymously, and evade censorship. Each individual client program needs to be programmed to use a proxy, and its an easy feature to overlook if you don use a proxy yourself. Fortunately, its easy to support proxies in Ruby: the Proxy class will create a custom Net::HTTP subclass that works through a certain proxy.
This library defines a single new method: Net::HTTP.fetch, an all-singing, all-dancing factory for HTTPRequest objects. It silently handles HTTPS URLs (assuming you have net/https installed) and HTTP redirects, and it transparently handles proxies. This might go into a file called http_fetch.rb:
require et/http require set class Net::HTTPResponse attr_accessor :final_uri end module Net begin require et/https HTTPS_SUPPORTED = true rescue LoadError HTTPS_SUPPORTED = false end class HTTP # Makes an HTTP request and returns the HTTPResponse object. # Args: :proxy_host, :proxy_port, :action (:get, :post, etc.), # :data (for :post action), :max_redirects. def HTTP.fetch(uri, args={}.freeze, &before_fetching) # Process the arguments with default values uri = URI.parse(uri) unless uri.is_a? URI proxy_host = args[:proxy_host] proxy_port = args[:proxy_port] || 80 action = args[:action] || :get data = args[:data] max_redirects = args[:max_redirects] || 10
We will always work on a Proxy object, even if no proxy is specified. A Proxy with no proxy_host makes direct HTTP connections. This way, the code works the same way whether we e actually using an HTTP proxy or not:
# Use a proxy class to create the request object proxy_class = Proxy(proxy_host, proxy_port) request = proxy_class.new(uri.host, uri.port)
We will use SSL to handle URLs of the "https" scheme. Note that we do not set any certificate paths here, or do any other SSL configuration. If you want to do that, youll need to pass an appropriate code block into fetch (see below for an example):
request.use_ssl = true if HTTPS_SUPPORTED and uri.scheme == https yield request if block_given?
Now we activate the request and get an HTTPResponse object back:
response = request.send(action, uri.path, data)
Our HTTPResponse object might be a document, it might be an error, or it might be a redirect. If its a redirect, we can make things easier for the caller of this method by following the redirect. This piece of the method finds the redirected URL and sends it into a recursive fetch call, after making sure that we aren stuck in an infinite loop or an endless chain of redirects:
urls_seen = args[:_urls_seen] || Set.new if response.is_a?(Net::HTTPRedirection) # Redirect if urls_seen.size < max_redirects && response[Location] urls_seen << uri new_uri = URI.parse(response[Location]) break if urls_seen.member? new_uri # Infinite redirect loop # Request the new location just as we did the old one. new_args = args.dup puts "Redirecting to #{new_uri}" if $DEBUG new_args[:_urls_seen] = urls_seen response = HTTP.fetch(new_uri, new_args, &before_fetching) end else # No redirect response.final_uri = uri end return response end end end
Thats pretty dense code, but it ties a lot of functionality into a single method with a relatively simple API. Heres a simple example, in which Net::HTTP.fetch silently follows an HTTP redirect. Note the final_uri is different from the original URI.
response = Net::HTTP.fetch("http://google.com/") puts "#{response.final_uri} body is #{response.body.size} bytes." # http://www.google.com/ body is 2444 bytes.
With fetch, redirects work even through proxies. This example accesses the Google homepage through a public HTTP proxy in Singapore. When it requests "http://google.com/", its redirected to "http://www.google.com/", as in the previous example. But when Google notices that the IP address is coming from Singapore, it sends another redirect:
response = Net::HTTP.fetch("http://google.com/", :proxy_host => "164.78.252.199") puts "#{response.final_uri} body is #{response.body.size} bytes." # http://www.google.com.sg/ body is 2853 bytes.
There are HTTPS proxies as well. This code uses an HTTPS proxy in the U.S. to make a secure connection to "https://paypal.com/". Its redirected to "https://paypal.com/us/". The second request is secured in the same way as the one that caused the redirect. Note that this code will only work if you have the Ruby SSL library installed.
response = Net::HTTP.fetch("https://paypal.com/", :proxy_host => "209.40.194.8") do |request| request.ca_path = "/etc/ssl/certs/" request.verify_mode = OpenSSL::SSL::VERIFY_PEER end puts "#{response.final_uri} body is #{response.body.size} bytes." # https://paypal.com/us/ body is 16978 bytes.
How does this work? The code block is actually called twice: once before requesting "https://paypal.com/" and once before requesting "https://paypal.com/us/". This is what fetchs code block is for: its run on the HTTPRequest object before the request is actually made. If the code block were only called once, then the second request wouldn have access to any certificates.
Net::HTTP.fetch will follow redirects served by the web server, but it won follow redirects contained in the META tags of an HTML document. To follow those redirects, youll have to parse the document as HTML.
See Also
- Recipe 14.1, "Grabbing the Contents of a Web Page"
- Recipe 14.2, "Making an HTTPS Web Request"
- Recipe 14.3, "Customizing HTTP Request Headers"
- Several web sites have lists of public HTTP and HTTPS proxies (for instance, http://www.samair.ru/proxy/ and http://tools.rosinstrument.com/proxy/); if you want to set up a proxy on your local network, Squid is a good choice (http://www.squid-cache.org/)
Категории