Grabbing the Contents of a Web Page

2017-11-03 09:05:07

Problem

You want to display or process a specific web page.

Solution

The simplest solution is to use the open-uri library. It lets you open a web page as though it were a file. This code fetches the oreilly.com homepage and prints out the first part of it:

require open-uri puts open(http://www.oreilly.com/).read(200) # #

For more complex applications, youll need to use the net/http library. Use Net::HTTP.get_response to make an HTTP request and get the response as a Net::HTTPResponse object containing the response code, headers, and body.

require net/http response = Net::HTTP.get_response(www.oreilly.com, /about/) response.code # => "200" response.body.size # => 21835 response[Content-type] # => "text/html; charset=ISO-8859-1" puts response.body[0,200] # # # # # #

Rather than passing in the hostname, port, and path as separate arguments, its usually easier to create URI objects from URL strings and pass those into the Net::HTTP methods.

require uri Net::HTTP.get(URI.parse("http://www.oreilly.com")) Net::HTTP.get_response(URI.parse("http://www.oreilly.com/about/"))

Discussion

If you just want the text of the page, use get. If you also want the response code or the values of the HTTP response headers, use get_reponse.

The get_response method returns some HTTPResponse subclass of Net:HTTPResponse, which contains all information about an HTTP response. Theres one subclass for every response code defined in the HTTP standard; for instance, HTTPOK for the 200 response code, HTTPMovedPermanently for the 301 response code, and HTTPNotFound for the 404 response code. Theres also an HTTPUnknown subclass for any response codes not defined in HTTP.

The only difference between these subclasses is the class name and the code member. You can check the response code of an HTTP response by comparing specific classes with is_a?, or by checking the result of HTTPResponse#code, which returns a String:

puts "Success!" if response.is_a? Net::HTTPOK # Success! puts case response.code[0] # Check the first byte of the response code. when ?1 then "Status code indicates an HTTP informational response." when ?2 then "Status code indicates success." when ?3 then "Status code indicates redirection." when ?4 then "Status code indicates client error." when ?5 then "Status code indicates server error." else "Non-standard status code." end # Status code indicates success.

You can get the value of an HTTP response header by treating HTTPResponse as a hash, passing the header name into HTTPResponse#[]. The only difference from a real Hash is that the names of the headers are case-insensitive. Like a hash, HTTPResponse supports the iteration methods #each, #each_key, and #each_value:

response[Server] # => "Apache/1.3.34 (Unix) PHP/4.3.11 mod_perl/1.29" response[SERVER] # => "Apache/1.3.34 (Unix) PHP/4.3.11 mod_perl/1.29" response.each_key { |key| puts key } # x-cache # p3p # content-type # date # server # transfer-encoding

If you do a request by calling NET::HTTP.get_response with no code block, Ruby will read the body of the web page into a string, which you can fetch with the HTTPResponse::body method. If you like, you can process the body as you read it, one segment at a time, by passing a code block to HTTPResponse::read_body:

Net::HTTP.get_response(www.oreilly.com, /about/) do |response| response.read_body do |segment| puts "Received segment of #{segment.size} byte(s)!" end end # Received segment of 614 byte(s)! # Received segment of 1024 byte(s)! # Received segment of 848 byte(s)! # Received segment of 1024 byte(s)! # …

Note that you can only call read_body once per request. Also, there are no guarantees that a segment won end in the middle of an HTML tag name or some other inconvenient place, so this is best for applications where you e not handing the web page as structured data: for instance, when you e simply piping it to some other source.

Grabbing the Contents of a Web Page

Problem

Solution

Discussion

See Also

Категории