Converting HTML Documents from the Web into Text
Problem
You want to get a text summary of a web site.
Solution
The open-uri library is the easiest way to grab the content of a web page; it lets you open a URL as though it were a file:
require open-uri
example = open(http://www.example.com/)
# => #
As with a file, the read method returns a string. You can do a series of sub and gsub methods to clean the code into a more readable format.
plain_text = html.sub(%r{(.*?)}mi, \1).gsub(/<.*?>/m, ). gsub(%r{( s*){2}}, " ")
Finally, you can use the standard CGI library to unescape HTML entities like < into their ASCII equivalents (<):
require cgi plain_text = CGI.unescapeHTML(plain_text)
The final product:
puts plain_text # Example Web Page # # You have reached this web page by typing "example.com", # "example.net", # or "example.org" into your web browser. # These domain names are reserved for use in documentation and are not available # for registration. See RFC # 2606 , Section 3.
Discussion
The open-uri library extends the open method so that you can access the contents of web pages and FTP sites with the same interface used for local files.
The simple regular expression substitutions above do nothing but remove HTML tags and clean up excess whitespace. They work well for well-formatted HTML, but the web is full of mean and ugly HTML, so you may consider taking a more involved approach. Lets define a HTMLSanitizer class to do our dirty business.
An HTMLSanitizer will start off with some HTML, and through a series of search-and-replace operations transform it into plain text. Different HTML tags will be handled differently. The contents of some HTML tags should simply be removed in a plaintext rendering. For example, you probably don want to see the contents of and tags. Other tags affect what the rendition should look like, for instance, a
tag should be represented as a blank line:
require open-uri require cgi class HTMLSanitizer attr_accessor :html @@ignore_tags = [head, script, frameset ] @@inline_tags = [span, strong, i, u ] @@block_tags = [p, div, ul, ol ]
The next two methods define the skeleton of our HTML sanitizer:
def initialize(source=\) begin @html = open(source).read rescue Errno::ENOENT # If its not a file, assume its an HTML string @html = source end end def plain_text # remove pre-existing blank spaces between tags since we will # be adding spaces on our own @plain_text = @html.gsub(/s*(<.*?>)/m, \1) handle_ignore_tags handle_inline_tags handle_block_tags handle_all_other_tags return CGI.unescapeHTML(@plain_text) end
Now we need to fill in the handle_ methods defined by HTMLSanitizer#plain_text. These methods perform search-and-replace operations on the @plain_text instance variable, gradually transforming it from HTML into plain text. Because we are modifying @plain_text in place, we will need to use String#gsub! instead of String#gsub.
private def tag_regex(tag) %r{<#{tag}.*?>(.*?)#{tag}>}mi end def handle_ignore_tags @@ignore_tags.each { |tag| @plain_text.gsub!(tag_regex(tag), \) } end def handle_inline_tags @@inline_tags.each { |tag| @plain_text.gsub!(tag_regex(tag), \1 ) } end def handle_block_tags @@block_tags.each { |tag| @plain_text.gsub!(tag_regex(tag), " \1 ") } end def handle_all_other_tags @plain_text.gsub!(/ /mi, " ") @plain_text.gsub!(/<.*?>/m, ) @plain_text.gsub!(/( s*){2}/, " ") end end
To use this class, simply initialize it with a URL and call the plain_text method:
puts HTMLSanitizer.new(http://slashdot.org/).plain_text # Stories # Slash Boxes # Comments # # Slashdot # # News for nerds, stuff that matters # # Login # # Why Login? Why Subscribe? # …
See Also
- Recipe 14.1, "Grabbing the Contents of a Web Page"
- For a more sophisticated text renderer, parse the HTML document with the techniques described in Recipe 11.2, "Extracting Data from a Documents Tree Structure," or Recipe 11.5, "Parsing Invalid Markup"
Категории