Converting HTML Documents from the Web into Text
Problem
You want to get a text summary of a web site.
Solution
The open-uri library is the easiest way to grab the content of a web page; it lets you open a URL as though it were a file:
require open-uri
example = open(http://www.example.com/)
# => # As with a file, the read method returns a string. You can do a series of sub and gsub methods to clean the code into a more readable format.
plain_text =
html.sub(%r{(.*?)}mi, \1).gsub(/<.*?>/m, ).
gsub(%r{(
s*){2}}, "
")
Finally, you can use the standard CGI library to unescape HTML entities like < into their ASCII equivalents (<):
require cgi
plain_text = CGI.unescapeHTML(plain_text)
The final product:
puts plain_text
# Example
Web Page
#
# You have reached this web page by typing "example.com",
# "example.net",
# or "example.org" into your web browser.
# These domain names are reserved for use in documentation and are not available
# for registration. See RFC
# 2606 , Section 3.
The open-uri library extends the open method so that you can access the contents of web pages and FTP sites with the same interface used for local files. The simple regular expression substitutions above do nothing but remove HTML tags and clean up excess whitespace. They work well for well-formatted HTML, but the web is full of mean and ugly HTML, so you may consider taking a more involved approach. Lets define a HTMLSanitizer class to do our dirty business. An HTMLSanitizer will start off with some HTML, and through a series of search-and-replace operations transform it into plain text. Different HTML tags will be handled differently. The contents of some HTML tags should simply be removed in a plaintext rendering. For example, you probably don want to see the contents of and tags. Other tags affect what the rendition should look like, for instance, a tag should be represented as a blank line:
require open-uri
require cgi
class HTMLSanitizer
attr_accessor :html
@@ignore_tags = [head, script, frameset ]
@@inline_tags = [span, strong, i, u ]
@@block_tags = [p, div, ul, ol ]
The next two methods define the skeleton of our HTML sanitizer:
def initialize(source=\)
begin
@html = open(source).read
rescue Errno::ENOENT
# If its not a file, assume its an
HTML string
@html = source
end
end
def plain_text
# remove pre-existing blank spaces between tags since we will
# be adding spaces on our own
@plain_text = @html.gsub(/s*(<.*?>)/m, \1)
handle_ignore_tags
handle_inline_tags
handle_block_tags
handle_all_other_tags
return CGI.unescapeHTML(@plain_text)
end
Now we need to fill in the handle_ methods defined by HTMLSanitizer#plain_text. These methods perform search-and-replace operations on the @plain_text instance variable, gradually transforming it from HTML into plain text. Because we are modifying @plain_text in place, we will need to use String#gsub! instead of String#gsub.
private
def tag_regex(tag)
%r{<#{tag}.*?>(.*?)#{tag}>}mi
end
def handle_ignore_tags
@@ignore_tags.each { |tag| @plain_text.gsub!(tag_regex(tag), \) }
end
def handle_inline_tags
@@inline_tags.each { |tag| @plain_text.gsub!(tag_regex(tag), \1 ) }
end
def handle_block_tags
@@block_tags.each { |tag| @plain_text.gsub!(tag_regex(tag), "
\1
") }
end
def handle_all_other_tags
@plain_text.gsub!(/
/mi, "
")
@plain_text.gsub!(/<.*?>/m, )
@plain_text.gsub!(/(
s*){2}/, "
")
end
end
To use this class, simply initialize it with a URL and call the plain_text method:
puts HTMLSanitizer.new(http://slashdot.org/).plain_text
# Stories
# Slash Boxes
# Comments
#
# Slashdot
#
# News for nerds, stuff that matters
#
# Login
#
# Why Login? Why Subscribe?
# …
Discussion
See Also
Категории