Compressing Whitespace in an XML Document

Problem

When REXML parses a document, it respects the original whitespace of the documents text nodes. You want to make the document smaller by compressing extra whitespace.

Solution

Parse the document by creating a REXML::Document out of it. Within the Document constructor, tell the parser to compress all runs of whitespace characters:

require exml/document text = %{<a>Some whitespace</a> Some more} REXML::Document.new(text, { :compress_whitespace => :all }).to_s # => "<a>Some whitespace</a> Some more"

Discussion

Sometimes whitespace within a document is significant, but usually (as with HTML) it can be compressed without changing the meaning of the document. The resulting document takes up less space on the disk and requires less bandwidth to transmit.

Whitespace compression doesn have to be all-or-nothing. REXML gives two ways to configure it. Instead of passing :all as a value for :compress_whitespace, you can pass in a list of tag names. Whitespace will only be compressed in those tags:

REXML::Document.new(text, { :compress_whitespace => %w{a} }).to_s # => "<a>Some whitespace</a> Some more"

You can also switch it around: pass in :respect_whitespace and a list of tag names whose whitespace you don want to be compressed. This is useful if you know that whitespace is significant within certain parts of your document.

REXML::Document.new(text, { :respect_whitespace => %w{a} }).to_s # => "<a>Some whitespace</a> Some more"

What about text nodes containing only whitespace? These are often inserted by XML pretty-printers, and they can usually be totally discarded without altering the meaning of a document. If you add :ignore_whitespace_nodes => :all to the parser configuration, REXML will simply decline to create text nodes that contain nothing but whitespace characters. Heres a comparison of :compress_whitespace alone, and in conjunction with :ignore_whitespace_nodes:

text = %{<a>Some text</a> Some more } REXML::Document.new(text, { :compress_whitespace => :all }).to_s # => "<a>Some text</a> Some more " REXML::Document.new(text, { :compress_ whitespace => :all, :ignore_ whitespace_nodes => :all }).to_s # => "<a>Some text</a>Some more"

By itself, :compress_ whitespace shouldn make a document less human-readable, but :ignore_whitespace_nodes almost certainly will.

See Also

Категории