Substituting XML Entities

Problem

Youve parsed a document that contains internal XML entities. You want to substitute the entities in the document for their values.

Solution

To perform entity substitution on a specific text element, call its value method. If its the first text element of its parent, you can call text on the parent instead.

Heres a simple document that defines and uses two entities in a single text node. We can substitute those entities for their values without changing the document itself:

require exml/document str = %{ ]> &product; v&version; is the most advanced astronomy product on the market. } doc = REXML::Document.new str doc.root.children[0].value # => " Stargaze v2.3 is the most advanced astronomy product on the market. " doc.root.text # => " Stargaze v2.3 is the most advanced astronomy product on the market. " doc.root.children[0].to_s # => " &product; v&version; is the most advanced astronomy product on the market. " doc.root.write # # &product; v&version; is the most advanced astronomy program on the market. #

Discussion

Internal XML entities are often used to factor out data that changes a lot, like dates or version numbers. But REXML only provides a convenient way to perform substitution on a single text node. What if you want to perform substitutions throughout the entire document?

When you call Document#write to send a document to some IO object, it ends up calling Text#to_s on each text node. As seen in the Solution, this method presents a "normalized" view of the data, one where entities are displayed instead of having their values substituted in.

We could write our own version of Document#write that presents an "unnormalized" view of the document, one with entity values substituted in, but that would be a lot of work. We could hack Text#to_s to work more like Text#value, or hack Text#write to call the value method instead of to_s. But its less intrusive to do the entity replacement outside of the write method altogether. Heres a class that wraps any IO object and performs entity replacement on all the text that comes through it:

require delegate require exml/text class EntitySubstituter < DelegateClass(IO) def initialize(io, document, filter=nil) @document = document @filter = filter super(io) end def <<(s) super(REXML::Text::unnormalize(s, @document.doctype, @filter)) end end output = EntitySubstituter.new($stdout, doc) doc.write(output) # # # ]> # # Stargaze v2.3 is the most advanced astronomy product on the market. #

Because it processes the entire output of Document#write, this code will replace all entity references in the document. This includes any references found in attribute values, which may or may not be what you want.

If you create a Text object manually, or set the value of an existing object, REXML assumes that you e giving it unnormalized text, and normalizes it. This can be problematic if your text contains strings that happen to be the values of entities:

text_node = doc.root.children[0] text_node.value = "&product; v&version; has a catalogue of 2.3 " + "million celestial objects." doc.write # # # ]> # &product; v&version; has a catalogue of &version; million celestial objects.

To avoid this, you can create a "raw" text node:

text_node.raw = true doc.write # # # ]> # &product; v&version; has a catalogue of 2.3 million celestial objects. text_node.value # => "Stargaze v2.3 has a catalogue of 2.3 million celestial objects." text_node.to_s # => "&product; v&version; has a catalogue of 2.3 million celestial objects."

In addition to entities you define, REXML automatically processes five named character entities: the ones for left and right angle brackets, single and double quotes, and the ampersand. Each is replaced with the corresponding ASCII character.

str = %{ ]> © &year; Komodo Dragon & Bob Productions } doc = REXML::Document.new str text_node = doc.root.children[0] text_node.value # => "© 2006 Komodo Dragon & Bob Productions" text_node.to_s # => "© &year; Komodo Dragon & Bob Productions"

"©" is an HTML character entity representing the copyright symbol, but REXML doesn know that. It only knows about the five XML character entities. Also, REXML only knows about internal entities: ones whose values are defined within the same document that uses them. It won resolve external entities.

See Also

Категории