Substituting XML Entities
Problem
Youve parsed a document that contains internal XML entities. You want to substitute the entities in the document for their values.
Solution
To perform entity substitution on a specific text element, call its value method. If its the first text element of its parent, you can call text on the parent instead.
Heres a simple document that defines and uses two entities in a single text node. We can substitute those entities for their values without changing the document itself:
require
exml/document
str = %{
]>
Discussion
Internal XML entities are often used to factor out data that changes a lot, like dates or version numbers. But REXML only provides a convenient way to perform substitution on a single text node. What if you want to perform substitutions throughout the entire document?
When you call Document#write to send a document to some IO object, it ends up calling Text#to_s on each text node. As seen in the Solution, this method presents a "normalized" view of the data, one where entities are displayed instead of having their values substituted in.
We could write our own version of Document#write that presents an "unnormalized" view of the document, one with entity values substituted in, but that would be a lot of work. We could hack Text#to_s to work more like Text#value, or hack Text#write to call the value method instead of to_s. But its less intrusive to do the entity replacement outside of the write method altogether. Heres a class that wraps any IO object and performs entity replacement on all the text that comes through it:
require delegate
require
exml/text
class EntitySubstituter < DelegateClass(IO)
def initialize(io, document, filter=nil)
@document = document
@filter = filter
super(io)
end
def <<(s)
super(REXML::Text::unnormalize(s, @document.doctype, @filter))
end
end
output = EntitySubstituter.new($stdout, doc)
doc.write(output)
#
#
# ]>
#
Because it processes the entire output of Document#write, this code will replace all entity references in the document. This includes any references found in attribute values, which may or may not be what you want.
If you create a Text object manually, or set the value of an existing object, REXML assumes that you e giving it unnormalized text, and normalizes it. This can be problematic if your text contains strings that happen to be the values of entities:
text_node = doc.root.children[0]
text_node.value = "&product; v&version; has a catalogue of 2.3 " +
"million celestial objects."
doc.write
#
#
# ]>
#
To avoid this, you can create a "raw" text node:
text_node.raw = true
doc.write
#
#
# ]>
#
In addition to entities you define, REXML automatically processes five named character entities: the ones for left and right angle brackets, single and double quotes, and the ampersand. Each is replaced with the corresponding ASCII character.
str = %{
]>
"©" is an HTML character entity representing the copyright symbol, but REXML doesn know that. It only knows about the five XML character entities. Also, REXML only knows about internal entities: ones whose values are defined within the same document that uses them. It won resolve external entities.
See Also
- The section "Text Nodes" of the REXML tutorial (http://www.germane-software.com/software/rexml/docs/tutorial.html#id2248004)
Категории