Indexing Unstructured Text with SimpleSearch
Problem
You want to index a number of texts and do quick keyword searches on them.
Solution
Use the SimpleSearch library, available in the SimpleSearch gem.
Heres how to create and save an index:
require ubygems require search/simple contents = Search::Simple::Contents.new contents << Search::Simple::Content. new(In the beginning God created the heavens…, Genesis.txt, Time.now) contents << Search::Simple::Content.new(Call me Ishmael…, MobyDick.txt, Time.now) contents << Search::Simple::Content.new(Marley was dead to begin with…, AChristmasCarol.txt, Time.now) searcher = Search::Simple::Searcher.load(contents, index_file)
Heres how to load and search an existing index:
require ubygems require search/simple searcher = nil open(index_file) do |f| searcher = Search::Simple::Searcher.new(Marshal.load(f), Marshal.load(f), index_file) end searcher.find_words([egin]).results.collect { |result| result.name } # => ["AChristmasCarol.txt", "Genesis.txt"]
Discussion
SimpleSearch is a library that makes it easy to do fast keyword searching on unstructured text documents. The index itself is represented by a Searcher object, and each document you feed it is a Content object.
To create an index, you must first construct a number of Content objects and a Contents object to contain them. A Content object contains a piece of text, a unique identifier for that text (often a filename, though it could also be a database ID or a URL), and the time at which the text was last modified. Searcher.load transforms a Contents object into a searchable index that gets serialized to disk with Marshal.
The indexer analyzes the text you gives it, removes stop words (like "a"), truncates words to their roots (so "beginning" becomes "begin"), and puts every word of the text into binary data structures. Given a set of words to find and a set of words to exclude, SimpleSearch uses these structures to quickly find a set of documents.
Heres how to add some new documents to an existing index:
class Search::Simple::Searcher def add_contents(contents) Search::Simple::Searcher.create_indices(contents, @dict, @document_vectors) dump # Re-serialize the file end end contents = Search::Simple::Contents.new contents << Search::Simple::Content.new(A spectre is haunting Europe…, TheCommunistManifesto.txt, Time.now) searcher.add_contents(contents) searcher.find_words([spectre]).results[0].name # => "TheCommunistManifesto.txt"
SimpleSearch doesn support incremental indexing. If you update or delete a document, you must recreate the entire index from scratch.
See Also
- The SimpleSearch home page (http://www.chadfowler.com/SimpleSearch/)
- The sample application within the SimpleSearch gem: search-simple.rb
- Recipe 13.2, "Serializing Data with Marshal"
- For a more sophisticated indexer, see Recipe 13.5, "Indexing Structured Text with Ferret"
Категории