Indexing Structured Text with Ferret

Problem

You want to perform searches on structured text. For instance, you might want to search just the headline of a news story, or just the body.

Discussion

The Ferret library can tokenize and search structured data. Its a pure Ruby port of Javas Lucene library, and its available as the ferret gem.

Heres how to create and populate an index with Ferret. Ill create a searchable index of useful Ruby packages, stored as a set of binary files in the ruby_packages/ directory.

require ubygems require ferret PACKAGE_INDEX_DIR = uby_packages/ Dir.mkdir(PACKAGE_INDEX_DIR) unless File.directory? PACKAGE_INDEX_DIR index = Ferret::Index::Index.new(:path => PACKAGE_INDEX_DIR, :default_search_field => ame|description) index << { :name => SimpleSearch, :description => A simple indexing library., :supports_structured_data => false, :complexity => 2 } index << { :name => Ferret, :description => A Ruby port of the Lucene library. More powerful than SimpleSearch, :supports_structured_data => true, :complexity => 5 }

By default, queries against this index will search the "name" and "description" fields, but you can search against any field:

index.search_each(library) do |doc_id, score| puts index.doc(doc_id).field( ame).data end # SimpleSearch # Ferret index.search_each(description:powerful AND supports_structured_data:true) do |doc_id, score| puts index.doc(doc_id).field("name").data end # Ferret index.search_each("complexity:<5") do |doc_id, score| puts index.doc(doc_id).field("name").data end # SimpleSearch

Discussion

When should you use Ferret instead of SimpleText? SimpleText is good for unstructured data like plain text. Ferret excels at searching structured data, the kind you find in databases.

Relational databases are good at finding exact field matches, but not very good at locating keywords within large strings. Ferret works best when you need full text search but you want to keep some of the document structure. Ive also had great success using Ferret[6] to bring together data from disparate sources (some in databases, some not) into one structured, searchable index.

[6] Actually, I was using Lucene. Same idea.

There are two things you can do with Ferret: add text to the index, and query the index. Ferret offers you a lot of control over both activities. Ill briefly cover the most interesting features.

You can feed an index by passing in a hash of field names to values, or you can feed it fully formed Ferret::Document objects. This gives you more control over which fields youd like to index. Here, Ill create an index of news stories taken from a hypothetical database:

# This include will cut down on the length of the Field:: constants below. include Ferret::Document def index_story(index, db_id, headline, story) doc = Document.new doc << Field.new("db_id", db_id, Field::Store::YES, Field::Index::NO) doc << Field.new("headline", headline, Field::Store::YES, Field::Index::TOKENIZED) doc << Field.new("story", story, Field::Store::NO, Field::Index::TOKENIZED) index << doc end STORY_INDEX_DIR = ews_stories/ Dir.mkdir(STORY_INDEX_DIR) unless File.directory? STORY_INDEX_DIR index = Ferret::Index::Index.new(:path => STORY_INDEX_DIR) index_story(index, 1, "Lizardoids Control the Media, Sources Say", "Don count on reading this story in your local paper anytime soon, because …") index_story(index, 2, "Where Are My Pants? An Editorial", "This is an outrage. The lizardoids have gone too far! …")

In this case, Im storing the database ID in the Document, but Im not indexing it. I don want anyone to search on it, but I need some way of tying a Document in the index to a record in the database. That way, when someone does a search, I can print out the headline and provide a link to the original story.

I treat the body of the story exactly the opposite way: the words get indexed, but the original text is not stored and can be recovered from the Document object. Im not going to be displaying the text of the story along with my search results, and the text is already in the database, so why store it again in the index?

The simplest way to search a Ferret index is with Index#search_each, as demonstrated in the Solution. This takes a query and a code block. For each document that matched the search query, it yields the document ID and a number between 0 and 1, representing the quality of the match.

You can get more information about the search results by calling search instead of search_each. This gives you a Ferret::Search::TopDocs object that contains the search results, as well as useful information like how many documents were matched. Call each on a TopDocs object and itll act just as if youd called search_each.

Heres some code that does a search and prints the results:

def search_news(index, query) results = index.search(query) puts "#{results.size} article(s) matched:" results.each do |doc_id, score| story = index.doc(doc_id) puts " #{story.field("headline").data} (score: #{score})" puts " http://www.example.com/news/#{story.field("db_id").data}" puts end end search_news(index, "pants editorial") # 1 article(s) matched: # Where Are My Pants? An Editorial (score: 0.0908329636861293) # http://www.example.com/news/2

You can weight the fields differently to fine-tune the results. This query makes a match in the headline count twice as much as a match in the story:

search_news(index, "headline:lizardoids^1 OR story:lizardoids^0.5") # 2 article(s) matched: # Lizardoids Control the Media, Sources Say (score: 0.195655948031232) # http://www.example.com/news/1 # # Where Are My Pants? An Editorial (score: 0.0838525491562421) # http://www.example.com/news/2

Queries can be strings or Ferret::Search::Query objects. Pass in a string, and it just gets parsed and turned into a Query. The main advantage of creating your own Query objects is that you can put a user-friendly interface on your search functionality, instead of making people always construct Ferret queries by hand. The weighted_query method defined below takes a single keyword and creates a Query object equivalent to the rather complicated weighted query given above:

def weighted_query(term) query = Ferret::Search::BooleanQuery.new query << term_clause("headline", term, 1) query << term_clause("story", term, 0.5) end def term_clause(field, term, weight) t = Ferret::Search::TermQuery.new(Ferret::Index::Term.new(field, term)) t.boost = weight return Ferret::Search::BooleanClause.new(t) end

Ferret can be clumsy to use. Its got a lot of features to learn, and sometimes it seems like you spend all your time composing small objects into bigger objects (as in weighted_query above, which creates instances of four different classes). This is partly because Ferret is so flexible, and partly because the API comes mainly from Java. But nothing else works as well for searching structured text.

See Also

Категории