Programming Microsoft Outlook and Microsoft Exchange, Second Edition (DV-MPS Programming)
Site Server supports full-text and property searching of multiple data sources. These data sources include http, file, Exchange Server, and ODBC-compliant databases. Site Server implements searches using a flexible and powerful distributed process called a crawler. Once the crawler has indexed the data sources you specify, the newly created catalogs are compiled and propagated to the search servers you specify. Since you can have multiple search servers, you can gain the best performance by load balancing users' queries across these search servers.
At the Microsoft Windows NT level, Site Server runs two services: Gatherer and Search. The Gatherer service crawls the content, extracts the information, compiles the catalog, and then propagates the catalog to the required hosts. The Search service allows you to search the catalog for data.
When crawling content, the Gatherer service extracts the full-text content, properties, and links. One advantage of the Gatherer service is that it contains built-in filters for Office documents. This means the Gatherer service can open Office documents and pull out custom properties as well as built-in Office document properties such as Author or Last Save Time. The Gatherer service retrieves these properties by using a plug-in filter. Certain filters, such as Office document filters, come with Site Server. However, because the plug-ins implement the IFilter interface documented in the Platform Software Development Kit (SDK), you can create your own filters for the specific types of documents you use. The Adobe PDF filter, available from Adobe's Web site, is an example of such a custom filter. This filter allows Site Server to index PDF files stored in data sources that Site Server can crawl.
Once you've finished indexing the content, the real fun begins. Now you're ready to search the content. Site Server is a powerful search product. Its Search service supports wildcard, free-text, and regular expression searches. You might be thinking to yourself that no user would want to use a regular expression search. But don't think of Site Server's features from the perspective of a user performing a search. Instead, think of the applications you can build that leverage these powerful features.
The Search service is also multilingual. Site Server determines the language of the document it's crawling. The Search service then uses the correct word-breaking module, which identifies specific words in a document, and the correct word-stemming module, which identifies grammatically correct variations of each word. (This is an example of word stemming: fly, flying, flown, and flew.) Site Server will also ignore noise words such as "the," "a," or "do"—that is, words that are unlikely to carry searchable information. You can configure and add to the noise word lists that Site Server provides by using a text editor.
If you do scan multilingual documents, you can have Site Server search in multiple languages or you can specify a single language for the search. In your search applications, you can even detect the language of the person attempting the search and automatically default to that language when querying the Site Server catalog.
All the capabilities we've just discussed are available in the object model that Site Server provides for searching. To help you get more acquainted with some of the finer capabilities of Site Server Search, I included the Site Server help files on the companion CD. You'll find a lot of useful information about regular expression, free-text, and word-stemming searches. I highly recommend that you look at the documentation and give some of these types of searches a try.