Mining Google Web Services: Building Applications with the Google API

One of the most important techniques you have at your disposal for creating a truly efficient Google Web Services environment is database caching. Google doesn't set any limits on the time you can store the data you retrieve, so it's theoretically possible that you'll only have to make each query once. Practically, you'll have to set a time limit on the data to ensure your data store doesn't get too old. However, depending on your research field, the storage time could be considerable. Even in the world of computer research, storage times of a month aren't uncommon.

The big question about this technique is whether it actually provides a benefit. Using a cache implies that someone makes the same query more than once. The "Caching as a Practical Performance Enhancing Technique" section of Chapter 5 discusses many of the issues surrounding caching. However, there's a significant difference in usage between a desktop application and a Web application.

You have to know how people use the Web site to make any assumptions about the benefits of caching. In many cases, you might not know the search habits of the people using your site. Public Web sites with ill-defined goals will experience the biggest problems because you can't narrow the focus of the search in any way. A boutique Web site that sells candles has a strong focus so you won't have a problem. On the other hand, a personal Web site that discusses every topic imaginable isn't a good candidate for caching because it's public and you can't narrow the focus of searches ”people could literally search for anything.

It's also important to consider the size and scope of potential searches. Even if you intend to follow the Google licensing agreement, popular sites can stretch the limits on the number of searches per day. The only way to ensure your site meets the requirements is to narrow the search focus. Narrowing the focus does help you comply with license limitations and also means that users will receive more accurate search results for your particular site topic. A user can enter a generally understood term for your profession and the search narrowing additions you provide ensure the search results reflect the user's intent.

Note  

You can run into problems with the 1,000 search daily limit on popular sites. At some point, you won't be able to narrow the focus of your searches any more without losing search capability. When this problem occurs, it's time to ask Google for an extension on the limit or find some other means of satisfying the user's search needs (including redirecting the user to the Google search site).

Caching can have additional benefits for Web applications. For example, you can use caching to reduce the number of Google requests for a site search because many people will request the same topics. A site search narrows the search scope to the local site by definition, so you can restrict the search criteria in that way and make caching worthwhile.

The most important consideration is that a local search encourages the user to look at additional content. When a user searches for information on the Google Web site, their attention might drift to other locations, making your site less effective ”keeping the user on site means that they won't be as likely to move somewhere else. With all these pros and cons in mind, let's look at an example. The following sections describe how to create a basic cached application using PHP.

Setting Up the Database

Before you can begin using PHP with MySQL, you need a database. The \Chapter 07\Data folder of the source code located on the Sybex Web site contains a Structure Query Language (SQL) script called GoogleData.SQL. Copy this file into the \MySQL\bin folder of your server ( assuming you used a default setup). At the command line, type MySQL MySQL < GoogleData.SQL and press Enter. Your system will pause for a moment and return to the command prompt. That's all you need to do to create the database for this example.

You can use the MySQL utility to verify the presence of the database, table, and data at the command line. Simply type USE GoogleData; and press Enter. If the GoogleData database is present, the utility will use that database. Type SELECT * FROM DataStore and press Enter. You'll see a lot of data stream by if the script successfully created the table and filled it with data. An easier way to achieve the same results on a Windows system is to use the WinMySQL admin utility shown in Figure 7.3. Simply select the Databases tab and you'll see the database, table, and associated fields.

Figure 7.3: The WinMySQLadm in utility validates the success of the script on a Windows machine.

It's important to understand how Google returns values and how you store them. In many cases, all you'll really need is the title, snippet, URL, and possibly the cached size of the data in the database. The database could also include a scanned date and other information that you provide as value-added content. This content won't affect your Google search ”it's information you add after the fact to make the content of the database more valuable to you.

Depending on how you create the database and the fields you plan to use as keys, you'll run into a number of problems. For example, you can use the title, URL, and search criteria as primary keys for the database and still end up with duplicate values from Google Web Services. Consequently, when you perform a search for an item online, Google Web Services might return ten entries, but a refresh from the database will show only nine entries. The difference is that the original Google search returned a duplicate value that MySQL didn't store in the database. In reality, this feature is handy for automatically reducing the size of the result set to just the unique sites that Google finds.

Another issue to consider is the matter of how to identify the search criteria. The example defines each set of criteria as unique ”even when it comes to the order of words. For example, a search for Visual Basic isn't the same as Basic Visual. Even though the words are the same, the order of the words is different. Try the terms using Google's advanced search and you'll notice that the order of the words does make a difference to Google, but does it make a difference to you? In some cases, you'll want to break the search phrase into component words, sort the words, and use the sorted version as the search criteria in the database. Using this technique, the user would receive the same results every time the same set of words appears as a search criteria, regardless of word order.

To expand on this idea, you might also want to define words that your search will ignore. Google generally ignores common terms such as if , but you could go further than that when efficient searching is important. A custom search engine can ignore common jargon terms in an effort to maintain the efficiency of a search. On the other hand, you might want to expand a search. When a user enters terms that can appear in more than one way, the search engine could expand them to locate other instances. For example, a search could look for both VB and Visual Basic when a user enters VB as part of the search criteria. Your database design must consider all of these search options to ensure you retrieve the same information from the database as Google provided during the live search.

Writing the Sample Application

The example performs a number of tasks inherent in all cached applications. First, it must determine the technique for processing the request. If the data exists in the database, then the application uses that source. If the data doesn't exist or the user wants to force a refresh, then the application has to make the request from Google. Second, when the application does make a request to Google Web Services, it places this information into the database. However, when using the database, the application builds an array that roughly simulates the array of result values returned by Google Web Services. Finally, the application uses the array (no matter the source) to display the information on screen. Listing 7.2 shows the first of these three steps.

Listing 7.2: Determining How to Process a Request

if ($SearchTerm != null) { // Create a database connection. mysql_connect("localhost") or die ("Cannot connect to the database."); // Select the database. mysql_select_db("GoogleData") or die ("The database doesn't exist or is inaccessible."); // Obtain the data. $Output = mysql_query("SELECT * FROM DataStore ". "WHERE SearchCriteria = '$SearchTerm'"); // Verify the data is in the database. if (mysql_num_rows ($Output) == 0 or $DoRefresh) { // Get the data. $result = GetData(); // Display the total results. print_r("<p>Total Estimated Results: "); print_r($result[estimatedTotalResultsCount]); print_r("</p>"); // Get the result array. $Results = $result[resultElements]; } else { for ($Counter = 0; $Counter < mysql_num_rows ($Output); $Counter++) { // Get the display data. $Row = mysql_fetch_row($Output) or die ("No Data in Query Row."); // Get the current row of data. $Temp = array($Counter => array("title" => $Row[1], "snippet" => $Row[2], "URL" => $Row[3], "cachedSize" => $Row[4])); // Add this data to the result set. $Results = array_merge($Results, $Temp); // If this is the first pass through the data, // include the scanned date. if ($Counter == 0) print_r("<label>Scanning Date: {$Row[5]}</label></br>"); } } // Display the actual results. print_r("<label>Actual Results: "); print_r(count($Results)); print_r("</label>"); ?>

 

You'll find the complete source code for this example in the \Chapter 07\CachedSearch folder of the source code located on the Sybex Web site.

The application begins by creating a connection to MySQL using mysql_connect() . This connection remains active throughout the application. The code then uses the mysql_select_db() function to select the caching database. Both of these calls display a special message if they fail. At this point, the code uses mysql_query() to look for the data. A failure here simply means the data doesn't appear in the database. When the call fails or the user specifically requests a refresh ( $DoRefresh is true) the code calls on GetData() , explained later in this section, to obtain the information from Google. Notice that the code displays the estimated number of results on return from the call. This information isn't available (or necessary) when working directly with the database. Finally, the code retrieves the result elements and places them in $Results .

When the database does contain the required information, the code must retrieve it one row at a time using mysql_fetch_row() . Unfortunately, the presentation of the data in the database is nothing like the data retrieved from Google, so the application has to build an array that approximates the Google result elements. The array_merge() function works nicely for this purpose.

One additional problem in using the database is that the data might be too old. The example handles this problem by displaying a scanned date when it retrieves data from the database. The user can choose to search online by checking the Force Refresh option. The final step is to display the actual number of results that appear on screen.

The application uses a special function, GetData() to retrieve the information from Google Web Services. This function isn't strictly necessary for the example, but placing the code in a separate location does become valuable when working with complex applications. Listing 7.3 shows the GetData() function.

Listing 7.3: Getting the Data and Updating the Database

function GetData() { global $SearchTerm; global $ProductType; global $StartIndex; // Create an instance of the SOAP client. This client must point // to the Google search site. Don't attempt to create a client // that uses a proxy because it won't work with Google. $soapclient = new soapclient("http://api.google.com/search/beta2"); // Uncomment the next line to see debug messages // $soapclient->debug_flag = 1; // Set up an array with the parameters use for the call. Make // sure you include your license key or the call will definitely // fail. $params = array( 'key' => 'Your License Key', ... Other Arguments ... 'oe' => ''); // Invoke the method. Include the method name, the list of // parameters, the namespace, and the SOAP action. $result = $soapclient->call("doGoogleSearch", $params, "urn:GoogleSearch", "urn:GoogleSearch"); // Get the result array. $Results = $result[resultElements]; // Get the scanning time. $ScanTime = strftime("%y/%m/%d %H:%M:%S"); // Remove the existing record (if any). mysql_query("DELETE FROM DataStore ". "WHERE SearchCriteria = '$SearchTerm'"); // Process each result array element. for ($Counter = 0; $Counter < count($Results); $Counter++) { // Obtain the update values. $IndSite = $Results[$Counter]; $Title = $IndSite[title]; $URL = ($IndSite[URL]); $CachedSize = $IndSite[cachedSize]; // Some sites don't include a snippet, so // check for the snippet length. if (strlen($IndSite[snippet]) == 0) // In many cases, sites that don't include // a snippet will include a summary. if (strlen($IndSite[summary]) > 0) $Snippet = $IndSite[summary]; else $Snippet = "N/A"; else $Snippet = $IndSite[snippet]; // Create the query string. $UpdateQuery = 'INSERT INTO DataStore (SearchCriteria, Title, URL, Snippet, Size, Scanned) VALUES ("' . $SearchTerm . '" , "' . $Title . '" , "' . $Snippet . '" , "' . $URL . '" , "' . $CachedSize . '" , "' . $ScanTime . '")'; // Update the database. mysql_query($UpdateQuery); } return $result; }

 

The function begins with a declaration of global variables . Make sure you include this declaration. PHP won't tell you when they aren't included and debugging for this particular program can become difficult.

As with other examples in the chapter, the next step is to create a SOAP client, define the arguments to access the Web service, and make the call to Google Web Services. The results appear in $result , which is returned to the caller for processing.

It's important to keep track of when you last scanned the data for a particular search. Otherwise, the data in your application becomes old and unusable. This application creates the $ScanTime variable and places the information in the database as part of a record insertion.

The code updates the database at this point. The first step is to delete any old records from the DataStore table to ensure the old data doesn't remain behind. The code then processes the results one element at a time. It places the data in variables, creates an update statement, and finally uses mysql_query() to perform the update.

The actual display code for this application is the same as the display code shown in Listing 7.1. The technique for displaying the data doesn't change, just the means for obtaining it. Figure 7.4 shows typical output from this application.

Figure 7.4: Using cached search data has a number of benefits for online users.

Категории