Validation Using a Lookup Table
10.29.1 Problem
You need to check values to make sure they're listed in a lookup table.
10.29.2 Solution
Issue queries to see if the values are in the table. But the way you do this depends on the number of input values and on the size of the table.
10.29.3 Discussion
To validate input values against the contents of a lookup table, you can use techniques somewhat similar to those shown in Recipe 10.28 on checking ENUM and SET columns. However, whereas ENUM and SET columns are limited to a maximum of 65,536 and 64 member values respectively, a lookup table can have an essentially unlimited number of values. You may not want to read them all into memory.
Validation of input values against the contents of a lookup table can be done several ways, as illustrated in the following discussion. The tests shown in the examples perform comparisons against values exactly as they are stored in the lookup table. To perform case-insensitive comparisons, remember to convert all values to a consistent lettercase.
10.29.4 Issue Individual Queries
For one-shot operations, you can test a value by checking whether it's listed in the lookup table. The following query returns true (nonzero) a value that is present and false otherwise:
$valid = $dbh->selectrow_array ( "SELECT COUNT(*) FROM $tbl_name WHERE val = ?", undef, $val);
This kind of test may be suitable for purposes such as checking a value submitted in a web form, but is inefficient for validating large datasets. It has no memory for the results of previous tests for values that have been seen before; consequently, you'll end up issuing a query for every single input value.
10.29.5 Construct a Hash from the Entire Lookup Table
If you're going to perform bulk validation of a large set of values, it's more efficient to pull the lookup values into memory, save them in a data structure, and check each input value against the contents of that structure. Using an in-memory lookup avoids the overhead of running a query for each value.
First, run a query to retrieve all the lookup table values and construct a hash from them:
my %members; # hash for lookup values my $sth = $dbh->prepare ("SELECT val FROM $tbl_name"); $sth->execute ( ); while (my ($val) = $sth->fetchrow_array ( )) { $members{$val} = 1; }
Then check each value by performing a hash key existence test:
$valid = exists ($members{$val});
This reduces the database traffic to a single query. However, for a large lookup table, that may still be a lot of traffic, and you may not want to hold the entire table in memory.
10.29.6 Use a Hash as a Cache of Already-Seen Lookup Values
Another lookup technique is to mix use of individual queries with a hash that stores lookup value existence information. This approach can be useful if you have a very large lookup table. Begin with an empty hash:
my %members; # hash for lookup values
Then, for each value to be tested, check whether or not it's present in the hash. If not, issue a query to see if the value is present in the lookup table, and record the result of the query in the hash. The validity of the input value is determined by the value associated with the key, not by the existence of the key:
if (!exists ($members{$val})) # haven't seen this value yet { my $count = $dbh->selectrow_array ( "SELECT COUNT(*) FROM $tbl_name WHERE val = ?", undef, $val); # store true/false to indicate whether value was found $members{$val} = ($count > 0); } $valid = $members{$val};
For this method, the hash acts as a cache, so that you run a lookup query for any given value only once, no matter how many times it occurs in the input. For datasets that have a reasonable number of repeated values, this approach avoids issuing a separate query for every single value, while requiring an entry in the hash only for each unique value. It thus stands between the other two approaches in terms of the tradeoff between database traffic and program memory requirements for the hash.
Note that the hash is used in a somewhat different manner for this method than for the previous method. Previously, the existence of the input value as a key in the hash determined the validity of the value, and the value associated with the hash key was irrelevant. For the hash-as-cache method, the meaning of key existence in the hash changes from "it's valid" to "it's been tested before." For each key, the value associated with it indicates whether the input value is present in the lookup table. (If you store as keys only those values that are found to be in the lookup table, you'll issue a query for each instance of an invalid value in the input dataset, which is inefficient.)