Validating and Transforming Data

10.21.1 Problem

You need to make sure the data values contained in a file are legal.

10.21.2 Solution

Check them, possibly rewriting them into a more suitable format.

10.21.3 Discussion

Earlier recipes in this chapter show how to work with the structural characteristics of files, by reading lines and busting them up into separate columns. It's important to be able to do that, but sometimes you need to work with the data content of a file, not just its structure:

This section begins a set of recipes that describe validation and reformatting techniques that are useful in these kinds of situations. Techniques covered here for checking values include direct comparison, pattern matching, and validation against information in a database. It's not unusual for certain validation operations to come up over and over, in which case you'll probably find it useful to to construct a library of functions. Packaging validation operations as library routines makes it easier to write utilities based on them, and the utilities make it easier to perform command-line operations on entire files so you can avoid editing them yourself.

10.21.4 Writing an Input-Processing Loop

Many of the validation recipes shown in the new few sections are typical of those that you perform within the context of a program that reads a file and checks individual column values. The general form of such a file-processing utility can be written like this:

#! /usr/bin/perl -w # loop.pl - Typical input-processing loop # Assumes tab-delimited, linefeed-terminated input lines. use strict; while (<>) # read each line { chomp; # split line at tabs, preserving all fields my @val = split (/ /, $_, 10000); for my $i (0 .. @val - 1) # iterate through columns in line { # ... test $val[$i] here ... } } exit (0);

The while( ) loop reads each input line and breaks it into fields. Inside the loop, each line is broken into fields. Then the inner for( ) loop iterates through the fields in the line, allowing each one to be processed in sequence. If you're not applying a given test uniformly to all the fields, you'd replace the for( ) loop with separate column-specific tests.

This loop assumes tab-delimited, linefeed-terminated input, an assumption that is shared by most of the utilities discussed throughout the rest of this chapter. To use these programs with datafiles in other formats, you may be able to convert them into tab-delimited format using the cvt_file.pl script discussed in Recipe 10.19.

10.21.5 Putting Common Tests in Libraries

For a test that you perform often, it may be useful to package it as a library function. This makes the operation easy to perform, and also gives it a name that's likely to make the meaning of the operation clearer than the comparison code itself. For example, the following test performs a pattern match to check that $val consists entirely of digits (optionally preceded by a plus sign), then makes sure the value is greater than zero:

$valid = ($val =~ /^+?d+$/ && $val > 0);

In other words, the test looks for strings that represent positive integers. To make the test easier to use and its intent clearer, you might put it into a function that is used like this:

$valid = is_positive_integer ($val);

The function itself can be defined as follows:

sub is_positive_integer { my $s = shift; return ($s =~ /^+?d+$/ && $s > 0); }

Then put the function definition into a library file so that multiple scripts can use it easily. The Cookbook_Utils.pm module file in the lib directory of the recipes distribution is an example of a library file that contains a number of validation functions. Take a look through it to see which functions may be useful to you in your own programs (or as a model for writing your own library files). To gain access to this module from within a script, include a use statement like this:

use Cookbook_Utils;

You must of course install the module file in a directory where Perl will find it. For details on library installation, see Recipe 2.4.

A significant benefit of putting a collection of utility routines into a library file is that you can use it for all kinds of programs. It's rare for a data manipulation problem to be completely unique. If you can pick and choose at least a few validation routines from a library, it's possible to reduce the amount of code you need to write, even for highly specialized programs.

Категории