Using Temporary Tables for Data Transformation
10.35.1 Problem
You want to preprocess input data for MySQL, but you don't have access to external utilities for doing so.
10.35.2 Solution
Load the data into a temporary table, reformat it using SQL statements, then copy the records into the final destination table.
10.35.3 Discussion
To work with information that must be checked or transformed before it's ready to be added to a table, it's sometimes helpful to load a datafile into a temporary table first for validation purposes. (It's generally easier to work with a dataset that is isolated into its own table rather than combined with other records.) After you have made sure that the temporary table's contents are satisfactory, copy its rows to the main table and then drop it. (Note that the use of "temporary" in this context doesn't necessarily imply that you must use the keyword TEMPORARY when creating the table.[6] If you process the table in multiple phases over the course of several server connections, you'll need to create a non-TEMPORARY table, then drop it explicitly when you're done with it.)
[6] CREATE TEMPORARY TABLE is discussed in Recipe 3.25.
The following example illustrates how to use a temporary table to solve a common problem: loading data into a table when the values do not have the format required by the table structure. Suppose you have a table main that contains three columns, name, date, and value, where date is a DATE column requiring values in ISO format (CCYY-MM-DD). Suppose also that you're given a datafile newdata.txt to be imported into the table, but the contents look like this:
name1 01/01/99 38 name2 12/31/00 40 name3 02/28/01 42 name4 01/02/03 44
Here the dates are in MM/DD/YY format and must be converted to ISO format to be stored as DATE values in MySQL. One way to do this would be to run the file through the cvt_date.pl script shown earlier in the chapter:
% cvt_date.pl --iformat=us --add-century newdata.txt >tmp
Then you can load the tmp file into the main table. But this task also can be accomplished entirely in MySQL with no external utilities by importing the data into a temporary table and using SQL to perform the reformatting operations. Here's how:
- Create an empty table in which to load the test data. The following statements create the table tmp as an empty copy of main and add a cdate column to hold the dates from the datafile as character strings:
mysql> CREATE TABLE tmp SELECT * FROM main WHERE 1 < 0; mysql> ALTER TABLE tmp ADD cdate CHAR(8);
- Load the datafile into the temporary table, storing the date values in the cdate column rather than in date:
mysql> LOAD DATA LOCAL INFILE 'newdata.txt' INTO TABLE tmp (name,cdate,value);
- Transform the cdate values from MM/DD/YY format to YY-MM-DD format and store the results in the date column:
mysql> UPDATE tmp -> SET date = CONCAT(RIGHT(cdate,2),'-',LEFT(cdate,2),'-',MID(cdate,4,2));
MySQL will convert the two-digit years to four-digit years automatically, so the original MM/DD/YY values in the cdate column end up in the date column as ISO values in CCYY-MM-DD format. The following query shows what the original cdate values and the transformed date values look like after the UPDATE statement has been performed:
mysql> SELECT cdate, date FROM tmp; +----------+------------+ | cdate | date | +----------+------------+ | 01/01/99 | 1999-01-01 | | 12/31/00 | 2000-12-31 | | 02/28/01 | 2001-02-28 | | 01/02/03 | 2003-01-02 | +----------+------------+
- Finally, copy the records from tmp to main (using the transformed date values rather than the original cdate values) and drop the temporary table:
mysql> INSERT INTO main (name, date, value) -> SELECT name, date, value FROM tmp; mysql> DROP TABLE tmp;
This procedure assumes that MySQL's automatic conversion of two-digit years to four digits produces the correct century values. This means that the year part of the values must correspond to years in the range from 1970 to 2069. If that's not true, you'd need to convert the year values some other way. (See Recipe 10.30.)
The procedure also assumes that the cdate values are always exactly eight characters so that LEFT( ), MID( ), and RIGHT( ) can be used to extract the pieces. If this assumption is invalid, you'd have to modify the conversion procedure. One possibility would be to use SUBSTRING_INDEX( ) to break apart the strings at the / delimiters:
mysql> UPDATE tmp -> SET date = -> CONCAT(SUBSTRING_INDEX(cdate,'/',-1),'-', -> SUBSTRING_INDEX(cdate,'/',1),'-', -> SUBSTRING_INDEX(SUBSTRING_INDEX(cdate,'/',2),'/',-1));
Another application for post-import processing is name splitting. If you import values consisting of a first name, a space, and a last name into a column full_name, you can reformat the column into separate first_name and last_name columns with these statements:
UPDATE tbl_name SET first_name = SUBSTRING_INDEX(full_name,' ',1); UPDATE tbl_name SET last_name = SUBSTRING_INDEX(full_name,' ',-1);
However, this task can easily become more difficult if any of the names have middle initials, or trailing words like Jr. or Sr. If that's the case, you're probably better off preprocessing the names prior to import, using a pattern matching utility that's smarter about breaking full names into components.