A LOAD DATA Diagnostic Utility
10.38.1 Problem
LOAD DATA or mysqlimport indicates a nonzero warning count when you load a datafile into MySQL, but you have no idea which rows or columns were problematic.
10.38.2 Solution
Run the file through a utility that diagnoses which data values caused the warnings.
10.38.3 Discussion
As a bulk loader, LOAD DATA is very efficient; it can run many times faster than a set of INSERT statements that adds the same rows. However, LOAD DATA also is not very informative. It returns only a message that indicates the number of records processed, and a few other status counts. For example, in the previous section, we generated a datafile managers.txt to use with guess_table.pl for guessing the structure of the baseball1.com managers table. If you create that table using the resulting CREATE TABLE statement and then load the datafile into it, you will observe the following result:
mysql> LOAD DATA LOCAL INFILE 'managers.txt' INTO TABLE managers -> IGNORE 1 LINES; Query OK, 2841 rows affected (0.06 sec) Records: 2841 Deleted: 0 Skipped: 0 Warnings: 5082
Evidently, there were a quite a few problems with the file. Unfortunately, the message produced by LOAD DATA doesn't tell you anything about which rows and columns caused them. The mysqlimport program is similarly terse, because its message is the same as the one returned by LOAD DATA.
We'll revisit this example at the end of the section, but first consider LOAD DATA's output style. On the one hand, the minimal-report approach is the right one to take. If warning information were to be returned to the client, it potentially could include a diagnostic message for each input row, or even for each column! This might be overwhelming and certainly would entirely defeat the high-efficiency nature of LOAD DATA. On the other hand, more information about the source of errors could be useful for fixing the file to eliminate the warnings.
It's on the MySQL development "to do" list to allow LOAD DATA errors to be logged to another table so that you can get extended diagnostic information. In the meantime, you can use the load_diag.pl utility included in the transfer directory of the recipes distribution. load_diag.pl is useful for "pre-flighting" a datafile to get an idea of how well the file will load into the table you intend it for, and to pinpoint problems so that you can clean up the file before loading it into MySQL "for real."
load_diag.pl also can help you identify patterns of problems for situations in which it may be beneficial to write a preprocessing filter. Suppose you periodically receive files containing data to be loaded into a given MySQL table. The more frequently this occurs, the more highly motivated you'll be to automate as much of the data transfer process as possible. This may involve writing a filter to convert data values from the format in which you receive them to a format more appropriate for MySQL. Running the datafiles through load_diag.pl can help you assess which columns tend to be problematic and thereby assist you in determining where to concentrate your efforts in creating a transformation program for rewriting the files so they will load cleanly into MySQL.
To run load_diag.pl, specify the name of the database and table you intend to load the datafile into, as well as the name of the file itself:
% load_diag.pl db_name tbl_name file_name
load_diag.pl won't actually load anything into the table named on the command line, but it needs to know what the table is so that it can create a temporary table that has the same column structure to use for testing.
Initially, load_diag.pl loads the entire datafile into the temporary table to see if there are any warnings. If not, there's nothing else to do, so load_diag.pl drops the temporary table and exits. Otherwise, it loads each line of the datafile into the table individually to determine which lines caused problems, using the following procedure:
- It writes the line to a temporary file and issues a LOAD DATA statement to load the file into the table. If the warning count is zero, the line is assumed to be okay.
- If the warning count for the line is nonzero, load_diag.pl examines each of its columns in turn by using a series of single-column LOAD DATA statements to find out which ones generate warnings.
- If a column-specific warning occurs and the data value is empty, load_diag.pl determines whether the warning goes away by loading a NULL value instead. It does this because if a datafile contains empty values, you can often get better results by loading NULL than by loading empty strings. (For example, if you load an empty string into an INT column, MySQL converts the value to 0 and issues a warning.) If a datafile turns out to have a significant reduction in warnings when loading NULL rather than empty strings, you may find it useful to run the file through to_null.pl before loading it.
- It's also possible for warnings to occur if a line contains fewer or more columns than the number of columns in the table, so load_diag.pl checks that, too.
load_diag.pl prints diagnostic information about its findings while testing each input line, then prints a summary report after the entire file has been processed. The report indicates the number of lines in the file, how many warnings the initial full-file load caused, and the number of lines that had too few or too many columns. The report also includes a list that shows for each column how many values were missing, the number of warnings that occurred, how many of those warnings occurred for empty values, and the number of empty-value warnings that went away by loading NULL instead.
As you might guess, all this activity means that load_diag.pl isn't nearly as efficient as LOAD DATA. In fact, it has the potential to exercise your server rather heavily! But its goal is to provide maximal information, not minimal execution time. (Note too that if your MySQL server has logging enabled, using load_diag.pl with large datafiles can cause the logs to grow quickly.)
To see how load_diag.pl works, assume you have a simple table named diag_test that contains string, date, and number columns:
CREATE TABLE diag_test ( str CHAR(10), date DATE, num INT );
Assume you also have a datafile named diag_sample.dat that you plan to load into the table:
str1 01-20-2001 97 str2 02-28-2002 03-01-2002 64 extra junk
To see if the file will have any problems loading, check it like this:
% load_diag.pl cookbook diag_test diag_sample.dat line 1: 1 warning column 2 (date): bad value = (01-20-2001) line 2: 2 warnings too few columns column 2 (date): bad value = (02-28-2002) column 3 (num): missing from input line column 3 (num): bad value = ( ) (inserting NULL worked better) line 3: 1 warning excess number of columns Number of lines in file: 3 Warnings found when loading entire file: 4 Lines containing too few column values: 1 Lines containing excess column values: 1 Warnings per column: Column Times Total Warnings for Improved missing warnings empty columns with NULL str 0 0 0 0 date 0 2 0 0 num 1 1 1 1
It appears that the dates don't load very well. That's not surprising, because they appear to be in U.S. format and should be rewritten in ISO format. Converting empty fields to N may also be beneficial, and you can get rid of the extra column value in line 3. Using some of the utilities developed earlier in this chapter, perform all those transformations, writing the result to a temporary file:
% yank_col.pl --columns=1-3 diag_sample.dat | cvt_date.pl --iformat=us --oformat=iso | to_null.pl > tmp
The tmp file produced by that command looks like this:
str1 2001-01-20 97 str2 2002-02-28 N N 2002-03-01 64
Using load_diag.pl to check the new file produces the following result:
% load_diag.pl cookbook diag_test tmp File loaded with no warnings, no per-record tests performed
This indicates that if you load tmp into the diag_test table, you should get good results, and indeed that is true:
mysql> LOAD DATA LOCAL INFILE 'tmp' INTO TABLE diag_test; Records: 3 Deleted: 0 Skipped: 0 Warnings: 0
Clearly, that's a lot of messing around just to make a three-line file load into MySQL better. But the point of the example is to illustrate that the feedback load_diag.pl provides can help you figure out what's wrong with a datafile so that you can clean it up.
In addition to the required arguments that name the database, table, and datafile, load_diag.pl understands several options:
--columns= name1,name2,name3,...
By default, load_diag.pl assumes the datafile contains columns that correspond in number and order to the columns in the table. If that is not true, use this option to specify the names of the columns that are present in the file, and in what order.
--labels
This option indicates that the datafile contains an initial row of labels that should be skipped. (Loading labels into a table typically results in spurious warnings.)
--skip-full-load
Skip the initial test that loads the entire datafile.
--tmp-table= tbl_name
Specify the name to use for the temporary table. The default is _load_diag_n, where n is load_diag.pl's process ID.
If necessary, you can also specify standard connection parameter options like --user or --host. Any options must precede the database name argument.
Use of load_diag.pl is subject to the following constraints and limitations:
- The input must be in tab-delimited, linefeed-terminated format.
- Record loading is performed with the LOCAL option of the LOAD DATA statement. LOCAL capability requires MySQL 3.22.15 or higher (and, as of 3.23.49, requires that your MySQL distribution not have been built with that capability disabled).
- When load_diag.pl creates the temporary table, it omits any indexes that are present in the original table. This results in faster record loading time (particularly for the initial test that loads the entire datafile). On the other hand, not using indexes means that load_diag.pl won't find warnings that result from duplicate key values on unique indexes.
Returning to the example with which this section began, what about all those warnings that resulted from loading the managers.txt file into the managers table? load_diag.pl identifies them all as being due to missing or empty columns at the end of some of the lines:
% load_diag.pl --labels cookbook managers managers.txt line 2: 2 warnings column 14 (postwins): bad value = ( ) (inserting NULL worked better) column 15 (postlosses): bad value = ( ) (inserting NULL worked better) line 3: 2 warnings column 14 (postwins): bad value = ( ) (inserting NULL worked better) column 15 (postlosses): bad value = ( ) (inserting NULL worked better) ... line 2839: 2 warnings column 14 (postwins): bad value = ( ) (inserting NULL worked better) column 15 (postlosses): bad value = ( ) (inserting NULL worked better) line 2842: 2 warnings column 14 (postwins): bad value = ( ) (inserting NULL worked better) column 15 (postlosses): bad value = ( ) (inserting NULL worked better) Number of lines in file: 2842 Warnings found when loading entire file: 5082 Lines containing too few column values: 416 Lines containing excess column values: 0 Warnings per column: Column Times Total Warnings for Improved missing warnings empty columns with NULL lahmanid 0 0 0 0 year 0 0 0 0 team 0 0 0 0 lg 0 0 0 0 div 0 0 0 0 g 0 0 0 0 w 0 0 0 0 l 0 0 0 0 pct 0 0 0 0 std 0 0 0 0 half 0 0 0 0 mgrorder 0 0 0 0 plyrmgr 16 0 0 0 postwins 416 2533 2533 2533 postlosses 416 2533 2533 2533
From this result, we can determine that 416 lines were missing the postwins and postlosses columns (and 16 of those were missing the plyrmgr column as well). The remaining errors were due to lines for which the postwins and postlosses columns were present but empty. The entire-file warning count of 5082 can be accounted for as the the number of plyrmgr values that were missing, plus the total warnings from the postwins and postlosses columns (16+2533+2533 = 5082).
The Total warnings value for the plyrmgr column is zero because it's a CHAR column, and thus loading empty values into it is legal. The Total warnings value for postwins and postlosses is nonzero because they are INT columns and loading empty values into them result in a conversion-to-zero operations. All of these problems are of the sort that can be made to go away by converting empty or missing values to N. Run the file through yank_col.pl to force each line to have 15 columns, and run the result through to_null.pl to convert empty values to N:
% yank_col.pl --columns=1-15 managers.txt | to_null.pl > tmp
Then see what load_diag.pl has to say about the resulting file:
% load_diag.pl --labels cookbook managers tmp File loaded with no warnings, no per-record tests performed
If you load tmp into the managers table, no problems should occur:
mysql> LOAD DATA LOCAL INFILE 'tmp' INTO TABLE managers IGNORE 1 LINES;
Query OK, 2841 rows affected (0.13 sec)
Records: 2841 Deleted: 0 Skipped: 0 Warnings: 0
Категории