Extracting and Rearranging Datafile Columns

2017-11-03 09:05:05

10.20.1 Problem

You want to pull out columns from a datafile or rearrange them into a different order.

10.20.2 Solution

Use a utility that can produce columns from a file on demand.

10.20.3 Discussion

cvt_file.pl serves as a tool that converts entire files from one format to another. Another common datafile operation is to manipulate its columns. This is necessary, for example, when importing a file into a program that doesn't understand how to extract or rearrange input columns for itself. Perhaps you want to omit columns from the middle of a file so you can use it with LOAD DATA, which cannot skip over columns in the middle of data lines. Or perhaps you have a version of mysqlimport older than 3.23.17, which doesn't support the --columns option that allows you to indicate the order in which table columns appear in the file. To work around these problems, you can rearrange the datafile instead.

Recall that this chapter began with a description of a scenario involving a 12-column CSV file somedata.csv from which only columns 2, 11, 5, and 9 were needed. You can convert the file to tab-delimited format like this:

% cvt_file.pl --iformat=csv somedata.csv > somedata.txt

But then what? If you just want to knock out a short script to extract those specific four columns, that's fairly easy: write a loop that reads input lines and writes only the columns you want in the proper order. Assuming input in tab-delimited, linefeed-terminated format, a simple Perl program to pull out the four columns can be written like this:

#! /usr/bin/perl -w # yank_4col.pl - 4-column extraction example # Extracts column 2, 11, 5, and 9 from 12-column input, in that order. # Assumes tab-delimited, linefeed-terminated input lines. use strict; while (<>) { chomp; my @in = split (/ /, $_); # split at tabs # extract columns 2, 11, 5, and 9 print join (" ", $in[1], $in[10], $in[4], $in[8]) . " "; } exit (0);

Run the script as follows to read the file containing 12 data columns and write output that contains only the four columns in the desired order:

% yank_4col.pl somedata.txt > tmp

But yank_4col.pl is a special purpose script, useful only within a highly limited context. With just a little more work, it's possible to write a more general utility yank_col.pl that allows any set of columns to be extracted. With such a tool, you'd specify the column list on the command line like this:

% yank_col.pl --columns=2,11,5,9 somedata.txt > tmp

Because the script doesn't use a hardcoded column list, it can be used to pull out an arbitrary set of columns in any order. Columns can be specified as a comma-separated list of column numbers or column ranges. (For example, --columns=1,4-7,10 means columns 1, 4, 5, 6, 7, and 10.) yank_col.pl looks like this:

#! /usr/bin/perl -w # yank_col.pl - extract columns from input # Example: yank_col.pl --columns=2,11,5,9 filename # Assumes tab-delimited, linefeed-terminated input lines. use strict; use Getopt::Long; $Getopt::Long::ignorecase = 0; # options are case sensitive my $prog = "yank_col.pl"; my $usage = < $help, # print help message "columns=s" => $columns # specify column list ) or die "$usage "; die "$usage " if defined $help; my @col_list = split (/,/, $columns) if defined ($columns); @col_list or die "$usage "; # nonempty column list is required # make sure column specifiers are positive integers, and convert from # 1-based to 0-based values my @tmp; for (my $i = 0; $i < @col_list; $i++) { if ($col_list[$i] =~ /^d+$/) # single column number { die "Column specifier $col_list[$i] is not a positive integer " unless $col_list[$i] > 0; push (@tmp, $col_list[$i] - 1); } elsif ($col_list[$i] =~ /^(d+)-(d+)$/) # column range m-n { my ($begin, $end) = ($1, $2); die "$col_list[$i] is not a valid column specifier " unless $begin > 0 && $end > 0 && $begin <= $end; while ($begin <= $end) { push (@tmp, $begin - 1); ++$begin; } } else { die "$col_list[$i] is not a valid column specifier "; } } @col_list = @tmp; while (<>) # read input { chomp; my @val = split (/ /, $_, 10000); # split, preserving all fields # extract desired columns, mapping undef to empty string (can # occur if an index exceeds number of columns present in line) @val = map { defined ($_) ? $_ : "" } @val[@col_list]; print join (" ", @val) . " "; } exit (0);

The input processing loop converts each line to an array of values, then pulls out from the array the values corresponding to the requested columns. To avoid looping though the array, it uses Perl's notation that allows a list of subscripts to be specified all at once to request multiple array elements. For example, if @col_list contains the values 2, 6, and 3, these two expressions are equivalent:

($val[2] , $val[6], $val[3]) @val[@col_list]

What if you want to extract columns from a file that's not in tab-delimited format, or produce output in another format? In that case, combine yank_col.pl with cvt_file.pl. Suppose you want to pull out all but the password column from the colon-delimited /etc/passwd file and write the result in CSV format. Use cvt_file.pl both to preprocess /etc/passwd into tab-delimited format for yank_col.pl and to post-process the extracted columns into CSV format:

% cvt_file.pl --idelim=":" /etc/passwd | yank_col.pl --columns=1,3-7 | cvt_file.pl --oformat=csv > passwd.csv

If you don't want to type all of that as one long command, use temporary files for the intermediate steps:

% cvt_file.pl --idelim=":" /etc/passwd > tmp1 % yank_col.pl --columns=1,3-7 tmp1 > tmp2 % cvt_file.pl --oformat=csv tmp2 > passwd.csv % rm tmp1 tmp2

Forcing split() to Return Every Field

The Perl split( ) function is extremely useful, but normally it doesn't return trailing empty fields. This means that if you write out only as many fields as split( ) returns, output lines may not have the same number of fields as input lines. To avoid this problem, pass a third argument to indicate the maximum number of fields to return. This forces split( ) to return as many fields as are actually present on the line, or the number requested, whichever is smaller. If the value of the third argument is large enough, the practical effect is to cause all fields to be returned, empty or not. Scripts shown in this chapter use a field count value of 10,000:

# split line at tabs, preserving all fields my @val = split (/ /, $_, 10000);

In the (unlikely?) event that an input line has more fields than that, it will be truncated. If you think that will be a problem, you can bump up the number even higher.

Forcing split() to Return Every Field

Категории