Perl Best Practices

19.8. Benchmarking

Don't optimize codebenchmark it.

It's natural to think that a single expression like:

keys %{ { map {$_=>1} @_ } }

will be more efficient than two statements:

my %seen; return grep {!$seen{$_}++} @_;

But, unless you are deeply familiar with the internals of the Perl interpreter[*], intuitions about the relative performance of two constructs are exactly that: unconscious guesses.

[*] In which case you already have far more serious personal issues to deal with.

The only way to know for sure which of twoor morealternatives will perform better is to actually time each of them. The standard Benchmark module makes that easy, as Example 19-8 illustrates.

Example 19-8. Benchmarking the uniqueness functions

# A short list of not-quite-unique values... our @data = qw( do re me fa so la ti do );

# Various candidates... sub unique_via_anon { return keys %{ { map {$_=>1} @_ } }; } sub unique_via_grep { my %seen; return grep { !$seen{$_}++ } @_; } sub unique_via_slice { my %uniq; @uniq{@_} = ( ); return keys %uniq; }

# Compare the current set of data in @data sub compare { my ($title) = @_; print "\n[$title]\n";

# Create a comparison table of the various timings, making sure that # each test runs at least 10 CPU seconds... use Benchmark qw( cmpthese ); cmpthese -10, { anon => 'my @uniq = unique_via_anon(@data)', grep => 'my @uniq = unique_via_grep(@data)', slice => 'my @uniq = unique_via_slice(@data)', }; return; } compare('8 items, 10% repetition');

# Two copies of the original data... @data = (@data) x 2; compare('16 items, 56% repetition');

# One hundred copies of the original data... @data = (@data) x 50; compare('800 items, 99% repetition');

The cmpthese( ) subroutine is passed a number, followed by a reference to a hash of tests. The number specifies either the exact number of times to run each test (if the number is positive) or the absolute number of CPU seconds to run the test for (if the number is negative). Typical values are around 10,000 repetitions or 10 CPU seconds, but the module will warn you if the test is too short to produce an accurate benchmark.

The keys of the test hash are the names of your tests, and the corresponding values specify the code to be tested. Those values can be either strings (which are eval'd to produce executable code) or subroutine references (which are called directly).

Specifying your tests as strings usually produces results that are more accurate. Subroutine references have to be re-invoked each time each test is repeated, which adds a subroutine call overhead to every test and tends to obscure small relative differences in performance. So benchmarking via eval'd strings is the better practice here (despite the general exhortations against them in Chapter 8).

Unfortunately, eval'd code can see only those lexical variables in the scope where the eval is executed, not those in the scope where the string is created. That means the string eval inside cmpthese( ) cannot see lexicals in the scope where cmpthese( ) was called. So for accurate benchmarking, it's necessary to also ignore the strong admonitions in Chapter 5, and pass data to Benchmark tests using package variables (e.g. our @data) rather than lexicals.

Of course, if you use anonymous subroutines for your tests, they can see any lexicals in their declaration scope, in which case you can avoid both globals and eval, the only problem being that your results may be badly skewed by the extra subroutine call costs.

The benchmarking code in Example 19-8 would print out something like the following:

[8 items, 10% repetitions] Rate anon grep slice anon 28234/s -- -24% -47% grep 37294/s 32% -- -30% slice 53013/s 88% 42% -- [16 items, 50% repetitions] Rate anon grep slice anon 21283/s -- -28% -51% grep 29500/s 39% -- -32% slice 43535/s 105% 48% -- [800 items, 99% repetitions] Rate anon grep slice anon 536/s -- -65% -89% grep 1516/s 183% -- -69% slice 4855/s 806% 220% --

Each of the tables printed has a separate row for each named test. The first column lists the absolute speed of each candidate in repetitions per second, while the remaining columns allow you to compare the relative performance of any two tests. For example, in the final test, tracing across the grep row to the anon column reveals that the grepped solution was 1.83 times (183%) faster than using an anonymous hash. Tracing further across the same row also indicates that grepping was 69% slower (-69% faster) than slicing.

Overall, the indication from the three tests is that the slicing-based solution is consistently the fastest for this particular set of data on this particular machine. It also appears that as the data set increases in size, slicing also scales much better than either of the other two approaches.

However, those two conclusions are effectively drawn from only three data points (namely, the three benchmarking runs). To get a more definitive comparison of the three methods, you'd also need to test other possibilities, such as a long list of non-repeating items or a short list with nothing but repetitions.

Better still, test on the real data that you'll actually be "unique-ing".

For example, if that data is a sorted list of a quarter million words, with only minimal repetitions, and which has to remain sorted, then test exactly that:

our @data = slurp '/usr/share/biglongwordlist.txt'; use Benchmark qw( cmpthese ); cmpthese 10, {

# Note:the non-grepped solutions need a post-uniqification re-sort anon => 'my @uniq = sort(unique_via_anon(@data))', grep => 'my @uniq = unique_via_grep(@data)', slice => 'my @uniq = sort(unique_via_slice(@data))', };

Not surprisingly, this benchmark indicates that the grepped solution is markedly superior on a large sorted data set:

s/iter anon slice grep anon 4.28 -- -3% -46% slice 4.15 3% -- -44% grep 2.30 86% 80% --

Perhaps more interestingly, the grepped solution still benchmarks as being marginally faster when the two hash-based approaches aren't re-sorted. This suggests that the better scalability of the sliced solution that was seen in the earlier benchmark is a localized phenomenon, and is eventually undermined by the growing costs of allocation, hashing, and bucket-overflows as the sliced hash grows very large.

Above all, that last example demonstrates that benchmarks only benchmark the cases you actually benchmark. And that useful conclusions about performance can only be drawn from benchmarking real data.

Категории