Statistics Hacks: Tips & Tools for Measuring the World and Beating the Odds

While everyday entire species of creatures become extinct, occasionally new species are identified that were previously unknown. Surprisingly, statistical tools, not biological tools, can do the trick.

A few years back, a new species, a type of possum, was identified. The new species was named trichosurus cunninghamii. Trichosurus means, um...possum (I guess), and the cunninghamii part refers to its discoverer, Ross Cunningham, a statistician at Australian National University. If you'd like to have a species named for you, here's how statistics can help.

Identifying Species with Statistics

There is a family of statistical analyses that looks at a bunch of variables and finds naturally occurring groupings among them. Typically, the groupings or clusters of variables are identified on the basis of the correlations among them [Hack #11].

One procedure that uses this strategy attempts to find underlying dimensions or invisible, giant basic variables that account for a bunch of less important variables. This procedure is factor analysis, and elsewhere we see how it can, among other things, be used to identify writers' styles [Hack #65].

Statistics is full of similar techniques that can identify dimensions, underlying causes, and groupings. The goal of identifying groupings is of greatest use to biologically inclined statisticians who wish to identify new species.

For some group of animals to technically be a separate species, it must share a unique set of biological characteristics that make it distinct from similar animals. Sure, animals within the same family all look a little different from each other, but then, people look a lot different from each other and we are all one species (my Uncle Frank being perhaps the exception that proves the rule).

If a group of animals, such as Dr. Cunningham's possums, have more in common with each other than they do with the other creatures in their species, they might be candidates for consideration as a species in their own right. Statistics can determine that "more like each other and more different from the rest of the species than chance alone would produce" point.

Using Cunningham's discovery as a model, there are a few steps to follow for you to make your own discovery.

Collect some data

This possum existed in Australia near people for more than 200 years and no one noticed. To be fair, it looked an awful lot like the other possums, the most common of which was the trichosurus caninus, now called the short-eared possum.

It was assumed for some time that there was really just this one species of the little guys. Part of Dr. Cunningham's job was to collect and organize descriptive data for the wildlife around him. Consequently, he had a ton of very specific quantitative descriptions of various possum partseyes, ears, nose, and throatand measurements of other physical characteristics.

Choose a statistical method

Cunningham's choice was a technique similar to factor analysis but with a more imposing name: canonical variate analysis. You can use any method that uses the variability in scores to create distinct groupings. Some of those are discussed in this booksuch as factor analysis, mentioned earlier in this hackbut there are many other procedures that would work.

If you are really statistically savvy, it will help you to know that canonical variate analysis is functionally the same as discriminant analysis or multivariate analysis of variance (MANOVA), two other procedures that create linear composites of variables with the goal of conceptually defining two or more distinctly different groups.

Cunningham used this statistical procedure to examine the descriptive data for this presumably single species (you know, these trichosurus caninus fellers) and demonstrated that there were likely two different species.

Select a hypothesis and analyze the data

Statisticians test hypotheses, so you should begin your analysis with a guess about whether there is or is not a distinction between the groups of participates who supplied your data.

In the example of our hero, Cunningham assumed that there were two different groups of critters that accounted for the data. Then, the procedure (using a computer for the calculations, of course) identified which variables worked best as key distinguishing characteristics between the theoretical groups.

The difference between using this tool, canonical variate analysis, and something like regression is that, when using variables to make predictions in regression, the researcher has some known data about scores of actual subjects: which "group" they belong to [Hack #13]. Here, the procedure works blindly without knowing what the correct answer is. Instead, it finds groups that can be made the most different with the variables at hand.

Here are the variables Cunningham used:

  • Head length

  • Skull width

  • Eye size

  • Ear length

  • Body length (from tip of nose to tip of uncurled tail)

  • Tail length

  • Chest width

  • Foot length

While other variables were considered, Cunningham chose these because they were eventually found to be most important in distinguishing one species from another and also because they were characteristics that would probably be unaffected by environment.

Interpret results

The last step in any statistical analysis is to describe and understand whatever you found. For discovering species, you need to be able to describe that new species in enough detail to differentiate it form other, similar species.

The procedures used by Cunningham identified a series of different equations that weighted each of the biological variables differently, to find the combination that best identified two separate groups. These equations (which the procedure labels variates) are similar to regression equations, with the outcome or criterion variable determining which group a possum belongs to.

Here's the single best equation that accounted for an astonishing 89 percent of the variability on these characteristics for all the possums in his database:

(head lengthx.44) + (skull widthx.07) + (eye sizex.05) + (ear lengthx.82) + (body lengthx.35) + (tail lengthx.72) + (chest widthx.16) + (foot lengthx.70)

I've provided the standardized weights from the study, so we can compare them to each other. The larger weights indicate the possum parts that differed the most between the mathematically chosen two groups of possums.

In this data, you could find two groups of possums that differed the most based on ear length, tail length, and foot length. The amount of variability explained was so large that, statistically, Cunningham concluded that the mathematically identified groupings were real. The two groups of possums found in the data were actually two different species of possum, and the species could be defined by their ear length and a couple of other variables. The larger the weights in the equation shown earlier, the more the two species differed on these body parts.

Two Possum Species

Table 6-20 shows the official descriptions of the two possum species first identified as such by our statistician and his mathematics. Notice the names are even based on the key predictors found in the statistical analysis!

Table Two common Australian possums

  trichosurus caninus trichosurus cunninghamii
Common nameShort-eared possumMountain brushtail possum
HabitatLives in the northLives in the south
EarsShorter earsLonger ears
FeetSmaller feetLarger feet
HeadBigger head Smaller head
TailLonger tailSmaller tail

So, start collecting your own data on those odd, stinky bugs you find on your screen door and you are well on your way to greatness and immortality. Is there one species of stink bug or two? You tell me.

See Also

  • I first learned about this approach to identifying species in this fine article: Hall, P. (2003). Chance, 16, 1.

Категории