Sunday, January 2, 2011

Simple Statistics 101: How To Write a Perl Script that performs a Hypothesis Test

The hypothesis test.

It's something many of us have, and will, encounter on a regular basis in many courses, and in professional settings.

This initial blog post is, simply put, an easy way to show the potential power that just a little bit of programming knowledge can have on data analysis and manipulation. After the jump is a sample script I wrote in Perl that accepts an input of values, and a test point, and prints out summary statistics.

#!/usr/bin/perl

use URI;
use URI::Escape;
use URI::Split qw(uri_split uri_join);

print "Enter numbers, seperated by commas: \n";
my \$list = <>;
print "Enter test point for z-score:\n";
my \$hypothesis = <>;

my @nums = split(/,/, \$list);
\$total = scalar(@nums);

my \$sum = 0;
foreach(@nums) {
chomp;
my \$num = \$_;
\$sum = \$sum + \$num;
}

my \$mean = \$sum / \$total;
my \$varsum = 0;
foreach(@nums) {
chomp;
my \$num2 = \$_;
\$varsum = \$varsum + (\$num2-\$mean)**2;
}

my \$popvar = \$varsum / \$total;
my \$sampvar = \$varsum / (\$total - 1);
my \$popsd = \$popvar**0.5;
my \$poperr = \$popsd / (\$total**0.5);
my \$sampsd = \$sampvar**0.5;
my \$samperr = \$sampsd / (\$total**0.5);
my \$zpop = (\$hypothesis - \$mean) / \$poperr;
my \$zsamp = (\$hypothesis - \$mean) / \$samperr;

my \$m = sprintf("%.3f", \$mean);
my \$p = sprintf("%.4f", \$popsd);
my \$s = sprintf("%.4f", \$sampsd);
my \$zp = sprintf("%.4f", \$zpop);
my \$zs = sprintf("%.4f", \$zsamp);

print "Sample:\t\$total\nMean:\t\$m\nPop. St.Dev\t\$p\nPop Z-score:\t\$zp\nSample St.Dev\t\$s\nSample Z-score:\t\$zs\n";

Not always the cleanest, but it is effective. Let's analyze what is occurring in every block of code:
1) Accepts the list of data points, and the hypothesis (example, testing to see if a track coach's claim of improving his athlete's 100 meter times or not, the null can be set at 0.0 seconds).
2) Splits the numbers, counts the total and sums.
3) Determines mean, and then determines the sum of squares.
4) Determines two separate lists of statistics: one if the list is a population, one if the list is a sample (though this is easy enough to write into the input fields, this script outputs both).
5) Prints results, and sample t-scores.

Of course, this is hardly credible without a trial run, so let's test this out.

\$ perl hyptestcalc.pl
Enter numbers, seperated by commas:
0.2,0.2,0,-0.1,0,0.2,0.1,0,-0.2,0.1,0.3,0.4,0,-0.2,0.1,0.1
Enter test point for z-score:
0
Sample: 16
Mean: 0.075
Pop. St.Dev 0.1601
Pop Z-score: -1.8741
Sample St.Dev 0.1653
Sample Z-score: -1.8146

In this case, we continued on the "track coach" theme. Using a 16 athlete sample, we saw the mean improvement in times for the group was 0.075 seconds, with a standard deviation in the sample of 0.1653 seconds. This results in a Sample z-score of -1.8146.

Using our t-table, given degrees of freedom of 15, we would see that 0 falls below the 1-tailed t-test of -1.753 (alpha = 0.05), but inside alpha=0.025 (test score of -2.131). Therefore, it would make it logical to say that the probability that this track coach improves the times of his clients is between 95% and 97.5%, and probably around 96%.

In the meantime, any new readers of my blog can feel free to contribute all sorts of ideas in the mix, whether is be a specific analysis piece, how to write certain blocks of code in various data-intensive languages like R and Perl, sports analysis, and many other topics.