Thursday, January 6, 2011

How Much Increase does the Dow need to gain employment?

The Dow being up is big news to many people in America. It signifies a strengthening economy, and a possible end of hard times for millions of people.

However, the Dow Jones goes up annually. On January 2nd, 1929, the DJI closed at 307.01. January 3rd, 2011, the DJI closed at 11,670.75. Despite a stretch in time that consisted of the Great Depression and the recent recession, the DJI has managed to climb at an annual rate of 4.54%, out-pacing inflation at 3.16%.

So, it is clear, over time, the Dow is a good, though not risk-free investment for an individual. A person who put the equivalent of $1 in 2011 dollars into the DJI in 1929, the person would have $2.97 now.

How does this affect main street, however? Namely, how is the annual performance of the Dow reflective on unemployment rates?

While one immediately assumes a growth in the Dow equals growth in employment, that is not necessarily the case. Though it is a positive sign for growth, there needs to be a level fit to meet an ever growing workforce, but how does one determine this?

Now, understanding the behavior of the Dow is easy, it generally follows a curve at the rate a*(1+r)^n, where a = starting value, r = rate of growth, and n = years since the starting point.

So first, to get a feel for the data, I elected to view the growth rate of the dow as a function of two variables: years since 1970, and unemployment rate of the year, found in this table, with fields of year, year since 1970, average dow jones index for the year, and average unemployment rate.

My first equation is as follows:

> summary(nlsdow)

Formula: dow ~ c * exp(a * yr70 + b * unemp) + c

Parameters:
Estimate Std. Error t value Pr(>|t|)
a 0.091785 0.004272 21.483
b -16.314535 2.978873 -5.477 3.21e-06 ***
c 951.696519 134.844962 7.058 2.38e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1034 on 37 degrees of freedom

Number of iterations to convergence: 8
Achieved convergence tolerance: 8.044e-06

Or, dow = 951.7*exp(0.091785*(yr-1970)-16.315*u_rate). Logical, the more years out, the more the DJI grows, and high unemployment is a good sign of a bad economy.

Now, we have an idea of the form for regression using the unemployment rate as a dependent variable (after all, unemployment as a lagging indicator means it is a better candidate for a dependent variable). Since we see the initial equation takes the form Y = a*exp(b*X1+c*X2), we can turn this into b*X1+c*X2 = ln(Y/a) = ln(Y) - ln(a) = ln(Y) + A, where A = -ln(a).

Solving for X2 gives us [ln(Y) + A - b*X1]/c, and of course we can change this into u_rate = p*ln(DJI) + q*(year since 1970) + r, where p = 1/c, q = -b/c, and r = -ln(a)/c.

So, simply put, one can regress for the unemployment rate by regressing against the natural log of the Dow Jones Index and the year since the particular starting point (in this case, 1970).

Fortunately, transforming a variable array is easy in R. To write a variable in log space, simply enter:

> lnDow<-log(dow)

You are now all set to run the regression. Let's try it:

> lmU<-lm(unemp ~ yr70 + lnDow)
> summary(lmU)

Call:
lm(formula = unemp ~ yr70 + lnDow)

Residuals:
Min 1Q Median 3Q Max
-0.0184135 -0.0058837 0.0002107 0.0038651 0.0209755

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3038799 0.0347165 8.753 1.52e-10 ***
yr70 0.0028846 0.0004947 5.831 1.06e-06 ***
lnDow -0.0376083 0.0055375 -6.792 5.38e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.009397 on 37 degrees of freedom
Multiple R-squared: 0.5931, Adjusted R-squared: 0.5711
F-statistic: 26.96 on 2 and 37 DF, p-value: 5.976e-08

I see a lot of highly significant coefficients. Though the R-squared leaves a bit to be desired, there seems to be a good deal being said through this function.

The coefficients we care about, though, are for yr70 and lnDow. To get a point estimate of the growth rate needed to sustain employment, we need to know how much lnDow has to increase to equal the coefficient of yr70, and effectively balance it out to keep unemployment stable. Simple algebra says that this means:

.0028846(yr70) = .0376083(lnDow),

so a delta of one year means we solve

.0376083(delta_lnDow) = .0028846, or delta_lnDow=.0767.

To find the point-estimate percentage increase needed, we simply solve exp(.0767) - 1 = 7.972%.

While high, the good news is, the DJI achieved this mark 17 of 39 years in the sample, and averages 6.5% growth a year from the average of 1970 to the average of 2009, including the downturn after 2007.

So if you were to ask me what my break-even point is for a "main street approved" economy is? Watch for that DJI to crack 8% growth during the year.

Monday, January 3, 2011

The Logit Model and Baseball: Projecting Chance of Making the Hall of Fame

It is the favorite time of year for many a sports nerd like myself: the time when the Baseball Writers Association of America will make their picks for the Hall of Fame, and when the blogosphere is best equipped to mock and ridicule the inconsistent logic of many esteemed writers.

It is also a favorite time for anyone who has ever said "I cannot believe they voted for...", or how someone was robbed (see Whitaker, Lou).

Last year's epic battle, in my mind, was the one about Edgar Martinez. Supporters cited his batting numbers that were comparable to legends while playing in a mediocre hitting park, and his career that shows no signs of PED use. His detractors cited that he was a Designated Hitter, and that his career is short.

When all was said and done, Martinez received a mere 36.2% of the vote, less than half what one needs to reach the Hall of Fame. So, what would Edgar's chance of reaching Cooperstown, knowing this?

Would you believe a 69.09% chance?

It seems counter-intuitive that when one is yet to convince almost two-thirds of the remaining voting base of his greatness, 6 years after his career ended, that anything would change so rapidly. However, it occurs constantly, as only 2 men from 1976-97 received a higher share of the vote on their first ballot and missed out on the writer's election. One of which, Jim Bunning, eventually gained access through the Veterans Committee.

As we saw from yesterday's post, the logit model can provide a powerful probability estimator given a dummy dependent variable. In this case, we test whether someone reached the Hall of Fame (y=1) or not (y=0).

To perform this analysis, I looked at all Hall of Fame votes from 1976-1997, and took the percent share of the vote obtained by all players on their first ballot, excluding those who received less than 5% of the vote (indicating a probability of being elected to the Hall of 0, and a small chance of being elected by the veterans' committee). Through this process, I obtained a data sheet of 59 players, as can be seen here.

Right away, one can make general assumptions. Thirty-three of the fifty-nine players listed were eventually elected to the Hall, a 55.93% success rate. Additionally, 3 more were elected by the veterans' committee, leaving the total success rate of the group at 61.02%. Simply clearing the first obstacle of making it past the first ballot seems to bode well for the eventual success of candidates.

However, this analysis is imperfect. The success rate includes players who were elected on the first ballot, and had no resistance in making the Hall. Once again, though, we can easily run a logit model regression on the data. Yesterday I showed a sample script that can run all the needed commands for you, but today I'll simply enter these in, command by command, and show you another useful feature: creating a function.

Yesterday I showed you the format for writing a logit model regression, so let's skip right to a summary. This regresses share of the first ballot vote to probability of eventually making the Hall of Fame:

> summary(probabilityHOF)

Formula: hof ~ 1/(1 + exp(-1 * (A + B * initial)))

Parameters:
Estimate Std. Error t value Pr(>|t|)
A -3.0108 0.7995 -3.766 0.000395 ***
B 10.5387 3.0206 3.489 0.000942 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2891 on 57 degrees of freedom

Number of iterations to convergence: 6
Achieved convergence tolerance: 8.11e-06

And this, in turn, summarizes the probability of being elected by the BBWAA:

> summary(probHOFbbwaa)

Formula: hof_bbwaa ~ 1/(1 + exp(-1 * (A + B * initial)))

Parameters:
Estimate Std. Error t value Pr(>|t|)
A -3.5384 0.8171 -4.330 6.1e-05 ***
B 10.4765 2.5601 4.092 0.000136 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2479 on 57 degrees of freedom

Number of iterations to convergence: 19
Achieved convergence tolerance: 5.858e-06

First thing to stand out is that the equations are very similar. The coefficients are strong on both regressions, though a bit stronger on the second one, measuring whether or not a player would be elected by the sports writers or not.

This is fine if we want to hard code, but we would rather not: we would rather simply enter a value in question and have R return a probability for us. Well this can be done simply using a function.

As shown at the Depauw website, writing a function is performed by:
name=function(argument 1, argument 2, etc...) { [expression] }

Since we have the coefficients available at our disposal from our previous equations, let's define functions. First, a function to determine the probability of being elected into the Hall of Fame by the sports writers:

> xChance=function(x) {
+ 1/(1+exp(-1*(-3.5384+10.4765*x))) }

Then, a second one, to determine the probability of reaching the Hall via the BBWAA or the Veterans' Committee:

> xHOF=function(x) {
+ 1/(1+exp(-1*(-3.0108+10.5387*x))) }

So of course, the last question is, how do you use a function? Well, that's the simplest part of all, and will help us obtain our answer. All you need to do is write in this format:
FnName(value)

So, let's try both of these equations where x=.362, or Edgar Martinez's share of votes in 2010:

> xChance(0.362)
[1] 0.5631837
> xHOF(0.362)
[1] 0.6908742

Well, these are probability moderately encouraging for my fellow Edgar fans. Not even including the Veterans Committee option, Edgar currently stands as better than a coin flip chance of reaching the Hall, at 56.32%. With the Veterans Committee, this probability spikes to 69.09%.

So where are the break-even (50-50) points for both equations? For just the BBWAA vote, it is at around 33.8% of the initial vote. For overall Hall of Fame chances, it is at around 28.6%.

After the fifth, we can re-visit this post and project the chances of ballot newcomers like Jeff Bagwell and Larry Walker. For now, rest assured that Barry Larkin has a 91.9% chance of reaching Cooperstown, and Robbie Alomar is almost a stone cold lock. Heck, even Fred McGriff has a 32.2% chance of reaching the Hall someday given his first ballot performance.

Sunday, January 2, 2011

Logistic Regression Analysis in R-programming

This time last year, I had a very vague knowledge of Logistic Analysis, an analysis that is now a cornerstone to my profession.

Many times as an analyst, I come across the various question of whether or not a person acted upon an advertisement (y=1), or not (y=0), and the various attributes (independent variables) that factored into the equation.

For a more broad audience, and to summarize this concept further, let's take the role of a college professor, who teaches the same course to a total of 210 students over the course of a school year, and wants to determine the point where a student has a 80% probability of getting at least a B on the final exam.

After looking over the 210 students, the professor formed this excel file, with two very simple variables: an independent variable of hours spent studying, and a dependent variable of whether or not a B-grade was achieved.

Could we use a linear regression here? It would show us a positive or negative correlation and...not much else. Since we would have a cluster of 121 points at y=1, and 89 at y=0, this would not really describe much of anything.

Enter logistic regression.

What is logistic regression? Per wikipedia, it is "used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve." Essentially, it is transforming the data into a model where the dependent variable becomes a probability prediction.

The form of a logit model is:
f(z) = 1/[1-exp(-z)], where z = B0 + B1X1 + B2X2 + ... + BnXn. In the previous example, we only have one independent variable X1, so z = A + BX.

Now that we have the gritty covered, how about the actual analysis?

Well, this can be performed with a script in R. The best part of all, a "lengthy" version of this only requires 14 lines!

#!/usr/bin/env Rscript

bgrade<-read.csv("C:/Users/Joseph/Documents/Blogfolder/bgrade.csv", header=TRUE)
Yreal<-bgrade$b_grade
Xreal<-bgrade$studytime
nlsProb<-nls(Yreal ~ 1/(1+exp(-1*(A+B*Xreal))),start=list(A=-0.5,B=0.05),control=nls.control(maxiter=20000))
Coeffs<-coefficients(nlsProb)
xAchieve<-NULL
A = Coeffs[[1]][[1]]
B = Coeffs[[2]][[1]]
for(i in 1:20) {
xAchieve[i] = 1/(1+exp(-1*(A + B*i)))
}
xTime = (log(4)-A)/B

Let's analyze what is happening.
Line 1: Shebang line. Let's the system know you're about to use R.
Line 3: Reads the .csv file into a data frame (and indicates headers in the file).
Lines 4+5: Assigns our independent and dependent fields to variable arrays.
Line 6: Determines equation. This is the "kicker" here, and it's important to know two things:
  1. You SHOULD have a general idea of how the equation should fit, or else it may not reduce. In this example, I assumed a less than 50% chance of achieving a B would occur when x=0, and as x grew, so would y (hence A = -0.5, B = 0.05)
  2. control=nls.control is an important control here. The default amount of iterations is 50, though I often spike it to 20,000 (I've never had a major runtime issue as a result, so I highly suggest using as many iterations as you need).
Line 7: Writes coefficients to variable, this will allow us to determine where the 90% breakthrough point is without leaving our script.
Line 8: Declare an empty array for later use.
Lines 9+10: Write coefficients to variables to use in for loop.
Lines 11+13: For loop. Makes a nifty array (xAchieve) where one can input any integer from 1-20 (in the form xAchieve[i]) later, or print out the entire array to map x hours of studying to a certain percent chance of achievement.
Line 14: Simply the logit model, now solving for x with a Y set at 0.8, or 80% (determining our breakthrough point).

From here, it's as simple as opening up an R GUI interface, or starting R from the command prompt (whatever your preference, I'm a Windows user and prefer the GUI) and type:

source("(path)/bgrade.r")

Now it's all written out! Let's take a look at some of our key outputs.
> summary(nlsProb)

Formula: Yreal ~ 1/(1 + exp(-1 * (A + B * Xreal)))

Parameters:
Estimate Std. Error t value Pr(>|t|)
A -1.30595 0.30481 -4.284 2.80e-05 ***
B 0.16816 0.03072 5.475 1.26e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4433 on 208 degrees of freedom

Number of iterations to convergence: 4
Achieved convergence tolerance: 7.859e-06

> xAchieve
[1] 0.2427264 0.2749548 0.3097123 0.3467622 0.3857673 0.4262967 0.4678409 0.5098359 0.5516924 0.5928293 0.6327044 0.6708427 0.7068557 0.7404530 0.7714447 0.7997367
[17] 0.8253195 0.8482544 0.8686572 0.8866832
> xTime
[1] 16.00978

What do these three commands mean?
The first shows the summary of our logit model. While my assumptions where right, my estimations were off. A=-1.30595, and B=0.16816.

The second is our array xAchieve. The first data point is a student that only studies for 1 hour can be expected to score a B or higher only 24.27% of the time. Yikes. At the other end, a student that studies for 20 hours can be expected to score a B or higher 88.67% of the time.

The third is the "answer" to our initial question of where the 80% breakthrough point is. As our array shows us reasonably, and xTime shows us more precisely, it is right around 16 hours. Apparently this professor does not believe in grade inflation.

After you are done with the data, a simple command:

rm(list=ls())

Cleans up your workspace for more analysis.

Clearly, the ability to perform robust analysis is powerful using the logit model. It was used recently by The College Board to make a powerful conclusion: that an 1180 on the old-SAT was the "breakthrough" point where a student had a 65% chance of achieving just a 2.7 GPA in their freshman year (a telling statistic in an age of grade inflation and more students attending college than ever).

The logit model also is a powerful analysis for marketing and advertising, to measure various aspects of an advertisement, the way it was published, and various other factors to determine how each independent variable affects the one major question: whether or not it causes the individual to act upon the advertisement or not. While an important analysis for a statistician, knowledge of the logit model is valuable for all sorts of professions, from biology and academia, to business and marketing.

Simple Statistics 101: How To Write a Perl Script that performs a Hypothesis Test

The hypothesis test.

It's something many of us have, and will, encounter on a regular basis in many courses, and in professional settings.

This initial blog post is, simply put, an easy way to show the potential power that just a little bit of programming knowledge can have on data analysis and manipulation. After the jump is a sample script I wrote in Perl that accepts an input of values, and a test point, and prints out summary statistics.

#!/usr/bin/perl

use URI;
use URI::Escape;
use URI::Split qw(uri_split uri_join);

print "Enter numbers, seperated by commas: \n";
my $list = <>;
print "Enter test point for z-score:\n";
my $hypothesis = <>;

my @nums = split(/,/, $list);
$total = scalar(@nums);

my $sum = 0;
foreach(@nums) {
chomp;
my $num = $_;
$sum = $sum + $num;
}

my $mean = $sum / $total;
my $varsum = 0;
foreach(@nums) {
chomp;
my $num2 = $_;
$varsum = $varsum + ($num2-$mean)**2;
}

my $popvar = $varsum / $total;
my $sampvar = $varsum / ($total - 1);
my $popsd = $popvar**0.5;
my $poperr = $popsd / ($total**0.5);
my $sampsd = $sampvar**0.5;
my $samperr = $sampsd / ($total**0.5);
my $zpop = ($hypothesis - $mean) / $poperr;
my $zsamp = ($hypothesis - $mean) / $samperr;

my $m = sprintf("%.3f", $mean);
my $p = sprintf("%.4f", $popsd);
my $s = sprintf("%.4f", $sampsd);
my $zp = sprintf("%.4f", $zpop);
my $zs = sprintf("%.4f", $zsamp);

print "Sample:\t$total\nMean:\t$m\nPop. St.Dev\t$p\nPop Z-score:\t$zp\nSample St.Dev\t$s\nSample Z-score:\t$zs\n";

Not always the cleanest, but it is effective. Let's analyze what is occurring in every block of code:
1) Accepts the list of data points, and the hypothesis (example, testing to see if a track coach's claim of improving his athlete's 100 meter times or not, the null can be set at 0.0 seconds).
2) Splits the numbers, counts the total and sums.
3) Determines mean, and then determines the sum of squares.
4) Determines two separate lists of statistics: one if the list is a population, one if the list is a sample (though this is easy enough to write into the input fields, this script outputs both).
5) Prints results, and sample t-scores.

Of course, this is hardly credible without a trial run, so let's test this out.

$ perl hyptestcalc.pl
Enter numbers, seperated by commas:
0.2,0.2,0,-0.1,0,0.2,0.1,0,-0.2,0.1,0.3,0.4,0,-0.2,0.1,0.1
Enter test point for z-score:
0
Sample: 16
Mean: 0.075
Pop. St.Dev 0.1601
Pop Z-score: -1.8741
Sample St.Dev 0.1653
Sample Z-score: -1.8146

In this case, we continued on the "track coach" theme. Using a 16 athlete sample, we saw the mean improvement in times for the group was 0.075 seconds, with a standard deviation in the sample of 0.1653 seconds. This results in a Sample z-score of -1.8146.

Using our t-table, given degrees of freedom of 15, we would see that 0 falls below the 1-tailed t-test of -1.753 (alpha = 0.05), but inside alpha=0.025 (test score of -2.131). Therefore, it would make it logical to say that the probability that this track coach improves the times of his clients is between 95% and 97.5%, and probably around 96%.

In the meantime, any new readers of my blog can feel free to contribute all sorts of ideas in the mix, whether is be a specific analysis piece, how to write certain blocks of code in various data-intensive languages like R and Perl, sports analysis, and many other topics.