Sunday, May 15, 2011

How to play Nim on the command line

A bit of a deviation from pure statistics, because tonight I wrote something interesting, while attempting to learn the Python programming language for work. The program is simple, a normal play subtraction (Nim) game.

The object of the game is simple: given a pile of N elements (stones, we will call them), each player is allowed to select between 1 and k (where k is any number greater than or equal to 2, and less than N) elements to take. The winner is whichever player takes the last element from the pile.

The winning strategy of this game is simple, if one does a modular division of the number of elements in the pile by the sum of the maximum move and 1 is zero, the player has a winning strategy if the opponent is moving, and a losing one if they are moving (example: a maximum move game of 2 and a pile of 9, the player is in a winning position if the opponent is moving, since 9 Mod 3 = 0). Therefore if, at the start of the game, N Mod (k+1) = 0, then a player would want to go second in order to have a winning strategy, first otherwise.

How can one write a program in Python that can simulate this? Well, it's easy, and took me just 84 lines of code. First, the full program:

#!C:/Python32

#Joe Regan
#nims.py
#5/15/11

import math
import random

def play_nims(pile,maxRemove):
print("Starting with ",pile," stones, select up to ",maxRemove," stones")
while(pile>0):
p1move=int(input("Select stones to remove "))
while(p1move>maxRemove):
p1move=int(input("Select stones to remove "))
pile=pile-p1move
print("Player 1 took ",p1move," stones, and ",pile," stones remain")
if(pile==0):
print("Player 1 wins!")
else:
p2move=int(input("Select stones to remove "))
while(p2move>maxRemove):
p2move=int(input("Select stones to remove "))
pile=pile-p2move
print("Player 2 took ",p2move," stones, and ",pile," stones remain")
if(pile==0):
print("Player 2 wins!")


def play_nims_comp(pile,maxRemove):
print("Starting with ",pile," stones, select up to ",maxRemove," stones")
turn=int(input("Enter 1 to go first, any other number to go second: ")) #allow user to decide in what order he wants to go
while(pile>0):
if(turn==1):
move=int(input("Select stones to remove "))
while(move>maxRemove):
move=int(input("Select stones to remove "))
pile=pile-move
print("Player took ",move," stones, and ",pile," stones remain")
if(pile==0):
print("Player wins!")
else:
if(math.fmod(pile,(maxRemove+1))==0):
CPUmove=random.randint(1,maxRemove)
else:
CPUmove=math.fmod(pile,(maxRemove+1))
pile=pile-CPUmove
print("CPU took ",CPUmove," stones, and ",pile," stones remain")
if(pile==0):
print("Player loses!")
else:
if(math.fmod(pile,(maxRemove+1))==0):
CPUmove=random.randint(1,maxRemove)
else:
CPUmove=math.fmod(pile,(maxRemove+1))
pile=pile-CPUmove
print("CPU took ",CPUmove," stones, and ",pile," stones remain")
if(pile==0):
print("Player loses!")
else:
move=int(input("Select stones to remove "))
while(move>maxRemove):
move=int(input("Select stones to remove "))
pile=pile-move
print("Player took ",move," stones, and ",pile," stones remain")
if(pile==0):
print("Player wins!")


#Allow the user to select which game they would like to play
CPUorHuman=int(input("Let's play nim! First, select if you're playing a human (1) or the computer (2): "))
while(CPUorHuman!=1 and CPUorHuman!=2):
CPUorHuman=int(input("Let's play nim! First, select if you're playing a human (1) or the computer (2): "))
params=input("Enter the starting pile and max move, separated by commas: ").split(",")
P=int(params[0])
M=int(params[1])
while(P<=M):
params=input("Enter the starting pile and max move, separated by commas: ").split(",")
P=int(params[0])
M=int(params[1])
if(CPUorHuman==1):
play_nims(P,M)
else:
play_nims_comp(P,M)


Now let's break it down to the functional levels.
#!C:/Python32

#Joe Regan
#nims.py
#5/15/11

import math
import random


Here's where I write the shebang, telling the system that I want to use python. After some documentation, I import the math package (redundant here) and the random package (which will provide highly useful for the CPU logic).

def play_nims(pile,maxRemove):
print("Starting with ",pile," stones, select up to ",maxRemove," stones")
while(pile>0):
p1move=int(input("Select stones to remove "))
while(p1move>maxRemove):
p1move=int(input("Select stones to remove "))
pile=pile-p1move
print("Player 1 took ",p1move," stones, and ",pile," stones remain")
if(pile==0):
print("Player 1 wins!")
else:
p2move=int(input("Select stones to remove "))
while(p2move>maxRemove):
p2move=int(input("Select stones to remove "))
pile=pile-p2move
print("Player 2 took ",p2move," stones, and ",pile," stones remain")
if(pile==0):
print("Player 2 wins!")


This is the method used in a human v. human game. As order of player does not matter to the computer, this is simple: we pass to the function the size of the original pile, and the maximum number of moves. While there's still elements within the pile, we allow players to take turns.

First, player 1 makes his move (and continues to be asked until giving a valid response), and his move is subtracted from the pile. If he takes the last stone, player 1 wins, and if not, the same process goes to player 2. If player 2 does not win, we go back to the top of the while loop.

def play_nims_comp(pile,maxRemove):
print("Starting with ",pile," stones, select up to ",maxRemove," stones")
turn=int(input("Enter 1 to go first, any other number to go second: ")) #allow user to decide in what order he wants to go
while(pile>0):
if(turn==1):
move=int(input("Select stones to remove "))
while(move>maxRemove):
move=int(input("Select stones to remove "))
pile=pile-move
print("Player took ",move," stones, and ",pile," stones remain")
if(pile==0):
print("Player wins!")
else:
if(math.fmod(pile,(maxRemove+1))==0):
CPUmove=random.randint(1,maxRemove)
else:
CPUmove=math.fmod(pile,(maxRemove+1))
pile=pile-CPUmove
print("CPU took ",CPUmove," stones, and ",pile," stones remain")
if(pile==0):
print("Player loses!")
else:
if(math.fmod(pile,(maxRemove+1))==0):
CPUmove=random.randint(1,maxRemove)
else:
CPUmove=math.fmod(pile,(maxRemove+1))
pile=pile-CPUmove
print("CPU took ",CPUmove," stones, and ",pile," stones remain")
if(pile==0):
print("Player loses!")
else:
move=int(input("Select stones to remove "))
while(move>maxRemove):
move=int(input("Select stones to remove "))
pile=pile-move
print("Player took ",move," stones, and ",pile," stones remain")
if(pile==0):
print("Player wins!")


This is far more complex. First, we have to ask the player (who is opposing the computer, so now order does matter), whether he wants to go first or second (after passing the initial game rules again). If the player decides to go first, the same logic as before is followed for the human player. The second player, aka the computer, is designed to play an optimal game, however. Therefore, if the player leaves a pile of M where M Mod (k+1) > 0, the computer will simply take M Mod (k+1), and will have a winning strategy (and eventually win). If not, the computer will select a random amount between 1 and k, and continue to have a losing strategy, unless the human player makes an error. If the player decides to go second, the reverse order is followed.

CPUorHuman=int(input("Let's play nim! First, select if you're playing a human (1) or the computer (2): "))
while(CPUorHuman!=1 and CPUorHuman!=2):
CPUorHuman=int(input("Let's play nim! First, select if you're playing a human (1) or the computer (2): "))
params=input("Enter the starting pile and max move, separated by commas: ").split(",")
P=int(params[0])
M=int(params[1])
while(P<=M):
params=input("Enter the starting pile and max move, separated by commas: ").split(",")
P=int(params[0])
M=int(params[1])
if(CPUorHuman==1):
play_nims(P,M)
else:
play_nims_comp(P,M)


Here's where the user interacts with the command line to decide the game options. First, the user immediately decides if they are playing a human (1) or a computer (2), and this is asked until the user answers with a 1 or 2. From there, the user is asked to provide a starting pile and maximum move, and is asked again if the pile is less than or equal to the maximum move (comma separated). Finally, we pass to the if statement, where the program decides whether to call the human v. human, or human v. computer game.

So how does this look? Let's play the computer, start with 30 stones, and allow a maximum move of 4.

Let's play nim! First, select if you're playing a human (1) or the computer (2): 2
Enter the starting pile and max move, separated by commas: 30,4
Starting with 30 stones, select up to 4 stones
Enter 1 to go first, any other number to go second: 2
CPU took 4 stones, and 26 stones remain
Select stones to remove 1
Player took 1 stones, and 25 stones remain
CPU took 1 stones, and 24 stones remain
Select stones to remove 4
Player took 4 stones, and 20 stones remain
CPU took 4 stones, and 16 stones remain
Select stones to remove 1
Player took 1 stones, and 15 stones remain
CPU took 4 stones, and 11 stones remain
Select stones to remove 1
Player took 1 stones, and 10 stones remain
CPU took 2 stones, and 8 stones remain
Select stones to remove 3
Player took 3 stones, and 5 stones remain
CPU took 3 stones, and 2 stones remain
Select stones to remove 2
Player took 2 stones, and 0 stones remain
Player wins!


As you can see, I elected to go first (since 30 Mod (4+1) = 0, I had a winning strategy if the computer went first), and won the game by following this basic rule.

So, if you have python installed on your own system, try it out, and see if you can keep your friends stumped.

Monday, February 21, 2011

Does Spending on Education Improve SAT scores?

In case one has been living under a rock recently (as I seemingly have given my lack of posting), the big teachers' union battle is occurring on a daily basis in Wisconsin. While the Republican controlled government wants to increase teacher contribution to their benefits package, the unions and many teachers are fighting hard against this.

While for years, we in the private sector have had to contribute to our retirements and benefits packages, this seems novel to many of these teachers. In their defense, though, many independent sources argue that spending more money on education would improve American performance, which is very clearly an issue. However, does the data support this?

My interest was piqued upon coming across a post on this blog, which highlights state spend, and state SAT scores. While the source does show SAT scores and spending, he clearly underestimates the effects of participation rate (which can lead to sample bias) on the individual states. Even the biggest defender of educational spending cuts would not argue Kentucky has a better system than Massachusetts or Vermont.

So I took this to the task of analyzing based on each individual score, and the aggregate total of all three, as dependent to the amount of spend and participation rate of all states. In order to do this, I created this spreadsheet.

First I looked at the subject dear to my heart (and one we should be focusing more on in this country), mathematics. First, I looked at a simple linear model:

> lmMath<-lm(math ~ partrate + spend) > summary(lmMath)

Call:
lm(formula = math ~ partrate + spend)

Residuals:
Min 1Q Median 3Q Max
-64.453 -15.305 -2.256 13.458 39.785

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.565e+02 1.583e+01 35.15 <>


At first glance, math seems (minimally) correlated w/ educational spending, but far more correlated with participation rate. To find potentially better variables, we can analyze a function w/ quadratic versions of participation rate and spend, and then natural logarithmic transformations:

> summary(lm(math ~ partrate + partrate_sq + spend + spend_sq))

Call:
lm(formula = math ~ partrate + partrate_sq + spend + spend_sq)

Residuals:
Min 1Q Median 3Q Max
-57.029 -8.593 -1.308 12.901 30.822

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.238e+02 6.494e+01 8.065 2.37e-10 ***
partrate -2.343e+02 4.543e+01 -5.158 5.16e-06 ***
partrate_sq 1.454e+02 5.662e+01 2.569 0.0135 *
spend 1.286e-02 1.235e-02 1.041 0.3031
spend_sq -5.762e-07 5.865e-07 -0.982 0.3310
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 19.83 on 46 degrees of freedom
Multiple R-squared: 0.7871, Adjusted R-squared: 0.7686
F-statistic: 42.53 on 4 and 46 DF, p-value: 6.712e-15

> summary(lm(math ~ partrate + partrate_ln + spend + spend_ln))

Call:
lm(formula = math ~ partrate + partrate_ln + spend + spend_ln)

Residuals:
Min 1Q Median 3Q Max
-49.374 -8.838 -0.935 13.856 34.057

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -463.83562 909.66724 -0.510 0.61256
partrate -5.60859 32.95449 -0.170 0.86561
partrate_ln -29.31128 8.09637 -3.620 0.00073 ***
spend -0.01099 0.01116 -0.985 0.32971
spend_ln 116.34990 111.24720 1.046 0.30109
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 18.67 on 46 degrees of freedom
Multiple R-squared: 0.8114, Adjusted R-squared: 0.795
F-statistic: 49.47 on 4 and 46 DF, p-value: 4.278e-16

> summary(lm(math ~ partrate_sq + spend_sq))

Call:
lm(formula = math ~ partrate_sq + spend_sq)

Residuals:
Min 1Q Median 3Q Max
-68.990 -18.437 -2.103 19.048 48.723

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.536e+02 9.767e+00 56.687 <> summary(lm(math ~ partrate_ln + spend_ln))

Call:
lm(formula = math ~ partrate_ln + spend_ln)

Residuals:
Min 1Q Median 3Q Max
-48.2752 -9.7688 -0.1039 12.6402 35.6967

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 428.039 127.745 3.351 0.00158 **
partrate_ln -31.017 2.218 -13.984 <>


It seems like all but spend_ln had some correlation in one of the models, so let's throw all 6 variables into one and see what happens:

> summary(lm(math ~ partrate + partrate_ln + partrate_sq + spend + spend_ln + spend_sq))

Call:
lm(formula = math ~ partrate + partrate_ln + partrate_sq + spend +
spend_ln + spend_sq)

Residuals:
Min 1Q Median 3Q Max
-47.376 -9.845 -1.071 11.698 33.868

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.978e+03 4.881e+03 -1.020 0.3133
partrate 2.468e+01 1.288e+02 0.192 0.8489
partrate_ln -3.316e+01 1.471e+01 -2.254 0.0292 *
partrate_sq -1.863e+01 9.837e+01 -0.189 0.8507
spend -1.312e-01 1.287e-01 -1.020 0.3134
spend_ln 7.033e+02 6.358e+02 1.106 0.2746
spend_sq 2.974e-06 3.164e-06 0.940 0.3524
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 18.87 on 44 degrees of freedom
Multiple R-squared: 0.8158, Adjusted R-squared: 0.7907
F-statistic: 32.48 on 6 and 44 DF, p-value: 1.287e-14


Yikes, I muddied the waters. Let's look at just the log of participation rate:
> summary(lm(math ~ partrate_ln))

Call:
lm(formula = math ~ partrate_ln)

Residuals:
Min 1Q Median 3Q Max
-47.6408 -11.2827 -0.2747 13.3237 35.7253

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 491.382 4.198 117.06 <2e-16>


This simple model fits the best of any we have seen. Is there any way to incorporate spend into this model?

No matter what, I couldn't find a way to wedge spend into SAT Math scores. The amount of spending on education seemed to have minimal effect, if any, on performance, and one can estimate the state's average SAT score rather well by the formula:

SAT_math = 491.382 - 30.795*ln(participation_rate).

But does this hold true for Reading and Writing?
The good news for the add spending crowd is that spend was a positively relevant variable when factored with the untransformed participation rate variable, as well as when the log transformed version of spend is regressed with the square of participation rate on score.

The bad news is, that the log transformation of participation rate indicates a much stronger correlation than either of these previous functions, with an adjusted r-squared of 87.02%.

SAT_reading = 485.670 - 31.665*ln(participation_rate).

Writing shares the same characteristic, the possibility of spending improving test scores, though the sample bias in the state's scores makes this tough to decipher.

SAT_writing = 475.065 - 30.059*ln(participation_rate).

For total SAT scores, this still holds true, where the log of participation rate trumps all other variables in terms of statistical significance, and takes the form:

SAT_total = 1452.116 - 92.519*ln(participation_rate).

For full disclosure, though, leaving participation_rate untransformed does provide some evidence that spending improves SAT scores:

> summary(lm(total ~ partrate + spend_ln))

Call:
lm(formula = total ~ partrate + spend_ln)

Residuals:
Min 1Q Median 3Q Max
-167.0151 -37.5985 0.6148 32.2768 111.0931

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 825.01 393.85 2.095 0.0415 *
partrate -367.49 27.34 -13.442 <2e-16 ***
spend_ln 98.63 43.20 2.283 0.0269 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 55.8 on 48 degrees of freedom
Multiple R-squared: 0.7943, Adjusted R-squared: 0.7858
F-statistic: 92.7 on 2 and 48 DF, p-value: < 2.2e-16


Suggesting a relationship of SAT_total = 825.01 - 367.49*participation_rate + 98.63*ln(spend). Given this, we can create a ranking of states using SAT participation rate and per pupil spending to derive a ranking list of the best performing states, as shown below.



So, it seems like both sides can lay claims to evidence that back their opinions. Readers, what do you think?

Thursday, January 6, 2011

How Much Increase does the Dow need to gain employment?

The Dow being up is big news to many people in America. It signifies a strengthening economy, and a possible end of hard times for millions of people.

However, the Dow Jones goes up annually. On January 2nd, 1929, the DJI closed at 307.01. January 3rd, 2011, the DJI closed at 11,670.75. Despite a stretch in time that consisted of the Great Depression and the recent recession, the DJI has managed to climb at an annual rate of 4.54%, out-pacing inflation at 3.16%.

So, it is clear, over time, the Dow is a good, though not risk-free investment for an individual. A person who put the equivalent of $1 in 2011 dollars into the DJI in 1929, the person would have $2.97 now.

How does this affect main street, however? Namely, how is the annual performance of the Dow reflective on unemployment rates?

While one immediately assumes a growth in the Dow equals growth in employment, that is not necessarily the case. Though it is a positive sign for growth, there needs to be a level fit to meet an ever growing workforce, but how does one determine this?

Now, understanding the behavior of the Dow is easy, it generally follows a curve at the rate a*(1+r)^n, where a = starting value, r = rate of growth, and n = years since the starting point.

So first, to get a feel for the data, I elected to view the growth rate of the dow as a function of two variables: years since 1970, and unemployment rate of the year, found in this table, with fields of year, year since 1970, average dow jones index for the year, and average unemployment rate.

My first equation is as follows:

> summary(nlsdow)

Formula: dow ~ c * exp(a * yr70 + b * unemp) + c

Parameters:
Estimate Std. Error t value Pr(>|t|)
a 0.091785 0.004272 21.483
b -16.314535 2.978873 -5.477 3.21e-06 ***
c 951.696519 134.844962 7.058 2.38e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1034 on 37 degrees of freedom

Number of iterations to convergence: 8
Achieved convergence tolerance: 8.044e-06

Or, dow = 951.7*exp(0.091785*(yr-1970)-16.315*u_rate). Logical, the more years out, the more the DJI grows, and high unemployment is a good sign of a bad economy.

Now, we have an idea of the form for regression using the unemployment rate as a dependent variable (after all, unemployment as a lagging indicator means it is a better candidate for a dependent variable). Since we see the initial equation takes the form Y = a*exp(b*X1+c*X2), we can turn this into b*X1+c*X2 = ln(Y/a) = ln(Y) - ln(a) = ln(Y) + A, where A = -ln(a).

Solving for X2 gives us [ln(Y) + A - b*X1]/c, and of course we can change this into u_rate = p*ln(DJI) + q*(year since 1970) + r, where p = 1/c, q = -b/c, and r = -ln(a)/c.

So, simply put, one can regress for the unemployment rate by regressing against the natural log of the Dow Jones Index and the year since the particular starting point (in this case, 1970).

Fortunately, transforming a variable array is easy in R. To write a variable in log space, simply enter:

> lnDow<-log(dow)

You are now all set to run the regression. Let's try it:

> lmU<-lm(unemp ~ yr70 + lnDow)
> summary(lmU)

Call:
lm(formula = unemp ~ yr70 + lnDow)

Residuals:
Min 1Q Median 3Q Max
-0.0184135 -0.0058837 0.0002107 0.0038651 0.0209755

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3038799 0.0347165 8.753 1.52e-10 ***
yr70 0.0028846 0.0004947 5.831 1.06e-06 ***
lnDow -0.0376083 0.0055375 -6.792 5.38e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.009397 on 37 degrees of freedom
Multiple R-squared: 0.5931, Adjusted R-squared: 0.5711
F-statistic: 26.96 on 2 and 37 DF, p-value: 5.976e-08

I see a lot of highly significant coefficients. Though the R-squared leaves a bit to be desired, there seems to be a good deal being said through this function.

The coefficients we care about, though, are for yr70 and lnDow. To get a point estimate of the growth rate needed to sustain employment, we need to know how much lnDow has to increase to equal the coefficient of yr70, and effectively balance it out to keep unemployment stable. Simple algebra says that this means:

.0028846(yr70) = .0376083(lnDow),

so a delta of one year means we solve

.0376083(delta_lnDow) = .0028846, or delta_lnDow=.0767.

To find the point-estimate percentage increase needed, we simply solve exp(.0767) - 1 = 7.972%.

While high, the good news is, the DJI achieved this mark 17 of 39 years in the sample, and averages 6.5% growth a year from the average of 1970 to the average of 2009, including the downturn after 2007.

So if you were to ask me what my break-even point is for a "main street approved" economy is? Watch for that DJI to crack 8% growth during the year.

Monday, January 3, 2011

The Logit Model and Baseball: Projecting Chance of Making the Hall of Fame

It is the favorite time of year for many a sports nerd like myself: the time when the Baseball Writers Association of America will make their picks for the Hall of Fame, and when the blogosphere is best equipped to mock and ridicule the inconsistent logic of many esteemed writers.

It is also a favorite time for anyone who has ever said "I cannot believe they voted for...", or how someone was robbed (see Whitaker, Lou).

Last year's epic battle, in my mind, was the one about Edgar Martinez. Supporters cited his batting numbers that were comparable to legends while playing in a mediocre hitting park, and his career that shows no signs of PED use. His detractors cited that he was a Designated Hitter, and that his career is short.

When all was said and done, Martinez received a mere 36.2% of the vote, less than half what one needs to reach the Hall of Fame. So, what would Edgar's chance of reaching Cooperstown, knowing this?

Would you believe a 69.09% chance?

It seems counter-intuitive that when one is yet to convince almost two-thirds of the remaining voting base of his greatness, 6 years after his career ended, that anything would change so rapidly. However, it occurs constantly, as only 2 men from 1976-97 received a higher share of the vote on their first ballot and missed out on the writer's election. One of which, Jim Bunning, eventually gained access through the Veterans Committee.

As we saw from yesterday's post, the logit model can provide a powerful probability estimator given a dummy dependent variable. In this case, we test whether someone reached the Hall of Fame (y=1) or not (y=0).

To perform this analysis, I looked at all Hall of Fame votes from 1976-1997, and took the percent share of the vote obtained by all players on their first ballot, excluding those who received less than 5% of the vote (indicating a probability of being elected to the Hall of 0, and a small chance of being elected by the veterans' committee). Through this process, I obtained a data sheet of 59 players, as can be seen here.

Right away, one can make general assumptions. Thirty-three of the fifty-nine players listed were eventually elected to the Hall, a 55.93% success rate. Additionally, 3 more were elected by the veterans' committee, leaving the total success rate of the group at 61.02%. Simply clearing the first obstacle of making it past the first ballot seems to bode well for the eventual success of candidates.

However, this analysis is imperfect. The success rate includes players who were elected on the first ballot, and had no resistance in making the Hall. Once again, though, we can easily run a logit model regression on the data. Yesterday I showed a sample script that can run all the needed commands for you, but today I'll simply enter these in, command by command, and show you another useful feature: creating a function.

Yesterday I showed you the format for writing a logit model regression, so let's skip right to a summary. This regresses share of the first ballot vote to probability of eventually making the Hall of Fame:

> summary(probabilityHOF)

Formula: hof ~ 1/(1 + exp(-1 * (A + B * initial)))

Parameters:
Estimate Std. Error t value Pr(>|t|)
A -3.0108 0.7995 -3.766 0.000395 ***
B 10.5387 3.0206 3.489 0.000942 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2891 on 57 degrees of freedom

Number of iterations to convergence: 6
Achieved convergence tolerance: 8.11e-06

And this, in turn, summarizes the probability of being elected by the BBWAA:

> summary(probHOFbbwaa)

Formula: hof_bbwaa ~ 1/(1 + exp(-1 * (A + B * initial)))

Parameters:
Estimate Std. Error t value Pr(>|t|)
A -3.5384 0.8171 -4.330 6.1e-05 ***
B 10.4765 2.5601 4.092 0.000136 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2479 on 57 degrees of freedom

Number of iterations to convergence: 19
Achieved convergence tolerance: 5.858e-06

First thing to stand out is that the equations are very similar. The coefficients are strong on both regressions, though a bit stronger on the second one, measuring whether or not a player would be elected by the sports writers or not.

This is fine if we want to hard code, but we would rather not: we would rather simply enter a value in question and have R return a probability for us. Well this can be done simply using a function.

As shown at the Depauw website, writing a function is performed by:
name=function(argument 1, argument 2, etc...) { [expression] }

Since we have the coefficients available at our disposal from our previous equations, let's define functions. First, a function to determine the probability of being elected into the Hall of Fame by the sports writers:

> xChance=function(x) {
+ 1/(1+exp(-1*(-3.5384+10.4765*x))) }

Then, a second one, to determine the probability of reaching the Hall via the BBWAA or the Veterans' Committee:

> xHOF=function(x) {
+ 1/(1+exp(-1*(-3.0108+10.5387*x))) }

So of course, the last question is, how do you use a function? Well, that's the simplest part of all, and will help us obtain our answer. All you need to do is write in this format:
FnName(value)

So, let's try both of these equations where x=.362, or Edgar Martinez's share of votes in 2010:

> xChance(0.362)
[1] 0.5631837
> xHOF(0.362)
[1] 0.6908742

Well, these are probability moderately encouraging for my fellow Edgar fans. Not even including the Veterans Committee option, Edgar currently stands as better than a coin flip chance of reaching the Hall, at 56.32%. With the Veterans Committee, this probability spikes to 69.09%.

So where are the break-even (50-50) points for both equations? For just the BBWAA vote, it is at around 33.8% of the initial vote. For overall Hall of Fame chances, it is at around 28.6%.

After the fifth, we can re-visit this post and project the chances of ballot newcomers like Jeff Bagwell and Larry Walker. For now, rest assured that Barry Larkin has a 91.9% chance of reaching Cooperstown, and Robbie Alomar is almost a stone cold lock. Heck, even Fred McGriff has a 32.2% chance of reaching the Hall someday given his first ballot performance.

Sunday, January 2, 2011

Logistic Regression Analysis in R-programming

This time last year, I had a very vague knowledge of Logistic Analysis, an analysis that is now a cornerstone to my profession.

Many times as an analyst, I come across the various question of whether or not a person acted upon an advertisement (y=1), or not (y=0), and the various attributes (independent variables) that factored into the equation.

For a more broad audience, and to summarize this concept further, let's take the role of a college professor, who teaches the same course to a total of 210 students over the course of a school year, and wants to determine the point where a student has a 80% probability of getting at least a B on the final exam.

After looking over the 210 students, the professor formed this excel file, with two very simple variables: an independent variable of hours spent studying, and a dependent variable of whether or not a B-grade was achieved.

Could we use a linear regression here? It would show us a positive or negative correlation and...not much else. Since we would have a cluster of 121 points at y=1, and 89 at y=0, this would not really describe much of anything.

Enter logistic regression.

What is logistic regression? Per wikipedia, it is "used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve." Essentially, it is transforming the data into a model where the dependent variable becomes a probability prediction.

The form of a logit model is:
f(z) = 1/[1-exp(-z)], where z = B0 + B1X1 + B2X2 + ... + BnXn. In the previous example, we only have one independent variable X1, so z = A + BX.

Now that we have the gritty covered, how about the actual analysis?

Well, this can be performed with a script in R. The best part of all, a "lengthy" version of this only requires 14 lines!

#!/usr/bin/env Rscript

bgrade<-read.csv("C:/Users/Joseph/Documents/Blogfolder/bgrade.csv", header=TRUE)
Yreal<-bgrade$b_grade
Xreal<-bgrade$studytime
nlsProb<-nls(Yreal ~ 1/(1+exp(-1*(A+B*Xreal))),start=list(A=-0.5,B=0.05),control=nls.control(maxiter=20000))
Coeffs<-coefficients(nlsProb)
xAchieve<-NULL
A = Coeffs[[1]][[1]]
B = Coeffs[[2]][[1]]
for(i in 1:20) {
xAchieve[i] = 1/(1+exp(-1*(A + B*i)))
}
xTime = (log(4)-A)/B

Let's analyze what is happening.
Line 1: Shebang line. Let's the system know you're about to use R.
Line 3: Reads the .csv file into a data frame (and indicates headers in the file).
Lines 4+5: Assigns our independent and dependent fields to variable arrays.
Line 6: Determines equation. This is the "kicker" here, and it's important to know two things:
  1. You SHOULD have a general idea of how the equation should fit, or else it may not reduce. In this example, I assumed a less than 50% chance of achieving a B would occur when x=0, and as x grew, so would y (hence A = -0.5, B = 0.05)
  2. control=nls.control is an important control here. The default amount of iterations is 50, though I often spike it to 20,000 (I've never had a major runtime issue as a result, so I highly suggest using as many iterations as you need).
Line 7: Writes coefficients to variable, this will allow us to determine where the 90% breakthrough point is without leaving our script.
Line 8: Declare an empty array for later use.
Lines 9+10: Write coefficients to variables to use in for loop.
Lines 11+13: For loop. Makes a nifty array (xAchieve) where one can input any integer from 1-20 (in the form xAchieve[i]) later, or print out the entire array to map x hours of studying to a certain percent chance of achievement.
Line 14: Simply the logit model, now solving for x with a Y set at 0.8, or 80% (determining our breakthrough point).

From here, it's as simple as opening up an R GUI interface, or starting R from the command prompt (whatever your preference, I'm a Windows user and prefer the GUI) and type:

source("(path)/bgrade.r")

Now it's all written out! Let's take a look at some of our key outputs.
> summary(nlsProb)

Formula: Yreal ~ 1/(1 + exp(-1 * (A + B * Xreal)))

Parameters:
Estimate Std. Error t value Pr(>|t|)
A -1.30595 0.30481 -4.284 2.80e-05 ***
B 0.16816 0.03072 5.475 1.26e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4433 on 208 degrees of freedom

Number of iterations to convergence: 4
Achieved convergence tolerance: 7.859e-06

> xAchieve
[1] 0.2427264 0.2749548 0.3097123 0.3467622 0.3857673 0.4262967 0.4678409 0.5098359 0.5516924 0.5928293 0.6327044 0.6708427 0.7068557 0.7404530 0.7714447 0.7997367
[17] 0.8253195 0.8482544 0.8686572 0.8866832
> xTime
[1] 16.00978

What do these three commands mean?
The first shows the summary of our logit model. While my assumptions where right, my estimations were off. A=-1.30595, and B=0.16816.

The second is our array xAchieve. The first data point is a student that only studies for 1 hour can be expected to score a B or higher only 24.27% of the time. Yikes. At the other end, a student that studies for 20 hours can be expected to score a B or higher 88.67% of the time.

The third is the "answer" to our initial question of where the 80% breakthrough point is. As our array shows us reasonably, and xTime shows us more precisely, it is right around 16 hours. Apparently this professor does not believe in grade inflation.

After you are done with the data, a simple command:

rm(list=ls())

Cleans up your workspace for more analysis.

Clearly, the ability to perform robust analysis is powerful using the logit model. It was used recently by The College Board to make a powerful conclusion: that an 1180 on the old-SAT was the "breakthrough" point where a student had a 65% chance of achieving just a 2.7 GPA in their freshman year (a telling statistic in an age of grade inflation and more students attending college than ever).

The logit model also is a powerful analysis for marketing and advertising, to measure various aspects of an advertisement, the way it was published, and various other factors to determine how each independent variable affects the one major question: whether or not it causes the individual to act upon the advertisement or not. While an important analysis for a statistician, knowledge of the logit model is valuable for all sorts of professions, from biology and academia, to business and marketing.