Sunday, January 2, 2011

Logistic Regression Analysis in R-programming

This time last year, I had a very vague knowledge of Logistic Analysis, an analysis that is now a cornerstone to my profession.

Many times as an analyst, I come across the various question of whether or not a person acted upon an advertisement (y=1), or not (y=0), and the various attributes (independent variables) that factored into the equation.

For a more broad audience, and to summarize this concept further, let's take the role of a college professor, who teaches the same course to a total of 210 students over the course of a school year, and wants to determine the point where a student has a 80% probability of getting at least a B on the final exam.

After looking over the 210 students, the professor formed this excel file, with two very simple variables: an independent variable of hours spent studying, and a dependent variable of whether or not a B-grade was achieved.

Could we use a linear regression here? It would show us a positive or negative correlation and...not much else. Since we would have a cluster of 121 points at y=1, and 89 at y=0, this would not really describe much of anything.

Enter logistic regression.

What is logistic regression? Per wikipedia, it is "used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve." Essentially, it is transforming the data into a model where the dependent variable becomes a probability prediction.

The form of a logit model is:
f(z) = 1/[1-exp(-z)], where z = B0 + B1X1 + B2X2 + ... + BnXn. In the previous example, we only have one independent variable X1, so z = A + BX.

Now that we have the gritty covered, how about the actual analysis?

Well, this can be performed with a script in R. The best part of all, a "lengthy" version of this only requires 14 lines!

#!/usr/bin/env Rscript

nlsProb<-nls(Yreal ~ 1/(1+exp(-1*(A+B*Xreal))),start=list(A=-0.5,B=0.05),control=nls.control(maxiter=20000))
Coeffs<-coefficients(nlsProb)
xAchieve<-NULL
A = Coeffs[[1]][[1]]
B = Coeffs[[2]][[1]]
for(i in 1:20) {
xAchieve[i] = 1/(1+exp(-1*(A + B*i)))
}
xTime = (log(4)-A)/B

Let's analyze what is happening.
Line 1: Shebang line. Let's the system know you're about to use R.
Line 3: Reads the .csv file into a data frame (and indicates headers in the file).
Lines 4+5: Assigns our independent and dependent fields to variable arrays.
Line 6: Determines equation. This is the "kicker" here, and it's important to know two things:
1. You SHOULD have a general idea of how the equation should fit, or else it may not reduce. In this example, I assumed a less than 50% chance of achieving a B would occur when x=0, and as x grew, so would y (hence A = -0.5, B = 0.05)
2. control=nls.control is an important control here. The default amount of iterations is 50, though I often spike it to 20,000 (I've never had a major runtime issue as a result, so I highly suggest using as many iterations as you need).
Line 7: Writes coefficients to variable, this will allow us to determine where the 90% breakthrough point is without leaving our script.
Line 8: Declare an empty array for later use.
Lines 9+10: Write coefficients to variables to use in for loop.
Lines 11+13: For loop. Makes a nifty array (xAchieve) where one can input any integer from 1-20 (in the form xAchieve[i]) later, or print out the entire array to map x hours of studying to a certain percent chance of achievement.
Line 14: Simply the logit model, now solving for x with a Y set at 0.8, or 80% (determining our breakthrough point).

From here, it's as simple as opening up an R GUI interface, or starting R from the command prompt (whatever your preference, I'm a Windows user and prefer the GUI) and type:

Now it's all written out! Let's take a look at some of our key outputs.
> summary(nlsProb)

Formula: Yreal ~ 1/(1 + exp(-1 * (A + B * Xreal)))

Parameters:
Estimate Std. Error t value Pr(>|t|)
A -1.30595 0.30481 -4.284 2.80e-05 ***
B 0.16816 0.03072 5.475 1.26e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4433 on 208 degrees of freedom

Number of iterations to convergence: 4
Achieved convergence tolerance: 7.859e-06

> xAchieve
[1] 0.2427264 0.2749548 0.3097123 0.3467622 0.3857673 0.4262967 0.4678409 0.5098359 0.5516924 0.5928293 0.6327044 0.6708427 0.7068557 0.7404530 0.7714447 0.7997367
[17] 0.8253195 0.8482544 0.8686572 0.8866832
> xTime
[1] 16.00978

What do these three commands mean?
The first shows the summary of our logit model. While my assumptions where right, my estimations were off. A=-1.30595, and B=0.16816.

The second is our array xAchieve. The first data point is a student that only studies for 1 hour can be expected to score a B or higher only 24.27% of the time. Yikes. At the other end, a student that studies for 20 hours can be expected to score a B or higher 88.67% of the time.

The third is the "answer" to our initial question of where the 80% breakthrough point is. As our array shows us reasonably, and xTime shows us more precisely, it is right around 16 hours. Apparently this professor does not believe in grade inflation.

After you are done with the data, a simple command:

rm(list=ls())

Cleans up your workspace for more analysis.

Clearly, the ability to perform robust analysis is powerful using the logit model. It was used recently by The College Board to make a powerful conclusion: that an 1180 on the old-SAT was the "breakthrough" point where a student had a 65% chance of achieving just a 2.7 GPA in their freshman year (a telling statistic in an age of grade inflation and more students attending college than ever).

The logit model also is a powerful analysis for marketing and advertising, to measure various aspects of an advertisement, the way it was published, and various other factors to determine how each independent variable affects the one major question: whether or not it causes the individual to act upon the advertisement or not. While an important analysis for a statistician, knowledge of the logit model is valuable for all sorts of professions, from biology and academia, to business and marketing.