PSYC 480 -- Dr. King

Two Multiple Regression Problems

# tresselt.txt
# James Tresselt, Psyc 497, Fall 1995
# Most of these data were obtained from the Office of Institutional
# Research at Coastal Carolina University and are from a random
# sample of freshman who were admitted in Fall 1990 and who were
# still present at the end of Spring 1991. Variables are:
#   Gender: Female/Male
#   SATV: SAT Verbal Score
#   SATQ: SAT Quantitative Score
#   SATT: SAT Total = SATV + SATQ
#   GPA91: college GPA at the end of Spring semester 1991
#   HSGPA: high school grade point average (James did not get this
#          from OIR; they gave him class rank; I have made up GPAs
#          to replace these in a way that I thought was reasonable)
#   orient: did the student attend freshman orientation? (I made
#           this up out of thin air; 0=no, 1=yes)
#   sex: a dummy coded version of Gender (0=Female, 1=Male)
#
Gender   SATV  SATQ  SATT  GPA91  HSGPA orient sex 
Female    330   290   620  3.692  2.945     1   0
Female    480   390   870  2.800  3.216     1   0
Female    340   520   860  2.125  2.920     0   0
Male      540   530  1070  1.308  2.655     0   1
Female    570   550  1120  1.400  2.768     1   0
Male      310   430   740  2.094  3.209     1   1

> file = "http://ww2.coastal.edu/kingw/psyc480/data/tresselt.txt"
> tres = read.table(file=file, header=T)
> summary(tres)
    Gender         SATV            SATQ            SATT            GPA91      
 Female:133   Min.   :240.0   Min.   :260.0   Min.   : 560.0   Min.   :0.071  
 Male  :118   1st Qu.:350.0   1st Qu.:400.0   1st Qu.: 770.0   1st Qu.:1.800  
              Median :390.0   Median :450.0   Median : 840.0   Median :2.308  
              Mean   :402.4   Mean   :453.1   Mean   : 855.5   Mean   :2.392  
              3rd Qu.:440.0   3rd Qu.:505.0   3rd Qu.: 930.0   3rd Qu.:3.000  
              Max.   :700.0   Max.   :690.0   Max.   :1290.0   Max.   :4.000  
     HSGPA           orient            sex        
 Min.   :1.500   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:2.450   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :2.751   Median :1.0000   Median :0.0000  
 Mean   :2.772   Mean   :0.5259   Mean   :0.4701  
 3rd Qu.:3.099   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :4.000   Max.   :1.0000   Max.   :1.0000  
> tres$Gender = NULL   # so that we can use cor()
> cor(tres)
             SATV       SATQ       SATT      GPA91      HSGPA     orient         sex
SATV    1.0000000 0.45392716 0.84725863  0.3054033  0.2109411 0.18724464 -0.03826070
SATQ    0.4539272 1.00000000 0.85789620  0.2632519  0.1777090 0.03724552  0.13808155
SATT    0.8472586 0.85789620 1.00000000  0.3330467  0.2275793 0.13017919  0.06025216
GPA91   0.3054033 0.26325191 0.33304673  1.0000000  0.4685921 0.15339061 -0.19648261
HSGPA   0.2109411 0.17770900 0.22757925  0.4685921  1.0000000 0.17010389 -0.26315190
orient  0.1872446 0.03724552 0.13017919  0.1533906  0.1701039 1.00000000  0.03108083
sex    -0.0382607 0.13808155 0.06025216 -0.1964826 -0.2631519 0.03108083  1.00000000

We want to try to predict GPA91, and we have four numeric predictors to do it
with: SATV, SATQ, SATT, and HSGPA.

A rule: Your predictors should NOT be highly correlated. Highly correlated
predictors create a problem called "multicollinearity," which makes it very
difficult to find significant predictors. As a kind of rule of thumb, once
the correlation between two predictors starts to get up around 0.7, you should
start to worry. When it gets to 0.8, you should start thinking about dropping
one of those predictors.

On the other hand, you WANT your predictors to be correlated with the response.
A predictor that is not correlated with the response is not going to be a
useful predictor.

HSGPA is not highly correlated with anything, but its highest correlation is
with the response, GPA91. It should be a satisfactory predictor.

The variables orient and sex are dummy coded, and we don't know how to use those
yet, so we'll ignore them.

That leaves the SAT measures. Both SATV and SATQ are highly correlated with
SATT, but, somewhat surprisingly, are not highly correlated with each other.
They are both correlated to the response, although not impressively. This
means we can use SATT, but if we do, we cannot use either SATV or SATQ. We
could use SATV and SATQ if we wanted to. We will enter these predictors into
an additive model, which means we will add them into the regression formula
and not look for interactions.

Here are the commands you want to start with.

> lm.out = lm(GPA91 ~ SATV + SATQ + HSGPA, data=tres)
> summary(lm.out)

1) What is the largest residual?
2) What is the typical magnitude of a residual?
3) What percentage of the total variability in GPA91 is accounted for by the
   three predictors?
4) Is that significantly better than just using the mean of GPA91 as a prediction
   for everyone? Cite the results of a statistical test.
5) Are all predictors significant at the alpha=.05 level? If not, which are not?
6) What is the regression equation (using all three predictors)?
7) Interpret the coefficient for HSGAP?
8) What would be the predicted GPA91 for someone with SATV=500, SATQ=500, and
   HSGPA=2.500?
9) Which is the most important of these predictors? I.e., which one is accounting
   for the most change in GPA91?

Here are my answers, but if you can't answer those questions without looking at
my answers, you have a problem!

1) -1.99110
2) 0.7094
3) 27.45%
4) Yes, F(2,247) = 31.15, p < .001.
5) No, SATQ, p = 0.06.
6) GPA91.hat = -0.9063065 + 0.0018435 * SATV + 0.0012254 * SATQ + 0.7218107 * HSGPA
7) For every 1 point increase in HSGPA, GPA91 is predicted to increase by 0.7218107.
8) > -0.9063065 + 0.0018435 * 500 + 0.0012254 * 500 + 0.7218107 * 2.5
   [1] 2.43267
9) It's a trick question. You don't know how to figure this out yet. You cannot
   determine how important a predictor is by looking at the coefficients, or anything
   else in the coefficients table. You have to calculate something called a beta
   coefficient. It's an easy calculation, but we'll cover it later.

# bodyfat.txt
# Data retrieved from: http://lib.stat.cmu.edu/datasets/bodyfat
# The curious reader should see that source for extensive info about this
# data set. Body density was determined using an underwater weighing method,
# and percent body fat was then determined by formula from density. The goal
# is to find a less expensive, less moist, and less naked method of finding
# percent body fat from various body measurements. The variables are:
# density: determined from underwater weighing (in g/cc?)
# fat: percent body fat from Siri's (1956) equation
# age: in years at last birthday
# weight: in pounds apparently to the nearest quarter pound
# height: in inches apparently to the nearest quarter inch
# neck: circumference (cm)
# chest: circumference (cm)
# abdom: circumference (cm)
# hip: circumference (cm)
# thigh: circumference (cm)
# knee: circumference (cm)
# ankle: circumference (cm)
# biceps: (extended) circumference (cm)
# forearm: circumference (cm)
# wrist: circumference (cm)
# Subjects were 252 men.
#
density fat age weight height neck chest abdom   hip thigh knee ankle biceps forearm wrist
1.0708 12.3  23 154.25  67.75 36.2  93.1  85.2  94.5  59.0 37.3  21.9   32.0    27.4  17.1
1.0853  6.1  22 173.25  72.25 38.5  93.6  83.0  98.7  58.7 37.3  23.4   30.5    28.9  18.2
1.0414 25.3  22 154.00  66.25 34.0  95.8  87.9  99.2  59.6 38.9  24.0   28.8    25.2  16.6
1.0751 10.4  26 184.75  72.25 37.4 101.8  86.4 101.2  60.1 37.3  22.8   32.4    29.4  18.2
1.0340 28.7  24 184.25  71.25 34.4  97.3 100.0 101.9  63.2 42.2  24.0   32.2    27.7  17.7
1.0502 20.9  24 210.25  74.75 39.0 104.5  94.4 107.8  66.0 42.0  25.6   35.7    30.6  18.8

Choose any three of these variables (not density!) and see how well you can
predict fat. You would be wise to start with a summary(). There are some very
bizarre body measurements in this dataset. I suspect there are some recording
errors. You might want to toss out suspicious cases.

Some predictors may not be able to be used together because they are highly
correlated, or not correlated with the response. Be sure to check for this.

The best you will be able to do is R-sqr = 0.749, and that's using all the
predictors. If you can pick just three that gets you close to that, you've
done a good job of picking.

> file = "http://ww2.coastal.edu/kingw/psyc480/data/bodyfat.txt"
> bfat = read.table(file=file, header=T)

I chose age, abdom, and forearm and got R-sqr = 0.6741. Can you do better?

> summary(lm(fat~abdom+age+forearm,data=bfat))

Residual standard error: 4.807 on 248 degrees of freedom
Multiple R-squared:  0.6741,	Adjusted R-squared:  0.6701 
F-statistic:   171 on 3 and 248 DF,  p-value: < 2.2e-16