PSYC 480 -- Dr. King Two Multiple Regression Problems # tresselt.txt # James Tresselt, Psyc 497, Fall 1995 # Most of these data were obtained from the Office of Institutional # Research at Coastal Carolina University and are from a random # sample of freshman who were admitted in Fall 1990 and who were # still present at the end of Spring 1991. Variables are: # Gender: Female/Male # SATV: SAT Verbal Score # SATQ: SAT Quantitative Score # SATT: SAT Total = SATV + SATQ # GPA91: college GPA at the end of Spring semester 1991 # HSGPA: high school grade point average (James did not get this # from OIR; they gave him class rank; I have made up GPAs # to replace these in a way that I thought was reasonable) # orient: did the student attend freshman orientation? (I made # this up out of thin air; 0=no, 1=yes) # sex: a dummy coded version of Gender (0=Female, 1=Male) # Gender SATV SATQ SATT GPA91 HSGPA orient sex Female 330 290 620 3.692 2.945 1 0 Female 480 390 870 2.800 3.216 1 0 Female 340 520 860 2.125 2.920 0 0 Male 540 530 1070 1.308 2.655 0 1 Female 570 550 1120 1.400 2.768 1 0 Male 310 430 740 2.094 3.209 1 1 > file = "http://ww2.coastal.edu/kingw/psyc480/data/tresselt.txt" > tres = read.table(file=file, header=T) > summary(tres) Gender SATV SATQ SATT GPA91 Female:133 Min. :240.0 Min. :260.0 Min. : 560.0 Min. :0.071 Male :118 1st Qu.:350.0 1st Qu.:400.0 1st Qu.: 770.0 1st Qu.:1.800 Median :390.0 Median :450.0 Median : 840.0 Median :2.308 Mean :402.4 Mean :453.1 Mean : 855.5 Mean :2.392 3rd Qu.:440.0 3rd Qu.:505.0 3rd Qu.: 930.0 3rd Qu.:3.000 Max. :700.0 Max. :690.0 Max. :1290.0 Max. :4.000 HSGPA orient sex Min. :1.500 Min. :0.0000 Min. :0.0000 1st Qu.:2.450 1st Qu.:0.0000 1st Qu.:0.0000 Median :2.751 Median :1.0000 Median :0.0000 Mean :2.772 Mean :0.5259 Mean :0.4701 3rd Qu.:3.099 3rd Qu.:1.0000 3rd Qu.:1.0000 Max. :4.000 Max. :1.0000 Max. :1.0000 > tres$Gender = NULL # so that we can use cor() > cor(tres) SATV SATQ SATT GPA91 HSGPA orient sex SATV 1.0000000 0.45392716 0.84725863 0.3054033 0.2109411 0.18724464 -0.03826070 SATQ 0.4539272 1.00000000 0.85789620 0.2632519 0.1777090 0.03724552 0.13808155 SATT 0.8472586 0.85789620 1.00000000 0.3330467 0.2275793 0.13017919 0.06025216 GPA91 0.3054033 0.26325191 0.33304673 1.0000000 0.4685921 0.15339061 -0.19648261 HSGPA 0.2109411 0.17770900 0.22757925 0.4685921 1.0000000 0.17010389 -0.26315190 orient 0.1872446 0.03724552 0.13017919 0.1533906 0.1701039 1.00000000 0.03108083 sex -0.0382607 0.13808155 0.06025216 -0.1964826 -0.2631519 0.03108083 1.00000000 We want to try to predict GPA91, and we have four numeric predictors to do it with: SATV, SATQ, SATT, and HSGPA. A rule: Your predictors should NOT be highly correlated. Highly correlated predictors create a problem called "multicollinearity," which makes it very difficult to find significant predictors. As a kind of rule of thumb, once the correlation between two predictors starts to get up around 0.7, you should start to worry. When it gets to 0.8, you should start thinking about dropping one of those predictors. On the other hand, you WANT your predictors to be correlated with the response. A predictor that is not correlated with the response is not going to be a useful predictor. HSGPA is not highly correlated with anything, but its highest correlation is with the response, GPA91. It should be a satisfactory predictor. The variables orient and sex are dummy coded, and we don't know how to use those yet, so we'll ignore them. That leaves the SAT measures. Both SATV and SATQ are highly correlated with SATT, but, somewhat surprisingly, are not highly correlated with each other. They are both correlated to the response, although not impressively. This means we can use SATT, but if we do, we cannot use either SATV or SATQ. We could use SATV and SATQ if we wanted to. We will enter these predictors into an additive model, which means we will add them into the regression formula and not look for interactions. Here are the commands you want to start with. > lm.out = lm(GPA91 ~ SATV + SATQ + HSGPA, data=tres) > summary(lm.out) 1) What is the largest residual? 2) What is the typical magnitude of a residual? 3) What percentage of the total variability in GPA91 is accounted for by the three predictors? 4) Is that significantly better than just using the mean of GPA91 as a prediction for everyone? Cite the results of a statistical test. 5) Are all predictors significant at the alpha=.05 level? If not, which are not? 6) What is the regression equation (using all three predictors)? 7) Interpret the coefficient for HSGAP? 8) What would be the predicted GPA91 for someone with SATV=500, SATQ=500, and HSGPA=2.500? 9) Which is the most important of these predictors? I.e., which one is accounting for the most change in GPA91? Here are my answers, but if you can't answer those questions without looking at my answers, you have a problem! 1) -1.99110 2) 0.7094 3) 27.45% 4) Yes, F(2,247) = 31.15, p < .001. 5) No, SATQ, p = 0.06. 6) GPA91.hat = -0.9063065 + 0.0018435 * SATV + 0.0012254 * SATQ + 0.7218107 * HSGPA 7) For every 1 point increase in HSGPA, GPA91 is predicted to increase by 0.7218107. 8) > -0.9063065 + 0.0018435 * 500 + 0.0012254 * 500 + 0.7218107 * 2.5 [1] 2.43267 9) It's a trick question. You don't know how to figure this out yet. You cannot determine how important a predictor is by looking at the coefficients, or anything else in the coefficients table. You have to calculate something called a beta coefficient. It's an easy calculation, but we'll cover it later. # bodyfat.txt # Data retrieved from: http://lib.stat.cmu.edu/datasets/bodyfat # The curious reader should see that source for extensive info about this # data set. Body density was determined using an underwater weighing method, # and percent body fat was then determined by formula from density. The goal # is to find a less expensive, less moist, and less naked method of finding # percent body fat from various body measurements. The variables are: # density: determined from underwater weighing (in g/cc?) # fat: percent body fat from Siri's (1956) equation # age: in years at last birthday # weight: in pounds apparently to the nearest quarter pound # height: in inches apparently to the nearest quarter inch # neck: circumference (cm) # chest: circumference (cm) # abdom: circumference (cm) # hip: circumference (cm) # thigh: circumference (cm) # knee: circumference (cm) # ankle: circumference (cm) # biceps: (extended) circumference (cm) # forearm: circumference (cm) # wrist: circumference (cm) # Subjects were 252 men. # density fat age weight height neck chest abdom hip thigh knee ankle biceps forearm wrist 1.0708 12.3 23 154.25 67.75 36.2 93.1 85.2 94.5 59.0 37.3 21.9 32.0 27.4 17.1 1.0853 6.1 22 173.25 72.25 38.5 93.6 83.0 98.7 58.7 37.3 23.4 30.5 28.9 18.2 1.0414 25.3 22 154.00 66.25 34.0 95.8 87.9 99.2 59.6 38.9 24.0 28.8 25.2 16.6 1.0751 10.4 26 184.75 72.25 37.4 101.8 86.4 101.2 60.1 37.3 22.8 32.4 29.4 18.2 1.0340 28.7 24 184.25 71.25 34.4 97.3 100.0 101.9 63.2 42.2 24.0 32.2 27.7 17.7 1.0502 20.9 24 210.25 74.75 39.0 104.5 94.4 107.8 66.0 42.0 25.6 35.7 30.6 18.8 Choose any three of these variables (not density!) and see how well you can predict fat. You would be wise to start with a summary(). There are some very bizarre body measurements in this dataset. I suspect there are some recording errors. You might want to toss out suspicious cases. Some predictors may not be able to be used together because they are highly correlated, or not correlated with the response. Be sure to check for this. The best you will be able to do is R-sqr = 0.749, and that's using all the predictors. If you can pick just three that gets you close to that, you've done a good job of picking. > file = "http://ww2.coastal.edu/kingw/psyc480/data/bodyfat.txt" > bfat = read.table(file=file, header=T) I chose age, abdom, and forearm and got R-sqr = 0.6741. Can you do better? > summary(lm(fat~abdom+age+forearm,data=bfat)) Residual standard error: 4.807 on 248 degrees of freedom Multiple R-squared: 0.6741, Adjusted R-squared: 0.6701 F-statistic: 171 on 3 and 248 DF, p-value: < 2.2e-16