You will not be submitting this. It is a self-quiz by which you can test your knowledge. If you don't ace this, you should not go on. You will need to know all of this to understand what's coming.
Here is an exercise to hone your regression skills. You're going to have to do the R for yourself. I'm not going to show you the R output, but I will give you the commands. I'll leave the command prompts off so you can copy and paste.
After clearing your workspace...
rm(list=ls()) # you can also use a menu to do this
... do this to get the data.
data(mtcars) # built-in dataset help(mtcars) # will probably open a new window The data were extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). A data frame with 32 observations on 11 (numeric) variables. mpg Miles/(US) gallon cyl Number of cylinders (4, 6, or 8) disp Displacement (cu.in.) hp Gross horsepower drat Rear axle ratio wt Weight of the car (in 1000 lbs) qsec 1/4 mile time vs Engine (0 = V-shaped, 1 = straight, i.e., V engine vs. in-line engine) am Transmission (0 = automatic, 1 = manual) gear Number of forward gears (3, 4, or 5) carb Number of carburetors (1, 2, 3, 4, 6, or 8)
You can close the Help window that has opened once you're done staring at it. Get a correlation matrix. To make it easier to read, we'll round off the correlations to two decimal places.
round(cor(mtcars), 2)
Questions 1-5. What are the following correlations? 1) mpg with disp: 2) disp with hp: 3) hp with wgt: 4) gear with cyl: 5) gear with hp:
We are interested in the correlations that are in the first column of the correlation matrix, which are correlations of the other variables with mpg, or gas mileage of the car in miles per gallon of gasoline.
6) Higher values of which of the following are associated with increased gas mileage in this sample? (Be careful answering this question. Make sure you understand how these variables are coded.) A) higher engine displacement (i.e., bigger engine) B) higher horsepower (i.e., more powerful engine) C) having an automatic transmission D) none of the above increased gas mileage in this sample
7) Which of the following variables has the strongest relationship to gas mileage? A) hp (horsepower) B) disp (engine displacement) C) wt (weight of the car) D) am (automatic transmission)
8) Is that relationship (question 7) linear? A) yes B) no, but it's close to linear C) no, it's not even close D) can't say from the information we have so far
with(mtcars, scatter.smooth(mpg ~ wt))
9) Based on the scatterplot, which of the following is a correct description of the relationship between mpg and wt? A) strong, negative, linear B) strong, positive, linear C) strong, negative, nonlinear D) weak, negative, nonlinear
A side note: The curve in the scatterplot is said to be "decelerating" because its slope decreases as it goes from left to right across the graph. If the slope were increasing from left to right, the curve would be called "accelerating." (For the mathematicians in the audience, I should say in both those cases the absolute value of the slope. If the curve becomes flatter from left to right, it's decelerating, and if it becomes steeper from left to right, it's accelerating.) Such curves can often be flattened out with a log transform. Try with(mtcars,scatter.smooth(mpg~log(wt))) and see what happens. We will now continue our analysis without the log transform.
We are going to continue with a linear regression analysis, even though we know we don't have a linear relationship (because that's what we know how to do).
lm.out = lm(mpg ~ wt, data=mtcars) summary(lm.out)
10) The correct regression equation is: A) mpg = 37.285 - 5.345 * wt B) mpg = 37.285 - 5.345 * wt.hat C) mpg.hat = 37.285 - 5.345 * wt D) mpg.hat = 37.285 - 5.345 * wt.hat
11) Use the regression equation to make a prediction for a car weighting 4000 pounds. (Be careful that you understand how weight is coded.) A) mpg = 15.905 B) mpg.hat = 15.905 C) mpg.hat = 37.285 D) the result comes out negative so makes no sense
12) For each additional 1000 pounds of car weight, how does equation predict gas mileage will change? A) decrease by 5.345 mpg B) increase by 5.345 mpg C) increase by 37.285 mpg D) the regression equation makes no such prediction
13) What percentage of the variability in gas mileage is accounted for by the weight of the car in this model? A) 91.38% B) 37.285% C) 75.28% D) none of the above is correct
plot(mpg ~ wt, data=mtcars) abline(lm.out)
14) The line we have just plotted on the scatterplot is the: A) least squares regression line B) most squares regression line C) LOWESS line D) Maginot line
par(mfrow=c(2,2)) plot(lm.out, 1:4)
15) The Residuals vs. Fitted plot indicates that there is a serious problem with: A) nonnormality B) nonlinearity C) heterogeneity of variance D) all of the above are correct
16) Were there any cases that had undue influence in this regression analysis? If so, identify the car. A) Chrysler Imperial B) Toyota Corolla C) yes, but it was some car other than A or B D) no, there were no influential cases
Okay, now lets see how mpg is related to horsepower (hp). The R is up to you. See if you can answer the following questions.
17) The relationship between mpg and hp is (positive / negative).
18) The relationship between mpg and hp is (linear / nonlinear accelerating / nonlinear decelerating).
19) The relationship between mpg and hp is (weak / strong).
20) The linear correlation between mpg and hp is .
21) The least squares regression equation relating mpg to hp is .
22) A car with 500 hp would be predicted to have a gas mileage of mpg.
23) Looking at the scatterplot again, could you make a more reasonable prediction? mpg.
24) From the regression equation, we would predict that for every additional 1 hp produced by the engine the gas mileage should decrease by mpg.
25) The percentage of variability explained in mpg by its linear regression relationship to hp is %.
26) Here's one we didn't do above. See if you can get it. The residuals from the linear regression relationship range from -5.712 to 8.236 mpg. A typical residual has a magnitude of mpg (ignoring the sign).
27) The residuals vs. fitted plot shows that the relationship is .
28) The Cook's distance plot shows that, in this analysis, one car has a Cook's D of about 1.0. That car is the .
Now we'll do it again, except this time we'll look at the relationship between mpg and engine size (disp).
29) The relationship between mpg and disp is (positive / negative).
30) The relationship between mpg and disp is (linear / nonlinear accelerating / nonlinear decelerating).
31) The relationship between mpg and disp is (weak / strong).
32) The linear correlation between mpg and disp is .
33) The least squares regression equation relating mpg to disp is .
34) A car with 500 cubic inch displacement would be predicted to have a gas mileage of mpg.
35) Looking at the scatterplot again, could you make a more reasonable prediction? mpg.
36) From the regression equation, we would predict that for every additional 1 cubic inch of engine displacement the gas mileage should decrease by mpg.
37) The percentage of variability explained in mpg by its linear regression relationship to disp is %.
Follow-up question) Have you noticed, in the case of simple linear regression (i.e., one predictor), the value of multiple R-squared is just the correlation squared? .
38) The residuals from the linear regression relationship range from -4.892 to 7.231 mpg. A typical residual has a magnitude of mpg (ignoring the sign).
39) The residuals vs. fitted plot shows that the relationship is .
40) The Cook's distance plot shows that, in this analysis, the (which car) has the largest Cook's D, but it is less than 0.5, so we're not going to worry about it.
41) The Scale-Location plot shows that we have a clear violation of (which assumption).
42) The Normal Q-Q plot shows that the residuals are , which is a violation of the normality assumption.
Two more things you should remember. First, you can get confidence intervals around the regression coefficients by doing this.
confint(lm.out)
Second (and this is something I told you a long time ago), you can do a t-test by regression. Compare the output of these commands.
t.test(mpg ~ am, data=mtcars, var.eq=T) lm.out = lm(mpg ~ am, data=mtcars) summary(lm.out) confint(lm.out)
How did you do? Are you ready for multiple regression?