R Tutorials--Hierarchical Regression

HIERARCHICAL REGRESSION

Preliminaries

I am not well read on the topic of hierarchical regression, so for details, especially on theory, you'd be well advised to consult another source. I've found a textbook by Keith (2006) to be very helpful. It is well written and gets to the point without getting bogged down in a lot of mathematical finery. What theory I do present here is adapted from that source. I'm also going to borrow a data set from him. Kieth, by the way, refers to hierarchical regression as sequential regression. Other resources I found helpful were Petrocelli (2003), who discussed common problems with and misuses of hierarchical regression, and Lewis (2007), who compared hierarchical with stepwise regression.

I felt it necessary to create this tutorial to support another on Unbalanced Factorial Designs. As we'll find out, Type I sums of squares, the R default in ANOVA, does something similar to hierarchical regression, but not quite exactly. There is a difference in the error terms that are used, which I'll discuss below.

Lewis, M. (2007). Stepwise versus hierarchical regression: Pros and cons. Paper presented at the annual meeting of the Southwest Educational Research Association, February 7, 2007, San Antonio. (Available online here.)
Keith, T. Z. (2006). Multiple Regression and Beyond (1st ed.). Boston: Pearson/Allyn and Bacon. (I understand there is now a second edition available.)
Petrocelli, J. V. (2003). Hierarchical multiple regression in counseling research: Common problems and possible remedies. Measurement and Evaluation in Counseling and Development, 36, 9-22.

Some Data

Timothy Keith, in his textbook Multiple Regression and Beyond, uses a data set that is well suited to demonstrating hierarchical regression and how it is different from standard Multiple Regression. These data are available at Dr. Keith's website.

Click the link above, then scroll down until you come to a boldfaced section called "Shorter version of the NELS data (updated 5/9/15)." At the end of this paragraph is a link that begins "n=1000...". Click it and download the zip file it links to. Drag that file to someplace where it is convenient to work with it (I use my Desktop), then unzip it. That will create a folder with the ungainly name "n=1000stud-par-shorter-5-19-15-update". Open this folder and find a file with the equally ungainly name "n=1000, stud & par shorter all miss blank.csv". Change the name of that file to Keith.csv and drag it to your working directory (Rspace). You can throw the rest of the folder away now. (Not that there isn't anything interesting in it, but nothing we'll be using.) The file is a CSV file with missing values left as blanks. Fortunately, R should be able to deal with this in a CSV file by automatically converting the blank data values to NA. Load the data as follows.

> NELS = read.csv("Keith.csv")
> summary(NELS)                   # output not shown

I'm not going to show the summary output because there is a LOT of it! We want to work with only five of those many, many variables, so the easiest thing to do would be to extract them into a more manageable data frame. That's going to require some very careful typing, or some moderately careful copying and pasting.

> grades1 = NELS[,c("f1txhstd","byses","bygrads","f1cncpt2","f1locus2")]
> dim(grades1)
[1] 1000    5
> summary(grades1)
    f1txhstd         byses             bygrads        f1cncpt2       
 Min.   :28.94   Min.   :-2.41400   Min.   :0.50   Min.   :-2.30000  
 1st Qu.:43.85   1st Qu.:-0.56300   1st Qu.:2.50   1st Qu.:-0.45000  
 Median :51.03   Median :-0.08650   Median :3.00   Median :-0.06000  
 Mean   :50.92   Mean   :-0.03121   Mean   :2.97   Mean   : 0.03973  
 3rd Qu.:59.04   3rd Qu.: 0.53300   3rd Qu.:3.50   3rd Qu.: 0.52000  
 Max.   :69.16   Max.   : 1.87400   Max.   :4.00   Max.   : 1.35000  
 NA's   :77                         NA's   :17     NA's   :59        
    f1locus2       
 Min.   :-2.16000  
 1st Qu.:-0.36250  
 Median : 0.07000  
 Mean   : 0.04703  
 3rd Qu.: 0.48000  
 Max.   : 1.43000  
 NA's   :60

If you're getting that summary table, then you're ready to go.

> rm(NELS)                        # if you don't want it cluttering up your workspace

Some Preliminary Data Analysis

NELS stands for National Educational Longitudinal Study. NELS was a study begun in 1988, when a large, nationally representative sample of 8th graders was tested, and then retested again in 10th and 12th grades. These aren't the complete NELS data but 1000 cases chosen, I assume, at random.

We will attempt to duplicate the analysis in Keith's textbook (2006), which appears to have been done with SPSS. Let's begin by duplicating the descriptive statistics (Figure 5.1, page 75). The minimum and maximum we already have. We also already have the means, but we'll get them again, as it only takes two seconds. Okay, maybe three. Because our data frame is all numeric, we can use a shortcut instead of doing this variable by variable.

> apply(grades1, 2, mean, na.rm=T)     # margin 2 is the column margin
   f1txhstd       byses     bygrads    f1cncpt2    f1locus2 
50.91808234 -0.03120500  2.97039674  0.03973433  0.04703191

> apply(grades1, 2, sd, na.rm=T)
 f1txhstd     byses   bygrads  f1cncpt2  f1locus2 
9.9415281 0.7787984 0.7522191 0.6728771 0.6235697 

> sampsize = function(x)  sum(!is.na(x))
> apply(grades1, 2, sampsize)
f1txhstd    byses  bygrads f1cncpt2 f1locus2 
     923     1000      983      941      940

Because of the missing values, we couldn't use length() to get the sample sizes. We had to write a short function that counts up the number of values that are not missing.

Now we'll try to duplicate Table 5.1 on page 76, which is a correlation matrix. Given that we have multiple missing values in the data, we're going to have to figure out how to choose cases to use in calculating the correlations. My first inclination is to use all pairwise complete cases.

> cor(grades1, use="pairwise.complete")
          f1txhstd     byses   bygrads  f1cncpt2  f1locus2
f1txhstd 1.0000000 0.4387503 0.4948496 0.1791069 0.2474076
byses    0.4387503 1.0000000 0.3352870 0.1383291 0.2032299
bygrads  0.4948496 0.3352870 1.0000000 0.1645225 0.2462188
f1cncpt2 0.1791069 0.1383291 0.1645225 1.0000000 0.5898973
f1locus2 0.2474076 0.2032299 0.2462188 0.5898973 1.0000000

These correlations are close to the ones in the table, but not quite right. Perhaps it would be more sensible to use complete observations across the board, as these are the ones that will be included in the regression analysis.

> cor(grades1, use="complete.obs")
          f1txhstd     byses   bygrads  f1cncpt2  f1locus2
f1txhstd 1.0000000 0.4299589 0.4977062 0.1727001 0.2477357
byses    0.4299589 1.0000000 0.3254019 0.1322685 0.1942161
bygrads  0.4977062 0.3254019 1.0000000 0.1670206 0.2283377
f1cncpt2 0.1727001 0.1322685 0.1670206 1.0000000 0.5854103
f1locus2 0.2477357 0.1942161 0.2283377 0.5854103 1.0000000

That did it! I'm going to fix those variable names now because, seriously? The variables, in order, are standardized achievement scores, family socioeconomic status, previous grades, a self esteem score, and a locus of control score. (Locus of control tells how you feel your life is governed, internally by yourself, called internal locus of control, or externally by forces beyond your control, called external locus of control. Higher scores generally indicate a more external locus of control on these scales, and I assume the same is true in this case, although I'm not certain.)

> colnames(grades1) = c("std.achieve.score","SES","prev.grades","self.esteem","locus.of.control")
> head(grades1)
  std.achieve.score    SES prev.grades self.esteem locus.of.control
1             61.71 -0.563         3.8       -0.10            -0.14
2             46.98  0.123         2.5       -0.45            -0.58
3             50.48  0.229         2.8        0.33            -0.59
4             56.48  0.687         3.5       -0.02             0.07
5             55.32  0.633         3.3       -0.09            -0.85
6             39.67  0.992         2.5       -0.28             0.07

I feel much better now! The first analysis is a standard regression analysis (shown in Figure 5.2, page 77). I'm going to use a trick in the regression model formula, DV~., which means DV as a function of all other variables.

> lm.out = lm(std.achieve.score ~ ., data=grades1)
> summary(lm.out)

Call:
lm(formula = std.achieve.score ~ ., data = grades1)

Residuals:
     Min       1Q   Median       3Q      Max 
-24.8402  -5.3648   0.2191   5.6748  20.3933 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       35.5174     1.2256  28.981   <2e-16 ***
SES                3.6903     0.3776   9.772   <2e-16 ***
prev.grades        5.1502     0.3989  12.910   <2e-16 ***
self.esteem        0.2183     0.5007   0.436    0.663    
locus.of.control   1.5538     0.5522   2.814    0.005 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.041 on 882 degrees of freedom
  (113 observations deleted due to missingness)
Multiple R-squared:  0.3385,	Adjusted R-squared:  0.3355 
F-statistic: 112.8 on 4 and 882 DF,  p-value: < 2.2e-16

> confint(lm.out)       # confidence intervals for the regression coefficients
                      2.5 %    97.5 %
(Intercept)      33.1120344 37.922735
SES               2.9491568  4.431499
prev.grades       4.3672598  5.933147
self.esteem      -0.7643607  1.201002
locus.of.control  0.4700688  2.637597

Spot on, so far. Good job, SPSS! We have not yet duplicated the ANOVA table or the beta coefficients.

> print(anova(lm.out), digits=7)
Analysis of Variance Table

Response: std.achieve.score
                  Df   Sum Sq   Mean Sq   F value     Pr(>F)    
SES                1 15938.64 15938.645 246.49535 < 2.22e-16 ***
prev.grades        1 12344.62 12344.616 190.91274 < 2.22e-16 ***
self.esteem        1   391.62   391.623   6.05655  0.0140453 *  
locus.of.control   1   512.00   512.001   7.91823  0.0050028 ** 
Residuals        882 57031.03    64.661

The anova() broke the regression SS into four parts. If we sum them, we get the regression SS in the SPSS table. That seems like a lot of work, so let's get the total SS and then get the regression SS by subtraction. The first thing I'm going to do is write a function to calculate sum of squares when there are missing values.

> SS = function(x)  sum(x^2,na.rm=T) - sum(x,na.rm=T)^2 / sum(!is.na(x))
> SS(grades1$std.achieve.score)
[1] 91124.93

Fail! Why? Because achievement scores were included in this calculation that were not included in the regression analysis. The perils of missing values! There are 77 missing achievement scores, more than any of the other variables, but apparently there were some achievement scores present that were paired with missing values in other variables. Since those were not included in the regression analysis, we don't want them in this calculation either. The simplest thing to do is to create a new data frame with cases that have missing values omitted. We will then have only the complete cases that were used in the regression. (This might not have been a bad idea right from the get go!)

> grades2 = na.omit(grades1)
> SS(grades2$std.achieve.score)                  # total SS
[1] 86217.92

Better.

> SS(grades2$std.achieve.score) - 57031.03       # regression SS
[1] 29186.89

R and SPSS disagree a tad on the rounding, but I think we can let that slide. Now let's get the beta coefficients. To do that, we're going to use the scale() function to convert the entire numeric data frame to z-scores. The output is a matrix, so we have to convert it to a data frame before we can use it in the regression analysis.

> grades3 = scale(grades2)
> grades3 = as.data.frame(grades3)
> coef(lm(std.achieve.score~SES+prev.grades+self.esteem+locus.of.control, data=grades3))
     (Intercept)              SES      prev.grades      self.esteem 
   -3.264896e-16     2.854706e-01     3.802391e-01     1.474298e-02 
locus.of.control 
    9.683904e-02

Spot on, although that may have been a little more work than some of you were hoping for. Still, I think it's easier than hunting and pecking your way through a dozen menus to get it done!

Just A Touch Of Theory

Before we do the hierarchical regression, let's talk a little about what the hierarchical regression does that the standard regression does not. I think Keith's exposition on this is so clear that I'm going to essentially paraphrase it, and I'll refer you to his textbook for more. (He covers hierarchical regression only briefly, but it is an excellent summary.)

Standard regression enters all of the predictor variables at once into the regression, and each predictor is treated as if it were entered last. That is, each predictor variable in the regression table above is controlled for all of the other predictors in the table. Thus, we can say that the effect of SES is 3.6903 when prev.grads, self.esteem, and locus.of.control are held constant or are held at their mean values.

The following diagram, which I have redrawn from Keith (Figure 5.7, page 83), shows what this means most clearly.

In the little regression universe that we've created for ourselves, the predictors can influence the response potentially over several pathways. For example, SES can have a direct effect on achievement scores, indicated by the arrow that goes straight to the response, but it can also have effects that follow indirect pathways. It can have an effect on locus of control, which then affects achievement scores. It can have an effect on self esteem, which can then act directly on achievement scores, but can also act through locus of control. Following the other arrows around (other than the arrows showing direct effects) will show us how WE THINK each of these predictors might have an indirect effect on achievement scores.

Standard regression reveals only the direct effects of each predictor. Because SES is treated as if it is entered last in a standard regression, all of the indirect effects of SES have already been pruned away, having been handed over to one of the other predictors, before SES gets to attempt to claim variability in the response. Thus, SES is left only with its direct pathway.

However, if SES were entered alone into a regression analysis (as the only predictor, that is), we would get to see its total effect on achievement score, which is its effect through both direct and indirect pathways, because it wouldn't be competing with the other predictors for response variability.

That's what a hierarchical regression does. If SES were entered first into a hierarchical regression, it would get to claim all the response variability it could without having to compete with the other predictors. In other words, we wouldn't be seeing just the direct effect of SES on achievement scores, we'd be seeing the total effect. Then when we enter Previous Grades second, we'd get to see its total effect, provided none of that effect goes through SES and, therefore, has already been claimed. Then we enter Self Esteem third, and we see it's total effect, provided none of that goes through either Previous Grades or SES. And so on.

In other words, standard regression reveals direct effects only. Hierarchical regression reveals total effects, provided we have entered the predictors in the correct order. And that is a BIG provided! Order of entering the predictors is the biggest catch in hierarchical regression. How do we know what order to use? Keith recommends drawing a diagram like the one above. Such a diagram should make it clear in what order the variables should be entered. If you can't draw such a diagram, then you probably shouldn't be using hierarchical regression.

Note: I have to admit that it helped me a lot to have already written the tutorial on Simple Mediation Analysis when I was attempting to understand this!

Now For The Hierarchical Regression

Here are the hierarchical (or sequential as Keith calls it) multiple regression results (Figure 5.3, page 79) as SPSS (I assume) calculated them. The variables were entered one at a time in the order shown in the table.

We already have some of these results. We need to recall the total sum of squares, 86217.92, and we need to recall the ANOVA table we produced earlier.

> print(anova(lm.out), digits=7)
Analysis of Variance Table

Response: std.achieve.score
                  Df   Sum Sq   Mean Sq   F value     Pr(>F)    
SES                1 15938.64 15938.645 246.49535 < 2.22e-16 ***
prev.grades        1 12344.62 12344.616 190.91274 < 2.22e-16 ***
self.esteem        1   391.62   391.623   6.05655  0.0140453 *  
locus.of.control   1   512.00   512.001   7.91823  0.0050028 ** 
Residuals        882 57031.03    64.661

We can get the change in R-squared values by dividing each SS by the total.

> c(15938.64, 12344.62, 391.62, 512.00) / 86217.92
[1] 0.184864585 0.143179283 0.004542211 0.005938441

Of course, the R Square column is just the cumulative sum of those.

> cumsum(c(15938.64, 12344.62, 391.62, 512.00) / 86217.92)
[1] 0.1848646 0.3280439 0.3325861 0.3385245

And if you want to be really anal about it, the R column is the square root of that.

> sqrt(cumsum(c(15938.64, 12344.62, 391.62, 512.00) / 86217.92))
[1] 0.4299588 0.5727511 0.5767028 0.5818286

Notice that our ANOVA table does not give the same tests, however (except for the last one). Why? The answer is that the ANOVA table tests are done with a common error term, namely, the one shown in the table. The tests in the SPSS table can be obtained from the ANOVA for any term by pooling all the terms below it in the table with the error term. For example, the test for prev.grades would be:

> ((12344.62)/1) / ((391.62+512.00+57031.03)/(1+1+882))    # F value
[1] 188.3613
> pf(188.3613, 1, 884, lower=F)                            # p-value
[1] 5.310257e-39

We could also do that like this, but ONLY the prev.grades test is available in this output.

> anova(lm(std.achieve.score ~ SES + prev.grades, data=grades2))
Analysis of Variance Table

Response: std.achieve.score
             Df Sum Sq Mean Sq F value    Pr(>F)    
SES           1  15939 15938.6  243.20 < 2.2e-16 ***
prev.grades   1  12345 12344.6  188.36 < 2.2e-16 ***
Residuals   884  57935    65.5                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Which tests are correct, the tests in R's ANOVA table, or the tests in the SPSS table?

What Is The Correct Error Term?

According to every textbook I've ever read, and that's at least a couple, the SPSS tests are correct. This is not a trivial point, because there are multiple forums and R blogs online that advise doing a hierarchical regression like this.

> model1 = lm(std.achieve.score ~ 1, data=grades2)
> model2 = lm(std.achieve.score ~ SES, data=grades2)
> model3 = lm(std.achieve.score ~ SES + prev.grades, data=grades2)
> model4 = lm(std.achieve.score ~ SES + prev.grades + self.esteem, data=grades2)
> anova(model1, model2, model3, model4, lm.out)
Analysis of Variance Table

Model 1: std.achieve.score ~ 1
Model 2: std.achieve.score ~ SES
Model 3: std.achieve.score ~ SES + prev.grades
Model 4: std.achieve.score ~ SES + prev.grades + self.esteem
Model 5: std.achieve.score ~ SES + prev.grades + self.esteem + locus.of.control
  Res.Df   RSS Df Sum of Sq        F    Pr(>F)    
1    886 86218                                    
2    885 70279  1   15938.6 246.4953 < 2.2e-16 ***
3    884 57935  1   12344.6 190.9127 < 2.2e-16 ***
4    883 57543  1     391.6   6.0566  0.014045 *  
5    882 57031  1     512.0   7.9182  0.005003 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that this exactly duplicates the results that are in the ANOVA table and not those in the SPSS table. To duplicate the results that are in the SPSS table, you would have to make these model comparisons two models at a time. For example...

> anova(model2, model3)
Analysis of Variance Table

Model 1: std.achieve.score ~ SES
Model 2: std.achieve.score ~ SES + prev.grades
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1    885 70279                                  
2    884 57935  1     12345 188.36 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

...would give the correct test for prev.grades. Or just...

> anova(model3)    # not shown

...for the prev.grades test ONLY. So who says SPSS is right? Well, pretty much everybody! The formula given for comparing two nested regression analyses in every textbook I've consulted is this one, where for some unaccountable reason L stands for the full model and S the reduced model. (Maybe it means "larger" and "smaller." Didn't think of that!)

That this is the equivalent of the SPSS tests is easily demonstrated. To test the effect of adding self.esteem to the model, for example, we would need to know that the change in R-squared was .0045422, that the change in df was 1, that R-squared for the full model, including self.esteem, was .33259, and N was 887, after dropping all the cases with missing values. So...

> .0045422/((1-.33259)/(887-3-1))  # F-value
[1] 6.009443

Since there is no function in R that automates a hierarchical regression, the lesson here is BE CAREFUL WITH THE ERROR TERM.

Entering Variables In Blocks

Variables don't have to be entered one at a time into a hierarchical regression. They can also be entered in blocks. This might happen, for example, if the researcher wanted to enter all psychological variables at one time, or if the researcher was uncertain about the order in which they should be entered. Let's say in the current example we wanted to enter self.esteem and locus.of.control at the same time. (Frankly, I was unconvinced by Keith's argument concerning the order in which these variables should be entered.)

The increase in R-squared can be obtained by summing the SSes in the ANOVA table and dividing by the total SS.

> (391.62 + 512.00) / 86217.92
[1] 0.01048065

Which is just the sum of the individual increases in R-squared. The test on this would be the following.

> anova(model3, lm.out)
Analysis of Variance Table

Model 1: std.achieve.score ~ SES + prev.grades
Model 2: std.achieve.score ~ SES + prev.grades + self.esteem + locus.of.control
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1    884 57935                                  
2    882 57031  2    903.62 6.9874 0.0009754 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 3 was the model before this block of variables was entered, i.e., the model that included SES and prev.grades, and lm.out was the model after this block of variables was entered. This is in agreement with the SPSS output that is in Figure 5.9 in Keith. So, good, SPSS got it right again!

Question

I couldn't help noticing how much the shaded portion of the NELS diagram at left resembles a Simple Mediation Analysis, or rather, the diagram one would draw to represent such an analysis. As I said above, I'm not convinced Keith got self.esteem and locus.of.control in the correct order in the hierarchical regression. I believe the order should be reversed. I believe locus.of.control, being something that is learned from your environment, is the causally prior (or at least more important) variable. (Not being an expert on either of these constructs, I doubt that my "belief" should carry much weight, and I'm just playing devil's advocate!) So the mediation analysis people should forgive me, but I decided to play a little fast and loose with a mediation analysis. My "hypothesis" (such as it is) is that, if locus of control is entered as the mediator, it will completely sap self-esteem of its effect on achievement scores (complete mediation), because it's locus of control that's having the effect on scores and not self-esteem. That relationship would not be seen in the hierarchical regression as we've done it, because by the time locus of control got into the model, self-esteem has already grabbed a large (and undeserved) piece of the pie. Locus of control has been short circuited, so to speak. On the other hand, if the roles of self-esteem and locus of control were reversed in the mediation analysis, i.e., self-esteem entered as the mediator, we'd see virtually no mediation. I did those two analyses using the mediate() function that is discussed in the mediation analysis tutorial. The results are in the following diagrams (created by that function). The numbers on the lines are regression coefficients from the various models used to do the mediation analysis (see that tutorial).

For the record, these were the commands used. The mediate() function is in the tutdata folder that you may have downloaded to accompany these tutorials.

> with(grades2, mediate(self.esteem, std.achieve.score, locus.of.control,
+    names=c("Self Esteem","Achievement Score","Locus of Control")))
       a        b        c       cp  sobel.z        p 
0.540276 3.579580 2.557419 0.623457 5.371740 0.000000
> with(grades2, mediate(locus.of.control, std.achieve.score, self.esteem,
+    names=c("Locus of Control","Achievement Score","Self Esteem")))
       a        b        c       cp  sobel.z        p 
0.634315 0.623457 3.975048 3.579580 1.045782 0.295662

It appears my "hypothesis" was confirmed. Pretty close anyway. My question is, what does it mean? Does it mean I'm right and these variables were entered in the wrong order in the hierarchical regression? Does it mean the opposite? Or do the two analyses bear no relationship to one another?

I don't know the answers to these questions, and I suspect they are not to be found in the above analyses. That's why we are constantly warned that techniques like these (hierarchical regression, mediation analysis, path analysis) are not useful for exploratory analyses. These are correlational data, and as we know, we can twist correlational data around our little fingers and make them do just about anything we want. Also, there is no guarantee that these are the only variables, or even the most important variables, that we should be looking at.

This is one case where the hard work should definitely come BEFORE the analysis. It's up to us to be familiar enough with the phenomena under investigation that we can form reasonable and rational theories of causation. We CANNOT get those from correlational data. Correlational data can be consistent or inconsistent with our previously formed theories, but that's all. There is a much quoted warning (concerning path analysis, but equally applicable to any of these similar techniques) from Everitt and Dunn (1991) that is worth repeating here: "However convincing, respectable and reasonable a path diagram... may appear, any causal inferences extracted are rarely more than a form of statistical fantasy."

Everitt, B. S. and Dunn, G. (1991). Applied multivariate data analysis. London: Edward Arnold.

created 2016 March 29