PSYC 480 -- Dr. King Sequential (Hierarchical) Regression Practice Do at least some of these problems. One of them will be the basis for the last graded exercise. Tired of typing those long commands to get data from the website? Yeah, me too! So I've finally broken down and written a function to do it. You do have get it from the website, but if you keep it in your workspace, it will be much easier to load datasets. > source("http://ww2.coastal.edu/kingw/psyc480/functions/getdata.R") Once you have this function, you can load datasets with it, or you can still do it the old fashioned way. The defaults for getData() are header=TRUE, sep="", stringsAsFactors=TRUE, and row.names=NULL. All of those things can be changed as necessary. The dataset will be brought in as a data frame called myData. 1) aggression.txt Leigh Ann Waslien's Data, Psyc 497, Fall 1999. Variables: physag = score on a physical aggression scale verbag = score on a verbal aggression scale anger = score on an anger scale hostil = score on a hostility scale > getData("aggression.txt") > summary(myData) physag verbag anger hostil Min. :10.00 Min. : 5.00 Min. : 7.00 Min. : 8.0 1st Qu.:14.00 1st Qu.:11.00 1st Qu.:10.50 1st Qu.:11.0 Median :18.00 Median :14.00 Median :13.00 Median :16.0 Mean :19.89 Mean :15.14 Mean :14.83 Mean :16.4 3rd Qu.:22.50 3rd Qu.:17.50 3rd Qu.:16.50 3rd Qu.:20.5 Max. :45.00 Max. :28.00 Max. :33.00 Max. :32.0 We've talked about these data before. My "theory" is that hostility is a trait. There are hostile people and there are nonhostile people. On the other hand, anger is a state. People become angry for a while, then it passes. Anger results in aggression, first verbal aggression, and then sometimes excalating into physical aggression. Test the following causal chain using sequential regression. hostil -> anger -> verbag -> physag I.e., test physag ~ hostil + anger + verbag. One of those predictors will drop out. There is a theory of emotion that says we don't really become angry until after we start yelling. The more we yell, the angrier we get, until finally we might escalate to throwing punches. Test that theory using sequential regression. Here is a quick answer, which you should not look at until you give it a try for yourself. But in case you need a clue for the remaining problems, which will not have answers... > summary(aov(physag ~ hostil + anger + verbag, data=AGGR)) Df Sum Sq Mean Sq F value Pr(>F) hostil 1 227.0 227.0 7.352 0.0108 * anger 1 1015.8 1015.8 32.898 2.61e-06 *** verbag 1 1.5 1.5 0.048 0.8272 Residuals 31 957.2 30.9 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > summary(aov(physag~hostil+verbag+anger,data=myData)) Df Sum Sq Mean Sq F value Pr(>F) hostil 1 227.0 227.0 7.352 0.0108 * verbag 1 187.2 187.2 6.063 0.0196 * anger 1 830.1 830.1 26.883 1.26e-05 *** Residuals 31 957.2 30.9 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Of course, there is more to do than this. You need change statistics. Hold your breath. When you drop that HUGE anger term into Residuals, what do you think is going to happen? What would a mediation analysis say about this. Is verbal aggression a mediator between hostility and anger? For the mediation analysis, assume this is a pilot study and set alpha=.10. Of course, we've just engaged in a massive fishing expedition. We've had the whole fleet out trawling for significance. What do you have to say about that? 2) internet.txt Some researchers on social media use and perceived social isolation have proposed that social internet use causes perceived social isolation because it replaces genuine social contact. I want to test an additional idea. Be aware, when you fetch this dataset with getData(), you're old myData object will be written over. If you want to save it, you'll have to rename it (copy it into a new data frame with a different name, e.g., AGGR = myData). > getData("internet.txt") > summary(myData) Lone Soc.Anx GIU LIU Min. : 6.00 Min. : 3.000 Min. : 7.00 Min. : 0.000 1st Qu.:11.00 1st Qu.: 6.500 1st Qu.:14.00 1st Qu.: 1.000 Median :14.00 Median : 9.000 Median :17.00 Median : 4.000 Mean :15.09 Mean : 8.405 Mean :17.16 Mean : 5.748 3rd Qu.:18.00 3rd Qu.:10.000 3rd Qu.:20.00 3rd Qu.: 9.000 Max. :31.00 Max. :17.000 Max. :27.00 Max. :26.000 SIU Min. :20.00 1st Qu.:31.50 Median :38.00 Mean :38.72 3rd Qu.:45.50 Max. :66.00 Data from Shelley Stoecker and Christina Rukenbrod Psyc 497 Spring 2006. Scores on a loneliness scale and social anxiety scale and three scores from an Internet usage survey. Lone: score on the loneliness survey Soc.Anx: score on the social anxiety survey GIU: general internet usage LIU: leisure internet usage SIU: social internet usage I propose that social anxiety leads to a lot of Internet usage, which then replaces genuine social interaction, which then causes loneliness. I have no idea which of those types of Internet usage will be important, or which should be entered first, so enter the three kinds of Internet usage as a block in a sequential regression. Here are some R commands you might want to consider. > summary(aov(Lone ~ Soc.Anx + GIU + LIU + SIU, data=myData)) > model.1 = lm(Lone ~ Soc.Anx, data=myData) # you can also use aov() here... > model.2 = lm(Lone ~ Soc.Anx + GIU + LIU + SIU, data=myData) # and here > anova(model.1) > anova(model.1, model.2) Of course, you're also going to be checking correlations and missing values and get descriptive statistics, sample sizes, and so forth. What does summary(model.2) do and why isn't it right in this case? It wouldn't hurt to do this, however. > par(mfrow=c(2,2)) > plot(model.2,1:4) What is the shape of the distribution of the residuals? 3) shyness.txt Data are from Elizabeth Ostop, Psyc 497 Spring 2010. Variables are: SAD: score on Social Avoidance and Distress Scale (high=more distress) Shyness: score on Cheek and Buss Shyness Scale (high=more shyness) LOC: score on Rotter's Locus of Control Scale (high=external) Age: in years Sex: gender coded 0=Female, 1=Male There is a positive correlation between locus of control (LOC) and shyness (Shyness) in this dataset, r=0.232, p=.0497. I claim that this effect is not direct but is mediated through social anxiety and distress (SAD). People who are external (high LOC score) see the outcome of their interactions with others as beyond their control. This causes them to experience social anxiety, and this in turn leads them to be shy. Test my theory using sequential regression. (We've already done this one with mediation analysis, which would actually be the better analysis here.) > getData("shyness.txt") 4) loneliness.txt We've done this twice as a mediation. Do it using sequential regression. 5) cars.csv Guess what I've just discovered. I didn't even know how my own function works! > CARS = getData("cars.csv", sep=",") # in CSV files, the separator is a comma Now you have two new data frames in your workspace, CARS and myData, which are identical. The old myData has been overwritten. There are a lot of variables that supposedly influence gas mileage in cars, but I claim it's all weight! Everything that increases the gas mileage of a car does it by increasing the weight of the car. > summary(CARS) mpg disp hp wt am Min. :10.40 Min. : 71.1 Min. : 52.0 Min. :1.513 Min. :0.0000 1st Qu.:15.43 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:2.581 1st Qu.:0.0000 Median :19.20 Median :196.3 Median :123.0 Median :3.325 Median :0.0000 Mean :20.09 Mean :230.7 Mean :146.7 Mean :3.217 Mean :0.4062 3rd Qu.:22.80 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.610 3rd Qu.:1.0000 Max. :33.90 Max. :472.0 Max. :335.0 Max. :5.424 Max. :1.0000 These data are from cars tested in 1974 Motor Trend magazine. The variables are: mpg - gas mileage in miles per gallon disp - engine displacement in cubic inches hp - gross horsepower (not sure why it's "gross" wt - weight of the car in THOUSANDS of pounds am - whether or not the car has a automatic transmission (1=no) How would you test my theory? Can you do it with... > summary(aov(mpg ~ disp + hp + wt, data=CARS)) Or would this be better... > summary(lm(mpg ~ disp + hp + wt, data=CARS)) Do you think there is a direct effect of displacement (engine size) on gas mileage, or is it mediated through horsepower and weight? Draw the diagram. Is the result consistent with my theory? 6) television.csv This is not only a CSV file, but we also want the first column of it used as row names, so that requires an additional option. > TV = getData("television.csv", sep=",", row.names=1) Be sure to check for missing values and deal with them appropriately. > summary(TV) life.expectancy people.per.TV people.per.physician life.expectancy.female Min. :51.50 Min. : 1.30 Min. : 226.0 Min. :53.00 1st Qu.:61.00 1st Qu.: 3.35 1st Qu.: 472.2 1st Qu.:63.00 Median :69.50 Median : 6.30 Median : 990.5 Median :72.00 Mean :67.04 Mean : 51.98 Mean : 3997.7 Mean :69.58 3rd Qu.:73.38 3rd Qu.: 23.00 3rd Qu.: 3193.2 3rd Qu.:77.25 Max. :79.00 Max. :592.00 Max. :36660.0 Max. :82.00 NA's :2 life.expectancy.male Min. :50.00 1st Qu.:59.75 Median :66.00 Mean :64.50 3rd Qu.:69.50 Max. :76.00 > dim(TV) [1] 40 5 I got these data off the Internet. (I don't remember where.) > cor(TV, use="pairwise.complete") The life expectancy variables are so strongly correlated that we might as well toss out two of them. > TV$life.expectancy.male = NULL > TV$life.expectancy.female = NULL > pairs(TV) # close any open graphics window before doing this The next thing you'll discover is that the relationship between life.expectancy and the other two variables is strongly nonlinear. > par(mfrow=c(2,2)) > with(TV, scatter.smooth(people.per.TV, life.expectancy)) > with(TV, scatter.smooth(people.per.TV, log(life.expectancy))) > with(TV, scatter.smooth(log(people.per.TV), life.expectancy)) > with(TV, scatter.smooth(log(people.per.TV), log(life.expectancy))) It looks like the power function is very slightly better than the logarithmic one, so I'm going to log everything (which is easy). > TV.log = log(TV) # convert all variables to logs > pairs(TV.log) # close any open graphics windows before doing this (to reset) Do the relationships now appear to be reasonably linear? > cor(TV.log, use="complete.obs") How bad is variance inflation if we use both people.per.TV and people.per.physician as predictors? Now remember, our variables are logged!!! We want to see how life.expectancy is related to people.per.TV and people.per.physician. How people.per.physician is related to life.expectancy should be obvious, but why is people.per.TV related to life.expectancy? What is people.per.TV a proxy for? (Recall, a proxy is a variable that takes the place of another variable that would be more difficult to measure. Thus, people.per.TV is a proxy for wealth. The more people.per.TV, the poorer the country. Wouldn't you say?) Time to get rid of those countries with missing values. > TV.log = na.omit(TV.log) I suggest: people.per.TV -> people.per.physician -> life.expectancy Before you do this, think about it. What do you expect? Fetch the mediate() function and do this as a mediation analysis. But think about it. What's the difference between mediation analysis and sequential regression? > with(TV.log, mediate(x=people.per.TV,y=life.expectancy,m=people.per.physician)) Think about this. Why?