PSYC 480 -- Dr. King

Sequential (Hierarchical) Regression Practice

Do at least some of these problems. One of them will be the basis for the last
graded exercise.

Tired of typing those long commands to get data from the website? Yeah, me too!
So I've finally broken down and written a function to do it. You do have get it
from the website, but if you keep it in your workspace, it will be much easier
to load datasets.

> source("http://ww2.coastal.edu/kingw/psyc480/functions/getdata.R")

Once you have this function, you can load datasets with it, or you can still do
it the old fashioned way. The defaults for getData() are header=TRUE, sep="",
stringsAsFactors=TRUE, and row.names=NULL. All of those things can be changed
as necessary. The dataset will be brought in as a data frame called myData.

1) aggression.txt
Leigh Ann Waslien's Data, Psyc 497, Fall 1999.
Variables:
    physag = score on a physical aggression scale
    verbag = score on a verbal aggression scale
    anger = score on an anger scale
    hostil = score on a hostility scale

> getData("aggression.txt")
> summary(myData)
     physag          verbag          anger           hostil    
 Min.   :10.00   Min.   : 5.00   Min.   : 7.00   Min.   : 8.0  
 1st Qu.:14.00   1st Qu.:11.00   1st Qu.:10.50   1st Qu.:11.0  
 Median :18.00   Median :14.00   Median :13.00   Median :16.0  
 Mean   :19.89   Mean   :15.14   Mean   :14.83   Mean   :16.4  
 3rd Qu.:22.50   3rd Qu.:17.50   3rd Qu.:16.50   3rd Qu.:20.5  
 Max.   :45.00   Max.   :28.00   Max.   :33.00   Max.   :32.0 

We've talked about these data before. My "theory" is that hostility is a trait.
There are hostile people and there are nonhostile people. On the other hand,
anger is a state. People become angry for a while, then it passes. Anger results
in aggression, first verbal aggression, and then sometimes excalating into
physical aggression. Test the following causal chain using sequential regression.

hostil -> anger -> verbag -> physag

I.e., test physag ~ hostil + anger + verbag.

One of those predictors will drop out. There is a theory of emotion that says
we don't really become angry until after we start yelling. The more we yell,
the angrier we get, until finally we might escalate to throwing punches. Test
that theory using sequential regression.

Here is a quick answer, which you should not look at until you give it a try
for yourself. But in case you need a clue for the remaining problems, which will
not have answers...

> summary(aov(physag ~ hostil + anger + verbag, data=AGGR))
            Df Sum Sq Mean Sq F value   Pr(>F)    
hostil       1  227.0   227.0   7.352   0.0108 *  
anger        1 1015.8  1015.8  32.898 2.61e-06 ***
verbag       1    1.5     1.5   0.048   0.8272    
Residuals   31  957.2    30.9                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> summary(aov(physag~hostil+verbag+anger,data=myData))
            Df Sum Sq Mean Sq F value   Pr(>F)    
hostil       1  227.0   227.0   7.352   0.0108 *  
verbag       1  187.2   187.2   6.063   0.0196 *  
anger        1  830.1   830.1  26.883 1.26e-05 ***
Residuals   31  957.2    30.9                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Of course, there is more to do than this. You need change statistics. Hold
your breath. When you drop that HUGE anger term into Residuals, what do you
think is going to happen?

What would a mediation analysis say about this. Is verbal aggression a
mediator between hostility and anger? For the mediation analysis, assume
this is a pilot study and set alpha=.10.

Of course, we've just engaged in a massive fishing expedition. We've had the
whole fleet out trawling for significance. What do you have to say about that?


2) internet.txt
Some researchers on social media use and perceived social isolation have
proposed that social internet use causes perceived social isolation because
it replaces genuine social contact. I want to test an additional idea.

Be aware, when you fetch this dataset with getData(), you're old myData object
will be written over. If you want to save it, you'll have to rename it (copy it
into a new data frame with a different name, e.g., AGGR = myData).

> getData("internet.txt")
> summary(myData)
      Lone          Soc.Anx            GIU             LIU        
 Min.   : 6.00   Min.   : 3.000   Min.   : 7.00   Min.   : 0.000  
 1st Qu.:11.00   1st Qu.: 6.500   1st Qu.:14.00   1st Qu.: 1.000  
 Median :14.00   Median : 9.000   Median :17.00   Median : 4.000  
 Mean   :15.09   Mean   : 8.405   Mean   :17.16   Mean   : 5.748  
 3rd Qu.:18.00   3rd Qu.:10.000   3rd Qu.:20.00   3rd Qu.: 9.000  
 Max.   :31.00   Max.   :17.000   Max.   :27.00   Max.   :26.000  
      SIU       
 Min.   :20.00  
 1st Qu.:31.50  
 Median :38.00  
 Mean   :38.72  
 3rd Qu.:45.50  
 Max.   :66.00 

Data from Shelley Stoecker and Christina Rukenbrod Psyc 497 Spring 2006.
Scores on a loneliness scale and social anxiety scale and three scores from
an Internet usage survey.
    Lone: score on the loneliness survey
    Soc.Anx: score on the social anxiety survey
    GIU: general internet usage
    LIU: leisure internet usage
    SIU: social internet usage

I propose that social anxiety leads to a lot of Internet usage, which then
replaces genuine social interaction, which then causes loneliness. I have no
idea which of those types of Internet usage will be important, or which should
be entered first, so enter the three kinds of Internet usage as a block in a
sequential regression.

Here are some R commands you might want to consider.
> summary(aov(Lone ~ Soc.Anx + GIU + LIU + SIU, data=myData))
> model.1 = lm(Lone ~ Soc.Anx, data=myData)   # you can also use aov() here...
> model.2 = lm(Lone ~ Soc.Anx + GIU + LIU + SIU, data=myData)   # and here
> anova(model.1)
> anova(model.1, model.2)

Of course, you're also going to be checking correlations and missing values
and get descriptive statistics, sample sizes, and so forth.

What does summary(model.2) do and why isn't it right in this case?

It wouldn't hurt to do this, however.

> par(mfrow=c(2,2))
> plot(model.2,1:4)

What is the shape of the distribution of the residuals?


3) shyness.txt
Data are from Elizabeth Ostop, Psyc 497 Spring 2010. Variables are:
    SAD: score on Social Avoidance and Distress Scale (high=more distress)
    Shyness: score on Cheek and Buss Shyness Scale (high=more shyness)
    LOC: score on Rotter's Locus of Control Scale (high=external)
    Age: in years
    Sex: gender coded 0=Female, 1=Male

There is a positive correlation between locus of control (LOC) and shyness
(Shyness) in this  dataset, r=0.232, p=.0497. I claim that this effect is not
direct but is mediated through social anxiety and distress (SAD). People who
are external (high LOC score) see the outcome of their interactions with others
as beyond their control. This causes them to experience social anxiety, and
this in turn leads them to be shy. Test my theory using sequential regression.
(We've already done this one with mediation analysis, which would actually be
the better analysis here.)

> getData("shyness.txt")


4) loneliness.txt
We've done this twice as a mediation. Do it using sequential regression.


5) cars.csv
Guess what I've just discovered. I didn't even know how my own function works!

> CARS = getData("cars.csv", sep=",")   # in CSV files, the separator is a comma

Now you have two new data frames in your workspace, CARS and myData, which are
identical. The old myData has been overwritten.

There are a lot of variables that supposedly influence gas mileage in cars,
but I claim it's all weight! Everything that increases the gas mileage of a
car does it by increasing the weight of the car.

> summary(CARS)
      mpg             disp             hp              wt              am        
 Min.   :10.40   Min.   : 71.1   Min.   : 52.0   Min.   :1.513   Min.   :0.0000  
 1st Qu.:15.43   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:2.581   1st Qu.:0.0000  
 Median :19.20   Median :196.3   Median :123.0   Median :3.325   Median :0.0000  
 Mean   :20.09   Mean   :230.7   Mean   :146.7   Mean   :3.217   Mean   :0.4062  
 3rd Qu.:22.80   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.610   3rd Qu.:1.0000  
 Max.   :33.90   Max.   :472.0   Max.   :335.0   Max.   :5.424   Max.   :1.0000

These data are from cars tested in 1974 Motor Trend magazine.
The variables are:
    mpg - gas mileage in miles per gallon
    disp - engine displacement in cubic inches
    hp - gross horsepower (not sure why it's "gross"
    wt - weight of the car in THOUSANDS of pounds
    am - whether or not the car has a automatic transmission (1=no)

How would you test my theory? Can you do it with...
> summary(aov(mpg ~ disp + hp + wt, data=CARS))

Or would this be better...
> summary(lm(mpg ~ disp + hp + wt, data=CARS))

Do you think there is a direct effect of displacement (engine size) on gas
mileage, or is it mediated through horsepower and weight? Draw the diagram.

Is the result consistent with my theory?


6) television.csv
This is not only a CSV file, but we also want the first column of it used as
row names, so that requires an additional option.

> TV = getData("television.csv", sep=",", row.names=1)

Be sure to check for missing values and deal with them appropriately.

> summary(TV)
 life.expectancy people.per.TV    people.per.physician life.expectancy.female
 Min.   :51.50   Min.   :  1.30   Min.   :  226.0      Min.   :53.00         
 1st Qu.:61.00   1st Qu.:  3.35   1st Qu.:  472.2      1st Qu.:63.00         
 Median :69.50   Median :  6.30   Median :  990.5      Median :72.00         
 Mean   :67.04   Mean   : 51.98   Mean   : 3997.7      Mean   :69.58         
 3rd Qu.:73.38   3rd Qu.: 23.00   3rd Qu.: 3193.2      3rd Qu.:77.25         
 Max.   :79.00   Max.   :592.00   Max.   :36660.0      Max.   :82.00         
                 NA's   :2                                                   
 life.expectancy.male
 Min.   :50.00       
 1st Qu.:59.75       
 Median :66.00       
 Mean   :64.50       
 3rd Qu.:69.50       
 Max.   :76.00       
                     
> dim(TV)
[1] 40  5

I got these data off the Internet. (I don't remember where.)

> cor(TV, use="pairwise.complete")

The life expectancy variables are so strongly correlated that we might as well toss
out two of them.

> TV$life.expectancy.male = NULL
> TV$life.expectancy.female = NULL
> pairs(TV)   # close any open graphics window before doing this

The next thing you'll discover is that the relationship between life.expectancy
and the other two variables is strongly nonlinear.

> par(mfrow=c(2,2))
> with(TV, scatter.smooth(people.per.TV, life.expectancy))
> with(TV, scatter.smooth(people.per.TV, log(life.expectancy)))
> with(TV, scatter.smooth(log(people.per.TV), life.expectancy))
> with(TV, scatter.smooth(log(people.per.TV), log(life.expectancy)))

It looks like the power function is very slightly better than the logarithmic
one, so I'm going to log everything (which is easy).

> TV.log = log(TV)   # convert all variables to logs
> pairs(TV.log)   # close any open graphics windows before doing this (to reset)

Do the relationships now appear to be reasonably linear?

> cor(TV.log, use="complete.obs")

How bad is variance inflation if we use both people.per.TV and people.per.physician
as predictors?

Now remember, our variables are logged!!! We want to see how life.expectancy is
related to people.per.TV and people.per.physician. How people.per.physician is
related to life.expectancy should be obvious, but why is people.per.TV related
to life.expectancy? What is people.per.TV a proxy for? (Recall, a proxy is a
variable that takes the place of another variable that would be more difficult
to measure. Thus, people.per.TV is a proxy for wealth. The more people.per.TV,
the poorer the country. Wouldn't you say?)

Time to get rid of those countries with missing values.

> TV.log = na.omit(TV.log)

I suggest: people.per.TV -> people.per.physician -> life.expectancy

Before you do this, think about it. What do you expect? Fetch the mediate()
function and do this as a mediation analysis. But think about it. What's the
difference between mediation analysis and sequential regression?

> with(TV.log, mediate(x=people.per.TV,y=life.expectancy,m=people.per.physician))

Think about this. Why?