PSYC 480 -- SPRING 2022 -- Dr. King

Graded Exercise on Simple Linear, Nonlinear, and Multiple Regression

Can No Longer Be Submitted

Type your first and last name and email address in the boxes indicated. No name, no credit! Enter your CCU email address. Without this I cannot return your exam to you.

First Name: Last Name:

Your Coastal e-mail address: CCU emails ONLY!

STOP! There is a good chance ITS will not send me your answers if you don't fill in the boxes above, so fill them in now.

Further Instructions. DO NOT press the Enter key after answering a question. This will cause html to attempt to submit your answers. Don't blame me. I think it's stupid. Blame the people who programmed html. Bottom line, keep your hands off the Enter key while this window is in focus in your browser.

Still More Instructions. When entering a numeric answer into a box that comes from the given analysis, enter it EXACTLY as R has printed it out. When entering a numeric answer that you've had to calculate, enter it with at least two accurate decimal places, unless it's an integer, in which case enter it as such.

Problem One

Here is my analysis of the airpassengers.txt data, which is available at the website. This dataset gives number of international airline passengers IN THOUSANDS from January 1949 to December 1960. That info is in the "AirPass" variable. Months are numbered consecutively from 1 to 144 (12 years). Those numbers are in the variable "months." You can read a little more about this dataset at the website by using the blue data retrieval box if you want to.

> rm(list=ls())
> file = "http://ww2.coastal.edu/kingw/psyc480/data/airpassengers.txt"
> AP = read.table(file=file, header=T, row.names=1)
>
> head(AP)
      AirPass months
Jan49     112      1
Feb49     118      2
Mar49     132      3
Apr49     129      4
May49     121      5
Jun49     135      6
>
> dim(AP)
[1] 144   2
>
> summary(AP)
    AirPass          months      
 Min.   :104.0   Min.   :  1.00  
 1st Qu.:180.0   1st Qu.: 36.75  
 Median :265.5   Median : 72.50  
 Mean   :280.3   Mean   : 72.50  
 3rd Qu.:360.5   3rd Qu.:108.25  
 Max.   :622.0   Max.   :144.00 
>
> cor(AP)
          AirPass    months
AirPass 1.0000000 0.9239254
months  0.9239254 1.0000000
>
> with(AP, scatter.smooth(AirPass ~ months))   # graph appears below
>
> lm.out = lm(AirPass ~ months, data=AP)
> summary(lm.out)

Call:
lm(formula = AirPass ~ months, data = AP)

Residuals:
    Min      1Q  Median      3Q     Max 
-93.858 -30.727  -5.757  24.489 164.999 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 87.65278    7.71635   11.36   <2e-16 ***
months       2.65718    0.09233   28.78   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 46.06 on 142 degrees of freedom
Multiple R-squared:  0.8536,	Adjusted R-squared:  0.8526 
F-statistic: 828.2 on 1 and 142 DF,  p-value: < 2.2e-16

> confint(lm.out)
               2.5 %     97.5 %
(Intercept) 72.39902 102.906537
months       2.47466   2.839708
>
> par(mfrow=c(2,2))
> plot(lm.out, 1:4)
airpass_scatter airpass_diagnostics

You should know what these commands do by now, so I'm not explaining them.

1) What is the smallest value in the AirPass variable?

2) Judging from the summary information given for the data frame, what would you say the shape of the distribution of the AirPass variable is?





3) In the regression analysis, which variable was the response variable (DV)?





4) What kind of regression was done?





5) What is the value of the correlation coefficient for the correlation between months and AirPass?

6) From this regression model, we would expect (predict) how many more international air passengers each month?

7) Make a prediction for the value of the AirPass variable in December 1961.

8) What is obvious from the first graph (on the left)?





9) The line on the graph is not the least squares regression line but rather R's best guess as to what the regression line or curve looks like, done by a computer-intensive smoothing technique. This line is called a(n):





10) What are the second graphs (on the right) called?





11) In the graph labeled Residuals vs Fitted, the points seem to fan out from left to right. What does this mean?





12) Which point was influential in this analysis by the D > 0.5 criterion?





Problem Two

The following analysis is of the data in the file wine.txt, which you have seen before, and which you can retrieve from the website. These data show average consumption of wine for 19 countries (in liters per capita) and yearly deaths from heart disease (per 100,000 population). Wine consumption is measured as liters of alcohol consumed from drinking wine per person. The obvious question is whether or not there is a relationship between wine drinking and death from heart disease. (Data are from M.H. Criqui, M.D., Dept. of Family and Preventive Medicine, UC San Diego, and were reported in the New York Times, 28 Dec 1994.)

> rm(list=ls())   # clears your workspace
> file = "http://ww2.coastal.edu/kingw/psyc480/data/wine.txt"
> WINE = read.table(file=file, header=T, row.names=1)
> WINE
               wine hd.deaths
Australia       2.5       211
Austria         3.9       167
Belgium         2.9       131
Canada          2.4       191
Denmark         2.9       220
Finland         0.8       297
France          9.1        71
Iceland         0.8       211
Ireland         0.7       300
Italy           7.9       107
Netherlands     1.8       167
New.Zealand     1.9       266
Norway          0.8       227
Spain           6.5        86
Sweden          1.6       207
Switzerland     5.8       115
United.Kingdom  1.3       285
United.States   1.2       199
West.Germany    2.7       172
> 
> par(mfrow=c(2,2))
> with(WINE, scatter.smooth(hd.deaths ~ wine, main="linear"))
> with(WINE, scatter.smooth(hd.deaths ~ log(wine), main="logarithmic"))
> with(WINE, scatter.smooth(log(hd.deaths) ~ wine, main="exponential"))
> with(WINE, scatter.smooth(log(hd.deaths) ~ log(wine), main="power"))
> # These graphs are on the left below.
>
> lm.out = lm(log(hd.deaths) ~ wine, data=WINE)
> summary(lm.out)

Call:
lm(formula = log(hd.deaths) ~ wine, data = WINE)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.32397 -0.12238 -0.02119  0.18010  0.23573 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.63010    0.06529  86.237  < 2e-16 ***
wine        -0.14860    0.01679  -8.852 8.96e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1787 on 17 degrees of freedom
Multiple R-squared:  0.8217,	Adjusted R-squared:  0.8112 
F-statistic: 78.36 on 1 and 17 DF,  p-value: 8.96e-08

> plot(lm.out, 1:4)
> # These graphs are on the right below.
> 
> ## Regression equation
> ## log(hd.deaths.hat) = 5.6301 - 0.1486 * wine
> ## hd.deaths.hat = exp(5.6301 - 0.1486 * wine)   # NOTE: exp() is the antilog of log()
> ## -or- hd.deaths.hat = 278.6914 * exp(-0.1486 * wine)   # NOTE: exp is e to the power of
>
> ## Prediction for the United States
> exp(5.6301 - 0.1486 * 1.2)
[1] 233.1728
wine_scatter wine_diagnostics

13) Were base-10 logs used in this analysis, and if not does that make a difference?





14) The last thing I did in this analysis was to calculate a prediction of hd.deaths for the United States. (Yes, the calculation was done correctly!) From this, is it possible to say whether the point for the United States on the scatterplot is above, below, or on the regression line? (The observed value for the United States is given in the table at the top of the output.)





15) Using the same two numbers you used to answer question 14, calculate a residual for the United States.

16) What proportion of the variability in log(hd.deaths) is accounted for by the relationship to wine? (Be careful. It says proportion, not percentage.)

17) In the first image above (on the left) are four scatterplots of the possible models for these data. Based on these scatterplots, which model would you say is best?





18) Which model was actually calculated in the regression I did?





19) Which of the following is a correct regression equation from the model that was actually calculated? (Hint: If you know what this kind of relationship looks like, you'll know there is only one possible right answer among the following alternatives.)





20) In the second graphs above (on the right), which is a test of the assumption of "no influential points"?





21) In the second graphs above (on the right), which is a test of the assumption of "linear relationship"?





22) In the second graphs above (on the right), which is a test of the assumption of "normal distribution of residuals"?





23) Based on this analysis alone, what would you conclude about the relationship between drinking wine and dying of heart disease?





24) Notice that there is not a single country from Africa, Asia, or South America in the sample. What would you say about that?





Problem Three

Examine the following graph carefully. This is from a very famous study testing the theory that antipsychotic drugs alleviate the symptoms of schizophrenia by blocking dopamine receptors in the brain. The horizontal axis shows the affinity of the drug for dopamine receptors, i.e., how well the drug binds to and, thereby, blocks dopamine receptors. The trick on this axis is that less affinity is to the right, and more affinity (better binding) is towards the left. The vertical axis shows the typical effective daily dose of the drug (the best dose for alleviating symptoms). Higher doses are at the top. Take careful note of the scale on these axes. Then answer the following questions. (Note: drugs you might recognize from other classes are chlorpromazine or Thorazine, thioridozine or Mellaril, and haloperidol or Haldol.)

dopamine_schizo_drugs

25) What kind of relationship is being shown on this graph?





26) Be very careful answering this question. Remember: the horizontal axis is a little screwy. How would you describe the relationship between drug affinity for the receptor and effective dose of the drug.





27) Find the point for chlorpromazine (top center of graph). How would you describe the residual for chlorpromazine, compared to other drugs?





Problem Four

Here is another graph showing an effect that you should be familiar with. Once again, look at the axes carefully. Then answer the questions that follow.

brain_body_mass

28) What kind of relationship is being shown on this graph?





29) The point for "modern man" is well above the regression line. Does that mean we have bigger brains than the model predicts or smaller brains?





30) Was this a balanced design?





Problem Five

sparrows_scatter

At left is a scatterplot (with a smoothed regression line) of data from a study by Fernandez-Juricic, E., A. Sallent, R. Sanz, and I. Rodriguez-Prieto. 2003. Testing the risk-disturbance hypothesis in a fragmented landscape: non-linear responses of house sparrows to humans. Condor, 105, 316-326. The researchers observed the number of nesting house sparrows (sp$pairs) in public parks in Madrid, Spain, and correlated that to the density of human foot traffic (sp$pedestrians) in the park. They found at first as density of foot traffic increased, nesting also increased, possibly because humans are a source of food for the birds. Eventually, however, the trend reversed, and nesting declined as human foot traffic continued to increase. Thus, as the authors themselves say in the title to their article, the relationship is nonlinear.

Answer question 31 based on this information.


31) Could the relationship between nesting pairs of sparrows and pedestrian foot traffic be made linear by a log transform of one or both variables?





Problem Six

> rm(list=ls())
> file = "shyness.txt"   # I have this data file on my Desktop; it's also at the website
> shy = read.table(file=file, header=T)
> head(shy)
  SAD Shyness LOC Age Sex
1   5      39  13  22   1
2  14      36   9  23   0
3  16      48  18  21   0
4   1      36   3  19   0
5  13      57  19  21   0
6  13      40  16  20   0
> summary(shy)
      SAD            Shyness           LOC             Age             Sex        
 Min.   : 0.000   Min.   : 7.00   Min.   : 1.00   Min.   :18.00   Min.   :0.0000  
 1st Qu.: 3.000   1st Qu.:27.75   1st Qu.: 9.00   1st Qu.:19.00   1st Qu.:0.0000  
 Median : 6.000   Median :32.50   Median :11.00   Median :21.00   Median :0.0000  
 Mean   : 6.764   Mean   :32.22   Mean   :10.72   Mean   :22.38   Mean   :0.2917  
 3rd Qu.:10.250   3rd Qu.:36.00   3rd Qu.:13.00   3rd Qu.:22.00   3rd Qu.:1.0000  
 Max.   :17.000   Max.   :59.00   Max.   :19.00   Max.   :60.00   Max.   :1.0000
> dim(shy)
[1] 72  5

These data are from Elizabeth Ostop (Psyc 497 Spring 2010). The variables are:

  • SAD: score on Social Avoidance and Distress Scale (high=more distress)
  • Shyness: score on Cheek and Buss Shyness Scale (high=more shyness)
  • LOC: score on Rotter's Locus of Control Scale (high=external)
  • Age: in years
  • Sex: gender coded 0=Female, 1=Male

We are interested in the relationship between Shyness (as the response) and SAD and LOC (as the explanatory variables or predictors). Our hypotheses are that both predictors are positively related to the response. The reasoning behind SAD is obvious. The more social anxiety and distress a person has, the shyer s/he will be. The reasoning behind LOC is a bit more complex, but not much. Locus of control describes how a person believes her destiny or fate is determine. People who are "internal" (low scores on this scale) believe they are in control of their own fate. People who are "external" (high scores on this scale) believe their fate is determined by external forces. An "internal" person should have confidence in herself in social situations because she believes she can handle it, come what may. A person who is external, on the other hand, believes the outcome of a social interaction is beyond his control and, therefore, is nervous and shy in social situations. Here is the regression analysis. (The interaction is not significant. I checked, p=0.266. I also checked to make sure both predictors are related to the response. Both have a significant positive correlation to Shyness: LOC, r=0.2322, p=0.0497; SAD, r=0.4748, p<.001.) The null hypothesis for both predictors would be that the regression coefficient (slope) is 0.

> cor(shy)
               SAD    Shyness        LOC         Age         Sex
SAD     1.00000000 0.47481350 0.32425026  0.09381815  0.02515235
Shyness 0.47481350 1.00000000 0.23221044  0.07019303  0.09958598
LOC     0.32425026 0.23221044 1.00000000  0.13700285  0.03268946
Age     0.09381815 0.07019303 0.13700285  1.00000000 -0.01684174
Sex     0.02515235 0.09958598 0.03268946 -0.01684174  1.00000000
>
> lm.out = lm(Shyness ~ LOC + SAD, data=shy)
> summary(lm.out)

Call:
lm(formula = Shyness ~ LOC + SAD, data = shy)

Residuals:
    Min      1Q  Median      3Q     Max 
-32.675  -3.663   1.037   3.674  19.120 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  24.7609     2.8208   8.778 7.57e-13 ***
LOC           0.2047     0.2610   0.784 0.435589    
SAD           0.7787     0.1945   4.004 0.000155 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.507 on 69 degrees of freedom
Multiple R-squared:  0.2323,	Adjusted R-squared:   0.21 
F-statistic: 10.44 on 2 and 69 DF,  p-value: 0.0001095

32) What kind of analysis is this?





33) What proportion of the variability in Shyness scores is accounted for ("explained") by this regression model?

34) What is the correct decision regarding the null hypothesis for LOC?





35) Over all, is this regression model better at predicting Shyness scores than just using the mean Shyness score as a prediction for everyone, and how do you know (cite a relevant test)?





36) What is the regression equation?





37) Predict a Shyness score for someone with LOC = 11 and SAD = 6.

38) Calculate a variance inflation factor for the two predictors. Based on VIF, what can you say about the impact of variance inflation in this analysis?





Here is some additional information about the data. Using this information, answer question 39.

> apply(shy[,1:3], 2, sd)   # standard deviations
     SAD  Shyness      LOC 
4.842573 8.445721 3.608371

39) What can you say about the relative importance of SAD and LOC in predicting Shyness?





Here is an additional analysis of this problem.

> aov.out = aov(Shyness ~ LOC + SAD, data=shy)
> summary(aov.out)
            Df Sum Sq Mean Sq F value   Pr(>F)    
LOC          1    273   273.1   4.846 0.031052 *  
SAD          1    903   903.3  16.031 0.000155 ***
Residuals   69   3888    56.3                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

40) Why is LOC a significant predictor here when it was not when the lm() function was used to do the analysis?





Problem Seven

# loneliness.txt
# These data are from a Psyc 497 project (Parris Claytor, Fall 2011).
# The subjects were given three tests, one of embarrassability ("embarrass"),
# one of sense of emotional isolation ("emotiso"), and one of sense of social
# isolation ("socialiso"). One case was deleted (by me) because of missing
# values on all variables. Higher values on these variables mean more.

> file = "http://ww2.coastal.edu/kingw/psyc480/data/loneliness.txt"
> lone = read.table(file=file, header=T)
> summary(lone)
   embarrass         emotiso        socialiso 
 Min.   : 32.00   Min.   : 0.00   Min.   : 0  
 1st Qu.: 52.75   1st Qu.: 5.75   1st Qu.: 2  
 Median : 65.00   Median :11.00   Median : 6  
 Mean   : 64.90   Mean   :12.07   Mean   : 7  
 3rd Qu.: 76.25   3rd Qu.:16.00   3rd Qu.:10  
 Max.   :111.00   Max.   :40.00   Max.   :31

We wish to know how emotional isolation (emotiso), the response, is related to embarrassability (embarrass) and social isolation (socialiso), the predictors. NOTE: be sure to check for an interaction.

41) Is the interaction statistically significant? If yes, type the p-value in the box. If no, type "no" (no quotes) in the box.

In either case, the interaction is to be retained in the model.

42) What is the median of the residuals?

43) In this model, what is the effect (coefficient) of embarrass when socialiso = 0?

44) In this model, what is the effect of embarrass when socialiso is at its median value?

45) This model accounts for what proportion of the variability in emotiso?

46-47) If this model is correct, then we can be 95% confident that the true coefficient (i.e., the value in the population) for socialiso lies between what two values?
LL = UL =

48) In this model, which case number has the largest Cook's Distance? Hint: plot(lm.out, 4).

49) To the nearest tenth (one decimal place), what is this Cook's Distance?

50) Is this something we should be concerned about (yes/no)?



Message or Note (if any; keep it short!):


FINISHED:

CAUTION: If you click the submit button before indicating you are finished,
your answers will be reset and you'll have to do the whole thing over again!

============================
============================