Type your first and last name and email address in the boxes indicated. No name, no credit! Enter your CCU email address. Without this I cannot return your exam to you.
First Name: Last Name:
Your Coastal e-mail address: CCU emails ONLY!
STOP! There is a good chance ITS will not send me your answers if you don't fill in the boxes above, so fill them in now.
Further Instructions. DO NOT press the Enter key after answering a question. This will cause html to attempt to submit your answers. Don't blame me. I think it's stupid. Blame the people who programmed html. Bottom line, keep your hands off the Enter key while this window is in focus in your browser.
Still More Instructions. When entering a numeric answer into a box that comes from the given analysis, enter it EXACTLY as R has printed it out. When entering a numeric answer that you've had to calculate, enter it with at least two accurate decimal places, unless it's an integer, in which case enter it as such.
Problem One
Here is my analysis of the airpassengers.txt data, which is available at the website. This dataset gives number of international airline passengers IN THOUSANDS from January 1949 to December 1960. That info is in the "AirPass" variable. Months are numbered consecutively from 1 to 144 (12 years). Those numbers are in the variable "months." You can read a little more about this dataset at the website by using the blue data retrieval box if you want to.
> rm(list=ls()) > file = "http://ww2.coastal.edu/kingw/psyc480/data/airpassengers.txt" > AP = read.table(file=file, header=T, row.names=1) > > head(AP) AirPass months Jan49 112 1 Feb49 118 2 Mar49 132 3 Apr49 129 4 May49 121 5 Jun49 135 6 > > dim(AP) [1] 144 2 > > summary(AP) AirPass months Min. :104.0 Min. : 1.00 1st Qu.:180.0 1st Qu.: 36.75 Median :265.5 Median : 72.50 Mean :280.3 Mean : 72.50 3rd Qu.:360.5 3rd Qu.:108.25 Max. :622.0 Max. :144.00 > > cor(AP) AirPass months AirPass 1.0000000 0.9239254 months 0.9239254 1.0000000 > > with(AP, scatter.smooth(AirPass ~ months)) # graph appears below > > lm.out = lm(AirPass ~ months, data=AP) > summary(lm.out) Call: lm(formula = AirPass ~ months, data = AP) Residuals: Min 1Q Median 3Q Max -93.858 -30.727 -5.757 24.489 164.999 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 87.65278 7.71635 11.36 <2e-16 *** months 2.65718 0.09233 28.78 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 46.06 on 142 degrees of freedom Multiple R-squared: 0.8536, Adjusted R-squared: 0.8526 F-statistic: 828.2 on 1 and 142 DF, p-value: < 2.2e-16 > confint(lm.out) 2.5 % 97.5 % (Intercept) 72.39902 102.906537 months 2.47466 2.839708 > > par(mfrow=c(2,2)) > plot(lm.out, 1:4)
You should know what these commands do by now, so I'm not explaining them.
1) What is the smallest value in the AirPass variable?
2) Judging from the summary information given for the data frame, what would you say the shape of the distribution of the AirPass variable is?
3) In the regression analysis, which variable was the response variable (DV)?
4) What kind of regression was done?
5) What is the value of the correlation coefficient for the correlation between months and AirPass?
6) From this regression model, we would expect (predict) how many more international air passengers each month?
7) Make a prediction for the value of the AirPass variable in December 1961.
8) What is obvious from the first graph (on the left)?
9) The line on the graph is not the least squares regression line but rather R's best guess as to what the regression line or curve looks like, done by a computer-intensive smoothing technique. This line is called a(n):
10) What are the second graphs (on the right) called?
11) In the graph labeled Residuals vs Fitted, the points seem to fan out from left to right. What does this mean?
12) Which point was influential in this analysis by the D > 0.5 criterion?
Problem Two
The following analysis is of the data in the file wine.txt, which you have seen before, and which you can retrieve from the website. These data show average consumption of wine for 19 countries (in liters per capita) and yearly deaths from heart disease (per 100,000 population). Wine consumption is measured as liters of alcohol consumed from drinking wine per person. The obvious question is whether or not there is a relationship between wine drinking and death from heart disease. (Data are from M.H. Criqui, M.D., Dept. of Family and Preventive Medicine, UC San Diego, and were reported in the New York Times, 28 Dec 1994.)
> rm(list=ls()) # clears your workspace > file = "http://ww2.coastal.edu/kingw/psyc480/data/wine.txt" > WINE = read.table(file=file, header=T, row.names=1) > WINE wine hd.deaths Australia 2.5 211 Austria 3.9 167 Belgium 2.9 131 Canada 2.4 191 Denmark 2.9 220 Finland 0.8 297 France 9.1 71 Iceland 0.8 211 Ireland 0.7 300 Italy 7.9 107 Netherlands 1.8 167 New.Zealand 1.9 266 Norway 0.8 227 Spain 6.5 86 Sweden 1.6 207 Switzerland 5.8 115 United.Kingdom 1.3 285 United.States 1.2 199 West.Germany 2.7 172 > > par(mfrow=c(2,2)) > with(WINE, scatter.smooth(hd.deaths ~ wine, main="linear")) > with(WINE, scatter.smooth(hd.deaths ~ log(wine), main="logarithmic")) > with(WINE, scatter.smooth(log(hd.deaths) ~ wine, main="exponential")) > with(WINE, scatter.smooth(log(hd.deaths) ~ log(wine), main="power")) > # These graphs are on the left below. > > lm.out = lm(log(hd.deaths) ~ wine, data=WINE) > summary(lm.out) Call: lm(formula = log(hd.deaths) ~ wine, data = WINE) Residuals: Min 1Q Median 3Q Max -0.32397 -0.12238 -0.02119 0.18010 0.23573 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.63010 0.06529 86.237 < 2e-16 *** wine -0.14860 0.01679 -8.852 8.96e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.1787 on 17 degrees of freedom Multiple R-squared: 0.8217, Adjusted R-squared: 0.8112 F-statistic: 78.36 on 1 and 17 DF, p-value: 8.96e-08 > plot(lm.out, 1:4) > # These graphs are on the right below. > > ## Regression equation > ## log(hd.deaths.hat) = 5.6301 - 0.1486 * wine > ## hd.deaths.hat = exp(5.6301 - 0.1486 * wine) # NOTE: exp() is the antilog of log() > ## -or- hd.deaths.hat = 278.6914 * exp(-0.1486 * wine) # NOTE: exp is e to the power of > > ## Prediction for the United States > exp(5.6301 - 0.1486 * 1.2) [1] 233.1728
13) Were base-10 logs used in this analysis, and if not does that make a difference?
14) The last thing I did in this analysis was to calculate a prediction of hd.deaths for the United States. (Yes, the calculation was done correctly!) From this, is it possible to say whether the point for the United States on the scatterplot is above, below, or on the regression line? (The observed value for the United States is given in the table at the top of the output.)
15) Using the same two numbers you used to answer question 14, calculate a residual for the United States.
16) What proportion of the variability in log(hd.deaths) is accounted for by the relationship to wine? (Be careful. It says proportion, not percentage.)
17) In the first image above (on the left) are four scatterplots of the possible models for these data. Based on these scatterplots, which model would you say is best?
18) Which model was actually calculated in the regression I did?
19) Which of the following is a correct regression equation from the model that was actually calculated? (Hint: If you know what this kind of relationship looks like, you'll know there is only one possible right answer among the following alternatives.)
20) In the second graphs above (on the right), which is a test of the assumption of "no influential points"?
21) In the second graphs above (on the right), which is a test of the assumption of "linear relationship"?
22) In the second graphs above (on the right), which is a test of the assumption of "normal distribution of residuals"?
23) Based on this analysis alone, what would you conclude about the relationship between drinking wine and dying of heart disease?
24) Notice that there is not a single country from Africa, Asia, or South America in the sample. What would you say about that?
Problem Three
Examine the following graph carefully. This is from a very famous study testing the theory that antipsychotic drugs alleviate the symptoms of schizophrenia by blocking dopamine receptors in the brain. The horizontal axis shows the affinity of the drug for dopamine receptors, i.e., how well the drug binds to and, thereby, blocks dopamine receptors. The trick on this axis is that less affinity is to the right, and more affinity (better binding) is towards the left. The vertical axis shows the typical effective daily dose of the drug (the best dose for alleviating symptoms). Higher doses are at the top. Take careful note of the scale on these axes. Then answer the following questions. (Note: drugs you might recognize from other classes are chlorpromazine or Thorazine, thioridozine or Mellaril, and haloperidol or Haldol.)
25) What kind of relationship is being shown on this graph?
26) Be very careful answering this question. Remember: the horizontal axis is a little screwy. How would you describe the relationship between drug affinity for the receptor and effective dose of the drug.
27) Find the point for chlorpromazine (top center of graph). How would you describe the residual for chlorpromazine, compared to other drugs?
Problem Four
Here is another graph showing an effect that you should be familiar with. Once again, look at the axes carefully. Then answer the questions that follow.
28) What kind of relationship is being shown on this graph?
29) The point for "modern man" is well above the regression line. Does that mean we have bigger brains than the model predicts or smaller brains?
30) Was this a balanced design?
Problem Five
At left is a scatterplot (with a smoothed regression line) of data from a study by Fernandez-Juricic, E., A. Sallent, R. Sanz, and I. Rodriguez-Prieto. 2003. Testing the risk-disturbance hypothesis in a fragmented landscape: non-linear responses of house sparrows to humans. Condor, 105, 316-326. The researchers observed the number of nesting house sparrows (sp$pairs) in public parks in Madrid, Spain, and correlated that to the density of human foot traffic (sp$pedestrians) in the park. They found at first as density of foot traffic increased, nesting also increased, possibly because humans are a source of food for the birds. Eventually, however, the trend reversed, and nesting declined as human foot traffic continued to increase. Thus, as the authors themselves say in the title to their article, the relationship is nonlinear.
Answer question 31 based on this information.
31) Could the relationship between nesting pairs of sparrows and pedestrian foot traffic be made linear by a log transform of one or both variables?
Problem Six
> rm(list=ls()) > file = "shyness.txt" # I have this data file on my Desktop; it's also at the website > shy = read.table(file=file, header=T) > head(shy) SAD Shyness LOC Age Sex 1 5 39 13 22 1 2 14 36 9 23 0 3 16 48 18 21 0 4 1 36 3 19 0 5 13 57 19 21 0 6 13 40 16 20 0 > summary(shy) SAD Shyness LOC Age Sex Min. : 0.000 Min. : 7.00 Min. : 1.00 Min. :18.00 Min. :0.0000 1st Qu.: 3.000 1st Qu.:27.75 1st Qu.: 9.00 1st Qu.:19.00 1st Qu.:0.0000 Median : 6.000 Median :32.50 Median :11.00 Median :21.00 Median :0.0000 Mean : 6.764 Mean :32.22 Mean :10.72 Mean :22.38 Mean :0.2917 3rd Qu.:10.250 3rd Qu.:36.00 3rd Qu.:13.00 3rd Qu.:22.00 3rd Qu.:1.0000 Max. :17.000 Max. :59.00 Max. :19.00 Max. :60.00 Max. :1.0000 > dim(shy) [1] 72 5
These data are from Elizabeth Ostop (Psyc 497 Spring 2010). The variables are:
We are interested in the relationship between Shyness (as the response) and SAD and LOC (as the explanatory variables or predictors). Our hypotheses are that both predictors are positively related to the response. The reasoning behind SAD is obvious. The more social anxiety and distress a person has, the shyer s/he will be. The reasoning behind LOC is a bit more complex, but not much. Locus of control describes how a person believes her destiny or fate is determine. People who are "internal" (low scores on this scale) believe they are in control of their own fate. People who are "external" (high scores on this scale) believe their fate is determined by external forces. An "internal" person should have confidence in herself in social situations because she believes she can handle it, come what may. A person who is external, on the other hand, believes the outcome of a social interaction is beyond his control and, therefore, is nervous and shy in social situations. Here is the regression analysis. (The interaction is not significant. I checked, p=0.266. I also checked to make sure both predictors are related to the response. Both have a significant positive correlation to Shyness: LOC, r=0.2322, p=0.0497; SAD, r=0.4748, p<.001.) The null hypothesis for both predictors would be that the regression coefficient (slope) is 0.
> cor(shy) SAD Shyness LOC Age Sex SAD 1.00000000 0.47481350 0.32425026 0.09381815 0.02515235 Shyness 0.47481350 1.00000000 0.23221044 0.07019303 0.09958598 LOC 0.32425026 0.23221044 1.00000000 0.13700285 0.03268946 Age 0.09381815 0.07019303 0.13700285 1.00000000 -0.01684174 Sex 0.02515235 0.09958598 0.03268946 -0.01684174 1.00000000 > > lm.out = lm(Shyness ~ LOC + SAD, data=shy) > summary(lm.out) Call: lm(formula = Shyness ~ LOC + SAD, data = shy) Residuals: Min 1Q Median 3Q Max -32.675 -3.663 1.037 3.674 19.120 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 24.7609 2.8208 8.778 7.57e-13 *** LOC 0.2047 0.2610 0.784 0.435589 SAD 0.7787 0.1945 4.004 0.000155 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 7.507 on 69 degrees of freedom Multiple R-squared: 0.2323, Adjusted R-squared: 0.21 F-statistic: 10.44 on 2 and 69 DF, p-value: 0.0001095
32) What kind of analysis is this?
33) What proportion of the variability in Shyness scores is accounted for ("explained") by this regression model?
34) What is the correct decision regarding the null hypothesis for LOC?
35) Over all, is this regression model better at predicting Shyness scores than just using the mean Shyness score as a prediction for everyone, and how do you know (cite a relevant test)?
36) What is the regression equation?
37) Predict a Shyness score for someone with LOC = 11 and SAD = 6.
38) Calculate a variance inflation factor for the two predictors. Based on VIF, what can you say about the impact of variance inflation in this analysis?
Here is some additional information about the data. Using this information, answer question 39. > apply(shy[,1:3], 2, sd) # standard deviations SAD Shyness LOC 4.842573 8.445721 3.608371
> apply(shy[,1:3], 2, sd) # standard deviations SAD Shyness LOC 4.842573 8.445721 3.608371
39) What can you say about the relative importance of SAD and LOC in predicting Shyness?
Here is an additional analysis of this problem.
> aov.out = aov(Shyness ~ LOC + SAD, data=shy) > summary(aov.out) Df Sum Sq Mean Sq F value Pr(>F) LOC 1 273 273.1 4.846 0.031052 * SAD 1 903 903.3 16.031 0.000155 *** Residuals 69 3888 56.3 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
40) Why is LOC a significant predictor here when it was not when the lm() function was used to do the analysis?
Problem Seven
# loneliness.txt # These data are from a Psyc 497 project (Parris Claytor, Fall 2011). # The subjects were given three tests, one of embarrassability ("embarrass"), # one of sense of emotional isolation ("emotiso"), and one of sense of social # isolation ("socialiso"). One case was deleted (by me) because of missing # values on all variables. Higher values on these variables mean more. > file = "http://ww2.coastal.edu/kingw/psyc480/data/loneliness.txt" > lone = read.table(file=file, header=T) > summary(lone) embarrass emotiso socialiso Min. : 32.00 Min. : 0.00 Min. : 0 1st Qu.: 52.75 1st Qu.: 5.75 1st Qu.: 2 Median : 65.00 Median :11.00 Median : 6 Mean : 64.90 Mean :12.07 Mean : 7 3rd Qu.: 76.25 3rd Qu.:16.00 3rd Qu.:10 Max. :111.00 Max. :40.00 Max. :31
We wish to know how emotional isolation (emotiso), the response, is related to embarrassability (embarrass) and social isolation (socialiso), the predictors. NOTE: be sure to check for an interaction.
41) Is the interaction statistically significant? If yes, type the p-value in the box. If no, type "no" (no quotes) in the box.
In either case, the interaction is to be retained in the model.
42) What is the median of the residuals?
43) In this model, what is the effect (coefficient) of embarrass when socialiso = 0?
44) In this model, what is the effect of embarrass when socialiso is at its median value?
45) This model accounts for what proportion of the variability in emotiso?
46-47) If this model is correct, then we can be 95% confident that the true coefficient (i.e., the value in the population) for socialiso lies between what two values? LL = UL =
48) In this model, which case number has the largest Cook's Distance? Hint: plot(lm.out, 4).
49) To the nearest tenth (one decimal place), what is this Cook's Distance?
50) Is this something we should be concerned about (yes/no)?
CAUTION: If you click the submit button before indicating you are finished, your answers will be reset and you'll have to do the whole thing over again!
============================ ============================