Psyc 480 -- Dr. King

Simple Correlation and Regression

Instructions. FOR THIS EXERCISE, WHEN YOU TYPE AN ANSWER INTO A BOX, IF THE ANSWER IS A WORD OR TWO, TYPE IT IN ALL LOWER CASE LETTERS WITH NO UNNECESSARY SPACES. IF THE ANSWER IS A WHOLE NUMBER SUCH AS NUMBER OF SUBJECTS, DEGREES OF FREEDOM, AND SO FORTH, TYPE IT AS A WHOLE NUMBER. IF IT'S A DECIMAL NUMBER, TYPE IT ROUNDED TO THREE (3) DECIMAL PLACES, NO MORE, NO LESS, EVEN IF THOSE DECIMAL PLACES ARE ZEROS. EXAMPLE: 250.1 WOULD BE WRONG. CORRECT WOULD BE 250.100. IF THE NUMBER IS LESS THAN 1, SUCH AS A P-VALUE OR A CORRELATION COEFFICIENT, TYPE IT WITH A ZERO BEFORE THE DECIMAL PLACE. EXAMPLE: 0.050, NOT .05 OR .050. IF THE P-VALUE IS ALL ZEROS ROUNDED TO THREE DECIMAL PLACES, THEN TYPE IT THAT WAY. FOR EXAMPLE, p-value = 0.000416 WOULD BE TYPED 0.000, BUT IF IT WERE p-value = 0.000516, IT WOULD BE TYPED 0.001. IF YOU CLICK THE CHECK MARK BUTTON AND IT SAYS YOUR ANSWER IS WRONG, IT MIGHT BE WRONG FOR TWO REASONS: IT'S JUST WRONG, OR YOU FORMATED IT INCORRECTLY. THE Get Answer BUTTON WILL TELL YOU.

Further Instructions: Go back and read the instructions if you skipped them.

Still More Instructions: Each answer box is accompanied by two buttons, a check mark button which will just tell you if your answer is right or wrong, and a Get Answer button which will give you the answer (and occasionally an explanation of the answer). I would advise you to practice getting the right answers before clicking the Get Answer button.

The good news is, R is going to do almost all your calculations for you in this correlation and regression material, which will last for the rest of the semester.

I am going to leave the command prompts off of R commands in this exercise so that you can copy and paste them into R if you're having difficulty getting them typed correctly. I will try to do things the same in R Console as you would do them in Rweb, but there are unavoidable differences.

If you need to ask me a question about R, about why a certain command isn't working or about what an error or a warning message means, DO NOT send me a screenshot or a photo of your computer screen! Copy and paste the command and the error or warning you got into an email, and I'll get you straightened out. With any luck.

Today we are going to look at Franz Messerli's data on chocolate consumption and Nobel prizes. To begin with anyway.

Before we begin, however, if you want an education on how such results are dealt with in the popular press (usually pretty stupidly), just google "Franz Messerli chocolate data". One of the better and more interesting articles is this one, which tells a little history of how this bit of research came about.

Messerli's hypothesis was developed unusually clearly in his introduction. (If you've read many journal articles, you know that's not always the case!) Certain nutrients in chocolate have been shown to improve cognitive function. Therefore, populations that eat more chocolate should be, on the average, smarter than those that eat less chocolate. Smarter populations will have more people winning Nobel Prizes. He states, "Total number of Nobel laureates per capita could serve as a surrogate end point reflecting the proportion with superior cognitive function and thereby give us some measure of the overall cognitive function of a given country."

Number of Nobel laureates from a country is easy data to get. He got it off the Internet. Other measures of cognitive function by country would be difficult to get, although some of those might be better measures. The usual term that is used here for these kinds of measures, measures used in place of some other measure that is better but unavailable, is proxy. Number of Nobel laureates per capita in a country is a proxy for the average level of cognitive functioning in that country.

This study has been widely criticized for an obvious confound that Messerli mentions only briefly and then discounts near the end of the discussion section in his article, socioeconomic status. How might that work? People with money can not only afford better education but better health care and a host of other things that might result in them being "smarter." People with money can also afford chocolate, which is not cheap. That makes socioeconomic status a potential common cause, or confound, in this relationship. Remember why variables are correlated: spurious relationship (unlikely for such a strong relationship), direct cause and effect, mediated or indirect cause and effect, and common cause by a third variable. Messerli points this out at the beginning of his discussion section where he states, "Of course, a correlation between X and Y does not prove causation."

Generally, in correlational research like this, where we have reasons to believe that a cause and effect relationship MIGHT exist between the explanatory variable and response variable, and we state the existence of such a correlation, when we find the hypothesized effect, we are very careful how we talk about it. To say the hypothesis was "confirmed" is considered too strong, and even "supported" is frowned upon by most people. Instead, we say the results are "consistent" with the hypothesis. Messerli thought there might be a correlation between chocolate consumption and cognitive function because eating chocolate has been shown to have this effect in other studies. His results were "consistent" with this idea. They certainly do not "prove" it, and even the word "confirmed" is too strong here. Applied strictly to a properly stated hypothesis, "confirmed" might be okay, but his data were only "consistent" with the idea that consumption and laureates are correlated because eating chocolate makes you smart.

Question 1. Examine Figure 1 in Messerli's article. What kind of graph is this? (One word. Remember, all lower case.)

Question 2. If we were to draw a line on this graph that best fits the points (flags) on the graph, what would we call this line? (Two words.)

The data can be obtained from the website this way. We are reading it into a data frame with very simple(minded) name, so make sure there is nothing in your workspace that will interfere with this, and don't create anything that has the same name as the data frame.

----------R commands----------
file = "http://ww2.coastal.edu/kingw/psyc480/data/chocolate.txt"
X = read.table(header=T, file=file)
attach(X)   # this command makes it unnecessary to use with(), $, or data=
##### but using attach() is risky!

Just to reassure ourselves that the file has been downloaded...

----------R commands----------
ls()
summary(X)

The summary command should reveal the following result.

----------R output----------
  consumption       laureates   
 Min.   : 0.800   Min.   : 0.1  
 1st Qu.: 3.700   1st Qu.: 2.7  
 Median : 4.600   Median : 9.1  
 Mean   : 5.752   Mean   :11.1  
 3rd Qu.: 8.600   3rd Qu.:17.3  
 Max.   :12.000   Max.   :30.3

Messerli says in his article that there were 22 countires in his sample, but if you count the flags on his scatterplot (carefully because some of them are partially hidden), there are 23. It's the kind of mistake you should not make when you're writing a research paper like this! Nevertheless, there it is.

Question 3. From Messerli's graph, which country has the highest chocolate consumption? (Don't capitalize.)

Question 4. Messerli's hypothesis would then lead us to expect that this country also has the highest number of Nobel laureates. Does it? (yes or no)

Question 5. Find the point (flag) for the United States on Messerli's graph. How many countries in this data set had lower annual chocolate consumption than the U.S.? (Count carefully. Some of the flags are hidden behind others.)

Question 6. Of these countries with lower chocolate consumption, how many also have fewer Nobel laureates?

Questions 7-10. Fill in the following contingency table by counting flags on Messerli's graph.

number of countries with fewer Nobel laureates than the United States number of countries with more Nobel laureates than the United States
number of countries with lower chocolate consumption than the United States



number of countries with higher chocolate consumption than the United States



Do you see the relationship now? Of course, reducing the data to frequency data like this throws away a lot of useful numeric information, which ordinarily we would not want to do. We could test the frequency table with a chi-square test to see if these observed frequencies are significantly different from the ones expected under the assumption of independence (chisq = 8.814, df = 1, p = 0.003), but correlation and regression analysis on the numeric data is a much more powerful technique.

Old news! These data are ten to twenty years old, so I got to wondering if we would see the same relationship in newer data. The Internet is a wonderful place! I found two websites that allowed me to update most of the data. Here are the new data.

# https://en.wikipedia.org/wiki/List_of_countries_by_Nobel_laureates_per_capita
# https://www.statista.com/statistics/819288/worldwide-chocolate-consumption-by-country/
# chocolate consumption for 2017 and Nobel laureates for 2018
#
               consumption   laureates  scientific
China                  0.1       0.064       0.035
Japan                  1.2       2.202       1.887
Portugal               NA        1.943       0.972
Greece                 NA        1.795       NA
Brazil                 1.2       NA          0.047
Poland                 5.7       4.986       1.312
Spain                  NA        1.724       0.431
Italy                  NA        3.373       2.193
Canada                 NA        6.765       5.953
Belgium                5.6       8.697       5.218
Australia              4.9       4.844       4.440
Netherlands            5.1      11.707      11.121
United.States          4.4      11.721      10.711
France                 4.3      10.664       5.979
Sweden                 6.6      30.052      17.029
Finland                5.4       9.021       5.413
Austria                8.1      25.138      20.567
Denmark                4.9      24.329      17.378
Ireland                7.9      14.572       4.163
Norway                 5.8      24.284      14.944
United.Kingdom         7.6      19.429      16.373
Germany                7.9      13.245      11.180
Switzerland            8.8      32.771      29.260
Russia                 4.8       1.598       1.111
New.Zealand            5.0       6.316       6.316
South.Africa           0.9       1.742       0.697
Czech.Republic         4.9       4.706       2.823
Slovakia               5.2       NA          NA

The Nobel Prize data come from the same website Messerli used, but his chocolate websites either no longer exist or no longer have the data we need. So I found the statistica.com website. This website, however, was missing data on five of the countries on Messerli's list: Portugal, Greece, Spain, Italy, and Canada. That's what the value of NA means in these lists: "not available" or "missing". NA is R's recognized missing value code. I added four new countries to take their place: Russia, New Zealand, South Africa, and Czech Republic.

Notice when a country name has a space in it, I filled that space with a period. It's a good idea not to leave spaces in the names of things in R. While R can handle that under most circumstances, there are a few circumstances where it will cause a problem. So spaces are usually filled either with periods or an underline character.

There is also a new variable in this data set: scientific. Wikipedia has Nobel laureates per 10 million population for all prizes, and also just for the scientific prizes, Chemistry, Physics, Physiology or Medicine, and the Nobel Memorial Prize in Economic Sciences. I thought both might be interesting. (Did Messerli tell us which one he used?)

So we need to go back to R and get rid of that old data.

----------R Commands----------
detach(X)   ### very important that you do this first!
rm(X)

Removing a file from the workspace does not detach it from the search path, so make sure you do the detach() command first. Now, how do we get the new data in? Easy.

----------R commands----------
file = "http://ww2.coastal.edu/kingw/psyc480/data/chocolate_new.txt"
X = read.table(header=T, row.names=1, file=file)   # column 1, the country names, will be row names
head(X)   # see! custom row names!
attach(X)

Now...

----------R Commands----------
summary(X)   # just type this; what follows is output!

----------R Output----------
  consumption      laureates        scientific    
 Min.   :0.100   Min.   : 0.064   Min.   : 0.035  
 1st Qu.:4.600   1st Qu.: 2.495   1st Qu.: 1.456  
 Median :5.100   Median : 7.731   Median : 5.316  
 Mean   :5.057   Mean   :10.680   Mean   : 7.598  
 3rd Qu.:6.200   3rd Qu.:14.240   3rd Qu.:11.165  
 Max.   :8.800   Max.   :32.771   Max.   :29.260  
 NA's   :5       NA's   :2        NA's   :2 

Notice a new element at the bottom of these summaries, the number of NAs, or missing values, in each variable. Those missing values are going to cause us some minor difficulties, but you might as well get used to them, as most real world data sets have missing values for one reason or another.

Missing value codes are not always NA in other software packages. Some people use zeros in their data as missing values, a very bad idea. (Why?) Some people use ridiculous values like -99 or -999. The R convention of using NA makes a lot more sense.

If there were no warnings after attaching, then do this.

----------R Commands----------
cor(consumption, laureates)

Notice the answer is NA, or missing. What's up with that? The same thing will happen if you try this.

----------R Commands----------
mean(consumption)

R is smart enough to know that you cannot calculate a mean when you have missing values. Unless you somehow get rid of them. The solution for functions like mean(), var(), sd(), and so forth is to use an option called na.rm=T, which means "remove missing values equals TRUE."

----------R Commands----------
mean(consumption, na.rm=T)

Question 11. What is the mean of consumption with the missing values removed? Remember to round your answer to 3 decimal places.

The solution for the cor() function is a little different, because there are multiple ways to remove missing values when using cor(). To demonstrate them, we have to learn something new. (Oh boy!)

The cor() function can be used on entire data frames, as long as all the variables in the data frame are numeric. Try this.

----------R Commands----------
cor(X)

The output is a correlation matrix, but it is filled with NAs. There are two useful ways to get rid of them. This is the first one.

----------R Commands----------
cor(X, use="pairwise")

This will give you correlations between pairs of scores as long as that pair is complete. If it's not, that pair will be thrown out. For example, when calculating the consumption-laureates correlation, Spain was thrown out because Spain has no value for consumption. But Spain was used for the laureates-scientific correlation, because that pair of scores is complete (both scores are there). Now try this.

----------R Commands----------
cor(X, use="complete")

The result is a little different. That's because this option uses a case only if the entire case is complete. So when calculating the laureates-scientific correlation, Spain was thrown out because it has no value for consumption. Spain is not a complete case, and so it is not used in any of the correlations. For now it doesn't matter which of those we use. Eventually it will.

By the way, I hope you remember some very basic facts about correlations that you probably learned in Psyc 101 and, if not, certainly in your first stat course.

Notice that the correlations are still fairly high, up around 0.7, but not as high as Messerli found. Messerli's correlation was highly significant (p < 0.0001). Are ours (with the newer data)? Fortunately, there's a function for that.

----------R Commands----------
cor.test(consumption, laureates)

See the output in R. I'm not giving it to you! The cor.test() function uses pairwise complete cases (i.e., only the two scores actually involved in the correlation have to be there). The whole case does not have to be complete.

Question 12. What is the correlation between consumption and laureates. Remember to round to three decimal places. The correlation is given at the bottom of the output, in case you haven't found it.

Question 13. Is this correlation significantly different from zero (yes or no)?

Question 14. What is the p-value?

Question 15. Somewhat harder question. How many pairs of scores were used in calculating this correlation?

Notice there is also a 95% confidence interval for the true value of the correlation in the population (whatever the population might be in this case). You should remember what that means.

Let's move on to the repression analysis. We'll begin with the scatterplot.

----------R Commands----------
plot(laureates ~ consumption)
scatterplot

Here is a series of yes or no question for you to answer about the scatterplot.

Question 16. Is the relationship between laureates and consumption positive?

Question 17. Is it strong?

Question 18. Is it linear?

Let's get the regression line. Leave that graphics window open.

----------R Commands----------
lm.out = lm(laureates ~ consumption)
summary(lm.out)

----------R Output----------
Call:
lm(formula = laureates ~ consumption)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5873 -5.8784 -0.1248  2.9259 13.4870 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.1608     3.9912  -0.792 0.438174    
consumption   2.9888     0.7003   4.268 0.000416 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.241 on 19 degrees of freedom
  (7 observations deleted due to missingness)
Multiple R-squared:  0.4894,	Adjusted R-squared:  0.4625 
F-statistic: 18.21 on 1 and 19 DF,  p-value: 0.000416

To draw the regression line on the scatterplot, the scatterplot has to be visible on your screen. If you closed the graphics window, as I did, you'll have to redraw the scatterplot before you can draw the regression line.

----------R Commands----------
plot(laureates ~ consumption)   # to redraw the scatterplot if necessary
abline(lm.out)
scatterplot

Before we go any further, let me remind you what a residual is. For any point on the scatterplot, it's residual is its vertical distance from the regression line: residual = y - y.hat. I'm going to be talking about residuals in the next few paragraphs, so you need to be clear on what they are.

Question 19. On the scatterplot I've blackened in one of the points. (Were you wondering about it?) Does that point have a positive or a negative residual?

Question 20. As nearly as you can estimate it from the graph (nearest whole number), what is the value of that residual?

One thing jumps out at me immediately on the scatterplot. In the lower left corner, the points are above the regression line (positive residuals). Towards the middle of the graph, most of the points are below the regression line (negative residuals). Then in the upper right corner, most of the residuals go positive again. That's the sign of a nonlinear trend. There is even a special graph, called a residuals plot, that is used to visualize this.

----------R Commands----------
plot(lm.out, 1)
residuals plot

We very much prefer that red line to be straight and flat. Still, it's not too bad. We'll see much worse eventually. Another thing I notice on the scatterplot is that the points appear to be clustered close to the line for small consumption values (lower left) but spread out as consumption values become larger (towards the upper right). That is, the residuals tend to get bigger as we go left to right across the graph. We can also plot that.

----------R Commands----------
plot(lm.out, 3)
scale-location plot

This is called a scale-location plot, and for some reason in R it doesn't plot the actual residuals but the square root of the absolute values of the standardized residuals. Doesn't matter. It amounts to the same thing. Again, what we want to see is a red line that is straight and flat. That it jumps up towards the right on that plot indicates that the residuals are getting larger as we go to the right. That's a problem called (get ready for this) heteroskedasticity. Remember in ANOVA we had an assumption called homogeneity of variance, which states that all groups should have the same variance in the population. Homogeneity of variance is actually a special case of homoskedasticity, which states that, in a regression analysis, the y-values (response or DV) should have the same variance (in the population) at all values of x (predictor or IV). We're going to continue to call that a homogeneity of variance assumption, and when it's violated, we'll call that heterogeneity of variance.

In fact, regression analysis makes four important assumptions.

These assumptions can be tested graphically by drawing what are called diagnostic plots, of which you have already seen two. It is possible to draw all four of them at once by doing this. First, close the graphics window if it is open. Then...

----------R Commands----------
par(mfrow=c(2,2))   # opens the graphics window again
plot(lm.out, 1:4)
diagnostic plots

There is our residuals plot in the upper left corner, which is our graphical test of linearity. In the upper right corner, we have a Normal Q-Q plot of the residuals. You should remember, on such a plot, we want to see the points lining up in a straight line from the lower left to the upper right. We're not doing too badly on that one. It's not going to be perfect, of course, because samples are noisy. In the lower left corner, we see the scale-location plot, which evaluates the homogeneity assumption. (On the residuals plot, if the points fan out from left to right, that tells us the same thing, that we have heterogeneity of variance.) The new graph here is the one at lower right. This is called a Cook's Distance plot.

Cook's distance is a measure of how influential a point is in a regression analysis. The bigger Cook's D is, the more influential the point is. When we look at a Cook's distance plot, what we like to see is something that looks like a newly mowed lawn - all the lines having more or less the same height. If one or more lines stand out, then we need to look at the vertical axis to get Cook's D for that point. Here we have three points that R has flagged for us (by labeling them). However, when we look at the vertical axis for a value of Cook's D, we notice it is less than 0.5 for all these points. Once Cook's D reaches a value of 0.5, we start worrying about having a point that is too influential. If Cook's D is 1.0 or more, then we are beyond just worrying. Something has to be done about that! Here we are okay. The largest Cook's D appears to be just a little more than 0.2.

Let's not worry about assumptions right now. Let's look at that regression output again.

----------R Output----------
Call:
lm(formula = laureates ~ consumption)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5873 -5.8784 -0.1248  2.9259 13.4870 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.1608     3.9912  -0.792 0.438174    
consumption   2.9888     0.7003   4.268 0.000416 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.241 on 19 degrees of freedom
  (7 observations deleted due to missingness)
Multiple R-squared:  0.4894,	Adjusted R-squared:  0.4625 
F-statistic: 18.21 on 1 and 19 DF,  p-value: 0.000416

First of all, we'll look at the summary of the residuals. The median of the residuals is a little suspicious, but not too bad.

Question 21. What value would be ideal for the median of the residuals?

The largest residual is almost 13.5, a large value considering that the range of the DV values (laureates) is only about 0 to 32.8.

Question 22. Is the point that has that residual above or below the regression line?

Question 23. What value would be a typical magnitude for a residual (ignoring the sign of the residual)?

Question 24. What percentage of the variability in laureates is being "explained" (or accounted for) by the predictor, consumption?
% (two decimal places)

Question 25. Is this regression model significantly better than just using the mean of laureates as the prediction for all values of laureates (yes or no)?

Now let's take a look at the real meat of this output, the regression coefficients and the regression equation.

Question 26. What is the y-intercept of the regression equation?

Question 27. Is the y-intercept significantly different from zero (yes or no)?

Question 28. What is the slope of the regression equation?

Question 29. Is the slope significantly different from zero (yes or no)?

So we would write the regression equation as...

laureates.hat = -3.161 + 2.989 * consumption

where laureates.hat stands for predicted values of laureates. Remember, the regression equation does not calculate observed values of y but predicted values for any given x. Here is a way to interpret the regression coefficients.

Question 30. If a country consumes 5 kg of chocolate per capita annually, what is the predicted number of laureates in that country?

Now here's a hard question. We've just predicted number of laureates from a value for consumption of chocolate. Suppose we had a country with 20 laureates. Could we use the regression equation to predict how much chocolate is eaten in that country? Think about it!

The answer is absolutely not! Look at the regression equation. Is there a place to plug in number of laureates? There is not. There is only a place for laureates.hat, the predicted number of laureates. Look again. Is there a place to solve for predicted chocolate consumption? No there is not. There is no consumption.hat in the equation, only observed values of consumption. Make a note of this now. The regression equation does not work backwards. You can go from x to y.hat. You cannot go from y to x.hat. Those variables are not in the equation.

So final question. Suppose I have a country where the predicted number of laureates is 20. Can I calculate the observed chocolate consumption in that country from the regression equation? Sure! Because both laureates.hat and consumption occur as variables in the equation.

For practice, try doing laureates ~ scientific. I'll give you the regression equation so you can check.
laureates.hat = 1.535 + 1.203 * scientific
Did you notice any problems with assumptions in this analysis?

One more thing. Don't forget!

----------R Commands----------
detach(X)

If you do forget, it's not a disaster. You haven't broken anything. R will detach everything when you shut it down. But you WILL have a serious problem if you continue working and read in another dataset with variables that have the same names, or name a new data frame X. Best to detach and then clear your workspace if you are going to continue working.