Psyc 480 -- Dr. King

One More Example of Simple ANOVA (and some new stuff) - Test Yourself

chess game

What is it that makes experts better than the rest of us at a task? Is it some general ability that they excel in, or is it a task-specific skill that they have developed through practice? Chess has been the subject of psychological studies of expertise for 75 years. In chess, expertise can be easily quantified and recognized, which makes it an ideal venue in which to study expertise. One of the first researchers to do this was a Dutch psychologist and chess master named Adriaan de Groot (1946, trans. 1965). He found that chessmasters do not have better memories than anyone else, except when it comes to chess-related material such as board positions. His study was expanded upon by Chase & Simon, 1973 (Perception in chess. Cognitive Psychology, 4, 55-81). They studied novice, average, and expert chess players, examining their ability to reproduce chess positions that they were able to examine only briefly (a few seconds). Two types of positions were used, positions that arose from a real game of chess, and positions that were produced by randomly placing pieces on the board. The measured variable was how well the subjects were able to recall the positions they saw. The following data display some of the same effects seen by Chase & Simon. Only data from the real chess positions is presented here. The null hypothesis is that novice, average, and expert chess players have the same level of skill at remembering these chess positions after being briefly exposed to them and, therefore, can recall and reproduce such positions with equal ability (equal mean recall scores).

checkbox



IMPORTANT INSTRUCTION: If you are using Firefox, never ever click the box that says "Prevent this page..." etc. That will stop the check boxes from working and prevent you from completing this exercise.


The following data show the number of test positions successfully recalled by players at each level of ability.

novice
 36 27 57 32 37 59 32 65 29

average
 71 39 44 77 68 69 42 40 49

expert
 65 70 96 98 96 73 67 70 87

Because our analysis is going to be ANOVA, and for ANOVA, we can't have each group in a separate variable (no matter what stat software we're using). We need the DV (the measured values, those numbers) in one variable, and the IV (the names of the groups) in another variable. Furthermore, our IV variable has to have a group name for every data value in the DV variable. That is, the IV variable and the DV variable have to be the same length.

Data have to be entered. It's tedious, and no one likes to do it, but it's a fact of life if you're doing statistics, kind of like cutting your toenails. This is a small dataset, so putting it into a spreadsheet or a script would be overkill, in my opinion. I recommend the following.

> recall.scores = scan()
1: 36 27 57 32 37 59 32 65 29     # put a space between each value (or copy and paste)
10: 71 39 44 77 68 69 42 40 49    # press Enter only after entering a group
19: 65 70 96 98 96 73 67 70 87    # press Enter twice to terminate data entry
28: 
Read 27 items
> summary(recall.scores)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  27.00   39.50   65.00   59.07   70.50   98.00

DO NOT put commas between the data values--spaces only in scan(). Now for the IV. We could use scan(), but instead we're going to use a function called rep(), which means repeat.

> groups = rep(c("novice","average","expert"), each=9)   # times=c(9,9,9) would also work
> groups   # not yet a factor, a character variable (quoted values)
 [1] "novice"  "novice"  "novice"  "novice"  "novice"  "novice"  "novice"  "novice" 
 [9] "novice"  "average" "average" "average" "average" "average" "average" "average"
[17] "average" "average" "expert"  "expert"  "expert"  "expert"  "expert"  "expert" 
[25] "expert"  "expert"  "expert" 
> groups = factor(groups)   # declare it to be a factor
> groups   # notice the difference in the output format
 [1] novice  novice  novice  novice  novice  novice  novice  novice  novice  average
[11] average average average average average average average average expert  expert 
[21] expert  expert  expert  expert  expert  expert  expert 
Levels: average expert novice
> summary(groups)
average  expert  novice 
      9       9       9 

It looks like we got what we wanted. We have a factor, or grouping variable, or categorical variable, with three levels (i.e., three groups). Why are the levels listed out of order at the end of that last output? Because R sees them in alphabetical order. They're in the correct order in the variable, and that's what we need. We could now put these two variables into a data frame, but we don't need to. As long as we're confident that we're not going to mess them up accidently, we can leave them in the workspace.

> ls()
[1] "getData"       "groups"        "recall.scores" "SS"

This means we don't have to ask R nicely to see them, because there they are right there in the workspace. We're allowed to do anything we want with them. We don't need any of the $ or with() business.

Now, use tapply() to fill in the following table. Remember, you don't need with() or $. All you need is tapply(DV,IV,function). You're not going to type "DV,IV,function" of course. You're going to type the actual names of the variables and the function you want to use.

Questions 1-9: Complete the following table of descriptive statistics. Please round correctly to three decimal places. Your answers can be off by 0.001 and still be considered correct. Click the ✓ button in each box to check your answers.

novice average expert
mean 1) 2) 3)
variance 4) 5) 6)
n 7) 8) 9)

Check for homogeneity of variance by eyeballing the variances. Do they look frighteningly different to you? No, they don't to me either, so I think we are safe with the homogeneity assumption. They are different, of course, but that's because these are samples, and samples are what? The assumption refers to variances in the population.

What if we did think homogeneity of variance had been violated? What if the sample variances were too different to pass our comfort test? R provides two solutions. One is a version of the single-factor ANOVA that does not assume homogeneity.

> oneway.test(recall.scores ~ groups)   # no data frame so data= unnecessary

	One-way analysis of means (not assuming equal variances)

data:  recall.scores and groups
F = 16.5575, num df = 2.000, denom df = 15.967, p-value = 0.0001278

Notice that this test penalizes us for having unequal sample variances by decreasing error degrees of freedom (denom df). What would it be with no penalty applied? From this ANOVA result, would we retain or reject the null hypothesis of equal means?

The second solution is to use a so-called nonparametric ANOVA test called the Kruskal-Wallis test. (Note: For unequal variances, the oneway.test() function is better. The Kruskal-Wallis test is used mostly when the normality assumption has been violated.)

> kruskal.test(recall.scores ~ groups)   # no data frame so data= unnecessary

	Kruskal-Wallis rank sum test

data:  recall.scores by groups
Kruskal-Wallis chi-squared = 15.6284, df = 2, p-value = 0.000404

From this ANOVA result, would we retain or reject the null hypothesis of equal means?

To do the ANOVA, we would ordinarily need the data in the form of a data frame. That's unnecssary in this case because the variables are right there in our workspace. We can get to them directly. Thus, in the aov() function, you will not need the data= option. You'll just need DV tilde IV. Do the ANOVA using aov() and fill in the following ANOVA summary table.

Questions 10-19: Complete the following ANOVA summary table. Enter the answers into the boxes EXACTLY as R prints them out, EXCEPT for the p-value, which should be in correct APA Style. If the p-value is very small, write p < .001 in the box. (There are spaces on each side of the less than sign.) If it is not that small, give the exact p-value to three decimal places with a leading 0 before the decimal point. For example, p = 0.015. R does not give a Total line, so you'll have to get that yourself by adding down the columns. Why is the MS.total box not filled in?

df SS MS F p
groups 10) 11) 12) 13) 14)
Residual 15) 16) 17)
Total 18) 19)

Question 20. Calculate a value for eta-squared to three decimal places.
eta-squared =

Question 21. How would you evaluate this effect size?





Question 22. Are post hoc tests justified here? Why or why not?




Justified or not, we're doing them! Perform Tukey HSD pairwise comparisons and put the p-values in the boxes EXACTLY as they appear in the R output, even if the p-value is really small. Then indicate whether the Tukey test found a significant difference between the means that are being compared by writing "significant" or "nonsignificant" in the box.

p adj

significance?
expert-average 23) 24)
novice-average 25) 26)
novice-expert 27) 28)

The Tukey HSD test is a very conservative test. The Fisher LSD test is a much less conservative test. If a difference between two means is found to be significant by the Tukey test, it will surely be significant by the Fisher LSD test. So the only question, Fisher LSD-wise, concerns one of the above comparisons. The Fisher LSD test is fine as long as you are doing no more than how many pairwise comparisons? And those comparisons have to be...? Which means what?

You don't know this yet, but ANOVA is actually a special case of regression analysis. The R function for doing regression is lm(), which stands for linear model. It has the same syntax as the aov() function. Let's try it.

> lm.out = lm(recall.scores ~ groups)   # once again, no data frame so no data=
> summary(lm.out)

Call:
lm(formula = recall.scores ~ groups)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.444 -12.000  -6.444  15.500  23.444 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    55.444      4.875  11.373 3.77e-11 ***
groupsexpert   24.778      6.895   3.594  0.00146 ** 
groupsnovice  -13.889      6.895  -2.014  0.05530 .  
---
Signif. codes: 

Residual standard error: 14.63 on 24 degrees of freedom
Multiple R-squared:  0.5736,	Adjusted R-squared:  0.538 
F-statistic: 16.14 on 2 and 24 DF,  p-value: 3.614e-05

Let's see what we have here. We'll start at the very bottom. On the last line of this output is the ANOVA result, the same ANOVA result we got above with aov(). On the line above that is something called Multiple R-squared. (Ignore the adjusted version.) Multiple R-squared is the proportion of variability explained by the regression relationship. Which is what statistic in ANOVA speak? On the third line from the bottom is a number called Residual standard error. If you square that, you'll discover it is equal to MS.error. So residual standard error is another name for what in ANOVA speak?

Above that we have a table of Coefficients with significance tests. Let's figure out what that is. You've seen the number 55.444 before. In fact, you typed it into a box somewhere above. Okay, I'll give you a few seconds to find it. Correct! It is the mean recall score for the group called "average." It is labeled in this table as (Intercept). Below that in the same column, the Estimate column, are two other numbers you've seen before, although it may take a little more diligent searching (through your R Console window) to find them. One is labeled groupsexpert, and the other groupsnovice. Okay, hint: look at the output of the Tukey HSD test you did above.

24.778 is the difference between the mean of the expert group and the mean of the average group, while -13.889 is the difference between the mean of the novice group and the mean of the average group.

Here's what you're looking at in the Coefficients table. R has chosen one the groups to be a baseline group. Since we didn't specify otherwise, R chose the group that comes first alphabetically, which is "average." The first line, labeled (Intercept), which you'll understand eventually, shows the mean of the baseline group, and then there is a t-test testing the null hypothesis that, in the population, the mean score of average chess players on this task is 0. It's not a very interesting hypothesis. It's the next two lines that are interesting.

The second line, labeled groupsexpert, gives the difference between the means of the recall.scores variable for the expert group and the baseline group (average), and then there is a t-test on that difference. Guess what. The t-test is a Fisher LSD test. The third line, labeled groupsnovice, gives the difference between the means of the recall.scores variable for the novice group and the baseline group (average), and gives the result of a Fisher LSD test on that difference. Notice that the difference does not quite reach statistical significance, p = 0.05530. There is no test between the two groups that are not baseline groups, expert and novice, but since that is the largest of the mean differences, it shouldn't be hard to guess that those means are also significantly different. In fact, the difference between those two means is 24.778-(-13.889)=38.667, with a standard error of 6.895 (because it's a balanced design), so the t-value is 38.667/6.895=5.608.

So the Fisher LSD tests are not telling us anything different than the Tukey HSD test did, although we got close to a significant difference between the "average" and "novice" groups. But as my grad school stat professor, Dr. Tom Wickens, used to say, "Close only counts in horseshoes and handgrenades."

It's important that you understand the output of the lm() function because you're going to be seeing a lot of it in the not too distant future. See if you can answer the following questions from that output without looking back at the previous six paragraphs.

For the following questions, enter the answers EXACTLY as they appear in the lm() output.

Question 29. What is the value of eta-squared for the ANOVA recall.scores ~ groups?

Question 30. What is the value of the pooled standard deviation?

Questions 31-35. What is the result of the Fisher LSD test for the comparison average vs. novice?
31) mean difference =
32) standard error =
33) t-value =
34) p-value =
35) "reject" or "fail to reject":

Following is a graph. Here's how I drew it for those of you who want to know. (If you don't want to know, that's okay too. Just skip to the graph and the questions.) At the very least, you need to know that the solid points are the group means.

# First I'll relevel the IV to make "novice" the baseline group.
# That will make it appear as the first box in the boxplots.
> groups = relevel(groups, ref="novice")
# Then I'll draw the graph, labeling the X-axis as "skill".
> boxplot(recall.scores ~ groups, xlab="skill")   # again, no data frame so no data=
# I'll label the y-axis "score".
> title(ylab="score")
# Finally I'll draw a line graph (type="b") over top of this,
# plotting the means as solid points (pch=16).
> points(x=c(1,2,3), y=tapply(recall.scores,groups,mean), pch=16, type="b")
graph

If you don't remember what a boxplot is, there is an explanation here.

Answer the following questions based on this graph.

Question 36. If the boxes are all approximately the same height, this indicates that the variability in all three groups is similar. This means that what ANOVA assumption has been met?




Question 37. In all of the boxes, the means are greater than the medians. This indicates that the scores may be (hint: remember from your first stat course that means are pulled into the long tail of a skewed distribution):




Question 38. If scores are not normally distributed, then the parametric ANOVA (the one we did above) may not be the correct test. What test could be used in its place?




Question 39. The means show a clear increasing trend from novice to average to expert. Yet the difference between the means of the novice group and the average group was not significant. But suppose those two groups really are different (in the population). Then by finding the difference not to be significant in the samples, we have (hint: you'll also have to recall this from your first stat course, or the review material--when we fail to reject a false null hypothesis, that's called...):




Psychologists have some sort of neurotic fixation on bar graphs, whereas the R people (rightly) believe that bar graphs are for frequencies, not means. Nevertheless, to keep the psychologists happy, here is a bar graph showing the means of the three chess skill groups. (I'm not going to show you how to do this because I don't want to encourage the practice of using bar graphs for means!) Notice the bars are marked with asterisks. This is a standard way of showing which groups are significantly different. Means with the same number of asterisks are not significantly different. Means with different numbers of asterisks are significantly different.

graph

Question 40. What is shown in the boxplots above that is not shown by this bar graph?




Boxplots are sometimes drawn horizontally rather than vertically (as above). SPSS does this, for example. So...

> boxplot(recall.scores ~ groups, xlab="score", ylab="skill", horizontal=T)
> points(y=c(1,2,3), x=tapply(recall.scores,groups,mean), pch=16)
graph

The means are still shown by the solid points. By the way, means are a nonstandard part of boxplots, so if you plot them, you have to explain in a note or the figure capture what they are.

Question 41. What information is contained in the horizontally formatted boxplots that is not in the vertically formatted boxplots?




Here are some general questions on analysis of variance, etc.

Question 42. Suppose we do an ANOVA and find that F = 1. What does this imply about the variability between the groups (variability in the group means)?




Question 43. If we have a true (randomized, designed) experiment, this means:




Question 44. What is the F-ratio?




Question 45. Why do we set the alpha level at .05?




If you want to do this again (more practice is good!), click the following button to erase all your answers.