What is it that makes experts better than the rest of us at a task? Is it some general ability that they excel in, or is it a task-specific skill that they have developed through practice? Chess has been the subject of psychological studies of expertise for 75 years. In chess, expertise can be easily quantified and recognized, which makes it an ideal venue in which to study expertise. One of the first researchers to do this was a Dutch psychologist and chess master named Adriaan de Groot (1946, trans. 1965). He found that chessmasters do not have better memories than anyone else, except when it comes to chess-related material such as board positions. His study was expanded upon by Chase & Simon, 1973 (Perception in chess. Cognitive Psychology, 4, 55-81). They studied novice, average, and expert chess players, examining their ability to reproduce chess positions that they were able to examine only briefly (a few seconds). Two types of positions were used, positions that arose from a real game of chess, and positions that were produced by randomly placing pieces on the board. The measured variable was how well the subjects were able to recall the positions they saw. The following data display some of the same effects seen by Chase & Simon. Only data from the real chess positions is presented here. The null hypothesis is that novice, average, and expert chess players have the same level of skill at remembering these chess positions after being briefly exposed to them and, therefore, can recall and reproduce such positions with equal ability (equal mean recall scores).
IMPORTANT INSTRUCTION: If you are using Firefox, never ever click the box that says "Prevent this page..." etc. That will stop the check boxes from working and prevent you from completing this exercise.
The following data show the number of test positions successfully recalled by players at each level of ability.
novice 36 27 57 32 37 59 32 65 29 average 71 39 44 77 68 69 42 40 49 expert 65 70 96 98 96 73 67 70 87
Because our analysis is going to be ANOVA, and for ANOVA, we can't have each group in a separate variable (no matter what stat software we're using). We need the DV (the measured values, those numbers) in one variable, and the IV (the names of the groups) in another variable. Furthermore, our IV variable has to have a group name for every data value in the DV variable. That is, the IV variable and the DV variable have to be the same length.
Data have to be entered. It's tedious, and no one likes to do it, but it's a fact of life if you're doing statistics, kind of like cutting your toenails. This is a small dataset, so putting it into a spreadsheet or a script would be overkill, in my opinion. I recommend the following.
> recall.scores = scan() 1: 36 27 57 32 37 59 32 65 29 # put a space between each value (or copy and paste) 10: 71 39 44 77 68 69 42 40 49 # press Enter only after entering a group 19: 65 70 96 98 96 73 67 70 87 # press Enter twice to terminate data entry 28: Read 27 items > summary(recall.scores) Min. 1st Qu. Median Mean 3rd Qu. Max. 27.00 39.50 65.00 59.07 70.50 98.00
DO NOT put commas between the data values--spaces only in scan(). Now for the IV. We could use scan(), but instead we're going to use a function called rep(), which means repeat.
> groups = rep(c("novice","average","expert"), each=9) # times=c(9,9,9) would also work > groups # not yet a factor, a character variable (quoted values) [1] "novice" "novice" "novice" "novice" "novice" "novice" "novice" "novice" [9] "novice" "average" "average" "average" "average" "average" "average" "average" [17] "average" "average" "expert" "expert" "expert" "expert" "expert" "expert" [25] "expert" "expert" "expert" > groups = factor(groups) # declare it to be a factor > groups # notice the difference in the output format [1] novice novice novice novice novice novice novice novice novice average [11] average average average average average average average average expert expert [21] expert expert expert expert expert expert expert Levels: average expert novice > summary(groups) average expert novice 9 9 9
It looks like we got what we wanted. We have a factor, or grouping variable, or categorical variable, with three levels (i.e., three groups). Why are the levels listed out of order at the end of that last output? Because R sees them in alphabetical order. They're in the correct order in the variable, and that's what we need. We could now put these two variables into a data frame, but we don't need to. As long as we're confident that we're not going to mess them up accidently, we can leave them in the workspace.
> ls() [1] "getData" "groups" "recall.scores" "SS"
This means we don't have to ask R nicely to see them, because there they are right there in the workspace. We're allowed to do anything we want with them. We don't need any of the $ or with() business.
Now, use tapply() to fill in the following table. Remember, you don't need with() or $. All you need is tapply(DV,IV,function). You're not going to type "DV,IV,function" of course. You're going to type the actual names of the variables and the function you want to use.
Questions 1-9: Complete the following table of descriptive statistics. Please round correctly to three decimal places. Your answers can be off by 0.001 and still be considered correct. Click the ✓ button in each box to check your answers.
Check for homogeneity of variance by eyeballing the variances. Do they look frighteningly different to you? No, they don't to me either, so I think we are safe with the homogeneity assumption. They are different, of course, but that's because these are samples, and samples are what? The assumption refers to variances in the population.
What if we did think homogeneity of variance had been violated? What if the sample variances were too different to pass our comfort test? R provides two solutions. One is a version of the single-factor ANOVA that does not assume homogeneity.
> oneway.test(recall.scores ~ groups) # no data frame so data= unnecessary One-way analysis of means (not assuming equal variances) data: recall.scores and groups F = 16.5575, num df = 2.000, denom df = 15.967, p-value = 0.0001278
Notice that this test penalizes us for having unequal sample variances by decreasing error degrees of freedom (denom df). What would it be with no penalty applied? From this ANOVA result, would we retain or reject the null hypothesis of equal means?
The second solution is to use a so-called nonparametric ANOVA test called the Kruskal-Wallis test. (Note: For unequal variances, the oneway.test() function is better. The Kruskal-Wallis test is used mostly when the normality assumption has been violated.)
> kruskal.test(recall.scores ~ groups) # no data frame so data= unnecessary Kruskal-Wallis rank sum test data: recall.scores by groups Kruskal-Wallis chi-squared = 15.6284, df = 2, p-value = 0.000404
From this ANOVA result, would we retain or reject the null hypothesis of equal means?
To do the ANOVA, we would ordinarily need the data in the form of a data frame. That's unnecssary in this case because the variables are right there in our workspace. We can get to them directly. Thus, in the aov() function, you will not need the data= option. You'll just need DV tilde IV. Do the ANOVA using aov() and fill in the following ANOVA summary table.
Questions 10-19: Complete the following ANOVA summary table. Enter the answers into the boxes EXACTLY as R prints them out, EXCEPT for the p-value, which should be in correct APA Style. If the p-value is very small, write p < .001 in the box. (There are spaces on each side of the less than sign.) If it is not that small, give the exact p-value to three decimal places with a leading 0 before the decimal point. For example, p = 0.015. R does not give a Total line, so you'll have to get that yourself by adding down the columns. Why is the MS.total box not filled in?
Question 20. Calculate a value for eta-squared to three decimal places. eta-squared =
Question 21. How would you evaluate this effect size? A) trivial B) small C) moderate D) large E) very large
Question 22. Are post hoc tests justified here? Why or why not? A) yes because the ANOVA found a significant effect of groups B) yes because of the magnitude of the eta-squared value C) no because the effect of groups was not significant in the ANOVA D) no because there are only three levels of the IV (three groups)
Justified or not, we're doing them! Perform Tukey HSD pairwise comparisons and put the p-values in the boxes EXACTLY as they appear in the R output, even if the p-value is really small. Then indicate whether the Tukey test found a significant difference between the means that are being compared by writing "significant" or "nonsignificant" in the box.
The Tukey HSD test is a very conservative test. The Fisher LSD test is a much less conservative test. If a difference between two means is found to be significant by the Tukey test, it will surely be significant by the Fisher LSD test. So the only question, Fisher LSD-wise, concerns one of the above comparisons. The Fisher LSD test is fine as long as you are doing no more than how many pairwise comparisons? And those comparisons have to be...? Which means what?
You don't know this yet, but ANOVA is actually a special case of regression analysis. The R function for doing regression is lm(), which stands for linear model. It has the same syntax as the aov() function. Let's try it.
> lm.out = lm(recall.scores ~ groups) # once again, no data frame so no data= > summary(lm.out) Call: lm(formula = recall.scores ~ groups) Residuals: Min 1Q Median 3Q Max -16.444 -12.000 -6.444 15.500 23.444 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 55.444 4.875 11.373 3.77e-11 *** groupsexpert 24.778 6.895 3.594 0.00146 ** groupsnovice -13.889 6.895 -2.014 0.05530 . --- Signif. codes: Residual standard error: 14.63 on 24 degrees of freedom Multiple R-squared: 0.5736, Adjusted R-squared: 0.538 F-statistic: 16.14 on 2 and 24 DF, p-value: 3.614e-05
Let's see what we have here. We'll start at the very bottom. On the last line of this output is the ANOVA result, the same ANOVA result we got above with aov(). On the line above that is something called Multiple R-squared. (Ignore the adjusted version.) Multiple R-squared is the proportion of variability explained by the regression relationship. Which is what statistic in ANOVA speak? On the third line from the bottom is a number called Residual standard error. If you square that, you'll discover it is equal to MS.error. So residual standard error is another name for what in ANOVA speak?
Above that we have a table of Coefficients with significance tests. Let's figure out what that is. You've seen the number 55.444 before. In fact, you typed it into a box somewhere above. Okay, I'll give you a few seconds to find it. Correct! It is the mean recall score for the group called "average." It is labeled in this table as (Intercept). Below that in the same column, the Estimate column, are two other numbers you've seen before, although it may take a little more diligent searching (through your R Console window) to find them. One is labeled groupsexpert, and the other groupsnovice. Okay, hint: look at the output of the Tukey HSD test you did above.
24.778 is the difference between the mean of the expert group and the mean of the average group, while -13.889 is the difference between the mean of the novice group and the mean of the average group.
Here's what you're looking at in the Coefficients table. R has chosen one the groups to be a baseline group. Since we didn't specify otherwise, R chose the group that comes first alphabetically, which is "average." The first line, labeled (Intercept), which you'll understand eventually, shows the mean of the baseline group, and then there is a t-test testing the null hypothesis that, in the population, the mean score of average chess players on this task is 0. It's not a very interesting hypothesis. It's the next two lines that are interesting.
The second line, labeled groupsexpert, gives the difference between the means of the recall.scores variable for the expert group and the baseline group (average), and then there is a t-test on that difference. Guess what. The t-test is a Fisher LSD test. The third line, labeled groupsnovice, gives the difference between the means of the recall.scores variable for the novice group and the baseline group (average), and gives the result of a Fisher LSD test on that difference. Notice that the difference does not quite reach statistical significance, p = 0.05530. There is no test between the two groups that are not baseline groups, expert and novice, but since that is the largest of the mean differences, it shouldn't be hard to guess that those means are also significantly different. In fact, the difference between those two means is 24.778-(-13.889)=38.667, with a standard error of 6.895 (because it's a balanced design), so the t-value is 38.667/6.895=5.608.
So the Fisher LSD tests are not telling us anything different than the Tukey HSD test did, although we got close to a significant difference between the "average" and "novice" groups. But as my grad school stat professor, Dr. Tom Wickens, used to say, "Close only counts in horseshoes and handgrenades."
It's important that you understand the output of the lm() function because you're going to be seeing a lot of it in the not too distant future. See if you can answer the following questions from that output without looking back at the previous six paragraphs.
For the following questions, enter the answers EXACTLY as they appear in the lm() output.
Question 29. What is the value of eta-squared for the ANOVA recall.scores ~ groups?
Question 30. What is the value of the pooled standard deviation?
Questions 31-35. What is the result of the Fisher LSD test for the comparison average vs. novice? 31) mean difference = 32) standard error = 33) t-value = 34) p-value = 35) "reject" or "fail to reject":
Following is a graph. Here's how I drew it for those of you who want to know. (If you don't want to know, that's okay too. Just skip to the graph and the questions.) At the very least, you need to know that the solid points are the group means.
# First I'll relevel the IV to make "novice" the baseline group. # That will make it appear as the first box in the boxplots. > groups = relevel(groups, ref="novice") # Then I'll draw the graph, labeling the X-axis as "skill". > boxplot(recall.scores ~ groups, xlab="skill") # again, no data frame so no data= # I'll label the y-axis "score". > title(ylab="score") # Finally I'll draw a line graph (type="b") over top of this, # plotting the means as solid points (pch=16). > points(x=c(1,2,3), y=tapply(recall.scores,groups,mean), pch=16, type="b")
If you don't remember what a boxplot is, there is an explanation here.
Answer the following questions based on this graph.
Question 36. If the boxes are all approximately the same height, this indicates that the variability in all three groups is similar. This means that what ANOVA assumption has been met? A) independent subjects B) normal distributions C) homogeneity of variance D) all of the above have been met
Question 37. In all of the boxes, the means are greater than the medians. This indicates that the scores may be (hint: remember from your first stat course that means are pulled into the long tail of a skewed distribution): A) normally distributed B) positively skewed (skewed to the right) C) negatively skewed (skewed to the left) D) bimodal
Question 38. If scores are not normally distributed, then the parametric ANOVA (the one we did above) may not be the correct test. What test could be used in its place? A) Games-Howell test B) Levene test C) Tukey-Kramer procedure D) Kruskal-Wallis test
Question 39. The means show a clear increasing trend from novice to average to expert. Yet the difference between the means of the novice group and the average group was not significant. But suppose those two groups really are different (in the population). Then by finding the difference not to be significant in the samples, we have (hint: you'll also have to recall this from your first stat course, or the review material--when we fail to reject a false null hypothesis, that's called...): A) made a Type I error B) made a Type II error C) obviously done the wrong test D) really screwed up
Psychologists have some sort of neurotic fixation on bar graphs, whereas the R people (rightly) believe that bar graphs are for frequencies, not means. Nevertheless, to keep the psychologists happy, here is a bar graph showing the means of the three chess skill groups. (I'm not going to show you how to do this because I don't want to encourage the practice of using bar graphs for means!) Notice the bars are marked with asterisks. This is a standard way of showing which groups are significantly different. Means with the same number of asterisks are not significantly different. Means with different numbers of asterisks are significantly different.
Question 40. What is shown in the boxplots above that is not shown by this bar graph? A) the medians B) an indication of variability within the groups C) an indication of how scores within the groups are distributed D) all of the above are correct
Boxplots are sometimes drawn horizontally rather than vertically (as above). SPSS does this, for example. So...
> boxplot(recall.scores ~ groups, xlab="score", ylab="skill", horizontal=T) > points(y=c(1,2,3), x=tapply(recall.scores,groups,mean), pch=16)
The means are still shown by the solid points. By the way, means are a nonstandard part of boxplots, so if you plot them, you have to explain in a note or the figure capture what they are.
Question 41. What information is contained in the horizontally formatted boxplots that is not in the vertically formatted boxplots? A) the medians B) standard deviations C) interquartile ranges D) none
Here are some general questions on analysis of variance, etc.
Question 42. Suppose we do an ANOVA and find that F = 1. What does this imply about the variability between the groups (variability in the group means)? A) there's a lot of it! B) there is none; the means are all the same C) variability in the group means is nothing but random error D) none of the above is correct
Question 43. If we have a true (randomized, designed) experiment, this means: A) subjects were randomly assigned to groups B) there are no confounds C) we do not need post hoc tests after the ANOVA is completed D) all of the above are correct
Question 44. What is the F-ratio? A) SS.treatment / SS.error B) MS.treatment / MS.error C) SS.treatment / MS.error D) MS.treatment / df.treatment
Question 45. Why do we set the alpha level at .05? A) that value makes the probability of a Type I error as small as possible B) the difference between the means must be 5% or more for significance C) it insures that any effects that are found to be significant are large enough to be meaningful D) there really is no good reason; it is an entirely arbitrary cutoff
If you want to do this again (more practice is good!), click the following button to erase all your answers.