Psyc 480 -- Dr. King

More R Practice Using Data From Brooks (1987)

It is essential that you learn the basics of how to use R. So let's take this opportunity to practice some more using the data in the Brooks (1987) article. You can do this exercise using this webpage or your own installation of R. To use R, you need to do ONLY ONE of the following: 1) install it on your computer from the R-project website (you may have to fiddle with your security settings to allow your Mac to install something not from the App Store), or 2) find a computer on campus that already has R installed, or 3) go to R Studio Cloud. For today's exercise, however, you can do the entire thing on this webpage without using R at all. Next time that will no longer be true, so be sure you have access to R one way or another. Personally, I would find an R installation somewhere and work through this.

You may not know some of this R stuff, but that's okay. You're learning!

These are the data that Brooks gives in his article for 322 students who have taken his Intro Stats course.

grades  male   female
A = 4    11%      23%
B = 3    24%      33%
C = 2    48%      39%
D = 1    10%       4%
F = 0     7%       1%
       n=154    n=168

The percentages aren't useful to us. We cannot use them in a hypothesis test, for example. They are descriptive statistics only. Their usefulness lies in the fact that we can convert them to raw frequencies (you may know them as "obtained frequencies" or "observed frequencies"), which we can use in a hypothesis test. 11% of 154 is 17, so there were 17 male students who got As in his course, and so on. I had to fool around with these figures a little to get tests that came close to matching what Brooks reported in his article because, for example, 39% of 168 can be either 66 or 65. I ended up using 65. Even so, I couldn't exactly duplicate Brooks' hypotheses tests, and we'll talk about why that might be when the time comes.

Here are those percentages converted into observed frequencies. (Can you do this?)

grades  male   female
A = 4     17       39
B = 3     37       56
C = 2     74       65
D = 1     15        7
F = 0     11        1
       n=154    n=168

We could put these data into graphic displays, either histograms or stem-and leaf-displays, if we wanted to, but I think we can see the result plainly in the frequency table. On the whole, female students did better than male students in Dr. Brooks' stat course.

I have put these data into a data file, which I have uploaded to my website. You can load the data file into R as a data frame directly from the website. When downloading data from the website directly into R, you have to type VERY CAREFULLY. Most students manage to get this wrong on their first few attempts, some even on subsequent attempts, so if you got it to work, congratulations! Here's what you have to type into R. (If you're not using R, don't worry about it.) The name of the file that's at the website is Brooks.txt.

> file = "http://ww2.coastal.edu/kingw/psyc480/data/Brooks.txt"
> Brooks = read.table(file=file, header=T, stringsAsFactors=T)

I know those commands work (provided you have an Internet connection), so if you get a long, cryptic error message, you mistyped them. Try again. Common errors: 1) it's psyc480, not psych480, 2) it's kingw, not king, 3) the file URL has to be in quotes, 4) capitalization has to be EXACTLY as you see it above. Attention to detail is required to use a command line program such as R. (Hint: If, no matter how often you try it, you just can't get it, those commands can be copied and pasted into R. Just don't copy the command prompts.) These commands produce no output in your R console, so if you get no response from R to these commands, that's good! In R, silence is golden. It means what you asked R to do worked.

Okay, let's get on with it. Here's how this webpage works (if you're not actually using R). I will ask you to find something out about the data, or calculate something from the data, and you will type the appropriate command into the box I supply for you. If you get the command right, the page will show you the R output. If you get it wrong, you'll get an error message. There is one catch. DO NOT TYPE SPACES IN YOUR COMMANDS. R ignores spaces in commands, but I'm not that good a programmer, SO LEAVE THE SPACES OUT!" Spaces will cause your answer to be considered an error (even if it isn't). Also, in R you would press the Enter key to execute your commands. Here you have to click it. If after several earnest efforts, you can't get the right answer, click the Answer button, then type the correct command into the box to see the R output. (One more thing. You'll have to cut me some slack on the formatting. Javascript alerts use a proportionally spaced font in alert boxes, so getting the spacing right was a PITA!)

Imagine you've just typed the commands above to read in the data from the website, and now you want to check your workspace to be sure R did what it was told. How would you do it? (I'll do the first one for you. You can just click Enter.)

A data frame is a rectangular array of data, much like you might type it into SPSS (or a spreadsheet). When you read a file in from the website using read.table(), it comes in as a data frame. For now, it will be sufficient to have you believe that each row in the data frame is a subject, and each column in the data frame is a variable with some information about that subject. The data frame is too long to print in your R Console window, but you can see the top of it by typing head(Brooks). How would you find out the size of this data frame in rows and columns?

How would you get descriptive statistics on both variables in this data frame?

Using the information in the last answer, what can you say about the coding of the sex variable?

In the sex variable, 0 = female and 1 = male. Using the information from a previous answer, what proportion of subjects in this sample were male? Hint: this was in the required calculations, so you should know this. If you don't remember, think about what the mean is.

Using information given at the top of this exercise, how could you calculate the previous answer at the command prompt? (Hint: using R, how could you turn your shiny new thousand-dollar computer into a two-dollar calculator?)

What is the sum of the grade variable in the Brooks data frame? Hint: R will not let you look inside a data frame unless you ask it politely for permission. This is to prevent you from accidentally screwing up your data (and there are also other advantages). If you want to see inside a data frame, you really need to ask nicely. One way (not the way used here) is to begin by typing the name of the data frame followed by a $ followed by the name of the variable you wish to work with. It would look like this: sum(Brooks$grade). Just typing sum(grade) would give you a "not found" error, because there is nothing called "grade" in your workspace. (Try it!) Another way to get inside a data frame is to use the with() function. You type with(name.of.data.frame,command.you.wish.to.execute). In this case, that would look like with(Brooks,???) and in place of the question marks the command you wish to execute. I'll let you figure that out. And please remember, no spaces. Also remember to close all open parentheses.

Using the same method you used in the previous question, what is the mean of the grade variable?

What is the standard deviation of the grade variable?

How (if you were actually using R) could you have gotten that last answer without retyping the whole command?

In this dataset, which variable are we considering to be the independent variable (IV) and which the dependent variable (DV)?

Hopefully, you got that last question correct, because identifying the IV and DV correctly is an essential skill in statistics. For the time being, the DV will always be a measured variable, something we measured about our subjects. It should, therefore, be a numeric variable, and we only have one of those: grade. The IV is a categorical variable that creates the groups. In these data, our groups are male and female, and the variable that creates those groups is sex. Therefore, sex is the IV.

Now for one of my favorite jokes in this class. We want to see the DV broken down by sex. Use the tapply() function to get separate means of the grade variable for males and females. Hint: use with() to get inside the data frame, then the tapply() function works like this:tapply(name.of.DV,name.of.IV,function.you.wish.to.apply). Give it a try. No spaces!

The two group means are labelled 0 and 1. That's the problem with dummy coding. You have to remember the codes: 0=female, 1=male.

At the command line, calculate the difference between the two means without having to remember the numeric values of the means. (There are several ways to do this. I'm only showing one in the answer. I won't lie to you. This is a tough one. I'm using a function called diff(), which finds the difference between two numbers produced by another function, such as tapply.)

Now get the standard deviations for each group using tapply(). Hint: almost exactly like getting the mean except the function is sd instead of mean.

Given that the difference between the means is about one-half and the standard deviations are around one, would say this effect size is small, moderate, or large? Why?

Now for the t-test. First, what null hypothesis shall we test?

How do you do the pooled-variance t-test in R? Hint: you can no longer type something like t.test(female,male,var.eq=T), because female and male are not variables that are in your workspace. You have to use what is called a formula interface when the data are in a data frame. It works like this: t.test(DV~IV,data=name.of.data.frame,var.eq=T). The little squiggly thing between DV and IV is a tilde (upper left corner of keyboard just under the Esc key). Of course, you won't actually type "DV~IV", you'll use the names of your variables, and also the name of your data frame after data=. Try it. Formula interface is optional with the t-test if you have the variables in your workspace, but it is not with more complex procedures such as ANOVA and regression, so get used to it. (Hint: That last option can be typed in several different ways: var.eq=T, var.eq=TRUE, var.equal=T, and so on. We are using the first of those.)

(NOTE: the data=Brooks,var.eq=T part can come in either order. I.e., var.eq=T,data=Brooks also works in R, but not on this webpage.)

Here is the output of the t.test() function. (There is no way I'm going to try to put this into a Javascript alert box!!)

	Two Sample t-test

data:  grade by sex
t = 4.9572, df = 320, p-value = 1.162e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.3155932 0.7309436
sample estimates:
mean in group 0 mean in group 1 
       2.744048        2.220779

Should the null hypothesis be retained or rejected? What's the p-value?

By the way, compare this result to the one Brooks reported in his article. According to the article, t = 5.30, but we got t = 4.96. Why is that? I had to reconstruct the data from his percentages in the article, and those percentages are rounded off to whole numbers. So in a few spots, I had to pick between two possible frequencies. For example, females getting a C could have been either 65 or 66. I choose 65. Honestly, I don't think that's the problem. I think (and I'm just guessing) that Brooks did his calculations by hand. It was 1987 after all, when "by hand" meant by hand calculator. Desktop computers were relatively new, few people had access to statistical software, and those who did found it inconvenient to use for something as simple as a t-test. I'd be willing to bet that Brooks did these calculations by hand. And I'd be willing to bet that he rounded off perhaps a little too severely and, therefore, got an answer that was slightly inaccurate. It doesn't really matter whether the t-value is 5.30 or 4.96. The p-value is still tiny. So I'm not being critical. I'm just pointing out why I think our result in R doesn't quite agree with the result given in the article.

The difference between the sample means was 0.523, but this is probably not the difference between the population means, because samples are noisy. Given that these are adequate random samples of the population, however, we can be 95% confident that the true mean difference in the population falls between what two values?

What would happen to those confidence limits if you changed the CI to a 99% confidence interval. The option in the t.test() function that sets this is conf.level=.99. Go ahead and try it. (I'll show only the relevant part of the output in the answer.) (Further hint: You can also enter the option as conf=.99. That will work in R but not on this webpage. R will generally give you a pass on the syntax as long as it can recognize what option you want, but this isn't univerally true, so care is warranted with many functions. If you want to see the "official" version of the options, type help(t.test) at the command prompt. Be forewarned, however. Some of these help pages are not very helpful to beginners!)

Given information that you've already seen, how would you calculate a more accurate value of Cohen's d at the command line. (Be careful with parentheses!)

Brooks also did a chi-square test on the grade frequencies. A chi-square test is done when you have a categorical grouping variable for an IV and a categorical variable for the DV (instead of a numeric variable), or when you just have two categorical variables and you want to see how they are related. Can we get the grade frequencies from our data frame. Easy as pie!

> with(Brooks, table(grade, sex))
     sex
grade  0  1
    0  1 11
    1  7 15
    2 65 74
    3 56 37
    4 39 17

(Technical aside. Technically, we should declare these variables as factors or categorical variables when we do this: with(Brooks, table(factor(grade), factor(sex))). R doesn't really care though.)

That table is called a cross-tabulation, and the entries in the table are frequencies. For example, there were 74 males who got a C in statistics in this sample. What can we do with that table now? How can we operate on it in R? The answer is, we can do nothing with it because we didn't assign it to an object in our workspace. Check your workspace. Do you see anything there that might be this table? Nope! No assignment was done.

> freqtab = with(Brooks, table(grade, sex))   # the name freqtab is arbitrary
> ls()
[1] "Brooks"  "freqtab"

NOW the table is in the workspace, and NOW we can operate on it. Once we have that cross-tabulation stored as an object in the workspace, the chi-square test (of independence) is easy. The null hypothesis is: these two variables are independent of one another or have no relationship to one another in the population.

> chisq.test(freqtab)

	Pearson's Chi-squared test

data:  freqtab
X-squared = 23.786, df = 4, p-value = 8.816e-05

Should the null hypothesis be retained or rejected? What's the p-value?

How does this value of chi-squared compare to the one obtained by Brooks?

In which cells of the cross-tabulation do we see the largest effects? I.e., where are the observed frequencies most different from the expected frequencies (frequencies expected to occur if the null hypothesis is true)? This can also be easily seen in R by looking at the standardized residuals. (A "residual" means something is off or not as predicted.)

> chisq.test(freqtab)$residuals
     sex
grade          0          1
    0 -2.1025203  2.1960109
    1 -1.3218170  1.3805929
    2 -0.8832510  0.9225256
    3  1.0735751 -1.1213126
    4  1.8098141 -1.8902892

The numbers in this table are very similar to z-scores. What is a large z-score? Two (or negative two) is a large z-score. Thus, we see the biggest deviation from what the null predicts in the two cells that represent F grades. Women got far fewer Fs than the null would predict, z=-2.10, while men got far more, z=2.20. The As are pretty far out of kilter as well, with women getting more than expected and men getting less than expected.

Is this result consistent with what the numerical analysis revealed?

HairEyeColor is a built-in dataset that is a cross-tabulation of the hair and eye color of statistics students. How would you see that table?

> HairEyeColor
, , Sex = Male

       Eye
Hair    Brown Blue Hazel Green
  Black    32   11    10     3
  Brown    53   50    25    15
  Red      10   10     7     7
  Blond     3   30     5     8

, , Sex = Female

       Eye
Hair    Brown Blue Hazel Green
  Black    36    9     5     2
  Brown    66   34    29    14
  Red      16    7     7     7
  Blond     4   64     5     8

Note that this table has frequencies for men and women in separate tables. Thus, this is a three-dimensional cross-tabulation. We can't do a chi-square test on it. The chi-square test of independence is used for two-dimensional crosstabs. However, we can combine the two tables as follows.

> HEC = HairEyeColor[,,1] + HairEyeColor[,,2]

You do not need to know how this works. Just take my word for it! Now, how would you see the resulting table?

> HEC
       Eye
Hair    Brown Blue Hazel Green
  Black    68   20    15     5
  Brown   119   84    54    29
  Red      26   17    14    14
  Blond     7   94    10    16

How would you now test the null hypothesis that hair and eye color are unrelated in statistics students?

> chisq.test(HEC)

	Pearson's Chi-squared test

data:  HEC
X-squared = 138.29, df = 9, p-value < 2.2e-16

What is your conclusion? Are hair and eye color unrelated?

Press the reset button if you want to erase your answers and do this again.