R Practice Problems

Psyc 480 - Dr. King

Some Problems That Will Allow You to Practice Your R Skills

Problem 1: Adult PTSD and childhood sexual abuse in women.

These data are those obtained by Rodriguez, N., Ryan, S. W., Vande Kemp, H., & Foy, D. W. (1997). Posttraumatic stress disorder in adult female survivors of childhood sexual abuse: A comparison study. Journal of Counseling and Clinical Psychology, 65, 53-59. Specifically, these are adult PTSD scores from women who were sexually abused or not abused as children. Is there a relationship between the two variables? That is, is it true that one of these groups is more likely to develop, or displays greater, PTSD as adults? Is this a true experiment or a quasi-experiment? What will we be able to conclude when the analysis is done?

Abused
 9.71  6.17 15.16 11.31  9.95  9.84  5.98 11.11  6.26  7.04 13.89 14.98 10.82
13.91 18.17 12.92 15.08  8.31  9.28  9.29 11.93 13.52  8.54 18.99 17.20 10.37
 8.48 14.08 13.45  9.37 16.13 12.07 10.84 17.41 14.90 12.74 15.42 17.64 11.89
 7.33 10.77 15.25 11.08  7.62 11.13

NotAbused
 6.14  0.74  3.46  6.91  4.54  6.34  7.30 -3.35  4.47  4.02  6.22  5.79  8.54
-0.38  4.75  6.39  6.83 10.47  1.11  7.40  6.17  5.49 10.91  4.84  3.63 -0.14
 6.57 -3.12 -0.47  6.84  7.13

You'll want to calculate appropriate summary (descriptive) statistics, do a t-test, and calculate an effect size.

The mathematics behind the t-test makes a couple of important assumptions. One is that sampling is from normal distributions of scores. A way to graphically evaluate this is with histograms.

> hist(Abused)
> hist(NotAbused)

A better way is by drawing a normal probability plot.

> qqnorm(Abused)
> qqline(Abused)

What you want to see are the points following the line fairly closely. It won't be perfect, of course, because these are sample data, and samples are what? If the points bow away from the line, that indicates skew. If the point fall away from the line or rise above the line at the bottom left and upper right, that indicates a problem with normality in the tails of the distribution. The t-test is most sensitive to this assumption if the design is unbalanced. Is this design unbalanced?

The other important assumption is homogeneity of variance, which means the samples come from populations that have the same variance. For now, we'll test that by eyeballing the sample variances or standard deviations. They won't be identical, of course, because these are samples, and samples are what? But they should be "reasonably close." More on exactly what that means later.

Answers

Come on now! Don't look at this until you've tried it for yourself.

The data can be entered into your own version of R (or on a university computer, or in RStudioCloud) using the scan() method demonstrated in the video (because the data values are separated by spaces rather than commas). Type a name for your variable, then an equals sign, which means you're going to assign a value or values to that variable, then type scan() and press Enter (or Return on a Mac). When the 1: prompt appears, copy and paste the data. It would look something like this.

> Abused = scan()
1:  9.71  6.17 15.16 11.31  9.95  9.84  5.98 11.11  6.26  7.04 13.89 14.98 10.82
13.91 18.17 12.92 15.08  8.31  9.28  9.29 11.93 13.52  8.54 18.99 17.20 10.37
 8.48 14.08 13.45  9.37 16.13 12.07 10.84 17.41 14.90 12.74 15.42 17.64 11.89
 7.33 10.77 15.25 11.08  7.62 11.13

At this point, to terminate the data entry, you press Enter twice (or more). Then do the same thing for the NotAbused variable. It will look something like this on your screen, although it may be somewhat different depending upon how wide you have your R window set.

> Abused = scan()
1:  9.71  6.17 15.16 11.31  9.95  9.84  5.98 11.11  6.26  7.04 13.89 14.98 10.82
14: 13.91 18.17 12.92 15.08  8.31  9.28  9.29 11.93 13.52  8.54 18.99 17.20 10.37
27:  8.48 14.08 13.45  9.37 16.13 12.07 10.84 17.41 14.90 12.74 15.42 17.64 11.89
40:  7.33 10.77 15.25 11.08  7.62 11.13
46: 
Read 45 items
> NotAbused = scan()
1:  6.14  0.74  3.46  6.91  4.54  6.34  7.30 -3.35  4.47  4.02  6.22  5.79  8.54
14: -0.38  4.75  6.39  6.83 10.47  1.11  7.40  6.17  5.49 10.91  4.84  3.63 -0.14
27:  6.57 -3.12 -0.47  6.84  7.13
32: 
Read 31 items
>

At this point, if you're insecure about it, you can check your workspace with ls(), as follows.

> ls()
[1] "Abused"    "NotAbused"

I cleared my workspace before I began (there is a menu item for that under the Workspace menu on the Mac, and under the Misc menu in Windows (it says Remove all objects--a thing in your workspace is called an "object"), so the two variables I've just created are the only objects in my workspace. A three-number summary could be obtained as follows.

> mean(Abused); mean(NotAbused)   # two functions on one line separated by ;
[1] 11.94067
[1] 4.694839
> sd(Abused); sd(NotAbused)
[1] 3.439882
[1] 3.519886
> length(Abused); length(NotAbused)
[1] 45
[1] 31

You could also do these functions one at a time rather than having two on a line. That's entirely up to you. Simpler is usually better when you're learning.

If you want to get really fancy (THIS IS ENTIRELY OPTIONAL: don't even look at it if you don't care about being really fancy!), you could do this. This creates what is called a "named vector" of the group means and stores it in your workspace. I'm just showing off, so you don't have to know this.

> means = c(mean(Abused), mean(NotAbused))
> names(means) = c("Abused", "NotAbused")
> means
   Abused NotAbused 
11.940667  4.694839

OKAY, START LOOKING AGAIN. These are independent groups, so the significance test is a t-test for independent groups. We will use the pooled variance t-test as the standard deviations are very similar and, therefore, we can assume that homogeneity of variance has been met.

> t.test(Abused, NotAbused, var.eq=T)

	Two Sample t-test

data:  Abused and NotAbused
t = 8.9397, df = 74, p-value = 2.162e-13
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 5.630820 8.860835
sample estimates:
mean of x mean of y 
11.940667  4.694839

The t-value is huge and the p-value is miniscule, so there is clearly a significant difference between the means of these two samples. Had we stated a null hypothesis (no difference in the population means), we'd now be rejecting it.

An effect size measure (Cohen's d) is a little trickier. Just eyeballing it, we can say that this is a large effect. Notice that the difference between the means is more than 7, while the standard deviations are about 3.5. That tells us we are going to get a Cohen's d greater than 2, and that is a whopping great effect! How would be get an accurate value for Cohen's d? There is no Cohen's d function in the base R packages (another oversight in my opinion), so we need a way to calculate it.

If you've done a pooled-variance t-test (yes, you have, var.eq=T), then there is a quick way to get Cohen's d. Write this into your notes (which you are surely keeping, right?).

d = t * sqrt(N / n1*n2)

Where t is the calculated t-value, n1 and n2 are the group sizes, and N is the total number of subjects in both groups combined (n1 + n2). Thus...

> 8.9397*sqrt(76/(45*31))
[1] 2.086616

...and that's a pretty whopping big Cohen's d! We may see these data again later. There is an unaccounted for confound. Can you think what it might be?

Problem 2: Drinking Among College Students and Gender

# AUDIT stands for Alcohol Use Disorders Identification Test. Scores on this
# test greater than 8 supposedly indicate the possible existence of "problem
# drinking." Kellie Dunlap used the AUDIT as the dependent measure in her Psyc
# 497 project (Fall 2008). 70 CCU students were scored on the test. Other
# information she collected from her subjects was gender and home state.
# variables:
#    sex - gender of subject
#    south - whether or not the subject was from a state south of the
#            Mason-Dixon line
#    AUDIT - AUDIT score

sex: F
17  3 19  9 14  9  9  5 16 10  4  3 10  3  2 10 20  5  2  6  4  8  4  4  1
 2  1 21 14  1  4 11 10  8  7  0 18  1  9  6  0  9  8  4 14  4 11 22  0 17
sex: M
11  7  9 10  6 13  1  8 13 11 26 13 19  6  1 20 17 12 15 24

Problem 3: Drinking Among College Students and Home State

Same data source but broken into groups by "south" this time.

south: No
17 11 19 14  6  9 10 20  2  4  8  4  1  8 21 14  4 26 11  0 19  1  4 11 22
17 24
south: Yes
 3  7  9  9 10 13  9  5 16  4  3 10  3  2 10  5  6  4  1  2  1  1 13 11 13
10  8  7 18  6  1  9  6  0  9  8  4 14 20 17 12  0 15

Problem 4.

Kellie Dunlap's data are available at the website in a data file called audit.txt (also audit.csv). It can be read directly into R as a data frame. Enough said.