PSYC 480 -- Dr. King Explanation of big_tables.RData > ls() [1] "HEC" "pierced" "race" "seatbelts" [5] "tattoos" "thoughts" "twain" "UCB" [9] "wages" "wages.by.race" HEC - the R built-in dataset HairEyeColor (4x4x2). pierced - data from the following published study (data frame) # Koch, J. R., Roberts, A. E., Armstrong, M. L., & Owen, D. C. (2007). # Frequencies and relations of body piercing and sexual experiences in college # students. Psychological Reports, 101, 159-162. race - data from the following published study (data frame) # Giving the death penalty in Florida by race of victim and race of defendant. # Source: Radelet, M. L. & Pierce, G. L. (1991). Florida Law Review, 43, 1-34. seatbelts - data from Brittany Parker's Psyc 497 project, Sp 2009 (data frame) tattoos - data from the following published study (data frame) # Koch, J. R., Roberts, A. E., Armstrong, M. L., & Owen, D. C. (2005). # College students, tattoos, and sexual activity. Psychological Reports, # 97, 887-890. thoughts - data from the Handbook of Small Data Sets (3x4) # Original source: D. Goldberg (1972). The detection of psychiatric illness by # questionnaire. London: Oxford Univ. Press. twain - data from the Handbook of Small Data Sets (2x13) # Brinegar, C. S. (1963). Mark Twain and the Q.C.S. letters -- a statistical # test of authorship. Journal of the Royal Statistical Association, 58, 85-96. # The data set was copied from the Handbook of Small Data Sets (Hand et al.). # Ten letters signed Quintus Curtius Snodgrass were published in the New # Orleans Crescent in 1861. No person by that name has ever been identified. # A common believe among literary scholars is that the letters were actually # written by Mark Twain. This analysis compares the distribution of word # lengths in the QCS letters to a known sample of Mark Twain's writings. UCB - the R built-in dataset UCBAdmissions (2x6x2) wages - data from a random sample of the Current Population Survey 1985 wages.by.race - a contingency table created from "wages" (3x2) # Race: Race (1=Other, 2=Hispanic, 3=White) # A median split was performed on the Wage (dollars/hr) variable. =============================================================================== Answers HECfemale can be analyzed the same way HECmale was in class. Here is something interesting to note. > chisq.test(HECfemale)$residuals Eye Hair Brown Blue Hazel Green Black 3.4943330 -2.283882 -0.9557686 -1.388104709 Brown 1.3745338 -2.505670 1.7415949 -0.043296136 Red 0.4155985 -1.764121 0.6699724 1.742395019 Blond -4.9070016 6.351499 -2.0010649 -0.007895916 The blond hair and blue eyed cell is much more out of agreement with the expected (null) prediction for women than for men. Why do you think? ------------------------------------------------------------------------------- pierced - save for log-linear analysis (LLA) race - save for LLA ------------------------------------------------------------------------------- seatbelts - I'll create and examine a 4x2 contingency table here. > seatbelts.by.vehicle = xtabs(~Vehicle+Seatbelt, data=seatbelts) > seatbelts.by.vehicle Seatbelt Vehicle no yes car 44 107 pickup 47 61 SUV 22 79 van 13 27 > props = prop.table(seatbelts.by.vehicle, margin=1) > props Seatbelt Vehicle no yes car 0.2913907 0.7086093 pickup 0.4351852 0.5648148 SUV 0.2178218 0.7821782 van 0.3250000 0.6750000 It appears that SUV drivers were most likely to use their seatbelts, followed by car drivers, followed by van drivers, and that pickup drivers were the least likely to use their seatbelts. > chisq.test(seatbelts.by.vehicle) Pearson's Chi-squared test data: seatbelts.by.vehicle X-squared = 12.0587, df = 3, p-value = 0.007185 The relationship between these two variables is statistically significant. [Note: the chi-square test must be done on the observed frequencies. It cannot be done on proportions or percentages.] > chisq.test(seatbelts.by.vehicle)$residuals Seatbelt Vehicle no yes car -0.51691101 0.35053047 pickup 2.22539720 -1.50909830 SUV -1.74010053 1.18000631 van 0.11268723 -0.07641607 Only one cell was way off from what the null predicted. Pickup drivers were much more likely not to wear seatbelts than was predicted by the null. There are several ways to draw a graph, some better than others. > barplot(seatbelts.by.vehicle, beside=T, legend=T) Not the best. It would be best to have vehicle type on the x-axis. > barplot(t(seatbelts.by.vehicle), beside=T, legend=T) Better. This graph compares what we really want to compare: yes vs. no within vehicle type. A graph of likelihoods would be a graph of the second column of the prop.table. > barplot(props[,2], ylim=c(0,1), axis.lty=1) However, I think this graph would look better if we rearranged vehicle type on the x-axis so that they were arranged in order from most likely to least likely to wear seatbelts. > barplot(c(.78,.71,.68,.56), ylim=c(0,1), axis.lty=1, + names.arg=c("SUV","Car","Van","Pickup"), + ylab="Likelihood of Driver Wearing Seatbelts", cex.lab=1.2) ------------------------------------------------------------------------------- tattoos - save for LLA ------------------------------------------------------------------------------- thoughts - suicidal thoughts by severity of mental illness, as described in class > thoughts answer severity def.has crossed.mind dont.think def.not normal 1 3 5 90 mild 15 21 18 43 severe 36 21 8 34 The question the subjects was answering was: Have you recently found the idea of taking your own life kept coming into your mind? The available answers were: definitely has, has crossed my mind, I don't think so, definitely not. It's pretty clear just from looking at the table that this is more likely to happen in those with more severe illness. A graph shows this most clearly. > barplot(t(thoughts), beside=T, legend=T) Proportions show it very clearly as well. > prop.table(thoughts, margin=1) answer severity def.has crossed.mind dont.think def.not normal 0.01010101 0.03030303 0.05050505 0.90909091 mild 0.15463918 0.21649485 0.18556701 0.44329897 severe 0.36363636 0.21212121 0.08080808 0.34343434 > .363636/.010101 [1] 36 Patients with severe mental illness were 36 times more likely to answer this question "definitely has" than were normal controls (and 7 times more likely to answer "has crossed my mind"). > chisq.test(thoughts) Pearson's Chi-squared test data: thoughts X-squared = 91.253, df = 6, p-value < 2.2e-16 The relationship is highly significant. > sqrt(91.253/(2*sum(thoughts))) [1] 0.3932761 Cramer's phi is fairly large. > chisq.test(thoughts)$residuals answer severity def.has crossed.mind dont.think def.not normal -3.9380390 -3.1141057 -1.6752471 4.5357680 mild -0.5074487 1.6126791 2.4452103 -1.6074840 severe 4.4403358 1.5177994 -0.7451381 -2.9446041 The residuals also tell the story very clearly. ------------------------------------------------------------------------------- twain - We did this in class. An interesting article on this controversy can be found here: http://www.jstor.org/stable/2923498 ------------------------------------------------------------------------------- UCB - This can be analyzed as two 2x6 tables, similar to the way we analyzed the HEC table above. ------------------------------------------------------------------------------- wages.by.race > wages.by.race wage.median.split race above at.or.below Other 31 36 Hispanic 7 20 White 228 212 A couple of things should be mentioned. First, it seems clear to me that this table was created to compare Hispanics and Whites, as evidenced by the fact that all other races are lumped together into Others. Second, the variable "Wage" in this dataset is numerical (dollars per hour). It was turned into a categorical variable by a technique called median split, in which each subject is classified as above vs. at or below the median Wage. (At or above vs. below can also be used.) This sacrifices a lot of information in the Wage variable, thus making the analysis much less powerful, and is generally a bad idea. It's done when a variable is poorly behaved for analysis as a numerical variable (i.e., it might be strongly skewed or bimodal). It's also done when a teacher needs an example of contingency table for class. > prop.table(wages.by.race, margin=1) wage.median.split race above at.or.below Other 0.4626866 0.5373134 Hispanic 0.2592593 0.7407407 White 0.5181818 0.4818182 It's pretty clear that Whites are doing better than Hispanics. Whites are 2.0 times more likely to have an hourly wage that is above the median than are Hispanics. > chisq.test(wages.by.race) Pearson's Chi-squared test data: wages.by.race X-squared = 7.2068, df = 2, p-value = 0.02723 The relationship is significant. The effect size is not impressive (but then we have to remember that a lot of the Wage information has been thrown away by the median split). > sqrt(7.2068/sum(wages.by.race)) # Cramer's phi [1] 0.1161718 > chisq.test(wages.by.race)$residuals wage.median.split race above at.or.below Other -0.4110271 0.4094905 Hispanic -1.7586111 1.7520368 White 0.5960291 -0.5938010 The only cells that come close to strongly deviating from expected are those for Hispanics. There are fewer Hispanics above the median and more at or below the median than expected by the hypothesis of independence. -------------------------------------------------------------------------------