MAKING R WORK FOR YOU In the course of working through these tutorials, you may have noticed that, just in places, R is a twinkling shy of what might be referred to as perfection. This occurs mostly in three ways: functions with arcane names, functions with goofy syntax, and functions that don't exist, such as ones for sum of squares and standard error of the mean. In this tutorial we're going to take care of a lot of that. My intention is to someday bundle together a lot of what is done in this tutorial and offer it as an optional package that can be downloaded from CRAN. Someday. In the meantime, I'm going to show how to create your own bundle of custom functions, which can be continually modified, and which you can use without cluttering up your workspace with custom functions. Before you read this tutorial, you should read
Writing Your Own Functions and Scripts.
> setwd("Rspace") # very important (reasonably important anyway!) > rm(list=ls()) # a clean workspace is essential Optional Packages Before we get to writing our own stuff, it is worthwhile to have one of the
optional packages available at CRAN. Frankly, I think this should be part of
the default installation. One Note: When I installed it, I got a warning about
version numbers. I installed it anyway and haven't had a problem, but you might
want to consider upgrading to the latest version of R first.
> install.packages("car") # see the Package Management tutorialWhen you do this, R will open a list of mirror sites for you, from which you should choose one near you. Once you do that, the rest happens automatically. Let's see what we're getting.
> library(car) # attach to the search path in position 2 > ls(pos=2) # see the contents [1] "Adler" "AMSsurvey" "Angell" [4] "Anova" "Anscombe" "av.plot" [7] "av.plots" "avPlot" "avPlots" [10] "basicPower" "basicPowerAxis" "Baumann" [13] "bc" "bcPower" "bcPowerAxis" [16] "Bfox" "Blackmore" "Boot" [19] "bootCase" "box.cox" "box.cox.powers" [22] "box.cox.var" "box.tidwell" "boxCox" [25] "boxCoxVariable" "Boxplot" "boxTidwell" [28] "Burt" "CanPop" "carWeb" [31] "ceres.plot" "ceres.plots" "ceresPlot" [34] "ceresPlots" "Chile" "Chirot" [37] "compareCoefs" "confidence.ellipse" "confidenceEllipse" [40] "contr.Helmert" "contr.Sum" "contr.Treatment" [43] "cookd" "Cowles" "cr.plot" [46] "cr.plots" "crp" "crPlot" [49] "crPlots" "data.ellipse" "dataEllipse" [52] "Davis" "DavisThin" "deltaMethod" [55] "densityPlot" "Depredations" "dfbetaPlots" [58] "dfbetasPlots" "Duncan" "durbin.watson" [61] "durbinWatsonTest" "dwt" "ellipse" [64] "Ericksen" "estimateTransform" "Florida" [67] "Freedman" "Friendly" "gamLine" [70] "Ginzberg" "Greene" "Guyer" [73] "Hartnagel" "hccm" "Highway1" [76] "Identify3d" "infIndexPlot" "influenceIndexPlot" [79] "influencePlot" "inverseResponsePlot" "invResPlot" [82] "invTranEstimate" "invTranPlot" "KosteckiDillon" [85] "Leinhardt" "levene.test" "leveneTest" [88] "leverage.plot" "leverage.plots" "leveragePlot" [91] "leveragePlots" "lht" "linear.hypothesis" [94] "linearHypothesis" "linearHypothesis.default" "loessLine" [97] "logit" "makeHypothesis" "Mandel" [100] "Manova" "marginalModelPlot" "marginalModelPlots" [103] "matchCoefs" "mcPlot" "mcPlots" [106] "Migration" "mmp" "mmps" [109] "Moore" "Mroz" "ncv.test" [112] "ncvTest" "nextBoot" "OBrienKaiser" [115] "Ornstein" "outlier.test" "outlierTest" [118] "panel.car" "Pottery" "powerTransform" [121] "Prestige" "printHypothesis" "probabilityAxis" [124] "qq.plot" "qqp" "qqPlot" [127] "quantregLine" "Quartet" "recode" [130] "Recode" "regLine" "residualPlot" [133] "residualPlots" "Robey" "Sahlins" [136] "Salaries" "scatter3d" "scatterplot" [139] "scatterplot.matrix" "scatterplotMatrix" "showLabels" [142] "sigmaHat" "SLID" "slp" [145] "Soils" "some" "sp" [148] "spm" "spread.level.plot" "spreadLevelPlot" [151] "States" "subsets" "symbox" [154] "testTransform" "Transact" "UN" [157] "USPop" "vif" "Vocab" [160] "wcrossprod" "WeightLoss" "which.names" [163] "whichNames" "Womenlf" "Wong" [166] "Wool" "yjPower" "yjPowerAxis"That is a lot of stuff! Many of those items are data sets, such as Adler, for example. Help pages are available if you want details. There are also a few functions here that appear to be available in the base packages, for example, Anova and Boxplot. The function name is capitalized, however, so as not to mask the similar function in base R. The "car" versions work somewhat differently. Just a few things here worth pointing out are Anova, which allows Type II
and Type III tests for unbalanced factorial designs (Type II is default, in
fact), leveneTest, which computes a Levene's test for homogeneity of variance,
Recode and recode, which allow more convenient recoding of a numeric vector
than is available in base R, and vif, which calculates variance inflation
factors for linear models. In fact, there is so much good stuff here that I
will just have to leave you to investigate it for yourself. This package was
developed to accompany the very excellent R manual "An R Companion to Applied
Regression" (2nd ed.) by Fox and Weisberg (2011), which I can highly recommend
for an introduction to much of what is in "car".
> detach("package:car")I'm not suggesting that this is the only valuable optional package at CRAN, and may not even be the most valuable one to you, but I have found it to be almost indispensable. A Few Simple One-Liner Custom Functions That Work on Vectors To begin, make sure your workspace is clean, and I mean spotless! I would also recommend if you have a .RData file in your current working directory that you do not want to lose, you should load it and rename it. Then once again, clean out your workspace. Your workspace should be character(0)! You may recall that the length() function has
a shortcoming. It counts missing values, and there is nothing you can do to
stop it. Thus, if you want to know how many nonmissing values there are in a
vector--how many values actually went into a calculation--the length() function may not be your best bet. Here is a simple
replacement. The workings of this function was explained in a previous tutorial.
> sampsize = function(X) length(X) - sum(is.na(X))And test it. (NOTE: I don't mean these tests to be rigorous, but these functions should work. Here I just want to show that they give an answer. If anyone sees a fault in any of these functions, please let me know.) > ls() [1] "sampsize" > length(rivers) [1] 141 > sampsize(rivers) [1] 141 > length(islands) [1] 48 > sampsize(islands) [1] 48 > rivers[c(10:15)] = NA # won't hurt the permanent version > length(rivers) [1] 141 > sampsize(rivers) [1] 135 >rm(rivers)There is no sum of squares function in R. This is not too much of an oversight, because you rarely would need such a function in most statistical work. But you sure need one when you're first learning statistics, AND when you're trying to teach it! > SS = function(X) sum(X^2, na.rm=TRUE) - sum(X, na.rm=TRUE)^2 / (length(X) - sum(is.na(X)))This should work on any numeric vector, even one with missing values. > ls() [1] "sampsize" "SS" > SS(rivers) [1] 34147177 > rivers[10:20] = NA > SS(rivers) [1] 33177355 > rm(rivers) > SS(c(1,2,3)) # because we know the answer to this one! [1] 2 > SS(c(1,2,NA)) [1] 0.5Likewise, a function for standard error of the mean. > sem = function(X) sqrt(var(X, na.rm=TRUE) / (length(X) - sum(is.na(X))))And test it. > ls() [1] "sampsize" "sem" "SS" > sqrt(var(rivers)/length(rivers)) [1] 41.59143 > sem(rivers) [1] 41.59143 > rivers[10:20] = NA > sem(rivers) [1] 44.47893 > rm(rivers) > sem(c(1,2,3)) [1] 0.5773503 > sqrt(1/3) [1] 0.5773503It annoys me that the name and syntax of the tapply() function are so goofy. So I'm going to write a simple one-line replacement. (The default argument labels in tapply are X, INDEX, and FUN, and my students never get it). > bygroups = function(data, groups, FUN=mean) tapply(data, groups, FUN)This should do the means by default but can also be used with other functions. > ls() [1] "bygroups" "sampsize" "sem" "SS" > head(chickwts) weight feed 1 179 horsebean 2 160 horsebean 3 136 horsebean 4 227 horsebean 5 217 horsebean 6 168 horsebean > with(chickwts, tapply(weight, feed, mean)) casein horsebean linseed meatmeal soybean sunflower 323.5833 160.2000 218.7500 276.9091 246.4286 328.9167 > with(chickwts, bygroups(data=weight, groups=feed)) casein horsebean linseed meatmeal soybean sunflower 323.5833 160.2000 218.7500 276.9091 246.4286 328.9167 > with(chickwts, tapply(weight, feed, sd)) casein horsebean linseed meatmeal soybean sunflower 64.43384 38.62584 52.23570 64.90062 54.12907 48.83638 > with(chickwts, bygroups(data=weight, groups=feed, FUN=sd)) casein horsebean linseed meatmeal soybean sunflower 64.43384 38.62584 52.23570 64.90062 54.12907 48.83638Finally, when I summarize a numeric variable, more often than not I want a mean, a standard deviation, and a sample size. The summary() function gives me only one of those things. > summarize = function(X) c(mean=mean(X, na.rm=TRUE), SD=sd(X, na.rm=TRUE), N=sampsize(X))And test it. Notice that I took a shortcut here. This function depends upon the presence of the sampsize() function. > ls() [1] "bygroups" "sampsize" "sem" "SS" "summarize" > mean(rivers) [1] 591.1844 > sd(rivers) [1] 493.8708 > length(rivers) [1] 141 > summarize(rivers) mean SD N 591.1844 493.8708 141.0000 > rivers[10:20] = NA > summarize(rivers) mean SD N 599.2231 507.1378 130.0000 > rm(rivers)Now check your workspace to make sure it has NOTHING in it but these five functions. Then save and clear your workspace. > ls() [1] "bygroups" "sampsize" "sem" "SS" "summarize" > getwd() # just checking! [1] "/Users/billking/Rspace" > save.image("myfunctions.RData") > rm(list=ls()) > ls() character(0)I will show you how these functions will be loaded and used later. For now, on to more complex stuff! Making a Modification or Two I'm not entirely happy with our sampsize() function. First of all, I don't like the name. I also don't like that, as the length() function does, it does things its way and offers us no opportunity to modify that. What I want is a length function with an na.rm option. Open a script window (File > New Script in Windows or File > New Document on
a Mac), and type this.
size = function(X, na.rm=TRUE) { ifelse(na.rm, length(X) - sum(is.na(X)), length(X)) }Then choose Edit > Run All in Windows, or highlight all lines of the script and choose Edit > Execute on a Mac. This should put a function in your workspace called size(). The function first tests na.rm. If it's true, the second line is executed, and the function works like sampsize. If it's false, then the third line is executed, and the function works like length. The default is true. If you want false, you have to set the option when you use the function. > size(rivers) [1] 141 > size(1:100) [1] 100 > size(c(1:98,NA,NA)) [1] 98 > length(c(1:98,NA,NA)) [1] 100 > size(c(1:98,NA,NA), na.rm=F) [1] 100Check your workspace. The new function should be the ONLY thing there. If not, make it so. Then... > load("myfunctions.RData") > rm(sampsize) > save.image("myfunctions.RData")...and your myfunctions.RData file is modified. What's the problem? Our summarize() function depends upon the presence of sampsize, and we've just undone that. It was kind of a questionable thing to do in the first place, so we're going to fix it by amending the summarize() function. Once again in the script window (delete anything that is still there)... summarize = function(X) { c(MEAN=mean(X, na.rm=T), MEDIAN=median(X, na.rm=T), SD=sd(X, na.rm=T), N=length(X)-sum(is.na(X)), NAs=sum(is.na(X))) }Run/Execute it. Then test it. > summarize(rivers) MEAN MEDIAN SD N NAs 591.1844 425.0000 493.8708 141.0000 0.0000 > summarize(c(1:100)) MEAN MEDIAN SD N NAs 50.50000 50.50000 29.01149 100.00000 0.00000 > summarize(c(1:98,NA,NA)) MEAN MEDIAN SD N NAs 49.50000 49.50000 28.43413 98.00000 2.00000 > summarize(c(1:98,NA,NA,1000)) MEAN MEDIAN SD N NAs 59.10101 50.00000 99.62936 99.00000 2.00000Looks like it works, and when we created it, it overwrote the old version that was in the workspace. Now we need to save the workspace again to update the myfunctions file. Then clear it and on to the next section with thee! > save.image("myfunctions.RData") # good idea to ls() it first! > rm(list=ls()) Error Bars, Critical r, and Cohen's d In this and future sections of this tutorial, I am going to be typing my functions into a script window. That way they will be much easier to fix if (when!) I screw them up. Reminder: The script window is reached in Windows via File > New Script, and on a Mac via File > New Document. There is no facility in R for plotting error bars on bar graphs (or any
other kind of graph). Putting on my R apologist hat, I would say fine, because
bar graphs are for frequencies, not for means! But not everyone sees it that
way, and I must respect their right to be mistaken, so I'm told. So here is an
error.bars() function. If you want to just copy and
paste it into R to create the function, I'm okay with that.
error.bars = function (x, y, e, cbl = 0.03) { num.groups = length(x) for (i in 1:num.groups) { arrows(x[i], y[i] - e[i], x[i], y[i] + e[i], angle = 90, length = cbl, code = 3) } }The function must be sent three vectors: x = a vector of x-coordinates where you want error bars plotted, y = a vector of y-coordinates where you want error bars centered (same order as the x-coords), and e = a vector of errors (usually standard errors) that give half the length of the error bars. Bar graphs are tricky in R, because they don't plot bars where you might
expect to find them. Let me show you.
> summary(chickwts) # some data to demonstrate weight feed Min. :108.0 casein :12 1st Qu.:204.5 horsebean:10 Median :258.0 linseed :12 Mean :261.3 meatmeal :11 3rd Qu.:323.5 soybean :14 Max. :423.0 sunflower:12 > means = with(chickwts, tapply(weight, feed, mean)) # get and save means > barplot(means) > barplot(means, ylim=c(0,500)) # make room for error bars > sds = with(chickwts, tapply(weight,feed, sd)) # get and save SDs > lengths = with(chickwts, tapply(weight, feed, length)) # get and save Ns > sems = sds / sqrt(lengths) # get sems > barplot(means, ylim=c(0,500)) -> bp # I always forget this! > bp # this is where the bars are plotted! [,1] [1,] 0.7 [2,] 1.9 [3,] 3.1 [4,] 4.3 [5,] 5.5 [6,] 6.7 > error.bars(x=bp, y=means, e=sems, cbl=.1) # draw the error barsSave the output of your barplot command. That will store the x-coords of the bars on the bar graph. The cbl=.1 option controls the length of the crossbar on the error bars (default value .03). Now we've made a mess, and we must clean it up! > ls() [1] "bp" "error.bars" "lengths" "means" "sds" "sems" > rm(bp, lengths, means, sds, sems) > ls() [1] "error.bars"R does a good job of calculating correlations and correlation matrices, but it gives no indication of when these values are statistically significant. I could always look it up in a table, but hey! What do I have a computer for? Here is a simple function for r-critcal based on the relationship between r and student's t. r.crit = function (df, alpha = 0.05) { t1 = qt(p = alpha, df = df, lower.tail = FALSE) r1 = sqrt(t1^2/(t1^2 + df)) t2 = qt(p = alpha/2, df = df, lower.tail = FALSE) r2 = sqrt(t2^2/(t2^2 + df)) output = c(df, alpha, r1, r2) names(output) = c("df", "alpha", "1-tail", "2-tail") output = round(output, 4) output }And here's an example of how it is used. > cor(USArrests) Murder Assault UrbanPop Rape Murder 1.00000000 0.8018733 0.06957262 0.5635788 Assault 0.80187331 1.0000000 0.25887170 0.6652412 UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412 Rape 0.56357883 0.6652412 0.41134124 1.0000000 > dim(USArrests) [1] 50 4 > r.crit(df=48, alpha=.01) # default alpha is .05 df alpha 1-tail 2-tail 48.0000 0.0100 0.3281 0.3610Finally, Cohen's d, there is no R function for Cohen's d. We certainly can't have that! Cohens.d = function(group1, group2) { group1 = group1[!is.na(group1)] # strip NAs from group 1 group2 = group2[!is.na(group2)] # strip NAs from group 2 SSE = var(group1)*(length(group1)-1)+var(group2)*(length(group2)-1) dfE = length(group1)+length(group2)-2 MSE = SSE/dfE pooled.s = sqrt(MSE) diff = abs(mean(group1)-mean(group2)) t = diff/(pooled.s*sqrt(1/length(group1)+1/length(group2))) c(diff=diff,t=t,Cohens.d=abs(mean(group1)-mean(group2))/pooled.s) }Send it two vectors that you have used or are about to use as the two groups to be compared in a t-test. > casein = chickwts$weight[chickwts$feed=="casein"] > horsebean = chickwts$weight[chickwts$feed=="horsebean"] > Cohens.d(horsebean, casein) diff t Cohens.d 163.383333 7.019741 3.005674And now we've got some cleaning up to do! Following that, we're going to (carefully!) add these functions to the ones we programmed above, and then we're going to save the whole bunch of them. > ls() [1] "casein" "Cohens.d" "error.bars" "horsebean" "r.crit" > rm(casein, horsebean) > ls() # can't be too safe! [1] "Cohens.d" "error.bars" "r.crit" > load("myfunctions.RData") # add the previous ones > ls() # really! you can't be too safe! [1] "bygroups" "Cohens.d" "error.bars" "r.crit" "sampsize" "sem" [7] "SS" "summarize" > save.image("myfunctions.RData") # save all of them > rm(list=ls())A note: You caught me! Notice the workspace listings above contain sampsize and not size. Why's that? Because the 3rd section of this tutorial is the last section I wrote, so when I was writing this and future sections of this tutorial, it was still sampsize, size having not yet been created. Now on to even more complex stuff! Some Functions Related to ANOVA With apologies to Red Green, what's the difference between Type III sums
of squares and a trampoline? You take your shoes off to jump on a trampoline!
Type III sums of squares are the bane of ANOVA with unbalanced designs. At
least, you'd get that impression from listening to the R community. (See the
tutorial on unbalanced designs for details.) But sometimes you just got to
have them. Bill Venables, a staunch critic of Type III, let the cat out of
the box as to how to get them in R. Here is a function.
aovIII = function (formula, data = NULL) { options(contrasts = c("contr.sum", "contr.poly")) aovIII.out <<- aov(formula, data) print(drop1(aovIII.out, . ~ ., test = "F")) options(contrasts = c("contr.treatment", "contr.poly")) }The syntax is the same as that of the aov() function in so far as feeding it a formula and data are concerned, but there is no need to save the output, as you're going to see it all. Nevertheless, the function creates an aovIII.out object in your workspace, which contains the output as if a Type I ANOVA had been calculated. If you already have such an object there and don't wish it to be harmed, CHANGE ITS NAME BEFORE RUNNING THIS FUNCTION! > data(genotype, package="MASS") # get some unbalanced data > summary(genotype) # weight of crossfostered mice by litter and mother Litter Mother Wt A:17 A:16 Min. :36.30 B:15 B:14 1st Qu.:48.20 I:14 I:16 Median :54.00 J:15 J:15 Mean :53.97 3rd Qu.:60.30 Max. :69.80 > with(genotype, table(Litter,Mother)) # check the cell frequencies Mother Litter A B I J A 5 3 4 5 B 4 5 4 2 I 3 3 5 3 J 4 3 3 5 > aovIII(Wt ~ Mother * Litter, data=genotype) # the Type III tests Single term deletions Model: Wt ~ Mother * Litter Df Sum of Sq RSS AIC F value Pr(>F) <none> 2440.8 257.04 Mother 3 671.74 3112.6 265.87 4.1282 0.01142 * Litter 3 27.66 2468.5 251.73 0.1700 0.91612 Mother:Litter 9 824.07 3264.9 256.79 1.6881 0.12005 > summary(aovIII.out) # the Type I tests Df Sum Sq Mean Sq F value Pr(>F) Mother 3 771.6 257.20 4.742 0.00587 ** Litter 3 63.6 21.21 0.391 0.76000 Mother:Litter 9 824.1 91.56 1.688 0.12005 Residuals 45 2440.8 54.24 > rm(genotype, aov.outIII)These are the data used by Bill Venables in his famous Exegeses, in which he discusses unbalanced designs (among other things). One caution: I've never really tested this method with repeated measures factors. I know it works if all factors are between groups. When I was teaching the senior research course here at CCU, I had all my students put their data into a spreadsheet and send it to me, so I could use R to check their statistical analysis. No matter how many times I instructed them on the matter, I just couldn't get them to put repeated measures data into a long-format data frame. So I wrote a function that would analyze the data in short format (single factor designs only). Later, I expanded that function for teaching purposes. Here is that function. (Don't ask me to explain it! I'm not sure I remember!) rmsaov.test = function(matname) { s <- dim(matname)[1] df.subjects <- s - 1 a <- dim(matname)[2] df.treatment <- a - 1 total.df <- a * s - 1 total.SS <- var(as.vector(matname)) * total.df treatment.SS <- var(margin.table(matname, 2)) * df.treatment/s treatment.MS <- treatment.SS/df.treatment subjects.SS <- var(margin.table(matname, 1)) * df.subjects/a subjects.MS <- subjects.SS/df.subjects error.SS <- total.SS - treatment.SS - subjects.SS df.error <- df.treatment * df.subjects error.MS <- error.SS/df.error STATISTIC <- treatment.MS/error.MS PARAMETER <- c(df.treatment, df.error) PVAL <- pf(STATISTIC, df.treatment, df.error, lower.tail = FALSE) DNAME <- deparse(substitute(matname)) METHOD <- "Oneway ANOVA with repeated measures" names(STATISTIC) <- "F" names(PARAMETER) <- c("df treat", "df error") OUT <- list(statistic = STATISTIC, parameter = PARAMETER, p.value = PVAL, method = METHOD, data.name = DNAME, MS.treatment = treatment.MS, MS.error = error.MS, MS.subjects = subjects.MS) class(OUT) <- "htest" print(OUT) print(c(MS.error = error.MS, MS.treat = treatment.MS, MS.subjs = subjects.MS)) cat("\n") cat("Treatment Means\n") print(apply(matname, 2, mean)) cat("\n") cat("Compound Symmetry Check (Variance-Covariance Matrix)\n") covmat <- cov(matname) print(covmat) cat("\n") cat("Estimated Greenhouse-Geisser Epsilon and Adjusted p-value\n") D <- a^2 * (mean(diag(covmat)) - mean(covmat))^2 N1 <- sum(covmat^2) N2 <- 2 * a * sum(apply(covmat, 1, mean)^2) N3 <- a^2 * mean(covmat)^2 epsilon <- D/((a - 1) * (N1 - N2 + N3)) p.adj <- pf(treatment.MS/error.MS, df.treatment * epsilon, df.error * epsilon, lower.tail = FALSE) GGE <- c(epsilon, p.adj) names(GGE) = c("epsilon", "p-adjust") print(GGE) }As you can see, the function takes a single matrix as an argument. The matrix should have the repeated measures data in columns (each time measurement or each repeated measurement on the DV gets its own column). Each row of the matrix corresponds to data from one subject. Here are two examples of how to use this function. The first one involves
fetching some data from a help page for Friedman's test (a nonparametric
repeated measures ANOVA). The data are base running times for three different
methods of rounding first base. (The help page contains some errors, but see
for more details.)
> example(friedman.test) # output not shown; but there's a lot! > ls() [1] "aovIII" "rmsaov.test" "RoundingTimes" "wb" > head(RoundingTimes) # the data matrix we are after Round Out Narrow Angle Wide Angle 1 5.40 5.50 5.55 2 5.85 5.70 5.75 3 5.20 5.60 5.50 4 5.55 5.50 5.40 5 5.90 5.85 5.70 6 5.45 5.55 5.60 > class(RoundingTimes) # it must be a matrix [1] "matrix" > rmsaov.test(RoundingTimes) Oneway ANOVA with repeated measures data: RoundingTimes F = 6.2883, df treat = 2, df error = 42, p-value = 0.004084 MS.error MS.treat MS.subjs 0.007451299 0.046856061 0.200887446 Treatment Means Round Out Narrow Angle Wide Angle 5.543182 5.534091 5.459091 Compound Symmetry Check (Variance-Covariance Matrix) Round Out Narrow Angle Wide Angle Round Out 0.07387987 0.06310065 0.06327922 Narrow Angle 0.06310065 0.06747294 0.06705628 Wide Angle 0.06327922 0.06705628 0.07443723 Estimated Greenhouse-Geisser Epsilon and Adjusted p-value epsilon p-adjust 0.773501463 0.008439799The output is self-explanatory, but for more information see the tutorial on repeated measures designs. The second example is from the anorexia data that we have looked at before. > rm(RoundingTimes, wb) # clean up a bit first > data(anorexia, package="MASS") # fetch the data > head(anorexia) Treat Prewt Postwt 1 Cont 80.7 80.2 2 Cont 89.4 80.1 3 Cont 91.8 86.4 4 Cont 74.0 86.3 5 Cont 78.1 76.1 6 Cont 88.3 78.1 > class(anorexia) # this is no good! [1] "data.frame" > anor = as.matrix(anorexia[,2:3]) # convert the relevant columns to a matrix > rmsaov.test(anor) Oneway ANOVA with repeated measures data: anor F = 8.6293, df treat = 1, df error = 71, p-value = 0.004458 MS.error MS.treat MS.subjs 31.86892 275.00694 59.55305 Treatment Means Prewt Postwt 82.40833 85.17222 Compound Symmetry Check (Variance-Covariance Matrix) Prewt Postwt Prewt 26.85796 13.84207 Postwt 13.84207 64.56401 Estimated Greenhouse-Geisser Epsilon and Adjusted p-value epsilon p-adjust 1.000000000 0.004457718That's it for ANOVA-related functions. Now we must enhance our functions file with the two new ones we've created. > ls() [1] "anor" "anorexia" "aovIII" "rmsaov.test" > rm(anor, anorexia) # get rid of the junk > ls() # check! [1] "aovIII" "rmsaov.test" > load("myfunctions.RData") # add the functions in myfunctions > ls() # check again! [1] "aovIII" "bygroups" "Cohens.d" "error.bars" "r.crit" "rmsaov.test" [7] "sampsize" "sem" "SS" "summarize" > save.image("myfunctions.RData") # save a new myfunctions file > rm(list=ls()) # clean up > ls() character(0)One more new function and we'll call it a day. Linear Contrasts There is no easy way in the base R packages to test complex linear contrasts.
So we're going to create one. The method of calculation will be based on that in
the very excellent statistics book The Statistical Sleuth (3rd ed.) by Ramsey
and Schafer (2013).
t.contrast = function(means,codes,n,spool,dfe) { g = sum(means*codes) seg = spool*sqrt(sum(codes^2/n)) t = g/seg p.value = pt(abs(t),dfe,lower.tail=FALSE)*2 d = abs(g/spool) output <<- c(g=g,SE=seg,t=t,df=dfe,p.value=p.value,d=d) round(output,4) }Executing this function will write an object to your workspace called "output", so be careful about that. If you don't want that to happen, modify the script so that the line that begins output <<- is output <- (a single arrowhead on the assignment arrow). This is how the function works. Linear contrasts are used as an alternative
to ANOVA and are sometimes referred to as planned comparisons. ANOVA is a fine
tool, but it tests an alternative hypothesis that is very general, i.e., any
pattern of means other than all means equal. Thus, ANOVA is most appropriate
when you don't really know much about what to expect from your treatments. If
you have specific hypotheses, for example, group A will have a higher mean
than group B, and groups B and C will be the same, then planned comparisons are
more appropriate.
> ls() # your workspace should contain only the function [1] "t.contrast" > data(chickwts) # weights of chickens by type of feed they received > summary(chickwts) weight feed Min. :108.0 casein :12 1st Qu.:204.5 horsebean:10 Median :258.0 linseed :12 Mean :261.3 meatmeal :11 3rd Qu.:323.5 soybean :14 Max. :423.0 sunflower:12I don't really know much about feeding chickens other than that I had to do it as a kid and didn't care for it. Here we have 71 chickens in six groups by the type of feed they are receiving. The DV is weight. Let's pretend we (I) know enough about chicken feed to have the hypothesis that casein and sunflower are the best feeds, and are about equally good. The other four are not as good and are about equally not as good. So we're expecting the casein and sunflower means to be high and roughly equal, and the other four means to be low and roughly equal. In contrast codes (I'll have to refer you to an outside source here), that hypothesis would be expressed as follows: (1/2, -1/4, -1/4, -1/4, -1/4, 1/2). That will have the effect of combining the first and last group into one big "megagroup" and the middle four groups into another "megagroup," and then those two "megagroups" will be tested against each other. We need to feed our t.contrast function five pieces of information: a vector
of group means, a vector of contrast codes, a vector of group sizes (can be one
number if the design is balanced), the pooled standard deviation calculated over
all groups, and the degrees of freedom for error, which is total number of
subjects minus the number of groups they've been divided into if the design is
between groups. Let's start getting that stuff.
> means = with(chickwts, tapply(weight,feed,mean)) > codes = c(1/2, -1/4, -1/4, -1/4, -1/4, 1/2) # best entered as fractions > n = c(12, 10, 12, 11, 14, 12) # from the summary above > aov(weight~feed, data=chickwts) # easiest way to get spool Call: aov(formula = weight ~ feed, data = chickwts) Terms: feed Residuals Sum of Squares 231129.2 195556.0 Deg. of Freedom 5 65 Residual standard error: 54.85029 > spool = 54.85029 # it's the residual s.e. > dfe = 65And now we're ready to run the function. > t.contrast(means, codes, n, spool, dfe) g SE t df p.value d 100.6781 13.7969 7.2972 65.0000 0.0000 1.8355g is the difference in means between the two "megagroups," SE is the standard error (estimated) of that difference, t is the corresponding test statistic, df and p.value should be obvious, and d is Cohen's d for the difference g. We could (and probably should) continue with further contrasts, but I'll leave that to you as an exercise (as authors always say in these sorts of situations when they really don't want to do something). Now let's add the function to our file. First, make sure there is nothing
in your workspace except this function.
> ls() [1] "chickwts" "codes" "dfe" "means" "n" [6] "output" "spool" "t.contrast" > rm(chickwts, codes, dfe, means, n, output, spool) > ls() [1] "t.contrast"And we made quite a mess while testing the function, so we had to clean it up. Otherwise, all that junk would be going into the myfunctions file. > load("myfunctions.RData") > ls() [1] "aovIII" "bygroups" "Cohens.d" "error.bars" "r.crit" "rmsaov.test" [7] "sampsize" "sem" "SS" "summarize" "t.contrast" > save.image("myfunctions.RData") > rm(list=ls()) There is one more thing I want to do to prepare us for the next section of
this tutorial.
> data(anorexia, package="MASS") # we'll use this below > save.image("anorexia.RData") # a saved project > rm(anorexia) # clean up > palm483S = rnorm(50) # some junk for the workspaceWe are ready to move on. Putting It All Together and Using It We now have eleven shiny new functions in a file called myfunctions.RData. In my case, I have this file saved to my Rspace directory, because that's usually where I go to do my R work. You should adjust the advice here to fit your own working habits. When I start R, I immediately change my working directory to Rspace. Then I examine my workspace, and if there is stuff there I don't want, I remove it. I load anything from Rspace that I might be working on. Then I'm ready to go to work. Suppose I wanted to use one or more of the functions we've just created.
It's an RData file, so I could load() it, but
that would put all those functions in my workspace, and I don't want that.
So instead of loading it, I'm going to attach it to the search path. My
work flow might look like this.
> setwd("Rspace") # don't do this if you're already in Rspace > ls() [1] "palm483S" > rm(palm483S) > attach("myfunctions.RData")Those functions are now in my search path, and if I call one of them, R will be able to find it. > search() [1] ".GlobalEnv" "file:myfunctions.RData" "tools:RGUI" [4] "package:stats" "package:graphics" "package:grDevices" [7] "package:utils" "package:datasets" "package:methods" [10] "Autoloads" "package:base" > ls(pos=2) [1] "aovIII" "bygroups" "Cohens.d" "error.bars" "r.crit" "rmsaov.test" [7] "sampsize" "sem" "SS" "summarize" "t.contrast"Just in case you didn't believe me! > dir() # see what's in Rspace [1] "anorexia.RData" "myfunctions.RData" "UCB.RData" > load("anorexia.RData") # load the previously saved project I want > ls() # nobody here but us data sets [1] "anorexia" > head(anorexia) Treat Prewt Postwt 1 Cont 80.7 80.2 2 Cont 89.4 80.1 3 Cont 91.8 86.4 4 Cont 74.0 86.3 5 Cont 78.1 76.1 6 Cont 88.3 78.1 > dim(anorexia) [1] 72 3 > with(anorexia, cor(Prewt, Postwt)) # do some analysis [1] 0.3324062 > r.crit(df=70) # use one of our custom functions df alpha 1-tail 2-tail 70.0000 0.0500 0.1954 0.2319 > bygroups(data=anorexia$Postwt, groups=anorexia$Treat) # use another one CBT Cont FT 85.69655 81.10769 90.49412 > summarize(anorexia$Prewt) # caught me again; this is the old one mean SD N 82.408333 5.182466 72.000000 > detach("file:myfunctions.RData") # done with it; get it out of the search pathIt's as easy as that! Review of Steps For Modifying Your Functions File You may want to add a function of your own to this file, or delete one. Or you may want to modify one of these functions. Just to collect the steps together in a neat outline, here's how you would do it. Make sure the file is not attached to your search path before you begin. To delete a function (or a data object that snuck in there somehow):
To add a function:
To modify a function:
I should also note that you can add other objects, such as data frames, vectors, matrices, etc., to this file if you want to keep them there for some reason. I have a long list of data frames in mine that I use in my teaching. Sending the file to my students by e-mail is a convenient way to get a large number of data files to them. (No, I won't send it to you. Some of the data files were collected by students and are not for public use. Sorry.) An Alternative Method If you don't really need all your scripts and data files in one big file such as myfunctions.RData, then it may be more convenient to just save them a script at a time in your working directory (Rspace). Open a script window, type the script, then save it with a convenient name and .R as the extension. Be sure R is saving it to your working directory. This is not the default in the script editor, at least not on the Mac, so make the necessary changes in the dialogue box before saving. Then when you need to use it, you just source it in.
> setwd("Rspace") > source("size.R") > size(c(NA,1:5,NA,NA,NA)) [1] 5If you need to modify the script at some point, you don't need to go through all the gyrations above. Just open the script in a script window, make whatever changes you want, and save it. The disadvantage is, if you have a large number of scripts, you might end up making a mess of your working directory. Then again, you could always create a subdirectory. Call it, oh, say, "functions". created 2016 January 24; modified 2016 April 7 |