| Table of Contents
| Function Reference
| Function Finder
| R Project |
MAKING R WORK FOR YOU
In the course of working through these tutorials, you may have noticed that,
just in places, R is a twinkling shy of what might be referred to as
perfection. This occurs mostly in three ways: functions with arcane names,
functions with goofy syntax, and functions that don't exist, such as ones for
sum of squares and standard error of the mean. In this tutorial we're going
to take care of a lot of that.
My intention is to someday bundle together a lot of what is done in this
tutorial and offer it as an optional package that can be downloaded from
CRAN. Someday. In the meantime, I'm going to show how to create your own bundle
of custom functions, which can be continually modified, and which you can use
without cluttering up your workspace with custom functions.
Before you read this tutorial, you should read
Writing Your Own Functions and Scripts.
> setwd("Rspace") # very important (reasonably important anyway!)
> rm(list=ls()) # a clean workspace is essential
Optional Packages
Before we get to writing our own stuff, it is worthwhile to have one of the
optional packages available at CRAN. Frankly, I think this should be part of
the default installation. One Note: When I installed it, I got a warning about
version numbers. I installed it anyway and haven't had a problem, but you might
want to consider upgrading to the latest version of R first.
> install.packages("car") # see the Package Management tutorial
When you do this, R will open a list of mirror sites for you, from which you
should choose one near you. Once you do that, the rest happens automatically.
Let's see what we're getting.
> library(car) # attach to the search path in position 2
> ls(pos=2) # see the contents
[1] "Adler" "AMSsurvey" "Angell"
[4] "Anova" "Anscombe" "av.plot"
[7] "av.plots" "avPlot" "avPlots"
[10] "basicPower" "basicPowerAxis" "Baumann"
[13] "bc" "bcPower" "bcPowerAxis"
[16] "Bfox" "Blackmore" "Boot"
[19] "bootCase" "box.cox" "box.cox.powers"
[22] "box.cox.var" "box.tidwell" "boxCox"
[25] "boxCoxVariable" "Boxplot" "boxTidwell"
[28] "Burt" "CanPop" "carWeb"
[31] "ceres.plot" "ceres.plots" "ceresPlot"
[34] "ceresPlots" "Chile" "Chirot"
[37] "compareCoefs" "confidence.ellipse" "confidenceEllipse"
[40] "contr.Helmert" "contr.Sum" "contr.Treatment"
[43] "cookd" "Cowles" "cr.plot"
[46] "cr.plots" "crp" "crPlot"
[49] "crPlots" "data.ellipse" "dataEllipse"
[52] "Davis" "DavisThin" "deltaMethod"
[55] "densityPlot" "Depredations" "dfbetaPlots"
[58] "dfbetasPlots" "Duncan" "durbin.watson"
[61] "durbinWatsonTest" "dwt" "ellipse"
[64] "Ericksen" "estimateTransform" "Florida"
[67] "Freedman" "Friendly" "gamLine"
[70] "Ginzberg" "Greene" "Guyer"
[73] "Hartnagel" "hccm" "Highway1"
[76] "Identify3d" "infIndexPlot" "influenceIndexPlot"
[79] "influencePlot" "inverseResponsePlot" "invResPlot"
[82] "invTranEstimate" "invTranPlot" "KosteckiDillon"
[85] "Leinhardt" "levene.test" "leveneTest"
[88] "leverage.plot" "leverage.plots" "leveragePlot"
[91] "leveragePlots" "lht" "linear.hypothesis"
[94] "linearHypothesis" "linearHypothesis.default" "loessLine"
[97] "logit" "makeHypothesis" "Mandel"
[100] "Manova" "marginalModelPlot" "marginalModelPlots"
[103] "matchCoefs" "mcPlot" "mcPlots"
[106] "Migration" "mmp" "mmps"
[109] "Moore" "Mroz" "ncv.test"
[112] "ncvTest" "nextBoot" "OBrienKaiser"
[115] "Ornstein" "outlier.test" "outlierTest"
[118] "panel.car" "Pottery" "powerTransform"
[121] "Prestige" "printHypothesis" "probabilityAxis"
[124] "qq.plot" "qqp" "qqPlot"
[127] "quantregLine" "Quartet" "recode"
[130] "Recode" "regLine" "residualPlot"
[133] "residualPlots" "Robey" "Sahlins"
[136] "Salaries" "scatter3d" "scatterplot"
[139] "scatterplot.matrix" "scatterplotMatrix" "showLabels"
[142] "sigmaHat" "SLID" "slp"
[145] "Soils" "some" "sp"
[148] "spm" "spread.level.plot" "spreadLevelPlot"
[151] "States" "subsets" "symbox"
[154] "testTransform" "Transact" "UN"
[157] "USPop" "vif" "Vocab"
[160] "wcrossprod" "WeightLoss" "which.names"
[163] "whichNames" "Womenlf" "Wong"
[166] "Wool" "yjPower" "yjPowerAxis"
That is a lot of stuff! Many of those items are data sets, such as Adler, for
example. Help pages are available if you want details. There are also a few
functions here that appear to be available in the base packages, for
example, Anova and Boxplot. The function name is capitalized, however, so as
not to mask the similar function in base R. The "car" versions work somewhat
differently.
Just a few things here worth pointing out are Anova, which allows Type II
and Type III tests for unbalanced factorial designs (Type II is default, in
fact), leveneTest, which computes a Levene's test for homogeneity of variance,
Recode and recode, which allow more convenient recoding of a numeric vector
than is available in base R, and vif, which calculates variance inflation
factors for linear models. In fact, there is so much good stuff here that I
will just have to leave you to investigate it for yourself. This package was
developed to accompany the very excellent R manual "An R Companion to Applied
Regression" (2nd ed.) by Fox and Weisberg (2011), which I can highly recommend
for an introduction to much of what is in "car".
> detach("package:car")
I'm not suggesting that this is the only valuable optional package at CRAN,
and may not even be the most valuable one to you, but I have found it to be
almost indispensable.
A Few Simple One-Liner Custom Functions That Work on Vectors
To begin, make sure your workspace is clean, and I mean spotless! I would
also recommend if you have a .RData file in your current working directory
that you do not want to lose, you should load it and rename it. Then once
again, clean out your workspace. Your workspace should be character(0)!
You may recall that the length() function has
a shortcoming. It counts missing values, and there is nothing you can do to
stop it. Thus, if you want to know how many nonmissing values there are in a
vector--how many values actually went into a calculation--the length() function may not be your best bet. Here is a simple
replacement. The workings of this function was explained in a previous tutorial.
> sampsize = function(X) length(X) - sum(is.na(X))
And test it. (NOTE: I don't mean these tests to be rigorous, but these
functions should work. Here I just want to show that they give an answer. If
anyone sees a fault in any of these functions, please let me know.)
> ls()
[1] "sampsize"
> length(rivers)
[1] 141
> sampsize(rivers)
[1] 141
> length(islands)
[1] 48
> sampsize(islands)
[1] 48
> rivers[c(10:15)] = NA # won't hurt the permanent version
> length(rivers)
[1] 141
> sampsize(rivers)
[1] 135
>rm(rivers)
There is no sum of squares function in R. This is not too much of an oversight,
because you rarely would need such a function in most statistical work. But you
sure need one when you're first learning statistics, AND when you're trying to
teach it!
> SS = function(X) sum(X^2, na.rm=TRUE) - sum(X, na.rm=TRUE)^2 / (length(X) - sum(is.na(X)))
This should work on any numeric vector, even one with missing values.
> ls()
[1] "sampsize" "SS"
> SS(rivers)
[1] 34147177
> rivers[10:20] = NA
> SS(rivers)
[1] 33177355
> rm(rivers)
> SS(c(1,2,3)) # because we know the answer to this one!
[1] 2
> SS(c(1,2,NA))
[1] 0.5
Likewise, a function for standard error of the mean.
> sem = function(X) sqrt(var(X, na.rm=TRUE) / (length(X) - sum(is.na(X))))
And test it.
> ls()
[1] "sampsize" "sem" "SS"
> sqrt(var(rivers)/length(rivers))
[1] 41.59143
> sem(rivers)
[1] 41.59143
> rivers[10:20] = NA
> sem(rivers)
[1] 44.47893
> rm(rivers)
> sem(c(1,2,3))
[1] 0.5773503
> sqrt(1/3)
[1] 0.5773503
It annoys me that the name and syntax of the tapply()
function are so goofy. So I'm going to write a simple one-line replacement. (The
default argument labels in tapply are X, INDEX, and FUN, and my students never
get it).
> bygroups = function(data, groups, FUN=mean) tapply(data, groups, FUN)
This should do the means by default but can also be used with other functions.
> ls()
[1] "bygroups" "sampsize" "sem" "SS"
> head(chickwts)
weight feed
1 179 horsebean
2 160 horsebean
3 136 horsebean
4 227 horsebean
5 217 horsebean
6 168 horsebean
> with(chickwts, tapply(weight, feed, mean))
casein horsebean linseed meatmeal soybean sunflower
323.5833 160.2000 218.7500 276.9091 246.4286 328.9167
> with(chickwts, bygroups(data=weight, groups=feed))
casein horsebean linseed meatmeal soybean sunflower
323.5833 160.2000 218.7500 276.9091 246.4286 328.9167
> with(chickwts, tapply(weight, feed, sd))
casein horsebean linseed meatmeal soybean sunflower
64.43384 38.62584 52.23570 64.90062 54.12907 48.83638
> with(chickwts, bygroups(data=weight, groups=feed, FUN=sd))
casein horsebean linseed meatmeal soybean sunflower
64.43384 38.62584 52.23570 64.90062 54.12907 48.83638
Finally, when I summarize a numeric variable, more often than not I want a
mean, a standard deviation, and a sample size. The
summary() function gives me only one of those things.
> summarize = function(X) c(mean=mean(X, na.rm=TRUE), SD=sd(X, na.rm=TRUE), N=sampsize(X))
And test it. Notice that I took a shortcut here. This function depends upon
the presence of the sampsize() function.
> ls()
[1] "bygroups" "sampsize" "sem" "SS" "summarize"
> mean(rivers)
[1] 591.1844
> sd(rivers)
[1] 493.8708
> length(rivers)
[1] 141
> summarize(rivers)
mean SD N
591.1844 493.8708 141.0000
> rivers[10:20] = NA
> summarize(rivers)
mean SD N
599.2231 507.1378 130.0000
> rm(rivers)
Now check your workspace to make sure it has NOTHING in it but these five
functions. Then save and clear your workspace.
> ls()
[1] "bygroups" "sampsize" "sem" "SS" "summarize"
> getwd() # just checking!
[1] "/Users/billking/Rspace"
> save.image("myfunctions.RData")
> rm(list=ls())
> ls()
character(0)
I will show you how these functions will be loaded and used later. For now,
on to more complex stuff!
Making a Modification or Two
I'm not entirely happy with our sampsize()
function. First of all, I don't like the name. I also don't like that, as the
length() function does, it does things its way and
offers us no opportunity to modify that. What I want is a length function with
an na.rm option.
Open a script window (File > New Script in Windows or File > New Document on
a Mac), and type this.
size = function(X, na.rm=TRUE)
{
ifelse(na.rm,
length(X) - sum(is.na(X)),
length(X))
}
Then choose Edit > Run All in Windows, or highlight all lines of the script
and choose Edit > Execute on a Mac. This should put a function in your
workspace called size(). The function first tests
na.rm. If it's true, the second line is executed, and the function works like
sampsize. If it's false, then the third line is executed, and the function
works like length. The default is true. If you want false, you have to set the
option when you use the function.
> size(rivers)
[1] 141
> size(1:100)
[1] 100
> size(c(1:98,NA,NA))
[1] 98
> length(c(1:98,NA,NA))
[1] 100
> size(c(1:98,NA,NA), na.rm=F)
[1] 100
Check your workspace. The new function should be the ONLY thing there. If not,
make it so. Then...
> load("myfunctions.RData")
> rm(sampsize)
> save.image("myfunctions.RData")
...and your myfunctions.RData file is modified. What's the problem? Our
summarize() function depends upon the presence of
sampsize, and we've just undone that. It was kind of a questionable thing to
do in the first place, so we're going to fix it by amending the
summarize() function. Once again in the script window (delete anything
that is still there)...
summarize = function(X)
{
c(MEAN=mean(X, na.rm=T),
MEDIAN=median(X, na.rm=T),
SD=sd(X, na.rm=T),
N=length(X)-sum(is.na(X)),
NAs=sum(is.na(X)))
}
Run/Execute it. Then test it.
> summarize(rivers)
MEAN MEDIAN SD N NAs
591.1844 425.0000 493.8708 141.0000 0.0000
> summarize(c(1:100))
MEAN MEDIAN SD N NAs
50.50000 50.50000 29.01149 100.00000 0.00000
> summarize(c(1:98,NA,NA))
MEAN MEDIAN SD N NAs
49.50000 49.50000 28.43413 98.00000 2.00000
> summarize(c(1:98,NA,NA,1000))
MEAN MEDIAN SD N NAs
59.10101 50.00000 99.62936 99.00000 2.00000
Looks like it works, and when we created it, it overwrote the old version that
was in the workspace. Now we need to save the workspace again to update the
myfunctions file. Then clear it and on to the next section with thee!
> save.image("myfunctions.RData") # good idea to ls() it first!
> rm(list=ls())
Error Bars, Critical r, and Cohen's d
In this and future sections of this tutorial, I am going to be typing my
functions into a script window. That way they will be much easier to fix if
(when!) I screw them up. Reminder: The script window is reached in Windows via
File > New Script, and on a Mac via File > New Document.
There is no facility in R for plotting error bars on bar graphs (or any
other kind of graph). Putting on my R apologist hat, I would say fine, because
bar graphs are for frequencies, not for means! But not everyone sees it that
way, and I must respect their right to be mistaken, so I'm told. So here is an
error.bars() function. If you want to just copy and
paste it into R to create the function, I'm okay with that.
error.bars = function (x, y, e, cbl = 0.03)
{
num.groups = length(x)
for (i in 1:num.groups) {
arrows(x[i], y[i] - e[i], x[i], y[i] + e[i], angle = 90,
length = cbl, code = 3)
}
}
The function must be sent three vectors: x = a vector of x-coordinates where
you want error bars plotted, y = a vector of y-coordinates where you want
error bars centered (same order as the x-coords), and e = a vector of errors
(usually standard errors) that give half the length of the error bars.
Bar graphs are tricky in R, because they don't plot bars where you might
expect to find them. Let me show you.
> summary(chickwts) # some data to demonstrate
weight feed
Min. :108.0 casein :12
1st Qu.:204.5 horsebean:10
Median :258.0 linseed :12
Mean :261.3 meatmeal :11
3rd Qu.:323.5 soybean :14
Max. :423.0 sunflower:12
> means = with(chickwts, tapply(weight, feed, mean)) # get and save means
> barplot(means)
> barplot(means, ylim=c(0,500)) # make room for error bars
> sds = with(chickwts, tapply(weight,feed, sd)) # get and save SDs
> lengths = with(chickwts, tapply(weight, feed, length)) # get and save Ns
> sems = sds / sqrt(lengths) # get sems
> barplot(means, ylim=c(0,500)) -> bp # I always forget this!
> bp # this is where the bars are plotted!
[,1]
[1,] 0.7
[2,] 1.9
[3,] 3.1
[4,] 4.3
[5,] 5.5
[6,] 6.7
> error.bars(x=bp, y=means, e=sems, cbl=.1) # draw the error bars
Save the output of your barplot command. That will store the x-coords of the
bars on the bar graph. The cbl=.1 option controls the length of the crossbar
on the error bars (default value .03). Now we've made a mess, and we must
clean it up!
> ls()
[1] "bp" "error.bars" "lengths" "means" "sds" "sems"
> rm(bp, lengths, means, sds, sems)
> ls()
[1] "error.bars"
R does a good job of calculating correlations and correlation matrices, but
it gives no indication of when these values are statistically significant. I
could always look it up in a table, but hey! What do I have a computer for?
Here is a simple function for r-critcal based on the relationship between r
and student's t.
r.crit = function (df, alpha = 0.05)
{
t1 = qt(p = alpha, df = df, lower.tail = FALSE)
r1 = sqrt(t1^2/(t1^2 + df))
t2 = qt(p = alpha/2, df = df, lower.tail = FALSE)
r2 = sqrt(t2^2/(t2^2 + df))
output = c(df, alpha, r1, r2)
names(output) = c("df", "alpha", "1-tail", "2-tail")
output = round(output, 4)
output
}
And here's an example of how it is used.
> cor(USArrests)
Murder Assault UrbanPop Rape
Murder 1.00000000 0.8018733 0.06957262 0.5635788
Assault 0.80187331 1.0000000 0.25887170 0.6652412
UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
Rape 0.56357883 0.6652412 0.41134124 1.0000000
> dim(USArrests)
[1] 50 4
> r.crit(df=48, alpha=.01) # default alpha is .05
df alpha 1-tail 2-tail
48.0000 0.0100 0.3281 0.3610
Finally, Cohen's d, there is no R function for Cohen's d. We certainly can't
have that!
Cohens.d = function(group1, group2)
{
group1 = group1[!is.na(group1)] # strip NAs from group 1
group2 = group2[!is.na(group2)] # strip NAs from group 2
SSE = var(group1)*(length(group1)-1)+var(group2)*(length(group2)-1)
dfE = length(group1)+length(group2)-2
MSE = SSE/dfE
pooled.s = sqrt(MSE)
diff = abs(mean(group1)-mean(group2))
t = diff/(pooled.s*sqrt(1/length(group1)+1/length(group2)))
c(diff=diff,t=t,Cohens.d=abs(mean(group1)-mean(group2))/pooled.s)
}
Send it two vectors that you have used or are about to use as the two groups
to be compared in a t-test.
> casein = chickwts$weight[chickwts$feed=="casein"]
> horsebean = chickwts$weight[chickwts$feed=="horsebean"]
> Cohens.d(horsebean, casein)
diff t Cohens.d
163.383333 7.019741 3.005674
And now we've got some cleaning up to do! Following that, we're going to
(carefully!) add these functions to the ones we programmed above, and then
we're going to save the whole bunch of them.
> ls()
[1] "casein" "Cohens.d" "error.bars" "horsebean" "r.crit"
> rm(casein, horsebean)
> ls() # can't be too safe!
[1] "Cohens.d" "error.bars" "r.crit"
> load("myfunctions.RData") # add the previous ones
> ls() # really! you can't be too safe!
[1] "bygroups" "Cohens.d" "error.bars" "r.crit" "sampsize" "sem"
[7] "SS" "summarize"
> save.image("myfunctions.RData") # save all of them
> rm(list=ls())
A note: You caught me! Notice the workspace listings above contain sampsize
and not size. Why's that? Because the 3rd section of this tutorial is the last
section I wrote, so when I was writing this and future sections of this tutorial,
it was still sampsize, size having not yet been created. Now on to even more
complex stuff!
Some Functions Related to ANOVA
With apologies to Red Green, what's the difference between Type III sums
of squares and a trampoline? You take your shoes off to jump on a trampoline!
Type III sums of squares are the bane of ANOVA with unbalanced designs. At
least, you'd get that impression from listening to the R community. (See the
tutorial on unbalanced designs for details.) But sometimes you just got to
have them. Bill Venables, a staunch critic of Type III, let the cat out of
the box as to how to get them in R. Here is a function.
aovIII = function (formula, data = NULL)
{
options(contrasts = c("contr.sum", "contr.poly"))
aovIII.out <<- aov(formula, data)
print(drop1(aovIII.out, . ~ ., test = "F"))
options(contrasts = c("contr.treatment", "contr.poly"))
}
The syntax is the same as that of the aov()
function in so far as feeding it a formula and data are concerned, but there
is no need to save the output, as you're going to see it
all. Nevertheless, the function creates an aovIII.out object in your workspace,
which contains the output as if a Type I ANOVA had been calculated.
If you already have such an object there and don't wish it to be harmed,
CHANGE ITS NAME BEFORE RUNNING THIS FUNCTION!
> data(genotype, package="MASS") # get some unbalanced data
> summary(genotype) # weight of crossfostered mice by litter and mother
Litter Mother Wt
A:17 A:16 Min. :36.30
B:15 B:14 1st Qu.:48.20
I:14 I:16 Median :54.00
J:15 J:15 Mean :53.97
3rd Qu.:60.30
Max. :69.80
> with(genotype, table(Litter,Mother)) # check the cell frequencies
Mother
Litter A B I J
A 5 3 4 5
B 4 5 4 2
I 3 3 5 3
J 4 3 3 5
> aovIII(Wt ~ Mother * Litter, data=genotype) # the Type III tests
Single term deletions
Model:
Wt ~ Mother * Litter
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 2440.8 257.04
Mother 3 671.74 3112.6 265.87 4.1282 0.01142 *
Litter 3 27.66 2468.5 251.73 0.1700 0.91612
Mother:Litter 9 824.07 3264.9 256.79 1.6881 0.12005
> summary(aovIII.out) # the Type I tests
Df Sum Sq Mean Sq F value Pr(>F)
Mother 3 771.6 257.20 4.742 0.00587 **
Litter 3 63.6 21.21 0.391 0.76000
Mother:Litter 9 824.1 91.56 1.688 0.12005
Residuals 45 2440.8 54.24
> rm(genotype, aov.outIII)
These are the data used by Bill Venables in his famous Exegeses,
in which he discusses unbalanced designs (among other things). One caution:
I've never really tested this method with repeated measures factors. I know it
works if all factors are between groups.
When I was teaching the senior research course here at CCU, I had all my
students put their data into a spreadsheet and send it to me, so I could use
R to check their statistical analysis. No matter how many times I instructed
them on the matter, I just couldn't get them to put repeated measures data into
a long-format data frame. So I wrote a function that would analyze the data in
short format (single factor designs only). Later, I expanded that function for
teaching purposes. Here is that function. (Don't ask me to explain it! I'm
not sure I remember!)
rmsaov.test = function(matname)
{
s <- dim(matname)[1]
df.subjects <- s - 1
a <- dim(matname)[2]
df.treatment <- a - 1
total.df <- a * s - 1
total.SS <- var(as.vector(matname)) * total.df
treatment.SS <- var(margin.table(matname, 2)) * df.treatment/s
treatment.MS <- treatment.SS/df.treatment
subjects.SS <- var(margin.table(matname, 1)) * df.subjects/a
subjects.MS <- subjects.SS/df.subjects
error.SS <- total.SS - treatment.SS - subjects.SS
df.error <- df.treatment * df.subjects
error.MS <- error.SS/df.error
STATISTIC <- treatment.MS/error.MS
PARAMETER <- c(df.treatment, df.error)
PVAL <- pf(STATISTIC, df.treatment, df.error, lower.tail = FALSE)
DNAME <- deparse(substitute(matname))
METHOD <- "Oneway ANOVA with repeated measures"
names(STATISTIC) <- "F"
names(PARAMETER) <- c("df treat", "df error")
OUT <- list(statistic = STATISTIC, parameter = PARAMETER,
p.value = PVAL, method = METHOD, data.name = DNAME, MS.treatment = treatment.MS,
MS.error = error.MS, MS.subjects = subjects.MS)
class(OUT) <- "htest"
print(OUT)
print(c(MS.error = error.MS, MS.treat = treatment.MS, MS.subjs = subjects.MS))
cat("\n")
cat("Treatment Means\n")
print(apply(matname, 2, mean))
cat("\n")
cat("Compound Symmetry Check (Variance-Covariance Matrix)\n")
covmat <- cov(matname)
print(covmat)
cat("\n")
cat("Estimated Greenhouse-Geisser Epsilon and Adjusted p-value\n")
D <- a^2 * (mean(diag(covmat)) - mean(covmat))^2
N1 <- sum(covmat^2)
N2 <- 2 * a * sum(apply(covmat, 1, mean)^2)
N3 <- a^2 * mean(covmat)^2
epsilon <- D/((a - 1) * (N1 - N2 + N3))
p.adj <- pf(treatment.MS/error.MS, df.treatment * epsilon,
df.error * epsilon, lower.tail = FALSE)
GGE <- c(epsilon, p.adj)
names(GGE) = c("epsilon", "p-adjust")
print(GGE)
}
As you can see, the function takes a single matrix as an argument. The matrix
should have the repeated measures data in columns (each time measurement or
each repeated measurement on the DV gets its own column). Each row of the
matrix corresponds to data from one subject.
Here are two examples of how to use this function. The first one involves
fetching some data from a help page for Friedman's test (a nonparametric
repeated measures ANOVA). The data are base running times for three different
methods of rounding first base. (The help page contains some errors, but see
for more details.)
> example(friedman.test) # output not shown; but there's a lot!
> ls()
[1] "aovIII" "rmsaov.test" "RoundingTimes" "wb"
> head(RoundingTimes) # the data matrix we are after
Round Out Narrow Angle Wide Angle
1 5.40 5.50 5.55
2 5.85 5.70 5.75
3 5.20 5.60 5.50
4 5.55 5.50 5.40
5 5.90 5.85 5.70
6 5.45 5.55 5.60
> class(RoundingTimes) # it must be a matrix
[1] "matrix"
> rmsaov.test(RoundingTimes)
Oneway ANOVA with repeated measures
data: RoundingTimes
F = 6.2883, df treat = 2, df error = 42, p-value = 0.004084
MS.error MS.treat MS.subjs
0.007451299 0.046856061 0.200887446
Treatment Means
Round Out Narrow Angle Wide Angle
5.543182 5.534091 5.459091
Compound Symmetry Check (Variance-Covariance Matrix)
Round Out Narrow Angle Wide Angle
Round Out 0.07387987 0.06310065 0.06327922
Narrow Angle 0.06310065 0.06747294 0.06705628
Wide Angle 0.06327922 0.06705628 0.07443723
Estimated Greenhouse-Geisser Epsilon and Adjusted p-value
epsilon p-adjust
0.773501463 0.008439799
The output is self-explanatory, but for more information see the tutorial on
repeated measures designs. The second example is from the anorexia data that
we have looked at before.
> rm(RoundingTimes, wb) # clean up a bit first
> data(anorexia, package="MASS") # fetch the data
> head(anorexia)
Treat Prewt Postwt
1 Cont 80.7 80.2
2 Cont 89.4 80.1
3 Cont 91.8 86.4
4 Cont 74.0 86.3
5 Cont 78.1 76.1
6 Cont 88.3 78.1
> class(anorexia) # this is no good!
[1] "data.frame"
> anor = as.matrix(anorexia[,2:3]) # convert the relevant columns to a matrix
> rmsaov.test(anor)
Oneway ANOVA with repeated measures
data: anor
F = 8.6293, df treat = 1, df error = 71, p-value = 0.004458
MS.error MS.treat MS.subjs
31.86892 275.00694 59.55305
Treatment Means
Prewt Postwt
82.40833 85.17222
Compound Symmetry Check (Variance-Covariance Matrix)
Prewt Postwt
Prewt 26.85796 13.84207
Postwt 13.84207 64.56401
Estimated Greenhouse-Geisser Epsilon and Adjusted p-value
epsilon p-adjust
1.000000000 0.004457718
That's it for ANOVA-related functions. Now we must enhance our functions file
with the two new ones we've created.
> ls()
[1] "anor" "anorexia" "aovIII" "rmsaov.test"
> rm(anor, anorexia) # get rid of the junk
> ls() # check!
[1] "aovIII" "rmsaov.test"
> load("myfunctions.RData") # add the functions in myfunctions
> ls() # check again!
[1] "aovIII" "bygroups" "Cohens.d" "error.bars" "r.crit" "rmsaov.test"
[7] "sampsize" "sem" "SS" "summarize"
> save.image("myfunctions.RData") # save a new myfunctions file
> rm(list=ls()) # clean up
> ls()
character(0)
One more new function and we'll call it a day.
Linear Contrasts
There is no easy way in the base R packages to test complex linear contrasts.
So we're going to create one. The method of calculation will be based on that in
the very excellent statistics book The Statistical Sleuth (3rd ed.) by Ramsey
and Schafer (2013).
t.contrast = function(means,codes,n,spool,dfe)
{
g = sum(means*codes)
seg = spool*sqrt(sum(codes^2/n))
t = g/seg
p.value = pt(abs(t),dfe,lower.tail=FALSE)*2
d = abs(g/spool)
output <<- c(g=g,SE=seg,t=t,df=dfe,p.value=p.value,d=d)
round(output,4)
}
Executing this function will write an object to your workspace called "output",
so be careful about that. If you don't want that to happen, modify the script
so that the line that begins output <<- is output <- (a single arrowhead on
the assignment arrow).
This is how the function works. Linear contrasts are used as an alternative
to ANOVA and are sometimes referred to as planned comparisons. ANOVA is a fine
tool, but it tests an alternative hypothesis that is very general, i.e., any
pattern of means other than all means equal. Thus, ANOVA is most appropriate
when you don't really know much about what to expect from your treatments. If
you have specific hypotheses, for example, group A will have a higher mean
than group B, and groups B and C will be the same, then planned comparisons are
more appropriate.
> ls() # your workspace should contain only the function
[1] "t.contrast"
> data(chickwts) # weights of chickens by type of feed they received
> summary(chickwts)
weight feed
Min. :108.0 casein :12
1st Qu.:204.5 horsebean:10
Median :258.0 linseed :12
Mean :261.3 meatmeal :11
3rd Qu.:323.5 soybean :14
Max. :423.0 sunflower:12
I don't really know much about feeding chickens other than that I had to do it
as a kid and didn't care for it. Here we have 71 chickens in six groups by the
type of feed they are receiving. The DV is weight. Let's pretend we (I) know
enough about chicken feed to have the hypothesis that casein and sunflower are
the best feeds, and are about equally good. The other four are not as good and
are about equally not as good. So we're expecting the casein and sunflower means
to be high and roughly equal, and the other four means to be low and roughly
equal. In contrast codes (I'll have to refer you to an outside source here),
that hypothesis would be expressed as follows: (1/2, -1/4, -1/4, -1/4, -1/4, 1/2).
That will have the effect of combining the first and last group into one big
"megagroup" and the middle four groups into another "megagroup," and then
those two "megagroups" will be tested against each other.
We need to feed our t.contrast function five pieces of information: a vector
of group means, a vector of contrast codes, a vector of group sizes (can be one
number if the design is balanced), the pooled standard deviation calculated over
all groups, and the degrees of freedom for error, which is total number of
subjects minus the number of groups they've been divided into if the design is
between groups. Let's start getting that stuff.
> means = with(chickwts, tapply(weight,feed,mean))
> codes = c(1/2, -1/4, -1/4, -1/4, -1/4, 1/2) # best entered as fractions
> n = c(12, 10, 12, 11, 14, 12) # from the summary above
> aov(weight~feed, data=chickwts) # easiest way to get spool
Call:
aov(formula = weight ~ feed, data = chickwts)
Terms:
feed Residuals
Sum of Squares 231129.2 195556.0
Deg. of Freedom 5 65
Residual standard error: 54.85029
> spool = 54.85029 # it's the residual s.e.
> dfe = 65
And now we're ready to run the function.
> t.contrast(means, codes, n, spool, dfe)
g SE t df p.value d
100.6781 13.7969 7.2972 65.0000 0.0000 1.8355
g is the difference in means between the two "megagroups," SE is the standard
error (estimated) of that difference, t is the corresponding test statistic,
df and p.value should be obvious, and d is Cohen's d for the difference g.
We could (and probably should) continue with further contrasts, but I'll
leave that to you as an exercise (as authors always say in these sorts of
situations when they really don't want to do something).
Now let's add the function to our file. First, make sure there is nothing
in your workspace except this function.
> ls()
[1] "chickwts" "codes" "dfe" "means" "n"
[6] "output" "spool" "t.contrast"
> rm(chickwts, codes, dfe, means, n, output, spool)
> ls()
[1] "t.contrast"
And we made quite a mess while testing the function, so we had to clean it up.
Otherwise, all that junk would be going into the myfunctions file.
> load("myfunctions.RData")
> ls()
[1] "aovIII" "bygroups" "Cohens.d" "error.bars" "r.crit" "rmsaov.test"
[7] "sampsize" "sem" "SS" "summarize" "t.contrast"
> save.image("myfunctions.RData")
> rm(list=ls())
There is one more thing I want to do to prepare us for the next section of
this tutorial.
> data(anorexia, package="MASS") # we'll use this below
> save.image("anorexia.RData") # a saved project
> rm(anorexia) # clean up
> palm483S = rnorm(50) # some junk for the workspace
We are ready to move on.
Putting It All Together and Using It
We now have eleven shiny new functions in a file called myfunctions.RData. In
my case, I have this file saved to my Rspace directory, because that's usually
where I go to do my R work. You should adjust the advice here to fit your own
working habits.
When I start R, I immediately change my working directory to Rspace. Then I
examine my workspace, and if there is stuff there I don't want, I remove it.
I load anything from Rspace that I might be working on. Then I'm ready to go
to work.
Suppose I wanted to use one or more of the functions we've just created.
It's an RData file, so I could load() it, but
that would put all those functions in my workspace, and I don't want that.
So instead of loading it, I'm going to attach it to the search path. My
work flow might look like this.
> setwd("Rspace") # don't do this if you're already in Rspace
> ls()
[1] "palm483S"
> rm(palm483S)
> attach("myfunctions.RData")
Those functions are now in my search path, and if I call one of them, R will
be able to find it.
> search()
[1] ".GlobalEnv" "file:myfunctions.RData" "tools:RGUI"
[4] "package:stats" "package:graphics" "package:grDevices"
[7] "package:utils" "package:datasets" "package:methods"
[10] "Autoloads" "package:base"
> ls(pos=2)
[1] "aovIII" "bygroups" "Cohens.d" "error.bars" "r.crit" "rmsaov.test"
[7] "sampsize" "sem" "SS" "summarize" "t.contrast"
Just in case you didn't believe me!
> dir() # see what's in Rspace
[1] "anorexia.RData" "myfunctions.RData" "UCB.RData"
> load("anorexia.RData") # load the previously saved project I want
> ls() # nobody here but us data sets
[1] "anorexia"
> head(anorexia)
Treat Prewt Postwt
1 Cont 80.7 80.2
2 Cont 89.4 80.1
3 Cont 91.8 86.4
4 Cont 74.0 86.3
5 Cont 78.1 76.1
6 Cont 88.3 78.1
> dim(anorexia)
[1] 72 3
> with(anorexia, cor(Prewt, Postwt)) # do some analysis
[1] 0.3324062
> r.crit(df=70) # use one of our custom functions
df alpha 1-tail 2-tail
70.0000 0.0500 0.1954 0.2319
> bygroups(data=anorexia$Postwt, groups=anorexia$Treat) # use another one
CBT Cont FT
85.69655 81.10769 90.49412
> summarize(anorexia$Prewt) # caught me again; this is the old one
mean SD N
82.408333 5.182466 72.000000
> detach("file:myfunctions.RData") # done with it; get it out of the search path
It's as easy as that!
Review of Steps For Modifying Your Functions File
You may want to add a function of your own to this file, or delete one. Or
you may want to modify one of these functions. Just to collect the steps
together in a neat outline, here's how you would do it. Make sure the file
is not attached to your search path before you begin.
To delete a function (or a data object that snuck in there somehow):
- Change to the working directory that has the myfunctions.RData file in it.
- Clear your workspace.
- load("myfunctions.RData")
- Remove the undesired function or functions from your workspace using
rm().
- save.image("myfunctions.RData")
- Clear your workspace.
To add a function:
- Change to the working directory that has the myfunctions.RData file in it.
- Clear your workspace.
- Write the new function in the script editor (or in a text editor) and
read it into R (Execute or Run All from the script window, or copy and paste).
I.e., get it into your workspace, and make sure it's the ONLY thing in your
workspace.
- Make sure the new function doesn't have the same name as an old one, or it
will be overwritten at the next step.
- load("myfunctions.RData")
- save.image("myfunctions.RData")
- Clear your workspace.
To modify a function:
- Change to the working directory that has the myfunctions.RData file in it.
- Clear your workspace.
- load("myfunctions.RData")
- Type the name of the function you wish to modify at the command prompt
without parentheses. This will print the definition of the function to your
Console. Example:
> aovIII
function (formula, data = NULL, subset=NULL)
{
options(contrasts = c("contr.sum", "contr.poly"))
aov.out <<- aov(formula, data, subset)
print(drop1(aov.out, . ~ ., test = "F"))
options(contrasts = c("contr.treatment", "contr.poly"))
}
- Copy and paste the function definition to your script window and insert
the function name and an = sign before the word function. (NOTE: You can also
try fix(aovIII), which may or may not work. In
this case, DO NOT type in the function name =.)
- Make the desired mods. For example, if you don't want aovIII to save the
aov.out object to your workspace, modify the aov.out <<- line to aov.out <-.
- Read the modified function into R (or close the window opened by fix and
choose Save).
- save.image("myfunctions.RData")
- Clear your workspace.
I should also note that you can add other objects, such as data frames,
vectors, matrices, etc., to this file if you want to keep them there for
some reason. I have a long list of data frames in mine that I use in my
teaching. Sending the file to my students by e-mail is a convenient way to
get a large number of data files to them. (No, I won't send it to you.
Some of the data files were collected by students and are not for public
use. Sorry.)
An Alternative Method
If you don't really need all your scripts and data files in one big file
such as myfunctions.RData, then it may be more convenient to just save them a
script at a time in your working directory (Rspace). Open a script window,
type the script, then save it with a convenient name and .R as the extension.
Be sure R is saving it to your working directory. This is not the default in
the script editor, at least not on the Mac, so make the necessary changes in
the dialogue box before saving.
Then when you need to use it, you just source it in.
> setwd("Rspace")
> source("size.R")
> size(c(NA,1:5,NA,NA,NA))
[1] 5
If you need to modify the script at some point, you don't need to go
through all the gyrations above. Just open the script in a script window,
make whatever changes you want, and save it. The disadvantage is, if you
have a large number of scripts, you might end up making a mess of your
working directory. Then again, you could always create a subdirectory. Call
it, oh, say, "functions".
created 2016 January 24; modified 2016 April 7
| Table of Contents
| Function Reference
| Function Finder
| R Project |
|