
FUNCTIONS AND SCRIPTS
Functions
One of the advantages of using a scriptable statistics language like R is, if
you don't like the way it does something, you can change it. Or if there is a
function missing you'd like to have, write it. Writing basic functions is not
difficult. If you can calculate it at the command line, you can write a
function to calculate it.
There is no function in the R base packages to calculate the standard error
of the mean. So let's create one. The standard error of the mean is calculated
from a sample (I should say estimated from a sample) by taking the square root
of the sample variance divided by the sample size. So from the command line...
> setwd("Rspace") # if you've created this directory
> rm(list=ls()) # clean out the workspace
> ls()
character(0)
> nums = rnorm(25, mean=100, sd=15) # create a data vector to work with
> mean(nums) # calculate the mean
[1] 97.07936 # your results will differ
> sd(nums) # and the standard deviation (sample)
[1] 12.92470
> length(nums) # and the length or "sample size"
[1] 25
> sem = sqrt(var(nums)/length(nums)) # this is how the sem is calculated
> sem
[1] 2.584941
So we know how to calculate the sem at the command line. Automating that by
creating an "sem( )" function is a piece of cake...
> rm(sem) # get rid of the object we created above
> ?sem
No documentation for 'sem' in specified packages and libraries:
you could try 'help.search("sem")'
> sem = function(x)
+ {
+ sqrt(var(x)/length(x))
+ }
First, we checked to make sure "sem" was not already used as a keyword by
asking for a help page. (That's no guarantee, but it's a good check.) Then we
typed "sem=function(x)", which requires a bit of explanation. The name we
desire for our function is "sem", so that's the first thing we type. Then we
tell R we want to define this as a function by typing "=function". The sem is
going to be calculated on a data object--a vector--so we have to pass the data
to the function, and that is the point of "(x)". This tells R to expect one
argument to be passed to the function. It doesn't have to be called "x". This
is just a dummy variable, so call it "fred" if you want, as long as you call it
the same thing throughout the function definition.
After you hit the Enter key, R will see that you are defining a function,
and it will give you the + prompt, meaning "tell me more." Type an open curly
brace and hit Enter again. (A more common practice is to type this on the first
line, but it doesn't matter, and I learned it otherwise.) Then type the
calculations needed to get the standard error. Spacing is optional, but I think
it makes it a bit easier to understand if you use some indenting here. Hit
Enter. Type a closed curly brace and hit Enter again. Your function has been
defined and is now in your workspace to be used whenever you want...
> ls()
[1] "nums" "sem"
And it will stay in your workspace for whatever working directory you are in
PROVIDED you save your workspace when you quit R. You use the function just like
you use any other function in R...
> sem(nums)
[1] 2.584941
> with(PlantGrowth, tapply(weight, group, sem))
ctrl trt1 trt2
2.584941 2.584941 2.584941
So next week you fire up R, you see "sem" in your workspace, and you wonder
what it is (if you're anything like me). Easy enough to find out...
> class(sem)
[1] "function"
> sem
function(x)
{
sqrt(var(x)/length(x))
}
Just like any other data object, typing its name without an argument prints it
out.
I don't like it. Our sem function is good enough, but if there are missing
values in the data vector, sem( ) will choke...
> nums[20] = NA
> sem(nums)
Error in var(nums) : missing observations in cov/cor
So let's fix it...
> rm(sem) # out with the old...
> ls()
[1] "nums"
> sem = function(x)
+ {
+ n = sum(x,na.rm=T)/mean(x,na.rm=T)
+ sqrt(var(x,na.rm=T)/n)
+ }
> ls()
[1] "nums" "sem"
> sem(nums)
[1] 2.641737
By the way, we couldn't use the length( ) function in this calculation
because it has no "na.rm=" option.
The length( ) function counts NAs as data values and doesn't tell you.
(Which is why we couldn't use it above--it would have given the wrong value for
n.) Let's create another function for sample size that reports on NAs...
> ?samp.size
No documentation for 'samp.size' in specified packages and libraries:
you could try 'help.search("samp.size")'
> samp.size = function(x)
+ {
+ n = length(x) - sum(is.na(x))
+ nas = sum(is.na(x))
+ out = c(n, nas)
+ names(out) = c("", "NAs")
+ out
+ }
> ls()
[1] "nums" "samp.size" "sem"
> samp.size(nums)
NAs
24 1
Now samp.size( ) returns a vector the first element of which is the number
of nonmissing values in the data object we feed into it. So...
> sqrt(var(nums,na.rm=T)/samp.size(nums)[1])
2.641737
...you can use it like this.
If you ask me, R has some annoying idiosyncrasies. Take the tapply( )
function for example. What could "tapply" possibly mean? And who came up with
that convoluted syntax? Don't like it? Then change it!
> ?calculate
No documentation for 'calculate' in specified packages and libraries:
you could try 'help.search("calculate")'
> calculate = function(FUN, of, by)
+ {
+ tapply(of, by, FUN)
+ }
> ls()
[1] "calculate" "nums" "samp.size" "sem"
> with(PlantGrowth, tapply(weight, group, sem))
ctrl trt1 trt2
0.1843897 0.2509823 0.1399540
> with(PlantGrowth, calculate(sem, of=weight, by=group))
ctrl trt1 trt2
0.1843897 0.2509823 0.1399540
Which makes more sense to you?! You should also know that these one-liners can
be entered all on one line...
> rm(calculate)
> ls()
[1] "nums" "samp.size" "sem"
> calculate = function(FUN, of, by) tapply(of, by, FUN)
> with(PlantGrowth, calculate(sem, of=weight, by=group))
ctrl trt1 trt2
0.1843897 0.2509823 0.1399540
You don't even need the curly braces, although you can use them if you want. I
usually do.
Another R function that annoys the crap out of me is summary( ) when
applied to a numerical vector. I want a standard deviation, and I want a
sample size...
> ?describe
No documentation for 'describe' in specified packages and libraries:
you could try 'help.search("describe")'
> describe = function(x)
+ {
+ m=mean(x,na.rm=T)
+ s=sd(x,na.rm=T)
+ N=sum(is.na(x))
+ n=length(x)-N
+ se=s/sqrt(n)
+ out=c(m,s,se,n,N)
+ names(out)=c("mean","sd","sem","n","NAs")
+ round(out,4)
+ }
> ls()
[1] "calculate" "describe" "nums" "samp.size" "sem"
> describe(nums)
mean sd sem n NAs
96.5680 12.9418 2.6417 24.0000 1.0000
That's better! And don't forget to SAVE YOUR WORKSPACE when you quit if you want
to keep these functions.
Scripts
A script is just a plain text file with R commands in it. You can prepare a
script in any text editor, such as vim, TextWrangler, or Notepad. You can also
prepare a script in a word processor, like Word, Writer, TextEdit, or WordPad,
PROVIDED you save the script in plain text (ascii) format. In Windows, this will
append a ".txt" file extension to the file. Drop the script into your working
directory, and then read it into R using the source( ) function.
In a moment you're going to see a link. Click on it and a text page will
appear with a sample script on it. Use your browser to save this page to your
desktop. Then move the saved file into your R working directory. Or, if you
really want to be adventurous, type the script into a text editor like Notepad,
save it in your working directory, and you are ready to go. Okay, here is the
link...
Link To Sample Script
Now that you've got it in your working directory one way or another, do
this in R...
> source("sample_script.txt") # Don't forget those quotes!
A note: R does not like spaces in script names, so don't put spaces in your
script names! Now, what didn't happen that you expected to happen? Go back to
the link and read the script again if you have to. What happened to the mean of
"y" and the mean of "x"?
The script has created the variables "x" and "y" in your workspace (and has
erased any old objects you had by that name--sorry). You can see them with the
ls( ) function. Executing a script does everything typing those commands
in the Console would do, EXCEPT print things to the Console. Do this...
> x
[1] 22 39 50 25 18
> mean(x)
[1] 30.8
See? It's there. But if you want to be sure a script will print it to the
Console, you should use the print( ) function...
> print(x)
[1] 22 39 50 25 18
> print(mean(x))
[1] 30.8
When you're working in the Console, the print( ) is understood (implicit)
when you type a command or data object name. This is not necessarily so in a
script.
A script is a good way to keep track of what you're doing. If you have a long
analysis, and you want to be able to recreate it later, a good idea is to type
it into a script. If you're working in the Windows R GUI (and I believe also in
the Mac R GUI), there is even a built-in script editor. To get to it, pull down
the File menu and choose New Script. A window will open in which you can type
your script. Type this script into the open window...
with(PlantGrowth, tapply(weight, group, mean))
with(PlantGrowth, aov(weight ~ group)) -> aov.out
summary.aov(aov.out)
summary.lm(aov.out)
Hit the Enter key after the last line. Now, in the editor window, pull down the
Edit menu and choose Run All. The script should execute in your R Console. Pull
down the File Menu and choose Save As... Give the file a nice name, like
"script2.txt". R will NOT save it by default with a file extension, so be sure
you give it one. Close the editor window. Now, in the R Console, do this...
> source("script2.txt")
Nothing happens! Why not? Actually, something did happen. The "aov.out" object
was created in your workspace. However, nothing was echoed to your Console
because you didn't tell it to print( ). Go to File and choose New Script.
In the script editor, pull down File and choose Open Script... In the Open
Script dialog that appears, change Files Of Type to all files. Then choose to
open "script2.txt". Edit it to look like this...
print(with(PlantGrowth, tapply(weight, group, mean)))
with(PlantGrowth, aov(weight ~ group)) -> aov.out
print(summary.aov(aov.out))
print(summary.lm(aov.out))
Pull down File and choose Save. Close the script editor window(s). And
FINALLY...
> source("script2.txt")
Scripts! Nothing to it, right?
Return to the Table of Contents
|