R Tutorials Top Banner

FUNCTIONS AND SCRIPTS


Functions

One of the advantages of using a scriptable statistics language like R is, if you don't like the way it does something, you can change it. Or if there is a function missing you'd like to have, write it. Writing basic functions is not difficult. If you can calculate it at the command line, you can write a function to calculate it.

There is no function in the R base packages to calculate the standard error of the mean. So let's create one. The standard error of the mean is calculated from a sample (I should say estimated from a sample) by taking the square root of the sample variance divided by the sample size. So from the command line...

> setwd("Rspace")                      # if you've created this directory
> rm(list=ls())                        # clean out the workspace
> ls()
character(0)
> nums = rnorm(25, mean=100, sd=15)    # create a data vector to work with
> mean(nums)                           # calculate the mean
[1] 97.07936                           # your results will differ
> sd(nums)                             # and the standard deviation (sample)
[1] 12.92470
> length(nums)                         # and the length or "sample size"
[1] 25
> sem = sqrt(var(nums)/length(nums))   # this is how the sem is calculated
> sem
[1] 2.584941
So we know how to calculate the sem at the command line. Automating that by creating an "sem( )" function is a piece of cake...
> rm(sem)                              # get rid of the object we created above
> ?sem
No documentation for 'sem' in specified packages and libraries:
you could try 'help.search("sem")'
> sem = function(x)
+ {
+      sqrt(var(x)/length(x))
+ }
First, we checked to make sure "sem" was not already used as a keyword by asking for a help page. (That's no guarantee, but it's a good check.) Then we typed "sem=function(x)", which requires a bit of explanation. The name we desire for our function is "sem", so that's the first thing we type. Then we tell R we want to define this as a function by typing "=function". The sem is going to be calculated on a data object--a vector--so we have to pass the data to the function, and that is the point of "(x)". This tells R to expect one argument to be passed to the function. It doesn't have to be called "x". This is just a dummy variable, so call it "fred" if you want, as long as you call it the same thing throughout the function definition.

After you hit the Enter key, R will see that you are defining a function, and it will give you the + prompt, meaning "tell me more." Type an open curly brace and hit Enter again. (A more common practice is to type this on the first line, but it doesn't matter, and I learned it otherwise.) Then type the calculations needed to get the standard error. Spacing is optional, but I think it makes it a bit easier to understand if you use some indenting here. Hit Enter. Type a closed curly brace and hit Enter again. Your function has been defined and is now in your workspace to be used whenever you want...

> ls()
[1] "nums" "sem"
And it will stay in your workspace for whatever working directory you are in PROVIDED you save your workspace when you quit R. You use the function just like you use any other function in R...
> sem(nums)
[1] 2.584941
> with(PlantGrowth, tapply(weight, group, sem))
    ctrl     trt1     trt2 
2.584941 2.584941 2.584941

So next week you fire up R, you see "sem" in your workspace, and you wonder what it is (if you're anything like me). Easy enough to find out...

> class(sem)
[1] "function"
> sem
function(x)
{
     sqrt(var(x)/length(x))
}
Just like any other data object, typing its name without an argument prints it out.

I don't like it. Our sem function is good enough, but if there are missing values in the data vector, sem( ) will choke...

> nums[20] = NA
> sem(nums)
Error in var(nums) : missing observations in cov/cor
So let's fix it...
> rm(sem)                              # out with the old...
> ls()
[1] "nums"
> sem = function(x)
+ {
+      n = sum(x,na.rm=T)/mean(x,na.rm=T)
+      sqrt(var(x,na.rm=T)/n)
+ }
> ls()
[1] "nums" "sem" 
> sem(nums)
[1] 2.641737
By the way, we couldn't use the length( ) function in this calculation because it has no "na.rm=" option.

The length( ) function counts NAs as data values and doesn't tell you. (Which is why we couldn't use it above--it would have given the wrong value for n.) Let's create another function for sample size that reports on NAs...

> ?samp.size
No documentation for 'samp.size' in specified packages and libraries:
you could try 'help.search("samp.size")'
> samp.size = function(x)
+ {
+      n = length(x) - sum(is.na(x))
+      nas = sum(is.na(x))
+      out = c(n, nas)
+      names(out) = c("", "NAs")
+      out
+ }
> ls()
[1] "nums"      "samp.size" "sem"      
> samp.size(nums)
    NAs 
 24   1
Now samp.size( ) returns a vector the first element of which is the number of nonmissing values in the data object we feed into it. So...
> sqrt(var(nums,na.rm=T)/samp.size(nums)[1])
         
2.641737
...you can use it like this.

If you ask me, R has some annoying idiosyncrasies. Take the tapply( ) function for example. What could "tapply" possibly mean? And who came up with that convoluted syntax? Don't like it? Then change it!

> ?calculate
No documentation for 'calculate' in specified packages and libraries:
you could try 'help.search("calculate")'
> calculate = function(FUN, of, by)
+ {
+      tapply(of, by, FUN)
+ }
> ls()
[1] "calculate" "nums"      "samp.size" "sem"      
> with(PlantGrowth, tapply(weight, group, sem))
     ctrl      trt1      trt2 
0.1843897 0.2509823 0.1399540 
> with(PlantGrowth, calculate(sem, of=weight, by=group))
     ctrl      trt1      trt2 
0.1843897 0.2509823 0.1399540
Which makes more sense to you?! You should also know that these one-liners can be entered all on one line...
> rm(calculate)
> ls()
[1] "nums"      "samp.size" "sem"      
> calculate = function(FUN, of, by) tapply(of, by, FUN)
> with(PlantGrowth, calculate(sem, of=weight, by=group))
     ctrl      trt1      trt2 
0.1843897 0.2509823 0.1399540
You don't even need the curly braces, although you can use them if you want. I usually do.

Another R function that annoys the crap out of me is summary( ) when applied to a numerical vector. I want a standard deviation, and I want a sample size...

> ?describe
No documentation for 'describe' in specified packages and libraries:
you could try 'help.search("describe")'
> describe = function(x)
+ {
+    m=mean(x,na.rm=T)
+    s=sd(x,na.rm=T)
+    N=sum(is.na(x))
+    n=length(x)-N
+    se=s/sqrt(n)
+    out=c(m,s,se,n,N)
+    names(out)=c("mean","sd","sem","n","NAs")
+    round(out,4)
+ }
> ls()
[1] "calculate" "describe"  "nums"      "samp.size" "sem"      
> describe(nums)
   mean      sd     sem       n     NAs 
96.5680 12.9418  2.6417 24.0000  1.0000
That's better! And don't forget to SAVE YOUR WORKSPACE when you quit if you want to keep these functions.


Scripts

A script is just a plain text file with R commands in it. You can prepare a script in any text editor, such as vim, TextWrangler, or Notepad. You can also prepare a script in a word processor, like Word, Writer, TextEdit, or WordPad, PROVIDED you save the script in plain text (ascii) format. In Windows, this will append a ".txt" file extension to the file. Drop the script into your working directory, and then read it into R using the source( ) function.

In a moment you're going to see a link. Click on it and a text page will appear with a sample script on it. Use your browser to save this page to your desktop. Then move the saved file into your R working directory. Or, if you really want to be adventurous, type the script into a text editor like Notepad, save it in your working directory, and you are ready to go. Okay, here is the link...

Link To Sample Script

Now that you've got it in your working directory one way or another, do this in R...

> source("sample_script.txt")          # Don't forget those quotes!
A note: R does not like spaces in script names, so don't put spaces in your script names! Now, what didn't happen that you expected to happen? Go back to the link and read the script again if you have to. What happened to the mean of "y" and the mean of "x"?

The script has created the variables "x" and "y" in your workspace (and has erased any old objects you had by that name--sorry). You can see them with the ls( ) function. Executing a script does everything typing those commands in the Console would do, EXCEPT print things to the Console. Do this...

> x
[1] 22 39 50 25 18
> mean(x)
[1] 30.8
See? It's there. But if you want to be sure a script will print it to the Console, you should use the print( ) function...
> print(x)
[1] 22 39 50 25 18
> print(mean(x))
[1] 30.8
When you're working in the Console, the print( ) is understood (implicit) when you type a command or data object name. This is not necessarily so in a script.

A script is a good way to keep track of what you're doing. If you have a long analysis, and you want to be able to recreate it later, a good idea is to type it into a script. If you're working in the Windows R GUI (and I believe also in the Mac R GUI), there is even a built-in script editor. To get to it, pull down the File menu and choose New Script. A window will open in which you can type your script. Type this script into the open window...

with(PlantGrowth, tapply(weight, group, mean))
with(PlantGrowth, aov(weight ~ group)) -> aov.out
summary.aov(aov.out)
summary.lm(aov.out)

Hit the Enter key after the last line. Now, in the editor window, pull down the Edit menu and choose Run All. The script should execute in your R Console. Pull down the File Menu and choose Save As... Give the file a nice name, like "script2.txt". R will NOT save it by default with a file extension, so be sure you give it one. Close the editor window. Now, in the R Console, do this...

> source("script2.txt")
Nothing happens! Why not? Actually, something did happen. The "aov.out" object was created in your workspace. However, nothing was echoed to your Console because you didn't tell it to print( ). Go to File and choose New Script. In the script editor, pull down File and choose Open Script... In the Open Script dialog that appears, change Files Of Type to all files. Then choose to open "script2.txt". Edit it to look like this...

print(with(PlantGrowth, tapply(weight, group, mean)))
with(PlantGrowth, aov(weight ~ group)) -> aov.out
print(summary.aov(aov.out))
print(summary.lm(aov.out))

Pull down File and choose Save. Close the script editor window(s). And FINALLY...
> source("script2.txt")
Scripts! Nothing to it, right?


Return to the Table of Contents