R Tutorials--Functions and Scripts

FUNCTIONS AND SCRIPTS

Functions

One of the advantages of using a scriptable statistics language like R is, if you don't like the way it does something, you can change it. Or if there is a function missing you'd like to have, write it. Writing basic functions is not difficult. If you can calculate it at the command line, you can write a function to calculate it.

There is no function in the R base packages to calculate the standard error of the mean. So let's create one. The standard error of the mean is calculated from a sample (I should say estimated from a sample) by taking the square root of the sample variance divided by the sample size. So from the command line...

> setwd("Rspace")                      # if you've created this directory
> rm(list=ls())                        # clean out the workspace
> ls()
character(0)
> nums = rnorm(25, mean=100, sd=15)    # create a data vector to work with
> mean(nums)                           # calculate the mean
[1] 97.07936                           # your results will differ
> sd(nums)                             # and the standard deviation (sample)
[1] 12.92470
> length(nums)                         # and the length or "sample size"
[1] 25
> sem = sqrt(var(nums)/length(nums))   # this is how the sem is calculated
> sem
[1] 2.584941

Important note: Your results may be different, because the data vector was filled with randomly generated numbers. By the way, STUDENTS, say "es ee em," NOT "sem" as if you were starting to say "semi" or "semolina."

So we know how to calculate the sem ("es ee em") at the command line. Automating this by creating an "sem( )" function is a piece of cake.

> rm(sem)                              # get rid of the object we created above
> ?sem                                 # check to see if something by this name already exists
No documentation for 'sem' in specified packages and libraries:
you could try 'help.search("sem")'
> sem = function(x)                    # create an sem function; x is a dummy variable
+ {                                    # begin the function definition with an open curly brace
+      sqrt(var(x)/length(x))          # enter the necessary calculations
+ }                                    # close the function definition

Another important note: If you are working on a Mac, or in R Studio, the command editor will close the curly braces as soon as you open them. Thus, as soon as you type {, the } will also appear. If you just hit the Enter key at this point, your function is done. You've just defined an empty function. You'll have to erase that closed curly brace and then remember to type it again at the end to get what you want. Annoying! Now, what did we do above?

First, we checked to make sure "sem" was not already used as a keyword by asking for a help page. (That's no guarantee, but it's a good check.) Then we typed "sem=function(x)", which requires a bit of explanation. The name we desire for our function is "sem", so that's the first thing we type. Then we tell R we want to define this as a function by typing "=function". The sem is going to be calculated on a data object--a vector in this case--so we have to pass the data to the function, and that is the point of "(x)". This tells R to expect one argument to be passed to the function. It doesn't have to be called "x". This is just a dummy variable, so call it "fred" if you want, as long as you call it the same thing throughout the function definition.

After you hit the Enter key, R will see that you are defining a function, and it will give you the + prompt, meaning "tell me more." Type an open curly brace and hit Enter again. (A more common practice is to type this on the first line, but it doesn't matter, and I learned it otherwise.) Then type the calculations needed to get the standard error. Spacing is optional, but I think it makes it a bit easier to understand if you use some indenting here. Hit Enter. Type a closed curly brace and hit Enter again. Your function has been defined and is now in your workspace to be used whenever you want.

> ls()
[1] "nums" "sem"

And it will stay in your workspace for whatever working directory you are in PROVIDED you save your workspace when you quit R. You use the function just like you use any other function in R.

> sem(nums)
[1] 2.584941
> PlantGrowth      # PlantGrowth is a built-in data frame; output not shown
> with(PlantGrowth, tapply(weight, group, sem))
     ctrl      trt1      trt2 
0.1843897 0.2509823 0.1399540

So next week you fire up R, you see "sem" in your workspace, and you wonder what it is (if you're like me). Easy enough to find out.

> class(sem)
[1] "function"
> sem
function(x)
{
     sqrt(var(x)/length(x))
}

Just like any other object in your workspace, typing its name without an argument, or without parentheses in the case of a function, prints it out.

I don't like it. Our sem function is good enough, but if there are missing values in the data vector, sem( ) will choke.

> nums[20] = NA                        # create a missing value
> sem(nums)
[1] NA
> var(nums)                            # here's part of the problem
[1] NA

So let's fix it. The var() function can be fixed with a simple option that drops NAs from the calculation. Sadly, the length() function cannot. So I will resort to using a function called is.na() and it's negation !is.na(). To see what these do, try is.na(nums) and !is.na(nums). Summing the result of this function has the effect of counting up the number of TRUE responses, i.e., the number of missings for is.na, and the number of not missings for !is.na.

> rm(sem)                              # out with the old...
> ls()
[1] "nums"
> sem = function(x)
+ {
+      n = sum(!is.na(x))
+      sqrt(var(x,na.rm=T)/n)
+ }
> ls()
[1] "nums" "sem" 
> sem(nums)                            # once again, your result may be different...
[1] 2.641737

By the way, we couldn't use the length() function in this calculation because it has no "na.rm=" option. The line n=sum(!is.na(x)) tests each value of the vector to see if it's missing. If it is NOT (the ! means NOT), then it returns TRUE for that position in the vector. Finally, the values returned as TRUE are counted with sum(), because TRUE sums as 1 when you sum a logical vector.

The length() function counts NAs as data values and doesn't tell you. (Which is why we couldn't use it above--it would have given the wrong value for n.) Let's create another function for sample size that reports on NAs.

> ?samp.size
No documentation for 'samp.size' in specified packages and libraries:
you could try 'help.search("samp.size")'
> samp.size = function(x)
+ {
+      n = length(x) - sum(is.na(x))
+      nas = sum(is.na(x))
+      out = c(n, nas)
+      names(out) = c("n", "  NA")
+      out                             # this is the only thing that will be printed
+ }
> ls()
[1] "nums"      "samp.size" "sem"      
> samp.size(nums)
   n   NA 
  24    1
> nums[c(3,7,12)] = NA
> samp.size(nums)
   n   NA 
  21    4

Now samp.size() returns a vector the first element of which is the number of nonmissing values in the data object we feed into it. So...

> sqrt(var(nums,na.rm=T)/samp.size(nums)[1])     # your result may differ
2.954936

...you can use it like this. The little trick samp.size(nums)[1] picks up just the first value in the samp.size vector, which is n.

If you ask me, R has some annoying idiosyncrasies. Take the tapply() function for example. What could "tapply" possibly mean? And who came up with that convoluted syntax? Don't like it? Then change it!

> ?calculate
No documentation for 'calculate' in specified packages and libraries:
you could try 'help.search("calculate")'
> calculate = function(FUN, of, by)
+ {
+      tapply(of, by, FUN)
+ }
> ls()
[1] "calculate" "nums"      "samp.size" "sem"      
> with(PlantGrowth, tapply(weight, group, sem))
     ctrl      trt1      trt2 
0.1843897 0.2509823 0.1399540 
> with(PlantGrowth, calculate(sem, of=weight, by=group))
     ctrl      trt1      trt2 
0.1843897 0.2509823 0.1399540

Which makes more sense to you?! You should also know that these one-liners can be entered all on one line.

> rm(calculate)
> ls()
[1] "nums"      "samp.size" "sem"      
> calculate = function(FUN, of, by)    tapply(of, by, FUN)
> with(PlantGrowth, calculate(sem, of=weight, by=group))
     ctrl      trt1      trt2 
0.1843897 0.2509823 0.1399540

You don't even need the curly braces, although you can use them if you want. I usually do.

Another R function that annoys the crap out of me is summary() when applied to a numeric vector. I want a standard deviation, and I want a sample size! As we saw above, a function will print out the last defined thing in the function definition (unless you tell it to do otherwise), so we will use that in the following function definition to have the function print out the value of round(out,4). Go ahead and type the comments into the function definition as well. It's good programming practice if you think you might need a reminder later of what the heck it is you've done here!

> ?describe
No documentation for 'describe' in specified packages and libraries:
you could try 'help.search("describe")'
> describe = function(x)
+ {
+    m=mean(x,na.rm=T)
+    s=sd(x,na.rm=T)
+    N=sum(is.na(x))     # number of missing values
+    n=length(x)-N       # number of nonmissing values
+    se=s/sqrt(n)
+    out=c(m,s,se,n,N)
+    names(out)=c("mean","sd","sem","n","NAs")
+    round(out,4)
+ }
> ls()
[1] "calculate" "describe"  "nums"      "samp.size" "sem"      
> describe(nums)
    mean       sd      sem        n      NAs 
103.3923  13.5412   2.9549  21.0000   4.0000

That's better! And don't forget to SAVE YOUR WORKSPACE when you quit if you want to keep these functions.

Scripts

A script is just a plain text file with R commands in it. You can prepare a script in any text editor, such as vim, TextWrangler, or Notepad. You can also prepare a script in a word processor, like Word, Writer, TextEdit, or WordPad, PROVIDED you save the script in plain text (ascii) format. This should (!) append a ".txt" file extension to the file. Drop the script into your working directory, and then read it into R using the source() function.

In a moment you're going to see a link. Click on it and a text page will appear with a sample script on it. Use your browser to save this page to your desktop. Then move the saved file into your R working directory. Or, if you really want to be adventurous, type the script into a text editor like Notepad, save it in your working directory, and you are ready to go. Okay, here is the link...

Link To Sample Script

Now that you've got it in your working directory one way or another, do this in R.

> source(file = "sample_script.txt")   # Don't forget those quotes!

A note: This may not have worked. And the reason for that is, your script may not have had the name "sample_script.txt". Depending upon what editor you used to prepare it, and what operating system you saved it in, you may have gotten additional elements appended to the name of your script. (Windows is especially notorious for doing this!) It will be particularly difficult for you to tell, if you have not set the option that makes your OS "always display file extensions." Pardon me for griping, but what moron decided it was a good idea to have an OS not display a full file name? I mean come on! How stupid is it possible to get? Here's what they should have said when they were first thinking about this. "Hey! Don't understand what a file extension is? Go use a typewriter!" Anyway, if you make sure the file has the correct name, R will read it. If the file is in your working directory, type dir() at the command prompt, and R will show you the full file name. End of rant!

Also, R does not like spaces in script names, so don't put spaces in your script names! (In newer versions of R, this is no longer an issue.) Now, what didn't happen that you expected to happen? Go back to the link and read the script again if you have to. What happened to the mean of "y" and the mean of "x"?

The script has created the variables "x" and "y" in your workspace (and has erased any old objects you had by that name--sorry). You can see them with the ls( ) function. Executing a script does everything typing those commands in the Console would do, EXCEPT print things to the Console. Do this.

> x
[1] 22 39 50 25 18
> mean(x)
[1] 30.8

See? It's there. But if you want to be sure a script will print it to the Console, you should use the print() function.

> print(x)
[1] 22 39 50 25 18
> print(mean(x))
[1] 30.8

When you're working in the Console, the print() is understood (implicit) when you type a command or data object name. This is not necessarily so in a script.

A script is a good way to keep track of what you're doing. If you have a long analysis, and you want to be able to recreate it later, a good idea is to type it into a script. If you're working in the Windows R GUI (also in the Mac R GUI), there is even a built-in script editor. To get to it, pull down the File menu and choose New Script (New Document on a Mac). A window will open in which you can type your script. Type this script into the open window. (Hint: You can copy and paste it.)

with(PlantGrowth, tapply(weight, group, mean))
with(PlantGrowth, aov(weight ~ group)) -> aov.out
summary.aov(aov.out)
summary.lm(aov.out)

Hit the Enter key after the last line. Now, in the editor window, pull down the Edit menu and choose Run All. (On a Mac, highlight all the lines of the script and choose Execute.) The script should execute in your R Console. Pull down the File Menu and choose Save As... Give the file a nice name, like "script2.txt". R will NOT save it by default with a file extension, so be sure you give it one. (Note: On my Mac, the script editor in R will not let me save the script with a .txt extension. It insists that I use .R. Fine!) Close the editor window. Now, in the R Console, do this.

> source(file = "script2.txt")    # or source(file = "script2.R") if that's how you saved it

Nothing happens! Why not? Actually, something did happen. The "aov.out" object was created in your workspace. However, nothing was echoed to your Console because you didn't tell it to print(). Go to File and choose New Script (New Document on a Mac). In the script editor, pull down File and choose Open Script... (Open Document... on a Mac). In the Open Script dialog that appears, change Files Of Type to all files (not necessary on a Mac). Then choose to open "script2.txt" (or "script2.R", whatever!). Edit it to look like this.

print(with(PlantGrowth, tapply(weight, group, mean)))
with(PlantGrowth, aov(weight ~ group)) -> aov.out
print(summary.aov(aov.out))
print(summary.lm(aov.out))

Pull down File and choose Save. Close the script editor window(s). And FINALLY...

> source(file = "script2.txt")     # or source(file = "script2.R") if necessary

Scripts! Nothing to it, right? :)

revised 2016 January 13