| Table of Contents
| Function Reference
| Function Finder
| R Project |
FUNCTIONS AND SCRIPTS
Functions
One of the advantages of using a scriptable statistics language like R is, if
you don't like the way it does something, you can change it. Or if there is a
function missing you'd like to have, write it. Writing basic functions is not
difficult. If you can calculate it at the command line, you can write a
function to calculate it.
There is no function in the R base packages to calculate the standard error
of the mean. So let's create one. The standard error of the mean is calculated
from a sample (I should say estimated from a sample) by taking the square root
of the sample variance divided by the sample size. So from the command line...
> setwd("Rspace") # if you've created this directory
> rm(list=ls()) # clean out the workspace
> ls()
character(0)
> nums = rnorm(25, mean=100, sd=15) # create a data vector to work with
> mean(nums) # calculate the mean
[1] 97.07936 # your results will differ
> sd(nums) # and the standard deviation (sample)
[1] 12.92470
> length(nums) # and the length or "sample size"
[1] 25
> sem = sqrt(var(nums)/length(nums)) # this is how the sem is calculated
> sem
[1] 2.584941
Important note: Your results may be different, because the data vector was
filled with randomly generated numbers. By the way, STUDENTS, say "es ee em,"
NOT "sem" as if you were starting to say "semi" or "semolina."
So we know how to calculate the sem ("es ee em") at the command line.
Automating this by creating an "sem( )" function is a piece of cake.
> rm(sem) # get rid of the object we created above
> ?sem # check to see if something by this name already exists
No documentation for 'sem' in specified packages and libraries:
you could try 'help.search("sem")'
> sem = function(x) # create an sem function; x is a dummy variable
+ { # begin the function definition with an open curly brace
+ sqrt(var(x)/length(x)) # enter the necessary calculations
+ } # close the function definition
Another important note: If you are working on a Mac, or in R Studio, the command
editor will close the curly braces as soon as you open them. Thus, as soon as
you type {, the } will also appear. If you just hit the Enter key at this point,
your function is done. You've just defined an empty function. You'll have to
erase that closed curly brace and then remember to type it again at the end to
get what you want. Annoying! Now, what did we do above?
First, we checked to make sure "sem" was not already used as a keyword by
asking for a help page. (That's no guarantee, but it's a good check.) Then we
typed "sem=function(x)", which requires a bit of explanation. The name we
desire for our function is "sem", so that's the first thing we type. Then we
tell R we want to define this as a function by typing "=function". The sem is
going to be calculated on a data object--a vector in this case--so we have to
pass the data
to the function, and that is the point of "(x)". This tells R to expect one
argument to be passed to the function. It doesn't have to be called "x". This
is just a dummy variable, so call it "fred" if you want, as long as you call it
the same thing throughout the function definition.
After you hit the Enter key, R will see that you are defining a function,
and it will give you the + prompt, meaning "tell me more." Type an open curly
brace and hit Enter again. (A more common practice is to type this on the first
line, but it doesn't matter, and I learned it otherwise.) Then type the
calculations needed to get the standard error. Spacing is optional, but I think
it makes it a bit easier to understand if you use some indenting here. Hit
Enter. Type a closed curly brace and hit Enter again. Your function has been
defined and is now in your workspace to be used whenever you want.
> ls()
[1] "nums" "sem"
And it will stay in your workspace for whatever working directory you are in
PROVIDED you save your workspace when you quit R. You use the function just like
you use any other function in R.
> sem(nums)
[1] 2.584941
> PlantGrowth # PlantGrowth is a built-in data frame; output not shown
> with(PlantGrowth, tapply(weight, group, sem))
ctrl trt1 trt2
0.1843897 0.2509823 0.1399540
So next week you fire up R, you see "sem" in your workspace, and you wonder
what it is (if you're like me). Easy enough to find out.
> class(sem)
[1] "function"
> sem
function(x)
{
sqrt(var(x)/length(x))
}
Just like any other object in your workspace, typing its name without an argument,
or without parentheses in the case of a function, prints it out.
I don't like it. Our sem function is good enough, but if there are missing
values in the data vector, sem( ) will choke.
> nums[20] = NA # create a missing value
> sem(nums)
[1] NA
> var(nums) # here's part of the problem
[1] NA
So let's fix it. The var() function can be fixed
with a simple option that drops NAs from the calculation. Sadly, the
length() function cannot. So I will
resort to using a function called is.na() and it's
negation !is.na(). To see what these do, try
is.na(nums) and
!is.na(nums). Summing the result of this
function has the effect of counting up the number of TRUE responses, i.e., the
number of missings for is.na, and the number of not missings for !is.na.
> rm(sem) # out with the old...
> ls()
[1] "nums"
> sem = function(x)
+ {
+ n = sum(!is.na(x))
+ sqrt(var(x,na.rm=T)/n)
+ }
> ls()
[1] "nums" "sem"
> sem(nums) # once again, your result may be different...
[1] 2.641737
By the way, we couldn't use the length() function
in this calculation because it has no "na.rm=" option. The line n=sum(!is.na(x)) tests each value
of the vector to see if it's missing. If it is NOT (the ! means NOT), then it
returns TRUE for that position in the vector. Finally, the values returned as
TRUE are counted with sum(), because TRUE sums as
1 when you sum a logical vector.
The length() function counts NAs as data values
and doesn't tell you.
(Which is why we couldn't use it above--it would have given the wrong value for
n.) Let's create another function for sample size that reports on NAs.
> ?samp.size
No documentation for 'samp.size' in specified packages and libraries:
you could try 'help.search("samp.size")'
> samp.size = function(x)
+ {
+ n = length(x) - sum(is.na(x))
+ nas = sum(is.na(x))
+ out = c(n, nas)
+ names(out) = c("n", " NA")
+ out # this is the only thing that will be printed
+ }
> ls()
[1] "nums" "samp.size" "sem"
> samp.size(nums)
n NA
24 1
> nums[c(3,7,12)] = NA
> samp.size(nums)
n NA
21 4
Now samp.size() returns a vector the first element
of which is the number of nonmissing values in the data object we feed into it.
So...
> sqrt(var(nums,na.rm=T)/samp.size(nums)[1]) # your result may differ
2.954936
...you can use it like this. The little trick
samp.size(nums)[1] picks up just
the first value in the samp.size vector, which is n.
If you ask me, R has some annoying idiosyncrasies. Take the tapply()
function for example. What could "tapply" possibly mean? And who came up with
that convoluted syntax? Don't like it? Then change it!
> ?calculate
No documentation for 'calculate' in specified packages and libraries:
you could try 'help.search("calculate")'
> calculate = function(FUN, of, by)
+ {
+ tapply(of, by, FUN)
+ }
> ls()
[1] "calculate" "nums" "samp.size" "sem"
> with(PlantGrowth, tapply(weight, group, sem))
ctrl trt1 trt2
0.1843897 0.2509823 0.1399540
> with(PlantGrowth, calculate(sem, of=weight, by=group))
ctrl trt1 trt2
0.1843897 0.2509823 0.1399540
Which makes more sense to you?! You should also know that these one-liners can
be entered all on one line.
> rm(calculate)
> ls()
[1] "nums" "samp.size" "sem"
> calculate = function(FUN, of, by) tapply(of, by, FUN)
> with(PlantGrowth, calculate(sem, of=weight, by=group))
ctrl trt1 trt2
0.1843897 0.2509823 0.1399540
You don't even need the curly braces, although you can use them if you want. I
usually do.
Another R function that annoys the crap out of me is summary() when
applied to a numeric vector. I want a standard deviation, and I want a
sample size! As we saw above, a function will print out the last defined
thing in the function
definition (unless you tell it to do otherwise), so we will use that in the
following function definition to have the function print out the value of
round(out,4). Go ahead and type the comments into
the function definition as
well. It's good programming practice if you think you might need a reminder
later of what the heck it is you've done here!
> ?describe
No documentation for 'describe' in specified packages and libraries:
you could try 'help.search("describe")'
> describe = function(x)
+ {
+ m=mean(x,na.rm=T)
+ s=sd(x,na.rm=T)
+ N=sum(is.na(x)) # number of missing values
+ n=length(x)-N # number of nonmissing values
+ se=s/sqrt(n)
+ out=c(m,s,se,n,N)
+ names(out)=c("mean","sd","sem","n","NAs")
+ round(out,4)
+ }
> ls()
[1] "calculate" "describe" "nums" "samp.size" "sem"
> describe(nums)
mean sd sem n NAs
103.3923 13.5412 2.9549 21.0000 4.0000
That's better! And don't forget to SAVE YOUR WORKSPACE when you quit if you want
to keep these functions.
Scripts
A script is just a plain text file with R commands in it. You can prepare a
script in any text editor, such as vim, TextWrangler, or Notepad. You can also
prepare a script in a word processor, like Word, Writer, TextEdit, or WordPad,
PROVIDED you save the script in plain text (ascii) format. This should (!)
append a ".txt" file extension to the file. Drop the script into your working
directory, and then read it into R using the
source() function.
In a moment you're going to see a link. Click on it and a text page will
appear with a sample script on it. Use your browser to save this page to your
desktop. Then move the saved file into your R working directory. Or, if you
really want to be adventurous, type the script into a text editor like Notepad,
save it in your working directory, and you are ready to go. Okay, here is the
link...
Link To Sample Script
Now that you've got it in your working directory one way or another, do
this in R.
> source(file = "sample_script.txt") # Don't forget those quotes!
A note: This may not have worked. And the reason for that is, your script may
not have had the name "sample_script.txt". Depending upon what editor you used
to prepare it, and what operating system you saved it in, you may have gotten
additional elements appended to the name of your script. (Windows is especially
notorious for doing this!) It will be particularly difficult for you to tell, if
you have not set the option that makes your OS "always display file extensions."
Pardon me for griping, but what moron decided it was a good idea to have an OS
not display a full file name? I mean come on! How stupid is it possible to get?
Here's what they should have said when they were first thinking about this.
"Hey! Don't understand what a file extension
is? Go use a typewriter!" Anyway, if you make sure the file has the correct name,
R will read it. If the file is in your working directory, type dir() at the
command prompt, and R will show you the full file name. End of rant!
Also, R does not like spaces in script names, so don't put spaces in your
script names! (In newer versions of R, this is no longer an issue.)
Now, what didn't happen that you expected to happen? Go back to
the link and read the script again if you have to. What happened to the mean of
"y" and the mean of "x"?
The script has created the variables "x" and "y" in your workspace (and has
erased any old objects you had by that name--sorry). You can see them with the
ls( ) function. Executing a script does everything typing those commands
in the Console would do, EXCEPT print things to the Console. Do this.
> x
[1] 22 39 50 25 18
> mean(x)
[1] 30.8
See? It's there. But if you want to be sure a script will print it to the
Console, you should use the print() function.
> print(x)
[1] 22 39 50 25 18
> print(mean(x))
[1] 30.8
When you're working in the Console, the print()
is understood (implicit)
when you type a command or data object name. This is not necessarily so in a
script.
A script is a good way to keep track of what you're doing. If you have a
long analysis, and you want to be able to recreate it later, a good idea is to
type it into a script. If you're working in the Windows R GUI (also in the Mac
R GUI), there is even a built-in script editor. To get to it, pull down the
File menu and choose New Script (New Document on a Mac). A window will open in
which you can type your script. Type this script into the open window. (Hint:
You can copy and paste it.)
with(PlantGrowth, tapply(weight, group, mean))
with(PlantGrowth, aov(weight ~ group)) -> aov.out
summary.aov(aov.out)
summary.lm(aov.out)
Hit the Enter key after the last line. Now, in the editor window, pull down the
Edit menu and choose Run All. (On a Mac, highlight all the lines of the script
and choose Execute.) The script should execute in your R Console. Pull
down the File Menu and choose Save As... Give the file a nice name, like
"script2.txt". R will NOT save it by default with a file extension, so be sure
you give it one. (Note: On my Mac, the script editor in R will not let me save
the script with a .txt extension. It insists that I use .R. Fine!)
Close the editor window. Now, in the R Console, do this.
> source(file = "script2.txt") # or source(file = "script2.R") if that's how you saved it
Nothing happens! Why not? Actually, something did happen. The "aov.out" object
was created in your workspace. However, nothing was echoed to your Console
because you didn't tell it to print(). Go to File
and choose New Script (New Document on a Mac).
In the script editor, pull down File and choose Open Script... (Open
Document... on a Mac). In the Open
Script dialog that appears, change Files Of Type to all files (not necessary on
a Mac). Then choose to
open "script2.txt" (or "script2.R", whatever!). Edit it to look like this.
print(with(PlantGrowth, tapply(weight, group, mean)))
with(PlantGrowth, aov(weight ~ group)) -> aov.out
print(summary.aov(aov.out))
print(summary.lm(aov.out))
Pull down File and choose Save. Close the script editor window(s). And
FINALLY...
> source(file = "script2.txt") # or source(file = "script2.R") if necessary
Scripts! Nothing to it, right? :)
revised 2016 January 13
| Table of Contents
| Function Reference
| Function Finder
| R Project |
|