QUICK AND DIRTY GUIDE TO R
(just about enough to cover a first stats course).
by William B. King, Ph.D.
Coastal Carolina University
© William B. King, Ph.D.
-------------------------------------------------------------------------------
Best used with R in front of you and running -- for simplicity I have done away
with certain annoying statistical formalisms. This is a plain text document.
You should be able to save it or print it with little difficulty. It should
print to about 25 pages.
-------------------------------------------------------------------------------
Getting it.
Download R from www.r-project.org (click on the CRAN link on the left side of
the page under Downloads). As of this moment, the current version is 3.2.3. It
is available for Windows, OS X, and Linux. Get the .exe file for Windows, the
.dmg file for OS X, and for Linux the best way to install is via your package
manager. In Ubuntu, for example, use Synaptic, or from a terminal...
sudo apt-get install r-base
R will not run on a Chromebook or a tablet such as an iPad unless you have an
account on a server running R Studio. If you don't know what that means, then
you don't have one!
-------------------------------------------------------------------------------
Starting R.
In Windows, double-click the desktop shortcut. In OS X, double-click the R icon
in your /Applications folder (or drag a shortcut to the Dock). In Linux, start
a terminal session, and type R and hit the Enter key. Here's what R looks like
when it starts. (This is an older version, but newer versions look the same.)
R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin10.8.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
[R.app GUI 1.65 (6833) x86_64-apple-darwin10.8.0]
[Workspace restored from /Users/billking/.RData]
[History restored from /Users/billking/.Rapp.history]
> |
The little window in which this appears is called the R Console. The > is the
command prompt, and the vertical bar (may be a solid block, may or may not be
blinking) is the cursor.
-------------------------------------------------------------------------------
The R prompt.
R is command-line driven. The prompt looks like a greater than sign, >. When
you see it, start typing. (Much faster than hunting and clicking once you get
used to it!) Press Enter when you have finished typing the command. If the
command is not complete, R will issue a continuation prompt, which is a plus
sign, +. You can type long commands on several lines to make it easier to keep
track of syntax. You can also type them into a text editor and then cut and
paste them into R. (Don't type the command prompts. Indent if the command
continues to multiple lines.)
If your command line starts with + instead of >, that means you are continuing
a command you started typing on the previous line. You are not typing a new
command. If that's not what you want, press the Escape key (upper left corner
of the keyboard) to get back to the command prompt. Write that on the back of
your hand. You're going to need it sooner or later!!! (STUDENTS: Seriously, I'm
not going to tell you again how to do this. Make a note, or know where to look
it up! Using R will require you to exercise your memory a bit. Try it. It's
good for you!)
-------------------------------------------------------------------------------
Important info for command line newbies.
Never look above the current command prompt for new information. That's all
history, and it will never change. New information will be printed ONLY below
the command you are currently typing. (That even catches me every once in
awhile, and I'm a command line guy from way back!)
-------------------------------------------------------------------------------
Case sensitive.
R commands are case sensitive. If you get a "function not found" error, check
your spelling and your capitalization. R functions are generally, but not
always, typed in lower case letters. On the other hand, spacing usually
doesn't matter. Leave spaces wherever you like (except within names of things
or inside of numbers or data values) to make your input easier to read.
-------------------------------------------------------------------------------
R functions.
R "commands" are called functions. The basic form of a function is:
> function.name(arguments, options)
Example (go ahead and type these and press Enter--output is not shown here):
> mean(rivers)
> mean(rivers, trim=.1)
> mean(rivers, trim=.2)
Obviously, you've just calculated the mean of something. In the first case,
it's the ordinary arithnetic mean. In the second case, the mean was calculated
after trimming off the highest 10% and lowest 10% of the values in the data.
In the third case, 20% of the values were trimmed off the high and low ends of
the distribution of values. The default value for trim= is 0 (do not trim), and
so the value of trim= does not have to be given if you just want the plain old
mean. When an argument has a default value, I'll refer to it as an option.
Try misspelling "rivers", or try typing it with a capital R, "Rivers". You'll
get an "object not found" error. Try spelling mean with a capital M. You'll get
a "function not found" error. R will demand proper spelling and capitalization.
These are the two errors you will see most often in R. When you see them, check
your spelling and capitalization. (STUDENTS: Check it BEFORE you ask me! R
demands attention to details. If you're not used to that, get used to it!)
Even if there are no arguments or options, the parentheses are still mandatory.
If you leave off the parentheses, R will print out the source code for the
function you have entered. This can be startling when you're not expecting it,
but it is not harmful. Just retype the command correctly.
Example:
> mean # press Enter here
> mean rivers # always remember to press Enter
> meen(rivers)
> mean(Rivers)
> mean(rivers) # finally we get it right!
Now what happens if you type mean() and press Enter? Another error. Why?
I've been using R for 12 years, and I get error messages all the time. You will,
too. Don't let it bother you. Just fix the mistake and move on.
-------------------------------------------------------------------------------
Interactive mode.
R can be used by writing programs into scripts and reading them in. It's easier
to use R in interactive mode, typing individual commands at the command prompt.
Notes or comments can be added to the R session by typing a hash or pound
symbol, #. Anything on the line following this symbol will be ignored by R.
(STUDENTS: You do not have to type things on the handouts that follow a #.)
-------------------------------------------------------------------------------
Keep it clean!
The "workspace" is an area in memory where R keeps your stuff. It is best not
to clutter up your workspace with old stuff you no longer need. If you have
created a data object called myData (for example), in your workspace, and you
want to delete it before saving your workspace or continuing on to a new
problem, you can do it this way.
> myData = 17 # create the data object
> ls() # look at it in your workspace
[1] "myData" # this is an output line--no prompt!
> myData # see the value(s) in myData
[1] 17 # another output line
> rm(myData) # erase it -- remove(myData) also works
> ls() # see that it is GONE
character(0) # R's way of saying "empty"
The function ls() lists the contents of your workspace. The function rm(), or
remove(), erases things from your workspace.
Important note: R does not send things to the trash. If you erase something,
it's GONE! Be careful.
-------------------------------------------------------------------------------
Quitting R.
> quit() # remember, the parentheses are mandatory!
Note: best to know this right away! Also, q() works. When R asks to save your
workspace, pick "no" until you know what this means. If you have entered data
into R and want it saved for your next R session, pick "yes". R will create a
data file on your hard drive that will be loaded automatically next time you
start R.
-------------------------------------------------------------------------------
Simple data input.
Small data sets can be entered using the c() function. This produces a data
vector. Name it anything you want, but don't put spaces or dashes in the name.
Use underline characters or periods instead.
> my.data = c( 22, 38, 12, 23, 29, 18, 16, 24 )
Note: spaces after the commas and before and after parentheses are optional.
> my.data=c(22,38,12,23,29,18,16,24) # exactly the same as previous command
To examine what you've entered, just type its name.
> my.data # this implicitly invokes print(my.data)
[1] 22 38 12 23 29 18 16 24
To change value i in the list, address it as my.data[i]. For example...
> my.data[2] # see the second value in my.data
[1] 38
> my.data[2] = 28 # change its value
> my.data # see the changed data vector
[1] 22 28 12 23 29 18 16 24
One caution about making changes to data--there is no "undo." Be careful!
-------------------------------------------------------------------------------
An important note on assignment.
When you create a data set in R and save it, the thing that gets placed in the
workspace with your data in it is called an "object." Objects are created by
assignment. There are three ways to do assignment. One is with the equals sign
as we've seen.
> my.data = c(22, 38, 12, 23, 29, 18, 16, 24)
Notice that there is no output to the screen. This means R did what you told it
to do and doesn't feel compelled to keep chatting with you about it. (Unlike
most software these days, R is not unnecessarily chatty.) If you create data
by assignment, and R doesn't print anything to the Console, that means "mission
accomplished." If you want to see it, you have to ask.
> my.data
[1] 22 38 12 23 29 18 16 24
On the other hand, if you forget to give your data object a name, R will spill
the data out to the Console, and that will be the end of them. Nothing is
created in the workspace. Don't make this mistake. Name your data object!
> c(22, 38, 12, 23, 29, 18, 16, 24) # nothing is placed in the workspace
[1] 22 38 12 23 29 18 16 24
You can name data objects anything you want as long as the name starts with a
letter, consists of letters and numbers, and has no spaces in it. I would
advise you against using T and F as names for data objects. Other than that...
Also remember that data object names are case sensitive. Thus, if you create
my.data, and then ask to see My.data or my.Data or MY.DATA, what's going to
happen?
> My.data
Error: object 'My.data' not found
-------------------------------------------------------------------------------
Another way to enter data from the keyboard.
> rm(my.data)
> my.data = scan() # press the Enter key here
1: 22 # type the first data value and press Enter
2: 38 # type the second data value and press Enter
3: 12 # type the third data value and press Enter
4: 23 # etc.
5: 29
6: 18
7: 16
8: 24
9: # enter a blank data value to terminate data entry
Read 8 items
>
This method is particularly useful for entering longer data vectors. (STUDENTS:
Don't just type scan() and start entering data. Type a name for your data, an
equals sign, and then scan(). Otherwise, when you terminate data entry, your
entered values will spill out to the Console and be lost. You've been warned!)
-------------------------------------------------------------------------------
R output.
You may have noticed above that R not only printed out the values you stored
in your vector, but it also printed out a 1 inside square brackets. When R is
printing out a vector, it indexes all the values, but only the index for the
first value on a line is shown. Thus, [1] means this line begins with data value
number 1. The numbers in square brackets are not part of the data, and they may
be different on your screen depending upon how wide you have the R Console set
to. Try this:
> rivers
[1] 735 320 325 392 524 450 1459 135 465 600 330 336 280 315 870
[16] 906 202 329 290 1000 600 505 1450 840 1243 890 350 407 286 280
[31] 525 720 390 250 327 230 265 850 210 630 260 230 360 730 600
[46] 306 390 420 291 710 340 217 281 352 259 250 470 680 570 350
[61] 300 560 900 625 332 2348 1171 3710 2315 2533 780 280 410 460 260
[76] 255 431 350 760 618 338 981 1306 500 696 605 250 411 1054 735
[91] 233 435 490 310 460 383 375 1270 545 445 1885 380 300 380 377
[106] 425 276 210 800 420 350 360 538 1100 1205 314 237 610 360 540
[121] 1038 424 310 300 444 301 268 620 215 652 900 525 246 360 529
[136] 500 720 270 430 671 1770
The first line of this output starts with value number 1, the second line with
value number 16, the third line with value number 31, and so on. All of the
values are indexed, so the value of 310 in the next to the last line is value
number 123. It can be retrieved like this:
rivers[123]
[1] 310
Don't be confused by the fact that R called it value [1] when it printed it. It
was the first value printed, so it is called [1]. If you want values 10 through
20 from the rivers vector, you can get them this way:
> rivers[10:20] # a colon between two numbers means "through"
[1] 600 330 336 280 315 870 906 202 329 290 1000
The printed values are considered a new vector and are indexed accordingly.
If you want a discontinuous set of values, enter the indexes in a vector created
with the c() function:
> rivers[c(10,21,35,102,116)] # and mind your parentheses and brackets
Square brackets ALWAYS contain indexes, or something that evaluates to indexes.
For example:
> rivers[15]
> rivers[(10+20)/2]
> rivers[rivers > 1000] # prints values for which rivers > 1000 is true
> values = c(10,21,35,102,116)
> rivers[values]
Important note: Leaving the index blank means gimme everything.
>rivers[ ]
Now, what do these commands do?
> head(rivers)
> tail(rivers)
> tail(rivers, n=15)
-------------------------------------------------------------------------------
Simple frequency distributions.
Clear your workspace before continuing.
X f
-----------
7 3
6 0
5 2
4 7
3 5
2 1
1 1
To put this simple frequency distribution into a vector, do this:
> X = 7:1 # same thing as X = c(7, 6, 5, 4, 3, 2, 1)
> f = c(3, 0, 2, 7, 5, 1, 1)
> the.data = rep(X, f) # rep() says repeat each element of X f times
> the.data
[1] 7 7 7 5 5 4 4 4 4 4 4 4 3 3 3 3 3 2 1
If you want to view your data as a table similar to the one above, do this.
> data.frame(X, f)
To save the data in this format:
> data.table = data.frame(X, f)
Your workspace should now look like this (good to keep track):
> ls()
[1] "data.table" "f" "the.data" "X"
-------------------------------------------------------------------------------
Grouped frequency tables.
For large data sets (such as rivers), a simple frequency table would be more
than a bit unwieldy.
> table(rivers) # top numbers are data values; bottom numbers are frequencies
You can also try it this way.
> as.data.frame(table(rivers))
See what I mean? Thus, we resort to grouping. First, some standard notation. An
interval denoted as (10,20] is an interval extending from 10 to 20, which does
not include 10, called the lower bound, but does include 20, called the upper
bound. Including the upper bound and not the lower bound in an interval is
default behavior in R. This behavior is called "closed on the right." This can
be changed with an option.
Step 1: Decide what intervals you want. This may help.
> summary(rivers)
Min. 1st Qu. Median Mean 3rd Qu. Max.
135.0 310.0 425.0 591.2 680.0 3710.0
We will define the following intervals: (99,199], (199,299], (299,399], and so
on to (899,999], (999,1999], (1999,2999], (2999,3999]. Why are the intervals
written this way?
> bounds = c(99, 199, 299, 399, 499, 599, 699, 799, 899, 999, 1999, 2999, 3999)
Step 2: Use the cut() function to do the grouping. Create a temporary storage
place for the result.
> temp = cut(rivers, bounds) # examine this result at your own risk!
Step 3: Create the grouped frequency table.
> table(temp) # one possible format
> as.data.frame(table(temp)) # another possible format
Notice that R did some rounding that we didn't ask for. Can we stop that? Of
course! I'll start again, close the intervals on the left rather than the right,
and stop R from rounding in the interval labels. I'll also replace "temp" with
"X" just to make the result look like it does in a textbook (more or less).
> rm(bounds, temp)
> bounds = c(100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000)
> X = cut(rivers, bounds, right=FALSE, dig.lab=4)
> table(X) # output not shown
> as.data.frame(table(X))
X Freq
1 [100,200) 1
2 [200,300) 28
3 [300,400) 35
4 [400,500) 18
5 [500,600) 12
6 [600,700) 13
7 [700,800) 8
8 [800,900) 5
9 [900,1000) 4
10 [1000,2000) 13
11 [2000,3000) 3
12 [3000,4000) 1
Note: the right=FALSE (or right=F) option closes the bins on the left, and
dig.lab=4 tells R to maintain 4 digits in the interval labels. (The default is
three.) If we have a river that is exactly 600 miles long (there are three), in
which of the above intervals will we find it?
You can also get a stem-and-leaf display. (Try fooling with the scale= option
to see what effect that has. It might help if you try sort(rivers) first.)
> stem(rivers, scale=2) # output not shown
-------------------------------------------------------------------------------
Histogram.
> hist(rivers)
Note: Bars in R histograms include the right limit but not the left (closed on
the right). This can be changed, of course.
> hist(rivers, right=F) # F is short for FALSE
> hist(rivers, breaks=20, right=F) # use about 20 breaks (bars)
> hist(rivers, breaks=bounds, right=F) # user defined bounds as breaks
The last command prints a frequency density histogram, because the intervals
are not all equally wide. Thus, an ordinary frequency histogram would be
misleading.
-------------------------------------------------------------------------------
Boxplot.
> boxplot(rivers)
> boxplot(rivers, ylab="River Length in Miles")
> title(main="Length of Major North American Rivers")
-------------------------------------------------------------------------------
Frequency polygon.
There really is no easy way to plot frequency polygons using just the base R
packages. There are optional packages that can be downloaded that have this
functionality (such as ggplot2). However, here is a quick and very dirty
frequency polygon from the base packages.
> hist(rivers, breaks=20, right=F) -> my.hist
> my.hist
> data.frame(interval.midpoints=my.hist$mids, counts=my.hist$counts)
> plot(my.hist$mids, my.hist$counts, type="b", # press Enter
+ xlab="Interval Midpoints", ylab="Frequency")
-------------------------------------------------------------------------------
First intermission.
Time to clean up and get some popcorn.
> ls()
[1] "bounds" "breaks" "data.table" "f" "midpoints" "my.hist"
[6] "temp" "the.data" "X"
> rm(bounds, breaks, midpoints, my.hist, temp)
> ls() # the popcorn is up to you
[1] "data.table" "f" "the.data" "X"
-------------------------------------------------------------------------------
Categorical (nominal) data.
> colors = c("red", "black", "brown", "blonde")
> freqs = c( 8 , 22 , 30 , 18 ) # spaces don't matter
> hair.color = rep(colors, freqs)
> hair.color # output not shown
> table(hair.color)
hair.color
black blonde brown red
22 18 30 8
> barplot(table(hair.color))
> barplot(table(hair.color ), col=c("black","yellow","brown","red"))
> barplot(table(hair.color ), col=c("black","yellow","brown","red"),
+ ylab="Frequency")
Note: The order of the bars can be changed, too.
The example above uses one function embedded within another. You don't have to
do it this way if you're not comfortable with that.
> temp.table = table(hair.color)
> barplot(temp.table)
> rm(temp.table) # if you no longer need it
> rm(colors, freqs) # clean up
-------------------------------------------------------------------------------
Summation.
> sum(the.data) # sigma X
[1] 77
> sum(the.data) ^ 2 # (sigma X) quantity squared; i.e., 77^2
[1] 5929
> sum(the.data ^ 2) # sigma (X squared)
[1] 359
Note: the squared values themselves can be returned with...
> the.data ^ 2
[1] 49 49 49 25 25 16 16 16 16 16 16 16 9 9 9 9 9 4 1
> x = c(2, 3, 5)
> y = c(3, 8, 9)
> sum(x * y) # sum of the crossproducts
[1] 75
> sum(x) * sum(y) # sigma X times sigma y
[1] 200
There is no function in R for the sum of squares. However, it can easily be
calculated using the arithmetic functions in R:
> sum(the.data ^ 2) - sum(the.data) ^ 2 / length(the.data)
[1] 46.94737
And knowing that allows us to write a function for sum of squares.
> SS = function(X) sum(X^2) - sum(X)^2/length(X) # X is a dummy variable
> SS(the.data)
[1] 46.94737
This function will remain in your workspace until you remove it.
Here are some other built-in math functions in R:
> sqrt() # square root
> log() # returns the natural log
> log10() # base-10 logs
> sin() # and all the other trig functions
> abs() # absolute value
And there are many, many, many more! These functions will work on individual
values or on complete vectors.
> sqrt(144)
[1] 12
> my.data = c(22, 38, 12, 23, 29, 18, 16, 24)
> sqrt(my.data)
[1] 4.690416 6.164414 3.464102 4.795832 5.385165 4.242641 4.000000 4.898979
-------------------------------------------------------------------------------
Descriptive statistics.
> ls() # this is what we have to work with
[1] "hair.color" "my.data" "the.data"
**sample size**
> length(the.data)
[1] 19
> length(hair.color)
[1] 78
> length(my.data)
[1] 8
**sample median**
> median(the.data)
[1] 4
**sample mean**
> mean(the.data)
[1] 4.052632
**range**
> range(the.data) # reports the min and max values
[1] 1 7
**IQR**
> IQR(the.data)
[1] 1.5
Note: This may not be the "textbook" IQR. R can use 9 different methods to
calculate Q1 and Q3. By default it uses a different one than most elementary
textbooks do. See below for a fix.
**sample variance**
> var(the.data) # there is no function for pop. variance
[1] 2.608187
**sum of squares**
> var(the.data) * (length(the.data) - 1) # explain how this works!
[1] 46.94737
> sum(the.data ^ 2) - sum(the.data) ^ 2 / length(the.data)
[1] 46.94737
Note: there is no built-in function, but it is easy enough to create one using
either one of the above calculations.
**sample standard deviation**
> sd(the.data) # there is no function for pop. sd
[1] 1.614988
**more or less all at once**
> summary(the.data)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 3.000 4.000 4.053 4.500 7.000
Note: these quartiles are technically correct for continuous data. The
"textbook" method would give Q3 = 5, which is correct for discrete data. To
get this...
> quantile(the.data, c(.25,.75), type=2) # textbook quartiles!
25% 75%
3 5
To get the "textbook" IQR...
> IQR(the.data, type=2)
[1] 2
-------------------------------------------------------------------------------
Unit normal distribution.
Note: R has a built-in function for converting to z-scores, but it uses the
sample standard deviation rather than the population SD (sigma), and assumes it
has been given an entire sample of scores. While this makes perfect sense, it's
not what most intro stat books want. Instead, you can use R as a calculator.
If mu = 25 and sigma = 8, convert these scores to z-scores: 18, 28, 38, 21.
**one at a time**
> (18 - 25) / 8
[1] -0.875
**or all at once**
> scores = c(18, 28, 38, 21)
> z = (scores - 25) / 8 # assignment to variable z is optional
> z
[1] -0.875 0.375 1.625 -0.500
**probabilities under the normal curve**
> pnorm(0.375) # area at or below z = .375
[1] 0.6461698
> 1 - pnorm(0.375) # area above z
[1] 0.3538302
> pnorm(0.375, lower.tail=FALSE) # I bet you can figure this one out
[1] 0.3538302
> pnorm(z) # for all of the values stored in z at once
[1] 0.1907870 0.6461698 0.9479187 0.3085375
**the inverse z function**
> qnorm(.9) # the z with 90% of area at or below it
[1] 1.281552
> qnorm(.95, lower.tail=FALSE) # the z with 95% of area at or above it
[1] -1.644854
> rm(scores)
-------------------------------------------------------------------------------
Single sample z-test.
Isn't built in -- must be calculated by hand. Or you can write your own
function to calculate it. There are also optional packages that contain
z-test functions. For example, the UsingR package contains a function,
simple.z.test(), that does single-sample tests.
-------------------------------------------------------------------------------
Single sample t-test.
> the.data # just to be sure it's still there
[1] 7 7 7 5 5 4 4 4 4 4 4 4 3 3 3 3 3 2 1
> t.test(the.data, mu=5) # H-null says mu=5, two-tailed
One Sample t-test
data: the.data
t = -2.557, df = 18, p-value = 0.01981
alternative hypothesis: true mean is not equal to 5
95 percent confidence interval:
3.274232 4.831031
sample estimates:
mean of x
4.052632
One-tailed tests follow...
> t.test(the.data, mu=5, alternative="greater")
One Sample t-test
data: the.data
t = -2.557, df = 18, p-value = 0.99
alternative hypothesis: true mean is greater than 5
95 percent confidence interval:
3.410155 Inf
sample estimates:
mean of x
4.052632
> t.test(the.data, mu=5, alternative="less")
One Sample t-test
data: the.data
t = -2.557, df = 18, p-value = 0.009904
alternative hypothesis: true mean is less than 5
95 percent confidence interval:
-Inf 4.695109
sample estimates:
mean of x
4.052632
-------------------------------------------------------------------------------
Independent t-test.
> x1 = c(18, 25, 17, 20, 23)
> x2 = c(20, 30, 22, 25, 28, 30)
**pooled variance**
> t.test(x1, x2, var.equal=TRUE)
Two Sample t-test
data: x1 and x2
t = -2.2395, df = 9, p-value = 0.05188
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-10.51953398 0.05286732
sample estimates:
mean of x mean of y
20.60000 25.83333
**unpooled variance (the default)**
> t.test(x1, x2)
Welch Two Sample t-test
data: x1 and x2
t = -2.2903, df = 8.995, p-value = 0.04776
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-10.40273450 -0.06393217
sample estimates:
mean of x mean of y
20.60000 25.83333
**one-tailed (R always subtracts group 1 - group 2)**
> t.test(x1, x2, var.equal=T, alternative="less")
Two Sample t-test
data: x1 and x2
t = -2.2395, df = 9, p-value = 0.02594
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -0.9497217
sample estimates:
mean of x mean of y
20.60000 25.83333
**the formula interface (a more sophisticated way)**
You will be using the formula interface quite a lot in R. The basic format of
a formula is DV ~ IV (name of dependent variable tilde name of independent
variable). Thus, to use a formula interface, all values of the DV must be in
one vector, and all values of the IV in another. For each value of the DV,
there is a value of the IV telling what group it is from.
> all.scores = c(x1, x2) # combines x1 and x2 into one vector
> all.scores # the DV vector
[1] 18 25 17 20 23 20 30 22 25 28 30
> grp = c("x1", "x2") # the group names
> n = c(5, 6) # the groups sizes
> groups = rep(grp, n) # create the labels
> groups # the IV vector
[1] "x1" "x1" "x1" "x1" "x1" "x2" "x2" "x2" "x2" "x2" "x2"
> t.test(all.scores ~ groups) # "all.scores by (tilde) groups"
Welch Two Sample t-test
data: all.scores by groups
t = -2.2903, df = 8.995, p-value = 0.04776
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-10.40273450 -0.06393217
sample estimates:
mean in group x1 mean in group x2
20.60000 25.83333
-------------------------------------------------------------------------------
Tests for homogeneity of variance.
There are several tests in R for homogeneity of variance. The one most commonly
used is the Bartlett test. It is slightly, but not much, better than the F-max
test, which is not available in R.
> bartlett.test(all.scores ~ groups)
Bartlett test of homogeneity of variances
data: all.scores by groups
Bartlett's K-squared = 0.1994, df = 1, p-value = 0.6552
Also...
> var.test(x1, x2) # significance test for two variances
F test to compare two variances
data: x1 and x2
F = 0.636, num df = 4, denom df = 5, p-value = 0.6818
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.08608992 5.95601427
sample estimates:
ratio of variances
0.6360225
> var.test(all.scores ~ groups) # the formula interface also works
-------------------------------------------------------------------------------
Dependent (related samples) t-test.
> before = c(18, 17, 20, 19, 22, 24)
> after = c(20, 22, 19, 21, 25, 25)
> t.test(before, after, paired=TRUE)
Paired t-test
data: before and after
t = -2.4495, df = 5, p-value = 0.05797
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.09887128 0.09887128
sample estimates:
mean of the differences
-2
Note: R always subtracts first group - second group. One-tailed tests can be
specified as in the independent t-test with the option alternative=. Use
alternative="less" if you think "before" mean is less than "after" mean.
Otherwise, use alternative="greater".
-------------------------------------------------------------------------------
Second intermission.
Well, we've made quite a mess of our workspace, haven't we? Examine your
workspace, then remove everything EXCEPT hair.color, x1, and x2, which we will
be using below.
If you've had all you can take for now and want to shut down and resume later,
just answer "yes" when R asks to save your workspace. Your data will be there
waiting for you next time you start up. (Important exception: This is not true
if you've changed the working directory. If you don't know what the working
directory is, then you probably haven't changed it. You should be good to go.)
-------------------------------------------------------------------------------
ANOVA (oneway, between subjects)
> x1 # still there (if not, re-enter it)
[1] 18 25 17 20 23
> x2 # still there
[1] 20 30 22 25 28 30
> x3 = c(35, 27, 27, 30, 40, 33) # a new level of X
> all.scores = c(x1, x2, x3) # combined into a single DV data vector
> grp = c("x1", "x2", "x3") # the group names
> n = c(5, 6, 6) # the group sizes
> groups = rep(grp, n) # labels for each score (the IV vector)
> results = aov(all.scores ~ factor(groups))
> summary(results)
Df Sum Sq Mean Sq F value Pr(>F)
groups 2 358.20 179.10 9.5691 0.002402 **
Residuals 14 262.03 18.72
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notes: The formula interface is mandatory for the aov() function. To get the
ANOVA table, store the output of the aov() function and then use the summary()
function to see the table. The function oneway.test() also does a oneway ANOVA,
but the full ANOVA table is not produced. By default, oneway.test() applies a
Welch-like correction for nonhomogeneity of variance. R can get very snotty if
you don't correctly label the factor. If it is coded numerically, you MUST
convert it to a factor before doing the ANOVA. Check the degrees of freedom in
the ANOVA summary table. If they are not right, then something has gone wrong
with your coding.
The Bartlett test is available to check the homogeneity assumption.
> bartlett.test(all.scores ~ groups)
Bartlett test of homogeneity of variances
data: all.scores by groups
Bartlett's K-squared = 0.6503, df = 2, p-value = 0.7224
The Tukey HSD test is available for post hoc testing of pairwise comparisons.
> TukeyHSD(results) # a commonly used post hoc test
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = all.scores ~ factor(groups))
$`factor(groups)`
diff lwr upr p adj
x2-x1 5.233333 -1.6231312 12.08980 0.1493412
x3-x1 11.400000 4.5435354 18.25646 0.0017968
x3-x2 6.166667 -0.3707158 12.70405 0.0656473
Advantage of the aov() function--a post hoc (Tukey) test can be done.
Advantage of the oneway.test() function--it can correct for nonhomogeneity.
> oneway.test(all.scores ~ groups)
One-way analysis of means (not assuming equal variances)
data: all.scores and groups
F = 9.4638, num df = 2.000, denom df = 9.299, p-value = 0.005727
Clean up: Remove everything EXCEPT hair.color from your workspace.
-------------------------------------------------------------------------------
Correlation.
> x = rnorm(100, mean=50, sd=10) # 100 normally distributed random numbers
> y = rnorm(100, mean=75, sd=20) # 100 more (quick and dirty data!)
> plot(x, y) # scatterplot
> cor(x, y)
[1] 0.00620442 # Pearson's r is the default
> cor(x, y, method="spearman") # Spearman rho (don't capitalize it!)
[1] -0.002076208
> cor.test(x, y) # correlation t-test
Pearson's product-moment correlation
data: x and y
t = 0.0614, df = 98, p-value = 0.9511
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.1904458 0.2023759
sample estimates:
cor
0.00620442
Note: One-tailed tests are also possible by using the alternative= option. Use
alternative="less" if you think the correlation is less than zero. Otherwise,
use alternative="greater".
-------------------------------------------------------------------------------
Regression.
> lm(y ~ x) # linear regression requires formula interface
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
74.43864 0.01112
> my.lm.analysis = lm(y ~ x) # storing the results gives more output
> summary(my.lm.analysis)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-38.663 -14.562 -0.173 13.702 50.507
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 74.43864 9.07285 8.205 9.27e-13 ***
x 0.01112 0.18097 0.061 0.951
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.45 on 98 degrees of freedom
Multiple R-Squared: 3.849e-05, Adjusted R-squared: -0.01017
F-statistic: 0.003773 on 1 and 98 DF, p-value: 0.9511
> rm(x, y, my.lm.analysis) # clean up
-------------------------------------------------------------------------------
Chi-square goodness of fit.
> ls() # this should be where you are
[1] "hair.color"
> hair.color # still there? (if not, re-enter it)
[1] "red" "red" "red" "red" "red" "red" "red" "red"
[9] "black" "black" "black" "black" "black" "black" "black" "black"
[17] "black" "black" "black" "black" "black" "black" "black" "black"
[25] "black" "black" "black" "black" "black" "black" "brown" "brown"
[33] "brown" "brown" "brown" "brown" "brown" "brown" "brown" "brown"
[41] "brown" "brown" "brown" "brown" "brown" "brown" "brown" "brown"
[49] "brown" "brown" "brown" "brown" "brown" "brown" "brown" "brown"
[57] "brown" "brown" "brown" "brown" "blonde" "blonde" "blonde" "blonde"
[65] "blonde" "blonde" "blonde" "blonde" "blonde" "blonde" "blonde" "blonde"
[73] "blonde" "blonde" "blonde" "blonde" "blonde" "blonde"
> table(hair.color) # observed frequencies
hair.color
black blonde brown red
22 18 30 8
> chisq.test(table(hair.color))
Alternatively, you can do this one step at a time...
> temp.table = table(hair.color)
> chisq.test(temp.table)
Chi-squared test for given probabilities
data: table(hair.color)
X-squared = 12.8718, df = 3, p-value = 0.004922
> chisq.test(c(22,18,30,8)) # also works from a vector of OFs
Chi-squared test for given probabilities
data: c(22, 18, 30, 8)
X-squared = 12.8718, df = 3, p-value = 0.004922
Note: The default is equal expected frequencies. If expected frequencies are
unequal, enter them as proportions or probabilities.
> expected = c(1/4, 1/4, 1/3, 1/6) # best to use fractions if possible
> sum(expected) # must sum exactly to 1, of course
[1] 1
> observed = c(22, 18, 30, 8) # observed must be frequency counts
> chisq.test(observed, p=expected)
Chi-squared test for given probabilities
data: observed
X-squared = 2.9744, df = 3, p-value = 0.3956
>rm(hair.color, temp.table, expected, observed)
-------------------------------------------------------------------------------
Chi-square test of independence.
If the data have already been crosstabulated, this is easy. Otherwise, you'll
need to learn to use the xtabs() and table() functions for crosstabulation.
Here are some crosstabulated data that we will put into a table.
variable one
dems repubs other
male 22 35 12
variable two
female 28 30 18
> row1 = c(22, 35, 12)
> row2 = c(28, 30, 18)
> my.table = rbind(row1, row2) # binds rows into a table (matrix)
> my.table
[,1] [,2] [,3]
row1 22 35 12
row2 28 30 18
> chisq.test(my.table) # submit a table name and R knows what to do
Pearson's Chi-squared test
data: my.table
X-squared = 1.9713, df = 2, p-value = 0.3732
Note: For 2x2 tables, R does the Yates correction by default. To prevent this:
> chisq.test(my.table, correct=F)
To see more information, store the results of the test...
> chisq.results = chisq.test(my.table)
> chisq.results
> chisq.results $ expected # expected frequencies
> chisq.results $ residuals # unsquared chi values in each cell
> rm(row1, row2, my.table, chisq.results) # as always, clean up
-------------------------------------------------------------------------------
Getting help.
> help(function.name) # opens a somewhat helpful help page
> help(t.test) # ?t.test is a shortcut
> help.search("distribution") # looks for anything to do with quoted terms
> apropos("mean") # returns function names containing mean
Note: The R manuals are included with the download in both PDF and html format.
Look for them in the root R directory.
Note: In Windows and OS X, the help() function opens a new window, which can be
closed like any other window. In Linux, help pages are opened right in the
terminal session. To close the help page, press the Q key.
-------------------------------------------------------------------------------
Reading in a data frame from a csv file.
The best way I've found to handle large data files is to enter them into a
spreadsheet and then save them in comma separated format. The file should have
column names, one column per variable, and one row per case, and NOTHING else.
Don't use any fancy formatting. Don't leave any blank cells. Don't use any data
or variable names with spaces in them. Good idea: keep the names short!
If you drop a copy of this csv file into your working directory, you can read
it into a data frame as follows:
> dataframe.name = read.csv(file = "filename.csv")
It's best to keep the names of data frames in memory SHORT, so maybe...
> df = read.csv(file = "filename.csv")
If you put row names (see next topic) in the first column of the csv file, then
do this:
> df = read.csv(file = "filename.csv", row.names=1)
If the csv file is not in your working directory, set file= to the complete
(or relative) pathname of the file. If the file is on the Internet, set file= to
the complete url of the file. (Do not use the https protocol. Use http instead.)
If you want to try reading one off the Internet, use these commands.
> url = "http://ww2.coastal.edu/kingw/statistics/R-tutorials/text/caffeine.csv"
> caffeine = read.csv(file=url)
If this was NOT successful, you got some sort of error message, probably fairly
cryptic. If it was successful, R was silent, and in your workspace you'll find
a data frame called caffeine.
> ls()
[1] "caffeine" "url"
The following functions are available to examine it.
> names(caffeine) # get the names of the variables
[1] "group" "dose" "tapping"
> dim(caffeine) # get the size of the data frame
[1] 30 3
> head(caffeine) # see the header and top six rows of data
group dose tapping
1 caff0ml 0 242
2 caff0ml 0 245
3 caff0ml 0 244
4 caff0ml 0 248
5 caff0ml 0 247
6 caff0ml 0 248
> summary(caffeine) # summarize all the variables
group dose tapping
caff0ml :10 Min. : 0 Min. :242.0
caff100ml:10 1st Qu.: 0 1st Qu.:245.0
caff200ml:10 Median :100 Median :246.5
Mean :100 Mean :246.5
3rd Qu.:200 3rd Qu.:248.0
Max. :200 Max. :252.0
-------------------------------------------------------------------------------
Inline data entry for shorter data frames.
It's not hard to type data vectors into R using the scan() function, but if you
want to enter an entire data frame at one go, that's another matter. You can do
it, though, by opening a script window (File > New Script in Windows; File > New
Document on a Mac) and typing exactly the following (using your own data, of
course).
groceries = read.table(header=T, text="
items storeA storeB storeC storeD
lettuce 1.17 1.78 1.29 1.29
potatoes 1.77 1.98 1.99 1.99
milk 1.49 1.69 1.79 1.59
eggs 0.65 0.99 0.69 1.09
bread 1.58 1.70 1.89 1.89
cereal 3.13 3.15 2.99 3.09
ground.beef 2.09 1.88 2.09 2.49
tomato.soup 0.62 0.65 0.65 0.69
laundry.detergent 5.89 5.99 5.99 6.99
aspirin 4.46 4.84 4.99 5.15
")
It's probably best to do spacing with the spacebar if you're in Windows. I
sometimes have trouble getting the Windows version of R to recognize tabs as
white space (even though the help page says it should). It doesn't seem to
matter on a Mac. Once you get it typed, execute the script, and the data are in
your workspace in an object called "groceries". (To execute a script, in
Windows, go to the Edit menu, and choose Run all. On a Mac, highlight the whole
thing in the script window with your mouse, go to the Edit menu, and choose
Execute.) Hint: You can also use a text editor and copy and paste the text
file into R at a command prompt. You may have to use this method in Linux.
-------------------------------------------------------------------------------
Working with data frames.
Clear your workspace. Then do this.
> data(mtcars)
This will put a data frame in your workspace that came with R when you installed
it. To get some information on what is in this data frame, use the help file.
> help(mtcars)
Here's what the data frame looks like.
> mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
The column on the left is technically not part of the data but a column of
row names. Usually these will just be numbers unless you've gone to the trouble
of creating a unique row name for every row in the table. Each row represents a
single case (roughly, a "subject") in the data. Thus, the first row in this
data frame contains all the info on the Mazda RX4. (I would not put blank spaces
in these row names, but R can handle it if you know what you're doing.)
Each subsequent column contains one variable upon which the "subject" has been
measured. Each of these columns will have a name at the top. The first such
column has the name "mpg" and contains gas mileage info for each of the cars.
Data frames can be very long, thousands of cases or more, so it is often not
convenient to print the whole thing to the Console. In that case, the following
functions will be helpful.
> head(mtcars) # see the first six rows of data
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> dim(mtcars) # size of data frame in rows and columns
[1] 32 11
> names(mtcars) # see the variable names (column headers)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
> summary(mtcars) # summarize all the variables in the data
All the variables in this data frame are numeric, as indicated by the fact that
each variable was given a numerical summary. This will not always be the case.
For example:
> summary(chickwts)
weight feed
Min. :108.0 casein :12
1st Qu.:204.5 horsebean:10
Median :258.0 linseed :12
Mean :261.3 meatmeal :11
3rd Qu.:323.5 soybean :14
Max. :423.0 sunflower:12
This built-in data frame contains two variables, one of which is numeric
(weight), and one of which is categorical (feed). Categorical variables are
summarized with frequency tables. Similarly, in the caffeine data frame that
you read from the Internet in the last topic, there were three variables, one
categorical (group), and two numeric (dose, tapping).
> mtcars["Honda Civic", "mpg"] # get gas mileage for a Honda Civic
[1] 30.4
> mtcars["Honda Civic", ] # get all info on the Honda Civic
mpg cyl disp hp drat wt qsec vs am gear carb
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
The data frame can also be addressed using numeric indexes, if you know what
they are.
> mtcars[19, 1] # Honda Civic is row 19; mpg is column 1
[1] 30.4
Here is an important rule about data frames: R will not look inside a data
frame unless you give it permission to do so. Thus:
> mean(mpg)
Error in mean(mpg) : object 'mpg' not found
This fails because the variable mpg is inside the data frame, and R cannot see
it there until you tell it to look there. There are several ways to do this.
Here are two.
> with(mtcars, mean(mpg)) # syntax: with(dataframe.name, function(variable))
[1] 20.09062
> mean(mtcars$mpg) # syntax: function(dataframe.name$variable)
[1] 20.09062
Is there a difference in gas mileage depending upon transmission type? Sounds
like a t-test to me.
> t.test(mpg ~ am, data=mtcars)
Welch Two Sample t-test
data: mpg by am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.280194 -3.209684
sample estimates:
mean in group 0 mean in group 1
17.14737 24.39231
Is there a difference in gas mileage as a function of the number of forward
gears? This sounds like an ANOVA. First, we'll examine the group means and sds.
> with(mtcars, tapply(mpg, gear, mean))
3 4 5
16.10667 24.53333 21.38000
> with(mtcars, tapply(mpg, gear, sd))
3 4 5
3.371618 5.276764 6.658979
The tapply() function (pronounced tee-apply, not tap-plea) is one of the more
cryptically named R functions. It's syntax is tapply(DV, IV, function). This
will break down the DV into groups defined by the IV and calculate the function
within each of these groups.
Now we'll get a graphic display of the groups.
> boxplot(mpg ~ gear, data=mtcars)
> title(xlab="Number of Forward Gears", ylab="Gas Mileage in Miles Per Gallon")
Here's the ANOVA. One trick here is that the IV, gear, is coded numerically.
This will throw the aov() function into a conniption unless we are sure to
declare gear to be a factor. (This declaration would not be necessary if gear
were coded categorically.)
> results = aov(mpg ~ factor(gear), data=mtcars)
> summary(results)
Df Sum Sq Mean Sq F value Pr(>F)
factor(gear) 2 483.2 241.62 10.9 0.000295 ***
Residuals 29 642.8 22.17
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Is there a correlation between gas mileage and engine displacement?
> plot(mpg ~ disp, data=mtcars) # scatterplot
> with(mtcars, cor(mpg, disp))
[1] -0.8475514
> with(mtcars, cor.test(mpg, disp))
Pearson's product-moment correlation
data: mpg and disp
t = -8.7472, df = 30, p-value = 9.38e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9233594 -0.7081376
sample estimates:
cor
-0.8475514
How about between engine displacement and horsepower? You do it!
Is there a relationship between transmission type and number of forward gears,
both considered as factors?
> my.table = with(mtcars, table(am, gear))
> my.table
gear
am 3 4 5
0 15 4 0
1 0 8 5
> chisq.test(my.table)
Pearson's Chi-squared test
data: my.table
X-squared = 20.9447, df = 2, p-value = 2.831e-05
Warning message:
In chisq.test(my.table) : Chi-squared approximation may be incorrect
The warning is telling us that some of the expected frequencies are less than
five, possibly making the chi-square approximation inaccurate.
> chisq.test(my.table) $ expected
gear
am 3 4 5
0 8.90625 7.125 2.96875
1 6.09375 4.875 2.03125
-------------------------------------------------------------------------------
Housekeeping.
The workspace is an area in the computer's memory where R stores stuff you've
created in the Console or have read into R from an external file.
> ls() # shows the names of objects you've created
> remove(object.name) # deletes the named object; rm() also works
Note: it's best to keep your workspace cleaned up by deleting unused objects.
The working directory is the directory (folder) on your hard drive where R
expects to find things, such as data files, that you ask for.
> getwd() # get the name of the working directory
> setwd() # set (change) the working directory
> dir() # get a dir listing of the working directory
Note: In Windows it's best to use the pulldown menus to change directories
rather than attempt to navigate through the obtuse Windows file system. There
are also pulldown menus in OS X, but there are no menus in Linux.
If your R session becomes long, and you've typed in a lot of data that you don't
want to risk losing if the power goes out, you can do the following:
> save.image()
This will save a copy of your workspace to the working directory. It does the
same thing as choosing "yes" when you quit R. This will write over any data
file that has been saved previously, so be careful about that. The created file
will be named .RData, which will be an invisible file in unixy operating systems
such as OS X and Linux.
If you want to save a lot of these without writing over old ones, just give the
file a name.
> same.image(file = "mydata.RData")
You can create as many of these as you want, but be aware that they will not be
automatically loaded next time you start R. To load the file after R has been
started, do this.
> load(file = "mydata.RData")
This will add the contents of mydata.RData to anything you already have in your
workspace. Be aware that if there are any variable or data names in common, the
version in the workspace will be overwritten by the version in mydata.RData.
There is one more place you ought to be aware of in R called the search path.
You can see it as follows:
> search() # results will differ in Windows vs. OS X
The search path contains the places R will look for stuff when you ask for a
function or a data set. The first entry is ".GlobalEnv", the global environment,
which is R's name for the workspace. When you ask for a function or data, R will
look in the workspace first. Then it will look through each of those other
locations in turn. If it doesn't find what you asked for, you will get a "not
found" error.
It's possible to add and remove things from the search path. This can be a
fairly dangerous practice for newbies, however, so I'm not going to tell you
how to do it at this time.
-------------------------------------------------------------------------------
revised 2016 January 12; updated 2016 April 7
-------------------------------------------------------------------------------