| Table of Contents
| Function Reference
| Function Finder
| R Project |
DATA FRAMES
Preamble and Editorial
There is plenty to say about data frames because they are the primary data
structure in R. Some of what follows is essential knowledge. Some of it will be
satisfactorily learned for now if you remember that "R can do that." I will try
to point out which parts are which. Set aside some time. This is a long one! If
you break this one up into multiple sessions, always save your workspace when
you quit.
> setwd("Rspace") # if you have this directory
> rm(list=ls()) # clear the workspace
A Note About Data Management. We can hardly discuss data frames without talking
about data management. How do you get your data in? How do you edit them once
they're in? I'm sorry to report that this is one area in which R is particularly
poor. The facilities in R for data management are, to say the least, clumsy and
inadequate. On top of that, there doesn't seem to be any move afoot to
improve them.
If I were a programmer, this is where I'd be working to improve R. The single
most common "excuse" I've heard from people for not adopting R is lack of data
management tools. Now don't get me wrong. R does contain very powerful data
management tools, and you can accomplish just about any data management task
from within R. It's just not the way most people want to work with their data.
Most people (so they tell me anyway) find a command line interface a clumsy way
to manipute a (large) data set. I get that.
People working with other statistics software packages are used to a
spreadsheet-like interface for entering and editing data. I've worked with that
interface in SPSS, and I personally find it clunky and awkward. I'd much rather
use a modern spreadsheet to manage my data, and that's what I do in R. For some
reason, other people want the spreadsheet interface integrated, even if it's
"clunky and awkward."
R does have a spreadsheet-like data editor. It is invoked by functions such
as edit(), fix(), data.entry(), and maybe a couple
others. I don't use these functions, and I'm not going to discuss them. Here's
why. They just flat out don't work on my system. They're not awkward or clumsy,
they generate error messages! I am at the moment sitting beside a Windows XP
computer running R 3.1.2, and the functions are working there, but they
are awkward and illogical. So even when these data management functions do work,
they are just not convenient (or particularly safe!) ways to manage data.
Bottom line--you are probably going to end up using a spreadsheet or some
other third-party software to manage larger data sets. I will show you a
little of how to do that in this tutorial and the next one. I should also
say that R can be set up to work with data base management software such as
SQL, whatever that is. I don't know how to do that, and I've read mixed
reviews of its effectiveness. It also sounds like you better be running
Windows if you want to make it work, but I haven't really looked into it,
and don't plan to. Final note: R keeps data in RAM, so if you plan to work
with really, really large data sets, you're going to have to interact with
some sort of data base software, or have lots and lots of RAM. I have 4 GB
in my system and have worked with data sets that have tens of thousands of
cases and scores of variables. Having all the data in RAM makes R very fast.
However, available RAM is the limiting
factor in how large a data set you can work with entirely within R.
Definition and Examples (essential)
A data frame is a table, or two-dimensional array-like structure, in which
each column contains measurements on one variable, and each row contains one
case. As we shall see, a "case" is not necessarily the same as an experimental
subject or unit, although they are often the same. Technically, in R a data
frame is a list of column vectors, although there is only one reason why you
might need to remember such an arcane thing. Unlike an array, the data you
store in the columns of a data frame can be of various types. I.e., one column
might be a numeric variable, another might be a factor, and a third might be
a character variable. All columns have to be the same length (contain the same
number of data items, although some of those data items may be missing values).
Let's say we've collected data on one response variable or DV from 15
subjects, who were divided into three experimental groups called control
("contr"), treatment one ("treat1"), and treatment two ("treat2"). We might be
tempted to table the data as follows.
contr treat1 treat2
---------------------------
22 32 30
18 35 28
25 30 25
25 42 22
20 31 33
---------------------------
While this is a perfectly acceptable table, it is NOT a data frame, because
values on our one response variable have been divided into three columns (and so
have values on the grouping or independent variable). A data frame has the name
of the variable at the top of the column, and values of that variable in the
column under the variable name. So the data above should be tabled as follows.
scores group
----------------
22 contr
18 contr
25 contr
25 contr
20 contr
32 treat1
35 treat1
30 treat1
42 treat1
31 treat1
30 treat2
28 treat2
25 treat2
22 treat2
33 treat2
----------------
This is a proper data frame. It does not matter in what
order the columns appear, as long as each column contains values of one
variable, and every recorded value of that variable is in that column.
In a previous tutorial we used the data object "women" as an example of a
data frame.
> women
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 65 135
9 66 139
10 67 142
11 68 146
12 69 150
13 70 154
14 71 159
15 72 164
In this data frame we have two numeric variables and no real explanatory
variables (IVs) or response variables (DVs). Notice when R prints out a data
frame, it numbers the rows. These numbers are for convenience only and are not
part of the data, and I'll have much more to say about them shortly.
We can refer to any value, or subset of values, in this data frame using the
already familiar notation.
> women[12,2] # row 12, column 2; note the square brackets
[1] 150
> women[8,] # row 8, all columns (blank index means "all")
height weight
8 65 135
> women[1:5,] # rows 1 through 5, all columns
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
> women[,2] # all rows, column 2
[1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164
> women[c(1,3,7,13),] # rows 1, 3, 7, and 13, all columns
height weight
1 58 115
3 60 120
7 64 132
13 70 154
> women[c(1,3,7,13),1] # rows 1, 3, 7, and 13, column 1
[1] 58 60 64 70
Here's the catch. Those index numbers do NOT necessarily correspond to the
numbers you see printed at the left of the data frame. This can be confusing,
and it is something you need to keep in mind. I will explain in a moment.
Another built-in data object that is a data frame is "warpbreaks". This data
frame contains 54 cases, so I will print out only every third one. I do this
with the sequence function, since this function creates a vector just as the
c() function did in the above examples.
> warpbreaks[seq(from=1, to=54, by=3),]
breaks wool tension
1 26 A L
4 25 A L
7 51 A L
10 18 A M
13 17 A M
16 35 A M
19 36 A H
22 18 A H
25 28 A H
28 27 B L
31 19 B L
34 41 B L
37 42 B M
40 16 B M
43 21 B M
46 20 B H
49 17 B H
52 15 B H
In this data frame we have one numeric variable (number of breaks), and two
categorical variables (type of wool and tension on the wool). We don't have to
look at the data frame itself to get this information. We can also use the
str() function, which displays a breakdown of
the structure of a data frame (or other data object).
> str(warpbreaks)
'data.frame': 54 obs. of 3 variables:
$ breaks : num 26 30 54 25 70 52 51 26 67 18 ...
$ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
$ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...
Here are two more handy functions for finding out what's in a data frame.
> head(warpbreaks) # see the first six rows of data
breaks wool tension
1 26 A L
2 30 A L
3 54 A L
4 25 A L
5 70 A L
6 52 A L
> summary(warpbreaks) # see a summary of each of the variables
breaks wool tension
Min. :10.00 A:27 L:18
1st Qu.:18.25 B:27 M:18
Median :26.00 H:18
Mean :28.15
3rd Qu.:34.00
Max. :70.00
Another example is the data object "sleep".
> sleep
extra group ID
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3
4 -1.2 1 4
5 -0.1 1 5
6 3.4 1 6
7 3.7 1 7
8 0.8 1 8
9 0.0 1 9
10 2.0 1 10
11 1.9 2 1
12 0.8 2 2
13 1.1 2 3
14 0.1 2 4
15 -0.1 2 5
16 4.4 2 6
17 5.5 2 7
18 1.6 2 8
19 4.6 2 9
20 3.4 2 10
Here we have two variables, the change in sleep time a subject got ("extra"),
and what drug the subject received ("group"). There is also a subject identifier
(ID), indicating that the first 10 cases and last 10 cases are the same subjects.
In this data frame, the first variable
(the dependent variable, DV, response variable, etc.) is numeric and the
second (the independent variable, IV, explanatory variable, grouping variable,
etc.) is categorical, even though the categorical variable is coded as a
number. Once again, it does not matter in what order the columns occur. Put the
IV in the first column, the DV in the third column, and the subject ID between
them, if you want.
However, if categorical variables are coded as numbers (a common practice),
R will not know this until you tell it. Has R been told that "group" is a factor
in this data frame? The str() is one handy way to
find out.
> str(sleep)
'data.frame': 20 obs. of 3 variables:
$ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
$ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
In this case, the fact that "group" is a factor is stored internally in the
data frame, but that will not always be the case. So it's worth taking a look
to make sure things you intend to be factors are being interpreted as
factors by R. You can do this with str(),
but you can also do it with summary(),
because numeric variables and factors are summarized differently.
> summary(sleep)
extra group ID
Min. :-1.600 1:10 1 :2
1st Qu.:-0.025 2:10 2 :2
Median : 0.950 3 :2
Mean : 1.540 4 :2
3rd Qu.: 3.400 5 :2
Max. : 5.500 6 :2
(Other):8
Notice that numeric variables ("extra") are summarized with numerical summary
statistics, while factors are summarized with a frequency table. In these data,
there are 10 subjects in group 1 and 10 subjects in group 2. There are also
two subjects with ID 1, two subjects with ID 2, etc.
An Ambiguous Case (essential)
Entering data into a data frame sometimes involves making a tough decision as
to what your variables are. The following example is from a built-in data object
called "anorexia". This data set is not in the libraries that are loaded by
default when R starts, so to see it, the first thing we need to do is attach the
correct library to the search path. Let's see how that works.
> search()
[1] ".GlobalEnv" "tools:RGUI" "package:stats"
[4] "package:graphics" "package:grDevices" "package:utils"
[7] "package:datasets" "package:methods" "Autoloads"
[10] "package:base"
This is the default search path, the one you have right after R starts. (It
will be a little different in different operating systems.) We want to see an
object in the MASS library (or package), which is not currently in the search
path. So to get it into the search path, do this.
> library(MASS)
> search()
[1] ".GlobalEnv" "package:MASS" "tools:RGUI"
[4] "package:stats" "package:graphics" "package:grDevices"
[7] "package:utils" "package:datasets" "package:methods"
[10] "Autoloads" "package:base"
Notice we have added "package:MASS" to the search path in position 2. This means
if we request an R object, R will look first in the global environment (the
workspace), and if the object is not found there, R will look next in MASS, then
in RGUI, then in stats, and so on, until the object either is found or R runs
out of places to look for it. The "anorexia" data frame is 72 cases long, so to
conserve space we will look at only every fifth row of it.
> data(anorexia) # put a copy in your workspace
> anorexia[seq(from=1, to=72, by=5),]
Treat Prewt Postwt
1 Cont 80.7 80.2
6 Cont 88.3 78.1
11 Cont 77.6 77.4
16 Cont 77.3 77.3
21 Cont 85.5 88.3
26 Cont 89.0 78.8
31 CBT 79.9 76.4
36 CBT 80.5 82.1
41 CBT 70.0 90.9
46 CBT 84.2 83.9
51 CBT 83.3 85.2
56 FT 83.8 95.2
61 FT 79.6 76.7
66 FT 81.6 77.8
71 FT 86.0 91.7
The data frame contains data from women who underwent treatment for anorexia. In
the first column we have the treatment variable ("Treat"). The second column
contains the pretreatment body weight in pounds ("Prewt"). The third column
contains the posttreatment body weight in pounds ("Postwt"). So where is the
ambiguity?
Here's the awkward question. In our analysis of these data, do we wish to
treat weight as two variables ("Prewt" and "Postwt") each measured once on each
subject, or as one variable (call it "weight") measured twice on each subject?
The data frame is currently arranged as if the plan was for an analysis of
covariance, with "Postwt" being the response, "Treat" the explanatory variable,
and "Prewt" the covariate. Prewt and Postwt are treated as two variables.
If the plan was for a repeated measures ANOVA, then the data frame is wrong,
because in this case, "weight" is ONE variable measured twice ("pre" and "post")
on each woman. In this analysis, we would also need to add a "subject" identifier
to the data frame as well, since each subject would have two lines, a "pre" line
and a "post" line. (NOTE: There is an optional package that can be downloaded
from CRAN which will do repeated measures ANOVA on data in this format.
Google EZANOVA for details. The package is called ez.)
It's not a disaster. The data frame is easy enough to rearrange on the fly,
and we will do so below.
FYI, this is how we get the MASS package out of the search path if
we no longer need it, which we don't. (Don't remove "anorexia" from your
workspace, however.)
> detach("package:MASS")
Creating a Data Frame in R (essential)
The easiest way--and the usual way--of getting a data frame into the R
workspace is to read it in from a file. We will do that below (a few sections
from now).
Sometimes it becomes necessary to create one at the console, however. Here are
the steps involved:
- Type each variable into a vector.
- Use the data.frame() function to
create a data frame from the vectors.
You may remember these data from the "Objects" tutorial.
name age hgt wgt race year SAT
Bob 21 70 180 Cauc Jr 1080
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 Af.Am Fr 840
Sue 24 66 118 Cauc Sr 1340
Jeff 20 72 202 Asian So 880
Let's make a data frame of this. The method used here is a somewhat unforgiving
method of entering data. I will make an intentional mistake and show you how to
correct it below. However, the values for each of the variables have to remain
aligned. I.e., Bob is age 21, 70 in. tall, weighs 180 lbs., etc. If you get the
data values out of order in any given vector, or if you leave one out, for now
all I can say is, "Start again!"
> ls()
[1] "anorexia"
> name = scan(what="character")
1: Bob Fred Barb Sue Jeff # Remember: press Enter twice to end data entry.
6:
Read 5 items
> age = scan()
1: 21 18 18 24 20
6:
Read 5 items
> hgt = scan()
1: 70 67 64 66 72
6:
Read 5 items
> wgt = scan() # I am making a mistake intentionally here.
1: 180 156 128 1118 202
6:
Read 5 items
> race = scan(what="character") # One value here is being recorded as missing, NA.
1: Cauc Af.Am NA Cauc Asian
6:
Read 5 items
> year = scan(what="character")
1: Jr Fr Fr Sr So
6:
Read 5 items
> SAT = scan() # One value here is being recorded as missing, NA.
1: 1080 1210 840 NA 880
6:
Read 5 items
> my.data = data.frame(name, age, hgt, wgt, race, year, SAT)
> my.data
name age hgt wgt race year SAT
1 Bob 21 70 180 Cauc Jr 1080
2 Fred 18 67 156 Af.Am Fr 1210
3 Barb 18 64 128 <NA> Fr 840
4 Sue 24 66 1118 Cauc Sr NA
5 Jeff 20 72 202 Asian So 880
Tah dah! It's as "simple" as that. You wouldn't want to have to do that with a
large data set, however, and that's why we'll learn how to read them in from a
file. DON'T clean up your workspace. We will carry this
example over into the next section.
> ls() # Messy! But leave it that way for now.
[1] "age" "anorexia" "hgt" "my.data" "name" "race"
[7] "SAT" "wgt" "year"
Accessing Information Inside a Data Frame (essential)
First, let's look at a few functions that allow us to get general
information about a data frame...
> dim(my.data) # Get size in rows by columns.
[1] 5 7
> names(my.data) # Get the names of variables in the data frame.
[1] "name" "age" "hgt" "wgt" "race" "year" "SAT"
> str(my.data) # See the internal structure of the data frame.
'data.frame': 5 obs. of 7 variables:
$ name: Factor w/ 5 levels "Barb","Bob","Fred",..: 2 3 1 5 4
$ age : num 21 18 18 24 20
$ hgt : num 70 67 64 66 72
$ wgt : num 180 156 128 1118 202
$ race: Factor w/ 3 levels "Af.Am","Asian",..: 3 1 NA 3 2
$ year: Factor w/ 4 levels "Fr","Jr","So",..: 2 1 1 4 3
$ SAT : num 1080 1210 840 NA 880
These are self-explanatory, with the exception of
str(). First, notice that our character
variables were entered into the data frame as factors. This is standard in R,
but it may not be what you want. Second, notice on the lines giving info about
factors that there are strange numbers at the ends of those lines. You don't
have to worry about these. What R is telling you is that factors are coded
internally in R as numbers. R will keep it all straight for you, so don't
sweat the details.
The summary() function is VERY useful here.
> summary(my.data)
name age hgt wgt race year
Barb:1 Min. :18.0 Min. :64.0 Min. : 128.0 Af.Am:1 Fr:2
Bob :1 1st Qu.:18.0 1st Qu.:66.0 1st Qu.: 156.0 Asian:1 Jr:1
Fred:1 Median :20.0 Median :67.0 Median : 180.0 Cauc :2 So:1
Jeff:1 Mean :20.2 Mean :67.8 Mean : 356.8 NA's :1 Sr:1
Sue :1 3rd Qu.:21.0 3rd Qu.:70.0 3rd Qu.: 202.0
Max. :24.0 Max. :72.0 Max. :1118.0
SAT
Min. : 840
1st Qu.: 870
Median : 980
Mean :1002
3rd Qu.:1112
Max. :1210
NA's :1
Let's take a look. There is a variable called "name", which R is summarizing
as a factor. We probably don't really want that, because it's not a grouping
variable, but for now no harm no foul. There is a variable called "age",
which is numeric, a variable called "hgt", which is numeric, and a variable
called "wgt", which is numeric. Do you see any problems here?
The "age" and "hgt" variables look entirely reasonable as far as min and
max values are concerned, but wgt does not. Maximum wgt is 1118 lbs. Really?
Something clearly went wrong here, and we are going to have to track it down
and fix it!
The variables "race" and "year" are factors or categorical variables. See
any problems? Yes, there is a missing value (NA) in "race" that didn't occur
in the original data table. Something else we're going to have to fix.
Finally, "SAT", also numeric, has a missing value that we are going to have
to track down. This is the advantage of using summary().
It shows which variables have values missing, and you can look at the
range of the numeric variables and see if there is anything suspicious, like a
body weight of 1118 lbs.
There are four ways to get at the data inside a data frame, and this is NOT
one of them.
> SAT
[1] 1080 1210 840 NA 880
That only seemed to work, because remember when you created the data frame, you
started by putting a vector called "SAT" into the workspace. THAT'S what you're
seeing now! You are NOT seeing the SAT variable from inside the data frame.
R looks in your workspace FIRST, so that is the "SAT" that it came up with.
Confusing, right? So that we don't remain confused, let's erase all those
vectors EXCEPT "age", which we will keep to illustrate
something that you will need to remember about R.
> ls()
[1] "age" "anorexia" "hgt" "my.data" "name" "race"
[7] "SAT" "wgt" "year"
> rm(hgt, name, race, SAT, wgt, year) ##### DON'T erase my.data!
> ls()
[1] "age" "anorexia" "my.data"
Now if we try to see SAT as we did above...
> SAT
Error: object 'SAT' not found
...we get an error. R will not look inside data frames for variables unless
you tell it to. Here are the four ways to do that.
- by using $
- by using with()
- by using data=
- by using attach()
A data frame is a list of column vectors. We can extract items from inside
it by using the usual list indexing device, $. To do this, type the name of the
data frame, a dollar sign, and the name of the variable you want to work
with. You can leave spaces around the $ if you want to. Or not.
> my.data $ SAT
[1] 1080 1210 840 NA 880
> my.data$SAT
[1] 1080 1210 840 NA 880
This can certainly be a nuisance, because it will mean that in some commands
you have to type the data frame name multiple times. An example is the command
that calculates a correlation.
> cor(my.data$hgt, my.data$wgt)
[1] -0.2531835
In this case, you can use the with() function
to tell R where to get the data.
> with(my.data, cor(hgt, wgt)) # syntax: with(dataframe.name, function(arguments))
[1] -0.2531835
It doesn't save much typing in this example, but there are cases where that will
save a LOT of typing! Notice the syntax of this function. You type the name of
the data frame first, followed by a comma, followed by the function you want
to execute, then you close the parentheses on with().
As we will learn later, some functions, especially significance tests, take
what's called a formula interface. When that's the case, there is (almost) always
a data= option to specify the name of the data frame where the variables are to be
found. I'll just show you an example for now. We'll have plenty of time to
examine the formula interface later. For now, all you need to be aware of is the
tilde, which is always present in a formula. In this case, the formula starts
with the tilde (the squiggly line), which is unusual.
> cor.test( ~ hgt + wgt, data=my.data) #syntax: function(formula, data=dataframe.name)
Pearson's product-moment correlation
data: hgt and wgt
t = -0.4533, df = 3, p-value = 0.6811
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9281289 0.8100218
sample estimates:
cor
-0.2531835
Finally, there is the dreaded attach()
function. This attaches the data frame to your search path (in position 2) so
that R will know to look there for data objects that are referenced by name.
Some people use this device routinely when working with data frames, but it
can cause problems, and we are about to see one.
> attach(my.data)
The following object(s) are masked _by_ .GlobalEnv :
age
Say what? When an object is masked (or shadowed) by the global environment,
that means there is a data object in the workspace that has this name AND
there is a variable inside the data frame that has this name. I can now ask
directly for any variable inside the data frame EXCEPT age.
> SAT # same as my.data$SAT (well, almost)
[1] 1080 1210 840 NA 880
> mean(wgt) # same as mean(my.data$wgt)
[1] 356.8
> table(year) # same as table(my.data$SAT)
year
Fr Jr So Sr
2 1 1 1
> age
[1] 21 18 18 24 20
You might think you are seeing my.data$age here, but YOU ARE NOT! You're seeing
"age" from the workspace, BECAUSE THAT'S WHERE R LOOKS FIRST. In this case,
both copies of "age" are the same, but that won't always be true.
> age = 112 # modifies the first copy it finds
> age
[1] 112
> my.data$age
[1] 21 18 18 24 20
The assignment changed the value of "age" in the workspace, because that is
the first "age" R saw, but did not change the value of "age" in the data frame.
If we remove age from the workspace, R will then search inside the data frame
for it, because the data frame is attached in position 2 of the search path.
> rm(age)
> age
[1] 21 18 18 24 20
The lesson is, when you get one of these masking (or shadowing) conflicts,
WATCH OUT! Be extra careful to know which version of the variable you're
working with. This has tripped up many an R user, including me. This is why
you want to keep your workspace as clean as possible. The best strategy here
is to remove the masking variable from the workspace. If you want to keep it,
at least rename it and then remove the conflicting version from the workspace.
You'll eventually be sorry if you don't!
One more lesson...
> detach(my.data)
When you're done with an attached data frame, ALWAYS detach it. This will
remove it from the search path so that R will no longer look inside it for
variables. You'll have to go back to using $ to reference variables inside the
data frame after it is detached. This isn't necessary if you're going to quit
your R session right away. Quitting detaches everything that was attached. But
if you're going to continue working, detach data frames you no longer need.
Otherwise, your search path will get messy, and you'll get more and more
masking conflicts as other objects are attached.
DON'T erase my.data. We still need it.
Data Frame Indexing and Row Names (critical)
This will cost you BIGTIME eventually if you don't pay close attention!
This drove me nuts for awhile until I figured out what was happening.
> ls() # Still there?
[1] "anorexia" "my.data"
> my.data
name age hgt wgt race year SAT
1 Bob 21 70 180 Cauc Jr 1080
2 Fred 18 67 156 Af.Am Fr 1210
3 Barb 18 64 128 <NA> Fr 840
4 Sue 24 66 1118 Cauc Sr NA
5 Jeff 20 72 202 Asian So 880
Let's talk about those line numbers at the leftmost verge of the printed data
frame. THEY ARE NOT NUMBERS. Let me repeat that. THEY ARE NOT NUMBERS. They are
row names. They are character values. So the rows and columns of this data
frame are NAMED as follows:
> dimnames(my.data)
[[1]]
[1] "1" "2" "3" "4" "5"
[[2]]
[1] "name" "age" "hgt" "wgt" "race" "year" "SAT"
What's the big deal?
Look at the printed data frame. Suppose we wanted to extract Barb's weight.
That's the value in row 3 and column 4, so we could get it this way.
> my.data[3,4] # Remember to use square brackets for indexing.
[1] 128
"Yeah, so?" We could also get it this way...
> my.data[3,"wgt"]
[1] 128
...and this way...
> my.data["3","wgt"]
[1] 128
Those last two ways seem to be the same, BUT THEY ARE NOT!!!
Let's sort the data frame using the age variable. Sorting a data frame is
done using the order() function. Remember how
it worked when we sorted a vector? If a call to the
order() function is put in place of the row
index, the data frame will be sorted on whatever variable is specified inside
that function. You will have to use the full name of the variable; i.e., you
will have to use the $ notation. (Why?) Otherwise, R will be looking in your
workspace for a variable called "age", not finding it, and giving a "not
found" error. It happens to me a lot, so you might as well just get used to it!
> my.data[order(my.data$age),]
name age hgt wgt race year SAT
2 Fred 18 67 156 Af.Am Fr 1210
3 Barb 18 64 128 Af.Am Fr 840
5 Jeff 20 72 202 Asian So 880
1 Bob 21 70 180 Cauc Jr 1080
4 Sue 24 66 1118 Cauc Sr 1340
Observe the row names! They have also been sorted, haven't they? Let's save
this into a new data object so we can play with it a bit.
> my.data[order(my.data$age),] -> my.data.sorted # Did you remember up arrow?
> my.data.sorted
name age hgt wgt race year SAT
2 Fred 18 67 156 Af.Am Fr 1210
3 Barb 18 64 128 Af.Am Fr 840
5 Jeff 20 72 202 Asian So 880
1 Bob 21 70 180 Cauc Jr 1080
4 Sue 24 66 1118 Cauc Sr 1340
Now let's try to extract Barb's weight from this new data frame.
> my.data.sorted[3,4] ### Wrong!
[1] 202
> my.data.sorted[3,"wgt"] ### Also wrong!
[1] 202
> my.data.sorted["3","wgt"] ### Correct!
[1] 128
> my.data.sorted[2,4] ### Also correct!
[1] 128
Confused yet?
Here's what you have to remember. Those numbers that often print out on the
left side of a data frame ARE NOT NUMBERS. They're row names--character values.
So data frames
have both row and column names, whether you like it or not! The point becomes
clearer when we give the rows actual names. Let's erase the names from my.data
and then re-enter them as row names.
> rm(my.data.sorted) # Get rid of that first.
> my.data$name = NULL # This is how you erase a variable from a data frame.
> my.data
age hgt wgt race year SAT
1 21 70 180 Cauc Jr 1080
2 18 67 156 Af.Am Fr 1210
3 18 64 128 <NA> Fr 840
4 24 66 1118 Cauc Sr NA
5 20 72 202 Asian So 880
> rownames(my.data) = c("Bob","Fred","Barb","Sue","Jeff")
> my.data
age hgt wgt race year SAT
Bob 21 70 180 Cauc Jr 1080
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 <NA> Fr 840
Sue 24 66 1118 Cauc Sr NA
Jeff 20 72 202 Asian So 880
> my.data["Barb", "wgt"] # Makes getting Barb's weight a lot easier!
[1] 128
Notice the numbers are gone now because we have actual row names. And OF COURSE
they sort with the rest of the data frame, just as the "number" row names did
above.
> my.data[order(my.data$age),]
age hgt wgt race year SAT
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 <NA> Fr 840
Jeff 20 72 202 Asian So 880
Bob 21 70 180 Cauc Jr 1080
Sue 24 66 1118 Cauc Sr NA
It would be absolutely silly if they didn't! Just remember: Data frames ALWAYS
have row names. Sometimes those row names just happen to look like numbers.
It's the row names that print out to your console when you ask to see the
data frame, or any part of it, and NOT the index numbers. (R Studio shows you
both when you ask to View a data frame.)
(NOTE: All row names have to be unique. You can't have two Barbs, for
obvious reasons.)
Don't remove my.data yet. We still need it.
Modifying a Data Frame (pretty important)
Rule number one with a bullet:
- NEVER MODIFY AN ATTACHED DATA FRAME!
While this isn't strictly against the law, it's a bad idea and can get very
confusing as to exactly what it is you've modified. I could try to explain it,
but I'm not sure I understand it myself. So just don't do it! (An attached
data frame is a copy of the data frame in the workspace, not the actual data
frame in the workspace. Modifications will be made to the actual data frame in
the workspace, but not to the attached copy.)
The time will come when you want to change a data frame in some way. Here are
some examples of how to do that. You may have noticed that Sue (in the my.data
data frame) is a wee bit on the chunky side. This was an innocent mistake. I
really didn't do that on purpose. How do we fix it? The value was supposed to
be 118, but let's change it to 135 just for kicks.
> ls() # Still there?
[1] "my.data"
> my.data
age hgt wgt race year SAT
Bob 21 70 180 Cauc Jr 1080
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 <NA> Fr 840
Sue 24 66 1118 Cauc Sr NA
Jeff 20 72 202 Asian So 880
> my.data["Sue", "wgt"] = 135
> my.data
age hgt wgt race year SAT
Bob 21 70 180 Cauc Jr 1080
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 <NA> Fr 840
Sue 24 66 135 Cauc Sr NA
Jeff 20 72 202 Asian So 880
That's all there is to it. Use any kind of indexing you like. Let's use
numerical indexing to give Sue her correct weight, and while we're at it,
let's fix those missing values, too.
> my.data[4,3] = 118
> my.data[3, "race"] = "Af.Am"
> my.data["Sue", 6] = 1340
> my.data
age hgt wgt race year SAT
Bob 21 70 180 Cauc Jr 1080
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 Af.Am Fr 840
Sue 24 66 118 Cauc Sr 1340
Jeff 20 72 202 Asian So 880
Just remember that "wgt" is now in column 3, since the row names don't count
as a column.
Now, I have a confession to make. I neglected to detach my.data before I
made those changes. Here are the consequences.
> SAT # sees the attached copy
[1] 1080 1210 840 NA 880
> my.data$SAT # sees the copy in the workspace
[1] 1080 1210 840 1340 880
> wgt
[1] 180 156 128 1118 202
> race
[1] Cauc Af.Am <NA> Cauc Asian
Levels: Af.Am Asian Cauc
Ack! The attached copy has not been changed. But the copy in the workspace
has been changed. Here's the fix. (Won't work if you weren't as stupid as I
was and didn't have my.data attached while you were making those modifications.)
> detach(my.data) # will just give an error if not attached
> attach(my.data) # previous attached copy tossed; new attached copy created
> SAT
[1] 1080 1210 840 1340 880
> detach(my.data)
I have to warn you about modifying data frames. It's always a good idea to
make a backup copy in the workspace first. Because there are some commands that
modify data frames that, if they go wrong, can really screw things up! But
let's live dangerously. Suppose we wanted "wgt" to be in kilograms instead of
pounds. Easy enough...
> my.data$wgt / 2.2
[1] 81.81818 70.90909 58.18182 53.63636 91.81818
> my.data # Nothing has changed yet. Why not?
age hgt wgt race year SAT
Bob 21 70 180 Cauc Jr 1080
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 Af.Am Fr 840
Sue 24 66 118 Cauc Sr 1340
Jeff 20 72 202 Asian So 880
> my.data$wgt / 2.2 -> my.data$wgt # Aha! It has to be stored back into my.data.
> my.data
age hgt wgt race year SAT
Bob 21 70 81.81818 Cauc Jr 1080
Fred 18 67 70.90909 Af.Am Fr 1210
Barb 18 64 58.18182 Af.Am Fr 840
Sue 24 66 53.63636 Cauc Sr 1340
Jeff 20 72 91.81818 Asian So 880
> my.data$wgt = round(my.data$wgt, 1) # A little rounding for good measure.
> my.data
age hgt wgt race year SAT
Bob 21 70 81.8 Cauc Jr 1080
Fred 18 67 70.9 Af.Am Fr 1210
Barb 18 64 58.2 Af.Am Fr 840
Sue 24 66 53.6 Cauc Sr 1340
Jeff 20 72 91.8 Asian So 880
Now that we've rounded them off, we've lost the original weight data in
pounds.
> my.data$wgt * 2.2
[1] 179.96 155.98 128.04 117.92 201.96
We could have avoided this by making a backup copy of my.data first, or by
putting the new weight in kilograms into a new column in the data frame.
Let's see how to create a new column in the data frame.
> my.data$IQ = c(115, 122, 100, 144, 96)
> my.data
age hgt wgt race year SAT IQ
Bob 21 70 81.8 Cauc Jr 1080 115
Fred 18 67 70.9 Af.Am Fr 1210 122
Barb 18 64 58.2 Af.Am Fr 840 100
Sue 24 66 53.6 Cauc Sr 1340 144
Jeff 20 72 91.8 Asian So 880 96
Just name it and assign values to the name in a vector. The new vector has to be
the same length as the other variables already in the data frame.
> ls()
[1] "anorexia" "my.data"
Keep all of that. We're going to be referring to my.data in the next
tutorial.
Missing Values (kinda important, so listen up!)
Do this.
> data(Cars93, package="MASS") # Get data from MASS without attaching MASS first.
> str(Cars93) # Lots of output not shown!
This is a data frame with 93 observations on 27 variables. You can see what the
variables represent by looking at the help page for this data set: help(Cars93, package="MASS"). We're interested in the
variable "Luggage.room"
in particular, which is the trunk space in cubic feet, to the nearest cubic
foot.
> attach(Cars93)
> summary(Luggage.room)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
6.00 12.00 14.00 13.89 15.00 22.00 11.00
This is a numeric variable, so we get the summary we are accustomed to by now.
But what are those NAs? Whether we like it or not, data sets often have missing
values, and we need to know how to deal with them. R's standard code for missing
values is "NA", for "not available". The number associated with NA is a
frequency. There are 11 cases in this data frame in which "Luggage.room" is a
missing value. If you looked at the help page, you know why.
Some functions fail to work when there are missing values, but this can
(almost always) be fixed with a simple option.
> mean(Luggage.room)
[1] NA
> mean(Luggage.room, na.rm=TRUE)
[1] 13.89024
> mean(Luggage.room, na.rm=T)
[1] 13.89024
There is no mean when some of the values are missing, so the "na.rm" option
removes them when set to TRUE (must be all caps, but the shorter form T also
works provided you haven't assigned another value to it). If you want to clean
the data set by removing casewise all cases with missing values on any
variable (not always a good idea!), use the na.omit()
function.
> na.omit(Cars93) # Output not shown.
I will not reproduce the output here because it is extensive, but it is also
instructive, so take a look at it. Scroll the console window backwards to see
all of it. Of course, to use this cleaned data frame, you would have to
assign it to a new data object.
The which() function does not work to
identify which of the values are missing. Use is.na( ) instead.
> which(Luggage.room == NA)
integer(0)
> is.na(Luggage.room)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12] FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE
[23] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[34] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[56] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[67] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[89] TRUE FALSE FALSE FALSE FALSE
> which(is.na(Luggage.room))
[1] 16 17 19 26 36 56 57 66 70 87 89
Finally, some data sets come with other codes for missing values. 999 is a
common missing value code, as are blank spaces. Blanks are a very bad idea. If
you find a data set with blanks in it, it may have to be edited in a text
editor or spreadsheet before the file can be read into R. It depends on how the
file is formatted. In some cases, R will automatically assign NA to blank
values, but in other cases it will not. Other missing value codes are not a
problem, as they can be recoded.
> ifelse(is.na(Luggage.room), 999, Luggage.room) -> temp
> temp
[1] 11 15 14 17 13 16 17 21 14 18 14 13 14 13 16 999 999
[18] 20 999 15 14 17 11 13 14 999 16 11 11 15 12 12 13 12
[35] 18 999 18 21 10 11 8 12 14 11 12 9 14 15 14 9 19
[52] 22 16 13 14 999 999 12 15 6 15 11 14 12 14 999 14 14
[69] 16 999 17 8 17 13 13 16 18 14 12 10 15 14 10 11 13
[86] 15 999 10 999 14 15 14 15
> # first we'll mess it up
> # and then we'll fix it
> ifelse(temp == 999, NA, temp) -> fixed
> fixed
[1] 11 15 14 17 13 16 17 21 14 18 14 13 14 13 16 NA NA 20 NA 15 14 17 11
[24] 13 14 NA 16 11 11 15 12 12 13 12 18 NA 18 21 10 11 8 12 14 11 12 9
[47] 14 15 14 9 19 22 16 13 14 NA NA 12 15 6 15 11 14 12 14 NA 14 14 16
[70] NA 17 8 17 13 13 16 18 14 12 10 15 14 10 11 13 15 NA 10 NA 14 15 14
[93] 15
The ifelse() function is very handy for
recoding a data vector, so let me take a moment to explain it. Inside the
parentheses, the first thing you give is a test. In the second of these
commands above, where we are going from the messed up variable back to "fixed",
the test was "if any value of temp is equal to 999". Notice the double equals
sign meaning "equal". (I still get this wrong a lot!) The second thing you give
is how to recode those values, and finally you tell what to do with the values
that don't pass the test. So the whole command reads like this: "If any value
of temp is equal to 999, assign it the value NA, else assign it the value that
is currently in temp."
In the first instance of the function, we had to use is.na, since nothing can
really be "equal to" something that is not available! Try these, and say them
in words as you're typing them.
> ifelse(fixed == 10, 0, 100) # Output not shown.
> ifelse(fixed > 10, 100, 0) # Output not shown.
> ifelse(fixed > 10, "big", "small") # Output not shown.
If you stored that last one, it would create a character vector.
Don't forget to clean up your workspace and search path!
> rm(Cars93) # and anything else other than anorexia and my.data
> detach(Cars93) # removing it does not detach it!
Inline Data Entry In R (optional)
(NOTE: This may or may not work. I've just tested it in R 3.1.2 on both a
Mac running Snow Leopard and a Windows XP machine, and it worked in both cases.
Some of my students claim to have problems with it, especially in R Studio.
I've been unable to duplicate those problems.)
Those of you who are old enough to have used SPSS in a version where you
had to type commands into a batch file for execution may remember inline
data entry. You typed BEGIN DATA (as I recall), typed your data into a table-like
format, and then typed END DATA. Is there anything like that in R? Sort of.
Open a script window: File > New Script in Windows, or File > New Document on
a Mac. In that script window, type exactly this. Include a blank line at the end.
You can create white space by either tabbing or spacing on a Mac, but in Windows
you must create white space by spacing with the spacebar. (The help page
suggests otherwise, but I have been unable to get the Windows version to
recognize tabs as white space.) You can edit freely as you are typing.
new.dataframe = read.table(header=T, text="
name age hgt wgt race year SAT
Bob 21 70 180 Cauc Jr 1080
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 Af.Am Fr 840
Sue 24 66 118 Cauc Sr 1340
Jeff 20 72 202 Asian So 880
")
Then, in Windows, go to the Edit menu, and choose Run all. On a Mac,
highlight the whole thing in the script window with your mouse, go to the Edit
menu, and choose Execute. That should put a data frame in your workspace called
new.dataframe. Check it to make sure it's sound. (NOTE: In Linux, R scripts are
created in a text editor such as vim or gedit, saved, and then read into R by
using the source() function at the Console command
prompt.)
On my Mac, the script looked like this. (In Windows the script window is
much plainer, and the lines are not numbered.)
You can save the contents of this window as an R script, which you can always
reopen and modify if necessary.
Further hint: I just got those data into R by copying and pasting the
command and data above directly into the R Console. So that worked. It led
me to wonder if I could copy and paste an HTML table-formatted object from a
web page. I just tried it, and it caused R to crash, so I can't recommend it!
(That's only the second time in 12 years that R has crashed on me.)
However, here's what did work. I copied the contents of the table on the
web page and pasted it into a text editor (I used TextWrangler on a Mac). Then
I added the necessary R commands in the text editor and copied and pasted it
into the R Console. Sure beats typing in that data myself!
Subsetting a Data Frame (optional)
We will use a data frame called USArrests for this exercise.
> data(USArrests)
> head(USArrests)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Here is another useful function for looking at a data frame. The head() function shows the first six lines of data
(cases) inside a data frame. There is also a tail() function that shows the last six lines, and
the number of lines shown can be changed with an option (see the help
pages).
In this case we have a data frame with row names set to state names and
containing variables that give the crime rates (per 100,000 population) for
Murder, Assault, and Rape, as well as the percentage of the population that
lives in urban areas. These data are from 1973 so are not current.
Because state names are used as row names, to see the data for any state,
all we have to do is be able to spell the name of the state.
> USArrests["Pennsylvania",] # No column index, so all columns displayed.
Murder Assault UrbanPop Rape
Pennsylvania 6.3 106 72 14.9
We do not have to figure out what the index number would be for that row.
Thus, explicit row names can be very handy. To display the entire row of data
for PA, we just leave out the column index, but THE COMMA STILL HAS TO BE
THERE! Otherwise, you are trying to index a two-dimensional data object using
only one index, and R will tell you to knock it off!
Let's answer the following questions from these data.
- Which state has the lowest murder rate?
- Which states have murder rates less than 4.0?
- Which states are in the top quartile for urban population?
> min(USArrests$Murder) # What is the minimum murder rate?
[1] 0.8
> which(USArrests$Murder == 0.8) # Which line of the data is that?
[1] 34
> USArrests[34,] # Give me the data from that line.
Murder Assault UrbanPop Rape
North Dakota 0.8 45 44 7.3
> USArrests[USArrests$Murder==min(USArrests$Murder),] # All at once (showing off).
Murder Assault UrbanPop Rape
North Dakota 0.8 45 44 7.3
>
> which(USArrests$Murder < 4.0) # Gives the result in a vector.
[1] 7 12 15 19 23 29 34 39 41 44 45 49
> USArrests[which(USArrests$Murder < 4.0),] # Use that vector as an index.
Murder Assault UrbanPop Rape
Connecticut 3.3 110 77 11.1
Idaho 2.6 120 54 14.2
Iowa 2.2 56 57 11.3
Maine 2.1 83 51 7.8
Minnesota 2.7 72 66 14.9
New Hampshire 2.1 57 56 9.5
North Dakota 0.8 45 44 7.3
Rhode Island 3.4 174 87 8.3
South Dakota 3.8 86 45 12.8
Utah 3.2 120 80 22.9
Vermont 2.2 48 32 11.2
Wisconsin 2.6 53 66 10.8
>
> summary(USArrests$UrbanPop)
Min. 1st Qu. Median Mean 3rd Qu. Max.
32.00 54.50 66.00 65.54 77.75 91.00
> USArrests[which(USArrests$UrbanPop >= 77.75),]
Murder Assault UrbanPop Rape
Arizona 8.1 294 80 31.0
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Florida 15.4 335 80 31.9
Hawaii 5.3 46 83 20.2
Illinois 10.4 249 83 24.0
Massachusetts 4.4 149 85 16.3
Nevada 12.2 252 81 46.0
New Jersey 7.4 159 89 18.8
New York 11.1 254 86 26.1
Rhode Island 3.4 174 87 8.3
Texas 12.7 201 80 25.5
Utah 3.2 120 80 22.9
Suppose we wanted to work with data only from these states. How can we
extract them from the data frame and make a new data frame that contains only
those states? I'm glad you asked.
> subset(USArrests, subset=(UrbanPop >= 77.75)) -> high.urban
> high.urban
Murder Assault UrbanPop Rape
Arizona 8.1 294 80 31.0
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Florida 15.4 335 80 31.9
Hawaii 5.3 46 83 20.2
Illinois 10.4 249 83 24.0
Massachusetts 4.4 149 85 16.3
Nevada 12.2 252 81 46.0
New Jersey 7.4 159 89 18.8
New York 11.1 254 86 26.1
Rhode Island 3.4 174 87 8.3
Texas 12.7 201 80 25.5
Utah 3.2 120 80 22.9
The subset() function does the trick. The
syntax is a little squirrelly, so let me go through it. The first thing you
give is the name of the data frame. That is followed by the subset= option.
Then inside of parentheses (which actually aren't necessary) give the test
that defines the subset. Store the output into a new data object so that you
can then work with it. Functions that take a data= option can also take a
subset option, so it's a useful thing to know.
Of course you can also use subset()
without an assignment if you just want to display the results. This
eliminates the need to do the fancy indexing tricks above. Or you can use the
fancy indexing tricks with an assignment to get the same result stored in a
new data object. Whatever paddles your canoe. In R there are generally multiple
ways to accomplish things, and this is a good example.
You can clean up your workspace now, but KEEP anorexia and my.data.
Stacking and Unstacking (optional)
Suppose someone has retained your services as a data analyst and gives you
his data (from an Excel file or something) in this format.
contr treat1 treat2
---------------------------
22 32 30
18 35 28
25 30 25
25 42 22
20 31 33
---------------------------
If you're working for free, you can yell at him and make him do it the right
way, but if you're being paid, you probably shouldn't. Here's how to
deal with it. First, let's get these data into a "data frame" in this format,
and I will leave out the command prompts so that you can just copy and paste
these three lines directly into R.
### start copying here
wrong.data = data.frame(contr = c(22,18,25,25,20),
treat1 = c(32,35,30,42,31),
treat2 = c(30,28,25,22,33))
### stop copying here and paste into R
> wrong.data
contr treat1 treat2
1 22 32 30
2 18 35 28
3 25 30 25
4 25 42 22
5 20 31 33
Now do this.
> stack(wrong.data) -> correct.data
> correct.data
values ind
1 22 contr
2 18 contr
3 25 contr
4 25 contr
5 20 contr
6 32 treat1
7 35 treat1
8 30 treat1
9 42 treat1
10 31 treat1
11 30 treat2
12 28 treat2
13 25 treat2
14 22 treat2
15 33 treat2
> colnames(correct.data) = c("scores","groups")
> head(correct.data)
scores groups
1 22 contr
2 18 contr
3 25 contr
4 25 contr
5 20 contr
6 32 treat1
And there you go. Now you have a proper data frame.
There is also an unstack() function that
does the reverse of this, and it will work automatically on a data frame that
has been created by stack(), but otherwise is
a little trickier to use. You probably won't have to use it much, so I'll
refer you to the help page if you ever need it (and good luck to you!).
You can remove these data objects. We won't use them again.
Going From Wide to Long and Long to Wide (eventually you'll probably need to know this)
I mentioned this above under "An Ambiguous Case." There are two kinds of data
frames in R, and in most statistical software: wide ones and long ones. If you
deleted the "anorexia" data frame from your workspace, it's easy enough to get
back. Here's how to fetch
the "anorexia" data again (and we'll do it without attaching the MASS
package this time).
> data(anorexia, package="MASS")
What we are about to do is a little confusing until you get some experience
with it, so it will be necessary to be able to see what's happening. The
anorexia data frame is too long to print to a single console screen without
causing it to scroll, so I'm going to cut it down to only nine cases, three
from each group. This will help us to see the difference between wide and long
data frames without constantly scrolling the console window.
> anor = anorexia[c(1,2,3,27,28,29,56,57,58),]
> anor
Treat Prewt Postwt
1 Cont 80.7 80.2
2 Cont 89.4 80.1
3 Cont 91.8 86.4
27 CBT 80.5 82.2
28 CBT 84.9 85.6
29 CBT 81.5 81.4
56 FT 83.8 95.2
57 FT 83.3 94.3
58 FT 86.0 91.5
I also shortened up the name of our data frame, because we're going to be
typing it a lot.
This is a wide data frame (wide format). It's wide because each line of
the data frame
contains information on ONE SUBJECT, even though that subject was measured
multiple times (twice) on weight (Prewt, Postwt). So all the data for each
subject goes on ONE LINE, even though we could interpret this as a repeated
measures design, or longitudinal data.
In a long format data frame, each value of weight (if we consider that as a
single variable) would define a case. So each of
these subjects would have two lines in such a data frame, one for the subject's
Prewt, and one for her Postwt. A wide data frame would be used, for example, in
analysis of covariance. A long data frame would be used in repeated measures
analysis of variance. Do we have to retype the data frame to get from wide to
long? Fortunately not! Because R has a function called reshape() that will do the work for us.
It is not an easy function to understand, however (and don't count on the
help page being a whole lot of help!). So let me illustrate it, and then I will
explain what's happening.
> reshape(data=anor, direction="long",
+ varying=c("Prewt","Postwt"), v.names="Weight",
+ idvar="subject", ids=row.names(anor),
+ timevar="PrePost", times=c("Prewt","Postwt")
+ ) -> anor.long
> anor.long
Treat PrePost Weight subject
1.Prewt Cont Prewt 80.7 1
2.Prewt Cont Prewt 89.4 2
3.Prewt Cont Prewt 91.8 3
27.Prewt CBT Prewt 80.5 27
28.Prewt CBT Prewt 84.9 28
29.Prewt CBT Prewt 81.5 29
56.Prewt FT Prewt 83.8 56
57.Prewt FT Prewt 83.3 57
58.Prewt FT Prewt 86.0 58
1.Postwt Cont Postwt 80.2 1
2.Postwt Cont Postwt 80.1 2
3.Postwt Cont Postwt 86.4 3
27.Postwt CBT Postwt 82.2 27
28.Postwt CBT Postwt 85.6 28
29.Postwt CBT Postwt 81.4 29
56.Postwt FT Postwt 95.2 56
57.Postwt FT Postwt 94.3 57
58.Postwt FT Postwt 91.5 58
In this example, the first argument I gave to the
reshape() function was the name of the data frame to be reshaped,
and that was given in the data= argument. Then I specified the direction=
argument as "long" so that the data frame would be converted TO a long
format.
In the second line of this command, I specified varying= as a vector of
variable names in anor that correspond to the repeated measures or longitudinal
measures (the time-varying variables). These values will be given in one column
in the new data frame, so I named that new column using the v.names= argument.
A long data frame needs two things that a wide one does not have. One of
those things is a column identifying the subject (case or experimental unit)
from which the data in a row of the data frame come from. This is necessary
because each subject will have multiple rows of data in a long data frame. So
I used the idvar= argument to specify the name of this new column that would
identify the subjects. I then used ids= to specify how the subjects were to
be named. I told it to use the row names from anor, which is a sensible thing
to do.
The other thing a long format data frame needs that a wide one does not is
a variable giving the condition (or time) in which the subject is being
measured for this particular row of data. In the wide format, this information
is in the column (variable) names, but that will no longer be true in the long
format. We need to know which measure is Prewt and which measure is Postwt for
each subject, since these will be on different rows of the data frame in long
format. I named this new variable using the timevar= argument, and I gave its
possible values in a vector using the times= argument. The order in which those
values should be listed is the same as the order in which the corresponding
columns occur in the wide data frame.
Finally, I closed the parentheses on the
reshape() function and assigned the output to a new data object.
Done! Whew!
This can also be made to work if you have more than one repeated measures
variable, in which case all I can say is may the saints be with you! Surely
there must be an easier syntax for this!!
If the data frame results from a
reshape() command, then it can be converted back very simply.
All you have to do is this.
> reshape(anor.long)
Treat subject Prewt Postwt
1.Prewt Cont 1 80.7 80.2
2.Prewt Cont 2 89.4 80.1
3.Prewt Cont 3 91.8 86.4
27.Prewt CBT 27 80.5 82.2
28.Prewt CBT 28 84.9 85.6
29.Prewt CBT 29 81.5 81.4
56.Prewt FT 56 83.8 95.2
57.Prewt FT 57 83.3 94.3
58.Prewt FT 58 86.0 91.5
The row names have gone a little screwy, but all the correct information is
there. This isn't very useful actually, because we already have the data in
wide format in the data frame anor, which we were smart enough not to
overwrite. So let's see how to convert from long to wide the hard way.
First, we will get rid of those ridiculous row names.
> rownames(anor.long) <- as.character(1:18) # Just do it!
> anor.long
Treat PrePost Weight subject
1 Cont Prewt 80.7 1
2 Cont Prewt 89.4 2
3 Cont Prewt 91.8 3
4 CBT Prewt 80.5 27
5 CBT Prewt 84.9 28
6 CBT Prewt 81.5 29
7 FT Prewt 83.8 56
8 FT Prewt 83.3 57
9 FT Prewt 86.0 58
10 Cont Postwt 80.2 1
11 Cont Postwt 80.1 2
12 Cont Postwt 86.4 3
13 CBT Postwt 82.2 27
14 CBT Postwt 85.6 28
15 CBT Postwt 81.4 29
16 FT Postwt 95.2 56
17 FT Postwt 94.3 57
18 FT Postwt 91.5 58
And now for the reshaping. I won't bother storing it.
> reshape(data=anor.long, direction="wide",
+ v.names=c("Weight"),
+ idvar="subject",
+ timevar="PrePost"
+ )
Treat subject Weight.Prewt Weight.Postwt
1 Cont 1 80.7 80.2
2 Cont 2 89.4 80.1
3 Cont 3 91.8 86.4
4 CBT 27 80.5 82.2
5 CBT 28 84.9 85.6
6 CBT 29 81.5 81.4
7 FT 56 83.8 95.2
8 FT 57 83.3 94.3
9 FT 58 86.0 91.5
We didn't quite recover the original table, but then we probably didn't really
want to. The first two arguments name the data frame we are reshaping and tell
the direction we are reshaping TO. The next argument, v.names=, gives the name of
the time-varying variable that will be split into two (or more) columns. The
idvar= argument gives the name of the variable that is the subject identifier.
Finally, the timevar= argument gives the name of the variable that contains the
conditions under which the longitidinal information was collected; i.e., there
were two weights, a Prewt and a Postwt. Notice these values were used to name
the two new columns of Weight data. Want a pneumonic to help you remember all
that? Yeah, me too!
Clean up. Get rid of everything EXCEPT my.data.
Working With Spreadsheets and CSV Files (optional)
Even if you can get the spreadsheet-like data editing interface in R to
work for you, it's still really no great shakes. Even when I'm in Windows
(where it works), I use a spreadsheet to manage my data files, especially
larger ones. I'm going to type the data in my.data into a spreadsheet. I use
OpenOffice Calc. You can use whatever.
At this point, I can copy and paste any one of those columns into scan(). That's handy, but it's not why I created the
spreadsheet. (Notice that Calc wouldn't let me type SAT into a column header.
It kept insisting it was Sat, abbreviation for Saturday. I HATE software that
thinks it knows what I want! That's why I don't use Excel, but the open source
spreadsheets are getting just as bad. Don't presume you know what I want. JUST
DO WHAT I'M TELLING YOU!!! Gasp! I am so sick and tired of software--and
operating systems--being written for morons! My computer is not a phone, it's
not an iPad, it's a computer. Stop turning it into a toy! And if I want SAT to
be Sat, I'll damn well type it that way! There! That will do
no good whatsoever, but at least I vented.)
Now I'm going to save that as a CSV file. (And once again, my computer will
nag the crap out of me--in Excel especially. "Are you really sure you want to
do this? You're going to lose your formatting." Just do what I tell
you to do and SHUT UP! Anyone who doesn't know what a CSV file is and that it
contains no formatting can use a damn typewriter!) To do that I pulled down the
File menu (you're on your own with that idiot ribbon bar in Excel!), I chose
Save As..., and in the dialog box I entered mydata.csv as the file name, I
specified where to save (Desktop--I'll deal with it from there), and I chose
File type: Text CSV. I chose to edit the filter settings because who knows
how they might have them set? Then I clicked Save. Then it nagged me, and I
clicked Keep Current Format (because there was no choice that said Do What I
Tell You--I'm The Human Here!). In the filter settings I made sure the Field
delimiter was set to a comma (which is what it should always be because, hey,
CSV means comma separated values), and I removed the Text delimiter. Then I
clicked OK. Then I clicked away another warning popup. (See how hard they make
this because every idiot has to be able to use a computer these days?)
Here's what the CSV file actually looks like, and if you don't want to have
to deal with a nagging spreadsheet, you can just type this into a plain text
editor. You can even save it with a .txt extension. R won't care. ("THANK YOU"
to the people who write the R software for not treating me like I'm
feebleminded!)
rownames,age,wgt,race,year,"SAT"
Bob,21,180,Cauc,Jr,1080
Fred,18,156,Af.Am,Fr,1210
Barb,18,128,Af.Am,Fr,840
Sue,24,118,Cauc,Sr,1340
Jeff,20,202,Asian,So,880
Drop this file into your working directory (Rspace), and then read it into R
like this.
> my.newdata = read.csv(file="mydata.csv", row.names="rownames")
> # notice there is no annoying message telling you this has been done!
> my.newdata
age wgt race year SAT
Bob 21 180 Cauc Jr 1080
Fred 18 156 Af.Am Fr 1210
Barb 18 128 Af.Am Fr 840
Sue 24 118 Cauc Sr 1340
Jeff 20 202 Asian So 880
Yay! R even dealt with those annoying quotes around SAT. Since "rownames" was
the first column in the file, you could also have set that option as row.names=1. Now suppose somehow your CSV file gets some
whitespace in it. This could happen due to mistyping in the spreadsheet, or
because you typed it that way intentionally into a text editor. (It would be
easier in that case just to leave out the commas and use read.table). (NOTE:
SPSS data files tend to have variable names and value labels padded with white
space, an idiot programming practice if ever there was one!) Do this.
If the file looks something like this...
rownames, age, wgt, race, year, SAT
Bob, 21, 180, Cauc, Jr, 1080
Fred, 18, 156, Af.Am, Fr, 1210
Barb, 18, 128, Af.Am, Fr, 840
Sue, 24, 118, Cauc, Sr, 1340
Jeff, 20, 202, Asian, So, 880
Do this...
> my.newdata = read.csv(file="mydata.csv", row.names=1, strip.white=T)
I personally think using a spreadsheet for a data file this small would be like
driving in a tack with a sledge hammer, but it's up to you. A spreadsheet comes
in very handy for dealing with large data files, however.
Before you quit today, clean everything out of your workspace EXCEPT my.data,
then save the workspace when you quit.
revised 2016 January 21
| Table of Contents
| Function Reference
| Function Finder
| R Project |
|