| Table of Contents
| Function Reference
| Function Finder
| R Project |
R OBJECTS*
* Emphasis on the first syllable. It's a noun, not a verb. Objects
are the things you store in your workspace.
Data
Your data is the information upon which you wish to do a statistical
analysis. By the way, the word "data" is plural, so ordinarily you would not
say "data is" or "data was." Correct are "data are" and "data were." I'm not
the grammar police, but I will object (verb) to errors on that one!
(STUDENTS: Your English teacher may have told you that it is now acceptable to
use "data" in the singular. Not in my class it ain't. It just sounds illiterate
to me. Would you use "phenomena" or "criteria" in the singular? Well, you
shouldn't! Let me explain something to you that may help you should you ever end
up in graduate school. Your graduate adviser may be quite a bit older than you
are, perhaps old enough to be old school, maybe
even an old geezer like me. Listen to the way he or she uses the word "data."
Because here's an important lesson about grad school. However your graduate
adviser does something, that's the correct way to do it!)
Maintaining a data set is one of the most important things a statistician
needs to know how to do. Most statistical software requires that the data set
be in a very specific format, called a data table or, in R, a data frame (one
word or two, take your pick). Data frames will be covered in detail in a
future tutorial.
This is where R truly shines. R is much more flexible in that it does not
require that you use the data frame format for your data. If it is more
convenient to keep your data in a contingency table, or a list, or a matrix,
or a single vector, you can do so. This flexibility has a price--more to
learn. In the end, however, it makes R a much more convenient and flexible
way to analyze data sets, especially simple ones.
In the behavioral and social sciences, the unit of analysis is usually a
subject, human or animal. In the more general case, subjects are called "cases"
or "observations" or "experimental units." I prefer cases. There will
come a time when we have to distinguish between subjects and cases, so you
should not think of these two terms as being exactly equivalent.
Let's say you've collected data from five subjects: Bob, Fred, Barb, Sue,
and Jeff. From each subject you have collected information about age, height,
weight, race, year in school (they are all college students), and SAT score.
Your cases are Bob, Fred, Barb, Sue, and Jeff. Age, height, weight, race,
year in school, and SAT score are called variables. You would ordinarily
put this information into a data frame as follows:
name age hgt wgt race year SAT
Bob 21 70 180 Cauc Jr 1080
Fred 18 67 156 Af.Am Fr 1210
Barb 18 64 128 Af.Am Fr 840
Sue 24 66 118 Cauc Sr 1340
Jeff 20 72 202 Asian So 880
Notice that the cases, or subjects, go into rows in this table, and each
variable has its own column. This is the standard form for maintaining a
data table (data frame). It looks a lot like a spreadsheet, and in fact, using
spreadsheet software is a very good way to manage data. (Just don't succumb to
the urge to do any fancy formatting. Headers and data and that's all!) The first
row in this table is called the header. It contains the variable names. Having a
header row is optional but usually a good idea.
I call your attention to the fact that we have two fundamentally
different kinds of variables in this data frame. Some are numbers, like age
and weight. These are called numeric variables. Other variables
contain just the names of categories that the subject falls into. Race is an
example of such a variable, called a categorical variable. It's
absolutely essential that you be able to distinquish these two types of
variables. You can't do statistics otherwise! R will recognize the difference
automatically. You don't need to tell it which is which, UNLESS you've coded
your categories with numbers.
Categorical variables are often
called factors in R. Just to make matters a bit more confusing, examine
the "year" variable. What would you call it, numeric or categorical? If
those were your only choices, you'd have to call it categorical. In fact, in
this variable the categories have a natural order to them: Fr, So, Jr, Sr.
Sometimes such a categorical variable is called an ordered factor
in R. To get R to recognize a factor as ordered, you have to declare it as
such.
You may be more familiar with the terms nominal, ordinal, interval, and ratio
variables. Nominal variables and categorical variables are roughly the same
thing. Factors are usually nominal. However, ordered factors are ordinal.
Numeric variables are either interval or ratio variables, and it usually
doesn't matter which. One more catch to all this--examine the column labeled
"name" in the table above. Is this a variable? I suppose it is since its value
is different for everyone. Usually when we think of categorical variables or
factors, we are thinking of variables that have relatively few possible values,
variables that define groups (hence also called grouping variables).
The values of such a variable are called levels. The levels of year, for
example, are Fr, So, Jr, Sr. When a variable has a different value for everyone,
like the subject's name or address for example, it's often called a character
variable. You will see R make this distinction, and it's a useful one, so
remember it.
You get data into R by creating data objects, so let's see how that is
done.
Assignment
In R you create things, called "objects", by a process called assignment.
Start an R session and set the working directory to Rspace. Also, clear the
workspace.
> setwd("Rspace") # There is a menu item for this in the GUI, btw.
> rm(list=ls()) # Or use the menus to do this.
If you don't know what this means or have forgotten to create the
Rspace directory, you can find out how in the tutorial called
Preliminaries.
There are three ways to assign data to an object name in R (actually four,
but one is rarely used). Here is one way.
> x = 7
This should not be read as "x equals 7", which will result in confusion
later. Instead, the single equals sign means "takes the value" or "is
assigned the value." R is not usually picky about spacing, so all of
the following are equivalent.
> x=7 # "x is assigned the value 7"
> x = 7 # "x is assigned the value 7" again
> x= 7 # and again
> x = 7 # and again
> x = # Press Enter here.
+ 7 # Don't type the +. It's already there.
Use spacing to make your typed input look "pretty." Or not. It's (generally) up
to you. There are a few situations where R will get uppity about spacing, but
usually it is not an issue. DON'T, however, be so silly as to put a space in
the middle of the name of something. That would be bad!
Here is another way to do assignment.
> x <− 7
And here is one place where R insists on the correct spacing. The "arrow"
assignment operator is actually two symbols, a less than sign and a dash or
minus (not an underline character no matter what it might look like in your
browser). THERE CANNOT BE A SPACE BETWEEN THEM! Why would anyone want to use two
symbols instead of one if they do the same thing? You'll see!
In the meantime, I find it convenient to leave spaces on each side of this
arrow operator. It has saved me some sorrow! That way you're less likely to make
critical spacing errors when using an arrow. For example, suppose your fingers
get all crossed up, and you type this: x < -7.
Type it and see what happens. Huh? Usually not a problem--just retype it
correctly. But I've learned the hard way that a mistake like that in the wrong
place can have painful consequences! (Not that one so much, but cases where I
meant to type x < -7 and typed
x <- 7 or x<-7 instead. What would that do?)
Now look at the object called "x" in your workspace.
> ls() # the "show me" function
[1] "x"
> x # print out the value of x
[1] 7
We will use the third kind of assignment to overwrite this value.
> 9 -> x # arrow always points to the variable name
> x
[1] 9
Three things to note here. First, R is perfectly willing to let you be
stupid and overwrite things you have in your workspace. There is no
warning. If you assign something to an object name that already exists,
the old object is gone! Second, the arrow assignment works from either
direction. The equal sign does not! When using =, you must give the
object name first followed by the value you wish to assign to it.
Third, notice that when you do an assignment, nothing prints to the console.
R creates the data object in your workspace and remains silent. If your
intention is do assignment, thus creating a data object in your workspace, and
you see the data spilling onto the console after pressing Enter, then chances
are you've forgotten to give your new data object a name (and therefore have
not created a new data object). This is particularly painful when reading in
a file or using a more complex "data-creating" function such as scan(). You can spend quite a long time typing in your data,
press the Enter key, and see it all spill out onto the console. At that point,
it's lost! You have to start over. Be careful! When your intention is to create
a data object in your workspace, make sure you assign it a name.
Objects
The following data objects exist in R:
- vectors
- lists
- arrays
- matrices
- tables
- data frames
Some of these are more important than others. And there are more, but these
are the ones we need to know about for now. Let's begin at the beginning.
Object and Variable Names
R doesn't care much what you name things, whether they are variables or
complete data objects. As noted in the last tutorial, however, DO NOT put
spaces or dashes in your names. Thus, all of these are acceptable (and
different) object or variable names:
- x
- X
- x2
- x.2
- x_2
- myData
- MyData
- my_data
- my.data
- my.data.from.the.learning.experiment
- fred
- Fred
- FRED
- Rutherford.B.Hayes
Be creative! But if you make your object names too long, you'll be sorry,
because you'll be typing them a lot! Another warning: It is generally safest to
confine yourself to letters, numbers, dots, and underline characters and to
start your variable names with a letter (required). No dashes! Verboten!
Try to avoid using
names that are also functions in R, like "mean" for example, although R will
usually work around this. The only names I would seriously warn you against are T
and F. Avoid these as variable names because, as we will see later, R uses them to
mean true and false. If you assign them another value, that could cause trouble.
Then, instead of true and false, you've got Fred and Ethel, and that's just not
right!
Where The Heck Did That Come From?
Remember, R has a large number of built-in data objects. Some of them
will be used below to illustrate the various kinds of R data objects.
For example, here is a data object containing the lengths of major North
American rivers (in miles).
> rivers
[1] 735 320 325 392 524 450 1459 135 465 600 330 336 280 315
[15] 870 906 202 329 290 1000 600 505 1450 840 1243 890 350 407
[29] 286 280 525 720 390 250 327 230 265 850 210 630 260 230
[43] 360 730 600 306 390 420 291 710 340 217 281 352 259 250
[57] 470 680 570 350 300 560 900 625 332 2348 1171 3710 2315 2533
[71] 780 280 410 460 260 255 431 350 760 618 338 981 1306 500
[85] 696 605 250 411 1054 735 233 435 490 310 460 383 375 1270
[99] 545 445 1885 380 300 380 377 425 276 210 800 420 350 360
[113] 538 1100 1205 314 237 610 360 540 1038 424 310 300 444 301
[127] 268 620 215 652 900 525 246 360 529 500 720 270 430 671
[141] 1770
(The output on your screen may be slightly different, depending upon how wide
you have your R Console window set to. The data values will be the same, but
the numbers in square brackets may be different.)
In this R output, everything is numbered, but only the number of the first item
on each output line is printed. Thus, the value 1205 (third line from the
bottom three items in--may be different on your screen) is item number 115 in
this output. These index numbers are NOT PART OF THE DATA THEMSELVES! This will be
made clearer in the following section. The object "rivers" is a vector,
so...
Vectors
One kind of vector consists of numbers, as was the case just above for the
vector "rivers". This is called a numeric vector, cleverly enough. Any item
in this vector can be addressed by using its index number.
> rivers[115] # "show item 115 in vector rivers"
[1] 1205
The index number must be enclosed within square brackets. Notice R prints it
out as item [1], but within the "rivers" vector it is item [115]. Don't get
hung up over this. It happens because R considers this output also to be a
new vector. This can be very useful, as we'll see. It means that, unlike other
statistical software, R will allow you to use the output of a command as input
for further calculations. (If this isn't working for you, by the way, it
probably means that you are using a very old version of R. Try putting a copy
of the "rivers" vector in your workspace first: data(rivers). This should make the vector available no
matter what.)
If you want to see items 10 through 20 in "rivers" do this.
> rivers[10:20] # a colon between two numbers means "through"
[1] 600 330 336 280 315 870 906 202 329 290 1000
In R, a colon has two meanings. This is one of them. When two numbers
are separated by a colon, it means "through" as in "10 through 20". Try this.
> 10:20 # output not shown
Since no function is specified to operate on these numbers, R assumes you meant
print(10:20). So one meaning of colon is "through",
and it will be awhile before you have to worry about what the second meaning is
(interaction). On the other hand, in R square brackets have only ONE meaning:
index. Inside of square brackets you will always find index numbers, or something
that evaluates to index numbers. For a simple example, you can create a vector
of index numbers using the c() function. If you want
to see items 18, 104, and 168, do this.
> rivers[c(18, 104, 168)] # c() "combines" these values into a vector
[1] 329 380 NA
> rivers[18, 104, 168] # This will NOT work. So stop doing it!
Error in rivers[18, 104, 168] : incorrect number of dimensions
"NA" means not available, or missing. The "rivers" vector is only 141 items
long, so you just asked for something that doesn't exist. The point is, to see
specific items within a vector, enter a vector of index numbers inside the
square brackets. You can also use relational operators (about which more later)
to pick out certain items from a vector. If you just want to see the data values
for rivers with lengths greater than 500 miles, do this.
> rivers[rivers > 500]
[1] 735 524 1459 600 870 906 1000 600 505 1450 840 1243 890 525 720
[16] 850 630 730 600 710 680 570 560 900 625 2348 1171 3710 2315 2533
[31] 780 760 618 981 1306 696 605 1054 735 1270 545 1885 800 538 1100
[46] 1205 610 540 1038 620 652 900 525 529 720 671 1770
I will tell you how to find out which rivers those are in a later tutorial. In
the meantime, here's how that works. The expression
rivers > 500 evaluates to TRUE or FALSE for each value of the rivers
vector. Try it. Type rivers > 500 at a command
prompt and see what happens. When used as indexes, TRUE means "include it"
and FALSE means "don't include it."
Suppose you just wanted to see the last 50 values in the rivers vector. You
could figure out how long the vector is and then calculate the appropriate
indexes, but fortunately there are special functions for seeing the beginning
and end of data objects.
> head(rivers, n=50) # first 50; output not shown
> tail(rivers, n=50) # last 50; output not shown
Question: Why are the values in the output vector produced by tail() numbered 1 to 50? (Answer: Output produces a new
vector or, if it is stored by assigning it a name, a new data object.)
One way to create a vector is to use the c()
function (short for concatenate, or combine).
> x = c(12, 14, 15, 17, 19, 8, 10)
> x
[1] 12 14 15 17 19 8 10
Once again, R isn't picky about spacing. None of the spaces in the above
command needs to be there. Or you can put more in if you like. I won't mention
this again. I assume if you get curious about some special case, you will
experiment and find the answer for yourself.
If the values you wish to enter into a vector are consecutive, then
this is sufficient:
> x = 100:200 # x = c(100:200) also works (but not in older versions of R)
> x
[1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117
[19] 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
[37] 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153
[55] 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171
[73] 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189
[91] 190 191 192 193 194 195 196 197 198 199 200
And remember (also the last time I'll mention this), the old "x" has been
overwritten, gone, history, is no more, irretrievable! Be careful or sooner or
later you're going to overwrite something in your workspace that you didn't
mean to. You've been warned!
Vectors can also contain words or character values. When you enter
these values, they must be in double or single quotes.
> x = c("Bob","Carol","Ted","Alice")
> x
[1] "Bob" "Carol" "Ted" "Alice"
Two vectors can also be concatenated into one with the concatenate function as
follows.
> y = c("John","Joy","Fred","Frances")
> z = c(x, y)
> z
[1] "Bob" "Carol" "Ted" "Alice" "John" "Joy" "Fred"
[8] "Frances"
What would have happened if, instead, you had done this?
> z2 = c("x", "y")
> z2
It's worth finding out, so don't just sit there wondering. Type! One thing I
had a bit of trouble getting used to in R is when to put things in quotes and
when not to. The basic rule is: If it's an already defined object, don't quote
it. If you want to refer to the values inside already existing x and y vectors,
don't quote. If it's a new character value (i.e., a string--someone's or
something's name), use quotes. R assumes anything not in quotes is an object
name (an already defined vector, list, dataframe, etc.), and it will hunt for
that object in the search path. If it doesn't find it, you will be told so.
> Joy # Print out the value of object Joy.
Error: object "Joy" not found
> "Joy" # Print out "Joy".
[1] "Joy"
> y[2] # Print out the second value in vector y.
[1] "Joy"
> Joy = 5 # Create a new object named Joy.
> Joy
[1] 5
> z[Joy] # you tell me what it will do
In other words, use quotes when you want the name (word) itself. Don't use
quotes when you want the value or values stored in a data object with that
name.
Now do this.
> islands # Only the first four lines of output are shown here.
Africa Antarctica Asia Australia
11506 5500 16988 2968
Axel Heiberg Baffin Banks Borneo
16 184 23 280
...
This is called a named vector. Here is how to create one.
> x = c("Robert Culp","Natalie Wood","Elliott Gould","Dyan Cannon")
> x # The values are not named yet.
[1] "Robert Culp" "Natalie Wood" "Elliott Gould" "Dyan Cannon"
> names(x) = c("Bob","Carol","Ted","Alice")
> x # And now they are.
Bob Carol Ted Alice
"Robert Culp" "Natalie Wood" "Elliott Gould" "Dyan Cannon"
> x[Alice] # This is not correct! Why not?
Error: object "Alice" not found
> x["Alice"]
Alice
"Dyan Cannon"
> Alice = 4
> x[Alice] # Same thing as x[4].
Alice
"Dyan Cannon"
Confusing, right? You'll get used to it. This is a helpful example to study
and play around with. (STUDENTS: That means study it and play around with it!)
The vector "x" now contains the names of the actors in the movie "Bob and
Carol, Ted and Alice." The names() function
was used to label these values with the names of the characters they played in
the movie. Then we used the name of the character to retrieve the name of the
actor. Dyan Cannon could also have been referred to as x[4]. Try it. (I have a
very funny story about this movie, but this is not the place for it!)
In the "islands" vector, the data values are the size of the land mass in
thousands of square miles. Each data value is named with the name of the land
mass. Thus, to retrieve the area of Cuba, we do not need to know which of the
data values is Cuba. We can retrieve the value by name. The name is put inside
of square brackets just as it if were an index number, and it is quoted.
> islands["Cuba"] # islands[12] would also work, if we'd only known!
Cuba
43
Cuba has a land area of 43,000 square miles. Suppose you wanted to work with
this data vector, but you wanted the land areas in square kilometers instead of
square miles. The following procedure will allow this. First, use the data() function to write a copy of "islands" to
your workspace. Then do the conversion. The converted values can either be
stored back into the "islands" vector, in which case the old values are
overwritten, or it can be stored into a new vector with a new name.
> data(islands) # writes a copy to your workspace
> ls()
[1] "Alice" "islands" "Joy" "x" "y" "z" "z2"
> km_islands = islands * 2.59 # probably the best way
> km_islands["Cuba"]
Cuba
111.37
> islands = islands * 2.59 # overwrites the original data values
> islands["Cuba"] # the original data in miles are GONE!
Cuba
111.37
And finally...
> ls()
[1] "Alice" "islands" "Joy" "km_islands" "x"
[6] "y" "z" "z2"
> rm(list=ls()) # clean up!
> ls()
character(0)
Vectors are used a lot in R. You should take some time to understand them.
Lists
Lists are collections of other R objects collected into one place. To create a
list, use the list() function.
> x=1:10 # a vector
> y=matrix(1:12,nrow=3) # a matrix
> z="Bill" # a character variable
> my.list=list(x,y,z) # create the list
> my.list # view the list
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10
[[2]]
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
[[3]]
[1] "Bill"
The output of a lot of R functions is actually composed of lists. Notice that
items in a list are indexed by values inside double brackets. Thus...
> my.list[[3]] # The third item in my.list.
[1] "Bill"
To name the items
in the list...
> names(my.list) = c("my.vector","my.matrix","my.name")
> my.list
$my.vector
[1] 1 2 3 4 5 6 7 8 9 10
$my.matrix
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
$my.name
[1] "Bill"
In R, the $ is used for list indexing. That is, it allows you to pull elements
out of lists by name. First type the name of the list, followed by $, followed
by the name of the item in the list. For example...
> my.list$my.name
[1] "Bill"
Kinda trivial in this case, but it won't be when you have a much longer list.
That's enough on lists for now.
> ls()
[1] "my.list" "x" "y" "z"
> rm(my.list,x,y,z) # Don't forget to clean up!
There is one more thing you should remember about lists. Data frames are
actually lists. In fact, this is probably the most important thing you need
to remember about lists!
Matrices and Arrays
Essentially, these are both table-like objects. You saw how to create a
matrix using the matrix() in the last section on
lists. Inside of this function you need to name the vector that you want
matrixized (not really a word, I don't think), and you need to tell either how
many rows or how many columns you want in the matrix. The matrix will be filled
down the columns, as in the following example. To fill across the rows, set the
byrow= option to TRUE. That's really enough for now.
Except maybe for extracting values from one. The syntax is
my.matrix[row,col], as follows.
> y = matrix(1:16, nrow=4) # First we need a matrix! With 4 rows.
> class(y) # y is an object of class "matrix"
[1] "matrix"
> y
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
> y[3,2]
[1] 7
> y = matrix(1:16, nrow=4, byrow=T) # fill across rows instead of down columns
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
[4,] 13 14 15 16
> y[3,2]
[1] 10
> y = matrix(1:16, ncol=4, byrow=F) # back to the original, but using ncol=
Remember this! When indexing a matrix (or any table-like object), always put
the row index first followed by the column index, and
always put the indexes inside of square brackets. Notice
our matrix has no row names or column names. The notation [1,] means "row one,
all columns". To recall an entire row or an entire column of a matrix (or an
array or a table), do this.
> y[1,] # all values in row 1
[1] 1 5 9 13
> y[,3] # all values in column 3
[1] 9 10 11 12
More later on matrices, including how to name the rows and columns.
An array is like a matrix, except it can have more than two dimensions. In
other words, a matrix is just a two-dimensional array.
> y = array(1:16, dim=c(4,2,2))
> y
, , 1
[,1] [,2]
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
, , 2
[,1] [,2]
[1,] 9 13
[2,] 10 14
[3,] 11 15
[4,] 12 16
The array() function creates arrays. The
"dim" option gives the number of rows, columns, and layers, in that order. Of
course, this would be more useful if we were putting real data into the array
rather than just the numbers 1 to 16. It was just a quick example. To put
real data into a matrix or an array, simply put the data into a vector, and
replace "1:16" with the name of the vector in the matrix() or array() function.
> x = c(1.26, 3.89, 4.20, 0.76, 2.22, 6.01, 5.29, 1.93, 3.27)
> y = matric(x, nrow=3) # Hey! Everybody makes mistakes!
Error: could not find function "matric"
> y = matrix(x, nrow=3)
> y
[,1] [,2] [,3]
[1,] 1.26 0.76 5.29
[2,] 3.89 2.22 1.93
[3,] 4.20 6.01 3.27
Don't forget to clean up.
Tables
If the function to create a matrix is matrix(), and the function to create an array is
array(), I bet you can guess what function is
used to create a table. It's used quite a bit differently, however. The table() function is used to create frequency tables
or crosstabulations from raw data contained in a vector or a data frame. The
result is something that looks, in many cases, very much like a matrix or an
array, and behaves very much like one as well. For now, we will confine
ourselves to one relatively simple example. First, we have to create some raw
data.
> y = rnorm(100, mean=100, sd=15) # 100 normally distributed random nos.
> y = round(y, 0) # Rounded to zero decimal places.
Once again, don't worry about the syntax of these statements. I'm just using
them to create some data to put into a table. Since the values in the y vector
are random, everyone's results here will be different. To view a frequency
table (badly formatted, but...small steps!), do this.
> table(y)
y
64 69 73 74 77 79 80 81 82 84 85 86 87 88 89 90 91 92 93
1 1 1 1 4 4 2 1 1 2 1 1 1 3 1 1 1 2 1
94 95 96 97 98 99 100 101 102 103 104 105 106 107 109 110 111 112 113
4 4 3 3 5 2 6 3 1 5 4 2 2 2 1 2 1 4 3
114 116 117 118 119 120 123 125 129
2 2 1 1 2 1 1 2 1
The top row of numbers contains the data values, which we can see range from 64
to 129, and the bottom row of numbers gives the frequencies. The data value
(i.e., y-value) of 100, for example, occurs 6 times in the data vector. (Once
again, your result will be different.) Tables, of course, just like everything
else in R, can be stored and then used for further analysis...
> table(y) -> myTable # Store it.
> barplot(myTable)
> ls()
[1] "myTable" "y"
> rm(myTable, y) # And remember to clean up.
This table is (was!) one-dimensional. The "HairEyeColor" object we were playing
with in a previous tutorial was a multidimensional table of frequencies, also
called a crosstabulation. The table() will also
created crosstabulations.
Data Frames
Data frames are so important that I will devote an entire tutorial just
to them. For now, if you want to see a few, try this. The output will not
be shown. Look at your screen.
> women # average weight of women by height
> USArrests # crime statistics; scroll to see it all
> head(USArrests) # just the first six rows of data
> chickwts # chicken weights by feed type
The basic structure of a data frame is illustrated here. It's basically a table
(in fact, it's a list of column vectors) in which each variable goes in its own
column and each case goes in its own row.
Usually, data frames are read into the R workspace from external files,
which may have been created using a spreadsheet. Small ones can be typed in
at the command line, however. Let's use the data at the beginning of this
tutorial to see how that would work.
> myFirstDataframe = data.frame( # Press Enter to start a new line.
+ name=c("Bob","Fred","Barb","Sue","Jeff"),
+ age=c(21,18,18,24,20), hgt=c(70,67,64,66,72),
+ wgt=c(180,156,128,118,202),
+ race=c("Cauc","Af.Am","Af.Am","Cauc","Asian"),
+ year=c("Jr","Fr","Fr","Sr","So"),
+ SAT=c(1080,1210,840,1340,880)) # End with double close parenthesis. Why?
> myFirstDataframe
name age hgt wgt race year SAT
1 Bob 21 70 180 Cauc Jr 1080
2 Fred 18 67 156 Af.Am Fr 1210
3 Barb 18 64 128 Af.Am Fr 840
4 Sue 24 66 118 Cauc Sr 1340
5 Jeff 20 72 202 Asian So 880
That's probably not something you're going to want to do too very often! In
fact, I'd almost be willing to bet you got at least one comma, one quote, or
one parenthesis out of place, and the whole thing failed because of that. I've
gotten e-mails from a number of people telling me they couldn't get this to
work, but I just tested it by copying and pasting, and it does work. Try
highlighting the following text, copy, and then paste it into R at the
command prompt. (May not work in R Studio.)
# Begin copying here.
myFirstDataframe = data.frame(
name=c("Bob","Fred","Barb","Sue","Jeff"),
age=c(21,18,18,24,20), hgt=c(70,67,64,66,72),
wgt=c(180,156,128,118,202),
race=c("Cauc","Af.Am","Af.Am","Cauc","Asian"),
year=c("Jr","Fr","Fr","Sr","So"),
SAT=c(1080,1210,840,1340,880))
# End copying here.
Last Word
Further details as needed on these data objects will be covered in
future tutorials. For now, you should get the general idea.
revised 2016 January 18
| Table of Contents
| Function Reference
| Function Finder
| R Project |
|