R Tutorials--Objects

R OBJECTS^*

^* Emphasis on the first syllable. It's a noun, not a verb. Objects are the things you store in your workspace.

Data

Your data is the information upon which you wish to do a statistical analysis. By the way, the word "data" is plural, so ordinarily you would not say "data is" or "data was." Correct are "data are" and "data were." I'm not the grammar police, but I will object (verb) to errors on that one!

(STUDENTS: Your English teacher may have told you that it is now acceptable to use "data" in the singular. Not in my class it ain't. It just sounds illiterate to me. Would you use "phenomena" or "criteria" in the singular? Well, you shouldn't! Let me explain something to you that may help you should you ever end up in graduate school. Your graduate adviser may be quite a bit older than you are, perhaps old enough to be old school, maybe even an old geezer like me. Listen to the way he or she uses the word "data." Because here's an important lesson about grad school. However your graduate adviser does something, that's the correct way to do it!)

Maintaining a data set is one of the most important things a statistician needs to know how to do. Most statistical software requires that the data set be in a very specific format, called a data table or, in R, a data frame (one word or two, take your pick). Data frames will be covered in detail in a future tutorial.

This is where R truly shines. R is much more flexible in that it does not require that you use the data frame format for your data. If it is more convenient to keep your data in a contingency table, or a list, or a matrix, or a single vector, you can do so. This flexibility has a price--more to learn. In the end, however, it makes R a much more convenient and flexible way to analyze data sets, especially simple ones.

In the behavioral and social sciences, the unit of analysis is usually a subject, human or animal. In the more general case, subjects are called "cases" or "observations" or "experimental units." I prefer cases. There will come a time when we have to distinguish between subjects and cases, so you should not think of these two terms as being exactly equivalent.

Let's say you've collected data from five subjects: Bob, Fred, Barb, Sue, and Jeff. From each subject you have collected information about age, height, weight, race, year in school (they are all college students), and SAT score. Your cases are Bob, Fred, Barb, Sue, and Jeff. Age, height, weight, race, year in school, and SAT score are called variables. You would ordinarily put this information into a data frame as follows:

name     age  hgt  wgt  race year   SAT 
Bob       21   70  180  Cauc   Jr  1080
Fred      18   67  156 Af.Am   Fr  1210
Barb      18   64  128 Af.Am   Fr   840
Sue       24   66  118  Cauc   Sr  1340
Jeff      20   72  202 Asian   So   880

Notice that the cases, or subjects, go into rows in this table, and each variable has its own column. This is the standard form for maintaining a data table (data frame). It looks a lot like a spreadsheet, and in fact, using spreadsheet software is a very good way to manage data. (Just don't succumb to the urge to do any fancy formatting. Headers and data and that's all!) The first row in this table is called the header. It contains the variable names. Having a header row is optional but usually a good idea.

I call your attention to the fact that we have two fundamentally different kinds of variables in this data frame. Some are numbers, like age and weight. These are called numeric variables. Other variables contain just the names of categories that the subject falls into. Race is an example of such a variable, called a categorical variable. It's absolutely essential that you be able to distinquish these two types of variables. You can't do statistics otherwise! R will recognize the difference automatically. You don't need to tell it which is which, UNLESS you've coded your categories with numbers.

Categorical variables are often called factors in R. Just to make matters a bit more confusing, examine the "year" variable. What would you call it, numeric or categorical? If those were your only choices, you'd have to call it categorical. In fact, in this variable the categories have a natural order to them: Fr, So, Jr, Sr. Sometimes such a categorical variable is called an ordered factor in R. To get R to recognize a factor as ordered, you have to declare it as such.

You may be more familiar with the terms nominal, ordinal, interval, and ratio variables. Nominal variables and categorical variables are roughly the same thing. Factors are usually nominal. However, ordered factors are ordinal. Numeric variables are either interval or ratio variables, and it usually doesn't matter which. One more catch to all this--examine the column labeled "name" in the table above. Is this a variable? I suppose it is since its value is different for everyone. Usually when we think of categorical variables or factors, we are thinking of variables that have relatively few possible values, variables that define groups (hence also called grouping variables). The values of such a variable are called levels. The levels of year, for example, are Fr, So, Jr, Sr. When a variable has a different value for everyone, like the subject's name or address for example, it's often called a character variable. You will see R make this distinction, and it's a useful one, so remember it.

You get data into R by creating data objects, so let's see how that is done.

Assignment

In R you create things, called "objects", by a process called assignment. Start an R session and set the working directory to Rspace. Also, clear the workspace.

> setwd("Rspace")       # There is a menu item for this in the GUI, btw.
> rm(list=ls())         # Or use the menus to do this.

If you don't know what this means or have forgotten to create the Rspace directory, you can find out how in the tutorial called Preliminaries.

There are three ways to assign data to an object name in R (actually four, but one is rarely used). Here is one way.

> x = 7

This should not be read as "x equals 7", which will result in confusion later. Instead, the single equals sign means "takes the value" or "is assigned the value." R is not usually picky about spacing, so all of the following are equivalent.

> x=7                                  # "x is assigned the value 7"
> x = 7                                # "x is assigned the value 7" again
> x=           7                       # and again
> x            =             7         # and again
> x =                                  # Press Enter here.
+ 7                                    # Don't type the +. It's already there.

Use spacing to make your typed input look "pretty." Or not. It's (generally) up to you. There are a few situations where R will get uppity about spacing, but usually it is not an issue. DON'T, however, be so silly as to put a space in the middle of the name of something. That would be bad!

Here is another way to do assignment.

> x <− 7

And here is one place where R insists on the correct spacing. The "arrow" assignment operator is actually two symbols, a less than sign and a dash or minus (not an underline character no matter what it might look like in your browser). THERE CANNOT BE A SPACE BETWEEN THEM! Why would anyone want to use two symbols instead of one if they do the same thing? You'll see!

In the meantime, I find it convenient to leave spaces on each side of this arrow operator. It has saved me some sorrow! That way you're less likely to make critical spacing errors when using an arrow. For example, suppose your fingers get all crossed up, and you type this: x < -7. Type it and see what happens. Huh? Usually not a problem--just retype it correctly. But I've learned the hard way that a mistake like that in the wrong place can have painful consequences! (Not that one so much, but cases where I meant to type x < -7 and typed x <- 7 or x<-7 instead. What would that do?)

Now look at the object called "x" in your workspace.

> ls()                                 # the "show me" function
[1] "x"
> x                                    # print out the value of x
[1] 7

We will use the third kind of assignment to overwrite this value.

> 9 -> x                               # arrow always points to the variable name
> x
[1] 9

Three things to note here. First, R is perfectly willing to let you be stupid and overwrite things you have in your workspace. There is no warning. If you assign something to an object name that already exists, the old object is gone! Second, the arrow assignment works from either direction. The equal sign does not! When using =, you must give the object name first followed by the value you wish to assign to it.

Third, notice that when you do an assignment, nothing prints to the console. R creates the data object in your workspace and remains silent. If your intention is do assignment, thus creating a data object in your workspace, and you see the data spilling onto the console after pressing Enter, then chances are you've forgotten to give your new data object a name (and therefore have not created a new data object). This is particularly painful when reading in a file or using a more complex "data-creating" function such as scan(). You can spend quite a long time typing in your data, press the Enter key, and see it all spill out onto the console. At that point, it's lost! You have to start over. Be careful! When your intention is to create a data object in your workspace, make sure you assign it a name.

Objects

The following data objects exist in R:

vectors
lists
arrays
matrices
tables
data frames

Some of these are more important than others. And there are more, but these are the ones we need to know about for now. Let's begin at the beginning.

Object and Variable Names

R doesn't care much what you name things, whether they are variables or complete data objects. As noted in the last tutorial, however, DO NOT put spaces or dashes in your names. Thus, all of these are acceptable (and different) object or variable names:

x
X
x2
x.2
x_2
myData
MyData
my_data
my.data
my.data.from.the.learning.experiment
fred
Fred
FRED
Rutherford.B.Hayes

Be creative! But if you make your object names too long, you'll be sorry, because you'll be typing them a lot! Another warning: It is generally safest to confine yourself to letters, numbers, dots, and underline characters and to start your variable names with a letter (required). No dashes! Verboten! Try to avoid using names that are also functions in R, like "mean" for example, although R will usually work around this. The only names I would seriously warn you against are T and F. Avoid these as variable names because, as we will see later, R uses them to mean true and false. If you assign them another value, that could cause trouble. Then, instead of true and false, you've got Fred and Ethel, and that's just not right!

Where The Heck Did That Come From?

Remember, R has a large number of built-in data objects. Some of them will be used below to illustrate the various kinds of R data objects. For example, here is a data object containing the lengths of major North American rivers (in miles).

> rivers
  [1]  735  320  325  392  524  450 1459  135  465  600  330  336  280  315
 [15]  870  906  202  329  290 1000  600  505 1450  840 1243  890  350  407
 [29]  286  280  525  720  390  250  327  230  265  850  210  630  260  230
 [43]  360  730  600  306  390  420  291  710  340  217  281  352  259  250
 [57]  470  680  570  350  300  560  900  625  332 2348 1171 3710 2315 2533
 [71]  780  280  410  460  260  255  431  350  760  618  338  981 1306  500
 [85]  696  605  250  411 1054  735  233  435  490  310  460  383  375 1270
 [99]  545  445 1885  380  300  380  377  425  276  210  800  420  350  360
[113]  538 1100 1205  314  237  610  360  540 1038  424  310  300  444  301
[127]  268  620  215  652  900  525  246  360  529  500  720  270  430  671
[141] 1770

(The output on your screen may be slightly different, depending upon how wide you have your R Console window set to. The data values will be the same, but the numbers in square brackets may be different.)

In this R output, everything is numbered, but only the number of the first item on each output line is printed. Thus, the value 1205 (third line from the bottom three items in--may be different on your screen) is item number 115 in this output. These index numbers are NOT PART OF THE DATA THEMSELVES! This will be made clearer in the following section. The object "rivers" is a vector, so...

Vectors

One kind of vector consists of numbers, as was the case just above for the vector "rivers". This is called a numeric vector, cleverly enough. Any item in this vector can be addressed by using its index number.

> rivers[115]                          # "show item 115 in vector rivers"
[1] 1205

The index number must be enclosed within square brackets. Notice R prints it out as item [1], but within the "rivers" vector it is item [115]. Don't get hung up over this. It happens because R considers this output also to be a new vector. This can be very useful, as we'll see. It means that, unlike other statistical software, R will allow you to use the output of a command as input for further calculations. (If this isn't working for you, by the way, it probably means that you are using a very old version of R. Try putting a copy of the "rivers" vector in your workspace first: data(rivers). This should make the vector available no matter what.)

If you want to see items 10 through 20 in "rivers" do this.

> rivers[10:20]                        # a colon between two numbers means "through"
 [1]  600  330  336  280  315  870  906  202  329  290 1000

In R, a colon has two meanings. This is one of them. When two numbers are separated by a colon, it means "through" as in "10 through 20". Try this.

> 10:20                                # output not shown

Since no function is specified to operate on these numbers, R assumes you meant print(10:20). So one meaning of colon is "through", and it will be awhile before you have to worry about what the second meaning is (interaction). On the other hand, in R square brackets have only ONE meaning: index. Inside of square brackets you will always find index numbers, or something that evaluates to index numbers. For a simple example, you can create a vector of index numbers using the c() function. If you want to see items 18, 104, and 168, do this.

> rivers[c(18, 104, 168)]              # c() "combines" these values into a vector
[1] 329 380  NA
> rivers[18, 104, 168]                 # This will NOT work. So stop doing it!
Error in rivers[18, 104, 168] : incorrect number of dimensions

"NA" means not available, or missing. The "rivers" vector is only 141 items long, so you just asked for something that doesn't exist. The point is, to see specific items within a vector, enter a vector of index numbers inside the square brackets. You can also use relational operators (about which more later) to pick out certain items from a vector. If you just want to see the data values for rivers with lengths greater than 500 miles, do this.

> rivers[rivers > 500]
 [1]  735  524 1459  600  870  906 1000  600  505 1450  840 1243  890  525  720
[16]  850  630  730  600  710  680  570  560  900  625 2348 1171 3710 2315 2533
[31]  780  760  618  981 1306  696  605 1054  735 1270  545 1885  800  538 1100
[46] 1205  610  540 1038  620  652  900  525  529  720  671 1770

I will tell you how to find out which rivers those are in a later tutorial. In the meantime, here's how that works. The expression rivers > 500 evaluates to TRUE or FALSE for each value of the rivers vector. Try it. Type rivers > 500 at a command prompt and see what happens. When used as indexes, TRUE means "include it" and FALSE means "don't include it."

Suppose you just wanted to see the last 50 values in the rivers vector. You could figure out how long the vector is and then calculate the appropriate indexes, but fortunately there are special functions for seeing the beginning and end of data objects.

> head(rivers, n=50)                   # first 50; output not shown
> tail(rivers, n=50)                   # last 50; output not shown

Question: Why are the values in the output vector produced by tail() numbered 1 to 50? (Answer: Output produces a new vector or, if it is stored by assigning it a name, a new data object.)

One way to create a vector is to use the c() function (short for concatenate, or combine).

> x = c(12, 14, 15, 17, 19, 8, 10)
> x
[1] 12 14 15 17 19  8 10

Once again, R isn't picky about spacing. None of the spaces in the above command needs to be there. Or you can put more in if you like. I won't mention this again. I assume if you get curious about some special case, you will experiment and find the answer for yourself.

If the values you wish to enter into a vector are consecutive, then this is sufficient:

> x = 100:200     # x = c(100:200) also works (but not in older versions of R)
> x
  [1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117
 [19] 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
 [37] 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153
 [55] 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171
 [73] 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189
 [91] 190 191 192 193 194 195 196 197 198 199 200

And remember (also the last time I'll mention this), the old "x" has been overwritten, gone, history, is no more, irretrievable! Be careful or sooner or later you're going to overwrite something in your workspace that you didn't mean to. You've been warned!

Vectors can also contain words or character values. When you enter these values, they must be in double or single quotes.

> x = c("Bob","Carol","Ted","Alice")
> x
[1] "Bob"   "Carol" "Ted"   "Alice"

Two vectors can also be concatenated into one with the concatenate function as follows.

> y = c("John","Joy","Fred","Frances")
> z = c(x, y)
> z
[1] "Bob"     "Carol"   "Ted"     "Alice"   "John"    "Joy"     "Fred"   
[8] "Frances"

What would have happened if, instead, you had done this?

> z2 = c("x", "y")
> z2

It's worth finding out, so don't just sit there wondering. Type! One thing I had a bit of trouble getting used to in R is when to put things in quotes and when not to. The basic rule is: If it's an already defined object, don't quote it. If you want to refer to the values inside already existing x and y vectors, don't quote. If it's a new character value (i.e., a string--someone's or something's name), use quotes. R assumes anything not in quotes is an object name (an already defined vector, list, dataframe, etc.), and it will hunt for that object in the search path. If it doesn't find it, you will be told so.

> Joy                        # Print out the value of object Joy.
Error: object "Joy" not found
> "Joy"                      # Print out "Joy".
[1] "Joy"
> y[2]                       # Print out the second value in vector y.
[1] "Joy"
> Joy = 5                    # Create a new object named Joy.
> Joy
[1] 5
> z[Joy]                     # you tell me what it will do

In other words, use quotes when you want the name (word) itself. Don't use quotes when you want the value or values stored in a data object with that name.

Now do this.

> islands                    # Only the first four lines of output are shown here.
          Africa       Antarctica             Asia        Australia 
           11506             5500            16988             2968 
    Axel Heiberg           Baffin            Banks           Borneo 
              16              184               23              280 
...

This is called a named vector. Here is how to create one.

> x = c("Robert Culp","Natalie Wood","Elliott Gould","Dyan Cannon")
> x                          # The values are not named yet.
[1] "Robert Culp"   "Natalie Wood"  "Elliott Gould" "Dyan Cannon"
> names(x) = c("Bob","Carol","Ted","Alice")
> x                          # And now they are.
            Bob           Carol             Ted           Alice 
  "Robert Culp"  "Natalie Wood" "Elliott Gould"   "Dyan Cannon"
> x[Alice]                   # This is not correct! Why not?
Error: object "Alice" not found
> x["Alice"]
        Alice 
"Dyan Cannon"
> Alice = 4
> x[Alice]                   # Same thing as x[4].
        Alice 
"Dyan Cannon"

Confusing, right? You'll get used to it. This is a helpful example to study and play around with. (STUDENTS: That means study it and play around with it!)

The vector "x" now contains the names of the actors in the movie "Bob and Carol, Ted and Alice." The names() function was used to label these values with the names of the characters they played in the movie. Then we used the name of the character to retrieve the name of the actor. Dyan Cannon could also have been referred to as x[4]. Try it. (I have a very funny story about this movie, but this is not the place for it!)

In the "islands" vector, the data values are the size of the land mass in thousands of square miles. Each data value is named with the name of the land mass. Thus, to retrieve the area of Cuba, we do not need to know which of the data values is Cuba. We can retrieve the value by name. The name is put inside of square brackets just as it if were an index number, and it is quoted.

> islands["Cuba"]            # islands[12] would also work, if we'd only known!
Cuba
  43

Cuba has a land area of 43,000 square miles. Suppose you wanted to work with this data vector, but you wanted the land areas in square kilometers instead of square miles. The following procedure will allow this. First, use the data() function to write a copy of "islands" to your workspace. Then do the conversion. The converted values can either be stored back into the "islands" vector, in which case the old values are overwritten, or it can be stored into a new vector with a new name.

> data(islands)                          # writes a copy to your workspace
> ls()
[1] "Alice"   "islands" "Joy"     "x"       "y"       "z"       "z2"
> km_islands = islands * 2.59            # probably the best way
> km_islands["Cuba"]
  Cuba 
111.37
> islands = islands * 2.59               # overwrites the original data values
> islands["Cuba"]                        # the original data in miles are GONE!
  Cuba 
111.37

And finally...

> ls()
[1] "Alice"      "islands"    "Joy"        "km_islands" "x"         
[6] "y"          "z"          "z2"
> rm(list=ls())                          # clean up!
> ls()
character(0)

Vectors are used a lot in R. You should take some time to understand them.

Lists

Lists are collections of other R objects collected into one place. To create a list, use the list() function.

> x=1:10                          # a vector
> y=matrix(1:12,nrow=3)           # a matrix
> z="Bill"                        # a character variable
> my.list=list(x,y,z)             # create the list
> my.list                         # view the list
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[2]]
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

[[3]]
[1] "Bill"

The output of a lot of R functions is actually composed of lists. Notice that items in a list are indexed by values inside double brackets. Thus...

> my.list[[3]]                    # The third item in my.list.
[1] "Bill"

To name the items in the list...

> names(my.list) = c("my.vector","my.matrix","my.name")
> my.list
$my.vector
 [1]  1  2  3  4  5  6  7  8  9 10

$my.matrix
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

$my.name
[1] "Bill"

In R, the $ is used for list indexing. That is, it allows you to pull elements out of lists by name. First type the name of the list, followed by $, followed by the name of the item in the list. For example...

> my.list$my.name
[1] "Bill"

Kinda trivial in this case, but it won't be when you have a much longer list. That's enough on lists for now.

> ls()
[1] "my.list" "x"       "y"       "z"      
> rm(my.list,x,y,z)                      # Don't forget to clean up!

There is one more thing you should remember about lists. Data frames are actually lists. In fact, this is probably the most important thing you need to remember about lists!

Matrices and Arrays

Essentially, these are both table-like objects. You saw how to create a matrix using the matrix() in the last section on lists. Inside of this function you need to name the vector that you want matrixized (not really a word, I don't think), and you need to tell either how many rows or how many columns you want in the matrix. The matrix will be filled down the columns, as in the following example. To fill across the rows, set the byrow= option to TRUE. That's really enough for now. Except maybe for extracting values from one. The syntax is my.matrix[row,col], as follows.

> y = matrix(1:16, nrow=4)        # First we need a matrix! With 4 rows.
> class(y)                        # y is an object of class "matrix"
[1] "matrix"
> y
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16
> y[3,2]
[1] 7
> y = matrix(1:16, nrow=4, byrow=T)    # fill across rows instead of down columns
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12
[4,]   13   14   15   16
> y[3,2]
[1] 10
> y = matrix(1:16, ncol=4, byrow=F)    # back to the original, but using ncol=

Remember this! When indexing a matrix (or any table-like object), always put the row index first followed by the column index, and always put the indexes inside of square brackets. Notice our matrix has no row names or column names. The notation [1,] means "row one, all columns". To recall an entire row or an entire column of a matrix (or an array or a table), do this.

> y[1,]                                # all values in row 1
[1]  1  5  9 13
> y[,3]                                # all values in column 3
[1]  9 10 11 12

More later on matrices, including how to name the rows and columns.

An array is like a matrix, except it can have more than two dimensions. In other words, a matrix is just a two-dimensional array.

> y = array(1:16, dim=c(4,2,2))
> y
, , 1

     [,1] [,2]
[1,]    1    5
[2,]    2    6
[3,]    3    7
[4,]    4    8

, , 2

     [,1] [,2]
[1,]    9   13
[2,]   10   14
[3,]   11   15
[4,]   12   16

The array() function creates arrays. The "dim" option gives the number of rows, columns, and layers, in that order. Of course, this would be more useful if we were putting real data into the array rather than just the numbers 1 to 16. It was just a quick example. To put real data into a matrix or an array, simply put the data into a vector, and replace "1:16" with the name of the vector in the matrix() or array() function.

> x = c(1.26, 3.89, 4.20, 0.76, 2.22, 6.01, 5.29, 1.93, 3.27)
> y = matric(x, nrow=3)                # Hey! Everybody makes mistakes!
Error: could not find function "matric"
> y = matrix(x, nrow=3)
> y
     [,1] [,2] [,3]
[1,] 1.26 0.76 5.29
[2,] 3.89 2.22 1.93
[3,] 4.20 6.01 3.27

Don't forget to clean up.

Tables

If the function to create a matrix is matrix(), and the function to create an array is array(), I bet you can guess what function is used to create a table. It's used quite a bit differently, however. The table() function is used to create frequency tables or crosstabulations from raw data contained in a vector or a data frame. The result is something that looks, in many cases, very much like a matrix or an array, and behaves very much like one as well. For now, we will confine ourselves to one relatively simple example. First, we have to create some raw data.

> y = rnorm(100, mean=100, sd=15)        # 100 normally distributed random nos.
> y = round(y, 0)                        # Rounded to zero decimal places.

Once again, don't worry about the syntax of these statements. I'm just using them to create some data to put into a table. Since the values in the y vector are random, everyone's results here will be different. To view a frequency table (badly formatted, but...small steps!), do this.

> table(y)
y
 64  69  73  74  77  79  80  81  82  84  85  86  87  88  89  90  91  92  93 
  1   1   1   1   4   4   2   1   1   2   1   1   1   3   1   1   1   2   1 
 94  95  96  97  98  99 100 101 102 103 104 105 106 107 109 110 111 112 113 
  4   4   3   3   5   2   6   3   1   5   4   2   2   2   1   2   1   4   3 
114 116 117 118 119 120 123 125 129 
  2   2   1   1   2   1   1   2   1

The top row of numbers contains the data values, which we can see range from 64 to 129, and the bottom row of numbers gives the frequencies. The data value (i.e., y-value) of 100, for example, occurs 6 times in the data vector. (Once again, your result will be different.) Tables, of course, just like everything else in R, can be stored and then used for further analysis...

> table(y) -> myTable             # Store it.
> barplot(myTable)
> ls()
[1] "myTable" "y"
> rm(myTable, y)                  # And remember to clean up.

This table is (was!) one-dimensional. The "HairEyeColor" object we were playing with in a previous tutorial was a multidimensional table of frequencies, also called a crosstabulation. The table() will also created crosstabulations.

Data Frames

Data frames are so important that I will devote an entire tutorial just to them. For now, if you want to see a few, try this. The output will not be shown. Look at your screen.

> women                           # average weight of women by height
> USArrests                       # crime statistics; scroll to see it all
> head(USArrests)                 # just the first six rows of data
> chickwts                        # chicken weights by feed type

The basic structure of a data frame is illustrated here. It's basically a table (in fact, it's a list of column vectors) in which each variable goes in its own column and each case goes in its own row.

Usually, data frames are read into the R workspace from external files, which may have been created using a spreadsheet. Small ones can be typed in at the command line, however. Let's use the data at the beginning of this tutorial to see how that would work.

> myFirstDataframe = data.frame(       # Press Enter to start a new line.
+    name=c("Bob","Fred","Barb","Sue","Jeff"),
+    age=c(21,18,18,24,20), hgt=c(70,67,64,66,72),
+    wgt=c(180,156,128,118,202),
+    race=c("Cauc","Af.Am","Af.Am","Cauc","Asian"),
+    year=c("Jr","Fr","Fr","Sr","So"),
+    SAT=c(1080,1210,840,1340,880))    # End with double close parenthesis. Why?
> myFirstDataframe
  name age hgt wgt  race year  SAT
1  Bob  21  70 180  Cauc   Jr 1080
2 Fred  18  67 156 Af.Am   Fr 1210
3 Barb  18  64 128 Af.Am   Fr  840
4  Sue  24  66 118  Cauc   Sr 1340
5 Jeff  20  72 202 Asian   So  880

That's probably not something you're going to want to do too very often! In fact, I'd almost be willing to bet you got at least one comma, one quote, or one parenthesis out of place, and the whole thing failed because of that. I've gotten e-mails from a number of people telling me they couldn't get this to work, but I just tested it by copying and pasting, and it does work. Try highlighting the following text, copy, and then paste it into R at the command prompt. (May not work in R Studio.)

# Begin copying here.
myFirstDataframe = data.frame(
    name=c("Bob","Fred","Barb","Sue","Jeff"),
    age=c(21,18,18,24,20), hgt=c(70,67,64,66,72),
    wgt=c(180,156,128,118,202),
    race=c("Cauc","Af.Am","Af.Am","Cauc","Asian"),
    year=c("Jr","Fr","Fr","Sr","So"),
    SAT=c(1080,1210,840,1340,880))
# End copying here.

Last Word

Further details as needed on these data objects will be covered in future tutorials. For now, you should get the general idea.

revised 2016 January 18