R Tutorials Top Banner

MORE DESCRIPTIVE STATISTICS


Categorical Data

You summarize categorical data basically by counting up frequencies and by calculating proportions and percentages.

Categorical data are commonly encountered in three forms: a frequency table or crosstabulation, a flat table, or a case-by-case data frame. Let's begin with the last of these. Copy and paste the following lines ALL AT ONCE into R. That is, highlight these lines with your mouse, hit Ctrl-C on your keyboard, click at a command prompt in R, and hit Ctrl-V on your keyboard, and hit Enter if necessary, i.e., if R hasn't returned to a command prompt. On the Mac, use Command-C and Command-V. This will execute these lines as a script and create a data frame called "ucb" in your workspace. WARNING: Your workspace will also be cleared, so save anything you don't want to lose first.

# Begin copying here.
rm(list=ls())
gender = rep(c("female","male"),c(1835,2691))
admitted = rep(c("yes","no","yes","no"),c(557,1278,1198,1493))
dept = rep(c("A","B","C","D","E","F","A","B","C","D","E","F"),
           c(89,17,202,131,94,24,19,8,391,244,299,317))
dept2 = rep(c("A","B","C","D","E","F","A","B","C","D","E","F"),
            c(512,353,120,138,53,22,313,207,205,279,138,351))
department = c(dept,dept2)
ucb = data.frame(gender,admitted,department)
rm(gender,admitted,dept,dept2,department)
ls()
# End copying here.

Data sets that are purely categorical are not economically represented in case-by-case data frames, and so the built-in data sets that are purely categorical come in the form of tables (contingency tables or crosstabulations). We have just taken the data from one of these (the "UCBAdmissions" built-in data set) and turned it into a case-by-case data frame. It's the classic University of California, Berkeley, admissions data from 1973 describing admissions into six different graduate programs broken down by gender.

First, we want some general information about the data frame...

> str(ucb)                             # Examine the structure.
'data.frame':   4526 obs. of  3 variables:
 $ gender    : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
 $ admitted  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ department: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
> ucb[seq(1,4526,400),]                # Look at every 400th row.
     gender admitted department
1    female      yes          A
401  female      yes          D
801  female       no          C
1201 female       no          D
1601 female       no          F
2001   male      yes          A
2401   male      yes          B
2801   male      yes          C
3201   male       no          A
3601   male       no          C
4001   male       no          D
4401   male       no          F
There are 4526 cases or observations or people represented in this data frame, with three variables observed on each one: gender, admitted, and department the person applied to. All three variables are coded as factors, but this doesn't matter. For our present purposes, categorical variables and factors are the same thing.

What would we like to know? Here's a good start...

> summary(ucb)
    gender     admitted   department
 female:1835   no :2771   A:933     
 male  :2691   yes:1755   B:585     
                          C:918     
                          D:792     
                          E:584     
                          F:714
We now have frequency tables for each of the variables. We could get the same information for any individual variable using the table( ) function...
> table(ucb$gender)

female   male 
  1835   2691 
>
> table(ucb$admitted)

  no  yes 
2771 1755
>
> table(ucb$department)

  A   B   C   D   E   F 
933 585 918 792 584 714
These tables can easily be turned into relative frequency tables using the prop.table( ) function...
> table(ucb$department) -> dept.table  # Requires a table as an argument.
> prop.table(dept.table)               # Calculate proportions.

        A         B         C         D         E         F 
0.2061423 0.1292532 0.2028281 0.1749890 0.1290323 0.1577552 
>
> prop.table(dept.table) * 100         # Or calculate percentages.

       A        B        C        D        E        F 
20.61423 12.92532 20.28281 17.49890 12.90323 15.77552
As we see, 20.6% of the applicants applied to department A, 12.9% to department B, 20.3% to department C, etc.

The prop.table( ) function requires a table object as its argument, so the data must first be tabled before being prop.tabled.

Contingency tables or crosstabs can be produced using either the table( ) or xtabs( ) function. Table is easier, so I'll illustrate that first...

> with(ucb, table(gender, admitted))   # or table(ucb$gender, ucb$admitted)
        admitted
gender     no  yes
  female 1278  557
  male   1493 1198
Within the table( ) function, the first variable named will be in the rows of the contingency table, and the second variable named will be in the columns. If a third variable is named, it will form separate layers or strata of a three dimensional contingency table.

When using prop.table( ) on a multidimensional table, it's necessary to specify which marginal sums you want to use to calculate the proportions. To use the row sums, specify 1; to use the column sums, specify 2...

> with(ucb, table(gender, admitted)) -> gen.adm.table
> prop.table(gen.adm.table, 1)         # With respect to row marginal sums.
        admitted
gender          no       yes
  female 0.6964578 0.3035422
  male   0.5548123 0.4451877
>
> prop.table(gen.adm.table, 2)         # With respect to column marginal sums.
        admitted
gender          no       yes
  female 0.4612053 0.3173789
  male   0.5387947 0.6826211
The name of this option is "margin=", so this form could also have been used...
> prop.table(my.table, margin=1)
        admitted
gender          no       yes
  female 0.6964578 0.3035422
  male   0.5548123 0.4451877
Since "margin=" is the first thing prop.table( ) expects after the table name, either form will work here.

The xtabs( ) function works quite a bit differently. It uses a formula interface. The formula interface is used most often in model building and significance testing, so we'll see it a lot, and it's discussed in detail in another tutorial. Formulas can become quite complex, but their most basic form is as follows...

DV ~ IV1 + IV2 + IV3 + ...
First, a dependent or response variable is specified, followed by a tilde (in the upper left corner of most keyboards), which is read as "is a function of" or "is modeled by", and finally a list of independent or explanatory variables separated by plus signs. For xtabs( ) there is no DV (in a case-by-case data frame), so we just leave it out...
> with(ucb, xtabs(~ gender + admitted))
        admitted
gender     no  yes
  female 1278  557
  male   1493 1198
Instead of using with( ) to give the name of the data frame, we could also have used the data= option, since we are using a formula interface in the function...
> xtabs(~ gender + admitted, data=ucb)
        admitted
gender     no  yes
  female 1278  557
  male   1493 1198
The resulting table could also have been stored and operated on with other functions. Here are some examples...
> xtabs(~ gender + admitted, data=ucb) -> gen.adm.table
> prop.table(gen.adm.table, 1)         # Get proportions relative to row sums.
        admitted
gender          no       yes
  female 0.6964578 0.3035422
  male   0.5548123 0.4451877
>
> addmargins(gen.adm.table)            # Add marginal sums to table.
        admitted
gender     no  yes  Sum
  female 1278  557 1835
  male   1493 1198 2691
  Sum    2771 1755 4526
>
> margin.table(gen.adm.table, 1)       # Collapse over admitted (row marginals).
gender
female   male 
  1835   2691
>
> margin.table(gen.adm.table, 2)       # Collapse over sex (column marginals).
admitted
  no  yes 
2771 1755
Six different crosstabulations of the entire data set are possible, depending upon the order in which we list the variables...
> with(ucb, table(gender, department, admitted))
, , admitted = no

        department
gender     A   B   C   D   E   F
  female  19   8 391 244 299 317
  male   313 207 205 279 138 351

, , admitted = yes

        department
gender     A   B   C   D   E   F
  female  89  17 202 131  94  24
  male   512 353 120 138  53  22

> with(ucb, table(admitted, department, gender))
, , gender = female

        department
admitted   A   B   C   D   E   F
     no   19   8 391 244 299 317
     yes  89  17 202 131  94  24

, , gender = male

        department
admitted   A   B   C   D   E   F
     no  313 207 205 279 138 351
     yes 512 353 120 138  53  22
Etc.

A flat table is produced from the data frame by using the ftable( ) function. Use the "col.vars=" option to control which variable goes in the columns...

> ftable(ucb)                          # Last variable in columns.
                department   A   B   C   D   E   F
gender admitted                                   
female no                   19   8 391 244 299 317
       yes                  89  17 202 131  94  24
male   no                  313 207 205 279 138 351
       yes                 512 353 120 138  53  22
>
> ftable(ucb, col.vars="gender")       # Gender in columns.
                    gender female male
admitted department                   
no       A                     19  313
         B                      8  207
         C                    391  205
         D                    244  279
         E                    299  138
         F                    317  351
yes      A                     89  512
         B                     17  353
         C                    202  120
         D                    131  138
         E                     94   53
         F                     24   22
>
> ftable(ucb, col.vars="admitted")     # Admitted in columns.
                  admitted  no yes
gender department                 
female A                    19  89
       B                     8  17
       C                   391 202
       D                   244 131
       E                   299  94
       F                   317  24
male   A                   313 512
       B                   207 353
       C                   205 120
       D                   279 138
       E                   138  53
       F                   351  22
>
> ftable(ucb, col.vars="admitted") -> my.table
> prop.table(my.table, 1)              # prop.table() works here, too.
                  admitted         no        yes
gender department                               
female A                   0.17592593 0.82407407
       B                   0.32000000 0.68000000
       C                   0.65935919 0.34064081
       D                   0.65066667 0.34933333
       E                   0.76081425 0.23918575
       F                   0.92961877 0.07038123
male   A                   0.37939394 0.62060606
       B                   0.36964286 0.63035714
       C                   0.63076923 0.36923077
       D                   0.66906475 0.33093525
       E                   0.72251309 0.27748691
       F                   0.94101877 0.05898123
Flat tables can also be made from contingency tables, the order of the row variables can be changed, and multiple column variables can be specified...
> with(ucb, table(admitted,department,gender)) -> my.table      # A 3D contingency table.
> ftable(my.table)
                    gender female male
admitted department                   
no       A                     19  313
         B                      8  207
         C                    391  205
         D                    244  279
         E                    299  138
         F                    317  351
yes      A                     89  512
         B                     17  353
         C                    202  120
         D                    131  138
         E                     94   53
         F                     24   22
>
> ftable(my.table, col.vars="admitted")
                  admitted  no yes
department gender                 
A          female           19  89
           male            313 512
B          female            8  17
           male            207 353
C          female          391 202
           male            205 120
D          female          244 131
           male            279 138
E          female          299  94
           male            138  53
F          female          317  24
           male            351  22
>
> ftable(my.table, row.vars=c("gender","department"), col.vars="admitted")
                  admitted  no yes
gender department                 
female A                    19  89
       B                     8  17
       C                   391 202
       D                   244 131
       E                   299  94
       F                   317  24
male   A                   313 512
       B                   207 353
       C                   205 120
       D                   279 138
       E                   138  53
       F                   351  22
>
>  ftable(my.table, row.vars="department", col.vars=c("gender","admitted"))
           gender   female     male    
           admitted     no yes   no yes
department                             
A                       19  89  313 512
B                        8  17  207 353
C                      391 202  205 120
D                      244 131  279 138
E                      299  94  138  53
F                      317  24  351  22
Use the "row.vars=" option to control the order in which the row variables occur. If more than one variable name is to be used in either the "row.vars" or "col.vars" option, use vector notation to specify them. (HINT: In R, everything is a vector, even if it has only one element! The notation row.vars="department" is just a shortcut for row.vars=c("department").)

You can also get a VERY useful and more efficient data frame from a contingency table as follows...

> as.data.frame(my.table) -> my.df.table
> my.df.table
   admitted department gender Freq
1        no          A female   19
2       yes          A female   89
3        no          B female    8
4       yes          B female   17
5        no          C female  391
6       yes          C female  202
7        no          D female  244
8       yes          D female  131
9        no          E female  299
10      yes          E female   94
11       no          F female  317
12      yes          F female   24
13       no          A   male  313
14      yes          A   male  512
15       no          B   male  207
16      yes          B   male  353
17       no          C   male  205
18      yes          C   male  120
19       no          D   male  279
20      yes          D   male  138
21       no          E   male  138
22      yes          E   male   53
23       no          F   male  351
24      yes          F   male   22
There are only 24 possible unique cases in these data (2 genders times 2 admit statuses times 6 departments). So why put this information into a data frame with 4500+ rows in it? The form above is much more efficient. It lists the possible unique cases in the first three columns (every possible combination of factors) and then gives a frequency in the last Freq column.

When there is a Freq column in the data frame, you CANNOT use table( ) to get crosstabulations. You must use xtabs( ), AND you must specify Freq as the DV.

> xtabs(Freq ~ gender + admitted, data=my.df.table)
        admitted
gender     no  yes
  female 1278  557
  male   1493 1198
We'll look at graphics for categorical data in the next tutorial.

> rm(list=ls())

Numerical Data

Imagine a number line stretching all the way across the universe from negative infinity to positive infinity. Somewhere along this number line is your little mound of data, and you have to tell someone how to find it. What information would it be useful to give this person?

First, you might want to tell this person how big the mound is. Is she to look for a sizeable mound of data or just a little speck? Second, you might want to tell this person about where to look. Where is the mound centered on the number line? Third, you might want to tell this person how spread out to expect the mound to be. Are the data all in a nice, compact mound, or is it spread out all over the place? Finally, you might want to tell this person what shape to expect the mound to be.

The data frame "faithful" consists of 272 observations on the Old Faithful geyser in Yellowstone National Park. Each observation consists of two variables: "eruptions" (how long an eruption lasted in minutes), and "waiting" (how long in minutes the geyser was quiet before that eruption)...

> data(faithful)
> str(faithful)
'data.frame':   272 obs. of  2 variables:
 $ eruptions: num  3.60 1.80 3.33 2.28 4.53 ...
 $ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...
I'm going to attach this so that we can easily look at the variables inside. Remind me to detach it when we're done!
> attach(faithful)
We'll begin with the "waiting" variable. Let's see if we can characterize it.

First, we want to know how many values are in this variable. If this variable is our mound of data somewhere on that number line, how big a mound is it?

> length(waiting)
[1] 272
I don't suppose you were much surprised by that. We already knew that this data frame contains 272 cases, so each variable must be 272 cases long. There is a slight catch, however. The length( ) function counts missing values (NAs) as cases, and there is no way to stop it from doing so. That is, there is no na.rm= option for this function. In fact, there are no options at all. It will not even reveal the presence of missing values, and for this reason the length( ) function can give a misleading result when missing values are present.

The following command will tell you how many of the values in a variable are NAs (missing values)...

> sum(is.na(waiting))
[1] 0
It does this as follows. The is.na( ) function tests every value in a vector for missingness, returning either TRUE or FALSE for each value. Remember, in R, TRUEs add as ones and FALSEs add as zeroes, so when we sum the result of testing for missingness, we get the number of TRUEs, i.e., the number of missing values. It's kind of a nuisance to remember that, so I propose a new function, one that doesn't actually exist in R.

Here's one of the nicest things about R. If it doesn't do something you want it to do, you can write a function that does it. We'll talk about this in detail in a future tutorial, but for right now the basic idea is this. If you can calculate it at the command line, you can write a function to do it. In this case...

> length(waiting) - sum(is.na(waiting))
[1] 272
...would give us the number of nonmissing cases in "waiting". So do this (and be VERY CAREFUL with your typing)...
> sampsize = function(x) length(x) - sum(is.na(x))
> sampsize(waiting)
[1] 272
> ls()
[1] "faithful" "sampsize"
The first line creates a function called "sampsize" and gives it a mathematical definition (i.e., tells how to calculate it). The variable "x" is called a dummy variable, or a place holder. When we actually use the function, as we did in the second line, we put in the name of the vector we want "sampsize" calculated for. This takes the place of "x" in the calculation. Notice that an object called "sampsize" has been added to your workspace. It will be there until you remove it (or neglect to save the workspace at the end of this R session). Better yet, if you go back to your default working directoy and save it there, it will load automatically every time you start R. We'll do that at the end of this tutorial.

The second thing we want to know about our little mound of data is where it is located on our number line. Where is it centered? In other words, we want a measure of center or of central tendency. There are three that are commonly used: mean, median, and mode. The mode is not very useful and is rarely used in statistical calculations, so there is no R function for it. Mean and median, on'the other hand, are straightforward...

> mean(waiting)
[1] 70.89706
> median(waiting)
[1] 76
Now we know to hunt around somewhere in the seventies for our data.

How spread out should we expect it to be? This is given to us by a measure of spread, also called a measure of variability or a measure of dispersion. There are several of these in common usage: the range, the interquartile range, the variance, and the standard deviation. Here's how to get them...

> range(waiting)
[1] 43 96
> IQR(waiting)
[1] 24
> var(waiting)
[1] 184.8233
> sd(waiting)
[1] 13.59497
There are several things you should be aware of here. First, range( ) does not actually give the range (difference between the maximum and minimum values), it gives the minimum and maximum value. So the values in "waiting" range from a minimum of 43 to a maximum of 96.

Second, the interquartile range (IQR) is defined as the difference between the value at the 3rd quartile and the value at the 1st quartile. However, there is not universal agreement on how these values should be calculated, and different software packages do it differently. R allows nine different methods to calculate these values, and the one used by default is NOT the one you were probably taught in elementary statiistics. So the result given by IQR( ) is not the one you might get if you do the calculation by hand (or on a TI-84 calculator). It should be very close though. If your instructor insists that you get the "calculator" value, do the calculation this way...

> quantile(waiting, c(.75,.25), type=2)
75% 25% 
 82  58
...and then subtract those two values (82-58=24). In this case, the answer is the same. While we're here, this would be a good time to mention percentiles. Do you need the 65th percentile of the "waiting" values. Here it is...
> quantile(waiting, .65)
65% 
 79
The value waiting = 79 is at the 65th percentile.

The last two things you need to be aware of is that the variance and standard deviation calculated above are the sample values. That is, they are calculated as estimates from a sample of the population values (the n − 1 method). There are no functions for the population variance and standard deviation, although they are easily enough calculated if you need them.

You may remember from your elementary stats course a statistic called the standard error of the mean, or SEM. Technically, SEM is a measure of error and not of variability, but you may need to calculate it nevertheless. There is no R function for it, so let's write one.

-----This is optional. You can skip this if you want.-----

The standard error of the sample mean of "waiting" can be calculated like this...

> sqrt(var(waiting)/length(waiting))
[1] 0.8243164
...the square root of the (variance divided by the sample size). This calculation will choke on missing values, however. Let's see that by adding a missing value to "waiting"...
> waiting -> wait2
> wait2[100] <- NA
> sqrt(var(wait2)/length(wait2))
[1] NA
It's the var( ) function that coughed up the hairball, so...
> sqrt(var(wait2, na.rm=T)/length(wait2))
[1] 0.8248208
...seems to fix things. But it doesn't. This isn't quite the correct answer, and that is because length( ) gave us the wrong sample size. (It counted the case that was missing.) We could try this...
> sqrt(var(wait2, na.rm=T)/sampsize(wait2))
[1] 0.8263412
...but that depends upon the existence of the "sampsize" function. If that somehow gets erased, this calculation will fail. Here's one of the disadvantages of using R: You have to know what you're doing. Unlike other statistical packages, such as SPSS, to use R you occasionally have to think. It appears some of that is necessary here...
> sem = function(x) sqrt(var(x, na.rm=T)/(length(x)-sum(is.na(x))))
> sem(wait2)
[1] 0.8263412
> ls()
[1] "faithful" "sampsize" "sem"      "wait2"
We now have a perfectly good function for calculating SEM in our workspace, and we will not let it get away!

--------This is the end of the optional material.--------

Before we move on, I should remind you that the summary( ) function will give you quite a bit of the above information in one go...

> summary(waiting)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   43.0    58.0    76.0    70.9    82.0    96.0
It will tell you if there are missing values, too...
> summary(wait2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  43.00   58.00   76.00   70.86   82.00   96.00    1.00
The number under NA's is a frequency. There is 1 missing value in "wait2".

The last thing we want to know (for the time being anyway) about our little mound of data is what its shape is. Is it a nice mound-shaped mound, or is it deformed in some way, skewed or bimodal or worse? This will involve looking at some sort of frequency distribution.

The table( ) function will do that as well for a numerical variable as it will for a categorical variable, but the result may not be pretty (try this with eruptions--I dare you!)...

> table(waiting)
waiting
43 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 62 63 64 65 66 67 68 69 70 
 1  3  5  4  3  5  5  6  5  7  9  6  4  3  4  7  6  4  3  4  3  2  1  1  2  4 
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 96 
 5  1  7  6  8  9 12 15 10  8 13 12 14 10  6  6  2  6  3  6  1  1  2  1  1
Looking at this reveals that we have one value of 43, three values of 45, five values of 46, and so on. Surely there's a better way! We could try a grouped frequency distribution...
> cut(waiting, breaks=10) -> wait3
> table(wait3)
wait3
(42.9,48.3] (48.3,53.6] (53.6,58.9] (58.9,64.2] (64.2,69.5] (69.5,74.8] 
         16          28          26          24           9          23 
(74.8,80.1] (80.1,85.4] (85.4,90.7] (90.7,96.1] 
         62          55          23           6
The cut( ) function cuts a numerical variable into class intervals, the number of class intervals given (approximately) by the breaks= option. (R has a bit of a mind of it's own, so if you pick a clumbsy number of breaks, R will fix that for you.) The notation (x,y] means the class interval goes from x to y, with x not being included in the interval and y being included.

Another disadvantage of using R is that it is intended to be utilitarian. The output will be useful but not necessarily pretty. We can pretty that up a little bit with a trick...

> as.data.frame(table(wait3))
         wait3 Freq
1  (42.9,48.3]   16
2  (48.3,53.6]   28
3  (53.6,58.9]   26
4  (58.9,64.2]   24
5  (64.2,69.5]    9
6  (69.5,74.8]   23
7  (74.8,80.1]   62
8  (80.1,85.4]   55
9  (85.4,90.7]   23
10 (90.7,96.1]    6

You can also specify the break points yourself in a vector, if you are that anal retentive...

> cut(waiting, breaks=seq(40,100,5)) -> wait4
> as.data.frame(table(wait4))
      wait4 Freq
1   (40,45]    4
2   (45,50]   22
3   (50,55]   33
4   (55,60]   24
5   (60,65]   14
6   (65,70]   10
7   (70,75]   27
8   (75,80]   54
9   (80,85]   55
10  (85,90]   23
11  (90,95]    5
12 (95,100]    1
We are getting a hint of a bimodal distribution of values here.

A better way of seeing a grouped frequency distribution is by creating a stem-and-leaf display...

> stem(waiting)

  The decimal point is 1 digit(s) to the right of the |

  4 | 3
  4 | 55566666777788899999
  5 | 00000111111222223333333444444444
  5 | 555555666677788889999999
  6 | 00000022223334444
  6 | 555667899
  7 | 00001111123333333444444
  7 | 555555556666666667777777777778888888888888889999999999
  8 | 000000001111111111111222222222222333333333333334444444444
  8 | 55555566666677888888999
  9 | 00000012334
  9 | 6

> stem(waiting, scale=.5)

  The decimal point is 1 digit(s) to the right of the |

  4 | 355566666777788899999
  5 | 00000111111222223333333444444444555555666677788889999999
  6 | 00000022223334444555667899
  7 | 00001111123333333444444555555556666666667777777777778888888888888889
  8 | 00000000111111111111122222222222233333333333333444444444455555566666
  9 | 000000123346
Use the "scale=" option to determine how many lines are in the display. The value of scale= is not actually the number of lines, so this may take some fiddling. The first of those displays clearly shows the bimodal structure of this variable.

Oftentimes, we are faced with summarizing numerical data by group membership, or as indexed by a factor. The tapply( ) and by( ) functions are most useful here...

> detach(faithful)
> rm(faithful, wait2, wait3, wait4)         # Out with the old...
> as.data.frame(state.x77) -> states
> states$region <- state.region
> head(states)
           Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
California      21198   5114        1.1    71.71   10.3    62.6    20 156361
Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766
           region
Alabama     South
Alaska       West
Arizona      West
Arkansas    South
California   West
Colorado     West

The "state.x77" data set in R is a collection of information about the U.S. states (in 1977), but it is in the form of a matrix. We converted it to a data frame. All the variables are numerical, so we created a factor in the data frame by adding information about census region, from the built-in data set state.region. Now we can do interesting things like break down life expectancy by regions of the country. But hold on there! Some very foolish person has named that variable "Life Exp", with a space in the middle of the name. How do we deal with that?

> names(states)
[1] "Population" "Income"     "Illiteracy" "Life Exp"   "Murder"    
[6] "HS Grad"    "Frost"      "Area"       "region"    
> by(data=states[4], IND=states[9], FUN=mean)
region: Northeast
[1] 71.26444
------------------------------------------------------------ 
region: South
[1] 69.70625
------------------------------------------------------------ 
region: North Central
[1] 71.76667
------------------------------------------------------------ 
region: West
[1] 71.23462
>
> tapply(X=states[,4], IND=states[,9], FUN=mean)
    Northeast         South North Central          West 
     71.26444      69.70625      71.76667      71.23462
That's one way. There are a couple things to notice here. The arguments to by( ) are listed in the default order, so you don't really have to name them. The same is true of tapply( ). Notice also that you can be a lot more loosey goosey with the indexing in by( ) than in tapply( ). You only have to give the column number that you want from the data frame.

Of course, we could also rename the column and use the new, more sensible name...

> names(states)[4]                     # That's the one we want!
[1] "Life Exp"
> names(states)[4] <- "Life.Exp"
> tapply(states$Life.Exp, states$region, mean)
    Northeast         South North Central          West 
     71.26444      69.70625      71.76667      71.23462
Now, before we quit and clean up, let's save those functions in our default working directory.
> ls()
[1] "sampsize" "sem"      "states"  
> rm(states)                           # Do some cleaning up first.
> setwd("..")                          # Go back one directory level from Rspace.
> getwd()                              # Check to be sure you're in the right place!
[1] "C:/Documents and Settings/kingw/My Documents"
> ### I'm in WinXP at the moment. This should be your home directory in
> ### any other OS. Now assuming you have both sampsize and sem...
> load(".RData")                       # To preserve anything already there...
> ls()
[1] "age.out"       "m.ex"          "m.ex.minussex" "outcome.out"  
[5] "respiratory"   "sampsize"      "seizure"       "sem"          
[9] "visit.matrix"
> rm(age.out,m.ex,m.ex.minussex,outcome.out,
+    respiratory,seizure,visit.matrix) # I don't want any of that old junk!
> ls()                                 # Check to see functions are still there.
[1] "sampsize" "sem"
> quit("yes")                          # Quit saving workspace.
The next time you start R, those functions will be loaded automatically.

On to graphical summaries!


revised 2010 August 3
Return to the Table of Contents