R Tutorials--More Descriptive Statistics

MORE DESCRIPTIVE STATISTICS

Categorical Data: Crosstabulations

Categorical data are summarized by counting up frequencies and by calculating proportions and percentages.

Categorical data are commonly encountered in three forms: a frequency table or crosstabulation, a flat table, or a case-by-case data frame. Let's begin with the first of these.

In the default download of R, there is a data set called UCBAdmissions, which gives admissions data for UC Berkeley's top six grad programs in 1974. You might want to enlarge your R Console window a bit so you can see all the data at once.

> setwd(Rspace)                   # Okay. Everybody forgets once in awhile. Give me a break!
Error in setwd(Rspace) : object 'Rspace' not found
> setwd("Rspace")                 # if you've created this directory
> UCBAdmissions
, , Dept = A

          Gender
Admit      Male Female
  Admitted  512     89
  Rejected  313     19

, , Dept = B

          Gender
Admit      Male Female
  Admitted  353     17
  Rejected  207      8

, , Dept = C

          Gender
Admit      Male Female
  Admitted  120    202
  Rejected  205    391

, , Dept = D

          Gender
Admit      Male Female
  Admitted  138    131
  Rejected  279    244

, , Dept = E

          Gender
Admit      Male Female
  Admitted   53     94
  Rejected  138    299

, , Dept = F

          Gender
Admit      Male Female
  Admitted   22     24
  Rejected  351    317

This is a three-dimensional table, specifically a 2x2x6 contingency table. In R, the dimensions are recognized in this order: rows (the "Admit" variable in this table), columns (the "Gender" variable in this table), and layers (the "Dept" variable in this table). The numbers in the table are frequencies, for example, how many male applicants admitted to department A (512). We don't have information on individual subjects, but if we did, the first one of them would be someone who was admitted, would be male, and would have applied to dept. A, and that would be it. Thus, you can see that the frequency table, or crosstabs, contains complete information on our subjects.

How would we create this table ourselves? Well, we would start with all of those numbers in a vector. The tricky part is getting them in the correct order, and to do that we need to know in what order R is going to stick them into the table. R will pick a layer of the table, beginning with the first one (A), and fill down the columns, then move to the next layer and fill down the columns, and so on. Thus, when the admissions office at Berkeley sends us these numbers, we want to create the following vector from them. (Use scan() to enter the vector if you prefer. I'm going to use c() because it will keep things clearer, hopefully.)

> OF = c(512, 313,  89,  19,   # Dept A, males adm, males rej, females adm, females rej
+        353, 207,  17,   8,   # Dept B
+        120, 205, 202, 391,   # Dept C
+        138, 279, 131, 244,   # Dept D
+         53, 138,  94, 299,   # Dept E
+         22, 351,  24, 317)   # Dept F

Yes, you can bet I spaced it just like that, and I included the comments. When I come back and look at this later, I want to know exactly what I did! Another confession: If I get one of those numbers wrong, I don't want to have to retype the whole gosh darned thing! So I typed it into a script window (no command prompts typed there), and then copied and pasted it into R, a good way to deal with long commands, and also the way I'm going to deal with the next one! (To get a script window, go to File > New Script in Windows, or File > New Document on a Mac, or open a text editor in Linux.) So my entire script looks like this, which you can copy and paste if your fingers are broken and you can't type today:

OF = c(512, 313,  89,  19,   # Dept A, males adm, males rej, females adm, females rej
       353, 207,  17,   8,   # Dept B
       120, 205, 202, 391,   # Dept C
       138, 279, 131, 244,   # Dept D
        53, 138,  94, 299,   # Dept E
        22, 351,  24, 317)   # Dept F
UCB = array(data=OF, dim=c(2,2,6), dimnames=list(
       Admit=c("admitted","rejected"),
       Gender=c("male","female"),
       Dept=c("A","B","C","D","E","F")
       ))

I could also have used array() and dimnames() separately, as follows.

UCB = array(data=OF, dim=c(2,2,6))
dimnames(UCB) = list(
       Admit=c("admitted","rejected"),
       Gender=c("male","female"),
       Dept=c("A","B","C","D","E","F"))

Print out (to the Console) the table this creates in your workspace and confirm it is the same as the one above.

Is this table arranged the way you really want it? I.e., do you really want Admit in the rows, Gender in the columns, and Dept in the layers? Maybe you do, but then again, maybe you don't. I think the most interesting problem here is to compare admission by gender.

If that's what we're interested in, then why don't we just collapse the table across (or over) departments. That would be easy enough to do, but we need to remember that, in R, each dimension of the table has a number, and that rows are 1, columns are 2, and layers are 3. Then we can use the margin.table() function to collapse.

> margin.table(UCB, margin=c(1,2))     # omit the "margin" that you want to collapse over
          Gender
Admit      male female
  admitted 1198    557
  rejected 1493   1278
> margin.table(UCB, margin=c(2,1))     # put Gender in the rows and Admit in the columns
        Admit
Gender   admitted rejected
  male       1198     1493
  female      557     1278
> margin.table(UCB, margin=c(2,1)) -> Admit.by.Gender      # store that second one

The name of the command is more sensible than it may seem at first blush. If you can imagine what the 3D table looks like in 3D, it would resemble a box of crackers. The box is divided into 2x2x6=24 little cells, and each cell has a frequency in it. At the end of the box there is a margin (also above and in front). We're going to fill that margin with sums by adding frequencies across the length of the box (across departments). Those sums are called marginal sums or marginal frequencies, and they are in this case what you saw in the first of those margin.tables above. 3D box of crackers

The other marginal frequencies are obtained easily.

> margin.table(UCB, margin=c(1,3))     # the margin in front of the box
          Dept
Admit        A   B   C   D   E   F
  admitted 601 370 322 269 147  46
  rejected 332 215 596 523 437 668
> margin.table(UCB, margin=c(2,3))     # the margin above the box
        Dept
Gender     A   B   C   D   E   F
  male   825 560 325 417 191 373
  female 108  25 593 375 393 341

If we examine Admit.by.Gender, we can see without much further analysis that something appears to be fishy. About twice as many males than females were admitted, but the numbers of males and females rejected is about the same. That's a little easier to visualize if we convert the frequencies to proportions. The Admit.by.Gender table has two margins, the row margin (1) and the column margin (2).

> prop.table(Admit.by.Gender, margin=2)     # proportions with respect to column marginals
        Admit
Gender    admitted  rejected
  male   0.6826211 0.5387947
  female 0.3173789 0.4612053
> prop.table(Admit.by.Gender, margin=1)     # proportions with respect to row marginals
        Admit
Gender    admitted  rejected
  male   0.4451877 0.5548123
  female 0.3035422 0.6964578

From the first table, 68% of those admitted were male, and 32% were female, while 54% of those rejected were male, and 46% were female. From the second table, 45% of males were admitted, 55% rejected, while 30% of females were admitted, 70% rejected.

This is a famous problem in statistics. When we throw out the Dept information, it appears that males are being favored. What do we see if we do not throw out the Dept information? Now I think it becomes convenient to rearrange the whole table, and that can be done with a function we already know. I'm going to put Admit (1 in the original table) in the rows, Dept (3 in the original table) in the columns, and Gender (2 in the original table) in the layers.

> margin.table(UCB, margin=c(1,3,2))
, , Gender = male

          Dept
Admit        A   B   C   D   E   F
  admitted 512 353 120 138  53  22
  rejected 313 207 205 279 138 351

, , Gender = female

          Dept
Admit       A  B   C   D   E   F
  admitted 89 17 202 131  94  24
  rejected 19  8 391 244 299 317

> margin.table(UCB, margin=c(1,3,2)) -> temp
> prop.table(temp, margin=c(2,3))
, , Gender = male

          Dept
Admit              A         B         C         D         E          F
  admitted 0.6206061 0.6303571 0.3692308 0.3309353 0.2774869 0.05898123
  rejected 0.3793939 0.3696429 0.6307692 0.6690647 0.7225131 0.94101877

, , Gender = female

          Dept
Admit              A    B         C         D         E          F
  admitted 0.8240741 0.68 0.3406408 0.3493333 0.2391858 0.07038123
  rejected 0.1759259 0.32 0.6593592 0.6506667 0.7608142 0.92961877

62% of male applicants to Dept A were admitted, while 82% of female applicants were. 63% of male applicants to Dept B were admitted, while 68% of female applicants were. And so on. It's called Simpson's paradox.

For percentages instead of proportions, multiply by 100.

> temp2 = prop.table(temp, margin=c(2,3)) * 100
> round(temp2, 1)
, , Gender = male

          Dept
Admit         A  B    C    D    E    F
  admitted 62.1 63 36.9 33.1 27.7  5.9
  rejected 37.9 37 63.1 66.9 72.3 94.1

, , Gender = female

          Dept
Admit         A  B    C    D    E  F
  admitted 82.4 68 34.1 34.9 23.9  7
  rejected 17.6 32 65.9 65.1 76.1 93

By the way, I think it's poor form that R doesn't give me the .0 when that's what the number rounds to.

> ls()
[1] "Admit.by.Gender" "OF"              "temp"            "temp2"          
[5] "UCB"            
> rm(Admit.by.Gender, temp, temp2)

Categorical Data: Flat Tables

These are flat tables. As you can see, they contain the same frequency information as the crosstabs.

> ftable(UCB)
                Dept   A   B   C   D   E   F
Admit    Gender                             
admitted male        512 353 120 138  53  22
         female       89  17 202 131  94  24
rejected male        313 207 205 279 138 351
         female       19   8 391 244 299 317

> ftable(UCB, col.vars="Gender")
              Gender male female
Admit    Dept                   
admitted A            512     89
         B            353     17
         C            120    202
         D            138    131
         E             53     94
         F             22     24
rejected A            313     19
         B            207      8
         C            205    391
         D            279    244
         E            138    299
         F            351    317

> ftable(UCB, row.vars=c("Dept","Gender"), col.vars="Admit")
            Admit admitted rejected
Dept Gender                        
A    male              512      313
     female             89       19
B    male              353      207
     female             17        8
C    male              120      205
     female            202      391
D    male              138      279
     female            131      244
E    male               53      138
     female             94      299
F    male               22      351
     female             24      317

> prop.table(ftable(UCB, row.vars=c("Dept","Gender"), col.vars="Admit"), 1)
            Admit   admitted   rejected
Dept Gender                            
A    male         0.62060606 0.37939394
     female       0.82407407 0.17592593
B    male         0.63035714 0.36964286
     female       0.68000000 0.32000000
C    male         0.36923077 0.63076923
     female       0.34064081 0.65935919
D    male         0.33093525 0.66906475
     female       0.34933333 0.65066667
E    male         0.27748691 0.72251309
     female       0.23918575 0.76081425
F    male         0.05898123 0.94101877
     female       0.07038123 0.92961877

Categorical Data: Data Frames

R often gives a table like the following and calls it a data frame.

> as.data.frame.table(UCB)
      Admit Gender Dept Freq
1  admitted   male    A  512
2  rejected   male    A  313
3  admitted female    A   89
4  rejected female    A   19
5  admitted   male    B  353
6  rejected   male    B  207
7  admitted female    B   17
8  rejected female    B    8
9  admitted   male    C  120
10 rejected   male    C  205
11 admitted female    C  202
12 rejected female    C  391
13 admitted   male    D  138
14 rejected   male    D  279
15 admitted female    D  131
16 rejected female    D  244
17 admitted   male    E   53
18 rejected   male    E  138
19 admitted female    E   94
20 rejected female    E  299
21 admitted   male    F   22
22 rejected   male    F  351
23 admitted female    F   24
24 rejected female    F  317

Is this a true data frame? Well, it will function like one, and such tables are used in procedures that we will encounter in the future. The definition of data frame I gave in an earlier tutorial was a table of cases by variables. And cases are usually our experimental units or subjects. Here we have...

> sum(OF)
[1] 4526

...cases or subjects or applicants to Berkeley grad programs. A case-by-case data frame, therefore, would have 4,526 rows and 3 columns. We would construct such a table if we wished to keep track of every single applicant, by name perhaps. "Fred Smith applied to one of these programs in 1974. What were his particulars?" We could find that information in our case-by-case data frame.

The following script creates such a data frame from the UCB data. I don't expect that you will be itching to type this line by line, so just copy and paste it into R, and the data frame will appear in your workspace.

### begin copying here
gender = rep(c("female","male"),c(1835,2691))
admitted = rep(c("yes","no","yes","no"),c(557,1278,1198,1493))
dept = rep(c("A","B","C","D","E","F","A","B","C","D","E","F"),
           c(89,17,202,131,94,24,19,8,391,244,299,317))
dept2 = rep(c("A","B","C","D","E","F","A","B","C","D","E","F"),
            c(512,353,120,138,53,22,313,207,205,279,138,351))
department = c(dept,dept2)
UCBdf = data.frame(gender,admitted,department)
rownames(UCBdf)[3101] = "Fred.Smith"
rm(gender,admitted,dept,dept2,department)
### end copying here and paste into R

It's a data frame, so...

> ls()
[1] "OF"    "UCB"   "UCBdf"
> dim(UCBdf)
[1] 4526    3
> summary(UCBdf)
    gender     admitted   department
 female:1835   no :2771   A:933     
 male  :2691   yes:1755   B:585     
                          C:918     
                          D:792     
                          E:584     
                          F:714

I didn't give everyone in the data frame his or her name, but I did identify Fred.Smith, and so we can recall his information by name. (Everyone else has to be recalled by his or her secret code number.)

> UCBdf["Fred.Smith",]
           gender admitted department
Fred.Smith   male       no          A

Fred was male, applied to Dept A, and was not admitted. Too bad, Fred.

We can summarize the data in this rather largish data frame as follows.

> table(UCBdf$gender)                            # frequencies by gender

female   male 
  1835   2691 
> 
> table(UCBdf$admitted)                          # frequencies by admitted

  no  yes 
2771 1755 
> 
> table(UCBdf$department)                        # frequencies by department

  A   B   C   D   E   F 
933 585 918 792 584 714 
> 
> with(UCBdf, table(gender, admitted))           # a crosstabulation
        admitted
gender     no  yes
  female 1278  557
  male   1493 1198
> 
> with(UCBdf, table(gender, admitted)) -> admitted.by.gender
> prop.table(admitted.by.gender, margin=1)
        admitted
gender          no       yes
  female 0.6964578 0.3035422
  male   0.5548123 0.4451877

Some of this information we already had from summary(), but I wanted to demonstrate the table() function. The xtabs() function can also be used to do crosstabulations. It takes a formula interface. The syntax of the formula is a tilde following by the variables you wish to crosstabulate separated by + signs.

> xtabs( ~ admitted + department, data=UCBdf)
        department
admitted   A   B   C   D   E   F
     no  332 215 596 523 437 668
     yes 601 370 322 269 147  46

> xtabs( ~ admitted + department + gender, data=UCBdf)
, , gender = female

        department
admitted   A   B   C   D   E   F
     no   19   8 391 244 299 317
     yes  89  17 202 131  94  24

, , gender = male

        department
admitted   A   B   C   D   E   F
     no  313 207 205 279 138 351
     yes 512 353 120 138  53  22

You can store the table generated by either one of these functions as an object in your workspace and then operate on it with other functions, most of which are already familiar to you. All of these functions require table objects as an argument.

> xtabs( ~ admitted + department, data=UCBdf) -> admitted.by.department

> prop.table(admitted.by.department, margin=1)
        department
admitted          A          B          C          D          E          F
     no  0.11981234 0.07758932 0.21508481 0.18874053 0.15770480 0.24106821
     yes 0.34245014 0.21082621 0.18347578 0.15327635 0.08376068 0.02621083

> prop.table(admitted.by.department, margin=2)
        department
admitted          A          B          C          D          E          F
     no  0.35584137 0.36752137 0.64923747 0.66035354 0.74828767 0.93557423
     yes 0.64415863 0.63247863 0.35076253 0.33964646 0.25171233 0.06442577

> margin.table(admitted.by.department, margin=1)
admitted
  no  yes 
2771 1755 

> addmargins(admitted.by.department)
        department
admitted    A    B    C    D    E    F  Sum
     no   332  215  596  523  437  668 2771
     yes  601  370  322  269  147   46 1755
     Sum  933  585  918  792  584  714 4526

We'll look at graphics for categorical data in the next tutorial.

> ls()
[1] "admitted.by.department" "admitted.by.gender"    
[3] "OF"                     "UCB"                   
[5] "UCBdf"                 
> rm(admitted.by.department,admitted.by.gender)
> ls()
[1] "OF"    "UCB"   "UCBdf"
> save.image("UCB.RData")              # just in case - you never know!
> rm(list=ls())                        # clear the workspace

Numeric Data

Imagine a number line stretching all the way across the universe from negative infinity to positive infinity. Somewhere along this number line is your little mound of data, and you have to tell someone how to find it. What information would it be useful to give this person?

First, you might want to tell this person how big the mound is. Is she to look for a sizeable mound of data or just a little speck? Second, you might want to tell this person about where to look. Where is the mound centered on the number line? Third, you might want to tell this person how spread out to expect the mound to be. Are the data all in a nice, compact mound, or is it spread out all over the place? Finally, you might want to tell this person what shape to expect the mound to be.

The built-in data frame "faithful" consists of 272 observations on the Old Faithful geyser in Yellowstone National Park. Each observation consists of two variables: "eruptions" (how long an eruption lasted in minutes), and "waiting" (how long in minutes the geyser was quiet before that eruption).

> data(faithful)                       # put a copy in your workspace
> str(faithful)
'data.frame':   272 obs. of  2 variables:
 $ eruptions: num  3.60 1.80 3.33 2.28 4.53 ...
 $ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...

I'm going to attach this so that we can easily look at the variables inside. Remind me to detach it when we're done! I sometimes forget.

> attach(faithful)

We'll begin with the "waiting" variable. Let's see if we can characterize it.

First, we want to know how many values are in this variable. If this variable is our mound of data somewhere on that number line, how big a mound is it?

> length(waiting)
[1] 272

I don't suppose you were much surprised by that. We already knew that this data frame contains 272 cases, so each variable must be 272 cases long. There is a slight catch, however. The length() function counts missing values (NAs) as cases, and there is no way to stop it from doing so. That is, there is no na.rm= option for this function. In fact, there are no options at all. It will not even reveal the presence of missing values, and for this reason the length() function can give a misleading result when missing values are present.

The following command will tell you how many of the values in a variable are NAs (missing values).

> sum(is.na(waiting))
[1] 0

It does this as follows. The is.na() function tests every value in a vector for missingness, returning either TRUE or FALSE for each value. Remember, in R, TRUEs add as ones and FALSEs add as zeroes, so when we sum the result of testing for missingness, we get the number of TRUEs, i.e., the number of missing values. It's kind of a nuisance to remember that, so I propose a new function, one that doesn't actually exist in R.

Here's one of the nicest things about R. If it doesn't do something you want it to do, you can write a function that does it. We'll talk about this in detail in a future tutorial, but for right now the basic idea is this. If you can calculate it at the command line, you can write a function to do it. In this case...

> length(waiting) - sum(is.na(waiting))
[1] 272

...would give us the number of nonmissing cases in "waiting". So do this (and be VERY CAREFUL with your typing).

> sampsize = function(x) length(x) - sum(is.na(x))    # this creates the function
> sampsize(waiting)                                   # which we use like any other function
[1] 272
> ls()
[1] "faithful" "sampsize"

The first line creates a function called "sampsize" and gives it a mathematical definition (i.e., tells how to calculate it). The variable "x" is called a dummy variable, or a place holder. When we actually use the function, as we did in the second line, we put in the name of the vector we want "sampsize" calculated for. This takes the place of "x" in the calculation. Notice that an object called "sampsize" has been added to your workspace. It will be there until you remove it (or neglect to save the workspace at the end of this R session).

Of course, the summary() function also reports missing values (but not the length of the vector!). If there are no missing values, summary won't mention them.

> summary(waiting)                # no mention of NAs; therefore, there are none
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   43.0    58.0    76.0    70.9    82.0    96.0

The second thing we want to know about our little mound of data is where it is located on our number line. Where is it centered? In other words, we want a measure of center or of central tendency. There are three that are commonly used: mean, median, and mode. The mode is not very useful and is rarely used in statistical calculations, so there is no R function for it. Mean and median, on the other hand, are straightforward.

> mean(waiting)
[1] 70.89706
> median(waiting)
[1] 76

Now we know to hunt around somewhere in the seventies for our data.

How spread out should we expect it to be? This is given to us by a measure of spread, also called a measure of variability or a measure of dispersion. There are several of these in common usage: the range, the interquartile range, the variance, and the standard deviation, to name several. Here's how to get them.

> range(waiting)
[1] 43 96
> IQR(waiting)
[1] 24
> var(waiting)
[1] 184.8233
> sd(waiting)
[1] 13.59497

There are several things you should be aware of here. First, range() does not actually give the range (difference between the maximum and minimum values), it gives the minimum and maximum value. So the values in "waiting" range from a minimum of 43 to a maximum of 96.

Second, the interquartile range (IQR) is defined as the difference between the value at the 3rd quartile and the value at the 1st quartile. However, there is not universal agreement on how these values should be calculated, and different software packages do it differently. R allows nine different methods to calculate these values, and the one used by default is NOT the one you were probably taught in elementary statiistics. So the result given by IQR() is not the one you might get if you do the calculation by hand (or on a TI-84 calculator). It should be very close though. If your instructor insists that you get the "calculator" value, do the calculation this way.

> IQR(waiting, type=2)
[1] 24

In this case, the answer is the same. The first quartile is at the 25th percentile, and the third quartile is at the 75th percentile. What are those values? And what if you also wanted the 35th and 65th percentile for some reason?

> quantile(waiting, c(.25,.35,.65,.75), type=2)       # or leave out type= for the default type
25% 35% 65% 75% 
 58  65  79  82

The value waiting = 79 is at the 65th percentile.

The last two things you need to be aware of is that the variance and standard deviation calculated above are the sample values. That is, they are calculated as estimates of the population values from a sample (the n − 1 method). There are no functions for the population variance and standard deviation, although they are easily enough calculated if you need them.

You may remember from your elementary stats course a statistic called the standard error of the mean, or SEM. Technically, SEM is a measure of error and not of variability, but you may need to calculate it nevertheless. There is no R function for it, so let's write one.

-----This is optional. You can skip this if you want.-----

The (estimated) standard error of the sample mean of "waiting" can be calculated like this.

> sqrt(var(waiting)/length(waiting))
[1] 0.8243164

I.e., it's the square root of the (variance divided by the sample size). This calculation will choke on missing values, however. Let's see that by adding a missing value to "waiting".

> waiting -> wait2
> wait2[100] <- NA
> sqrt(var(wait2)/length(wait2))
[1] NA

It's the var() function that coughed up the hairball, so...

> sqrt(var(wait2, na.rm=T)/length(wait2))
[1] 0.8248208

...seems to fix things. But it doesn't. This isn't quite the correct answer, and that is because length() gave us the wrong sample size. (It counted the case that was missing.) We could try this...

> sqrt(var(wait2, na.rm=T)/sampsize(wait2))
[1] 0.8263412

...but that depends upon the existence of the "sampsize" function. If that somehow gets erased, this calculation will fail. Here's one of the disadvantages of using R: You have to know what you're doing. Unlike other statistical packages, such as SPSS, to use R you occasionally have to think. It appears some of that thinking is necessary here.

> sem = function(x) sqrt(var(x, na.rm=T)/(length(x)-sum(is.na(x))))
> sem(wait2)
[1] 0.8263412
> ls()
[1] "faithful" "sampsize" "sem"      "wait2"

We now have a perfectly good function for calculating SEM in our workspace, and we will not let it get away!

--------This is the end of the optional material.--------

Before we move on, I should remind you that the summary() function will give you quite a bit of the above information in one go.

> summary(waiting)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   43.0    58.0    76.0    70.9    82.0    96.0

It will tell you if there are missing values, too.

> summary(wait2)                  # if you did the optional section
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  43.00   58.00   76.00   70.86   82.00   96.00    1.00

The number under NA's is a frequency, a whole number, in spite of the decimal zeros. There is 1 missing value in "wait2".

The last thing we want to know (for the time being anyway) about our little mound of data is what its shape is. Is it a nice mound-shaped mound, or is it deformed in some way, skewed or bimodal or worse? This will involve looking at some sort of frequency distribution.

The table() function will do that as well for a numerical variable as it will for a categorical variable, but the result may not be pretty (try this with eruptions--I dare you!).

> table(waiting)
waiting
43 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 62 63 64 65 66 67 68 69 70 
 1  3  5  4  3  5  5  6  5  7  9  6  4  3  4  7  6  4  3  4  3  2  1  1  2  4 
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 96 
 5  1  7  6  8  9 12 15 10  8 13 12 14 10  6  6  2  6  3  6  1  1  2  1  1

Looking at this reveals that we have one value of 43, three values of 45, five values of 46, and so on. Surely there's a better way! We could try a grouped frequency distribution.

> cut(waiting, breaks=10) -> wait3
> table(wait3)
wait3
(42.9,48.3] (48.3,53.6] (53.6,58.9] (58.9,64.2] (64.2,69.5] (69.5,74.8] 
         16          28          26          24           9          23 
(74.8,80.1] (80.1,85.4] (85.4,90.7] (90.7,96.1] 
         62          55          23           6

The cut() function cuts a numerical variable into class intervals, the number of class intervals given (approximately) by the breaks= option. (R has a bit of a mind of it's own, so if you pick a clumsy number of breaks, R will fix that for you.) The notation (x,y] means the class interval goes from x to y, with x not being included in the interval and y being included.

Another disadvantage of using R is that it is intended to be utilitarian. The output will be useful but not necessarily pretty. We can pretty that up a little bit with a trick.

> as.data.frame(table(wait3))
         wait3 Freq
1  (42.9,48.3]   16
2  (48.3,53.6]   28
3  (53.6,58.9]   26
4  (58.9,64.2]   24
5  (64.2,69.5]    9
6  (69.5,74.8]   23
7  (74.8,80.1]   62
8  (80.1,85.4]   55
9  (85.4,90.7]   23
10 (90.7,96.1]    6

You can also specify the break points yourself in a vector, if you are anal retentive enough.

> cut(waiting, breaks=seq(from=40,to=100,by=5)) -> wait4
> as.data.frame(table(wait4))
      wait4 Freq
1   (40,45]    4
2   (45,50]   22
3   (50,55]   33
4   (55,60]   24
5   (60,65]   14
6   (65,70]   10
7   (70,75]   27
8   (75,80]   54
9   (80,85]   55
10  (85,90]   23
11  (90,95]    5
12 (95,100]    1

We are getting a hint of a bimodal distribution of values here.

A better way of seeing a grouped frequency distribution is by creating a stem-and-leaf display.

> stem(waiting)

  The decimal point is 1 digit(s) to the right of the |

  4 | 3
  4 | 55566666777788899999
  5 | 00000111111222223333333444444444
  5 | 555555666677788889999999
  6 | 00000022223334444
  6 | 555667899
  7 | 00001111123333333444444
  7 | 555555556666666667777777777778888888888888889999999999
  8 | 000000001111111111111222222222222333333333333334444444444
  8 | 55555566666677888888999
  9 | 00000012334
  9 | 6

> stem(waiting, scale=.5)

  The decimal point is 1 digit(s) to the right of the |

  4 | 355566666777788899999
  5 | 00000111111222223333333444444444555555666677788889999999
  6 | 00000022223334444555667899
  7 | 00001111123333333444444555555556666666667777777777778888888888888889
  8 | 00000000111111111111122222222222233333333333333444444444455555566666
  9 | 000000123346

Use the "scale=" option to determine how many lines are in the display. The value of scale= is not actually the number of lines, so this may take some fiddling. The first of those displays clearly shows the bimodal structure of this variable.

Summarizing Numeric Data by Groups

Oftentimes, we are faced with summarizing numerical data by group membership, or as indexed by a factor. The tapply() and by() functions are most useful here.

> detach(faithful)
> rm(faithful, wait2, wait3, wait4)         # Out with the old...
> states = as.data.frame(state.x77)
> states$region = state.region
> head(states)
           Population Income Illiteracy Life Exp Murder HS Grad Frost
Alabama          3615   3624        2.1    69.05   15.1    41.3    20
Alaska            365   6315        1.5    69.31   11.3    66.7   152
Arizona          2212   4530        1.8    70.55    7.8    58.1    15
Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
California      21198   5114        1.1    71.71   10.3    62.6    20
Colorado         2541   4884        0.7    72.06    6.8    63.9   166
             Area region
Alabama     50708  South
Alaska     566432   West
Arizona    113417   West
Arkansas    51945  South
California 156361   West
Colorado   103766   West

The "state.x77" data set in R is a collection of information about the U.S. states (in 1977), but it is in the form of a matrix. We converted it to a data frame. All the variables are numeric, so we created a factor in the data frame by adding information about census region, from the built-in data set state.region. Now we can do interesting things like break down life expectancy by regions of the country. But hold on there! Some very foolish person has named that variable "Life Exp", with a space in the middle of the name. How do we deal with that?

> names(states)
[1] "Population" "Income"     "Illiteracy" "Life Exp"   "Murder"    
[6] "HS Grad"    "Frost"      "Area"       "region"

The easiest way is to rename the column and use the new, more sensible name.

> names(states)[4]                     # That's the one we want!
[1] "Life Exp"
> names(states)[4] <- "Life.Exp"
> by(data=states$Life.Exp, IND=states$region, FUN=mean)
states$region: Northeast
[1] 71.26444
--------------------------------------------------------- 
states$region: South
[1] 69.70625
--------------------------------------------------------- 
states$region: North Central
[1] 71.76667
--------------------------------------------------------- 
states$region: West
[1] 71.23462
>
> ### with(states, by(data=Life.Exp, IND=region, FUN=mean) will do the same thing
>
> tapply(states$Life.Exp, states$region, mean)
    Northeast         South North Central          West 
     71.26444      69.70625      71.76667      71.23462

Two notes: The arguments for by() are in the default order, but I named them anyway. The syntax, with default order, for tapply() is DV, IV, function (but those are not the approved labels for those arguments, which are X, INDEX, and FUN). There is no formula interface available for these functions, which is a bummer. The good thing about them is that the data do not have to be in a data frame. They will work with data vectors in the workspace, or even columns of a matrix (although that would be silly since a matrix must be all numeric variables).

If you want a function with a formula interface, there is one. It's shortcoming is that the data must be in a data frame. (Another shortcoming is that it's name makes no more sense than tapply, but then the function wasn't developed for this purpose, I don't suspect.)

> aggregate(Life.Exp ~ region, FUN=mean, data=states)      # syntax: DV ~ IV, FUN=, data=
         region Life.Exp
1     Northeast 71.26444
2         South 69.70625
3 North Central 71.76667
4          West 71.23462

Obviously, with any of these methods, you can substitute another function for "mean" and get SDs or variances or whatever.

> with(states, tapply(Life.Exp, region, sd))
    Northeast         South North Central          West 
    0.7438769     1.0221994     1.0367285     1.3519715

You can even use custom functions that you've added to your workspace.

> with(states, tapply(Life.Exp, region, sem))
    Northeast         South North Central          West 
    0.2479590     0.2555499     0.2992778     0.3749694

What if we want the DV broken down by two indexes, or two grouping variables? We don't have a second grouping variable in our data frame, so let's create one. I'll show you the steps I followed, because there are some catches.

> median(states$Frost)                      # I'm going to do a median split of Frost
[1] 114.5
> cut(states$Frost, breaks=c(0,114.5,188))  # ALWAYS LOOK FIRST
 [1] (0,114]   (114,188] (0,114]   (0,114]   (0,114]   (114,188] (114,188]
 [8] (0,114]   (0,114]   (0,114]   <NA>      (114,188] (114,188] (114,188]
[15] (114,188] (0,114]   (0,114]   (0,114]   (114,188] (0,114]   (0,114]  
[22] (114,188] (114,188] (0,114]   (0,114]   (114,188] (114,188] (114,188]
[29] (114,188] (114,188] (114,188] (0,114]   (0,114]   (114,188] (114,188]
[36] (0,114]   (0,114]   (114,188] (114,188] (0,114]   (114,188] (0,114]  
[43] (0,114]   (114,188] (114,188] (0,114]   (0,114]   (0,114]   (114,188]
[50] (114,188]
Levels: (0,114] (114,188]

> ### Notice the NA in position 11. What's up with that? Hint: The interval is (0,114].

> cut(states$Frost, breaks=c(-1,114.5,188))
 [1] (-1,114]  (114,188] (-1,114]  (-1,114]  (-1,114]  (114,188] (114,188]
 [8] (-1,114]  (-1,114]  (-1,114]  (-1,114]  (114,188] (114,188] (114,188]
[15] (114,188] (-1,114]  (-1,114]  (-1,114]  (114,188] (-1,114]  (-1,114] 
[22] (114,188] (114,188] (-1,114]  (-1,114]  (114,188] (114,188] (114,188]
[29] (114,188] (114,188] (114,188] (-1,114]  (-1,114]  (114,188] (114,188]
[36] (-1,114]  (-1,114]  (114,188] (114,188] (-1,114]  (114,188] (-1,114] 
[43] (-1,114]  (114,188] (114,188] (-1,114]  (-1,114]  (-1,114]  (114,188]
[50] (114,188]
Levels: (-1,114] (114,188]
> cut(states$Frost, breaks=c(-1,114.5,188), labels=c("low","high"))
 [1] low  high low  low  low  high high low  low  low  low  high high high
[15] high low  low  low  high low  low  high high low  low  high high high
[29] high high high low  low  high high low  low  high high low  high low 
[43] low  high high low  low  low  high high
Levels: low high

> ### Yay! At last!

> cut(states$Frost, breaks=c(-1,114.5,188), labels=c("low","high")) -> states$coldness
> summary(states$coldness)
 low high 
  25   25

> ### Looking good! Although I suspect there is a substantial relationship
> ### between the region variable and the coldness variable, which we will
> ### ignore.

Now that we have our second grouping variable, we may proceed.

> with(states, by(Life.Exp, list(coldness,region), mean))       # output not shown
> with(states, tapply(Life.Exp, list(coldness,region), mean))   # much nicer output
     Northeast    South North Central    West
low   71.19000 69.70625        71.635 71.9420
high  71.28571       NA        71.793 70.7925

> ### What's up with that NA?

> with(states, table(coldness,region))      ### no cases in that cell!
        region
coldness Northeast South North Central West
    low          2    16             2    5
    high         7     0            10    8

> aggregate(Life.Exp ~ region + coldness, FUN=mean, data=states)
         region coldness Life.Exp
1     Northeast      low 71.19000
2         South      low 69.70625
3 North Central      low 71.63500
4          West      low 71.94200
5     Northeast     high 71.28571
6 North Central     high 71.79300
7          West     high 70.79250

On to graphical summaries!

revised 2016 January 23