
MORE DESCRIPTIVE STATISTICS
Categorical Data
You summarize categorical data basically by counting up frequencies and
by calculating proportions and percentages.
Categorical data are commonly encountered in three forms: a frequency table
or crosstabulation, a flat table, or a case-by-case data frame. Let's begin
with the last of these. Copy and paste the following lines ALL AT ONCE into R.
That is, highlight these lines with your mouse, hit Ctrl-C on your keyboard,
click at a command prompt in R, and hit Ctrl-V on your keyboard, and hit Enter
if necessary, i.e., if R hasn't returned to a command prompt. On the Mac, use
Command-C and Command-V. This will execute these lines as a script and create
a data frame called "ucb" in your workspace. WARNING: Your workspace will also
be cleared, so save anything you don't want to lose first.
# Begin copying here.
rm(list=ls())
gender = rep(c("female","male"),c(1835,2691))
admitted = rep(c("yes","no","yes","no"),c(557,1278,1198,1493))
dept = rep(c("A","B","C","D","E","F","A","B","C","D","E","F"),
c(89,17,202,131,94,24,19,8,391,244,299,317))
dept2 = rep(c("A","B","C","D","E","F","A","B","C","D","E","F"),
c(512,353,120,138,53,22,313,207,205,279,138,351))
department = c(dept,dept2)
ucb = data.frame(gender,admitted,department)
rm(gender,admitted,dept,dept2,department)
ls()
# End copying here.
Data sets that are purely categorical are not economically represented in
case-by-case data frames, and so the built-in data sets that are purely
categorical come in the form of tables (contingency tables or
crosstabulations). We have just taken the data from one of these (the
"UCBAdmissions" built-in data set) and turned it into a case-by-case data
frame. It's the classic University of California, Berkeley, admissions data
from 1973 describing admissions into six different graduate programs broken
down by gender.
First, we want some general information about the data frame...
> str(ucb) # Examine the structure.
'data.frame': 4526 obs. of 3 variables:
$ gender : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
$ admitted : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
$ department: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
> ucb[seq(1,4526,400),] # Look at every 400th row.
gender admitted department
1 female yes A
401 female yes D
801 female no C
1201 female no D
1601 female no F
2001 male yes A
2401 male yes B
2801 male yes C
3201 male no A
3601 male no C
4001 male no D
4401 male no F
There are 4526 cases or observations or people represented in this data frame,
with three variables observed on each one: gender, admitted, and department the
person applied to. All three variables are coded as factors, but this doesn't
matter. For our present purposes, categorical variables and factors are the
same thing.
What would we like to know? Here's a good start...
> summary(ucb)
gender admitted department
female:1835 no :2771 A:933
male :2691 yes:1755 B:585
C:918
D:792
E:584
F:714
We now have frequency tables for each of the variables. We could get the same
information for any individual variable using the table( ) function...
> table(ucb$gender)
female male
1835 2691
>
> table(ucb$admitted)
no yes
2771 1755
>
> table(ucb$department)
A B C D E F
933 585 918 792 584 714
These tables can easily be turned into relative frequency tables using the
prop.table( ) function...
> table(ucb$department) -> dept.table # Requires a table as an argument.
> prop.table(dept.table) # Calculate proportions.
A B C D E F
0.2061423 0.1292532 0.2028281 0.1749890 0.1290323 0.1577552
>
> prop.table(dept.table) * 100 # Or calculate percentages.
A B C D E F
20.61423 12.92532 20.28281 17.49890 12.90323 15.77552
As we see, 20.6% of the applicants applied to department A, 12.9% to department
B, 20.3% to department C, etc.
The prop.table( ) function requires a table
object as its argument, so the data must first be tabled before being
prop.tabled.
Contingency tables or crosstabs can be produced using either the table( ) or xtabs( ) function. Table is easier, so I'll
illustrate that first...
> with(ucb, table(gender, admitted)) # or table(ucb$gender, ucb$admitted)
admitted
gender no yes
female 1278 557
male 1493 1198
Within the table( ) function, the first
variable named will be in the rows of the contingency table, and the second
variable named will be in the columns. If a third variable is named, it will
form separate layers or strata of a three dimensional contingency table.
When using prop.table( ) on a
multidimensional table, it's necessary to specify which marginal sums you want
to use to calculate the proportions. To use the row sums, specify 1; to use the
column sums, specify 2...
> with(ucb, table(gender, admitted)) -> gen.adm.table
> prop.table(gen.adm.table, 1) # With respect to row marginal sums.
admitted
gender no yes
female 0.6964578 0.3035422
male 0.5548123 0.4451877
>
> prop.table(gen.adm.table, 2) # With respect to column marginal sums.
admitted
gender no yes
female 0.4612053 0.3173789
male 0.5387947 0.6826211
The name of this option is "margin=", so this form could also have been used...
> prop.table(my.table, margin=1)
admitted
gender no yes
female 0.6964578 0.3035422
male 0.5548123 0.4451877
Since "margin=" is the first thing prop.table( )
expects after the table name, either form will work here.
The xtabs( ) function works quite a bit
differently. It uses a formula interface. The formula interface is used most
often in model building and significance testing, so we'll see it a lot, and
it's discussed in detail in another tutorial. Formulas can become quite
complex, but their most basic form is as follows...
DV ~ IV1 + IV2 + IV3 + ...
First, a dependent or response variable is specified, followed by a tilde (in
the upper left corner of most keyboards), which is read as "is a function of"
or "is modeled by", and finally a list of independent or explanatory variables
separated by plus signs. For xtabs( ) there is
no DV (in a case-by-case data frame), so we just leave it out...
> with(ucb, xtabs(~ gender + admitted))
admitted
gender no yes
female 1278 557
male 1493 1198
Instead of using with( ) to give the name of
the data frame, we could also have used the data= option, since we are using a
formula interface in the function...
> xtabs(~ gender + admitted, data=ucb)
admitted
gender no yes
female 1278 557
male 1493 1198
The resulting table could also have been stored and operated on with other
functions. Here are some examples...
> xtabs(~ gender + admitted, data=ucb) -> gen.adm.table
> prop.table(gen.adm.table, 1) # Get proportions relative to row sums.
admitted
gender no yes
female 0.6964578 0.3035422
male 0.5548123 0.4451877
>
> addmargins(gen.adm.table) # Add marginal sums to table.
admitted
gender no yes Sum
female 1278 557 1835
male 1493 1198 2691
Sum 2771 1755 4526
>
> margin.table(gen.adm.table, 1) # Collapse over admitted (row marginals).
gender
female male
1835 2691
>
> margin.table(gen.adm.table, 2) # Collapse over sex (column marginals).
admitted
no yes
2771 1755
Six different crosstabulations of the entire data set are possible,
depending upon the order in which we list the variables...
> with(ucb, table(gender, department, admitted))
, , admitted = no
department
gender A B C D E F
female 19 8 391 244 299 317
male 313 207 205 279 138 351
, , admitted = yes
department
gender A B C D E F
female 89 17 202 131 94 24
male 512 353 120 138 53 22
> with(ucb, table(admitted, department, gender))
, , gender = female
department
admitted A B C D E F
no 19 8 391 244 299 317
yes 89 17 202 131 94 24
, , gender = male
department
admitted A B C D E F
no 313 207 205 279 138 351
yes 512 353 120 138 53 22
Etc.
A flat table is produced from the data frame by using the
ftable( ) function. Use the "col.vars="
option to control which variable goes in the columns...
> ftable(ucb) # Last variable in columns.
department A B C D E F
gender admitted
female no 19 8 391 244 299 317
yes 89 17 202 131 94 24
male no 313 207 205 279 138 351
yes 512 353 120 138 53 22
>
> ftable(ucb, col.vars="gender") # Gender in columns.
gender female male
admitted department
no A 19 313
B 8 207
C 391 205
D 244 279
E 299 138
F 317 351
yes A 89 512
B 17 353
C 202 120
D 131 138
E 94 53
F 24 22
>
> ftable(ucb, col.vars="admitted") # Admitted in columns.
admitted no yes
gender department
female A 19 89
B 8 17
C 391 202
D 244 131
E 299 94
F 317 24
male A 313 512
B 207 353
C 205 120
D 279 138
E 138 53
F 351 22
>
> ftable(ucb, col.vars="admitted") -> my.table
> prop.table(my.table, 1) # prop.table() works here, too.
admitted no yes
gender department
female A 0.17592593 0.82407407
B 0.32000000 0.68000000
C 0.65935919 0.34064081
D 0.65066667 0.34933333
E 0.76081425 0.23918575
F 0.92961877 0.07038123
male A 0.37939394 0.62060606
B 0.36964286 0.63035714
C 0.63076923 0.36923077
D 0.66906475 0.33093525
E 0.72251309 0.27748691
F 0.94101877 0.05898123
Flat tables can also be made from contingency tables, the order of the row
variables can be changed, and multiple column variables can be specified...
> with(ucb, table(admitted,department,gender)) -> my.table # A 3D contingency table.
> ftable(my.table)
gender female male
admitted department
no A 19 313
B 8 207
C 391 205
D 244 279
E 299 138
F 317 351
yes A 89 512
B 17 353
C 202 120
D 131 138
E 94 53
F 24 22
>
> ftable(my.table, col.vars="admitted")
admitted no yes
department gender
A female 19 89
male 313 512
B female 8 17
male 207 353
C female 391 202
male 205 120
D female 244 131
male 279 138
E female 299 94
male 138 53
F female 317 24
male 351 22
>
> ftable(my.table, row.vars=c("gender","department"), col.vars="admitted")
admitted no yes
gender department
female A 19 89
B 8 17
C 391 202
D 244 131
E 299 94
F 317 24
male A 313 512
B 207 353
C 205 120
D 279 138
E 138 53
F 351 22
>
> ftable(my.table, row.vars="department", col.vars=c("gender","admitted"))
gender female male
admitted no yes no yes
department
A 19 89 313 512
B 8 17 207 353
C 391 202 205 120
D 244 131 279 138
E 299 94 138 53
F 317 24 351 22
Use the "row.vars=" option to control the order in which the row variables
occur. If more than one variable name is to be used in either the "row.vars" or
"col.vars" option, use vector notation to specify them. (HINT: In R,
everything is a vector, even if it has only one element! The notation
row.vars="department" is just a shortcut for row.vars=c("department").)
You can also get a VERY useful and more efficient data frame from a
contingency table as follows...
> as.data.frame(my.table) -> my.df.table
> my.df.table
admitted department gender Freq
1 no A female 19
2 yes A female 89
3 no B female 8
4 yes B female 17
5 no C female 391
6 yes C female 202
7 no D female 244
8 yes D female 131
9 no E female 299
10 yes E female 94
11 no F female 317
12 yes F female 24
13 no A male 313
14 yes A male 512
15 no B male 207
16 yes B male 353
17 no C male 205
18 yes C male 120
19 no D male 279
20 yes D male 138
21 no E male 138
22 yes E male 53
23 no F male 351
24 yes F male 22
There are only 24 possible unique cases in these data (2 genders times 2
admit statuses times 6 departments). So why put this information into a data
frame with 4500+ rows in it? The form above is much more efficient. It lists
the possible unique cases in the first three columns (every possible
combination of factors) and then gives a frequency in the last Freq column.
When there is a Freq column in the data frame, you CANNOT use
table( ) to get crosstabulations. You must use
xtabs( ), AND you must specify Freq as the DV.
> xtabs(Freq ~ gender + admitted, data=my.df.table)
admitted
gender no yes
female 1278 557
male 1493 1198
We'll look at graphics for categorical data in the next tutorial.
> rm(list=ls())
Numerical Data
Imagine a number line stretching all the way across the universe from
negative infinity to positive infinity. Somewhere along this number line is
your little mound of data, and you have to tell someone how to find it. What
information would it be useful to give this person?
First, you might want to tell this person how big the mound is. Is she to
look for a sizeable mound of data or just a little speck? Second, you might
want to tell this person about where to look. Where is the mound centered on
the number line? Third, you might want to tell this person how spread out to
expect the mound to be. Are the data all in a nice, compact mound, or is it
spread out all over the place? Finally, you might want to tell this person
what shape to expect the mound to be.
The data frame "faithful" consists of 272 observations on the Old Faithful
geyser in Yellowstone National Park. Each observation consists of two
variables: "eruptions" (how long an eruption lasted in minutes), and "waiting"
(how long in minutes the geyser was quiet before that eruption)...
> data(faithful)
> str(faithful)
'data.frame': 272 obs. of 2 variables:
$ eruptions: num 3.60 1.80 3.33 2.28 4.53 ...
$ waiting : num 79 54 74 62 85 55 88 85 51 85 ...
I'm going to attach this so that we can easily look at the variables inside.
Remind me to detach it when we're done!
> attach(faithful)
We'll begin with the "waiting" variable. Let's see if we can characterize
it.
First, we want to know how many values are in this variable. If this variable
is our mound of data somewhere on that number line, how big a mound is it?
> length(waiting)
[1] 272
I don't suppose you were much surprised by that. We already knew that this
data frame contains 272 cases, so each variable must be 272 cases long. There
is a slight catch, however. The length( )
function counts missing values (NAs) as cases, and there is no way to stop it
from doing so. That is, there is no na.rm= option for this function. In fact,
there are no options at all. It will not even reveal the presence of missing
values, and for this reason the length( )
function can give a misleading result when missing values are present.
The following command will tell you how many of the values in a variable
are NAs (missing values)...
> sum(is.na(waiting))
[1] 0
It does this as follows. The is.na( )
function tests every value in a vector for missingness, returning either TRUE
or FALSE for each value. Remember, in R, TRUEs add as ones and FALSEs add as
zeroes, so when we sum the result of testing for missingness, we get the
number of TRUEs, i.e., the number of missing values. It's kind of a nuisance
to remember that, so I propose a new function, one that doesn't actually
exist in R.
Here's one of the nicest things about R. If it doesn't do something you want
it to do, you can write a function that does it. We'll talk about this in detail
in a future tutorial, but for right now the basic idea is this. If you can
calculate it at the command line, you can write a function to do it. In this
case...
> length(waiting) - sum(is.na(waiting))
[1] 272
...would give us the number of nonmissing cases in "waiting". So do this (and be
VERY CAREFUL with your typing)...
> sampsize = function(x) length(x) - sum(is.na(x))
> sampsize(waiting)
[1] 272
> ls()
[1] "faithful" "sampsize"
The first line creates a function called "sampsize" and gives it a mathematical
definition (i.e., tells how to calculate it). The variable "x" is called a
dummy variable, or a place holder. When we actually use the function, as we
did in the second line, we put in the name of the vector we want "sampsize"
calculated for. This takes the place of "x" in the calculation. Notice that an
object called "sampsize" has been added to your workspace. It will be there
until you remove it (or neglect to save the workspace at the end of this R
session). Better yet, if you go back to your default working directoy and save
it there, it will load automatically every time you start R. We'll do that at
the end of this tutorial.
The second thing we want to know about our little mound of data is where it
is located on our number line. Where is it centered? In other words, we want a
measure of center or of central tendency. There are three that are commonly
used: mean, median, and mode. The mode is not very useful and is rarely used in
statistical calculations, so there is no R function for it. Mean and median, on'the other hand, are straightforward...
> mean(waiting)
[1] 70.89706
> median(waiting)
[1] 76
Now we know to hunt around somewhere in the seventies for our data.
How spread out should we expect it to be? This is given to us by a measure
of spread, also called a measure of variability or a measure of dispersion.
There are several of these in common usage: the range, the interquartile range,
the variance, and the standard deviation. Here's how to get them...
> range(waiting)
[1] 43 96
> IQR(waiting)
[1] 24
> var(waiting)
[1] 184.8233
> sd(waiting)
[1] 13.59497
There are several things you should be aware of here. First, range( ) does not actually give the range
(difference between the maximum and minimum values), it gives the minimum and
maximum value. So the values in "waiting" range from a minimum of 43 to a
maximum of 96.
Second, the interquartile range (IQR) is defined as the difference between
the value at the 3rd quartile and the value at the 1st quartile. However, there
is not universal agreement on how these values should be calculated, and
different software packages do it differently. R allows nine different methods
to calculate these values, and the one used by default is NOT the one you were
probably taught in elementary statiistics. So the result given by IQR( ) is not the one you might get if you do the
calculation by hand (or on a TI-84 calculator). It should be very close though.
If your instructor insists that you get the "calculator" value, do the
calculation this way...
> quantile(waiting, c(.75,.25), type=2)
75% 25%
82 58
...and then subtract those two values (82-58=24). In this case, the answer is
the same. While we're here, this would be a good time to mention percentiles.
Do you need the 65th percentile of the "waiting" values. Here it is...
> quantile(waiting, .65)
65%
79
The value waiting = 79 is at the 65th percentile.
The last two things you need to be aware of is that the variance and
standard deviation calculated above are the sample values. That is, they are
calculated as estimates from a sample of the population values (the
n − 1 method). There are no functions for the population
variance and standard deviation, although they are easily enough calculated if
you need them.
You may remember from your elementary stats course a statistic called the
standard error of the mean, or SEM. Technically, SEM is a measure of error and
not of variability, but you may need to calculate it nevertheless. There is no
R function for it, so let's write one.
-----This is optional. You can skip this if you want.-----
The standard error of the sample mean of "waiting" can be calculated like
this...
> sqrt(var(waiting)/length(waiting))
[1] 0.8243164
...the square root of the (variance divided by the sample size). This
calculation will choke on missing values, however. Let's see that by adding
a missing value to "waiting"...
> waiting -> wait2
> wait2[100] <- NA
> sqrt(var(wait2)/length(wait2))
[1] NA
It's the var( ) function that coughed up the
hairball, so...
> sqrt(var(wait2, na.rm=T)/length(wait2))
[1] 0.8248208
...seems to fix things. But it doesn't. This isn't quite the correct answer,
and that is because length( ) gave us the
wrong sample size. (It counted the case that was missing.) We could try this...
> sqrt(var(wait2, na.rm=T)/sampsize(wait2))
[1] 0.8263412
...but that depends upon the existence of the "sampsize" function. If that
somehow gets erased, this calculation will fail. Here's one of the disadvantages
of using R: You have to know what you're doing. Unlike other statistical
packages, such as SPSS, to use R you occasionally have to think. It appears
some of that is necessary here...
> sem = function(x) sqrt(var(x, na.rm=T)/(length(x)-sum(is.na(x))))
> sem(wait2)
[1] 0.8263412
> ls()
[1] "faithful" "sampsize" "sem" "wait2"
We now have a perfectly good function for calculating SEM in our workspace, and
we will not let it get away!
--------This is the end of the optional material.--------
Before we move on, I should remind you that the
summary( ) function will give you quite a bit of the above
information in one go...
> summary(waiting)
Min. 1st Qu. Median Mean 3rd Qu. Max.
43.0 58.0 76.0 70.9 82.0 96.0
It will tell you if there are missing values, too...
> summary(wait2)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
43.00 58.00 76.00 70.86 82.00 96.00 1.00
The number under NA's is a frequency. There is 1 missing value in "wait2".
The last thing we want to know (for the time being anyway) about our little
mound of data is what its shape is. Is it a nice mound-shaped mound, or is it
deformed in some way, skewed or bimodal or worse? This will involve looking at
some sort of frequency distribution.
The table( ) function will do that as well
for a numerical variable as it will for a categorical variable, but the result
may not be pretty (try this with eruptions--I dare you!)...
> table(waiting)
waiting
43 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 62 63 64 65 66 67 68 69 70
1 3 5 4 3 5 5 6 5 7 9 6 4 3 4 7 6 4 3 4 3 2 1 1 2 4
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 96
5 1 7 6 8 9 12 15 10 8 13 12 14 10 6 6 2 6 3 6 1 1 2 1 1
Looking at this reveals that we have one value of 43, three values of 45, five
values of 46, and so on. Surely there's a better way! We could try a grouped
frequency distribution...
> cut(waiting, breaks=10) -> wait3
> table(wait3)
wait3
(42.9,48.3] (48.3,53.6] (53.6,58.9] (58.9,64.2] (64.2,69.5] (69.5,74.8]
16 28 26 24 9 23
(74.8,80.1] (80.1,85.4] (85.4,90.7] (90.7,96.1]
62 55 23 6
The cut( ) function cuts a numerical variable
into class intervals, the number of class intervals given (approximately) by the
breaks= option. (R has a bit of a mind of it's own, so if you pick a clumbsy
number of breaks, R will fix that for you.) The notation (x,y] means the class
interval goes from x to y, with x not being included in the interval and y
being included.
Another disadvantage of using R is that it is intended to be utilitarian.
The output will be useful but not necessarily pretty. We can pretty that up a
little bit with a trick...
> as.data.frame(table(wait3))
wait3 Freq
1 (42.9,48.3] 16
2 (48.3,53.6] 28
3 (53.6,58.9] 26
4 (58.9,64.2] 24
5 (64.2,69.5] 9
6 (69.5,74.8] 23
7 (74.8,80.1] 62
8 (80.1,85.4] 55
9 (85.4,90.7] 23
10 (90.7,96.1] 6
You can also specify the break points yourself in a vector, if you are that
anal retentive...
> cut(waiting, breaks=seq(40,100,5)) -> wait4
> as.data.frame(table(wait4))
wait4 Freq
1 (40,45] 4
2 (45,50] 22
3 (50,55] 33
4 (55,60] 24
5 (60,65] 14
6 (65,70] 10
7 (70,75] 27
8 (75,80] 54
9 (80,85] 55
10 (85,90] 23
11 (90,95] 5
12 (95,100] 1
We are getting a hint of a bimodal distribution of values here.
A better way of seeing a grouped frequency distribution is by creating a
stem-and-leaf display...
> stem(waiting)
The decimal point is 1 digit(s) to the right of the |
4 | 3
4 | 55566666777788899999
5 | 00000111111222223333333444444444
5 | 555555666677788889999999
6 | 00000022223334444
6 | 555667899
7 | 00001111123333333444444
7 | 555555556666666667777777777778888888888888889999999999
8 | 000000001111111111111222222222222333333333333334444444444
8 | 55555566666677888888999
9 | 00000012334
9 | 6
> stem(waiting, scale=.5)
The decimal point is 1 digit(s) to the right of the |
4 | 355566666777788899999
5 | 00000111111222223333333444444444555555666677788889999999
6 | 00000022223334444555667899
7 | 00001111123333333444444555555556666666667777777777778888888888888889
8 | 00000000111111111111122222222222233333333333333444444444455555566666
9 | 000000123346
Use the "scale=" option to determine how many lines are in the display. The
value of scale= is not actually the number of lines, so this may take some
fiddling. The first of those displays clearly shows the bimodal structure of
this variable.
Oftentimes, we are faced with summarizing numerical data by group membership,
or as indexed by a factor. The tapply( ) and
by( ) functions are most useful here...
> detach(faithful)
> rm(faithful, wait2, wait3, wait4) # Out with the old...
> as.data.frame(state.x77) -> states
> states$region <- state.region
> head(states)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
region
Alabama South
Alaska West
Arizona West
Arkansas South
California West
Colorado West
The "state.x77" data set in R is a collection of information about the U.S.
states (in 1977), but it is in the form of a matrix. We converted it to a data
frame. All the variables are numerical, so we created a factor in the data frame
by adding information about census region, from the built-in data set
state.region. Now we can do interesting things like break down life expectancy
by regions of the country. But hold on there! Some very foolish person has
named that variable "Life Exp", with a space in the middle of the name. How do
we deal with that?
> names(states)
[1] "Population" "Income" "Illiteracy" "Life Exp" "Murder"
[6] "HS Grad" "Frost" "Area" "region"
> by(data=states[4], IND=states[9], FUN=mean)
region: Northeast
[1] 71.26444
------------------------------------------------------------
region: South
[1] 69.70625
------------------------------------------------------------
region: North Central
[1] 71.76667
------------------------------------------------------------
region: West
[1] 71.23462
>
> tapply(X=states[,4], IND=states[,9], FUN=mean)
Northeast South North Central West
71.26444 69.70625 71.76667 71.23462
That's one way. There are a couple things to notice here. The arguments to
by( ) are listed in the default order, so you
don't really have to name them. The same is true of
tapply( ). Notice also that you can be a lot more loosey goosey
with the indexing in by( ) than in
tapply( ). You only have to give the column
number that you want from the data frame.
Of course, we could also rename the column and use the new, more sensible
name...
> names(states)[4] # That's the one we want!
[1] "Life Exp"
> names(states)[4] <- "Life.Exp"
> tapply(states$Life.Exp, states$region, mean)
Northeast South North Central West
71.26444 69.70625 71.76667 71.23462
Now, before we quit and clean up, let's save those functions in our default
working directory.
> ls()
[1] "sampsize" "sem" "states"
> rm(states) # Do some cleaning up first.
> setwd("..") # Go back one directory level from Rspace.
> getwd() # Check to be sure you're in the right place!
[1] "C:/Documents and Settings/kingw/My Documents"
> ### I'm in WinXP at the moment. This should be your home directory in
> ### any other OS. Now assuming you have both sampsize and sem...
> load(".RData") # To preserve anything already there...
> ls()
[1] "age.out" "m.ex" "m.ex.minussex" "outcome.out"
[5] "respiratory" "sampsize" "seizure" "sem"
[9] "visit.matrix"
> rm(age.out,m.ex,m.ex.minussex,outcome.out,
+ respiratory,seizure,visit.matrix) # I don't want any of that old junk!
> ls() # Check to see functions are still there.
[1] "sampsize" "sem"
> quit("yes") # Quit saving workspace.
The next time you start R, those functions will be loaded automatically.
On to graphical summaries!
revised 2010 August 3
Return to the Table of Contents
|