MORE DESCRIPTIVE STATISTICS Categorical Data: Crosstabulations Categorical data are summarized by counting up frequencies and by calculating proportions and percentages. Categorical data are commonly encountered in three forms: a frequency table or crosstabulation, a flat table, or a case-by-case data frame. Let's begin with the first of these. In the default download of R, there is a data set called UCBAdmissions,
which gives admissions data for UC Berkeley's top six grad programs in 1974.
You might want to enlarge your R Console window a bit so you can see all the
data at once.
> setwd(Rspace) # Okay. Everybody forgets once in awhile. Give me a break! Error in setwd(Rspace) : object 'Rspace' not found > setwd("Rspace") # if you've created this directory > UCBAdmissions , , Dept = A Gender Admit Male Female Admitted 512 89 Rejected 313 19 , , Dept = B Gender Admit Male Female Admitted 353 17 Rejected 207 8 , , Dept = C Gender Admit Male Female Admitted 120 202 Rejected 205 391 , , Dept = D Gender Admit Male Female Admitted 138 131 Rejected 279 244 , , Dept = E Gender Admit Male Female Admitted 53 94 Rejected 138 299 , , Dept = F Gender Admit Male Female Admitted 22 24 Rejected 351 317This is a three-dimensional table, specifically a 2x2x6 contingency table. In R, the dimensions are recognized in this order: rows (the "Admit" variable in this table), columns (the "Gender" variable in this table), and layers (the "Dept" variable in this table). The numbers in the table are frequencies, for example, how many male applicants admitted to department A (512). We don't have information on individual subjects, but if we did, the first one of them would be someone who was admitted, would be male, and would have applied to dept. A, and that would be it. Thus, you can see that the frequency table, or crosstabs, contains complete information on our subjects. How would we create this table ourselves? Well, we would start with all of
those numbers in a vector. The tricky part is getting them in the correct order,
and to do that we need to know in what order R is going to stick them into the
table. R will pick a layer of the table, beginning with the first one (A), and
fill down the columns, then move to the next layer and fill down the columns,
and so on. Thus, when the admissions office at Berkeley sends us these numbers,
we want to create the following vector from them. (Use
scan() to enter the vector if you prefer. I'm going to use c() because it will keep things clearer, hopefully.)
> OF = c(512, 313, 89, 19, # Dept A, males adm, males rej, females adm, females rej + 353, 207, 17, 8, # Dept B + 120, 205, 202, 391, # Dept C + 138, 279, 131, 244, # Dept D + 53, 138, 94, 299, # Dept E + 22, 351, 24, 317) # Dept FYes, you can bet I spaced it just like that, and I included the comments. When I come back and look at this later, I want to know exactly what I did! Another confession: If I get one of those numbers wrong, I don't want to have to retype the whole gosh darned thing! So I typed it into a script window (no command prompts typed there), and then copied and pasted it into R, a good way to deal with long commands, and also the way I'm going to deal with the next one! (To get a script window, go to File > New Script in Windows, or File > New Document on a Mac, or open a text editor in Linux.) So my entire script looks like this, which you can copy and paste if your fingers are broken and you can't type today: OF = c(512, 313, 89, 19, # Dept A, males adm, males rej, females adm, females rej 353, 207, 17, 8, # Dept B 120, 205, 202, 391, # Dept C 138, 279, 131, 244, # Dept D 53, 138, 94, 299, # Dept E 22, 351, 24, 317) # Dept F UCB = array(data=OF, dim=c(2,2,6), dimnames=list( Admit=c("admitted","rejected"), Gender=c("male","female"), Dept=c("A","B","C","D","E","F") ))I could also have used array() and dimnames() separately, as follows. UCB = array(data=OF, dim=c(2,2,6)) dimnames(UCB) = list( Admit=c("admitted","rejected"), Gender=c("male","female"), Dept=c("A","B","C","D","E","F"))Print out (to the Console) the table this creates in your workspace and confirm it is the same as the one above. Is this table arranged the way you really want it? I.e., do you really want Admit in the rows, Gender in the columns, and Dept in the layers? Maybe you do, but then again, maybe you don't. I think the most interesting problem here is to compare admission by gender. If that's what we're interested in, then why don't we just collapse the table
across (or over) departments. That would be easy enough to do, but we need to
remember that, in R, each dimension of the table has a number, and that rows
are 1, columns are 2, and layers are 3. Then we can use the margin.table() function to collapse.
> margin.table(UCB, margin=c(1,2)) # omit the "margin" that you want to collapse over Gender Admit male female admitted 1198 557 rejected 1493 1278 > margin.table(UCB, margin=c(2,1)) # put Gender in the rows and Admit in the columns Admit Gender admitted rejected male 1198 1493 female 557 1278 > margin.table(UCB, margin=c(2,1)) -> Admit.by.Gender # store that second oneThe name of the command is more sensible than it may seem at first blush. If you can imagine what the 3D table looks like in 3D, it would resemble a box of crackers. The box is divided into 2x2x6=24 little cells, and each cell has a frequency in it. At the end of the box there is a margin (also above and in front). We're going to fill that margin with sums by adding frequencies across the length of the box (across departments). Those sums are called marginal sums or marginal frequencies, and they are in this case what you saw in the first of those margin.tables above. > margin.table(UCB, margin=c(1,3)) # the margin in front of the box Dept Admit A B C D E F admitted 601 370 322 269 147 46 rejected 332 215 596 523 437 668 > margin.table(UCB, margin=c(2,3)) # the margin above the box Dept Gender A B C D E F male 825 560 325 417 191 373 female 108 25 593 375 393 341If we examine Admit.by.Gender, we can see without much further analysis that something appears to be fishy. About twice as many males than females were admitted, but the numbers of males and females rejected is about the same. That's a little easier to visualize if we convert the frequencies to proportions. The Admit.by.Gender table has two margins, the row margin (1) and the column margin (2). > prop.table(Admit.by.Gender, margin=2) # proportions with respect to column marginals Admit Gender admitted rejected male 0.6826211 0.5387947 female 0.3173789 0.4612053 > prop.table(Admit.by.Gender, margin=1) # proportions with respect to row marginals Admit Gender admitted rejected male 0.4451877 0.5548123 female 0.3035422 0.6964578From the first table, 68% of those admitted were male, and 32% were female, while 54% of those rejected were male, and 46% were female. From the second table, 45% of males were admitted, 55% rejected, while 30% of females were admitted, 70% rejected. This is a famous problem in statistics. When we throw out the Dept
information, it appears that males are being favored. What do we see if we do
not throw out the Dept information? Now I think it becomes convenient to
rearrange the whole table, and that can be done with a function we already
know. I'm going to put Admit (1 in the original table) in the rows, Dept (3 in
the original table) in the columns, and Gender (2 in the original table) in the
layers.
> margin.table(UCB, margin=c(1,3,2)) , , Gender = male Dept Admit A B C D E F admitted 512 353 120 138 53 22 rejected 313 207 205 279 138 351 , , Gender = female Dept Admit A B C D E F admitted 89 17 202 131 94 24 rejected 19 8 391 244 299 317 > margin.table(UCB, margin=c(1,3,2)) -> temp > prop.table(temp, margin=c(2,3)) , , Gender = male Dept Admit A B C D E F admitted 0.6206061 0.6303571 0.3692308 0.3309353 0.2774869 0.05898123 rejected 0.3793939 0.3696429 0.6307692 0.6690647 0.7225131 0.94101877 , , Gender = female Dept Admit A B C D E F admitted 0.8240741 0.68 0.3406408 0.3493333 0.2391858 0.07038123 rejected 0.1759259 0.32 0.6593592 0.6506667 0.7608142 0.9296187762% of male applicants to Dept A were admitted, while 82% of female applicants were. 63% of male applicants to Dept B were admitted, while 68% of female applicants were. And so on. It's called Simpson's paradox. For percentages instead of proportions, multiply by 100.
> temp2 = prop.table(temp, margin=c(2,3)) * 100 > round(temp2, 1) , , Gender = male Dept Admit A B C D E F admitted 62.1 63 36.9 33.1 27.7 5.9 rejected 37.9 37 63.1 66.9 72.3 94.1 , , Gender = female Dept Admit A B C D E F admitted 82.4 68 34.1 34.9 23.9 7 rejected 17.6 32 65.9 65.1 76.1 93By the way, I think it's poor form that R doesn't give me the .0 when that's what the number rounds to. > ls() [1] "Admit.by.Gender" "OF" "temp" "temp2" [5] "UCB" > rm(Admit.by.Gender, temp, temp2) Categorical Data: Flat Tables These are flat tables. As you can see, they contain the same frequency
information as the crosstabs.
> ftable(UCB) Dept A B C D E F Admit Gender admitted male 512 353 120 138 53 22 female 89 17 202 131 94 24 rejected male 313 207 205 279 138 351 female 19 8 391 244 299 317 > ftable(UCB, col.vars="Gender") Gender male female Admit Dept admitted A 512 89 B 353 17 C 120 202 D 138 131 E 53 94 F 22 24 rejected A 313 19 B 207 8 C 205 391 D 279 244 E 138 299 F 351 317 > ftable(UCB, row.vars=c("Dept","Gender"), col.vars="Admit") Admit admitted rejected Dept Gender A male 512 313 female 89 19 B male 353 207 female 17 8 C male 120 205 female 202 391 D male 138 279 female 131 244 E male 53 138 female 94 299 F male 22 351 female 24 317 > prop.table(ftable(UCB, row.vars=c("Dept","Gender"), col.vars="Admit"), 1) Admit admitted rejected Dept Gender A male 0.62060606 0.37939394 female 0.82407407 0.17592593 B male 0.63035714 0.36964286 female 0.68000000 0.32000000 C male 0.36923077 0.63076923 female 0.34064081 0.65935919 D male 0.33093525 0.66906475 female 0.34933333 0.65066667 E male 0.27748691 0.72251309 female 0.23918575 0.76081425 F male 0.05898123 0.94101877 female 0.07038123 0.92961877 Categorical Data: Data Frames R often gives a table like the following and calls it a data frame.
> as.data.frame.table(UCB) Admit Gender Dept Freq 1 admitted male A 512 2 rejected male A 313 3 admitted female A 89 4 rejected female A 19 5 admitted male B 353 6 rejected male B 207 7 admitted female B 17 8 rejected female B 8 9 admitted male C 120 10 rejected male C 205 11 admitted female C 202 12 rejected female C 391 13 admitted male D 138 14 rejected male D 279 15 admitted female D 131 16 rejected female D 244 17 admitted male E 53 18 rejected male E 138 19 admitted female E 94 20 rejected female E 299 21 admitted male F 22 22 rejected male F 351 23 admitted female F 24 24 rejected female F 317Is this a true data frame? Well, it will function like one, and such tables are used in procedures that we will encounter in the future. The definition of data frame I gave in an earlier tutorial was a table of cases by variables. And cases are usually our experimental units or subjects. Here we have... > sum(OF) [1] 4526...cases or subjects or applicants to Berkeley grad programs. A case-by-case data frame, therefore, would have 4,526 rows and 3 columns. We would construct such a table if we wished to keep track of every single applicant, by name perhaps. "Fred Smith applied to one of these programs in 1974. What were his particulars?" We could find that information in our case-by-case data frame. The following script creates such a data frame from the UCB data. I don't
expect that you will be itching to type this line by line, so just copy and
paste it into R, and the data frame will appear in your workspace.
### begin copying here gender = rep(c("female","male"),c(1835,2691)) admitted = rep(c("yes","no","yes","no"),c(557,1278,1198,1493)) dept = rep(c("A","B","C","D","E","F","A","B","C","D","E","F"), c(89,17,202,131,94,24,19,8,391,244,299,317)) dept2 = rep(c("A","B","C","D","E","F","A","B","C","D","E","F"), c(512,353,120,138,53,22,313,207,205,279,138,351)) department = c(dept,dept2) UCBdf = data.frame(gender,admitted,department) rownames(UCBdf)[3101] = "Fred.Smith" rm(gender,admitted,dept,dept2,department) ### end copying here and paste into RIt's a data frame, so... > ls() [1] "OF" "UCB" "UCBdf" > dim(UCBdf) [1] 4526 3 > summary(UCBdf) gender admitted department female:1835 no :2771 A:933 male :2691 yes:1755 B:585 C:918 D:792 E:584 F:714I didn't give everyone in the data frame his or her name, but I did identify Fred.Smith, and so we can recall his information by name. (Everyone else has to be recalled by his or her secret code number.) > UCBdf["Fred.Smith",] gender admitted department Fred.Smith male no AFred was male, applied to Dept A, and was not admitted. Too bad, Fred. We can summarize the data in this rather largish data frame as follows.
> table(UCBdf$gender) # frequencies by gender female male 1835 2691 > > table(UCBdf$admitted) # frequencies by admitted no yes 2771 1755 > > table(UCBdf$department) # frequencies by department A B C D E F 933 585 918 792 584 714 > > with(UCBdf, table(gender, admitted)) # a crosstabulation admitted gender no yes female 1278 557 male 1493 1198 > > with(UCBdf, table(gender, admitted)) -> admitted.by.gender > prop.table(admitted.by.gender, margin=1) admitted gender no yes female 0.6964578 0.3035422 male 0.5548123 0.4451877Some of this information we already had from summary(), but I wanted to demonstrate the table() function. The xtabs() function can also be used to do crosstabulations. It takes a formula interface. The syntax of the formula is a tilde following by the variables you wish to crosstabulate separated by + signs. > xtabs( ~ admitted + department, data=UCBdf) department admitted A B C D E F no 332 215 596 523 437 668 yes 601 370 322 269 147 46 > xtabs( ~ admitted + department + gender, data=UCBdf) , , gender = female department admitted A B C D E F no 19 8 391 244 299 317 yes 89 17 202 131 94 24 , , gender = male department admitted A B C D E F no 313 207 205 279 138 351 yes 512 353 120 138 53 22You can store the table generated by either one of these functions as an object in your workspace and then operate on it with other functions, most of which are already familiar to you. All of these functions require table objects as an argument. > xtabs( ~ admitted + department, data=UCBdf) -> admitted.by.department > prop.table(admitted.by.department, margin=1) department admitted A B C D E F no 0.11981234 0.07758932 0.21508481 0.18874053 0.15770480 0.24106821 yes 0.34245014 0.21082621 0.18347578 0.15327635 0.08376068 0.02621083 > prop.table(admitted.by.department, margin=2) department admitted A B C D E F no 0.35584137 0.36752137 0.64923747 0.66035354 0.74828767 0.93557423 yes 0.64415863 0.63247863 0.35076253 0.33964646 0.25171233 0.06442577 > margin.table(admitted.by.department, margin=1) admitted no yes 2771 1755 > addmargins(admitted.by.department) department admitted A B C D E F Sum no 332 215 596 523 437 668 2771 yes 601 370 322 269 147 46 1755 Sum 933 585 918 792 584 714 4526We'll look at graphics for categorical data in the next tutorial. > ls() [1] "admitted.by.department" "admitted.by.gender" [3] "OF" "UCB" [5] "UCBdf" > rm(admitted.by.department,admitted.by.gender) > ls() [1] "OF" "UCB" "UCBdf" > save.image("UCB.RData") # just in case - you never know! > rm(list=ls()) # clear the workspace Numeric Data Imagine a number line stretching all the way across the universe from negative infinity to positive infinity. Somewhere along this number line is your little mound of data, and you have to tell someone how to find it. What information would it be useful to give this person? First, you might want to tell this person how big the mound is. Is she to look for a sizeable mound of data or just a little speck? Second, you might want to tell this person about where to look. Where is the mound centered on the number line? Third, you might want to tell this person how spread out to expect the mound to be. Are the data all in a nice, compact mound, or is it spread out all over the place? Finally, you might want to tell this person what shape to expect the mound to be. The built-in data frame "faithful" consists of 272 observations on the
Old Faithful geyser in Yellowstone National Park. Each observation consists
of two variables: "eruptions" (how long an eruption lasted in minutes), and
"waiting" (how long in minutes the geyser was quiet before that eruption).
> data(faithful) # put a copy in your workspace > str(faithful) 'data.frame': 272 obs. of 2 variables: $ eruptions: num 3.60 1.80 3.33 2.28 4.53 ... $ waiting : num 79 54 74 62 85 55 88 85 51 85 ...I'm going to attach this so that we can easily look at the variables inside. Remind me to detach it when we're done! I sometimes forget. > attach(faithful)We'll begin with the "waiting" variable. Let's see if we can characterize it. First, we want to know how many values are in this variable. If this variable
is our mound of data somewhere on that number line, how big a mound is it?
> length(waiting) [1] 272I don't suppose you were much surprised by that. We already knew that this data frame contains 272 cases, so each variable must be 272 cases long. There is a slight catch, however. The length() function counts missing values (NAs) as cases, and there is no way to stop it from doing so. That is, there is no na.rm= option for this function. In fact, there are no options at all. It will not even reveal the presence of missing values, and for this reason the length() function can give a misleading result when missing values are present. The following command will tell you how many of the values in a variable
are NAs (missing values).
> sum(is.na(waiting)) [1] 0It does this as follows. The is.na() function tests every value in a vector for missingness, returning either TRUE or FALSE for each value. Remember, in R, TRUEs add as ones and FALSEs add as zeroes, so when we sum the result of testing for missingness, we get the number of TRUEs, i.e., the number of missing values. It's kind of a nuisance to remember that, so I propose a new function, one that doesn't actually exist in R. Here's one of the nicest things about R. If it doesn't do something you want
it to do, you can write a function that does it. We'll talk about this in detail
in a future tutorial, but for right now the basic idea is this. If you can
calculate it at the command line, you can write a function to do it. In this
case...
> length(waiting) - sum(is.na(waiting)) [1] 272...would give us the number of nonmissing cases in "waiting". So do this (and be VERY CAREFUL with your typing). > sampsize = function(x) length(x) - sum(is.na(x)) # this creates the function > sampsize(waiting) # which we use like any other function [1] 272 > ls() [1] "faithful" "sampsize"The first line creates a function called "sampsize" and gives it a mathematical definition (i.e., tells how to calculate it). The variable "x" is called a dummy variable, or a place holder. When we actually use the function, as we did in the second line, we put in the name of the vector we want "sampsize" calculated for. This takes the place of "x" in the calculation. Notice that an object called "sampsize" has been added to your workspace. It will be there until you remove it (or neglect to save the workspace at the end of this R session). Of course, the summary() function also reports
missing values (but not the length of the vector!). If there are no missing
values, summary won't mention them.
> summary(waiting) # no mention of NAs; therefore, there are none Min. 1st Qu. Median Mean 3rd Qu. Max. 43.0 58.0 76.0 70.9 82.0 96.0The second thing we want to know about our little mound of data is where it is located on our number line. Where is it centered? In other words, we want a measure of center or of central tendency. There are three that are commonly used: mean, median, and mode. The mode is not very useful and is rarely used in statistical calculations, so there is no R function for it. Mean and median, on the other hand, are straightforward. > mean(waiting) [1] 70.89706 > median(waiting) [1] 76Now we know to hunt around somewhere in the seventies for our data. How spread out should we expect it to be? This is given to us by a measure
of spread, also called a measure of variability or a measure of dispersion.
There are several of these in common usage: the range, the interquartile range,
the variance, and the standard deviation, to name several. Here's how to get them.
> range(waiting) [1] 43 96 > IQR(waiting) [1] 24 > var(waiting) [1] 184.8233 > sd(waiting) [1] 13.59497There are several things you should be aware of here. First, range() does not actually give the range (difference between the maximum and minimum values), it gives the minimum and maximum value. So the values in "waiting" range from a minimum of 43 to a maximum of 96. Second, the interquartile range (IQR) is defined as the difference between
the value at the 3rd quartile and the value at the 1st quartile. However, there
is not universal agreement on how these values should be calculated, and
different software packages do it differently. R allows nine different methods
to calculate these values, and the one used by default is NOT the one you were
probably taught in elementary statiistics. So the result given by IQR() is not the one you might get if you do the
calculation by hand (or on a TI-84 calculator). It should be very close though.
If your instructor insists that you get the "calculator" value, do the
calculation this way.
> IQR(waiting, type=2) [1] 24In this case, the answer is the same. The first quartile is at the 25th percentile, and the third quartile is at the 75th percentile. What are those values? And what if you also wanted the 35th and 65th percentile for some reason? > quantile(waiting, c(.25,.35,.65,.75), type=2) # or leave out type= for the default type 25% 35% 65% 75% 58 65 79 82The value waiting = 79 is at the 65th percentile. The last two things you need to be aware of is that the variance and standard deviation calculated above are the sample values. That is, they are calculated as estimates of the population values from a sample (the n − 1 method). There are no functions for the population variance and standard deviation, although they are easily enough calculated if you need them. You may remember from your elementary stats course a statistic called the standard error of the mean, or SEM. Technically, SEM is a measure of error and not of variability, but you may need to calculate it nevertheless. There is no R function for it, so let's write one. -----This is optional. You can skip this if you want.----- The (estimated) standard error of the sample mean of "waiting" can be
calculated like this.
> sqrt(var(waiting)/length(waiting)) [1] 0.8243164I.e., it's the square root of the (variance divided by the sample size). This calculation will choke on missing values, however. Let's see that by adding a missing value to "waiting". > waiting -> wait2 > wait2[100] <- NA > sqrt(var(wait2)/length(wait2)) [1] NAIt's the var() function that coughed up the hairball, so... > sqrt(var(wait2, na.rm=T)/length(wait2)) [1] 0.8248208...seems to fix things. But it doesn't. This isn't quite the correct answer, and that is because length() gave us the wrong sample size. (It counted the case that was missing.) We could try this... > sqrt(var(wait2, na.rm=T)/sampsize(wait2)) [1] 0.8263412...but that depends upon the existence of the "sampsize" function. If that somehow gets erased, this calculation will fail. Here's one of the disadvantages of using R: You have to know what you're doing. Unlike other statistical packages, such as SPSS, to use R you occasionally have to think. It appears some of that thinking is necessary here. > sem = function(x) sqrt(var(x, na.rm=T)/(length(x)-sum(is.na(x)))) > sem(wait2) [1] 0.8263412 > ls() [1] "faithful" "sampsize" "sem" "wait2"We now have a perfectly good function for calculating SEM in our workspace, and we will not let it get away! --------This is the end of the optional material.-------- Before we move on, I should remind you that the
summary() function will give you quite a bit of the above
information in one go.
> summary(waiting) Min. 1st Qu. Median Mean 3rd Qu. Max. 43.0 58.0 76.0 70.9 82.0 96.0It will tell you if there are missing values, too. > summary(wait2) # if you did the optional section Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 43.00 58.00 76.00 70.86 82.00 96.00 1.00The number under NA's is a frequency, a whole number, in spite of the decimal zeros. There is 1 missing value in "wait2". The last thing we want to know (for the time being anyway) about our little mound of data is what its shape is. Is it a nice mound-shaped mound, or is it deformed in some way, skewed or bimodal or worse? This will involve looking at some sort of frequency distribution. The table() function will do that as well
for a numerical variable as it will for a categorical variable, but the result
may not be pretty (try this with eruptions--I dare you!).
> table(waiting) waiting 43 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 62 63 64 65 66 67 68 69 70 1 3 5 4 3 5 5 6 5 7 9 6 4 3 4 7 6 4 3 4 3 2 1 1 2 4 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 96 5 1 7 6 8 9 12 15 10 8 13 12 14 10 6 6 2 6 3 6 1 1 2 1 1Looking at this reveals that we have one value of 43, three values of 45, five values of 46, and so on. Surely there's a better way! We could try a grouped frequency distribution. > cut(waiting, breaks=10) -> wait3 > table(wait3) wait3 (42.9,48.3] (48.3,53.6] (53.6,58.9] (58.9,64.2] (64.2,69.5] (69.5,74.8] 16 28 26 24 9 23 (74.8,80.1] (80.1,85.4] (85.4,90.7] (90.7,96.1] 62 55 23 6The cut() function cuts a numerical variable into class intervals, the number of class intervals given (approximately) by the breaks= option. (R has a bit of a mind of it's own, so if you pick a clumsy number of breaks, R will fix that for you.) The notation (x,y] means the class interval goes from x to y, with x not being included in the interval and y being included. Another disadvantage of using R is that it is intended to be utilitarian.
The output will be useful but not necessarily pretty. We can pretty that up a
little bit with a trick.
> as.data.frame(table(wait3)) wait3 Freq 1 (42.9,48.3] 16 2 (48.3,53.6] 28 3 (53.6,58.9] 26 4 (58.9,64.2] 24 5 (64.2,69.5] 9 6 (69.5,74.8] 23 7 (74.8,80.1] 62 8 (80.1,85.4] 55 9 (85.4,90.7] 23 10 (90.7,96.1] 6 You can also specify the break points yourself in a vector, if you are
anal retentive enough.
> cut(waiting, breaks=seq(from=40,to=100,by=5)) -> wait4 > as.data.frame(table(wait4)) wait4 Freq 1 (40,45] 4 2 (45,50] 22 3 (50,55] 33 4 (55,60] 24 5 (60,65] 14 6 (65,70] 10 7 (70,75] 27 8 (75,80] 54 9 (80,85] 55 10 (85,90] 23 11 (90,95] 5 12 (95,100] 1We are getting a hint of a bimodal distribution of values here. A better way of seeing a grouped frequency distribution is by creating a
stem-and-leaf display.
> stem(waiting) The decimal point is 1 digit(s) to the right of the | 4 | 3 4 | 55566666777788899999 5 | 00000111111222223333333444444444 5 | 555555666677788889999999 6 | 00000022223334444 6 | 555667899 7 | 00001111123333333444444 7 | 555555556666666667777777777778888888888888889999999999 8 | 000000001111111111111222222222222333333333333334444444444 8 | 55555566666677888888999 9 | 00000012334 9 | 6 > stem(waiting, scale=.5) The decimal point is 1 digit(s) to the right of the | 4 | 355566666777788899999 5 | 00000111111222223333333444444444555555666677788889999999 6 | 00000022223334444555667899 7 | 00001111123333333444444555555556666666667777777777778888888888888889 8 | 00000000111111111111122222222222233333333333333444444444455555566666 9 | 000000123346Use the "scale=" option to determine how many lines are in the display. The value of scale= is not actually the number of lines, so this may take some fiddling. The first of those displays clearly shows the bimodal structure of this variable. Summarizing Numeric Data by Groups Oftentimes, we are faced with summarizing numerical data by group membership,
or as indexed by a factor. The tapply() and
by() functions are most useful here.
> detach(faithful) > rm(faithful, wait2, wait3, wait4) # Out with the old... > states = as.data.frame(state.x77) > states$region = state.region > head(states) Population Income Illiteracy Life Exp Murder HS Grad Frost Alabama 3615 3624 2.1 69.05 15.1 41.3 20 Alaska 365 6315 1.5 69.31 11.3 66.7 152 Arizona 2212 4530 1.8 70.55 7.8 58.1 15 Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 California 21198 5114 1.1 71.71 10.3 62.6 20 Colorado 2541 4884 0.7 72.06 6.8 63.9 166 Area region Alabama 50708 South Alaska 566432 West Arizona 113417 West Arkansas 51945 South California 156361 West Colorado 103766 West The "state.x77" data set in R is a collection of information about the U.S.
states (in 1977), but it is in the form of a matrix. We converted it to a data
frame. All the variables are numeric, so we created a factor in the data frame
by adding information about census region, from the built-in data set
state.region. Now we can do interesting things like break down life expectancy
by regions of the country. But hold on there! Some very foolish person has
named that variable "Life Exp", with a space in the middle of the name. How do
we deal with that?
> names(states) [1] "Population" "Income" "Illiteracy" "Life Exp" "Murder" [6] "HS Grad" "Frost" "Area" "region" The easiest way is to rename the column and use the new, more sensible name.
> names(states)[4] # That's the one we want! [1] "Life Exp" > names(states)[4] <- "Life.Exp" > by(data=states$Life.Exp, IND=states$region, FUN=mean) states$region: Northeast [1] 71.26444 --------------------------------------------------------- states$region: South [1] 69.70625 --------------------------------------------------------- states$region: North Central [1] 71.76667 --------------------------------------------------------- states$region: West [1] 71.23462 > > ### with(states, by(data=Life.Exp, IND=region, FUN=mean) will do the same thing > > tapply(states$Life.Exp, states$region, mean) Northeast South North Central West 71.26444 69.70625 71.76667 71.23462Two notes: The arguments for by() are in the default order, but I named them anyway. The syntax, with default order, for tapply() is DV, IV, function (but those are not the approved labels for those arguments, which are X, INDEX, and FUN). There is no formula interface available for these functions, which is a bummer. The good thing about them is that the data do not have to be in a data frame. They will work with data vectors in the workspace, or even columns of a matrix (although that would be silly since a matrix must be all numeric variables). If you want a function with a formula interface, there is one. It's
shortcoming is that the data must be in a data frame. (Another shortcoming is
that it's name makes no more sense than tapply, but then the function wasn't
developed for this purpose, I don't suspect.)
> aggregate(Life.Exp ~ region, FUN=mean, data=states) # syntax: DV ~ IV, FUN=, data= region Life.Exp 1 Northeast 71.26444 2 South 69.70625 3 North Central 71.76667 4 West 71.23462Obviously, with any of these methods, you can substitute another function for "mean" and get SDs or variances or whatever. > with(states, tapply(Life.Exp, region, sd)) Northeast South North Central West 0.7438769 1.0221994 1.0367285 1.3519715You can even use custom functions that you've added to your workspace. > with(states, tapply(Life.Exp, region, sem)) Northeast South North Central West 0.2479590 0.2555499 0.2992778 0.3749694What if we want the DV broken down by two indexes, or two grouping variables? We don't have a second grouping variable in our data frame, so let's create one. I'll show you the steps I followed, because there are some catches. > median(states$Frost) # I'm going to do a median split of Frost [1] 114.5 > cut(states$Frost, breaks=c(0,114.5,188)) # ALWAYS LOOK FIRST [1] (0,114] (114,188] (0,114] (0,114] (0,114] (114,188] (114,188] [8] (0,114] (0,114] (0,114] <NA> (114,188] (114,188] (114,188] [15] (114,188] (0,114] (0,114] (0,114] (114,188] (0,114] (0,114] [22] (114,188] (114,188] (0,114] (0,114] (114,188] (114,188] (114,188] [29] (114,188] (114,188] (114,188] (0,114] (0,114] (114,188] (114,188] [36] (0,114] (0,114] (114,188] (114,188] (0,114] (114,188] (0,114] [43] (0,114] (114,188] (114,188] (0,114] (0,114] (0,114] (114,188] [50] (114,188] Levels: (0,114] (114,188] > ### Notice the NA in position 11. What's up with that? Hint: The interval is (0,114]. > cut(states$Frost, breaks=c(-1,114.5,188)) [1] (-1,114] (114,188] (-1,114] (-1,114] (-1,114] (114,188] (114,188] [8] (-1,114] (-1,114] (-1,114] (-1,114] (114,188] (114,188] (114,188] [15] (114,188] (-1,114] (-1,114] (-1,114] (114,188] (-1,114] (-1,114] [22] (114,188] (114,188] (-1,114] (-1,114] (114,188] (114,188] (114,188] [29] (114,188] (114,188] (114,188] (-1,114] (-1,114] (114,188] (114,188] [36] (-1,114] (-1,114] (114,188] (114,188] (-1,114] (114,188] (-1,114] [43] (-1,114] (114,188] (114,188] (-1,114] (-1,114] (-1,114] (114,188] [50] (114,188] Levels: (-1,114] (114,188] > cut(states$Frost, breaks=c(-1,114.5,188), labels=c("low","high")) [1] low high low low low high high low low low low high high high [15] high low low low high low low high high low low high high high [29] high high high low low high high low low high high low high low [43] low high high low low low high high Levels: low high > ### Yay! At last! > cut(states$Frost, breaks=c(-1,114.5,188), labels=c("low","high")) -> states$coldness > summary(states$coldness) low high 25 25 > ### Looking good! Although I suspect there is a substantial relationship > ### between the region variable and the coldness variable, which we will > ### ignore.Now that we have our second grouping variable, we may proceed. > with(states, by(Life.Exp, list(coldness,region), mean)) # output not shown > with(states, tapply(Life.Exp, list(coldness,region), mean)) # much nicer output Northeast South North Central West low 71.19000 69.70625 71.635 71.9420 high 71.28571 NA 71.793 70.7925 > ### What's up with that NA? > with(states, table(coldness,region)) ### no cases in that cell! region coldness Northeast South North Central West low 2 16 2 5 high 7 0 10 8 > aggregate(Life.Exp ~ region + coldness, FUN=mean, data=states) region coldness Life.Exp 1 Northeast low 71.19000 2 South low 69.70625 3 North Central low 71.63500 4 West low 71.94200 5 Northeast high 71.28571 6 North Central high 71.79300 7 West high 70.79250 On to graphical summaries! revised 2016 January 23 |