R Tutorials--Demonstration

A WORK-ALONG DEMONSTRATION OF R

Preliminaries

The purpose of this tutorial is to show you a bit--and I mean just a bit--of what R can do and how it works. If you have already downloaded and installed R, you can work along with me. If not, you should be able to get a pretty good idea of what R is about just by reading through this tutorial. However, I will not be explaining much of what I'm doing, so if you've never used R before, much of this may seem pretty cryptic to you. My intent here is not to teach you R but just to show you a little of what it can do. It will also give you some practice at typing and executing R commands and will show you what R looks like on your computer.

We will work on four problems. Do any or all of them, but do at least the first one.

Weight change in anorexic women associated with three different forms of therapy.
Risk factors associated with low infant birth weight.
Some factors associated with car insurance claims.
Data on cars on sale in the United States in 1993.

You will not have to enter any of these data sets, as they come with the default installation of R. They are in an R library that does not load automatically when R starts, so our first job will be getting access to the data. I will repeat the command for doing so at the beginning of each section, but if you are doing the entire tutorial at one sitting, you have to execute the library() command only once. In fact, I will remind you not to do otherwise.

When you start R on your computer, it will open a window that looks something like this. This is the Mac version--it will look slightly, but not dramatically, different in Windows, and will be recognizable in Linux.

The window into which you type your commands is called the R Console. On the Mac and in Windows, this runs inside of a graphical user interface (GUI). In Linux, it runs in a terminal window. In the Mac and Windows version (but not in Linux) there are a few menus at the top of the window (in the menu bar on a Mac), and these are useful for a few things, mostly housekeeping tasks, but the commands that do your statistical work will be typed at the command prompt.

The greater than sign (>) near the bottom of the window is the command prompt. You don't type this part--R supplies it. If the command extends onto two or more lines, this prompt will change to a plus sign (+), called the continuation prompt. Don't type the plus sign prompts either. With respect to that, an IMPORTANT NOTE: It will happen eventually (take my word for it!) that you will get stuck at a continuation prompt, and nothing you do will get you back to the command prompt. Don't panic! Press the Escape (Esc) key at the top left of your keyboard. That will abort the command currently being typed and bring you back to >.

To see what that looks like, try this.

> ls()

Type a lower case L ("el"), not a one, a lower case S, open and close parentheses, and then press the Enter (or Return) key. This command lists the contents of a place in memory called the workspace (the computer's memory, not your memory). If there is nothing there, the output will say character(0). Now suppose you neglected to type the close parenthesis, a common mistake. (The Mac command editor will add it for you automatically, so it's harder to make this mistake on a Mac.) R will recognize that the command is incomplete and give you a continuation prompt (+) so that you can complete it. Type the close parenthesis and press Enter. Be sure to press the Enter key after typing a command. Otherwise, R will just sit there waiting for you to do so, and R has infinite patience!

> ls(                   # no close parenthesis; the command is incomplete
+ )                     # so complete it and press Enter

R is usually insensitive to spacing, so add spacing wherever it helps you to make things clear, or leave it out if you want to save key strokes. R is case sensitive, however, so if you capitalize something that shouldn't be, or fail to when you should, you will get an error message of some sort. If you get a "function not found" or "object not found" error, that is the first thing you should check. Then check your spelling and punctuation. In R, attention to details is crucial!

In R, as in many programming languages, comments start with pound symbols (#). Anything following a pound symbol is ignored by R and is there purely for the purpose of instruction or annotation. You don't need to type these bits. And did I mention that R is, in fact, a full featured programming language? You don't need to be intimidated by that. R has such a large number of built-in statistical functions that most people use it simply for interactive data analysis and graphics. It does mean, however, that R will be very picky about syntax. You will have to close all open parentheses, get your commas and quotes in the right places, and things like that. Annoying, I agree, and another common source of errors, but also very, very powerful!

One thing you may find handy--commands can be copied and pasted into R. Don't copy the command prompts, however.

Problem One: Weight Change in Anorexic Women

Begin like this...

> remove(list=ls())             # Type only the bits after the > prompt. This clears your workspace.
> library("MASS")               # This loads an optional library of commands and data files.
> data(anorexia)                # This puts a copy of the anorexia data in your workspace.
> ls()                          # This produces a list of the contents of the your workspace.
[1] "anorexia"                  ### No prompt here. This is an output line.
> str(anorexia)                 # This asks to see the structure of the anorexia object.
'data.frame':   72 obs. of  3 variables:
 $ Treat : Factor w/ 3 levels "CBT","Cont","FT": 2 2 2 2 2 2 2 2 2 2 ...
 $ Prewt : num  80.7 89.4 91.8 74 78.1 88.3 87.3 75.1 80.6 78.4 ...
 $ Postwt: num  80.2 80.1 86.4 86.3 76.1 78.1 75.1 86.7 73.5 84.6 ...

The first of these commands cleaned out your workspace, deleting anything you or someone else stored there from a previous analysis. I hope you got all those parentheses in the right place! WARNING: R does not ask if you "really want to do this." It just does what you tell it to do. If you really didn't mean to, too bad! The information in your workspace is gone, and it does not go the trash either. It's gone! So be careful. This is a good old fashioned command line interface, and you're expected to know what you're doing.

The second command loaded the "MASS" library, which contains the data set. The third command placed a copy of the data set "anorexia" into your workspace. The fourth command asked for a list of items stored in the workspace. The output produced by R is labeled [1]. The fifth command asked to see the structure of this data object. It is a data frame, which is a table-like object containing cases (or observations) in the rows and variables in the columns. There are 72 cases each measured on 3 variables. Those variables are Treat, a factor with 3 levels (CBT is cognitive behavioral therapy, Cont is a no-treatment control, and FT is family therapy), Prewt, a numeric variable (pretreatment weight in pounds), and Postwt, also a numeric variable (posttreatment weight in pounds).

For more information, or to view the source for these data, request the help page by typing help(anorexia) at a command prompt. Help information will open in its own window in Windows and on the Mac. Close this window as you would any window on your system. In Linux, help information will open in the terminal. Press q (lower case Q) to return to the command prompt.

We continue by examining the data. Don't worry about memorizing what all the command do now. This is just for show. And remember, you don't have to type the stuff after the # symbol.

> summary(anorexia)             # Get a data summary.
  Treat        Prewt           Postwt      
 CBT :29   Min.   :70.00   Min.   : 71.30  
 Cont:26   1st Qu.:79.60   1st Qu.: 79.33  
 FT  :17   Median :82.30   Median : 84.05  
           Mean   :82.41   Mean   : 85.17  
           3rd Qu.:86.00   3rd Qu.: 91.55  
           Max.   :94.90   Max.   :103.60  
> attach(anorexia)              # Make the variables inside anorexia visible to R.
> by(anorexia, Treat, summary)  # Summarize each treatment group separately.
Treat: CBT
  Treat        Prewt           Postwt     
 CBT :29   Min.   :70.00   Min.   : 71.3  
 Cont: 0   1st Qu.:80.40   1st Qu.: 81.9  
 FT  : 0   Median :82.60   Median : 83.9  
           Mean   :82.69   Mean   : 85.7  
           3rd Qu.:85.00   3rd Qu.: 90.9  
           Max.   :94.90   Max.   :103.6  
------------------------------------------------------------ 
Treat: Cont
  Treat        Prewt           Postwt     
 CBT : 0   Min.   :70.50   Min.   :73.00  
 Cont:26   1st Qu.:77.72   1st Qu.:77.58  
 FT  : 0   Median :80.65   Median :80.70  
           Mean   :81.56   Mean   :81.11  
           3rd Qu.:85.88   3rd Qu.:84.67  
           Max.   :91.80   Max.   :89.60  
------------------------------------------------------------ 
Treat: FT
  Treat        Prewt           Postwt     
 CBT : 0   Min.   :73.40   Min.   : 75.2  
 Cont: 0   1st Qu.:80.50   1st Qu.: 90.7  
 FT  :17   Median :83.30   Median : 92.5  
           Mean   :83.23   Mean   : 90.5  
           3rd Qu.:86.00   3rd Qu.: 95.2  
           Max.   :94.20   Max.   :101.6
>                                 # Just press Enter here to put in a blank line.
> tapply(Postwt, Treat, mean)     # Another way to get means by groups.
     CBT     Cont       FT 
85.69655 81.10769 90.49412 
> tapply(Postwt, Treat, sd)       # Sample standard deviations by groups.
     CBT     Cont       FT 
8.351924 4.744253 8.475072 
> tapply(Postwt, Treat, length)   # Group sizes.
 CBT Cont   FT 
  29   26   17

If you're relatively new to computers (i.e., under the age of 50), here's something you will have to get used to about command line interfaces. Once something is printed to the screen, it's history, gone, done with. It will never be updated, changed or revised. ALL NEW INFORMATION OR CHANGES TO OLD INFORMATION WILL BE PRINTED OUT BELOW WHERE YOU TYPE A NEW COMMAND. So if you change something, you will have to issue a new command to see that change. What has already scrolled by is ancient history.

Another thing you'll have to get used to is that when things go well in R, R will often be silent. I.e., it will not tell you that it did what you told it to do. You will have to ask it to show you. Here is an example.

> x = 7                      # (Press Enter!) Assign the value 7 to x. Notice there is no output.
> print(x)                   # View the value of x, after you remember to press Enter.
[1] 7
> x                          # Equivalent to the last command--an "implicit print."
[1] 7
> x = 8                      # A new value for x.
> x                          # View it.
[1] 8
> rm(x)                      # Erase x. Done! No confirmatory message is printed.
> x
Error: object 'x' not found
>

Now let's draw a nice graph of the anorexia data. The graph will be drawn in a separate graphics window (called a "graphics device" in R-speak). If no such window is open, R will open one. If one is already open, R will use it and erase whatever graph is already there. So if you wanted a copy of that old graph, well, too late now! You should have saved it while you had the chance! You can manipulate this graphics window just like any other window on your system. You can move it around, resize it, and close it by clicking on the proper button. Closing the window will also erase the graph, so if you want to keep it, take steps to do so before writing over it or closing it. I'll tell you how to do this in a future tutorial.

One thing I'll warn you about in advance though. When a new graphics device window opens, it steals focus from the R Console. To type more commands, you'll have to click in the R Console window with your mouse. (Wow! You get to use the mouse!)

> Change = Postwt - Prewt              # Create a new variable called Change.
> boxplot(Change ~ Treat)              # The symbol before Treat is a tilde. Read "Change by Treat."
> title(main = "Weight Gain in Pounds by Treatment")
> title(xlab = "Type of treatment")
> title(ylab = "Weight gain in pounds")

When you are done admiring the graphics device window, you can click it shut in the normal way for your operating system.

Now let's open a smaller graphics window and look at a scatterplot.

> windows(4,4,10)       ### Windows only!
> ### On a Mac, type quartz("Scatterplot",4,4,10)
> ### On Linux just skip this step.
> scatter.smooth(Prewt, Postwt)
> title(main="Postweight by Preweight Scatterplot")
> abline(a=0, b=1, col="red")

The scatter.smooth() function draws a scatterplot with a smoothed, nonparametric (lowess) regression line on the plot. I have also added a red "no change" line with the abline() function (intercept=0 and slope=1). This window can also be closed in the way you would close any window on your desktop. A later tutorial will deal with how to modify and dress up graphics for publication, but R does a fairly decent job of drawing graphics just by default, as you can see.

It appears the treatment effect is nonlinear, with lower preweight women benefiting from treatment more than higher preweight women. In fact, it appears that women with a preweight of 80 lbs. or more benefited hardly at all, on the average. An analysis of variance is, therefore, suspect.
Nevertheless, for the sake of illustration, here's how it would be done.

> aov.out = aov(Change ~ Treat)        # Change tilde Treat, read "Change by Treat"
> summary(aov.out)
            Df Sum Sq Mean Sq F value   Pr(>F)   
Treat        2  614.6   307.3  5.4223 0.006499 **
Residuals   69 3910.7    56.7                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
>
> TukeyHSD(aov.out)                    # A Tukey HSD post hoc test.
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Change ~ Treat)

$Treat
              diff       lwr       upr     p adj
Cont-CBT -3.456897 -8.327276  1.413483 0.2124428
FT-CBT    4.257809 -1.250554  9.766173 0.1607461
FT-Cont   7.714706  2.090124 13.339288 0.0045127

The aov() function accepts a formula that tells what analysis to do. The formula has the basic form DV ~ IV (DV tilde IV), which means "DV as explained by or predicted by or modeled by IV." Treatment contrasts are also available. (The ones done here don't make a lot of sense, but with a little more work, we could get them to make sense. The tests being done here are equivalent to Fisher LSD pairwise comparisons between the CBT group and the control group, and between the CBT group and the FT group.)

> summary.lm(aov.out)

Call:
aov(formula = Change ~ Treat)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.565  -4.543  -1.007   3.846  17.893 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    3.007      1.398   2.151   0.0350 *
TreatCont     -3.457      2.033  -1.700   0.0936 .
TreatFT        4.258      2.300   1.852   0.0684 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 7.528 on 69 degrees of freedom
Multiple R-squared: 0.1358,     Adjusted R-squared: 0.1108 
F-statistic: 5.422 on 2 and 69 DF,  p-value: 0.006499

We can also do a Kruskal-Wallis nonparametric ANOVA.

> kruskal.test(Change ~ Treat)

        Kruskal-Wallis rank sum test

data:  Change by Treat 
Kruskal-Wallis chi-squared = 9.0475, df = 2, p-value = 0.01085

That should get you started and your appetite whetted. Let's move on.

> ls()                                 # View contents of workspace.
[1] "anorexia" "aov.out"  "Change"
> detach(anorexia)                     # ALWAYS detach your data frame after attaching!
> rm(anorexia, aov.out, Change)        # Short for remove( )

But first you should execute the commands above to clean up a bit. It's a good idea to keep your workspace clean by erasing (removing) things you no longer need. This will come back to bite you eventually if you don't. Also, keep your search path clean (explained in a future tutorial) by detaching anything you've attached to it.

Problem Two: Risk Factors Associated With Low Infant Birthweight

We begin as before. You should not execute the library("MASS") command if you are still running the same R session. Try typing search(), and if you see "package:MASS" among the listed packages, it's already loaded.

> library("MASS")            # Only if you have quit R since the last example.
> data(birthwt)              # Add this dataset to your workspace.
> ls()                       # List the contents of your workspace.
[1] "birthwt"
> str(birthwt)               # View the structure of the dataset.
'data.frame':   189 obs. of  10 variables:
 $ low  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
 $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
 $ race : int  2 3 1 1 1 3 1 3 1 1 ...
 $ smoke: int  0 0 1 1 1 0 0 0 1 1 ...
 $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ht   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ui   : int  1 0 0 1 1 0 0 0 0 0 ...
 $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
 $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

I will begin by recoding some of the variables to make them factors. This could have been done when the data set was first prepared, but it wasn't. If you've jumped the gun and have already attached the dataframe, detach it. NEVER modify an attached dataframe. (It's not against the law, and it really doesn't even hurt anything, but you'll end up becoming terribly confused! Because...? Mods to data frames don't affect the attached copy, which is what you'll be working from.)

> ### low: birthwt below 2.5 kg, 0=no, 1=yes; we will leave this as is
> ### age: age of mother in years; a numeric variable; no need to recode
> ### lwt: weight of mother in pounds at last menstrual period; numeric; no need to recode
> ### race: R thinks this is an integer, so we want to (very carefully!!) recode
> value.labels = c("white","black","other")
> birthwt$race = factor(birthwt$race, labels=value.labels)
> ### Note: birthwt$race means the race variable inside the birthwt data frame
> ### smoke: smoking status during pregnancy (0="no", 1="yes"); also needs recoding
> value.labels = c("no","yes")         # this could be done in one step, if you're brave enough!
> birthwt$smoke = factor(birthwt$smoke, labels=value.labels)
> ### ptl: number of previous premature labors; no need to recode
> ### ht: history of hypertension (0="no", 1="yes")
> birthwt$ht = factor(birthwt$ht, labels=value.labels)
> ### ui: presence of uterine irritability (0="no", 1="yes")
> birthwt$ui = factor(birthwt$ui, labels=value.labels)
> ### ftv: number of physician visits during first trimester
> ### bwt: birth weight of baby in grams; change to kilograms
> birthwt$bwt = birthwt$bwt / 1000
> summary(birthwt)
      low              age             lwt           race    smoke    
 Min.   :0.0000   Min.   :14.00   Min.   : 80.0   white:96   no :115  
 1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0   black:26   yes: 74  
 Median :0.0000   Median :23.00   Median :121.0   other:67            
 Mean   :0.3122   Mean   :23.24   Mean   :129.8                       
 3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0                       
 Max.   :1.0000   Max.   :45.00   Max.   :250.0                       
      ptl           ht        ui           ftv              bwt       
 Min.   :0.0000   no :177   no :161   Min.   :0.0000   Min.   :0.709  
 1st Qu.:0.0000   yes: 12   yes: 28   1st Qu.:0.0000   1st Qu.:2.414  
 Median :0.0000                       Median :0.0000   Median :2.977  
 Mean   :0.1958                       Mean   :0.7937   Mean   :2.945  
 3rd Qu.:0.0000                       3rd Qu.:1.0000   3rd Qu.:3.487  
 Max.   :3.0000                       Max.   :6.0000   Max.   :4.990

Notice that summary is a very intelligent command. It knows to summarize numbers as numbers and factors as factors. It will get even smarter in the future. Let's crosstabulate some of the categorical variables (factors).

> attach(birthwt)                 # Makes the variables inside birthwt visible without the $ thing.
> table(smoke, low)               # Hence, we don't need to type birthwt$smoke, etc.
     low
smoke  0  1
  no  86 29
  yes 44 30
>                                 # Inserting a blank line now and then makes things easier to read.
> xtabs(~ smoke + low)            # An alternate form that does the same thing.
     low
smoke  0  1
  no  86 29
  yes 44 30
>
> table(smoke, race)              # Remember, the space after the comma is optional.
     race
smoke white black other
  no     44    16    55
  yes    52    10    12
>
> table(smoke,ht,ui,low)          # A four-dimensional contingency table.
, , ui = no, low = 0

     ht
smoke no yes
  no  75   3
  yes 36   2

, , ui = yes, low = 0

     ht
smoke no yes
  no   8   0
  yes  6   0

, , ui = no, low = 1

     ht
smoke no yes
  no  18   4
  yes 20   3

, , ui = yes, low = 1

     ht
smoke no yes
  no   7   0
  yes  7   0

> big.table = table(smoke,ht,ui,low)        # Store output rather than printing it.
> ftable(big.table)                         # A more convenient way to view this.
              low  0  1
smoke ht  ui           
no    no  no      75 18
          yes      8  7
      yes no       3  4
          yes      0  0
yes   no  no      36 20
          yes      6  7
      yes no       2  3
          yes      0  0

> flat.table = ftable(big.table)            # Store output so we can work with it.
> prop.table(flat.table, 1)                 # As proportions rather than raw freqs.
              low         0         1
smoke ht  ui                         
no    no  no      0.8064516 0.1935484
          yes     0.5333333 0.4666667
      yes no      0.4285714 0.5714286
          yes           NaN       NaN
yes   no  no      0.6428571 0.3571429
          yes     0.4615385 0.5384615
      yes no      0.4000000 0.6000000
          yes           NaN       NaN

Let's look at smoking by race. Some of these commands are long, so I broke them in the middle by hitting Enter and continuing on the next line. (On a Mac you'll have to erase the close parenthesis first.) Remember not to type the + signs, as R will supply them. And spacing is optional. I added some spacing to make the command easier to read. Watch your commas and quotes, too. One missed or misplaced comma will make the whole command wrong!

> smoke.by.race = table(smoke, race)
> barplot(smoke.by.race, beside=T,
+         main="Smoking by Race",
+         ylab="Freq", xlab="Race",
+         col=c("black","lightgray"))
> legend(x=3.5, y=45, legend=c("nonsmokers","smokers"),
+         fill=c("black","lightgray"))
> results = chisq.test(smoke.by.race)
> results

        Pearson's Chi-squared test

data:  smoke.by.race 
X-squared = 21.779, df = 2, p-value = 1.865e-05

> results$expected                          # View the expected frequencies.
     race
smoke   white    black    other
  no  58.4127 15.82011 40.76720
  yes 37.5873 10.17989 26.23280
> results$residuals                         # View the (standardized) residuals in each cell.
     race
smoke       white       black       other
  no  -1.88578277  0.04522852  2.22912825
  yes  2.35084895 -0.05638265 -2.77886927

Are any of the factors related to low birthweight? We'll find out by using logistic regression, an advanced technique incorporated into the default R packages you get at download.

> glm.out = glm(low ~ race + smoke + ht + ui, family=binomial(logit))
> summary(glm.out)

Call:
glm(formula = low ~ race + smoke + ht + ui, family = binomial(logit))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6618  -0.7922  -0.4825   1.1413   2.1016  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.0919     0.3800  -5.505 3.69e-08 ***
raceblack     1.0638     0.4990   2.132  0.03301 *  
raceother     1.0834     0.4131   2.622  0.00873 ** 
smokeyes      1.0940     0.3797   2.881  0.00397 ** 
htyes         1.3594     0.6297   2.159  0.03087 *  
uiyes         1.0059     0.4385   2.294  0.02179 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 211.17  on 183  degrees of freedom
AIC: 223.17

Number of Fisher Scoring iterations: 4

Apparently they all are! Don't forget to clean up your workspace and search path.

> detach(birthwt)                 ### Always detach after you attach!
> ls()                            # List contents of workspace. Or you could do this from a menu.
[1] "big.table"     "birthwt"       "flat.table"    "glm.out"      
[5] "results"       "smoke.by.race"
> rm(list=ls())                   # Completely clear the workspace. You can also do this from a menu.
>ls()
character(0)

Notice that when I was naming things, I did not put spaces or dashes in those names. R wouldn't like that! Use a period instead. Notice also that the chisq.test( ) and glm( ) commands produced no output when they were executed, because that output was stored in results and glm.out, respectively. We then had to ask to see the results, in one case by using summary(glm.out). This is typically how R works.

Problem Three: Some Factors Associated With Car Insurance Claims

As before, don't execute the library() command if you are still in the same active R session (MASS still in the search path).

> library("MASS")                 # Not necessary if still in search path.
> data(Insurance)
> str(Insurance)
'data.frame':   64 obs. of  5 variables:
 $ District: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ Group   : Ord.factor w/ 4 levels "<1l"<"1-1.5l"<..: 1 1 1 1 2 2 2 2 3 3 ...
 $ Age     : Ord.factor w/ 4 levels "<25"<"25-29"<..: 1 2 3 4 1 2 3 4 1 2 ...
 $ Holders : int  197 264 246 1680 284 536 696 3582 133 286 ...
 $ Claims  : int  38 35 20 156 63 84 89 400 19 52 ...

> summary(Insurance)
 District    Group       Age        Holders            Claims      
 1:16     <1l   :16   <25  :16   Min.   :   3.00   Min.   :  0.00  
 2:16     1-1.5l:16   25-29:16   1st Qu.:  46.75   1st Qu.:  9.50  
 3:16     1.5-2l:16   30-35:16   Median : 136.00   Median : 22.00  
 4:16     >2l   :16   >35  :16   Mean   : 364.98   Mean   : 49.23  
                                 3rd Qu.: 327.50   3rd Qu.: 55.50  
                                 Max.   :3582.00   Max.   :400.00

District is the location of the policy holder, with 4 being in a major city. Group has to do with the size of the car's engine in liters. Age is obvious. Holders and Claims are both counts.

There is no replication within the cells, so a fully crossed 3-way ANOVA would leave us without an error term. Examination of the data reveals an apparent absence of interactions. So we will request a 3-way ANOVA without testing the interactions.

> aov.out = aov(I(Claims/Holders*100) ~ District + Age + Group, data=Insurance)
> ### Note: We made the DV--claims per 100 holders--on the fly.
> summary(aov.out)
            Df  Sum Sq Mean Sq F value    Pr(>F)    
District     3   88.87   29.62  0.6014 0.6168748    
Age          3  517.63  172.54  3.5030 0.0214000 *  
Group        3  974.59  324.86  6.5955 0.0007036 ***
Residuals   54 2659.78   49.26                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> TukeyHSD(aov.out, which=c("Age","Group"))
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = I(Claims/Holders * 100) ~ District + Age + Group, data = Insurance)

$Age
                 diff        lwr       upr     p adj
25-29-<25   -2.858071  -9.435711  3.719568 0.6593722
30-35-<25   -5.016151 -11.593791  1.561488 0.1928569
>35-<25     -7.748445 -14.326084 -1.170805 0.0148258
30-35-25-29 -2.158080  -8.735719  4.419560 0.8204004
>35-25-29   -4.890374 -11.468013  1.687266 0.2117055
>35-30-35   -2.732294  -9.309933  3.845346 0.6903616

$Group
                  diff        lwr       upr     p adj
1-1.5l-<1l    2.237368 -4.3402712  8.815008 0.8039620
1.5-2l-<1l    7.146119  0.5684797 13.723759 0.0282532
>2l-<1l       9.879534  3.3018946 16.457174 0.0011507
1.5-2l-1-1.5l 4.908751 -1.6688885 11.486390 0.2088731
>2l-1-1.5l    7.642166  1.0645263 14.219805 0.0166593
>2l-1.5-2l    2.733415 -3.8442246  9.311054 0.6900880

It appears that younger drivers and drivers of cars with larger engines are more likely to file claims.

My advice would be not to do something so silly as to use greater than, less than, and minus signs in your group names. Just a thought! But notice even when something so silly as mathematical operators are included in group names, R can handle it if you're careful. I would also not take those results at face value, as an examination of the model assumptions reveals heterogeneity of variance and nonnormality. Perhaps a transform of the response could fix that.

Yes, those graphs were also created in R. They show clearly that the homogeneity of variance and normal distribution assumptions have been violated.

> ls()
[1] "aov.out"        "Insurance"
> rm(list=ls())                        ### Clean up the mess we've made.

There is no need to detach Insurance because it was never attached. Check the search path to confirm this if you don't remember. And by the way, deleting it does not detach it.

Problem Four: Some Cars Sold In the United States in 1993

> library("MASS")                      ### Don't do this if it's already done.
> ### If you accidentally attach something twice, no harm no foul.
> ### Just detach it once.
> data(Cars93)
> str(Cars93)
'data.frame':   93 obs. of  27 variables:
 $ Manufacturer      : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...
 $ Model             : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...
 $ Type              : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...
 $ Min.Price         : num  12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
 $ Price             : num  15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
 $ Max.Price         : num  18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
 $ MPG.city          : int  25 18 20 19 22 22 19 16 19 16 ...
 $ MPG.highway       : int  31 25 26 26 30 31 28 25 27 25 ...
 $ AirBags           : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ...
 $ DriveTrain        : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ...
 $ Cylinders         : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ...
 $ EngineSize        : num  1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
 $ Horsepower        : int  140 200 172 172 208 110 170 180 170 200 ...
 $ RPM               : int  6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
 $ Rev.per.mile      : int  2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...
 $ Man.trans.avail   : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
 $ Fuel.tank.capacity: num  13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
 $ Passengers        : int  5 5 5 6 4 6 6 6 5 6 ...
 $ Length            : int  177 195 180 193 186 189 200 216 198 206 ...
 $ Wheelbase         : int  102 115 102 106 109 105 111 116 108 114 ...
 $ Width             : int  68 71 67 70 69 69 74 78 73 73 ...
 $ Turn.circle       : int  37 38 37 37 39 41 42 45 41 43 ...
 $ Rear.seat.room    : num  26.5 30 28 31 27 28 30.5 30.5 26.5 35 ...
 $ Luggage.room      : int  11 15 14 17 13 16 17 21 14 18 ...
 $ Weight            : int  2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
 $ Origin            : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ...
 $ Make              : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ...

What would be interesting to look at here?

> summary(Cars93)
    Manufacturer     Model         Type      Min.Price         Price      
 Chevrolet: 8    100    : 1   Compact:16   Min.   : 6.70   Min.   : 7.40  
 Ford     : 8    190E   : 1   Large  :11   1st Qu.:10.80   1st Qu.:12.20  
 Dodge    : 6    240    : 1   Midsize:22   Median :14.70   Median :17.70  
 Mazda    : 5    300E   : 1   Small  :21   Mean   :17.13   Mean   :19.51  
 Pontiac  : 5    323    : 1   Sporty :14   3rd Qu.:20.30   3rd Qu.:23.30  
 Buick    : 4    535i   : 1   Van    : 9   Max.   :45.40   Max.   :61.90  
 (Other)  :57    (Other):87                                               
   Max.Price       MPG.city      MPG.highway                  AirBags  
 Min.   : 7.9   Min.   :15.00   Min.   :20.00   Driver & Passenger:16  
 1st Qu.:14.7   1st Qu.:18.00   1st Qu.:26.00   Driver only       :43  
 Median :19.6   Median :21.00   Median :28.00   None              :34  
 Mean   :21.9   Mean   :22.37   Mean   :29.09                          
 3rd Qu.:25.3   3rd Qu.:25.00   3rd Qu.:31.00                          
 Max.   :80.0   Max.   :46.00   Max.   :50.00                          
                                                                       
 DriveTrain  Cylinders    EngineSize      Horsepower         RPM      
 4WD  :10   3     : 3   Min.   :1.000   Min.   : 55.0   Min.   :3800  
 Front:67   4     :49   1st Qu.:1.800   1st Qu.:103.0   1st Qu.:4800  
 Rear :16   5     : 2   Median :2.400   Median :140.0   Median :5200  
            6     :31   Mean   :2.668   Mean   :143.8   Mean   :5281  
            8     : 7   3rd Qu.:3.300   3rd Qu.:170.0   3rd Qu.:5750  
            rotary: 1   Max.   :5.700   Max.   :300.0   Max.   :6500  
                                                                      
  Rev.per.mile  Man.trans.avail Fuel.tank.capacity   Passengers   
 Min.   :1320   No :32          Min.   : 9.20      Min.   :2.000  
 1st Qu.:1985   Yes:61          1st Qu.:14.50      1st Qu.:4.000  
 Median :2340                   Median :16.40      Median :5.000  
 Mean   :2332                   Mean   :16.66      Mean   :5.086  
 3rd Qu.:2565                   3rd Qu.:18.80      3rd Qu.:6.000  
 Max.   :3755                   Max.   :27.00      Max.   :8.000  
                                                                  
     Length        Wheelbase         Width        Turn.circle   
 Min.   :141.0   Min.   : 90.0   Min.   :60.00   Min.   :32.00  
 1st Qu.:174.0   1st Qu.: 98.0   1st Qu.:67.00   1st Qu.:37.00  
 Median :183.0   Median :103.0   Median :69.00   Median :39.00  
 Mean   :183.2   Mean   :103.9   Mean   :69.38   Mean   :38.96  
 3rd Qu.:192.0   3rd Qu.:110.0   3rd Qu.:72.00   3rd Qu.:41.00  
 Max.   :219.0   Max.   :119.0   Max.   :78.00   Max.   :45.00  
                                                                
 Rear.seat.room   Luggage.room       Weight         Origin              Make   
 Min.   :19.00   Min.   : 6.00   Min.   :1695   USA    :48   Acura Integra: 1  
 1st Qu.:26.00   1st Qu.:12.00   1st Qu.:2620   non-USA:45   Acura Legend : 1  
 Median :27.50   Median :14.00   Median :3040                Audi 100     : 1  
 Mean   :27.83   Mean   :13.89   Mean   :3073                Audi 90      : 1  
 3rd Qu.:30.00   3rd Qu.:15.00   3rd Qu.:3525                BMW 535i     : 1  
 Max.   :36.00   Max.   :22.00   Max.   :4105                Buick Century: 1  
 NA's   : 2.00   NA's   :11.00                               (Other)      :87

Some of the variables have missing values, viz., Rear.seat.room and Luggage.room. In R, missing values are denoted by NA, or "not available". I will begin by removing those cases.

> Cars = na.omit(Cars93)          # Writing the result to a new data frame.
> dim(Cars)                       # Get dimensions of remaining data frame.
[1] 82 27

Eleven cases have been eliminated. It might also be desirable to use the variable "Make" as row names for the data frame rather than have it as a variable. Your wish is R's fondest desire (provided you know the command.)

> rownames(Cars) = Cars$Make
> Cars$Make = NULL                # Delete the variable from the data frame.
> Cars[1:6,1:7]                   # See some of the result (6 rows and 7 columns).
              Manufacturer   Model    Type Min.Price Price Max.Price MPG.city
Acura Integra        Acura Integra   Small      12.9  15.9      18.8       25
Acura Legend         Acura  Legend Midsize      29.2  33.9      38.7       18
Audi 90               Audi      90 Compact      25.9  29.1      32.3       20
Audi 100              Audi     100 Midsize      30.8  37.7      44.6       19
BMW 535i               BMW    535i Midsize      23.7  30.0      36.2       22
Buick Century        Buick Century Midsize      14.2  15.7      17.3       22

The advantage of this is that you can refer to rows of the data frame by the make of automobile. If you wanted to see all info on the Buick Century, for example, it's as simple as this. (Otherwise, you'd have to refer to this row of data by a number, that off the top of your head you may not know.)

> Cars["Buick Century", ]
              Manufacturer   Model    Type Min.Price Price Max.Price MPG.city
Buick Century        Buick Century Midsize      14.2  15.7      17.3       22
              MPG.highway     AirBags DriveTrain Cylinders EngineSize
Buick Century          31 Driver only      Front         4        2.2
              Horsepower  RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity
Buick Century        110 5200         2565              No               16.4
              Passengers Length Wheelbase Width Turn.circle Rear.seat.room
Buick Century          6    189       105    69          41             28
              Luggage.room Weight Origin
Buick Century           16   2880    USA

Now we'll look at a correlation matrix of most of the numeric variables. This will illustrate how in R you can use one command to do many things at once. Doing so makes the command kinda complicated, so be very careful with your typing. The first part of the command rounds the results to 3 decimal places. Nested inside of that command is one that calculates the correlation matrix. And nested inside of that is a restriction on the Cars dataframe that specifies only certain variables to be included.

> round(cor(Cars[,c(4,5,6,7,8,12,13,14,17,18,19,20,21,25)]),3)
                   Min.Price  Price Max.Price MPG.city MPG.highway EngineSize
Min.Price              1.000  0.971     0.910   -0.663      -0.664      0.705
Price                  0.971  1.000     0.983   -0.622      -0.626      0.642
Max.Price              0.910  0.983     1.000   -0.565      -0.572      0.567
MPG.city              -0.663 -0.622    -0.565    1.000       0.945     -0.747
MPG.highway           -0.664 -0.626    -0.572    0.945       1.000     -0.666
EngineSize             0.705  0.642     0.567   -0.747      -0.666      1.000
Horsepower             0.795  0.787     0.750   -0.704      -0.687      0.772
RPM                   -0.092 -0.035     0.011    0.327       0.234     -0.524
Fuel.tank.capacity     0.736  0.707     0.656   -0.797      -0.739      0.821
Passengers             0.327  0.274     0.222   -0.463      -0.414      0.526
Length                 0.611  0.552     0.485   -0.725      -0.617      0.852
Wheelbase              0.676  0.628     0.566   -0.671      -0.570      0.859
Width                  0.544  0.495     0.438   -0.694      -0.588      0.890
Weight                 0.777  0.737     0.677   -0.836      -0.777      0.917
                   Horsepower    RPM Fuel.tank.capacity Passengers Length
Min.Price               0.795 -0.092              0.736      0.327  0.611
Price                   0.787 -0.035              0.707      0.274  0.552
Max.Price               0.750  0.011              0.656      0.222  0.485
MPG.city               -0.704  0.327             -0.797     -0.463 -0.725
MPG.highway            -0.687  0.234             -0.739     -0.414 -0.617
EngineSize              0.772 -0.524              0.821      0.526  0.852
Horsepower              1.000  0.024              0.796      0.241  0.643
RPM                     0.024  1.000             -0.244     -0.378 -0.440
Fuel.tank.capacity      0.796 -0.244              1.000      0.451  0.794
Passengers              0.241 -0.378              0.451      1.000  0.646
Length                  0.643 -0.440              0.794      0.646  1.000
Wheelbase               0.663 -0.405              0.793      0.649  0.911
Width                   0.689 -0.489              0.772      0.516  0.885
Weight                  0.860 -0.348              0.900      0.510  0.880
                   Wheelbase  Width Weight
Min.Price              0.676  0.544  0.777
Price                  0.628  0.495  0.737
Max.Price              0.566  0.438  0.677
MPG.city              -0.671 -0.694 -0.836
MPG.highway           -0.570 -0.588 -0.777
EngineSize             0.859  0.890  0.917
Horsepower             0.663  0.689  0.860
RPM                   -0.405 -0.489 -0.348
Fuel.tank.capacity     0.793  0.772  0.900
Passengers             0.649  0.516  0.510
Length                 0.911  0.885  0.880
Wheelbase              1.000  0.850  0.879
Width                  0.850  1.000  0.869
Weight                 0.879  0.869  1.000

That would be a lot prettier if some of those variable names were shorter! (We could make them shorter if we wanted to go to the trouble.) Now we will do a linear multiple regression using the lm(), or "linear model" function, attempting to predict gas mileage from four predictors. Then we will use the step() function to do backwards regression to eliminate the nonsignificant terms. Complicated stuff, but all incorporated into the default R packages.

> lm.model = lm(MPG.city ~ EngineSize + Horsepower + Weight + Origin, data=Cars)
> summary(lm.model)

Call:
lm(formula = MPG.city ~ EngineSize + Horsepower + Weight + Origin, 
    data = Cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7646 -1.4445 -0.1286  1.1505 13.2322 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    4.910e+01  3.048e+00  16.110  < 2e-16 ***
EngineSize     1.220e+00  9.173e-01   1.330    0.187    
Horsepower    -5.836e-05  1.384e-02  -0.004    0.997    
Weight        -9.967e-03  1.890e-03  -5.273 1.19e-06 ***
Originnon-USA  1.265e+00  7.917e-01   1.598    0.114    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 3.083 on 77 degrees of freedom
Multiple R-squared: 0.7113,     Adjusted R-squared: 0.6963 
F-statistic: 47.42 on 4 and 77 DF,  p-value: < 2.2e-16

> step(lm.model, direction="backward")
Start:  AIC=189.48
MPG.city ~ EngineSize + Horsepower + Weight + Origin

             Df Sum of Sq    RSS    AIC
- Horsepower  1 0.0001691 731.78 187.48
- EngineSize  1     16.82 748.60 189.34
                    731.78 189.48
- Origin      1     24.28 756.06 190.15
- Weight      1    264.26 996.04 212.76

Step:  AIC=187.48
MPG.city ~ EngineSize + Weight + Origin

             Df Sum of Sq     RSS     AIC
- EngineSize  1     16.84  748.62  187.34
                     731.78  187.48
- Origin      1     26.76  758.54  188.42
- Weight      1    387.47 1119.25  220.32

Step:  AIC=187.34
MPG.city ~ Weight + Origin

         Df Sum of Sq     RSS     AIC
- Origin  1     15.62  764.24  187.04
                 748.62  187.34
- Weight  1   1588.44 2337.06  278.69

Step:  AIC=187.04
MPG.city ~ Weight

         Df Sum of Sq     RSS     AIC
                 764.24  187.04
- Weight  1   1770.16 2534.40  283.34

Call:
lm(formula = MPG.city ~ Weight, data = Cars)

Coefficients:
(Intercept)       Weight  
   47.76860     -0.00826

Of the original predictors, only Weight remains after the backward regression.

> quit("no")

The end. We don't need to clean up here because we are telling R to quit without saving the workspace. This will cause R to start up next time with the same workspace it had when we started this time. Thus, our curent workspace will be lost. No harm there. Our search path will also be cleaned up every time R restarts.

revised 2016 January 14