R Tutorials Top Banner

A WORK-ALONG DEMONSTRATION OF R


Preliminaries

The purpose of this tutorial is to show you a bit--and I mean just a bit--of what R can do and how it works. If you have already downloaded and installed R, you can work along with me. If not, you should be able to get a pretty good idea of what R is about just by reading through this tutorial. However, I will not be explaining much of what I'm doing, so if you've never used R before, much of this may seem pretty cryptic to you. My intent here is not to teach you R but just to show you a little of what it can do. It will also give you some practice at typing and executing R commands and will show you what R looks like on your computer.

We will work on four problems. Do any or all of them, but do at least the first one.

  • Weight change in anorexic women associated with three different forms of therapy.
  • Risk factors associated with low infant birth weight.
  • Some factors associated with car insurance claims.
  • Data on cars on sale in the United States in 1993.

You will not have to enter any of these data sets, as they come with the default installation of R. They are in an R library that does not load automatically when R starts, so our first job will be getting access to the data. I will repeat the command for doing so at the beginning of each section, but if you are doing the entire tutorial at one sitting, you have to execute the library( ) command only once. In fact, I will remind you not to do otherwise.

When you start R on your computer, it will open a window that looks something like this. This is the Windows version--it will look slightly, but not dramatically, different on the Mac, and will be recognizable in Linux.

R Console

The window into which you type your commands is called the R Console. On the Mac and in Windows, this runs inside of a graphical user interface (GUI). In Linux, it runs in a terminal window. The greater than sign (>) is the command prompt. You don't type this part--R supplies it. If the command extends onto two or more lines, this prompt will change to a plus sign (+). Don't type the plus sign prompts either.

To see what that looks like, try this:

> ls()

Type a lower case L ("el"), not a one, a lower case S, an open and close parentheses, and then press the Enter (or Return) key. This command lists the contents of a place in memory called the workspace (the computer's memory, not your memory). If there is nothing there, the output will say character(0). Now suppose you neglected to type the close parenthesis, a common mistake. R will recognize that the command is incomplete and give you a continuation prompt (+) so that you can complete it. Type the close parenthesis and press Enter. Be sure to press the Enter key after typing a command. Otherwise, R will just sit there waiting for you to do so, and R has infinite patience!

> ls(
+ )

R is usually insensitive to spacing, so add spacing wherever it helps you to make things clear, or leave it out if you want to save key strokes. R is case sensitive, however, so if you capitalize something that shouldn't be, or fail to when you should, you will get an error message of some sort. If you get a "function not found" or "object not found" error, that is the first thing you should check. Then check your spelling and punctuation. In R, attention to details is crucial.

In R, as in many programming languages, comments start with pound symbols (#). Anything following a pound symbol is ignored by R and is there purely for the purpose of instruction or annotation. You don't need to type these bits. And did I mention that R is, in fact, a full featured programming language? You don't need to be intimidated by that. R has such a large number of built-in statistical functions that most people use it simply for interactive data analysis and graphics. It does mean, however, that R will be very picky about syntax. You will have to close all open parentheses, get your commas and quotes in the right places, and things like that. Annoying, I agree, but also very, very powerful!

One thing you may find handy--commands can be copied and pasted into R. Don't copy the command prompts, however.


Weight Change in Anorexic Women

Begin like this...

> remove(list=ls())               ### You type only the bits after the > prompt.
> library("MASS")
> data(anorexia)
> ls()
[1] "anorexia"                    ### No prompt here. This is an output line.
> str(anorexia)
'data.frame':   72 obs. of  3 variables:
 $ Treat : Factor w/ 3 levels "CBT","Cont","FT": 2 2 2 2 2 2 2 2 2 2 ...
 $ Prewt : num  80.7 89.4 91.8 74 78.1 88.3 87.3 75.1 80.6 78.4 ...
 $ Postwt: num  80.2 80.1 86.4 86.3 76.1 78.1 75.1 86.7 73.5 84.6 ...
The first of these commands cleaned out your workspace, deleting anything you or someone else stored there from a previous analysis. I hope you got all those parentheses in the right place! WARNING: R does not ask you if you "really want to do this." It just does what you tell it to do. If you really didn't mean to, too bad! The information in your workspace is gone, and it does not go the trash either. It's gone! So be careful. This is a good old fashioned command line interface, and you're expected to know what you're doing.

The second command loaded the "MASS" library, which contains the data set. The third command placed a copy of the data set "anorexia" into your workspace. The fourth command asked for a list of items stored in the workspace. The output produced by R is labeled [1]. The fifth command asked to see the structure of this data object. It is a data frame, which is a table-like object containing cases (or observations) in the rows and variables in the columns. There are 72 cases each measured on 3 variables. Those variables are Treat, a factor with 3 levels (CBT is cognitive behavioral therapy, Cont is a no-treatment control, and FT is family therapy), Prewt, a numerical variable (pretreatment weight in pounds), and Postwt, also a numerical variable (posttreatment weight in pounds).

For more information, or to view the source for these data, request the help page by typing help(anorexia) at a command prompt. Help information will open in its own window in Windows and on the Mac. Close this window as you would any window on your system. In Linux, help information will open in the terminal. Press q (lower case Q) to return to the command prompt.

We continue by examining the data. Don't worry about memorizing what all the command do now. This is just for show. And remember, you don't have to type the stuff after the # symbol.

> summary(anorexia)          # Get a data summary.
  Treat        Prewt           Postwt      
 CBT :29   Min.   :70.00   Min.   : 71.30  
 Cont:26   1st Qu.:79.60   1st Qu.: 79.33  
 FT  :17   Median :82.30   Median : 84.05  
           Mean   :82.41   Mean   : 85.17  
           3rd Qu.:86.00   3rd Qu.: 91.55  
           Max.   :94.90   Max.   :103.60  
> attach(anorexia)           # Make the variables inside anorexia visible to R.
> by(anorexia, Treat, summary)    # Summarize each treatment group separately.
Treat: CBT
  Treat        Prewt           Postwt     
 CBT :29   Min.   :70.00   Min.   : 71.3  
 Cont: 0   1st Qu.:80.40   1st Qu.: 81.9  
 FT  : 0   Median :82.60   Median : 83.9  
           Mean   :82.69   Mean   : 85.7  
           3rd Qu.:85.00   3rd Qu.: 90.9  
           Max.   :94.90   Max.   :103.6  
------------------------------------------------------------ 
Treat: Cont
  Treat        Prewt           Postwt     
 CBT : 0   Min.   :70.50   Min.   :73.00  
 Cont:26   1st Qu.:77.72   1st Qu.:77.58  
 FT  : 0   Median :80.65   Median :80.70  
           Mean   :81.56   Mean   :81.11  
           3rd Qu.:85.88   3rd Qu.:84.67  
           Max.   :91.80   Max.   :89.60  
------------------------------------------------------------ 
Treat: FT
  Treat        Prewt           Postwt     
 CBT : 0   Min.   :73.40   Min.   : 75.2  
 Cont: 0   1st Qu.:80.50   1st Qu.: 90.7  
 FT  :17   Median :83.30   Median : 92.5  
           Mean   :83.23   Mean   : 90.5  
           3rd Qu.:86.00   3rd Qu.: 95.2  
           Max.   :94.20   Max.   :101.6
>
> tapply(Postwt, Treat, mean)     # Another way to get means by groups.
     CBT     Cont       FT 
85.69655 81.10769 90.49412 
> tapply(Postwt, Treat, sd)       # Sample standard deviations by groups.
     CBT     Cont       FT 
8.351924 4.744253 8.475072 
> tapply(Postwt, Treat, length)   # Group sizes.
 CBT Cont   FT 
  29   26   17

If you're relatively new to computers (i.e., under the age of 40), here's something you will have to get used to about command line interfaces. Once something is printed to the screen, it's history, gone, done with. It will never be updated, changed or revised. ALL NEW INFORMATION OR CHANGES TO OLD INFORMATION WILL BE PRINTED OUT BELOW WHERE YOU TYPE A NEW COMMAND. So if you change something, you will have to issue a new command to see that change. What has already scrolled by is ancient history.

Another thing you'll have to get used to is that when things go well in R, R will often be silent. I.e., it will not tell you that it did what you told it to do. You will have to ask it to show you. Here is an example.

> x = 7            # Assign the value 7 to x. Notice there is no output.
> x                # View the value of x, after you remember to press Enter.
[1] 7
> x = 8            # A new value for x.
> x                # View it.
[1] 8
> rm(x)            # Erase x. Done! No confirmatory message is printed.
> x
Error: object 'x' not found
>

Now let's draw a nice graph of the anorexia data. The graph will be drawn in a separate graphics window (called a "graphics device" in R-speak). If no such window is open, R will open one. If one is already open, R will use it and erase whatever graph is already there. So if you wanted a copy of that old graph, well, too late now! You can manipulate this graphics window just like any other window on your system. You can move it around, resize it, and close it by clicking on the proper button. Closing the window will also erase the graph, so if you want to keep it, take steps to do so before writing over it or closing it. I'll tell you how to do this in a future tutorial.

> Change = Postwt - Prewt              # Create a variable called Change.
> boxplot(Change ~ Treat)              # The symbol before Treat is a tilde.
> title(main = "Weight Gain in Pounds by Treatment")
> title(xlab = "Type of treatment")
> title(ylab = "Weight gain in pounds")
Boxplots
When you are done admiring the graphics device window, you can click it shut in the normal way for your operating system.

Scatterplot

When a graphics window opens, by the way, it removes focus from the R Console window, so you will have to click in the console to continue working there. Now let's open a smaller graphics window and look at a scatterplot...

> windows(4,4,10)       ### Windows only!
> ### On a Mac, type quartz("Q",4,4,10)
> ### On Linux just skip this step.
> scatter.smooth(Prewt, Postwt)
> title(main="Postweight by Preweight Scatterplot")
> abline(a=0, b=1, col="red")
The scatter.smooth( ) function draws a scatterplot with a smoothed, nonparametric (lowess) regression line on the plot. I have also added a red "no change" line with the abline( ) function (intercept=0 and slope=1). This window can also be closed in the way you would close any window on your desktop. A later tutorial will deal with how to modify and dress up graphics for publication, but R does a fairly decent job of drawing graphics just by default, as you can see.

It appears the treatment effect is nonlinear, with lower preweight women benefiting from treatment more than higher preweight women. An analysis of variance is, therefore, suspect. Nevertheless, for the sake of illustration...

> aov.out = aov(Change ~ Treat)        # Change tilde Treat
> summary(aov.out)
            Df Sum Sq Mean Sq F value   Pr(>F)   
Treat        2  614.6   307.3  5.4223 0.006499 **
Residuals   69 3910.7    56.7                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
> TukeyHSD(aov.out)                    # A Tukey HSD post hoc test.
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Change ~ Treat)

$Treat
              diff       lwr       upr     p adj
Cont-CBT -3.456897 -8.327276  1.413483 0.2124428
FT-CBT    4.257809 -1.250554  9.766173 0.1607461
FT-Cont   7.714706  2.090124 13.339288 0.0045127
The aov( ) function accepts a formula that tells what analysis to do. The formula has the basic form DV ~ IV (DV tilde IV), which means "DV as explained by or predicted by or modeled by IV." Treatment contrasts are also available...
> summary.lm(aov.out)

Call:
aov(formula = Change ~ Treat)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.565  -4.543  -1.007   3.846  17.893 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    3.007      1.398   2.151   0.0350 *
TreatCont     -3.457      2.033  -1.700   0.0936 .
TreatFT        4.258      2.300   1.852   0.0684 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 7.528 on 69 degrees of freedom
Multiple R-squared: 0.1358,     Adjusted R-squared: 0.1108 
F-statistic: 5.422 on 2 and 69 DF,  p-value: 0.006499
We can also do a Kruskal-Wallis nonparametric ANOVA...
> kruskal.test(Change ~ Treat)

        Kruskal-Wallis rank sum test

data:  Change by Treat 
Kruskal-Wallis chi-squared = 9.0475, df = 2, p-value = 0.01085
That should get you started and your appetite whetted. Let's move on.
> ls()                                 # View contents of workspace.
[1] "anorexia" "aov.out"  "Change"
> detach(anorexia)                     # ALWAYS detach after attaching!
> rm(anorexia, aov.out, Change)        # Short for remove( )
But first you should execute the commands above to clean up a bit. It's a good idea to keep your workspace clean by erasing (removing) things you no longer need. This will come back to bite you eventually if you don't. Also, keep your search path clean (explained in a future tutorial) by detaching anything you've attached to it.


Risk Factors Associated With Low Infant Birthweight

We begin as before. You should not execute the library("MASS") command if you are still running the same R session. Try typing search( ), and if you see "package:MASS" among the listed packages, it's already loaded.

> library("MASS")       # Only if you have quit R since the last example.
> data(birthwt)         # Add this dataset to your workspace.
> ls()                  # List the contents of your workspace.
[1] "birthwt"
> str(birthwt)          # View the structure of the dataset.
'data.frame':   189 obs. of  10 variables:
 $ low  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
 $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
 $ race : int  2 3 1 1 1 3 1 3 1 1 ...
 $ smoke: int  0 0 1 1 1 0 0 0 1 1 ...
 $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ht   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ ui   : int  1 0 0 1 1 0 0 0 0 0 ...
 $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
 $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
I will begin by recoding some of the variables to make them factors. This could have been done when the data set was first prepared, but it wasn't. If you've jumped the gun and have already attached the dataframe, detach it. NEVER modify an attached dataframe.
> ### low: birthwt below 2.5 kg, 0=no, 1=yes; we will leave this as is
> ### age: age of mother in years; no need to recode
> ### lwt: weight of mother in pounds at last menstrual period; no need to recode
> ### race: R thinks this is an integer, so we want to recode
> birthwt$race = factor(birthwt$race, labels=c("white","black","other"))
> ### Note: birthwt$race means the race variable inside the birthwt data frame
> ### smoke: smoking status during pregnancy; also needs recoding
> birthwt$smoke = factor(birthwt$smoke, labels=c("no","yes"))
> ### ptl: number of previous premature labors; no need to recode
> ### ht: history of hypertension (0="no", 1="yes")
> birthwt$ht = factor(birthwt$ht, labels=c("no","yes"))
> ### ui: presence of uterine irritability (0="no", 1="yes")
> birthwt$ui = factor(birthwt$ui, labels=c("no","yes"))
> ### ftv: number of physician visits during first trimester
> ### bwt: birth weight of baby in grams; change to kilograms
> birthwt$bwt = birthwt$bwt / 1000
> summary(birthwt)
      low              age             lwt           race    smoke    
 Min.   :0.0000   Min.   :14.00   Min.   : 80.0   white:96   no :115  
 1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0   black:26   yes: 74  
 Median :0.0000   Median :23.00   Median :121.0   other:67            
 Mean   :0.3122   Mean   :23.24   Mean   :129.8                       
 3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0                       
 Max.   :1.0000   Max.   :45.00   Max.   :250.0                       
      ptl           ht        ui           ftv              bwt       
 Min.   :0.0000   no :177   no :161   Min.   :0.0000   Min.   :0.709  
 1st Qu.:0.0000   yes: 12   yes: 28   1st Qu.:0.0000   1st Qu.:2.414  
 Median :0.0000                       Median :0.0000   Median :2.977  
 Mean   :0.1958                       Mean   :0.7937   Mean   :2.945  
 3rd Qu.:0.0000                       3rd Qu.:1.0000   3rd Qu.:3.487  
 Max.   :3.0000                       Max.   :6.0000   Max.   :4.990
Notice that summary is a very intelligent command. It knows to summarize numbers as numbers and factors as factors. It will get even smarter in the future. Let's crosstabulate some of the categorical variables (factors).
> attach(birthwt)            # Makes the variables inside birthwt visible.
> table(smoke, low)          # Hence, we don't need to type birthwt$smoke, etc.
     low
smoke  0  1
  no  86 29
  yes 44 30
> xtabs(~ smoke + low)       # An alternate form that does the same thing.
     low
smoke  0  1
  no  86 29
  yes 44 30
> table(smoke, race)         # Remember, the space after the comma is optional.
     race
smoke white black other
  no     44    16    55
  yes    52    10    12
> table(smoke,ht,ui,low)     # A four-dimensional contingency table.
, , ui = no, low = 0

     ht
smoke no yes
  no  75   3
  yes 36   2

, , ui = yes, low = 0

     ht
smoke no yes
  no   8   0
  yes  6   0

, , ui = no, low = 1

     ht
smoke no yes
  no  18   4
  yes 20   3

, , ui = yes, low = 1

     ht
smoke no yes
  no   7   0
  yes  7   0

> big.table = table(smoke,ht,ui,low)   # Store output rather than printing it.
> ftable(big.table)                    # A more convenient way to view this.
              low  0  1
smoke ht  ui           
no    no  no      75 18
          yes      8  7
      yes no       3  4
          yes      0  0
yes   no  no      36 20
          yes      6  7
      yes no       2  3
          yes      0  0

> flat.table = ftable(big.table)       # Store output so we can work with it.
> prop.table(flat.table, 1)            # As proportions rather than raw freqs.
              low         0         1
smoke ht  ui                         
no    no  no      0.8064516 0.1935484
          yes     0.5333333 0.4666667
      yes no      0.4285714 0.5714286
          yes           NaN       NaN
yes   no  no      0.6428571 0.3571429
          yes     0.4615385 0.5384615
      yes no      0.4000000 0.6000000
          yes           NaN       NaN
Smoking barplot Let's look at smoking by race. Some of these commands are long, so I broke them in the middle by hitting Enter and continuing on the next line. Remember not to type the + signs, as R will supply them. And spacing is optional. I added some spacing to make the command easier to read. Watch your commas and quotes, too. One missed or misplaced comma will make the whole command wrong!
> smoke.by.race = table(smoke, race)
> barplot(smoke.by.race, beside=T,
+         main="Smoking by Race",
+         ylab="Freq", xlab="Race",
+         col=c("black","lightgray"))
> legend(x=3.5, y=45, legend=c("nonsmokers","smokers"),
+         fill=c("black","lightgray"))
> results = chisq.test(smoke.by.race)
> results

        Pearson's Chi-squared test

data:  smoke.by.race 
X-squared = 21.779, df = 2, p-value = 1.865e-05

> results$expected                          # View the expected values.
     race
smoke   white    black    other
  no  58.4127 15.82011 40.76720
  yes 37.5873 10.17989 26.23280
> results$residuals                         # View the residuals in each cell.
     race
smoke       white       black       other
  no  -1.88578277  0.04522852  2.22912825
  yes  2.35084895 -0.05638265 -2.77886927
Are any of the factors related to low birthweight? We'll find out by using logistic regression, an advanced technique incorporated into the default R packages you get at download.
> glm.out = glm(low ~ race + smoke + ht + ui, family=binomial(logit))
> summary(glm.out)

Call:
glm(formula = low ~ race + smoke + ht + ui, family = binomial(logit))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6618  -0.7922  -0.4825   1.1413   2.1016  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.0919     0.3800  -5.505 3.69e-08 ***
raceblack     1.0638     0.4990   2.132  0.03301 *  
raceother     1.0834     0.4131   2.622  0.00873 ** 
smokeyes      1.0940     0.3797   2.881  0.00397 ** 
htyes         1.3594     0.6297   2.159  0.03087 *  
uiyes         1.0059     0.4385   2.294  0.02179 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 211.17  on 183  degrees of freedom
AIC: 223.17

Number of Fisher Scoring iterations: 4
Apparently they all are! Don't forget to clean up your workspace and search path.
> detach(birthwt)                 ### Always detach after you attach!
> ls()                            # List contents of workspace.
[1] "big.table"     "birthwt"       "flat.table"    "glm.out"      
[5] "results"       "smoke.by.race"
> rm(list=ls())                   # Completely clear the workspace.
>ls()
character(0)
Notice that when I was naming things, I did not put spaces or dashes in those names. R wouldn't like that! Use a period instead. Notice also that the chisq.test( ) and glm( ) commands produced no output when they were executed, because that output was stored in results and glm.out, respectively. We then had to ask to see the results, in one case by using summary(glm.out). This is typically how R works.


Some Factors Associated With Car Insurance Claims

As before, don't execute the library( ) command if you are still in the same active R session.

> library("MASS")                 # Not necessary if still in search path.
> data(Insurance)
> str(Insurance)
'data.frame':   64 obs. of  5 variables:
 $ District: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ Group   : Ord.factor w/ 4 levels "<1l"<"1-1.5l"<..: 1 1 1 1 2 2 2 2 3 3 ...
 $ Age     : Ord.factor w/ 4 levels "<25"<"25-29"<..: 1 2 3 4 1 2 3 4 1 2 ...
 $ Holders : int  197 264 246 1680 284 536 696 3582 133 286 ...
 $ Claims  : int  38 35 20 156 63 84 89 400 19 52 ...

> summary(Insurance)
 District    Group       Age        Holders            Claims      
 1:16     <1l   :16   <25  :16   Min.   :   3.00   Min.   :  0.00  
 2:16     1-1.5l:16   25-29:16   1st Qu.:  46.75   1st Qu.:  9.50  
 3:16     1.5-2l:16   30-35:16   Median : 136.00   Median : 22.00  
 4:16     >2l   :16   >35  :16   Mean   : 364.98   Mean   : 49.23  
                                 3rd Qu.: 327.50   3rd Qu.: 55.50  
                                 Max.   :3582.00   Max.   :400.00
District is the location of the policy holder, with 4 being in a major city. Group has to do with the size of the car's engine in liters. Age is obvious. Holders and Claims are both counts.

There is no replication within the cells, so a fully crossed 3-way ANOVA would leave us without an error term. Examination of the data reveals an apparent absence of interactions. So we will request a 3-way ANOVA without testing the interactions.

> aov.out = aov(I(Claims/Holders*100) ~ District + Age + Group, data=Insurance)
> ### Note: We made the DV--claims per 100 holders--on the fly.
> summary(aov.out)
            Df  Sum Sq Mean Sq F value    Pr(>F)    
District     3   88.87   29.62  0.6014 0.6168748    
Age          3  517.63  172.54  3.5030 0.0214000 *  
Group        3  974.59  324.86  6.5955 0.0007036 ***
Residuals   54 2659.78   49.26                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> TukeyHSD(aov.out, which=c("Age","Group"))
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = I(Claims/Holders * 100) ~ District + Age + Group, data = Insurance)

$Age
                 diff        lwr       upr     p adj
25-29-<25   -2.858071  -9.435711  3.719568 0.6593722
30-35-<25   -5.016151 -11.593791  1.561488 0.1928569
>35-<25     -7.748445 -14.326084 -1.170805 0.0148258
30-35-25-29 -2.158080  -8.735719  4.419560 0.8204004
>35-25-29   -4.890374 -11.468013  1.687266 0.2117055
>35-30-35   -2.732294  -9.309933  3.845346 0.6903616

$Group
                  diff        lwr       upr     p adj
1-1.5l-<1l    2.237368 -4.3402712  8.815008 0.8039620
1.5-2l-<1l    7.146119  0.5684797 13.723759 0.0282532
>2l-<1l       9.879534  3.3018946 16.457174 0.0011507
1.5-2l-1-1.5l 4.908751 -1.6688885 11.486390 0.2088731
>2l-1-1.5l    7.642166  1.0645263 14.219805 0.0166593
>2l-1.5-2l    2.733415 -3.8442246  9.311054 0.6900880
My advice would be not to do something so silly as to use greater than, less than, and minus signs in your group names. Just a thought! But notice even when something so silly as mathematical operators are included in group names, R can handle it if you're careful. I would also not take those results at face value, as an examination of the model assumptions reveals heterogeneity of variance and nonnormality. Perhaps a transform of the response could fix that!

Diagnostics

Yes, those graphs were also created in R. They show clearly that the homogeneity of variance and normal distribution assumptions have been violated.

> ls()
[1] "aov.out"        "Insurance"
> rm(list=ls())                        ### Clean up the mess we've made.

Some Cars Sold In the United States in 1993

> library("MASS")                      ### Don't do this if it's already done.
> data(Cars93)
> str(Cars93)
'data.frame':   93 obs. of  27 variables:
 $ Manufacturer      : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...
 $ Model             : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...
 $ Type              : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...
 $ Min.Price         : num  12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
 $ Price             : num  15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
 $ Max.Price         : num  18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
 $ MPG.city          : int  25 18 20 19 22 22 19 16 19 16 ...
 $ MPG.highway       : int  31 25 26 26 30 31 28 25 27 25 ...
 $ AirBags           : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ...
 $ DriveTrain        : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ...
 $ Cylinders         : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ...
 $ EngineSize        : num  1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
 $ Horsepower        : int  140 200 172 172 208 110 170 180 170 200 ...
 $ RPM               : int  6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
 $ Rev.per.mile      : int  2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...
 $ Man.trans.avail   : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
 $ Fuel.tank.capacity: num  13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
 $ Passengers        : int  5 5 5 6 4 6 6 6 5 6 ...
 $ Length            : int  177 195 180 193 186 189 200 216 198 206 ...
 $ Wheelbase         : int  102 115 102 106 109 105 111 116 108 114 ...
 $ Width             : int  68 71 67 70 69 69 74 78 73 73 ...
 $ Turn.circle       : int  37 38 37 37 39 41 42 45 41 43 ...
 $ Rear.seat.room    : num  26.5 30 28 31 27 28 30.5 30.5 26.5 35 ...
 $ Luggage.room      : int  11 15 14 17 13 16 17 21 14 18 ...
 $ Weight            : int  2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
 $ Origin            : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ...
 $ Make              : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ...

What would be interesting to look at here?

> summary(Cars93)
    Manufacturer     Model         Type      Min.Price         Price      
 Chevrolet: 8    100    : 1   Compact:16   Min.   : 6.70   Min.   : 7.40  
 Ford     : 8    190E   : 1   Large  :11   1st Qu.:10.80   1st Qu.:12.20  
 Dodge    : 6    240    : 1   Midsize:22   Median :14.70   Median :17.70  
 Mazda    : 5    300E   : 1   Small  :21   Mean   :17.13   Mean   :19.51  
 Pontiac  : 5    323    : 1   Sporty :14   3rd Qu.:20.30   3rd Qu.:23.30  
 Buick    : 4    535i   : 1   Van    : 9   Max.   :45.40   Max.   :61.90  
 (Other)  :57    (Other):87                                               
   Max.Price       MPG.city      MPG.highway                  AirBags  
 Min.   : 7.9   Min.   :15.00   Min.   :20.00   Driver & Passenger:16  
 1st Qu.:14.7   1st Qu.:18.00   1st Qu.:26.00   Driver only       :43  
 Median :19.6   Median :21.00   Median :28.00   None              :34  
 Mean   :21.9   Mean   :22.37   Mean   :29.09                          
 3rd Qu.:25.3   3rd Qu.:25.00   3rd Qu.:31.00                          
 Max.   :80.0   Max.   :46.00   Max.   :50.00                          
                                                                       
 DriveTrain  Cylinders    EngineSize      Horsepower         RPM      
 4WD  :10   3     : 3   Min.   :1.000   Min.   : 55.0   Min.   :3800  
 Front:67   4     :49   1st Qu.:1.800   1st Qu.:103.0   1st Qu.:4800  
 Rear :16   5     : 2   Median :2.400   Median :140.0   Median :5200  
            6     :31   Mean   :2.668   Mean   :143.8   Mean   :5281  
            8     : 7   3rd Qu.:3.300   3rd Qu.:170.0   3rd Qu.:5750  
            rotary: 1   Max.   :5.700   Max.   :300.0   Max.   :6500  
                                                                      
  Rev.per.mile  Man.trans.avail Fuel.tank.capacity   Passengers   
 Min.   :1320   No :32          Min.   : 9.20      Min.   :2.000  
 1st Qu.:1985   Yes:61          1st Qu.:14.50      1st Qu.:4.000  
 Median :2340                   Median :16.40      Median :5.000  
 Mean   :2332                   Mean   :16.66      Mean   :5.086  
 3rd Qu.:2565                   3rd Qu.:18.80      3rd Qu.:6.000  
 Max.   :3755                   Max.   :27.00      Max.   :8.000  
                                                                  
     Length        Wheelbase         Width        Turn.circle   
 Min.   :141.0   Min.   : 90.0   Min.   :60.00   Min.   :32.00  
 1st Qu.:174.0   1st Qu.: 98.0   1st Qu.:67.00   1st Qu.:37.00  
 Median :183.0   Median :103.0   Median :69.00   Median :39.00  
 Mean   :183.2   Mean   :103.9   Mean   :69.38   Mean   :38.96  
 3rd Qu.:192.0   3rd Qu.:110.0   3rd Qu.:72.00   3rd Qu.:41.00  
 Max.   :219.0   Max.   :119.0   Max.   :78.00   Max.   :45.00  
                                                                
 Rear.seat.room   Luggage.room       Weight         Origin              Make   
 Min.   :19.00   Min.   : 6.00   Min.   :1695   USA    :48   Acura Integra: 1  
 1st Qu.:26.00   1st Qu.:12.00   1st Qu.:2620   non-USA:45   Acura Legend : 1  
 Median :27.50   Median :14.00   Median :3040                Audi 100     : 1  
 Mean   :27.83   Mean   :13.89   Mean   :3073                Audi 90      : 1  
 3rd Qu.:30.00   3rd Qu.:15.00   3rd Qu.:3525                BMW 535i     : 1  
 Max.   :36.00   Max.   :22.00   Max.   :4105                Buick Century: 1  
 NA's   : 2.00   NA's   :11.00                               (Other)      :87
Some of the variables have missing values, viz., Rear.seat.room and Luggage.room. In R, missing values are denoted by NA, or "not available". I will begin by removing those cases.
> Cars = na.omit(Cars93)
> dim(Cars)                       # Get dimensions of remaining dataframe.
[1] 82 27
Eleven cases have been eliminated. Now we'll look at a correlation matrix of most of the numerical variables. This will illustrate how in R you can use one command to do many things at once. Doing so makes the command kinda complicated, so be very careful with your typing. The first part of the command rounds the results to 3 decimal places. Nested inside of that command is one that calculates the correlation matrix. And nested inside of that is a restriction on the Cars dataframe that specifies only certain variables to be included.
> round(cor(Cars[,c(4,5,6,7,8,12,13,14,17,18,19,20,21,25)]),3)
                   Min.Price  Price Max.Price MPG.city MPG.highway EngineSize
Min.Price              1.000  0.971     0.910   -0.663      -0.664      0.705
Price                  0.971  1.000     0.983   -0.622      -0.626      0.642
Max.Price              0.910  0.983     1.000   -0.565      -0.572      0.567
MPG.city              -0.663 -0.622    -0.565    1.000       0.945     -0.747
MPG.highway           -0.664 -0.626    -0.572    0.945       1.000     -0.666
EngineSize             0.705  0.642     0.567   -0.747      -0.666      1.000
Horsepower             0.795  0.787     0.750   -0.704      -0.687      0.772
RPM                   -0.092 -0.035     0.011    0.327       0.234     -0.524
Fuel.tank.capacity     0.736  0.707     0.656   -0.797      -0.739      0.821
Passengers             0.327  0.274     0.222   -0.463      -0.414      0.526
Length                 0.611  0.552     0.485   -0.725      -0.617      0.852
Wheelbase              0.676  0.628     0.566   -0.671      -0.570      0.859
Width                  0.544  0.495     0.438   -0.694      -0.588      0.890
Weight                 0.777  0.737     0.677   -0.836      -0.777      0.917
                   Horsepower    RPM Fuel.tank.capacity Passengers Length
Min.Price               0.795 -0.092              0.736      0.327  0.611
Price                   0.787 -0.035              0.707      0.274  0.552
Max.Price               0.750  0.011              0.656      0.222  0.485
MPG.city               -0.704  0.327             -0.797     -0.463 -0.725
MPG.highway            -0.687  0.234             -0.739     -0.414 -0.617
EngineSize              0.772 -0.524              0.821      0.526  0.852
Horsepower              1.000  0.024              0.796      0.241  0.643
RPM                     0.024  1.000             -0.244     -0.378 -0.440
Fuel.tank.capacity      0.796 -0.244              1.000      0.451  0.794
Passengers              0.241 -0.378              0.451      1.000  0.646
Length                  0.643 -0.440              0.794      0.646  1.000
Wheelbase               0.663 -0.405              0.793      0.649  0.911
Width                   0.689 -0.489              0.772      0.516  0.885
Weight                  0.860 -0.348              0.900      0.510  0.880
                   Wheelbase  Width Weight
Min.Price              0.676  0.544  0.777
Price                  0.628  0.495  0.737
Max.Price              0.566  0.438  0.677
MPG.city              -0.671 -0.694 -0.836
MPG.highway           -0.570 -0.588 -0.777
EngineSize             0.859  0.890  0.917
Horsepower             0.663  0.689  0.860
RPM                   -0.405 -0.489 -0.348
Fuel.tank.capacity     0.793  0.772  0.900
Passengers             0.649  0.516  0.510
Length                 0.911  0.885  0.880
Wheelbase              1.000  0.850  0.879
Width                  0.850  1.000  0.869
Weight                 0.879  0.869  1.000
That would be a lot prettier if some of those variable names were shorter! Now we will do a linear multiple regression using the lm( ), or "linear model" function. Then we will use the step( ) function to do backwards regression to eliminate the nonsignificant terms. Complicated stuff, but all incorporated into the default R packages.
> lm.model = lm(MPG.city ~ EngineSize + Horsepower + Weight + Origin, data=Cars)
> summary(lm.model)

Call:
lm(formula = MPG.city ~ EngineSize + Horsepower + Weight + Origin, 
    data = Cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7646 -1.4445 -0.1286  1.1505 13.2322 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    4.910e+01  3.048e+00  16.110  < 2e-16 ***
EngineSize     1.220e+00  9.173e-01   1.330    0.187    
Horsepower    -5.836e-05  1.384e-02  -0.004    0.997    
Weight        -9.967e-03  1.890e-03  -5.273 1.19e-06 ***
Originnon-USA  1.265e+00  7.917e-01   1.598    0.114    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 3.083 on 77 degrees of freedom
Multiple R-squared: 0.7113,     Adjusted R-squared: 0.6963 
F-statistic: 47.42 on 4 and 77 DF,  p-value: < 2.2e-16

> step(lm.model, direction="backward")
Start:  AIC=189.48
MPG.city ~ EngineSize + Horsepower + Weight + Origin

             Df Sum of Sq    RSS    AIC
- Horsepower  1 0.0001691 731.78 187.48
- EngineSize  1     16.82 748.60 189.34
                    731.78 189.48
- Origin      1     24.28 756.06 190.15
- Weight      1    264.26 996.04 212.76

Step:  AIC=187.48
MPG.city ~ EngineSize + Weight + Origin

             Df Sum of Sq     RSS     AIC
- EngineSize  1     16.84  748.62  187.34
                     731.78  187.48
- Origin      1     26.76  758.54  188.42
- Weight      1    387.47 1119.25  220.32

Step:  AIC=187.34
MPG.city ~ Weight + Origin

         Df Sum of Sq     RSS     AIC
- Origin  1     15.62  764.24  187.04
                 748.62  187.34
- Weight  1   1588.44 2337.06  278.69

Step:  AIC=187.04
MPG.city ~ Weight

         Df Sum of Sq     RSS     AIC
                 764.24  187.04
- Weight  1   1770.16 2534.40  283.34

Call:
lm(formula = MPG.city ~ Weight, data = Cars)

Coefficients:
(Intercept)       Weight  
   47.76860     -0.00826
Of the original predictors, only Weight remains after the backward regression.
> quit("no")
The end. We don't need to clean up here because we are telling R to quit without saving the workspace. This will cause R to start up next time with the same workspace it had when we started this time.


revised 2010 July 25
Return to the Table of Contents