| Table of Contents
| Function Reference
| Function Finder
| R Project |
A WORK-ALONG DEMONSTRATION OF R
Preliminaries
The purpose of this tutorial is to show you a bit--and I mean just a bit--of
what R can do and how it works. If you have already downloaded and installed R,
you can work along with me. If not, you should be able to get a pretty good idea
of what R is about just by reading through this tutorial. However, I will not be
explaining much of what I'm doing, so if you've never used R before, much of
this may seem pretty cryptic to you. My intent here is not to teach you R but
just to show you a little of what it can do. It will also give you some
practice at typing and executing R commands and will show you what R looks like
on your computer.
We will work on four problems. Do any or all of them, but do at least the
first one.
- Weight change in anorexic women associated with three different forms
of therapy.
- Risk factors associated with low infant birth weight.
- Some factors associated with car insurance claims.
- Data on cars on sale in the United States in 1993.
You will not have to enter any of these data sets, as they come with the
default installation of R. They are in an R library that does not load
automatically when R starts, so our first job will be getting access to the
data. I will repeat the command for doing so at the beginning of each section,
but if you are doing the entire tutorial at one sitting, you have to execute
the library() command only once. In fact, I
will remind you not to do otherwise.
When you start R on your computer, it will open a window that looks something
like this. This is the Mac version--it will look slightly, but not
dramatically, different in Windows, and will be recognizable in Linux.
The window into which you type your commands is called the R Console. On the
Mac and in Windows, this runs inside of a graphical user interface (GUI). In
Linux, it runs in a terminal window. In the Mac and Windows version (but not in
Linux) there are a few menus at the top of the window (in the menu bar on a Mac),
and these are useful for a few things, mostly housekeeping tasks, but the commands
that do your statistical work will be typed at the command prompt.
The greater than sign (>) near the bottom of the window is the
command prompt. You don't type this part--R supplies it. If the command extends
onto two or more lines, this prompt will change to a plus sign (+), called the
continuation prompt. Don't
type the plus sign prompts either. With respect to that, an IMPORTANT NOTE: It
will happen eventually (take my word for it!) that you will get stuck at a
continuation prompt, and nothing you do will get you back to the command prompt.
Don't panic! Press the Escape (Esc) key at the top left of your keyboard. That
will abort the command currently being typed and bring you back to >.
To see what that looks like, try this.
> ls()
Type a lower case L ("el"), not a one, a lower case S, open and close
parentheses, and then press the Enter (or Return) key. This command lists the
contents of a place in memory called the workspace (the computer's memory, not
your memory). If there is nothing there, the output will say
character(0). Now
suppose you neglected to type the close parenthesis, a common mistake. (The Mac
command editor will add it for you automatically, so it's harder to make this
mistake on a Mac.) R will
recognize that the command is incomplete and give you a continuation prompt (+)
so that you can complete it. Type the close parenthesis and press Enter. Be
sure to press the Enter key after typing a command. Otherwise, R will just sit
there waiting for you to do so, and R has infinite patience!
> ls( # no close parenthesis; the command is incomplete
+ ) # so complete it and press Enter
R is usually insensitive to spacing, so add spacing wherever it helps you to
make things clear, or leave it out if you want to save key strokes. R is case
sensitive, however, so if you capitalize something that shouldn't be, or fail
to when you should, you will get an error message of some sort. If you get a
"function not found" or "object not found" error, that is the first thing you
should check. Then check your spelling and punctuation. In R, attention to
details is crucial!
In R, as in many programming languages, comments start with pound symbols
(#). Anything following a pound symbol is ignored by R and is there purely for
the purpose of instruction or annotation. You don't need to type these
bits. And did I mention that R is, in fact, a full featured programming
language? You don't need to be intimidated by that. R has such a large number
of built-in statistical functions that most people use it simply for interactive
data analysis and graphics. It does mean, however, that R will be very picky
about syntax. You will have to close all open parentheses, get your commas and
quotes in the right places, and things like that. Annoying, I agree, and another
common source of errors, but also very, very powerful!
One thing you may find handy--commands can be copied and pasted into R. Don't
copy the command prompts, however.
Problem One: Weight Change in Anorexic Women
Begin like this...
> remove(list=ls()) # Type only the bits after the > prompt. This clears your workspace.
> library("MASS") # This loads an optional library of commands and data files.
> data(anorexia) # This puts a copy of the anorexia data in your workspace.
> ls() # This produces a list of the contents of the your workspace.
[1] "anorexia" ### No prompt here. This is an output line.
> str(anorexia) # This asks to see the structure of the anorexia object.
'data.frame': 72 obs. of 3 variables:
$ Treat : Factor w/ 3 levels "CBT","Cont","FT": 2 2 2 2 2 2 2 2 2 2 ...
$ Prewt : num 80.7 89.4 91.8 74 78.1 88.3 87.3 75.1 80.6 78.4 ...
$ Postwt: num 80.2 80.1 86.4 86.3 76.1 78.1 75.1 86.7 73.5 84.6 ...
The first of these commands cleaned out your workspace, deleting anything you
or someone else stored there from a previous analysis. I hope you got all those
parentheses in the right place! WARNING: R does not ask if you "really want
to do this." It just does what you tell it to do. If you really didn't mean to,
too bad! The information in your workspace is gone, and it does not go the trash
either. It's gone! So be careful. This is a good old fashioned command line
interface, and you're expected to know what you're doing.
The second command loaded the "MASS" library, which contains the data set.
The third command placed a copy of the data set "anorexia" into your workspace.
The fourth command asked for a list of items stored in the workspace. The
output produced by R is labeled [1]. The fifth command asked to see the
structure of this data object. It is a data frame, which is a table-like object
containing cases (or observations) in the rows and variables in the columns.
There are 72 cases each measured on 3 variables. Those variables are Treat, a
factor with 3 levels (CBT is cognitive behavioral therapy, Cont is a
no-treatment control, and FT is family therapy), Prewt, a numeric variable
(pretreatment weight in pounds), and Postwt, also a numeric variable
(posttreatment weight in pounds).
For more information, or to view the source for these data, request the help
page by typing help(anorexia) at a command prompt.
Help information will open in its own window in Windows and on the Mac. Close
this window as you would any window on your system. In Linux, help
information will open in the terminal. Press q (lower case Q) to return to the
command prompt.
We continue by examining the data. Don't worry about memorizing what all the
command do now. This is just for show. And remember, you don't have to type the
stuff after the # symbol.
> summary(anorexia) # Get a data summary.
Treat Prewt Postwt
CBT :29 Min. :70.00 Min. : 71.30
Cont:26 1st Qu.:79.60 1st Qu.: 79.33
FT :17 Median :82.30 Median : 84.05
Mean :82.41 Mean : 85.17
3rd Qu.:86.00 3rd Qu.: 91.55
Max. :94.90 Max. :103.60
> attach(anorexia) # Make the variables inside anorexia visible to R.
> by(anorexia, Treat, summary) # Summarize each treatment group separately.
Treat: CBT
Treat Prewt Postwt
CBT :29 Min. :70.00 Min. : 71.3
Cont: 0 1st Qu.:80.40 1st Qu.: 81.9
FT : 0 Median :82.60 Median : 83.9
Mean :82.69 Mean : 85.7
3rd Qu.:85.00 3rd Qu.: 90.9
Max. :94.90 Max. :103.6
------------------------------------------------------------
Treat: Cont
Treat Prewt Postwt
CBT : 0 Min. :70.50 Min. :73.00
Cont:26 1st Qu.:77.72 1st Qu.:77.58
FT : 0 Median :80.65 Median :80.70
Mean :81.56 Mean :81.11
3rd Qu.:85.88 3rd Qu.:84.67
Max. :91.80 Max. :89.60
------------------------------------------------------------
Treat: FT
Treat Prewt Postwt
CBT : 0 Min. :73.40 Min. : 75.2
Cont: 0 1st Qu.:80.50 1st Qu.: 90.7
FT :17 Median :83.30 Median : 92.5
Mean :83.23 Mean : 90.5
3rd Qu.:86.00 3rd Qu.: 95.2
Max. :94.20 Max. :101.6
> # Just press Enter here to put in a blank line.
> tapply(Postwt, Treat, mean) # Another way to get means by groups.
CBT Cont FT
85.69655 81.10769 90.49412
> tapply(Postwt, Treat, sd) # Sample standard deviations by groups.
CBT Cont FT
8.351924 4.744253 8.475072
> tapply(Postwt, Treat, length) # Group sizes.
CBT Cont FT
29 26 17
If you're relatively new to computers (i.e., under the age of 50), here's
something you will have to get used to about command line interfaces. Once
something is printed to the screen, it's history, gone, done with. It will
never be updated, changed or revised. ALL NEW INFORMATION OR CHANGES TO OLD
INFORMATION WILL BE PRINTED OUT BELOW WHERE YOU TYPE A NEW COMMAND. So
if you change something, you will have to issue a new command to see that
change. What has already scrolled by is ancient history.
Another thing you'll have to get used to is that when things go well in R,
R will often be silent. I.e., it will not tell you that it did what you told it
to do. You will have to ask it to show you. Here is an example.
> x = 7 # (Press Enter!) Assign the value 7 to x. Notice there is no output.
> print(x) # View the value of x, after you remember to press Enter.
[1] 7
> x # Equivalent to the last command--an "implicit print."
[1] 7
> x = 8 # A new value for x.
> x # View it.
[1] 8
> rm(x) # Erase x. Done! No confirmatory message is printed.
> x
Error: object 'x' not found
>
Now let's draw a nice graph of the anorexia data. The graph will be drawn in
a separate graphics window (called a "graphics device" in R-speak). If no such
window is open, R will open one. If one is already open, R will use it and
erase whatever graph is already there. So if you wanted a copy of that old
graph, well, too late now! You should have saved it while you had the chance!
You can manipulate this graphics window just like
any other window on your system. You can move it around, resize it, and close
it by clicking on the proper button. Closing the window will also erase the
graph, so if you want to keep it, take steps to do so before writing over it or
closing it. I'll tell you how to do this in a future tutorial.
One thing I'll warn you about in advance though. When a new graphics device
window opens, it steals focus from the R Console. To type more commands, you'll
have to click in the R Console window with your mouse. (Wow! You get to use the
mouse!)
> Change = Postwt - Prewt # Create a new variable called Change.
> boxplot(Change ~ Treat) # The symbol before Treat is a tilde. Read "Change by Treat."
> title(main = "Weight Gain in Pounds by Treatment")
> title(xlab = "Type of treatment")
> title(ylab = "Weight gain in pounds")
When you are done admiring the graphics device window, you can click it shut in
the normal way for your operating system.
Now let's open a smaller graphics window and look at a scatterplot.
> windows(4,4,10) ### Windows only!
> ### On a Mac, type quartz("Scatterplot",4,4,10)
> ### On Linux just skip this step.
> scatter.smooth(Prewt, Postwt)
> title(main="Postweight by Preweight Scatterplot")
> abline(a=0, b=1, col="red")
The scatter.smooth() function draws a
scatterplot with a smoothed, nonparametric (lowess) regression line on the
plot. I have also added a red "no change" line with the
abline() function (intercept=0 and slope=1). This window can also
be closed in the way you would close any window on your desktop. A later
tutorial will deal with how to modify and dress up graphics for publication,
but R does a fairly decent job of drawing graphics just by default, as you can
see.
It appears the treatment effect is nonlinear, with lower preweight women
benefiting from treatment more than higher preweight women. In fact, it appears
that women with a preweight of 80 lbs. or more benefited hardly at all, on the
average. An analysis of variance is, therefore, suspect.
Nevertheless, for the sake of illustration, here's how it would be done.
> aov.out = aov(Change ~ Treat) # Change tilde Treat, read "Change by Treat"
> summary(aov.out)
Df Sum Sq Mean Sq F value Pr(>F)
Treat 2 614.6 307.3 5.4223 0.006499 **
Residuals 69 3910.7 56.7
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> TukeyHSD(aov.out) # A Tukey HSD post hoc test.
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Change ~ Treat)
$Treat
diff lwr upr p adj
Cont-CBT -3.456897 -8.327276 1.413483 0.2124428
FT-CBT 4.257809 -1.250554 9.766173 0.1607461
FT-Cont 7.714706 2.090124 13.339288 0.0045127
The aov() function accepts a formula that
tells what analysis to do. The formula has the basic form DV ~ IV (DV tilde
IV), which means "DV as explained by or predicted by or modeled by IV."
Treatment contrasts are also available.
(The ones done here don't make a lot of sense, but
with a little more work, we could get them to make sense. The tests being done
here are equivalent to Fisher LSD pairwise comparisons between the CBT group
and the control group, and between the CBT group and the FT group.)
> summary.lm(aov.out)
Call:
aov(formula = Change ~ Treat)
Residuals:
Min 1Q Median 3Q Max
-12.565 -4.543 -1.007 3.846 17.893
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.007 1.398 2.151 0.0350 *
TreatCont -3.457 2.033 -1.700 0.0936 .
TreatFT 4.258 2.300 1.852 0.0684 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.528 on 69 degrees of freedom
Multiple R-squared: 0.1358, Adjusted R-squared: 0.1108
F-statistic: 5.422 on 2 and 69 DF, p-value: 0.006499
We can also do a Kruskal-Wallis nonparametric ANOVA.
> kruskal.test(Change ~ Treat)
Kruskal-Wallis rank sum test
data: Change by Treat
Kruskal-Wallis chi-squared = 9.0475, df = 2, p-value = 0.01085
That should get you started and your appetite whetted. Let's move on.
> ls() # View contents of workspace.
[1] "anorexia" "aov.out" "Change"
> detach(anorexia) # ALWAYS detach your data frame after attaching!
> rm(anorexia, aov.out, Change) # Short for remove( )
But first you should execute the commands above to clean up a bit. It's a good
idea to keep your workspace clean by erasing (removing) things you no longer
need. This will come back to bite you eventually if you don't. Also, keep your
search path clean (explained in a future tutorial) by detaching anything
you've attached to it.
Problem Two: Risk Factors Associated With Low Infant Birthweight
We begin as before. You should not execute the library("MASS") command if you are still running the same
R session. Try typing search(), and if you see
"package:MASS" among the listed packages, it's already loaded.
> library("MASS") # Only if you have quit R since the last example.
> data(birthwt) # Add this dataset to your workspace.
> ls() # List the contents of your workspace.
[1] "birthwt"
> str(birthwt) # View the structure of the dataset.
'data.frame': 189 obs. of 10 variables:
$ low : int 0 0 0 0 0 0 0 0 0 0 ...
$ age : int 19 33 20 21 18 21 22 17 29 26 ...
$ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
$ race : int 2 3 1 1 1 3 1 3 1 1 ...
$ smoke: int 0 0 1 1 1 0 0 0 1 1 ...
$ ptl : int 0 0 0 0 0 0 0 0 0 0 ...
$ ht : int 0 0 0 0 0 0 0 0 0 0 ...
$ ui : int 1 0 0 1 1 0 0 0 0 0 ...
$ ftv : int 0 3 1 2 0 0 1 1 1 0 ...
$ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
I will begin by recoding some of the variables to make them factors. This could
have been done when the data set was first prepared, but it wasn't. If you've
jumped the gun and have already attached the dataframe, detach it. NEVER modify
an attached dataframe. (It's not against the law, and it really doesn't even hurt
anything, but you'll end up becoming terribly confused! Because...? Mods to data
frames don't affect the attached copy, which is what you'll be working from.)
> ### low: birthwt below 2.5 kg, 0=no, 1=yes; we will leave this as is
> ### age: age of mother in years; a numeric variable; no need to recode
> ### lwt: weight of mother in pounds at last menstrual period; numeric; no need to recode
> ### race: R thinks this is an integer, so we want to (very carefully!!) recode
> value.labels = c("white","black","other")
> birthwt$race = factor(birthwt$race, labels=value.labels)
> ### Note: birthwt$race means the race variable inside the birthwt data frame
> ### smoke: smoking status during pregnancy (0="no", 1="yes"); also needs recoding
> value.labels = c("no","yes") # this could be done in one step, if you're brave enough!
> birthwt$smoke = factor(birthwt$smoke, labels=value.labels)
> ### ptl: number of previous premature labors; no need to recode
> ### ht: history of hypertension (0="no", 1="yes")
> birthwt$ht = factor(birthwt$ht, labels=value.labels)
> ### ui: presence of uterine irritability (0="no", 1="yes")
> birthwt$ui = factor(birthwt$ui, labels=value.labels)
> ### ftv: number of physician visits during first trimester
> ### bwt: birth weight of baby in grams; change to kilograms
> birthwt$bwt = birthwt$bwt / 1000
> summary(birthwt)
low age lwt race smoke
Min. :0.0000 Min. :14.00 Min. : 80.0 white:96 no :115
1st Qu.:0.0000 1st Qu.:19.00 1st Qu.:110.0 black:26 yes: 74
Median :0.0000 Median :23.00 Median :121.0 other:67
Mean :0.3122 Mean :23.24 Mean :129.8
3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:140.0
Max. :1.0000 Max. :45.00 Max. :250.0
ptl ht ui ftv bwt
Min. :0.0000 no :177 no :161 Min. :0.0000 Min. :0.709
1st Qu.:0.0000 yes: 12 yes: 28 1st Qu.:0.0000 1st Qu.:2.414
Median :0.0000 Median :0.0000 Median :2.977
Mean :0.1958 Mean :0.7937 Mean :2.945
3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:3.487
Max. :3.0000 Max. :6.0000 Max. :4.990
Notice that summary is a very intelligent command. It knows to summarize
numbers as numbers and factors as factors. It will get even smarter in the
future. Let's crosstabulate some of the categorical variables (factors).
> attach(birthwt) # Makes the variables inside birthwt visible without the $ thing.
> table(smoke, low) # Hence, we don't need to type birthwt$smoke, etc.
low
smoke 0 1
no 86 29
yes 44 30
> # Inserting a blank line now and then makes things easier to read.
> xtabs(~ smoke + low) # An alternate form that does the same thing.
low
smoke 0 1
no 86 29
yes 44 30
>
> table(smoke, race) # Remember, the space after the comma is optional.
race
smoke white black other
no 44 16 55
yes 52 10 12
>
> table(smoke,ht,ui,low) # A four-dimensional contingency table.
, , ui = no, low = 0
ht
smoke no yes
no 75 3
yes 36 2
, , ui = yes, low = 0
ht
smoke no yes
no 8 0
yes 6 0
, , ui = no, low = 1
ht
smoke no yes
no 18 4
yes 20 3
, , ui = yes, low = 1
ht
smoke no yes
no 7 0
yes 7 0
> big.table = table(smoke,ht,ui,low) # Store output rather than printing it.
> ftable(big.table) # A more convenient way to view this.
low 0 1
smoke ht ui
no no no 75 18
yes 8 7
yes no 3 4
yes 0 0
yes no no 36 20
yes 6 7
yes no 2 3
yes 0 0
> flat.table = ftable(big.table) # Store output so we can work with it.
> prop.table(flat.table, 1) # As proportions rather than raw freqs.
low 0 1
smoke ht ui
no no no 0.8064516 0.1935484
yes 0.5333333 0.4666667
yes no 0.4285714 0.5714286
yes NaN NaN
yes no no 0.6428571 0.3571429
yes 0.4615385 0.5384615
yes no 0.4000000 0.6000000
yes NaN NaN
Let's look at smoking by race. Some of these commands are long, so I broke them
in the middle by hitting Enter and continuing on the next line. (On a Mac
you'll have to erase the close parenthesis first.) Remember not to
type the + signs, as R will supply them. And spacing is optional. I added some
spacing to make the command easier to read. Watch your commas and quotes, too.
One missed or misplaced comma will make the whole command wrong!
> smoke.by.race = table(smoke, race)
> barplot(smoke.by.race, beside=T,
+ main="Smoking by Race",
+ ylab="Freq", xlab="Race",
+ col=c("black","lightgray"))
> legend(x=3.5, y=45, legend=c("nonsmokers","smokers"),
+ fill=c("black","lightgray"))
> results = chisq.test(smoke.by.race)
> results
Pearson's Chi-squared test
data: smoke.by.race
X-squared = 21.779, df = 2, p-value = 1.865e-05
> results$expected # View the expected frequencies.
race
smoke white black other
no 58.4127 15.82011 40.76720
yes 37.5873 10.17989 26.23280
> results$residuals # View the (standardized) residuals in each cell.
race
smoke white black other
no -1.88578277 0.04522852 2.22912825
yes 2.35084895 -0.05638265 -2.77886927
Are any of the factors related to low birthweight? We'll find out by using
logistic regression, an advanced technique incorporated into the default R
packages you get at download.
> glm.out = glm(low ~ race + smoke + ht + ui, family=binomial(logit))
> summary(glm.out)
Call:
glm(formula = low ~ race + smoke + ht + ui, family = binomial(logit))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6618 -0.7922 -0.4825 1.1413 2.1016
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.0919 0.3800 -5.505 3.69e-08 ***
raceblack 1.0638 0.4990 2.132 0.03301 *
raceother 1.0834 0.4131 2.622 0.00873 **
smokeyes 1.0940 0.3797 2.881 0.00397 **
htyes 1.3594 0.6297 2.159 0.03087 *
uiyes 1.0059 0.4385 2.294 0.02179 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.67 on 188 degrees of freedom
Residual deviance: 211.17 on 183 degrees of freedom
AIC: 223.17
Number of Fisher Scoring iterations: 4
Apparently they all are! Don't forget to clean up your workspace and search
path.
> detach(birthwt) ### Always detach after you attach!
> ls() # List contents of workspace. Or you could do this from a menu.
[1] "big.table" "birthwt" "flat.table" "glm.out"
[5] "results" "smoke.by.race"
> rm(list=ls()) # Completely clear the workspace. You can also do this from a menu.
>ls()
character(0)
Notice that when I was naming things, I did not put spaces or dashes in those
names. R wouldn't like that! Use a period instead. Notice also that the
chisq.test( ) and
glm( ) commands produced no output when they were executed, because
that output was stored in results and glm.out, respectively. We then had to ask to see the results,
in one case by using summary(glm.out). This is
typically how R works.
Problem Three: Some Factors Associated With Car Insurance Claims
As before, don't execute the library()
command if you are still in the same active R session (MASS still in the search
path).
> library("MASS") # Not necessary if still in search path.
> data(Insurance)
> str(Insurance)
'data.frame': 64 obs. of 5 variables:
$ District: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
$ Group : Ord.factor w/ 4 levels "<1l"<"1-1.5l"<..: 1 1 1 1 2 2 2 2 3 3 ...
$ Age : Ord.factor w/ 4 levels "<25"<"25-29"<..: 1 2 3 4 1 2 3 4 1 2 ...
$ Holders : int 197 264 246 1680 284 536 696 3582 133 286 ...
$ Claims : int 38 35 20 156 63 84 89 400 19 52 ...
> summary(Insurance)
District Group Age Holders Claims
1:16 <1l :16 <25 :16 Min. : 3.00 Min. : 0.00
2:16 1-1.5l:16 25-29:16 1st Qu.: 46.75 1st Qu.: 9.50
3:16 1.5-2l:16 30-35:16 Median : 136.00 Median : 22.00
4:16 >2l :16 >35 :16 Mean : 364.98 Mean : 49.23
3rd Qu.: 327.50 3rd Qu.: 55.50
Max. :3582.00 Max. :400.00
District is the location of the policy holder, with 4 being in a major city.
Group has to do with the size of the car's engine in liters. Age is obvious.
Holders and Claims are both counts.
There is no replication within the cells, so a fully crossed 3-way ANOVA
would leave us without an error term. Examination of the data reveals an
apparent absence of interactions. So we will request a 3-way ANOVA without
testing the interactions.
> aov.out = aov(I(Claims/Holders*100) ~ District + Age + Group, data=Insurance)
> ### Note: We made the DV--claims per 100 holders--on the fly.
> summary(aov.out)
Df Sum Sq Mean Sq F value Pr(>F)
District 3 88.87 29.62 0.6014 0.6168748
Age 3 517.63 172.54 3.5030 0.0214000 *
Group 3 974.59 324.86 6.5955 0.0007036 ***
Residuals 54 2659.78 49.26
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> TukeyHSD(aov.out, which=c("Age","Group"))
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = I(Claims/Holders * 100) ~ District + Age + Group, data = Insurance)
$Age
diff lwr upr p adj
25-29-<25 -2.858071 -9.435711 3.719568 0.6593722
30-35-<25 -5.016151 -11.593791 1.561488 0.1928569
>35-<25 -7.748445 -14.326084 -1.170805 0.0148258
30-35-25-29 -2.158080 -8.735719 4.419560 0.8204004
>35-25-29 -4.890374 -11.468013 1.687266 0.2117055
>35-30-35 -2.732294 -9.309933 3.845346 0.6903616
$Group
diff lwr upr p adj
1-1.5l-<1l 2.237368 -4.3402712 8.815008 0.8039620
1.5-2l-<1l 7.146119 0.5684797 13.723759 0.0282532
>2l-<1l 9.879534 3.3018946 16.457174 0.0011507
1.5-2l-1-1.5l 4.908751 -1.6688885 11.486390 0.2088731
>2l-1-1.5l 7.642166 1.0645263 14.219805 0.0166593
>2l-1.5-2l 2.733415 -3.8442246 9.311054 0.6900880
It appears that younger drivers and drivers of cars with larger engines are more
likely to file claims.
My advice would be not to do something so silly as to use greater than, less
than, and minus signs in your group names. Just a thought! But notice even when
something so silly as mathematical operators are included in group names, R can
handle it if you're careful. I would also not take those results at face value,
as an examination of the model assumptions reveals heterogeneity of variance
and nonnormality. Perhaps a transform of the response could fix that.
Yes, those graphs were also created in R. They show clearly that the
homogeneity of variance and normal distribution assumptions have been
violated.
> ls()
[1] "aov.out" "Insurance"
> rm(list=ls()) ### Clean up the mess we've made.
There is no need to detach Insurance because it was never attached. Check the
search path to confirm this if you don't remember. And by the way, deleting it
does not detach it.
Problem Four: Some Cars Sold In the United States in 1993
> library("MASS") ### Don't do this if it's already done.
> ### If you accidentally attach something twice, no harm no foul.
> ### Just detach it once.
> data(Cars93)
> str(Cars93)
'data.frame': 93 obs. of 27 variables:
$ Manufacturer : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...
$ Model : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...
$ Type : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...
$ Min.Price : num 12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
$ Price : num 15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
$ Max.Price : num 18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
$ MPG.city : int 25 18 20 19 22 22 19 16 19 16 ...
$ MPG.highway : int 31 25 26 26 30 31 28 25 27 25 ...
$ AirBags : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ...
$ DriveTrain : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ...
$ Cylinders : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ...
$ EngineSize : num 1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
$ Horsepower : int 140 200 172 172 208 110 170 180 170 200 ...
$ RPM : int 6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
$ Rev.per.mile : int 2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...
$ Man.trans.avail : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
$ Fuel.tank.capacity: num 13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
$ Passengers : int 5 5 5 6 4 6 6 6 5 6 ...
$ Length : int 177 195 180 193 186 189 200 216 198 206 ...
$ Wheelbase : int 102 115 102 106 109 105 111 116 108 114 ...
$ Width : int 68 71 67 70 69 69 74 78 73 73 ...
$ Turn.circle : int 37 38 37 37 39 41 42 45 41 43 ...
$ Rear.seat.room : num 26.5 30 28 31 27 28 30.5 30.5 26.5 35 ...
$ Luggage.room : int 11 15 14 17 13 16 17 21 14 18 ...
$ Weight : int 2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
$ Origin : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ...
$ Make : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ...
What would be interesting to look at here?
> summary(Cars93)
Manufacturer Model Type Min.Price Price
Chevrolet: 8 100 : 1 Compact:16 Min. : 6.70 Min. : 7.40
Ford : 8 190E : 1 Large :11 1st Qu.:10.80 1st Qu.:12.20
Dodge : 6 240 : 1 Midsize:22 Median :14.70 Median :17.70
Mazda : 5 300E : 1 Small :21 Mean :17.13 Mean :19.51
Pontiac : 5 323 : 1 Sporty :14 3rd Qu.:20.30 3rd Qu.:23.30
Buick : 4 535i : 1 Van : 9 Max. :45.40 Max. :61.90
(Other) :57 (Other):87
Max.Price MPG.city MPG.highway AirBags
Min. : 7.9 Min. :15.00 Min. :20.00 Driver & Passenger:16
1st Qu.:14.7 1st Qu.:18.00 1st Qu.:26.00 Driver only :43
Median :19.6 Median :21.00 Median :28.00 None :34
Mean :21.9 Mean :22.37 Mean :29.09
3rd Qu.:25.3 3rd Qu.:25.00 3rd Qu.:31.00
Max. :80.0 Max. :46.00 Max. :50.00
DriveTrain Cylinders EngineSize Horsepower RPM
4WD :10 3 : 3 Min. :1.000 Min. : 55.0 Min. :3800
Front:67 4 :49 1st Qu.:1.800 1st Qu.:103.0 1st Qu.:4800
Rear :16 5 : 2 Median :2.400 Median :140.0 Median :5200
6 :31 Mean :2.668 Mean :143.8 Mean :5281
8 : 7 3rd Qu.:3.300 3rd Qu.:170.0 3rd Qu.:5750
rotary: 1 Max. :5.700 Max. :300.0 Max. :6500
Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers
Min. :1320 No :32 Min. : 9.20 Min. :2.000
1st Qu.:1985 Yes:61 1st Qu.:14.50 1st Qu.:4.000
Median :2340 Median :16.40 Median :5.000
Mean :2332 Mean :16.66 Mean :5.086
3rd Qu.:2565 3rd Qu.:18.80 3rd Qu.:6.000
Max. :3755 Max. :27.00 Max. :8.000
Length Wheelbase Width Turn.circle
Min. :141.0 Min. : 90.0 Min. :60.00 Min. :32.00
1st Qu.:174.0 1st Qu.: 98.0 1st Qu.:67.00 1st Qu.:37.00
Median :183.0 Median :103.0 Median :69.00 Median :39.00
Mean :183.2 Mean :103.9 Mean :69.38 Mean :38.96
3rd Qu.:192.0 3rd Qu.:110.0 3rd Qu.:72.00 3rd Qu.:41.00
Max. :219.0 Max. :119.0 Max. :78.00 Max. :45.00
Rear.seat.room Luggage.room Weight Origin Make
Min. :19.00 Min. : 6.00 Min. :1695 USA :48 Acura Integra: 1
1st Qu.:26.00 1st Qu.:12.00 1st Qu.:2620 non-USA:45 Acura Legend : 1
Median :27.50 Median :14.00 Median :3040 Audi 100 : 1
Mean :27.83 Mean :13.89 Mean :3073 Audi 90 : 1
3rd Qu.:30.00 3rd Qu.:15.00 3rd Qu.:3525 BMW 535i : 1
Max. :36.00 Max. :22.00 Max. :4105 Buick Century: 1
NA's : 2.00 NA's :11.00 (Other) :87
Some of the variables have missing values, viz., Rear.seat.room and
Luggage.room. In R, missing values are denoted by NA, or "not available". I will
begin by removing those cases.
> Cars = na.omit(Cars93) # Writing the result to a new data frame.
> dim(Cars) # Get dimensions of remaining data frame.
[1] 82 27
Eleven cases have been eliminated. It might also be desirable to use the
variable "Make" as row names for the data frame rather than have it as a
variable. Your wish is R's fondest desire (provided you know the command.)
> rownames(Cars) = Cars$Make
> Cars$Make = NULL # Delete the variable from the data frame.
> Cars[1:6,1:7] # See some of the result (6 rows and 7 columns).
Manufacturer Model Type Min.Price Price Max.Price MPG.city
Acura Integra Acura Integra Small 12.9 15.9 18.8 25
Acura Legend Acura Legend Midsize 29.2 33.9 38.7 18
Audi 90 Audi 90 Compact 25.9 29.1 32.3 20
Audi 100 Audi 100 Midsize 30.8 37.7 44.6 19
BMW 535i BMW 535i Midsize 23.7 30.0 36.2 22
Buick Century Buick Century Midsize 14.2 15.7 17.3 22
The advantage of this is that you can refer to rows of the data frame by the
make of automobile. If you wanted to see all info on the Buick Century, for
example, it's as simple as this. (Otherwise, you'd have to refer to this row
of data by a number, that off the top of your head you may not know.)
> Cars["Buick Century", ]
Manufacturer Model Type Min.Price Price Max.Price MPG.city
Buick Century Buick Century Midsize 14.2 15.7 17.3 22
MPG.highway AirBags DriveTrain Cylinders EngineSize
Buick Century 31 Driver only Front 4 2.2
Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity
Buick Century 110 5200 2565 No 16.4
Passengers Length Wheelbase Width Turn.circle Rear.seat.room
Buick Century 6 189 105 69 41 28
Luggage.room Weight Origin
Buick Century 16 2880 USA
Now we'll look at a correlation matrix of
most of the numeric variables. This will illustrate how in R you can use one
command to do many things at once. Doing so makes the command kinda complicated,
so be very careful with your typing. The first part of the command rounds the
results to 3 decimal places. Nested inside of that command is one that
calculates the correlation matrix. And nested inside of that is a restriction
on the Cars dataframe that specifies only certain variables to be included.
> round(cor(Cars[,c(4,5,6,7,8,12,13,14,17,18,19,20,21,25)]),3)
Min.Price Price Max.Price MPG.city MPG.highway EngineSize
Min.Price 1.000 0.971 0.910 -0.663 -0.664 0.705
Price 0.971 1.000 0.983 -0.622 -0.626 0.642
Max.Price 0.910 0.983 1.000 -0.565 -0.572 0.567
MPG.city -0.663 -0.622 -0.565 1.000 0.945 -0.747
MPG.highway -0.664 -0.626 -0.572 0.945 1.000 -0.666
EngineSize 0.705 0.642 0.567 -0.747 -0.666 1.000
Horsepower 0.795 0.787 0.750 -0.704 -0.687 0.772
RPM -0.092 -0.035 0.011 0.327 0.234 -0.524
Fuel.tank.capacity 0.736 0.707 0.656 -0.797 -0.739 0.821
Passengers 0.327 0.274 0.222 -0.463 -0.414 0.526
Length 0.611 0.552 0.485 -0.725 -0.617 0.852
Wheelbase 0.676 0.628 0.566 -0.671 -0.570 0.859
Width 0.544 0.495 0.438 -0.694 -0.588 0.890
Weight 0.777 0.737 0.677 -0.836 -0.777 0.917
Horsepower RPM Fuel.tank.capacity Passengers Length
Min.Price 0.795 -0.092 0.736 0.327 0.611
Price 0.787 -0.035 0.707 0.274 0.552
Max.Price 0.750 0.011 0.656 0.222 0.485
MPG.city -0.704 0.327 -0.797 -0.463 -0.725
MPG.highway -0.687 0.234 -0.739 -0.414 -0.617
EngineSize 0.772 -0.524 0.821 0.526 0.852
Horsepower 1.000 0.024 0.796 0.241 0.643
RPM 0.024 1.000 -0.244 -0.378 -0.440
Fuel.tank.capacity 0.796 -0.244 1.000 0.451 0.794
Passengers 0.241 -0.378 0.451 1.000 0.646
Length 0.643 -0.440 0.794 0.646 1.000
Wheelbase 0.663 -0.405 0.793 0.649 0.911
Width 0.689 -0.489 0.772 0.516 0.885
Weight 0.860 -0.348 0.900 0.510 0.880
Wheelbase Width Weight
Min.Price 0.676 0.544 0.777
Price 0.628 0.495 0.737
Max.Price 0.566 0.438 0.677
MPG.city -0.671 -0.694 -0.836
MPG.highway -0.570 -0.588 -0.777
EngineSize 0.859 0.890 0.917
Horsepower 0.663 0.689 0.860
RPM -0.405 -0.489 -0.348
Fuel.tank.capacity 0.793 0.772 0.900
Passengers 0.649 0.516 0.510
Length 0.911 0.885 0.880
Wheelbase 1.000 0.850 0.879
Width 0.850 1.000 0.869
Weight 0.879 0.869 1.000
That would be a lot prettier if some of those variable names were shorter! (We
could make them shorter if we wanted to go to the trouble.) Now
we will do a linear multiple regression using the lm(), or "linear model" function, attempting to
predict gas mileage from four predictors. Then we will use
the step() function to do backwards regression
to eliminate the nonsignificant terms. Complicated stuff, but all incorporated
into the default R packages.
> lm.model = lm(MPG.city ~ EngineSize + Horsepower + Weight + Origin, data=Cars)
> summary(lm.model)
Call:
lm(formula = MPG.city ~ EngineSize + Horsepower + Weight + Origin,
data = Cars)
Residuals:
Min 1Q Median 3Q Max
-7.7646 -1.4445 -0.1286 1.1505 13.2322
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.910e+01 3.048e+00 16.110 < 2e-16 ***
EngineSize 1.220e+00 9.173e-01 1.330 0.187
Horsepower -5.836e-05 1.384e-02 -0.004 0.997
Weight -9.967e-03 1.890e-03 -5.273 1.19e-06 ***
Originnon-USA 1.265e+00 7.917e-01 1.598 0.114
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.083 on 77 degrees of freedom
Multiple R-squared: 0.7113, Adjusted R-squared: 0.6963
F-statistic: 47.42 on 4 and 77 DF, p-value: < 2.2e-16
> step(lm.model, direction="backward")
Start: AIC=189.48
MPG.city ~ EngineSize + Horsepower + Weight + Origin
Df Sum of Sq RSS AIC
- Horsepower 1 0.0001691 731.78 187.48
- EngineSize 1 16.82 748.60 189.34
731.78 189.48
- Origin 1 24.28 756.06 190.15
- Weight 1 264.26 996.04 212.76
Step: AIC=187.48
MPG.city ~ EngineSize + Weight + Origin
Df Sum of Sq RSS AIC
- EngineSize 1 16.84 748.62 187.34
731.78 187.48
- Origin 1 26.76 758.54 188.42
- Weight 1 387.47 1119.25 220.32
Step: AIC=187.34
MPG.city ~ Weight + Origin
Df Sum of Sq RSS AIC
- Origin 1 15.62 764.24 187.04
748.62 187.34
- Weight 1 1588.44 2337.06 278.69
Step: AIC=187.04
MPG.city ~ Weight
Df Sum of Sq RSS AIC
764.24 187.04
- Weight 1 1770.16 2534.40 283.34
Call:
lm(formula = MPG.city ~ Weight, data = Cars)
Coefficients:
(Intercept) Weight
47.76860 -0.00826
Of the original predictors, only Weight remains after the backward regression.
> quit("no")
The end. We don't need to clean up here because we are telling R to quit
without saving the workspace. This will cause R to start up next time with
the same workspace it had when we started this time. Thus, our curent workspace
will be lost. No harm there. Our search path will also be cleaned up every time
R restarts.
revised 2016 January 14
| Table of Contents
| Function Reference
| Function Finder
| R Project |
|