
A WORK-ALONG DEMONSTRATION OF R
Preliminaries
The purpose of this tutorial is to show you a bit--and I mean just a bit--of
what R can do and how it works. If you have already downloaded and installed R,
you can work along with me. If not, you should be able to get a pretty good idea
of what R is about just by reading through this tutorial. However, I will not be
explaining much of what I'm doing, so if you've never used R before, much of
this may seem pretty cryptic to you. My intent here is not to teach you R but
just to show you a little of what it can do. It will also give you some
practice at typing and executing R commands and will show you what R looks like
on your computer.
We will work on four problems. Do any or all of them, but do at least the
first one.
- Weight change in anorexic women associated with three different forms
of therapy.
- Risk factors associated with low infant birth weight.
- Some factors associated with car insurance claims.
- Data on cars on sale in the United States in 1993.
You will not have to enter any of these data sets, as they come with the
default installation of R. They are in an R library that does not load
automatically when R starts, so our first job will be getting access to the
data. I will repeat the command for doing so at the beginning of each section,
but if you are doing the entire tutorial at one sitting, you have to execute
the library( ) command only once. In fact, I
will remind you not to do otherwise.
When you start R on your computer, it will open a window that looks something
like this. This is the Windows version--it will look slightly, but not
dramatically, different on the Mac, and will be recognizable in Linux.
The window into which you type your commands is called the R Console. On the
Mac and in Windows, this runs inside of a graphical user interface (GUI). In
Linux, it runs in a terminal window. The greater than sign (>) is the
command prompt. You don't type this part--R supplies it. If the command extends
onto two or more lines, this prompt will change to a plus sign (+). Don't
type the plus sign prompts either.
To see what that looks like, try this:
> ls()
Type a lower case L ("el"), not a one, a lower case S, an open and close
parentheses, and then press the Enter (or Return) key. This command lists the
contents of a place in memory called the workspace (the computer's memory, not
your memory). If there is nothing there, the output will say character(0). Now
suppose you neglected to type the close parenthesis, a common mistake. R will
recognize that the command is incomplete and give you a continuation prompt (+)
so that you can complete it. Type the close parenthesis and press Enter. Be
sure to press the Enter key after typing a command. Otherwise, R will just sit
there waiting for you to do so, and R has infinite patience!
> ls(
+ )
R is usually insensitive to spacing, so add spacing wherever it helps you to
make things clear, or leave it out if you want to save key strokes. R is case
sensitive, however, so if you capitalize something that shouldn't be, or fail
to when you should, you will get an error message of some sort. If you get a
"function not found" or "object not found" error, that is the first thing you
should check. Then check your spelling and punctuation. In R, attention to
details is crucial.
In R, as in many programming languages, comments start with pound symbols
(#). Anything following a pound symbol is ignored by R and is there purely for
the purpose of instruction or annotation. You don't need to type these
bits. And did I mention that R is, in fact, a full featured programming
language? You don't need to be intimidated by that. R has such a large number
of built-in statistical functions that most people use it simply for interactive
data analysis and graphics. It does mean, however, that R will be very picky
about syntax. You will have to close all open parentheses, get your commas and
quotes in the right places, and things like that. Annoying, I agree, but also
very, very powerful!
One thing you may find handy--commands can be copied and pasted into R. Don't
copy the command prompts, however.
Weight Change in Anorexic Women
Begin like this...
> remove(list=ls()) ### You type only the bits after the > prompt.
> library("MASS")
> data(anorexia)
> ls()
[1] "anorexia" ### No prompt here. This is an output line.
> str(anorexia)
'data.frame': 72 obs. of 3 variables:
$ Treat : Factor w/ 3 levels "CBT","Cont","FT": 2 2 2 2 2 2 2 2 2 2 ...
$ Prewt : num 80.7 89.4 91.8 74 78.1 88.3 87.3 75.1 80.6 78.4 ...
$ Postwt: num 80.2 80.1 86.4 86.3 76.1 78.1 75.1 86.7 73.5 84.6 ...
The first of these commands cleaned out your workspace, deleting anything you
or someone else stored there from a previous analysis. I hope you got all those
parentheses in the right place! WARNING: R does not ask you if you "really want
to do this." It just does what you tell it to do. If you really didn't mean to,
too bad! The information in your workspace is gone, and it does not go the trash
either. It's gone! So be careful. This is a good old fashioned command line
interface, and you're expected to know what you're doing.
The second command loaded the "MASS" library, which contains the data set.
The third command placed a copy of the data set "anorexia" into your workspace.
The fourth command asked for a list of items stored in the workspace. The
output produced by R is labeled [1]. The fifth command asked to see the
structure of this data object. It is a data frame, which is a table-like object
containing cases (or observations) in the rows and variables in the columns.
There are 72 cases each measured on 3 variables. Those variables are Treat, a
factor with 3 levels (CBT is cognitive behavioral therapy, Cont is a
no-treatment control, and FT is family therapy), Prewt, a numerical variable
(pretreatment weight in pounds), and Postwt, also a numerical variable
(posttreatment weight in pounds).
For more information, or to view the source for these data, request the help
page by typing help(anorexia) at a command prompt.
Help information will open in its own window in Windows and on the Mac. Close
this window as you would any window on your system. In Linux, help
information will open in the terminal. Press q (lower case Q) to return to the
command prompt.
We continue by examining the data. Don't worry about memorizing what all the
command do now. This is just for show. And remember, you don't have to type the
stuff after the # symbol.
> summary(anorexia) # Get a data summary.
Treat Prewt Postwt
CBT :29 Min. :70.00 Min. : 71.30
Cont:26 1st Qu.:79.60 1st Qu.: 79.33
FT :17 Median :82.30 Median : 84.05
Mean :82.41 Mean : 85.17
3rd Qu.:86.00 3rd Qu.: 91.55
Max. :94.90 Max. :103.60
> attach(anorexia) # Make the variables inside anorexia visible to R.
> by(anorexia, Treat, summary) # Summarize each treatment group separately.
Treat: CBT
Treat Prewt Postwt
CBT :29 Min. :70.00 Min. : 71.3
Cont: 0 1st Qu.:80.40 1st Qu.: 81.9
FT : 0 Median :82.60 Median : 83.9
Mean :82.69 Mean : 85.7
3rd Qu.:85.00 3rd Qu.: 90.9
Max. :94.90 Max. :103.6
------------------------------------------------------------
Treat: Cont
Treat Prewt Postwt
CBT : 0 Min. :70.50 Min. :73.00
Cont:26 1st Qu.:77.72 1st Qu.:77.58
FT : 0 Median :80.65 Median :80.70
Mean :81.56 Mean :81.11
3rd Qu.:85.88 3rd Qu.:84.67
Max. :91.80 Max. :89.60
------------------------------------------------------------
Treat: FT
Treat Prewt Postwt
CBT : 0 Min. :73.40 Min. : 75.2
Cont: 0 1st Qu.:80.50 1st Qu.: 90.7
FT :17 Median :83.30 Median : 92.5
Mean :83.23 Mean : 90.5
3rd Qu.:86.00 3rd Qu.: 95.2
Max. :94.20 Max. :101.6
>
> tapply(Postwt, Treat, mean) # Another way to get means by groups.
CBT Cont FT
85.69655 81.10769 90.49412
> tapply(Postwt, Treat, sd) # Sample standard deviations by groups.
CBT Cont FT
8.351924 4.744253 8.475072
> tapply(Postwt, Treat, length) # Group sizes.
CBT Cont FT
29 26 17
If you're relatively new to computers (i.e., under the age of 40), here's
something you will have to get used to about command line interfaces. Once
something is printed to the screen, it's history, gone, done with. It will
never be updated, changed or revised. ALL NEW INFORMATION OR CHANGES TO OLD
INFORMATION WILL BE PRINTED OUT BELOW WHERE YOU TYPE A NEW COMMAND. So
if you change something, you will have to issue a new command to see that
change. What has already scrolled by is ancient history.
Another thing you'll have to get used to is that when things go well in R,
R will often be silent. I.e., it will not tell you that it did what you told it
to do. You will have to ask it to show you. Here is an example.
> x = 7 # Assign the value 7 to x. Notice there is no output.
> x # View the value of x, after you remember to press Enter.
[1] 7
> x = 8 # A new value for x.
> x # View it.
[1] 8
> rm(x) # Erase x. Done! No confirmatory message is printed.
> x
Error: object 'x' not found
>
Now let's draw a nice graph of the anorexia data. The graph will be drawn in
a separate graphics window (called a "graphics device" in R-speak). If no such
window is open, R will open one. If one is already open, R will use it and
erase whatever graph is already there. So if you wanted a copy of that old
graph, well, too late now! You can manipulate this graphics window just like
any other window on your system. You can move it around, resize it, and close
it by clicking on the proper button. Closing the window will also erase the
graph, so if you want to keep it, take steps to do so before writing over it or
closing it. I'll tell you how to do this in a future tutorial.
> Change = Postwt - Prewt # Create a variable called Change.
> boxplot(Change ~ Treat) # The symbol before Treat is a tilde.
> title(main = "Weight Gain in Pounds by Treatment")
> title(xlab = "Type of treatment")
> title(ylab = "Weight gain in pounds")
When you are done admiring the graphics device window, you can click it shut in
the normal way for your operating system.
When a graphics window opens, by the way, it removes focus from the R Console
window, so you will have to click in the console to continue working there. Now
let's open a smaller graphics window and look at a scatterplot...
> windows(4,4,10) ### Windows only!
> ### On a Mac, type quartz("Q",4,4,10)
> ### On Linux just skip this step.
> scatter.smooth(Prewt, Postwt)
> title(main="Postweight by Preweight Scatterplot")
> abline(a=0, b=1, col="red")
The scatter.smooth( ) function draws a
scatterplot with a smoothed, nonparametric (lowess) regression line on the
plot. I have also added a red "no change" line with the
abline( ) function (intercept=0 and slope=1). This window can also
be closed in the way you would close any window on your desktop. A later
tutorial will deal with how to modify and dress up graphics for publication,
but R does a fairly decent job of drawing graphics just by default, as you can
see.
It appears the treatment effect is nonlinear, with lower preweight women
benefiting from treatment more than higher preweight women. An analysis of
variance is, therefore, suspect. Nevertheless, for the sake of illustration...
> aov.out = aov(Change ~ Treat) # Change tilde Treat
> summary(aov.out)
Df Sum Sq Mean Sq F value Pr(>F)
Treat 2 614.6 307.3 5.4223 0.006499 **
Residuals 69 3910.7 56.7
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> TukeyHSD(aov.out) # A Tukey HSD post hoc test.
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Change ~ Treat)
$Treat
diff lwr upr p adj
Cont-CBT -3.456897 -8.327276 1.413483 0.2124428
FT-CBT 4.257809 -1.250554 9.766173 0.1607461
FT-Cont 7.714706 2.090124 13.339288 0.0045127
The aov( ) function accepts a formula that
tells what analysis to do. The formula has the basic form DV ~ IV (DV tilde
IV), which means "DV as explained by or predicted by or modeled by IV."
Treatment contrasts are also available...
> summary.lm(aov.out)
Call:
aov(formula = Change ~ Treat)
Residuals:
Min 1Q Median 3Q Max
-12.565 -4.543 -1.007 3.846 17.893
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.007 1.398 2.151 0.0350 *
TreatCont -3.457 2.033 -1.700 0.0936 .
TreatFT 4.258 2.300 1.852 0.0684 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.528 on 69 degrees of freedom
Multiple R-squared: 0.1358, Adjusted R-squared: 0.1108
F-statistic: 5.422 on 2 and 69 DF, p-value: 0.006499
We can also do a Kruskal-Wallis nonparametric ANOVA...
> kruskal.test(Change ~ Treat)
Kruskal-Wallis rank sum test
data: Change by Treat
Kruskal-Wallis chi-squared = 9.0475, df = 2, p-value = 0.01085
That should get you started and your appetite whetted. Let's move on.
> ls() # View contents of workspace.
[1] "anorexia" "aov.out" "Change"
> detach(anorexia) # ALWAYS detach after attaching!
> rm(anorexia, aov.out, Change) # Short for remove( )
But first you should execute the commands above to clean up a bit. It's a good
idea to keep your workspace clean by erasing (removing) things you no longer
need. This will come back to bite you eventually if you don't. Also, keep your
search path clean (explained in a future tutorial) by detaching anything
you've attached to it.
Risk Factors Associated With Low Infant Birthweight
We begin as before. You should not execute the library("MASS") command if you are still running the same
R session. Try typing search( ), and if you see
"package:MASS" among the listed packages, it's already loaded.
> library("MASS") # Only if you have quit R since the last example.
> data(birthwt) # Add this dataset to your workspace.
> ls() # List the contents of your workspace.
[1] "birthwt"
> str(birthwt) # View the structure of the dataset.
'data.frame': 189 obs. of 10 variables:
$ low : int 0 0 0 0 0 0 0 0 0 0 ...
$ age : int 19 33 20 21 18 21 22 17 29 26 ...
$ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
$ race : int 2 3 1 1 1 3 1 3 1 1 ...
$ smoke: int 0 0 1 1 1 0 0 0 1 1 ...
$ ptl : int 0 0 0 0 0 0 0 0 0 0 ...
$ ht : int 0 0 0 0 0 0 0 0 0 0 ...
$ ui : int 1 0 0 1 1 0 0 0 0 0 ...
$ ftv : int 0 3 1 2 0 0 1 1 1 0 ...
$ bwt : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
I will begin by recoding some of the variables to make them factors. This could
have been done when the data set was first prepared, but it wasn't. If you've
jumped the gun and have already attached the dataframe, detach it. NEVER modify
an attached dataframe.
> ### low: birthwt below 2.5 kg, 0=no, 1=yes; we will leave this as is
> ### age: age of mother in years; no need to recode
> ### lwt: weight of mother in pounds at last menstrual period; no need to recode
> ### race: R thinks this is an integer, so we want to recode
> birthwt$race = factor(birthwt$race, labels=c("white","black","other"))
> ### Note: birthwt$race means the race variable inside the birthwt data frame
> ### smoke: smoking status during pregnancy; also needs recoding
> birthwt$smoke = factor(birthwt$smoke, labels=c("no","yes"))
> ### ptl: number of previous premature labors; no need to recode
> ### ht: history of hypertension (0="no", 1="yes")
> birthwt$ht = factor(birthwt$ht, labels=c("no","yes"))
> ### ui: presence of uterine irritability (0="no", 1="yes")
> birthwt$ui = factor(birthwt$ui, labels=c("no","yes"))
> ### ftv: number of physician visits during first trimester
> ### bwt: birth weight of baby in grams; change to kilograms
> birthwt$bwt = birthwt$bwt / 1000
> summary(birthwt)
low age lwt race smoke
Min. :0.0000 Min. :14.00 Min. : 80.0 white:96 no :115
1st Qu.:0.0000 1st Qu.:19.00 1st Qu.:110.0 black:26 yes: 74
Median :0.0000 Median :23.00 Median :121.0 other:67
Mean :0.3122 Mean :23.24 Mean :129.8
3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:140.0
Max. :1.0000 Max. :45.00 Max. :250.0
ptl ht ui ftv bwt
Min. :0.0000 no :177 no :161 Min. :0.0000 Min. :0.709
1st Qu.:0.0000 yes: 12 yes: 28 1st Qu.:0.0000 1st Qu.:2.414
Median :0.0000 Median :0.0000 Median :2.977
Mean :0.1958 Mean :0.7937 Mean :2.945
3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:3.487
Max. :3.0000 Max. :6.0000 Max. :4.990
Notice that summary is a very intelligent command. It knows to summarize
numbers as numbers and factors as factors. It will get even smarter in the
future. Let's crosstabulate some of the categorical variables (factors).
> attach(birthwt) # Makes the variables inside birthwt visible.
> table(smoke, low) # Hence, we don't need to type birthwt$smoke, etc.
low
smoke 0 1
no 86 29
yes 44 30
> xtabs(~ smoke + low) # An alternate form that does the same thing.
low
smoke 0 1
no 86 29
yes 44 30
> table(smoke, race) # Remember, the space after the comma is optional.
race
smoke white black other
no 44 16 55
yes 52 10 12
> table(smoke,ht,ui,low) # A four-dimensional contingency table.
, , ui = no, low = 0
ht
smoke no yes
no 75 3
yes 36 2
, , ui = yes, low = 0
ht
smoke no yes
no 8 0
yes 6 0
, , ui = no, low = 1
ht
smoke no yes
no 18 4
yes 20 3
, , ui = yes, low = 1
ht
smoke no yes
no 7 0
yes 7 0
> big.table = table(smoke,ht,ui,low) # Store output rather than printing it.
> ftable(big.table) # A more convenient way to view this.
low 0 1
smoke ht ui
no no no 75 18
yes 8 7
yes no 3 4
yes 0 0
yes no no 36 20
yes 6 7
yes no 2 3
yes 0 0
> flat.table = ftable(big.table) # Store output so we can work with it.
> prop.table(flat.table, 1) # As proportions rather than raw freqs.
low 0 1
smoke ht ui
no no no 0.8064516 0.1935484
yes 0.5333333 0.4666667
yes no 0.4285714 0.5714286
yes NaN NaN
yes no no 0.6428571 0.3571429
yes 0.4615385 0.5384615
yes no 0.4000000 0.6000000
yes NaN NaN
Let's look at smoking by race. Some of these commands are long, so I broke them
in the middle by hitting Enter and continuing on the next line. Remember not to
type the + signs, as R will supply them. And spacing is optional. I added some
spacing to make the command easier to read. Watch your commas and quotes, too.
One missed or misplaced comma will make the whole command wrong!
> smoke.by.race = table(smoke, race)
> barplot(smoke.by.race, beside=T,
+ main="Smoking by Race",
+ ylab="Freq", xlab="Race",
+ col=c("black","lightgray"))
> legend(x=3.5, y=45, legend=c("nonsmokers","smokers"),
+ fill=c("black","lightgray"))
> results = chisq.test(smoke.by.race)
> results
Pearson's Chi-squared test
data: smoke.by.race
X-squared = 21.779, df = 2, p-value = 1.865e-05
> results$expected # View the expected values.
race
smoke white black other
no 58.4127 15.82011 40.76720
yes 37.5873 10.17989 26.23280
> results$residuals # View the residuals in each cell.
race
smoke white black other
no -1.88578277 0.04522852 2.22912825
yes 2.35084895 -0.05638265 -2.77886927
Are any of the factors related to low birthweight? We'll find out by using
logistic regression, an advanced technique incorporated into the default R
packages you get at download.
> glm.out = glm(low ~ race + smoke + ht + ui, family=binomial(logit))
> summary(glm.out)
Call:
glm(formula = low ~ race + smoke + ht + ui, family = binomial(logit))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6618 -0.7922 -0.4825 1.1413 2.1016
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.0919 0.3800 -5.505 3.69e-08 ***
raceblack 1.0638 0.4990 2.132 0.03301 *
raceother 1.0834 0.4131 2.622 0.00873 **
smokeyes 1.0940 0.3797 2.881 0.00397 **
htyes 1.3594 0.6297 2.159 0.03087 *
uiyes 1.0059 0.4385 2.294 0.02179 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.67 on 188 degrees of freedom
Residual deviance: 211.17 on 183 degrees of freedom
AIC: 223.17
Number of Fisher Scoring iterations: 4
Apparently they all are! Don't forget to clean up your workspace and search
path.
> detach(birthwt) ### Always detach after you attach!
> ls() # List contents of workspace.
[1] "big.table" "birthwt" "flat.table" "glm.out"
[5] "results" "smoke.by.race"
> rm(list=ls()) # Completely clear the workspace.
>ls()
character(0)
Notice that when I was naming things, I did not put spaces or dashes in those
names. R wouldn't like that! Use a period instead. Notice also that the
chisq.test( ) and
glm( ) commands produced no output when they were executed, because
that output was stored in results and glm.out, respectively. We then had to ask to see the results,
in one case by using summary(glm.out). This is
typically how R works.
Some Factors Associated With Car Insurance Claims
As before, don't execute the library( )
command if you are still in the same active R session.
> library("MASS") # Not necessary if still in search path.
> data(Insurance)
> str(Insurance)
'data.frame': 64 obs. of 5 variables:
$ District: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
$ Group : Ord.factor w/ 4 levels "<1l"<"1-1.5l"<..: 1 1 1 1 2 2 2 2 3 3 ...
$ Age : Ord.factor w/ 4 levels "<25"<"25-29"<..: 1 2 3 4 1 2 3 4 1 2 ...
$ Holders : int 197 264 246 1680 284 536 696 3582 133 286 ...
$ Claims : int 38 35 20 156 63 84 89 400 19 52 ...
> summary(Insurance)
District Group Age Holders Claims
1:16 <1l :16 <25 :16 Min. : 3.00 Min. : 0.00
2:16 1-1.5l:16 25-29:16 1st Qu.: 46.75 1st Qu.: 9.50
3:16 1.5-2l:16 30-35:16 Median : 136.00 Median : 22.00
4:16 >2l :16 >35 :16 Mean : 364.98 Mean : 49.23
3rd Qu.: 327.50 3rd Qu.: 55.50
Max. :3582.00 Max. :400.00
District is the location of the policy holder, with 4 being in a major city.
Group has to do with the size of the car's engine in liters. Age is obvious.
Holders and Claims are both counts.
There is no replication within the cells, so a fully crossed 3-way ANOVA
would leave us without an error term. Examination of the data reveals an
apparent absence of interactions. So we will request a 3-way ANOVA without
testing the interactions.
> aov.out = aov(I(Claims/Holders*100) ~ District + Age + Group, data=Insurance)
> ### Note: We made the DV--claims per 100 holders--on the fly.
> summary(aov.out)
Df Sum Sq Mean Sq F value Pr(>F)
District 3 88.87 29.62 0.6014 0.6168748
Age 3 517.63 172.54 3.5030 0.0214000 *
Group 3 974.59 324.86 6.5955 0.0007036 ***
Residuals 54 2659.78 49.26
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> TukeyHSD(aov.out, which=c("Age","Group"))
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = I(Claims/Holders * 100) ~ District + Age + Group, data = Insurance)
$Age
diff lwr upr p adj
25-29-<25 -2.858071 -9.435711 3.719568 0.6593722
30-35-<25 -5.016151 -11.593791 1.561488 0.1928569
>35-<25 -7.748445 -14.326084 -1.170805 0.0148258
30-35-25-29 -2.158080 -8.735719 4.419560 0.8204004
>35-25-29 -4.890374 -11.468013 1.687266 0.2117055
>35-30-35 -2.732294 -9.309933 3.845346 0.6903616
$Group
diff lwr upr p adj
1-1.5l-<1l 2.237368 -4.3402712 8.815008 0.8039620
1.5-2l-<1l 7.146119 0.5684797 13.723759 0.0282532
>2l-<1l 9.879534 3.3018946 16.457174 0.0011507
1.5-2l-1-1.5l 4.908751 -1.6688885 11.486390 0.2088731
>2l-1-1.5l 7.642166 1.0645263 14.219805 0.0166593
>2l-1.5-2l 2.733415 -3.8442246 9.311054 0.6900880
My advice would be not to do something so silly as to use greater than, less
than, and minus signs in your group names. Just a thought! But notice even when
something so silly as mathematical operators are included in group names, R can
handle it if you're careful. I would also not take those results at face value,
as an examination of the model assumptions reveals heterogeneity of variance
and nonnormality. Perhaps a transform of the response could fix that!
Yes, those graphs were also created in R. They show clearly that the
homogeneity of variance and normal distribution assumptions have been
violated.
> ls()
[1] "aov.out" "Insurance"
> rm(list=ls()) ### Clean up the mess we've made.
Some Cars Sold In the United States in 1993
> library("MASS") ### Don't do this if it's already done.
> data(Cars93)
> str(Cars93)
'data.frame': 93 obs. of 27 variables:
$ Manufacturer : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...
$ Model : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...
$ Type : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...
$ Min.Price : num 12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
$ Price : num 15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
$ Max.Price : num 18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
$ MPG.city : int 25 18 20 19 22 22 19 16 19 16 ...
$ MPG.highway : int 31 25 26 26 30 31 28 25 27 25 ...
$ AirBags : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ...
$ DriveTrain : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ...
$ Cylinders : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ...
$ EngineSize : num 1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
$ Horsepower : int 140 200 172 172 208 110 170 180 170 200 ...
$ RPM : int 6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
$ Rev.per.mile : int 2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...
$ Man.trans.avail : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
$ Fuel.tank.capacity: num 13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
$ Passengers : int 5 5 5 6 4 6 6 6 5 6 ...
$ Length : int 177 195 180 193 186 189 200 216 198 206 ...
$ Wheelbase : int 102 115 102 106 109 105 111 116 108 114 ...
$ Width : int 68 71 67 70 69 69 74 78 73 73 ...
$ Turn.circle : int 37 38 37 37 39 41 42 45 41 43 ...
$ Rear.seat.room : num 26.5 30 28 31 27 28 30.5 30.5 26.5 35 ...
$ Luggage.room : int 11 15 14 17 13 16 17 21 14 18 ...
$ Weight : int 2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
$ Origin : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ...
$ Make : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ...
What would be interesting to look at here?
> summary(Cars93)
Manufacturer Model Type Min.Price Price
Chevrolet: 8 100 : 1 Compact:16 Min. : 6.70 Min. : 7.40
Ford : 8 190E : 1 Large :11 1st Qu.:10.80 1st Qu.:12.20
Dodge : 6 240 : 1 Midsize:22 Median :14.70 Median :17.70
Mazda : 5 300E : 1 Small :21 Mean :17.13 Mean :19.51
Pontiac : 5 323 : 1 Sporty :14 3rd Qu.:20.30 3rd Qu.:23.30
Buick : 4 535i : 1 Van : 9 Max. :45.40 Max. :61.90
(Other) :57 (Other):87
Max.Price MPG.city MPG.highway AirBags
Min. : 7.9 Min. :15.00 Min. :20.00 Driver & Passenger:16
1st Qu.:14.7 1st Qu.:18.00 1st Qu.:26.00 Driver only :43
Median :19.6 Median :21.00 Median :28.00 None :34
Mean :21.9 Mean :22.37 Mean :29.09
3rd Qu.:25.3 3rd Qu.:25.00 3rd Qu.:31.00
Max. :80.0 Max. :46.00 Max. :50.00
DriveTrain Cylinders EngineSize Horsepower RPM
4WD :10 3 : 3 Min. :1.000 Min. : 55.0 Min. :3800
Front:67 4 :49 1st Qu.:1.800 1st Qu.:103.0 1st Qu.:4800
Rear :16 5 : 2 Median :2.400 Median :140.0 Median :5200
6 :31 Mean :2.668 Mean :143.8 Mean :5281
8 : 7 3rd Qu.:3.300 3rd Qu.:170.0 3rd Qu.:5750
rotary: 1 Max. :5.700 Max. :300.0 Max. :6500
Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers
Min. :1320 No :32 Min. : 9.20 Min. :2.000
1st Qu.:1985 Yes:61 1st Qu.:14.50 1st Qu.:4.000
Median :2340 Median :16.40 Median :5.000
Mean :2332 Mean :16.66 Mean :5.086
3rd Qu.:2565 3rd Qu.:18.80 3rd Qu.:6.000
Max. :3755 Max. :27.00 Max. :8.000
Length Wheelbase Width Turn.circle
Min. :141.0 Min. : 90.0 Min. :60.00 Min. :32.00
1st Qu.:174.0 1st Qu.: 98.0 1st Qu.:67.00 1st Qu.:37.00
Median :183.0 Median :103.0 Median :69.00 Median :39.00
Mean :183.2 Mean :103.9 Mean :69.38 Mean :38.96
3rd Qu.:192.0 3rd Qu.:110.0 3rd Qu.:72.00 3rd Qu.:41.00
Max. :219.0 Max. :119.0 Max. :78.00 Max. :45.00
Rear.seat.room Luggage.room Weight Origin Make
Min. :19.00 Min. : 6.00 Min. :1695 USA :48 Acura Integra: 1
1st Qu.:26.00 1st Qu.:12.00 1st Qu.:2620 non-USA:45 Acura Legend : 1
Median :27.50 Median :14.00 Median :3040 Audi 100 : 1
Mean :27.83 Mean :13.89 Mean :3073 Audi 90 : 1
3rd Qu.:30.00 3rd Qu.:15.00 3rd Qu.:3525 BMW 535i : 1
Max. :36.00 Max. :22.00 Max. :4105 Buick Century: 1
NA's : 2.00 NA's :11.00 (Other) :87
Some of the variables have missing values, viz., Rear.seat.room and
Luggage.room. In R, missing values are denoted by NA, or "not available". I will
begin by removing those cases.
> Cars = na.omit(Cars93)
> dim(Cars) # Get dimensions of remaining dataframe.
[1] 82 27
Eleven cases have been eliminated. Now we'll look at a correlation matrix of
most of the numerical variables. This will illustrate how in R you can use one
command to do many things at once. Doing so makes the command kinda complicated,
so be very careful with your typing. The first part of the command rounds the
results to 3 decimal places. Nested inside of that command is one that
calculates the correlation matrix. And nested inside of that is a restriction
on the Cars dataframe that specifies only certain variables to be included.
> round(cor(Cars[,c(4,5,6,7,8,12,13,14,17,18,19,20,21,25)]),3)
Min.Price Price Max.Price MPG.city MPG.highway EngineSize
Min.Price 1.000 0.971 0.910 -0.663 -0.664 0.705
Price 0.971 1.000 0.983 -0.622 -0.626 0.642
Max.Price 0.910 0.983 1.000 -0.565 -0.572 0.567
MPG.city -0.663 -0.622 -0.565 1.000 0.945 -0.747
MPG.highway -0.664 -0.626 -0.572 0.945 1.000 -0.666
EngineSize 0.705 0.642 0.567 -0.747 -0.666 1.000
Horsepower 0.795 0.787 0.750 -0.704 -0.687 0.772
RPM -0.092 -0.035 0.011 0.327 0.234 -0.524
Fuel.tank.capacity 0.736 0.707 0.656 -0.797 -0.739 0.821
Passengers 0.327 0.274 0.222 -0.463 -0.414 0.526
Length 0.611 0.552 0.485 -0.725 -0.617 0.852
Wheelbase 0.676 0.628 0.566 -0.671 -0.570 0.859
Width 0.544 0.495 0.438 -0.694 -0.588 0.890
Weight 0.777 0.737 0.677 -0.836 -0.777 0.917
Horsepower RPM Fuel.tank.capacity Passengers Length
Min.Price 0.795 -0.092 0.736 0.327 0.611
Price 0.787 -0.035 0.707 0.274 0.552
Max.Price 0.750 0.011 0.656 0.222 0.485
MPG.city -0.704 0.327 -0.797 -0.463 -0.725
MPG.highway -0.687 0.234 -0.739 -0.414 -0.617
EngineSize 0.772 -0.524 0.821 0.526 0.852
Horsepower 1.000 0.024 0.796 0.241 0.643
RPM 0.024 1.000 -0.244 -0.378 -0.440
Fuel.tank.capacity 0.796 -0.244 1.000 0.451 0.794
Passengers 0.241 -0.378 0.451 1.000 0.646
Length 0.643 -0.440 0.794 0.646 1.000
Wheelbase 0.663 -0.405 0.793 0.649 0.911
Width 0.689 -0.489 0.772 0.516 0.885
Weight 0.860 -0.348 0.900 0.510 0.880
Wheelbase Width Weight
Min.Price 0.676 0.544 0.777
Price 0.628 0.495 0.737
Max.Price 0.566 0.438 0.677
MPG.city -0.671 -0.694 -0.836
MPG.highway -0.570 -0.588 -0.777
EngineSize 0.859 0.890 0.917
Horsepower 0.663 0.689 0.860
RPM -0.405 -0.489 -0.348
Fuel.tank.capacity 0.793 0.772 0.900
Passengers 0.649 0.516 0.510
Length 0.911 0.885 0.880
Wheelbase 1.000 0.850 0.879
Width 0.850 1.000 0.869
Weight 0.879 0.869 1.000
That would be a lot prettier if some of those variable names were shorter! Now
we will do a linear multiple regression using the lm( ), or "linear model" function. Then we will use
the step( ) function to do backwards regression
to eliminate the nonsignificant terms. Complicated stuff, but all incorporated
into the default R packages.
> lm.model = lm(MPG.city ~ EngineSize + Horsepower + Weight + Origin, data=Cars)
> summary(lm.model)
Call:
lm(formula = MPG.city ~ EngineSize + Horsepower + Weight + Origin,
data = Cars)
Residuals:
Min 1Q Median 3Q Max
-7.7646 -1.4445 -0.1286 1.1505 13.2322
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.910e+01 3.048e+00 16.110 < 2e-16 ***
EngineSize 1.220e+00 9.173e-01 1.330 0.187
Horsepower -5.836e-05 1.384e-02 -0.004 0.997
Weight -9.967e-03 1.890e-03 -5.273 1.19e-06 ***
Originnon-USA 1.265e+00 7.917e-01 1.598 0.114
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.083 on 77 degrees of freedom
Multiple R-squared: 0.7113, Adjusted R-squared: 0.6963
F-statistic: 47.42 on 4 and 77 DF, p-value: < 2.2e-16
> step(lm.model, direction="backward")
Start: AIC=189.48
MPG.city ~ EngineSize + Horsepower + Weight + Origin
Df Sum of Sq RSS AIC
- Horsepower 1 0.0001691 731.78 187.48
- EngineSize 1 16.82 748.60 189.34
731.78 189.48
- Origin 1 24.28 756.06 190.15
- Weight 1 264.26 996.04 212.76
Step: AIC=187.48
MPG.city ~ EngineSize + Weight + Origin
Df Sum of Sq RSS AIC
- EngineSize 1 16.84 748.62 187.34
731.78 187.48
- Origin 1 26.76 758.54 188.42
- Weight 1 387.47 1119.25 220.32
Step: AIC=187.34
MPG.city ~ Weight + Origin
Df Sum of Sq RSS AIC
- Origin 1 15.62 764.24 187.04
748.62 187.34
- Weight 1 1588.44 2337.06 278.69
Step: AIC=187.04
MPG.city ~ Weight
Df Sum of Sq RSS AIC
764.24 187.04
- Weight 1 1770.16 2534.40 283.34
Call:
lm(formula = MPG.city ~ Weight, data = Cars)
Coefficients:
(Intercept) Weight
47.76860 -0.00826
Of the original predictors, only Weight remains after the backward regression.
> quit("no")
The end. We don't need to clean up here because we are telling R to quit
without saving the workspace. This will cause R to start up next time with
the same workspace it had when we started this time.
revised 2010 July 25
Return to the Table of Contents
|