
SIMPLE NONLINEAR CORRELATION AND REGRESSION
Preliminaries
Some new tools will be introduced in this tutorial for dealing with
nonlinear relationships between a single numerical response variable and a
single numerical explanatory variable. In most cases, this involves finding
some way to transform one or both of the variables to make the relationship
linear. Thereafter, the methods follow those of
Simple Linear Correlation and Regression. You might want to review that
tutorial first if you are not comfortable with those methods.
Although it isn't strictly necessary at this time, as regression problems
become more complex, it will become increasingly necessary for you to read the
Model Formulae tutorial, as the model to be fit and
tested by R will have to be specified by a model formula. You can probably skip
it for now, but you'll have to read it if you go any further than this.
In this tutorial we are going to use a data set called "pressure"...
> data(pressure)
> str(pressure)
'data.frame': 19 obs. of 2 variables:
$ temperature: num 0 20 40 60 80 100 120 140 160 180 ...
$ pressure : num 2e-04 0.0012 0.006 0.03 0.09 0.27 0.75 1.85 4.2 8.8 ...
> summary(pressure)
temperature pressure
Min. : 0 Min. : 0.0002
1st Qu.: 90 1st Qu.: 0.1800
Median :180 Median : 8.8000
Mean :180 Mean :124.3367
3rd Qu.:270 3rd Qu.:126.5000
Max. :360 Max. :806.0000
The two variables are "temperature" in degrees Celsius, and the vapor "pressure"
of mercury in millimeters of mercury. This built-in data set is derived from the
Handbook of Chemistry and Physics, CRC Press (1973). A brief explanation may be
in order for my less physical-science-oriented colleagues. When a liquid is
placed in an evacuated, closed chamber (i.e., in a "vacuum"), it begins to
evaporate. This vaporization will continue until the vapor and the liquid reach
a dynamic equilibrium (the number of particles of liquid that go into the vapor
is equal to the number of particles of vapor that go into the liquid). At this
point, the pressure exerted by the vapor on the liquid is called the vapor
pressure (or equilibrium vapor pressure) of the substance. Vapor pressure has a
complex relationship with temperature, and so far as I know, there is no law in
theory that specifies this relationship (at least not in the case of mercury).
Since we are dealing with a liquid/vapor phase transition, and not just a gas,
the simple gas laws you learned back in high school chemistry don't apply. The
vapor pressure of mercury and its relationship to temperature has been an
important question, for example, in the calibration of scientific instruments
like mercury thermometers and barometers.
To bring this problem into the 21st century, I am going to begin by
converting the units of measurement to SI units. This will also avoid some
problems with log transforms (notice the minimum value of "temperature" is 0)...
> pressure$temperature = pressure$temperature + 273.15
> ### temperature is now in degrees kelvin
> pressure$pressure = pressure$pressure * .1333
> ### pressure is now in kiloPascals
> summary(pressure)
temperature pressure
Min. :273.1 Min. :2.666e-05
1st Qu.:363.1 1st Qu.:2.399e-02
Median :453.1 Median :1.173e+00
Mean :453.1 Mean :1.657e+01
3rd Qu.:543.1 3rd Qu.:1.686e+01
Max. :633.1 Max. :1.074e+02
These are linear transformations and will not affect the nature of any
relationship we might find. Now, it annoys me to have to type something like
pressure$pressure, so I would like to attach this data frame. But that produces
an annoying warning of a conflict, because one of the variables in the data
frame has the same name as the data frame. So...
> pres = pressure$pressure
> temp = pressure$temperature
> rm(pressure)
> ls()
[1] "pres" "temp"
...and we are finally ready to go.
First Thoughts: Log Transformations
We begin, as always, with a scatterplot...
> par(mfrow=c(1,4)) # one row of four graphs
> plot(pres ~ temp, main="Vapor Pressure\nof Mercury",
+ xlab="Temperature (degrees Kelvin)",
+ ylab="Pressure (kPascals)")
When I see something like that (leftmost graph), I immediately think
"exponential relationship," so a log(y) transform might be in order. This can be
done in two ways: by actually transforming the variable to create a new variable
by logpres=log(pres), or by having R plot the relationship with a logarithmic
axis...
> plot(pres ~ temp, main="Vapor Pressure\nof Mercury",
+ xlab="Temperature (degrees Kelvin)",
+ ylab="Pressure (kPascals)", log="y")
No (second graph), that appears to have overcorrected. So let's try a power
function (log(y) and log(x) transforms)...
> plot(pres ~ temp, main="Vapor Pressure\nof Mercury",
+ xlab="Temperature (degrees Kelvin)",
+ ylab="Pressure (kPascals)", log="xy")
That (third graph) appears to be a little closer but still a bit bowed. A log(x)
transform (last graph) is just silly but illustrated just to show it...
> plot(pres ~ temp, main="Vapor Pressure\nof Mercury",
+ xlab="Temperature (degrees Kelvin)",
+ ylab="Pressure (kPascals)", log="x")
Close the graphics window now so that it will reset to the default one-graph per
window partition. (A trick: After par( ) opened a graphics window, I
resized it by hand to a long, skinny window. I admit, I also had to touch up
those graphs a bit in Paint Shop Pro, as saving them resulted in some spurious
lines and words on the right hand side of the graphic. I don't know why.)
Although we know it's wrong from the scatterplots, let's fit linear models
to the exponential and power transforms just to illustrate the process...
> ls() # just checking
[1] "pres" "temp"
> par(mfrow=c(1,2)) # one row of two graphs
> lm.out1 = lm(log(pres) ~ temp) # the exponential model
> lm.out1
Call:
lm(formula = log(pres) ~ temp)
Coefficients:
(Intercept) temp
-18.95245 0.03979
> plot(lm.out1$fitted, lm.out1$resid) # evaluated via a residual plot
>
> lm.out2 = lm(log(pres) ~ log(temp)) # the power model
> lm.out2
Call:
lm(formula = log(pres) ~ log(temp))
Coefficients:
(Intercept) log(temp)
-108.11 17.61
> plot(lm.out2$fitted, lm.out2$resid) # evaluated via a residual plot
The residual plots show a clear nonrandom pattern, betraying the fact that we
have the wrong models here.
Polynomial Regression
A polynomial of sufficient order will fit any scatterplot perfectly. If the
order of the polynomial is one less than the number of points to be fitted, the
fit will be perfect. However, it will certainly not be useful or have any
explanatory power. Let's see what we can do with lower order polynomials.
A glance at the scatterplot is enough to convince me this is not a parabola,
so let's start with a third order polynomial...
> ls() # always good to know what's there
[1] "lm.out1" "lm.out2" "pres" "temp"
> options(show.signif.stars=F) # turn off annoying significance stars
> lm.out3 = lm(pres ~ temp + I(temp^2) + I(temp^3))
> summary(lm.out3)
Call:
lm(formula = pres ~ temp + I(temp^2) + I(temp^3))
Residuals:
Min 1Q Median 3Q Max
-4.3792 -2.7240 -0.3315 2.5918 6.1815
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.758e+02 6.580e+01 -7.232 2.92e-06
temp 3.804e+00 4.628e-01 8.219 6.16e-07
I(temp^2) -9.912e-03 1.050e-03 -9.442 1.06e-07
I(temp^3) 8.440e-06 7.703e-07 10.957 1.48e-08
Residual standard error: 3.414 on 15 degrees of freedom
Multiple R-squared: 0.9892, Adjusted R-squared: 0.987
F-statistic: 456.4 on 3 and 15 DF, p-value: 5.89e-15
> par(mfrow=c(1,1)) # unnecessary if this has already been reset
> plot(lm.out3$fitted, lm.out3$resid) ### output not shown
Let me explain the model formula first. Three terms are entered into the model
as predictors: temp, temp squared, and temp cubed. The squared and cubed terms
MUST be entered using "I", which in a model formula means "as is." This is
because the caret symbol inside a formula does not have its usual mathematical
meaning. If you want it to be interpreted mathematically, which we do here, it
must be protected by "I", as in the term "I(temp^2)". This gives the same result
as if we had created a new variable, tempsquared=temp^2, and entered tempsquared
into the regression formula. WARNING: If you have read the tutoral on model
formulae, you might be tempted to think that pres~poly(temp,3) would have
produced the same result. IT DOES NOT!
All three estimated coefficients are making a signficant contribution to the
fit (and the intercept is significantly different from zero as well), and any
social scientist would be impressed by an R2-adjusted of 0.987, but
the residual plot shows we do not have the right model. To see why, do this...
> plot(pres~temp)
> curve(-475.8+3.804*x-.009912*x^2+.00000844*x^3,add=T)
This plot is shown above right, and you can see the cubic regression equation
has distinct inflection points. Our mercury vapor data do not. Increasing the
order of the polynomial is a straighforward extension of the above, and it will
improve the fit, but it will not result in the correct model.
Box-Cox Tranformations
When you get desperate enough to find an answer, you'll eventually start
reading the literature! An article I found from Industrial & Engineering
Chemistry Reseach (Huber et al., 2006, Ind. Eng. Chem. Res., 45 (21), 7351-7361)
suggested to me that a reciprocal transform of the temperature might be worth a
try...
> plot(pres ~ I(1/temp)) # output not shown
Clearly, that alone is not sufficient. (Note: Once again, the reciprocal of
"temp" had to be protected inside the formula, because in a formula, 1 is not
the numeral one. It stands for the intercept, which would make no sense in
this context. We could also have done temprecip=1/temp and used temprecip in
the formula, i.e., plot(pres~temprecip).)
Box-Cox transformations allow us to find an optimum transformation of the
response variable using maximum-likelihood methods...
> library("MASS")
> par(mfrow=c(1,2))
> boxcox(pres ~ I(1/temp)) # basic box-cox plot
> boxcox(pres ~ I(1/temp), lambda=seq(-.2,.2,.01)) # finer resolution on lambda
> par(mfrow=c(1,1))
The plots suggest we adopt a lambda value very close to zero, which is to say,
we should use a log transform of the response variable...
> plot(log(pres) ~ I(1/temp))
Now that is a scatterplot I can live with! (At right.)
On with the modeling...
> ls()
[1] "lm.out1" "lm.out2" "lm.out3" "pres" "temp"
> lm.out4 = lm(log(pres) ~ I(1/temp))
> summary(lm.out4)
Call:
lm(formula = log(pres) ~ I(1/temp))
Residuals:
Min 1Q Median 3Q Max
-0.074562 -0.029292 -0.002077 0.021496 0.151712
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.626e+01 4.453e-02 365.1 <2e-16
I(1/temp) -7.307e+03 1.833e+01 -398.7 <2e-16
Residual standard error: 0.04898 on 17 degrees of freedom
Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
F-statistic: 1.59e+05 on 1 and 17 DF, p-value: < 2.2e-16
I'd say we might be on to something here! It appears we have a model regression
equation of...
log(pres) = 16.26 - 7307 / temp
And a look at the regression diagnostic plots...
> par(mfrow=c(2,2))
> plot(lm.out4)
> par(mfrow=c(1,1))
...suggests we are still not dead-bang right on, but we are awfully gosh darn
close! It might be profitable to toss out case 4 and try again.
Making Predictions From a Model
Predictions for new values of the explanatory variables are made using the
predict( ) function. The syntax of this function makes absolutely no sense
to me whatsoever, but I can tell you how to use it to make predictions from a
model. First some reference values..
| Values from Handbook of Chemistry and Physics, 76e |
| Temp (degrees K) | 400 | 450 | 500 | 550 | 600 |
| Pres (kPa) | 0.138 | 1.045 | 5.239 | 19.438 | 57.64 |
...and now we will make predictions from our model for those tabled values of
temperature...
> new.temps = seq(400,600,50) # put the new values in a vector
> predict(lm.out4, list(temp=new.temps))
1 2 3 4 5
-2.00803385 0.02159221 1.64529305 2.97377556 4.08084431
The first thing you feed the predict( ) function is the name of the data
object in which you've stored your model. Then give it a list of vectors
of new values for the explanatory variable(s) in the format you see above. (If
that syntax makes any sense to anyone, I'd like to hear why!) The output is not
pressure in kPa, because that is not what was on the left side of our model
formula. The output is log(pres) values. To get them back to actual pressures,
we have take the antilog of them...
> exp(predict(lm.out4, list(temp=new.temps)))
1 2 3 4 5
0.1342524 1.0218270 5.1825284 19.5656516 59.1954283
Those values are not so terribly far from the ones in the table. Nevertheless,
I have probably just been expelled from the fraternity of chemical
engineers!
Return to the Table of Contents
|