Psyc 480 Nonlinear Regression

Psyc 480 -- Dr. King

Introduction to Nonlinear Regression

It's not often that a paper appears in a psychological journal that revolutionizes the entire field, but that's just what happened in 1959 when Lloyd and Margaret Peterson published a paper with the innocuous title Short-Term Retention of Individual Verbal Items in the Journal of Experimental Psychology (vol. 58, no. 3, pgs. 193-198). Today, if you say "Peterson and Peterson, 1959," to (properly educationed) psychologists, they will know immediately what you're talking about.

Imagine this. I say three letters to you, for example, CXF. Do you think you could remember that trigram of letters for 18 seconds? Your answer is probably, "Sure, who couldn't?" What if I gave you something else to think about during those 18 seconds? What Peterson and Peterson showed in their experiment was that under certain circumstances the chance that a person can remember a three-letter trigram for 18 seconds is less than 10%.

In 1959, when we thought of memory as something that lasted for days, weeks, or even years, the possibility that someone with a normal brain (okay, college students) could forget something so simple in as little as 18 seconds seemed almost unbelievable. What Peterson and Peterson had demonstrated was the existence of a form of memory called short-term memory, and today short-term memory is studied in animals and in humans, with normal as well as abnormal brains. Short-term memory is impaired in people diagnosed as schizophrenic, for example, and that has turned out to be a significant clue as to what's happening in the brains of those individuals.

Note: To be completely fair to the brains of college students, I should point out that the explanation of the above result is not as simple as simple memory trace decay. There are a good many websites that will explain this to you in some detail. If you're interested, one of the better ones is Dr. Kihlstrom's website at UC Berkeley.

Peterson and Peterson gave normal, healthy, young adults three-letter trigrams to remember, then asked them to immediately begin counting backwards by threes from a randomly chosen number (the "distractor task"). At the end of 3 sec., 6 sec., 9 sec., 12 sec., 15 sec., or 18 sec. (the "retention interval") a red light came on, which signalled the subject to attempt to recall the trigram. All three letters had to be recalled in the correct order to be considered a correct answer, and the subject had 15 seconds to do it (although they found, if recall hadn't occurred in half that time, it almost certainly wasn't going to).

Here are the results they got. (I took these numbers from a graph in their paper, so they may not be exactly accurate, but they're close enough. These results are averaged over many subjects and many trials per subject.)

retention interval 3 sec. 6 sec. 9 sec. 12 sec. 15 sec. 18 sec.

chance of correct recall 77% 65% 32% 22% 11% 9%

Fire up R, clear the workspace, and enter these data as follows.

RI = c(3, 6, 9, 12, 15, 18)
recall = c(79, 65, 32, 22, 11, 9)

Now plot these points on a scatterplot with RI as the explanatory variable and recall as the response. (You should know how to do this by now. If not, go back and redo some of the previous material. Hint: There is nothing to attach here. The variables are already in the workspace.)

Now we'll do the linear regression.

lm.out = lm(recall ~ RI)   # we do not need data= because the variables are in the workspace
summary(lm.out)

Questions

1) What percentage of the variability in recall is accounted for by the retention interval?
(two decimal places)

2) What is the typical magnitude of a residual?

3) If recall were asked for immediately, what is the prediction for percent correct recall?
%

4) Does this sound like a reasonable "real world" answer to you?
(yes/no)

5) For each one second increase in the length of the retention interval, predicted percent recall in this model changes by how much?

Let's look at the regression diagnostics. Begin by closing the graphics window, if it's still open.

par(mfrow=c(2,2))
plot(lm.out, 1:4)

Questions

6) Does the relationship between recall and RI appear to be linear?
(yes/no)

7) Are the residuals reasonably normally distributed?
(yes/no)

8) Has homogeneity of variance reasonably been met?
(yes/no)

9) Are there any influential points?
(yes/no)

Although R-squared is quite high, we do not have the correct model. The problem is, this relationship is not linear. How can we fix that in our regression analysis?

You should recall from your basic math classes that curves are described by polynomials that have squared, cubed, and higher order terms in them. That is:

y = ax² + bx + c

describes a curve, not a straight line. That curve can change direction once. I.e., it can start by going down and then bottom out and go back up again, or it can start by going up, top out, and go back down again. In general, curves described by polynomials can change directions one less time than the highest exponent in the polynomial. (Can but not necessarily will.)

The curve described by

y = ax³ + bx² + cx + d

can change directions twice. A quartic curve can change directions three times, and so on. The equations can be used to describe curves that don't change directions as well. For example, a quadratic polynomial describes a parabola. See the following graph.

But suppose the range of the values that make up our IV is only 0 to 3 (shaded). That bit of the parabola never changes direction. So the traditional way of describing all curvilinear relationships was by using polynomials. We might start with all terms up to 4th power and then drop them out one by one until all terms are significant. It would look something like this. (Note: In R, when you include the IV raised to a power in the regression formula, it has to be "isolated" by putting it inside I(). You'll see what I mean.

> summary(lm(recall ~ RI + I(RI^2) + I(RI^3) + I(RI^4)))   # use I() to 'isolate'

Call:
lm(formula = recall ~ RI + I(RI^2) + I(RI^3) + I(RI^4))

Residuals:
      1       2       3       4       5       6 
-0.3968  1.9841 -3.9683  3.9683 -1.9841  0.3968 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 49.333333  54.916965   0.898    0.534
RI          23.070547  29.971938   0.770    0.582
I(RI^2)     -5.382716   5.229296  -1.029    0.491
I(RI^3)      0.368999   0.361247   1.021    0.493
I(RI^4)     -0.008230   0.008573  -0.960    0.513

Residual standard error: 6.299 on 1 degrees of freedom
Multiple R-squared:  0.9907,	Adjusted R-squared:  0.9534 
F-statistic: 26.56 on 4 and 1 DF,  p-value: 0.1444

The percentage of variability explained has increased to 99.07%, which is certainly impressive. Nevertheless, this model totally sucks. None of the coefficients is significantly different from zero. The prediction for immediate recall is only 49.333% correct. The F test does not reveal this model to be significantly better than using just the mean of recall as the predicted value for all retention intervals.

So we start dropping out terms one at a time starting with the highest order term (the 4th power term), and eventually we get to this.

> summary(lm(recall ~ RI + I(RI^2)))

Call:
lm(formula = recall ~ RI + I(RI^2))

Residuals:
      1       2       3       4       5       6 
-3.4286  8.0571 -4.7429  0.1714 -1.2000  1.1429 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) 113.2000    10.4278  10.856  0.00167 **
RI          -11.1381     2.2741  -4.898  0.01629 * 
I(RI^2)       0.2937     0.1060   2.770  0.06956 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.829 on 3 degrees of freedom
Multiple R-squared:  0.976,	Adjusted R-squared:  0.9601 
F-statistic: 61.11 on 2 and 3 DF,  p-value: 0.003708

This model is significantly better than using just mean(recall) as the predictor for every retention interval (p=0.004), R-squared is still pretty impressive, and the coefficient for the squared term is at least nearly significant (p=0.070). It's still making a goofy prediction for immediate recall, however (113.2% correct). Nevertheless, in the olden days, this is the model they probably would have settled for. Let's draw this curve on top of the scatterplot. Close the graphics window, if it's open, and do this.

plot(recall ~ RI)
curve(113.2 - 11.14*x + 0.294*x^2, from=3, to=18, add=T)

Note: In R, the curve() function draws a graph. The equation of the line or curve on the graph is given first, and the IV is always x (lower case x). The options from= and to= give the range of x over which to plot. If you want the graph added to a graphics window that is already open, the option add=T does it.

Not a bad fit, is it? It accounts for the curve in the points pretty well. Of course, as psychologists, we're not happy, because there is no theoretical reason that forgetting should follow a quadratic trend.

If the trend appears to be monotonic (i.e., does not change direction), and appears to be either accelerating or decelerating (i.e., the instantaneous slope of the curve is either increasing or decreasing - hope you remember your Math 133!), then a better (and far easier) way to do this is to assume we have one of the following three relationships.

logarithmic function of the form y = a + b*log(x)
exponential function of the form log(y) = a + bx
power function of the form log(y) = a + b*log(x)

Yes, there are a lot of logarithms in nonlinear regression. If you don't remember what a logarithm is, watch this video: Intro to Base 10 Logs. (I do not endorse any political candidate you may see an ad for at the beginning of this video. Google execs should be hung up by their thumbs for ruining YouTube like this!)

Some rules for logarithms you may find handy:

log₁₀X means how may times do we have to multiply 10 by itself to get X
examples: log₁₀100 = 2, log₁₀1000 = 3, log₁₀1,000,000 = 6
log(xy) = log(x) + log(y)
log(x/y) = log(x) - log(y)
log(x^y) = y log(x)
log(10) = 1 (in base 10 logs only)
log(base) = 1 (in all bases)
log(1) = 0 (in all bases)
log(1/x) = -log(x)
10^log(y) = y (in base 10 logs only)
base^log(y) = y (in all bases)

These relationships are theoretically meaningful, although I don't want to get into why. (Read Dr. Kihlstrom's website.) So how do we test for them? It's simple. (I'll show only the last three lines of the output to save space. You can see the rest on your computer screen.) (Another note. In R, log() is not base 10 logs but base e logs or natural logs. Once again, Math 133. In R, log10() is base 10 logs.)

> summary(lm(recall ~ log(RI)))   # logarithmic - take log of the IV
Residual standard error: 6.839 on 4 degrees of freedom
Multiple R-squared:  0.956,	Adjusted R-squared:  0.945 
F-statistic: 86.97 on 1 and 4 DF,  p-value: 0.0007359

> summary(lm(log(recall) ~ RI))   # exponential - take log of the DV
Residual standard error: 0.1478 on 4 degrees of freedom
Multiple R-squared:  0.9782,	Adjusted R-squared:  0.9727 
F-statistic: 179.5 on 1 and 4 DF,  p-value: 0.0001796

> summary(lm(log(recall) ~ log(RI)))   # power - take log of both
Residual standard error: 0.3155 on 4 degrees of freedom
Multiple R-squared:  0.9006,	Adjusted R-squared:  0.8758 
F-statistic: 36.25 on 1 and 4 DF,  p-value: 0.003833

It looks like the exponential model is best (although I doubt that it is significantly better than the logarithmic model). This is what Peterson and Peterson claimed in their article, that the forgetting curve was exponential, and this is what makes the most theoretical sense. In some contexts, it's called exponential decay. Some psychologists have even proposed this. Memory traces decay exponentially, like radioactive substances. But as you might have learned in cognitive psychology, trace decay theory is no longer generally accepted.

In R, the log() function does natural logs, and although calculus and physics people may like those, psychologists are usually more comfortable with base-10 logs, or common logs. So let's redo the exponential model using base-10 logs. You'll see that it makes no difference in terms of R-squared or of how significant things are. The function for base-10 logs is log10().

> lm.out = lm(log10(recall) ~ RI)
> summary(lm.out)

Call:
lm(formula = log10(recall) ~ RI)

Residuals:
       1        2        3        4        5        6 
-0.04187  0.07897 -0.02325  0.01957 -0.07591  0.04249 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.145046   0.059755    35.9 3.59e-06 ***
RI          -0.068516   0.005115   -13.4  0.00018 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06419 on 4 degrees of freedom
Multiple R-squared:  0.9782,	Adjusted R-squared:  0.9727 
F-statistic: 179.5 on 1 and 4 DF,  p-value: 0.0001796

First, a minor detail. We only have six cases in our data set, so there are only six residuals, so summarizing them would be silly. So under Residuals in the regression output, R just lists them all.

Second, a major detail. You have to remember in an exponential regression we are no longer predicting y. We are predicting log10(y). So what is the prediction for x = 0, i.e., for immediate recall? It is...

log10(y.hat) = 2.145046       # the intercept is the prediction for x=0
y.hat = 10^2.144046           # antilogs, which means raise 10 to that power
y.hat = 139.3304% correct

Does that make sense? You can't get 139% correct, but what would you expect the percentage correct to be for immediate recall? I'd expect it to be pretty close to 100%, or 10^2. So I would expect the intercept in this regression to be 2. We can see in the regression output that the intercept is significantly different from 0 (p=0.000), but is it significantly different from 2? Here's how you can find out.

> confint(lm.out)
                  2.5 %      97.5 %
(Intercept)  1.97914062  2.31095126
RI          -0.08271649 -0.05431612

This function gives us 95% confidence intervals around the regression coefficients.

Questions 10-12. We are 95% confident that the true value of the intercept in the population is no less than and no more than . Since this interval includes 2, we conclude that the observed slope of 2.145 (is/is not) significantly different from 2.

13) If the slope is at the lower limit of the above interval, what is the predicted percent correct recall for immediate recall?
%

So the regression equation is...

log10(y.hat) = 2.145 - 0.0685*x

where y.hat is the predicted value of percent correct recall and x is the retention interval in seconds. Suppose we wanted to plot the regression line on the scatterplot. The abline() function is no longer going to cut it. We are going to have to use curve(), and to use curve we have to solve the regression equation for y.hat.

log10(y.hat) = 2.145 - 0.0685*x
10^log10(y.hat) = 10^{(2.145 - 0.0685*x)}
y.hat = 10^2.145 * 10^(-0.0685*x)
y.hat = 139.330 * 10^(-0.0685*x)

NOTE: If you can't do the math, look at the regression coefficients in the R output, and you'll see a simple way to get this equation. It's...

y.hat = 10^intercept * 10^{(slope * x)}

Now we can plot that line on the scatterplot. Close the graphics window if it is still open so we can start fresh.

plot(recall ~ RI)
curve(139.33 * 10^(-0.0685*x), from=3, to=18, add=T)
title(main="Exponential Model")

Not bad! The regression diagnostics suggest we may have problems though.

# close the graphics window
par(mfrow=c(2,2))
plot(lm.out, 1:4)

Still, the exponential model is theoretically correct, and theory trumps statistics. We shall let the model stand!

Many years ago, I attempted to replicate this experiment in an experimental psychology class I was teaching at another university. Ten students in the class did 8 trials at each of the retention intervals. Here are the results. The numbers in this table are the number of trigrams recalled correctly by each subject at each of the retention intervals.

        3 sec  6 sec  9 sec  12 sec  15 sec  18 sec
S1       8      2      2       3       3       2
S2       6      6      3       4       6       3
S3       4      6      3       2       5       6
S4       7      5      4       7       5       4
S5       5      6      3       5       4       4
S6       7      5      6       6       4       3
S7       8      6      6       6       7       5
S8       4      4      4       4       4       4
S9       6      4      4       5       1       2
S10      7      5      7       4       3       4

Follow these steps.

Calculate the mean at the bottom of each of those columns.
Convert the mean to percent correct by dividing by 8 then multiplying by 100.
Create two variables with c(), one containing the retention intervals and one containing the percent correct for each retention interval.
Repeat the above analysis.

Questions

14) Is the best model still exponential? (Don't forget to test the linear model.)
(yes/no)

15) If not, which model is best, linear, logarithmic, or power?

16) What is the R-squared value for this model?

17) Is there an influential point in this model? If so, for which retention interval?
sec.

18) Why do you think these subjects did so much better at the longer retention intervals than Peterson and Peterson's subjects did?

Hint: Here's the way the experiment was conducted. Students paired off. One student in each pair served as the experimenter and the other as the subject. The experimenter's job was to read the trigrams and 3-digit numbers and then enforce the distractor task by making sure the subject counted backwards as rapidly as possible. The subject's job was to listen to the trigram and then count backwards as rapidly as possible until cued to recall, at which time the subject repeated (or attempted to repeat) the trigram.

Answer: The experimenters weren't very good at enforcing the distrator task, so the subjects had some opportunity to rehearse the trigram while counting.

Some More Problems

19) Unregulated population growth is said to be exponential growth. Using the dataset scpop.txt from the website, do the following problem.

# scpop.txt
# Population of South Carolina.
# Data are derived from the U.S. census. Population is in thousands. 2020 was
# left blank to see how accurately it could be predicted from the regression.
# The correct value is 5,118,425. Is the increase exponential? How good is the
# exponential model? What accounts for the blips here and there where the points
# move off the regression line?

Answers: yes, R-sqr = 0.9888, the biggest blip is due to a little disagreement called the Civil War

20) Here's another classic that, had you solved it about 400 years ago, would have made you famous for all time.

# planets.txt
# Here is a classic problem in nonlinear regression. What is the relationship
# between a planet's distance from the sun and its orbital period. This is
# Kepler's 3rd law of planetary motion, and it took Kepler about ten years to
# find it in the early 17th century. Using R, you can find it in ten minutes.
# Which law works best, linear, logrithmic, exponential, or power? The distance
# variable is in millions of kilometers. The period variable is in Earth days.
# You will get a better version of Kepler's law if you transform the data in
# such a way that both distance and period equal 1 for Earth. (I.e., divide
# all the distances by 149.6 and all the periods by 365.26. The units will then
# be AUs and Earth years.) Planet names can be read in as row names.

Answers: later

21) So now that you know, how could you have solved the nonlinearity problem in the Guber.txt data?

Answer: log transform the percent variable.

22) The recordtimes.txt data were slightly nonlinear. Does a log transform of one or both variables help?

Answer: I don't know. I didn't do it. Try it and see!

23) Can the nonlinearity in the wine.txt data be repaired with a log transform?

Answer: Sure can! Which one is up to you to discover.

retention interval	3 sec.	6 sec.	9 sec.	12 sec.	15 sec.	18 sec.
chance of correct recall	77%	65%	32%	22%	11%	9%