If you understand what you're doing, you don't need a formula sheet. If you need a formula sheet, you don't understand what you're doing. You need to understand what you're doing in order to follow along. For example, if you don't know what the variance is (SS/df), then you're lost from the get-go. Memorizing SS/df is not understanding. WHY is it SS/df?
Don't be a passive reader. You're a psychology major. Why shouldn't you be a passive reader? What does cognitive theory or learning theory have to say about that? Passive readers don't learn! Get out your calculator and follow along with the calculations I'm about to do. Statistics is not a spectator sport! Fill in the boxes with your answers and then click the CHECK button. Keep trying until you get it right. Most of them are pretty simple. Round decimal answers to three decimal places. I will be here during class and lab times and office hours to answer any questions you may have. There is nothing to submit with this exercise.
Note: If while working through this lecture, your browser asks you if you want to prevent this page from creating additional dialogs, DO NOT check that box! If you do, the page will stop working. Just tell your browser to mind it's own business! (Sadly, there is no check box for that.) Also, I have not fixed the width of this window, so you can make it as wide or as narrow as you want by changing the width of your browser window. Don't make it too skinny though, or you'll ruin some of the formatting. I prefer to keep the window fairly wide.
A variable is a collection of values not all of which are the same.
X = c(16, 20, 14, 21, 20, 18, 13, 15, 17, 18)
This notation will indicate that we have created a variable called X from the values contained within the parentheses. The notation c() means collect or combine these values into a variable. (Technically, it means concatenate, but we don't need to get that technical.) In this case, these are scores from subjects who were administered the digit span subtest of the Wechsler Adult Intelligence Scale (WAIS). Scores on the digit span test can range from 0 to 30. A typical score is 20. (The data were collected by Scott Keats as part of his Psyc 497 project, Fall 2001.)
n = 10 # (note) n = length(X) to be explained at a later date
N or n is the "length" of the variable. In other words, it is the number of values in the variable, or the number of subjects in the sample (usually). You get this by counting. For the time being, we will make no distinction between n and N. Later we will. So pay attention!
The following notation will indicate the sum of the values in the variable X (ΣX or "sigma X").
sum(X) = 16 + 20 + 14 + 21 + 20 + 18 + 13 + 15 + 17 + 18
The following notation will indicate that we square all the values in X before summing (ΣX2). Do NOT call this the "sum of squares" (see below). It is also not "sigma X quantity squared" (sum first then square, (ΣX)2). We will denote that as sum(X)^2. Note: The hat or caret symbol, ^, denotes raising to the power of. Here it is the sum of the X values squared (power of 2).
sumsq(X) = sum(X^2) # sum of the squared values ("sigma X squared") = 16^2 + 20^2 + 14^2 + 21^2 + 20^2 + 18^2 + 13^2 + 15^2 + 17^2 + 18^2 = 256 + 400 + 196 + 441 + 400 + 324 + 169 + 225 + 289 + 324
The following notation will indicate taking the sample mean of X. The sample mean can be considered a typical value or a representative value in the variable X. The mean does not actually have to be a value found in the variable, but it has to be within the range of those values. In this example, the values in X range from 13 to 21. Therefore, the mean MUST be between those two values. (Why?) The mean is usually symbolized by M or by X-bar (X with a line over it, X̄). APA prefers M.
mean(X) = sum(X) / n # the arithmetic mean = 172 / 10
The following notation will indicate that we are finding the sample median value of X.
median(X)
We do this by sorting the values into ascending order...
sort(X) 13 14 15 16 17 18 18 20 20 21
...then we find the median ("middle value") by finding a cut point that puts 50% of the values below and 50% above the cut. Here that cut point would be 17.5. Thus, median(X) = 17.5. The median is also known as the 50th percentile or the 2nd quartile. The median is sometimes symbolized by Md: Md = 17.5.
Note: If there is an odd number of values, the median will be a value in the variable; 50% of the remaining values, excluding the actual median, will be above it and 50% below it. If there is an even number of values, as in the present example, the median may, or may not, fall between two of the values in the variable. Examples: What is the median of c(1,2,3,4,5)? What is the median of c(101,102,103,208)? What is the median of c(101,102,102,208)? What is the median of c(1,1,1,1,1,813)?
The mean and the median are "averages" of the values in a variable. Note: Many people, and spreadsheets, use the word "average" to indicate the same as "mean." So be careful when you hear someone saying "average." Make sure you know what average they are talking about.
In X, the mean and median have approximately the same values. This will not be true for all variables. Sometimes they won't even be close! Example: What are the mean and median of c(101,102,103,208)? Which of those do you think is the better "average," or in other words, the better representative of values in the variable?
In a normal distribution of values (the "bell curve"), it is always true that M = Md. A normal distribution is not the only case when M = Md, however. What are others?
dev(X) = X - mean(X) = (16-17.2), (20-17.2), (14-17.2), (21-17.2), (20-17.2), (18-17.2), (13-17.2), (15-17.2), (17-17.2), (18-17.2) = -1.2, 2.8, -3.2, 3.8, 2.8, 0.8, -4.2, -2.2, -0.2, 0.8
This operation is called zero-centering. To zero-center a variable, we subtract the mean of the variable from each of its values. When a variable is zero-centered, it's sum and mean become zero. (Check it and see! Another term for this is "mean-centering," but I find that more confusing. The new values are not centered around 17.2. They're centered around zero.) The zero-centered values are sometimes called "deviations" or "deviations from the mean." They represent the distance that each value lies away from the sample mean. Clearly, negative deviations represent values that lie below the sample mean, and positive deviations represent values that lie above the sample mean.
The squared deviations are very important statistically. (I'll do them for you, but of course you're going to work them out for yourself with a calculator. Right?)
dev(X)^2 = (-1.2)^2, 2.8^2, (-3.2)^2, 3.8^2, 2.8^2, ... etc. = 1.44, 7.84, 10.24, 14.44, 7.84, 0.64, 17.64, 4.84, 0.04, 0.64
If we square each of the deviations and then add them up, we get a very important statistic called the sum of squares (short for "sum of the squared deviations"). Notice that sum of squares is NOT sumsq, but rather...
SS(X) = sum((X - mean(X))^2) = sum(dev(X)^2) = 1.44 + 7.84 + 10.24 + 14.44 + 7.84 + 0.64 + 17.64 + 4.84 + 0.04 + 0.64
That's a lot of parentheses to keep track of. You should remember what the definition of sum of squares is (sum of the squared deviations from the mean), but when you remember how to calculate it, remember this shortcut method. (And I mean, REMEMBER IT! If you have to look this up every time you have to calculate it, you're going to be looking it up A LOT!) SS = ΣX2 - (ΣX)2 / n In our notation that would be...
SS = sumsq - sum^2 / n # computational definition of SS SS(X) = sumsq(X) - sum(X)^2 / n = 3024 - 172^2 / 10
By the way, when you do this calculation, don't do 3024-172^2=-26560 and then -26560 / 10 = -2656.0. That's a common mistake, and it's wrong. Remember the correct order of operations when doing arithmetic. You should know immediately that you got it wrong in this case because, for one thing, the SS can never be negative. It's a sum of squared values, and squared values are always positive.
Check your calculator! Enter this calculation just as you see it here.
3024-172^2/10
Your calculator should give you the right answer (65.6), but not all of them do! Some calculators aren't smart enough to do order of operations correctly, and if yours isn't, then you need a new calculator! YOU'VE BEEN WARNED!
SS is a statistical measure of variability, one of the most important concepts in all of statistics (hence boldface). Variability occurs when values in a variable are not all the same. What would SS be if all the values were the same? Not hard to figure out! All of the values would be right at the mean, so the deviations would be 0. Squaring 0 is still 0, adding up a bunch of zeroes is still zero, so SS = 0 if there is no variability. If some of the values in the variable are different from the mean, however, then SS > 0. The more different they are from the mean, the bigger the SS becomes.
It is our job as psychologists to explain why values in a variable are not all the same, i.e., to explain why there is variability. We want to know why people score differently on IQ tests, why everyone doesn't get the same GPA, why people score differently on personality tests. Some people are introverted, some are extraverted. Why is that? Some people are happy, some are depressed. Why is that? Some people have an aptitude for mechanical tasks, some don't. Why is that? That's what we do. If everybody were exactly the same, there would be no psychology.
So it behooves us as psychologists to have good measures of variability. Before we can explain it, we need to know how much of it there is. The mean might tell us what people are typically like, or what they are like on the average, but some people are going to be a lot different than the average, and that's when things get interesting.
The SS is a measure of variability. It is the sum of the squared differences between how people really are and how they are on the average. We square the differences because X - mean(X) always adds up to zero, for any variable you might imagine. Not useful. Squared numbers are always positive, and positive things add up to something positive.
Statisticians sometimes refer to SS as a measure of "error." Some psychologists have criticized them for doing so, saying that not being average should in no way be considered an error. These psychologists don't understand statistics. For that matter, they don't even understand psychology! When subjects deviate from the mean, that is not the subject's error. It's OUR error if we don't understand why it happens. We are supposed to understand variability in human behavior, traits, characteristics, performance, etc. That's our job. If we don't, then we haven't done our job. If the mean is 17.2, and Fred's score is 20, and we don't know why Fred's score is so high, that's not Fred's fault! We're not doing our job very well!
X = c(16, 20, 14, 21, 20, 18, 13, 15, 17, 18) mean(X) = 17.2 SS(X) = 65.6 X[2] = 20
This notation (square brackets) donotes the second value in X. Similarly, X[3] would denote the third value, X[7] the seventh value, etc. (In stat books, these so-called indexing numbers are usually given as subscripts of X, for example, X2, but bear with me. There is method to my madness!)
In X, why is SS > 0? Not statistically why, psychologically why. Why did the second subject, X[2], get a score of 20, almost 3 points above the mean? Why isn't he or she at the mean? Why aren't they all at the mean? We might make some guesses about that, but guesses are not understanding, no matter how good they are. Right now we have no information about why there is variability in X. Thus, all of that variability within X is "unexplained variability" or "error variability." I sometimes refer to it as "noise."
To be more completely correct, the deviations from the mean are the "error." When we square them and add them up to get SS, we have a measure of "squared error." A lot of statistical techniques are dedicated to minimizing squared error and are, therefore, referred to as "least squared error" techniques, or "least squares" techniques. Examples follow.
Although the SS is used frequently in statistical calculations, as you'll find out if you don't remember from your previous course, it's not the measure of variability that you would report. If your graduate adviser asks you how variable the scores are from the research project she has you working on, you would not tell her the SS. For one thing, SS is in the wrong units. If your original measures are in IQ points, the SS is in IQ points squared, and what the heck does that mean? For another thing, SS is a sum, so it just keeps getting bigger as you test more subjects, even if those subjects are getting scores quite close to the mean. Variability is not increasing, but SS is.
Somehow, we need to correct SS for the number of subjects we have, or the number of values that we are calculating it from. We could do that the same way we correct the sum of the scores to get a useful average--divide by n.
If our scores are in IQ points, when we add them up we get the total number of IQ points accumulated by our subjects. Then we divide by n subjects, and we get IQ points per subject, a useful average. Dividing by n corrects the sum for the number of subjects we have because we always end up with a number that is "per subject." It's like gas mileage: miles divided by gallons = miles per gallon. It doesn't matter how far we've driven. Divide by the number of gallons of gas we've used and we've got miles per gallon. 100 miles on 5 gallons and 1000 miles on 50 gallons is the same MPG. Means are always comparable across groups regardless of group sizes, because the means have been "corrected" for group size. I.e., they are "per subject." If one group has a higher mean than another group, it has nothing to do with the number of subjects in the group, so look for another explanation!
An Aside on Percentages
A similar idea would be percentage. "Percent" literally means per 100. Suppose we have the following variable.
gender = c(male, male, female, male, female, female, female, male, male, male)
This variable is a little different from the one we've been dealing with in that it has no numbers in it. A variable that has numbers in it is called a "numeric variable." The gender variable is called a "categorical variable" because it doesn't measure people on some sort of scale, it just places them into a set of categories. We can't really calculate the mean or median of a categorical variable. Or can we?
In the special case, and in this case ONLY, where there are only two categories, we can "dummy code" the categories with zeros and ones. It's entirely arbitrary which category is assigned which code. The result of dummy coding our gender variable might look like this.
gender = c(0, 0, 1, 0, 1, 1, 1, 0, 0, 0)
Now we have numbers. That doesn't make it a numeric variable, it's still a categorical variable, but we can do some arithmetic on those dummy codes. (Once again, we can do this ONLY when there are two categories!) For example...
n = 10 # we can get that from either version of the variable sum(gender) = 0 + 0 + 1 + 0 + 1 + 1 + 1 + 0 + 0 + 0 = 4 mean(gender) = 4 / 10
Is that mean at all useful? It is! In fact, it's a proportion. You may remember the definition of a proportion from your previous stat class.
p = f / N
f is the frequency of a certain category of subjects, and N is the number of subjects. Thus, if we counted up the number of females in the gender variable, we would find f = 4. And if we determined the length of the entire variable, we would find N = 10. Therefore, the proportion of females in the gender variable is 4 / 10 = 0.4, which is exactly the math that was done by sum() and mean() after the variable was dummy coded. The mean of a dummy coded variable is the proportion of people in the variable who were coded 1.
A proportion times 100 is a percentage. Thus, 0.4 * 100 = 40% of the subjects in the gender variable are female. If we had a much larger group, say 1000 people instead of 10, we would undoubtedly have many more female subjects in the group, let's say 400. We could compare the composition of the two groups by calculating proportions or percentages. In the larger group, 400 / 1000 = 0.4, and 0.4 * 100 = 40%, so the composition of the two groups is the same, even though one of the groups is much larger than the other.
We can also calculate percentages from numeric variables.
Spoiler alert! All of the subjects we've looked at so far from Scott Keat's 497 project were marijuana smokers (or claimed to be regular users of marijuana). Suppose we run a second group of subjects, cigarette (but not marijuana) smokers let's say. To save time, we give them only twenty of the items from the digit span test. They get a mean of 13.1 items correct. Did they do better or worse than our marijuana smokers? 17.2 out of a possible 30 is (figure it out!) 57.3% of items correct. 13.1 out of a possible 20 is 65.5% of items correct. Looks like the cigarette smokers did a bit better than the marijuana smokers. (Or did they?)
A good bit of statistics is devoted to this sort of thing; i.e., finding ways to compare groups that are not the same size or were given tests that were not the same length. And we not only want to compare their average performance, we also want to compare variability. So let's get back to that.
Aside over.
Back to Variability
If we divide the sum of the squared deviations by n, we get "squared deviation per subject," or the average of the squared deviations. This number is called the variance. The variance is the mean squared deviation.
Var(X) = SS(X) / n
However, that's not quite the way the variance is usually calculated. When calculated that way from sample data, it ususally underestimates the "true variance," which is to say the variance in the population. It turns out, and can be mathematically proved (but that's beyond the scope of this course), that the correct division is not by n but by n - 1. This value, n - 1, is called the "degrees of freedom" for a variable. Thus...
df = n - 1 # in the case of a single group of values
...and then the definition of sample variance (remember this!) is...
var = SS / df # our definition of variance var(X) = SS(X) / df = SS(X) / (n - 1) = 65.6 / 9
...for the variable we've been working with. This is a more useful measure of variability in a sample, and one we'll use frequently, but it's not the best one. It does not keep increasing as we add more subjects (it is "per subject"), but it's still in the wrong units, IQ points squared for example. That's easy enough to fix. Just take the square root (sqrt) of it. That puts this value in the same units of measurement as the original scores are. (The units for X are number of digit span items correct, and what would the square of that mean?)
sd = sqrt(var) = sqrt(SS / df) # the definition of standard deviation sd(X) = sqrt(var(X)) = sqrt(SS(X) / (n - 1)) = sqrt(65.6 / 9)
Note. I calculated this as sqrt(65.9/9) and not sqrt(7.289) because 7.289 was a rounded value. Taking the squsre root of it would accumulate even more rounding error. The safest way to calculate is from unrounded values, hence sqrt(65.9/9). Further note. It wouldn't have mattered in this case. You would still have gotten the right answer after rounding to three decimal places. That won't always be the case, however.
This number is called the standard deviation (the sample standard deviation to be entirely correct), and it can be thought of as a typical or average deviation from the mean. This is the value you would report if you were also reporting the mean. Thus, our group of n = 10 subjects has a mean digit span score of M = 17.2 items correct and a standard deviation of SD = 2.700 items correct. That's called a "three-number summary" (n, M, sd), and it's a pretty good summary of a variable under some circumstances, but not under others. We'll discuss those circumstances later.
Notice that I have used an upper case V in Var when we divide by n, which we will almost never do, and a lower case v in var when we divide by df, which we will almost always do. The latter definition is the one you should remember. We are almost always dealing with samples, and so we want the n-1 version of the standard deviation, the sample standard deviation. If you're unsure whether to divide by n or n-1, divide by n-1 and you'll almost always be right.
When we calculate deviations, why do we calculate them from the mean? Why don't we use the median? Or some other number? Let's try using the median and see what happens.
X - median(X) -1.5 2.5 -3.5 3.5 2.5 0.5 -4.5 -2.5 -0.5 0.5 sum(X-median(X)) =
Notice that the deviations around the median do not sum to zero. They usually won't. In fact, the only time they will is when Md = M.
(X - median(X))^2 # squared deviations from the median 2.25 6.25 12.25 12.25 6.25 0.25 20.25 6.25 0.25 0.25 sum((X - median(X))^2) = # sum of squared deviations from the median
Notice also that the sum of the squared deviations around the median is greater than the sum of squares (around the mean) that we calculated above. That will also always be true, unless Md = M. In fact, choose any number you want other than the mean to use in calculating the deviations, and the sum of the squared deviations around that number will always be larger than SS. Try a few.
The SS cannot be smaller than it is when calculated using the mean. Thus, the SS is a "least squares" statistic. So are the variance and standard deviation, which are calculated from the SS. When we use the SS in our statistcal calculations, we are using a "least squares technique."
Another way to summarize the variable X would be to sort the values into ascending order, and then stack them up if there are equal or similar values, and leave gaps if there are values missing, like this.
18 20 13 14 15 16 17 18 20 21
This is called a histogram. Histograms are more useful when there are a lot of values in the variable. Here it's a little silly, isn't it? Well, maybe not.
All of the scores in X came from subjects who were marijuana smokers. The person who collected these data (Scott) also collected ten scores from people who were not marijuana smokers. Let's put those into a variable called Y.
Y = c(18, 22, 21, 17, 20, 17, 23, 20, 22, 21)
It appears that these scores are somewhat higher, on the average, than the scores from the smokers. We can demonstrate this visually easily enough by drawing two histograms.
18 20 13 14 15 16 17 18 20 21 marijuana smokers, M = 17.2 --------------------------------- 17 20 21 22 17 18 20 21 22 23 marijuana nonsmokers, M = ?
It appears that a typical score from the nonsmokers would be higher than a typical score from the smokers. We can confirm this by calculating the means. We already know the mean for the smokers is 17.2.
sum(Y) =
mean(Y) = sum(Y) / n
It also looks like the scores from the nonsmokers might be somewhat less variable (spread out) than the scores from the smokers. We know how to do that. Let's get calculating! (Do it! Believe me, you need the practice!!)
sumsq(Y) =
SS(Y) = sumsq(Y) - sum(Y)^2 / n
var(Y) = SS(Y) / (n - 1)
sd(Y) = sqrt(var(Y))
So as we suspected from a visual inspection of the histograms, a typical deviation from the mean for the nonsmokers is a little smaller than that for the smokers. Why do you think that might be?
If I may step out of the path of our speeding statistical train for a moment, there's an interesting point to be made here. We psychologists are often fascinated by differences between means, but we consider differences between standard deviations to be uninteresting noise in the data. That's a mistake. So I'll repeat the question. Why do you think the variability of the smokers' scores is higher than that of the nonsmokers' scores? Think about it! I'll be asking you again.
Okay, everybody back in front of the train! Rule number 1 of statistics: samples are noisy. Are we really so interested in these 10 marijuana smokers and these 10 nonsmokers? I'm not! What I'm interested in is what's true of people in general. In our samples, the marijuana smokers are scoring lower, on the average, than the nonsmokers are. Is that true in the general case? (Statisticians call the general case the population.) Suppose we went out and found 10 more smokers and 10 more nonsmokers and we tested them. Would we see this same difference in the group means? Or is it, pardon the metaphor, just so much smoke?
The difference between the two group means is almost 3 points (2.9 items correct to be more accurate). There are three reasons why this difference might exist. (This is important! Memorize these three reasons!)
If you want a simple demonstration of this, take a new coin out of your pocket and toss it 100 times. Record the number of times it lands heads side up and the number of times it lands tales side up. In theory (i.e., in the general case), it should be 50 heads and 50 tails. If it is, I'll buy you lunch! (On second thought, no I won't, because if 1000 people do this, 80 of them should get 50:50, and I'm not buying lunch for 80 people.)
Tossing the coin 100 times is sampling all possible tosses of the coin. If the coin is balanced, and it probably very nearly is, then it should be equally likely to land heads as tails. Even if it is exactly balanced, however, you won't get 50 heads and 50 tails. (Well, 80 out of 1000 of you will.) Tossing a coin, like sampling, is a random process, and random processes generate random noise. Count on it. You can't stop it. So the differences in group means might be nothing more than random noise.
How can we ever tell? Statistics!
Let's make this simple and look at just one of our groups, the smokers. Their mean digit span score was 17.2. I said a typical score on the digit span test is 20, by which I meant that 20 is the mean score in the general case or in the population. So why didn't the smokers get a mean of 20? Same three reasons.
That random noise thing is really annoying. How can we rule that out? One thing we could try is to do the study again with another 10 marijuana smokers. If we see the same thing, that's suspicious. So we do it again. And then we do it again. And then we do it again. And then we do it again. And then we do it again. If we keep seeing the same result, if the result replicates as we say in the biz, then we can have confidence that this is not just noise. One thing about noise is it's noisy (random). It's not going to be the same every time. If we get the same (or similar) result every time, that's not noise.
Who has that kind of time or money? Experiments are expensive and time consuming! A granting agency is not going to give you money to do the same experiment over and over. (Which is not to say they shouldn't. They just won't.)
If the result replicates, then we say the result is reliable, but we don't have the time or money to do all these attempts at replication. So we fall back and look instead for statistical reliability. Time to calculate again.
sem = sd / sqrt(n) # definition of standard error of the mean sem(X) = sd(X) / sqrt(n) = 2.700 / sqrt(10)
This is called the standard error of the mean. If the sampling was done well, the sem (don't say "sem," say "standard error of the mean" or "ess eee emm") is a measure of how much random error is influencing the sample mean. If the sampling was done well, then the sample mean is probably within one sem of the true (population) mean. It is very likely to be within two sems, and almost surely within 3 sems.
Let's call the hypothesized true mean mu (say "mew," not "moo"). Then...
t = (mean(X) - mu) / sem(X) = (17.2 - 20) / 0.854
Our sample mean is more than three standard errors below the hypothesized true mean. That's very unlikely to be random noise. It's not impossible that it's random noise. We can never entirely rule out random error. It's just very unlikely, and therefore we conclude that it isn't. (CAUTION: This method works best with larger samples. Ten isn't a large sample, but 3 standard errors is a lot, so I think we're safe in our conclusions.)
Another way of saying this is to say it appears our sample of marijuana smokers was drawn from a population of marijuana smokers that does not have a mean of 20. It "appears" to be? Why so wishy-washy? Get used to it! Statistics do not reveal the truth. Statistics, when done properly, reveal only what is likely to be true. To do statistics, and to do science in general, you're going to have to get used to being uncertain. At least statistics allows us to quantify our uncertainty.
Every statistic (something calculated from a sample) has a standard error. The only one you need to know how to calculate is the standard error of the mean. The difference between a sample statistic and what we think the true value might be in the population (the population parameter) in units of standard errors is called t.
t = (statistic - parameter) / standard error
Note: In fact, we are using the estimated standard error, because we calculated it from the sample standard deviation and not the population value, which we don't know. Minor technical point which we will gloss over.
You may (certainly should!) remember that what we're getting around to doing here is called a "single-sample t-test." The single-sample t-test is one of the simplest examples of a statistical significance test or hypothesis test. Here are the formal steps in such a procedure.
1) State a hypothesis. If you're only going to state one, it should be a null hypothesis. Hypotheses are always statements about the general case or the population, never about the sample. The null hypothesis states an exact value for a population parameter.
H0: mu = 20
If we want, we can also formally state the alternative hypothesis. What is the alternative to mu = 20?
H1: mu is not equal to 20
2) At the end of all this statistical mumbo jumbo, we are going to make a decision about the null hypothesis. We're either going to reject it, or we're going to fail to reject it. We need to state some sort of decision criterion right now! No fair waiting until after the significance test has been done and then deciding on a decision criterion. There is a fancy statistical term for that. It's called cheating!
We usually establish a criterion by setting what's called an alpha level. This is a controversial idea, and we'll discuss that controversy in due time. In the meantime, you may (certainly should!) remember that we usually set alpha = .05. What does that mean? Later!
Now we can get a critical value for t. We need to know the degrees of freedom to do that, and we already know df = n - 1 for a single group, so df = 9. Now we go to a table. I'll just give you the right answer, this time.
t.crit = 2.26
For now, just remember that critical values of t are usually somewhere between 2 and 3, if we have any kind of reasonable sample size. Thus, we need to find a result in our sample that is 2 to 3 standard errors away from what the null hypothesis states before we can start wondering if we've found anything important.
When we calculate a t value from our data, it has to be more extreme than the critical value, either positive or negative, in order to reject the null.
3) NOW we go out and do the experiment and collect the data. In other words, we have planned our statistical analysis IN ADVANCE. Once we have the data, we can calculate a t value, which we have already done. (We kind of cheated!:)
t.calc = -3.279
The t.crit tells us that our sample mean has to be at least, if not more than, 2.26 standard errors away from mu in order to reject H0. In fact, we're 3.28 standard errors away from (below) mu.
4) Make a decision. Since t.calc is a more extreme value than t.crit, we decide to reject the null hypothesis. That doesn't mean the null hypothesis is false! These statistical significance tests never PROVE anything. Rather, our result is not consistent with the null hypothesis according to the decision criterion that we have established. Let me ask you this. Would it have been possible to establish a decision criterion that would have made this result consistent with the null hypothesis? If so, how? (Answer: yes, just set alpha to a smaller value. In this case, it would have to be a lot smaller, but alpha = .005 would do it. If we had set alpha = .005, we would be failing to reject H0 right now.)
5) State a conclusion in simple English (or whatever your native language is, and this is surely not statistics). We conclude that the mean digit span score for marijuana smokers is not 20. We might even say at this point that it is less than 20, but that would be playing a little fast and lose with the procedure.
Notice that our conclusion is a statement about the general case (population). We can also state it in reference to the sample, in which case we would say that the sample mean score on the digit span test was significantly different from (or significantly below) 20. When we use the term "significance" or "significant difference," we are always talking about the sample. What does it mean?
First of all, use the word correctly! Data are not significant. Results are not significant. Experiments are not significant. Differences are significant, and those differences are in the sample. When we say a difference is significant, we don't mean that it is important, or useful, or even that it's correct! What we're saying is that the difference we saw is PROBABLY not due to chance. (And the word "probably" is crucial. We can never entirely rule out random chance.) We now feel reasonably confident that the difference between the sample mean of 17.2 and 20 is either real or due to a confound. It is probably not random error, provided we've done our sampling correctly. Have we? (Answer: Seriously?)
You may recall (certainly should recall) that there is such a thing as two-sample t tests, or t tests for independent samples. We have two independent samples in this case, smokers and nonsmokers, and by independent I mean the subjects in the two groups are not paired up or matched in any way. (How might we have paired them up or matched them?)
In the two-sample t test, the sample statistic we test is the difference in the two group means: M1 - M2. It's entirely arbitrary (for the time being) which mean we call M1 and which M2, so let's call the larger one M1. That way we'll get a positive number out of M1 - M2. (I will not call them MX and MY because we may not always call our variables X and Y).
M1 - M2 = 20.1 - 17.2 = 2.9
The population parameter that we test this against is the difference between the population means, mu1 - mu2. We almost always set this to zero when we state our null hypothesis. That is, our null hypothesis says there is no difference between these two conditions, or that marijuana smoking has no effect on digit span scores.
H0: mu1 - mu2 = 0
Thus, the numerator of our two-sample t test reduces to M1 - M2.
(M1 - M2) - (mu1 - mu2) = M1 - M2
when mu1 - mu2 = 0. In the denominator of the two-sample test, therefore, we need the standard error of M1 - M2 as a measure of how much that difference between the sample means is influenced by random error.
se of mean difference = standard deviation * sqrt(1/n1 + 1/n2)
But hold on a minute! Which standard deviation? We have two of them and they are different. There are two strategies for dealing with this. We're going to use one called pooling. In statistics, when you hear about something being pooled, think "averaged." We're going to calculate an average SD for the two groups. But it's not going to be as easy as you might be hoping. It is NOT going to be (SD1 + SD2) / 2.
To pool the standard deviations, we begin by getting a pooled variance. Here is the general definition of pooled variance, good for any number of groups.
var.pooled = sum(SSes) / sum(dfs)
Calculate a SS for each group and add those up. Calculate degrees of freedom for each group and add those up. OR remember this general rule for degrees of freedom.
df.error = N - k
In this case, N is the total number of subjects we have in all our groups and k is the number of groups. So here is your mantra: "error degrees of freedom equals total number of subjects minus number of groups they are divided into." This will work for us for some time to come, no matter how many groups there are. Thus...
var.pooled = sum(SSes) / (N - k) = (40.9 + 65.6) / (20 - 2) = 106.5 / 18
The pooled variance will always be somewhere between the two group variances, or within the range of multiple group variances in the case of more than two groups. Why should that be?
Now, the pooled SD is exactly what you might think it is.
sd.pooled = sqrt(var.pooled) = sqrt(5.917)
This is the number we use to calculate the standard error of the mean differences. Because we have pooled the variances to get it, the t test we are doing is called a pooled-variance t test. There is an alternative in which the variances are not pooled, but we'll have to discuss that at another time.
se of mean difference = sd.pooled * sqrt(1/n1 + 1/n2) = 2.432 * sqrt(1/10 + 1/10)
But hold on a minute! Aren't we getting a little ahead of ourselves? We seem to have left out a couple steps. In particular, we have failed to establish a decision criterion. IN THIS COURSE, if we fail to explicitly state a decision criterion for a significance test, it will ALWAYS BE alpha=.05. Degrees of freedom for this test is the error degrees of freedom, or N - k = 18. Now we can go to a table... Oh, the heck with tables! I'll just give you the answer.
t.crit = 2.101
Our t.calc is going to have to be more extreme than that, either positive or negative, to allow us to reject H0: mu1 - mu2 = 0.
t.calc = 2.9 / 1.088
Since t.calc > t.crit, we reject H0. Now what does all that statistical mumbo jumbo mean? What is our conclusion?
We conclude that the marijuana smokers and nonsmokers were sampled from populations that have different means. Or, we conclude that the sample means from the smokers and nonsmokers were significantly different. We DO NOT conclude that we reject the null hypothesis. We've already done that.
The fact that the sample means were significantly different does not mean that this was a large effect, or even an effect large enough that anyone should care about it. Statistical significance does NOT tell us about effect size. Very large effects can be nonsignificant (never say insignificant or unsignificant!), and very small effects can be significant. We need a statistic that quantifies effect size.
In the case of two groups of numeric values, the usual measure of effect size is Cohen's d.
Cohens.d = (M1 - M2) / sd.pooled = 2.9 / 2.432
Cohen said his measure of effect size was only an approximate indication of how big the effect is, so there is no point in carrying a lot of decimal places. How do we interpret it?
The statistic d is a measure of how big the difference between the means is in units of standard deviations. Here the difference between the means is 1.2 SDs. That's big! It's reasonable to say that when Cohen's d is in the vicinity of 1, that's a large effect. (Say large, not big.) If Cohen's d is in the vicinity of 0.5, that would be considered an effect of moderate size. If Cohen's d is in the vicinity of 0.25, that's a small effect.
In your last class, you may have been given ranges for these effect sizes. E.g., if d is between 0.25 and 0.5, that's a small effect. Cohen would be spinning in his grave if he knew such a thing was being done! Cohen said his effect size statistic gives only a rough idea of how large the effect is. You're going to have to use your best judgment, because defining rigid ranges for effect sizes is pure silliness. Or at least Cohen would have thought so!
You may have noticed something about the formula for Cohen's d. It is part of the formula for the two-sample t test. Doing a little algebra will reveal...
t = d * sqrt((n1 * n2) / (n1 + n2))
That is, the value of the test statistic t is dependent upon two things: the effect size and the sample sizes. The larger these are, the larger t will be. The lesson is clear. If you want to find small effects to be significant, use a lot of subjects!
So it appears that marijuana smokers score lower on the digit span test than nonsmokers do, and that this difference is large. Is that right? Not exactly. Here's something you should always remember when thinking about statistical results. Distributions overlap (almost always). Look at the histograms again.
18 20 13 14 15 16 17 18 20 21 marijuana smokers, M = 17.2 --------------------------------- 17 20 21 22 17 18 20 21 22 23 marijuana nonsmokers, M = 20.1
Now tell me again that smokers score lower than nonsmokers do. It's not always true, is it? Some of the smokers did better than some of the nonsmokers. The two groups scored differently on the average, but it is definitely not true that a marijuana smoker will always score lower than a nonsmoker. Yet this is the way people, especially nontechnical people or lay people, tend to think. You are no longer a lay person, so be careful about the way you conceptualize these results. It is very important to realize that some marijuana smokers did well on the test, and some nonsmokers not so well. Why?
We know nothing about anything going on within those groups. We might make some very reasonable guesses as to why some smokers did well and some did quite poorly, but guessing is not knowing. Therefore, any variability inside the groups is unexplained variability. We measured that variability by calculating SS within the groups. If we add those SSes, we have a measure of how much unexplained variability we have across all of the groups.
unexplained variability = sum(SSes) = SS1 + SS2 + ...
Hopefully, you saw immediately that we used that number in calculating the pooled variance, and then we used the pooled variance to calculate the standard error for the difference between the means. Unexplained variability is error variability. In this case, unexplained or error variability is...
SS(X) + SS(Y) = 65.6 + 40.9 =
106.5 what, you may be tempted to ask. Let's not go there! 106.5 is all you need to know. Let's pretend for a moment that we don't know nuttin' about no smokers or nonsmokers. All we have is 20 scores on the digit span test.
c(X, Y) # combine X and Y into one variable 16 20 14 21 20 18 13 15 17 18 18 22 21 17 20 17 23 20 22 21
ALL = c(X, Y) # call the new variable ALL sum(ALL) = 373 # how can you get this without adding them all over again? sumsq(ALL) = 7105 # this too? n(ALL) = 20 # right? SS(ALL) =
We might reasonably call this the total variability in all of our scores considered as one big group and, okay, let's call it that.
total variability = SS(all scores combined into one big group) = 148.55 # in this example
If we didn't know anything more about those 20 subjects, then all of that variability would be unexplained. But we do know something about them, don't we? The first 10 are marijuana smokers and the last 10 are not. So we like to think that part of that total variability is explained. Part of the total variability can be explained by the fact that some of these people smoke marijuana and some do not.
That is not quite the same thing as saying BECAUSE some are marijuana smokers and some are not, so be careful about that. We can't really say what's causing the difference between these two groups because of the potential for a confound existing between the groups. Thus, as statisticians, we are using the word "explained" in a somewhat peculiar way. We are not in a position to talk about cause and effect here. We are merely saying that a certain part of the total variability is not due to random error within the groups.
When we break the scores out into groups and calculate SS(X) + SS(Y), we discover it comes to 106.5, which is less than the total variability of 148.55. What happened to the rest of it? The rest of it is due to the difference in the means between the groups. In other words, if it isn't unexplained, then it must be explained (i.e., not due to random error within the groups). Thus...
explained variability = total variability - unexplained variability = 148.55 - 106.5 = 42.05
We often express this as a proportion of variability explained, or PVE.
PVE = explained variability / total variability = 42.05 / 148.55
This can also be expressed as a percentage: 28.3% of the total variability is "explained" by the difference in the means between the groups, which we would like to assume is an effect of the independent variable (the variable that creates the groups). That assumption is unjustified, so I'll not say we ARE assuming that, just that we would LIKE to.
Here's another way to calculate explained variability that may give a little more insight into what it is. I said the group means are representative scores from the groups, so let's let them do that job. If we choose a person at random from one of the groups and had to guess her score, the best we could do without further information would be to guess the group mean. If we had to do that for all the people in the group, guessing the mean would minimize our squared guessing error. Let's go a step further and just substitute the group mean for every score in the group. Then the 20 scores in ALL would look like this.
17.2 17.2 17.2 17.2 17.2 17.2 17.2 17.2 17.2 17.2 20.1 20.1 20.1 20.1 20.1 20.1 20.1 20.1 20.1 20.1
Notice this has eliminated all within group (unexplained) variability but has maintained the between group (explained) variability. If we were to calculate the SS of those 20 numbers, what do you think we would get? If you didn't say explained variability, you're not thinking about it. Explained variability is all that's left.
sum(ALL) = 373 # we already had this (confirm by 10 * 17.2 + 10 * 20.1) 373 / 20 = 18.65 # let's call this the grand mean (GM)
Now instead of doing (17.2 - 18.65)^2 ten times, we can do it once and then multiply by 10. Similarly, we will also do (20.1 - 18.65)^2 and multiply that by 10. Add those two results together and what do we have?
10 * (17.2 - 18.65)^2 + 10 * (20.1 - 18.65)^2 =
Thus, explained variability can be calculated as sum(n * (M - GM)^2) where we do the n * (M - GM)^2 for each group and then add up the results. You may remember that formula from your first stats class. It was part of a procedure called analysis of variance (ANOVA). You do NOT need to memorize it now.
sum(n * (M - GM)^2)
Remember this! A probability is always a number between zero and one. If I ask you for a probability, and you give me a number that is not between 0 and 1, it's not only wrong, it counts off double!
Probability expresses the likelihood that an event will occur on a scale of 0 to 1, where 0 means no chance, and 1 means the event is certain. Probability is sometimes expressed as percent chance,
percent chance = (100 * p)%
We won't be dealing much with probabilities in this course, except in one very important application: p-values. When you run a hypothesis test using software, the software is not going to give you a critical value of the test statistic. It's going to give you a p-value. Here is an example using R to do the two- sample t test we did above.
> t.test(Y, X, var.eq=T) Two Sample t-test data: Y and X t = 2.6659, df = 18, p-value = 0.01575 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.6145934 5.1854066 sample estimates: mean of x mean of y 20.1 17.2
What do we have here? First, a technical detail. As I mentioned above, the t test can be done by pooling the variances or without pooling the variances. In R, the default is not to pool. If we want the pooled variance t test, we have to tell it so by using the option var.eq=T. So that's what that means.
The groups Y and X have already been defined. I entered them so that the group with the largest mean was entered first. This is entirely arbitrary, and the only thing that would be changed had I entered X first is the sign of t.
The t test result is: t = 2.6659, df = 18, p-value = 0.01575. Notice it's very similar to the result we got above, which is reassuring. (We got t = 2.665 because we had a bit of rounding error in our calculations.) Degrees of freedom for the test is 18, which you should realize is N - k. Then there's something called a p-value. What's that about?
The p-value is not a simple thing. If someone gives you a simple definition of p, it's wrong! For example, p is NOT the probability that the null hypothesis is true, although that is a definition which many poorly informed psychologists will often give. The p-value expresses the probably, if the experiment were repeated in exactly the same way but with different samples, that we would get a result as extreme or more extreme than the present one, IF the null hypothesis is true. (Yes, you need to remember that.)
The alpha level that we have set expresses the maximum value of p that we are willing to accept and still reject H0. In other words, if p < alpha, H0 is rejected. If p > alpha, H0 is not rejected. Another way to think of it is: if p < alpha, then the result is not consistent with H0 according to the decision criterion we have established.
Note. What if p = alpha? Don't worry about it. It won't happen. But if it does, H0 would be rejected.
Memorize this right now! Small p-values are good. If the p-value is LESS than alpha, we reject the null hypothesis. DON'T be one of those students who gets this wrong all semester!! And some of you will.
The variance of the scores in the smoking group, X, was 7.289. The variance of the scores in the nonsmoking group, Y, was 4.544. Why were they different? I told you I'd ask you again, because it's an important question. I didn't tell you I'd give you the answer! But it's something you have to think about if you want to do this right.
Psychologists often don't do it right, which is why many statisticians are of the opinion that psychologists should not be allowed to practice statistics (or at least should not be allowed to teach it). This is ironic considering that many of the statistical techniques being used every day by psychologists and statisticians alike were developed by psychologists! Are we psychologists getting dumber? Apparently! I say it's time we reverse that trend.
Here's the most important part of that question. Is the difference between the variances real? That is, does the difference between the variances of these two samples reflect a difference that exists in the general case or populations? Or is it just random noise? If it is a real difference, then we've done the wrong thing by using a pooled-variance t test.
The pooled-variance t test is one of the simpler of statistical hypothesis tests, but that does not mean it's simple! To fully understand the theory behind the pooled-variance t test would require a knowledge of advanced calculus. So we won't go there. However, in developing the pooled-variance t test, the people who do understand advanced calculus had to make a number of assumptions in order to get the mathematics to work out correctly. We do need to know those assumptions. Here they are.
Let's key in on that last one for a moment. That does not mean that the sample variances will be the same. Samples are noisy, after all. Nevertheless, they should be reasonably similar. Are they? Well, I don't know.
There is a significance test to determine if sample variances are significantly different, of course, because a lot of statisticians just sit around at home in their underware making up new significance tests. There are significance tests for everything! I've done this one, and these are not significantly different. However, here's another important rule of statistics: Not significantly different does not mean not different. So is there any reason to believe these variances should be different? I believe there is. But I'm not going to tell you. Thank about it.
What about the normal distribution assumption? What's a normal distribution? Our groups are too small to reliably check this assumption. So is there any reason to believe that the distributions (in the populations) might not be normal? Once again, I believe there is, especially for the smokers.
Were the subjects tested independently of one another? Or was the performance of one subject allowed to influence the performance of others? I sure hope they were tested independently! This is a matter of good experimental technique. I can't tell you that they were, but we will give the experimenter the benefit of the doubt.
Were the samples selected at random? Almost surely not! So it seems we've pretty much crapped out of these assumptions. We'll have to discuss what to do in such cases and why it may not be so important.
Before we leave assumptions, however, there is one more I'd like to address. An assumption that you often see tagged onto the above list is a dependent variable (the measured variable) measured on an interval or ratio scale of measurement. That is NOT an assumption of the t test. Scale of measurement may influence our ability to interpret the result of the test, but it is not an assumption of the test itself.
The lesson I'd like you to leave this section with is this. If you are pooling variances, if you are using a pooled variance technique, then you are making the assumption of equal variances in the populations. This assumption is often called homogeneity of variance, and we will encounter it often. Know what it is!
If you don't remember what a normal distribution is from your first stat course, go to Wikipedia and read about it. Briefly, a normal distribution is a bell-shaped, symmetrical distribution of numeric values in which Md = M and roughly 95% of the values fall within two standard deviations of the center (mean).
The three-number summary (n, M, SD) is most appropriate when used for values that are normally distributed, or nearly so. If the distribution is "goofy looking," i.e., strongly skewed, bimodal, or well-shaped, then the three- number summary is less appropriate.
The importance of the normal distribution stems from the central limit theorem, which states that sample means will be normally distributed, or nearly so, when the samples are large enough. Thus, repeated samples from the same population will have means that are normally distributed, or nearly so, when n = 30 or more, even if the population itself is not a normal distribution. The standard deviation of the sample means will be well estimated by the standard error of the mean (sem) calculated from such a sample.
This often justifies the use of the t test, for example, if the groups are large and equal in size, even if the population that is sampled is not normal. When our treatment groups are equal in size, we are said to have a balanced design. Whether or not the design is balanced can be an important factor in determining how we do, or interpret, our statistical analysis, as you will certainly find out before too much longer!
Normally distributed sample means also means the following way of estimating a population mean is justified.
95% CI approximately equals M +/- 2 * sem
That is, provided the central limit theorem is in force, we can be 95% confident that the true population mean is within two stardard errors of the sample mean. This interval of plus and minus two standard errors around the sample mean is called a 95% confidence interval, or confidence estimate, for the population mean.
Thus, if we can assume the central limit theorem is in force, we can estimate the true population mean of digit span scores for marijuana smokers as follows.
mean(X) = 17.2 sd(X) = 2.699794 n = 10 sem(X) = 0.8537499 mean(X) - 2 * sem(X) = 15.49 # lower limit of 95% CI mean(X) + 2 * sem(X) = 18.91 # upper limit of 95% CI
Compare this to the "correct" confidence interval calculated by the single- sample t test in R.
> t.test(X, mu=20) One Sample t-test data: X t = -3.2796, df = 9, p-value = 0.009535 alternative hypothesis: true mean is not equal to 20 95 percent confidence interval: 15.26868 19.13132 sample estimates: mean of x 17.2
R's confidence limits are a little farther apart because the sample size is small, just 10. In neither case, however, does the confidence interval include the value of 20. What does this imply? Answer: If we are reasonably confident that the true population mean is between 15.49 and 18.91, then we are confident that the true population mean is not 20.
X = c(16, 20, 14, 21, 20, 18, 13, 15, 17, 18) # defining a variable n = 10 # how many values in the variable sum(X) # add them sumsq(X) # square them first then add the squared scores mean(X) = sum(X) / n # sample mean median(X) # sample median is the value in the middle of the sorted scores dev(X) = X - mean(X) # deviation scores (zero centering) SS(X) = sumsq(X) - sum(X)^2 / n # sum of squares df = n - 1 # degrees of freedom for a single group of scores df = N - k # for multiple groups if scores; notice this is k*(n-1) var(X) = SS(X) / df # sample variance sd(X) = sqrt(var(X)) # sample standard deviation sem(X) = sd(X) / sqrt(n) # standard error of the mean t = (mean(X) - mu) / sem(X) # single-sample t var.pooled = sum(SSes) / sum(dfs) # pooled variance sd.pooled = sqrt(var.pooled) # pooled standard deviation d = (M1 - M2) / sd.pooled # Cohen's d unexplained variability = SS1 + SS2 + ... # unexplained variability total variability # SS of all the scores considered as one big group explained variability = total variability - unexplained variability PVE = explained variability / total variability # proportion of variability explained (*100 = percent) M +/- 2 * sem approximates a 95% CI
Here again are the scores from the group that were not marijuana smokers.
1) Calculate a three-number summary for this variable. Add the median to this summary.
2) Test the null hypothesis that the population mean mu = 20.
3) Calculate an approximate 95% CI for the population mean.
Answers
> Y = c(18, 22, 21, 17, 20, 17, 23, 20, 22, 21) # n = 10 > mean(Y) [1] 20.1 > sd(Y) [1] 2.13177 > t.test(Y,mu=20) One Sample t-test data: Y t = 0.14834, df = 9, p-value = 0.8853 alternative hypothesis: true mean is not equal to 20 95 percent confidence interval: 18.57502 21.62498 sample estimates: mean of x 20.1 > sd(Y) / sqrt(10) # sem [1] 0.6741249 > mean(Y) - 2 * 0.6741249 # lower limit of CI [1] 18.75175 > mean(Y) + 2 * 0.6741249 # upper limit of CI [1] 21.44825