QUICK AND DIRTY GUIDE TO R (just about enough to cover a first stats course). by William B. King, Ph.D. Coastal Carolina University © William B. King, Ph.D. ------------------------------------------------------------------------------- Best used with R in front of you and running -- for simplicity I have done away with certain annoying statistical formalisms. This is a plain text document. You should be able to save it or print it with little difficulty. It should print to about 25 pages. ------------------------------------------------------------------------------- Getting it. Download R from www.r-project.org (click on the CRAN link on the left side of the page under Downloads). As of this moment, the current version is 3.2.3. It is available for Windows, OS X, and Linux. Get the .exe file for Windows, the .dmg file for OS X, and for Linux the best way to install is via your package manager. In Ubuntu, for example, use Synaptic, or from a terminal... sudo apt-get install r-base R will not run on a Chromebook or a tablet such as an iPad unless you have an account on a server running R Studio. If you don't know what that means, then you don't have one! ------------------------------------------------------------------------------- Starting R. In Windows, double-click the desktop shortcut. In OS X, double-click the R icon in your /Applications folder (or drag a shortcut to the Dock). In Linux, start a terminal session, and type R and hit the Enter key. Here's what R looks like when it starts. (This is an older version, but newer versions look the same.) R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin10.8.0 (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. [R.app GUI 1.65 (6833) x86_64-apple-darwin10.8.0] [Workspace restored from /Users/billking/.RData] [History restored from /Users/billking/.Rapp.history] > | The little window in which this appears is called the R Console. The > is the command prompt, and the vertical bar (may be a solid block, may or may not be blinking) is the cursor. ------------------------------------------------------------------------------- The R prompt. R is command-line driven. The prompt looks like a greater than sign, >. When you see it, start typing. (Much faster than hunting and clicking once you get used to it!) Press Enter when you have finished typing the command. If the command is not complete, R will issue a continuation prompt, which is a plus sign, +. You can type long commands on several lines to make it easier to keep track of syntax. You can also type them into a text editor and then cut and paste them into R. (Don't type the command prompts. Indent if the command continues to multiple lines.) If your command line starts with + instead of >, that means you are continuing a command you started typing on the previous line. You are not typing a new command. If that's not what you want, press the Escape key (upper left corner of the keyboard) to get back to the command prompt. Write that on the back of your hand. You're going to need it sooner or later!!! (STUDENTS: Seriously, I'm not going to tell you again how to do this. Make a note, or know where to look it up! Using R will require you to exercise your memory a bit. Try it. It's good for you!) ------------------------------------------------------------------------------- Important info for command line newbies. Never look above the current command prompt for new information. That's all history, and it will never change. New information will be printed ONLY below the command you are currently typing. (That even catches me every once in awhile, and I'm a command line guy from way back!) ------------------------------------------------------------------------------- Case sensitive. R commands are case sensitive. If you get a "function not found" error, check your spelling and your capitalization. R functions are generally, but not always, typed in lower case letters. On the other hand, spacing usually doesn't matter. Leave spaces wherever you like (except within names of things or inside of numbers or data values) to make your input easier to read. ------------------------------------------------------------------------------- R functions. R "commands" are called functions. The basic form of a function is: > function.name(arguments, options) Example (go ahead and type these and press Enter--output is not shown here): > mean(rivers) > mean(rivers, trim=.1) > mean(rivers, trim=.2) Obviously, you've just calculated the mean of something. In the first case, it's the ordinary arithnetic mean. In the second case, the mean was calculated after trimming off the highest 10% and lowest 10% of the values in the data. In the third case, 20% of the values were trimmed off the high and low ends of the distribution of values. The default value for trim= is 0 (do not trim), and so the value of trim= does not have to be given if you just want the plain old mean. When an argument has a default value, I'll refer to it as an option. Try misspelling "rivers", or try typing it with a capital R, "Rivers". You'll get an "object not found" error. Try spelling mean with a capital M. You'll get a "function not found" error. R will demand proper spelling and capitalization. These are the two errors you will see most often in R. When you see them, check your spelling and capitalization. (STUDENTS: Check it BEFORE you ask me! R demands attention to details. If you're not used to that, get used to it!) Even if there are no arguments or options, the parentheses are still mandatory. If you leave off the parentheses, R will print out the source code for the function you have entered. This can be startling when you're not expecting it, but it is not harmful. Just retype the command correctly. Example: > mean # press Enter here > mean rivers # always remember to press Enter > meen(rivers) > mean(Rivers) > mean(rivers) # finally we get it right! Now what happens if you type mean() and press Enter? Another error. Why? I've been using R for 12 years, and I get error messages all the time. You will, too. Don't let it bother you. Just fix the mistake and move on. ------------------------------------------------------------------------------- Interactive mode. R can be used by writing programs into scripts and reading them in. It's easier to use R in interactive mode, typing individual commands at the command prompt. Notes or comments can be added to the R session by typing a hash or pound symbol, #. Anything on the line following this symbol will be ignored by R. (STUDENTS: You do not have to type things on the handouts that follow a #.) ------------------------------------------------------------------------------- Keep it clean! The "workspace" is an area in memory where R keeps your stuff. It is best not to clutter up your workspace with old stuff you no longer need. If you have created a data object called myData (for example), in your workspace, and you want to delete it before saving your workspace or continuing on to a new problem, you can do it this way. > myData = 17 # create the data object > ls() # look at it in your workspace [1] "myData" # this is an output line--no prompt! > myData # see the value(s) in myData [1] 17 # another output line > rm(myData) # erase it -- remove(myData) also works > ls() # see that it is GONE character(0) # R's way of saying "empty" The function ls() lists the contents of your workspace. The function rm(), or remove(), erases things from your workspace. Important note: R does not send things to the trash. If you erase something, it's GONE! Be careful. ------------------------------------------------------------------------------- Quitting R. > quit() # remember, the parentheses are mandatory! Note: best to know this right away! Also, q() works. When R asks to save your workspace, pick "no" until you know what this means. If you have entered data into R and want it saved for your next R session, pick "yes". R will create a data file on your hard drive that will be loaded automatically next time you start R. ------------------------------------------------------------------------------- Simple data input. Small data sets can be entered using the c() function. This produces a data vector. Name it anything you want, but don't put spaces or dashes in the name. Use underline characters or periods instead. > my.data = c( 22, 38, 12, 23, 29, 18, 16, 24 ) Note: spaces after the commas and before and after parentheses are optional. > my.data=c(22,38,12,23,29,18,16,24) # exactly the same as previous command To examine what you've entered, just type its name. > my.data # this implicitly invokes print(my.data) [1] 22 38 12 23 29 18 16 24 To change value i in the list, address it as my.data[i]. For example... > my.data[2] # see the second value in my.data [1] 38 > my.data[2] = 28 # change its value > my.data # see the changed data vector [1] 22 28 12 23 29 18 16 24 One caution about making changes to data--there is no "undo." Be careful! ------------------------------------------------------------------------------- An important note on assignment. When you create a data set in R and save it, the thing that gets placed in the workspace with your data in it is called an "object." Objects are created by assignment. There are three ways to do assignment. One is with the equals sign as we've seen. > my.data = c(22, 38, 12, 23, 29, 18, 16, 24) Notice that there is no output to the screen. This means R did what you told it to do and doesn't feel compelled to keep chatting with you about it. (Unlike most software these days, R is not unnecessarily chatty.) If you create data by assignment, and R doesn't print anything to the Console, that means "mission accomplished." If you want to see it, you have to ask. > my.data [1] 22 38 12 23 29 18 16 24 On the other hand, if you forget to give your data object a name, R will spill the data out to the Console, and that will be the end of them. Nothing is created in the workspace. Don't make this mistake. Name your data object! > c(22, 38, 12, 23, 29, 18, 16, 24) # nothing is placed in the workspace [1] 22 38 12 23 29 18 16 24 You can name data objects anything you want as long as the name starts with a letter, consists of letters and numbers, and has no spaces in it. I would advise you against using T and F as names for data objects. Other than that... Also remember that data object names are case sensitive. Thus, if you create my.data, and then ask to see My.data or my.Data or MY.DATA, what's going to happen? > My.data Error: object 'My.data' not found ------------------------------------------------------------------------------- Another way to enter data from the keyboard. > rm(my.data) > my.data = scan() # press the Enter key here 1: 22 # type the first data value and press Enter 2: 38 # type the second data value and press Enter 3: 12 # type the third data value and press Enter 4: 23 # etc. 5: 29 6: 18 7: 16 8: 24 9: # enter a blank data value to terminate data entry Read 8 items > This method is particularly useful for entering longer data vectors. (STUDENTS: Don't just type scan() and start entering data. Type a name for your data, an equals sign, and then scan(). Otherwise, when you terminate data entry, your entered values will spill out to the Console and be lost. You've been warned!) ------------------------------------------------------------------------------- R output. You may have noticed above that R not only printed out the values you stored in your vector, but it also printed out a 1 inside square brackets. When R is printing out a vector, it indexes all the values, but only the index for the first value on a line is shown. Thus, [1] means this line begins with data value number 1. The numbers in square brackets are not part of the data, and they may be different on your screen depending upon how wide you have the R Console set to. Try this: > rivers [1] 735 320 325 392 524 450 1459 135 465 600 330 336 280 315 870 [16] 906 202 329 290 1000 600 505 1450 840 1243 890 350 407 286 280 [31] 525 720 390 250 327 230 265 850 210 630 260 230 360 730 600 [46] 306 390 420 291 710 340 217 281 352 259 250 470 680 570 350 [61] 300 560 900 625 332 2348 1171 3710 2315 2533 780 280 410 460 260 [76] 255 431 350 760 618 338 981 1306 500 696 605 250 411 1054 735 [91] 233 435 490 310 460 383 375 1270 545 445 1885 380 300 380 377 [106] 425 276 210 800 420 350 360 538 1100 1205 314 237 610 360 540 [121] 1038 424 310 300 444 301 268 620 215 652 900 525 246 360 529 [136] 500 720 270 430 671 1770 The first line of this output starts with value number 1, the second line with value number 16, the third line with value number 31, and so on. All of the values are indexed, so the value of 310 in the next to the last line is value number 123. It can be retrieved like this: rivers[123] [1] 310 Don't be confused by the fact that R called it value [1] when it printed it. It was the first value printed, so it is called [1]. If you want values 10 through 20 from the rivers vector, you can get them this way: > rivers[10:20] # a colon between two numbers means "through" [1] 600 330 336 280 315 870 906 202 329 290 1000 The printed values are considered a new vector and are indexed accordingly. If you want a discontinuous set of values, enter the indexes in a vector created with the c() function: > rivers[c(10,21,35,102,116)] # and mind your parentheses and brackets Square brackets ALWAYS contain indexes, or something that evaluates to indexes. For example: > rivers[15] > rivers[(10+20)/2] > rivers[rivers > 1000] # prints values for which rivers > 1000 is true > values = c(10,21,35,102,116) > rivers[values] Important note: Leaving the index blank means gimme everything. >rivers[ ] Now, what do these commands do? > head(rivers) > tail(rivers) > tail(rivers, n=15) ------------------------------------------------------------------------------- Simple frequency distributions. Clear your workspace before continuing. X f ----------- 7 3 6 0 5 2 4 7 3 5 2 1 1 1 To put this simple frequency distribution into a vector, do this: > X = 7:1 # same thing as X = c(7, 6, 5, 4, 3, 2, 1) > f = c(3, 0, 2, 7, 5, 1, 1) > the.data = rep(X, f) # rep() says repeat each element of X f times > the.data [1] 7 7 7 5 5 4 4 4 4 4 4 4 3 3 3 3 3 2 1 If you want to view your data as a table similar to the one above, do this. > data.frame(X, f) To save the data in this format: > data.table = data.frame(X, f) Your workspace should now look like this (good to keep track): > ls() [1] "data.table" "f" "the.data" "X" ------------------------------------------------------------------------------- Grouped frequency tables. For large data sets (such as rivers), a simple frequency table would be more than a bit unwieldy. > table(rivers) # top numbers are data values; bottom numbers are frequencies You can also try it this way. > as.data.frame(table(rivers)) See what I mean? Thus, we resort to grouping. First, some standard notation. An interval denoted as (10,20] is an interval extending from 10 to 20, which does not include 10, called the lower bound, but does include 20, called the upper bound. Including the upper bound and not the lower bound in an interval is default behavior in R. This behavior is called "closed on the right." This can be changed with an option. Step 1: Decide what intervals you want. This may help. > summary(rivers) Min. 1st Qu. Median Mean 3rd Qu. Max. 135.0 310.0 425.0 591.2 680.0 3710.0 We will define the following intervals: (99,199], (199,299], (299,399], and so on to (899,999], (999,1999], (1999,2999], (2999,3999]. Why are the intervals written this way? > bounds = c(99, 199, 299, 399, 499, 599, 699, 799, 899, 999, 1999, 2999, 3999) Step 2: Use the cut() function to do the grouping. Create a temporary storage place for the result. > temp = cut(rivers, bounds) # examine this result at your own risk! Step 3: Create the grouped frequency table. > table(temp) # one possible format > as.data.frame(table(temp)) # another possible format Notice that R did some rounding that we didn't ask for. Can we stop that? Of course! I'll start again, close the intervals on the left rather than the right, and stop R from rounding in the interval labels. I'll also replace "temp" with "X" just to make the result look like it does in a textbook (more or less). > rm(bounds, temp) > bounds = c(100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000) > X = cut(rivers, bounds, right=FALSE, dig.lab=4) > table(X) # output not shown > as.data.frame(table(X)) X Freq 1 [100,200) 1 2 [200,300) 28 3 [300,400) 35 4 [400,500) 18 5 [500,600) 12 6 [600,700) 13 7 [700,800) 8 8 [800,900) 5 9 [900,1000) 4 10 [1000,2000) 13 11 [2000,3000) 3 12 [3000,4000) 1 Note: the right=FALSE (or right=F) option closes the bins on the left, and dig.lab=4 tells R to maintain 4 digits in the interval labels. (The default is three.) If we have a river that is exactly 600 miles long (there are three), in which of the above intervals will we find it? You can also get a stem-and-leaf display. (Try fooling with the scale= option to see what effect that has. It might help if you try sort(rivers) first.) > stem(rivers, scale=2) # output not shown ------------------------------------------------------------------------------- Histogram. > hist(rivers) Note: Bars in R histograms include the right limit but not the left (closed on the right). This can be changed, of course. > hist(rivers, right=F) # F is short for FALSE > hist(rivers, breaks=20, right=F) # use about 20 breaks (bars) > hist(rivers, breaks=bounds, right=F) # user defined bounds as breaks The last command prints a frequency density histogram, because the intervals are not all equally wide. Thus, an ordinary frequency histogram would be misleading. ------------------------------------------------------------------------------- Boxplot. > boxplot(rivers) > boxplot(rivers, ylab="River Length in Miles") > title(main="Length of Major North American Rivers") ------------------------------------------------------------------------------- Frequency polygon. There really is no easy way to plot frequency polygons using just the base R packages. There are optional packages that can be downloaded that have this functionality (such as ggplot2). However, here is a quick and very dirty frequency polygon from the base packages. > hist(rivers, breaks=20, right=F) -> my.hist > my.hist > data.frame(interval.midpoints=my.hist$mids, counts=my.hist$counts) > plot(my.hist$mids, my.hist$counts, type="b", # press Enter + xlab="Interval Midpoints", ylab="Frequency") ------------------------------------------------------------------------------- First intermission. Time to clean up and get some popcorn. > ls() [1] "bounds" "breaks" "data.table" "f" "midpoints" "my.hist" [6] "temp" "the.data" "X" > rm(bounds, breaks, midpoints, my.hist, temp) > ls() # the popcorn is up to you [1] "data.table" "f" "the.data" "X" ------------------------------------------------------------------------------- Categorical (nominal) data. > colors = c("red", "black", "brown", "blonde") > freqs = c( 8 , 22 , 30 , 18 ) # spaces don't matter > hair.color = rep(colors, freqs) > hair.color # output not shown > table(hair.color) hair.color black blonde brown red 22 18 30 8 > barplot(table(hair.color)) > barplot(table(hair.color ), col=c("black","yellow","brown","red")) > barplot(table(hair.color ), col=c("black","yellow","brown","red"), + ylab="Frequency") Note: The order of the bars can be changed, too. The example above uses one function embedded within another. You don't have to do it this way if you're not comfortable with that. > temp.table = table(hair.color) > barplot(temp.table) > rm(temp.table) # if you no longer need it > rm(colors, freqs) # clean up ------------------------------------------------------------------------------- Summation. > sum(the.data) # sigma X [1] 77 > sum(the.data) ^ 2 # (sigma X) quantity squared; i.e., 77^2 [1] 5929 > sum(the.data ^ 2) # sigma (X squared) [1] 359 Note: the squared values themselves can be returned with... > the.data ^ 2 [1] 49 49 49 25 25 16 16 16 16 16 16 16 9 9 9 9 9 4 1 > x = c(2, 3, 5) > y = c(3, 8, 9) > sum(x * y) # sum of the crossproducts [1] 75 > sum(x) * sum(y) # sigma X times sigma y [1] 200 There is no function in R for the sum of squares. However, it can easily be calculated using the arithmetic functions in R: > sum(the.data ^ 2) - sum(the.data) ^ 2 / length(the.data) [1] 46.94737 And knowing that allows us to write a function for sum of squares. > SS = function(X) sum(X^2) - sum(X)^2/length(X) # X is a dummy variable > SS(the.data) [1] 46.94737 This function will remain in your workspace until you remove it. Here are some other built-in math functions in R: > sqrt() # square root > log() # returns the natural log > log10() # base-10 logs > sin() # and all the other trig functions > abs() # absolute value And there are many, many, many more! These functions will work on individual values or on complete vectors. > sqrt(144) [1] 12 > my.data = c(22, 38, 12, 23, 29, 18, 16, 24) > sqrt(my.data) [1] 4.690416 6.164414 3.464102 4.795832 5.385165 4.242641 4.000000 4.898979 ------------------------------------------------------------------------------- Descriptive statistics. > ls() # this is what we have to work with [1] "hair.color" "my.data" "the.data" **sample size** > length(the.data) [1] 19 > length(hair.color) [1] 78 > length(my.data) [1] 8 **sample median** > median(the.data) [1] 4 **sample mean** > mean(the.data) [1] 4.052632 **range** > range(the.data) # reports the min and max values [1] 1 7 **IQR** > IQR(the.data) [1] 1.5 Note: This may not be the "textbook" IQR. R can use 9 different methods to calculate Q1 and Q3. By default it uses a different one than most elementary textbooks do. See below for a fix. **sample variance** > var(the.data) # there is no function for pop. variance [1] 2.608187 **sum of squares** > var(the.data) * (length(the.data) - 1) # explain how this works! [1] 46.94737 > sum(the.data ^ 2) - sum(the.data) ^ 2 / length(the.data) [1] 46.94737 Note: there is no built-in function, but it is easy enough to create one using either one of the above calculations. **sample standard deviation** > sd(the.data) # there is no function for pop. sd [1] 1.614988 **more or less all at once** > summary(the.data) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 3.000 4.000 4.053 4.500 7.000 Note: these quartiles are technically correct for continuous data. The "textbook" method would give Q3 = 5, which is correct for discrete data. To get this... > quantile(the.data, c(.25,.75), type=2) # textbook quartiles! 25% 75% 3 5 To get the "textbook" IQR... > IQR(the.data, type=2) [1] 2 ------------------------------------------------------------------------------- Unit normal distribution. Note: R has a built-in function for converting to z-scores, but it uses the sample standard deviation rather than the population SD (sigma), and assumes it has been given an entire sample of scores. While this makes perfect sense, it's not what most intro stat books want. Instead, you can use R as a calculator. If mu = 25 and sigma = 8, convert these scores to z-scores: 18, 28, 38, 21. **one at a time** > (18 - 25) / 8 [1] -0.875 **or all at once** > scores = c(18, 28, 38, 21) > z = (scores - 25) / 8 # assignment to variable z is optional > z [1] -0.875 0.375 1.625 -0.500 **probabilities under the normal curve** > pnorm(0.375) # area at or below z = .375 [1] 0.6461698 > 1 - pnorm(0.375) # area above z [1] 0.3538302 > pnorm(0.375, lower.tail=FALSE) # I bet you can figure this one out [1] 0.3538302 > pnorm(z) # for all of the values stored in z at once [1] 0.1907870 0.6461698 0.9479187 0.3085375 **the inverse z function** > qnorm(.9) # the z with 90% of area at or below it [1] 1.281552 > qnorm(.95, lower.tail=FALSE) # the z with 95% of area at or above it [1] -1.644854 > rm(scores) ------------------------------------------------------------------------------- Single sample z-test. Isn't built in -- must be calculated by hand. Or you can write your own function to calculate it. There are also optional packages that contain z-test functions. For example, the UsingR package contains a function, simple.z.test(), that does single-sample tests. ------------------------------------------------------------------------------- Single sample t-test. > the.data # just to be sure it's still there [1] 7 7 7 5 5 4 4 4 4 4 4 4 3 3 3 3 3 2 1 > t.test(the.data, mu=5) # H-null says mu=5, two-tailed One Sample t-test data: the.data t = -2.557, df = 18, p-value = 0.01981 alternative hypothesis: true mean is not equal to 5 95 percent confidence interval: 3.274232 4.831031 sample estimates: mean of x 4.052632 One-tailed tests follow... > t.test(the.data, mu=5, alternative="greater") One Sample t-test data: the.data t = -2.557, df = 18, p-value = 0.99 alternative hypothesis: true mean is greater than 5 95 percent confidence interval: 3.410155 Inf sample estimates: mean of x 4.052632 > t.test(the.data, mu=5, alternative="less") One Sample t-test data: the.data t = -2.557, df = 18, p-value = 0.009904 alternative hypothesis: true mean is less than 5 95 percent confidence interval: -Inf 4.695109 sample estimates: mean of x 4.052632 ------------------------------------------------------------------------------- Independent t-test. > x1 = c(18, 25, 17, 20, 23) > x2 = c(20, 30, 22, 25, 28, 30) **pooled variance** > t.test(x1, x2, var.equal=TRUE) Two Sample t-test data: x1 and x2 t = -2.2395, df = 9, p-value = 0.05188 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -10.51953398 0.05286732 sample estimates: mean of x mean of y 20.60000 25.83333 **unpooled variance (the default)** > t.test(x1, x2) Welch Two Sample t-test data: x1 and x2 t = -2.2903, df = 8.995, p-value = 0.04776 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -10.40273450 -0.06393217 sample estimates: mean of x mean of y 20.60000 25.83333 **one-tailed (R always subtracts group 1 - group 2)** > t.test(x1, x2, var.equal=T, alternative="less") Two Sample t-test data: x1 and x2 t = -2.2395, df = 9, p-value = 0.02594 alternative hypothesis: true difference in means is less than 0 95 percent confidence interval: -Inf -0.9497217 sample estimates: mean of x mean of y 20.60000 25.83333 **the formula interface (a more sophisticated way)** You will be using the formula interface quite a lot in R. The basic format of a formula is DV ~ IV (name of dependent variable tilde name of independent variable). Thus, to use a formula interface, all values of the DV must be in one vector, and all values of the IV in another. For each value of the DV, there is a value of the IV telling what group it is from. > all.scores = c(x1, x2) # combines x1 and x2 into one vector > all.scores # the DV vector [1] 18 25 17 20 23 20 30 22 25 28 30 > grp = c("x1", "x2") # the group names > n = c(5, 6) # the groups sizes > groups = rep(grp, n) # create the labels > groups # the IV vector [1] "x1" "x1" "x1" "x1" "x1" "x2" "x2" "x2" "x2" "x2" "x2" > t.test(all.scores ~ groups) # "all.scores by (tilde) groups" Welch Two Sample t-test data: all.scores by groups t = -2.2903, df = 8.995, p-value = 0.04776 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -10.40273450 -0.06393217 sample estimates: mean in group x1 mean in group x2 20.60000 25.83333 ------------------------------------------------------------------------------- Tests for homogeneity of variance. There are several tests in R for homogeneity of variance. The one most commonly used is the Bartlett test. It is slightly, but not much, better than the F-max test, which is not available in R. > bartlett.test(all.scores ~ groups) Bartlett test of homogeneity of variances data: all.scores by groups Bartlett's K-squared = 0.1994, df = 1, p-value = 0.6552 Also... > var.test(x1, x2) # significance test for two variances F test to compare two variances data: x1 and x2 F = 0.636, num df = 4, denom df = 5, p-value = 0.6818 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.08608992 5.95601427 sample estimates: ratio of variances 0.6360225 > var.test(all.scores ~ groups) # the formula interface also works ------------------------------------------------------------------------------- Dependent (related samples) t-test. > before = c(18, 17, 20, 19, 22, 24) > after = c(20, 22, 19, 21, 25, 25) > t.test(before, after, paired=TRUE) Paired t-test data: before and after t = -2.4495, df = 5, p-value = 0.05797 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -4.09887128 0.09887128 sample estimates: mean of the differences -2 Note: R always subtracts first group - second group. One-tailed tests can be specified as in the independent t-test with the option alternative=. Use alternative="less" if you think "before" mean is less than "after" mean. Otherwise, use alternative="greater". ------------------------------------------------------------------------------- Second intermission. Well, we've made quite a mess of our workspace, haven't we? Examine your workspace, then remove everything EXCEPT hair.color, x1, and x2, which we will be using below. If you've had all you can take for now and want to shut down and resume later, just answer "yes" when R asks to save your workspace. Your data will be there waiting for you next time you start up. (Important exception: This is not true if you've changed the working directory. If you don't know what the working directory is, then you probably haven't changed it. You should be good to go.) ------------------------------------------------------------------------------- ANOVA (oneway, between subjects) > x1 # still there (if not, re-enter it) [1] 18 25 17 20 23 > x2 # still there [1] 20 30 22 25 28 30 > x3 = c(35, 27, 27, 30, 40, 33) # a new level of X > all.scores = c(x1, x2, x3) # combined into a single DV data vector > grp = c("x1", "x2", "x3") # the group names > n = c(5, 6, 6) # the group sizes > groups = rep(grp, n) # labels for each score (the IV vector) > results = aov(all.scores ~ factor(groups)) > summary(results) Df Sum Sq Mean Sq F value Pr(>F) groups 2 358.20 179.10 9.5691 0.002402 ** Residuals 14 262.03 18.72 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Notes: The formula interface is mandatory for the aov() function. To get the ANOVA table, store the output of the aov() function and then use the summary() function to see the table. The function oneway.test() also does a oneway ANOVA, but the full ANOVA table is not produced. By default, oneway.test() applies a Welch-like correction for nonhomogeneity of variance. R can get very snotty if you don't correctly label the factor. If it is coded numerically, you MUST convert it to a factor before doing the ANOVA. Check the degrees of freedom in the ANOVA summary table. If they are not right, then something has gone wrong with your coding. The Bartlett test is available to check the homogeneity assumption. > bartlett.test(all.scores ~ groups) Bartlett test of homogeneity of variances data: all.scores by groups Bartlett's K-squared = 0.6503, df = 2, p-value = 0.7224 The Tukey HSD test is available for post hoc testing of pairwise comparisons. > TukeyHSD(results) # a commonly used post hoc test Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = all.scores ~ factor(groups)) $`factor(groups)` diff lwr upr p adj x2-x1 5.233333 -1.6231312 12.08980 0.1493412 x3-x1 11.400000 4.5435354 18.25646 0.0017968 x3-x2 6.166667 -0.3707158 12.70405 0.0656473 Advantage of the aov() function--a post hoc (Tukey) test can be done. Advantage of the oneway.test() function--it can correct for nonhomogeneity. > oneway.test(all.scores ~ groups) One-way analysis of means (not assuming equal variances) data: all.scores and groups F = 9.4638, num df = 2.000, denom df = 9.299, p-value = 0.005727 Clean up: Remove everything EXCEPT hair.color from your workspace. ------------------------------------------------------------------------------- Correlation. > x = rnorm(100, mean=50, sd=10) # 100 normally distributed random numbers > y = rnorm(100, mean=75, sd=20) # 100 more (quick and dirty data!) > plot(x, y) # scatterplot > cor(x, y) [1] 0.00620442 # Pearson's r is the default > cor(x, y, method="spearman") # Spearman rho (don't capitalize it!) [1] -0.002076208 > cor.test(x, y) # correlation t-test Pearson's product-moment correlation data: x and y t = 0.0614, df = 98, p-value = 0.9511 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.1904458 0.2023759 sample estimates: cor 0.00620442 Note: One-tailed tests are also possible by using the alternative= option. Use alternative="less" if you think the correlation is less than zero. Otherwise, use alternative="greater". ------------------------------------------------------------------------------- Regression. > lm(y ~ x) # linear regression requires formula interface Call: lm(formula = y ~ x) Coefficients: (Intercept) x 74.43864 0.01112 > my.lm.analysis = lm(y ~ x) # storing the results gives more output > summary(my.lm.analysis) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -38.663 -14.562 -0.173 13.702 50.507 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 74.43864 9.07285 8.205 9.27e-13 *** x 0.01112 0.18097 0.061 0.951 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 17.45 on 98 degrees of freedom Multiple R-Squared: 3.849e-05, Adjusted R-squared: -0.01017 F-statistic: 0.003773 on 1 and 98 DF, p-value: 0.9511 > rm(x, y, my.lm.analysis) # clean up ------------------------------------------------------------------------------- Chi-square goodness of fit. > ls() # this should be where you are [1] "hair.color" > hair.color # still there? (if not, re-enter it) [1] "red" "red" "red" "red" "red" "red" "red" "red" [9] "black" "black" "black" "black" "black" "black" "black" "black" [17] "black" "black" "black" "black" "black" "black" "black" "black" [25] "black" "black" "black" "black" "black" "black" "brown" "brown" [33] "brown" "brown" "brown" "brown" "brown" "brown" "brown" "brown" [41] "brown" "brown" "brown" "brown" "brown" "brown" "brown" "brown" [49] "brown" "brown" "brown" "brown" "brown" "brown" "brown" "brown" [57] "brown" "brown" "brown" "brown" "blonde" "blonde" "blonde" "blonde" [65] "blonde" "blonde" "blonde" "blonde" "blonde" "blonde" "blonde" "blonde" [73] "blonde" "blonde" "blonde" "blonde" "blonde" "blonde" > table(hair.color) # observed frequencies hair.color black blonde brown red 22 18 30 8 > chisq.test(table(hair.color)) Alternatively, you can do this one step at a time... > temp.table = table(hair.color) > chisq.test(temp.table) Chi-squared test for given probabilities data: table(hair.color) X-squared = 12.8718, df = 3, p-value = 0.004922 > chisq.test(c(22,18,30,8)) # also works from a vector of OFs Chi-squared test for given probabilities data: c(22, 18, 30, 8) X-squared = 12.8718, df = 3, p-value = 0.004922 Note: The default is equal expected frequencies. If expected frequencies are unequal, enter them as proportions or probabilities. > expected = c(1/4, 1/4, 1/3, 1/6) # best to use fractions if possible > sum(expected) # must sum exactly to 1, of course [1] 1 > observed = c(22, 18, 30, 8) # observed must be frequency counts > chisq.test(observed, p=expected) Chi-squared test for given probabilities data: observed X-squared = 2.9744, df = 3, p-value = 0.3956 >rm(hair.color, temp.table, expected, observed) ------------------------------------------------------------------------------- Chi-square test of independence. If the data have already been crosstabulated, this is easy. Otherwise, you'll need to learn to use the xtabs() and table() functions for crosstabulation. Here are some crosstabulated data that we will put into a table. variable one dems repubs other male 22 35 12 variable two female 28 30 18 > row1 = c(22, 35, 12) > row2 = c(28, 30, 18) > my.table = rbind(row1, row2) # binds rows into a table (matrix) > my.table [,1] [,2] [,3] row1 22 35 12 row2 28 30 18 > chisq.test(my.table) # submit a table name and R knows what to do Pearson's Chi-squared test data: my.table X-squared = 1.9713, df = 2, p-value = 0.3732 Note: For 2x2 tables, R does the Yates correction by default. To prevent this: > chisq.test(my.table, correct=F) To see more information, store the results of the test... > chisq.results = chisq.test(my.table) > chisq.results > chisq.results $ expected # expected frequencies > chisq.results $ residuals # unsquared chi values in each cell > rm(row1, row2, my.table, chisq.results) # as always, clean up ------------------------------------------------------------------------------- Getting help. > help(function.name) # opens a somewhat helpful help page > help(t.test) # ?t.test is a shortcut > help.search("distribution") # looks for anything to do with quoted terms > apropos("mean") # returns function names containing mean Note: The R manuals are included with the download in both PDF and html format. Look for them in the root R directory. Note: In Windows and OS X, the help() function opens a new window, which can be closed like any other window. In Linux, help pages are opened right in the terminal session. To close the help page, press the Q key. ------------------------------------------------------------------------------- Reading in a data frame from a csv file. The best way I've found to handle large data files is to enter them into a spreadsheet and then save them in comma separated format. The file should have column names, one column per variable, and one row per case, and NOTHING else. Don't use any fancy formatting. Don't leave any blank cells. Don't use any data or variable names with spaces in them. Good idea: keep the names short! If you drop a copy of this csv file into your working directory, you can read it into a data frame as follows: > dataframe.name = read.csv(file = "filename.csv") It's best to keep the names of data frames in memory SHORT, so maybe... > df = read.csv(file = "filename.csv") If you put row names (see next topic) in the first column of the csv file, then do this: > df = read.csv(file = "filename.csv", row.names=1) If the csv file is not in your working directory, set file= to the complete (or relative) pathname of the file. If the file is on the Internet, set file= to the complete url of the file. (Do not use the https protocol. Use http instead.) If you want to try reading one off the Internet, use these commands. > url = "http://ww2.coastal.edu/kingw/statistics/R-tutorials/text/caffeine.csv" > caffeine = read.csv(file=url) If this was NOT successful, you got some sort of error message, probably fairly cryptic. If it was successful, R was silent, and in your workspace you'll find a data frame called caffeine. > ls() [1] "caffeine" "url" The following functions are available to examine it. > names(caffeine) # get the names of the variables [1] "group" "dose" "tapping" > dim(caffeine) # get the size of the data frame [1] 30 3 > head(caffeine) # see the header and top six rows of data group dose tapping 1 caff0ml 0 242 2 caff0ml 0 245 3 caff0ml 0 244 4 caff0ml 0 248 5 caff0ml 0 247 6 caff0ml 0 248 > summary(caffeine) # summarize all the variables group dose tapping caff0ml :10 Min. : 0 Min. :242.0 caff100ml:10 1st Qu.: 0 1st Qu.:245.0 caff200ml:10 Median :100 Median :246.5 Mean :100 Mean :246.5 3rd Qu.:200 3rd Qu.:248.0 Max. :200 Max. :252.0 ------------------------------------------------------------------------------- Inline data entry for shorter data frames. It's not hard to type data vectors into R using the scan() function, but if you want to enter an entire data frame at one go, that's another matter. You can do it, though, by opening a script window (File > New Script in Windows; File > New Document on a Mac) and typing exactly the following (using your own data, of course). groceries = read.table(header=T, text=" items storeA storeB storeC storeD lettuce 1.17 1.78 1.29 1.29 potatoes 1.77 1.98 1.99 1.99 milk 1.49 1.69 1.79 1.59 eggs 0.65 0.99 0.69 1.09 bread 1.58 1.70 1.89 1.89 cereal 3.13 3.15 2.99 3.09 ground.beef 2.09 1.88 2.09 2.49 tomato.soup 0.62 0.65 0.65 0.69 laundry.detergent 5.89 5.99 5.99 6.99 aspirin 4.46 4.84 4.99 5.15 ") It's probably best to do spacing with the spacebar if you're in Windows. I sometimes have trouble getting the Windows version of R to recognize tabs as white space (even though the help page says it should). It doesn't seem to matter on a Mac. Once you get it typed, execute the script, and the data are in your workspace in an object called "groceries". (To execute a script, in Windows, go to the Edit menu, and choose Run all. On a Mac, highlight the whole thing in the script window with your mouse, go to the Edit menu, and choose Execute.) Hint: You can also use a text editor and copy and paste the text file into R at a command prompt. You may have to use this method in Linux. ------------------------------------------------------------------------------- Working with data frames. Clear your workspace. Then do this. > data(mtcars) This will put a data frame in your workspace that came with R when you installed it. To get some information on what is in this data frame, use the help file. > help(mtcars) Here's what the data frame looks like. > mtcars mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 The column on the left is technically not part of the data but a column of row names. Usually these will just be numbers unless you've gone to the trouble of creating a unique row name for every row in the table. Each row represents a single case (roughly, a "subject") in the data. Thus, the first row in this data frame contains all the info on the Mazda RX4. (I would not put blank spaces in these row names, but R can handle it if you know what you're doing.) Each subsequent column contains one variable upon which the "subject" has been measured. Each of these columns will have a name at the top. The first such column has the name "mpg" and contains gas mileage info for each of the cars. Data frames can be very long, thousands of cases or more, so it is often not convenient to print the whole thing to the Console. In that case, the following functions will be helpful. > head(mtcars) # see the first six rows of data mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 > dim(mtcars) # size of data frame in rows and columns [1] 32 11 > names(mtcars) # see the variable names (column headers) [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb" > summary(mtcars) # summarize all the variables in the data All the variables in this data frame are numeric, as indicated by the fact that each variable was given a numerical summary. This will not always be the case. For example: > summary(chickwts) weight feed Min. :108.0 casein :12 1st Qu.:204.5 horsebean:10 Median :258.0 linseed :12 Mean :261.3 meatmeal :11 3rd Qu.:323.5 soybean :14 Max. :423.0 sunflower:12 This built-in data frame contains two variables, one of which is numeric (weight), and one of which is categorical (feed). Categorical variables are summarized with frequency tables. Similarly, in the caffeine data frame that you read from the Internet in the last topic, there were three variables, one categorical (group), and two numeric (dose, tapping). > mtcars["Honda Civic", "mpg"] # get gas mileage for a Honda Civic [1] 30.4 > mtcars["Honda Civic", ] # get all info on the Honda Civic mpg cyl disp hp drat wt qsec vs am gear carb Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 The data frame can also be addressed using numeric indexes, if you know what they are. > mtcars[19, 1] # Honda Civic is row 19; mpg is column 1 [1] 30.4 Here is an important rule about data frames: R will not look inside a data frame unless you give it permission to do so. Thus: > mean(mpg) Error in mean(mpg) : object 'mpg' not found This fails because the variable mpg is inside the data frame, and R cannot see it there until you tell it to look there. There are several ways to do this. Here are two. > with(mtcars, mean(mpg)) # syntax: with(dataframe.name, function(variable)) [1] 20.09062 > mean(mtcars$mpg) # syntax: function(dataframe.name$variable) [1] 20.09062 Is there a difference in gas mileage depending upon transmission type? Sounds like a t-test to me. > t.test(mpg ~ am, data=mtcars) Welch Two Sample t-test data: mpg by am t = -3.7671, df = 18.332, p-value = 0.001374 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -11.280194 -3.209684 sample estimates: mean in group 0 mean in group 1 17.14737 24.39231 Is there a difference in gas mileage as a function of the number of forward gears? This sounds like an ANOVA. First, we'll examine the group means and sds. > with(mtcars, tapply(mpg, gear, mean)) 3 4 5 16.10667 24.53333 21.38000 > with(mtcars, tapply(mpg, gear, sd)) 3 4 5 3.371618 5.276764 6.658979 The tapply() function (pronounced tee-apply, not tap-plea) is one of the more cryptically named R functions. It's syntax is tapply(DV, IV, function). This will break down the DV into groups defined by the IV and calculate the function within each of these groups. Now we'll get a graphic display of the groups. > boxplot(mpg ~ gear, data=mtcars) > title(xlab="Number of Forward Gears", ylab="Gas Mileage in Miles Per Gallon") Here's the ANOVA. One trick here is that the IV, gear, is coded numerically. This will throw the aov() function into a conniption unless we are sure to declare gear to be a factor. (This declaration would not be necessary if gear were coded categorically.) > results = aov(mpg ~ factor(gear), data=mtcars) > summary(results) Df Sum Sq Mean Sq F value Pr(>F) factor(gear) 2 483.2 241.62 10.9 0.000295 *** Residuals 29 642.8 22.17 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Is there a correlation between gas mileage and engine displacement? > plot(mpg ~ disp, data=mtcars) # scatterplot > with(mtcars, cor(mpg, disp)) [1] -0.8475514 > with(mtcars, cor.test(mpg, disp)) Pearson's product-moment correlation data: mpg and disp t = -8.7472, df = 30, p-value = 9.38e-10 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.9233594 -0.7081376 sample estimates: cor -0.8475514 How about between engine displacement and horsepower? You do it! Is there a relationship between transmission type and number of forward gears, both considered as factors? > my.table = with(mtcars, table(am, gear)) > my.table gear am 3 4 5 0 15 4 0 1 0 8 5 > chisq.test(my.table) Pearson's Chi-squared test data: my.table X-squared = 20.9447, df = 2, p-value = 2.831e-05 Warning message: In chisq.test(my.table) : Chi-squared approximation may be incorrect The warning is telling us that some of the expected frequencies are less than five, possibly making the chi-square approximation inaccurate. > chisq.test(my.table) $ expected gear am 3 4 5 0 8.90625 7.125 2.96875 1 6.09375 4.875 2.03125 ------------------------------------------------------------------------------- Housekeeping. The workspace is an area in the computer's memory where R stores stuff you've created in the Console or have read into R from an external file. > ls() # shows the names of objects you've created > remove(object.name) # deletes the named object; rm() also works Note: it's best to keep your workspace cleaned up by deleting unused objects. The working directory is the directory (folder) on your hard drive where R expects to find things, such as data files, that you ask for. > getwd() # get the name of the working directory > setwd() # set (change) the working directory > dir() # get a dir listing of the working directory Note: In Windows it's best to use the pulldown menus to change directories rather than attempt to navigate through the obtuse Windows file system. There are also pulldown menus in OS X, but there are no menus in Linux. If your R session becomes long, and you've typed in a lot of data that you don't want to risk losing if the power goes out, you can do the following: > save.image() This will save a copy of your workspace to the working directory. It does the same thing as choosing "yes" when you quit R. This will write over any data file that has been saved previously, so be careful about that. The created file will be named .RData, which will be an invisible file in unixy operating systems such as OS X and Linux. If you want to save a lot of these without writing over old ones, just give the file a name. > same.image(file = "mydata.RData") You can create as many of these as you want, but be aware that they will not be automatically loaded next time you start R. To load the file after R has been started, do this. > load(file = "mydata.RData") This will add the contents of mydata.RData to anything you already have in your workspace. Be aware that if there are any variable or data names in common, the version in the workspace will be overwritten by the version in mydata.RData. There is one more place you ought to be aware of in R called the search path. You can see it as follows: > search() # results will differ in Windows vs. OS X The search path contains the places R will look for stuff when you ask for a function or a data set. The first entry is ".GlobalEnv", the global environment, which is R's name for the workspace. When you ask for a function or data, R will look in the workspace first. Then it will look through each of those other locations in turn. If it doesn't find what you asked for, you will get a "not found" error. It's possible to add and remove things from the search path. This can be a fairly dangerous practice for newbies, however, so I'm not going to tell you how to do it at this time. ------------------------------------------------------------------------------- revised 2016 January 12; updated 2016 April 7 -------------------------------------------------------------------------------