PSYC 480 -- Dr. King A Quick Introduction to R 1) You can get a copy for your own computer FREE -- not a stripped down copy, not a copy that will expire in 30 days -- the real deal. a) There is a link to the R-project website at the course webpage if you want to download it. There are versions for Linux, Mac, and Windows. (Sorry, no Chromebook version.) b) All of these work the same. If you can use one, you can use them all. The Windows and Mac versions look somewhat different, but the work the same once you're in the R Console. c) You may have to change your security settings on a Mac, since this is not a download from the App Store. (There are versions on the App Store for iPhone and iPad. I DO NOT recommend them. d) I also don't recommend that you install RStudio on your computer. For one thing, it's unnecessary. 2) R is installed on all the computers in the library and Bryan Information Commons and also in most computer labs around campus. You can also put a version on a flash drive that will run on any Windows computer. (Sorry, this doesn't work with Macs.) 3) Start it by clicking (or double clicking) the desktop icon. (Windows will create this icon for you automatically. On a Mac, you have to drag an icon into your dock from the Applications folder.) Important note: on university computers, ALWAYS start R from the desktop icon ("shortcut" in Windows speak). NEVER go to the Start menu!!! 4) The start-up screen will look something like this. (This is an old version.) ================================================================ R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin10.8.0 (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. [R.app GUI 1.65 (6833) x86_64-apple-darwin10.8.0] > ================================================================ 5) Behold! A good old-fashioned command line program! Not much to click here. You're going to have to type your commands at the command prompt, which is the greater-than sign (>) at the bottom of the R Console window. 6) R is spelling and case sensitive. If you get a "not found" error, that's probably why. It's also syntax sensitive. Errors in syntax will usually result in getting a "continuation prompt" (example below). 7) When you type a command in R, don't type the >. R supplies that for you. Also, always remember to press Enter (Return on a Mac) when you finish typing the command. Otherwise, R will not know what to do, and it will wait patiently for further instructions. And R has infinite patience! > rivers # Note: a built-in data set (length of N. American rivers in miles) [1] 735 320 325 392 524 450 1459 135 465 600 330 336 280 315 870 [16] 906 202 329 290 1000 600 505 1450 840 1243 890 350 407 286 280 [31] 525 720 390 250 327 230 265 850 210 630 260 230 360 730 600 [46] 306 390 420 291 710 340 217 281 352 259 250 470 680 570 350 [61] 300 560 900 625 332 2348 1171 3710 2315 2533 780 280 410 460 260 [76] 255 431 350 760 618 338 981 1306 500 696 605 250 411 1054 735 [91] 233 435 490 310 460 383 375 1270 545 445 1885 380 300 380 377 [106] 425 276 210 800 420 350 360 538 1100 1205 314 237 610 360 540 [121] 1038 424 310 300 444 301 268 620 215 652 900 525 246 360 529 [136] 500 720 270 430 671 1770 (Note: in R, # means note or comment. R ignores everything after #, and you don't have to type it if you are following along in your own R installation.) Some errors follow. R is case sensitive and spelling sensitive. If you get a "not found" error, check that first. R commands are usually lower case, but not always. > Rivers # capitalized when it shouldn't be Error: object 'Rivers' not found > median(rivers) # one of many built-in statistical functions [1] 425 > medium(rivers) # misspelled Error: could not find function "medium" 8) R's statistical functions always have parentheses at the end, even if there is nothing inside them. The Mac R editor will close the parentheses for you. The Windows editor will not. Always remember to type the close parenthesis. > median(rivers # Note: failed to close the parentheses (syntax error) + The plus sign (when supplied by R at the beginning of a line) does not mean add. It is the continuation prompt. R is telling you that you didn't finish typing the command and wants you to give it more. DO NOT JUST START TYPING NEW STUFF. Press the Esc key (upper left corner of keyboard) to return to the command prompt and start again. (Write this on the back of your hand! You're going to need it!!! Press Esc to abort any command in the middle.) 9) On the other hand, R rarely cares about spacing. (Just don't put spaces in the names of things. R will probably not understand South Carolina. It will understand SouthCarolina, South.Carolina, or South_Carolina.) > median ( rivers ) # add spacing except in the middle of the names of things [1] 425 10) Stuff that you enter into R (data, etc.) will be found in your "workspace." To see the contents of your workspace, use the ls() command. > ls() # and press Enter; you always have to press Enter character(0) The parentheses are required even though there is nothing inside them. The output appears on the next line. In this case, "character(0)" is R's way of telling you that your workspace is empty. 11) Entering your own data can be done in many different ways. For now, I'll show you two of the simplest. If you want to enter a "bunch of numbers" that you got from a group of subjects, you will be entering it into a "vector." (Don't let the fancy term intimidate you. It just means a bunch of numbers.) One way to enter numeric data is to use the c() function ("combine"). > nonsmokers = c(18,22,21,17,20,17,23,20,22,21) # Note: spacing is optional a) The single = sign is the assignment operator. It creates things in your workspace ("stores" them). b) If you want to "store" or "save" something in your workspace, give it a name, type =, and then type what you want "stored" with that name. c) If there is no = sign, you are NOT creating anything in your workspace (very few exceptions to this). 12) The numbers inside c() must be separated by commas. You can put a space after each comma if that makes you happy. 13) Now the ls() function will reveal that this "data object" has been added to your workspace. The [1] is a place marker. You can ignore it. > ls() [1] "nonsmokers" 14) If at any time you want to see what is in "nonsmokers" (or anything else in your workspace), just type its name at the command prompt and press Enter. Again, ignore the [1]. It's not part of your data. > nonsmokers [1] 18 22 21 17 20 17 23 20 22 21 But notice it must be spelled and capitalized correctly! > Nonsmokers Error: object 'Nonsmokers' not found 15) The summary() function can also be handy for summarizing any data vector. > summary(nonsmokers) Min. 1st Qu. Median Mean 3rd Qu. Max. 17.00 18.50 20.50 20.10 21.75 23.00 16) ALWAYS CHECK to make sure the data have been entered correctly! R won't make a mistake, but you might. The most common reason for getting a wrong answer when using statistics software is entering the data incorrectly. And a mistake is a mistake. You don't get credit for a wrong answer because you entered data incorrectly. 17) If you made a mistake entering data, start again! (There are ways to fix mistakes, but you need more experience with R before they will make sense.) 18) Another way to enter numbers is with the scan() function. This is more like what you would do if you were typing these numbers into a spreadsheet. DON'T just type scan(). Give your data object a name by typing the name you want followed by = followed by scan(). Remember: no =, no assignment. > smokers = scan() # press Enter here 1: 16 # press Enter ONCE after entering each value 2: 20 3: 14 4: 21 5: 20 6: 18 7: 13 8: 15 9: 17 10: 18 11: # press Enter again to terminate data entry Read 10 items > smokers [1] 16 20 14 21 20 18 13 15 17 18 It will also work this way. > smokers=scan() 1: 16 20 14 21 20 18 13 15 17 18 # press Enter (notice: no commas!) 11: # press Enter Read 10 items 19) Here are some things you can do with your numbers now that they are entered. The commands are called "functions" and all R functions are followed by open and close parentheses, even when there is nothing inside them, such as ls(). Hint: where have you seen this notation before? > sum(smokers) # add 'em up [1] 172 > smokers^2 # square them all [1] 256 400 196 441 400 324 169 225 289 324 > sum(smokers^2) # square and then sum [1] 3024 > sum(smokers)^2 # sum and then square [1] 29584 > length(smokers) # get n [1] 10 > sqrt(10) # square root function [1] 3.162278 > 2024 - 29584 / 10 # calculate SS (incorrectly) [1] -934.4 > 3024 - 29584 / 10 # calculate SS (correctly) [1] 65.6 > sum(smokers^2) - sum(smokers)^2 / length(smokers) # all at once [1] 65.6 > median(smokers) # statistical functions [1] 17.5 > mean(smokers) # sample mean [1] 17.2 > var(smokers) # sample variance [1] 7.288889 > sd(smokers) # sample standard deviation [1] 2.699794 > t.test(smokers, mu=20) # single-sample t-test One Sample t-test data: smokers t = -3.2796, df = 9, p-value = 0.009535 alternative hypothesis: true mean is not equal to 20 95 percent confidence interval: 15.26868 19.13132 sample estimates: mean of x 17.2 > t.test(nonsmokers, smokers, var.eq=T) # t-test (pooled variance) Two Sample t-test data: nonsmokers and smokers t = 2.6659, df = 18, p-value = 0.01575 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.6145934 5.1854066 sample estimates: mean of x mean of y 20.1 17.2 20) To clear things from your workspace, use rm(). There is a menu option for clearing the entire workspace, but use it cautiously. There is no going back! Once something has been removed, it's GONE!!! > rm(smokers, nonsmokers) # Hope you meant it! There is no getting it back! > ls() character(0) ------------------------------------------------------------------------------- This should be enough for the first exposure. You will want to PRACTICE. Always remember and don't ever forget: you cannot learn to play the piano by watching me play the piano. Here are some data you can practice with (from the review). Adjective 10 7 14 7 13 11 13 11 13 11 14 22 18 13 15 17 10 12 12 15 or 10,7,14,7,13,11,13,11,13,11,14,22,18,13,15,17,10,12,12,15 Control 10 14 5 11 12 19 15 14 10 10 18 21 21 18 16 22 22 22 18 15 or 10,14,5,11,12,19,15,14,10,10,18,21,21,18,16,22,22,22,18,15 Counting 6 10 7 9 6 7 6 9 4 6 7 9 7 7 5 7 7 5 4 7 or 6,10,7,9,6,7,6,9,4,6,7,9,7,7,5,7,7,5,4,7 Imagery 23 19 10 12 11 10 16 12 10 11 15 16 14 20 17 18 15 20 19 22 or 23,19,10,12,11,10,16,12,10,11,15,16,14,20,17,18,15,20,19,22 Rhyming 6 6 10 8 10 3 5 7 7 7 6 10 6 8 10 7 7 9 9 4 or 6,6,10,8,10,3,5,7,7,7,6,10,6,8,10,7,7,9,9,4 Special note: c() can also be used to combine already existing vectors. All = c(Adjective, Control, Counting, Imagery, Rhyming) # spaces optional ------------------------------------------------------------------------------- 21) I have put a large number of the datasets we work with in class up at the website. For example, Scott Keats' data is in a file called marijuana.txt. To get it, do the following, typing VERY CAREFULLY! FOLLOWING IS THE STANDARD PROCEDURE FOR FETCHING DATA FROM THE WEBSITE: ----------------------------------------------------------------------- > file = "http://ww2.coastal.edu/kingw/psyc480/data/marijuana.txt" > MJ = read.table(file=file, header=T, stringsAsFactors=T) Notice this produced no output. R is not chatty. When you tell R to do something, it just does it. If something goes wrong, you will get an error or a warning. Here, you didn't ask for output, you asked for an assignment to your workspace. If you're a Luddite (like me) and don't trust that R did what you told it to do, you can always check. > ls() [1] "file" "MJ" 22) The data will come in a form called a "data frame," which is a rectangular array of data with each variable in a column and each subject (for now) in a row. You must give your data frame a name. Otherwise, no assignment occurs. We have named it MJ. To see the contents of MJ, as always you can just type its name at the prompt. > MJ group score 1 smoker 16 2 smoker 20 3 smoker 14 4 smoker 21 5 smoker 20 6 smoker 18 7 smoker 13 8 smoker 15 9 smoker 17 10 smoker 18 11 nonsmoker 18 12 nonsmoker 22 13 nonsmoker 21 14 nonsmoker 17 15 nonsmoker 20 16 nonsmoker 17 17 nonsmoker 23 18 nonsmoker 20 19 nonsmoker 22 20 nonsmoker 21 Many data frames will be too big to see on a single screen, and they will scroll by too quickly to be inspected. You can look at just the first few lines of the data frame, if you want. > head(MJ) group score 1 smoker 16 2 smoker 20 3 smoker 14 4 smoker 21 5 smoker 20 6 smoker 18 23) Here are some functions for getting information about a data frame. > dim(MJ) # how big (rows by columns) [1] 20 2 > names(MJ) # names of variables (column names) [1] "group" "score" > summary(MJ) group score nonsmoker:10 Min. :13.00 smoker :10 1st Qu.:17.00 Median :19.00 Mean :18.65 3rd Qu.:21.00 Max. :23.00 Notice the difference in the way R summarizes numeric variables vs. categorical variables. (Unlike in SPSS, you don't have to tell R which is which. R is smart enough to figure it out, usually.) 24) R will not look inside a data frame without your permission. (This is for your protection. It makes it hard to accidently alter your data. There is also another reason: you can have two or more active data frames at one time. Try that in SPSS! In R, anything that's in your workspace is active.) For example, if you want to see the mean of values in the variable "score", this will not work. > mean(score) Error in mean(score) : object 'score' not found 25) One way to give R permission to look inside your data frame is to prefix the name of the variable with the name of the data frame and a $ sign. > MJ $ score # the spaces are optional [1] 16 20 14 21 20 18 13 15 17 18 18 22 21 17 20 17 23 20 22 21 > mean(MJ$score) [1] 18.65 > var(MJ$score) [1] 7.818421 A second way to do it is with with(). > with(MJ, mean(score)) [1] 18.65 It's a good idea to keep your data frame names SHORT! 26) When your data are in a data frame (almost always!), getting statistics with the data broken into groups is just a tad tricky. One way to do it is to use the by() function. The syntax is by(DV, IV, function). The IV must be a categorical (nominal or "group naming") variable. > by(MJ$score, MJ$group, mean) # calculate means by group MJ$group: nonsmoker [1] 20.1 ----------------------------------------------------------------- MJ$group: smoker [1] 17.2 > by(MJ$score, MJ$group, sd) # calculate standard deviations by group MJ$group: nonsmoker [1] 2.13177 ----------------------------------------------------------------- MJ$group: smoker [1] 2.699794 27) Another somewhat more cryptically named function, tapply(), works the same. The output is formatted differently. (I prefer, and will use, tapply()). > tapply(MJ$score, MJ$group, var) # compare the output to that of by() nonsmoker smoker 4.544444 7.288889 28) Statistical significance tests on data in a data frame require a "formula interface": DV ~ IV, data=. The ~ is a tilde, top left of keyboard, just under the esc key. You have to know your DV from your IV to do this! > t.test(score ~ group, data=MJ, var.eq=T) # (DV ~ IV, data=, other options) Two Sample t-test data: score by group t = 2.6659, df = 18, p-value = 0.01575 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.6145934 5.1854066 sample estimates: mean in group nonsmoker mean in group smoker 20.1 17.2 29) In the t.test() function, var.eq= is called an option. It modifies the way the t.test() function works. Without it, R would not pool the variances for the t-test. (Try it.) Options are used all the time in R to modify the way functions work. > mean(rivers, trim=.1) # trim off upper and lower 10% of scores (trimmed mean) [1] 490.9469 > t.test(score ~ group, data=MJ, alternative="greater") # one-tailed test Welch Two Sample t-test data: score by group t = 2.6659, df = 17.081, p-value = 0.008123 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: 1.008154 Inf sample estimates: mean in group nonsmoker mean in group smoker 20.1 17.2 Note: for directional tests, you have to know that R will see the levels of your IV in alphabetical order (unless you've told it otherwise). Thus, the subtraction will be nonsmoker - smoker, and we would expect the answer to be "greater" than zero. 30) ANOVA. The function for doing ANOVA is aov(). The syntax is the same as the t.test() function (formula interface required for ANOVA). > aov(score ~ group, data=MJ) Call: aov(formula = score ~ group, data = MJ) Terms: group Residuals Sum of Squares 42.05 106.50 <- explained and unexplained var. Deg. of Freedom 1 18 Residual standard error: 2.43242 <- pooled standard deviation Estimated effects may be unbalanced 31) However, the aov() function produces so much output that it really needs to be stored and then looked at with various R functions. You can name the output of a statistical test anything you want, but I'm not very creative, so I call my ANOVA output aov.out. > aov.out = aov(score ~ group, data=MJ) # btw, this overwrites any old aov.out > summary(aov.out) # anova summary table Df Sum Sq Mean Sq F value Pr(>F) group 1 42.05 42.05 7.107 0.0158 * Residuals 18 106.50 5.92 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > TukeyHSD(aov.out) # kinda pointless here (why?) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = score ~ group, data = MJ) $group diff lwr upr p adj smoker-nonsmoker -2.9 -5.185407 -0.6145934 0.0157501 32) Let's get a couple more datasets. > file = "http://ww2.coastal.edu/kingw/psyc480/data/firemen.txt" > PRIN = read.table(file=file, header=T, stringsAsFactors=T) > file = "http://ww2.coastal.edu/kingw/psyc480/data/tresselt.txt" > TRES = read.table(file=file, header=T, stringsAsFactors=T) > ls() [1] "file" "MJ" "PRIN" "TRES" Question: Why is there only one "file" in the workspace? These are Tom Prin's firemen data and James Tresselt's data on CCU freshman. They collected these data as part of their Psyc 497 projects. > summary(PRIN) Rotter Area Risk Min. : 4.00 Charleston:25 A:26 1st Qu.: 8.00 Horry :25 B:33 Median :10.00 NYC :25 C:16 Mean : 9.96 3rd Qu.:12.00 Max. :16.00 > with(PRIN, tapply(Rotter, Risk, mean)) A B C 8.961538 10.212121 11.062500 > aov.out = aov(Rotter ~ Risk, data=PRIN) # weighted means oneway ANOVA > summary(aov.out) Df Sum Sq Mean Sq F value Pr(>F) Risk 2 47.5 23.733 2.919 0.0604 . Residuals 72 585.4 8.131 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > TukeyHSD(aov.out) # not justified since null was not rejected Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = Rotter ~ Risk, data = PRIN) $Risk diff lwr upr p adj B-A 1.2505828 -0.53883983 3.040005 0.2227360 C-A 2.1009615 -0.06728435 4.269207 0.0595346 C-B 0.8503788 -1.22841605 2.929174 0.5924549 > aov.out = aov(Rotter ~ Area * Risk, data=PRIN) # factorial ANOVA > summary(aov.out) Df Sum Sq Mean Sq F value Pr(>F) Area 2 13.5 6.760 0.805 0.451 Risk 2 41.5 20.767 2.474 0.092 . Area:Risk 4 23.9 5.968 0.711 0.587 Residuals 66 554.0 8.393 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 REMEMBER: R does Type I sums of squares by default, probably not what you want here. To get the Type III sums of squares, you need to download a special function that I've written and have put at the website. We'll discuss this when the time comes to cover factorial ANOVA. (Another way to do it would be to download and install an optional package that will do Type III sums of squares, such as "car".) 33) Regression. The function for regression analysis is lm(), which stands for linear model. The syntax is the same. > summary(TRES) Gender SATV SATQ SATT GPA91 Female:133 Min. :240.0 Min. :260.0 Min. : 560.0 Min. :0.071 Male :118 1st Qu.:350.0 1st Qu.:400.0 1st Qu.: 770.0 1st Qu.:1.800 Median :390.0 Median :450.0 Median : 840.0 Median :2.308 Mean :402.4 Mean :453.1 Mean : 855.5 Mean :2.392 3rd Qu.:440.0 3rd Qu.:505.0 3rd Qu.: 930.0 3rd Qu.:3.000 Max. :700.0 Max. :690.0 Max. :1290.0 Max. :4.000 HSGPA orient sex Min. :1.500 Min. :0.0000 Min. :0.0000 1st Qu.:2.450 1st Qu.:0.0000 1st Qu.:0.0000 Median :2.751 Median :1.0000 Median :0.0000 Mean :2.772 Mean :0.5259 Mean :0.4701 3rd Qu.:3.099 3rd Qu.:1.0000 3rd Qu.:1.0000 Max. :4.000 Max. :1.0000 Max. :1.0000 > lm.out = lm(GPA91 ~ SATT, data=TRES) # simple regression (one predictor) > summary(lm.out) Call: lm(formula = GPA91 ~ SATT, data = TRES) Residuals: Min 1Q Median 3Q Max -2.24529 -0.59698 -0.01629 0.56758 1.80136 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.5711142 0.3303657 1.729 0.0851 . SATT 0.0021283 0.0003818 5.574 6.47e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7822 on 249 degrees of freedom Multiple R-squared: 0.1109, Adjusted R-squared: 0.1073 F-statistic: 31.06 on 1 and 249 DF, p-value: 6.473e-08 > lm.out = lm(GPA91 ~ SATT + HSGPA, data=TRES) # multiple regression > summary(lm.out) Call: lm(formula = GPA91 ~ SATT + HSGPA, data = TRES) Residuals: Min 1Q Median 3Q Max -1.98933 -0.45321 0.03021 0.46506 1.71818 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.9197904 0.3599112 -2.556 0.0112 * SATT 0.0015258 0.0003552 4.296 2.50e-05 *** HSGPA 0.7236731 0.0970904 7.454 1.51e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7084 on 248 degrees of freedom Multiple R-squared: 0.2736, Adjusted R-squared: 0.2678 F-statistic: 46.71 on 2 and 248 DF, p-value: < 2.2e-16 Now, what is in your workspace? (Every time you use =, you put something in your workspace.) > ls() [1] "aov.out" "file" "lm.out" "MJ" "PRIN" "TRES" 34) I am an old command line guy from way back, but here's something that even catches me up occasionally. In the R Console, NOTHING above the current command line ever changes. That's history. If you want to see new info, you have to ask for it with a new command. Never look above the command line where you typed a command to see output of that command. 35) One final note. To save your workspace without quitting, do this. (This will work on university computers ONLY if you started R from the desktop icon.) > save.image() # save an image of the workspace Note: R works almost exactly the same on a Mac as it does in Windows. Here is one spot where there is a minor difference. On a Mac, save.image() creates an invisible file. (If you don't know what that is, don't worry about it.) To create a visible named file (Mac and Windows), do this. > save.image("FredsData.RData") # save it with a customized name On the lab computers, this will create a file (icon on your desktop) that contains your data. You can use your own name instead of Fred. Copy this file to a flash drive if you want to take your data with you. 36) The function used to quit R is very cryptic and will tax your memory. (You can also quit like you quit any other program in your operating system.) > quit() You will be asked if you want to save your workspace. If you do, your data will be there waiting for you the next time you start R (provided no one else has messed with it). 37) Have questions. ASK! I am happy to help. For more on R, check out the R Tutorials linked to at the website. 38) Datasets at the website that you can practice with: Brooks.txt (with a dummy coded sex variable) firemen.txt (more than two groups, so no t-tests) Eysenck.txt (more than two groups, so no t-tests)