PSYC 480 -- Dr. King Introduction to R Some things you should know about R: 1) Get it from www.r-project.org under the CRAN link at the left. a) On the left side of the page, find "Downloads, Packages" b) Click on the CRAN link c) The next page is a page of mirror sites from which R can be downloaded - choose one in the USA (if that's where you are) d) On the next page under "Download and Install R," choose your operating system: "Download R for MacOS X" or "Download R for Windows" (if you are using Linux, the easiest way to get R installed is via your package manager) e) For Mac people, unless you are running a really old version of OS X (Tiger or older), download the "latest version" of R i) at the time of this writing, that is R-2.14.1.pkg ii) it will probably download to your Downloads folder, depending upon how you have your computer set up iii) it is not a simple drag-and-drop installation iv) double-click the R-2.14.1.pkg icon and follow the on-screen instructions v) when the installation is complete, drag a copy of the program icon from your Applications folder to the Dock f) For Windows people, click the "base" link i) click "Download R 2.14.1 for Windows" link ii) an installer called R-14.1-win.exe will appear on your desktop or in your Downloads folder, depending upon how you have your computer set up iii) double-click this installer icon iv) the installation will put a shortcut on your desktop 2) After installing it, start R by clicking the shortcut on your desktop or the R icon you have dragged into your Dock. 3) If you are in Windows, there is one thing I would change (although this is optional). Start R, pull down the Edit menu, and select GUI preferences. When that window opens, on the very first line you will see the following choice: Single or multiple windows _MDI _SDI. I prefer SDI (which oddly enough stands somehow for multiple windows). You can switch back and forth and see which one suits you. Notice the font is set for Courier New. DO NOT CHANGE THIS, even though you may think Courier is not a very attractive font. In fact, you probably shouldn't fool with very many of these settings. When you are done with this, close the preferences window, then shutdown R and restart it. (Click on the red X at the upper right of the R window and choose not to save the workspace.) 4) R is a good old fashioned command line program. That means you control it by typing commands at the command prompt, which is a greater than symbol at the bottom of the window, >. There are a few menus at the top of the R Console window, but for the most part they are not very useful. 5) Typing in R is case sensitive. That means "mydata", "MyData", and "MYDATA" refer to different data sets. Those of you who are used to Windows' case insensitivity may find this takes some getting use to. You also have to spell things correctly. So using R takes some attention to detail. 6) The commands you'll type are always followed by parentheses, even if there is nothing inside those parentheses. For example (go ahead and type this)... > ls() When you are done typing your command, press the Enter (or Return) key. There is nothing to click. I'll explain what this command does shortly, but to get a different result, try typing it without the parentheses (and pressing Enter--always press Enter). Don't worry if this happens to you every once in awhile. It happens to me, too. Just go back and type the command correctly. (Typing a command without the parentheses prints out the source code for that command, or in other words, how the command is programmed to work. Unless you are a programmer, this will not be useful to you. It does no harm, however. Just retype the command correctly and move on!) 7) Anything following a pound symbol in R (#) is a comment or a note. R will ignore it, and when you see it on my handouts, you don't have to type that part. It is a good way to keep track of what you're doing, however. > ls() # this is a note to myself and R will ignore it > # this is also a note to myself and R is still ignoring it! > # you can use notes like this to annotate your work so you can > # remember what you've done (new line needs a new #) > ### you can use as many as you want; the first one makes it a note 8) There are three places in R that you have to be aware of: 1) the workspace, 2) the working directory, and 3) the search path. 9) Any data or other information you enter into R is stored in the workspace as an "object". To see the objects you've created in your workspace, type: > ls() # and remember to press the Enter key; objects() does the same thing > dir() # this shows you the contents of your working directory > search() # this shows you the search path; don't worry what that is The ls() command stands for "list the objects in the workspace." Right now R is probably saying "character(0)", which is R's rather cryptic way of saying there is nothing in the workspace. You haven't created any data objects. 10) The workspace is in RAM. What this means to you is that if the power goes off or your laptop battery dies, all your hard work is lost! You may want to save it every once in awhile. Let's create a data object, check to see that it has been created, and then save it. But first, if ls() does not give character(0) but lists some data objects, erase them so that you are starting fresh. Do this in Windows by going to the Misc menu and choosing Remove all objects. (You'll be asked if you really mean it, then you'll see the command that does it appear at the command prompt.) On a Mac, go to the Workspace menu and choose Clear Workspace. (You'll also be asked if you really mean it, but nothing will appear at the command prompt.) If you want to be a command line purist (we never touch the mouse!), execute this command at the command prompt: > rm(list=ls()) # don't do this unless you really mean it! That's really not as cryptic as it looks. What you're saying is remove (rm) the list of objects that would be returned by the ls() command. In other words, remove everything! You will NOT be asked if you really mean it. It will just happen. So you'd better mean it! This is a characteristic of command lines, by the way. They don't mollycoddle you. They assume you know what you're doing and do what you tell them to do. So once again, a little care can save you from disaster! If you ask for something to be removed, it will be removed, and you won't be asked if you really mean to. It will also not be sent to the trash. It will be gone! If you give a new data object a name that already is being used for a data object in your workspace, the old data object will be overwritten without a warning. So be careful. Now let's create something and save it. > x = 8 You have just created a data object called x. Notice that R doesn't inform you of this. It just does it. In R, when things go well, R is usually silent. When things go badly, you'll get an error message. For example... > y = .8. Error: unexpected symbol in "y=.8." Too many decimal points there! And R is telling you it doesn't understand what you've just asked it to do. Expect to be told that a lot! Typing errors are common, and MOST OF THE TIME they are harmless. By the way, in R an error means the command was not executed. A warning means the command was executed, but the result may not be what you expected. > ls() # list the contents of your workspace [1] "x" R is now telling you that you have a data object called x in your workspace. (Never mind the 1 inside the brackets for now.) If you want to see what you have stored in a data object, just ask to print it: > print(x) # and remember to press the Enter key; I won't tell you again [1] 8 Your data object, x, has the value 8 in it (and once again, ignore the [1]). There is a shortcut for doing this. You don't really need to use the print() function. All you need to do is type the name of the data object. > x # no parentheses; it's not a command but you still need to press Enter [1] 8 Now let's save it before the power goes out. > save.image() # and of course you've remembered to press Enter! Done! R has written your workspace to your computer's hard drive. When you quit R... > quit() R will ask if you want to save the workspace. If you say yes, R will execute the save.image() command, and any old workspace will be overwritten. If you say no, R will quit and leave the old workspace in tact. The next time you start R, if there is a workspace file present, R will load it automatically. So choose "yes" (save the workspace). R will quit. Now start it up again. Notice at the bottom of the introductory message it says "[Previously saved workspace restored]". Confirm this by asking to see its contents. > ls() [1] "x" > x [1] 8 The data object you created in the previous session has been restored. In short, if you want to save your work at the end of a session (and erase any previous work that may have been saved), choose "yes" when R asks if you want to save the workspace. You can accomplish the same thing as follows: > quit("yes") R will quit and save the workspace without nagging you about it. 11) Understanding what happens to the workspace when you quit R is a bit confusing at first, but important. Let's suppose you entered a hundred data values into R, saved them, and then accidently erased them with a careless rm(). Is there any way to get them back? There are, in fact, two ways, which I will demostrate. We'll use a data set containing only four values to save some time, and just type the commands as you see them and don't worry too much about the syntax just now. > ls() [1] "x" > rm(x) # get rid of that > ls() # confirm that it's gone character(0) > x = c(8, 10, 11, 9) # put four values into x; think of c() as a container > x # look at them [1] 8 10 11 9 > save.image() # save them > rm(x) > ls() # oh no! what have I done? character(0) Fortunately, you have saved the data to your working directory with the save.image() command. There are two ways to get them back. The first is simply to load them using the load() command. > load(".RData") # workspace is by default saved in a file called .RData > x [1] 8 10 11 9 Now we'll remove them again. > rm(x) > ls() character(0) > x Error: object 'x' not found Another way to get them back is to quit R without saving the workspace, and then to restart R. > quit("no") # "no" means do not save the workspace ### R is being restarted here ### > x [1] 8 10 11 9 Think about what just happened here and you will be well on your way to understanding how R deals with the workspace when it quits. The previously saved workspace has been restored. All this depends upon your having saved the workspace, however. Without that, rm() would have permanently erased your data! On the other hand, the save.image() command will write over any previously saved workspace. So what will this do? > rm(x) > save.image() > quit("no") Can you recover the values you saved in x now? No, you cannot! Think of your workspace as a document you are preparing. Every time you modify it and save it, the old version is overwritten (erased). If you quit and save, the new version is saved and the old version erased. If you quit without saving, the new version is lost and the old version retained. It's like any other computer program actually. It just takes some getting used to in R because things happen without warning or nagging. R assumes that what you've asked it to do is what you wanted it to do. It will rarely try to second guess you or ask you "are you sure?" So THINK before you act! 12) Where did it get saved to? R wrote your workspace to the working directory. This is a folder somewhere on your computer where R writes things (unless you tell it to write elsewhere) and also expects to find things you tell it to read or load. To find out the name of this folder, do this: > getwd() # this means get (the name of the) working directory [1] "C:/Documents and Settings/kingw/My Documents" On the computer I am using at the moment, an old Windows XP computer, the working directory is "My Documents". This has changed in Vista and Windows 7, and is also different in Mac OS X. What YOU are probably seeing at this moment is the name of your user folder. This is called the default working directory, the one R uses unless you tell it to do otherwise. Don't worry about the details right now, but eventually you'll need to know how to find this folder and place files in it so that R can read them. 13) To see the contents of this working directory, do this: > dir(all=T) # just dir() will work, too; output not shown This will return what in the good old days we called a directory listing. In other words, it returns a list of files in the folder that is your working directory. The bit inside the parentheses tells R to list all files, even the invisible ones that your operating system (Windows or OS X) is trying to hide from you. (This is essentially the same thing as double-clicking the folder icon to see its contents in Windows or OS X.) 14) Here is something you may find confusing for awhile if you've never used a command line before. Nothing above the command prompt will ever be updated on your screen. Anything above the current command prompt is history. If you change the value of x, for example, and want to see the new value, you have to ask. > x = 8 # any old values stored in x have been erased > x [1] 8 > x = 28 > x [1] 28 Never look above the current command prompt to see something new. That's old news. If you want to see new, changed, or revised data, you have to ask! 15) Spacing is almost always optional in R. Put spaces in, leave them out, R doesn't care. Thus, all of these are the same: > x=28 > x= 28 > x =28 > x = 28 > x = 28 There are a few exceptions to this. You can't put a space in the middle of a number, of course. > x = 2 8 Error: unexpected numeric constant in "x = 2 8" You also can't put a space in the name of something. So if we were creating a data object called "xyz" no spaces would be permitted inside that name. > xyz = 112 # works > x yz = 112 # does not work Error: unexpected symbol in "x yz" There is one other place where you cannot put spaces as well, but you don't need to know about that right now. 16) The last place you need to be aware of in R is the search path. The search path is where R looks for things you ask it to find, such as data objects and commands. To see the search path, do this: > search() [1] ".GlobalEnv" "package:stats" "package:graphics" [4] "package:grDevices" "package:utils" "package:datasets" [7] "package:methods" "Autoloads" "package:base" ".GlobalEnv" (the global environment) is your workspace. All those packages are places R stores the commands you'll be using. If the data object or command you are referring to at the command prompt isn't at one of those places, R won't be able to find it, and it will give you an error message. For example: > x [1] 28 > y # no such data object Error: object 'y' not found > x/7 [1] 4 > y/7 Error: object 'y' not found > log(28) # natural logarithm [1] 3.332205 > beans(28) # no such command Error: could not find function "beans" 17) I suppose I should tell you, in R commands are called functions. All functions have the same form: a function name followed by parentheses with arguments and options typed inside them. For example, the seq() function creates sequences of numbers. Let's play with it a bit. > seq(1, 10) # with unnamed arguments [1] 1 2 3 4 5 6 7 8 9 10 > seq(from=1, to=10) # with named arguments [1] 1 2 3 4 5 6 7 8 9 10 > seq(from=1, to=10, by=2) # by twos (an option) [1] 1 3 5 7 9 As long as the arguments to the function are named, they can occur in any order: > seq(from=100, to=150, by=5) [1] 100 105 110 115 120 125 130 135 140 145 150 > seq(by=5, to=150, from=100) [1] 100 105 110 115 120 125 130 135 140 145 150 If the arguments are not named, they must be given in the default order. You can find out what this is by asking for help for the function. > help(seq) # opens a help page (not shown) Don't expect these help pages to be very helpful until you get more experience with R, however. Some of the arguments will have default values (also given on the help page). If you don't specify a value for an argument, R will use the default, if there is one. > seq(to=10) # from=1 by default [1] 1 2 3 4 5 6 7 8 9 10 Of course, you can also just name everything, even though you may not have to, and get the same result. That's what I usually do, because I can't always remember what the defaults are and often don't what to bother looking it up! > seq(from=1, to=10, by=1) # same result as above [1] 1 2 3 4 5 6 7 8 9 10 18) In R, you can name your data objects almost anything you want. You should avoid using the names of built-in functions, however. And don't use T or F. Also, never put spaces in the names of things. R thinks you're typing something entirely new when it sees a space. Another bad thing is to use a dash (-). R will think you mean to subtract. It's best to stick to letters and numbers, and always start with a letter. If you want a space, use a period. For example, South.Carolina, not South Carolina. R may understand that space, but it may not, and it's best just to avoid spaces in the names of things. And remember, R is case sensitive. In R, South.Carolina is an entirely different state from south.carolina. 19) When R prints something to the screen like above, it usually means nothing was stored in a data object. Thus, those sequences of numbers might be pretty to look at on the screen, but you're not going to be able to use them. To use them, you have to store them into a data object. This procedure is called assignment. > y = seq(to=10) > y [1] 1 2 3 4 5 6 7 8 9 10 Notice nothing was printed when the sequence was created. If you want to see the contents of your new data object, you have to ask. (There is a way around this, but it's rarely used.) As the sequence of numbers has now been stored into a data object, we can now do stuff with it. > sum(y) # find the sum of the numbers we've stored in y [1] 55 > mean(y) # I hope you can figure this one out! [1] 5.5 > sd(y) # the sample standard deviation (i.e., the n-1 version) [1] 3.027650 20) It's a good idea when you're done using something and don't need it anymore to erase it from your workspace. Don't clutter up your workspace with crap! > rm(x,y,xyz) # remove(x,y,xyz) will also work Notice once again that R doesn't tell you it has done what you've asked. It's not a naggy program like most menu-driven programs are. It doesn't talk much. It just does what you tell it to. > ls() character(0) 21) You can't learn to play the piano by watching me play one. To learn this stuff, you have to do it. There is no substitute for sitting down with R and using it. There is hardly anything you can do to hurt it, so feel free to play around. There's just one thing you should know. > seq(10,20,3 + R is very fussy about syntax. (It's actually a programming language, and a very good one!) In the example above, I have not completed the command by typing the close parenthesis. R recognizes that I have not typed a complete command, so it has given me a "continuation prompt", which is a + sign. It's saying it wants more. Just finish typing the command and press Enter. + ) [1] 10 13 16 19 > Occasionally, you'll get stuck. (It happens to me, too.) R won't be happy with anything you type and will just keep giving you continuation prompts (+). In this case, terminate the command by pressing the Esc key (upper left corner of your keyboard), and start again. 22) Now lets do something useful. Let's put some data into R and do some statistical analysis. There are three ways to get data into R: 1) from an external file that you've created, probably in spreadsheet software like MS Excel, 2) by copying and pasting it from another document, for example, a webpage, or 3) by typing it in directly at the command prompt. Today we'll use the third method. We'll use statistics exam scores data from a 50-point statistics exam: 39 40 28 44 43 41 36 33 40 49 48 37 33 38 49 45 43 37 26 47 30 45 48 38 45 43 25 28 37 41 29 46 49 31 37 31 Most stat software makes you enter data in one form only, called a data table or dataframe (in R-speak). This is a spreadsheet-like format with cases in the rows and variables in the columns. R is much more flexible and allows you to enter data in a number of different ways, some of which are much more convenient when you are dealing with data sets like the one above. The basic data structure in R is called a "vector", which is just a collection of numbers, or words, or logical values (TRUE and FALSE). You've already created a vector in R. You did it with the c() function. The output of seq() is also a vector (just a "string" of values--vector is a strange and intimidating word, but it really means something very simple). It's too bad the exam scores don't fall into a reqular sequence, because that means we are going to have to do a little more work to get them into a vector. There are two (common) methods of entering numerical data into a vector. The first is to use the c() function, which means concatenate or combine. This is something of a nuisance, so to make things easy I will enter only the first four values of the above data. (Imagine doing this for all 36 data values!) > c(39, 40, 28, 44) # the first four data values, just to illustrate [1] 39 40 28 44 R responded by printing out the values, which means they weren't stored into a data object (a "variable name"), and that's because we didn't ask. > grades = c(39, 40, 28, 44) # spacing is optional > grades [1] 39 40 28 44 > mean(grades) # now that it's stored, we can calculate on it [1] 37.75 That would be a tedious way to enter a lot of data values because of all those commas! So a better way is to use scan(). This function reads data values that you type from the keyboard, pressing Enter after each value, and pressing Enter twice after the last value. > grades = scan() # scan() means scan the keyboard for input 1: 39 2: 40 3: 28 4: 44 5: 43 6: 41 7: 36 8: 33 9: 40 10: 49 11: 48 12: 37 13: 33 14: 38 15: 49 16: 45 17: 43 18: 37 19: 26 20: 47 21: 30 22: 45 23: 48 24: 38 25: 45 26: 43 27: 25 28: 28 29: 37 30: 41 31: 29 32: 46 33: 49 34: 31 35: 37 36: 31 37: Read 36 items > grades [1] 39 40 28 44 43 41 36 33 40 49 48 37 33 38 49 45 43 37 26 47 30 45 48 38 [25] 45 43 25 28 37 41 29 46 49 31 37 31 > And there they are. Simple as that! Just remember to press Enter twice after the last value to terminate data entry. Now I can calculate a variety of descriptive statistics. Just type the name of the function you want to calculate, and put the name of the vector in the parentheses. > length(grades) # how many are there? (sample size or n) [1] 36 > mean(grades) # arithmetic mean - try sum(grades) / length(grades) [1] 38.86111 > median(grades) # median [1] 39.5 > var(grades) # sample variance [1] 50.86587 > sd(grades) # sample standard deviation [1] 7.132031 > IQR(grades) # not your granddad's interquartile range [1] 12 > range(grades) # range prints the min and max values [1] 25 49 > min(grades) # minimum value [1] 25 > max(grades) # maximum value [1] 49 > summary(grades) Min. 1st Qu. Median Mean 3rd Qu. Max. 25.00 33.00 39.50 38.86 45.00 49.00 I can also look at the distribution in a number of different ways. > stem(grades, scale=.5) # a stem-and-leaf display (with an option) The decimal point is 1 digit(s) to the right of the | 2 | 56889 3 | 01133 3 | 67777889 4 | 00113334 4 | 5556788999 > hist(grades) # a histogram (not shown in this handout) > hist(grades, right=F) # there are all kinds of fancy options available You will notice when you plot a graph in R that it opens another window, called a "graphics device," to draw in, if one is not already open. This steals the focus from the R Console window, so to type more commands in R you will have to click on the R Console window to return focus there. (I have to admit, this "feature" of R drives me NUTS!) There are a multitude of statistical functions built-in to R. Just about any statistical procedure you can imagine has been programmed either into the base package (which is what you now have) or to an optional package that can be obtained for free from CRAN (The Comprehensive R Archive Network). There are a few interesting exceptions, but we'll cope with those when the time comes. Right now I will illustrate just one--the single-sample t test. If I had hypothesized in advance that the mean grade would be 35, I can run a hypothesis test very easily to see if that hypothesis was confirmed. > t.test(grades, mu=35) # single-sample t-test, two-tailed One Sample t-test data: grades t = 3.2483, df = 35, p-value = 0.002564 alternative hypothesis: true mean is not equal to 35 95 percent confidence interval: 36.44798 41.27424 sample estimates: mean of x 38.86111 This has been a very brief introduction to R. Stay tuned! We have just barely scratched the surface of what R can do. 23) One more thing. I mentioned optional packages just a moment ago. How do you get them? There is one you will almost certainly need eventually, so let's install it now. This will require that you have a connection to the Internet. R will not establish that connection for you, but it will make use of the connection if it is already established. So if you are on a wireless network that requires you to enter a password, you'll have to do that in another program, such as your browser (Firefox, Internet Explorer, Safari, etc.). On the other hand, if your computer has already detected your wireless router or Ethernet connection, you're good to go. > installed.packages() # output not shown The output is extensive, and basically shows what packages have already been installed. There are several lists, all of which contain the same package names. We're looking of a package called "car" (which is short for Companion to Applied Regession). Unless you've already installed "car," you won't find it in any of the lists, because it is not one of the packages installed by default. So we need to get it if we want to use the functions and data sets it contains. Do this. > install.packages("car") This will (probably) open a list of mirror sites. Pick on near you. R will take care of the rest. In fairly short order, the package will be installed and the command prompt will be returned. You still won't be able to use it, however, because it is not in the search path. To add it to the search path, do this. > library("car") Some stuff will be printed to the console, and then "car" will be ready to use. Functions You've Learned About in This Tutorial ls() dir() search() rm() rm(list=ls()) x = 8 x = c(8, 10, 11, 9) print(x) x save.image() quit() load(".RData") getwd() dir(all=T) seq(from= , to= , by= ) help() sum() mean() sd() scan() length() median() var() IQR() range() min() max() summary() stem() hist() t.test() installed.packages() install.packages() library() An Additional Tutoral (For Practice) > ls() # list contents of workspace character(0) > dir() # see files in your working directory (not shown) > getwd() # find out what the working directory is [1] "C:/Documents and Settings/kingw/My Documents" > search() # see the search path [1] ".GlobalEnv" "package:stats" "package:graphics" [4] "package:grDevices" "package:utils" "package:datasets" [7] "package:methods" "Autoloads" "package:base" > help(median) # see a help page (these can be pretty cryptic!) > 58 + 32 + 28 + 17.7 # do some arithmetic at the command line [1] 135.7 > 456 / 28.3 [1] 16.11307 > 43 * 23.3333 [1] 1003.332 > 16^3 [1] 4096 > sqrt(28) [1] 5.291503 > log(28) [1] 3.332205 > data(rivers) # load a built-in data vector > ls() # see it in your workspace [1] "rivers" > rivers # print it to the screen [1] 735 320 325 392 524 450 1459 135 465 600 330 336 280 315 870 [16] 906 202 329 290 1000 600 505 1450 840 1243 890 350 407 286 280 [31] 525 720 390 250 327 230 265 850 210 630 260 230 360 730 600 [46] 306 390 420 291 710 340 217 281 352 259 250 470 680 570 350 [61] 300 560 900 625 332 2348 1171 3710 2315 2533 780 280 410 460 260 [76] 255 431 350 760 618 338 981 1306 500 696 605 250 411 1054 735 [91] 233 435 490 310 460 383 375 1270 545 445 1885 380 300 380 377 [106] 425 276 210 800 420 350 360 538 1100 1205 314 237 610 360 540 [121] 1038 424 310 300 444 301 268 620 215 652 900 525 246 360 529 [136] 500 720 270 430 671 1770 > sort(rivers) # sort from low to high (output not shown) > sorted.rivers = sort(rivers) # store the sorted values > sorted.rivers # see that (not shown) > ls() # output not shown > rm(sorted.rivers) # remove it from your workspace > ls() # output not shown > mean(rivers) # get some descriptive statistics (mean) [1] 591.1844 > median(rivers) # median [1] 425 > sd(rivers) # sample standard deviation [1] 493.8708 > summary(rivers) Min. 1st Qu. Median Mean 3rd Qu. Max. 135.0 310.0 425.0 591.2 680.0 3710.0 > hist(rivers) # draw a histogram (not shown) > hist(rivers, breaks=20) # draw a histogram with about 20 bars Note: help windows and graphics windows can be closed with the mouse by clicking them away just like you would close any other window. > quit("no") # quit R without saving the workspace