QUICK AND DIRTY GUIDE TO R for PSYC 497 students (14 March 2009) by William B. King, Ph.D. Coastal Carolina University ------------------------------------------------------------------------------- Getting it. Download R from www.r-project.org (click on the CRAN link on the left side of the page under downloads). As of this moment, the current version is 2.8.1. It is available for Windows, OS X, and Linux. Get the .exe file for Windows, the .dmg file for OS X, and for Linux the best way to install is via your package manager. In Ubuntu, for example, use Synaptic, or from a terminal... sudo apt-get install r-base ------------------------------------------------------------------------------- The R prompt. R is command-line driven. The prompt looks like a greater than sign, >. When you see it, start typing. (Much faster than hunting and clicking once you get used to it!) Press Enter when you have finished typing the command. If the command is not complete, R will issue a continuation prompt, which is a plus sign, +. You can type long commands on several lines to make it easier to keep track of syntax. You can also type them into a text editor and then cut and paste them into R. ------------------------------------------------------------------------------- Case sensitive. R commands are case sensitive. If you get a function not found error, check your spelling and your capitalization. R functions are generally, but not always, typed in lower case letters. On the other hand, spacing usually doesn't matter. Leave spaces wherever you like (except within names of things) to make your input easier to read. ------------------------------------------------------------------------------- R functions. R "commands" are called functions. The basic form of a function is: > function.name( arguments, options ) Even if there are no arguments or options, the parentheses are still mandatory. If you leave off the parentheses, R will print out the source code for the function you have entered. This can be startling when you're not expecting it, but it is not harmful. Just retype the command correctly. ------------------------------------------------------------------------------- Starting R. In Windows, double-click the desktop shortcut. In OS X, double-click the R icon in your /Applications folder (or drag a shortcut to the Dock). In Linux, start a terminal session, and type R and hit the Enter key. Here's what R looks like when it starts. R version 2.8.1 (2008-12-22) Copyright (C) 2008 The R Foundation for Statistical Computing ISBN 3-900051-07-0 R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. [Previously saved workspace restored] > | The > is the command prompt, and the vertical bar (may be a solid block, may or may not be blinking) is the cursor. ------------------------------------------------------------------------------- Quitting R. > quit() # remember, the parentheses are mandatory! Note: best to know this right away! Also, q() works. When R asks to save your workspace, pick "no" until you know what this means. If you have entered data into R and want it saved for your next R session, pick "yes". It is best not to clutter up your workspace with old stuff you no longer need. If you have created a data object called myData in your workspace, and you want to delete it before saving your workspace, do this: > rm( myData ) # remove( myData ) also works ------------------------------------------------------------------------------- Interactive mode. R can be used by writing programs into scripts and reading them in. It's easier to use R in interactive mood, typing individual commands at the command prompt. Notes or comments can be added to the R session by typing a hash or pound symbol, #. Anything on the line following this symbol will be ignored by R. ------------------------------------------------------------------------------- Getting your data file in. Once your have created your data file in a spreadsheet and have saved it as a csv file, you should place that file in R's working directory (folder). To find out what folder that is, type this: > getwd() [1] "/Users/billking" In OS X, the working directory will be your home folder. The same is true in Linux (in the lab that is the psychlab folder). In Windows, it depends upon which version you are running. In XP the home folder is My Documents. In Vista it is the users home folder. A few cautions about creating the data file: 1) Do not put spaces in the file name. R won't like that. 2) Do not put spaces in variable names. 3) If you are typing character values into the file, for example "male" and "female", be very careful about your spelling and capitalization, and don't leave spaces at the beginning or end of the word. R will consider each of these a different value: "male", "Male", " male", "Male ". 4) Make a backup copy of your data file. 5) Give the file a sensible name. 497data isn't sensible, because half the people in the class will name their file the same. Try King497data.csv. I hope I don't really have to add this note, but use your own name! Once the file is the working directory, laod it into R this way: > myData = read.csv( "King497data.csv", header=TRUE, comment="#" ) This will read your data into a data object called myData. No output will be produced if all goes well. R is normally silent when things work and gives error messages when things go wrong. To confirm that the file is in memory, do this: > ls() [1] "describe" "eda" "myData" This "list" command will show all the objects R has in the workspace (memory). To find out the names of the variables in your file (in case you forgot): > names( myData ) [1] "Age" "Gender" "Race" "UnUpCl" "Class2" "ASACL" [7] "Anxiety" "Depress" "Suicide" "SubAb" "SelfEs" "InterProb" [13] "FamProb" "AcadProb" "CarProb" "ASACat" At this point, should attach your data file so that R can see the variables inside it: > attach( myData ) Now you can refer to the variables directly by name: > Age [1] 23 19 18 18 19 18 20 18 18 18 18 18 19 19 18 19 19 18 19 18 19 19 18 25 [25] 21 19 19 20 19 21 19 19 18 20 19 18 19 18 18 18 19 17 18 18 18 20 21 18 [49] 23 21 18 19 23 17 18 19 20 18 19 23 17 19 18 18 21 18 18 22 20 25 19 21 [73] 18 19 28 18 20 20 20 18 19 18 21 21 20 19 19 18 20 19 18 20 18 20 18 18 [97] 18 18 19 19 19 23 20 18 19 19 19 18 18 19 19 18 20 20 18 21 18 18 18 21 [121] 21 18 18 18 20 19 18 18 19 20 20 21 18 18 17 21 20 > mean( Age ) [1] 19.20438 > range( Age ) [1] 17 28 Note: some terminaology. In R-speak, myData is called a data frame. Each variable inside the data frame is called a vector. Thus, Age is a vector. ------------------------------------------------------------------------------- R output. You may have noticed above that R not only printed out the values you stored in your variable, but it also printed out a 1 inside square brackets. When R is printing out a vector, it indexes all the values, but only the index for the first value on a line is shown. Thus, [1] means this line begins with data value number 1. Try this (note: the data vector rivers is built-in to R): > rivers [1] 735 320 325 392 524 450 1459 135 465 600 330 336 280 315 870 [16] 906 202 329 290 1000 600 505 1450 840 1243 890 350 407 286 280 [31] 525 720 390 250 327 230 265 850 210 630 260 230 360 730 600 [46] 306 390 420 291 710 340 217 281 352 259 250 470 680 570 350 [61] 300 560 900 625 332 2348 1171 3710 2315 2533 780 280 410 460 260 [76] 255 431 350 760 618 338 981 1306 500 696 605 250 411 1054 735 [91] 233 435 490 310 460 383 375 1270 545 445 1885 380 300 380 377 [106] 425 276 210 800 420 350 360 538 1100 1205 314 237 610 360 540 [121] 1038 424 310 300 444 301 268 620 215 652 900 525 246 360 529 [136] 500 720 270 430 671 1770 The first line of this output starts with value number 1, the second line with value number 16, the third line with value number 31, and so on. All of the values are indexed, so the value of 310 in the next to the last line is value number 123. It can be retrieved like this: rivers[ 123 ] [1] 310 Don't be confused by the fact that R called it value 1 when it printed it. It was the first value printed, so it is called 1. If you want values 10 through 20 from the rivers vector, you can get them this way: > rivers[ 10:20 ] [1] 600 330 336 280 315 870 906 202 329 290 1000 The printed values are considered a new vector and are indexed accordingly. ------------------------------------------------------------------------------- Histogram. > hist( rivers ) Note: bars in R histograms include the right limit but not the left. > hist( rivers, right=FALSE ) # now includes the left, not the right ------------------------------------------------------------------------------- Categorical (nominal) data. In the data frame I read in, Class2 is a categorical variable containing the subjects' class standing. There are two ways to get summary information: > table( Class2 ) Class2 Freshman Sophomore Junior Senior NA 65 42 18 12 0 > summary( Class2 ) Freshman Sophomore Junior Senior NA 65 42 18 12 0 NA, by the way, means "not available" and is the R missing value code. You may or may not see this in the output of the commands above. To crosstabulate two categorical variables, such as Race and Class2: > xtabs( ~ Race + Class2 ) Class2 Race Freshman Sophomore Junior Senior NA Caucasian 50 32 13 11 0 NonCaucas 15 10 5 1 0 To save this table into a data object so it can be manipulated with further commands, just set it equal to something: > myTable = xtabs( ~ Race + Class2 ) ------------------------------------------------------------------------------- Descriptive statistics. **sample size** > length( Age ) # won't work if there are NAs [1] 137 > length( rivers ) [1] 141 **sample median** > median( Age ) [1] 19 **sample mean** > mean( Age ) # use mean( Age, na.rm=T ) if there are NAs [1] 19.20438 **range** > range( Age ) # reports the min and max values [1] 17 28 **IQR** > IQR( Age ) # interquartile range [1] 2 Note: this may not be the "textbook" IQR. R can use 9 different methods to calculate Q1 and Q3. By default it uses a different one than most elementary textbooks do. **sample variance** > var( Age ) [1] 2.781451 **sample standard deviation** > sd( Age ) [1] 1.667768 **more or less all at once** > summary( Age ) Min. 1st Qu. Median Mean 3rd Qu. Max. 17.0 18.0 19.0 19.2 20.0 28.0 **descriptive statistics by groups** The basic syntax is by( DV, IV, function ). E.g.: > by( Age, Class2, mean ) Class2: Freshman [1] 18.21538 ---------------------------------------------------------- Class2: Sophomore [1] 19.57143 ---------------------------------------------------------- Class2: Junior [1] 20.55556 ---------------------------------------------------------- Class2: Senior [1] 21.25 > by( Depress, Class2, mean ) Class2: Freshman [1] 51.67692 ---------------------------------------------------------- Class2: Sophomore [1] 51.83333 ---------------------------------------------------------- Class2: Junior [1] 49.16667 ---------------------------------------------------------- Class2: Senior [1] 51.83333 ------------------------------------------------------------------------------- Single sample t-test. H0: the Anxiety variable has a mean of 35 > t.test( Anxiety, mu=35 ) One Sample t-test data: Anxiety t = -2.773, df = 136, p-value = 0.006333 alternative hypothesis: true mean is not equal to 35 95 percent confidence interval: 26.64689 33.60129 sample estimates: mean of x 30.12409 ------------------------------------------------------------------------------- Independent t-test. H0: there is no difference in mean anxiety by gender > t.test( Anxiety ~ Gender, var.equal=TRUE ) Two Sample t-test data: Anxiety by Gender t = 3.8968, df = 135, p-value = 0.0001528 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 6.484899 19.850607 sample estimates: mean in group F mean in group M 35.79487 22.62712 Note: R always subtracts first group - second group, and usually considers groups to be first or second by alphabetical order. To specify a one-tailed test, use the option alternative="greater" or alternative="less". A nice graph: > boxplot( Anxiety ~ Gender ) ------------------------------------------------------------------------------- Dependent (related samples) t-test. H0: the means of interpersonal problems and family problems are the same > t.test( InterProb, FamProb, paired=TRUE ) Paired t-test data: InterProb and FamProb t = 3.2828, df = 136, p-value = 0.001307 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.873559 3.520602 sample estimates: mean of the differences 2.197080 ------------------------------------------------------------------------------- ANOVA (oneway, between subjects) H0: there are no differences among the means of depression scores by class > results = aov( Depress ~ Class2 ) > summary( results) Df Sum Sq Mean Sq F value Pr(>F) Class2 3 104.9 35.0 0.334 0.8008 Residuals 133 13922.2 104.7 A post hoc test is pointless here, but just to illustrate how it's done: > TukeyHSD( results ) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = Depress ~ Class2) $Class2 diff lwr upr p adj Sophomore-Freshman 1.564103e-01 -5.113637 5.426457 0.9998341 Junior-Freshman -2.510256e+00 -9.600318 4.579805 0.7935768 Senior-Freshman 1.564103e-01 -8.207358 8.520178 0.9999584 Junior-Sophomore -2.666667e+00 -10.165931 4.832597 0.7914229 Senior-Sophomore -4.973799e-14 -8.713357 8.713357 1.0000000 Senior-Junior 2.666667e+00 -7.253927 12.587260 0.8971568 A nice graph: > boxplot( Depress ~ Class2 ) ------------------------------------------------------------------------------- Correlation. H0: there is no correlation between depression and anxiety scores > cor.test( Anxiety, Depress ) Pearson's product-moment correlation data: Anxiety and Depress t = 3.5145, df = 135, p-value = 0.0006001 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.1280281 0.4360677 sample estimates: cor 0.2895268 Note: one-tailed tests are also possible by using the alternative= option. Use alternative="less" if you think the correlation is less than zero. Otherwise, use alternative="greater". A scatterplot: > plot( Anxiety, Depress ) > scatter.smooth( Anxiety, Depress ) ------------------------------------------------------------------------------- Regression. Note: for the following example, x and y are previously defined numerical variables. > lm( y ~ x ) Call: lm(formula = y ~ x) Coefficients: (Intercept) x 74.43864 0.01112 > my.lm.analysis = lm( y ~ x ) # storing the results gives more output > summary( my.lm.analysis ) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -38.663 -14.562 -0.173 13.702 50.507 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 74.43864 9.07285 8.205 9.27e-13 *** x 0.01112 0.18097 0.061 0.951 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 17.45 on 98 degrees of freedom Multiple R-Squared: 3.849e-05, Adjusted R-squared: -0.01017 F-statistic: 0.003773 on 1 and 98 DF, p-value: 0.9511 ------------------------------------------------------------------------------- Chi-square goodness of fit. H0: there are equal numbers of freshman, sophomores, juniors, and seniors > class.table = table( Class2 ) > class.table Class2 Freshman Sophomore Junior Senior 65 42 18 12 > chisq.test( class.table ) Chi-squared test for given probabilities data: class.table X-squared = 51.5255, df = 3, p-value = 3.78e-11 ------------------------------------------------------------------------------- Chi-square test of independence. If the data have already been crosstabulated, this is easy. Otherwise, you'll need to learn to use the xtabs() and table() functions for crosstabulation (see below). variable one dems repubs other male 22 35 12 variable two female 28 30 18 > male = c( dems=22, repubs=35, other=12 ) > female = c( dems=28, repubs=30, other=18 ) > crosstabs = rbind( male, female ) > crosstabs dems repubs other male 22 35 12 female 28 30 18 > chisq.test( crosstabs ) Pearson's Chi-squared test data: crosstabs X-squared = 1.9713, df = 2, p-value = 0.3732 Note: for 2x2 tables, R does the Yates correction by default. To prevent this: > chisq.test( crosstabs, correct=F ) To see more information, store the results of the test... > chisq.results = chisq.test( crosstabs ) > chisq.results > chisq.results $ expected # expected frequencies > chisq.results $ residuals # unsquared chi values in each cell A nice barplot: > barplot( crosstabs, beside=TRUE, legend=TRUE ) ------------------------------------------------------------------------------- Getting help. > help( function.name ) # opens a somewhat helpful help page > help( t.test ) # ?t.test is a shortcut > help.search( "distribution" ) # looks for anything to do with quoted terms Note: the R manuals are included with the download in both PDF and html format. Look for them in the root R directory. Note: in Windows and OS X, the help() function opens a new window, which can be closed like any other window. In Linux, help pages are opened right in the terminal session. To close the help page, press the Q key. > args( function.name ) This is sometimes also helpful to see the various arguments and options a function will take. ------------------------------------------------------------------------------- Detach. Always detach attached data frames. If you are going to quit, quitting will detach the data frame. > detach( myData ) ------------------------------------------------------------------------------- More on reading in a dataframe from a csv file. The best way I've found to handle large data files is to enter them into a spreadsheet and then save them in comma separated format. The file should have column names, one column per variable, and one row per case, and NOTHING else. If you drop a copy of it into your working directory, you can read it into a dataframe as follows: > dataframe.name = read.csv( "filename.csv", header=T, comment="#" ) It's best to keep the names of dataframes in memory SHORT, so maybe... > df = read.csv( "filename.csv" ) Then the following functions are available: > names( df ) # get the names of the variables > dim( df ) # get the size of the dataframe > head( df ) # see the header and top six lines of data > summary( df ) # summarize all the variables > mean( df ) # find the mean of the numerical variables > str( df ) # gives a lot of info about the dataframe If you don't want to use attach, the following tips will be useful. The easiest way to work with the variables in the dataframe is to use the $ notation: > mean( df $ scores) # find the mean of the variable named scores Statistical tests can be done by specifying the name of the dataframe in a data option, but only if the formula interface is used: > t.test( scores ~ groups, data=df ) ------------------------------------------------------------------------------- Housekeeping. > ls() # shows the names of objects you've created > remove( object.name ) # deletes the named object; rm() also works Note: it's best to keep your workspace cleaned up by deleting unused objects. > dir() # get a dir listing of the working directory > getwd() # get the name of the working directory > setwd() # set (change) the working directory Note: in Windows it's best to use the pulldown menus to change directories rather than attempt to navigate through the file system. It's best not to clutter up the root R directory with a lot of saved work files. Make a new directory in My Documents (in Windows) and switch to it when you start R. > save( myData, file="mydata.rda" ) Note: this is the easiest way to save data, as a data file in a proprietary file format. The file will be saved in the working directory. To read it back in at some later work session: ------------------------------------------------------------------------------- SOME MORE ADVANCED MATERIAL ------------------------------------------------------------------------------- A further tutotial on dataframes. Since dataframes are the primary data structure in R, you may find some additional information on them helpful. Use the following tutorial when it becomes necessary. Most statistical software, like SAS, SPSS, and Minitab, use a particular data structure called a dataset, a rectangular table in which the columns are variables and the rows are cases or observations. Often, cases (rows) correspond to subjects, but this is not always true. A short example is given here. As you can see, the dataset can contain a mix of categorical and numerical variables: gender class score1 score2 subject1 male freshman 28.3 33.8 subject2 female junior 32.0 30.8 subject3 female senior 26.7 40.2 ... R also uses a similar data structure called a dataframe. Unlike the other software packages, however, R can use many other data structures, such as vectors, matrices, arrays, and so forth, but for now we do not need to know about them, except for just a bit about vectors. A vector is about the simplest form for R data objects. In R, even single values are considered vectors. All of the following assignments create vectors, as per the comment (#): > a = 6 # a vector containing one element > a = c( 6, 4, 8, 3 ) # a vector containing four elements > a = 1:10 # a vector containing the values 1 to 10 > a = seq( from=1, to=12, by=3 ) # a vector containing 1, 4, 7, and 10 > a = "Fred" # a vector containing one character value > a = c( "Fred", "Ginger" ) # ditto with two character values > remove( a ) # erasing unnecessary junk Important: The columns of dataframes are vectors! For educational purposes, R comes with a large number of built-in datasets (not all of which are in the form of dataframes). We will use one called "USArrests". Type the following command: > ls() The parentheses are manditory, and so is lower case. This will show you the contents of your workspace, which is an area of the computer's RAM memory that contains data objects you have created. If you have not created any, and the workspace is empty, you will see the following: > ls() character(0) That's R's way of telling you, "Nothing to see here." To place a copy of the "USArrests" dataframe in your workspace, do this: > data( USArrests ) If you execute the ls() function again, you should see that your workspace includes this data object. Now let's find out some things about this dataframe. First, let's find out how big it is. In R, spacing usually doesn't matter, so any of the following forms of the command will work. Try them all: > dim(USArrests) > dim (USArrests) > dim ( USArrests ) > dim( USArrests ) This will print out two numbers (actually 3, but the one in square brackets doesn't count). These numbers tell you how many rows there are in the dataframe, not counting the row of variable labels at the top, and the number of columns in the dataframe, not counting the column of row names at the left. To see the names of the columns, and hence the variable names, do this: > colnames( USArrests ) Or this will give the same result: > names( USArrests ) So now you know the dataframe contains information on four variables, and these variables are called "Murder", "Assault", "UrbanPop", and "Rape". To see the row names, i.e., the names assigned to the cases or subjects in the dataframe, do this: > rownames( USArrests ) Now you know there is a row in the dataframe for every state in the country. To see both column names and row names at once, do this: > dimnames( USArrests ) To see the first half dozen lines of the dataframe and get a bit better idea of its structure, do this: > head( USArrests ) The first thing you'll see is a line of variable names or column names, also called the header. Then you will see a row of data for the first case, which is the state of Alabama. And so on. The numbers, by the way, give the number of arrests in 1973 per 100,000 population in each state for murder, assault, and rape, and it also gives the percentage of the population living in urban areas within the state. A reminder: the spacing in these commands is optional. I'm using extra spaces to make the commands more easily readable, but to save typing, you could have typed that last command as head(USArrests). Every entry in this table-like structure is indexed by two numbers, the row number and the column number. The first entry in the table is the murder rate for the state of Alabama. This value is in the first row and the first column, and it can be referred to as follows: > USArrests[ 1, 1 ] Take special note of the square brackets. Also make a note of the fact that, when using index numbers (sometimes called subscripts) like this, the first number is ALWAYS the number of the row, and the second number is the number of the column. The space after the comma is optional--as are all the spaces. You can also refer to the values in the dataframe by row name and column name (if these exist). To do so, do this: > USArrests[ "Alabama", "Murder" ] Once again, the name of the row must come first followed by the name of the column, and the names must be quoted. I should point out that in an R dataframe row names and column names always exists. If you don't create them yourself, R will create them for you. Let's create a new dataframe to see how this works. For now, never mind exactly what the following commands are doing. Just type them in as you see them here: > myData = matrix( c( 2, 5, 3, 4, 6, 7, 3, 2, 18, 22, 17, 10 ), ncol=3 ) > myData = as.data.frame( myData ) > myData As you can see, R has supplied row and column names for the dataframe. The row names appear to be numbers, but they are not! They are character values. I.e., the name of the first row is "1", a character like "A" or "Fred", and not the number 1. You can confirm this by asking for rownames(myData). Notice the row names are quoted. R always quotes character values and never quotes numbers. The column names are V1, V2, and V3, which stand for "Variable 1", etc. Now let's name the rows and columns. Do this: > colnames( myData ) = c( "Quiz1", "Quiz2", "Exam" ) > rownames( myData ) = c( "Jeff", "Barbara", "Rachel", "Bob" ) > myData As you can see, row names and column names are assigned by using the rownames() and colnames() functions. Without an assignment, these functions simply print out what is already there. With an assignment, they change what is there. "Bob" is upset, however, because he prefers to go by "Robb". This is easily enough changed: > rownames( myData )[4] = "Robb" > myData Frankly, I found this syntax a little confusing at first, but it makes sense once you realize how R does things. For now, just remember it! You can think of rownames(myData) as "containing" the row names for the dataframe, and rownames(myData)[4] as being the fourth row name. Assigning a new value to rownames(myData)[4] changes its value. Now let's add a column to myData. There are two ways to do this, so we will add two columns to illustrate them both: > myData[4] = c( 24, 28, 30, 20 ) > myData This is a bit mysterious, but it is worth remembering. If you give the name of the dataframe only one index (or subscript) inside the square brackets, R will assume you are referring to a column. Let's name it: > colnames( myData )[4] = "Paper" > myData Be careful about the indexes here. If you had used 3 instead of 4, column 3 would have been replaced. Now let's add a fifth column: > myData $ Major = c( "Psych", "Biology", "Math", "Business" ) > myData The dollar sign notation is a way of referring to any column (variable) inside a dataframe. For example: > myData$Exam Once again, there are optional spaces around the $, so to make this easier to read, you could also have typed this: > myData $ Exam Now back to our new column. Since the variable "Major" did not already exist, R created it and stuck it into a new column in the dataframe. Notice that, using this method, we did not even have to do an extra step to name the column. Let's get rid of this mess we've made: > rm( myData ) # remove( myData ) would do the same > ls() ***** Intermission: Things You've Learned So Far ***** 1) The structure of a dataframe: rows and columns with cases in the rows and variables in the columns. The columns of a dataframe are vectors, so you also should have learned a little about vectors! 2) How to copy a built-in dataset into your workspace: data(). 3) How to see the contents of your workspace: ls() 4) How to find out the size of a dataframe: dim() 5) How to find out the column names: colnames(), names(), or dimnames() 6) How to find out the row names: rownames() or dimnames() 7) How to see the top six cases in the dataframe: head() 8) What a header is: the variable names at the top of the dataframe. 9) How to pull any single value out of the dataframe by using index numbers: dataframe.name[ row.number, col.number ] 10) How to pull any single value out of the dataframe by using row names and column names: dataframe.name[ "row.name", "col.name" ] 11) How to change the column names: colnames() = a vector of quoted names 12) How to change the row names: rownames() = a vector of quoted names 13) How to change a single row name or column name: look it up if you've forgotten! 14) How to add a column to a dataframe: two methods, which once again you should review if you've already forgotten them. 15) How to refer to a variable inside a dataframe using the $ notation. ***** Changing Values Inside a Dataframe ***** Suppose you need to change a value inside the dataframe. I bet by now you can guess how to do it. I have one caution. Later we will learn that dataframes can be "attached" using the attach() function. Never mind for right now what this means. Just heed this warning: NEVER MODIFY AN ATTACHED DATAFRAME! If the dataframe is attached, use the detach() function to detach it before you try to modify it. The world will not come to an end if you forget this, but things can get very confusing if you are careless about it. Let's look at the percentage of people living in urban areas in Iowa: > USArrests[ "Iowa", "UrbanPop" ] [1] 57 I have included the output of the command this time, because we will need it later. Let's change that value to 33: > USArrests[ "Iowa", "UrbanPop" ] = 33 > USArrests[ "Iowa", "UrbanPop" ] [1] 33 It's as simple as that. You could also have used the row index number and column index number (subscripts), if you knew them, in place of the row name and column name. Now, of course, we should put back the correct value: > USArrests[ "Iowa", "UrbanPop" ] = 57 > USArrests[ "Iowa", "UrbanPop" ] [1] 57 ***** Looking At Subsets of Dataframes ***** Now for some more complex stuff--looking at subsets of dataframes. This can get tricky, so follow along closely! Let's begin by looking at just one row, say that for Pennsylvania: > USArrests[ "Pennsylvania", ] # make sure you include the comma You could also have used the row number, if you'd known it! > USArrests[ 38, ] # once again, the comma must be there This shows the value of having meaningful row names. However, in a moment we will have to fall back on the row numbers. The thing to pay close attention to is the comma after the row index inside the brackets. The comma followed by a blank column index tells R you want to see the entire row (all columns). If you want to see one column, all you have to do is name (or number) it: > USArrests[ "UrbanPop" ] > USArrests[3] The comma is optional, but if you use it, it should come before the column index: > USArrests[ ,"UrbanPop" ] > USArrests[ ,3] Notice how the result is different from the one you got above (without the comma). Technically, if you use the comma, the result is a data object called a vector. If you don't use the comma, the result is a new dataframe. To see the murder rate column, any of these would work (I'm setting you up here, so pay close attention to what's happening): > USArrests[ "Murder" ] > USArrests[ ,"Murder" ] > USArrests[1] > USArrests[ ,1] Now suppose you want to see rows 10 through 20. In R, the notation 10:20 means "10 through 20", so we can do this: > USArrests[ 10:20, ] Either of the following would do the same: > USArrests[ c(10,11,12,13,14,15,16,17,18,19,20), ] > USArrests[ seq( from = 10, to = 20, by = 1 ), ] Suppose you want to see only rows in which the murder rate is greater than 12. This can be achieved as follows: > USArrests[ Murder > 12, ] Oops! That didn't quite work, did it? The idea was right, but there is a technicality. When R was searching for the variable called Murder, it did not know to look inside the dataframe. In technical "R-speak", the dataframe is not in the search path. I know what you're thinking! "But, but, but, it worked above when we executed USArrests["Murder"]." There is a subtle but crucially important difference. When we quoted the word "Murder", we were naming a column. When the word Murder was left unquoted, we were asking R to look at the values stored in this variable. This is going to confuse you for awhile, so you might want to repeat that to yourself a few times. I still get error messages from not using quotes when I should or using them when I shouldn't. A NAME IN QUOTES IS JUST A NAME. A name in quotes IS a value. The value of "Murder" is "Murder", and it is the name of the first column in the dataframe. A NAME NOT IN QUOTES REFERS TO THE VALUES STORED INSIDE THE DATA OBJECT. Thus, Murder refers to a data object, and to the values stored therein. To find those values, R must know where to find Murder. The first place R will look is in the workspace. You can verify that no Murder exists there by using ls(). Then R will start looking through attached data objects and libraries. To see exactly where R will look, execute this command, which will show you the search path: > search() R does not yet know to look inside USArrests. We can tell it to do so as follows: > attach( USArrests ) This adds the dataframe "USArrests" to the search path, as you can verify as follows: > search() So now, after searching the workspace (".GlobalEnv") for Murder, R will look inside USArrests, and there it will find what it is looking for. Now our command will work: > USArrests[ Murder > 12, ] If we want just rows with murder rates higher than 12 and assault rates greater than or equal to 255, we can do this: > USArrests[ Murder > 12 & Assault >= 255, ] Question: What happens if you get the greater than or equal to sign the wrong way around, i.e., => instead of >=? If we want to work with this subset of rows, we can store them in a new data object: > newData = USArrests[ Murder > 12 & Assault >= 255, ] > newData ***** The Subset Function ***** Subsetting a dataframe can also be accomplished using the subset() function. Here are some examples: > subset( USArrests, select = UrbanPop ) > subset( USArrests, subset = Murder > 12 ) > subset( USArrests, subset = Murder > 12 & Assault >= 255 ) > subset( USArrests, subset = Murder > 12 & Assault >= 255, select = Rape ) Inside the subset() function, the option select= selects columns, while the option subset= selects rows. The subset= option must be set to a logical statement using >, <, >=, <=, == (equals), or != (not equals). It cannot be set to something like subset="Pennsylvania". The dataframe does not have to be attached for this to work. Of course, any of these newly created subsets can be stored in a new data object for further analysis. > just.UrbanPop = subset( USArrests, select = UrbanPop ) > head( just.UrbanPop ) > rm( just.UrbanPop ) # always clean up ***** Sorting Dataframes ***** Another thing you often want to do with a dataframe is sort it. Sorting a single variable is easy (with the dataframe attached): > sort( Murder ) > sort( Murder, decreasing = TRUE ) However, sorting an entire dataframe is a bit more complicated, as the rest of the columns must be sorted as well in order to keep them properly matched up with the Murder column. To do it, you use the order() function as follows: > SO = order( Murder ) > USArrests[ SO, ] SO is just a dummy variable, and we could have called it anything we wanted. I called it SO for "sort order". We could have accomplished the same thing by nesting the order() function inside the call to USArrests: > USArrests[ order( Murder ), ] Notice that we have not changed the dataframe itself: > USArrests To do that, we would have had to store the output of USArrests[ SO, ] into a new data object: > USArrests.sorted = USArrests[ SO, ] > USArrests.sorted > rm( USArrests.sorted ) To sort from highest to lowest, do this: > SO = order( Murder, decreasing = TRUE ) > USArrests[ SO, ] Or this would also work: > USArrests[ order( Murder, decreasing = TRUE ), ] ***** Pruning Out Unnecessary Variables ***** Unlike most statistical software, such as SAS and SPSS, R stores the workspace in RAM memory. Thus, there is a limit to the size of a dataframe that R can handle. That limit is still pretty big if you have a respectable amount of RAM memory in your computer, but to avoid taxing RAM memory, you may want to prune out unwanted variables. Suppose we do not need the UrbanPop variable in our dataframe. We can toss it out as follows: > detach( USArrests ) # never modify an attached dataframe! > USArrests[3] = NULL # or USArrests[ "UrbanPop" ] = NULL > attach( USArrests ) > USArrests Column three is no more (and "Rape" has been moved into the number 3 position). Be careful with this, because there is no recovering from it. Once R throws something away, it's gone for good! Of course, it would be wise to store a copy of the dataframe before altering it in some unrecoverable way. ***** Recoding Entire Variables ***** We changed single values above. Now lets change all the values in a variable. Remember the warning: NEVER MODIFY AN ATTACHED DATAFRAME. Suppose we wanted murder rates per 1,000,000 population instead of 100,000. This would involve multiplying the Murder variable by 10. Easy enough: > detach( USArrests ) > USArrests $ Murder = 10 * USArrests $ Murder > attach( USArrests ) > USArrests Here is another way to do the same thing using the transform() function. First, I'll put the dataframe back to the way it was, then I'll use transform() to make the same change we just made above: > detach( USArrests ) > USArrests $ Murder = USArrests $ Murder / 10 > USArrests = transform( USArrests, Murder = 10 * Murder ) > attach( USArrests ) > USArrests WARNING: I was just reminded of a valuable lesson here (the hard way, I admit). Be VERY CAREFUL when you are modifying your dataset. Make a backup copy first! I made a capitalization error in one of the above commands, and I destroyed the murder rate data! Thus, it would have been a good idea to insert the following command after the detach() command above: > USArrests.bak = USArrests That way, if you screw it up as badly as I did, you can recover your data from the backup copy in your workspace. Below I will show you how to save a copy to your hard drive as well. This may be necessary if you have a very large data set that is hogging up most of your computer's memory. Any mathematical transformation can be done this way. But suppose we want to do something a bit more complicated. Suppose we want to convert murder rate to a categorical variable with values of "high" and "low". Once this is done, if we overwrite the Murder variable with these new values, the old values will not be recoverable, so I suggest we create a new variable. One way to do it is this: > median( Murder ) # splitting at the median is a good idea [1] 72.5 > which( Murder == 72.5 ) # are any values right at the median? (no) integer(0) > > detach( USArrests ) > USArrests.bak = USArrests # not taking any chances this time!!! > > USArrests $ Murd.cat = ifelse( USArrests $ Murder > 72.5, "high", "low" ) > attach( USArrests ) > USArrests We have created a new column with the new variable Murd.cat in it. We did it with the ifelse() function. Read what it's done as follows: "If Murder is greater than 72.5 (the median), then assign Murd.cat the value "high", otherwise assign Murd.cat the value "low". > Murd.cat ***** Character Variables versus Factors ***** Character variables are names of things, like subjects' names, or subjects' addresses. Generally, we use a character variable when all the values of the variable will be different. A factor is a categorical variable that can be used as an independent or dependent variable. Generally, factors have only a few different values, called levels. Do this: > str( USArrests ) 'data.frame': 50 obs. of 4 variables: $ Murder : num 132 100 81 88 90 79 33 59 154 174 ... $ Assault : int 236 263 294 190 276 204 110 238 335 211 ... $ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ... $ Murd.cat: chr "high" "high" "high" "high" ... I have reproduced the output so we can look at it in some detail. The str() function reveals the internal structure of a dataframe (and other data objects) in more detail than any of the functions we've seen so far. First it tells us that USArrests is, in fact, a 'data.frame'. Then it tells us the size: 50 observations (rows) by 4 variables (columns). Then it lists the name of each variable and what kind it is: Murder is numerical, Assault is integer (also numerical), Rape is numerical, and Murd.cat is a character variable. The first several values of each variable are also given. Before we go on, do this: > summary( Murd.cat ) We don't really want Murd.cat to be a character variable. We want it to be a factor, since it has only two values, and we might want to use it as an independent or dependent variable in our analysis. To change it, do this: > detach( USArrests ) > USArrests $ Murd.cat = factor( USArrests $ Murd.cat ) > attach( USArrests ) > str( USArrests ) Murd.cat is now a factor with two levels (values), "high" and "low". Internally, to save space, factors are stored as numbers, in this case 1 and 2. This will come in very handy later. (It means we do not have to recode if want to enter this variable into a regression analysis, for example.) Now that Murd.cat is a factor, we get a different result from: > summary( Murd.cat ) This time the output is a frequency table. There are 25 values of "high" and 25 values of "low". ***** Missing Values ***** It is a fact of life that data values sometimes turn up missing. Someone does not answer all questions on a survey, a subject does not show up for retesting, equipment fails, and so on. In R, missing values should be recorded as NA, for "not available". There are no missing values in the USArrests dataframe, and we don't really have to go into detail right now about how to deal with NAs. You should be aware of the possibility, however. When values are missing, special measures may have to be taken to deal with that circumstance. ***** A Caution About Attaching Dataframes ***** Do this: > detach( USArrests ) > Murder = 6 > ls() > attach( USArrests ) What's that all about? The USArrests dataframe contains a variable called Murder. But so does the Global Environment (your workspace), because you created it with the command Murder=6. R is notifying you that a conflict of sorts exists between these two variables. You are attaching USArrests because you want to be able to address the variables directly, but there is already a Murder variable in the workspace. If you try to work with the Murder variable inside USArrests by just typing Murder, R will see the value in your workspace first: > Murder This is a problem that must be resolved. One way to resolve it is to delete the Murder variable from the workspace. We don't need it anyway: > rm( Murder ) > Murder Now R is seeing the Murder variable inside USArrests. When you attach a dataframe, and you get a warning like the one you saw above, this is most likely the problem. Do something about it before you try to work with the variables inside the dataframe. ***** Intermission: More Things You've Learned ***** 16) Never modify an attached dataframe! 17) How to modify single values inside a dataframe. 18) Several ways to look at subsets of a dataframe. 19) How to sort a dataframe on a single variable using order(). 20) How to prune out unnecessary variables. 21) How to recode a variable (two methods). 22) The difference between a character variable and a factor. 23) How to turn a character (or numerical) variable into a factor. 24) Missing values happen. In R, the missing value code is NA. 25) Because R searches in so many places for the variables you are working with, conflicts can occur. Be careful. ***** How To Save a Dataframe ***** After all that, you may want to make a permanent copy of your dataframe. This involves knowing the difference between the workspace and working directory. The workspace is the area of RAM memory in which R stores the objects you've created, but it only stores them there temporarily. If the power goes off while you are working...poof! Workspace--and all your unsaved data objects--gone! When you quit R, it will ask you if you want to save your workspace. If you want to keep the objects you've created and your modifications to them, then the obvious answer is "yes". At this point, the workspace will be saved to your working directory. The working directory is a folder or directory on your hard drive (or other permanent storage medium) in which R looks by default for files you ask it to load or writes objects you ask it to save. To find out what your working directory is, do this: > getwd() The answer will be different depending upon what operating system you are working in (Windows, Mac OS X, or Linux). But this is where R is going to store the stuff you save. You may know that you can change the location of the working directory using setwd(), but this is for another time. In Windows and on the Mac, there are also pull down menus at the top of the R Console that allow you to set the working directory. On the Mac, it is under the Misc menu. In Windows it is under File (and is called Change dir...). If you don't know about pathnames in your directory structure, you may want to use these menus instead of trying to use setwd()! If you are in a long R session, and it is storming outside, you may want to save your workspace once in awhile before you actually quit R. The command to do so (can also be done from the menus in Windows and OS X) is: > save.image() You may want to use ls() first and get rid of old garbage by using rm(). It's amazing how much junk you can create during an R session. The save.image() function will save your entire workspace in a file called ".RData". The dot makes it an invisible file in OS X and Linux, but not in Windows. If you want a different name, just give it inside the function call: > save.image( file = "myWorkspace.RData" ) Make sure it has the .RData extension, however. This also allows you to save a separate workspace file for every project or problem you are working on. If you are not in the default working directory, i.e., the one R starts up in, you may find it necessary to load the workspace the next time you are working in this directory. This may be true even if you are working in the default working directory but have saved a workspace image under a nonstandard name. To load a workspace image (DON'T do this now; just make a note of it), do one of these: > load( ".RData" ) > load( "myWorkspace.RData" ) This will load the stored workspace image and ADD its contents to whatever is already in your workspace. To save a data object like a dataframe into a file of its own is almost as easy. Let's save the modified USArrests dataframe into a file called arrests.rda: > save( USArrests, file = "arrests.rda") You can see that the file has, in fact, been saved, by asking for a directory listing of the working directory: > dir() The save() function saves files in an encoded ("binary") format that is not human readable. It is an efficient way to save data objects, but neither you nor another software package will be able to read the file. Now let's remove the USArrests object from the workspace and then reload it from the saved file. FIRST, since USArrests is attached, we will begin by detaching it. ALWAYS remember to detach an attached dataframe. > detach( USArrests ) > ls() # Now you see it... > rm( USArrests ) > ls() # ...and now you don't. > load( file = "arrests.rda") > ls() # And there it is again. > USArrests Notice the data object has the same name you gave it while working with it and saving it, which is different from the filename you created for it. The dataframe is known as "USArrests" in the workspace, but it's known as "arrests.rda" on the hard drive. There is a good reason for this, but I'm not going to tell you what it is right now. (Okay, a hint: you can use save() to save several data objects at once.) Now we will save a copy of "USArrests" in human readable form, i.e., as a text file. You can read this file in a text editor or a word processor. You can load it into a spreadsheet to edit it. You can even load it into other statistics software, such as SPSS. The file format we will use is called CSV, short for comma separated variables. Every spreadsheet program can read these files, every text editor will open them, and you can even view them with a browser such as Firefox, provided you use the file extension ".txt". > write.table( USArrests, file = "arrests.txt", sep = "," ) > dir() If you go to your working directory in Finder or Explorer and click on this file, it will open in a text editor like TextEdit or Notepad. Now we will remove USArrests from the working directory and reload it from this file. In this case, reloading will involve making an assignment, which was not the case above, so observe and make note of the difference: > rm( USArrests ) > ls() # Gone! > USArrests = read.table( file = "arrests.txt", sep = ",") > ls() It's also possible to save these files in other text formats, such as tab separated variables (sep = "\t"), but that is for another time. Question: What happens if you forget to the do the assignment? (I sometimes do this on purpose to make sure the file is being read properly.) ***** More Stuff You Have Learned ***** 26) The difference between the workspace and the working directory. 27) How to find out what the working directory is: getwd() 28) How to save your workspace: save.image() or save.image("filename.RData") 29) What the .RData file is. 30) How to load a workspace image: load() 31) How to save a dataframe: save() 32) How to see a directory listing of the current working directory: dir() 33) How to load a dataframe from a file you previously saved: load() 34) How to write a dataframe to a CSV file: write.table() 35) How to load a dataframe from a CSV file: read.table() ***** Dataframes With A Freq Column ***** You can remove "USArrests"--and all the other junk we've created--from your workspace now. We're done with it. Instead, we are now doing to look at a data object called "UCBAdmissions": > rm( USArrests, USArrests.bak, USArrests.sorted, newData, SO ) > # a warning will result if any of these objects have already been removed > data( UCBAdmissions ) > ls() > UCBAdmissions What's this? Clearly, it is not a dataframe. In fact, "UCBAdmissions" is an array, or a three-dimensional contingency table. Applicants to graduate school at Berkeley in the six largest departments in 1973 are classified by gender and whether or not the applicant was admitted. In total, 4526 applicants are described in the table. This is not a convenient form in which to examine the data. When contingency tables become multidimensional like this, they get awkward. An easier form in which to visualize the data is as a flat table: > ftable( UCBAdmissions ) This is a much more compact form for the data summary, but it is still not a proper dataframe. To convert "UCBAdmissions" to a dataframe, do this: > UCB = as.data.frame( UCBAdmissions ) > UCB Now we have a dataframe, but it is not a dataframe in which each row represents a case or a single subject. There were 4526 subjects, but there are only 24 rows in the dataframe. This is because many of the subjects are the same, as far as we're concerned. There are 6 departments, 2 genders, and 2 admit statuses, which means there are 6x2x2=24 different combinations of these factor levels. So why create a dataframe with 4526 rows, when one with 24 rows will do? The final column, called "Freq", tells how often each of these combinations of factor levels occurs among the 4526 applicants to Berkeley. We will have reason to analyze this dataframe later in the course. For now, let's just get rid of it: > rm( UCB, UCBAdmissions ) -------------------------------------------------------------------------------