QUICK AND DIRTY GUIDE TO R (just about enough to cover a first stats course). by William B. King, Ph.D. Coastal Carolina University ------------------------------------------------------------------------------- Best used with R in front of you and running -- for simplicity I have done away with certain annoying statistical formalisms. ------------------------------------------------------------------------------- Getting it. Download R from www.r-project.org (click on the CRAN link on the left side of the page under downloads). As of this moment, the current version is 2.8.1. It is available for Windows, OS X, and Linux. Get the .exe file for Windows, the .dmg file for OS X, and for Linux the best way to install is via your package manager. In Ubuntu, for example, use Synaptic, or from a terminal... sudo apt-get install r-base ------------------------------------------------------------------------------- The R prompt. R is command-line driven. The prompt looks like a greater than sign, >. When you see it, start typing. (Much faster than hunting and clicking once you get used to it!) Press Enter when you have finished typing the command. If the command is not complete, R will issue a continuation prompt, which is a plus sign, +. You can type long commands on several lines to make it easier to keep track of syntax. You can also type them into a text editor and then cut and paste them into R. ------------------------------------------------------------------------------- Case sensitive. R commands are case sensitive. If you get a function not found error, check you spelling and your capitalization. R functions are generally, but not always, typed in lower case letters. On the other hand, spacing usually doesn't matter. Leave spaces wherever you like (except within names of things) to make your input easier to read. ------------------------------------------------------------------------------- R functions. R "commands" are called functions. The basic form of a function is: > function.name( arguments, options ) Even if there are no arguments or options, the parentheses are still mandatory. If you leave off the parentheses, R will print out the source code for the function you have entered. This can be startling when you're not expecting it, but it is not harmful. Just retype the command correctly. ------------------------------------------------------------------------------- Starting R. In Windows, double-click the desktop shortcut. In OS X, double-click the R icon in your /Applications folder (or drag a shortcut to the Dock). In Linux, start a terminal session, and type R and hit the Enter key. Here's what R looks like when it starts. R version 2.8.1 (2008-12-22) Copyright (C) 2008 The R Foundation for Statistical Computing ISBN 3-900051-07-0 R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. [Previously saved workspace restored] > | The > is the command prompt, and the vertical bar (may be a solid block, may or may not be blinking) is the cursor. ------------------------------------------------------------------------------- Quitting R. > quit() # remember, the parentheses are mandatory! Note: best to know this right away! Also, q() works. When R asks to save your workspace, pick "no" until you know what this means. If you have entered data into R and want it saved for your next R session, pick "yes". It is best not to clutter up your workspace with old stuff you no longer need. If you have created a data object called myData in your workspace, and you want to delete it before saving your workspace, do this: > rm( myData ) # remove( myData ) also works ------------------------------------------------------------------------------- Interactive mode. R can be used by writing programs into scripts and reading them in. It's easier to use R in interactive mood, typing individual commands at the command prompt. Notes or comments can be added to the R session by typing a hash or pound symbol, #. Anything on the line following this symbol will be ignored by R. ------------------------------------------------------------------------------- Simple data input. Small data sets can be entered using the c() function. This produces a data vector. Name it anything you want, but don't put spaces or dashes in the name. Use underline characters or periods instead. > my.data = c( 22, 38, 12, 23, 29, 18, 16, 24 ) Note: spaces after the commas and before and after parentheses are optional. > my.data=c(22,38,12,23,29,18,16,24) To examine what you've entered, just type its name. > my.data # this implicitly invokes print(my.data) [1] 22 38 12 23 29 18 16 24 To change value i in the list, address it as my.data[i]. For example... > my.data[2] = 28 > my.data [1] 22 28 12 23 29 18 16 24 Further instructions for reading in data from files and entering data at the keyboard will appear below. ------------------------------------------------------------------------------- R output. You may have noticed above that R not only printed out the values you stored in your vector, but it also printed out a 1 inside square brackets. When R is printing out a vector, it indexes all the values, but only the index for the first value on a line is shown. Thus, [1] means this line begins with data value number 1. Try this: > rivers [1] 735 320 325 392 524 450 1459 135 465 600 330 336 280 315 870 [16] 906 202 329 290 1000 600 505 1450 840 1243 890 350 407 286 280 [31] 525 720 390 250 327 230 265 850 210 630 260 230 360 730 600 [46] 306 390 420 291 710 340 217 281 352 259 250 470 680 570 350 [61] 300 560 900 625 332 2348 1171 3710 2315 2533 780 280 410 460 260 [76] 255 431 350 760 618 338 981 1306 500 696 605 250 411 1054 735 [91] 233 435 490 310 460 383 375 1270 545 445 1885 380 300 380 377 [106] 425 276 210 800 420 350 360 538 1100 1205 314 237 610 360 540 [121] 1038 424 310 300 444 301 268 620 215 652 900 525 246 360 529 [136] 500 720 270 430 671 1770 The first line of this output starts with value number 1, the second line with value number 16, the third line with value number 31, and so on. All of the values are indexed, so the value of 310 in the next to the last line is value number 123. It can be retrieved like this: rivers[ 123 ] [1] 310 Don't be confused by the fact that R called it value 1 when it printed it. It was the first value printed, so it is called 1. If you want values 10 through 20 from the rivers vector, you can get them this way: > rivers[ 10:20 ] [1] 600 330 336 280 315 870 906 202 329 290 1000 The printed values are considered a new vector and are indexed accordingly. ------------------------------------------------------------------------------- Simple frequency distributions. To put a simple frequency distribution into a vector, do this: X f ----------- 7 3 6 0 5 2 4 7 3 5 2 1 1 1 > X = 7:1 # same thing as X = c( 7, 6, 5, 4, 3, 2, 1 ) > f = c( 3, 0, 2, 7, 5, 1, 1 ) > the.data = rep( X, f ) # rep() is the repeat function > the.data [1] 7 7 7 5 5 4 4 4 4 4 4 4 3 3 3 3 3 2 1 ------------------------------------------------------------------------------- A simple (and grouped) frequency table (kinda). > table( the.data ) # use your variable name inside the ( ) the.data 1 2 3 4 5 7 # the X "column" 1 1 5 7 2 3 # the f "column" Note: there are no 6's, so R doesn't report them. Note: if you want it in columns, do this: > as.data.frame( table( the.data )) > table( my.data ) # not much to see here my.data 12 16 18 22 23 24 28 29 1 1 1 1 1 1 1 1 **grouping (a bit tricky)** > bins = seq( from=10, to=30, by=5) > # bins = c( 10, 15, 20, 25, 30 ) would give the same result > table( cut( my.data, bins )) (10,15] (15,20] (20,25] (25,30] # closed on the right by default 1 2 3 2 Or, since doing more than one thing at a time makes most students nervous... > bins = c(10, 15, 20, 25, 30) > temp.table = cut( my.data, bins, right=FALSE ) > table(temp.table) Note: the right=FALSE (or right=F) option closes the bins on the left. > stem( my.data ) # stem-and-leaf display is better anyway! The decimal point is 1 digit(s) to the right of the | 1 | 2 1 | 68 2 | 234 2 | 89 ------------------------------------------------------------------------------- Histogram. > hist( the.data ) Note: bars in R histograms include the right limit but not the left. > hist( the.data, right=F ) # now includes the left, not the right This makes it much clearer... > bins = seq( .5, 7.5, 1 ) # creates a sequence from .5 to 7.5 by 1 > hist( the.data, breaks=bins ) ------------------------------------------------------------------------------- Categorical (nominal) data. > colors = c( "red", "black", "brown", "blonde" ) > freqs = c( 8 , 22 , 30 , 18 ) # spaces don't matter > hair.color = rep( colors, freqs ) > table( hair.color ) hair.color black blonde brown red 22 18 30 8 > barplot( table( hair.color )) > barplot( table( hair.color ), col=c( "black","yellow","brown","red" )) Note: the order of the bars can be changed, too. The example above uses one function embedded within another. You don't have to do it this way. > temp.table = table( hair.color ) > barplot( temp.table ) > rm( temp.table ) # if you no longer need it ------------------------------------------------------------------------------- Summation. > sum( the.data ) # sigma X [1] 77 > sum( the.data ) ^ 2 # (sigma X) quantity squared [1] 5929 > sum( the.data ^ 2 ) # sigma (X squared) [1] 359 Note: the squared values themselves can be returned with... > the.data ^ 2 [1] 49 49 49 25 25 16 16 16 16 16 16 16 9 9 9 9 9 4 1 > x = c( 2, 3, 5 ) > y = c( 3, 8, 9 ) > sum( x * y ) # sum of the crossproducts [1] 75 > sum(x) * sum(y) # sigma X times sigma y [1] 200 There is no function in R for the sum of squares. However, it can easily be calculated using the arithmetic functions in R: > sum( the.data ^ 2 ) - sum( the.data ) ^ 2 / length( the.data ) Here are some other built-in math functions in R: > sqrt() > log() # returns the natural log > log10() # base-10 logs > sin() # and all the other trig functions > abs() # absolute value And there are many, many, many more! ------------------------------------------------------------------------------- Descriptive statistics. **sample size** > length( the.data ) [1] 19 > length( hair.color ) [1] 78 **sample median** > median( the.data ) [1] 4 **sample mean** > mean( the.data ) [1] 4.052632 **range** > range( the.data ) # reports the min and max values [1] 1 7 **IQR** > IQR( the.data ) [1] 1.5 Note: this may not be the "textbook" IQR. R can use 9 different methods to calculate Q1 and Q3. By default it uses a different one than most elementary textbooks do. See below for a fix. **sample variance** > var( the.data ) # there is no function for pop. variance [1] 2.608187 **sum of squares** > var( the.data ) * ( length( the.data ) - 1 ) [1] 46.94737 Note: there is no built-in function, but it is easy enough to create one. **sample standard deviation** > sd( the.data ) # there is no function for pop. sd [1] 1.614988 **more or less all at once** > summary( the.data ) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 3.000 4.000 4.053 4.500 7.000 Note: these quartiles are technically correct for continuous data. The "textbook" method would give Q3 = 5, which is correct for discrete data. To get this... > quantile( the.data, c( .25,.75 ), type=2 ) # textbook quartiles! 25% 75% 3 5 ------------------------------------------------------------------------------- Unit normal distribution. Note: R has a built-in function for converting to z-scores, but it is too confusing for most students, I'd bet. Instead, they can use R as a calculator. If mu = 25 and sigma = 8, convert these scores to z-scores: 18, 28, 38, 21. **one at a time** > (18 - 25) / 8 [1] -0.875 **or all at once** > scores = c( 18, 28, 38, 21 ) > z = (scores - 25) / 8 # assignment to variable z is optional > z [1] -0.875 0.375 1.625 -0.500 **probabilities under the normal curve** > pnorm( 0.375 ) # area at or below z [1] 0.6461698 > 1 - pnorm( 0.375 ) # area above z [1] 0.3538302 > pnorm( z ) # for all of the values stored in z at once [1] 0.1907870 0.6461698 0.9479187 0.3085375 **the inverse z function** > qnorm( .9 ) # the z with 90% of area at or below it [1] 1.281552 ------------------------------------------------------------------------------- Single sample z-test. Isn't built in -- must be calculated by hand. ------------------------------------------------------------------------------- Single sample t-test. > the.data # just to be sure it's still there [1] 7 7 7 5 5 4 4 4 4 4 4 4 3 3 3 3 3 2 1 > t.test( the.data, mu=5 ) # H-null says mu=5, two-tailed One Sample t-test data: the.data t = -2.557, df = 18, p-value = 0.01981 alternative hypothesis: true mean is not equal to 5 95 percent confidence interval: 3.274232 4.831031 sample estimates: mean of x 4.052632 One-tailed tests follow... > t.test( the.data, mu=5, alternative="greater" ) One Sample t-test data: the.data t = -2.557, df = 18, p-value = 0.99 alternative hypothesis: true mean is greater than 5 95 percent confidence interval: 3.410155 Inf sample estimates: mean of x 4.052632 > t.test( the.data, mu=5, alternative="less" ) One Sample t-test data: the.data t = -2.557, df = 18, p-value = 0.009904 alternative hypothesis: true mean is less than 5 95 percent confidence interval: -Inf 4.695109 sample estimates: mean of x 4.052632 ------------------------------------------------------------------------------- Independent t-test. > x1 = c( 18, 25, 17, 20, 23 ) > x2 = c( 20, 30, 22, 25, 28, 30 ) **pooled variance** > t.test( x1, x2, var.equal=TRUE ) Two Sample t-test data: x1 and x2 t = -2.2395, df = 9, p-value = 0.05188 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -10.51953398 0.05286732 sample estimates: mean of x mean of y 20.60000 25.83333 **unpooled variance (the default)** > t.test( x1, x2 ) Welch Two Sample t-test data: x1 and x2 t = -2.2903, df = 8.995, p-value = 0.04776 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -10.40273450 -0.06393217 sample estimates: mean of x mean of y 20.60000 25.83333 **one-tailed (R always subtracts group 1 - group 2)** > t.test( x1, x2, var.equal=T, alternative="less" ) Two Sample t-test data: x1 and x2 t = -2.2395, df = 9, p-value = 0.02594 alternative hypothesis: true difference in means is less than 0 95 percent confidence interval: -Inf -0.9497217 sample estimates: mean of x mean of y 20.60000 25.83333 **the formula interface (a more sophisticated way)** > all.scores = c( x1, x2 ) # combines x1 and x2 into one vector > all.scores [1] 18 25 17 20 23 20 30 22 25 28 30 > grp = c( "x1", "x2" ) # the group names > n = c( 5, 6 ) # the groups sizes > groups = rep( grp, n ) # create the labels > groups [1] "x1" "x1" "x1" "x1" "x1" "x2" "x2" "x2" "x2" "x2" "x2" > t.test( all.scores ~ groups ) # "all.scores by (tilde) groups" Welch Two Sample t-test data: all.scores by groups t = -2.2903, df = 8.995, p-value = 0.04776 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -10.40273450 -0.06393217 sample estimates: mean in group x1 mean in group x2 20.60000 25.83333 **a test for homogeneity of variance (better than Hartley F-max)** > bartlett.test( list( x1, x2 )) Bartlett test of homogeneity of variances data: list(x1, x2) Bartlett's K-squared = 0.1994, df = 1, p-value = 0.6552 > bartlett.test(all.scores ~ groups) # also works with formula interface Bartlett test of homogeneity of variances data: all.scores by groups Bartlett's K-squared = 0.1994, df = 1, p-value = 0.6552 Also... > var.test( x1, x2 ) # significance test for two variances F test to compare two variances data: x1 and x2 F = 0.636, num df = 4, denom df = 5, p-value = 0.6818 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.08608992 5.95601427 sample estimates: ratio of variances 0.6360225 > var.test( all.scores ~ groups ) # the formula interface also works ------------------------------------------------------------------------------- Dependent (related samples) t-test. > before = c( 18, 17, 20, 19, 22, 24 ) > after = c( 20, 22, 19, 21, 25, 25 ) > t.test( before, after, paired=TRUE ) Paired t-test data: before and after t = -2.4495, df = 5, p-value = 0.05797 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -4.09887128 0.09887128 sample estimates: mean of the differences -2 Note: R always subtracts first group - second group. One-tailed tests can be specified as in the independent t-test with the option alternative=. Use alternative="less" if you think "before" mean is less than "after" mean. Otherwise, use alternative="greater". ------------------------------------------------------------------------------- ANOVA (oneway, between subjects) > x1 # still there (if not, re-enter it) [1] 18 25 17 20 23 > x2 # still there [1] 20 30 22 25 28 30 > x3 = c( 35, 27, 27, 30, 40, 33 ) # a new level of X > all.scores = c( x1, x2, x3 ) # combined into a single data vector > grp = c( "x1", "x2", "x3" ) # the group names > n = c( 5, 6, 6 ) # the group sizes > groups = rep( grp, n ) # labels for each score > results = aov( all.scores ~ factor( groups )) > summary( results ) Df Sum Sq Mean Sq F value Pr(>F) groups 2 358.20 179.10 9.5691 0.002402 ** Residuals 14 262.03 18.72 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Notes: the formula interface is mandatory for the aov() function. To get the ANOVA table, store the output of the aov() function and then use the summary() function to see the table. The function oneway.test() also does a oneway ANOVA, but the full ANOVA table is not produced. By default, oneway.test() applies a Welch-like correction for nonhomogeneity of variance. R can get very snotty if you don't correctly label the factor. If it is coded numerically, you MUST convert it to a factor before doing the ANOVA. > TukeyHSD( results ) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = all.scores ~ factor(groups)) $`factor(groups)` diff lwr upr p adj x2-x1 5.233333 -1.6231312 12.08980 0.1493412 x3-x1 11.400000 4.5435354 18.25646 0.0017968 x3-x2 6.166667 -0.3707158 12.70405 0.0656473 Advantage of the aov() function--a post hoc (Tukey) test can be done. Advantage of the oneway.test() function--it can correct for nonhomogeneity. > oneway.test( all.scores ~ groups ) One-way analysis of means (not assuming equal variances) data: all.scores and groups F = 9.4638, num df = 2.000, denom df = 9.299, p-value = 0.005727 ------------------------------------------------------------------------------- Correlation. > x = rnorm( 100, 50, 10 ) # 100 normally distributed random numbers > y = rnorm( 100, 75, 20 ) # 100 more (quick and dirty data!) > plot( x, y ) # scatterplot > cor( x, y ) [1] 0.00620442 # Pearson's r is the default > cor( x, y, method="spearman" ) # Spearman rho (don't capitalize it!) [1] -0.002076208 > cor.test( x, y ) # correlation t-test Pearson's product-moment correlation data: x and y t = 0.0614, df = 98, p-value = 0.9511 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.1904458 0.2023759 sample estimates: cor 0.00620442 Note: one-tailed tests are also possible by using the alternative= option. Use alternative="less" if you think the correlation is less than zero. Otherwise, use alternative="greater". ------------------------------------------------------------------------------- Regression. > lm( y ~ x ) # linear regression requires formula interface Call: lm(formula = y ~ x) Coefficients: (Intercept) x 74.43864 0.01112 > my.lm.analysis = lm( y ~ x ) # storing the results gives more output > summary( my.lm.analysis ) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -38.663 -14.562 -0.173 13.702 50.507 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 74.43864 9.07285 8.205 9.27e-13 *** x 0.01112 0.18097 0.061 0.951 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 17.45 on 98 degrees of freedom Multiple R-Squared: 3.849e-05, Adjusted R-squared: -0.01017 F-statistic: 0.003773 on 1 and 98 DF, p-value: 0.9511 ------------------------------------------------------------------------------- Chi-square goodness of fit. > hair.color # still there? (if not, re-enter it) [1] "red" "red" "red" "red" "red" "red" "red" "red" [9] "black" "black" "black" "black" "black" "black" "black" "black" [17] "black" "black" "black" "black" "black" "black" "black" "black" [25] "black" "black" "black" "black" "black" "black" "brown" "brown" [33] "brown" "brown" "brown" "brown" "brown" "brown" "brown" "brown" [41] "brown" "brown" "brown" "brown" "brown" "brown" "brown" "brown" [49] "brown" "brown" "brown" "brown" "brown" "brown" "brown" "brown" [57] "brown" "brown" "brown" "brown" "blonde" "blonde" "blonde" "blonde" [65] "blonde" "blonde" "blonde" "blonde" "blonde" "blonde" "blonde" "blonde" [73] "blonde" "blonde" "blonde" "blonde" "blonde" "blonde" > chisq.test( table( hair.color )) Alternatively, you can do this one step at a time... > temp.table = table( hair.color ) > chisq.test( temp.table ) Chi-squared test for given probabilities data: table(hair.color) X-squared = 12.8718, df = 3, p-value = 0.004922 > chisq.test( c( 22,18,30,8 )) # also works from raw frequencies Chi-squared test for given probabilities data: c(22, 18, 30, 8) X-squared = 12.8718, df = 3, p-value = 0.004922 Note: the default is equal expected frequencies. If expected frequencies are unequal, enter them as proportions or probabilities. > expected = c( 1/4, 1/4, 1/3, 1/6 ) # best to use fractions if possible > sum( expected ) # must sum to 1, of course [1] 1 > observed = c( 22, 18, 30, 8 ) # observed must be frequency counts > chisq.test( observed, p=expected ) Chi-squared test for given probabilities data: observed X-squared = 2.9744, df = 3, p-value = 0.3956 ------------------------------------------------------------------------------- Chi-square test of independence. If the data have already been crosstabulated, this is easy. Otherwise, you'll need to learn to use the xtabs() and table() functions for crosstabulation (see below). variable one dems repubs other male 22 35 12 variable two female 28 30 18 > row1 = c( 22, 35, 12 ) > row2 = c( 28, 30, 18 ) > crosstabs = rbind( row1, row2 ) # binds rows into a table > crosstabs [,1] [,2] [,3] row1 22 35 12 row2 28 30 18 > chisq.test( crosstabs ) Pearson's Chi-squared test data: crosstabs X-squared = 1.9713, df = 2, p-value = 0.3732 Note: for 2x2 tables, R does the Yates correction by default. To prevent this: > chisq.test( crosstabs, correct=F ) To see more information, store the results of the test... > chisq.results = chisq.test( crosstabs ) > chisq.results > chisq.results $ expected # expected frequencies > chisq.results $ residuals # unsquared chi values in each cell ------------------------------------------------------------------------------- Getting help. > help( function.name ) # opens a somewhat helpful help page > help( t.test ) # ?t.test is a shortcut > help.search( "distribution" ) # looks for anything to do with quoted terms Note: the R manuals are included with the download in both PDF and html format. Look for them in the root R directory. Note: in Windows and OS X, the help() function opens a new window, which can be closed like any other window. In Linux, help pages are opened right in the terminal session. To close the help page, press the Q key. ------------------------------------------------------------------------------- Reading in a dataframe from a csv file. The best way I've found to handle large data files is to enter them into a spreadsheet and then save them in comma separated format. The file should have column names, one column per variable, and one row per case, and NOTHING else. If you drop a copy of it into your working directory, you can read it into a dataframe as follows: > dataframe.name = read.csv( "filename.csv" ) It's best to keep the names of dataframes in memory SHORT, so maybe... > df = read.csv( "filename.csv" ) Then the following functions are available: > names( df ) # get the names of the variables > dim( df ) # get the size of the dataframe > head( df ) # see the header and top six lines of data > summary( df ) # summarize all the variables > mean( df ) # find the mean of the numerical variables > str( df ) # gives a lot of info about the dataframe The easiest way to work with the variables in the dataframe is to use the $ notation: > mean( df $ scores) # find the mean of the variable named scores Statistical tests can be done by specifying the name of the dataframe in a data option, but only if the formula interface is used: > t.test( scores ~ groups, data=df ) ------------------------------------------------------------------------------- Housekeeping. > ls() # shows the names of objects you've created > remove( object.name ) # deletes the named object; rm() also works Note: it's best to keep your workspace cleaned up by deleting unused objects. > dir() # get a dir listing of the working directory > getwd() # get the name of the working directory > setwd() # set (change) the working directory Note: in Windows it's best to use the pulldown menus to change directories rather than attempt to navigate through the file system. It's best not to clutter up the root R directory with a lot of saved work files. Make a new directory in My Documents (in Windows) and switch to it when you start R. > save( the.data, file="thedata.rda" ) Note: this is the easiest way to save data, as a data vector in a proprietary file format. The file will be saved in the working directory. To read it back in at some later work session: > load( file = "thedata.rda" ) # rda is short for R Data Entire dataframes can be stored this way as well: > df = USArrests # gets a dataframe to work with > head( df ) # see? > save( df, file = "df.rda" ) > rm( df ) > ls() # it's gone! > load( file = "df.rda" ) > ls() # and it's back again! There are many other commands for saving and loading data files, too numerous to detail here. Here is a short list: > read.table() # reads in a text file with values separated # by white space (spaces, tabs, etc.) > library("foreign") # loads an optional library > read.spss() # read in an SPSS .sav file The syntax of these commands is somewhat more complex than for read.csv(). ------------------------------------------------------------------------------- SOME MORE ADVANCED MATERIAL ------------------------------------------------------------------------------- A further tutotial on dataframes. Since dataframes are the primary data structure in R, you may find some additional information on them helpful. Use the following tutorial when it becomes necessary. Most statistical software, like SAS, SPSS, and Minitab, use a particular data structure called a dataset, a rectangular table in which the columns are variables and the rows are cases or observations. Often, cases (rows) correspond to subjects, but this is not always true. A short example is given here. As you can see, the dataset can contain a mix of categorical and numerical variables: gender class score1 score2 subject1 male freshman 28.3 33.8 subject2 female junior 32.0 30.8 subject3 female senior 26.7 40.2 ... R also uses a similar data structure called a dataframe. Unlike the other software packages, however, R can use many other data structures, such as vectors, matrices, arrays, and so forth, but for now we do not need to know about them, except for just a bit about vectors. A vector is about the simplest form for R data objects. In R, even single values are considered vectors. All of the following assignments create vectors, as per the comment (#): > a = 6 # a vector containing one element > a = c( 6, 4, 8, 3 ) # a vector containing four elements > a = 1:10 # a vector containing the values 1 to 10 > a = seq( from=1, to=12, by=3 ) # a vector containing 1, 4, 7, and 10 > a = "Fred" # a vector containing one character value > a = c( "Fred", "Ginger" ) # ditto with two character values > remove( a ) # erasing unnecessary junk Important: The columns of dataframes are vectors! For educational purposes, R comes with a large number of built-in datasets (not all of which are in the form of dataframes). We will use one called "USArrests". Type the following command: > ls() The parentheses are manditory, and so is lower case. This will show you the contents of your workspace, which is an area of the computer's RAM memory that contains data objects you have created. If you have not created any, and the workspace is empty, you will see the following: > ls() character(0) That's R's way of telling you, "Nothing to see here." To place a copy of the "USArrests" dataframe in your workspace, do this: > data( USArrests ) If you execute the ls() function again, you should see that your workspace includes this data object. Now let's find out some things about this dataframe. First, let's find out how big it is. In R, spacing usually doesn't matter, so any of the following forms of the command will work. Try them all: > dim(USArrests) > dim (USArrests) > dim ( USArrests ) > dim( USArrests ) This will print out two numbers (actually 3, but the one in square brackets doesn't count). These numbers tell you how many rows there are in the dataframe, not counting the row of variable labels at the top, and the number of columns in the dataframe, not counting the column of row names at the left. To see the names of the columns, and hence the variable names, do this: > colnames( USArrests ) Or this will give the same result: > names( USArrests ) So now you know the dataframe contains information on four variables, and these variables are called "Murder", "Assault", "UrbanPop", and "Rape". To see the row names, i.e., the names assigned to the cases or subjects in the dataframe, do this: > rownames( USArrests ) Now you know there is a row in the dataframe for every state in the country. To see both column names and row names at once, do this: > dimnames( USArrests ) To see the first half dozen lines of the dataframe and get a bit better idea of its structure, do this: > head( USArrests ) The first thing you'll see is a line of variable names or column names, also called the header. Then you will see a row of data for the first case, which is the state of Alabama. And so on. The numbers, by the way, give the number of arrests in 1973 per 100,000 population in each state for murder, assault, and rape, and it also gives the percentage of the population living in urban areas within the state. A reminder: the spacing in these commands is optional. I'm using extra spaces to make the commands more easily readable, but to save typing, you could have typed that last command as head(USArrests). Every entry in this table-like structure is indexed by two numbers, the row number and the column number. The first entry in the table is the murder rate for the state of Alabama. This value is in the first row and the first column, and it can be referred to as follows: > USArrests[ 1, 1 ] Take special note of the square brackets. Also make a note of the fact that, when using index numbers (sometimes called subscripts) like this, the first number is ALWAYS the number of the row, and the second number is the number of the column. The space after the comma is optional--as are all the spaces. You can also refer to the values in the dataframe by row name and column name (if these exist). To do so, do this: > USArrests[ "Alabama", "Murder" ] Once again, the name of the row must come first followed by the name of the column, and the names must be quoted. I should point out that in an R dataframe row names and column names always exists. If you don't create them yourself, R will create them for you. Let's create a new dataframe to see how this works. For now, never mind exactly what the following commands are doing. Just type them in as you see them here: > myData = matrix( c( 2, 5, 3, 4, 6, 7, 3, 2, 18, 22, 17, 10 ), ncol=3 ) > myData = as.data.frame( myData ) > myData As you can see, R has supplied row and column names for the dataframe. The row names appear to be numbers, but they are not! They are character values. I.e., the name of the first row is "1", a character like "A" or "Fred", and not the number 1. You can confirm this by asking for rownames(myData). Notice the row names are quoted. R always quotes character values and never quotes numbers. The column names are V1, V2, and V3, which stand for "Variable 1", etc. Now let's name the rows and columns. Do this: > colnames( myData ) = c( "Quiz1", "Quiz2", "Exam" ) > rownames( myData ) = c( "Jeff", "Barbara", "Rachel", "Bob" ) > myData As you can see, row names and column names are assigned by using the rownames() and colnames() functions. Without an assignment, these functions simply print out what is already there. With an assignment, they change what is there. "Bob" is upset, however, because he prefers to go by "Robb". This is easily enough changed: > rownames( myData )[4] = "Robb" > myData Frankly, I found this syntax a little confusing at first, but it makes sense once you realize how R does things. For now, just remember it! You can think of rownames(myData) as "containing" the row names for the dataframe, and rownames(myData)[4] as being the fourth row name. Assigning a new value to rownames(myData)[4] changes its value. Now let's add a column to myData. There are two ways to do this, so we will add two columns to illustrate them both: > myData[4] = c( 24, 28, 30, 20 ) > myData This is a bit mysterious, but it is worth remembering. If you give the name of the dataframe only one index (or subscript) inside the square brackets, R will assume you are referring to a column. Let's name it: > colnames( myData )[4] = "Paper" > myData Be careful about the indexes here. If you had used 3 instead of 4, column 3 would have been replaced. Now let's add a fifth column: > myData $ Major = c( "Psych", "Biology", "Math", "Business" ) > myData The dollar sign notation is a way of referring to any column (variable) inside a dataframe. For example: > myData$Exam Once again, there are optional spaces around the $, so to make this easier to read, you could also have typed this: > myData $ Exam Now back to our new column. Since the variable "Major" did not already exist, R created it and stuck it into a new column in the dataframe. Notice that, using this method, we did not even have to do an extra step to name the column. Let's get rid of this mess we've made: > rm( myData ) # remove( myData ) would do the same > ls() ***** Intermission: Things You've Learned So Far ***** 1) The structure of a dataframe: rows and columns with cases in the rows and variables in the columns. The columns of a dataframe are vectors, so you also should have learned a little about vectors! 2) How to copy a built-in dataset into your workspace: data(). 3) How to see the contents of your workspace: ls() 4) How to find out the size of a dataframe: dim() 5) How to find out the column names: colnames(), names(), or dimnames() 6) How to find out the row names: rownames() or dimnames() 7) How to see the top six cases in the dataframe: head() 8) What a header is: the variable names at the top of the dataframe. 9) How to pull any single value out of the dataframe by using index numbers: dataframe.name[ row.number, col.number ] 10) How to pull any single value out of the dataframe by using row names and column names: dataframe.name[ "row.name", "col.name" ] 11) How to change the column names: colnames() = a vector of quoted names 12) How to change the row names: rownames() = a vector of quoted names 13) How to change a single row name or column name: look it up if you've forgotten! 14) How to add a column to a dataframe: two methods, which once again you should review if you've already forgotten them. 15) How to refer to a variable inside a dataframe using the $ notation. ***** Changing Values Inside a Dataframe ***** Suppose you need to change a value inside the dataframe. I bet by now you can guess how to do it. I have one caution. Later we will learn that dataframes can be "attached" using the attach() function. Never mind for right now what this means. Just heed this warning: NEVER MODIFY AN ATTACHED DATAFRAME! If the dataframe is attached, use the detach() function to detach it before you try to modify it. The world will not come to an end if you forget this, but things can get very confusing if you are careless about it. Let's look at the percentage of people living in urban areas in Iowa: > USArrests[ "Iowa", "UrbanPop" ] [1] 57 I have included the output of the command this time, because we will need it later. Let's change that value to 33: > USArrests[ "Iowa", "UrbanPop" ] = 33 > USArrests[ "Iowa", "UrbanPop" ] [1] 33 It's as simple as that. You could also have used the row index number and column index number (subscripts), if you knew them, in place of the row name and column name. Now, of course, we should put back the correct value: > USArrests[ "Iowa", "UrbanPop" ] = 57 > USArrests[ "Iowa", "UrbanPop" ] [1] 57 ***** Looking At Subsets of Dataframes ***** Now for some more complex stuff--looking at subsets of dataframes. This can get tricky, so follow along closely! Let's begin by looking at just one row, say that for Pennsylvania: > USArrests[ "Pennsylvania", ] # make sure you include the comma You could also have used the row number, if you'd known it! > USArrests[ 38, ] # once again, the comma must be there This shows the value of having meaningful row names. However, in a moment we will have to fall back on the row numbers. The thing to pay close attention to is the comma after the row index inside the brackets. The comma followed by a blank column index tells R you want to see the entire row (all columns). If you want to see one column, all you have to do is name (or number) it: > USArrests[ "UrbanPop" ] > USArrests[3] The comma is optional, but if you use it, it should come before the column index: > USArrests[ ,"UrbanPop" ] > USArrests[ ,3] Notice how the result is different from the one you got above (without the comma). Technically, if you use the comma, the result is a data object called a vector. If you don't use the comma, the result is a new dataframe. To see the murder rate column, any of these would work (I'm setting you up here, so pay close attention to what's happening): > USArrests[ "Murder" ] > USArrests[ ,"Murder" ] > USArrests[1] > USArrests[ ,1] Now suppose you want to see rows 10 through 20. In R, the notation 10:20 means "10 through 20", so we can do this: > USArrests[ 10:20, ] Either of the following would do the same: > USArrests[ c(10,11,12,13,14,15,16,17,18,19,20), ] > USArrests[ seq( from = 10, to = 20, by = 1 ), ] Suppose you want to see only rows in which the murder rate is greater than 12. This can be achieved as follows: > USArrests[ Murder > 12, ] Oops! That didn't quite work, did it? The idea was right, but there is a technicality. When R was searching for the variable called Murder, it did not know to look inside the dataframe. In technical "R-speak", the dataframe is not in the search path. I know what you're thinking! "But, but, but, it worked above when we executed USArrests["Murder"]." There is a subtle but crucially important difference. When we quoted the word "Murder", we were naming a column. When the word Murder was left unquoted, we were asking R to look at the values stored in this variable. This is going to confuse you for awhile, so you might want to repeat that to yourself a few times. I still get error messages from not using quotes when I should or using them when I shouldn't. A NAME IN QUOTES IS JUST A NAME. A name in quotes IS a value. The value of "Murder" is "Murder", and it is the name of the first column in the dataframe. A NAME NOT IN QUOTES REFERS TO THE VALUES STORED INSIDE THE DATA OBJECT. Thus, Murder refers to a data object, and to the values stored therein. To find those values, R must know where to find Murder. The first place R will look is in the workspace. You can verify that no Murder exists there by using ls(). Then R will start looking through attached data objects and libraries. To see exactly where R will look, execute this command, which will show you the search path: > search() R does not yet know to look inside USArrests. We can tell it to do so as follows: > attach( USArrests ) This adds the dataframe "USArrests" to the search path, as you can verify as follows: > search() So now, after searching the workspace (".GlobalEnv") for Murder, R will look inside USArrests, and there it will find what it is looking for. Now our command will work: > USArrests[ Murder > 12, ] If we want just rows with murder rates higher than 12 and assault rates greater than or equal to 255, we can do this: > USArrests[ Murder > 12 & Assault >= 255, ] Question: What happens if you get the greater than or equal to sign the wrong way around, i.e., => instead of >=? If we want to work with this subset of rows, we can store them in a new data object: > newData = USArrests[ Murder > 12 & Assault >= 255, ] > newData ***** The Subset Function ***** Subsetting a dataframe can also be accomplished using the subset() function. Here are some examples: > subset( USArrests, select = UrbanPop ) > subset( USArrests, subset = Murder > 12 ) > subset( USArrests, subset = Murder > 12 & Assault >= 255 ) > subset( USArrests, subset = Murder > 12 & Assault >= 255, select = Rape ) Inside the subset() function, the option select= selects columns, while the option subset= selects rows. The subset= option must be set to a logical statement using >, <, >=, <=, == (equals), or != (not equals). It cannot be set to something like subset="Pennsylvania". The dataframe does not have to be attached for this to work. Of course, any of these newly created subsets can be stored in a new data object for further analysis. > just.UrbanPop = subset( USArrests, select = UrbanPop ) > head( just.UrbanPop ) > rm( just.UrbanPop ) # always clean up ***** Sorting Dataframes ***** Another thing you often want to do with a dataframe is sort it. Sorting a single variable is easy (with the dataframe attached): > sort( Murder ) > sort( Murder, decreasing = TRUE ) However, sorting an entire dataframe is a bit more complicated, as the rest of the columns must be sorted as well in order to keep them properly matched up with the Murder column. To do it, you use the order() function as follows: > SO = order( Murder ) > USArrests[ SO, ] SO is just a dummy variable, and we could have called it anything we wanted. I called it SO for "sort order". We could have accomplished the same thing by nesting the order() function inside the call to USArrests: > USArrests[ order( Murder ), ] Notice that we have not changed the dataframe itself: > USArrests To do that, we would have had to store the output of USArrests[ SO, ] into a new data object: > USArrests.sorted = USArrests[ SO, ] > USArrests.sorted > rm( USArrests.sorted ) To sort from highest to lowest, do this: > SO = order( Murder, decreasing = TRUE ) > USArrests[ SO, ] Or this would also work: > USArrests[ order( Murder, decreasing = TRUE ), ] ***** Pruning Out Unnecessary Variables ***** Unlike most statistical software, such as SAS and SPSS, R stores the workspace in RAM memory. Thus, there is a limit to the size of a dataframe that R can handle. That limit is still pretty big if you have a respectable amount of RAM memory in your computer, but to avoid taxing RAM memory, you may want to prune out unwanted variables. Suppose we do not need the UrbanPop variable in our dataframe. We can toss it out as follows: > detach( USArrests ) # never modify an attached dataframe! > USArrests[3] = NULL # or USArrests[ "UrbanPop" ] = NULL > attach( USArrests ) > USArrests Column three is no more (and "Rape" has been moved into the number 3 position). Be careful with this, because there is no recovering from it. Once R throws something away, it's gone for good! Of course, it would be wise to store a copy of the dataframe before altering it in some unrecoverable way. ***** Recoding Entire Variables ***** We changed single values above. Now lets change all the values in a variable. Remember the warning: NEVER MODIFY AN ATTACHED DATAFRAME. Suppose we wanted murder rates per 1,000,000 population instead of 100,000. This would involve multiplying the Murder variable by 10. Easy enough: > detach( USArrests ) > USArrests $ Murder = 10 * USArrests $ Murder > attach( USArrests ) > USArrests Here is another way to do the same thing using the transform() function. First, I'll put the dataframe back to the way it was, then I'll use transform() to make the same change we just made above: > detach( USArrests ) > USArrests $ Murder = USArrests $ Murder / 10 > USArrests = transform( USArrests, Murder = 10 * Murder ) > attach( USArrests ) > USArrests WARNING: I was just reminded of a valuable lesson here (the hard way, I admit). Be VERY CAREFUL when you are modifying your dataset. Make a backup copy first! I made a capitalization error in one of the above commands, and I destroyed the murder rate data! Thus, it would have been a good idea to insert the following command after the detach() command above: > USArrests.bak = USArrests That way, if you screw it up as badly as I did, you can recover your data from the backup copy in your workspace. Below I will show you how to save a copy to your hard drive as well. This may be necessary if you have a very large data set that is hogging up most of your computer's memory. Any mathematical transformation can be done this way. But suppose we want to do something a bit more complicated. Suppose we want to convert murder rate to a categorical variable with values of "high" and "low". Once this is done, if we overwrite the Murder variable with these new values, the old values will not be recoverable, so I suggest we create a new variable. One way to do it is this: > median( Murder ) # splitting at the median is a good idea [1] 72.5 > which( Murder == 72.5 ) # are any values right at the median? (no) integer(0) > > detach( USArrests ) > USArrests.bak = USArrests # not taking any chances this time!!! > > USArrests $ Murd.cat = ifelse( USArrests $ Murder > 72.5, "high", "low" ) > attach( USArrests ) > USArrests We have created a new column with the new variable Murd.cat in it. We did it with the ifelse() function. Read what it's done as follows: "If Murder is greater than 72.5 (the median), then assign Murd.cat the value "high", otherwise assign Murd.cat the value "low". > Murd.cat ***** Character Variables versus Factors ***** Character variables are names of things, like subjects' names, or subjects' addresses. Generally, we use a character variable when all the values of the variable will be different. A factor is a categorical variable that can be used as an independent or dependent variable. Generally, factors have only a few different values, called levels. Do this: > str( USArrests ) 'data.frame': 50 obs. of 4 variables: $ Murder : num 132 100 81 88 90 79 33 59 154 174 ... $ Assault : int 236 263 294 190 276 204 110 238 335 211 ... $ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ... $ Murd.cat: chr "high" "high" "high" "high" ... I have reproduced the output so we can look at it in some detail. The str() function reveals the internal structure of a dataframe (and other data objects) in more detail than any of the functions we've seen so far. First it tells us that USArrests is, in fact, a 'data.frame'. Then it tells us the size: 50 observations (rows) by 4 variables (columns). Then it lists the name of each variable and what kind it is: Murder is numerical, Assault is integer (also numerical), Rape is numerical, and Murd.cat is a character variable. The first several values of each variable are also given. Before we go on, do this: > summary( Murd.cat ) We don't really want Murd.cat to be a character variable. We want it to be a factor, since it has only two values, and we might want to use it as an independent or dependent variable in our analysis. To change it, do this: > detach( USArrests ) > USArrests $ Murd.cat = factor( USArrests $ Murd.cat ) > attach( USArrests ) > str( USArrests ) Murd.cat is now a factor with two levels (values), "high" and "low". Internally, to save space, factors are stored as numbers, in this case 1 and 2. This will come in very handy later. (It means we do not have to recode if want to enter this variable into a regression analysis, for example.) Now that Murd.cat is a factor, we get a different result from: > summary( Murd.cat ) This time the output is a frequency table. There are 25 values of "high" and 25 values of "low". ***** Missing Values ***** It is a fact of life that data values sometimes turn up missing. Someone does not answer all questions on a survey, a subject does not show up for retesting, equipment fails, and so on. In R, missing values should be recorded as NA, for "not available". There are no missing values in the USArrests dataframe, and we don't really have to go into detail right now about how to deal with NAs. You should be aware of the possibility, however. When values are missing, special measures may have to be taken to deal with that circumstance. ***** A Caution About Attaching Dataframes ***** Do this: > detach( USArrests ) > Murder = 6 > ls() > attach( USArrests ) What's that all about? The USArrests dataframe contains a variable called Murder. But so does the Global Environment (your workspace), because you created it with the command Murder=6. R is notifying you that a conflict of sorts exists between these two variables. You are attaching USArrests because you want to be able to address the variables directly, but there is already a Murder variable in the workspace. If you try to work with the Murder variable inside USArrests by just typing Murder, R will see the value in your workspace first: > Murder This is a problem that must be resolved. One way to resolve it is to delete the Murder variable from the workspace. We don't need it anyway: > rm( Murder ) > Murder Now R is seeing the Murder variable inside USArrests. When you attach a dataframe, and you get a warning like the one you saw above, this is most likely the problem. Do something about it before you try to work with the variables inside the dataframe. ***** Intermission: More Things You've Learned ***** 16) Never modify an attached dataframe! 17) How to modify single values inside a dataframe. 18) Several ways to look at subsets of a dataframe. 19) How to sort a dataframe on a single variable using order(). 20) How to prune out unnecessary variables. 21) How to recode a variable (two methods). 22) The difference between a character variable and a factor. 23) How to turn a character (or numerical) variable into a factor. 24) Missing values happen. In R, the missing value code is NA. 25) Because R searches in so many places for the variables you are working with, conflicts can occur. Be careful. ***** How To Save a Dataframe ***** After all that, you may want to make a permanent copy of your dataframe. This involves knowing the difference between the workspace and working directory. The workspace is the area of RAM memory in which R stores the objects you've created, but it only stores them there temporarily. If the power goes off while you are working...poof! Workspace--and all your unsaved data objects--gone! When you quit R, it will ask you if you want to save your workspace. If you want to keep the objects you've created and your modifications to them, then the obvious answer is "yes". At this point, the workspace will be saved to your working directory. The working directory is a folder or directory on your hard drive (or other permanent storage medium) in which R looks by default for files you ask it to load or writes objects you ask it to save. To find out what your working directory is, do this: > getwd() The answer will be different depending upon what operating system you are working in (Windows, Mac OS X, or Linux). But this is where R is going to store the stuff you save. You may know that you can change the location of the working directory using setwd(), but this is for another time. In Windows and on the Mac, there are also pull down menus at the top of the R Console that allow you to set the working directory. On the Mac, it is under the Misc menu. In Windows it is under File (and is called Change dir...). If you don't know about pathnames in your directory structure, you may want to use these menus instead of trying to use setwd()! If you are in a long R session, and it is storming outside, you may want to save your workspace once in awhile before you actually quit R. The command to do so (can also be done from the menus in Windows and OS X) is: > save.image() You may want to use ls() first and get rid of old garbage by using rm(). It's amazing how much junk you can create during an R session. The save.image() function will save your entire workspace in a file called ".RData". The dot makes it an invisible file in OS X and Linux, but not in Windows. If you want a different name, just give it inside the function call: > save.image( file = "myWorkspace.RData" ) Make sure it has the .RData extension, however. This also allows you to save a separate workspace file for every project or problem you are working on. If you are not in the default working directory, i.e., the one R starts up in, you may find it necessary to load the workspace the next time you are working in this directory. This may be true even if you are working in the default working directory but have saved a workspace image under a nonstandard name. To load a workspace image (DON'T do this now; just make a note of it), do one of these: > load( ".RData" ) > load( "myWorkspace.RData" ) This will load the stored workspace image and ADD its contents to whatever is already in your workspace. To save a data object like a dataframe into a file of its own is almost as easy. Let's save the modified USArrests dataframe into a file called arrests.rda: > save( USArrests, file = "arrests.rda") You can see that the file has, in fact, been saved, by asking for a directory listing of the working directory: > dir() The save() function saves files in an encoded ("binary") format that is not human readable. It is an efficient way to save data objects, but neither you nor another software package will be able to read the file. Now let's remove the USArrests object from the workspace and then reload it from the saved file. FIRST, since USArrests is attached, we will begin by detaching it. ALWAYS remember to detach an attached dataframe. > detach( USArrests ) > ls() # Now you see it... > rm( USArrests ) > ls() # ...and now you don't. > load( file = "arrests.rda") > ls() # And there it is again. > USArrests Notice the data object has the same name you gave it while working with it and saving it, which is different from the filename you created for it. The dataframe is known as "USArrests" in the workspace, but it's known as "arrests.rda" on the hard drive. There is a good reason for this, but I'm not going to tell you what it is right now. (Okay, a hint: you can use save() to save several data objects at once.) Now we will save a copy of "USArrests" in human readable form, i.e., as a text file. You can read this file in a text editor or a word processor. You can load it into a spreadsheet to edit it. You can even load it into other statistics software, such as SPSS. The file format we will use is called CSV, short for comma separated variables. Every spreadsheet program can read these files, every text editor will open them, and you can even view them with a browser such as Firefox, provided you use the file extension ".txt". > write.table( USArrests, file = "arrests.txt", sep = "," ) > dir() If you go to your working directory in Finder or Explorer and click on this file, it will open in a text editor like TextEdit or Notepad. Now we will remove USArrests from the working directory and reload it from this file. In this case, reloading will involve making an assignment, which was not the case above, so observe and make note of the difference: > rm( USArrests ) > ls() # Gone! > USArrests = read.table( file = "arrests.txt", sep = ",") > ls() It's also possible to save these files in other text formats, such as tab separated variables (sep = "\t"), but that is for another time. Question: What happens if you forget to the do the assignment? (I sometimes do this on purpose to make sure the file is being read properly.) ***** More Stuff You Have Learned ***** 26) The difference between the workspace and the working directory. 27) How to find out what the working directory is: getwd() 28) How to save your workspace: save.image() or save.image("filename.RData") 29) What the .RData file is. 30) How to load a workspace image: load() 31) How to save a dataframe: save() 32) How to see a directory listing of the current working directory: dir() 33) How to load a dataframe from a file you previously saved: load() 34) How to write a dataframe to a CSV file: write.table() 35) How to load a dataframe from a CSV file: read.table() ***** Dataframes With A Freq Column ***** You can remove "USArrests"--and all the other junk we've created--from your workspace now. We're done with it. Instead, we are now doing to look at a data object called "UCBAdmissions": > rm( USArrests, USArrests.bak, USArrests.sorted, newData, SO ) > # a warning will result if any of these objects have already been removed > data( UCBAdmissions ) > ls() > UCBAdmissions What's this? Clearly, it is not a dataframe. In fact, "UCBAdmissions" is an array, or a three-dimensional contingency table. Applicants to graduate school at Berkeley in the six largest departments in 1973 are classified by gender and whether or not the applicant was admitted. In total, 4526 applicants are described in the table. This is not a convenient form in which to examine the data. When contingency tables become multidimensional like this, they get awkward. An easier form in which to visualize the data is as a flat table: > ftable( UCBAdmissions ) This is a much more compact form for the data summary, but it is still not a proper dataframe. To convert "UCBAdmissions" to a dataframe, do this: > UCB = as.data.frame( UCBAdmissions ) > UCB Now we have a dataframe, but it is not a dataframe in which each row represents a case or a single subject. There were 4526 subjects, but there are only 24 rows in the dataframe. This is because many of the subjects are the same, as far as we're concerned. There are 6 departments, 2 genders, and 2 admit statuses, which means there are 6x2x2=24 different combinations of these factor levels. So why create a dataframe with 4526 rows, when one with 24 rows will do? The final column, called "Freq", tells how often each of these combinations of factor levels occurs among the 4526 applicants to Berkeley. We will have reason to analyze this dataframe later in the course. For now, let's just get rid of it: > rm( UCB, UCBAdmissions ) ------------------------------------------------------------------------------- And finally, crosstabulation and working with contingency data. A note: I have changed my mind from what I've told people in the past, that it is easiest to work with contingency data in contingency table form. That's fine for a 2-way table, but for higher dimensional tables, I now advise to turn them into dataframes before trying to work with them. -------------------------------------------------------------- 2-way tables -------------------------------------------------------------- We will work directly from the contingency table. The following table crosstabulates people from Caithness, Scotland, by eye color (rows) and hair color (columns). > library( MASS ) # load an optional library called MASS > data( caith ) # copy the data into your workspace > caith # look at it fair red medium dark black blue 326 38 241 110 3 light 688 116 584 188 4 medium 343 84 909 412 26 dark 98 48 403 681 85 As luck would have it, whoever created this dataset stored it as a dataframe. It's not a proper dataframe, however, and should be stored as a table, array, or matrix. So... > caith = as.matrix( caith ) # turn "dataframe" into contingency table Now we can work on it using the various commands that work on contingency tables. For example, if we wish to name the dimensions of the matrix... > dimnames( caith ) = list( + Eyes = c( "blue", "light", "medium", "dark" ), + Hair = c( "fair", "red", "medium", "dark", "black" )) > caith Hair Eyes fair red medium dark black blue 326 38 241 110 3 light 688 116 584 188 4 medium 343 84 909 412 26 dark 98 48 403 681 85 (Note: since the actual color names are already there, it seems like there ought to be an easier way to do this. If there is, no one is telling me about it!) To see table margins... > addmargins( caith ) Hair Eyes fair red medium dark black Sum blue 326 38 241 110 3 718 light 688 116 584 188 4 1580 medium 343 84 909 412 26 1774 dark 98 48 403 681 85 1315 Sum 1455 286 2137 1391 118 5387 > > margin.table( caith ) # overall N [1] 5387 > margin.table( caith, 1 ) # the row marginal sums (index 1) Eyes blue light medium dark 718 1580 1774 1315 > margin.table( caith, 2 ) # column marginal sums (index 2) Hair fair red medium dark black 1455 286 2137 1391 118 The apply() function also works here, but you must specify the function sum... > apply( caith, 1, sum ) blue light medium dark 718 1580 1774 1315 To get proportions or percentages across the rows... > prop.table( caith, 1 ) Hair Eyes fair red medium dark black blue 0.45403900 0.05292479 0.3356546 0.1532033 0.004178273 light 0.43544304 0.07341772 0.3696203 0.1189873 0.002531646 medium 0.19334837 0.04735062 0.5124014 0.2322435 0.014656144 dark 0.07452471 0.03650190 0.3064639 0.5178707 0.064638783 > prop.table( caith, 1 ) * 100 Hair Eyes fair red medium dark black blue 45.403900 5.292479 33.56546 15.32033 0.4178273 light 43.544304 7.341772 36.96203 11.89873 0.2531646 medium 19.334837 4.735062 51.24014 23.22435 1.4656144 dark 7.452471 3.650190 30.64639 51.78707 6.4638783 To calculate proportions relative to the column sums, use index 2, and to calculate them relative to overall N, use no index. To see an interesting graphical form of the contingency table, try this... > mosaicplot( caith, shade=TRUE ) # output not shown A side-by-side barplot can be produced like this... > barplot( caith, beside=TRUE, legend=TRUE ) # not shown To test for independence of the two factors... > chisq.test( caith ) -> results > results Pearson's Chi-squared test data: caith X-squared = 1240.039, df = 12, p-value < 2.2e-16 > results $ expected # see the expected frequencies Hair Eyes fair red medium dark black blue 193.9280 38.11918 284.8275 185.3978 15.72749 light 426.7496 83.88342 626.7793 407.9785 34.60924 medium 479.1479 94.18303 703.7383 458.0720 38.85873 dark 355.1745 69.81437 521.6549 339.5517 28.80453 > > results $ residuals # see the residuals (unsquared chis) Hair Eyes fair red medium dark black blue 9.483979 -0.01930262 -2.596906 -5.537407 -3.209321 light 12.646503 3.50663997 -1.708741 -10.890844 -5.203033 medium -6.219798 -1.04927862 7.737531 -2.152635 -2.062785 dark -13.646052 -2.61077971 -5.195102 18.529854 10.470584 -------------------------------------------------------------- Multiway tables -------------------------------------------------------------- We will work with the HairEyeColor data, which is built in to R as a 3-way contingency table. To see it in various tabular forms... > HairEyeColor # as a contingency table > as.data.frame( HairEyeColor ) # as a dataframe with Freq column > ftable( HairEyeColor ) # as a flat contingency table > ftable( HairEyeColor, col.vars="Hair" ) # with a different column variable While the data are in multidimensional contingency table form, there is a quick and dirty way to test the mutual independence of all factors in the table... > summary( HairEyeColor ) # this works only for 3-way and up Number of cases in table: 592 Number of factors: 3 Test for independence of all factors: Chisq = 164.92, df = 24, p-value = 5.321e-23 Chi-squared approximation may be incorrect This is like a 3-way test of independence and is calculated using log linear analysis. Note: the warning message means that some of the expected frequencies are less than 5. Now we will convert to a dataframe and work from there... > hec = as.data.frame( HairEyeColor ) > head( hec ) # first six rows Hair Eye Sex Freq 1 Black Brown Male 32 2 Brown Brown Male 53 3 Red Brown Male 10 4 Blond Brown Male 3 5 Black Blue Male 11 6 Brown Blue Male 50 To crosstabulate any two variables in the dataframe... > xtabs( Freq ~ Hair + Eye, data=hec ) Eye Hair Brown Blue Hazel Green Black 68 20 15 5 Brown 119 84 54 29 Red 26 17 14 14 Blond 7 94 10 16 The first named variable will be in the rows, and the second named variable will be in the columns. Of course, this table can be saved to a data object and manipulated with all the commands given under 2-way tables above. To subset the data... > xtabs( Freq ~ Hair + Eye, data=hec, subset=Sex=="Male" ) Eye Hair Brown Blue Hazel Green Black 32 11 10 3 Brown 53 50 25 15 Red 10 10 7 7 Blond 3 30 5 8 To collapse over a single variable, just do crosstabs over the rest of them. In the example above, we "collapsed over sex." (No snickering please!) For a 4-way table, you can collapse over a variable by crosstabulating over the other three. > ti = as.data.frame( Titanic ) > head( ti ) > xtabs( Freq ~ Class + Gender + Sex, data=ti ) Note: if the dataframe does not include a Freq column, but contains a row for every single case, do the formula for xtabs() the same way but leave the Freq position blank. E.g., xtabs( ~ var.1 + var.2, data = data.frame.name ). In this case, the table function will also work: table( var.1, var.2 ), this assuming the dataframe is attached. (Otherwise, use with() or the $ notation.) > HEC.Male = xtabs( Freq ~ Hair + Eye, data=hec, subset=Sex=="Male" ) Now we treat this just like any other 2-way table... > prop.table( HEC.Male, 1 ) * 100 Eye Hair Brown Blue Hazel Green Black 57.142857 19.642857 17.857143 5.357143 Brown 37.062937 34.965035 17.482517 10.489510 Red 29.411765 29.411765 20.588235 20.588235 Blond 6.521739 65.217391 10.869565 17.391304 > chisq.test( HEC.Male ) -> results Warning message: In chisq.test(HEC.Male) : Chi-squared approximation may be incorrect > results Pearson's Chi-squared test data: HEC.Male X-squared = 41.2803, df = 9, p-value = 4.447e-06 > results $ expected Eye Hair Brown Blue Hazel Green Black 19.67025 20.27240 9.433692 6.623656 Brown 50.22939 51.76703 24.089606 16.913978 Red 11.94265 12.30824 5.727599 4.021505 Blond 16.15771 16.65233 7.749104 5.440860 > results $ residuals Eye Hair Brown Blue Hazel Green Black 2.7800288 -2.0593949 0.1843792 -1.4079851 Brown 0.3909276 -0.2455931 0.1854875 -0.4653869 Red -0.5621403 -0.6579360 0.5316647 1.4852600 Blond -3.2733341 3.2709053 -0.9875644 1.0971354 -------------------------------------------------------------------------------