R Tutorials--Simple Data Entry

SIMPLE DATA ENTRY AND DESCRIPTION

A Couple Tips

One reason people don't like command line programs is because, if you make a mistake in typing a long command, you have to start all over from scratch. Not so in R. Suppose you were trying to set your working directory to "Rspace", and you accidently typed this.

> setwf("Rspace")                 # Type this into R.
Error: could not find function "setwf"

There is no "setwf( )" function in R, and R will cheerfully tell you the function was not found. Go ahead and see for yourself. Now, instead of retyping the whole line, R will allow you to recall it to the command line and edit it. To recall the previously typed command, just press the up arrow key. You can then use the right and left arrow keys to move through the command line. The Backspace and Delete keys (on a Windows keyboard) can be used to erase the errors. Then make corrections and press Enter. Your cursor does not even need to be at the end of the line when you press Enter. Try it.

> ### Press up arrow key here...
> setwd("Rspace")                 # Edit command using arrow keys, press Enter.
> getwd()
[1] "C:/Documents and Settings/kingw/My Documents/Rspace"

If you continue pressing the up arrow key, R will bring older and older commands to the command line. Thus, if you did something five commands ago, and you want to do it again, press the up arrow key five times to recall the command, then press Enter.

NOTE: If you don't know what Rspace is, or R is telling you it cannot change the working directory, then you probably didn't create the Rspace directory back in a previous tutorial. Use getwd() to find out what your working directory is, then using your operating system (Finder or whatever), create a folder inside that location and name it Rspace.

Here's another tip, and one you might be a bit miffed I didn't tell you earlier. You can copy and paste stuff into R. For example, suppose I told you to execute the following command. (And I am telling you to do that now!)

> boxplot(log(islands), main="Boxplot of Islands", ylab="log(land area)")

You're saying to yourself, "Oh man! I don't want to type all that, and I'm gonna get commas in the wrong place, and come on!" You don't have to type it. With your mouse--yes, that's right, your mouse!--highlight the line on this webpage (not including the command prompt or > symbol). Then either go to the Edit menu of your browser and choose Copy, or press Ctrl-C on your keyboard (hold down the Ctrl key and press c and then release both--on a Mac it's Command-C). Now, go into the R Console window and either pull down the Edit menu (in Windows) and choose Paste, or (with the cursor at a command prompt) press Ctrl-V (Command-V on a Mac). Either one will paste the command at the command prompt. Then press Enter.

Note to my Mac friends: On older Mac keyboards, the Command key is the one to the left of the space bar with the little flowery thing on it.

Now you know. Of course, you will have to type your own commands eventually, and that one wasn't a particularly long one.

Creating a Vector

Using built-in data objects is fine and dandy for demonstration purposes, but eventually you're going to want to enter and analyze your own data. If the data set is small, you can do this easily from within R. The following data were collected by a student doing his senior research project here at CCU. The numbers represent number of items recalled correctly on a digit span task, supposedly a measure of short term memory. The explanatory variable ("IV") was whether or not the subject admitted to regularly smoking marijuana.

smokers     16 20 14 21 20 18 13 15 17 18
nonsmokers  18 22 21 17 20 17 23 20 22 21

It might seem a little silly to go to the trouble of formally entering such a small data set into a data frame or a spreadsheet and then reading it into R, when the whole thing can be typed into an R Console session in just a few seconds. The thing you need to realize is that all these scores are ON THE SAME VARIABLE, the response variable, and therefore, they need to go into the same data object or vector. So...

> scores = c(16,20,14,21,20,18,13,15,17,18,18,22,21,17,20,17,23,20,22,21)
> scores
 [1] 16 20 14 21 20 18 13 15 17 18 18 22 21 17 20 17 23 20 22 21
> summary(scores)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  13.00   17.00   19.00   18.65   21.00   23.00

The scores have been entered into a vector using the c() function. Since that was an assignment statement, it wrote nothing to the screen. Then we asked to see the scores, a good check, since (confession) it took me three tries to get the scores typed in correctly. (STUDENTS: ALWAYS double check your data entry! The most common reason for getting a wrong answer with statistical software or calculator is entering the data incorrectly.) Then the summary() function was used to produce a preliminary descriptive summary.

That's probably the most annoying way to get data into a vector--all those commas! So here is a more convenient way when typing data at the keyboard. First, remove the "scores" vector. Then recreate it using scan(). The scan() function allows you to type in numbers one at a time, hitting Enter after each one, rather than putting commas between them.

> rm(scores)
> scores = scan()            # be sure to do the assignment!
1: 16                        # press Enter
2: 20                        # press Enter
3: 14                        # etc.
4: 21
5: 20
6: 18
7: 13
8: 15
9: 17
10: 18
11: 18
12: 22
13: 21
14: 17
15: 20
16: 17
17: 23
18: 20
19: 22
20: 21
21:                          # press Enter here to end data input
Read 20 items
> scores
 [1] 16 20 14 21 20 18 13 15 17 18 18 22 21 17 20 17 23 20 22 21

This is handy when you're using a numeric keypad. But it gets better. You don't have to hit the Enter key between each data value. You only have to leave some white space.

> rm(scores)
> scores = scan()
1: 16 20 14 21 20 18 13 15             # press Enter
9: 17 18 18 22 21 17 20 17 23 20       # press Enter
19:  22 21                             # press Enter
21:                                    # press Enter to end data input
Read 20 items
> scores
 [1] 16 20 14 21 20 18 13 15 17 18 18 22 21 17 20 17 23 20 22 21

The Enter key can be hit at any time to start a new line. Items entered into scan() must be separated by white space: a space or spaces, a tab, a newline, a carriage return. Notice also that it doesn't matter whether left or right arrow assignment is used. Better still, you can copy and paste the numbers from this webpage. (Although this kinda defeats the purpose of showing you how to enter data from the keyboard!)

> rm(scores)
> scores = scan()
1: 16 20 14 21 20 18 13 15 17 18       # Copied and pasted from above.
11: 18 22 21 17 20 17 23 20 22 21      # Copied and pasted from above.
21:                                    # Remember to hit Enter to end entry.
Read 20 items
> scores
 [1] 16 20 14 21 20 18 13 15 17 18 18 22 21 17 20 17 23 20 22 21

You can also copy and paste comma separated values, but not into the scan() function. Copy comma separated values into c(). However, you can copy and paste a spreadsheet column (but not a row) into the scan() function. (NOTE: Some of my students have had a problem getting R Studio to play nice with spreadsheets, especially with Excel. I haven't had a problem with OpenOffice Calc. It always works with just plain R, however.)

Now, about that summary--what we want, of course, is a summary by groups, and not of all the scores at once. You can probably think of one way to this.

> summary(scores[1:10])                # Summarize scores 1 to 10.
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  13.00   15.25   17.50   17.20   19.50   21.00 
> summary(scores[11:20])               # Summarize scores 11 to 20.
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  17.00   18.50   20.50   20.10   21.75   23.00

Another way to do it is to create a second vector with group names (i.e., values of the explanatory variable) in it and to use that to extract scores by group.

> groups = rep(c("smoker","nonsmoker"), times=c(10,10))
> tapply(X=scores, IND=groups, FUN=summary)      # Similar to by() function.
$nonsmoker
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  17.00   18.50   20.50   20.10   21.75   23.00 

$smoker
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  13.00   15.25   17.50   17.20   19.50   21.00

The syntax of the tapply() function can be put into words like this: "Apply the summary function to scores by groups." The by() function does something similar, but the output format is a bit different.

> by(data=scores, IND=groups, FUN=summary)
groups: nonsmoker
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  17.00   18.50   20.50   20.10   21.75   23.00 
---------------------------------------------------------- 
groups: smoker
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  13.00   15.25   17.50   17.20   19.50   21.00

Now might be a good time to mention this. The summary() function is very versatile, and it's output will depend upon what you are asking for a summary of, as we will have ample opportunity to see. When a numeric vector is summarized, the output is the minimum, 1st quartile, median, mean, 3rd quartile, and maximum. There is a qualification. The quartiles are calculated assuming the vector contains a continuous numerical variable. The variable in this example is not continuous. Therefore, the quartiles may not come out to have the same values as if you'd used the method you were taught in elementary statistics to calculate them. We will return to this in a future tutorial. For now, I'll simply say that R can use any of nine different methods to calculate these values.

> rm(list=ls())                        # Clean up.

Entering Categorical Data

There is no way to get around it. Entering categorical data, or character values, is a pain in the posterior! However, once they are entered, R handles them in a much more versatile way than any other statistical software I have ever used. For example, if you are going to use a categorical variable (entered as character values) in a regression analysis, you do not have to recode. R will do the appropriate recoding for you.

There are some cautions about entering character values that it will be very healthy to know about right up front. Suppose we enter the following vector into R. (You might want to copy and paste it!)

> gender = c("male","female","female","male ","Male","female","female","mail")
> summary(gender)            # R thinks this is a character variable until you declare it otherwise.
   Length     Class      Mode 
        8 character character 
> table(gender)              # summary(factor(gender)) will work; try it!
gender
female   mail   male   Male  male  
     4      1      1      1      1
> gender = factor(gender)
> summary(gender)
female   mail   male   Male  male  
     4      1      1      1      1

First we learn that summary() is not so useful for summarizing a character vector. So I used table() instead, which gives me a freqency table. (The summary() function will do the same thing if you first declare the variable to be a factor, as demonstrated.) Notice I got what appears to be an unintended result. First, there is a misspelling. R doesn't know you can't spell, so it assumes this is what you (I) intended. There is also a case where "Male" was capitalized, and R, being case sensitive, counted that as a different value from the uncapitalized "male"s. The one that can really be a puzzler is the difference between "male" and "male ". This can be a real mystery, in fact, when you've entered data using another program, like a spreadsheet, and then read it into R. Moral of the story: BE CAREFUL TYPING CHARACTER DATA! If you put a space on the beginning or end of a value, R will assume you mean it to be that way. And of course, by now I don't even have to mention the spelling and capitalization issues.

Here is another way you can go wrong entering character data.

> country = c("England","Russia","United States","England","England")
> table(country)
country
      England        Russia United States 
            3             1             1

What I am attempting to illustrate is that some data entry methods in R assume that white space separates variable values. So suppose you have a value like United States. There are some cases in which R will read that as two values, "United" and "States". If you are typing values into a vector, the necessary quotes will take care of it. However, it might be a good idea not to put spaces inside data values. You can type a period into what would otherwise be a space, "United.States", and that will never cause a problem.

Now let's use scan() to enter the same values.

> rm(country)
> country = scan(what="character")     # necessary to specify what= if not numeric
1: England
2: Russia
3: United States
5: England
6: England
7: 
Read 6 items
> table(country)
country
England  Russia  States  United 
      3       1       1       1

And there it is! Now you see the problem. Let's do it right.

> rm(country)
> country = scan(what="char")          # good enough
1: England
2: Russia
3: United.States
4: England
5: England
6: 
Read 5 items
> table(country)
country
      England        Russia United.States 
            3             1             1

The default data type for scan() is numeric. Using scan() to enter character data is very convenient because you can avoid typing commas and quotes, but you do have to remember to specify that you are entering character data by using the what= option. (All you have to do is give it a sample of the kind of data you are entering, so what="xyz" would have worked just fine.)

Large data sets, however, will probably be typed into a spreadsheet and then read into R. In this case, you will have to be careful how you tell R the file is formatted. More about that when we get to reading and writing external files.

One more thing about character data...

> summary(country)
   Length     Class      Mode 
        5 character character 
> country = factor(country)
> summary(country)
      England        Russia United.States 
            3             1             1

Until you declare your entered vector to be a factor, R will consider it character data. Sometimes that is what you want, but usually not. If you mean it to be a factor, use factor() to declare it as such.

> rm(list=ls())                        # Clean up.

Entering Tabled Data

Sometimes you have data that someone has already done the work of putting into a table for you. (This happens especially with problems out of a textbook.) The following data occur in "A Handbook of Small Data Sets" by Hand et al. (1994).

24. Snoring and heart disease (on page 18 of Hand et al.)

Norton, P.G. and Dunn, E.V. (1985) Snoring as a risk factor for disease: an
epidemiological survey. British Medical Journal, 291, 630-632.

                                    Snore
                                    nearly      Snore
Heart        Non-    Occasional     every       every
disease    snorers    snorers       night       night
-------    -------------------------------------------
yes           24          35          21          30
no          1355         603         192         224
------------------------------------------------------

These data can be entered into a matrix, an array, or a table. I prefer to enter them into a matrix, so that's what I'm going to illustrate here, along with a few pointers for making things look a little neater when R prints it out.

> row1 = c(24, 35, 21, 30)
> row2 = c(1355, 603, 192, 224)
> snoring.table = rbind(row1, row2)    # rbind() binds rows into a matrix
> snoring.table
     [,1] [,2] [,3] [,4]
row1   24   35   21   30
row2 1355  603  192  224
> dimnames(snoring.table) = list("heart.disease" = c("yes","no"),
+                           "snore.status" = c("nonsnorer","occasional",
+                                  "nearly.every.night","every.night"))
> snoring.table
             snore.status
heart.disease nonsnorer occasional nearly.every.night every.night
          yes        24         35                 21          30
          no       1355        603                192         224

First, I entered the table row by row into separate vectors. Then I used the rbind(), or "row bind", function to bind the rows into a matrix. (There is also a cbind() function, if you prefer to enter your matrices column by column.) Then I added names to the various dimensions of the table, making liberal use of the Enter key and space bar so the screen did not scroll as I was typing. Notice the row names were entered first followed by the column names. The same method would be used to name the dimensions in an array or a table. It's worth taking a few minutes to examine the syntax of the dimnames() function. Notice it takes a list of the variable names, and the individual levels of each variable are assigned via vectors typed within the list. In other words, it takes a list of vectors. Tricky!

I don't like this table, and the reason I don't is because it's customary to put the explanatory variable in the rows and the response variable in the columns of a contingency table (but not required). So I'm going to flip it using the t(), for "transpose matrix", function.

> snoring.table = t(snoring.table)
> snoring.table
                    heart.disease
snore.status         yes   no
  nonsnorer           24 1355
  occasional          35  603
  nearly.every.night  21  192
  every.night         30  224

Better! Notice also I avoided putting spaces into my variable names. This is a good practice, although since the names had to be quoted anyway in the dimnames command, it is not strictly necessary. Also, you should ignore the fact that I am always using the = assignment. If you prefer the arrow assignment, by all means use it. I'm using = because it is rendered better by my browser.

Now let's look at a few functions for extracting information from this table/matrix.

> dim(snoring.table)              # no. of rows by no. of columns
[1] 4 2

> dimnames(snoring.table)         # We already know this, but what the heck?
$snore.status
[1] "nonsnorer"          "occasional"         "nearly.every.night"
[4] "every.night"       

$heart.disease
[1] "yes" "no"

> snoring.table[1,]               # Look at row 1.
 yes   no 
  24 1355 
> snoring.table[,2]               # Look at column 2.
         nonsnorer         occasional nearly.every.night        every.night 
              1355                603                192                224 
> snoring.table[3,2]              # Look at the entry in row 3 and column 2.
[1] 192

> addmargins(snoring.table)       # Show row and column sums.
                    heart.disease
snore.status         yes   no  Sum
  nonsnorer           24 1355 1379
  occasional          35  603  638
  nearly.every.night  21  192  213
  every.night         30  224  254
  Sum                110 2374 2484

> prop.table(snoring.table, margin=1)  # Get proportions relative to row sums.
                    heart.disease
snore.status                yes        no
  nonsnorer          0.01740392 0.9825961
  occasional         0.05485893 0.9451411
  nearly.every.night 0.09859155 0.9014085
  every.night        0.11811024 0.8818898

> prop.table(snoring.table, margin=2)  # Get proportions relative to column sums.
                    heart.disease
snore.status               yes         no
  nonsnorer          0.2181818 0.57076664
  occasional         0.3181818 0.25400168
  nearly.every.night 0.1909091 0.08087616
  every.night        0.2727273 0.09435552

> prop.table(snoring.table)            # Get proportions relative to overall sum.
                    heart.disease
snore.status                 yes         no
  nonsnorer          0.009661836 0.54549114
  occasional         0.014090177 0.24275362
  nearly.every.night 0.008454106 0.07729469
  every.night        0.012077295 0.09017713

> chisq.test(snoring.table)            # You were wondering, weren't you?

        Pearson's Chi-squared test

data:  snoring.table 
X-squared = 72.7821, df = 3, p-value = 1.082e-15

It's also easy enough to turn those proportions into percentages.

> prop.table(snoring.table, margin=1)*100
                    heart.disease
snore.status               yes       no
  nonsnorer           1.740392 98.25961
  occasional          5.485893 94.51411
  nearly.every.night  9.859155 90.14085
  every.night        11.811024 88.18898

Just multiply the entire prop.table by 100. And finally...

> rm(list=ls())                        # clean up

I have one quibble at this point, and you may have stumbled upon it as well. The function addmargins() has no period. The function prop.table() does have a period. Is there any logic behind that? Not that I'm aware of. But it's a small enough price to pay for a free software package this powerful, so who's complaining?

We'll look at the special case of creating a data frame in a future tutorial.

revised 2016 January 19