DOING ARITHMETIC IN R A Caveat Before we begin, let me issue a caveat. These tutorials are not a complete reference manual for the R language--far from it! R can do many more things than you have seen or will see outlined in these brief tutorials. My dilemma is to balance the desire for completeness with the need for brevity. I want to include things that you MAY find handy someday, if for no other reason than to illustrate what R is capable of. On the other hand, too much detail all at once can be overwhelming. So use your best judgment as to what you think you need to know. If something doesn't look useful to you right now, skip it! You can always come back if you find eventually that you need to know it. On the other hand, don't get carried away with skipping stuff either. Most of this material you do need to know in order to use R effectively. Okay, on with it! More On R Functions We will begin this tutorial by looking at two functions that create vectors
of numerical values in a regular sequence. We've already seen that this
can be done as follows.
> 1:10 [1] 1 2 3 4 5 6 7 8 9 10 > 10:1 # And backwards too. [1] 10 9 8 7 6 5 4 3 2 1Recall, in this context, that the colon means "through", as in "1 through 10". Those values are printed to the console (screen), and that's the end of it. We have not stored them in the workspace by assigning them to an object, and therefore, we cannot use them in any further calculations. This is not terribly useful, but we shall get to usefulness shortly. There is another way to produce this same sequence, which is by using the
seq() function.
> seq(from=1, to=10, by=1) [1] 1 2 3 4 5 6 7 8 9 10The syntax should be self-explanatory. The function has created a regular sequence of integers "from" 1 "to" 10 "by" adding 1 at each step. The seq() function is more flexible than the colon operator because the function can be made to step by any amount you want, whereas the colon operator can only step by one. > seq(from=1, to=10, by=2) [1] 1 3 5 7 9It can even step by fractional values. > seq(from=1, to=10, by=2.5) [1] 1.0 3.5 6.0 8.5So here is the point I want to illustrate about R functions. In this case, every argument inside the seq() function is named: from, to, and by. It's not required to use these names. > seq(1,10,2) [1] 1 3 5 7 9R will understand what the arguments are, AS LONG AS they are given in the correct order: first the "from" value, then the "to" value, and finally the "by" value. You can find out what order the arguments should be in by going to the help page and looking at the syntax statement at the top of the page. > help(seq) seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)), length.out = NULL, along.with = NULL, ...)On the other hand, if you don't remember what order the arguments are supposed to be in, and you don't want to take the time to look it up, then they can be given IN ANY ORDER, as long as they are labelled. > seq(to=10, by=2, from=1) [1] 1 3 5 7 9This is a general behavior for R functions. Arguments can be given without labels, as long as they are in the correct order, or they can be given in any order with labels. The correct order, as well as the correct labels, can be found by going to the help page for the function. The Mac command editor and R Studio also show the syntax in the lower margin of the Console window as the command is being typed. Of course, just having something printed to the screen is not often useful.
If we are creating this sequence to be used in subsequent calculations, we would
want it stored in an object.
> my.seq = seq(to=10, by=2, from=1) > my.seq / 3 [1] 0.3333333 1.0000000 1.6666667 2.3333333 3.0000000 > class(my.seq) [1] "numeric" > is.vector(my.seq) [1] TRUEOnce the sequence is stored, it can be manipulated mathematically, in this case, divided by 3. The class() function tells you this object you've created is numeric (i.e., has numbers in it). The is.vector() function is a question: Is this a vector? TRUE, "my.seq" is a vector. One final point should be made here. The values in "my.seq" have not been
changed by the arithmetic we did on them.
> my.seq [1] 1 3 5 7 9 > my.seq = my.seq / 3 > my.seq [1] 0.3333333 1.0000000 1.6666667 2.3333333 3.0000000That doesn't happen unless the results of the division are stored back into the "my.seq" object. As a general rule, when R prints something to the screen, nothing in the workspace has been altered. That is, whatever values you have stored in objects are still the same. On the other hand, when an assignment is made, R generally does not print anything to the screen until you ask it to, but changes are made in the workspace. This is worth remembering! Another function that creates regular sequences is the rep() function, which stands for "repeat."
> rep(x=1, times=10) # With the arguments labelled. [1] 1 1 1 1 1 1 1 1 1 1 > rep(1,10) # Without the arguments labelled. [1] 1 1 1 1 1 1 1 1 1 1What the heck good could that possibly be? A few examples should illustrate. > my.seq = 1:5 # Create a fresh sequence. > rep(my.seq, times=3) # Repeat the whole vector 3 times. [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 > rep(my.seq, each=3) # Repeat each element of the vector 3 times. [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 > rep(my.seq, times=c(3,2,4,0,1)) # See the explanation below. [1] 1 1 1 2 2 3 3 3 3 5If the first argument is a vector, it will be repeated "times" times, or each element will be repeated "each" times. If both the first argument and the times= argument are vectors, then things start to get really useful. The "times" vector specifies how often each element of the first vector should be repeated. In the example above, 1 is repeated 3 times, 2 is repeated 2 times, 3 is repeated 4 times, 4 is not repeated at all, and 5 is repeated 1 time. And this is useful how? Suppose you have a vector of measurements, the first five of
which are from men and the second five of which are from women.
> height = c(70, 72, 67, 66, 75, 64, 66, 68, 63, 65)You can create a vector of gender labels for the measurements as follows. > gender = rep(c("male","female"), times=c(5,5)) > gender [1] "male" "male" "male" "male" "male" "female" "female" "female" [9] "female" "female"Now you can use the gender vector as a grouping variable (or indexing vector) for the height vector to get means by gender. Or do a t-test. > by(FUN=mean, IND=gender, data=height) gender: female [1] 65.2 ------------------------------------------------------------ gender: male [1] 70 > # Just press Enter to put a space here. > t.test(height ~ gender) # height tilde (by) gender Welch Two Sample t-test data: height by gender t = -2.588, df = 6.039, p-value = 0.04108 ... # Some output omitted here. Or suppose you have the following frequency distribution, and you want to put all the data into a single vector.
> X = 12:4 > freq=c(3,2,0,3,3,5,3,0,1) > all.in.one = rep(X, freq) > all.in.one [1] 12 12 12 11 11 9 9 9 8 8 8 7 7 7 7 7 6 6 6 4 > mean(all.in.one) [1] 8.3Or, following the instructions in your elementary statistics book, you can also get the mean of X as follows. > sum(X * freq) / sum(freq) # STUDENTS: Explain why this works. [1] 8.3This, in my opinion, makes R an unparalleled teaching tool for elementary stat courses. Formulas can be worked at the command line using command line arithmetic, thus giving students (at least those who bothered to learn anything at all in their algebra course) an appreciation for how the formulas work that they are unlikely to get by running the numbers through a calculator. > sum(all.in.one^2) - sum(all.in.one)^2 / length(all.in.one) [1] 100.2 > sum(all.in.one^2) - sum(all.in.one)^2 / length(all.in.one) -> SS > SS / (length(all.in.one)-1) [1] 5.273684 > sqrt(SS / (length(all.in.one)-1)) [1] 2.29645 > sd(all.in.one) [1] 2.29645Enough of this for now. We will use both of these functions, seq() and rep(), in due time. You should clean up your workspace now. R As a Simple Calulator R will do simple arithmetic from the command line. Several examples should
suffice to illustrate.
> 18 + 12 # addition [1] 30 > 18 - 12 # subtraction [1] 6 > 18 * 12 # multiplication [1] 216 > 18 / 12 # division [1] 1.5 > 18 %/% 12 # just the integer part of the quotient [1] 1 > 18 %% 12 # just the remainder part (modulo) [1] 6 > 18 ^ 12 # exponentiation (raising to a power) [1] 1.156831e+15In the last case, the answer was such a large number that R printed it in scientific notation. Don't ignore the exponent part! This answer is not 1.1568, as many of my students often try to claim. It is 1.156831 times 10 raised to the 15th power. In other words, it is: 1.156831 x 1,000,000,000,000,000 = 1,156,831,000,000,000 R prints all very large and very small numbers in scientific notation. You will need to know how it works. If you don't (including cases where the exponent of 10 is negative), here is a link to a Wikipedia page that explains it. Scientific notation at Wikipedia (STUDENTS: Required reading!) Of course, R obeys the usual rules for order of operations, and it uses
parentheses for grouping operations (but never square brackets or curly braces,
which are used for other purposes).
> 18 - 12 / 3 [1] 14 > (18 - 12) / 3 [1] 2And R recognizes certain goofs, like trying to divide by zero, and points them out. > 18 / 0 [1] Inf > 0 / 0 [1] NaN > "eighteen" / 12 Error in "eighteen"/12 : non-numeric argument to binary operatorTechnically, eighteen divided by zero is undefined, but most computer software will tell you it is infinity ("Inf" in R-speak). Zero divided by zero is not a number ("NaN" in R-speak). Trying to divide a word by a number is just silliness, of course. There are also more advanced operators, such as those that manipulate matrices, but I'll leave those to be investigated by the few readers who may be interested in such things. These operators will also work with complex numbers, but once again, that's beyond the scope of these tutorials. See help("+") for more details. Mathematical Functions The following examples illustrate SOME of the mathematical functions
available in R.
> log(10) # natural log (base e) [1] 2.302585 > exp(2.302585) # antilog, e raised to a power [1] 10 > log10(100) # base 10 logs; log(100, base=10) is the same [1] 2 > sqrt(88) # square root [1] 9.380832 > factorial(8) # factorial [1] 40320 > choose(12,8) # combinations (binomial coefficients) [1] 495 > round(log(10), digits=3) # rounding; round(log(10),3) also works [1] 2.303 > signif(log(10), digits=3) # significant digits (wrong answer in this case!) [1] 2.3 > runif(5) # five uniform random numbers between 0 and 1 [1] 0.3088106 0.6893187 0.5312068 0.2848143 0.4390779 > rnorm(5) # random numbers from a normal distribution; N(0,1) [1] 0.39655158 -0.90683680 0.70820865 -0.06417678 0.25064385 > abs(18 / -12) # absolute value [1] 1.5There are others, such as the standard trig functions (cos(x), sin(x), and tan(x), along with their inverses: acos(x), asin(x), atan(x)), and the hyperbolic trig functions, as well as many more advanced math functions. Try help(gamma) for an example. R Can Get You a Date (Kind Of) Here are a couple R functions for working with dates.
> date() [1] "Wed Jul 28 12:48:18 2010" > difftime("2008-07-05","1992-08-15") Time difference of 5803 daysIn fact, there is an entire add-on package that does nothing but deal with dates written in different formats. More on add-on packages in a future tutorial. Vectorized Arithmetic Now for some good stuff! In R arithmetic, one or more of the arguments can be
a vector. Here is an example of how useful this can be.
> height.inches = c(68,65,70,71,69) > height.cm = height.inches * 2.54 > height.cm [1] 172.72 165.10 177.80 180.34 175.26When a vector is operated on by a single value, the single value operates on each value of the vector in turn. Math functions work the same way. > log(height.inches) [1] 4.219508 4.174387 4.248495 4.262680 4.234107When a vector operates on a vector, the operation is done term by term, which is to say, the first term of one operates on the first term of the other, the second term on the second term, and so on. > correction = c(1,0,0,-1,-2) > height.inches + correction [1] 69 65 70 70 67There are special functions designed to work specifically on vectors. Here are a few. > max(height.inches) # maximum value [1] 71 > min(height.inches) # minimum value [1] 65 > sum(height.inches) # sum [1] 343 > mean(height.inches) # arithmetic mean [1] 68.6 > median(height.inches) # median [1] 69 > range(height.inches) # range (actually min and max in one) [1] 65 71 > var(height.inches) # sample variance [1] 5.3 > sd(height.inches) # sample standard deviation [1] 2.302173 > length(height.inches) # number of values in the vector [1] 5These functions can be very useful for teaching purposes. They take the tedium out of calculating the sum of squares, for example, for which there is no R function. > sum(height.inches^2) - sum(height.inches)^2 / length(height.inches) [1] 21.2A student who can do this will not soon forget how the SS is calculated! Sorting, Ranking, and Ordering Vectors If you've already erased height.inches, recreate it. Then use the following
functions to sort it in increasing then in decreasing order.
> sort(height.inches) # Increasing order is the default. [1] 65 68 69 70 71 > sort(height.inches, decreasing=TRUE) # Or just sort(height.inches, T). [1] 71 70 69 68 65To find the ranks that correspond to values in a vector, use the rank() function. > height.inches [1] 68 65 70 71 69 > rank(height.inches) # Rank 1 is the minimum value. [1] 2 1 4 5 3The order() function is tricky but very useful. Let's see it at work, and then I will explain what it is doing. > height.inches [1] 68 65 70 71 69 > sort(height.inches) [1] 65 68 69 70 71 > order(height.inches) [1] 2 1 5 3 4 > ord = order(height.inches) > height.inches[ord] [1] 65 68 69 70 71In the example above, the first command simply printed out the "height.inches" vector so you could get another look at it. The sort() function was then used to rearrange it into ascending order. A useful thing to know, sometimes, is how the vector was rearranged to achieve that sorting. That's what the order() function tells you. The output of that function says essentially, "Put the second item first, the first item second, the fifth item third, the third item fourth, and the fourth item last." The output of the order() function can also be used to sort a vector, as is shown by the last two commands. This is useful when you want to sort two vectors in the same order, i.e., keeping the values in the two vectors properly paired up. If you don't see it now, don't worry, but we will also be using this function to sort a data frame by one of its variables, so you will see it eventually. Relational and Logical Operations Values can be compared using the following operations:
Here are some examples of how these might be used. Once again, I will use
the "height.inches" vector to illustrate, so recreate it if you've erased it.
> height.inches [1] 68 65 70 71 69 > height.inches >= 70 [1] FALSE FALSE TRUE TRUE FALSE > height.inches == 70 [1] FALSE FALSE TRUE FALSE FALSE > height.inches != 70 [1] TRUE TRUE FALSE TRUE TRUE > which(height.inches >= 70) [1] 3 4 > all(height.inches <= 72) [1] TRUE > any(height.inches <= 65) [1] TRUE > which(height.inches <= 65) [1] 2First, the "height.inches" vector is printed out so you have a copy to look at. The second command compares each value of the vector to 70 and returns a logical result: FALSE if the value is not greater than or equal to 70, TRUE if it is. The third command returns TRUE only if the value is exactly equal to 70, and the fourth command returns TRUE only if the value is not equal to 70. The fifth command asks a question: which values in the vector are greater than or equal to 70? The answer is items 3 and 4. The sixth command asks if all values in the vector and less than or equal to 72. The answer is TRUE, meaning yes. The seventh command asks if any of the values are equal to or less than 65, and once again the answer is TRUE, meaning yes. The last command asks which ones? The answer is item 2 in the vector is less than or equal to 65. These functions are useful for subsetting the data, for example, in situations where you might want to look at cases where the subjects were six feet tall or more. Play with these commands a bit, and you will get used to them. Remember when we did this?
> rivers[rivers > 500] [1] 735 524 1459 600 870 906 1000 600 505 1450 840 1243 890 525 [15] 720 850 630 730 600 710 680 570 560 900 625 2348 1171 3710 [29] 2315 2533 780 760 618 981 1306 696 605 1054 735 1270 545 1885 [43] 800 538 1100 1205 610 540 1038 620 652 900 525 529 720 671 [57] 1770Now you know how to find out which rivers those are. > which(rivers > 500) [1] 1 5 7 10 15 16 20 21 22 23 24 25 26 31 32 38 40 44 [19] 45 50 58 59 62 63 64 66 67 68 69 70 71 79 80 82 83 85 [37] 86 89 90 98 99 101 109 113 114 115 118 120 121 128 130 131 132 135 [55] 137 140 141One final note: In R, logical values can be added. TRUEs add as ones, and FALSEs add as zeros. Here are a couple examples. > sum(height.inches >= 70) [1] 2 > sum(height.inches <= 69) [1] 3 > sum(rivers > 500) [1] 57So there are two cases in the height.inches vector in which the height is equal to or greater than 70, and three cases in which the height is less than or equal to 69. There are 57 rivers with lengths greater than 500 miles. Finally, don't forget to clean up! revised 2016 January 19 |