R Tutorials--Describing Data Graphically

DESCRIBING DATA GRAPHICALLY

Introduction

The graphical procedures in R are extremely powerful. I'm told there are people who use R not so much for data analysis as for its ability to produce top notch publication quality graphics. I will only scratch the surface of these capabilities here. A later tutorial will fill in a few more of the details.

Don't make the mistake of assuming that this tutorial is in any fashion a complete summary of R graphics capabilities. It is a brief overview of how to get a mostly quick-and-dirty graph to visualize your data.

R graphics functions can be grouped into three types:

High level plotting functions that will create a more or less complete graph, often with axis labels, titles, and so forth.
Low level plotting functions that allow additional information to be added to an existing graph, or that allow graphs to be drawn from scratch.
Interactive graphics functions that allows the R user to extract information from an existing graph, or to label points and so on.

I'm going to do something unusual, and perhaps ill-advised, and cover the low level functions first, so that you will be ready to use them in conjunction with the high level functions when we get to those. If you just want the quick and dirty approach, then skip this first section (for now).

Low Level Plotting Functions

For years I told my students, "I can draw anything I want in an R graphics window. I can draw a clown face in an R graphics window if I want to." Finally a student challenged me to do it. It took me the better part of a day to figure out!

To conserve space, I'm not going to reproduce the output of every single example in this tutorial. If you have R open and are following along, you can see it on your own screen.

High level plotting functions open a graphics device (window) automatically, but the low level functions do not. So to get a graph and some axes to work with, the following command will get us started without actually drawing a graph.

> plot(1:100, 1:100, type="n", xlab="", ylab="")

The plot() function is high level, opening a graphics window and drawing labeled axes, but in this case we've asked it not to plot anything with the type="n" option. Now we have a palette. Let's paint on it.

One thing we can do is plot a curve from an algebraic equation. Let's use the equation y = 0.01 x².

> curve(x^2/100, add=TRUE)        # add=T adds the curve to an existing graph

We may want to add some text to the graph, which tells our intended audience just what it is we've plotted.

> text(x=80, y=50, "This is a graph of")
> text(x=80, y=45, "the equation")
> text(x=80, y=37, expression(y == frac(1,100) * x^2))

The text() function takes, first, arguments that give x,y-coordinates at which the text will be centered (and this can take some careful eyeballing or some trial and error), and then it takes quoted text or a mathematical expression. The syntax for the expression() function is an art form in itself (similar to LaTeX), and I have not mastered it, but it can be used to produce some very fancy mathematical expressions. There are also options for controlling font face and size as well as spacing, etc.

Next let's draw some points on this curve.

> points(x=c(20, 60, 90), y=c(4, 36, 81), pch=6)      # or points(x<-c(20,60,90), y=x^2/100, pch=6)

The first vector gives the x-coordinates of the desired points, the second vector gives the y-coordinates (which can be calculated on the fly), and the "pch=" option gives the point character to use. There are about twenty different point characters to choose from. To see some of them, do this.

> points(x=rep(100,10), y=seq(0,90,10), pch=seq(1,20,2))

You can experiment for yourself to find out what the rest of them look like.

Now let's draw a straight line through a couple of those points, say the one at (20, 4) and the one at (90, 81). The draw-a-straight-line function is abline(), and in this case it's arguments are "a=the y-intercept" and "b=the slope" of the desired line.

> abline(a=-18, b=1.1, col="red")

And just to be showy, we made it red with col="red". We can also draw horizontal and vertical lines with this function.

> abline(h=20, lty=2)             # abline(h=20, lty="dashed") also works
> abline(v=20, lty=3)             # abline(v=20, lty="dotted") also works

The "lty=" option specifies the line type (1=solid, 2=dashed, 3=dotted). You can also change the color of these lines with col=, and the width of the lines with lwd= options. Try repeating that last command but set lty=1 and lwd=3.

We can also draw lines and/or points using the lines() function.

> lines(x=c(40, 40, 60, 60), y=c(80, 100, 100, 80), type="b")
> lines(x=c(40, 60), y=c(80, 80), type="l")      # type="lower case L", not "one"

Once again, the first vector gives the x-coordinates, the second vector the y-coordinates, and the "type=" option tells whether you want just points (type="p"), just lines (type="l"), or both (type="b"). Note: For just lines, use a lower case L. This example shows that type="l" and type="b" behave a bit differently in terms of where the line begins and terminates.

Finally, at least as far as this tutorial is concerned, titles and axis labels can be added using the title() function. (NOTE: If you already have axis labels (they are set by default in the plot() function, but we set them to blank by setting xlab="" and ylab=""), you can try a little trick that SOMETIMES works to erase them. Try writing over them in the background color of the graph, in this case, white. This brings up an IMPORTANT POINT. Be careful when you're drawing a complex graph, because it's generally true that, when you make a mistake, you start again! It's advisable to use the script window to write out the commands for a complex graph.)

> title(main="A Drawing To Put On the Refrigerator!")
> title(xlab="This is the x-axis", col.lab="green", cex.lab=1.5)

This example also gives a little taste of how various options can be used to control colors, fonts, text sizes, and so forth. We'll do more of this in a future tutorial. The col.lab= option sets the color of the label, while the cex.lab= option controls it's size. "cex" stands for character expansion factor, so setting cex.lab=1.5 makes the label 1.5 times it's normal size. And now let's have a look at our masterpiece. Our First Graph

Beautiful! Okay, so it's a little first-graderish as R graphics go. There are entire books on R graphics, and I am but a novice! Here is the entire script if you just now have decided you want to see this happen on your own monitor.

# start copying here
plot(1:100, 1:100, type="n", xlab="", ylab="")
curve(x^2/100, add=TRUE)
text(x=80, y=50, "This is a graph of")
text(x=80, y=45, "the equation")
text(x=80, y=37, expression(y == frac(1,100) * x^2))
points(x=c(20,60,90), y=c(4,36,81), pch=6)
points(x=rep(100,10), y=seq(0,90,10), pch=seq(1,20,2))
abline(a=-18, b=1.1, col="red")
abline(h=20, lty=2)
abline(v=20, lty=3)
lines(x=c(40,40,60,60), y=c(80,100,100,80), type="b")
lines(x=c(40,60), y=c(80,80), type="l")
title(main="A Drawing To Put On the Refrigerator!")
title(xlab="This is the x-axis", col.lab="green", cex.lab=1.5)
# stop copying here and paste to your R Console

You can also paste this into a script window (File > New Document on a Mac, File > New Script in Windows) and experiment with it. Learn by doing!

High Level Plotting Functions

Usually, we don't want to fuss that much. We just want to see a graph of some data we're examining. If we want to dress it up for publication, THEN we'll worry about the low-level functions and various options.

The basic high level plotting function is plot(), and it works differently depending upon what you're asking it to plot. The basic syntax is plot(x, y, ...), where x is a vector of x-coordinates, y is a vector of y-coordinates, and ... represents further refinements and options, as will be illustrated.

> data(faithful)
> attach(faithful)
> names(faithful)
[1] "eruptions" "waiting"  
> plot(x=waiting, y=eruptions)         # x is num., y is num., plot is a scatterplot
> detach(faithful)
> rm(faithful)
>
>
> data(ToothGrowth)
> attach(ToothGrowth)
> names(ToothGrowth)
[1] "len"  "supp" "dose"
> plot(x=supp, y=len)                  # x is factor, y is num., plot is boxplots
> plot(x=factor(dose), y=len)          # coercing dose to a factor
> detach(ToothGrowth)
> rm(ToothGrowth)
>
>
> data(sunspots)
> class(sunspots)
[1] "ts"
> plot(sunspots)                       # x is time series, y missing, plot is a
> rm(sunspots)                         #     time-series plot
>
>
> data(UCBAdmissions)
> class(UCBAdmissions)
[1] "table"
> plot(UCBAdmissions)                  # x is table, y missing, plot is a mosaic plot
> rm(UCBAdmissions)
>
>
> data(mtcars)
> str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17.0 18.6 19.4 17.0 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
> plot(mtcars)                         # x is dataframe of num. vars., y missing,
> rm(mtcars)                           #     plot is a scatterplot matrix
>

And so on. I think I've made my point. As we go through the individual data analyses in future tutorials, we will see these various plots again, and we will dress them up a bit. So for now, let me just illustrate a few other things R can do.

Note: Plotting can also be done with a formula interface: plot(y ~ x).

Pie Charts and Bar Graphs

When a single categorical variable is being graphed, the customary way is to use a pie chart or a bar graph. Statisticians are somewhat biased against pie charts, and I suppose for good reason, but I'll illustrate them anyway, just in case you have a hankerin' to flout good statistical practice.

The data set UCBAdmissions, which we were using above, is the Berkeley admissions data we used in a different form in a previous tutorial. The data set is a 3-D table, and we need a 1-D table to illustrate a basic piechart and barplot, so...

> margin.table(UCBAdmissions, 3)       # Collapse over dimensions 1 and 2.
Dept
  A   B   C   D   E   F 
933 585 918 792 584 714 
> margin.table(UCBAdmissions,3) -> Department
> pie(Department)
> barplot(Department, xlab="Department", ylab="frequency")

I think you can see why the bar graph is preferred. It's much easier to read. The pie() function in R is limited because, as I mentioned above, many statisticians (including the R folks) consider pie charts to be poor statistical practice. However, if you want something flashy like a 3D exploded pie chart, you can get it by installing an optional graphics package called "plotrix", which contains a function called pie3D(), which has an "explode" option. It just goes to show, if you want it, someone has probably written an R package that will do it! To see an example of an exploded pie chart produced with this package, try this link. In fact, I recommend the plotrix package if you want some useful extensions to the basic R graphics capabilities.

If you want to look at two categorical variables at once, a stacked barplot, or better yet, a side-by-side barplot is usually the way to go.

> margin.table(UCBAdmissions, c(1,3)) -> Admit.by.Dept
> barplot(Admit.by.Dept)
> barplot(Admit.by.Dept, beside=T, ylim=c(0,1000), legend=T,
+         main="Admissions by Department")

Notice a stacked barplot is the default for some odd reason. To change that, set the "beside=" option to TRUE. Also, I dressed up the second barplot a bit by adding a main title, and by changing the limits on the y-axis to (almost!) make enough room for a legend. I need to adjust the font size a bit in the legend, and maybe change its location, but that's a future tutorial!

Histograms

When you have one numeric variable to look at, a histogram is appropriate. I'll use the "faithful" data set again to illustrate.

> data(faithful)                       # This is optional.
> attach(faithful)
> hist(waiting)

It doesn't get much more straightforward than that! And by the way, in case you're wondering, I resized the graphic by resizing the graphics device window before saving it. There are better ways, but that works pretty well in a pinch.

If you want more or fewer bars, you can refine your plot by using the "breaks=" option and defining your own breakpoints.

> range(waiting)
[1] 43 96
> hist(waiting, breaks=seq(from=40, to=100, by=10))

By default, R includes the right limit (right side of the bar) but not the left limit in the intervals. Usually, I prefer it the other way around, so I change it with the "right=" option, which by default is TRUE.

> hist(waiting, breaks=seq(40,100,10), right=F)

There are many, many other options as well, which you can examine by looking at the help page for this function: type ?hist.

R also incorporates many functions for data smoothing, including kernel density smoothing of histograms. If you'd rather see a smooth curve than a boxy histogram, it can be done as follows.

> plot(density(waiting))
> # Or, getting fancier...
> hist(waiting, prob=T)
> lines(density(waiting))
> detach(faithful)

The density() function does kernel density smoothing, which can be refined by adjusting the options of the function. To plot the smoothed curve on top of a histogram, set the "prob=" option to TRUE inside the hist() function. This plots densities rather than frequencies. Also, use lines() rather than plot() to draw the smoothed curve. This low level graphics function will add the smoothed curve to the histogram rather than drawing a new plot and thereby erasing the histogram.

Numerical Summaries by Groups

When you have a numerical variable indexed by a categorical variable or factor, you might want a group-by-group summary in graphical form. The primary way R offers to achieve this is side-by-side boxplots.

> data(chickwts)                  # Weight gain by type of diet.
> str(chickwts)
'data.frame':   71 obs. of  2 variables:
 $ weight: num  179 160 136 227 217 168 108 124 143 140 ...
 $ feed  : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...
> attach(chickwts)
> plot(feed, weight)              # boxplot(weight ~ feed) will also work
> title(main="Body Weight of Chicks by Type of Diet")

One somewhat annoying thing about boxplots is that they don't display means. If you're doing a parametric analysis such as t-test or ANOVA, those are tests on means, so your graphic ought to display means. You can add means by using the low-level points() function. (Watch your screen, I won't reproduce this here.)

> means = tapply(weight, feed, mean)
> points(x=1:6, y=means, pch=16)       # pch=16 is a filled circle
> detach(chickwts)

The function boxplot(), which takes a formula interface, can also be used. Here is the example copied and pasted off the "chickwts" help page.

> boxplot(weight ~ feed, data = chickwts, col = "lightgray",
+     varwidth = TRUE, notch = TRUE, main = "chickwt data",
+     ylab = "Weight at six weeks (gm)")
Warning message:
In bxp(list(stats = c(216, 271.5, 342, 373.5, 404, 108, 136, 151.5,  :
  some notches went outside hinges ('box'): maybe set notch=FALSE

Notice that several options are set, including an option to color the boxes, the "varwidth=" option, which sets the width of the box according to the sample size, the "notch=" option, which gives a confidence interval around the median, and options to print a main title and y-axis label. The procedure generated a warning message, which you will understand when you look at the graphic (which I have not reproduced here).

Scatterplots

For examining the relationship between two numerical variables, you can't beat a scatterplot. R has several functions for producing them, two of which will be demonstrated here.

> data(mammals, package="MASS")
> str(mammals)
'data.frame':   62 obs. of  2 variables:
 $ body : num    3.38   0.48   1.35 465.00  36.33 ...
 $ brain: num   44.5  15.5   8.1 423.0 119.5 ...
> attach(mammals)
> plot(log(body), log(brain))          # plot(x=body, y=brain, log="xy") is similar (try it)
> scatter.smooth(log(body), log(brain))
> detach(mammals)

Some explanations are in order. First, I didn't want to attach the MASS package to the search path, so I used an option when I copied the "mammals" data frame that told R to look for it there. The data frame contains brain and body weights from 62 species of land mammals. Second, to produce a linear plot, I had to do a log transform on both variables, and I did that "on the fly." Third, the two functions produced the same scatterplot, but the scatter.smooth() function also plots a smoothed, nonparametric regression line on the plot. This line is computed using the loess technique and is called the "loess line" (locally weighted scatterplot smoothing, sometimes also called "lowess", although I understand some sources use the two acronyms differently). Both functions have options that allow the plots to be modified in several ways.

Interacting With Plots

R supplies several functions that allow you to interact with the graphics window, including functions that allow you to identify and label points on the graph. See the help pages for the locator() and identify() functions for details. I'll discuss these briefly in a later tutorial.

Remember to clean up your workspace!

revised 2016 January 31