
DESCRIBING DATA GRAPHICALLY
Introduction
The graphical procedures in R are extremely powerful. I'm told there are some
people who use R not so much for data analysis as for its ability to produce top
notch publication quality graphics. I will only scratch the surface of these
capabilities here. A later tutorial will fill in a few more of the details.
R graphics functions can be grouped into three types:
- High level plotting functions that will create a more or less complete
graph, often with axis labels, titles, and so forth.
- Low level plotting functions that allow additional information to be
added to an existing graph, or that allow graphs to be drawn from
scratch.
- Interactive graphics functions that allow you to extract information
from an existing graph, or to label points and so on.
I'm going to do something unusual, and perhaps ill-advised, and cover the
low level functions first, so that you will be ready to use them in conjunction
with the high level functions when we get to those. If you just want the quick
and dirty approach, then skip this first section (for now).
Low Level Plotting Functions
Just about anything can be drawn into a graphics window in R if you are
clever enough. I'm not that clever, so I'll keep it simple. To conserve space,
I'm also not going to reproduce the output of every single example. If you have
R open and are following along, you can see it on your own screen.
High level plotting functions open a graphics device (window) automatically,
but the low level functions do not. So to get a graph and some axes to work
with, the following command will get us started without actually drawing a
graph...
> plot(1:100, 1:100, type="n")
The plot( ) function is high level, opening a
graphics window and drawing labeled axes, but in this case we've asked it not
to plot anything with the 'type="n"' option. Now we have a palette. Let's paint
on it.
One thing we can do is plot a curve from an algebraic equation. Let's say
the equation is y = 0.01 x2...
> curve(x^2/100)
We may want to add some text to the graph, which tells our intended audience
just what it is we've plotted...
> text(80, 50, "This is a graph of")
> text(80, 45, "the equation")
> text(80, 37, expression(y == frac(1,100) * x^2))
The text( ) function takes, first, arguments
that give x,y-coordinates at which the text will be centered (and this can take
some careful eyeballing or some trial and error), and then it takes quoted text
or a mathematical expression. The syntax for the expression( ) function is an art form in itself
(similar to LaTex), and I have not mastered it, but it can be used to produce
some very fancy mathematical expressions. There are also options for
controlling font face and size as well as spacing, etc.
Next let's draw some points on this curve...
> points(x=c(20, 60, 90), y=c(4, 36, 81), pch=6)
The first vector gives the x-coordinates of the desired points, the second
vector gives the y-coordinates, and the "pch=" option gives the point character
to use. There are about twenty different point characters to choose
from. To see some of them, do this...
> points(x=rep(100,10), y=seq(0,90,10), pch=seq(1,20,2))
You can experiment for yourself to find out what the rest of them look like.
Now let's draw a straight line through a couple of those points, say the one
at (20, 4) and the one at (90, 81). The draw-a-straight-line function is abline( ), and in this case it's arguments are
"a=the y-intercept" and "b=the slope" of the desired line...
> abline(a=-18, b=1.1, col="red")
And what the heck? Just to be showy, let's make it red. We can also draw
horizontal and vertical lines with this function...
> abline(h=20, lty=2) # abline(h=20, lty="dashed") also works
> abline(v=20, lty=3) # abline(v=20, lty="dotted") also works
The "lty=" option specifies the line type. (1=solid, 2=dashed, 3=dotted.) You
can also change the color of these lines with col=, and the width of the lines
with lwd= options. Try repeating that last command but set lty=1 and lwd=3.
We can also draw lines and/or points using the
lines( ) function...
> lines(x=c(40, 40, 60, 60), y=c(80, 100, 100, 80), type="b")
> lines(x=c(40, 60), y=c(80, 80), type="l")
Once again, the first vector gives the x-coordinates, the second vector the
y-coordinates, and the "type=" option tells whether you want just points
(type="p"), just lines (type="l"), or both (type="b"). Note: for just lines,
use a lower case L. This example shows that type="l" and type="b" behave a bit
differently in terms of where the line begins and terminates.
Finally (at least as far as this tutorial is concerned!), titles and axis
labels can be added using the title( )
function. We already have axis labels (they are set by default in the
plot( ) function), so I'll use a little trick
that SOMETIMES works to erase one of them. I'll write over it in the background
color of the graph...
> title(main="A Drawing To Put On the Refrigerator!")
> title(xlab="x", col.lab="white")
> title(xlab="This is the x-axis", col.lab="black")
This example also gives a little taste of how various options can be used to
control colors, fonts, text sizes, and so forth. We'll do more of this in a
future tutorial. The "xlab=" option was used to overwrite the existing x-axis
label with itself written in white, and then a new label written in black. The
same thing could have been done on the y-axis using the "ylab=" option. And now
let's have a look at our masterpiece...
Beautiful! Okay, so it's a little first-graderish as R graphics go. There are
entire books on R graphics, and I am but a beginner! Here is the entire script
if you just now have decided you want to see this happen on your own
monitor.
# Start copying here.
plot(1:100, 1:100, type="n")
curve(x^2/100)
text(80, 50, "This is a graph of")
text(80, 45, "the equation")
text(80, 37, expression(y == frac(1,100) * x^2))
points(c(20,60,90), c(4,36,81), pch=6)
points(rep(100,10), seq(0,90,10), pch=0:9)
abline(a=-18, b=1.1, col="red")
abline(h=20, lty=2)
abline(v=20, lty=3)
lines(c(40,40,60,60), c(80,100,100,80), type="b")
lines(c(40,60), c(80,80), type="l")
title(main="A Drawing To Put On the Refrigerator!")
title(xlab="x", col.lab="white")
title(xlab="This is the x-axis", col.lab="black")
# Stop copying here and paste to your R Console.
High Level Plotting Functions
Usually, we don't want to fuss that much. We just want to see a graph of
some data we're examining. If we want to dress it up for publication, THEN we'll
worry about the low-level functions and various options.
The basic high level plotting function is
plot( ), and it works differently depending upon what you're asking
it to plot. The basic syntax is plot(x, y, ...), where x is a vector of
x-coordinates, y is a vector of y-coordinates, and ... represents further
refinements and options, as will be illustrated...
> data(faithful)
> attach(faithful)
> names(faithful)
[1] "eruptions" "waiting"
> plot(waiting, eruptions) # x is num., y is num., plot is scatterplot
> detach(faithful)
> rm(faithful)
>
>
> data(ToothGrowth)
> attach(ToothGrowth)
> names(ToothGrowth)
[1] "len" "supp" "dose"
> plot(supp, len) # x is factor, y is num., plot is boxplots
> plot(factor(dose), len) # coercing dose to a factor
> detach(ToothGrowth)
> rm(ToothGrowth)
>
>
> data(sunspots)
> class(sunspots)
[1] "ts"
> plot(sunspots) # x is time series, y missing, plot is a
> rm(sunspots) # time-series plot
>
>
> data(UCBAdmissions)
> class(UCBAdmissions)
[1] "table"
> plot(UCBAdmissions) # x is table, y missing, plot is a mosaic plot
> rm(UCBAdmissions)
>
>
> data(mtcars)
> str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17.0 18.6 19.4 17.0 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
> plot(mtcars) # x is dataframe of num. vars., y missing,
> rm(mtcars) # plot is a scatterplot matrix
>
And so on. I think I've made my point.
As we go through the individual data analyses in future tutorials, we will
see these various plots again, and we will dress them up a bit. So for now, let
me just illustrate a few other things R can do.
Piecharts and Barplots
When a single categorical variable is being graphed, the customary way is to
use a piechart or a barplot. Statisticians are somewhat biased against
piecharts, and I suppose for good reason, but I'll illustrate them anyway, just
in case you have a hankerin' to flaunt good statistical practice.
The data set UCBAdmissions, which we were using above, is the Berkeley
admissions data we used in a different form in a previous tutorial. The data set
is a 3-D table, and we need a 1-D table to illustrate a basic piechart and
barplot, so...
> margin.table(UCBAdmissions, 3) # Collapse over dimensions 1 and 2.
Dept
A B C D E F
933 585 918 792 584 714
> margin.table(UCBAdmissions,3) -> Department
> pie(Department)
> barplot(Department, xlab="Department", ylab="frequency")
The pie( ) function in R is limited because,
as I mentioned above, many statisticians (including the R folks) consider pie
charts to be poor statistical practice. However, if you want something flashy
like a 3D exploded pie chart, you can get it by installing an optional graphics
package called "plotrix", which contains a function called
pie3D( ), which has an "explode" option. It just goes to show, if
you want it, someone has probably written an R package that will do it! To
see an example of an exploded pie chart produced with this package, try
this link.
In fact, I recommend the plotrix package if you want some useful extensions to
the basic R graphics capabilities.
If you want to look at two categorical variables at once, a stacked barplot,
or better yet, a side-by-side barplot is usually the way to go...
> margin.table(UCBAdmissions, c(1,3)) -> Admit.by.Dept
> barplot(Admit.by.Dept)
> barplot(Admit.by.Dept, beside=T, ylim=c(0,1000), legend=T,
+ main="Admissions by Department")
Notice a stacked barplot is the default. To change that, set the "beside="
option to TRUE. Also, I dressed up the second barplot a bit by adding a main
title, and by changing the limits on the y-axis to make room for a legend. I
need to adjust the font size a bit in the legend, and maybe change its location,
but that's a future tutorial!
Histograms
When you have one numerical variable to look at, a histogram is appropriate.
I'll use the "faithful" data set again to illustrate...
> data(faithful) # This is really optional.
> attach(faithful)
> hist(waiting)
It doesn't get much more straightforward than that! And by the way, in case
you're wondering, I resized the graphic by resizing the graphics device window
before saving it. There are better ways, but that works in a pinch.
If you want more or fewer bars, you can refine your plot by using the
"breaks=" option and defining your own breakpoints...
> range(waiting)
[1] 43 96
> hist(waiting, breaks=seq(40,100,10))
By default, R includes the right limit (right side of the bar) but not the left
limit in the intervals. Usually, I prefer it the other way around, so I change
it with the "right=" option, which by default is TRUE...
> hist(waiting, breaks=seq(40,100,10), right=F)
There are many, many other options as well, which you can examine by looking at
the help page for this function: ?hist.
R also incorporates many functions for data smoothing, including kernel
density smoothing of histograms. If you'd rather see a smooth curve than a
boxy histogram, it can be done as follows...
> plot(density(waiting))
> # Or, getting fancier...
> hist(waiting, prob=T)
> lines(density(waiting))
> detach(faithful)
The density( ) function does kernel density
smoothing, which can be refined by adjusting the options of the function. To
plot the smoothed curve on top of a histogram, set the "prob=" option to TRUE
inside the hist( ) function. This plots
densities rather than frequencies. Also, use lines( ) rather than plot( ) to plot the smoothed curve. This low level
graphics function will add the smoothed curve to the histogram rather than
drawing a new plot and thereby erasing the histogram.
Numerical Summaries by Groups
When you have a numerical variable indexed by a categorical variable or
factor, you might want a group-by-group summary in graphical form. The primary
way R offers to achieve this is side-by-side boxplots...
> data(chickwts) # Weight gain by type of diet.
> str(chickwts)
'data.frame': 71 obs. of 2 variables:
$ weight: num 179 160 136 227 217 168 108 124 143 140 ...
$ feed : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...
> attach(chickwts)
> plot(feed, weight)
> title(main="Body Weight of Chicks by Type of Diet")
> detach(chickwts)
The function boxplot( ), which takes a formula
interface, can also be used. Here is the example copied and pasted off the
"chickwts" help page...
> boxplot(weight ~ feed, data = chickwts, col = "lightgray",
+ varwidth = TRUE, notch = TRUE, main = "chickwt data",
+ ylab = "Weight at six weeks (gm)")
Warning message:
In bxp(list(stats = c(216, 271.5, 342, 373.5, 404, 108, 136, 151.5, :
some notches went outside hinges ('box'): maybe set notch=FALSE
Notice that several options are set, including an option to color the boxes, the
"varwidth=" option, which sets the width of the box according to the sample
size, the "notch=" option, which gives a confidence interval around the median,
and options to print a main title and y-axis label. The procedure generated a
warning message, which you will understand when you look at the graphic (which
I have not reproduced here).
Scatterplots
For examining the relationship between two numerical variables, you can't
beat a scatterplot. R has several functions for producing them, two of which
will be demonstrated here...
> data(mammals, package="MASS")
> str(mammals)
'data.frame': 62 obs. of 2 variables:
$ body : num 3.38 0.48 1.35 465.00 36.33 ...
$ brain: num 44.5 15.5 8.1 423.0 119.5 ...
> attach(mammals)
> plot(log(body), log(brain))
> scatter.smooth(log(body), log(brain))
> detach(mammals)
Some explanations are in order. First, I didn't want to attach the MASS package
to the search path, so I used an option when I copied the "mammals" data frame
that told R to look for it there. The data frame contains brain and body
weights from 62 species of land mammals. Second, to produce a linear plot, I
had to do a log transform on both variables, and I did that "on the fly."
Third, the two functions produced the same scatterplot, but the scatter.smooth( ) function also plots a smoothed,
nonparametric regression line on the plot. This line is computed using the
loess technique and is called the "loess line" (locally weighted scatterplot
smoothing, sometimes also called "lowess", although I understand some sources
use the two acronyms differently). Both functions have options that allow the
plots to be modified in several ways.
Interacting With Plots
R supplies several functions that allow you to interact with the graphics
window, including functions that allow you to identify and label points on the
graph. See the help pages for the locator( )
and identify( ) functions for details. I'll
discuss these briefly in a later tutorial.
Remember to clean up your workspace!
revised 2010 August 4
Return to the Table of Contents
|