DESCRIBING DATA GRAPHICALLY Introduction The graphical procedures in R are extremely powerful. I'm told there are people who use R not so much for data analysis as for its ability to produce top notch publication quality graphics. I will only scratch the surface of these capabilities here. A later tutorial will fill in a few more of the details. Don't make the mistake of assuming that this tutorial is in any fashion a complete summary of R graphics capabilities. It is a brief overview of how to get a mostly quick-and-dirty graph to visualize your data. R graphics functions can be grouped into three types:
I'm going to do something unusual, and perhaps ill-advised, and cover the low level functions first, so that you will be ready to use them in conjunction with the high level functions when we get to those. If you just want the quick and dirty approach, then skip this first section (for now). Low Level Plotting Functions For years I told my students, "I can draw anything I want in an R graphics window. I can draw a clown face in an R graphics window if I want to." Finally a student challenged me to do it. It took me the better part of a day to figure out! To conserve space, I'm not going to reproduce the output of every single example in this tutorial. If you have R open and are following along, you can see it on your own screen. High level plotting functions open a graphics device (window) automatically,
but the low level functions do not. So to get a graph and some axes to work
with, the following command will get us started without actually drawing a
graph.
> plot(1:100, 1:100, type="n", xlab="", ylab="")The plot() function is high level, opening a graphics window and drawing labeled axes, but in this case we've asked it not to plot anything with the type="n" option. Now we have a palette. Let's paint on it. One thing we can do is plot a curve from an algebraic equation. Let's use
the equation y = 0.01 x^{2}.
> curve(x^2/100, add=TRUE) # add=T adds the curve to an existing graphWe may want to add some text to the graph, which tells our intended audience just what it is we've plotted. > text(x=80, y=50, "This is a graph of") > text(x=80, y=45, "the equation") > text(x=80, y=37, expression(y == frac(1,100) * x^2))The text() function takes, first, arguments that give x,y-coordinates at which the text will be centered (and this can take some careful eyeballing or some trial and error), and then it takes quoted text or a mathematical expression. The syntax for the expression() function is an art form in itself (similar to LaTeX), and I have not mastered it, but it can be used to produce some very fancy mathematical expressions. There are also options for controlling font face and size as well as spacing, etc. Next let's draw some points on this curve.
> points(x=c(20, 60, 90), y=c(4, 36, 81), pch=6) # or points(x<-c(20,60,90), y=x^2/100, pch=6)The first vector gives the x-coordinates of the desired points, the second vector gives the y-coordinates (which can be calculated on the fly), and the "pch=" option gives the point character to use. There are about twenty different point characters to choose from. To see some of them, do this. > points(x=rep(100,10), y=seq(0,90,10), pch=seq(1,20,2))You can experiment for yourself to find out what the rest of them look like. Now let's draw a straight line through a couple of those points, say the one
at (20, 4) and the one at (90, 81). The draw-a-straight-line function is abline(), and in this case it's arguments are
"a=the y-intercept" and "b=the slope" of the desired line.
> abline(a=-18, b=1.1, col="red")And just to be showy, we made it red with col="red". We can also draw horizontal and vertical lines with this function. > abline(h=20, lty=2) # abline(h=20, lty="dashed") also works > abline(v=20, lty=3) # abline(v=20, lty="dotted") also worksThe "lty=" option specifies the line type (1=solid, 2=dashed, 3=dotted). You can also change the color of these lines with col=, and the width of the lines with lwd= options. Try repeating that last command but set lty=1 and lwd=3. We can also draw lines and/or points using the
lines() function.
> lines(x=c(40, 40, 60, 60), y=c(80, 100, 100, 80), type="b") > lines(x=c(40, 60), y=c(80, 80), type="l") # type="lower case L", not "one"Once again, the first vector gives the x-coordinates, the second vector the y-coordinates, and the "type=" option tells whether you want just points (type="p"), just lines (type="l"), or both (type="b"). Note: For just lines, use a lower case L. This example shows that type="l" and type="b" behave a bit differently in terms of where the line begins and terminates. Finally, at least as far as this tutorial is concerned, titles and axis
labels can be added using the title()
function. (NOTE: If you already have axis labels (they are set by default in the
plot() function, but we set them to blank by setting
xlab="" and ylab=""), you can try a little trick that SOMETIMES works to
erase them. Try writing over them in the background color of the graph, in this
case, white. This brings up an IMPORTANT POINT. Be careful when you're drawing
a complex graph, because it's generally true that, when you make a mistake, you
start again! It's advisable to use the script window to write out the commands
for a complex graph.)
> title(main="A Drawing To Put On the Refrigerator!") > title(xlab="This is the x-axis", col.lab="green", cex.lab=1.5)This example also gives a little taste of how various options can be used to control colors, fonts, text sizes, and so forth. We'll do more of this in a future tutorial. The col.lab= option sets the color of the label, while the cex.lab= option controls it's size. "cex" stands for character expansion factor, so setting cex.lab=1.5 makes the label 1.5 times it's normal size. And now let's have a look at our masterpiece. # start copying here plot(1:100, 1:100, type="n", xlab="", ylab="") curve(x^2/100, add=TRUE) text(x=80, y=50, "This is a graph of") text(x=80, y=45, "the equation") text(x=80, y=37, expression(y == frac(1,100) * x^2)) points(x=c(20,60,90), y=c(4,36,81), pch=6) points(x=rep(100,10), y=seq(0,90,10), pch=seq(1,20,2)) abline(a=-18, b=1.1, col="red") abline(h=20, lty=2) abline(v=20, lty=3) lines(x=c(40,40,60,60), y=c(80,100,100,80), type="b") lines(x=c(40,60), y=c(80,80), type="l") title(main="A Drawing To Put On the Refrigerator!") title(xlab="This is the x-axis", col.lab="green", cex.lab=1.5) # stop copying here and paste to your R ConsoleYou can also paste this into a script window (File > New Document on a Mac, File > New Script in Windows) and experiment with it. Learn by doing! High Level Plotting Functions Usually, we don't want to fuss that much. We just want to see a graph of some data we're examining. If we want to dress it up for publication, THEN we'll worry about the low-level functions and various options. The basic high level plotting function is
plot(), and it works differently depending upon what you're asking
it to plot. The basic syntax is plot(x, y, ...), where x is a vector of
x-coordinates, y is a vector of y-coordinates, and ... represents further
refinements and options, as will be illustrated.
> data(faithful) > attach(faithful) > names(faithful) [1] "eruptions" "waiting" > plot(x=waiting, y=eruptions) # x is num., y is num., plot is a scatterplot > detach(faithful) > rm(faithful) > > > data(ToothGrowth) > attach(ToothGrowth) > names(ToothGrowth) [1] "len" "supp" "dose" > plot(x=supp, y=len) # x is factor, y is num., plot is boxplots > plot(x=factor(dose), y=len) # coercing dose to a factor > detach(ToothGrowth) > rm(ToothGrowth) > > > data(sunspots) > class(sunspots) [1] "ts" > plot(sunspots) # x is time series, y missing, plot is a > rm(sunspots) # time-series plot > > > data(UCBAdmissions) > class(UCBAdmissions) [1] "table" > plot(UCBAdmissions) # x is table, y missing, plot is a mosaic plot > rm(UCBAdmissions) > > > data(mtcars) > str(mtcars) 'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17.0 18.6 19.4 17.0 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ... > plot(mtcars) # x is dataframe of num. vars., y missing, > rm(mtcars) # plot is a scatterplot matrix > Note: Plotting can also be done with a formula interface: plot(y ~ x). Pie Charts and Bar Graphs When a single categorical variable is being graphed, the customary way is to use a pie chart or a bar graph. Statisticians are somewhat biased against pie charts, and I suppose for good reason, but I'll illustrate them anyway, just in case you have a hankerin' to flout good statistical practice. The data set UCBAdmissions, which we were using above, is the Berkeley
admissions data we used in a different form in a previous tutorial. The data set
is a 3-D table, and we need a 1-D table to illustrate a basic piechart and
barplot, so...
> margin.table(UCBAdmissions, 3) # Collapse over dimensions 1 and 2. Dept A B C D E F 933 585 918 792 584 714 > margin.table(UCBAdmissions,3) -> Department > pie(Department) > barplot(Department, xlab="Department", ylab="frequency") If you want to look at two categorical variables at once, a stacked barplot,
or better yet, a side-by-side barplot is usually the way to go.
> margin.table(UCBAdmissions, c(1,3)) -> Admit.by.Dept > barplot(Admit.by.Dept) > barplot(Admit.by.Dept, beside=T, ylim=c(0,1000), legend=T, + main="Admissions by Department") Histograms When you have one numeric variable to look at, a histogram is appropriate.
I'll use the "faithful" data set again to illustrate.
> data(faithful) # This is optional. > attach(faithful) > hist(waiting) If you want more or fewer bars, you can refine your plot by using the
"breaks=" option and defining your own breakpoints.
> range(waiting) [1] 43 96 > hist(waiting, breaks=seq(from=40, to=100, by=10))By default, R includes the right limit (right side of the bar) but not the left limit in the intervals. Usually, I prefer it the other way around, so I change it with the "right=" option, which by default is TRUE. > hist(waiting, breaks=seq(40,100,10), right=F)There are many, many other options as well, which you can examine by looking at the help page for this function: type ?hist. R also incorporates many functions for data smoothing, including kernel
density smoothing of histograms. If you'd rather see a smooth curve than a
boxy histogram, it can be done as follows.
> plot(density(waiting)) > # Or, getting fancier... > hist(waiting, prob=T) > lines(density(waiting)) > detach(faithful) Numerical Summaries by Groups
When you have a numerical variable indexed by a categorical variable or
factor, you might want a group-by-group summary in graphical form. The primary
way R offers to achieve this is side-by-side boxplots.
> data(chickwts) # Weight gain by type of diet. > str(chickwts) 'data.frame': 71 obs. of 2 variables: $ weight: num 179 160 136 227 217 168 108 124 143 140 ... $ feed : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ... > attach(chickwts) > plot(feed, weight) # boxplot(weight ~ feed) will also work > title(main="Body Weight of Chicks by Type of Diet") > means = tapply(weight, feed, mean) > points(x=1:6, y=means, pch=16) # pch=16 is a filled circle > detach(chickwts)The function boxplot(), which takes a formula interface, can also be used. Here is the example copied and pasted off the "chickwts" help page. > boxplot(weight ~ feed, data = chickwts, col = "lightgray", + varwidth = TRUE, notch = TRUE, main = "chickwt data", + ylab = "Weight at six weeks (gm)") Warning message: In bxp(list(stats = c(216, 271.5, 342, 373.5, 404, 108, 136, 151.5, : some notches went outside hinges ('box'): maybe set notch=FALSENotice that several options are set, including an option to color the boxes, the "varwidth=" option, which sets the width of the box according to the sample size, the "notch=" option, which gives a confidence interval around the median, and options to print a main title and y-axis label. The procedure generated a warning message, which you will understand when you look at the graphic (which I have not reproduced here). Scatterplots For examining the relationship between two numerical variables, you can't
beat a scatterplot. R has several functions for producing them, two of which
will be demonstrated here.
> data(mammals, package="MASS") > str(mammals) 'data.frame': 62 obs. of 2 variables: $ body : num 3.38 0.48 1.35 465.00 36.33 ... $ brain: num 44.5 15.5 8.1 423.0 119.5 ... > attach(mammals) > plot(log(body), log(brain)) # plot(x=body, y=brain, log="xy") is similar (try it) > scatter.smooth(log(body), log(brain)) > detach(mammals) Interacting With Plots R supplies several functions that allow you to interact with the graphics window, including functions that allow you to identify and label points on the graph. See the help pages for the locator() and identify() functions for details. I'll discuss these briefly in a later tutorial. Remember to clean up your workspace! revised 2016 January 31 |