Archive for the ‘R’ Category

Learning ggplot2

March 31, 2012 2 comments

I have always been content with the base graphing capabilities of R, it is indeed very powerful. However, I have been hearing lot of praise about the ggplot around internet, and after looking through some of the graphs that it generated, I was really impressed with it, and after going through some basic ggplot tutorials, I think it’d be a great tool to add in my R arsenal. So, I decided to write a blog on a step by step venture in the world of ggplot. Most of the examples here will be very preliminary plots, but the difference between the one generated by the base graphic system and ggplot2 will be apparent.

To follow this tutorial, I am assuming that you have ggplot package installed in R.I am using the UScereal data in the package MASS. To see a list of data available through R

> data()

So, the first step is to load the two packages in R.



Now, lets load the data on UScereal.


The dataset is basically Cereal brand by properties table. It has the name of cereals as the row and its different properties as columns. It has 11 properties listed which are

> colnames(UScereal)
[1] “mfr” “calories” “protein” “fat” “sodium”  “fibre” “carbo” “sugars” “shelf” “potassium” “vitamins”

Where mfr or manufacturer and vitamins are categorical or in terms of R factor data. All other data are continuous. Now, lets look at some preliminary analysis. The first thing to do is see the distribution of these variables using histogram. Lets first look at the histogram of calories using ggplot2.

> qplot(calories, data=UScereal)
stat_bin: binwidth defaulted to range/30. Use ‘binwidth = x’ to adjust this.

Histogram of Calories

Here, a warning is also shown for not including the binwidth size, however, the graph is still produced as ggplot2 automatically sets a binwidth. Lets see how the graph looks if we set a binwidth of 30.

Histogram with binwidth of 30

Now, what if we want to see the histogram of all the continuos variables. We can also do this with the base graphics by changing par(mfrow=(nrows,ncols)), but ggplot2 has a much easier way to do this.

> qplot(calories, data=UScereal,binwidth=30,facets=.~mfr)

Histogram divided to different facet based on manufacturing company

Here the ‘facets’ option can help specify multiple graphs based on the categorical data. The above command separates the graph based on the mfr, and ~ position specifies how you want the graph to be arranged.

Now, lets look at the relationship between sugars content and calories of the cereals.

> qplot(sugars,calories, data=UScereal)

Relationship between calories and sugar content of Cereals

Notice that in this case, when we use the same qplot command, it plotted a scatterplot and then in the previous examples it posted histogram. The main difference is that when we use qplot and specify two variables, ggplot2 automatically assumes a scatterplot and plots it, else it plots a histogram. ggplot2 automatically choses the type of graph based on the type of data. But, we do have a freedom of chosing what kind of plot we want using “geom” option in qplot. We can specify exactly what type of plot we want through this geom option. For example, if we want to see the kernel density estimation of the calories, we can simply add geom=”density” to the same qplot command that we used to generate a histogram.

> qplot(calories, data=UScereal, binwidth=30,geom=”density”)

Density Plot of Calories

Now, if we want to look at the relationship of calorie content and sugar content in terms of both manufacturer and vitamin content, we can use the simple command.

qplot(sugars,calories, data=UScereal, facets=mfr~vitamins)

Now, lets look at the relationship between protein content, sugars, fat, and the manufacturer of these cereals in just one graph. I used the point size for fat content and then colors of the point for the manufacturers.

> qplot(protein,sugars, data=UScereal, size=fat,color=mfr)

Finally, lets label this graph

> qplot(protein,sugars, data=UScereal, size=fat,color=mfr,xlab=”grams of protein in one portion”,ylab=”grams of sugar in one portion”,main=”Protein content Vs. sugar content of common US cereals”)

In all the previous examples, where we did not specify any geoms, it automatically assume it to be points, but there are many options or types of graphs that can be generated using the geom option. For example, we can generate the line graph by specifying geom=”line”

> qplot(protein,sugars, data=UScereal, size=fat,color=mfr,,geom=”line”,xlab=”grams of protein in one portion”,ylab=”grams of sugar in one portion”,main=”Protein content Vs. sugar content of common US cereals”)

Obviously, this is not an extensive tutorial on this amazing package. There are many visually stimulating graph that you can generate using this package. I will try to keep this page updated as i venture more into the beautiful world of R.

Categories: Graphics, R