Disclaimer: This document represents notes taken from Advanced Data Analysis: STAT 527 and is not work of my own.
The following examples demonstrates basic functionality of R as well as various data types available in R.
# Arithmetic
2 * 10
## [1] 20
1 + 2
## [1] 3
# Order of operations
1 + 5 * 10
## [1] 51
(1 + 5) * 10
## [1] 60
# Exponents
2^10
## [1] 1024
9^(1/2)
## [1] 3
A vector is a set of numbers similar to the columns in a spreadsheet. In R these can be ordered and indexed.
# Vector
c(1, 2, 3, 4)
## [1] 1 2 3 4
c(1:5, 10)
## [1] 1 2 3 4 5 10
# Using seq to create a sequence
seq(from=1, to=10, by=2)
## [1] 1 3 5 7 9
seq(1, 10, by=2)
## [1] 1 3 5 7 9
seq(1, 10, length=11)
## [1] 1.0 1.9 2.8 3.7 4.6 5.5 6.4 7.3 8.2 9.1 10.0
seq(1, 10)
## [1] 1 2 3 4 5 6 7 8 9 10
# Creating sequences using :
1:5
## [1] 1 2 3 4 5
Data type (i.e. sequences, integers, strings, data.frames) can all be assigned to variables using the <-
operator. Variables created in this way are stored in memory and can be oparated on and referenced by calling the variable name.
# Assign a vector to variable a
a <- 1:5
a
## [1] 1 2 3 4 5
b <- seq(15, 3, length=5)
b
## [1] 15 12 9 6 3
c <- a * b
The R-base package has many functions avaialbe for performing routine takes such as computing the mean of a vector, summing a vector as well as several other tasks. Help documentation for a particular function can be viewed by using putting a ? at the beginning of the function name, ?sum
, ?datasets
.
a
## [1] 1 2 3 4 5
sum(a)
## [1] 15
mean(a)
## [1] 3
sd(a)
## [1] 1.581139
prod(a)
## [1] 120
var(a)
## [1] 2.5
min(a)
## [1] 1
max(a)
## [1] 5
median(a)
## [1] 3
range(a)
## [1] 1 5
Lists, vectors, data frames and matricies can all be subseted using various techniques.
# Create a vector raning from 0 to 100 by 10
a <- seq(0, 100, by=10)
# Index/subset the first element of the vector
a[1]
## [1] 0
# Index the first 3 elements of the vector
a[1:3]
## [1] 0 10 20
# Index the first and fourth elements
a[c(1,4)]
## [1] 0 30
# Reassign the value of the first element
a[1] <- 7
# Evaluate elements of a vector
a > 50
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
# Subset vector based if the evaluate to TRUE
a[a > 50]
## [1] 60 70 80 90 100
# Similarly, subset elements based on conditionals
which(a > 50)
## [1] 7 8 9 10 11
# Negate evaluation
!(a > 50)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
# Select all elements NOT greater than 50
a[!(a > 50)]
## [1] 7 10 20 30 40 50
Comparisons return a boolean type object when evaluated.
a
## [1] 7 10 20 30 40 50 60 70 80 90 100
# Extract elements where expression evaluates to TRUE
a[(a == 55)]
## numeric(0)
a[(a != 55)]
## [1] 7 10 20 30 40 50 60 70 80 90 100
a[(a > 50)]
## [1] 60 70 80 90 100
a[(a < 50)]
## [1] 7 10 20 30 40
a[(a >= 50)]
## [1] 50 60 70 80 90 100
a[(a <= 50)]
## [1] 7 10 20 30 40 50
# Set operations
c(10, 14, 40, 60, 99) %in% a
## [1] TRUE FALSE TRUE TRUE FALSE
Compare TRUE/FALSE values and return TRUE/FALSE values.
a
## [1] 7 10 20 30 40 50 60 70 80 90 100
# Subset values within a certain range using and
a[(a > 50) & (a <=90)]
## [1] 60 70 80 90
# Subset either values using or
a[(a < 50) | (a > 100)]
## [1] 7 10 20 30 40
a[(a < 50) | !(a > 100)]
## [1] 7 10 20 30 40 50 60 70 80 90 100
a[(a >= 50) & !(a <= 90)]
## [1] 100
The NA (not available) means the value is missing. Any calculation involving NA will return a NA by default.
NA + 8
## [1] NA
3 * NA
## [1] NA
mean(c(1, 2, NA))
## [1] NA
# Some functions have the ability to ignore NA
mean(c(NA, 1, 2), na.rm=TRUE)
## [1] 1.5
sum(c(NA, 1, 2))
## [1] NA
sum(c(NA, 1, 2), na.rm=TRUE)
## [1] 3
# Evaluating NA
a <- c(NA, 1:5, NA)
a
## [1] NA 1 2 3 4 5 NA
is.na(a)
## [1] TRUE FALSE FALSE FALSE FALSE FALSE TRUE
!is.na(a)
## [1] FALSE TRUE TRUE TRUE TRUE TRUE FALSE
# Subset all values that are not NA
a[!is.na(a)]
## [1] 1 2 3 4 5
This section is intended as an introduction to ggplot2
and some of its capabilities. As a basic introduction, it requires a data.frame
object as input, and then you define plot layers that stack on top of each other, and each layer has visual/text elements that are mapped to aesthetics (colors, size, opacity). In this way, a simple set of commands can be combined to produce extremely informative displays.
In the example that follows, we consider a dataset mpg
consisting of fuel economy data from 1999 and 2008 for 38 popular models of car.
# Install ggplot2 if not already
if(!require(ggplot2)){install.packages("ggplot2")}
## Loading required package: ggplot2
# Load the ggplot2 package
library(ggplot2)
# The mpg dataset should be already loaded into our R environment
# lets take a look at the top few rows of data using head()
head(mpg)
## manufacturer model displ year cyl trans drv cty hwy fl class
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
## 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
## 4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
# Inspect the data type of each column
str(mpg)
## 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ model : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
## $ drv : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ class : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...
# Compute summary statistics on each column/data type
summary(mpg)
## manufacturer model displ year
## dodge :37 caravan 2wd : 11 Min. :1.600 Min. :1999
## toyota :34 ram 1500 pickup 4wd: 10 1st Qu.:2.400 1st Qu.:1999
## volkswagen:27 civic : 9 Median :3.300 Median :2004
## ford :25 dakota pickup 4wd : 9 Mean :3.472 Mean :2004
## chevrolet :19 jetta : 9 3rd Qu.:4.600 3rd Qu.:2008
## audi :18 mustang : 9 Max. :7.000 Max. :2008
## (Other) :74 (Other) :177
## cyl trans drv cty hwy
## Min. :4.000 auto(l4) :83 4:103 Min. : 9.00 Min. :12.00
## 1st Qu.:4.000 manual(m5):58 f:106 1st Qu.:14.00 1st Qu.:18.00
## Median :6.000 auto(l5) :39 r: 25 Median :17.00 Median :24.00
## Mean :5.889 manual(m6):19 Mean :16.86 Mean :23.44
## 3rd Qu.:8.000 auto(s6) :16 3rd Qu.:19.00 3rd Qu.:27.00
## Max. :8.000 auto(l6) : 6 Max. :35.00 Max. :44.00
## (Other) :13
## fl class
## c: 1 2seater : 5
## d: 5 compact :47
## e: 8 midsize :41
## p: 52 minivan :11
## r:168 pickup :33
## subcompact:35
## suv :62
# Plot hwy mpg against engine displacement adding titles and axis labels
ggplot(mpg) + geom_point(aes(x=displ, y=hwy)) + labs(title="Highway MPG vs Displacement") +
xlab("Engine displacement (liters)") + ylab("Highway MPG")
The ggplot()
function sets up the data frame to be used while the geom_point()
function specifies the type of plot (i.e. geome) to use. The aes()
function inside geom_point()
is used to map plot attributes (i.e. \(x\) and \(y\) values) to columns within the data frame. We can map other values to plot attributes using aes()
for example, we can map the color of each point to the class of vehicle.
ggplot(mpg) + geom_point(aes(x=displ, y=hwy, color=class)) + labs(title="Highway MPG vs Displacement") +
xlab("Engine displacement (liters)") + ylab("Highway MPG")
We can go a step further and map the size of each point to the number of cylinders and the shape of each point to the drive type (i.e. number of gears). In this way, we can encode a large amount of information (dimensions) into each plot (althought this is not always desireable).
ggplot(mpg) + geom_point(aes(x=displ, y=hwy, color=class, shape=drv, size=cyl)) +
labs(title="Highway MGP vs Displacement") + xlab("Engine displacement (liters") +
ylab("Highway MPG")
Small multiple plots (i.e. faceting, trellis chart, lattice chart, grid chart, panel chart) is a series or grid of small similar graphics or charts. These are useful when we want to stratify and compare groups of data to one another. Typically facets are formed on different subsets of data and are really useful for exploring conditional relationships between data.
# Start by creating a basic scatterplot
p <- ggplot(mpg) + geom_point(aes(x=displ, y=hwy))
p1 <- p + facet_grid(. ~ cyl) # Columns are cyl categories
p2 <- p + facet_grid(drv ~ .) # Rows are drv categories
p3 <- p + facet_grid(drv ~ cyl) # Rows are drv categories, columns are cyl categories
p4 <- p + facet_wrap(~ class) # Wrap plots by class category
# Plot all plots in one figure
library(gridExtra)
## Loading required package: grid
grid.arrange(p1, p2, p3, p4, ncol=2)
Sometimes plots can be enhanced by adding noise. One example is when plotting categorical data or large numbers of data. For example, suppose we are plotting large numbers of categorical data, because categorical values take on finite values, we can expect much of the data points to overlap and obfuscate one another. One solution to this is to add just enough noise that the data points separate while still maintaining their categorical relationships.
# Obfuscated data
ggplot(mpg) + geom_point(aes(x=cty, y=hwy))
In the above plot, some of the data overlap. We can avoid this issue by adding jitter to each point and changing the transparency (alpha) of each point so that hidden data points are revealed.
ggplot(mpg) + geom_point(aes(x=cty, y=hwy), position = "jitter", alpha=0.5)
We can also improve plots by rearranging the order in which data appear.
ggplot(mpg) + geom_point(aes(x=class, y=hwy))
Suppose we want to reorder each class ascendingly based on highway MPG
ggplot(mpg) + geom_point(aes(x=reorder(class, hwy), y=hwy))
Great, we reordered the vehicle class by highway MPG but a lot of the data are still obfuscated. Lets try adding jitter and changing transparency.
ggplot(mpg) + geom_point(aes(x=reorder(class, hwy), y=hwy), position="jitter", alpha=0.5)
Suppose were are more interested in comparing the distribution of highway MPG betwee vehicle classes. Boxplots would be more appropriate
ggplot(mpg) + geom_boxplot(aes(x=reorder(class, hwy), y=hwy))
That’s good, but we would like to see how the data points are distributed within the quartiles, lets add a goeom_point()
layer to show this
# Specify how much jitter you would like to add
ggplot(mpg) + geom_boxplot(aes(x=reorder(class, hwy), y=hwy)) +
geom_point(aes(x=reorder(class, hwy), hwy), alpha=0.5, position=position_jitter(width=0.1)) +
labs(title="Vehicle class vs highway MPG") + xlab("Vehicle class") + ylab("Highway MPG")
The the above examples we reordered the vehicle class based on mean highway MPG (mean is used by default), however we can just as easily reorder based on other statistics by specifying what FUN
to use
# Order vehicle class by median MPG
ggplot(mpg) + geom_boxplot(aes(x=reorder(class, hwy, FUN=median), y=hwy), alpha=0.5) +
geom_point(aes(x=reorder(class, hwy), hwy), alpha=0.5, position=position_jitter(width=0.1)) +
labs(title="Ordered by median MGP")
# Order vehicle class by the MPG standard deviation
ggplot(mpg) + geom_boxplot(aes(x=reorder(class, hwy, FUN=sd), y=hwy), alpha=0.5) +
geom_point(aes(x=reorder(class, hwy), hwy), alpha=0.5, position=position_jitter(width=0.1)) +
labs(title="Ordered by MPG standard deviation")