Disclaimer: This document represents notes taken from Advanced Data Analysis: STAT 527 and is not work of my own.

Learning objectives

  1. Identify a function or operation and describe it’s use.
  2. Apply functions and operations to acheive a specific result.
  3. Predict answers of calculations written in R.
  4. Use R’s functions to get help and numerically summarize data.
  5. Apply ggplot() to organize and reveal patterns visually.
  6. Explain what each plotting option does.

R building blocks

Basic arithmetic operations

The following examples demonstrates basic functionality of R as well as various data types available in R.

# Arithmetic
2 * 10
## [1] 20
1 + 2
## [1] 3
# Order of operations
1 + 5 * 10
## [1] 51
(1 + 5) * 10
## [1] 60
# Exponents
2^10
## [1] 1024
9^(1/2)
## [1] 3

Vectors

A vector is a set of numbers similar to the columns in a spreadsheet. In R these can be ordered and indexed.

# Vector
c(1, 2, 3, 4)
## [1] 1 2 3 4
c(1:5, 10)
## [1]  1  2  3  4  5 10
# Using seq to create a sequence
seq(from=1, to=10, by=2)
## [1] 1 3 5 7 9
seq(1, 10, by=2)
## [1] 1 3 5 7 9
seq(1, 10, length=11)
##  [1]  1.0  1.9  2.8  3.7  4.6  5.5  6.4  7.3  8.2  9.1 10.0
seq(1, 10)
##  [1]  1  2  3  4  5  6  7  8  9 10
# Creating sequences using :
1:5
## [1] 1 2 3 4 5

Assignment and variables

Data type (i.e. sequences, integers, strings, data.frames) can all be assigned to variables using the <- operator. Variables created in this way are stored in memory and can be oparated on and referenced by calling the variable name.

# Assign a vector to variable a
a <- 1:5
a
## [1] 1 2 3 4 5
b <- seq(15, 3, length=5)
b
## [1] 15 12  9  6  3
c <- a * b

Basic functions

The R-base package has many functions avaialbe for performing routine takes such as computing the mean of a vector, summing a vector as well as several other tasks. Help documentation for a particular function can be viewed by using putting a ? at the beginning of the function name, ?sum, ?datasets.

a
## [1] 1 2 3 4 5
sum(a)
## [1] 15
mean(a)
## [1] 3
sd(a)
## [1] 1.581139
prod(a)
## [1] 120
var(a)
## [1] 2.5
min(a)
## [1] 1
max(a)
## [1] 5
median(a)
## [1] 3
range(a)
## [1] 1 5

Extracting subsets

Lists, vectors, data frames and matricies can all be subseted using various techniques.

# Create a vector raning from 0 to 100 by 10
a <- seq(0, 100, by=10)

# Index/subset the first element of the vector
a[1]
## [1] 0
# Index the first 3 elements of the vector
a[1:3]
## [1]  0 10 20
# Index the first and fourth elements
a[c(1,4)]
## [1]  0 30
# Reassign the value of the first element
a[1] <- 7

# Evaluate elements of a vector
a > 50
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
# Subset vector based if the evaluate to TRUE
a[a > 50]
## [1]  60  70  80  90 100
# Similarly, subset elements based on conditionals
which(a > 50)
## [1]  7  8  9 10 11
# Negate evaluation
!(a > 50)
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
# Select all elements NOT greater than 50
a[!(a > 50)]
## [1]  7 10 20 30 40 50

Comparison functions

Comparisons return a boolean type object when evaluated.

a
##  [1]   7  10  20  30  40  50  60  70  80  90 100
# Extract elements where expression evaluates to TRUE
a[(a == 55)]
## numeric(0)
a[(a != 55)]
##  [1]   7  10  20  30  40  50  60  70  80  90 100
a[(a > 50)]
## [1]  60  70  80  90 100
a[(a < 50)]
## [1]  7 10 20 30 40
a[(a >= 50)]
## [1]  50  60  70  80  90 100
a[(a <= 50)]
## [1]  7 10 20 30 40 50
# Set operations
c(10, 14, 40, 60, 99) %in% a
## [1]  TRUE FALSE  TRUE  TRUE FALSE

Boolean operators

Compare TRUE/FALSE values and return TRUE/FALSE values.

a
##  [1]   7  10  20  30  40  50  60  70  80  90 100
# Subset values within a certain range using and
a[(a > 50) & (a <=90)]
## [1] 60 70 80 90
# Subset either values using or
a[(a < 50) | (a > 100)]
## [1]  7 10 20 30 40
a[(a < 50) | !(a > 100)]
##  [1]   7  10  20  30  40  50  60  70  80  90 100
a[(a >= 50) & !(a <= 90)]
## [1] 100

Missing values

The NA (not available) means the value is missing. Any calculation involving NA will return a NA by default.

NA + 8
## [1] NA
3 * NA
## [1] NA
mean(c(1, 2, NA))
## [1] NA
# Some functions have the ability to ignore NA
mean(c(NA, 1, 2), na.rm=TRUE)
## [1] 1.5
sum(c(NA, 1, 2))
## [1] NA
sum(c(NA, 1, 2), na.rm=TRUE)
## [1] 3
# Evaluating NA
a <- c(NA, 1:5, NA)
a
## [1] NA  1  2  3  4  5 NA
is.na(a)
## [1]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE
!is.na(a)
## [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
# Subset all values that are not NA
a[!is.na(a)]
## [1] 1 2 3 4 5

Plotting with ggplot2

This section is intended as an introduction to ggplot2 and some of its capabilities. As a basic introduction, it requires a data.frame object as input, and then you define plot layers that stack on top of each other, and each layer has visual/text elements that are mapped to aesthetics (colors, size, opacity). In this way, a simple set of commands can be combined to produce extremely informative displays.

In the example that follows, we consider a dataset mpg consisting of fuel economy data from 1999 and 2008 for 38 popular models of car.

# Install ggplot2 if not already
if(!require(ggplot2)){install.packages("ggplot2")}
## Loading required package: ggplot2
# Load the ggplot2 package
library(ggplot2)

# The mpg dataset should be already loaded into our R environment
# lets take a look at the top few rows of data using head()
head(mpg)
##   manufacturer model displ year cyl      trans drv cty hwy fl   class
## 1         audi    a4   1.8 1999   4   auto(l5)   f  18  29  p compact
## 2         audi    a4   1.8 1999   4 manual(m5)   f  21  29  p compact
## 3         audi    a4   2.0 2008   4 manual(m6)   f  20  31  p compact
## 4         audi    a4   2.0 2008   4   auto(av)   f  21  30  p compact
## 5         audi    a4   2.8 1999   6   auto(l5)   f  16  26  p compact
## 6         audi    a4   2.8 1999   6 manual(m5)   f  18  26  p compact
# Inspect the data type of each column
str(mpg)
## 'data.frame':    234 obs. of  11 variables:
##  $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ model       : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
##  $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
##  $ drv         : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
##  $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ class       : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...
# Compute summary statistics on each column/data type
summary(mpg)
##      manufacturer                 model         displ            year     
##  dodge     :37    caravan 2wd        : 11   Min.   :1.600   Min.   :1999  
##  toyota    :34    ram 1500 pickup 4wd: 10   1st Qu.:2.400   1st Qu.:1999  
##  volkswagen:27    civic              :  9   Median :3.300   Median :2004  
##  ford      :25    dakota pickup 4wd  :  9   Mean   :3.472   Mean   :2004  
##  chevrolet :19    jetta              :  9   3rd Qu.:4.600   3rd Qu.:2008  
##  audi      :18    mustang            :  9   Max.   :7.000   Max.   :2008  
##  (Other)   :74    (Other)            :177                                 
##       cyl               trans    drv          cty             hwy       
##  Min.   :4.000   auto(l4)  :83   4:103   Min.   : 9.00   Min.   :12.00  
##  1st Qu.:4.000   manual(m5):58   f:106   1st Qu.:14.00   1st Qu.:18.00  
##  Median :6.000   auto(l5)  :39   r: 25   Median :17.00   Median :24.00  
##  Mean   :5.889   manual(m6):19           Mean   :16.86   Mean   :23.44  
##  3rd Qu.:8.000   auto(s6)  :16           3rd Qu.:19.00   3rd Qu.:27.00  
##  Max.   :8.000   auto(l6)  : 6           Max.   :35.00   Max.   :44.00  
##                  (Other)   :13                                          
##  fl             class   
##  c:  1   2seater   : 5  
##  d:  5   compact   :47  
##  e:  8   midsize   :41  
##  p: 52   minivan   :11  
##  r:168   pickup    :33  
##          subcompact:35  
##          suv       :62
# Plot hwy mpg against engine displacement adding titles and axis labels
ggplot(mpg) + geom_point(aes(x=displ, y=hwy)) + labs(title="Highway MPG vs Displacement") +
  xlab("Engine displacement (liters)") + ylab("Highway MPG")

The ggplot() function sets up the data frame to be used while the geom_point() function specifies the type of plot (i.e. geome) to use. The aes() function inside geom_point() is used to map plot attributes (i.e. \(x\) and \(y\) values) to columns within the data frame. We can map other values to plot attributes using aes() for example, we can map the color of each point to the class of vehicle.

ggplot(mpg) + geom_point(aes(x=displ, y=hwy, color=class)) + labs(title="Highway MPG vs Displacement") +
  xlab("Engine displacement (liters)") + ylab("Highway MPG")

We can go a step further and map the size of each point to the number of cylinders and the shape of each point to the drive type (i.e. number of gears). In this way, we can encode a large amount of information (dimensions) into each plot (althought this is not always desireable).

ggplot(mpg) + geom_point(aes(x=displ, y=hwy, color=class, shape=drv, size=cyl)) +
  labs(title="Highway MGP vs Displacement") + xlab("Engine displacement (liters") +
  ylab("Highway MPG")

Faceting

Small multiple plots (i.e. faceting, trellis chart, lattice chart, grid chart, panel chart) is a series or grid of small similar graphics or charts. These are useful when we want to stratify and compare groups of data to one another. Typically facets are formed on different subsets of data and are really useful for exploring conditional relationships between data.

# Start by creating a basic scatterplot
p <- ggplot(mpg) + geom_point(aes(x=displ, y=hwy))

p1 <- p + facet_grid(. ~ cyl)   # Columns are cyl categories
p2 <- p + facet_grid(drv ~ .)   # Rows are drv categories
p3 <- p + facet_grid(drv ~ cyl) # Rows are drv categories, columns are cyl categories
p4 <- p + facet_wrap(~ class)   # Wrap plots by class category

# Plot all plots in one figure
library(gridExtra)
## Loading required package: grid
grid.arrange(p1, p2, p3, p4, ncol=2)

Improving plots

Sometimes plots can be enhanced by adding noise. One example is when plotting categorical data or large numbers of data. For example, suppose we are plotting large numbers of categorical data, because categorical values take on finite values, we can expect much of the data points to overlap and obfuscate one another. One solution to this is to add just enough noise that the data points separate while still maintaining their categorical relationships.

# Obfuscated data
ggplot(mpg) + geom_point(aes(x=cty, y=hwy))

In the above plot, some of the data overlap. We can avoid this issue by adding jitter to each point and changing the transparency (alpha) of each point so that hidden data points are revealed.

ggplot(mpg) + geom_point(aes(x=cty, y=hwy), position = "jitter", alpha=0.5)

We can also improve plots by rearranging the order in which data appear.

ggplot(mpg) + geom_point(aes(x=class, y=hwy))

Suppose we want to reorder each class ascendingly based on highway MPG

ggplot(mpg) + geom_point(aes(x=reorder(class, hwy), y=hwy))

Great, we reordered the vehicle class by highway MPG but a lot of the data are still obfuscated. Lets try adding jitter and changing transparency.

ggplot(mpg) + geom_point(aes(x=reorder(class, hwy), y=hwy), position="jitter", alpha=0.5)

Suppose were are more interested in comparing the distribution of highway MPG betwee vehicle classes. Boxplots would be more appropriate

ggplot(mpg) + geom_boxplot(aes(x=reorder(class, hwy), y=hwy))

That’s good, but we would like to see how the data points are distributed within the quartiles, lets add a goeom_point() layer to show this

# Specify how much jitter you would like to add
ggplot(mpg) + geom_boxplot(aes(x=reorder(class, hwy), y=hwy)) +
  geom_point(aes(x=reorder(class, hwy), hwy), alpha=0.5, position=position_jitter(width=0.1)) +
  labs(title="Vehicle class vs highway MPG") + xlab("Vehicle class") + ylab("Highway MPG")

The the above examples we reordered the vehicle class based on mean highway MPG (mean is used by default), however we can just as easily reorder based on other statistics by specifying what FUN to use

# Order vehicle class by median MPG
ggplot(mpg) + geom_boxplot(aes(x=reorder(class, hwy, FUN=median), y=hwy), alpha=0.5) +
  geom_point(aes(x=reorder(class, hwy), hwy), alpha=0.5, position=position_jitter(width=0.1)) +
  labs(title="Ordered by median MGP")

# Order vehicle class by the MPG standard deviation
ggplot(mpg) + geom_boxplot(aes(x=reorder(class, hwy, FUN=sd), y=hwy), alpha=0.5) +
  geom_point(aes(x=reorder(class, hwy), hwy), alpha=0.5, position=position_jitter(width=0.1)) +
  labs(title="Ordered by MPG standard deviation")