Advanced Data Analysis - Ch. 2: Estimation in One-Sample Problems

Disclaimer These notes are based on Advanced Data Analysis by Erick B. Erhardt

Inference for a population mean.

The population mean \(\mu\) is inferred from the sample mean \(\bar{Y}\). If we were draw another sample from the population and recaculcate the mean (\(\bar{Y}\)) we would likely end up with a slightly different estimate of th population mean (\(\mu\)). If we continued to do this we would obtain a distribution of various estimates for \(\mu\). The standard deviation of this distribution is known as the standard error of the mean which is the average deviation of our estimates from the true population mean (\(\mu\)). Since we don’t often draw hundreds of samples frome the population we can estimate the standard error of the sample using the following equation

\[SE_\bar{Y} = \frac{s}{\sqrt{n}}\]

Let’s do a quick experiment to show how this works.

n <- 100
N <- 10000
population <- rnorm(10000, mean=50, sd=20)

# Take sample from population
samp <- sample(population, 100)
# Standard deviation of the sample
sd(samp)

## [1] 20.99045

# Estimate standard error of the mean (SEM) using equation
sd(samp)/sqrt(length(samp))

## [1] 2.099045

# Draw multiple (1000) samples from the population and compute the standard error of the
# mean directly
samps <- matrix(sample(population, 100*1000, replace=T), ncol=1000)
samp.means <- colMeans(samps)

# Show standard error 
sd(samp.means)

## [1] 2.026521

hist(samps, freq=FALSE, 
     main = paste(c("Distribution of sample means, sd=",
                    round(sd(samp.means), digits = 2))))
rug(samps)

t-distribution

The student’s t-distribution is a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is mall and the standard deviation is known. Below.

x <- seq(-8, 8, length=1000)
plot(x, dnorm(x), type="l", col="red")
points(x, dt(x, 1), type="l")
points(x, dt(x, 5), type="l")

Confidence interval

The confidence interval is a range of values that are computed within which you expected to find the true population mean (\(\mu\)) 95% of the time. For example, if you were to re-conduct the experiment 100 times, the confidence interval would contain the true population mean 95 out of the 100 times. The 5% of the times where the confidence interval does not contain the population mean is known as the error rate and is refered to as \(\alpha\) when expressed as a percent. For example, an \(\alpha=0.05\) will give a 95% confidence interval with a 5% error rate.

The upper value of the confidence interval is computed by \[U=\bar{Y} + t_{crit} \times SE_{\bar{Y}}\] and the lower by \[L=\bar{Y} - t_{crit} \times SE_{\bar{Y}}\]

\(t_{crit}\) is usually found from a table or computed computed from statistical software.

Assumptions

There are two assumptions that must be met in order to accurately infere population parameters from sample statistics. The first is that the sample must be sampled randomly from the population (i.e. a random sample). The second assumption is that the population frequency curve is normal. There are a variety of ways to test the second assumption (assumption) of normality. One method is using the bootstrap method.

The bootstrap method is as follows: Given a sample S, repeatedly draw sub-sample (resample) from S with replacement. Calculate the mean for each of these re-samples. Plot the resulting distribution and assess its normality visually.

bootstrap.norm.assessment <- function(samp, N=1e4){
  # Take n samples of size 1000 from sample (with replacement)
  n <- length(samp)
  sub.samps <- matrix(sample(samp, n*N, replace=TRUE), ncol=N)
  samp.means <- colMeans(sub.samps)
  # Plot sample means
  hist(samp.means, freq=FALSE, main="Assessment of normality by bootstrapping")
  points(density(samp.means), type="l")
  
  # Overlay normal disribution
  
  x <- seq(min(samp.means), max(samp.means), length=1000)
  points(x, dnorm(x, mean=mean(samp), sd=(sd(samp)/sqrt(length(samp)))), type="l", col="red")
  rug(samp.means)
}

# Test the bootstrap function
skewed.samp <- rgamma(10, shape=0.5, scale=20)
bootstrap.norm.assessment(skewed.samp)

Example: Age at First Heart Transplant

We are interested in calculating the mean age at first heart transplant from a population of patients.

Define the population paramter: \(\mu\) is the mean age of the time of first heart transplant for population of patients.
Calculate summary statistics from sample: n, \(\bar{Y}\), s, \(SE_{\bar{Y}}\), df, \(t_{crit}\)
Calculate 95% confidence intervals: \(U = \bar{Y} + t_{crit} \times SE_{\bar{Y}}\) and \(L = \bar{Y} - t_{crit}\times SE_{\bar{Y}}\). From the output below we see that the lower limit of the CI is 45.7 years and the upper limit is 56.8 years.
Summarize in words: We are 95% confident that the population mean (\(\mu\)) age at first transplant is 51.3 \(\pm\) 5.55, that is, between 45.7 and 56.8 years of age.
Check assumptions: Check the normality assumption using bootstrap or other method.

# Here's the data (ages for a sample)
(ages <- c(54, 42, 51, 54, 49, 56, 33, 58, 54, 64, 49))

##  [1] 54 42 51 54 49 56 33 58 54 64 49

# Calculate n
(n <- length(ages))

## [1] 11

# Calculate mean
(y <- mean(ages))

## [1] 51.27273

# Calculate standard deviation
(s <- sd(ages))

## [1] 8.25943

# Calculate standard error of the mean (SEM)
(SEM <- s/sqrt(n))

## [1] 2.490312

# Calculate degrees of freedom
(df <- n - 1)

## [1] 10

# Calculate t-crit
alpha <- 0.05
(t.crit <- qt(1 - alpha/2, df = df))

## [1] 2.228139

# Calculate upper CI
(U <- y + t.crit*SEM)

## [1] 56.82149

# Calculate lower CI
(L <- y - t.crit*SEM)

## [1] 45.72397

Check normality assumption

bootstrap.norm.assessment(ages)

Hypothesis testing with one-sample t-test

In a one-sample t-test we are interested in testing whether the populaton mean (\(\mu\)) is different from some prespecified value (\(\mu_{o}\)). For example, suppose we are interested in testing whether or not the age at first heart transplant is different from the hypothesized mean of \(\mu=50\).

First we frame our hypotheses: The Null hypothesis states that there is no difference between the observed population mean and the prespecified value (\(H_{o}: \mu = \mu_{o}\)). The alternative hypothesis states that there is a difference between the observed mean and the prespecified value (\(H_{A}: \mu \ne \mu_{o}\)). Before beginning our experiment we must select the significance level (\(\alpha\)) which sets the error rate. Typically we use an \(\alpha\) of 0.5 to test at the 95% confidence level. Next, we calculate the test statistic

\[t_{s} = \frac{\bar{Y} - \mu_{o}}{SE_{\bar{Y}}}\]

Once we have calculated \(t_{s}\) we can determine the probabily of obtaining a value at least as extream from a t-distribution. The probability of arriving at a \(t_{s}\) at least as extreme as the one calculated is known as the P-value. If the probability of randomly obtaining \(t_{s}\) is less than \(\alpha\) then we can reject the null hypothesis in favor of the alternative.

The assumptions of the t-test are that the data are from a random sample and that the population frequency curve is normal.

# Test whether the population age at first transplant is different from 50 years
(t.summary <- t.test(ages, mu=50))

## 
##  One Sample t-test
## 
## data:  ages
## t = 0.5111, df = 10, p-value = 0.6204
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
##  45.72397 56.82149
## sample estimates:
## mean of x 
##  51.27273

# Assess normlaity of sample data
bootstrap.norm.assessment(ages)

The output from t.test() show gives the exact same CI that we calculated previously. We note that the estimated age of 50 is not included in this interval, also the t value produced is low (0.5) compared to the \(t_{crit} = 2.23\) needed to obtain significance. As a result the p-value is > \(\alpha\) suggesing that there is not statistically significant difference between \(\mu\) and \(\mu_{o}\).

**Normality* assessment seems to be reasonable based on output from the bootstrap method.

Example: Meteorites

Suppose we have a predicted cooling rate of \(\mu=0.54\) degrees per million years for a meteorite and a series of observed cooling rates. We are interested in testing whether or not the observed cooling rates support the theorized cooling rate of 0.54 degrees/million years. To test this, we will use a one sample t-test.

# Plot sample data
obs.cooling.rates <- c(5.6, 2.7, 6.2, 2.9, 1.5, 4.0, 4.3, 3.0, 3.6, 2.2, 6.7, 3.8)
hist(obs.cooling.rates, freq=FALSE)
points(density(obs.cooling.rates), type="l")
rug(obs.cooling.rates)

# Run a t-test to test the hypotheses of mu = 0.54
(t.summary <- t.test(obs.cooling.rates, mu=0.54))

## 
##  One Sample t-test
## 
## data:  obs.cooling.rates
## t = 7.2176, df = 11, p-value = 1.714e-05
## alternative hypothesis: true mean is not equal to 0.54
## 95 percent confidence interval:
##  2.858002 4.891998
## sample estimates:
## mean of x 
##     3.875

# Assess normality
bootstrap.norm.assessment(obs.cooling.rates)

From the results above we can see that the hypthesis of \(\mu_{o}=0.54\) is not supported by the data. The observed mean is 3.89 degrees/million years with a CI of 2.88 - 4.89 and a p-value of 1.47 x 10^-5 suggesting the observed difference is statistically significant.

Example: Weights of canned tomatoes

All of the t-tests we have conducted up until this point have been two sided t-tests. That is, we do not care whether the observed mean, \(\mu\) is less than or greater thane the hypothesized mean \(\mu_{o}\). As a result our rejection region within the t-distribution lies on both sides of the mean with 2.5% at the upper and lower tails for a 95% confidence test. With a one-sided t-test however, we are only interested in asking whether the observed mean is greater than or less than some hypothesized mean. A lower one-sided test is where there hypothesized mean is lower (less than) the observed mean (\(H_{o}: \mu_{o} \le \mu\)). In this next example we are intersted in testing whether or not the contents of sample of tomato cans are less than the amount reported by the supplier who claims they supply 20 oz of tomatos per can.

tomato <- c(20.5, 18.5, 20.0, 19.5, 19.5, 21.0, 17.5, 22.5, 20.0, 19.5, 18.5, 20.0, 18.0, 20.5)

# Plot sample data
hist(tomato, freq=FALSE)
points(density(tomato), type="l")
rug(tomato)

# One-sided t-test
(t.summary <- t.test(tomato, mu=20, alternative="less"))

## 
##  One Sample t-test
## 
## data:  tomato
## t = -0.9287, df = 13, p-value = 0.185
## alternative hypothesis: true mean is less than 20
## 95 percent confidence interval:
##      -Inf 20.29153
## sample estimates:
## mean of x 
##  19.67857

bootstrap.norm.assessment(tomato)

According to the bootstrap method the normality assumption seems to hold. Notice the p-value is 0.185 suggesting there is no statistically significant difference between \(\mu_{o}\) and \(\mu\). Also, notice the CI begin reported with a lower bound of -Inf and an upper bound of 20.29 which crosses the \(\mu_{o}\) of 20 oz.