352 lines
14 KiB
Markdown
352 lines
14 KiB
Markdown
# Chapter 5
|
|
|
|
## Statistical Inference
|
|
Statistical inference is the process of drawing conclusions about the entire population based on information from a sample.
|
|
|
|
### Parameter vs. Statistic
|
|
A parameter is a number that summarizes data from an entire population.
|
|
|
|
A statistic is a number that summarizes data from a sample.
|
|
|
|
| |parameter|statistic
|
|
|-|-|-
|
|
|mean|$\mu$| $\bar{x}$
|
|
|standard deviation| $\sigma$ | $s$
|
|
|variance| $\sigma^2$ | $s^2$
|
|
|
|
### Example
|
|
Suppose you were interested in the number of hours that Rowan students spend studying on Sundays. You take a random sample of $n = 100$ students and the average time they study on Sunday is $\bar{x}= 3.2$[hrs].
|
|
|
|
We use $\bar{x} = 3.2$[hrs] as our best estimate for $\mu$.
|
|
|
|
### Variability of Sample Statistics
|
|
We normally think of a parameter as a fixed value. Sample statistics vary from sample to sample.
|
|
|
|
|
|
### Sampling Distribution
|
|
A sampling distribution is the distribution of sample statistics computed for different samples of the same sample size from the same population.
|
|
|
|
The mean of the sample means is $\mu$. For a random sample of size, $n$, the standard error is given by:
|
|
$$\text{var}(\bar{x}) = {\sigma^2 \over n}$$
|
|
|
|
### Central Limit Theorem
|
|
If $\bar{x}$ is the mean of a random sample of size, $n$, taken from a population with mean, $\mu$, and finite variance, $\sigma^2$, then the limiting form of the distribution.
|
|
$$z = {\sqrt{n} (\bar{x} - \mu )\over \sigma}$$
|
|
|
|
As $n \to \infty$, is the standard normal distribution. This generally holds for $n \ge 30$. If $n < 30$, the approximation is good so long as the population is not too different from a normal distribution.
|
|
|
|
### Unbiased Estimator
|
|
A statistic, $\hat{\theta}$, is said to be an unbiased estimator of the parameter, $\theta$, if:
|
|
$$E[\hat{\theta}] = \theta$$
|
|
or
|
|
$$E[\hat{\theta} - \theta] =0$$
|
|
|
|
The mean:
|
|
$$\bar{x} = {1\over n} \sum_{i=1}^{n} x_i$$
|
|
is an unbiased estimator of $\mu$.
|
|
|
|
Proof:
|
|
$$E[\bar{x}] = E\left[ {1\over n} \sum_{i=1}^n x_i\right]$$
|
|
$$= {1\over n} E[x_1 + x_2 + x_3 + \cdots + x_n]$$
|
|
$$= {1\over n} \left[ E[x_1] + E[x_2] + \cdots + E[x_n]\right]$$
|
|
$$= {1\over n} [\mu + \mu + \cdots + \mu]$$
|
|
$$= {1\over n} [n\mu] = \mu$$
|
|
|
|
### Confidence Interval for $\mu$ if $\sigma$ is known:
|
|
If our sample size is "large", then the CLT tells us that:
|
|
$${\sqrt{n} (\bar{x} - \mu) \over \sigma} \sim N(0,1) \text{ as } n \to \infty$$
|
|
|
|
$$1 - \alpha = P(-z_{\alpha \over 2} \le {\bar{x} - \mu \over \sigma/\sqrt{n}} \le z_{\alpha \over2}$$
|
|
|
|
A ($1 - \alpha$)% confidence interval for $\mu$ is:
|
|
$$\bar{x} \pm z_{\alpha \over 2} {\sigma \over \sqrt{n}}$$
|
|
|
|
90% CI: $z_{\alpha \over 2} = 1.645$
|
|
|
|
95% CI: $z_{\alpha \over 2} = 1.96$$
|
|
|
|
99% CI: $z_{\alpha \over 2} = 2.576$
|
|
|
|
### Example
|
|
In a random sample of 75 Rowan students, the sample mean height was 67 inches. Suppose the population standard deviation is known to be $\sigma = 7$ inches. Construct a 95% confidence interval for the mean height of *all* rowan students.
|
|
|
|
$$\bar{x} \pm z_{\alpha \over 2} {\sigma \over \sqrt{n}}$$
|
|
$$\bar{x} = 67$$
|
|
$$z_{\alpha \over 2} = 1.96$$
|
|
$$\sigma = 7$$
|
|
$$n = 75$$
|
|
|
|
A 95% CI for $\mu$:
|
|
$$67 \pm 1.96 \left({7\over\sqrt{75}}\right) = (65.4, 68.6)$$
|
|
|
|
#### Interpretation
|
|
95% confident that the mean height of all Rowan students is somewhere between 65.4 and 68.6 inches.
|
|
|
|
From the sample, we found that $\bar{x} = 67$ inches. Using the confidence interval, we are saying that we are 95% confident that $\mu$ is somewhere between 65.4 and 68.6 inches.
|
|
|
|
A limitation of $z$ confidence interval is that $\sigma$ is unlikely to be known.
|
|
|
|
|
|
### Confidence interval for $\mu$ if $\sigma$ is unknown:
|
|
If $\sigma$ is unknown, we then estimate the standard error, ${\sigma \over \sqrt{n}}$ as ${s \over \sqrt{n}}$.
|
|
|
|
When we estimate the standard error, the distribution is not normal. Instead, it follows a t-distribution with n-1 degrees of freedom. The new distribution is given as:
|
|
$${\bar{x} - \mu \over {s \over \sqrt{n}}}$$
|
|
|
|
A ($1 - \alpha$)% confidence interval for $\mu$ when $\sigma$ is unknown is:
|
|
$$\bar{x} \pm t^* {s\over \sqrt{n}}$$
|
|
|
|
Where $t^*$ is an end point chosen from the t-distribution. $t^*$ varies based on sample size and desired confidence level.
|
|
|
|
### Example
|
|
A research engineer for a time manufacturer is investigating tire life for a new rubber compound and has built 115 tires and tested them to end-of-life in a road test. The sample mean and standard deviation are 60139.7, and 3645.94 kilometers.
|
|
|
|
Find a 90% confidence interval for the mean life of all such tires.
|
|
$$\bar{x} \pm t^* {s\over\sqrt{n}}$$
|
|
$$\bar{x} = 60139.7$$
|
|
$$s = 3645.94$$
|
|
$$n = 115$$
|
|
$$t^* = \texttt{t\_crit\_value(115, 0.90)} = 1.658$$
|
|
$$60139.7 \pm 1.658 {3645.94 \over \sqrt{115}} = (59567.1, 60703.3)$$
|
|
|
|
### Width of a Confidence Interval
|
|
$$\bar{x} \pm t_{\alpha \over 2} {s \over \sqrt{n}}$$
|
|
As sample size increases the width of the confidence interval decreases, and $\bar{x}$ becomes a better approximation of $\mu$.
|
|
$$\lim_{n\to\infty} {s \over \sqrt{n}} = 0$$
|
|
$$\lim_{n\to\infty} P(|\bar{x} - \mu| < \varepsilon) = 1$$
|
|
Where $\varepsilon > 0$.
|
|
|
|
### One-Sided Confidence Intervals
|
|
$$\left(-\infty, \bar{x} + t_\alpha {s \over \sqrt{n}}\right)$$
|
|
|
|
### Confidence Intervals in Python
|
|
```python
|
|
import numpy as np
|
|
import matplotlib.pyplot as plt
|
|
import seaborn as sns
|
|
import scipy.stats
|
|
|
|
conf_levels = []
|
|
iterations = 100
|
|
|
|
def tvalue(sample_size, conf_level):
|
|
return stat.t.ppf(1 - (1 - conf_level)/2, sample_size - 1)
|
|
|
|
for i in range(iterations)
|
|
sample = np.random.chisquare(df=10, size=100)
|
|
sample_mean = np.mean(sample)
|
|
std = np.std(sample)
|
|
t_value = tvalue(100, .95)
|
|
lb = sample_mean - t_value*(std / np.sqrt(100))
|
|
ub = sample_mean + t_value*(std / np.sqrt(100))
|
|
conf_levels.append((lb, ub))
|
|
|
|
plt.figure(figsize=(15,5))
|
|
|
|
for j, (lb, ub) in enumerate(conf_levels):
|
|
if 10 < lb or 10 > ub:
|
|
plt.plot([j,j], [lb,ub], 'ro-', color='red')
|
|
else:
|
|
plt.plot([j,j], [lb,ub], 'ro-', color='green')
|
|
|
|
plt.show()
|
|
```
|
|

|
|
|
|
## Hypothesis Testing
|
|
Many problems require that we decide whether to accept or reject a statement about some parameter.
|
|
|
|
##### Hypothesis
|
|
A claim that we want to test or investigate
|
|
|
|
##### Hypothesis Test
|
|
A statistical test that is used to determine whether results from a sample are convincing enough to allow us to conclude something about the population.
|
|
|
|
Use sample evidence to back up claims about a population
|
|
|
|
##### Null Hypothesis
|
|
The claim that there is no effect or no difference $(H_0)$.
|
|
|
|
##### Alternative Hypothesis
|
|
The claim for which we seek evidence $(H_a)$.
|
|
|
|
#### Using $H_0$ and $H_a$
|
|
Does the average Rowan student spend more than $300 each semester on books?
|
|
|
|
In a sample of 226 Rowan students, the mean cost of a students textbook was $344 with a standard deviation of $106.
|
|
|
|
$H_0$: $\mu = 300$.
|
|
|
|
$H_a$: $\mu > 300$.
|
|
|
|
$H_0$ and $H_a$ are statements about population parameters, not sample statistics.
|
|
|
|
In general, the null hypothesis is a statement of equality $(=)$, while the alternative hypothesis is a statement of inequality $(<, >, \ne)$.
|
|
|
|
#### Possible outcomes of a hypothesis test
|
|
1. Reject the null hypothesis
|
|
- Rejecting $H_0$ means we have enough evidence to support the alternative hypothesis
|
|
1. Fail to reject the null hypothesis
|
|
- Not enough evidence to support the alternative hypothesis
|
|
|
|
### Figuring Out Whether Sample Data is Supported
|
|
If we assume that the null hypothesis is true, what is the probability of observing sample data that is as extreme or more extreme than what we observed.
|
|
|
|
In the Rowan example, we found that $\bar{x} = 344$.
|
|
|
|
### One-Sample T-test for a Mean
|
|
To test a hypothesis regarding a single mean, there are two main parametric options:
|
|
1. z-test
|
|
1. t-test
|
|
|
|
The z-test requires knowledge of the population standard deviation. Since $\sigma$ is unlikely to be known, we will use a t-test.
|
|
|
|
To test $H_0$: $\mu = \mu_0$ against its alternative $H_a$: $\mu \ne \mu_0$, use the t-statistic.
|
|
|
|
$$t^* = {\bar{x} - \mu_0 \over {s \over \sqrt{n}}}$$
|
|
|
|
##### P-Value
|
|
A measure of inconsistency between the null hypothesis and the sample data.
|
|
|
|
##### Significance Level $(\alpha)$
|
|
$\alpha$ for a test of hypothesis is a boundary below which we conclude that a p-value shows statistically significant evidence against the null.
|
|
|
|
Common $\alpha$ levels are 0.01, 0.05, 0.10.
|
|
|
|
The lower the $\alpha$, the stronger the evidence required to reject $H_0$. If the p-value is less than $\alpha$, reject $H_0$, but if the p-value is greater than $\alpha$, fail to reject $H_0$.
|
|
|
|
#### Steps of a Hypothesis Test
|
|
1. State the $H_0$ and $H_a$
|
|
1. Calculate the test statistic
|
|
1. Find the p-value
|
|
1. Reject or fail to reject $H_0$
|
|
1. Write conclusion in the context of the problem
|
|
|
|
### Example
|
|
A researcher is interested in testing a particular brand of batteries and whether its battery life exceeds 40 hours.
|
|
|
|
A random sample of $n=70$ batteries has a mean life of $\bar{x} = 40.5$ hours with $s = 1.75$ Let $\alpha = 0.05$.
|
|
|
|
$H_0$: $\mu = 40$
|
|
|
|
$H_a$ $\mu > 40$
|
|
|
|
$$t^* = {\bar{x} - \mu_0 \over {s \over \sqrt{n}}}$$
|
|
$$t^* = {40.5 - 40 \over {1.75 \over \sqrt{70}}} = 2.39$$
|
|
|
|
Find the p-value:
|
|
$$P(t_{\alpha \over 2} \ge |t^*|) = 0.0097$$
|
|
```python
|
|
>>> from scipy.stats import t
|
|
|
|
>>> t = 2.39 # The t-score
|
|
>>> s = 70 # The sample size
|
|
|
|
>>> t.sf(abs(t), s)
|
|
0.009772027372500908
|
|
```
|
|
|
|
If in fact $H_0$ is true, the probability of observing a test statistic that is as extreme or more extreme than $t^* = 2.39$ is about $0.0097$. That is to say, the sample is very unlikely to occur under $H_0$. Since the p-value is less than $\alpha$, $H_0$ is rejected.
|
|
|
|
Sample evidence suggests that the mean battery life of this particular brand exceeds 40 hours.
|
|
|
|
##### Type 1 Error
|
|
When $H_0$ is rejected despite it being true.
|
|
|
|
The probability that a type 1 error occurs is $\alpha$
|
|
|
|
##### Type 2 Error
|
|
When $H_0$ is not rejected despite it being false.
|
|
|
|
### NOTE:
|
|
Our group of subjects should be representative of the entire population of interests.
|
|
|
|
Because we cannot impose an experiment on an entire population, we are often forced to examine a small, and we hope that the sample statistics, $\bar{x}$ and $s^2$, are good estimates of the population parameters, $\mu$ and $\sigma^2$.
|
|
|
|
### Example
|
|
The effects of caffeine on the body have been well studied. In one experiment, a group of 20 male college students were trained in a particular tapping movement and to tap a rapid rate. They were randomly divided into caffeine and non-caffeine groups and given approximately 2 cups of coffee (Either 200[mg] of caffeine or decaf). After a two hour period, the tapping rate was measured.
|
|
|
|
The population of interest is male college-aged students.
|
|
|
|
The question of interest: is the mean tap rate of the caffeinated group different than that of the non-caffeinated group.
|
|
|
|
Let $\mu_c$ be the mean of the caffeinated group, and $\mu_d$ be the mean of the caffeinated group.
|
|
|
|
$H_0$: $\mu_c = \mu_d$
|
|
|
|
$H_a$: $\mu_c \ne \mu_d$
|
|
|
|
## Two-Sample T-Test
|
|
To test:
|
|
|
|
$H_0$: $\mu_1 = \mu_2$
|
|
|
|
$H_a$: $\mu_1 \ne \mu_2$
|
|
|
|
Use the following statistic:
|
|
$$t^* = {(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2) \over s_p \sqrt{{1\over n_1} + {1\over n_2}}}$$
|
|
Where:
|
|
$$s_p^2 = {(n_1 -1)s_1^2 + (n_2 - 1)s_2^2 \over n_1 + n_2 -2}$$
|
|
Where $t^*$ follows a t-distribution with $n_1 + n_2 -2$ degrees of freedom under $H_0$. Thus, the p-value is $P(t_{n_1 + n_2 -2} \ge |t^*|)$ for a one sided test, and twice that for a two sided test.
|
|
|
|
##### Assumptions:
|
|
The two populations are independently normally distributed with the same variance.
|
|
|
|
### Example
|
|
$H_0$: $\mu_c = \mu_d$
|
|
|
|
$H_a$: $\mu_c = \mu_d$
|
|
|
|
$$s_p^2 = {(n_1 -1)s_1^2 + (n_2 - 1)s_2^2 \over n_1 + n_2 -2}$$
|
|
$$s_p^2 = {(10 -1)(5.73) + (10 - 1)(4.9) \over 18} = 5.315$$
|
|
$$s_p = \sqrt{5.315}$$
|
|
|
|
Find the p-value:
|
|
$$2P(t_{n_1 + n_2 - 2} \ge |3.394|) = 0.00326$$
|
|
Since the p-value $< \alpha$, we reject $H_0$.
|
|
|
|
Sample evidence suggests that the mean tap rate for the caffeinated group is different than that for the con-caffeinated group.
|
|
|
|
### Example
|
|
The thickness of a plastic film in mils on a substrate material is thought to be influenced by the temperature at which the coating is applied. A completely randomized experiment is carried out. 11 substrates are coated at 125$^\circ$F, resulting in sample mean coating thickness of $\bar{x}_1 = 103.5$, and sample standard deviation of $s_1 = 10.2$. Another 13 substrates are coated at 150$^\circ$F, where $\bar{x}_2 = 99.7$ and $s_2 = 15.1$. It is suspected that raising the process temperature would reduce the mean coating thickness. Does the data support this claim? Use $\alpha = 0.01$.
|
|
|
|
|| 125$^\circ$F | 150$^\circ$F |
|
|
|-|-|-
|
|
|$\bar{x}$| 103.5 | 99.7
|
|
|$s$| 10.2 | 15.1
|
|
|$n$| 11 | 13
|
|
|
|
$H_0$: $\mu_1 = \mu_2$
|
|
|
|
$H_a$: $\mu_1 < \mu_2$
|
|
|
|
$$t^* = {(\bar{x}_1 - \bar{x}_2) \over s_p \sqrt{{1\over n_1} + {1\over n_2}}}$$
|
|
$$s_p^2 = {(11 - 1)(10.2)^2 + (13-1)(15.1)^2 \over 11 + 13 - 2} = 171.66$$
|
|
$$s_p = 13.1$$
|
|
$$t^* = {(99.7 - 103.5) \over 13.1 \sqrt{{1\over11} + {1\over13}}} = -0.71$$
|
|
|
|
Find the p-value:
|
|
$$P(t_{n_1 + n_2 - 2} > |-0.71|) = 0.243$$
|
|
Since the p-value is greater than $\alpha$, we fail to reject $H_0$. That is to say sample evidence does not suggest that raising the process temperature would reduce the mean coating thickness.
|
|
|
|
## Practical vs. Statistical Significance
|
|
More samples is not always better.
|
|
* Waste of resources
|
|
* Statistical significance $\ne$ practical significance
|
|
|
|
### Example
|
|
Consider an SAT score improvement study.
|
|
|
|
$600 study plan: $x_{11}, x_{12}, \cdots, x_{1n}$
|
|
|
|
Traditional study plan: $x_{21}, x_{22}, \cdots, x_{2n}$
|
|
|
|
Test for
|
|
$H_0$: $\mu_1 = \mu_2$
|
|
|
|
$H_a$: $\mu_1 \ne \mu_2$
|
|
|
|
Test statistic:
|
|
$$t^* = {\bar{x}_1 - \bar{x}_2 \over s_p \sqrt{{1\over n_1} + {1\over n_2}}}$$
|
|
Suppose that $\mu_1 - \mu_2 = 1$ point. When $n \to \infty$, $\bar{x}_1 - \bar{x}_2 \xrightarrow{p} \mu_1 - \mu_2$, $s_p^2 \to \sigma^2$ as $n\to \infty$.
|