关于stat的笔记 - sampling | Underlake Reading Room

为stat332的复习.附上没有过多关联的前情提要.

332的精华在于推导每个model,再按需求套数据…所以整理的是各个model.
Model的定义: Relates a parameter to a response.
老师讲的不细,所以笔记有一半是我的归纳总结,不是证明出来的,估计会有错.

basic

基本判定方式有p-value与confidence interval.

CI:
assume estimator is $\tilde{\theta} \sim N(\theta, V(\tilde{\theta}))$ .

est \pm c*(s.e.)

等于 $\hat{\theta} \pm c* \sqrt{V(\tilde{\theta})}$ .
如果不知道 $\sigma$ 就取 $\hat{\theta} \pm c* \sqrt{\hat{V(\tilde{\theta})}}$ .
( $\sigma$ 是知道的standard deviation,知道时取 $C \sim N(0,1)$ .
S是不知道的,估算的standard deviation.不知道时取t distribution.)

est指的是estimate的值.
c指的是百分比confidence系数的对应值.
SE指的是standard error.在sampling中是standard deviation乘sampling error.
基础的是s.e. = $\frac{\sigma}{\sqrt{n}}$ .

具体要参考model公式.

p-value:

d = \frac{est - H_{0value}}{s.e.} = \frac{\hat{\theta} - \theta_0}{\sqrt{V(\tilde{\theta})}}

given estimator is $\tilde{\theta} \sim N(\theta, V(\tilde{\theta}))$ .
同样 $D \sim N(0,1)$ when $\sigma$ known, or $D \sim t_{n-1+c}$ when $\sigma$ is unknown.

H0指的是假设的值.

$H_0$	$H_a$	P value
$\theta = \theta_0$	$\theta \neq \theta_0$	$2Pr(D > abs(d))$
$\theta \geq \theta_0$	$\theta < \theta_0$	$Pr(D < d)$
$\theta \leq \theta_0$	$\theta > \theta_0$	$Pr(D > d)$

如果没有significance level默认:
p > 0.1 : No evidence to reject $H_0$ .
0.1 > p > 0.05 : There is evidence to reject $H_0$ .
0.05 > p > 0.01 : There is some evidence to reject $H_0$ .
p < 0.01 : There is tons evidence to reject $H_0$ .

具体要参考model公式.

Side:两组数值时的variance(standard deviation的平方),aka pool variance,

s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1+n_2-2}

这里 $s_1^2$ 是第一组的variance, $n_1$ 是size.
这里可以参考前情提要的表格.

models

Side:

$Y_j$ 是response of unit j.是random quantity.
$\mu$ 是study parameter,not random but unknown.一般 $\mu$ 指平均值, $\pi$ 是proportion比例.
$R_j$ 是error term. Gives distribution of responses about $\mu$ . Always independent.

Gauss’s theorem:
Any linear combination of normal random variable is normal.
Central limit theorem:
Let $Y_1,...,Y_n$ be a sequence of ramdom variable, $E(Y_i)$ = $\mu$ for any i, $V(Y_i)$ = $\sigma^2$ and is not infinity for any i. All $Y_i$ are independent,
Then we have $\bar{Y} \sim N(\mu, \frac{\sigma^2}{n})$ .

Model1:

因为 $R_j$ normal, $Y_i$ 也normal.

Y_i = \mu + R_i

where $R_{j} \sim N(0,\sigma^2)$

也可以写成 $Y_i \sim N(\mu , \sigma^2)$ .

with confidence interval: $\mu: \bar{y} \pm \frac{c*S}{\sqrt{n}}$
degree: $n - 1$ (t distribution)

discrepancy: $d = \frac{\bar{y} - \mu_0}{\frac{s}{\sqrt{n}}}$
distribution: $D \sim t_{n-1}$

$S = \sum \frac{(y_i - \bar{y})^2}{n - 1}$ 也就是standard diviation.

___

Model2A:

Independent groups which have same variance.

Y_{ij}

是response of unit j in group i.

Y_{ij} = \mu_i + R_{ij}

where $R_{ij} \sim N(0,\sigma^2)$

with confidence interval: $\mu_1: \hat{\mu_1} \pm \frac{c*S_1}{\sqrt{n_1}}$
degree: $n_1 - 1$
OR
with confidence interval: $\mu_1 - \mu_2: \hat{\mu_1} - \hat{\mu_2} \pm c*\hat{\sigma}\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}$
degree: $n_1 + n_2 - 2$

discrepancy: $d = \frac{\hat{\mu_1} - \hat{\mu_2} - \mu_0}{\hat{\sigma}\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}$
distribution: $D \sim t_{n_1 + n_2 - 2}$

___

Model2B:

Independent groups which have different variance.

Y_{ij}

是response of unit j in group i.

因为variance不一样, $\sigma_i^2$ 指每组的variance.

Y_{ij} = \mu_i + R_{ij}

where $R_{ij} \sim N(0,\sigma_i^2)$

with confidence interval: $\mu_1: \hat{\mu_1} \pm \frac{c*S_1}{\sqrt{n_1}}$
degree: $n_1 - 1$
OR
with confidence interval: $\mu_1 - \mu_2: \hat{\mu_1} - \hat{\mu_2} \pm c*\sqrt{\frac{\hat{\sigma_1^2}}{n_1} + \frac{\hat{\sigma_2^2}}{n_2}}$
degree: $n_1 + n_2 - 2$

discrepancy: $d = \frac{\hat{\mu_1} - \hat{\mu_2} - \mu_0}{\sqrt{\frac{\hat{\sigma_1^2}}{n_1} + \frac{\hat{\sigma_2^2}}{n_2}}}$
distribution: $D \sim t_{n_1 + n_2 - 2}$

___

Model3:

model3测试的是一组两数据之间的difference(缩写为d).

Y_{di} = \mu_d + R_{di}

where $R_{dj} \sim N(0,\sigma_d^2)$

with confidence interval: $\mu_d: \bar{y_d} \pm \frac{c*S_d}{\sqrt{n_d}}$
degree: $n_d - 1$ .

discrepancy: $d = \frac{\bar{\mu_d} - \mu_0}{\frac{\hat{\sigma_d}}{\sqrt{n_d}}}$
distribution: $D \sim t_{n_d-1}$

___

Model4:

where we have n outcome, and each outcome is binary.

\frac{Y}{n} \sim N(\pi,\frac{\pi(1-\pi)}{n})

with confidence interval: $\pi: \hat{\pi} \pm c*\sqrt{\frac{\hat{\pi}(1-\hat{\pi})}{n}}$
degree: $C \sim N(0,1)$

discrepancy: $d = \frac{\hat{\pi} - \pi_0}{\sqrt{\frac{\hat{\pi_0}(1-\hat{\pi_0})}{n}}}$
distribution: $C \sim N(0,1)$

Side:

$y_1,...,y_n$ 是sample data,non-random,我们收集到的data.
$Y_1,...,Y_n$ 是realization of random variable.
Statistic is a function of the sample data, $\hat{\theta}$ . 当sample data changes, $\hat{\theta}$ 也会changes.
我们可以把 $\hat{\theta}$ 当作realibation of ramdom variable $\tilde{\theta}$ , $\tilde{\theta}$ 叫做estimator.

从 $\hat{\theta}$ 到 $\tilde{\theta}$ 是从 $y_1,...,y_n$ 到 $Y_1,...,Y_n$ .
例子:
在model2A中, $\hat{\mu_1}$ = $\bar{y_{1+}}$ (statistic)所以 $\tilde{\mu_1}$ = $\bar{Y_{1+}}$ (estimator).

___

Model5:

completely randomized design.

Y_{ij} = \mu_i + \tau_i + R_{ij}

where $R_{ij} \sim N(0,\sigma^2)$
and $i = 1,2,...,t$ (# of treatments), $j = 1,2,...,r$ (# of replicates for treatments).

the # of units are i*j.
overall mean: $\mu$ . group average: $\mu + \tau_i$ for group i. treatment effect: $\tau_i$ for group i.

R_{ij}

is the distribution of values about the deterministic part of the model.

Constraints: $\tau_1 + \tau_2 ... + \tau_t = 0$ .

举例:group 1的ave是65,group 2 ave是75.
group1与group2数据在一起的ave是70. $\mu$ + $\tau_1$ 是65, $\mu$ + $\tau_2$ 是75.所以 $\hat{\tau_1}$ 是-5, $\hat{\tau_2}$ 是5.

当i只为2的情况:

Y_{ij} \sim N(\mu,\frac{\sigma^2}{2r})

with confidence interval: $\tau_1: \hat{\tau_1} \pm c*\sqrt{\frac{\sigma^2}{2r}}$
degree: $n - q + c$ (t distribution)

with confidence interval: $\mu: \hat{\mu} \pm c*\sqrt{\frac{\sigma^2}{2r}}$
degree: $n - q + c$ (t distribution)

关于为什么是2r的原因:

V(\tau_1) = V(\bar{Y_{1+}} - \bar{Y_{++}})

= $V(\bar{Y_{1+}} - (\bar{Y_{1+}} + \bar{Y_{2+}})/2 )$
= $V(1/2\bar{Y_{1+}} - 1/2\bar{Y_{2+}})$
= 1/4 $V(\bar{Y_{1+}})$ - 1/4 $V(\bar{Y_{2+}})$
= 1/4 $\frac{\sigma^2}{r}$ + 1/4 $\frac{\sigma^2}{r}$
= $\frac{\sigma^2}{2r}$

当i不为2的时候,variance也会更改,请注意.

注: $V(\bar{Y_{1+}}) = \tau^2 /$ # of samples in group $\bar{Y_{1+}}$ .

discrepancy:

grp1 =  c(50,53,52,58)
grp2 = c(62,55,58,60)

options(contrasts = c('contr.sum','contr.poly'))
Y = c(grp1,grp2)
x = as.factor(c(rep('1',4),rep('2',4))) # makes a discrete variable
model = lm(Y~x) # builds the model
summary(model)

estimate:
intercept: $\hat{\mu}$ .
x: $\hat{\tau_1}$ .

residual standard error: $\hat{\sigma}$ .
p-value: assume $H_0:\mu = 0$ , $H_0:\mu \neq 0$ .
assume $H_0:\tau_1 = 0$ , $H_0:\tau_1 \neq 0$ .

if we want to estimate difference between two treatment:

\theta : \hat{\theta \pm c*SE}

s.e. = $\frac{\sigma}{\sqrt{n}}$ .n = 2.
with confidence interval: $\tau_1: \hat{\tau_1} - \hat{\tau_2} \pm c*\sqrt{\frac{\sigma^2}{2}}$
degree: $n - q + c$ (t distribution)

p-value: $H_0 : \tau_1 = \tau_2$ , $H_a: \tau_1 \neq \tau_2$ .
discrepancy:

d = \frac{est - H_{0value}}{s.e.}

we have $d = \frac{\tau_1 - \tau_2 - \tau_0}{\frac{\sigma}{\sqrt{2}}}$ .

___

Model6:

unbalanced CRD.每组的数据量不一样.

Y_{ij} = \mu_i + \tau_i + R_{ij}

where $R_{ij} \sim N(0,\sigma^2)$
and $i = 1,2,...,t$ (# of treatments), $j = 1,2,...,r_i$ (# of replicates for treatments).

Constraints: $\sum_{i = 1}^{t} r_i \tau_i = 0$ .

grp1 =  c(50,53,52,58)
grp2 =  c(62,55,58)
Y = c(grp1,grp2)
x = as.factor(c(rep('1',4),rep('2',3)))

#Group Averages
grp_av = tapply(Y,x,mean,na.rm = T)
mu = mean(Y)

#Treatment Effects
tao1 = (grp_av - mean(Y))[1]
tao2 = (grp_av – mean(Y))[2]

#Estimated Sigma
sigma = summary(lm(Y~x))$sigma

#Values
sigma; tao1; tao2; mu
#3.447221, -2.178571, 2.904762, 55.42857

anova(lm(Y~x))

___

Model7:

randomized block design.
block这里指的是每组,比如说model3中compare difference的每两个数据就是一个block.

Y_{ij} = \mu_i + \tau_i + \beta_j + R_{ij}

where $R_{ij} \sim N(0,\sigma^2)$
and $i = 1,2,...,t$ (# of treatments), $j = 1,2,...,r$ (# of replicates for treatments).

$\beta_j$ is the $j^{th}$ block effect.
Constraints: $\sum_{i = 1}^{t} \tau_i = 0$ . $\sum_{j = 1}^{r} \beta_j = 0$ .
By least square: $\hat{\beta_j} = \bar{y_{+j}} - \bar{y_{++}}$ .

Data=read.table("blocked.csv",sep=",",header=T)

options(contrasts = c('contr.sum','contr.poly'))
attach(Data) # 用了attach之后,column里面的名字可以直接当作varible使用.
# 比如说应该是Data$Treatment,可以用Treatment.
# 这里Treatment,Block,Value都是column的名字.
Treatment = as.factor(Treatment)
Block = as.factor(Block)
Model = lm(Value~Treatment+Block)

#To look at the output, we type:
summary(Model)

Model8:

factorial design.
例:radiation有1/4的效果,chemo也有1/4的效果,两个一起使用可以有5/6的效果.
用于寻找两个变量是否有这种关系(interaction).

Y_{ijk} = \mu_i + \tau_i + R_{ijk}

where $R_{ijk} \sim N(0,\sigma^2)$
and $i = 1,2,...,l_1$ (# of levels of factor 1), $j = 1,2,...,l_2$ (# of levels of factor 2), $k = 1,2,...,r$ .

by least square: $W = \sum_{ijk} r_{ijk}^2 + \lambda(\sum_{ij}\tau_{ij})$
we get $\hat{\mu} = \bar(y_{+++})$ , $\hat{\tau_{ij}} = \bar{y_{ij+}} - \bar{y_{+++}}$ , $\hat{\sigma^2} = \frac{W}{rl_1l_2 - l_1l_2 - 1 + 1}$ .

Model9:

factorial randomized block design.

Y_{ijk} = \mu + \tau_{ij} + \beta_k + R_{ijk}

$beta_k$ is block effect.
Constraints: $\sum_{ij} \tau_{ij} = 0$ . $\sum_{k} \beta_k = 0$ .

where $R_{ijk} \sim N(0,\sigma^2)$
and $i = 1,2,...,l_1$ (# of levels of factor 1), $j = 1,2,...,l_2$ (# of levels of factor 2), $k = 1,2,...,r$ .

by least square: $W = \sum_{ijk} r_{ijk}^2 + \lambda_1(\sum_{ij}\tau_{ij}) + \lambda_2(\sum_{k}\beta_{k})$
we get $\hat{\mu} = \bar(y_{+++})$ , $\hat{\tau_{ij}} = \bar{y_{ij+}} - \bar{y_{+++}}$ , $\hat{\beta} = \bar{y_{++k}} - \bar{y_{+++}}$ , $\hat{\sigma^2} = \frac{W}{rl_1l_2 - l_1l_2 - r - 1 + 2}$ .

anova

SS_{TOT} = SS_{TRT} + SS_{RES}

1	anova(model)

___

sampling

SRS
我们无法取population中所有的数据,所以取样(sample).

最大的区别是在公式中有finite population correction.
n为sample size,N为population size.

Example:model 1

confidence interval: $\mu: \bar{y} \pm \frac{c*S}{\sqrt{n}}$
We have error $E = \frac{c*\sigma^2}{\sqrt{n}}$ , solve for n.

Using SRS:
confidence interval: $\mu: \bar{y} \pm \sqrt{1 - \frac{n}{N}} \frac{c*\hat{\sigma}}{\sqrt{n}}$
We have error $E = \sqrt{1 - \frac{n}{N}} \frac{c*\hat{\sigma}}{\sqrt{n}}$ , solve for n.

Example:model 4

with confidence interval: $\pi: \hat{\pi} \pm c*\sqrt{\frac{\hat{\pi}(1-\hat{\pi})}{n}}$

Using SRS:
confidence interval: $\pi: \hat{\pi} \pm c*\sqrt{\frac{\hat{\pi}(1-\hat{\pi})}{n}}\sqrt{1 - \frac{n}{N}}$

We have error $\sqrt{1 - \frac{n}{N}} \frac{c*\hat{\sigma}}{\sqrt{n}}$ , solve for n.
注: $\hat{\sigma^2} = \hat{\pi}(1-\hat{\pi})$
often we replace $\hat{\sigma^2}$ by 1/4(the worst scenario is $\pi = 1/2$ ).

如何取sample size:

Take a small sample, estimate $\sigma$ .
Find n using formula.
Perform a large study with n units.

regression

regression sampling
require linear relationship between x and y.

Y_i = \alpha + \beta(x_i - \bar(x)) + R_i

where $R_{j} \sim N(0,\sigma^2)$

by least square: $W = \sum_{i} r_{i}^2 = \sum_{i} (y_i - \alpha \beta(x_i - \bar{x}))^2$
we get $\hat{\alpha} = \bar{y}$ , $\hat{\beta} = \frac{S_{xy}}{S_{xx}} = \frac{s_{xy}}{\hat{s_{xx}}}$ , $\hat{\sigma_r}^2 = \frac{W}{n - 1}$ .
where $S_{xy} = \sum_i (y_i - \hat{y})(x_i = \bar{x})$ , $s_{xy} = \frac{S_{xy}}{n - 1}$ , $S_{xx}= \sum_i (x_i = \bar{x})^2$ , $s_{xx} = \frac{S_{xx}}{n - 1}$ .

estimators: $\alpha, \beta, \mu_x, \mu_y$ are all unbiased. $\tilde{\mu_{reg}}$ is biased estimator for $\mu_y$ .

confidence interval: $EST \pm c SE = \hat{\mu_{reg}} \pm c \sqrt{1 - n/N} \frac{\hat{\sigma_r}}{\sqrt{n}}$

attach(women)
# assume we want to know the mean height
# assume we know the mean weight
mean(height) # unknown
mean(weight) # we known

# using SRSWOR, we take a sample of size 5 and use this as our estimate for the height:  
set.seed(1)
sample_heights = sample(height,5) # n = 5
mean(sample_heights)
# [1] 63.4, this is \hat{\mu_{h}}
sd(sample_heights)
# [1] 3.209361, this is \hat{\sigma_{SRS}}

we can build SRS CI: $\hat{\mu_{h}} \pm c \sqrt{1 - \frac{n}{N}} \frac{\hat{\sigma_{SRS}}}{\sqrt{n}}$

sample_weights = c(123,129,135,146,120)
mean(sample_weights)
# [1] 130.6
# We note that there is a linear relationship between height and weight.
plot(weight,height)

sample_weights = sample_weights-mean(sample_weights)
summary(lm(sample_heights~sample_weights))

 The regression estimate is given by:

\hat{\mu_{height}} = \hat{\alpha} + \beta(x_i - \hat{x}) = \mu_{height}(\mu_{weight}) = 63.4 + 0.31(136.7333 - 130.6) = 65.3

build regression CI: $\hat{\mu_{reg}} \pm c\sqrt{1 - \frac{n}{N}}\frac{\hat{\sigma_{r}}}{\sqrt{n}}$

regression比SRS更接近.

___

ratio

ratio estimation
require linear relationship between x and y.
require an intercept of zero.

Y_i = \beta x_i + R_i

where $R_{i} \sim N(0,x_i \sigma^2)$

注,variance中乘了x_i,我们在计算的时候会计算 $\frac{x_i}{\sqrt{x_i}}和\frac{y_i}{\sqrt{x_i}}$ .

by least square we have $\hat{\beta} = \frac{\bar{y}}{\bar{x}}$ and $\hat{\sigma_{ratio}}^2 = \frac{W}{n - 1}$ . $\hat{\mu_{ratio}} = \frac{\bar{y}}{\bar{x}}\mu_x$ .

with confidence interval: $\hat{\mu_{ratio}} \pm c \sqrt{1 - \frac{n}{N}} \frac{\hat{\sigma_{ratio}}}{\sqrt{n}}$

attach(women) # same data set

set.seed(1)
sample_heights = sample(height,5)
mean(sample_heights)
#[1] 63.4 # SRS estimate for \mu_y
sd(sample_heights)
#[1] 3.209361
sample_weights = c(123,129,135,146,120)
mean(sample_weights)
#[1] 130.6

plot(weight,height) # linear

Sqrt_weights = sqrt(sample_weights) # \sqrt{x_i}
sample_weights = sample_weights/Sqrt_weights # \frac{x_i}{sqrt{x_i}}

sample_heights = sample_heights/Sqrt_weights # \frac{y_i}{\sqrt{x_i}}
summary(lm(sample_heights~sample_weights-1))

The ratio estimate is given by: $\hat{\mu_{height}} = \hat{\beta}*x_i = \frac{\bar{y}}{\bar{x}}*x_i = \frac{63.4}{130.6}x_i$ .

= 0.48545x_i.带入= 0.48545(136.7333) = 66.4

build regression CI: $\hat{\mu_{ratio}} \pm c \sqrt{1 - \frac{n}{N}} \frac{\hat{\sigma_{ratio}}}{\sqrt{n}}$

ratio比SRS更接近.ratio是biased的.

proportion的情况下:

build regression CI: $\hat{\theta} \pm \frac{1}{\hat{\pi}}\sqrt{1 - \frac{n}{N}}\frac{\hat{\sigma_{ratio}}}{\sqrt{n}}$ where $\hat{\sigma_{ratio}} = \sum\frac{(y_i - \hat{\theta_{zi}})^2}{n - 1}$ .

$\hat{\theta}$ 我们想要算的gourp of interest(占总人群的比例).

___

stratified

计算subpopulation,每个subpopulation independent.

\mu = \frac{N_1\mu_1+ N_2\mu_2 + ... + N_H\mu_H}{N} = \sum_{i = 1}^{H}w_i\mu_i

这里的w是weight,也是占总人群的比例.

confidence interval: $\hat{\mu} \pm c \sqrt{ \sum_{i=1}^H w_i^2\frac{\sigma_i^2}{n_i}(1 - \frac{n_i}{N_i}) }$
where $C \sim N(0,1)$ .

\pi = \sum_{i = 1}^{H}w_i\pi_i

confidence interval: $\hat{\pi} \pm c \sqrt{ \sum_{i=1}^H w_i^2\frac{\sigma_i^2}{n_i}(1 - \frac{n_i}{N_i}) }$
where $C \sim N(0,1)$ , $\sigma_i^2 = \pi_i(1 - \pi_i)$ .