0%

关于stat的笔记 - sampling

为stat332的复习.附上没有过多关联的前情提要.

332的精华在于推导每个model,再按需求套数据…所以整理的是各个model.
Model的定义: Relates a parameter to a response.
老师讲的不细,所以笔记有一半是我的归纳总结,不是证明出来的,估计会有错.

basic

基本判定方式有p-value与confidence interval.


CI:
assume estimator is θ~N(θ,V(θ~))\tilde{\theta} \sim N(\theta, V(\tilde{\theta})).

est±c(s.e.)est \pm c*(s.e.)

等于θ^±cV(θ~)\hat{\theta} \pm c* \sqrt{V(\tilde{\theta})}.
如果不知道σ\sigma就取θ^±cV(θ~)^\hat{\theta} \pm c* \sqrt{\hat{V(\tilde{\theta})}}.
(σ\sigma是知道的standard deviation,知道时取CN(0,1)C \sim N(0,1).
S是不知道的,估算的standard deviation.不知道时取t distribution.)

est指的是estimate的值.
c指的是百分比confidence系数的对应值.
SE指的是standard error.在sampling中是standard deviation乘sampling error.
基础的是s.e. = σn\frac{\sigma}{\sqrt{n}}.

具体要参考model公式.


p-value:

d=estH0values.e.=θ^θ0V(θ~)d = \frac{est - H_{0value}}{s.e.} = \frac{\hat{\theta} - \theta_0}{\sqrt{V(\tilde{\theta})}}

given estimator is θ~N(θ,V(θ~))\tilde{\theta} \sim N(\theta, V(\tilde{\theta})).
同样DN(0,1)D \sim N(0,1) when σ\sigma known, or Dtn1+cD \sim t_{n-1+c} when σ\sigma is unknown.

H0指的是假设的值.

H0H_0 HaH_a P value
θ=θ0\theta = \theta_0 θθ0\theta \neq \theta_0 2Pr(D>abs(d))2Pr(D > abs(d))
θθ0\theta \geq \theta_0 θ<θ0\theta < \theta_0 Pr(D<d)Pr(D < d)
θθ0\theta \leq \theta_0 θ>θ0\theta > \theta_0 Pr(D>d)Pr(D > d)

如果没有significance level默认:
p > 0.1 : No evidence to reject H0H_0.
0.1 > p > 0.05 : There is evidence to reject H0H_0.
0.05 > p > 0.01 : There is some evidence to reject H0H_0.
p < 0.01 : There is tons evidence to reject H0H_0.

具体要参考model公式.

Side:两组数值时的variance(standard deviation的平方),aka pool variance,

sp2=(n11)s12+(n21)s22n1+n22s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1+n_2-2}

这里s12s_1^2是第一组的variance,n1n_1是size.
这里可以参考前情提要的表格.





models

Side:

  • YjY_j是response of unit j.是random quantity.
  • μ\mu是study parameter,not random but unknown.一般μ\mu指平均值,π\pi是proportion比例.
  • RjR_j是error term. Gives distribution of responses about μ\mu. Always independent.

Gauss’s theorem:
Any linear combination of normal random variable is normal.
Central limit theorem:
Let Y1,...,YnY_1,...,Y_n be a sequence of ramdom variable, E(Yi)E(Y_i) = μ\mu for any i, V(Yi)V(Y_i) = σ2\sigma^2 and is not infinity for any i. All YiY_i are independent,
Then we have YˉN(μ,σ2n)\bar{Y} \sim N(\mu, \frac{\sigma^2}{n}).


Model1:

因为RjR_jnormal, YiY_i也normal.

Yi=μ+RiY_i = \mu + R_i

where RjN(0,σ2)R_{j} \sim N(0,\sigma^2)

也可以写成YiN(μ,σ2)Y_i \sim N(\mu , \sigma^2).


with confidence interval: μ:yˉ±cSn\mu: \bar{y} \pm \frac{c*S}{\sqrt{n}}
degree: n1n - 1(t distribution)

discrepancy: d=yˉμ0snd = \frac{\bar{y} - \mu_0}{\frac{s}{\sqrt{n}}}
distribution: Dtn1D \sim t_{n-1}

S=(yiyˉ)2n1S = \sum \frac{(y_i - \bar{y})^2}{n - 1} 也就是standard diviation.


___

Model2A:

Independent groups which have same variance.

YijY_{ij}是response of unit j in group i.
Yij=μi+RijY_{ij} = \mu_i + R_{ij}

where RijN(0,σ2)R_{ij} \sim N(0,\sigma^2)

with confidence interval: μ1:μ1^±cS1n1\mu_1: \hat{\mu_1} \pm \frac{c*S_1}{\sqrt{n_1}}
degree: n11n_1 - 1
OR
with confidence interval: μ1μ2:μ1^μ2^±cσ^1n1+1n2\mu_1 - \mu_2: \hat{\mu_1} - \hat{\mu_2} \pm c*\hat{\sigma}\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}
degree: n1+n22n_1 + n_2 - 2

discrepancy: d=μ1^μ2^μ0σ^1n1+1n2d = \frac{\hat{\mu_1} - \hat{\mu_2} - \mu_0}{\hat{\sigma}\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}
distribution: Dtn1+n22D \sim t_{n_1 + n_2 - 2}



___

Model2B:

Independent groups which have different variance.

YijY_{ij}是response of unit j in group i.

因为variance不一样,σi2\sigma_i^2指每组的variance.


Yij=μi+RijY_{ij} = \mu_i + R_{ij}

where RijN(0,σi2)R_{ij} \sim N(0,\sigma_i^2)

with confidence interval: μ1:μ1^±cS1n1\mu_1: \hat{\mu_1} \pm \frac{c*S_1}{\sqrt{n_1}}
degree: n11n_1 - 1
OR
with confidence interval: μ1μ2:μ1^μ2^±cσ12^n1+σ22^n2\mu_1 - \mu_2: \hat{\mu_1} - \hat{\mu_2} \pm c*\sqrt{\frac{\hat{\sigma_1^2}}{n_1} + \frac{\hat{\sigma_2^2}}{n_2}}
degree: n1+n22n_1 + n_2 - 2

discrepancy: d=μ1^μ2^μ0σ12^n1+σ22^n2d = \frac{\hat{\mu_1} - \hat{\mu_2} - \mu_0}{\sqrt{\frac{\hat{\sigma_1^2}}{n_1} + \frac{\hat{\sigma_2^2}}{n_2}}}
distribution: Dtn1+n22D \sim t_{n_1 + n_2 - 2}



___

Model3:

model3测试的是一组两数据之间的difference(缩写为d).


Ydi=μd+RdiY_{di} = \mu_d + R_{di}

where RdjN(0,σd2)R_{dj} \sim N(0,\sigma_d^2)

with confidence interval: μd:ydˉ±cSdnd\mu_d: \bar{y_d} \pm \frac{c*S_d}{\sqrt{n_d}}
degree: nd1n_d - 1.

discrepancy: d=μdˉμ0σd^ndd = \frac{\bar{\mu_d} - \mu_0}{\frac{\hat{\sigma_d}}{\sqrt{n_d}}}
distribution: Dtnd1D \sim t_{n_d-1}



___

Model4:

where we have n outcome, and each outcome is binary.


YnN(π,π(1π)n)\frac{Y}{n} \sim N(\pi,\frac{\pi(1-\pi)}{n})

with confidence interval: π:π^±cπ^(1π^)n\pi: \hat{\pi} \pm c*\sqrt{\frac{\hat{\pi}(1-\hat{\pi})}{n}}
degree: CN(0,1)C \sim N(0,1)

discrepancy: d=π^π0π0^(1π0^)nd = \frac{\hat{\pi} - \pi_0}{\sqrt{\frac{\hat{\pi_0}(1-\hat{\pi_0})}{n}}}
distribution: CN(0,1)C \sim N(0,1)


Side:

  • y1,...,yny_1,...,y_n是sample data,non-random,我们收集到的data.
  • Y1,...,YnY_1,...,Y_n是realization of random variable.
  • Statistic is a function of the sample data, θ^\hat{\theta}. 当sample data changes,θ^\hat{\theta}也会changes.
  • 我们可以把θ^\hat{\theta}当作realibation of ramdom variable θ~\tilde{\theta},θ~\tilde{\theta}叫做estimator.

θ^\hat{\theta}θ~\tilde{\theta}是从y1,...,yny_1,...,y_nY1,...,YnY_1,...,Y_n.
例子:
在model2A中,μ1^\hat{\mu_1} = y1+ˉ\bar{y_{1+}}(statistic)所以μ1~\tilde{\mu_1} = Y1+ˉ\bar{Y_{1+}}(estimator).



___

Model5:

completely randomized design.


Yij=μi+τi+RijY_{ij} = \mu_i + \tau_i + R_{ij}

where RijN(0,σ2)R_{ij} \sim N(0,\sigma^2)
and i=1,2,...,ti = 1,2,...,t (# of treatments), j=1,2,...,rj = 1,2,...,r (# of replicates for treatments).

the # of units are i*j.
overall mean: μ\mu. group average: μ+τi\mu + \tau_i for group i. treatment effect: τi\tau_i for group i.

RijR_{ij} is the distribution of values about the deterministic part of the model.

Constraints: τ1+τ2...+τt=0\tau_1 + \tau_2 ... + \tau_t = 0.

举例:group 1的ave是65,group 2 ave是75.
group1与group2数据在一起的ave是70.μ\mu+τ1\tau_1是65,μ\mu+τ2\tau_2是75.所以τ1^\hat{\tau_1}是-5,τ2^\hat{\tau_2}是5.

当i只为2的情况:

YijN(μ,σ22r)Y_{ij} \sim N(\mu,\frac{\sigma^2}{2r})

with confidence interval: τ1:τ1^±cσ22r\tau_1: \hat{\tau_1} \pm c*\sqrt{\frac{\sigma^2}{2r}}
degree: nq+cn - q + c(t distribution)

with confidence interval: μ:μ^±cσ22r\mu: \hat{\mu} \pm c*\sqrt{\frac{\sigma^2}{2r}}
degree: nq+cn - q + c(t distribution)

关于为什么是2r的原因:

V(τ1)=V(Y1+ˉY++ˉ)V(\tau_1) = V(\bar{Y_{1+}} - \bar{Y_{++}})

= V(Y1+ˉ(Y1+ˉ+Y2+ˉ)/2)V(\bar{Y_{1+}} - (\bar{Y_{1+}} + \bar{Y_{2+}})/2 )
= V(1/2Y1+ˉ1/2Y2+ˉ)V(1/2\bar{Y_{1+}} - 1/2\bar{Y_{2+}})
= 1/4V(Y1+ˉ)V(\bar{Y_{1+}}) - 1/4V(Y2+ˉ)V(\bar{Y_{2+}})
= 1/4σ2r\frac{\sigma^2}{r} + 1/4σ2r\frac{\sigma^2}{r}
=σ22r\frac{\sigma^2}{2r}

当i不为2的时候,variance也会更改,请注意.

注:V(Y1+ˉ)=τ2/V(\bar{Y_{1+}}) = \tau^2 / # of samples in group Y1+ˉ\bar{Y_{1+}}.

discrepancy:

1
2
3
4
5
6
7
8
grp1 =  c(50,53,52,58)
grp2 = c(62,55,58,60)

options(contrasts = c('contr.sum','contr.poly'))
Y = c(grp1,grp2)
x = as.factor(c(rep('1',4),rep('2',4))) # makes a discrete variable
model = lm(Y~x) # builds the model
summary(model)

estimate:
intercept: μ^\hat{\mu}.
x: τ1^\hat{\tau_1}.

residual standard error: σ^\hat{\sigma}.
p-value: assume H0:μ=0H_0:\mu = 0,H0:μ0H_0:\mu \neq 0.
assume H0:τ1=0H_0:\tau_1 = 0,H0:τ10H_0:\tau_1 \neq 0.

if we want to estimate difference between two treatment:

θ:θ±cSE^\theta : \hat{\theta \pm c*SE}

s.e. = σn\frac{\sigma}{\sqrt{n}}.n = 2.
with confidence interval: τ1:τ1^τ2^±cσ22\tau_1: \hat{\tau_1} - \hat{\tau_2} \pm c*\sqrt{\frac{\sigma^2}{2}}
degree: nq+cn - q + c(t distribution)

p-value:H0:τ1=τ2H_0 : \tau_1 = \tau_2, Ha:τ1τ2H_a: \tau_1 \neq \tau_2.
discrepancy:

d=estH0values.e.d = \frac{est - H_{0value}}{s.e.}

we haved=τ1τ2τ0σ2d = \frac{\tau_1 - \tau_2 - \tau_0}{\frac{\sigma}{\sqrt{2}}}.



___

Model6:

unbalanced CRD.每组的数据量不一样.


Yij=μi+τi+RijY_{ij} = \mu_i + \tau_i + R_{ij}

where RijN(0,σ2)R_{ij} \sim N(0,\sigma^2)
and i=1,2,...,ti = 1,2,...,t (# of treatments), j=1,2,...,rij = 1,2,...,r_i (# of replicates for treatments).

Constraints: i=1triτi=0\sum_{i = 1}^{t} r_i \tau_i = 0.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
grp1 =  c(50,53,52,58)
grp2 = c(62,55,58)
Y = c(grp1,grp2)
x = as.factor(c(rep('1',4),rep('2',3)))

#Group Averages
grp_av = tapply(Y,x,mean,na.rm = T)
mu = mean(Y)

#Treatment Effects
tao1 = (grp_av - mean(Y))[1]
tao2 = (grp_av – mean(Y))[2]

#Estimated Sigma
sigma = summary(lm(Y~x))$sigma

#Values
sigma; tao1; tao2; mu
#3.447221, -2.178571, 2.904762, 55.42857

anova(lm(Y~x))


___

Model7:

randomized block design.
block这里指的是每组,比如说model3中compare difference的每两个数据就是一个block.


Yij=μi+τi+βj+RijY_{ij} = \mu_i + \tau_i + \beta_j + R_{ij}

where RijN(0,σ2)R_{ij} \sim N(0,\sigma^2)
and i=1,2,...,ti = 1,2,...,t (# of treatments), j=1,2,...,rj = 1,2,...,r (# of replicates for treatments).

βj\beta_j is the jthj^{th} block effect.

Constraints: i=1tτi=0\sum_{i = 1}^{t} \tau_i = 0. j=1rβj=0\sum_{j = 1}^{r} \beta_j = 0.
By least square: βj^=y+jˉy++ˉ\hat{\beta_j} = \bar{y_{+j}} - \bar{y_{++}}.

1
2
3
4
5
6
7
8
9
10
11
12
13
Data=read.table("blocked.csv",sep=",",header=T)

options(contrasts = c('contr.sum','contr.poly'))
attach(Data) # 用了attach之后,column里面的名字可以直接当作varible使用.
# 比如说应该是Data$Treatment,可以用Treatment.
# 这里Treatment,Block,Value都是column的名字.
Treatment = as.factor(Treatment)
Block = as.factor(Block)
Model = lm(Value~Treatment+Block)

#To look at the output, we type:
summary(Model)




Model8:

factorial design.
例:radiation有1/4的效果,chemo也有1/4的效果,两个一起使用可以有5/6的效果.
用于寻找两个变量是否有这种关系(interaction).

Yijk=μi+τi+RijkY_{ijk} = \mu_i + \tau_i + R_{ijk}

where RijkN(0,σ2)R_{ijk} \sim N(0,\sigma^2)
and i=1,2,...,l1i = 1,2,...,l_1 (# of levels of factor 1), j=1,2,...,l2j = 1,2,...,l_2 (# of levels of factor 2),k=1,2,...,rk = 1,2,...,r.

by least square:W=ijkrijk2+λ(ijτij)W = \sum_{ijk} r_{ijk}^2 + \lambda(\sum_{ij}\tau_{ij})
we get μ^=(ˉy+++)\hat{\mu} = \bar(y_{+++}) , τij^=yij+ˉy+++ˉ\hat{\tau_{ij}} = \bar{y_{ij+}} - \bar{y_{+++}}, σ2^=Wrl1l2l1l21+1\hat{\sigma^2} = \frac{W}{rl_1l_2 - l_1l_2 - 1 + 1}.




Model9:

factorial randomized block design.

Yijk=μ+τij+βk+RijkY_{ijk} = \mu + \tau_{ij} + \beta_k + R_{ijk}
betakbeta_k is block effect.

Constraints: ijτij=0\sum_{ij} \tau_{ij} = 0. kβk=0\sum_{k} \beta_k = 0.

where RijkN(0,σ2)R_{ijk} \sim N(0,\sigma^2)
and i=1,2,...,l1i = 1,2,...,l_1 (# of levels of factor 1), j=1,2,...,l2j = 1,2,...,l_2 (# of levels of factor 2),k=1,2,...,rk = 1,2,...,r.

by least square:W=ijkrijk2+λ1(ijτij)+λ2(kβk)W = \sum_{ijk} r_{ijk}^2 + \lambda_1(\sum_{ij}\tau_{ij}) + \lambda_2(\sum_{k}\beta_{k})
we get μ^=(ˉy+++)\hat{\mu} = \bar(y_{+++}) , τij^=yij+ˉy+++ˉ\hat{\tau_{ij}} = \bar{y_{ij+}} - \bar{y_{+++}}, β^=y++kˉy+++ˉ\hat{\beta} = \bar{y_{++k}} - \bar{y_{+++}}, σ2^=Wrl1l2l1l2r1+2\hat{\sigma^2} = \frac{W}{rl_1l_2 - l_1l_2 - r - 1 + 2}.




anova

SSTOT=SSTRT+SSRESSS_{TOT} = SS_{TRT} + SS_{RES}
1
anova(model)




___

sampling

SRS
我们无法取population中所有的数据,所以取样(sample).

最大的区别是在公式中有finite population correction.
n为sample size,N为population size.


Example:model 1

confidence interval: μ:yˉ±cSn\mu: \bar{y} \pm \frac{c*S}{\sqrt{n}}
We have error E=cσ2nE = \frac{c*\sigma^2}{\sqrt{n}}, solve for n.

Using SRS:
confidence interval: μ:yˉ±1nNcσ^n\mu: \bar{y} \pm \sqrt{1 - \frac{n}{N}} \frac{c*\hat{\sigma}}{\sqrt{n}}
We have error E=1nNcσ^nE = \sqrt{1 - \frac{n}{N}} \frac{c*\hat{\sigma}}{\sqrt{n}}, solve for n.


Example:model 4

with confidence interval: π:π^±cπ^(1π^)n\pi: \hat{\pi} \pm c*\sqrt{\frac{\hat{\pi}(1-\hat{\pi})}{n}}

Using SRS:
confidence interval: π:π^±cπ^(1π^)n1nN\pi: \hat{\pi} \pm c*\sqrt{\frac{\hat{\pi}(1-\hat{\pi})}{n}}\sqrt{1 - \frac{n}{N}}

We have error 1nNcσ^n\sqrt{1 - \frac{n}{N}} \frac{c*\hat{\sigma}}{\sqrt{n}}, solve for n.
注:σ2^=π^(1π^)\hat{\sigma^2} = \hat{\pi}(1-\hat{\pi})
often we replace σ2^\hat{\sigma^2} by 1/4(the worst scenario is π=1/2\pi = 1/2).


如何取sample size:

  1. Take a small sample, estimate σ\sigma.
  2. Find n using formula.
  3. Perform a large study with n units.



regression

regression sampling
require linear relationship between x and y.

Yi=α+β(xi(ˉx))+RiY_i = \alpha + \beta(x_i - \bar(x)) + R_i

where RjN(0,σ2)R_{j} \sim N(0,\sigma^2)

by least square:W=iri2=i(yiαβ(xixˉ))2W = \sum_{i} r_{i}^2 = \sum_{i} (y_i - \alpha \beta(x_i - \bar{x}))^2
we get α^=yˉ\hat{\alpha} = \bar{y}, β^=SxySxx=sxysxx^\hat{\beta} = \frac{S_{xy}}{S_{xx}} = \frac{s_{xy}}{\hat{s_{xx}}}, σr^2=Wn1\hat{\sigma_r}^2 = \frac{W}{n - 1}.
where Sxy=i(yiy^)(xi=xˉ)S_{xy} = \sum_i (y_i - \hat{y})(x_i = \bar{x}), sxy=Sxyn1s_{xy} = \frac{S_{xy}}{n - 1}, Sxx=i(xi=xˉ)2S_{xx}= \sum_i (x_i = \bar{x})^2,sxx=Sxxn1s_{xx} = \frac{S_{xx}}{n - 1}.

estimators: α,β,μx,μy\alpha, \beta, \mu_x, \mu_y are all unbiased.μreg~\tilde{\mu_{reg}} is biased estimator for μy\mu_y.

confidence interval: EST±cSE=μreg^±c1n/Nσr^nEST \pm c SE = \hat{\mu_{reg}} \pm c \sqrt{1 - n/N} \frac{\hat{\sigma_r}}{\sqrt{n}}

1
2
3
4
5
6
7
8
9
10
11
12
13
attach(women)
# assume we want to know the mean height
# assume we know the mean weight
mean(height) # unknown
mean(weight) # we known

# using SRSWOR, we take a sample of size 5 and use this as our estimate for the height:
set.seed(1)
sample_heights = sample(height,5) # n = 5
mean(sample_heights)
# [1] 63.4, this is \hat{\mu_{h}}
sd(sample_heights)
# [1] 3.209361, this is \hat{\sigma_{SRS}}

we can build SRS CI: μh^±c1nNσSRS^n\hat{\mu_{h}} \pm c \sqrt{1 - \frac{n}{N}} \frac{\hat{\sigma_{SRS}}}{\sqrt{n}}

1
2
3
4
5
6
7
8
9
sample_weights = c(123,129,135,146,120)
mean(sample_weights)
# [1] 130.6
# We note that there is a linear relationship between height and weight.
plot(weight,height)

sample_weights = sample_weights-mean(sample_weights)
summary(lm(sample_heights~sample_weights))


The regression estimate is given by:

μheight^=α^+β(xix^)=μheight(μweight)=63.4+0.31(136.7333130.6)=65.3\hat{\mu_{height}} = \hat{\alpha} + \beta(x_i - \hat{x}) = \mu_{height}(\mu_{weight}) = 63.4 + 0.31(136.7333 - 130.6) = 65.3

build regression CI: μreg^±c1nNσr^n\hat{\mu_{reg}} \pm c\sqrt{1 - \frac{n}{N}}\frac{\hat{\sigma_{r}}}{\sqrt{n}}

regression比SRS更接近.



___

ratio

ratio estimation
require linear relationship between x and y.
require an intercept of zero.

Yi=βxi+RiY_i = \beta x_i + R_i

where RiN(0,xiσ2)R_{i} \sim N(0,x_i \sigma^2)

注,variance中乘了x_i,我们在计算的时候会计算xixiyixi\frac{x_i}{\sqrt{x_i}}和\frac{y_i}{\sqrt{x_i}}.

by least square we have β^=yˉxˉ\hat{\beta} = \frac{\bar{y}}{\bar{x}} and σratio^2=Wn1\hat{\sigma_{ratio}}^2 = \frac{W}{n - 1}.μratio^=yˉxˉμx\hat{\mu_{ratio}} = \frac{\bar{y}}{\bar{x}}\mu_x.

with confidence interval: μratio^±c1nNσratio^n\hat{\mu_{ratio}} \pm c \sqrt{1 - \frac{n}{N}} \frac{\hat{\sigma_{ratio}}}{\sqrt{n}}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
attach(women) # same data set

set.seed(1)
sample_heights = sample(height,5)
mean(sample_heights)
#[1] 63.4 # SRS estimate for \mu_y
sd(sample_heights)
#[1] 3.209361
sample_weights = c(123,129,135,146,120)
mean(sample_weights)
#[1] 130.6

plot(weight,height) # linear

Sqrt_weights = sqrt(sample_weights) # \sqrt{x_i}
sample_weights = sample_weights/Sqrt_weights # \frac{x_i}{sqrt{x_i}}

sample_heights = sample_heights/Sqrt_weights # \frac{y_i}{\sqrt{x_i}}
summary(lm(sample_heights~sample_weights-1))

The ratio estimate is given by:μheight^=β^xi=yˉxˉxi=63.4130.6xi\hat{\mu_{height}} = \hat{\beta}*x_i = \frac{\bar{y}}{\bar{x}}*x_i = \frac{63.4}{130.6}x_i.

= 0.48545x_i.带入= 0.48545(136.7333) = 66.4

build regression CI: μratio^±c1nNσratio^n\hat{\mu_{ratio}} \pm c \sqrt{1 - \frac{n}{N}} \frac{\hat{\sigma_{ratio}}}{\sqrt{n}}

ratio比SRS更接近.ratio是biased的.


proportion的情况下:

build regression CI:θ^±1π^1nNσratio^n\hat{\theta} \pm \frac{1}{\hat{\pi}}\sqrt{1 - \frac{n}{N}}\frac{\hat{\sigma_{ratio}}}{\sqrt{n}} where σratio^=(yiθzi^)2n1\hat{\sigma_{ratio}} = \sum\frac{(y_i - \hat{\theta_{zi}})^2}{n - 1}.

θ^\hat{\theta}我们想要算的gourp of interest(占总人群的比例).


___

stratified

计算subpopulation,每个subpopulation independent.

μ=N1μ1+N2μ2+...+NHμHN=i=1Hwiμi\mu = \frac{N_1\mu_1+ N_2\mu_2 + ... + N_H\mu_H}{N} = \sum_{i = 1}^{H}w_i\mu_i

这里的w是weight,也是占总人群的比例.

confidence interval: μ^±ci=1Hwi2σi2ni(1niNi)\hat{\mu} \pm c \sqrt{ \sum_{i=1}^H w_i^2\frac{\sigma_i^2}{n_i}(1 - \frac{n_i}{N_i}) }
where CN(0,1)C \sim N(0,1).

π=i=1Hwiπi\pi = \sum_{i = 1}^{H}w_i\pi_i

confidence interval: π^±ci=1Hwi2σi2ni(1niNi)\hat{\pi} \pm c \sqrt{ \sum_{i=1}^H w_i^2\frac{\sigma_i^2}{n_i}(1 - \frac{n_i}{N_i}) }
where CN(0,1)C \sim N(0,1), σi2=πi(1πi)\sigma_i^2 = \pi_i(1 - \pi_i).