为stat331附录的estimation相关代码.
Simple Linear Regression
我们通过Least Squares Estimation来estimate和.
(省略证明的步骤)
Maximum Likelihood Estimation和Least Squares Estimation估算的和是一样的.
当n大于50时,这两个没什么区别.
注:
同理
1 | #-------------------------------------------------------------------------- |
Confidence Intervals
注意confidence intervals分为知道sigma和不知道sigma.
知道sigma用normal distribution,不知道用t distribution.
通过MLE的distribution得知
在知道sigma的情况下,95%的
1.96是Z常数.
不知道sigma的情况下,我们使用,95%的
q由r计算.写作
1 | qt(p = 0.05/2, df = n-2, lower.tail = FALSE) |
虽然老师讲了很多的代码但不如直接用r自带的confint built-in function好了.
1 |
|
Hypothesis Testing
1 |
|
Multiple Linear Regression
也就是多个系数的model.
data:image/s3,"s3://crabby-images/8a957/8a957d58239fde3945038cc2008847e3d0bfe270" alt=""
data:image/s3,"s3://crabby-images/1ca8d/1ca8d2188f815836e5216726c1d38f096b8ccc29" alt=""
其中有系数是categorical的,需要特殊处理.
data:image/s3,"s3://crabby-images/39832/39832cd809f73bb343eccd613a6c0c26ed073943" alt=""
1 |
|
其中系数也可能有nolinear的情况.
data:image/s3,"s3://crabby-images/0e047/0e04700375b4ab6ad9f832b003a70e2887236cd4" alt=""
插播stat444老师的名言:
all models are false but some are useful.
在Assignment1中的例子,按照scatter plot看出跟exponential像,所以plot exponential.
1 | log.sigma2.y = log(sigma2.y) |
1 | plot(Week.x, sigma2.y , pch=19, cex.axis = 1.5, cex.lab = 1.5, ylim=c(0,36000)) |
从R-squared中看出数据比较大(偏差比较大),model可以再优化.
ANOVA
(这个在stat系列中有记载过,再次备份一下)
ANOVA: How much of the variability in Y is explained by our model?
data:image/s3,"s3://crabby-images/281c2/281c2e70f00d4fb05e78892cf5e4db07c2a19a2e" alt=""
data:image/s3,"s3://crabby-images/4cd4b/4cd4b3a8b1b829cf76fbb0fbfe4279294dc62a81" alt=""
data:image/s3,"s3://crabby-images/fd8b3/fd8b39e226d810b7682a3688b327504bb7af41f2" alt=""
data:image/s3,"s3://crabby-images/f8bb3/f8bb33fa9572d7dc4429a84fb8abb92d747e2dc0" alt=""
1 | ### LECTURE 11 |
1 |
|
data:image/s3,"s3://crabby-images/72eb9/72eb9cbaaba56b0f6a72a7328e80aeeab3d2acf2" alt=""
data:image/s3,"s3://crabby-images/6b818/6b81831c28aea3e6c35f60f301ff2b0d67addb2b" alt=""
1 |
|
Goodness of Fit
测试一个model是否合适可以通过测试以下数据:
,where.Interpretation: proportion of variability explained by the model.
Problem: will never decrease when adding more variables.
因此可以测试Adjusted.
data:image/s3,"s3://crabby-images/67c8c/67c8ccf55587864694501f9d1769f3a641787175" alt=""
此外也可以测试AIC.
data:image/s3,"s3://crabby-images/92e7d/92e7df68143fcf5224442514cd8df4f897fb8636" alt=""
1 |
|
Automatic Selection
一个model中有多个系数,其中建立model的方式有 将所有的系数一个一个加入 和 一个一个减去 两种,分为Forward Selection和Backward Elimination.
Forward Selection:
data:image/s3,"s3://crabby-images/87e64/87e64de994e53ff599e3308a181ef778ae4dbc56" alt=""
data:image/s3,"s3://crabby-images/e500c/e500c149e2a87816412626d3cd422d3953a847a9" alt=""
data:image/s3,"s3://crabby-images/58c90/58c909378f8527a5ebee5c01ecd8ef2239768a7a" alt=""
Backward Elimination:
data:image/s3,"s3://crabby-images/0cd35/0cd35a4591da7f44656922543268518c6fe2a78c" alt=""
data:image/s3,"s3://crabby-images/67805/67805e1af3b05a141171833e87febd4335fc26f0" alt=""
还有一种是combine了forward和backward,Stepwise Selection.
data:image/s3,"s3://crabby-images/64231/64231215ff334aa2c938d95a897f5dc5a7bb25b1" alt=""
1 | # ------ automatic selection ------------------ |
Cross-validation
cross validation的思想是把数据分组,一部分作为training set,来fit model,一部分作为test set,用来测试fit的model的误差.
k-fold cross validation是其中的一种:将k-1组作为traning set.
data:image/s3,"s3://crabby-images/4899b/4899be5808a94c781cfe4d0254a36ef73b5f600d" alt=""
leave-one-out cross validation则除了一个数据外将全部数据作为traning set,再计算其中的误差.
load data and fit models from lec 14:
1 |
|
1 |
|
1 |
|
k-fold cross validation stat444加强版:
这里也包括了LOOCV.
一般k-fold有更好的bais和variance trace off,但LOOCV会更快.
1 | # A function to generate the indices of the k-fold sets |
Linear Regression Assumptions:
- Linearity
- Independence
- Normality
- Equal variance (homoskedasticity)
code for checking them:
data:image/s3,"s3://crabby-images/65fef/65fefb79a2a1fe3d482a042a7a81cc398334da04" alt=""
1 |
|
data:image/s3,"s3://crabby-images/4d4d4/4d4d44d86f3c6c17bd7a7ec6a0c4752b8b24ff2f" alt=""
1 |
|
在不满足的情况下修改model:
- Linearity
- Instead include log(xj)
- Instead use a quadratic model (xj and xj2)
- Independence
- Estimates are still unbiased but standard errors are broken
=⇒ replace SEs with robust alternatives (sandwich form, GEE) - Explicitly model the dependence structure • Mixed effects models
- Normality
Violations of normality might not be a big deal
- E.g. if we have a large sample size
However, Normality is required for valid prediction intervals - Could consider transforming Y
E.g. model log(Y)
This again changes interpretations!
Might not be a problem if we’re interested in prediction - Could consider other regression approaches: GLMs, etc. (not covered in this course)
- Equal variance (homoskedasticity)
If our errors are heteroskedastic, we have a few options:
- Transform outcome (see above)
Variance stabilizing transform - Weighted Least Squares
- Bootstrap (time permitting…)
Weighted Least Squares:
1 | ## load packages |
1 |
|
stat444加强版选择weight:
data:image/s3,"s3://crabby-images/65d8d/65d8d318949d12024b031bfdb1bcaa0cc1eb617b" alt=""