权重
有时我们希望模型比某些数据点或示例更重要。通过在学习模型时指定输入数据的权重,可以实现这一点。通常有两种情况我们可能会在示例中使用非均匀权重:
-
分析权重:反映不同观察的不同精确度。例如,如果分析每个观测值是来自地理区域的平均结果的数据,则分析权重与估计方差的倒数成比例。通过在给定观察数量的情况下提供比例权重来处理数据中的平均值时非常有用。资源
-
抽样权重(反向概率权重 - IPW):一种统计技术,用于计算标准化为不同于收集数据的人口的统计数据。具有不同采样群体和目标推断群体(目标群体)的研究设计在应用中是常见的。处理缺少值的数据时很有用。资源
lm()
函数进行分析加权。对于抽样权重,survey
包用于构建调查设计对象并运行 svyglm()
。默认情况下,survey
包使用抽样权重。 (注意:lm()
和 svyglm()
与家族 gaussian()
将产生相同的点估计,因为它们都通过最小化加权最小二乘来求解系数。它们在如何计算标准误差方面不同。)
测试数据
data <- structure(list(lexptot = c(9.1595012302023, 9.86330744180814,
8.92372556833205, 8.58202430280175, 10.1133857229336), progvillm = c(1L,
1L, 1L, 1L, 0L), sexhead = c(1L, 1L, 0L, 1L, 1L), agehead = c(79L,
43L, 52L, 48L, 35L), weight = c(1.04273509979248, 1.01139605045319,
1.01139605045319, 1.01139605045319, 0.76305216550827)), .Names = c("lexptot",
"progvillm", "sexhead", "agehead", "weight"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -5L))
分析权重
lm.analytic <- lm(lexptot ~ progvillm + sexhead + agehead,
data = data, weight = weight)
summary(lm.analytic)
输出
Call:
lm(formula = lexptot ~ progvillm + sexhead + agehead, data = data,
weights = weight)
Weighted Residuals:
1 2 3 4 5
9.249e-02 5.823e-01 0.000e+00 -6.762e-01 -1.527e-16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.016054 1.744293 5.742 0.110
progvillm -0.781204 1.344974 -0.581 0.665
sexhead 0.306742 1.040625 0.295 0.818
agehead -0.005983 0.032024 -0.187 0.882
Residual standard error: 0.8971 on 1 degrees of freedom
Multiple R-squared: 0.467, Adjusted R-squared: -1.132
F-statistic: 0.2921 on 3 and 1 DF, p-value: 0.8386
采样重量(IPW)
library(survey)
data$X <- 1:nrow(data) # Create unique id
# Build survey design object with unique id, ipw, and data.frame
des1 <- svydesign(id = ~X, weights = ~weight, data = data)
# Run glm with survey design object
prog.lm <- svyglm(lexptot ~ progvillm + sexhead + agehead, design=des1)
输出
Call:
svyglm(formula = lexptot ~ progvillm + sexhead + agehead, design = des1)
Survey design:
svydesign(id = ~X, weights = ~weight, data = data)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.016054 0.183942 54.452 0.0117 *
progvillm -0.781204 0.640372 -1.220 0.4371
sexhead 0.306742 0.397089 0.772 0.5813
agehead -0.005983 0.014747 -0.406 0.7546 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.2078647)
Number of Fisher Scoring iterations: 2