權重
有時我們希望模型比某些資料點或示例更重要。通過在學習模型時指定輸入資料的權重,可以實現這一點。通常有兩種情況我們可能會在示例中使用非均勻權重:
-
分析權重:反映不同觀察的不同精確度。例如,如果分析每個觀測值是來自地理區域的平均結果的資料,則分析權重與估計方差的倒數成比例。通過在給定觀察數量的情況下提供比例權重來處理資料中的平均值時非常有用。資源
-
抽樣權重(反向概率權重 - IPW):一種統計技術,用於計算標準化為不同於收集資料的人口的統計資料。具有不同取樣群體和目標推斷群體(目標群體)的研究設計在應用中是常見的。處理缺少值的資料時很有用。資源
lm()
函式進行分析加權。對於抽樣權重,survey
包用於構建調查設計物件並執行 svyglm()
。預設情況下,survey
包使用抽樣權重。 (注意:lm()
和 svyglm()
與家族 gaussian()
將產生相同的點估計,因為它們都通過最小化加權最小二乘來求解係數。它們在如何計算標準誤差方面不同。)
測試資料
data <- structure(list(lexptot = c(9.1595012302023, 9.86330744180814,
8.92372556833205, 8.58202430280175, 10.1133857229336), progvillm = c(1L,
1L, 1L, 1L, 0L), sexhead = c(1L, 1L, 0L, 1L, 1L), agehead = c(79L,
43L, 52L, 48L, 35L), weight = c(1.04273509979248, 1.01139605045319,
1.01139605045319, 1.01139605045319, 0.76305216550827)), .Names = c("lexptot",
"progvillm", "sexhead", "agehead", "weight"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -5L))
分析權重
lm.analytic <- lm(lexptot ~ progvillm + sexhead + agehead,
data = data, weight = weight)
summary(lm.analytic)
輸出
Call:
lm(formula = lexptot ~ progvillm + sexhead + agehead, data = data,
weights = weight)
Weighted Residuals:
1 2 3 4 5
9.249e-02 5.823e-01 0.000e+00 -6.762e-01 -1.527e-16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.016054 1.744293 5.742 0.110
progvillm -0.781204 1.344974 -0.581 0.665
sexhead 0.306742 1.040625 0.295 0.818
agehead -0.005983 0.032024 -0.187 0.882
Residual standard error: 0.8971 on 1 degrees of freedom
Multiple R-squared: 0.467, Adjusted R-squared: -1.132
F-statistic: 0.2921 on 3 and 1 DF, p-value: 0.8386
取樣重量(IPW)
library(survey)
data$X <- 1:nrow(data) # Create unique id
# Build survey design object with unique id, ipw, and data.frame
des1 <- svydesign(id = ~X, weights = ~weight, data = data)
# Run glm with survey design object
prog.lm <- svyglm(lexptot ~ progvillm + sexhead + agehead, design=des1)
輸出
Call:
svyglm(formula = lexptot ~ progvillm + sexhead + agehead, design = des1)
Survey design:
svydesign(id = ~X, weights = ~weight, data = data)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.016054 0.183942 54.452 0.0117 *
progvillm -0.781204 0.640372 -1.220 0.4371
sexhead 0.306742 0.397089 0.772 0.5813
agehead -0.005983 0.014747 -0.406 0.7546 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.2078647)
Number of Fisher Scoring iterations: 2