删除密切相关的功能
密切相关的特征可能会增加模型的差异,删除相关对中的一个可能有助于减少这种差异。有很多方法可以检测相关性。这是一个:
library(purrr) # in order to use keep()
# select correlatable vars
toCorrelate<-mtcars %>% keep(is.numeric)
# calculate correlation matrix
correlationMatrix <- cor(toCorrelate)
# pick only one out of each highly correlated pair's mirror image
correlationMatrix[upper.tri(correlationMatrix)]<-0
# and I don't remove the highly-correlated-with-itself group
diag(correlationMatrix)<-0
# find features that are highly correlated with another feature at the +- 0.85 level
apply(correlationMatrix,2, function(x) any(abs(x)>=0.85))
mpg cyl disp hp drat wt qsec vs am gear carb
TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
我想看看 MPG 与之相关的是什么,并决定要保留什么和折腾什么。对于 cyl 和 disp 也是如此。或者,我可能需要结合一些强相关的功能。