低方差特徵移除

Created: November-22, 2018

這是一種非常基本的特徵選擇技術。

它的基本思想是，如果一個特徵是常數（即它有 0 個方差），那麼它不能用於尋找任何有趣的模式，並且可以從資料集中刪除。

因此，特徵消除的啟發式方法是首先刪除方差低於某個（低）閾值的所有特徵。

建立文件中的示例，假設我們開始

X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]

這裡有 3 個布林特徵，每個特徵有 6 個例項。假設我們希望刪除至少 80％的例項中不變的那些。一些概率計算表明這些特徵需要具有低於 0.8 *（1 - 0.8）的方差。因此，我們可以使用

from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))    
sel.fit_transform(X)
# Output: array([[0, 1],
                 [1, 0],
                 [0, 0],
                 [1, 1],
                 [1, 0],
                 [1, 1]])

請注意第一個功能的刪除方式。

應謹慎使用此方法，因為低方差並不一定意味著功能不感興趣。考慮以下示例，其中我們構建包含 3 個要素的資料集，前兩個包含隨機分佈的變數，第三個包含均勻分佈的變數。

from sklearn.feature_selection import VarianceThreshold
import numpy as np

# generate dataset
np.random.seed(0)

feat1 = np.random.normal(loc=0, scale=.1, size=100) # normal dist. with mean=0 and std=.1
feat2 = np.random.normal(loc=0, scale=10, size=100) # normal dist. with mean=0 and std=10
feat3 = np.random.uniform(low=0, high=10, size=100) # uniform dist. in the interval [0,10)
data = np.column_stack((feat1,feat2,feat3))

data[:5]
# Output:
# array([[  0.17640523,  18.83150697,   9.61936379],
#        [  0.04001572, -13.47759061,   2.92147527],
#        [  0.0978738 , -12.70484998,   2.4082878 ],
#        [  0.22408932,   9.69396708,   1.00293942],
#        [  0.1867558 , -11.73123405,   0.1642963 ]]) 

np.var(data, axis=0)
# Output: array([  1.01582662e-02,   1.07053580e+02,   9.07187722e+00])

sel = VarianceThreshold(threshold=0.1)
sel.fit_transform(data)[:5]
# Output:
# array([[ 18.83150697,   9.61936379],
#        [-13.47759061,   2.92147527],
#        [-12.70484998,   2.4082878 ],
#        [  9.69396708,   1.00293942],
#        [-11.73123405,   0.1642963 ]])

現在第一個特徵已被刪除，因為它的方差很小，而第三個特徵（這是最無趣的）已被保留。在這種情況下，考慮變異係數會更合適，因為這與縮放無關。