scikit-learn 中的分类
1.袋装决策树
Bagging 在具有高差异的算法中表现最佳。一个流行的例子是决策树,通常是在没有修剪的情况下构建的。
在下面的示例中,请参阅使用 BaggingClassifier 和分类和回归树算法(DecisionTreeClassifier)的示例。共创造了 100 棵树。
使用的数据集: Pima Indians 糖尿病数据集
# Bagged Decision Trees for Classification
import pandas
from sklearn import cross_validation
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
运行该示例,我们可以获得对模型精度的可靠估计。
0.770745044429
2.随机森林
随机森林是袋装决策树的扩展。
训练数据集的样本是替换的,但树的构造方式会减少各个分类器之间的相关性。具体而言,不是贪婪地选择树构造中的最佳分裂点,而是仅考虑每个分裂的随机特征子集。
你可以使用 RandomForestClassifier 类构建随机森林模型以进行分类。
下面的示例提供了一个随机森林的示例,用于对 100 棵树进行分类,并从随机选择的 3 个特征中选择分割点。
# Random Forest Classification
import pandas
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
num_instances = len(X)
seed = 7
num_trees = 100
max_features = 3
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
运行该示例提供了分类准确性的平均估计。
0.770727956254
3. AdaBoost
AdaBoost 可能是第一个成功的增强集成算法。它通常通过对数据集中的实例进行加权来确定它们分类的难易程度,允许算法在后续模型的构造中支付或不太关注它们。
你可以使用 AdaBoostClassifier 类构建 AdaBoost 模型以进行分类。
下面的示例演示了使用 AdaBoost 算法按顺序构造 30 个决策树。
# AdaBoost Classification
import pandas
from sklearn import cross_validation
from sklearn.ensemble import AdaBoostClassifier
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
num_instances = len(X)
seed = 7
num_trees = 30
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
运行该示例提供了分类准确性的平均估计。
0.76045796309
4.随机梯度提升
随机梯度增强(也称为梯度增强机器)是最复杂的整体技术之一。它也是一种被证明可能是通过集合提高性能的最佳技术的技术。
你可以使用 GradientBoostingClassifier 类构建 Gradient Boosting 模型以进行分类。
下面的示例演示了随机梯度提升用于分类 100 棵树。
# Stochastic Gradient Boosting Classification
import pandas
from sklearn import cross_validation
from sklearn.ensemble import GradientBoostingClassifier
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
num_instances = len(X)
seed = 7
num_trees = 100
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
运行该示例提供了分类准确性的平均估计。
0.764285714286
资料来源: http : //machinelearningmastery.com/ensemble-machine-learning-algorithms-python-scikit-learn/