数据准备

数据不均衡问题

比如说本题，分类为0的样本有400多个，但是分类为1的样本有1500多个，此时如果直接使用数据去训练分类器，会产生问题。因为分类器全部判别为1，就会有很高的准确率了。

SMOTE过采样

python"># 首先分割训练集与测试集
from sklearn.model_selection import train_test_split
train_xx, test_xx,train_yy_before, test_yy = train_test_split(xx, yy, test_size= 0.3)

# 定义SMOTE过采样解决数据不均衡问题
from imblearn.over_sampling import SMOTE
smo = SMOTE()
train_xx, train_yy = smo.fit_resample(train_xx, train_yy_before)

print('原始数据：{}'.format(collections.Counter(yy)))
print('测试集：{}'.format(collections.Counter(test_yy)))
print('训练集：{}'.format(collections.Counter(train_yy_before)))
print('过采样的训练集：{}'.format(collections.Counter(train_yy)))

EasyEnsembleClassifier

据说这个集成分类器可以解决不均衡问题，但实际体验下来感觉很脑残，最后还是换成了SMOTE。EasyEnsembleClassifier的使用很简单，直接将原来的分离器传参给其即可：

python">from imblearn.ensemble import EasyEnsembleClassifier
from sklearn.ensemble import AdaBoostClassifier

ee_ada = EasyEnsembleClassifier(n_estimators=20,base_estimator=AdaBoostClassifier())
# ee_ada = AdaBoostClassifier()
ee_ada.fit(train_xx, train_yy)

print("准确率：{}， F-measure:{}".format(ee_ada.score(test_xx, test_yy),
                                    f1_score(test_yy,
                                             ee_ada.predict(test_xx))))
plot_AUC(ee_ada, test_xx, test_yy)

具体的分类器

Adaboost
XGboost
RandomForestClassifier(这个有回归器和分类器，不要弄错)
GradientBoostingClassifier
DecisionTreeClassifier
BernoulliNB
svm
VotingClassifier（集成分类器，可选软硬投票）

使用起来都很简单，分为三部

导入相关的包
实例化
使用fit(x,y)函数进行训练（y为label）
使用predict()函数进行预测
可以使用feature_importances_属性查看各个变量对分类的贡献

下面列出VotingClassifier的实例代码：

python">from sklearn.ensemble import VotingClassifier

clf = VotingClassifier(estimators=[
    ('xgboost',
     XGBClassifier(use_label_encoder=False,
                   eval_metric=['logloss', 'auc', 'error'])),
    ('adaboost', AdaBoostClassifier()),
#     ('random_tree', RandomForestClassifier(n_estimators=30)),
    ('gradient_boost', GradientBoostingClassifier()),
#     ('desision', tree.DecisionTreeClassifier())
],
                       voting='soft')

ee_clf = EasyEnsembleClassifier(n_estimators=20, base_estimator=clf)
ee_clf.fit(train_xx, train_yy)

print("准确率：{}， F-measure:{}".format(ee_clf.score(test_xx, test_yy),
                                    f1_score(test_yy,
                                             ee_clf.predict(test_xx))))
plot_AUC(ee_clf, test_xx, test_yy)

相关的各种包：

python">from sklearn import svm  
from sklearn.ensemble import AdaBoostClassifier

from sklearn.ensemble import VotingClassifier

from sklearn import tree

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn import svm  
from imblearn.ensemble import EasyEnsembleClassifier

from sklearn.ensemble import AdaBoostClassifier

from sklearn import metrics

import xgboost
from xgboost import XGBClassifier

from sklearn.metrics import f1_score, accuracy_score

另外，绘制AUC图的函数

python">from sklearn import metrics

def plot_AUC(model,X_test,y_test):
    probs = model.predict_proba(X_test)
    preds = probs[:,1]
    fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
    roc_auc = metrics.auc(fpr, tpr)

    plt.title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()