KFold/StratifiedKFold/GroupKFold
1. sklearn .model_selection.KFold 1.1 KFold().split(x) 循环获取分割数据 1.2 cross_validate(cv=KFold()) 作为cv参数
2. sklearn .model_selection.StratifiedKFold 3. sklearn .model_selection.GroupKFold
sklearn model_selectionKFold_1">1. sklearn .model_selection.KFold
1.1 KFold().split(x) 循环获取分割数据
python">from sklearn . model_selection import KFold
X = [ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ]
'''
不管样本的标签(y)分布
shuffle 每次分割前打乱顺序
random_state shuffle=True时使用,设定后重复运行数据分组不变
'''
kf = KFold( n_splits= 5 , shuffle= False )
for train, test in kf. split( X, y) :
print ( train, test)
'''
[2 3 4 5 6 7 8 9] [0 1]
[0 1 4 5 6 7 8 9] [2 3]
[0 1 2 3 6 7 8 9] [4 5]
[0 1 2 3 4 5 8 9] [6 7]
[0 1 2 3 4 5 6 7] [8 9]
'''
python">kf = KFold( n_splits= 5 , shuffle= True )
for train, test in kf. split( X, y) :
print ( train, test)
'''
[0 1 2 4 5 6 7 9] [3 8]
[1 2 3 4 5 7 8 9] [0 6]
[0 1 3 4 6 7 8 9] [2 5]
[0 1 2 3 5 6 8 9] [4 7]
[0 2 3 4 5 6 7 8] [1 9]
'''
1.2 cross_validate(cv=KFold()) 作为cv参数
sklearn model_selectionStratifiedKFold_38">2. sklearn .model_selection.StratifiedKFold
作用: 划分后的训练集和测试集数据分布与原数据相同 即:原始标签中类别占比=训练标签中类别占比=验证标签中类别占比【sklearn 】模型融合_堆叠法 StackingClassfier\Regressor参数cv
python">from sklearn . model_selection import StratifiedKFold
X = [ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ]
y = [ 0 , 0 , 0 , 0 , 1 , 1 , 1 , 1 , 1 , 1 ]
skf = StratifiedKFold( n_splits= 5 , shuffle= False )
for train, test in skf. split( X, y) :
print ( train, test)
'''
[1 2 3 5 6 7 8 9] [0 4]
[0 2 3 4 6 7 8 9] [1 5]
[0 1 3 4 5 7 8 9] [2 6]
[0 1 2 4 5 6 8 9] [3 7]
[0 1 2 3 4 5 6 7] [8 9]
'''
python">skf = StratifiedKFold( n_splits= 5 , shuffle= True )
for train, test in skf. split( X, y) :
print ( train, test)
'''
[0 1 2 4 5 6 7 8] [3 9]
[0 1 3 4 6 7 8 9] [2 5]
[1 2 3 4 5 6 8 9] [0 7]
[0 2 3 4 5 6 7 9] [1 8]
[0 1 2 3 5 7 8 9] [4 6]
'''
sklearn model_selectionGroupKFold_71">3. sklearn .model_selection.GroupKFold
只有n_splits一个参数 作用: 保证同一个group的样本不会同时出现在训练集和测试集上 即:一个group的多个样本要么出现在训练集,要么都出现在测试集意义: 若一个group中的样本即用于训练也用于测试,模型能充分学习该group样本的特征并在测试集表现良好,但遇到新group会表现较差。
python">from sklearn . model_selection import GroupKFold
X = [ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ]
y = [ 0 , 0 , 0 , 0 , 1 , 1 , 1 , 1 , 1 , 1 ]
groups = [ 1 , 1 , 1 , 2 , 3 , 3 , 4 , 4 , 5 , 5 ]
gkf = GroupKFold( n_splits= 5 )
for train, test in gkf. split( X, y, groups= groups) :
print ( train, test)
'''
[3 4 5 6 7 8 9] [0 1 2]
[0 1 2 3 4 5 6 7] [8 9]
[0 1 2 3 4 5 8 9] [6 7]
[0 1 2 3 6 7 8 9] [4 5]
[0 1 2 4 5 6 7 8 9] [3]
'''