This blog will be about the new interface of the cross-validation module in scikit-learn, which will become available very soon as the work on PR

#4294 is nearing completion and hopefully it will get merged soon.

The two main features of the new

`model_selection`

module are -

- Grouping together all the classes and functions related to model-selection or evaluation and
- Data independent CV classes which instead of taking data / data-dependent parameters at the time of initialisation, expose a new
`split(X, y, labels)`

method which generates the splits based on the chosen strategy.

The second feature is of considerable importance to a lot of people as it enhances the usability of CV objects. An important benefit of the 2nd feature is that nested cross-validation can now be performed easily.

To read more about nested cross-validation, refer to

my previous blog post.

This paper, by Cawley et al published in JMLR in the year 2010, also elaborates on the importance of nested cross-validation for model selection.

Let us now work with the

diabetes dataset and use the new API to build and evaluate different settings of the SVR using nested cross-validation.

(Incase you are wondering what SVRs are,

this article explains the same nicely. Do check it out!)

The diabetes dataset consists of 442 samples consisting of 10 features per sample as the data and the diabetes score as the target. It is a simple regression problem.

Let us

*hypothetically* assume that this data was compiled from 10 patients with multiple samples from each patient at different times. With this assumption let us label the data samples based on the patient id (arbitrarily chosen) ranging from 1 - 10.

` >>> diabetes_dataset = load_diabetes()`

>>> X, y = diabetes_dataset['data'], diabetes_dataset['target']

>>> labels = ([1]*50 + [2]*45 + [3]*60 + [4]*10 + [5]*25 + [6]*155

+ [7]*20 + [8]*10 + [9]*20 + [10]*47)

Now this

*hypothetical* assumption has a

*hypothetical* consequence. The sample distribution tends to group/cluster around each patient and has a possibility that any model trained using such a dataset might overfit on those groups (patients) and predict the target well only for those patients whose data was used for training the model.

(To clarify why this is different compared to the regular overfitting problem, any model built on such a dataset could perform well on unseen data from the old patients (whose data was used in training), but perform poorly on unseen data from new patients (whose data was not used for training the model). Hence testing such a model even on the unseen data from old patients could potentially give us a biased estimate of the model's performance.)

So it becomes necessary to evaluate the model by holding out one patient's data and observing how the model trained with the rest of the patient's data generalizes to this held-out patient. (This gives us an unbiased estimate of the model)

This can be easily done by using sklearn's

LeaveOneLabelOut cross-validator.

**NOTE**- Nested cross-validation can be illustrated without this slightly convoluted hypothetical example. However to appreciate the real benefit of having data independence in the CV iterator, I felt that it would be better to show how we can flexibly choose different CV strategies for the inner and outer loops even when one of them use labels to generate the splits.
- Another similar example would be data collected from multiple similar instruments. Here the samples would be labelled with the instrument id.

To perform model selection we must explore the hyperparameter space of our model and choose the one that has the best unbiased score (which means that our model generalizes well outside the training data)

Now let us do a grid search with a range of values for the three important hyperparameters

`epsilon`

,

`C`

and

`gamma`

(Note that

`gamma`

is the rbf kernel's parameter and not a hyperparameter per se)

` >>> epsilon_range = [0.1, 1, 10, 100, 1000]`

>>> C_range = [0.1, 1, 10, 100]

>>> gamma_range = np.logspace(-2, 2, 5)

>>> parameter_grid = {'C': C_range, 'gamma': gamma_range, 'epsilon': epsilon_range}

Let us import the

`LeaveOneLabelOut`

,

`KFold`

,

`GridSearchCV`

and

`cross_val_score`

from the new

`model_selection`

module...

` >>> from sklearn.model_selection import (`

... GridSearchCV, LeaveOneLabelOut, KFold, cross_val_score)

Let us use

`LeaveOneLabelOut`

for the inner CV and construct a

`GridSearch`

object with our

`parameter_grid`

...

` >>> inner_cv = LeaveOneLabelOut()`

>>> grid_search = GridSearchCV(SVR(kernel='rbf'),

param_grid=parameter_grid,

cv=inner_cv)

And use

`KFold`

with

`n_folds=5`

for the outer CV...

` >>> outer_cv = KFold(n_folds=5)`

We now use the

`cross_val_score`

to estimate the best params for each fold of the outer split and analyse the best models for variance in their scores or parameters. This gives us a picture of how much we can trust the best model(s)

` >>> cross_val_score(`

... grid_search, X=X, y=y,

... fit_params={'labels':labels},

... cv=outer_cv)

array([ 0.40955022, 0.55578469, 0.4796581 , 0.43532192, 0.55993554])

Good so the scores seem to be similar-ish with a variance of < +/- 0.1

Let us do a little more analysis to know what the best parameters are at each fold... This allows us to check if there is any variance in the model parameters between the different folds...

>>> for i, (tuning_set, validation_set) in enumerate(outer_cv.split(X, y)):

... X_tuning_set, y_tuning_set, labels_tuning_set = (

... X[tuning_set], y[tuning_set], labels[tuning_set])

...

... grid_search.fit(X_tuning_set, y_tuning_set, labels_tuning_set)

...

... print("The best params for fold %d are %s,"

... " the best inner CV score is %s,"

... " The final validation score for the best model "

... "of this fold is %s\n"

... % (i+1, grid_search.best_params_, grid_search.best_score_,

... grid_search.score(X[validation_set],

... y[validation_set])))

The best params for fold 1 are {'epsilon': 10, 'C': 100, 'gamma': 1.0}, the best inner CV score is 0.446247290221, The final validation score for the best model of this fold is 0.409550217773

The best params for fold 2 are {'epsilon': 10, 'C': 100, 'gamma': 10.0}, the best inner CV score is 0.454263206641, The final validation score for the best model of this fold is 0.555784686655

The best params for fold 3 are {'epsilon': 0.1, 'C': 100, 'gamma': 10.0}, the best inner CV score is 0.456009154539, The final validation score for the best model of this fold is 0.479658102225

The best params for fold 4 are {'epsilon': 10, 'C': 100, 'gamma': 10.0}, the best inner CV score is 0.465195173573, The final validation score for the best model of this fold is 0.43532192105

The best params for fold 5 are {'epsilon': 0.1, 'C': 100, 'gamma': 10.0}, the best inner CV score is 0.44172440761, The final validation score for the best model of this fold is 0.559935541672

So the 5 best models are similar and hence we choose values

`epsilon`

as 10,

`gamma`

as 10 and

`C`

as 100 for our final model.

The main thing to note here is how the new API makes it easy to pass

`inner_cv`

to

`GridSearchCV`

making it really simple to perform nested CV using two different types of CV strategies in just 2 lines of code.

` `

` >>> grid_search = GridSearchCV(SVR(kernel='rbf'),`

... param_grid=parameter_grid,

... cv=LeaveOneLabelOut())

` >>> cross_val_score(`

... grid_search, X=X, y=y,

... fit_params={'labels':labels},

... cv=KFold(n_folds=5))

` array([ 0.40955022, 0.55578469, 0.4796581 , 0.43532192, 0.55993554])`

EDIT (30th October 2015): The model_selection module has been merged along with the documentation!