Cross-validation: KFold and StratifiedKFold with examples

Shahad Mahmud

4 years ago

Photo by Polina Tankilevitch on Pexels.com

In this post, we are going to discuss KFold and StratifiedKFold with some real-time examples. While working with a supervised machine learning model, we have some data with features and labels or targets. During the training, we give some portion of the data and keep a portion for test the model. Thus, we need to split the dataset into a training and testing subset.

Now we can just split the dataset into two portions. Like an 80%-20% split. Here we will use 80% data for training our model and the rest of the 20% will be used to test the performance of our model. The data can be randomly split or we can split the data in a fixed position. Like first 80% for training and last 20% for testing. A problem with this issue is, the model is never using 20% of the data. So it might miss any important feature resting on this portion. We can use cross-validation to overcome this problem.

In a nutshell, The dataset is partitioned into n splits using cross-validation. The n-1 split is used for training, and the other split is utilized for testing. The model iteratively cycles through the full dataset n times, trying a different split each time. As a result, we use all of the data points for training and testing. Cross-validation is also important for more properly measuring a model’s performance, particularly on new, previously unseen data points.

KFold and StratifiedKFold are commonly used for cross-validation tasks. Let’s discuss the differences between these.

KFold

As the name suggests, this method splits the dataset into k number of consecutive folds or groups. As a result, the process is frequently referred to as k-fold cross-validation. When a specific number for k is chosen, it may be used in place of k in the model’s reference, for example, k=5 resulting in 5-fold cross-validation.

When using Scikit learn’s KFold API, we can specify the number of folds to use, whether to shuffle the folds, and a random state. By default, shuffle is set to False. So the dataset is consecutively split. Let’s have a look at the following example.

It has the following output:

Train index: [ 3  4  5  6  7  8  9 10 11 12 13 14] Test index: [0 1 2]
Train index: [ 0  1  2  6  7  8  9 10 11 12 13 14] Test index: [3 4 5]
Train index: [ 0  1  2  3  4  5  9 10 11 12 13 14] Test index: [6 7 8]
Train index: [ 0  1  2  3  4  5  6  7  8 12 13 14] Test index: [9 10 11]
Train index: [ 0  1  2  3  4  5  6  7  8  9 10 11] Test index: [12 13 14]

So, we can see that it has created 5 consecutive folds or groups of indexes. For the first run, the first 3 data points are used for testing, and the remaining 12 data points are used to get the training dataset. For each iteration, a consecutive set of data points is used as a test dataset and the remaining data points are used to train the model. Now let’s set the shuffle to True. The random state has effect only when shuffle is set to True. So we also set a value for a random state.

The output is as follows:

Train index: [ 0  1  3  4  5  6  7  8  9 10 11 13] Test index: [ 2 12 14]
Train index: [ 0  1  2  3  4  5  6  7  9 10 12 14] Test index: [ 8 11 13]
Train index: [ 0  1  2  4  5  6  8  9 11 12 13 14] Test index: [ 3  7 10]
Train index: [ 1  2  3  5  7  8  9 10 11 12 13 14] Test index: [0 4 6]
Train index: [ 0  2  3  4  6  7  8 10 11 12 13 14] Test index: [1 5 9]

Clearly, we can see that this time the test sets are generated randomly. We can get the training dataset from the remaining data points. The random_state is used to reproduce the same results each time. It can be any number you like.

At first, let’s understand stratification. Stratification is the process of organizing data so that each fold is an accurate representation of the total. Like if a multiclass classification problem contains 4 different classes of 25% data each, we should divide the folds such that it is arranged like the original data distribution.

StratifiedKFold

Now, what is StratifiedKFold? StratifiedKFold shuffles the data first, then splits it into n splits portions. It only shuffles data one time before splitting. While using the StratifiedKFold API, we can specify the number of folds to use, whether to shuffle the folds and a random state. With shuffle = True, the data is shuffled by the random_state. Otherwise, the data is shuffled as default.

Now, let’s create an imbalanced data set and test StratifiedKFold. We take 15 data points for binary classification. Among these 10 data points have label = 1 and 5 data points have label = 0. So the imbalance ratio is 2:1. After creating the dataset, we split it with StratifiedKFold.

In the data set 10, 11, 12, 13, 14 data points have target 0. Running the code we get the following output.

Train index: [ 0  1  2  3  4  6  7  8 10 11 13 14] Test index: [ 5  9 12]
Train index: [ 0  2  3  5  6  7  8  9 11 12 13 14] Test index: [ 1  4 10]
Train index: [ 0  1  2  3  4  5  7  9 10 12 13 14] Test index: [ 6  8 11]
Train index: [ 0  1  2  4  5  6  8  9 10 11 12 13] Test index: [ 3  7 14]
Train index: [ 1  3  4  5  6  7  8  9 10 11 12 14] Test index: [ 0  2 13]

We can see that each training data set has 12 data points. Among these 12 data points, 8 data points belong to class 1 and 4 data points belong to class 0. Thus maintaining the 2:1 ratio same the whole population. This ratio is also maintained in the test data points with each having 2 data points to class 1 and 1 data point to class 0.

That’s all for this post. I hope this post will help you to enrich your understanding of KFold and StratifiedKFold. I am a learner who is learning new things and trying to share with others. Let me know your thoughts on this post. You can get more machine learning-related posts here.