Understand sklearn.model_selection.train_test_split() with Examples – Scikit-Learn Tutorial

By | May 13, 2022

sklearn.model_selection.train_test_split() function allows us to split a data set to train set and test set easily. In this tutorial, we will use an example to show you how to use it correctly.

sklearn.model_selection.train_test_split()

It is defined as:

sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)

It will split arrays or matrices into random train and test subsets.

Here are some important parameters we should notice:

test_size: float or int, we usually use float number. It can be 0-1.0, which represents the proportion of the test subsets in the whole dataset. For example, we can set test_size = 0.1

train_size: float or int. It is same to test_size. It represents the proportion of train subset. We can set train_size = 1.0-test_size.

random_state: set a random seed, for example: 42

shuffle: shuffle the data before splitting

How about the return?

This function will return two subsets: train set and test.

How to use this function?

Here we will use an example to show you how to it. For example:

import numpy as np
from sklearn.model_selection import train_test_split

x = np.random.random([300, 200])

X_train, X_test = train_test_split(x, train_size=0.9, random_state=42)
print(type(X_train))
print(X_train.shape)
print(type(X_test))
print(X_test.shape)

In this example, x is the whole set, which contain 300 samples. We will split it to a train set (0.9) and test set (0.1) randomly.

Run this code, we will get:

<class 'numpy.ndarray'>
(270, 200)
<class 'numpy.ndarray'>
(30, 200)

Here we can find train set (X_train) and test set (X_test) are returned.

Look at example code below:

import numpy as np
from sklearn.model_selection import train_test_split

x = np.random.random([300, 200])
y = np.arange(300)

X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=0.9, random_state=42)

print(type(X_train))
print(X_train.shape)
print(type(X_test))
print(X_test.shape)

print(type(y_train))
print(y_train.shape)
print(type(y_test))
print(y_test.shape)
print(y_test)

In this code, we will split two big datasets, x and y. We can find this function will return 4 subsets.

X_train and X_test are splitted from the dataset x.

y_train and y_test are extracted from the dataset y.

Run this code, we will see:

<class 'numpy.ndarray'>
(270, 200)
<class 'numpy.ndarray'>
(30, 200)
<class 'numpy.ndarray'>
(270,)
<class 'numpy.ndarray'>
(30,)
[203 266 152   9 233 226 196 109   5 175 237  57 218  45 182 221 289 211
 148 165  78 113 249 250 104  42 281 295 157 238]

Moreover, if your data contained in a excel or csv, we can use python pandas to get train and test set.

Python Create Train, Test and Validation Set From Pandas Dataframe: A Beginner Guide

Leave a Reply