sklearn.model_selection.train_test_split() function allows us to split a data set to train set and test set easily. In this tutorial, we will use an example to show you how to use it correctly.
sklearn.model_selection.train_test_split()
It is defined as:
sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)
It will split arrays or matrices into random train and test subsets.
Here are some important parameters we should notice:
test_size: float or int, we usually use float number. It can be 0-1.0, which represents the proportion of the test subsets in the whole dataset. For example, we can set test_size = 0.1
train_size: float or int. It is same to test_size. It represents the proportion of train subset. We can set train_size = 1.0-test_size.
random_state: set a random seed, for example: 42
shuffle: shuffle the data before splitting
How about the return?
This function will return two subsets: train set and test.
How to use this function?
Here we will use an example to show you how to it. For example:
import numpy as np from sklearn.model_selection import train_test_split x = np.random.random([300, 200]) X_train, X_test = train_test_split(x, train_size=0.9, random_state=42) print(type(X_train)) print(X_train.shape) print(type(X_test)) print(X_test.shape)
In this example, x is the whole set, which contain 300 samples. We will split it to a train set (0.9) and test set (0.1) randomly.
Run this code, we will get:
<class 'numpy.ndarray'> (270, 200) <class 'numpy.ndarray'> (30, 200)
Here we can find train set (X_train) and test set (X_test) are returned.
Look at example code below:
import numpy as np from sklearn.model_selection import train_test_split x = np.random.random([300, 200]) y = np.arange(300) X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=0.9, random_state=42) print(type(X_train)) print(X_train.shape) print(type(X_test)) print(X_test.shape) print(type(y_train)) print(y_train.shape) print(type(y_test)) print(y_test.shape) print(y_test)
In this code, we will split two big datasets, x and y. We can find this function will return 4 subsets.
X_train and X_test are splitted from the dataset x.
y_train and y_test are extracted from the dataset y.
Run this code, we will see:
<class 'numpy.ndarray'> (270, 200) <class 'numpy.ndarray'> (30, 200) <class 'numpy.ndarray'> (270,) <class 'numpy.ndarray'> (30,) [203 266 152 9 233 226 196 109 5 175 237 57 218 45 182 221 289 211 148 165 78 113 249 250 104 42 281 295 157 238]
Moreover, if your data contained in a excel or csv, we can use python pandas to get train and test set.
Python Create Train, Test and Validation Set From Pandas Dataframe: A Beginner Guide