Python Create Train, Test and Validation Set From Pandas Dataframe: A Beginner Guide

By | April 15, 2020

We often read data from a file to a pandas dataframe, how to extract train, test and validation set to train your deep learning model? In this tutorial, we will discuss this topic.

Preliminary

We should read data from a file, such as csv or excel.

To read data from csv, you can read tutorial:

A Beginner Guide to Python Pandas Read CSV

To read data from excel, you can read tutoria:

Python Pandas read_excel() – Reading Excel File for Beginners

Here we use a simple example to illustrate how to create a dataframe.

import pandas as pd
import numpy as np

df = pd.read_csv("test_member.csv", sep = '\t')
print(df)

The dataframe is:

   No     Name  Age
0   1      Tom   24
1   2     Kate   22
2   3    Alexa   34
3   4     Kate   23
4   5     John   45
5   6     Lily   41
6   7    Bruce   23
7   8      Lin   33
8   9    Brown   31
9  10  Alibama   20

How to extract train, test and validation set?

To extract train, test and validation set, you should random a dataframe by rows.

df = df.sample(len(df))
print(df)

The randomized dataframe is:

   No     Name  Age
0   1      Tom   24
1   2     Kate   22
2   3    Alexa   34
3   4     Kate   23
4   5     John   45
5   6     Lily   41
6   7    Bruce   23
7   8      Lin   33
8   9    Brown   31
9  10  Alibama   20

 

To know more on how to randomize a pandas dataframe, you can read:

Understand pandas.DataFrame.sample(): Randomize DataFrame By Row

We know the size of train, test and validation set is: 8:1:1. We can extract them as follow:

Train set

We calculate the length of train and test set.

length = len(df)

train_len = int(0.8 * length)
print(train_len)
test_len = int(0.1 * length)
print(test_len)

Then we can extract train set

train_set = df[0: train_len]
print(train_set)

The train set is:

   No     Name  Age
5   6     Lily   41
7   8      Lin   33
3   4     Kate   23
1   2     Kate   22
9  10  Alibama   20
4   5     John   45
6   7    Bruce   23
2   3    Alexa   34

Extract test set

test_set = df[train_len: train_len+test_len]
print(test_set)

Test set is:

   No Name  Age
0   1  Tom   24

Extract validation set

dev_set = df[train_len+test_len:]
print(dev_set)

The validation set is:

   No   Name  Age
8   9  Brown   31

Leave a Reply