Python Create Train, Test and Validation Set From Pandas Dataframe: A Beginner Guide

By | April 15, 2020

We often read data from a file to a pandas dataframe, how to extract train, test and validation set to train your deep learning model? In this tutorial, we will discuss this topic.

Preliminary

We should read data from a file, such as csv or excel.

To read data from csv, you can read tutorial:

A Beginner Guide to Python Pandas Read CSV

To read data from excel, you can read tutoria:

Python Pandas read_excel() – Reading Excel File for Beginners

Here we use a simple example to illustrate how to create a dataframe.

  1. import pandas as pd
  2. import numpy as np
  3. df = pd.read_csv("test_member.csv", sep = '\t')
  4. print(df)
import pandas as pd
import numpy as np

df = pd.read_csv("test_member.csv", sep = '\t')
print(df)

The dataframe is:

  1. No Name Age
  2. 0 1 Tom 24
  3. 1 2 Kate 22
  4. 2 3 Alexa 34
  5. 3 4 Kate 23
  6. 4 5 John 45
  7. 5 6 Lily 41
  8. 6 7 Bruce 23
  9. 7 8 Lin 33
  10. 8 9 Brown 31
  11. 9 10 Alibama 20
   No     Name  Age
0   1      Tom   24
1   2     Kate   22
2   3    Alexa   34
3   4     Kate   23
4   5     John   45
5   6     Lily   41
6   7    Bruce   23
7   8      Lin   33
8   9    Brown   31
9  10  Alibama   20

How to extract train, test and validation set?

To extract train, test and validation set, you should random a dataframe by rows.

  1. df = df.sample(len(df))
  2. print(df)
df = df.sample(len(df))
print(df)

The randomized dataframe is:

  1. No Name Age
  2. 0 1 Tom 24
  3. 1 2 Kate 22
  4. 2 3 Alexa 34
  5. 3 4 Kate 23
  6. 4 5 John 45
  7. 5 6 Lily 41
  8. 6 7 Bruce 23
  9. 7 8 Lin 33
  10. 8 9 Brown 31
  11. 9 10 Alibama 20
   No     Name  Age
0   1      Tom   24
1   2     Kate   22
2   3    Alexa   34
3   4     Kate   23
4   5     John   45
5   6     Lily   41
6   7    Bruce   23
7   8      Lin   33
8   9    Brown   31
9  10  Alibama   20

 

To know more on how to randomize a pandas dataframe, you can read:

Understand pandas.DataFrame.sample(): Randomize DataFrame By Row

We know the size of train, test and validation set is: 8:1:1. We can extract them as follow:

Train set

We calculate the length of train and test set.

  1. length = len(df)
  2. train_len = int(0.8 * length)
  3. print(train_len)
  4. test_len = int(0.1 * length)
  5. print(test_len)
length = len(df)

train_len = int(0.8 * length)
print(train_len)
test_len = int(0.1 * length)
print(test_len)

Then we can extract train set

  1. train_set = df[0: train_len]
  2. print(train_set)
train_set = df[0: train_len]
print(train_set)

The train set is:

  1. No Name Age
  2. 5 6 Lily 41
  3. 7 8 Lin 33
  4. 3 4 Kate 23
  5. 1 2 Kate 22
  6. 9 10 Alibama 20
  7. 4 5 John 45
  8. 6 7 Bruce 23
  9. 2 3 Alexa 34
   No     Name  Age
5   6     Lily   41
7   8      Lin   33
3   4     Kate   23
1   2     Kate   22
9  10  Alibama   20
4   5     John   45
6   7    Bruce   23
2   3    Alexa   34

Extract test set

  1. test_set = df[train_len: train_len+test_len]
  2. print(test_set)
test_set = df[train_len: train_len+test_len]
print(test_set)

Test set is:

  1. No Name Age
  2. 0 1 Tom 24
   No Name  Age
0   1  Tom   24

Extract validation set

  1. dev_set = df[train_len+test_len:]
  2. print(dev_set)
dev_set = df[train_len+test_len:]
print(dev_set)

The validation set is:

  1. No Name Age
  2. 8 9 Brown 31
   No   Name  Age
8   9  Brown   31

Leave a Reply