We often read data from a file to a pandas dataframe, how to extract train, test and validation set to train your deep learning model? In this tutorial, we will discuss this topic.
Preliminary
We should read data from a file, such as csv or excel.
To read data from csv, you can read tutorial:
A Beginner Guide to Python Pandas Read CSV
To read data from excel, you can read tutoria:
Python Pandas read_excel() – Reading Excel File for Beginners
Here we use a simple example to illustrate how to create a dataframe.
- import pandas as pd
- import numpy as np
- df = pd.read_csv("test_member.csv", sep = '\t')
- print(df)
The dataframe is:
- No Name Age
- 0 1 Tom 24
- 1 2 Kate 22
- 2 3 Alexa 34
- 3 4 Kate 23
- 4 5 John 45
- 5 6 Lily 41
- 6 7 Bruce 23
- 7 8 Lin 33
- 8 9 Brown 31
- 9 10 Alibama 20
How to extract train, test and validation set?
To extract train, test and validation set, you should random a dataframe by rows.
- df = df.sample(len(df))
- print(df)
The randomized dataframe is:
- No Name Age
- 0 1 Tom 24
- 1 2 Kate 22
- 2 3 Alexa 34
- 3 4 Kate 23
- 4 5 John 45
- 5 6 Lily 41
- 6 7 Bruce 23
- 7 8 Lin 33
- 8 9 Brown 31
- 9 10 Alibama 20
To know more on how to randomize a pandas dataframe, you can read:
Understand pandas.DataFrame.sample(): Randomize DataFrame By Row
We know the size of train, test and validation set is: 8:1:1. We can extract them as follow:
Train set
We calculate the length of train and test set.
- length = len(df)
- train_len = int(0.8 * length)
- print(train_len)
- test_len = int(0.1 * length)
- print(test_len)
Then we can extract train set
- train_set = df[0: train_len]
- print(train_set)
The train set is:
- No Name Age
- 5 6 Lily 41
- 7 8 Lin 33
- 3 4 Kate 23
- 1 2 Kate 22
- 9 10 Alibama 20
- 4 5 John 45
- 6 7 Bruce 23
- 2 3 Alexa 34
Extract test set
- test_set = df[train_len: train_len+test_len]
- print(test_set)
Test set is:
- No Name Age
- 0 1 Tom 24
Extract validation set
- dev_set = df[train_len+test_len:]
- print(dev_set)
The validation set is:
- No Name Age
- 8 9 Brown 31