We often read data from a file to a pandas dataframe, how to extract train, test and validation set to train your deep learning model? In this tutorial, we will discuss this topic.
Preliminary
We should read data from a file, such as csv or excel.
To read data from csv, you can read tutorial:
A Beginner Guide to Python Pandas Read CSV
To read data from excel, you can read tutoria:
Python Pandas read_excel() – Reading Excel File for Beginners
Here we use a simple example to illustrate how to create a dataframe.
import pandas as pd import numpy as np df = pd.read_csv("test_member.csv", sep = '\t') print(df)
The dataframe is:
No Name Age 0 1 Tom 24 1 2 Kate 22 2 3 Alexa 34 3 4 Kate 23 4 5 John 45 5 6 Lily 41 6 7 Bruce 23 7 8 Lin 33 8 9 Brown 31 9 10 Alibama 20
How to extract train, test and validation set?
To extract train, test and validation set, you should random a dataframe by rows.
df = df.sample(len(df)) print(df)
The randomized dataframe is:
No Name Age 0 1 Tom 24 1 2 Kate 22 2 3 Alexa 34 3 4 Kate 23 4 5 John 45 5 6 Lily 41 6 7 Bruce 23 7 8 Lin 33 8 9 Brown 31 9 10 Alibama 20
To know more on how to randomize a pandas dataframe, you can read:
Understand pandas.DataFrame.sample(): Randomize DataFrame By Row
We know the size of train, test and validation set is: 8:1:1. We can extract them as follow:
Train set
We calculate the length of train and test set.
length = len(df) train_len = int(0.8 * length) print(train_len) test_len = int(0.1 * length) print(test_len)
Then we can extract train set
train_set = df[0: train_len] print(train_set)
The train set is:
No Name Age 5 6 Lily 41 7 8 Lin 33 3 4 Kate 23 1 2 Kate 22 9 10 Alibama 20 4 5 John 45 6 7 Bruce 23 2 3 Alexa 34
Extract test set
test_set = df[train_len: train_len+test_len] print(test_set)
Test set is:
No Name Age 0 1 Tom 24
Extract validation set
dev_set = df[train_len+test_len:] print(dev_set)
The validation set is:
No Name Age 8 9 Brown 31