Split IMDB Movie Review Dataset (aclImdb) into Train, Test and Validation Set: A Step Guide for NLP Beginners

admin

5 years ago

aclImdb dataset contains some small txt files, we have combined them to a big txt file (aclImdb-all.txt). In this tutorial, we will introduce you how to split this file to train, test and validation set for model learning.

Read all lines from aclImdb-all.txt

We can open this file and read all lines. Here is an example code.

import random

aclImdb_file = 'aclImdb-all.txt'
all_lines = []
with open(aclImdb_file, 'r', encoding = 'utf-8') as f:
    all_lines = f.readlines()

Shuffle all lines

To generate train, test and validation set, we should shuffle all lines.

total = len(all_lines)
random.shuffle(all_lines)

How to use python random.shuffle()? Here is the tutorial.

Understand Python random.shuffle(): Randomize a Sequence

Generate train, test and validation set

We will create these three set as: train:test:validation = 8:1:1, then get train, test and validation lines.

train_len = int(total*0.8)
test_len = int(total*0.1)

train_lines = all_lines[0:train_len]
test_lines = all_lines[train_len: (train_len + test_len)]
dev_lines = all_lines[(train_len + test_len):]

Save train, test and validation lines

We will save these lines to files and generate train, test and validation set.

def saveFile(file, lines):
    with open(file, 'w', encoding = 'utf-8') as f:
        f.writelines(lines)
        
train_file = 'aclImdb-train.txt'
test_file = 'aclImdb-test.txt'
dev_file = 'aclImdb-dev.txt'
saveFile(train_file, train_lines)
saveFile(test_file, test_lines)
saveFile(dev_file, dev_lines)

Finally, we will create three files:

aclImdb-train.txt: train set, 40,000 lines

aclImdb-test.txt: test set, 5,000 lines

aclImdb-dev.txt: validation set, 5,000 lines