aclImdb dataset contains some small txt files, we have combined them to a big txt file (aclImdb-all.txt). In this tutorial, we will introduce you how to split this file to train, test and validation set for model learning.
Read all lines from aclImdb-all.txt
We can open this file and read all lines. Here is an example code.
import random aclImdb_file = 'aclImdb-all.txt' all_lines = [] with open(aclImdb_file, 'r', encoding = 'utf-8') as f: all_lines = f.readlines()
Shuffle all lines
To generate train, test and validation set, we should shuffle all lines.
total = len(all_lines) random.shuffle(all_lines)
How to use python random.shuffle()? Here is the tutorial.
Understand Python random.shuffle(): Randomize a Sequence
Generate train, test and validation set
We will create these three set as: train:test:validation = 8:1:1, then get train, test and validation lines.
train_len = int(total*0.8) test_len = int(total*0.1) train_lines = all_lines[0:train_len] test_lines = all_lines[train_len: (train_len + test_len)] dev_lines = all_lines[(train_len + test_len):]
Save train, test and validation lines
We will save these lines to files and generate train, test and validation set.
def saveFile(file, lines): with open(file, 'w', encoding = 'utf-8') as f: f.writelines(lines) train_file = 'aclImdb-train.txt' test_file = 'aclImdb-test.txt' dev_file = 'aclImdb-dev.txt' saveFile(train_file, train_lines) saveFile(test_file, test_lines) saveFile(dev_file, dev_lines)
Finally, we will create three files:
aclImdb-train.txt: train set, 40,000 lines
aclImdb-test.txt: test set, 5,000 lines
aclImdb-dev.txt: validation set, 5,000 lines