Tutorial Example

Python Combine IMDB Moview Review Dataset (aclImdb) to One Text File: A Step Guide

There are 50,000 labeled text files in aclImdb dataset, in order to use this dataset easily, we should combine these small text files to a big one. In this tutorial, we will introduce you how to do.

Python combine text files

It is easy to combile some text files to a big one, here is an tutorial.

Best Practice to Python Combine Multiple Text Files into One Text File

To combile aclImdb text files, we need do some extra processes.

Combine aclImdb labeled text files

We import some libraries first.

import glob
import re
import os

Then we will traverse some directories.

dir = []
dir.append(r"E:\dataset\Large Movie Review Dataset\aclImdb\train\pos")
dir.append(r"E:\dataset\Large Movie Review Dataset\aclImdb\train\neg")
dir.append(r"E:\dataset\Large Movie Review Dataset\aclImdb\test\pos")
dir.append(r"E:\dataset\Large Movie Review Dataset\aclImdb\test\neg")

We need to remove some invalid string and characters in text files.

def removeAllInvalid(text):
    text = text.replace("<br />", " ")
    pattern = re.compile(r'[\r\n\t]{1,}')
    text = re.sub(pattern, ' ', text)
    pattern = re.compile(r'[ ]{2,}')
    text = re.sub(pattern, ' ', text)
    return text.strip()

In removeAllInvalid function, we will remove some invalid string, which is very helpful to your model.

Get text label

The file label is in file name, we can get it by this function.

def getLabel(absolute):
    basename = os.path.basename(absolute)
    info = os.path.splitext(basename)
    filename = info[0]
    fileInfo = filename.split("_")
    return fileInfo[1]

Finally, we will read all text files one by one and write file content and label into a file.

file_big = 'aclImdb-all.txt'
with open(file_big, 'w', encoding = 'utf-8') as fnew:
    for d in dir:
        files = glob.glob(d + '\\*.txt')
        for f in files:
            content = ''
            with open(f, 'r', encoding = 'utf-8') as fold:
                content = fold.read()
            content = removeAllInvalid(content)
            #get label
            label = getLabel(f)
            
            fnew.write(content + "\t" + label + "\n")

Run this code, we will combile all labeled text files of aclImdb into aclImdb-all.txt, which contains 50,000 lines.