There are 50,000 labeled text files in aclImdb dataset, in order to use this dataset easily, we should combine these small text files to a big one. In this tutorial, we will introduce you how to do.
Python combine text files
It is easy to combile some text files to a big one, here is an tutorial.
Best Practice to Python Combine Multiple Text Files into One Text File
To combile aclImdb text files, we need do some extra processes.
Combine aclImdb labeled text files
We import some libraries first.
import glob import re import os
Then we will traverse some directories.
dir = [] dir.append(r"E:\dataset\Large Movie Review Dataset\aclImdb\train\pos") dir.append(r"E:\dataset\Large Movie Review Dataset\aclImdb\train\neg") dir.append(r"E:\dataset\Large Movie Review Dataset\aclImdb\test\pos") dir.append(r"E:\dataset\Large Movie Review Dataset\aclImdb\test\neg")
We need to remove some invalid string and characters in text files.
def removeAllInvalid(text): text = text.replace("<br />", " ") pattern = re.compile(r'[\r\n\t]{1,}') text = re.sub(pattern, ' ', text) pattern = re.compile(r'[ ]{2,}') text = re.sub(pattern, ' ', text) return text.strip()
In removeAllInvalid function, we will remove some invalid string, which is very helpful to your model.
Get text label
The file label is in file name, we can get it by this function.
def getLabel(absolute): basename = os.path.basename(absolute) info = os.path.splitext(basename) filename = info[0] fileInfo = filename.split("_") return fileInfo[1]
Finally, we will read all text files one by one and write file content and label into a file.
file_big = 'aclImdb-all.txt' with open(file_big, 'w', encoding = 'utf-8') as fnew: for d in dir: files = glob.glob(d + '\\*.txt') for f in files: content = '' with open(f, 'r', encoding = 'utf-8') as fold: content = fold.read() content = removeAllInvalid(content) #get label label = getLabel(f) fnew.write(content + "\t" + label + "\n")
Run this code, we will combile all labeled text files of aclImdb into aclImdb-all.txt, which contains 50,000 lines.