An Introduction to IMDB Movie Review Dataset (aclImdb): A Small Sentiment Analysis Dataset

admin

5 years ago

aclImdb is a small imdb movie review dataset, which is good choice to build an experimental model for sentiment analysis. In this tutorial, we will introduce some basci feartures for sentiment analysis beginners.

File Name Format

Each file is named as: [id]_[rating].txt

where id is the unique file id, rating is the movie star rated by users.

For example:

12487_10.txt, which means this number of this file is 12487 and the star is 10, this review represents strong positive.

In this dataset, 6-10 star is positive and 1-5 is negative.

This movie dataset is mainly to used in binary sentiment classification. However, it is also can be used to multiple sentiment classification, for example 10 classes.

File Composition

The statistical files of this dataset is in this table.

	Class	Polarity	Files	File Format
Labeled	Train	POS	12,500	id_[6-10].txt
	Train	NEG	12,500	id_[1-5].txt
	TESTt	POS	12,500	id_[6-10].txt
	TESTt	NEG	12500	id_[1-5].txt
Unlabeled	Train	–	50,000	id_[1-10].txt
Unlabeled	Test	–	50,000	id_[1-10].txt

How to use this imdb review dataset

This dataset is often used in supervised sentiment classification, to use this dataset, we should process it first. We should merge all train and test files to a big text file. Then we will split train, test and validation set by 8:1:1 randomly.

To combine all labeled train and test files, you can read:

Python Combine IMDB Moview Review Dataset (aclImdb) to One Text File: A Step Guide

To generate train, test and validation set, you can read:

Split IMDB Movie Review Dataset (aclImdb) into Train, Test and Validation Set: A Step Guide for NLP Beginners

Statistics of labeled reivew

We have counted the number of each labeled class, here is the table.

Class	Count
1	10,122
2	4,586
3	4,961
4	5,331
5	0
6	0
7	4,803
8	5,859
9	4,607
10	9,731

From this table, we can find the rating 5 and 6 is zero, which means we can use this dataset for 8 classification problem in sentiment analysis.