Tutorial Example

An Introduction to IMDB Movie Review Dataset (aclImdb): A Small Sentiment Analysis Dataset

aclImdb is a small imdb movie review dataset, which is good choice to build an experimental model for sentiment analysis. In this tutorial, we will introduce some basci feartures for sentiment analysis beginners.

File Name Format

Each file is named as: [id]_[rating].txt

where id is the unique file id, rating is the movie star rated by users.

For example:

12487_10.txt, which means this number of this file is 12487 and the star is 10, this review represents strong positive.

In this dataset, 6-10 star is positive and 1-5 is negative.

This movie dataset is mainly to used in binary sentiment classification. However, it is also can be used to multiple sentiment classification, for example 10 classes.

File Composition

The statistical files of this dataset is in this table.

Class Polarity Files File Format
Labeled Train POS 12,500 id_[6-10].txt
NEG 12,500 id_[1-5].txt
TESTt POS 12,500 id_[6-10].txt
NEG 12500 id_[1-5].txt
Unlabeled Train 50,000 id_[1-10].txt
Test 50,000 id_[1-10].txt

How to use this imdb review dataset

This dataset is often used in supervised sentiment classification, to use this dataset, we should process it first. We should merge all train and test files to a big text file. Then we will split train, test and validation set by 8:1:1 randomly.

To combine all labeled train and test files, you can read:

Python Combine IMDB Moview Review Dataset (aclImdb) to One Text File: A Step Guide

To generate train, test and validation set, you can read:

Split IMDB Movie Review Dataset (aclImdb) into Train, Test and Validation Set: A Step Guide for NLP Beginners

Statistics of labeled reivew

We have counted the number of each labeled class, here is the table.

Class Count
1 10,122
2 4,586
3 4,961
4 5,331
5 0
6 0
7 4,803
8 5,859
9 4,607
10 9,731

From this table, we can find the rating 5 and 6 is zero, which means we can use this dataset for 8 classification problem in sentiment analysis.