aclImdb is a small imdb movie review dataset, which is good choice to build an experimental model for sentiment analysis. In this tutorial, we will introduce some basci feartures for sentiment analysis beginners.
File Name Format
Each file is named as: [id]_[rating].txt
where id is the unique file id, rating is the movie star rated by users.
For example:
12487_10.txt, which means this number of this file is 12487 and the star is 10, this review represents strong positive.
In this dataset, 6-10 star is positive and 1-5 is negative.
This movie dataset is mainly to used in binary sentiment classification. However, it is also can be used to multiple sentiment classification, for example 10 classes.
File Composition
The statistical files of this dataset is in this table.
Class | Polarity | Files | File Format | |
Labeled | Train | POS | 12,500 | id_[6-10].txt |
NEG | 12,500 | id_[1-5].txt | ||
TESTt | POS | 12,500 | id_[6-10].txt | |
NEG | 12500 | id_[1-5].txt | ||
Unlabeled | Train | – | 50,000 | id_[1-10].txt |
Test | – | 50,000 | id_[1-10].txt |
How to use this imdb review dataset
This dataset is often used in supervised sentiment classification, to use this dataset, we should process it first. We should merge all train and test files to a big text file. Then we will split train, test and validation set by 8:1:1 randomly.
To combine all labeled train and test files, you can read:
Python Combine IMDB Moview Review Dataset (aclImdb) to One Text File: A Step Guide
To generate train, test and validation set, you can read:
Statistics of labeled reivew
We have counted the number of each labeled class, here is the table.
Class | Count |
1 | 10,122 |
2 | 4,586 |
3 | 4,961 |
4 | 5,331 |
5 | 0 |
6 | 0 |
7 | 4,803 |
8 | 5,859 |
9 | 4,607 |
10 | 9,731 |
From this table, we can find the rating 5 and 6 is zero, which means we can use this dataset for 8 classification problem in sentiment analysis.