Tutorial Example

Implement LDA Model Using Gensim – A Beginner Guide – Gensim Tutorial

LDA (Latent Dirichlet Allocation) is a kind of unsupervised method to classify documents by topic number. In this tutorial, we will introduce how to build a LDA model using python gensim.

Preliminary

We should import some libraries first.

from nltk.tokenize import RegexpTokenizer
from gensim import corpora, models
import os

Load text documents

Before we use gensim lda model to classify documents, we should load text documents first.

data_file ='../data/imdb/train.ss'
docs = []
with open(data_file, 'rb') as f:
    for line in f:
        data = ""
        line = line.decode('utf-8', 'ignore')
        line = line.strip().split('\t\t')
        #print(line)
        data = line[3].lower()
        data = data.replace('<sssss>','')
        docs.append(data)
print(len(docs))
print(docs[0][:500])

Here docs is a python list, which contains some documents, you can modify this code to load your own documents.

In this exode, we have loaded 67426 documents.

67426
i excepted a lot from this movie , and it did deliver .  there is some great buddhist wisdom in this movie .  the real dalai lama is a very interesting person , and i think there is a lot of wisdom in buddhism .  the music , of course , sounds like because it is by philip glass .  this adds to the beauty of the movie .  whereas other biographies of famous people tend to get very poor this movie always stays focused and gives a good and honest portrayal of the dalai lama .  all things being equal

Split the documents into tokens

After we have loaded documents in a python list, we also need to split them to tokens (words). In this tutorial, we will use nltk to split.

Here is an example code:

tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

Remove some words we do not need

In order to increase the accuracy, we should remove some words, such as numbers, stop words or others.

Here is an example code:

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]

Build dictionary and corpus

We have got document words list above, then we can use it to create a dictionary and a corpus.

# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

# Filter out words that occur less than 20 documents, or more than 10% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.1)

# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))
print(corpus[0])

In this tutorial, we have filtered out words that occur less than 20 documents, or more than 10% of the documents.

Run this code, we may get result as follows:

Number of unique tokens: 25080
Number of documents: 67426
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 2), (19, 2), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 2)]

Use dictionary and corpus to build LDA model

We can use gensim LdaModel to create a lda model using dictionary and corpus. Here is an example:

from gensim.models import LdaModel
num_topics = 10
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

id2word = dictionary.id2token

model_name = "./imdb-"+str(num_topics)+".lda"
model = LdaModel(
            corpus=corpus,
            id2word=id2word,
            chunksize=chunksize,
            alpha='auto',
            eta='auto',
            iterations=iterations,
            num_topics=num_topics,
            passes=passes,
            minimum_probability = 0.0,
            eval_every=eval_every
        )
model.show_topic(0)
#save model
model.save(model_name)

Then we can use model.save() to save lda model.

Print the topic distribution of documents

After we have created a lda model using gensim, in order to know the topic distribution of a document, we can use code below:

for index, score in sorted(lda_model[corpus[0]], key=lambda tup: -1*tup[1]):
    print ("Score: {}\t Topic ID: {} Topic: {}".format(score, index, lda_model.print_topic(index, 10)))
    print(n)

In this code, we will display the topic distribution of the first document.

Run this code, we will find:

Score: 0.37222424149513245	 Topic ID: 2 Topic: 0.006*"war" + 0.005*"human" + 0.004*"american" + 0.003*"perhaps" + 0.003*"itself" + 0.003*"often" + 0.003*"events" + 0.003*"given" + 0.003*"viewer" + 0.003*"reality"
Score: 0.17407076060771942	 Topic ID: 0 Topic: 0.005*"minutes" + 0.004*"saw" + 0.004*"maybe" + 0.004*"said" + 0.004*"seeing" + 0.004*"half" + 0.004*"nice" + 0.004*"let" + 0.004*"bond" + 0.004*"am"
Score: 0.14683006703853607	 Topic ID: 7 Topic: 0.007*"picture" + 0.006*"beautiful" + 0.006*"score" + 0.005*"oscar" + 0.005*"perfect" + 0.005*"wonderful" + 0.004*"direction" + 0.004*"brilliant" + 0.004*"cinematography" + 0.004*"cinema"
Score: 0.0867297500371933	 Topic ID: 4 Topic: 0.009*"book" + 0.008*"series" + 0.007*"harry" + 0.006*"battle" + 0.006*"fight" + 0.005*"earth" + 0.005*"alien" + 0.005*"fi" + 0.005*"sci" + 0.005*"evil"
Score: 0.07743233442306519	 Topic ID: 9 Topic: 0.024*"allen" + 0.014*"western" + 0.014*"woody" + 0.011*"paris" + 0.011*"mann" + 0.008*"jim" + 0.008*"keaton" + 0.008*"leone" + 0.008*"cooper" + 0.007*"adams"
Score: 0.07302337139844894	 Topic ID: 6 Topic: 0.008*"father" + 0.006*"mother" + 0.006*"son" + 0.005*"home" + 0.005*"friend" + 0.005*"brother" + 0.004*"finds" + 0.004*"women" + 0.004*"david" + 0.004*"town"
Score: 0.03492188826203346	 Topic ID: 5 Topic: 0.011*"dead" + 0.008*"thriller" + 0.007*"house" + 0.007*"killer" + 0.007*"night" + 0.006*"murder" + 0.006*"police" + 0.006*"michael" + 0.005*"violence" + 0.005*"blood"
Score: 0.015267780050635338	 Topic ID: 1 Topic: 0.010*"humor" + 0.010*"hilarious" + 0.009*"laugh" + 0.009*"kids" + 0.009*"jokes" + 0.008*"laughs" + 0.006*"comedies" + 0.006*"steve" + 0.005*"sex" + 0.005*"nick"
Score: 0.009802566841244698	 Topic ID: 3 Topic: 0.013*"jones" + 0.012*"robert" + 0.012*"james" + 0.010*"ford" + 0.010*"williams" + 0.009*"de" + 0.008*"spielberg" + 0.008*"grant" + 0.007*"henry" + 0.007*"scott"
Score: 0.0096972631290555	 Topic ID: 8 Topic: 0.018*"dr" + 0.015*"peter" + 0.010*"agent" + 0.009*"president" + 0.009*"bruce" + 0.008*"jackson" + 0.007*"douglas" + 0.007*"willis" + 0.007*"lee" + 0.007*"spider"

We can find the first document is in topic 2, the distribution is 0.372, which is the biggest in all 10 topics.