LDA (Latent Dirichlet Allocation) is a kind of unsupervised method to classify documents by topic number. In this tutorial, we will introduce how to build a LDA model using python gensim.
Preliminary
We should import some libraries first.
from nltk.tokenize import RegexpTokenizer from gensim import corpora, models import os
Load text documents
Before we use gensim lda model to classify documents, we should load text documents first.
data_file ='../data/imdb/train.ss' docs = [] with open(data_file, 'rb') as f: for line in f: data = "" line = line.decode('utf-8', 'ignore') line = line.strip().split('\t\t') #print(line) data = line[3].lower() data = data.replace('<sssss>','') docs.append(data) print(len(docs)) print(docs[0][:500])
Here docs is a python list, which contains some documents, you can modify this code to load your own documents.
In this exode, we have loaded 67426 documents.
67426 i excepted a lot from this movie , and it did deliver . there is some great buddhist wisdom in this movie . the real dalai lama is a very interesting person , and i think there is a lot of wisdom in buddhism . the music , of course , sounds like because it is by philip glass . this adds to the beauty of the movie . whereas other biographies of famous people tend to get very poor this movie always stays focused and gives a good and honest portrayal of the dalai lama . all things being equal
Split the documents into tokens
After we have loaded documents in a python list, we also need to split them to tokens (words). In this tutorial, we will use nltk to split.
Here is an example code:
tokenizer = RegexpTokenizer(r'\w+') for idx in range(len(docs)): docs[idx] = docs[idx].lower() # Convert to lowercase. docs[idx] = tokenizer.tokenize(docs[idx]) # Split into words.
Remove some words we do not need
In order to increase the accuracy, we should remove some words, such as numbers, stop words or others.
Here is an example code:
# Remove numbers, but not words that contain numbers. docs = [[token for token in doc if not token.isnumeric()] for doc in docs] # Remove words that are only one character. docs = [[token for token in doc if len(token) > 1] for doc in docs]
Build dictionary and corpus
We have got document words list above, then we can use it to create a dictionary and a corpus.
# Remove rare and common tokens. from gensim.corpora import Dictionary # Create a dictionary representation of the documents. dictionary = Dictionary(docs) # Filter out words that occur less than 20 documents, or more than 10% of the documents. dictionary.filter_extremes(no_below=20, no_above=0.1) # Bag-of-words representation of the documents. corpus = [dictionary.doc2bow(doc) for doc in docs] print('Number of unique tokens: %d' % len(dictionary)) print('Number of documents: %d' % len(corpus)) print(corpus[0])
In this tutorial, we have filtered out words that occur less than 20 documents, or more than 10% of the documents.
Run this code, we may get result as follows:
Number of unique tokens: 25080 Number of documents: 67426 [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 2), (19, 2), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 2)]
Use dictionary and corpus to build LDA model
We can use gensim LdaModel to create a lda model using dictionary and corpus. Here is an example:
from gensim.models import LdaModel num_topics = 10 chunksize = 2000 passes = 20 iterations = 400 eval_every = None # Don't evaluate model perplexity, takes too much time. id2word = dictionary.id2token model_name = "./imdb-"+str(num_topics)+".lda" model = LdaModel( corpus=corpus, id2word=id2word, chunksize=chunksize, alpha='auto', eta='auto', iterations=iterations, num_topics=num_topics, passes=passes, minimum_probability = 0.0, eval_every=eval_every ) model.show_topic(0) #save model model.save(model_name)
Then we can use model.save() to save lda model.
Print the topic distribution of documents
After we have created a lda model using gensim, in order to know the topic distribution of a document, we can use code below:
for index, score in sorted(lda_model[corpus[0]], key=lambda tup: -1*tup[1]): print ("Score: {}\t Topic ID: {} Topic: {}".format(score, index, lda_model.print_topic(index, 10))) print(n)
In this code, we will display the topic distribution of the first document.
Run this code, we will find:
Score: 0.37222424149513245 Topic ID: 2 Topic: 0.006*"war" + 0.005*"human" + 0.004*"american" + 0.003*"perhaps" + 0.003*"itself" + 0.003*"often" + 0.003*"events" + 0.003*"given" + 0.003*"viewer" + 0.003*"reality" Score: 0.17407076060771942 Topic ID: 0 Topic: 0.005*"minutes" + 0.004*"saw" + 0.004*"maybe" + 0.004*"said" + 0.004*"seeing" + 0.004*"half" + 0.004*"nice" + 0.004*"let" + 0.004*"bond" + 0.004*"am" Score: 0.14683006703853607 Topic ID: 7 Topic: 0.007*"picture" + 0.006*"beautiful" + 0.006*"score" + 0.005*"oscar" + 0.005*"perfect" + 0.005*"wonderful" + 0.004*"direction" + 0.004*"brilliant" + 0.004*"cinematography" + 0.004*"cinema" Score: 0.0867297500371933 Topic ID: 4 Topic: 0.009*"book" + 0.008*"series" + 0.007*"harry" + 0.006*"battle" + 0.006*"fight" + 0.005*"earth" + 0.005*"alien" + 0.005*"fi" + 0.005*"sci" + 0.005*"evil" Score: 0.07743233442306519 Topic ID: 9 Topic: 0.024*"allen" + 0.014*"western" + 0.014*"woody" + 0.011*"paris" + 0.011*"mann" + 0.008*"jim" + 0.008*"keaton" + 0.008*"leone" + 0.008*"cooper" + 0.007*"adams" Score: 0.07302337139844894 Topic ID: 6 Topic: 0.008*"father" + 0.006*"mother" + 0.006*"son" + 0.005*"home" + 0.005*"friend" + 0.005*"brother" + 0.004*"finds" + 0.004*"women" + 0.004*"david" + 0.004*"town" Score: 0.03492188826203346 Topic ID: 5 Topic: 0.011*"dead" + 0.008*"thriller" + 0.007*"house" + 0.007*"killer" + 0.007*"night" + 0.006*"murder" + 0.006*"police" + 0.006*"michael" + 0.005*"violence" + 0.005*"blood" Score: 0.015267780050635338 Topic ID: 1 Topic: 0.010*"humor" + 0.010*"hilarious" + 0.009*"laugh" + 0.009*"kids" + 0.009*"jokes" + 0.008*"laughs" + 0.006*"comedies" + 0.006*"steve" + 0.005*"sex" + 0.005*"nick" Score: 0.009802566841244698 Topic ID: 3 Topic: 0.013*"jones" + 0.012*"robert" + 0.012*"james" + 0.010*"ford" + 0.010*"williams" + 0.009*"de" + 0.008*"spielberg" + 0.008*"grant" + 0.007*"henry" + 0.007*"scott" Score: 0.0096972631290555 Topic ID: 8 Topic: 0.018*"dr" + 0.015*"peter" + 0.010*"agent" + 0.009*"president" + 0.009*"bruce" + 0.008*"jackson" + 0.007*"douglas" + 0.007*"willis" + 0.007*"lee" + 0.007*"spider"
We can find the first document is in topic 2, the distribution is 0.372, which is the biggest in all 10 topics.