N-grams model is often used in nlp field, in this tutorial, we will introduce how to create word and sentence n-grams with python. You can use our tutorial example code to start to your nlp research.
What is n-grams?
As to n-grams, there are different levels.
As to sentence level:
“this is a good blog site.”
1-grams (unigrams) can be: this, is, a, good, blog, site, .
2-grams (bigrams) can be: this is, is a, a good, good blog, blog site, site.
3-grams (trigrams) can be: this is a, is a good, a good blog, good blog site, blog site.
However, if we apply n-grams on word level , n-grams model can be:
As to word: this
1-grams: t, h, i, s
2-grams: th, hi, is
3-grams: thi, his
How to get word level n-grams?
Create a python function to extract word level n-grams
def extract_word_ngrams(word, num = 3): word = word.strip() word = word.lower() #padding word to length if len(word) < num: word = format(word,'#^'+str(num)) grams = [] wlen = len(word) for i in range(wlen-num+1): w = word[i:i+num] grams.append(w) return grams
In this function, we should notice, if the length of word is smaller than num.
For example:
n = 3, word = go
We should pad word to length 3, it will be go#
To pad word to a lenght, we can read this tutorial.
Best Practice to Pad Python String up to Specific Length – Python Tutorial
How to use this function?
grams = extract_word_ngrams(word='python', num = 3) print(grams)
As to 3-grams, we can extract word ‘python‘ to:
['pyt', 'yth', 'tho', 'hon']
Extract word level n-grams in sentence with python
import nltk def extract_sentence_ngrams(sentence, num = 3): words = nltk.word_tokenize(sentence) grams = [] for w in words: w_grams = extract_word_ngrams(w, num) grams.append(w_grams) return grams
We can split a sentence to word list, then extarct word n-gams.
data = 'i like writting.' grams = extract_sentence_ngrams(data, 3) print(grams)
The sentence n-grams is:
[['#i#'], ['lik', 'ike'], ['wri', 'rit', 'itt', 'tti', 'tin', 'ing'], ['#.#']]