Encode Word N-grams to One-Hot Encoding with Numpy

One-hot encoding is common used in deep learning, n-grams model should be encoded to vector to train. In this tutorial, we will introduce how to encode n-grams to one-hot encoding.

What is one-hot encoding?

As to a vocabulary, i like writing.

The size of vocabulary is 3.

Eoncode this sentence will be:

i: [1, 0, 0]

like: [0, 1, 0]

writing: [0, 0, 1]

Encode n-grams to one-hot encoding

As to sentence “i like writing“, the trigrams (3-grams) of it is:

[‘#i#’, ‘lik’, ‘ike’, ‘wri’, ‘rit’, ‘iti’, ‘tin’, ‘ing’]

To create word level n-grams, you can read this tutorial.

A Simple Guide to Flatten Python List for Beginners – NLTK Tutorial

Here we will encode n-grams to one-hot encoding with numpy.

Import library

import numpy as np

Prepare n-gram

'''i like writing'''
grams = ['#i#', 'lik', 'ike', 'wri', 'rit', 'iti', 'tin', 'ing']

Create a function to encode n-grams

def get_one_hot(targets, nb_classes):
    res = np.eye(nb_classes)[np.array(targets).reshape(-1)]
    return res.reshape(list(targets.shape)+[nb_classes])

Get n-grams one-hot encoding

words_count = len(grams)
one_hots = get_one_hot(np.arange(words_count), words_count)
print(one_hots)

The one-hot encoding result is:

[[1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1.]]

Encode Word N-grams to One-Hot Encoding with Numpy – Deep Learning Tutorial