Encode Word N-grams to One-Hot Encoding Dynamically with Numpy – Deep Learning Tutorial

By | September 19, 2019

Encoding word n-grams to one-hot encoding is simple with numpy, you can read this tutorial below to implement it.

Encode Word N-grams to One-Hot Encoding with Numpy – Deep Learning Tutorial

However, this way need large memory space, for example, if the vocabulary size is 500,000, the one-hot encoding matrix is 500,000 * 500,000,which may fail to encode n-grams if your memory space is limited. We call this way to be static method.

In this tutorial, we will introduce a new way to encode n-grams to one-hot encoding, it can create a one-hot matrix dynamically and need a little of memory space.

Prepare n-grams

As to sentence ‘i like writing‘, we will use its 3-grams to create one-hot encoding.

grams = ['#i#', 'lik', 'ike', 'wri', 'rit', 'iti', 'tin', 'ing']

Select grams to create one-hot encoding

index = np.array([0, 2, 4, 1, 3])

We select position in [0, 2, 4, 1, 3] to create one-hot encoding.

Create a function to create one-hot encoding by grams position

def make_one_hot(data, vab_size):
    
    return (np.arange(vab_size)==data[:,None]).astype(np.integer)

Create one-hot encoding dynamically

one_hots = make_one_hot(index, vab_size = len(grams))
print(one_hots)

The one-hot encoding result is:

[[1 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0]
 [0 0 0 0 1 0 0 0]
 [0 1 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0]]

Compare two methods

In this example, the vocabulary size is: 8

Static way: need 8 * 8 space.

Dynamic way: maximum space 8*8, minimum space 1 * 8

Leave a Reply