Encoding word n-grams to one-hot encoding is simple with numpy, you can read this tutorial below to implement it.
Encode Word N-grams to One-Hot Encoding with Numpy – Deep Learning Tutorial
However, this way need large memory space, for example, if the vocabulary size is 500,000, the one-hot encoding matrix is 500,000 * 500,000,which may fail to encode n-grams if your memory space is limited. We call this way to be static method.
In this tutorial, we will introduce a new way to encode n-grams to one-hot encoding, it can create a one-hot matrix dynamically and need a little of memory space.
Prepare n-grams
As to sentence ‘i like writing‘, we will use its 3-grams to create one-hot encoding.
grams = ['#i#', 'lik', 'ike', 'wri', 'rit', 'iti', 'tin', 'ing']
Select grams to create one-hot encoding
index = np.array([0, 2, 4, 1, 3])
We select position in [0, 2, 4, 1, 3] to create one-hot encoding.
Create a function to create one-hot encoding by grams position
def make_one_hot(data, vab_size): return (np.arange(vab_size)==data[:,None]).astype(np.integer)
Create one-hot encoding dynamically
one_hots = make_one_hot(index, vab_size = len(grams)) print(one_hots)
The one-hot encoding result is:
[[1 0 0 0 0 0 0 0] [0 0 1 0 0 0 0 0] [0 0 0 0 1 0 0 0] [0 1 0 0 0 0 0 0] [0 0 0 1 0 0 0 0]]
Compare two methods
In this example, the vocabulary size is: 8
Static way: need 8 * 8 space.
Dynamic way: maximum space 8*8, minimum space 1 * 8