Word embeddings can be created with Word2Vec and Glove, it is common used in nlp filed. In this tutorial, we will introduce how to create word embeddings from text using Glove. If you want to use Word2Vec, you can read:
Best Practice to Create Word Embeddings Using Word2Vec – Word2Vec Tutorial
How to create word embeddings using GloVe?
Step 1: Download Glove source code
You can download it from: https://github.com/stanfordnlp/GloVe , such as file: GloVe-1.2.zip
Step 2: Unpack the files
unzip GloVe-1.2.zip
Step 3: Compile the source
cd GloVe-1.2 && make
Step 4: Download corpus file and unpack it
You can download text8 corpus from: http://mattmahoney.net/dc/text8.zip and unpack it in GloVe-1.2 folder
Step 5: Edit demo.sh
#!/bin/bash # Makes programs, downloads sample data, trains a GloVe model, and then evaluates it. # One optional argument can specify the language used for eval script: matlab, octave or [default] python CORPUS=text8 # the corpus file VOCAB_FILE=vocab.txt # the vocabulary file COOCCURRENCE_FILE=cooccurrence.bin COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin BUILDDIR=build SAVE_FILE=vectors # the word embeddings file VERBOSE=2 MEMORY=4.0 VOCAB_MIN_COUNT=5 # the min count of a word in vocabulary file VECTOR_SIZE=50 MAX_ITER=15 WINDOW_SIZE=15 BINARY=2 NUM_THREADS=8 X_MAX=10 $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE if [[ $? -eq 0 ]] then $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE if [[ $? -eq 0 ]] then $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE if [[ $? -eq 0 ]] then $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE if [[ $? -eq 0 ]] then if [ "$1" = 'matlab' ]; then matlab -nodisplay -nodesktop -nojvm -nosplash < ./eval/matlab/read_and_evaluate.m 1>&2 elif [ "$1" = 'octave' ]; then octave < ./eval/octave/read_and_evaluate_octave.m 1>&2 else python eval/python/evaluate.py fi fi fi fi fi
You should edit:
CORPUS: the path and name of your text
VECTOR_SIZE: the vector size of word embeddings
WINDOW_SIZE: the size of window
Step 6: Execute demo.sh
./demo.sh
What is X_MAX?
It is used to scale the weight when computing difference between two words.
For example:
/* Calculate cost, save diff for gradients */
diff = 0;
for (b = 0; b < vector_size; b++) diff += W[b + l1] * W[b + l2]; // dot product of word and context word vector diff += W[vector_size + l1] + W[vector_size + l2] - log(cr.val); // add separate bias for each word fdiff = (cr.val > x_max) ? diff : pow(cr.val / x_max, alpha) * diff; // multiply weighting function (f) with diff
// Check for NaN and inf() in the diffs.
if (isnan(diff) || isnan(fdiff) || isinf(diff) || isinf(fdiff)) {
fprintf(stderr,"Caught NaN in diff for kdiff for thread. Skipping update");
continue;
}
cost[id] += 0.5 * fdiff * diff; // weighted squared error