Best Practice to Create Word Embeddings Using GloVe

Word embeddings can be created with Word2Vec and Glove, it is common used in nlp filed. In this tutorial, we will introduce how to create word embeddings from text using Glove. If you want to use Word2Vec, you can read:

Best Practice to Create Word Embeddings Using Word2Vec – Word2Vec Tutorial

How to create word embeddings using GloVe?

Step 1: Download Glove source code

You can download it from: https://github.com/stanfordnlp/GloVe , such as file: GloVe-1.2.zip

Step 2: Unpack the files

unzip GloVe-1.2.zip

Step 3: Compile the source

cd GloVe-1.2 && make

Step 4: Download corpus file and unpack it

You can download text8 corpus from: http://mattmahoney.net/dc/text8.zip and unpack it in GloVe-1.2 folder

Step 5: Edit demo.sh

#!/bin/bash

# Makes programs, downloads sample data, trains a GloVe model, and then evaluates it.
# One optional argument can specify the language used for eval script: matlab, octave or [default] python

CORPUS=text8 # the corpus file
VOCAB_FILE=vocab.txt # the vocabulary file
COOCCURRENCE_FILE=cooccurrence.bin
COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin
BUILDDIR=build
SAVE_FILE=vectors # the word embeddings file
VERBOSE=2
MEMORY=4.0
VOCAB_MIN_COUNT=5 # the min count of a word in vocabulary file
VECTOR_SIZE=50
MAX_ITER=15
WINDOW_SIZE=15
BINARY=2
NUM_THREADS=8
X_MAX=10

$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
if [[ $? -eq 0 ]]
  then
  $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
  if [[ $? -eq 0 ]]
  then
    $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
    if [[ $? -eq 0 ]]
    then
       $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
       if [[ $? -eq 0 ]]
       then
           if [ "$1" = 'matlab' ]; then
               matlab -nodisplay -nodesktop -nojvm -nosplash < ./eval/matlab/read_and_evaluate.m 1>&2 
           elif [ "$1" = 'octave' ]; then
               octave < ./eval/octave/read_and_evaluate_octave.m 1>&2 
           else
               python eval/python/evaluate.py
           fi
       fi
    fi
  fi
fi

You should edit:

CORPUS: the path and name of your text

VECTOR_SIZE: the vector size of word embeddings

WINDOW_SIZE: the size of window

Step 6: Execute demo.sh

./demo.sh

2 thoughts on “Best Practice to Create Word Embeddings Using GloVe – Deep Learning Tutorial”

Eduardo April 16, 2022

What is X_MAX?

↓

admin Post authorApril 17, 2022

It is used to scale the weight when computing difference between two words.
For example:
/* Calculate cost, save diff for gradients */ diff = 0; for (b = 0; b < vector_size; b++) diff += W[b + l1] * W[b + l2]; // dot product of word and context word vector diff += W[vector_size + l1] + W[vector_size + l2] - log(cr.val); // add separate bias for each word fdiff = (cr.val > x_max) ? diff : pow(cr.val / x_max, alpha) * diff; // multiply weighting function (f) with diff
// Check for NaN and inf() in the diffs. if (isnan(diff) || isnan(fdiff) || isinf(diff) || isinf(fdiff)) { fprintf(stderr,"Caught NaN in diff for kdiff for thread. Skipping update"); continue; }
cost[id] += 0.5 * fdiff * diff; // weighted squared error

Log in to Reply ↓