If you’re stuck, free to ask me for guidance at firstname.lastname@example.org.
Word embeddings are a recent yet powerful trend in deep learning research in natural language processing. They allow us to capture semantic meaning using familiar linear algebra tools. More specifically, vector representations of words allow us to use better inputs for neural networks as compared with one hot encoding.
Essentially, word embeddings place the words of a given vocabulary in a dimensional vector space. The dimensions of word vectors range in the thousands.
Word embeddings themselves are the product of feed-forward neural networks, usually trained by maximizing the log-likelihood on a given training dataset. What this means is that by using different corpuses of text, we can even generate domain-specific word embeddings (finance, food, medicine etc).
A way to visualize word embeddings is through analogies. There are certain “hello world” examples of word vectors. For example, with trained word embeddings, we can find that the nearest vector to
“kitten”. More specifically, since we are working with vectors, we define distance as the cosine distance between two vectors.
A more visual example is comparing the vectors of
queen. We can see that the vector of
"boy" - "girl" is similar to
“king” - "queen". We can create even more analogies by instead phrasing it as
“king is to boy as queen is to girl ”.
Another example that we can derive using simple vector algebra is
“Steve Jobs is to Apple as Bill Gates is to Microsoft”.
A commonly accepted implementation of generating trained word embeddings is Word2Vec, which is what we will use as reference in this tutorial. Word2Vec generates word embeddings through one of two models (which can both be trained based off of what it predicts).
If you notice, they are essentially the inverse of the other. This makes it easy to understand the intuition behind how Word2Vec works to generate word embeddings.
Training word embeddings with a given dataset is particularly easy using
gensim, a Python package that abstracts the implementation of the Word2Vec neural network. This is the most commonly used Python package for generating word embeddings. However Google’s TensorFlow is also a good choice if you better understand the deep learning logic, for which the TensorFlow Word2Vec Tutorial is an excellent resource.
In this tutorial, we will instead use pre-trained word embeddings from Google (trained by reading through Google News). Using trusted pre-trained models will allow us to quickly play with word vectors as well as prototype with deep learning faster since such models already been worked well in practice.
First, we’ll change directory to a new folder named
project that will contain our model (Magnitude vectors), our Python code and our (optional) server/UI.
mkdir project && cd project
Next, we’ll create a virtual environment named
"venv" to track our Python dependencies and to prevent our project from interfering with other projects on our computer. Read my article here for a quick tutorial on how to set up
virtualenv -p python3 venv
We know we’ll be using the Magnitude word embeddings, so we’ll install the package as well as NLTK (a common natural language library for some helpful tools) and Numpy (a library that makes computing math easy in Python).
pip install pymagnitude nltk numpy
Finally we’ll save these dependencies for future reference.
pip freeze > requirements.txt
Let’s create the file that will contain our recommendation engine.
For now, we’ll implement a
recommend function that will return a list of recommendations in descending order. We’ll begin by importing all the necessary packages and adding a quick stub to implement.
import nltk import numpy as np from pymagnitude import Magnitude def recommend(item, items): pass
There’s no point making anything that can’t be shared. In this section, we’ll wrap our model and engine logic in a Flask server that can be deployed for the world to see.
Let’s first install Flask, add it to our requirements.
pip install Flask && pip freeze > requirements.txt
We’ll create a file named
main.py and start by importing Flask and our recommendation engine.
import Flask from engine import recommend
It would be cool to visualize the word vectors. Sadly, we humans are mostly incapable of visualizing in the 300th dimension.
Instead, we can use a process called dimensionality reduction which will allow us to turn our 300 dimensions into regular 2D vectors (without losing too much information) that we can visualize.
A lot of “legacy“ natural language processing might use the
gensim package, which has a different API as our faster
pymagnitude package. In order to interface with the
pymagnitude model, we can write a wrapper class to use the same API as the
class Word2Vec: def __init__(self, vectors): self.vectors = vectors self.layer1_size = self.vectors.dim def __getitem__(self, word): return self.vectors.query(word) def __contains__(self, word): return word in self.vectors def dim(self): return self.vectors.dim
Using this, we can wrap our magnitude model as follows:
vectors = Magnitude('google_news.magnitude') w2v = Word2Vec(vectors)
And use it like
gensim as follows:
cat_vector = w2v['cat']