A Gentle Introduction to Word Embeddings

How to Encode Lexical Semantics

Prerequisites

  1. Some knowledge of Python and some familiarity with the command line
  2. A high level understanding of neural networks learn (i.e. what “training” is)

If you’re stuck, free to ask me for guidance at helpme@kirubarajan.com.

What Are Word Embeddings?

Word embeddings are a recent yet powerful trend in deep learning research in natural language processing. They allow us to capture semantic meaning using familiar linear algebra tools. More specifically, vector representations of words allow us to use better inputs for neural networks as compared with one hot encoding.

Essentially, word embeddings place the words of a given vocabulary in a dimensional vector space. The dimensions of word vectors range in the thousands.

Word embeddings themselves are the product of feed-forward neural networks, usually trained by maximizing the log-likelihood on a given training dataset. What this means is that by using different corpuses of text, we can even generate domain-specific word embeddings (finance, food, medicine etc).

A way to visualize word embeddings is through analogies. There are certain “hello world” examples of word vectors. For example, with trained word embeddings, we can find that the nearest vector to “cat” is “kitten”. More specifically, since we are working with vectors, we define distance as the cosine distance between two vectors.

A more visual example is comparing the vectors of "boy" and "girl" with king and queen. We can see that the vector of "boy" - "girl" is similar to “king” - "queen". We can create even more analogies by instead phrasing it as “king is to boy as queen is to girl ”.

Another example that we can derive using simple vector algebra is “Steve Jobs is to Apple as Bill Gates is to Microsoft”.

How Do We Create Word Embeddings?

A commonly accepted implementation of generating trained word embeddings is Word2Vec, which is what we will use as reference in this tutorial. Word2Vec generates word embeddings through one of two models (which can both be trained based off of what it predicts).

  1. Continuous Bag Of Word: predicts a given missing word in a sentence/phrase based on context (faster but less specific)
  2. Skip Gram: given a word, predicts the words that will co-appear near it (slower but works better for infrequent words)

If you notice, they are essentially the inverse of the other. This makes it easy to understand the intuition behind how Word2Vec works to generate word embeddings.

Implementation

Training word embeddings with a given dataset is particularly easy using gensim, a Python package that abstracts the implementation of the Word2Vec neural network. This is the most commonly used Python package for generating word embeddings. However Google’s TensorFlow is also a good choice if you better understand the deep learning logic, for which the TensorFlow Word2Vec Tutorial is an excellent resource.

In this tutorial, we will instead use pre-trained word embeddings from Google (trained by reading through Google News). Using trusted pre-trained models will allow us to quickly play with word vectors as well as prototype with deep learning faster since such models already been worked well in practice.

Tutorial

First, we’ll change directory to a new folder named project that will contain our model (Magnitude vectors), our Python code and our (optional) server/UI.

mkdir project && cd project

Next, we’ll create a virtual environment named "venv" to track our Python dependencies and to prevent our project from interfering with other projects on our computer. Read my article here for a quick tutorial on how to set up virtualenv.

virtualenv -p python3 venv

We know we’ll be using the Magnitude word embeddings, so we’ll install the package as well as NLTK (a common natural language library for some helpful tools) and Numpy (a library that makes computing math easy in Python).

pip install pymagnitude nltk numpy

Finally we’ll save these dependencies for future reference.

pip freeze > requirements.txt

Let’s create the file that will contain our recommendation engine.

touch engine.py

For now, we’ll implement a recommend function that will return a list of recommendations in descending order. We’ll begin by importing all the necessary packages and adding a quick stub to implement.

import nltk
import numpy as np
from pymagnitude import Magnitude

def recommend(item, items):
    pass

(Optional) Displaying Our Recommender in A Website

There’s no point making anything that can’t be shared. In this section, we’ll wrap our model and engine logic in a Flask server that can be deployed for the world to see.

Let’s first install Flask, add it to our requirements.

pip install Flask && pip freeze > requirements.txt

We’ll create a file named main.py and start by importing Flask and our recommendation engine.

import Flask
from engine import recommend

(Optional) Visualizing Word Embeddings

It would be cool to visualize the word vectors. Sadly, we humans are mostly incapable of visualizing in the 300th dimension.

Instead, we can use a process called dimensionality reduction which will allow us to turn our 300 dimensions into regular 2D vectors (without losing too much information) that we can visualize.

(Optional) Wrapping Magnitude as Gensim

A lot of “legacy“ natural language processing might use the gensim package, which has a different API as our faster pymagnitude package. In order to interface with the pymagnitude model, we can write a wrapper class to use the same API as the gensim model:

class Word2Vec:
    def __init__(self, vectors):
        self.vectors = vectors
        self.layer1_size = self.vectors.dim

    def __getitem__(self, word):
        return self.vectors.query(word)

    def __contains__(self, word):
        return word in self.vectors

    def dim(self):
        return self.vectors.dim 

Using this, we can wrap our magnitude model as follows:

vectors = Magnitude('google_news.magnitude')
w2v = Word2Vec(vectors)

And use it like gensim as follows:

cat_vector = w2v['cat']

Read more posts or learn more about me.