September 2017
By Roland Meertens

Image: andreykuzmin/

Roland Meertens is a research engineer implementing tools that help with the translation of technical documents. His background in Artificial Intelligence has given him an understanding of neural network techniques. You can get a glimpse of his many interesting projects on his blog Pinch of Intelligence.


Natural language processing: Understanding human meaning

In only a few years, embeddings have changed the world of natural language processing. How far have we come in teaching an artificial brain to understand natural language?

For many decades, computer scientists have been trying to teach computers to understand the human language. It’s a difficult task: Sentences that are easy to understand for us humans can be incredibly complex for machines. A major reason is that humans can see meaning behind words. We know how to put a word in context, know how to reason with it, and know how to use it to give it meaning and relevance. But computer scientists have struggled to teach this sort of deep understanding to a computer. Only recently, a new technique has emerged that promises unprecedented advances: embeddings. In this article I try to shed some light on this new approach in natural language processing.

The problem with one-hot vectors

Many computer programs work with self-learning algorithms: mathematical problem solvers that work with a "numbers in, numbers out" approach. For example, we might be able to predict the weather by applying numbers to an algorithm of six dimensions: temperature, air pressure, the amount of clouds, yesterday’s temperature, wind speed, length of the day. Drawing this is difficult, but applying learning algorithms is easy. And having these numbers is extremely useful, as it allows you to apply mathematical formulas to calculate the weather.

To use self-learning algorithms in natural language processing, we have to convert words to numbers. One way to represent a word is creating a so-called "one-hot vector". This vector consists of only zeros and a single number one. To simplify the idea, let’s pretend we only have three words in our system: "machine learning rules". The one-hot vectors for these words would be:

machine: [1 0 0]

learning:  [0 1 0]

rules:  [0 0 1]

Having applied one-hot vectors, these vectors can now already be used in many machine-learning algorithms!

Unfortunately, there is a major issue with this representation: it is very sparse. As you can easily imagine, creating a more comprehensive word base will soon raise other problems:

  1. You need a lot of memory to store that many strings of numbers for a long text.
  2. The difference between cat and kitten is as big as the difference between cat and refrigerator: both show different values for two digits of your vector.
  3. Analyzing the sentiment of a sentence in this way is simply a matter of checking if a certain word is in a sentence or not.

Introducing embeddings

In 2013, Google research scientist Tomas Mikolov developed the word2vec program. Mikolov and his team took on the challenge of using neural networks. They based their research on the assumption that in order to know what a word means, you need to know the context of the word.

This assumption can be interpreted in two ways: predict a word based on its context, or predict the context based on the word.

Take a look at the following examples:

1. XXX barks XXX  

2. The dog barks XXX

The word "barks" can be used in many contexts: dogs bark, other animals bark, and a person can bark at somebody by speaking very loudly. You really have to ask yourself: In what kind of instances can there be a bark and what properties can a bark have? This limits the context of the word "barks" immensely.

The second example also leaves limited possibilities. Dogs can bark loudly, but also bark at somebody. You have to ask yourself: what can dogs do with a bark?

Word2vec can train a neural network – or artificial brain – in this manner. First, it generates a random set of numbers for each word: the embedding. Next, it takes two pairs of words. Given one word, the second word could be predicted (e.g. "dog" and "barks"). It takes the embedding for the first word and puts it through its network to predict what word will come next (see Figure 1).

Figure 1: Example of a word embedding using the word2vec model

You now have a neural network that, given a word vector, predicts how likely it is that a certain word surrounds that word vector. For example, let's say our input is "dog":

0.3 -0.8 0.9 -0.3

The network now predicts the following surrounding words:

  • barks
  • jumps
  • refrigerator

If we teach the neural network that in this case the right answer is "barks", the network learns that "jumps" and "refrigerator" were wrong. It is thus less likely to predict these words the next time it receives the same input. At the same time, the network also learns that "dog" is a likely value to surround the embedding for "barks". This way, the neural network and the embedding are simultaneously trained.

After we have trained our network and the embeddings, we can use them for other tasks. This is like performing well on one task, then removing the part of the brain that performs this task, and replacing it by another task brain. For example, the "predict-word part" can be replaced by a "predict part of speech tag" brain, or a "sentiment of this word" brain. This is the most exciting aspect about neural networks: you can take a trained part and use it to train another part.

Evaluating this method

By now you are probably wondering if this artificial brain is able to understand natural language. This, of course, depends on what we mean by "understanding". One way to evaluate how well embeddings work is by trying to calculate with them.

For a human, the following riddles provide little challenge:

  • What Paris is to France, Amsterdam is to XXX (answer: the Netherlands)
  • Large and larger, small and XXX (answer: smaller)

With embeddings trained by word2vec, you can perform some of these calculations. Take the embedding for "larger", subtract the vector of "large", and add the vector of "small". You will end up with a vector that is close to the vector of "smaller".

Figure 2: Semantic relations in word2vec


Mikolov created a test set with 8869 semantic relations and 10675 syntactic relations, which he used as a benchmark to test the accuracy of the model. This evaluation gave valuable insight into how to create good embeddings. What remains unclear is: how much data do we need? And which way do we train our network (predict context given word, or predict word given context)? There are also other ways to train embeddings, but I will not cover these in this article.

Another way to evaluate the model is to see if similar words are close together. If words share a lot of properties (e.g. "cat" and "dog" have more in common than "cat" and "hovercraft") they should be closer together. Figure 3 reveals what happens if we bring these vectors down to two dimensions. Take a few minutes to assess how well you think this model is trained. Single letters seem to be grouped in the upper left corner and numbers are clustered right next to it. There appears to be a cluster for countries and another one for pronouns. Based on this figure we can assume that this representation was successful.

Figure 3: A two-dimensional representation of vectors in word2vec



Embeddings are able to represent words while taking up less memory than one-hot vectors, and allowing for a larger word base. They enable us to use existing machine-learning algorithms in natural language processing.

Nowadays embeddings are one of the most important components of many new natural language processing applications. Examples of such applications include:

  • Neural machine translation
    Before feeding a word to the translation engine, it is replaced by its embedding. Although neural machine translation needs a lot of data, its output shows a high level of accuracy. In large part, this is due to the fact that the computer "understands" each word better. The bigger the embeddings are (that is, the more dimensions they have), the better the translation. However, more data is needed for its training.
  • Sentiment analysis
    Does the user who wrote a certain review feel positive or negative about the product? There is a big difference between a user writing "This is a good product" and "This is not a good product". But if you only search for specific keywords (like “good”) you might fail to understand the meaning. By turning each word into an embedding and "reading" the sentence, you can catch the sentiment much more accurately.

Although embeddings have only been around for a relatively short time, they have shaken up the world of natural language processing. While in the past many applications were inaccurate because word bases were too sparse, embeddings allow the creation of many applications.


Further reading