Implementing Euclidean Distance Matrix Calculations From Scratch In Python

Distance matrices are a really useful data structure that store pairwise information about how vectors from a dataset relate to one another. In machine learning they are used for tasks like hierarchical clustering of phylogenic trees (looking at genetic ancestry) and in natural language processing (NLP) models for exploring the relationships between words (with word embeddings like Word2Vec, GloVe, fastText, etc.). Here, we will briefly go over how to implement a function in python that can be used to efficiently compute the pairwise distances for a set/or sets of vectors.

  • If you are interested in following along, fire up iPython in a terminal session (or create a new Jupyter Notebook). Also be sure that you have the Numpy package installed.

Brief review of Euclidean distance

Recall that the squared Euclidean distance between any two vectors a and b is simply the sum of the square component-wise differences. (we are skipping the last step, taking the square root, just to make the examples easy)

euclidean_distance_for_two_vectors.png

We can naively implement this calculation with vanilla python like this:

a = [i + 1 for i in range(0, 500)]
b = [i for i in range(0, 500)]
dist_squared = sum([(a_i - b_i)**2 for a_i, b_i in zip(a, b)])
dist_squared
500

In fact, we could implement all of math we are going to work through this way, but it would be slow and tedious. Numpy, the definitive numerical library for Python, gives us fast implementations for everything we need here. To illustrate the speed advantage, let’s use the same vectors as numpy arrays, perform an identical calculation, and then perform a speed comparison with %timeit

import numpy as np
a_numpy = np.array(a)
b_numpy = np.array(b)
dist_squared = np.sum(np.square(a_numpy - b_numpy))
dist_squared
500

# using pure python
%timeit dist_squared = sum([(a_i - b_i)**2 for a_i, b_i in zip(a, b)])
119 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# using numpy
%timeit dist_squared = np.sum(np.square(a_numpy - b_numpy))
6.32 µs ± 2.11 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

As you can seen, the Numpy version is 20X faster than our original implementation!

There is an equivalent formulation of squared Euclidean distance for vectors that uses dot products:

alt_vec_dist_calc.png

Keep this in the back of your mind as we will be extending this vector formulation to matrices in our final distance matrix implementation.

The Distance Matrix

distance_matrix.png

Suppose that we have a group of three observations where each observation is a vector with three components. We can write this set of observations as a 3 x 3 matrix A where each row represents one observation.

The distance matrix for A, which we will call D, is also a 3 x 3 matrix where each element in the matrix represents the result of a distance calculation for two of the rows (vectors) in A. Note that D is symmetrical and has all zeros on its diagonal. (The distance between a vector and itself is zero)

distance_matrix_A_B.png

What if I have two groups of observations that I want to compare distances for? We can get a distance matrix in this case as well. Let’s keep our first matrix A and compare it with a new 2 x 3 matrix B. Here, our new distance matrix D is 3 x 2. In general, for any distance matrix between two matrices of size M x K and N x K, the size of the new matrix is M x N.

With most of the background covered, let’s state the problem we want to solve clearly. We want to create some function in python that will take two matrices as arguments and return back a distance matrix. In our examples we have been looking at squared distance, so we will also add the ability to return the squared distance if desired.

Set up

First, let’s create the sample matrices A and B from above to use as test data.

A = np.array([[1,2,3],[2,3,4],[0,1,2]])
A
array([[1, 2, 3],
       [2, 3, 4],
       [0, 1, 2]])      
B = np.array([[1,2,3],[4,3,2]])
B
array([[1, 2, 3],
       [4, 3, 2]])

Implementation

Now let’s revisit the alternate distance formulation from above, and look at how it can be applied two our two matrices A and B.

The distance matrix on the left, our goal, can be constructed from three matrices that follow the formula above. Take a moment to make sure you see the pattern. Now, let’s construct the first matrix of dot products for A.

M = A.shape[0]
N = B.shape[0]

A_dots = (A*A).sum(axis=1).reshape((M,1))*np.ones(shape=(1,N))
array([[14., 14.],
       [29., 29.],
       [ 5.,  5.]])

First we find the number of rows M in A, which is 3 and the number of rows N in B, which is 2. To make A_dots we first construct the dot products for each row. This is (A*A).sum(axis=1). We then reshape the output to be a column .reshape((M, 1)), and repeat our column vector to match the number of rows in B by multiplying by np.ones(shape=(1,N)). The matrix of dot products for B is constructed in a similar way.

B_dots = (B*B).sum(axis=1)*np.ones(shape=(M,1))
B_dots
array([[14., 29.],
       [14., 29.],
       [14., 29.]])

The only thing to note here is that in our final matrix B is represented on the columns, so our dot products are also arranged colunnwise. The last matrix of dot products is constructed with:

-2*A.dot(B.T)
array([[-28, -32],
       [-40, -50],
       [-16, -14]])

Putting it all together…

D_squared =  A_dots + B_dots -2*A.dot(B.T)
D_squared
array([[ 0., 11.],
       [ 3.,  8.],
       [ 3., 20.]])

Final function

And here is the code wrapped into a function with a nice Numpy style doc string.

def distance_matrix(A, B, squared=False):
    """
    Compute all pairwise distances between vectors in A and B.

    Parameters
    ----------
    A : np.array
        shape should be (M, K)
    B : np.array
        shape should be (N, K)

    Returns
    -------
    D : np.array
        A matrix D of shape (M, N).  Each entry in D i,j represnets the
        distance between row i in A and row j in B.

    See also
    --------
    A more generalized version of the distance matrix is available from
    scipy (https://www.scipy.org) using scipy.spatial.distance_matrix,
    which also gives a choice for p-norm.
    """
    M = A.shape[0]
    N = B.shape[0]

    assert A.shape[1] == B.shape[1], f"The number of components for vectors in A \
        {A.shape[1]} does not match that of B {B.shape[1]}!"

    A_dots = (A*A).sum(axis=1).reshape((M,1))*np.ones(shape=(1,N))
    B_dots = (B*B).sum(axis=1)*np.ones(shape=(M,1))
    D_squared =  A_dots + B_dots -2*A.dot(B.T)

    if squared == False:
        zero_mask = np.less(D_squared, 0.0)
        D_squared[zero_mask] = 0.0
        return np.sqrt(D_squared)

    return D_squared

Conclusion

And there you have it! An efficient function for computing distance matrices in Python using Numpy. Before I leave you I should note that SciPy has a built in function (scipy.spatial.distance_matrix) for computing distance matrices as well. You should find that the results of either implementation are identical.

distance_matrix(A, B)
array([[0.        , 3.31662479],
       [1.73205081, 2.82842712],
       [1.73205081, 4.47213595]])

import scipy
scipy.spatial.distance_matrix(A, B)
array([[0.        , 3.31662479],
       [1.73205081, 2.82842712],
       [1.73205081, 4.47213595]])