A crash course in linear algebra with PyTorch¶

These are the linear concepts we want to get an understanding of:

Vectors
Matrices
Matrix / vector multiplication

We're going to examine each of these concepts with the help of PyTorch, which will also get us familiar with the 'Tensor' object and how it can be used to represent each of the above.

Vectors¶

Bascially, a $\textit{vector}$ is a list of numbers. That's it. It's merely a convenient way to represent multiple properties at the same time.

A little more specifically, when we think of a vector (ie. list of numbers), we think of it in the context of all the other vectors (ie. lists of numbers) that have the same length. Note that it's straightforward for us to compare two vectors if they have the same length - for every number in vector $a$ we have a corresponding number from vector $b$, if we just line the vectors up beside each other. We call the set of vectors that all have the same length a $\textit{vector space}$, where the fixed length itself is called the $\textit{dimension}$.

It's appropriate to think of these concepts as an abstraction over your intuition about 3-dimensional space. For that reason, we'll continue this discussion with vectors of length 3 (or less), so that there is always a spatial interpretation.

Concretely then, our '3-dimensional vector space' is comprised of any list of numbers with length 3. As in,

In [1]:

[1, 2, 3]

Out[1]:

[1, 2, 3]

Is a 3-D vector. An so is:

In [34]:

[0.5, 0.6, 0.7]

Out[34]:

[0.5, 0.6, 0.7]

Conventionally, in 3 dimensions, the three entries in a vector correspond to the $x$, $y$, and $z$ coordinates, spatially. So we can actually visualize those two vectors:

In [47]:

import matplotlib.pyplot as plt

def plot_vectors(vectors: list[list[float]]):
    zeroes = [0] * len(vectors)
    x, y, z = [zeroes, zeroes, zeroes]
    u, v, w = [
        list(map(lambda v: v[0], vectors)),
        list(map(lambda v: v[1], vectors)),
        list(map(lambda v: v[2], vectors)),
    ]

    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    ax.quiver(x, y, z, u, v, w)
    ax.set_xlim([0, max(u)])
    ax.set_ylim([0, max(v)])
    ax.set_zlim([0, max(w)])
    plt.show()
    return

plot_vectors([
    [0.5, 0.6, 0.7],
    [1, 2, 3],
])

No description has been provided for this image

Distance and length¶

Given the straightforward spatial interpretation of 3-dimensional vectors, naturally, we can talk about the $\textit{distance}$ between them. Probably the most intuitive measurement of distance between the two vectors we drew above is something like 'how far apart are the tips of their arrows?'. We'll use a measurement called 'Euclidean distance' that you're familiar with in 2-dimensions as the 'Pythagorean theorem':

In [45]:

def euclidean_distance(a: list[float], b: list[float]) -> float:
    return sum([
        (a[i] - b[i]) ** 2
        for i in range(len(a))
    ]) ** 0.5


euclidean_distance(
    [0.5, 0.6, 0.7],
    [1, 2, 3],
)

Out[45]:

2.7386127875258306

We can also talk about the 'length' of a vector as its distance from $[0, 0, 0]$ - this is called its $\textit{magnitude}$:

In [52]:

def magnitude(v: list[float]) -> float:
    return euclidean_distance(v, [0] * len(v))

In [51]:

print("[1, 2, 3] has length:", magnitude([1, 2, 3]))
print("[0.5, 0.6, 0.7] has length:", magnitude([0.5, 0.6, 0.7]))

[1, 2, 3] has length: 3.7416573867739413
[0.5, 0.6, 0.7] has length: 1.0488088481701514

In fact, 'magnitude' is the fundamental concept behind distance - it's best to look at distance as the magnitude of the difference between two vectors. The notion of distance itself is also a bit more general than the formula above: 'Euclidean distance' is a special case of what's called a norm.

Norms¶

A norm is any measurement that exhibits properties that make it useful as a distance. The only 'norms' that you'll probably ever need to think about look like $ (x_1^p + x_2^p + \ldots x_n^p)^{\frac{1}{p}} $ for some $p$ (so you might notice, Euclidean distance is a '$p = 2$' norm).

One other $p-norm$ that you might find interesting is when $p=1$, or the 'Manhattan norm'. Can you see where it gets its name from?

In [1]:

def p_norm(vector: list[float], p: int = 2) -> float:
    return sum(
        x ** p
        for x in vector
    ) ** (1 / p)

print("Euclidean magnitude:", p_norm([1, 2, 3], p=2))
print("Manhattan magnitude:", p_norm([1, 2, 3], p=1))

Euclidean magnitude: 3.7416573867739413
Manhattan magnitude: 6.0

Dot products¶

Sometimes however, we're only interested in the $\textit{directional}$ distance between two vectors - in other words, a measurement of the angle between them. To measure this, we'll introduce a common operation in linear algebra - the $\textit{dot product}$:

In [53]:

def dot_product(a: list[float], b: list[float]) -> float:
    return sum([
        a[i] * b[i]
        for i in range(len(a))
    ])

In plain English, the dot product is the sum of the pairwise product of two vectors. Intuitively, it is a measurement of how 'aligned' two vectors are. If we 'normalize' by dividing by the magnitude of each input vector, it gives us a purely directional measurement of the two vector's alignment:

In [54]:

def angular_similarity(a: list[float], b: list[float]) -> float:
    return dot_product(a, b) / (magnitude(a) * magnitude(b))

To illustrate the difference between these various measurements, consider a third vector, $[2, 4, 6]$.

In [48]:

plot_vectors([
    [0.5, 0.6, 0.7],
    [1, 2, 3],
    [2, 4, 6],
])

If you notice, $[2, 4, 6]$ is in the same direction as $[1, 2, 3]$, it's just twice as long.

In [64]:

from itertools import pairwise

vectors = [
    [0.5, 0.6, 0.7],
    [1, 2, 3],
    [2, 4, 6],
]

operations = [
    euclidean_distance,
    dot_product,
    angular_similarity,
]

for op in operations:
    print(f"---- {op.__name__} ----")
    for a, b in pairwise(vectors):
        print(f"{a}, {b}:", op(a, b))
        pass
    print("\n")
    pass

---- euclidean_distance ----
[0.5, 0.6, 0.7], [1, 2, 3]: 2.7386127875258306
[1, 2, 3], [2, 4, 6]: 3.7416573867739413


---- dot_product ----
[0.5, 0.6, 0.7], [1, 2, 3]: 3.8
[1, 2, 3], [2, 4, 6]: 28


---- angular_similarity ----
[0.5, 0.6, 0.7], [1, 2, 3]: 0.9683296637314887
[1, 2, 3], [2, 4, 6]: 1.0

Since they're in exactly the same direction, $[1, 2, 3]$ and $[2, 4, 6]$ have perfect 'angular similarity'.

Introducing PyTorch¶

From now on, instead of using the native Python 'list' type to represent vectors, we'll use the PyTorch 'tensor'. At this point, you can regard PyTorch tensors as simply a convenient data structure for vectors. Indeed, the operations from above are straightforward in PyTorch:

In [37]:

import torch

def torch_dot_product(a: torch.Tensor, b: torch.Tensor):
    return a.dot(b)

def torch_euclidean_distance(a: torch.Tensor, b: torch.Tensor):
    return torch.norm(a - b)

def torch_angular_similarity(a: torch.Tensor, b: torch.Tensor):
    return a.dot(b) / (torch.norm(a) * torch.norm(b))

a = torch.tensor([0.5, 0.6, 0.7])
b = torch.tensor([1, 2, 3], dtype=torch.float32)

print("dot product:", torch_dot_product(a, b))
print("euclidean distance:", torch_euclidean_distance(a, b))
print("angular distance:", torch_angular_similarity(a, b))

dot product: tensor(3.8000)
euclidean distance: tensor(2.7386)
angular distance: tensor(0.9683)

Note that the outputs here are still tensors, but a single item. It's handy to use the shape property to understand the shape of our tensors:

In [38]:

print("shape of a", a.shape)
print("shape of a * b :", a.dot(b).shape)

shape of a torch.Size([3])
shape of a * b : torch.Size([])

It can get a little confusing to refer properly to the length of a tensor's shape versus the actual values in each of the shape positions. Remember:

A tensor's rank is the length of its shape. So a vector has rank 1, and a number (ie. a scalar) has rank 0.

In PyTorch, the term dimension or axis refers to which element of the shape list we are talking about. Unfortunately, this conflicts with the mathematical definition of dimension - ie. we would like to say that the vector $[1, 2, 3]$ is 3-dimensional. In PyTorch, this would be refered to as the size along that axis.

So, at least when working with PyTorch, it would serve you to forget any other definitions of 'dimension'.

In [39]:

b = torch.tensor([1, 2, 3])
print("So we say that", b, "is a RANK", len(b.shape), "tensor, with a SIZE in the 0th dimension of:", b.size(0))

So we say that tensor([1, 2, 3]) is a RANK 1 tensor, with a SIZE in the 0th dimension of: 3

Matrices¶

If a vector is just a list of numbers, then a $\textit{matrix}$ is just a list of vectors. The major motivation for building such a list of lists is actually related to the 'dot-product' operation.

Consider that the dot-product of two vectors is a single number. What if we were interested in some kind of operation that produces not just single numbers, but new vectors? Let's consider 'stacking' two dot-products, so instead of two separate dot product like

$$ [1, 2, 3] \cdot [2, 3, 4] = 20 $$

and

$$ [4, 5, 6] \cdot [2, 3, 4] = 47 $$

we combined them into something like:

\begin{equation} \begin{bmatrix} 1, & 2, & 3 \\ 4, & 5, & 6 \\ \end{bmatrix} \cdot [2, 3, 4] = \begin{bmatrix} 20 \\ 47 \\ \end{bmatrix} \end{equation}

As written, this is a 'matrix multiplication' where the 'matrix' is the stack of vectors:

\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ \end{bmatrix}

And this is basically the definition of 'matrix multiplication' - a stack of dot products. PyTorch actually has an operator for 'matrix multiply':

In [4]:

M = torch.tensor([
    [1, 2, 3],
    [4, 5, 6],
])
v = torch.tensor([2, 3, 4])

M @ v

Out[4]:

tensor([20, 47])

It's important to note is that in matrix multiplication, order matters: when a vector is on the right side of a matrix, the dot products happen with the rows of the matrix, but when a vector is on the left, the dot products happen with the columns.

Check it out:

In [5]:

print(
    "M @ [2, 3, 4] = ", M @ torch.tensor([2, 3, 4]),
    "which is equal to: [\n",
    "  [1, 2, 3] . [2, 3, 4] = ", torch.tensor([1, 2, 3]).dot(torch.tensor([2, 3, 4])), ",\n",
    "  [4, 5, 6] . [2, 3, 4] = ", torch.tensor([4, 5, 6]).dot(torch.tensor([2, 3, 4])), ",\n",
    "]"
)

M @ [2, 3, 4] =  tensor([20, 47]) which is equal to: [
   [1, 2, 3] . [2, 3, 4] =  tensor(20) ,
   [4, 5, 6] . [2, 3, 4] =  tensor(47) ,
 ]

whereas

In [6]:

print(
    "[2, 3] @ M = ", torch.tensor([2, 3]) @ M,
    "which is equal to: [\n",
    "  [1, 4] . [2, 3] = ", torch.tensor([1, 4]).dot(torch.tensor([2, 3])), ",\n",
    "  [2, 5] . [2, 3] = ", torch.tensor([2, 5]).dot(torch.tensor([2, 3])), ",\n",
    "  [3, 6] . [2, 3] = ", torch.tensor([3, 6]).dot(torch.tensor([2, 3])), ",\n",
    "]"
)

[2, 3] @ M =  tensor([14, 19, 24]) which is equal to: [
   [1, 4] . [2, 3] =  tensor(14) ,
   [2, 5] . [2, 3] =  tensor(19) ,
   [3, 6] . [2, 3] =  tensor(24) ,
 ]

Did you notice that we actually had to change the size of the vector we used to multiply with on the left? $M$ only has 2 elements in each of its columns, so if we're going to compute dot product with each column, we can only use a vector of length 2. Also, notice that the shape of the output vector after the matrix multiplication will change depending on the side that the multiplication is performed on.

Generally, we refer to the number of rows in a matrix with the number $m$ and the number of columns with the number $n$, and say that a given matrix is $m$ by $n$. You can think of an $m$ by $n$ matrix as a function that maps vectors of length $n$ to vectors of length $m$ when it multiplies a vector on the right, and vice-versa when it multiplies a vector on the left.

We often deal with 'square' matrices where $m = n$, so in this case, the size of the input and output vectors is the same. You can interpret matrices like this as 'moving points around' in space (where that space has as many coordinates as the vectors are long). When $m \neq n$, you can imagine that the matrix is also somehow either 'embedding' a smaller space within a bigger one (think of a circle being embedded into 3-d), or 'projecting' a bigger space into a smaller one (think of a 3-d object casting a shadow).

It's actually the case that if you came up with all the ways move points around in space in a way that makes sense (as in, doesn't 'rip' space apart or something), you would be able to describe all of these operations using a matrix. This is why matrices are a special tool for us.

Matrix-matrix multiplication¶

We've justified the usefulness of matrices as tools to manipulate vectors with. To justify why we might want to combine two matrices, consider a scenario where we want to apply two operations to a vector, one after the other. If you define 'matrix-to-matrix' multiplication in a smart way, it turns out that you can combine the two matrices into one, such that you only need to multiply the vector with one, new matrix.

Let's have a look:

In [8]:

M1 = torch.tensor([
    [1, 2, 3],
    [4, 5, 6],
])
M2 = torch.tensor([
    [1, 2],
    [3, 4],
    [5, 6],
])

v = torch.tensor([1, 1, 1])

In [9]:

M1 @ v

Out[9]:

tensor([ 6, 15])

In [10]:

w = M2 @ (M1 @ v)
w

Out[10]:

tensor([ 36,  78, 120])

But note that the @ operator can also multiply two matrices!

In [11]:

M3 = M2 @ M1
M3

Out[11]:

tensor([[ 9, 12, 15],
        [19, 26, 33],
        [29, 40, 51]])

And this matrix actually operates identically to first applying M1 then M2:

In [36]:

M3 @ v

Out[36]:

tensor([ 36,  78, 120])

There are a few things to dive into in what we just demonstrated, but first, let's try to work out how you might go about defining the @ operator on two matrices.

It will help to break apart $M1$ and $M2$ into row vectors:

In [38]:

M1_0 = torch.tensor([1, 2, 3])
M1_1 = torch.tensor([4, 5, 6])

M2_0 = torch.tensor([1, 2])
M2_1 = torch.tensor([3, 4])
M2_2 = torch.tensor([5, 6])

At the same time, we're going to try and figure out what a matrix would have to look like to behave the same as $M1$ then $M2$. Let's call the hypothetical new matrix '$A$'.

Just by looking at the shape of $M1$ and $M2$, we know that this $A$ is going to need to be a $3$ x $3$ matrix, since it's supposed to turn vectors of length 3 into other vectors of length 3.

Think of $A$ in three rows, like:

\begin{equation} \begin{bmatrix} A_0 \\ A_1 \\ A_2 \\ \end{bmatrix} \end{equation}

and imagine that we're trying to figure out what each of those rows looks like so that we can get the same result as M2 @ (M1 @ v).

Thinking of the first row specifically, we want to find $A_0$ such that $ A_0 \cdot v $ is the first entry in $w$.

$\hspace{5mm}$

Let's expand M2 @ (M1 @ v) in terms of rows of $M1$ and $M2$.

M1 @ v is a vector with two entries: $[M1\_0 \cdot v, \hspace{2mm} M1\_1 \cdot v]$.

And the full expression M2 @ (M1 @ v) expands to $M2 \hspace{2.5mm} [M1\_0 \cdot v, \hspace{2mm} M1\_1 \cdot v]$, or:

\begin{equation} \begin{bmatrix} M2\_0 \cdot [M1\_0 \cdot v, \hspace{2mm} M1\_1 \cdot v] \\ M2\_1 \cdot [M1\_0 \cdot v, \hspace{2mm} M1\_1 \cdot v] \\ M2\_2 \cdot [M1\_0 \cdot v, \hspace{2mm} M1\_1 \cdot v] \\ \end{bmatrix} \end{equation}

$\hspace{5mm}$

Focusing on the first row, remember that we're looking for a first row in our matrix $A$ that does the same thing as what's expanded above.

So we want a $A_0$ where $A_0 \cdot v $ = $ M2\_0 \cdot [M1\_0 \cdot v, \hspace{2mm} M1\_1 \cdot v] $.

We're going to have to tear apart these vectors a little to sort this out, so we'll start using some notation to refer to actual numbers within $M1$ and $M2$. We'll identify the first entry in $M1\_0$ as $M1\_0_0$, the second as $M1\_0_1$ and the third as $M1\_0_2$. To make this explicit, $M1\_0_0 = 1$, $M1\_0_1 = 2$ and $M1\_0_2 = 3$

Now, $ M2\_0 \cdot [M1\_0 \cdot v, \hspace{2mm} M1\_1 \cdot v] $ expands out to

$$ [M2\_0_0 M1\_0_0 + M2\_0_1 M1\_1_0, \hspace{2mm} M2\_0_0 M1\_1_0 + M2\_0_1 M1\_1_1, \hspace{2mm} M2\_0_0 M1\_2_0 + M2\_0_1 M1\_2_0] \cdot v $$

and this is exactly what we were looking for, some vector to dot product $v$ with as the first row of $A$!

It's a bit messy as written above, but with one small change it will simplify considerably. Instead of looking at the rows of $M1$, what if we broke $M1$ into columns, like $M1 = [C0, C1, C2]$, or explicitly: \begin{equation} \begin{bmatrix} 1 \\ 4 \\ \end{bmatrix} \begin{bmatrix} 2 \\ 5 \\ \end{bmatrix} \begin{bmatrix} 3 \\ 6 \\ \end{bmatrix} \end{equation}

Now our expression for $A_0$ is just

$$ [M2\_0 \cdot C0, \hspace{2mm} M2\_0 \cdot C1, \hspace{2mm} M2\_0 \cdot C2 ] $$

$\hspace{5mm}$

We did the algebra for one row of $A$ above, but you can trace through everything to get the complete picture for $A$. If you haven't followed along completely, that's also just fine. All that really matters is this:

To multiply two matrices $M2$ and $M1$ with $M2$ on the left, $M2$ must have the same number of columns as $M1$ has rows. The resultant matrix is then defined so that the number in the $i$th row and $j$th column is equal to the dot product of the $i$th row of $M2$ with the $j$th column of $M1$.

In [12]:

# M1
C_0 = torch.tensor([1, 4])
C_1 = torch.tensor([2, 5])
C_2 = torch.tensor([3, 6])

# M2
R_0 = torch.tensor([1, 2])
R_1 = torch.tensor([3, 4])
R_2 = torch.tensor([5, 6])

In [13]:

torch.tensor([
    [R_0.dot(C_0), R_0.dot(C_1), R_0.dot(C_2)],
    [R_1.dot(C_0), R_1.dot(C_1), R_1.dot(C_2)],
    [R_2.dot(C_0), R_2.dot(C_1), R_2.dot(C_2)],
])

Out[13]:

tensor([[ 9, 12, 15],
        [19, 26, 33],
        [29, 40, 51]])

Bingo.

Conclusions¶

This should be all the linear algebra you need to get started running neural networks. In particular, hopefully it has given you some intuition behind the computations that are necessary to perform matrix multiplications, and why they lend themselves to easily to parallel computation. That said, you should note that the above is just a theoretical overview of matrix multiplication - because it is such a common operation, there has been a lot of algorithmic research into computing them in faster and faster ways. Check out https://en.wikipedia.org/wiki/Matrix_multiplication_algorithm to read more about different approaches.

All the best in your matrix-multiplying endeavors.

- Liam