Practical Guide to PCA¶

Principal component analysis¶

I'd like to present a practical take on principal component analysis that doesn't shy away from some technical detail.

Let's take a look at what applying PCA looks like on a dataset, and what motivations we might have for doing so.

We'll use the 'Iris' dataset, from the scikit-learn documentation itself: https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_iris.html#sphx-glr-auto-examples-decomposition-plot-pca-iris-py.

Let's set things up.

In [2]:

from itertools import combinations
import math
import matplotlib.pyplot as plt
import numpy as np
from sklearn import (
    datasets,
    decomposition,
    preprocessing,
)

In [3]:

np.random.seed(5)
iris = datasets.load_iris()
X = iris.data
y = iris.target

In case you're unfamiliar, the Iris dataset is a small dataset of measurements of flowers, where we're typically trying to classify each sample as a particular kind of iris.

In [4]:

print("flower measurements:")
print("shape: ", X.shape)
print("example:", X[0])
print()
print("target:")
print("shape:", y.shape)
print("possible values:", set(y))

flower measurements:
shape:  (150, 4)
example: [5.1 3.5 1.4 0.2]

target:
shape: (150,)
possible values: {0, 1, 2}

The feature labels are actually "Sepal Length, Sepal Width, Petal Length and Petal Width", but we won't worry too much about which is which. Also note that there are 3 kinds of Iris we might classify into.

Motivation¶

To begin the discussion, let's consider trying to visualize the data set in 2 dimensions. Since there are 4 feature dimensions, we have to pick 2 of these features at a time to create a 2-D scatterplot:

In [5]:

dimension_combinations = list(combinations(range(X.shape[1]), 2))
n_rows = 3
n_cols = math.ceil(len(dimension_combinations) / n_rows)
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5))

for ax, dims in zip(axes.flatten(), dimension_combinations):
    x_dim, y_dim = dims
    ax.scatter(X[:, x_dim], X[:, y_dim], c=y)
    ax.set_xlabel(iris.feature_names[x_dim])
    ax.set_ylabel(iris.feature_names[y_dim])
    pass

plt.tight_layout()

No description has been provided for this image

Based on what we see here, it looks like we could drop all the features except 'petal length' and 'petal width', and we could still classify 1 type of iris. Intuitively, if we can make classifications based on only two of the features, then why even bother including the other features in our model?

Dimensionality reduction is the term for transforming our data into a new representation with lower dimensionality, ie. less features. Informally, lower dimensionality makes a dataset 'easier to reason about', and more precisely, will mean smaller and easier to train models.

In this example, we could apply a really straightfoward dimensionality reduction by just dropping two of the dimensions. In practice however, dimensionality reduction is going to give us a whole new feature set that doesn't resemble the original feature set.

Additionally, we need an algorithmic way of arriving at the new features to represent the data instead of visual inspection.

Enter PCA¶

Principal component analysis is just such an algorithm to arrive at a lower cardinality feature set for a given data set.

You can find a somewhat terse general overview of the PCA algorithm on wikipedia: https://en.wikipedia.org/wiki/Principal_component_analysis, but let's try to distill this into something a bit more intuitive.

Change of basis¶

Imagine the entire data set as an $(n \times f)$ matrix $X$, where $n$ is the number of samples and $f$ is the number of features. We're seeking an $(f \times t)$ transformation $W$ that will take us to a new $(n \times t)$ representation of the data $T = W X$, where $t$ is the number of features we're going to project into.

Consider the representation of one sample $ X_i $ under the new basis: $ T_i $. (So now we're representing $X_i$ in only $t$ dimensions, ie. the length of $T_i$ is $t$). Let's label the columns of $W$, so that $W = \big( (W^1) \ldots (W^t) \big)$. Now what $T_i$ looks like is $ ( W^1 \cdot X_i, W^2 \cdot X_i, \ldots W^t \cdot X_i) $.

Intuitively then, each column of $W$ represents a line or direction onto which we project the data, and these directions are known as the 'principal components' of the transformation. One important assumption in PCA is that we should assume that the directions are orthogonal to each other. The reason for doing this is such that we can apply linear algebraic techniques in order to actually solve for $W$.

An algorithm¶

To look for how we might compute $W$, let's consider the covariance matrix of our data, $\frac{1}{n - 1}{X X^T}$ (if we assume a mean of 0 for each original feature, which is totally reasonable to assume since we can just normalize).

Let's imagine the covariance matrix of the transformed data, $\frac{1}{n - 1} T T^T$, and the properties we'd like it to have: Ideally we want no covariance between features, since we're trying to eliminate redundant features. Practically, that means then that $T T^T$ should be diagonal, ie. with 0s everywhere except down the main diagonal.

At this point, we know we're looking for a set of $t$ orthogonal 'basis change' vectors, such that the covariance matrix of $W$ after the basis change is diagonal. PCA employs one more assumption - we're going to successively look for 'basis change' vectors that maximize the variance in the transformed data in the new, transformed dimension. The intuition behind this is that we want to capture or 'explain' as much of the variance in our data as possible with each dimension we add. In other words, going back to our columns of $W$, $\big( (W^1) \ldots (W^t) \big)$, to pick our first basis change direction, we want to maximize $\sum (W_1 \cdot X_i) ^ 2$. After we've picked that direction, we'll maximize $\sum (W_2 \cdot X_i) ^ 2$ (but assuming $W_2$ is orthogonal to $W_1$) and so on.

We'll hand-wave over the ensuing linear algebra, but you can find here a really nice derivation of the fact that the columns of $W$ that satisfy the above criteria, our 'basis-change' directions, are actually the eigenvectors of our original covariance matrix $X X^T$. You can interpret an eigenvector as a line or direction that the matrix sort of 'leaves alone' when it performs its mapping, at least in terms of directions. Crucially though, the eigenvectors of a matrix are something that we can compute in a straightforward way, so this more or less solves the problem of performing PCA.

Sklearn Implementation¶

Let's take a look at the scikit-learn implementation. Keep in mind that as discussed above, we need to normalize our data at some point in the PCA algorithm, and the scikit-learn implementation does not do this!.

So we'll normalize ourselves.

Normalization¶

In [6]:

X_normed = (X - X.mean(axis=0)) / X.std(axis=0)

There is actually a built-in class for this:

In [7]:

scaler = preprocessing.StandardScaler()
np.array_equal(scaler.fit_transform(X), X_normed)

Out[7]:

True

With that out of the way...

In [8]:

pca_model = decomposition.PCA(
    n_components=3,
)
pca_model = pca_model.fit(X_normed)

Basis change mapping¶

The model components are the change of basis matrix $W$:

In [9]:

W = pca_model.components_
W

Out[9]:

array([[ 0.52106591, -0.26934744,  0.5804131 ,  0.56485654],
       [ 0.37741762,  0.92329566,  0.02449161,  0.06694199],
       [-0.71956635,  0.24438178,  0.14212637,  0.63427274]])

Orthogonality of rows of W:¶

In [10]:

[
    np.dot(W[i], W[j])
    for i, j in combinations(range(W.shape[0]), 2)
]

Out[10]:

[1.5526542332574833e-18, 1.56268127367244e-16, 1.5004754434276003e-16]

Orthonormality of rows of W:¶

In [11]:

[np.linalg.norm(W[i]) for i in range(W.shape[0])]

Out[11]:

[0.9999999999999999, 0.9999999999999997, 1.0000000000000002]

Transformed data:¶

In [12]:

X2 = pca_model.transform(X_normed)

In [13]:

print(X_normed[0])
print("under transform is:")
print(X2[0])
print("which should be equal to:", [
    np.dot(X_normed[0], W[i])
    for i in range(W.shape[0])
])

[-0.90068117  1.01900435 -1.34022653 -1.3154443 ]
under transform is:
[-2.26470281  0.4800266  -0.12770602]
which should be equal to: [-2.26470280880759, 0.4800265965209855, -0.12770602230015266]

Covariance of the transformed data:¶

In [14]:

X2.transpose() @ X2

Out[14]:

array([[ 4.37774672e+02,  1.77635684e-13, -4.80726570e-14],
       [ 1.77635684e-13,  1.37104571e+02, -2.21975216e-14],
       [-4.80726570e-14, -2.21975216e-14,  2.20135313e+01]])

Note that this is approximately diagonal. Also note how the coefficients on the diagonal are decreasing, since the principal components are picked in order of decreasing variance. We might use this property to indicate when we've captured enough variance with a smaller number of principal components.

Alternatively sklearn exposes the eigenvalues corresponding to each of the eigenvectors, which correspond to the amount of variance in the original data set that is captured in the new basis dimension.

In [15]:

pca_model.explained_variance_

Out[15]:

array([2.93808505, 0.9201649 , 0.14774182])

In fact, this is probably the best way to assess how many components we should target in the transformed space. Let's look at the explained variance when we project into all 4 dimensions:

In [16]:

decomposition.PCA(n_components=4).fit(X_normed).explained_variance_

Out[16]:

array([2.93808505, 0.9201649 , 0.14774182, 0.02085386])

We can probably achieve a good split over our iris classes just sticking to the first two dimensions after PCA.

In [17]:

X2 = decomposition.PCA(n_components=2).fit_transform(X_normed)
fig, axes = plt.subplots(1, figsize=(15, 5))
axes.scatter(
    X2[:, 0],
    X2[:, 1],
    c=y,
)

Out[17]:

<matplotlib.collections.PathCollection at 0x1351eb6e0>

And let's try with 3:

In [18]:

X2 = decomposition.PCA(n_components=3).fit_transform(X_normed)
fig = plt.figure(figsize=(5, 5))
ax = fig.add_subplot(projection='3d')
ax.scatter(
    X2[:, 0],
    X2[:, 1],
    X2[:, 2],
    c=y,
)

Out[18]:

<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x137a5ec00>

Revising some assumptions¶

Let's take a look at why it makes sense to maximize the variance along each transformed dimension.

Summary¶

To summarize what was presented here:

dimensionality reduction is a process through which we transform a data set into an approximately equivalent data set with a smaller number of features
PCA or principal component analysis is one method of dimensionality reduction

Hopefully this has given you some intuition for how dimensionality reduction is useful, and for the math behind how PCA achieves it.