Skip to content

Probability and Linear Algebra Refresher

What This Is

A working refresher of the math that shows up most often in the rest of the academy. Not a textbook — a pointer document with the minimum needed to read the topic pages without stalling.

If you can already explain what a matrix multiplication does, what a probability distribution is, and what an expectation is, you can skip this page. If those words feel loose, spend an hour here before moving into deep learning.

Every concept is paired with a runnable NumPy snippet. The point is to recognize the object, not to derive it.

When You Use It

  • pages start throwing around Wx + b, softmax, or "covariance" and you want a solid mental picture
  • attention, transformers, and linear regression are all starting to look like the same thing and you want to confirm that
  • you want to read papers without pausing every three lines
  • you are about to compute a gradient and the chain rule is not automatic yet

Part 1 — Linear Algebra

Vectors

A vector is a list of numbers with a consistent interpretation:

import numpy as np
x = np.array([2.0, -1.0, 0.5])

Two main operations:

  • addition: x + y adds component-wise
  • scaling: 3 * x multiplies every entry by 3

The length of a vector is its norm. The most common is the Euclidean norm (L2):

norm = np.sqrt(np.sum(x ** 2))     # or: np.linalg.norm(x)

The dot product multiplies two vectors and sums the result:

np.dot(x, y)                        # same as (x * y).sum()

The dot product measures alignment: positive if the two vectors point the same way, zero if they are perpendicular, negative if they point opposite ways.

Matrices

A matrix is a rectangular grid of numbers:

W = np.array([[1.0, 2.0],
              [3.0, 4.0],
              [5.0, 6.0]])
W.shape       # (3, 2) → 3 rows, 2 columns

Matrix-vector multiplication transforms a vector:

x = np.array([1.0, 1.0])       # shape (2,)
W @ x                          # shape (3,)

Each entry of the output is a dot product of one row of W with x. This is exactly the computation a linear layer performs.

Matrix-matrix multiplication A @ B: the number of columns of A must equal the number of rows of B. The result has shape (rows_of_A, cols_of_B). Thinking about shapes first prevents most bugs.

A = np.random.randn(3, 4)
B = np.random.randn(4, 2)
(A @ B).shape          # (3, 2)

Transpose

Swap rows and columns:

W.T.shape              # (2, 3) if W was (3, 2)

In deep learning formulas, transposes mostly appear when you go from "row of a batch" to "column of features" or in the attention formula QK^T.

Identity, Inverse, Rank

  • identity matrix I: I @ x = x. The "do nothing" matrix.
  • inverse A^(-1): A @ A^(-1) = I. Only exists if A is square and non-degenerate.
  • rank: how many independent columns A has. Low-rank matrices carry less information than their size suggests — LoRA is built on this idea.

You rarely compute inverses directly in machine learning. Numerical methods (LU, QR, SVD) are used instead.

The Singular Value Decomposition (SVD)

Every real matrix can be written as A = U Σ V^T:

U, s, Vt = np.linalg.svd(A)
  • U and V are rotations
  • Σ (s as a vector) is a diagonal of non-negative singular values
  • big singular values carry most of the information; small ones are close to noise

This is the engine behind PCA, low-rank approximation, matrix completion, and many compression methods. Dimensionality Reduction uses it directly.

Eigenvalues And Eigenvectors

If A @ v = λ v, then v is an eigenvector and λ is its eigenvalue. For symmetric matrices (which covariance matrices are), the eigenvectors form an orthonormal basis and eigenvalues are the variances along each direction. PCA is exactly this.

You will not compute eigenvalues by hand in the academy. You should recognize the idea.

Shapes Drive Deep Learning

When reading deep learning code, track shapes obsessively. A typical PyTorch transformer has tensors that look like:

(batch, sequence, embedding_dim)
(batch, num_heads, sequence, head_dim)
(batch, num_classes)

Most bugs are shape mismatches. The smallest habit that fixes this: annotate every forward with a comment showing the shape after each step.

Part 2 — Probability

A Distribution Is A Budget

A probability distribution over a finite set of outcomes is a list of non-negative numbers that sum to 1:

p = np.array([0.1, 0.6, 0.3])      # valid distribution over 3 outcomes
p.sum()                             # 1.0

This is the object a classifier outputs. softmax turns arbitrary real numbers into a valid distribution:

def softmax(z):
    e = np.exp(z - z.max())        # subtract max for numerical stability
    return e / e.sum()

For continuous values (height, weight, pixel intensity), a distribution is described by a density function. You do not need to be fluent with densities; recognizing that "the model's prediction is a distribution" is enough.

Expectation

The average of a function under a distribution:

E[f(X)] = Σ p_i * f(x_i)         # discrete

In code, for a finite distribution:

x = np.array([0, 1, 2])
p = np.array([0.1, 0.6, 0.3])
expectation = (p * x).sum()      # = 1.2

Most "loss" quantities are expectations. Training minimizes the expected loss over the training distribution.

Variance And Standard Deviation

How spread out the distribution is:

Var(X) = E[(X - E[X])^2]
Std(X) = sqrt(Var(X))

In practice you compute them with np.var(x) and np.std(x). You will see them everywhere: batch normalization, weight initialization, calibration.

Independence And Correlation

Two variables are independent if knowing one tells you nothing about the other. They are uncorrelated if their covariance is zero. Independence implies uncorrelated, but uncorrelated does not imply independent.

Correlation coefficient (Pearson's r) ranges from -1 to 1:

np.corrcoef(x, y)[0, 1]

Feature engineering and feature selection lean heavily on correlation; leakage diagnostics do too.

Conditional Probability And Bayes' Rule

P(A | B) = P(A and B) / P(B)
P(A | B) = P(B | A) * P(A) / P(B)     # Bayes' rule

Bayes' rule is the backbone of probabilistic modeling. It turns "how likely is the evidence given a hypothesis" into "how likely is the hypothesis given the evidence".

The practical form you will meet in machine learning: the posterior over parameters is proportional to the likelihood times the prior. MAP estimation (maximum a posteriori) is gradient descent on -log p(data | θ) - log p(θ), which is exactly loss + regularizer.

Likelihood And Cross-Entropy

The likelihood of a dataset under a model is the probability the model assigns to the observed labels. Training a classifier by cross-entropy is equivalent to maximizing the likelihood of the training data.

This is why cross-entropy is the default loss for classification — it is the natural loss from a probabilistic viewpoint.

The Normal (Gaussian) Distribution

p(x) = (1 / sqrt(2π σ²)) exp(-(x - μ)² / (2 σ²))

Two parameters: mean μ and variance σ². It shows up in:

  • weight initialization schemes (He, Xavier)
  • noise models in regression
  • the central limit theorem, which explains why sums of many independent factors look Gaussian even when the factors themselves do not
  • variational methods, diffusion models, and many generative models

You do not need to compute densities by hand. You do need to recognize ~ N(μ, σ²) on sight.

Sampling

Drawing a value from a distribution. In NumPy:

rng = np.random.default_rng(seed=0)
rng.normal(0, 1, size=100)         # 100 samples from N(0, 1)
rng.choice([0, 1, 2], p=[0.1, 0.6, 0.3], size=10)

In generative models, sampling is how you produce new outputs. In bootstrap resampling, it is how you estimate uncertainty. In stochastic gradient descent, it is how you pick a batch each step.

Part 3 — Calculus Just Enough To Follow

You do not need to solve integrals. You need to recognize what a gradient is and why it points uphill.

Derivative In One Line

The derivative of f(x) at a point is the slope of f there. Positive = going up, negative = going down, zero = flat.

Gradient For Many Variables

If f(w₁, w₂, ..., wₙ) is a function of many variables, the gradient is a vector whose i-th entry is the partial derivative with respect to wᵢ. It points in the direction of steepest increase. The Loss and Gradients Intuition page walks through this.

Chain Rule

If z = g(f(x)), then dz/dx = g'(f(x)) * f'(x). The chain rule is how gradients flow through stacked layers. Backpropagation is one application of the chain rule, executed efficiently by reusing intermediate results.

You do not need to compute chain rules by hand. PyTorch does it for you. But the next time loss.backward() succeeds, you should know that the chain rule is what made it possible.

Part 4 — A Reading List In Five Rows

Use this when a paper or topic page outpaces your math background:

Concept Academy page that revisits it External reference when the academy page is not enough
matrices and linear layers What Is A Model Strang, Introduction to Linear Algebra, ch. 2
SVD and PCA Dimensionality Reduction Strang ch. 7; Bishop ch. 12
probability and likelihood Loss and Gradients Intuition MacKay, Information Theory, Inference and Learning Algorithms, ch. 2
gradients and backprop PyTorch Training Loops 3Blue1Brown neural network series (video)
calibration and Bayes rule Calibration and Thresholds Gelman et al., Bayesian Data Analysis, ch. 1–2

The Beyond-The-Academy page has longer off-ramps into the same territory.

What To Inspect

When a formula on a later page confuses you:

  • write out the shapes of every tensor or vector involved
  • plug in tiny numbers (2 × 2 matrices, distributions over 3 outcomes) and compute by hand or in NumPy
  • compare what you get to what the code produces
  • ask whether the formula is a linear combination, a dot product, a softmax, a likelihood, or an expectation — one of those five covers most of deep learning

Failure Pattern

Memorizing formulas instead of shapes. A formula you cannot rewrite with small numbers will not stick. If "softmax" or "cross-entropy" still feels abstract after reading a page, open NumPy and try it on a 3-element vector.

Another failure: treating the math as separate from the code. The Q K^T / sqrt(d_k) in the attention formula is Q @ K.transpose(-2, -1) / math.sqrt(d_k) in PyTorch. They are the same object. Seeing that makes the rest of the academy read faster.

Common Mistakes

  • mixing up W @ x and W * x (matrix multiplication vs element-wise)
  • forgetting the transpose in QK^T
  • treating probability and odds as the same thing
  • applying softmax twice by accident
  • confusing variance with standard deviation when reading code
  • reading a tensor's shape after an error instead of before
  • treating independence as the same as uncorrelated

Practice

  1. In NumPy, compute W @ x where W is (3, 2) and x is (2,). Print the shape and values.
  2. Implement softmax in 5 lines. Verify it sums to 1.
  3. Sample 10,000 values from N(0, 1) and plot a histogram. Confirm mean ≈ 0, std ≈ 1.
  4. Compute the Pearson correlation between two vectors you construct to be linearly related plus noise.
  5. Compute the SVD of a random (10, 4) matrix. Zero out the two smallest singular values and reconstruct. Compare to the original.
  6. Pick one formula from the attention page and rewrite it in NumPy on a 3-token toy sequence.

Runnable Example

No runnable web example for this page — each code block above is standalone and works in any Python environment.

Longer Connection

Continue with What Is A Model for the function-with-knobs picture, Loss and Gradients Intuition for the training loop viewed as downhill motion, and Beyond The Academy for longer external textbooks when the reading list above runs out.