Probability and Linear Algebra Refresher¶
What This Is¶
A working refresher of the math that shows up most often in the rest of the academy. Not a textbook — a pointer document with the minimum needed to read the topic pages without stalling.
If you can already explain what a matrix multiplication does, what a probability distribution is, and what an expectation is, you can skip this page. If those words feel loose, spend an hour here before moving into deep learning.
Every concept is paired with a runnable NumPy snippet. The point is to recognize the object, not to derive it.
When You Use It¶
- pages start throwing around
Wx + b,softmax, or "covariance" and you want a solid mental picture - attention, transformers, and linear regression are all starting to look like the same thing and you want to confirm that
- you want to read papers without pausing every three lines
- you are about to compute a gradient and the chain rule is not automatic yet
Part 1 — Linear Algebra¶
Vectors¶
A vector is a list of numbers with a consistent interpretation:
import numpy as np
x = np.array([2.0, -1.0, 0.5])
Two main operations:
- addition:
x + yadds component-wise - scaling:
3 * xmultiplies every entry by 3
The length of a vector is its norm. The most common is the Euclidean norm (L2):
norm = np.sqrt(np.sum(x ** 2)) # or: np.linalg.norm(x)
The dot product multiplies two vectors and sums the result:
np.dot(x, y) # same as (x * y).sum()
The dot product measures alignment: positive if the two vectors point the same way, zero if they are perpendicular, negative if they point opposite ways.
Matrices¶
A matrix is a rectangular grid of numbers:
W = np.array([[1.0, 2.0],
[3.0, 4.0],
[5.0, 6.0]])
W.shape # (3, 2) → 3 rows, 2 columns
Matrix-vector multiplication transforms a vector:
x = np.array([1.0, 1.0]) # shape (2,)
W @ x # shape (3,)
Each entry of the output is a dot product of one row of W with x. This is exactly the computation a linear layer performs.
Matrix-matrix multiplication A @ B: the number of columns of A must equal the number of rows of B. The result has shape (rows_of_A, cols_of_B). Thinking about shapes first prevents most bugs.
A = np.random.randn(3, 4)
B = np.random.randn(4, 2)
(A @ B).shape # (3, 2)
Transpose¶
Swap rows and columns:
W.T.shape # (2, 3) if W was (3, 2)
In deep learning formulas, transposes mostly appear when you go from "row of a batch" to "column of features" or in the attention formula QK^T.
Identity, Inverse, Rank¶
- identity matrix
I:I @ x = x. The "do nothing" matrix. - inverse
A^(-1):A @ A^(-1) = I. Only exists ifAis square and non-degenerate. - rank: how many independent columns
Ahas. Low-rank matrices carry less information than their size suggests — LoRA is built on this idea.
You rarely compute inverses directly in machine learning. Numerical methods (LU, QR, SVD) are used instead.
The Singular Value Decomposition (SVD)¶
Every real matrix can be written as A = U Σ V^T:
U, s, Vt = np.linalg.svd(A)
UandVare rotationsΣ(sas a vector) is a diagonal of non-negative singular values- big singular values carry most of the information; small ones are close to noise
This is the engine behind PCA, low-rank approximation, matrix completion, and many compression methods. Dimensionality Reduction uses it directly.
Eigenvalues And Eigenvectors¶
If A @ v = λ v, then v is an eigenvector and λ is its eigenvalue. For symmetric matrices (which covariance matrices are), the eigenvectors form an orthonormal basis and eigenvalues are the variances along each direction. PCA is exactly this.
You will not compute eigenvalues by hand in the academy. You should recognize the idea.
Shapes Drive Deep Learning¶
When reading deep learning code, track shapes obsessively. A typical PyTorch transformer has tensors that look like:
(batch, sequence, embedding_dim)
(batch, num_heads, sequence, head_dim)
(batch, num_classes)
Most bugs are shape mismatches. The smallest habit that fixes this: annotate every forward with a comment showing the shape after each step.
Part 2 — Probability¶
A Distribution Is A Budget¶
A probability distribution over a finite set of outcomes is a list of non-negative numbers that sum to 1:
p = np.array([0.1, 0.6, 0.3]) # valid distribution over 3 outcomes
p.sum() # 1.0
This is the object a classifier outputs. softmax turns arbitrary real numbers into a valid distribution:
def softmax(z):
e = np.exp(z - z.max()) # subtract max for numerical stability
return e / e.sum()
For continuous values (height, weight, pixel intensity), a distribution is described by a density function. You do not need to be fluent with densities; recognizing that "the model's prediction is a distribution" is enough.
Expectation¶
The average of a function under a distribution:
E[f(X)] = Σ p_i * f(x_i) # discrete
In code, for a finite distribution:
x = np.array([0, 1, 2])
p = np.array([0.1, 0.6, 0.3])
expectation = (p * x).sum() # = 1.2
Most "loss" quantities are expectations. Training minimizes the expected loss over the training distribution.
Variance And Standard Deviation¶
How spread out the distribution is:
Var(X) = E[(X - E[X])^2]
Std(X) = sqrt(Var(X))
In practice you compute them with np.var(x) and np.std(x). You will see them everywhere: batch normalization, weight initialization, calibration.
Independence And Correlation¶
Two variables are independent if knowing one tells you nothing about the other. They are uncorrelated if their covariance is zero. Independence implies uncorrelated, but uncorrelated does not imply independent.
Correlation coefficient (Pearson's r) ranges from -1 to 1:
np.corrcoef(x, y)[0, 1]
Feature engineering and feature selection lean heavily on correlation; leakage diagnostics do too.
Conditional Probability And Bayes' Rule¶
P(A | B) = P(A and B) / P(B)
P(A | B) = P(B | A) * P(A) / P(B) # Bayes' rule
Bayes' rule is the backbone of probabilistic modeling. It turns "how likely is the evidence given a hypothesis" into "how likely is the hypothesis given the evidence".
The practical form you will meet in machine learning: the posterior over parameters is proportional to the likelihood times the prior. MAP estimation (maximum a posteriori) is gradient descent on -log p(data | θ) - log p(θ), which is exactly loss + regularizer.
Likelihood And Cross-Entropy¶
The likelihood of a dataset under a model is the probability the model assigns to the observed labels. Training a classifier by cross-entropy is equivalent to maximizing the likelihood of the training data.
This is why cross-entropy is the default loss for classification — it is the natural loss from a probabilistic viewpoint.
The Normal (Gaussian) Distribution¶
p(x) = (1 / sqrt(2π σ²)) exp(-(x - μ)² / (2 σ²))
Two parameters: mean μ and variance σ². It shows up in:
- weight initialization schemes (He, Xavier)
- noise models in regression
- the central limit theorem, which explains why sums of many independent factors look Gaussian even when the factors themselves do not
- variational methods, diffusion models, and many generative models
You do not need to compute densities by hand. You do need to recognize ~ N(μ, σ²) on sight.
Sampling¶
Drawing a value from a distribution. In NumPy:
rng = np.random.default_rng(seed=0)
rng.normal(0, 1, size=100) # 100 samples from N(0, 1)
rng.choice([0, 1, 2], p=[0.1, 0.6, 0.3], size=10)
In generative models, sampling is how you produce new outputs. In bootstrap resampling, it is how you estimate uncertainty. In stochastic gradient descent, it is how you pick a batch each step.
Part 3 — Calculus Just Enough To Follow¶
You do not need to solve integrals. You need to recognize what a gradient is and why it points uphill.
Derivative In One Line¶
The derivative of f(x) at a point is the slope of f there. Positive = going up, negative = going down, zero = flat.
Gradient For Many Variables¶
If f(w₁, w₂, ..., wₙ) is a function of many variables, the gradient is a vector whose i-th entry is the partial derivative with respect to wᵢ. It points in the direction of steepest increase. The Loss and Gradients Intuition page walks through this.
Chain Rule¶
If z = g(f(x)), then dz/dx = g'(f(x)) * f'(x). The chain rule is how gradients flow through stacked layers. Backpropagation is one application of the chain rule, executed efficiently by reusing intermediate results.
You do not need to compute chain rules by hand. PyTorch does it for you. But the next time loss.backward() succeeds, you should know that the chain rule is what made it possible.
Part 4 — A Reading List In Five Rows¶
Use this when a paper or topic page outpaces your math background:
| Concept | Academy page that revisits it | External reference when the academy page is not enough |
|---|---|---|
| matrices and linear layers | What Is A Model | Strang, Introduction to Linear Algebra, ch. 2 |
| SVD and PCA | Dimensionality Reduction | Strang ch. 7; Bishop ch. 12 |
| probability and likelihood | Loss and Gradients Intuition | MacKay, Information Theory, Inference and Learning Algorithms, ch. 2 |
| gradients and backprop | PyTorch Training Loops | 3Blue1Brown neural network series (video) |
| calibration and Bayes rule | Calibration and Thresholds | Gelman et al., Bayesian Data Analysis, ch. 1–2 |
The Beyond-The-Academy page has longer off-ramps into the same territory.
What To Inspect¶
When a formula on a later page confuses you:
- write out the shapes of every tensor or vector involved
- plug in tiny numbers (2 × 2 matrices, distributions over 3 outcomes) and compute by hand or in NumPy
- compare what you get to what the code produces
- ask whether the formula is a linear combination, a dot product, a softmax, a likelihood, or an expectation — one of those five covers most of deep learning
Failure Pattern¶
Memorizing formulas instead of shapes. A formula you cannot rewrite with small numbers will not stick. If "softmax" or "cross-entropy" still feels abstract after reading a page, open NumPy and try it on a 3-element vector.
Another failure: treating the math as separate from the code. The Q K^T / sqrt(d_k) in the attention formula is Q @ K.transpose(-2, -1) / math.sqrt(d_k) in PyTorch. They are the same object. Seeing that makes the rest of the academy read faster.
Common Mistakes¶
- mixing up
W @ xandW * x(matrix multiplication vs element-wise) - forgetting the transpose in
QK^T - treating probability and odds as the same thing
- applying softmax twice by accident
- confusing variance with standard deviation when reading code
- reading a tensor's shape after an error instead of before
- treating independence as the same as uncorrelated
Practice¶
- In NumPy, compute
W @ xwhereWis (3, 2) andxis (2,). Print the shape and values. - Implement softmax in 5 lines. Verify it sums to 1.
- Sample 10,000 values from
N(0, 1)and plot a histogram. Confirm mean ≈ 0, std ≈ 1. - Compute the Pearson correlation between two vectors you construct to be linearly related plus noise.
- Compute the SVD of a random (10, 4) matrix. Zero out the two smallest singular values and reconstruct. Compare to the original.
- Pick one formula from the attention page and rewrite it in NumPy on a 3-token toy sequence.
Runnable Example¶
Longer Connection¶
Continue with What Is A Model for the function-with-knobs picture, Loss and Gradients Intuition for the training loop viewed as downhill motion, and Beyond The Academy for longer external textbooks when the reading list above runs out.