SVM, Kernels, and Learning Theory¶

This pack stays close to margins, feature maps, regularization, and sample-complexity reasoning. The prompts are adapted from public official course materials and rewritten into academy form.

Use this pack when the weakness is geometric reasoning, not library syntax.

Use This Pack Cold¶

Do not read this page straight through.

Spend 45 to 90 seconds per prompt before revealing anything.
Say the decision first, then the reason.
If you need more than three sentences, the answer is not cold yet.
If two prompts in a row feel shaky, stop and route back before finishing the pack.

Cold answer shape for this pack:

feature-map prompts: name the transformed view and why the method stays linear
margin or regularization prompts: name the effect first, then the bias-variance cost
theory prompts: write the bound or principle, then state what it does not prove

QK01. Bent Boundary, Linear Parameters¶

Question: Why can a classifier with decision function [ f(x)=\theta_0+\theta_1 x+\theta_2 x^2 ] still be trained with a linear method?

Cold target: Name the feature map and the parameterization claim.

Reveal Academy Answer

Because the model is nonlinear in the raw input, but linear in the trainable parameters.
After the feature map (\phi(x)=[1,x,x^2]), the decision function is just (\theta^\top \phi(x)).
Linear methods operate on the transformed features.

Why it matters: The whole point of feature maps is richer geometry without changing the parameterization style.

Source family: Stanford CS229 linear regression with feature maps and MIT OCW 6.867 SVM feature-map questions

QK02. Bigger Margin, Different Kernel¶

Question: Kernel A yields a larger geometric margin than Kernel B on the same training set. Does that prove Kernel A will generalize better?

Cold target: Say no first, then name the missing comparison rule.

Reveal Academy Answer

No.
A margin is informative, but margins across different kernels are not a complete model-comparison rule.
Generalization also depends on model class complexity, regularization strength, data fit, and mismatch between the kernel and the true pattern.

Why it matters: One attractive statistic should not replace an honest validation comparison.

Source family: MIT OCW 6.867 SVM and margin questions

QK03. What Happens When C Gets Very Large?¶

Question: In a soft-margin SVM, what is the qualitative effect of making (C) extremely large?

Cold target: Name the direction of the tradeoff and the main risk.

Reveal Academy Answer

The model penalizes margin violations much more aggressively.
That pushes the optimizer toward fitting the training points more strictly, often at the cost of a smaller effective margin and higher variance.
In noisy data, very large (C) can overfit.

Why it matters: C is not just a knob for accuracy. It is a bias-variance and robustness tradeoff.

Source family: MIT OCW 6.867 SVM and margin questions and Stanford CS229 regularization/model-selection notes

QK04. Hinge Loss Versus Zero-One Loss¶

Question: Why do SVMs optimize hinge loss instead of optimizing classification error directly?

Cold target: Name the optimization problem first, then the surrogate reason.

Reveal Academy Answer

Zero-one loss is discontinuous and hard to optimize directly.
Hinge loss is a convex surrogate that still rewards correct classification with margin.
It provides a practical optimization problem while keeping pressure on confident separation.

Why it matters: Good ML often works by replacing the intractable ideal objective with a useful surrogate.

Source family: MIT OCW 6.867 SVM and margin questions

QK05. Finite-Hypothesis PAC Bound¶

Question: Suppose the hypothesis class has size (|H|=500). Under the standard finite-hypothesis PAC bound, how does required sample size scale with (\varepsilon) and (\delta)?

Cold target: Write the bound and state the scaling in words.

Reveal Academy Answer

The standard bound is [ m \ge \frac{1}{\varepsilon}\left(\ln |H| + \ln \frac{1}{\delta}\right). ]
So the sample size grows linearly with (1/\varepsilon) and logarithmically with both (|H|) and (1/\delta).
More accuracy is expensive; more confidence or a larger finite class is costly too, but only logarithmically.

Why it matters: This is the simplest formal picture of why harder guarantees demand more data.

Source family: CMU 10-601 learning-theory questions

QK06. RBF Kernel Or Linear Baseline?¶

Question: You have 80 examples, 200 normalized features, and the current linear SVM is underfitting visibly. Is the first nonlinear move an RBF kernel or more linear feature engineering?

Cold target: Pick the safer next move and name the validation condition.

Reveal Academy Answer

An RBF kernel is a reasonable first nonlinear move, but only under a fixed validation rule.
With just 80 examples, the risk of overfitting is real, so compare it against the linear baseline honestly.
If the signal looks local or curved, RBF may help; if the problem is mostly additive and data are scarce, careful features plus linear regularization may still win.

Why it matters: Nonlinear capacity is a hypothesis to test, not an automatic upgrade.

Source family: MIT OCW 6.867 SVM feature-map questions and Stanford CS229 model-selection notes

QK07. Who Are The Support Vectors?¶

Question: Which training points matter most in a hard-margin SVM solution?

Cold target: Name the points and why they control the boundary.

Reveal Academy Answer

The support vectors.
These are the points on or closest to the margin boundary that determine the separating hyperplane.
Moving far-away points often changes nothing, while moving a support vector can change the decision boundary substantially.

Why it matters: This is the geometric reason SVMs focus on critical boundary cases rather than every point equally.

Source family: MIT OCW 6.867 SVM and margin questions

QK08. Strong Training Accuracy, Weak Theory Story¶

Question: A flexible kernel classifier gets nearly perfect training accuracy on a tiny dataset. What is the first theory-aware concern?

Cold target: Name the diagnosis and the first corrective move.

Reveal Academy Answer

High variance.
A rich hypothesis class can fit small datasets easily without learning a stable pattern.
The right next step is not to celebrate the fit, but to evaluate under a fixed split and control complexity with regularization or simpler models.

Why it matters: Theory is useful when it stops you from mistaking memorization for learning.

Source family: CMU 10-601 learning-theory questions and MIT OCW 6.867 SVM themes

What To Do After This Pack¶

Use your miss pattern, not your mood:

0-1 misses: move to SVM and Advanced Clustering and compare one linear baseline against one nonlinear alternative under one fixed rule.
2-3 misses: rerun SVM Margins and Kernels, then retake this pack cold.
4+ misses or repeated confusion about validation claims: stop and repair the comparison rule first in Cross-Validation.

If QK05 or QK08 felt slow, write the finite-hypothesis bound once from memory before leaving this page.