University-Sourced Solved Questions¶

This pack contains academy-style solved questions adapted from public official university materials. The problems are rewritten for the academy, not copied verbatim from the original exams or handouts.

This is the mixed foundations pack. Use it when you want a broad cold check before routing into a narrower pack.

Use This Pack Cold¶

Do not read the answers on the first pass.

Spend 45 to 90 seconds per prompt.
Answer in one short paragraph, a few bullets, or a few lines of algebra.
Say the decision or final object first, then the reason.
If two prompts in the same block feel weak, stop and route out immediately.

Cold answer shape:

optimization or algebra prompts: write the optimizer, update, or bound first, then the condition that makes it valid
model-choice prompts: choose one family first, then give one reason and one failure risk
geometry or representation prompts: say what the method captures, then what it does not prove

If you want more concentrated repetition after this page, continue with:

Optimization And Linear Models¶

Q01. Simplex Log-Optimization¶

Question: Let (s_1,\dots,s_d > 0). Maximize [ f(x)=\sum_{i=1}^d s_i \log x_i ] subject to (x_i \ge 0) and (\sum_i x_i = 1). What is the optimizer?

Cold target: Write the optimizer and the proportionality argument.

Reveal Academy Answer

Form the Lagrangian: (\mathcal{L}(x,\lambda)=\sum_i s_i \log x_i + \lambda(\sum_i x_i - 1)).
Set derivatives to zero: (\partial \mathcal{L}/\partial x_i = s_i/x_i + \lambda = 0).
So (x_i = -s_i/\lambda), meaning all coordinates are proportional to (s_i).
Enforce the simplex constraint: (\sum_i x_i = 1) gives (-1/\lambda = 1/\sum_j s_j).
Therefore [ x_i^* = \frac{s_i}{\sum_j s_j}. ]
The objective is strictly concave on the simplex, so this maximizer is unique.

Why it matters: This is the normalization logic behind many probability and mixture-weight updates.

Source family: Cornell CS4786 placement-style optimization

Q02. Rayleigh Quotient On The Unit Sphere¶

Question: Maximize (x^\top M x) subject to (|x|_2=1). What vector solves this, and what is the maximum value?

Cold target: Name the vector and the value.

Reveal Academy Answer

If (M) is symmetric, this is the Rayleigh quotient.
The maximizing direction is any eigenvector corresponding to the largest eigenvalue of (M).
The maximum value is (\lambda_{\max}(M)).
If (M) is not symmetric, replace it with ((M+M^\top)/2), because the antisymmetric part cancels in (x^\top M x).

Why it matters: This is the same eigenvector logic that shows up in PCA and spectral methods.

Source family: Cornell CS4786 placement-style optimization

Q03. Normal Equations And Uniqueness¶

Question: For ordinary least squares with design matrix (X \in \mathbb{R}^{n \times d}) and target vector (y), when is the solution unique, and what is it?

Cold target: State the rank condition and the closed-form answer.

Reveal Academy Answer

Any (w) satisfying (X^\top X w = X^\top y) minimizes the squared loss.
The solution is unique exactly when (X^\top X) is invertible.
That is equivalent to the columns of (X) being linearly independent, i.e. (\mathrm{rank}(X)=d).
In that case, [ w^* = (X^\top X)^{-1} X^\top y. ]

Why it matters: This is the clean version of the baseline linear-regression solve before regularization or iterative training.

Source family: UC Berkeley CS189 OLS and least-squares materials

Q04. OLS Cost And Gradient Descent Update¶

Question: What is the computational cost of solving the normal equations directly, and what is one gradient-descent update for OLS?

Cold target: Give the total cost and one update step.

Reveal Academy Answer

Computing (X^\top X) costs (O(nd^2)).
Computing (X^\top y) costs (O(nd)).
Solving the resulting (d \times d) linear system costs (O(d^3)).
Total direct cost: (O(nd^2 + d^3)).
For loss (J(w)=|Xw-y|_2^2), the gradient is [ \nabla J(w)=2X^\top(Xw-y). ]
So one gradient step is [ w \leftarrow w - 2\eta X^\top (Xw-y). ]

Why it matters: This is the split between closed-form solves and iterative optimization.

Source family: UC Berkeley CS189 OLS and least-squares materials

Q05. “Linear In What?”¶

Question: Is the model [ h(x)=\theta_0+\theta_1 x+\theta_2 x^2+\theta_3 x^3 ] a linear model or a nonlinear model?

Cold target: Say what stays linear and what does not.

Reveal Academy Answer

It is nonlinear in the raw input (x), but linear in the parameters (\theta).
So it is still a linear regression model after a feature map (\phi(x)=[1,x,x^2,x^3]).
The key question is not “is the curve bent?” but “does the model stay linear in the trainable parameters?”

Why it matters: Feature maps let simple linear methods fit richer patterns.

Source family: Stanford CS229 linear regression with feature maps

Q06. Bigger Polynomial, Better Training, Unclear Test¶

Question: Model A is linear. Model B is quadratic. On the training set, Model B fits better. What can you conclude about the test set?

Cold target: Reject the training-only claim and name the real uncertainty.

Reveal Academy Answer

You can conclude almost nothing from training fit alone.
The higher-order model has more flexibility, so lower training error is expected.
Test performance depends on the true data-generating process and the amount of data.
If the underlying pattern is truly curved and you have enough data, the quadratic model may win.
If data are limited or the true pattern is close to linear, the quadratic model may overfit and lose on test.

Why it matters: This is the core move behind bias-variance reasoning and validation discipline.

Source family: CMU 10-601 regression and overfitting questions

Generative, Discriminative, And Bayesian Reasoning¶

Q07. Naive Bayes Posterior Under A Fixed Prior¶

Question: A classifier has features urgent, coupon, and cash. Suppose

(p(\text{spam})=0.15)
(p(\text{urgent} \mid \text{spam})=0.2)
(p(\text{coupon} \mid \text{spam})=0.8)
(p(\text{cash} \mid \text{spam})=0.6)
(p(\text{urgent} \mid \text{regular})=0.7)
(p(\text{coupon} \mid \text{regular})=0.2)
(p(\text{cash} \mid \text{regular})=0.2)

An email contains urgent and cash, but not coupon. Which class wins under Naive Bayes?

Cold target: Compute both class scores and name the winner.

Reveal Academy Answer

Spam score: [ 0.15 \cdot 0.2 \cdot (1-0.8) \cdot 0.6 = 0.0036 ]
Regular score: [ 0.85 \cdot 0.7 \cdot (1-0.2) \cdot 0.2 = 0.0952 ]
The regular score is much larger.
So the email should be classified as regular.

Why it matters: This is the cleanest way to see how priors and conditional likelihoods combine.

Source family: CMU 10-601 Naive Bayes exam questions

Q08. Why Lower The Spam Prior On Purpose?¶

Question: Why might an engineer intentionally choose a spam prior that is lower than the observed fraction of spam messages?

Cold target: Name the policy goal before the probability story.

Reveal Academy Answer

If false positives are especially costly, you want the classifier to be more conservative about calling something spam.
Lowering the spam prior shifts the decision rule toward “regular” unless the evidence is strong.
This is not a claim about the world being different; it is a policy choice about asymmetric error costs.

Why it matters: Priors are often operating-policy choices, not only empirical frequencies.

Source family: CMU 10-601 Naive Bayes exam questions

Q09. GDA Or Logistic Regression First?¶

Question: You have a small dataset and the class-conditional distributions look roughly Gaussian with similar covariance. Would you try GDA or logistic regression first?

Cold target: Pick the first model and give the small-data reason.

Reveal Academy Answer

Try GDA first.
When the Gaussian assumptions are approximately right and data are limited, GDA can have lower variance because it uses a stronger model.
Logistic regression is often safer when you have more data or suspect the generative assumptions are wrong.
The real answer is empirical: compare them on the same split.

Why it matters: This is the practical tradeoff between model bias and data efficiency.

Source family: Stanford CS229 generative algorithms and MIT 6.867 generative-versus-discriminative themes

Q10. Decision Tree Versus Logistic Regression¶

Question: When does a decision tree beat logistic regression, and what is the usual price you pay?

Cold target: Name the winning pattern and the price.

Reveal Academy Answer

A tree wins when the real rule is interaction-heavy, split-based, or highly non-additive.
Logistic regression is stronger when the signal is mostly additive and the boundary is simple.
The usual price of a tree is higher overfitting risk, especially if depth and splitting are not controlled.

Why it matters: This is a recurring “simpler linear baseline versus more flexible structure” decision.

Source family: CMU 10-601 decision-tree comparisons

Margins, Kernels, And Learning Theory¶

Q11. Separable In A Feature Map, Not On The Line¶

Question: Give three one-dimensional points that are not linearly separable in (x), but are separable after the feature map (\phi(x)=[x,x^2]).

Cold target: Give the three points and the separating quadratic.

Reveal Academy Answer

Use (x=1) with label (+1), (x=2) with label (-1), and (x=3) with label (+1).
On the real line, one threshold cannot separate positive, negative, positive in that order.
In feature space, use the quadratic decision function [ x^2 - 4x + 3.5. ]
It is positive at (x=1) and (x=3), but negative at (x=2), so these three points become separable.

Why it matters: This is the concrete intuition behind kernels and nonlinear feature maps.

Source family: MIT OCW 6.867 SVM feature-map questions

Q12. Does A Bigger Margin Across Two Kernels Guarantee Better Test Error?¶

Question: Two kernels trained on the same dataset produce different geometric margins. Does the larger margin automatically mean better test performance?

Cold target: Say no first, then state why cross-kernel margin comparison is weak.

Reveal Academy Answer

No.
Margin is useful, but comparing margins across different feature maps or kernels does not directly determine test error.
Generalization depends on the whole hypothesis class, the data distribution, model mismatch, and regularization, not only one raw margin number.

Why it matters: This prevents overclaiming from a single geometric statistic.

Source family: MIT OCW 6.867 SVM and margin questions

Q13. Boolean Hypothesis Space Sample Complexity¶

Question: Suppose emails are described by 10 binary features. Your hypothesis space is all Boolean functions on those 10 bits. How many samples are sufficient to get error at most (0.1) with confidence (0.95) under the standard finite-hypothesis PAC bound?

Cold target: Write the hypothesis size, the bound, and the rough sample count.

Reveal Academy Answer

The number of Boolean functions on 10 bits is [ |H| = 2^{2^{10}}. ]
The finite-hypothesis PAC bound gives [ m \ge \frac{1}{\varepsilon}\left(\ln|H| + \ln\frac{1}{\delta}\right). ]
With (\varepsilon=0.1) and (\delta=0.05), [ m \ge 10\left(2^{10}\ln 2 + \ln 20\right). ]
Numerically, [ m \approx 10(709.78 + 2.996) \approx 7128. ]

Why it matters: Huge hypothesis spaces demand either lots of data or stronger inductive bias.

Source family: CMU 10-601 learning-theory questions

Q14. Cross-Validation Or Test Set?¶

Question: You are comparing five hyperparameter settings. Should you use the test set repeatedly to pick the best one?

Cold target: Say no first, then name the honest selection boundary.

Reveal Academy Answer

No.
Use cross-validation or a validation split to choose among hyperparameters.
Keep the test set locked until the selection rule is finished.
Otherwise the test set becomes part of model selection and stops being an honest final estimate.

Why it matters: This is one of the academy's most important anti-leakage rules.

Source family: UC Berkeley CS189 cross-validation themes and Stanford CS229 model-selection notes

Q15. Bias-Variance Diagnosis From Curves¶

Question: Training error is very low, validation error is much higher, and the gap stays large as the model gets more flexible. What is the first diagnosis?

Cold target: Name the diagnosis and one defensible next move.

Reveal Academy Answer

This is primarily a high-variance pattern.
The model is fitting the training set too closely relative to what the validation set supports.
First follow-ups: simplify the model, regularize more, get more data, or stop earlier.
Do not celebrate the low training loss by itself.

Why it matters: This is the visual pattern behind overfitting.

Source family: UC Berkeley CS189 bias-variance materials and Stanford CS229 regularization/model-selection notes

Unsupervised Learning And Representation¶

Q16. One K-Means Update By Hand¶

Question: Run one k-means update on the five points [ (0,0), (1,0), (0,2), (4,0), (5,0) ] with initial centers [ c_1=(0,0), \quad c_2=(5,0). ] What are the updated centers after one assignment-and-mean step?

Cold target: Show the assignment split and the two updated means.

Reveal Academy Answer

Points assigned to (c_1): ((0,0), (1,0), (0,2)).
Points assigned to (c_2): ((4,0), (5,0)).
New center for cluster 1: [ \left(\frac{0+1+0}{3}, \frac{0+0+2}{3}\right)=\left(\frac{1}{3}, \frac{2}{3}\right). ]
New center for cluster 2: [ \left(\frac{4+5}{2}, \frac{0+0}{2}\right)=(4.5,0). ]

Why it matters: K-means is only alternating between assignment and averaging. If you cannot do one step by hand, the code will stay opaque.

Source family: Stanford CS229 unsupervised-learning materials and Cornell CS4786 clustering themes

Q17. First Principal Component Direction¶

Question: Your points are stretched mostly along the line (x_2 \approx x_1). What should the first principal component look like?

Cold target: Name the first direction and the orthogonal second one.

Reveal Academy Answer

The first principal component should point along the direction of greatest variance.
Here that direction is approximately [ \frac{1}{\sqrt{2}}(1,1). ]
The second component is orthogonal to it, approximately [ \frac{1}{\sqrt{2}}(1,-1). ]

Why it matters: PCA is about variance direction, not label separation.

Source family: UC Berkeley CS189 PCA materials and Stanford CS229 PCA notes

Q18. EM Responsibility Intuition¶

Question: A two-component Gaussian mixture has equal mixing weights and equal variance. One component is centered at 0, the other at 4. For the point (x=3.5), which component gets higher responsibility, and what happens if both variances shrink while the means stay fixed?

Cold target: Name the winning component and the variance effect.

Reveal Academy Answer

The component centered at 4 gets higher responsibility, because 3.5 is much closer to 4 than to 0.
If both variances shrink, the relative preference becomes more extreme.
So the responsibilities get sharper, not flatter.

Why it matters: This is the intuition behind why EM assignments can harden when components become tighter.

Source family: Stanford CS229 mixture-of-Gaussians and EM materials and Cornell CS4786 clustering themes

Updates, Networks, And Model Selection¶

Q19. Perceptron Update¶

Question: A perceptron has current weight vector (w=[-1, 0.5]), bias (b=0), learning rate (\eta=1), and sees an example (x=[2,1]) with label (y=+1). If the example is misclassified, what is the update?

Cold target: Write the update rule and the updated parameters.

Reveal Academy Answer

Perceptron update: [ w \leftarrow w + \eta y x, \qquad b \leftarrow b + \eta y. ]
So [ w' = [-1,0.5] + [2,1] = [1,1.5]. ]
And [ b' = 0 + 1 = 1. ]

Why it matters: This is the cleanest update rule in linear classification, and it shows up later as a building block for larger optimization stories.

Source family: Stanford CS229 linear-classifier materials

Q20. Early Stopping Versus Final Epoch¶

Question: Training loss keeps dropping for 20 epochs, but validation loss bottoms out at epoch 6 and rises after that. Which checkpoint should you keep, and what is the first diagnosis?

Cold target: Name the checkpoint and the diagnosis.

Reveal Academy Answer

Keep the best-validation checkpoint, not the final epoch.
The first diagnosis is overfitting or excess variance after epoch 6.
The next move is to keep the split fixed and change one thing: stronger regularization, earlier stopping, or a smaller model.
Reporting the final epoch as the winner would hide the actual validation story.

Why it matters: Checkpoint choice is part of model selection, not a file-management detail.

Source family: Stanford CS229 regularization/model-selection notes and UC Berkeley CS189 bias-variance themes

What To Do After This Pack¶

Do not just reread. Route by the block that broke:

misses in Q01-Q06: return to Classical ML Workflow and repair the algebra, fit, and validation basics.
misses in Q07-Q10: go narrower with Validation, Leakage, and Model Choice or the matching generative-versus-discriminative topic before retaking this pack.
misses in Q11-Q15: move into SVM, Kernels, and Learning Theory and retake only that narrower pack next.
misses in Q16-Q18: route into Unsupervised Learning and Representation or Clustering and Low-Dimensional Views.
misses in Q19-Q20: return to PyTorch Training Loops or ResNet, BERT, and Fine-Tuning depending on whether the weakness is loop mechanics or checkpoint choice.

Exit rule:

0-2 misses: move to one narrower pack for speed.
3-5 misses: route back to the smallest runnable layer before taking another mixed pack.
6+ misses: stop here and rebuild one block at a time; this pack is too broad for the current gap.