Regression Metrics and Diagnostics¶

Scenario: Predicting House Prices¶

You're building a model to predict home sale prices based on features like size and location. Errors in thousands of dollars matter—use regression metrics to evaluate accuracy and diagnose where the model fails, ensuring reliable price estimates for buyers and sellers.

What This Is¶

Classification has accuracy, precision, and recall. Regression has its own set of metrics — and its own ways to lie to you. This topic covers how to measure regression quality and how to diagnose where a model fails.

When You Use It¶

evaluating any model that predicts a continuous value
choosing between MSE, MAE, and R² for model selection
diagnosing whether errors are systematic or randomly scattered
communicating model quality to stakeholders in interpretable units

Metric Comparison¶

Metric	Formula Intuition	Sensitive To	Best For
MSE	average of squared errors	outliers (heavily)	penalizing large mistakes
RMSE	√MSE — same units as target	outliers	interpretable error magnitude
MAE	average of absolute errors	moderate outlier sensitivity	robust central error
R²	1 − (MSE / variance of y)	scale-free	comparing across datasets
MAPE	% error relative to true value	small true values (divides by y)	business percentage targets
Median AE	median of absolute errors	resistant to outliers	when the median error matters more than the mean

When Each Metric Misleads¶

MSE/RMSE: one huge outlier can dominate the entire score
MAE: hides the fact that some predictions are extremely wrong
R²: can be negative (model worse than predicting the mean), and does not tell you where errors concentrate
MAPE: explodes when true values are near zero

Minimal Example¶

from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    median_absolute_error, root_mean_squared_error,
)
import numpy as np

y_true = np.array([3.0, 5.0, 2.5, 7.0, 4.5])
y_pred = np.array([2.8, 5.3, 2.0, 8.1, 4.2])

print(f"MSE:       {mean_squared_error(y_true, y_pred):.4f}")
print(f"RMSE:      {root_mean_squared_error(y_true, y_pred):.4f}")
print(f"MAE:       {mean_absolute_error(y_true, y_pred):.4f}")
print(f"Median AE: {median_absolute_error(y_true, y_pred):.4f}")
print(f"R²:        {r2_score(y_true, y_pred):.4f}")

Residual Analysis — The Real Diagnostic¶

Metrics give you one number. Residuals tell you the story.

residuals = y_true - y_pred

# Residual plot: predictions vs errors
# Look for: patterns, fan shapes, systematic bias

What residual patterns mean¶

Pattern	Diagnosis
Random scatter around zero	✅ the model is unbiased
Fan shape (errors grow with predictions)	heteroscedasticity — consider log-transforming the target
Curved pattern	the model misses a nonlinear relationship
Cluster of large errors in one region	the model fails on a specific subgroup
All residuals positive or negative	systematic bias — the model consistently over/under-predicts

The Diagnostic Ladder¶

Compute metrics — MSE, MAE, R² for the overall picture
Plot residuals vs predictions — look for patterns
Plot residuals vs features — find where the model fails
Check residuals by group — are errors worse for a subpopulation?
Compare against a baseline — does the model beat DummyRegressor(strategy="mean")?

Baseline Pattern¶

from sklearn.dummy import DummyRegressor

dummy = DummyRegressor(strategy="mean")
dummy.fit(X_train, y_train)
dummy_pred = dummy.predict(X_valid)

print(f"Dummy MAE:  {mean_absolute_error(y_valid, dummy_pred):.3f}")
print(f"Model MAE:  {mean_absolute_error(y_valid, model_pred):.3f}")

If the model barely beats the dummy, the features are probably too weak — not the model.

When To Log-Transform The Target¶

If the target spans orders of magnitude (e.g., house prices from $50K to $5M), predicting in log space often helps:

import numpy as np

y_log = np.log1p(y_train)
model.fit(X_train, y_log)
pred_log = model.predict(X_valid)
pred_original = np.expm1(pred_log)

Check whether residuals become more uniform after the transform.

Failure Pattern¶

Reporting only R² without checking residuals. An R² of 0.85 sounds good, but if all the errors concentrate on high-value predictions, the model is systematically failing where it matters most.

Another failure: using MAPE on data with zeros or near-zero values, which produces infinite or misleading percentages.

Common Mistakes¶

comparing MSE across datasets with different target scales (use R² or normalize)
forgetting that R² can be negative — it just means the model is worse than the mean
treating a low MAE as proof of a good model when the residuals show systematic patterns
optimizing for MSE when the business cares about MAE (or vice versa)

Practice¶

Compute MSE, MAE, and R² for a regression model and explain what each tells you.
Plot residuals versus predictions and describe the pattern you see.
Add one outlier to the dataset and show how MSE changes compared to MAE.
Compare a model against DummyRegressor and explain whether the model adds value.
Apply a log transform to the target, retrain, and check whether residuals improve.
Explain when you would prefer MAE over MSE for model selection.

Case Study: Regression in Financial Forecasting¶

Financial models use RMSE to evaluate stock price predictions, where large errors are penalized heavily. This helps prioritize models that avoid catastrophic mispredictions in volatile markets.

Expanded Quick Quiz¶

Why use RMSE instead of MSE?

Answer: RMSE is in the same units as the target, making errors more interpretable (e.g., dollars instead of squared dollars).

What does a negative R² mean?

Answer: The model performs worse than simply predicting the mean of the target.

How does MAE differ from MSE?

Answer: MAE is less sensitive to outliers, providing a robust measure of central error tendency.

In the house price scenario, why plot residuals?

Answer: To check for systematic errors (e.g., underpredicting expensive homes), guiding model improvements.

Progress Checkpoint¶

[ ] Computed multiple regression metrics (MSE, MAE, R², RMSE).
[ ] Plotted residuals vs. predictions to diagnose patterns.
[ ] Analyzed outlier impact on metrics.
[ ] Compared model against dummy regressor.
[ ] Interpreted results for model selection.
[ ] Answered quiz questions without peeking.

Milestone: Complete this to unlock "SVM Margins and Kernels" in the Classical ML track. Share your residual analysis in the academy Discord!

Runnable Example¶

Longer Connection¶

Continue with Evaluation Metrics Deep Dive for the classification-side counterpart, and Learning Curves and Bias-Variance for diagnosing whether the model needs more data or more capacity.