Hyperparameter Tuning¶

Scenario: Optimizing a Fraud Detection Model¶

You're building a fraud detection system for online transactions. Your random forest model has decent accuracy but flags too many false positives—use hyperparameter tuning to find the best balance of depth and regularization, ensuring the model generalizes well without overfitting to the training fraud patterns.

Learning Objectives¶

By the end of this module (40-50 minutes), you should be able to: - Set up honest hyperparameter tuning with cross-validation and pipelines. - Choose between grid and randomized search based on search space size. - Interpret tuning results and select robust parameters. - Apply the one-standard-error rule for simpler models. - Diagnose when tuning isn't the bottleneck (e.g., feature issues).

Prerequisites: Cross-validation basics; scikit-learn pipelines. Difficulty: Intermediate-Advanced.

What This Is¶

Hyperparameter tuning is a controlled search over model settings without leaking information across the validation boundary.

The deeper point is that tuning is not about finding the most extreme settings. It is about finding the smallest change that gives a repeatable improvement.

When You Use It¶

comparing a few candidate settings honestly
tuning regularization, tree depth, or similar controls
improving a baseline without changing the whole workflow

Tooling¶

Pipeline
GridSearchCV
RandomizedSearchCV
validation_curve
ParameterGrid
HalvingGridSearchCV
HalvingRandomSearchCV
StandardScaler

Library Notes¶

Pipeline keeps preprocessing tied to the model so each fold stays honest.
GridSearchCV is best when the search space is small and you want to inspect every candidate.
RandomizedSearchCV is better when the space is larger or you want a fast first pass.
validation_curve is useful when you want to inspect one parameter at a time instead of tuning several knobs at once.
ParameterGrid helps you reason about the search space before the run starts.
HalvingGridSearchCV and HalvingRandomSearchCV spend fewer resources on weak candidates and are useful when the full search would be too expensive.
StandardScaler should usually live inside the pipeline for linear and distance-based models.

What To Tune First¶

Start with the parameters that control capacity:

C for linear and margin-based models
max_depth, min_samples_leaf, or similar controls for tree models
regularization or shrinkage knobs before secondary preprocessing choices

If the first pass is inconclusive, add one interacting parameter only after you can explain why it belongs in the search.

Honest Tuning Protocol¶

Treat tuning as a controlled decision process:

lock the split or CV design first
choose one primary metric
define a budget for candidates, not an unlimited search
search only inside the training boundary
compare the tuned winner against the untuned baseline
evaluate once on the locked holdout after selection

If the tuned model cannot beat the untuned baseline honestly, the lesson is often about the representation or the split, not the grid size.

Minimal Example¶

from sklearn.model_selection import GridSearchCV

search = GridSearchCV(model, {"C": [0.1, 1.0, 10.0]}, cv=cv, scoring="roc_auc")

Worked Pattern¶

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
    ("scale", StandardScaler()),
    ("model", LogisticRegression(max_iter=1000)),
])

search = GridSearchCV(
    pipeline,
    {"model__C": [0.1, 1.0, 10.0]},
    cv=cv,
    scoring="average_precision",
    return_train_score=True,
)
search.fit(X_train, y_train)

The important part is not the exact grid. It is that preprocessing stays inside the pipeline and the search happens only inside the training boundary.

What To Read After Fitting¶

Read these outputs before you celebrate a winner:

best_params_
best_score_
best_estimator_
cv_results_

cv_results_ matters because it shows the whole candidate table, not just the winner. That makes it easier to see whether the gain is broad or just a one-point spike.

One-Parameter Check¶

from sklearn.model_selection import validation_curve

train_scores, valid_scores = validation_curve(
    pipeline,
    X_train,
    y_train,
    param_name="model__C",
    param_range=[0.01, 0.1, 1.0, 10.0],
    cv=cv,
    scoring="average_precision",
)

Use this when you want to answer one question first:

is the model under-regularized
is the model over-regularized
is the gain broad enough to matter

If the curve is flat, more tuning may not be the right next move.

Search Helper¶

from sklearn.model_selection import ParameterGrid

list(ParameterGrid({"model__C": [0.1, 1.0], "model__penalty": ["l2"]}))

This is useful when you want to sanity-check the search size before spending time on the run.

Search Space Design Under Budget¶

A good search space is narrow enough to teach you something.

use log-scale sweeps for regularization and learning-rate style parameters
tune the one or two capacity controls most likely to matter before secondary knobs
use coarse-to-fine search instead of a huge first grid
keep a clear reason for every parameter in the search

Bad search spaces usually share one symptom: the candidate table is large, but none of the choices would be easy to defend to a teammate.

One-Standard-Error Rule¶

If several candidates are close, prefer the simplest candidate whose score is within one standard error of the best mean.

Practical version:

best_mean = table["mean_score"].max()
best_sem = table.loc[table["mean_score"].idxmax(), "sem_score"]
safe = table[table["mean_score"] >= best_mean - best_sem]

Then choose the simplest row inside safe, not automatically the row with the very top mean. This protects you from over-reading small tuning differences.

Split And Scoring Must Match The Task¶

Search quality depends on the scorer and the splitter:

imbalanced queue: optimize average_precision or a threshold-aware metric, not plain accuracy
grouped data: use GroupKFold or StratifiedGroupKFold
time-aware data: use an ordered splitter, not shuffled CV

If the search uses the wrong split or the wrong metric, the best parameters are only best for the wrong problem.

What To Watch For¶

a grid that is so wide it becomes hard to interpret
a best setting that barely beats the default
a tuning run that changes the validation story only by chance
a pipeline that accidentally leaks preprocessing information
a large train-validation gap hidden behind one average score
a search that takes longer to explain than the gain is worth

The important signal is not "did the score move?" It is "did the score move in a way I can defend?"

When To Use Halving Search¶

Use halving search when:

the grid is large enough that full search is expensive
you want to eliminate weak candidates early
you can accept a more aggressive search strategy

Use it carefully:

keep the split fixed
compare it against a smaller ordinary search first
check whether the winner is stable enough to justify the shortcut

What To Try¶

tune C for logistic regression
tune max_depth or min_samples_leaf for a tree model
compare a small grid with a randomized search on the same metric
inspect one parameter with validation_curve before tuning two at once
use ParameterGrid to reason about the search space before the run
try halving search only when the full search would be too slow

Failure Pattern¶

Scaling or imputing on the full dataset before the search begins. Preprocessing must stay inside the pipeline so each fold is treated honestly.

Another failure pattern is making the grid too wide. A search that is too big becomes a time sink and often rewards luck more than understanding.

Another failure pattern is tuning several knobs at once before you know which one matters. If you cannot explain why a parameter belongs in the search, it probably should not be there yet.

Another failure pattern is trusting the best score without checking the spread, the training score, and the candidate table.

Another common counterexample is a wide search where one extreme candidate wins by 0.002 on validation but loses the weak slice, inflates training score, or falls outside the one-standard-error safety zone. That is not a robust win.

Inspection Habits¶

compare the best score with the baseline score, not just the neighboring candidates
check whether the train score rises much faster than the validation score
inspect whether one parameter dominates the result
prefer the smallest setting that gives a repeatable gain
read the whole candidate table before announcing a winner

If a smaller setting is nearly as good as the best one, the smaller setting is often the more defensible choice.

Practice¶

Tune one hyperparameter grid for logistic regression.
Tune one small tree-based grid.
Explain why the search happens only inside the training boundary.
Name one setting you would not tune on the first pass.
Explain what a small but consistent gain means compared with a one-off large jump.
Describe how you would decide whether RandomizedSearchCV is enough.
State what you would lock before the second tuning pass.
Explain when a default setting is already good enough.
Use validation_curve to decide whether one parameter is worth tuning further.
Explain what best_estimator_ and cv_results_ each tell you after a search.

Runnable Example¶

Open the matching example in AI Academy and run it from the platform.

Run the same idea in the browser:

Inspect the best parameter choice and the validation metrics after the search finishes.

Common Tricks¶

When tuning a linear model, start with regularization before branching into more exotic preprocessing. In many tabular tasks, a strong C sweep gets most of the useful signal quickly.

For tree models, a small structural sweep often gives more insight than trying to tune every available knob at once. The point is to learn what matters, not to search the entire model space.

If the model family is already strong with defaults, a short RandomizedSearchCV pass can be better than a giant grid. The goal is a defendable improvement, not the biggest possible search.

If the validation curve is flat, stop tuning that parameter and inspect the feature representation or the split instead.

Questions To Ask¶

Which parameter is most likely to change the result in a meaningful way?
Is the grid small enough to inspect after the run?
Does the training score suggest overfitting at the winning point?
Would a validation curve give you a clearer answer than a full grid search?
Is the next gain likely to come from the model family, the features, or the split?
Would a smaller model with a similar score be easier to defend?

Case Study: Tuning in Production ML¶

In production systems like recommendation engines, hyperparameter tuning ensures models perform reliably under varying data distributions. Companies like Netflix use careful tuning to avoid overfitting to historical patterns, maintaining accuracy as user behavior evolves.

Expanded Quick Quiz¶

Why must preprocessing stay inside the pipeline during tuning?

Answer: To prevent data leakage; each CV fold should be processed independently to simulate unseen data.

When should you use RandomizedSearchCV over GridSearchCV?

Answer: When the search space is large; randomized sampling is more efficient for exploring many parameters.

What does the one-standard-error rule help with?

Answer: It encourages simpler models by choosing settings within one standard error of the best score, avoiding overfitting to noise.

In the fraud detection scenario, why tune hyperparameters?

Answer: To optimize the model's sensitivity-specificity trade-off, reducing false positives while maintaining fraud detection accuracy.

Progress Checkpoint¶

[ ] Set up a GridSearchCV with a pipeline and tuned one parameter.
[ ] Analyzed cv_results_ to understand parameter effects.
[ ] Applied the one-standard-error rule to select a simpler model.
[ ] Answered quiz questions without peeking.

Milestone: Complete this to unlock "Model Evaluation and Metrics" in the Classical ML track. Share your tuning results in the academy Discord!

Runnable Example¶

Open the matching example in AI Academy and run it from the platform.

Run the same idea in the browser:

Inspect the best parameter choice and the validation metrics after the search finishes.

Longer Connection¶

Continue with scikit-learn Validation and Tuning for a fuller tuning and calibration workflow.

Hyperparameter Tuning¶

Scenario: Optimizing a Fraud Detection Model¶

Learning Objectives¶

What This Is¶

When You Use It¶

Tooling¶

Library Notes¶

What To Tune First¶

Honest Tuning Protocol¶

Minimal Example¶

Worked Pattern¶

What To Read After Fitting¶

One-Parameter Check¶

Search Helper¶

Search Space Design Under Budget¶

One-Standard-Error Rule¶

Split And Scoring Must Match The Task¶

What To Watch For¶

When To Use Halving Search¶

What To Try¶

Failure Pattern¶

Inspection Habits¶

Practice¶

Runnable Example¶

Common Tricks¶

Questions To Ask¶

Case Study: Tuning in Production ML¶

Expanded Quick Quiz¶

Progress Checkpoint¶

Further Reading¶

Runnable Example¶

Longer Connection¶