Skip to content

Hyperparameter Tuning

Scenario: Optimizing a Fraud Detection Model

You're building a fraud detection system for online transactions. Your random forest model has decent accuracy but flags too many false positives—use hyperparameter tuning to find the best balance of depth and regularization, ensuring the model generalizes well without overfitting to the training fraud patterns.

Learning Objectives

By the end of this module (40-50 minutes), you should be able to: - Set up honest hyperparameter tuning with cross-validation and pipelines. - Choose between grid and randomized search based on search space size. - Interpret tuning results and select robust parameters. - Apply the one-standard-error rule for simpler models. - Diagnose when tuning isn't the bottleneck (e.g., feature issues).

Prerequisites: Cross-validation basics; scikit-learn pipelines. Difficulty: Intermediate-Advanced.

What This Is

Hyperparameter tuning is a controlled search over model settings without leaking information across the validation boundary.

The deeper point is that tuning is not about finding the most extreme settings. It is about finding the smallest change that gives a repeatable improvement.

When You Use It

  • comparing a few candidate settings honestly
  • tuning regularization, tree depth, or similar controls
  • improving a baseline without changing the whole workflow

Tooling

  • Pipeline
  • GridSearchCV
  • RandomizedSearchCV
  • validation_curve
  • ParameterGrid
  • HalvingGridSearchCV
  • HalvingRandomSearchCV
  • StandardScaler

Library Notes

  • Pipeline keeps preprocessing tied to the model so each fold stays honest.
  • GridSearchCV is best when the search space is small and you want to inspect every candidate.
  • RandomizedSearchCV is better when the space is larger or you want a fast first pass.
  • validation_curve is useful when you want to inspect one parameter at a time instead of tuning several knobs at once.
  • ParameterGrid helps you reason about the search space before the run starts.
  • HalvingGridSearchCV and HalvingRandomSearchCV spend fewer resources on weak candidates and are useful when the full search would be too expensive.
  • StandardScaler should usually live inside the pipeline for linear and distance-based models.

What To Tune First

Start with the parameters that control capacity:

  • C for linear and margin-based models
  • max_depth, min_samples_leaf, or similar controls for tree models
  • regularization or shrinkage knobs before secondary preprocessing choices

If the first pass is inconclusive, add one interacting parameter only after you can explain why it belongs in the search.

Honest Tuning Protocol

Treat tuning as a controlled decision process:

  1. lock the split or CV design first
  2. choose one primary metric
  3. define a budget for candidates, not an unlimited search
  4. search only inside the training boundary
  5. compare the tuned winner against the untuned baseline
  6. evaluate once on the locked holdout after selection

If the tuned model cannot beat the untuned baseline honestly, the lesson is often about the representation or the split, not the grid size.

Minimal Example

from sklearn.model_selection import GridSearchCV

search = GridSearchCV(model, {"C": [0.1, 1.0, 10.0]}, cv=cv, scoring="roc_auc")

Worked Pattern

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
    ("scale", StandardScaler()),
    ("model", LogisticRegression(max_iter=1000)),
])

search = GridSearchCV(
    pipeline,
    {"model__C": [0.1, 1.0, 10.0]},
    cv=cv,
    scoring="average_precision",
    return_train_score=True,
)
search.fit(X_train, y_train)

The important part is not the exact grid. It is that preprocessing stays inside the pipeline and the search happens only inside the training boundary.

What To Read After Fitting

Read these outputs before you celebrate a winner:

  • best_params_
  • best_score_
  • best_estimator_
  • cv_results_

cv_results_ matters because it shows the whole candidate table, not just the winner. That makes it easier to see whether the gain is broad or just a one-point spike.

One-Parameter Check

from sklearn.model_selection import validation_curve

train_scores, valid_scores = validation_curve(
    pipeline,
    X_train,
    y_train,
    param_name="model__C",
    param_range=[0.01, 0.1, 1.0, 10.0],
    cv=cv,
    scoring="average_precision",
)

Use this when you want to answer one question first:

  • is the model under-regularized
  • is the model over-regularized
  • is the gain broad enough to matter

If the curve is flat, more tuning may not be the right next move.

Search Helper

from sklearn.model_selection import ParameterGrid

list(ParameterGrid({"model__C": [0.1, 1.0], "model__penalty": ["l2"]}))

This is useful when you want to sanity-check the search size before spending time on the run.

Search Space Design Under Budget

A good search space is narrow enough to teach you something.

  • use log-scale sweeps for regularization and learning-rate style parameters
  • tune the one or two capacity controls most likely to matter before secondary knobs
  • use coarse-to-fine search instead of a huge first grid
  • keep a clear reason for every parameter in the search

Bad search spaces usually share one symptom: the candidate table is large, but none of the choices would be easy to defend to a teammate.

One-Standard-Error Rule

If several candidates are close, prefer the simplest candidate whose score is within one standard error of the best mean.

Practical version:

best_mean = table["mean_score"].max()
best_sem = table.loc[table["mean_score"].idxmax(), "sem_score"]
safe = table[table["mean_score"] >= best_mean - best_sem]

Then choose the simplest row inside safe, not automatically the row with the very top mean. This protects you from over-reading small tuning differences.

Split And Scoring Must Match The Task

Search quality depends on the scorer and the splitter:

  • imbalanced queue: optimize average_precision or a threshold-aware metric, not plain accuracy
  • grouped data: use GroupKFold or StratifiedGroupKFold
  • time-aware data: use an ordered splitter, not shuffled CV

If the search uses the wrong split or the wrong metric, the best parameters are only best for the wrong problem.

What To Watch For

  • a grid that is so wide it becomes hard to interpret
  • a best setting that barely beats the default
  • a tuning run that changes the validation story only by chance
  • a pipeline that accidentally leaks preprocessing information
  • a large train-validation gap hidden behind one average score
  • a search that takes longer to explain than the gain is worth

The important signal is not "did the score move?" It is "did the score move in a way I can defend?"

Use halving search when:

  • the grid is large enough that full search is expensive
  • you want to eliminate weak candidates early
  • you can accept a more aggressive search strategy

Use it carefully:

  • keep the split fixed
  • compare it against a smaller ordinary search first
  • check whether the winner is stable enough to justify the shortcut

What To Try

  • tune C for logistic regression
  • tune max_depth or min_samples_leaf for a tree model
  • compare a small grid with a randomized search on the same metric
  • inspect one parameter with validation_curve before tuning two at once
  • use ParameterGrid to reason about the search space before the run
  • try halving search only when the full search would be too slow

Failure Pattern

Scaling or imputing on the full dataset before the search begins. Preprocessing must stay inside the pipeline so each fold is treated honestly.

Another failure pattern is making the grid too wide. A search that is too big becomes a time sink and often rewards luck more than understanding.

Another failure pattern is tuning several knobs at once before you know which one matters. If you cannot explain why a parameter belongs in the search, it probably should not be there yet.

Another failure pattern is trusting the best score without checking the spread, the training score, and the candidate table.

Another common counterexample is a wide search where one extreme candidate wins by 0.002 on validation but loses the weak slice, inflates training score, or falls outside the one-standard-error safety zone. That is not a robust win.

Inspection Habits

  • compare the best score with the baseline score, not just the neighboring candidates
  • check whether the train score rises much faster than the validation score
  • inspect whether one parameter dominates the result
  • prefer the smallest setting that gives a repeatable gain
  • read the whole candidate table before announcing a winner

If a smaller setting is nearly as good as the best one, the smaller setting is often the more defensible choice.

Practice

  1. Tune one hyperparameter grid for logistic regression.
  2. Tune one small tree-based grid.
  3. Explain why the search happens only inside the training boundary.
  4. Name one setting you would not tune on the first pass.
  5. Explain what a small but consistent gain means compared with a one-off large jump.
  6. Describe how you would decide whether RandomizedSearchCV is enough.
  7. State what you would lock before the second tuning pass.
  8. Explain when a default setting is already good enough.
  9. Use validation_curve to decide whether one parameter is worth tuning further.
  10. Explain what best_estimator_ and cv_results_ each tell you after a search.

Runnable Example

Open the matching example in AI Academy and run it from the platform.

Run the same idea in the browser:

Inspect the best parameter choice and the validation metrics after the search finishes.

Common Tricks

When tuning a linear model, start with regularization before branching into more exotic preprocessing. In many tabular tasks, a strong C sweep gets most of the useful signal quickly.

For tree models, a small structural sweep often gives more insight than trying to tune every available knob at once. The point is to learn what matters, not to search the entire model space.

If the model family is already strong with defaults, a short RandomizedSearchCV pass can be better than a giant grid. The goal is a defendable improvement, not the biggest possible search.

If the validation curve is flat, stop tuning that parameter and inspect the feature representation or the split instead.

Questions To Ask

  1. Which parameter is most likely to change the result in a meaningful way?
  2. Is the grid small enough to inspect after the run?
  3. Does the training score suggest overfitting at the winning point?
  4. Would a validation curve give you a clearer answer than a full grid search?
  5. Is the next gain likely to come from the model family, the features, or the split?
  6. Would a smaller model with a similar score be easier to defend?

Case Study: Tuning in Production ML

In production systems like recommendation engines, hyperparameter tuning ensures models perform reliably under varying data distributions. Companies like Netflix use careful tuning to avoid overfitting to historical patterns, maintaining accuracy as user behavior evolves.

Expanded Quick Quiz

Why must preprocessing stay inside the pipeline during tuning?

Answer: To prevent data leakage; each CV fold should be processed independently to simulate unseen data.

When should you use RandomizedSearchCV over GridSearchCV?

Answer: When the search space is large; randomized sampling is more efficient for exploring many parameters.

What does the one-standard-error rule help with?

Answer: It encourages simpler models by choosing settings within one standard error of the best score, avoiding overfitting to noise.

In the fraud detection scenario, why tune hyperparameters?

Answer: To optimize the model's sensitivity-specificity trade-off, reducing false positives while maintaining fraud detection accuracy.

Progress Checkpoint

  • [ ] Set up a GridSearchCV with a pipeline and tuned one parameter.
  • [ ] Analyzed cv_results_ to understand parameter effects.
  • [ ] Applied the one-standard-error rule to select a simpler model.
  • [ ] Answered quiz questions without peeking.

Milestone: Complete this to unlock "Model Evaluation and Metrics" in the Classical ML track. Share your tuning results in the academy Discord!

Further Reading

  • Scikit-Learn Model Selection docs.
  • "Practical Bayesian Optimization of Machine Learning Algorithms" for advanced tuning.
  • Blog posts on hyperparameter tuning best practices.

Runnable Example

Open the matching example in AI Academy and run it from the platform.

Run the same idea in the browser:

Inspect the best parameter choice and the validation metrics after the search finishes.

Longer Connection

Continue with scikit-learn Validation and Tuning for a fuller tuning and calibration workflow.