Baseline-First Task Solving¶

What This Is¶

When a task is open-ended, the fastest good workflow is usually to build one safe baseline, lock the split, and then add only one or two justified improvements. That sounds simple, but the discipline matters because the first decent score is often the only reliable anchor you get before the task starts to drift.

The point is not to be minimal for its own sake. The point is to create a reference that is hard to fool. A baseline tells you whether a later gain is real, whether the split is stable, and whether you should keep spending time on the task.

A good baseline-first workflow does three jobs at once:

it gives you an honest floor to compare against
it reveals whether the data split is trustworthy
it makes later changes easier to explain, because every improvement has a reference point

When You Use It¶

solving a timed task
starting a new timed evaluation task
deciding what to try before tuning
checking whether the problem is imbalanced, leaky, or simply hard
comparing a simple linear model against a stronger follow-up before investing more time

Tooling¶

train_test_split for the first honest holdout
StratifiedKFold or GroupKFold when the split needs extra protection
DummyClassifier as the floor for classification
DummyRegressor as the floor for regression
make_pipeline or Pipeline so preprocessing and modeling stay together
StandardScaler when features are on very different scales
cross_validate for a more stable score table
classification_report and confusion_matrix for reading failure patterns
balanced_accuracy_score when classes are uneven
one strong baseline such as LogisticRegression
one higher-capacity follow-up model
a metric table for ranking candidates
a short report that records the selection rule

Split Choice Protocol¶

Choose the split before you choose the model ladder.

use a random stratified split when rows are independent and the deployment task is IID
use a grouped split when the same person, case, document, or device can appear more than once
use an ordered split when prediction happens later than the training rows
use a validation-plus-locked-holdout setup when you expect hidden or private scoring later

The rule is simple: pick the hardest split you can defend operationally, then build the baseline inside that boundary.

Counterexample:

a random split can make a large tree ensemble look better than it is because repeated customer patterns leak across rows
the same model can lose its advantage once the split is grouped by customer or ordered by time
if the split changes the winner, the lesson is about evaluation design before it is about architecture

Minimal Example¶

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

model_ladder = {
    "dummy_prior": DummyClassifier(strategy="prior"),
    "scaled_logistic": make_pipeline(StandardScaler(), LogisticRegression(max_iter=4000)),
}

That pattern is useful because the preprocessing and the model are evaluated together. A model that needs scaling should not be judged without it, and a baseline should be simple enough that you can explain why it won or lost.

If the task is regression, the same idea becomes a DummyRegressor floor followed by a simple model that you trust before moving to a larger search.

Worked Pattern¶

rows = [
    {"model": "dummy_prior", "roc_auc": 0.500, "avg_precision": 0.121},
    {"model": "scaled_logistic", "roc_auc": 0.781, "avg_precision": 0.366},
]

leaderboard = pd.DataFrame(rows).sort_values(["roc_auc", "avg_precision"], ascending=False)

Why this works:

the dummy model gives you the floor
the linear baseline gives you a real signal check
the leaderboard tells you which changes are worth keeping
the stable split makes it much harder to fool yourself with a lucky run

The minimum artifact to keep after the first pass is:

one baseline table with the dummy floor and the strong baseline side by side
one confusion view or slice note for the chosen baseline
one written decision that says promote, defer, or stop

Common trick:

keep the baseline and the follow-up in the same model ladder so the comparison is explicit
if a later model wins only barely, check the weak slice before you promote it
if the task is imbalanced, rank by a metric that respects the minority class instead of plain accuracy
if the split is small, use cross-validation to confirm that the first win is not a one-off

Applied pattern:

pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=4000))
scores = cross_validate(pipe, X, y, cv=5, scoring=["roc_auc", "average_precision"])

That pattern is useful when a single split feels too noisy. cross_validate returns more than one score, so you can see whether the model is broadly decent or only good on one metric.

20-Minute Baseline Routine¶

Use this routine when a new task appears and you do not want to waste the first hour:

freeze one split
run a dummy model
run one strong linear baseline
choose one metric rule
add only one higher-capacity follow-up
stop and read the failure pattern before adding more

That routine is intentionally narrow. It prevents the common mistake of turning uncertainty into uncontrolled tuning.

If you have more time, the next useful step is not many experiments. It is one careful ablation that explains why the first stronger model helped.

For a classification task, a strong first pass often looks like this:

DummyClassifier(strategy="prior")
make_pipeline(StandardScaler(), LogisticRegression(max_iter=4000))
one larger model only after the split and metric look sane

For a regression task, replace the floor with DummyRegressor(strategy="mean") and compare against one simple regression baseline before anything more complex.

Baseline Ladders By Modality¶

The first ladder should match the data type.

tabular: dummy floor -> scaled linear model -> one tree or boosting family
text: majority or unigram floor -> unigram logistic -> bigram or stronger representation
vision: flattened-pixel or tiny feature baseline -> small CNN -> stronger backbone only if the shift gap survives
sequential or time-aware tabular: naive lag or recent-window rule -> linear or tree model on lag features -> higher-capacity follow-up only after the ordered split is stable

The mistake to avoid is importing a model family from another modality before the task has earned it.

Failure Pattern¶

Starting with tuning or architecture changes before the baseline exists, then losing track of what actually improved the result.

Another failure is moving the split after seeing the score. That converts debugging into score hunting.

Other failure patterns:

using raw accuracy on a strongly imbalanced problem and reading it as progress
fitting preprocessing on the full dataset before the split, which leaks information
comparing models that were not evaluated on the same split
letting a high-variance model be the first thing you trust
calling the baseline "bad" before you know what the dummy floor is

Failure checks:

if the dummy score is already suspiciously high, inspect class imbalance and label leakage
if the linear baseline beats the stronger model on most slices, do not celebrate the higher-capacity model yet
if classification_report shows one class collapsing, inspect the confusion matrix instead of the aggregate score
if cross-validation scores swing widely, reduce complexity before increasing it

Inspection habit:

cm = confusion_matrix(y_true, y_pred)
report = classification_report(y_true, y_pred, digits=3)
balanced = balanced_accuracy_score(y_true, y_pred)

The point of these checks is not to generate more numbers. It is to make the first model failure visible before you spend time on a second model.

Questions To Ask After The First Jump¶

When the first stronger model beats the baseline, ask:

is the gain large enough to matter
did the weak slice improve too, or only the overall metric
is the split still clean
would this gain survive a hidden holdout
should the next step be another model, or a report and stop

Those questions are what turn a baseline into a workflow instead of just a first script.

Extra questions that save time:

did scaling actually matter, or was the gain from the model itself
did the model improve the rare class, or only the easy majority class
do the errors cluster in one group, one time window, or one text pattern
is the improvement bigger than the run-to-run noise
do you need a better split before you need a better model

Promote, Defer, Or Reject¶

Use a simple decision rule after the first stronger model runs:

promote the change if it beats the baseline on the primary metric, improves or preserves the weak slice, and survives a second sanity check such as CV, a grouped split, or a holdout
defer the change if it helps only a little and the gain is smaller than the expected evaluation noise
reject the change if it wins only on an easier split, hurts the weak slice, or makes the workflow harder to explain than the score gain justifies

That rule is what keeps a baseline ladder from turning into unprincipled search.

Useful Scikit-Learn Patterns¶

train_test_split is the fastest way to create the first holdout, but it should be frozen before you compare models. For classification, stratify=y is the common protection when you want the class proportions to stay similar across the split.

DummyClassifier is more than a joke baseline. Its strategy choices let you separate different questions:

prior shows the class distribution floor
most_frequent shows the majority-class floor
stratified shows a random-but-class-aware floor
uniform shows a pure random floor
constant is useful when you want to test a specific non-majority prediction

Pipeline and make_pipeline keep preprocessing attached to the estimator, which is the cleanest way to avoid leakage. If scaling, encoding, or feature selection happens before the split, the comparison is no longer honest.

cross_validate is a better fit than a single score when you want both stability and visibility. It is especially useful if one split feels too small, too noisy, or too dependent on the random seed.

confusion_matrix and classification_report turn a metric into an explanation. They help you see whether the model is failing on one class, one slice, or one kind of mistake.

balanced_accuracy_score is a strong default when the classes are uneven because it stops a majority-class shortcut from looking better than it is.

learning_curve is useful when the first baseline seems weak but you do not know whether you need more data, a different model, or a better feature set. If the training and validation curves are both poor, the issue is usually underfitting or weak features. If training is strong but validation is not, the issue is usually generalization.

Library Notes¶

DummyClassifier is useful even when it looks silly because it exposes the floor immediately.
DummyRegressor plays the same role for regression tasks.
LogisticRegression is often the best first strong baseline because it is fast, stable, and easy to interpret.
StandardScaler matters whenever feature scales differ a lot.
Pipeline and make_pipeline protect you from leakage when preprocessing must happen before the model.
cross_validate is helpful when you need a more stable baseline decision than one split can give.
train_test_split is fine for the first pass, but only if you freeze the split before ranking models.

Practice¶

Run a dummy baseline and one strong baseline on the same split.
Add one follow-up model only if the baseline is already stable.
State one clear metric rule for selecting the winner.
Write down the one failure pattern you expect to inspect before trying a second improvement.
Explain what would make you stop after the first jump instead of chasing a smaller gain.
Name one sign that your follow-up model is only learning noise.
Say which metric would be your primary selection rule and why.
Show how you would check whether the rare class got better.
Explain when cross_validate would be safer than a single split.
Describe one preprocessing step that must stay inside a pipeline.

Runnable Example¶

Open the baseline-first timed-task example in AI Academy and run it from the platform.

Questions To Ask¶

What is the honest floor?
Which feature family would you trust first?
Which slice is most likely to break the model?
What would count as a real improvement versus a tiny metric fluctuation?
If the second model is slightly better, is it better enough to justify its complexity?
Which metric would you trust if the classes are imbalanced?
What would the confusion matrix need to show before you call the baseline acceptable?