Baseline-First Task Solving¶
What This Is¶
When a task is open-ended, the fastest good workflow is usually to build one safe baseline, lock the split, and then add only one or two justified improvements. That sounds simple, but the discipline matters because the first decent score is often the only reliable anchor you get before the task starts to drift.
The point is not to be minimal for its own sake. The point is to create a reference that is hard to fool. A baseline tells you whether a later gain is real, whether the split is stable, and whether you should keep spending time on the task.
A good baseline-first workflow does three jobs at once:
- it gives you an honest floor to compare against
- it reveals whether the data split is trustworthy
- it makes later changes easier to explain, because every improvement has a reference point
When You Use It¶
- solving a timed task
- starting a new timed evaluation task
- deciding what to try before tuning
- checking whether the problem is imbalanced, leaky, or simply hard
- comparing a simple linear model against a stronger follow-up before investing more time
Tooling¶
train_test_splitfor the first honest holdoutStratifiedKFoldorGroupKFoldwhen the split needs extra protectionDummyClassifieras the floor for classificationDummyRegressoras the floor for regressionmake_pipelineorPipelineso preprocessing and modeling stay togetherStandardScalerwhen features are on very different scalescross_validatefor a more stable score tableclassification_reportandconfusion_matrixfor reading failure patternsbalanced_accuracy_scorewhen classes are uneven- one strong baseline such as
LogisticRegression - one higher-capacity follow-up model
- a metric table for ranking candidates
- a short report that records the selection rule
Split Choice Protocol¶
Choose the split before you choose the model ladder.
- use a random stratified split when rows are independent and the deployment task is IID
- use a grouped split when the same person, case, document, or device can appear more than once
- use an ordered split when prediction happens later than the training rows
- use a validation-plus-locked-holdout setup when you expect hidden or private scoring later
The rule is simple: pick the hardest split you can defend operationally, then build the baseline inside that boundary.
Counterexample:
- a random split can make a large tree ensemble look better than it is because repeated customer patterns leak across rows
- the same model can lose its advantage once the split is grouped by customer or ordered by time
- if the split changes the winner, the lesson is about evaluation design before it is about architecture
Minimal Example¶
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
model_ladder = {
"dummy_prior": DummyClassifier(strategy="prior"),
"scaled_logistic": make_pipeline(StandardScaler(), LogisticRegression(max_iter=4000)),
}
That pattern is useful because the preprocessing and the model are evaluated together. A model that needs scaling should not be judged without it, and a baseline should be simple enough that you can explain why it won or lost.
If the task is regression, the same idea becomes a DummyRegressor floor followed by a simple model that you trust before moving to a larger search.
Worked Pattern¶
rows = [
{"model": "dummy_prior", "roc_auc": 0.500, "avg_precision": 0.121},
{"model": "scaled_logistic", "roc_auc": 0.781, "avg_precision": 0.366},
]
leaderboard = pd.DataFrame(rows).sort_values(["roc_auc", "avg_precision"], ascending=False)
Why this works:
- the dummy model gives you the floor
- the linear baseline gives you a real signal check
- the leaderboard tells you which changes are worth keeping
- the stable split makes it much harder to fool yourself with a lucky run
The minimum artifact to keep after the first pass is:
- one baseline table with the dummy floor and the strong baseline side by side
- one confusion view or slice note for the chosen baseline
- one written decision that says promote, defer, or stop
Common trick:
- keep the baseline and the follow-up in the same model ladder so the comparison is explicit
- if a later model wins only barely, check the weak slice before you promote it
- if the task is imbalanced, rank by a metric that respects the minority class instead of plain accuracy
- if the split is small, use cross-validation to confirm that the first win is not a one-off
Applied pattern:
pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=4000))
scores = cross_validate(pipe, X, y, cv=5, scoring=["roc_auc", "average_precision"])
That pattern is useful when a single split feels too noisy. cross_validate returns more than one score, so you can see whether the model is broadly decent or only good on one metric.
20-Minute Baseline Routine¶
Use this routine when a new task appears and you do not want to waste the first hour:
- freeze one split
- run a dummy model
- run one strong linear baseline
- choose one metric rule
- add only one higher-capacity follow-up
- stop and read the failure pattern before adding more
That routine is intentionally narrow. It prevents the common mistake of turning uncertainty into uncontrolled tuning.
If you have more time, the next useful step is not many experiments. It is one careful ablation that explains why the first stronger model helped.
For a classification task, a strong first pass often looks like this:
DummyClassifier(strategy="prior")make_pipeline(StandardScaler(), LogisticRegression(max_iter=4000))- one larger model only after the split and metric look sane
For a regression task, replace the floor with DummyRegressor(strategy="mean") and compare against one simple regression baseline before anything more complex.
Baseline Ladders By Modality¶
The first ladder should match the data type.
- tabular: dummy floor -> scaled linear model -> one tree or boosting family
- text: majority or unigram floor -> unigram logistic -> bigram or stronger representation
- vision: flattened-pixel or tiny feature baseline -> small CNN -> stronger backbone only if the shift gap survives
- sequential or time-aware tabular: naive lag or recent-window rule -> linear or tree model on lag features -> higher-capacity follow-up only after the ordered split is stable
The mistake to avoid is importing a model family from another modality before the task has earned it.
Failure Pattern¶
Starting with tuning or architecture changes before the baseline exists, then losing track of what actually improved the result.
Another failure is moving the split after seeing the score. That converts debugging into score hunting.
Other failure patterns:
- using raw accuracy on a strongly imbalanced problem and reading it as progress
- fitting preprocessing on the full dataset before the split, which leaks information
- comparing models that were not evaluated on the same split
- letting a high-variance model be the first thing you trust
- calling the baseline "bad" before you know what the dummy floor is
Failure checks:
- if the dummy score is already suspiciously high, inspect class imbalance and label leakage
- if the linear baseline beats the stronger model on most slices, do not celebrate the higher-capacity model yet
- if
classification_reportshows one class collapsing, inspect the confusion matrix instead of the aggregate score - if cross-validation scores swing widely, reduce complexity before increasing it
Inspection habit:
cm = confusion_matrix(y_true, y_pred)
report = classification_report(y_true, y_pred, digits=3)
balanced = balanced_accuracy_score(y_true, y_pred)
The point of these checks is not to generate more numbers. It is to make the first model failure visible before you spend time on a second model.
Questions To Ask After The First Jump¶
When the first stronger model beats the baseline, ask:
- is the gain large enough to matter
- did the weak slice improve too, or only the overall metric
- is the split still clean
- would this gain survive a hidden holdout
- should the next step be another model, or a report and stop
Those questions are what turn a baseline into a workflow instead of just a first script.
Extra questions that save time:
- did scaling actually matter, or was the gain from the model itself
- did the model improve the rare class, or only the easy majority class
- do the errors cluster in one group, one time window, or one text pattern
- is the improvement bigger than the run-to-run noise
- do you need a better split before you need a better model
Promote, Defer, Or Reject¶
Use a simple decision rule after the first stronger model runs:
- promote the change if it beats the baseline on the primary metric, improves or preserves the weak slice, and survives a second sanity check such as CV, a grouped split, or a holdout
- defer the change if it helps only a little and the gain is smaller than the expected evaluation noise
- reject the change if it wins only on an easier split, hurts the weak slice, or makes the workflow harder to explain than the score gain justifies
That rule is what keeps a baseline ladder from turning into unprincipled search.
Useful Scikit-Learn Patterns¶
train_test_split is the fastest way to create the first holdout, but it should be frozen before you compare models. For classification, stratify=y is the common protection when you want the class proportions to stay similar across the split.
DummyClassifier is more than a joke baseline. Its strategy choices let you separate different questions:
priorshows the class distribution floormost_frequentshows the majority-class floorstratifiedshows a random-but-class-aware flooruniformshows a pure random floorconstantis useful when you want to test a specific non-majority prediction
Pipeline and make_pipeline keep preprocessing attached to the estimator, which is the cleanest way to avoid leakage. If scaling, encoding, or feature selection happens before the split, the comparison is no longer honest.
cross_validate is a better fit than a single score when you want both stability and visibility. It is especially useful if one split feels too small, too noisy, or too dependent on the random seed.
confusion_matrix and classification_report turn a metric into an explanation. They help you see whether the model is failing on one class, one slice, or one kind of mistake.
balanced_accuracy_score is a strong default when the classes are uneven because it stops a majority-class shortcut from looking better than it is.
learning_curve is useful when the first baseline seems weak but you do not know whether you need more data, a different model, or a better feature set. If the training and validation curves are both poor, the issue is usually underfitting or weak features. If training is strong but validation is not, the issue is usually generalization.
Library Notes¶
DummyClassifieris useful even when it looks silly because it exposes the floor immediately.DummyRegressorplays the same role for regression tasks.LogisticRegressionis often the best first strong baseline because it is fast, stable, and easy to interpret.StandardScalermatters whenever feature scales differ a lot.Pipelineandmake_pipelineprotect you from leakage when preprocessing must happen before the model.cross_validateis helpful when you need a more stable baseline decision than one split can give.train_test_splitis fine for the first pass, but only if you freeze the split before ranking models.
Practice¶
- Run a dummy baseline and one strong baseline on the same split.
- Add one follow-up model only if the baseline is already stable.
- State one clear metric rule for selecting the winner.
- Write down the one failure pattern you expect to inspect before trying a second improvement.
- Explain what would make you stop after the first jump instead of chasing a smaller gain.
- Name one sign that your follow-up model is only learning noise.
- Say which metric would be your primary selection rule and why.
- Show how you would check whether the rare class got better.
- Explain when
cross_validatewould be safer than a single split. - Describe one preprocessing step that must stay inside a pipeline.
Runnable Example¶
Open the baseline-first timed-task example in AI Academy and run it from the platform.
Questions To Ask¶
- What is the honest floor?
- Which feature family would you trust first?
- Which slice is most likely to break the model?
- What would count as a real improvement versus a tiny metric fluctuation?
- If the second model is slightly better, is it better enough to justify its complexity?
- Which metric would you trust if the classes are imbalanced?
- What would the confusion matrix need to show before you call the baseline acceptable?