Skip to content

Feature Selection

Scenario: Optimizing a Credit Scoring Model

You're building a credit risk model with hundreds of applicant features, but many are irrelevant or correlated. Use feature selection to identify the key predictors, improving model speed, interpretability, and reducing overfitting for better loan decisions.

What This Is

Feature selection finds which features actually help the model and which ones are noise, redundant, or actively harmful. Unlike dimensionality reduction (which transforms features), feature selection keeps the original features and drops the rest.

When You Use It

  • too many features slow down training or cause overfitting
  • you suspect some features are leaking the target
  • you want a simpler, more interpretable model
  • you need to explain which inputs matter to a stakeholder

The Three Families

Family How It Works Speed Accounts for Model?
Filter rank features by a statistical score, independent of the model fastest no
Wrapper train models with different feature subsets, pick the best slowest yes
Embedded the model learns feature importance during training medium yes

Filter Methods — Start Here

Filter methods score each feature independently. They are fast and model-agnostic.

from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

# ANOVA F-test (linear relationships)
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X_train, y_train)

# Mutual information (captures nonlinear relationships)
selector_mi = SelectKBest(score_func=mutual_info_classif, k=10)
X_selected_mi = selector_mi.fit_transform(X_train, y_train)

Which score to use

  • f_classif / f_regression: fast, assumes linear relationship, good first check
  • mutual_info_classif / mutual_info_regression: slower, captures nonlinear signal, needs more data
  • chi2: for non-negative features (e.g., word counts)

Reading the scores

scores = selector.scores_
feature_ranking = sorted(zip(feature_names, scores), key=lambda x: -x[1])
for name, score in feature_ranking[:10]:
    print(f"  {name:>25}: {score:.2f}")

Correlation-Based Pruning

When two features are highly correlated, one is usually redundant:

corr_matrix = df[feature_cols].corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]

Wrapper Methods — Best Subset

Wrapper methods train models on different subsets and pick the one that performs best.

from sklearn.feature_selection import SequentialFeatureSelector

sfs = SequentialFeatureSelector(
    estimator=LogisticRegression(max_iter=1000),
    n_features_to_select=10,
    direction="forward",
    cv=5,
)
sfs.fit(X_train, y_train)
selected_mask = sfs.get_support()
  • direction="forward": start empty, add features one by one
  • direction="backward": start full, remove features one by one

Wrapper methods are slow but find feature combinations that work together.

Embedded Methods — Model-Based

Some models learn feature importance as part of training:

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=200, random_state=0)
rf.fit(X_train, y_train)
importances = rf.feature_importances_

# Permutation importance (more reliable)
from sklearn.inspection import permutation_importance
result = permutation_importance(rf, X_valid, y_valid, n_repeats=10, random_state=0)

Why permutation importance is better

Built-in feature_importances_ can be biased toward high-cardinality features. Permutation importance measures the actual impact on validation performance and is model-agnostic.

The Selection Ladder

  1. Correlation pruning to remove obvious redundancy
  2. Filter methods (SelectKBest) for a fast first pass
  3. Permutation importance to validate which features actually help the model
  4. Sequential selection only when you need the best possible small subset

Failure Pattern

Selecting features on the full dataset before splitting. If the selection step sees validation data, it can pick features that overfit to the specific split.

Another failure: trusting tree-based feature_importances_ on one-hot-encoded features, where importance is split across the dummy columns.

Common Mistakes

  • running SelectKBest on the entire dataset including test data
  • using filter methods alone when feature interactions matter
  • dropping a feature because its individual score is low, even though it helps in combination
  • confusing correlation with causation when reading importance scores

Practice

  1. Apply SelectKBest with f_classif and mutual_info_classif and compare which features are selected.
  2. Remove highly correlated features and check whether model performance changes.
  3. Compare built-in feature importance against permutation importance for a random forest.
  4. Use forward sequential selection with 5-fold CV and report the best feature subset.
  5. Explain why feature selection must happen after the train/test split, not before.

Case Study: Feature Selection in Genomics

Genomics researchers use feature selection to identify key genes from thousands of candidates, reducing noise and focusing on biomarkers. This speeds up analysis and improves model accuracy in disease prediction.

Expanded Quick Quiz

Why use filter methods first?

Answer: They are fast and model-agnostic, providing a quick way to rank features without training.

What's the advantage of wrapper methods?

Answer: They account for feature interactions by training models on subsets, finding the best combination.

How does embedded selection work?

Answer: The model learns feature importance during training (e.g., via coefficients or tree splits).

In the credit scoring scenario, why select features?

Answer: To focus on relevant predictors, avoiding overfitting and improving model interpretability for lenders.

Progress Checkpoint

  • [ ] Applied filter methods (SelectKBest with f_classif and mutual_info).
  • [ ] Used wrapper methods (e.g., sequential selection) with CV.
  • [ ] Checked embedded importance from a tree-based model.
  • [ ] Compared selected features across methods.
  • [ ] Answered quiz questions without peeking.

Milestone: Complete this to unlock "Honest Splits and Baselines" in the Classical ML track. Share your feature ranking in the academy Discord!

Further Reading

  • Scikit-Learn Feature Selection Guide.
  • "Feature Selection for Machine Learning" papers.
  • Permutation importance tutorials.

Runnable Example

Longer Connection

Continue with Dimensionality Reduction for an alternative approach that transforms features instead of dropping them, and Hyperparameter Tuning for the full selection-and-tuning workflow.