Feature Selection¶
Scenario: Optimizing a Credit Scoring Model¶
You're building a credit risk model with hundreds of applicant features, but many are irrelevant or correlated. Use feature selection to identify the key predictors, improving model speed, interpretability, and reducing overfitting for better loan decisions.
What This Is¶
Feature selection finds which features actually help the model and which ones are noise, redundant, or actively harmful. Unlike dimensionality reduction (which transforms features), feature selection keeps the original features and drops the rest.
When You Use It¶
- too many features slow down training or cause overfitting
- you suspect some features are leaking the target
- you want a simpler, more interpretable model
- you need to explain which inputs matter to a stakeholder
The Three Families¶
| Family | How It Works | Speed | Accounts for Model? |
|---|---|---|---|
| Filter | rank features by a statistical score, independent of the model | fastest | no |
| Wrapper | train models with different feature subsets, pick the best | slowest | yes |
| Embedded | the model learns feature importance during training | medium | yes |
Filter Methods — Start Here¶
Filter methods score each feature independently. They are fast and model-agnostic.
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
# ANOVA F-test (linear relationships)
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X_train, y_train)
# Mutual information (captures nonlinear relationships)
selector_mi = SelectKBest(score_func=mutual_info_classif, k=10)
X_selected_mi = selector_mi.fit_transform(X_train, y_train)
Which score to use¶
f_classif/f_regression: fast, assumes linear relationship, good first checkmutual_info_classif/mutual_info_regression: slower, captures nonlinear signal, needs more datachi2: for non-negative features (e.g., word counts)
Reading the scores¶
scores = selector.scores_
feature_ranking = sorted(zip(feature_names, scores), key=lambda x: -x[1])
for name, score in feature_ranking[:10]:
print(f" {name:>25}: {score:.2f}")
Correlation-Based Pruning¶
When two features are highly correlated, one is usually redundant:
corr_matrix = df[feature_cols].corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]
Wrapper Methods — Best Subset¶
Wrapper methods train models on different subsets and pick the one that performs best.
from sklearn.feature_selection import SequentialFeatureSelector
sfs = SequentialFeatureSelector(
estimator=LogisticRegression(max_iter=1000),
n_features_to_select=10,
direction="forward",
cv=5,
)
sfs.fit(X_train, y_train)
selected_mask = sfs.get_support()
direction="forward": start empty, add features one by onedirection="backward": start full, remove features one by one
Wrapper methods are slow but find feature combinations that work together.
Embedded Methods — Model-Based¶
Some models learn feature importance as part of training:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, random_state=0)
rf.fit(X_train, y_train)
importances = rf.feature_importances_
# Permutation importance (more reliable)
from sklearn.inspection import permutation_importance
result = permutation_importance(rf, X_valid, y_valid, n_repeats=10, random_state=0)
Why permutation importance is better¶
Built-in feature_importances_ can be biased toward high-cardinality features. Permutation importance measures the actual impact on validation performance and is model-agnostic.
The Selection Ladder¶
- Correlation pruning to remove obvious redundancy
- Filter methods (
SelectKBest) for a fast first pass - Permutation importance to validate which features actually help the model
- Sequential selection only when you need the best possible small subset
Failure Pattern¶
Selecting features on the full dataset before splitting. If the selection step sees validation data, it can pick features that overfit to the specific split.
Another failure: trusting tree-based feature_importances_ on one-hot-encoded features, where importance is split across the dummy columns.
Common Mistakes¶
- running
SelectKBeston the entire dataset including test data - using filter methods alone when feature interactions matter
- dropping a feature because its individual score is low, even though it helps in combination
- confusing correlation with causation when reading importance scores
Practice¶
- Apply
SelectKBestwithf_classifandmutual_info_classifand compare which features are selected. - Remove highly correlated features and check whether model performance changes.
- Compare built-in feature importance against permutation importance for a random forest.
- Use forward sequential selection with 5-fold CV and report the best feature subset.
- Explain why feature selection must happen after the train/test split, not before.
Case Study: Feature Selection in Genomics¶
Genomics researchers use feature selection to identify key genes from thousands of candidates, reducing noise and focusing on biomarkers. This speeds up analysis and improves model accuracy in disease prediction.
Expanded Quick Quiz¶
Why use filter methods first?
Answer: They are fast and model-agnostic, providing a quick way to rank features without training.
What's the advantage of wrapper methods?
Answer: They account for feature interactions by training models on subsets, finding the best combination.
How does embedded selection work?
Answer: The model learns feature importance during training (e.g., via coefficients or tree splits).
In the credit scoring scenario, why select features?
Answer: To focus on relevant predictors, avoiding overfitting and improving model interpretability for lenders.
Progress Checkpoint¶
- [ ] Applied filter methods (SelectKBest with f_classif and mutual_info).
- [ ] Used wrapper methods (e.g., sequential selection) with CV.
- [ ] Checked embedded importance from a tree-based model.
- [ ] Compared selected features across methods.
- [ ] Answered quiz questions without peeking.
Milestone: Complete this to unlock "Honest Splits and Baselines" in the Classical ML track. Share your feature ranking in the academy Discord!
Further Reading¶
- Scikit-Learn Feature Selection Guide.
- "Feature Selection for Machine Learning" papers.
- Permutation importance tutorials.
Runnable Example¶
Longer Connection¶
Continue with Dimensionality Reduction for an alternative approach that transforms features instead of dropping them, and Hyperparameter Tuning for the full selection-and-tuning workflow.