Text Representations and Order¶
What This Is¶
Text representation decides how much meaning survives before the classifier sees the input. If you ignore order, a sentence like “not good” can look too similar to “good”. If you keep too little context, the model learns from tokens but misses phrases, negation, and local structure.
This topic is about choosing the smallest representation that still preserves the signal. In practice, that usually means starting with word counts, then checking whether bigrams, character n-grams, or TF-IDF are enough before reaching for heavier models.
When You Use It¶
- building a fast text baseline
- checking whether word order matters
- handling negation, short phrases, or noisy spelling
- comparing bag-of-words, n-grams, and TF-IDF
- deciding whether a simple linear model is already strong enough
Most Useful Functions¶
CountVectorizerfor word counts and n-gramsTfidfVectorizerfor count-to-weighted-feature text baselinesHashingVectorizerfor large-scale or streaming textget_feature_names_out()for reading back the learned vocabularyinverse_transform()for checking which features fired on a samplebuild_analyzer()for understanding how text becomes tokens
Representation Capability Table¶
| Representation | Keeps Well | Misses Or Weakens | Good First Use |
|---|---|---|---|
| unigrams | topic words, coarse sentiment cues, short keyword tasks | negation, phrase order, multiword meaning | fast baseline |
| word n-grams | short local phrases, negation, local order | long-distance structure, spelling variation across forms | order-sensitive short text |
| char n-grams | spelling, prefixes/suffixes, noisy queries, typos | explicit word-level semantics and longer phrase structure | noisy or short text |
| TF-IDF | same structure as the chosen analyzer, but with better term weighting | it does not add new expressive power beyond the analyzer | counts are too dominated by common terms |
Important rule:
- TF-IDF changes weighting, not the basic capability of the analyzer
- if the failure is about missing order, switching from counts to TF-IDF will not solve it by itself
Core Pattern¶
Start simple, then add only the missing structure.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
vectorizer = CountVectorizer(ngram_range=(1, 1))
X_train = vectorizer.fit_transform(train_texts)
X_valid = vectorizer.transform(valid_texts)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
That baseline is useful because it tells you whether the problem is already visible to a bag-of-words model.
Applied Examples¶
1. Add bigrams when order matters¶
CountVectorizer(ngram_range=(1, 2))
This is the first upgrade when a unigram model misses phrases such as “not good”, “out of stock”, or “credit card denied”. scikit-learn’s feature extraction guide notes that unigrams cannot capture phrases or word-order dependence, while n-grams can.
2. Use character n-grams for noisy text¶
CountVectorizer(analyzer="char_wb", ngram_range=(3, 5))
Character n-grams are often stronger when text is noisy, misspelled, or short. The char_wb analyzer keeps n-grams inside word boundaries and pads word edges with spaces, which helps preserve local spelling patterns without mixing across words.
3. Use TF-IDF when repeated common words are too dominant¶
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1, 2), sublinear_tf=True, min_df=2)
X_train = vectorizer.fit_transform(train_texts)
TfidfVectorizer is the practical default when raw counts overweight frequent terms. It is equivalent to CountVectorizer followed by TfidfTransformer, so you can think of it as “count first, then downweight broad terms.”
4. Use a sparse, scalable representation for large text collections¶
from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer(ngram_range=(1, 2), alternate_sign=False)
X_train = vectorizer.transform(train_texts)
This is useful when you want fixed-size features without storing a vocabulary. The tradeoff is that collisions can happen, and you cannot inspect a learned vocabulary the same way you can with CountVectorizer.
What To Tune First¶
ngram_range: start with(1, 1), then try(1, 2)analyzer: word first, thenchar_wbif spelling or morphology mattersmin_df: raise it if the vocabulary is too noisymax_df: lower it if corpus-wide filler terms dominatebinary=True: use this when only presence matters, not repetitionsublinear_tf=True: use this when repeated words should grow more slowlystop_words: use carefully; removing the wrong words can destroy signal
Tokenizer And Analyzer Audit¶
Before changing the model, inspect what the vectorizer is actually doing.
vectorizer = CountVectorizer()
analyzer = vectorizer.build_analyzer()
tokens = analyzer("Not good!!!")
That audit catches simple but costly mistakes:
- default lowercasing may merge cues you meant to preserve
- the default token pattern drops punctuation and one-character tokens
- stop-word removal can accidentally remove negation or domain-specific cues
- word analysis can miss useful morphology that char n-grams would catch
If you cannot explain how the text became features, you should not trust the error analysis yet.
Common Mistakes¶
- using unigram bag-of-words on a task where negation changes the label
- jumping to a complex model before testing a simple n-gram change
- treating stop-word removal as always safe
- forgetting that the same word can matter differently in different phrases
- using TF-IDF when the main issue is actually missing order
- fitting the vectorizer on all data before splitting, which leaks vocabulary information
- assuming more features automatically means better text understanding
Inspection Habits¶
Use these checks before you trust the representation.
vectorizer.get_feature_names_out()
vectorizer.inverse_transform(X_valid[:3])
Look at the learned tokens and the tokens active in a few mistakes. If a failed example contains the right words but still gets the wrong score, the representation may be too weak. If the active features are garbage, the tokenizer or preprocessing is probably the problem.
Good inspection questions:
- Which n-grams actually appear in the feature set?
- Are the features capturing the phrase that caused the error?
- Are there too many near-duplicate tokens from punctuation or casing?
- Do the top features look like real language or accidental artifacts?
Coefficient And Error Autopsy¶
For linear text models, inspect both the coefficients and the features that fired on a specific mistake.
feature_names = vectorizer.get_feature_names_out()
top_positive = feature_names[model.coef_[0].argsort()[-5:]]
top_negative = feature_names[model.coef_[0].argsort()[:5]]
active_features = vectorizer.inverse_transform(X_valid[error_index])[0]
This gives you two different debugging views:
- coefficient view: what the model generally thinks is useful
- active-feature view: what the model actually used on one mistake
If the coefficients look sensible but the mistake still fails, the representation may be missing context. If the active features are junk, the analyzer or preprocessing is probably the real problem.
Failure Checks¶
- If the validation gain disappears when you switch from unigrams to bigrams, the model was probably not using order in a stable way.
- If
char_wbhelps much more than word n-grams, spelling, morphology, or short noisy strings are probably driving the task. - If TF-IDF helps only a little, the bottleneck may be representation coverage, not feature weighting.
- If the train score rises but validation stays flat, the vocabulary may be too task-specific or the split may be leaking patterns.
Useful Tricks¶
- Start with
CountVectorizer, then compare againstTfidfVectorizerbefore changing the classifier. - Keep the classifier simple while you test representation changes.
- Use bigrams as the first order-aware upgrade instead of changing everything at once.
- Try
char_wbwhen the task has typos, inflections, short queries, or many rare forms. - Use
binary=Truewhen the presence of a cue matters more than how often it repeats. - Keep an eye on sparse matrices; high-dimensional text features are normal, not a bug.
Runnable Pattern¶
Open the matching example in AI Academy and compare the unigram and bigram variants. Watch for the specific cases where order changes the answer.
from sklearn.feature_extraction.text import CountVectorizer
texts = ["not good", "very good", "good enough"]
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(texts)
print(vectorizer.get_feature_names_out())
print(X.toarray())
That tiny pattern already shows why “words only” and “words plus order” are not the same representation.
Practice¶
- Fit a unigram baseline and then a unigram-plus-bigram baseline. Which mistakes disappear?
- Inspect the feature names for a noisy text task. Which tokens look useful, and which look like clutter?
- Explain when
char_wbis a better first choice than word n-grams. - Describe one case where
TfidfVectorizeris better than raw counts. - Say what information is lost when a bag-of-words model ignores order.
- Show how
inverse_transform()can help debug a false positive. - Name one reason
HashingVectorizeris useful and one reason it is harder to inspect. - Explain why
max_dfandmin_dfcan act like vocabulary filters.
Questions To Ask¶
- Is the error about missing order, noisy spelling, or both?
- Would bigrams fix the issue, or do you need character n-grams?
- Are common words drowning out the useful signal?
- Can I explain this model’s mistakes by reading the active n-grams?
- Is the current representation still the smallest one that keeps the important information?
Library Notes¶
CountVectorizeris the clearest first check for word-based text features.TfidfVectorizeris often the best next step when raw counts are too blunt.HashingVectorizertrades interpretability for fixed-size scalability.char_wbis a strong option when word boundaries matter but exact words are fragile.get_feature_names_out()andinverse_transform()are not extras; they are part of debugging the representation.ngram_range=(1, 2)is the simplest way to test whether local order matters.