Text Representations and Order¶

What This Is¶

Text representation decides how much meaning survives before the classifier sees the input. If you ignore order, a sentence like “not good” can look too similar to “good”. If you keep too little context, the model learns from tokens but misses phrases, negation, and local structure.

This topic is about choosing the smallest representation that still preserves the signal. In practice, that usually means starting with word counts, then checking whether bigrams, character n-grams, or TF-IDF are enough before reaching for heavier models.

When You Use It¶

building a fast text baseline
checking whether word order matters
handling negation, short phrases, or noisy spelling
comparing bag-of-words, n-grams, and TF-IDF
deciding whether a simple linear model is already strong enough

Most Useful Functions¶

CountVectorizer for word counts and n-grams
TfidfVectorizer for count-to-weighted-feature text baselines
HashingVectorizer for large-scale or streaming text
get_feature_names_out() for reading back the learned vocabulary
inverse_transform() for checking which features fired on a sample
build_analyzer() for understanding how text becomes tokens

Representation Capability Table¶

Representation	Keeps Well	Misses Or Weakens	Good First Use
unigrams	topic words, coarse sentiment cues, short keyword tasks	negation, phrase order, multiword meaning	fast baseline
word n-grams	short local phrases, negation, local order	long-distance structure, spelling variation across forms	order-sensitive short text
char n-grams	spelling, prefixes/suffixes, noisy queries, typos	explicit word-level semantics and longer phrase structure	noisy or short text
TF-IDF	same structure as the chosen analyzer, but with better term weighting	it does not add new expressive power beyond the analyzer	counts are too dominated by common terms

Important rule:

TF-IDF changes weighting, not the basic capability of the analyzer
if the failure is about missing order, switching from counts to TF-IDF will not solve it by itself

Core Pattern¶

Start simple, then add only the missing structure.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

vectorizer = CountVectorizer(ngram_range=(1, 1))
X_train = vectorizer.fit_transform(train_texts)
X_valid = vectorizer.transform(valid_texts)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

That baseline is useful because it tells you whether the problem is already visible to a bag-of-words model.

Applied Examples¶

1. Add bigrams when order matters¶

CountVectorizer(ngram_range=(1, 2))

This is the first upgrade when a unigram model misses phrases such as “not good”, “out of stock”, or “credit card denied”. scikit-learn’s feature extraction guide notes that unigrams cannot capture phrases or word-order dependence, while n-grams can.

2. Use character n-grams for noisy text¶

CountVectorizer(analyzer="char_wb", ngram_range=(3, 5))

Character n-grams are often stronger when text is noisy, misspelled, or short. The char_wb analyzer keeps n-grams inside word boundaries and pads word edges with spaces, which helps preserve local spelling patterns without mixing across words.

3. Use TF-IDF when repeated common words are too dominant¶

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1, 2), sublinear_tf=True, min_df=2)
X_train = vectorizer.fit_transform(train_texts)

TfidfVectorizer is the practical default when raw counts overweight frequent terms. It is equivalent to CountVectorizer followed by TfidfTransformer, so you can think of it as “count first, then downweight broad terms.”

4. Use a sparse, scalable representation for large text collections¶

from sklearn.feature_extraction.text import HashingVectorizer

vectorizer = HashingVectorizer(ngram_range=(1, 2), alternate_sign=False)
X_train = vectorizer.transform(train_texts)

This is useful when you want fixed-size features without storing a vocabulary. The tradeoff is that collisions can happen, and you cannot inspect a learned vocabulary the same way you can with CountVectorizer.

What To Tune First¶

ngram_range: start with (1, 1), then try (1, 2)
analyzer: word first, then char_wb if spelling or morphology matters
min_df: raise it if the vocabulary is too noisy
max_df: lower it if corpus-wide filler terms dominate
binary=True: use this when only presence matters, not repetition
sublinear_tf=True: use this when repeated words should grow more slowly
stop_words: use carefully; removing the wrong words can destroy signal

Tokenizer And Analyzer Audit¶

Before changing the model, inspect what the vectorizer is actually doing.

vectorizer = CountVectorizer()
analyzer = vectorizer.build_analyzer()
tokens = analyzer("Not good!!!")

That audit catches simple but costly mistakes:

default lowercasing may merge cues you meant to preserve
the default token pattern drops punctuation and one-character tokens
stop-word removal can accidentally remove negation or domain-specific cues
word analysis can miss useful morphology that char n-grams would catch

If you cannot explain how the text became features, you should not trust the error analysis yet.

Common Mistakes¶

using unigram bag-of-words on a task where negation changes the label
jumping to a complex model before testing a simple n-gram change
treating stop-word removal as always safe
forgetting that the same word can matter differently in different phrases
using TF-IDF when the main issue is actually missing order
fitting the vectorizer on all data before splitting, which leaks vocabulary information
assuming more features automatically means better text understanding

Inspection Habits¶

Use these checks before you trust the representation.

vectorizer.get_feature_names_out()
vectorizer.inverse_transform(X_valid[:3])

Look at the learned tokens and the tokens active in a few mistakes. If a failed example contains the right words but still gets the wrong score, the representation may be too weak. If the active features are garbage, the tokenizer or preprocessing is probably the problem.

Good inspection questions:

Which n-grams actually appear in the feature set?
Are the features capturing the phrase that caused the error?
Are there too many near-duplicate tokens from punctuation or casing?
Do the top features look like real language or accidental artifacts?

Coefficient And Error Autopsy¶

For linear text models, inspect both the coefficients and the features that fired on a specific mistake.

feature_names = vectorizer.get_feature_names_out()
top_positive = feature_names[model.coef_[0].argsort()[-5:]]
top_negative = feature_names[model.coef_[0].argsort()[:5]]
active_features = vectorizer.inverse_transform(X_valid[error_index])[0]

This gives you two different debugging views:

coefficient view: what the model generally thinks is useful
active-feature view: what the model actually used on one mistake

If the coefficients look sensible but the mistake still fails, the representation may be missing context. If the active features are junk, the analyzer or preprocessing is probably the real problem.

Failure Checks¶

If the validation gain disappears when you switch from unigrams to bigrams, the model was probably not using order in a stable way.
If char_wb helps much more than word n-grams, spelling, morphology, or short noisy strings are probably driving the task.
If TF-IDF helps only a little, the bottleneck may be representation coverage, not feature weighting.
If the train score rises but validation stays flat, the vocabulary may be too task-specific or the split may be leaking patterns.

Useful Tricks¶

Start with CountVectorizer, then compare against TfidfVectorizer before changing the classifier.
Keep the classifier simple while you test representation changes.
Use bigrams as the first order-aware upgrade instead of changing everything at once.
Try char_wb when the task has typos, inflections, short queries, or many rare forms.
Use binary=True when the presence of a cue matters more than how often it repeats.
Keep an eye on sparse matrices; high-dimensional text features are normal, not a bug.

Runnable Pattern¶

Open the matching example in AI Academy and compare the unigram and bigram variants. Watch for the specific cases where order changes the answer.

from sklearn.feature_extraction.text import CountVectorizer

texts = ["not good", "very good", "good enough"]
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(texts)

print(vectorizer.get_feature_names_out())
print(X.toarray())

That tiny pattern already shows why “words only” and “words plus order” are not the same representation.

Practice¶

Fit a unigram baseline and then a unigram-plus-bigram baseline. Which mistakes disappear?
Inspect the feature names for a noisy text task. Which tokens look useful, and which look like clutter?
Explain when char_wb is a better first choice than word n-grams.
Describe one case where TfidfVectorizer is better than raw counts.
Say what information is lost when a bag-of-words model ignores order.
Show how inverse_transform() can help debug a false positive.
Name one reason HashingVectorizer is useful and one reason it is harder to inspect.
Explain why max_df and min_df can act like vocabulary filters.

Questions To Ask¶

Is the error about missing order, noisy spelling, or both?
Would bigrams fix the issue, or do you need character n-grams?
Are common words drowning out the useful signal?
Can I explain this model’s mistakes by reading the active n-grams?
Is the current representation still the smallest one that keeps the important information?

Library Notes¶

CountVectorizer is the clearest first check for word-based text features.
TfidfVectorizer is often the best next step when raw counts are too blunt.
HashingVectorizer trades interpretability for fixed-size scalability.
char_wb is a strong option when word boundaries matter but exact words are fragile.
get_feature_names_out() and inverse_transform() are not extras; they are part of debugging the representation.
ngram_range=(1, 2) is the simplest way to test whether local order matters.