Resources / Spam filtering

How Spam Filters Are Tested and Scored

Filter accuracy claims mean nothing without a methodology. This explains how spam filters are evaluated - the TREC Spam Track's one-at-a-time model, the 1-ROCA% and lam% metrics, the feedback regimes - and the public corpora researchers use, with the caveats that make any single number unreliable.

Last checked: June 21, 2026

Every spam filter ships with an accuracy number, and almost none of them mean what they appear to mean. “99.9% accurate” is meaningless without answers to: measured on which mail, with what feedback, counting which errors, and against what baseline? The field’s answer to that problem is a body of shared evaluation methodology and public corpora - most importantly the TREC Spam Track - that makes filter comparisons reproducible and honest about their limits.

This page explains how filters are measured and what the standard datasets are. It is written for senders too, because the central lesson of the evaluation literature is the same one that runs through this whole library: there is no single accuracy number and no universal threshold - performance is a curve, a trade-off, and a property of a corpus and a configuration.

A labeled ham-plus-spam corpus runs through the classifier into a confusion matrix; the false-positive cell - good mail called spam - is the one filters most fear, and the metrics quantify both error directions.

The 60-second version

The TREC Spam Track is the reference methodology: messages presented one at a time, in chronological order, with the filter giving a binary verdict and a spamminess score before seeing the next.
The primary metric is 1-ROCA% (area above the ROC curve) - lower is better - chosen because a single accuracy figure hides the ham/spam trade-off.
Two raw error rates matter: ham misclassification (hm%) = good mail called spam, and spam misclassification (sm%) = spam called good. They trade off against each other.
lam% combines both into one number without pre-judging which error is worse.
Standard corpora include TREC 2007 (trec07p), the SpamAssassin public corpus, and the Enron dataset - each with important caveats (Enron has no spam labels at all).
Live network lookups corrupt offline tests - a previous downloader may have already reported the corpus to DCC/Razor/Pyzor.
Every reported number is a property of its corpus; none is a guarantee about your mail.

The TREC Spam Track methodology

The TREC (Text REtrieval Conference, NIST) Spam Track defined how to evaluate a filter realistically. Its model: chronologically ordered email messages are “presented one at a time to the filter,” which yields “(a) a binary spam/ham judgment and (b) a numeric spamminess score for each message before seeing the next.”

Two design decisions make it rigorous:

Chronological order. Presenting mail in time order preserves realistic temporal signal and prevents look-ahead - “a filter trained on future messages and tested on past ones would have an artificial advantage.” One-at-a-time presentation enforces this.
Multiple feedback regimes. Real users don’t label every message, so TREC tests four:

Feedback mode	What it models
Immediate	The ideal user - correct label given right after each classification
Delayed	Feedback for the first N messages, then none (most mail classified blind)
Partial	Only some recipients’ messages are ever labelled (users who never report errors)
Active / on-line	The filter may request labels up to a fixed quota

The headline finding from these regimes is intuitive but important: “Delayed and partial feedback degrade filter performance.” A filter that looks excellent with perfect feedback degrades as feedback dries up - which is the real-world condition. The best active-learning strategy was “uncertainty scheduling,” requesting labels “only for those messages whose score is near the filter’s threshold.”

The evaluation toolkit also constrained filters to make runs comparable: no network resources during evaluation, 1 GB temp disk, 1 GB RAM, and an amortized 2 seconds per message, via five mandatory operations (initialize, classify, train-ham, train-spam, finalize).

The metrics: why one number isn’t enough

The reason filters are measured on a curve rather than a single accuracy figure is a fundamental tension TREC states plainly: “There is a natural tension between ham and spam misclassification percentages. A filter may improve one at the expense of the other.” Move the threshold to catch more spam and you block more legitimate mail; relax it to protect legitimate mail and you let more spam through.

Metric	Definition	Direction
hm% (ham misclassification)	Fraction of all ham classified as spam (false-positive rate)	Lower is better
sm% (spam misclassification)	Fraction of all spam classified as ham (false-negative rate)	Lower is better
1-ROCA%	Area above the ROC curve, as a percentage. Probabilistically: the chance a random spam scores lower than a random ham	Lower is better
lam% (logistic average misclassification)	logit⁻¹(½ logit(hm%) + ½ logit(sm%)) - a geometric mean of the odds of each error	Lower is better

The 1-ROCA% metric is the primary one precisely because it summarizes the filter’s performance across all threshold settings at once - it does not commit to a single operating point. lam% complements it by combining both error types “without imposing an a priori weighting of ham vs. spam errors.” TREC also computes 95% bootstrap confidence intervals per measure, per corpus, because differences smaller than the noise floor are not real differences.

To make these concrete: the best 1-ROCA% on the public trec07p corpus under immediate feedback was 0.0055% (University of Waterloo). In the TREC 2005 evaluation, CRM114’s best configuration reached a 1-ROCAC% of 0.019 on the FULL corpus - “best of all 44 filter configurations tested.” Those numbers are tiny, and they are also corpus-specific artifacts; the same filter on different mail performs differently. That is the whole point of reporting the methodology alongside the number.

The standard corpora

A filter is only as meaningful as the mail it was tested on. Three datasets recur in the literature, each with a different character and different traps.

TREC 2007 corpora

Corpus	Composition	Total
trec07p (public)	25,220 ham + 50,199 spam	75,419
MrX3 (private)	8,082 ham + 153,893 spam	161,975

trec07p was a milestone: “the first TREC public corpus that contains exclusively ham and spam sent to the same server within the same time period” - all messages delivered to one server from April 8 through July 6, 2007, including honeypot accounts. Same server, same window means the ham and spam are genuinely comparable, not stitched together from different sources (a flaw that inflates results when the classifier learns the source rather than the spamminess).

SpamAssassin public corpus

A widely used offline development set, totaling 6,047 messages at roughly 31% spam, deliberately graded by difficulty:

Subset	Count	Character
spam	500	”all received from non-spam-trap sources”
spam_2	1,397	more recent spam
easy_ham	2,500	”frequently do not contain any spammish signatures (like HTML etc)“
easy_ham_2	1,400	more recent ham
hard_ham	250	”closer in many respects to typical spam: use of HTML… ‘spammish-sounding’ phrases”

The hard_ham subset is the interesting one - it is exactly the kind of legitimate mail most at risk of a false positive, which is why it exists as a separate, harder test. The corpus reproduces “all headers in full,” with some obfuscation (some hostnames replaced with spamassassin.taint.org, which has a valid MX record).

Enron Email Dataset

The Enron corpus is “the only substantial collection of ‘real’ email that is public” - about 0.5M messages from ~150 users (mostly Enron senior management), released via the FERC investigation and prepared by CMU’s CALO project (~1.7 GB, no attachments).

There is one thing you must know about it, and it is easy to get wrong: Enron is a ham corpus. It is NOT a spam corpus and contains no spam labels. Used correctly in spam research, it supplies the legitimate side of a dataset, mixed with a separate spam corpus to build a balanced set. A filter trained on Enron ham alone “will reflect senior Enron management vocabulary and may not generalise.” (A 2026 update to the CMU page also notes a forensic flaw that allowed message impersonation in the original archive, though it “probably does not affect NLP uses of the corpus.”)

The trap that invalidates offline tests: live lookups

This caveat is critical and easy to overlook. The SpamAssassin corpus readme warns directly: “Relying on data from public networked blacklists like DNSBLs, Razor, DCC or Pyzor for identification of these messages is not recommended, as a previous downloader of this corpus might have reported them!” Because the collaborative networks (see DCC, Pyzor, and Razor) are global and stateful, a public corpus that thousands of researchers have downloaded has almost certainly been reported into those networks - so live lookups against it produce results that have nothing to do with the messages’ original receipt. The corpus also warns, bluntly, “do NOT send these emails into a live email system,” because doing so generates bounces to the original senders.

The practical rule: reproducible offline evaluation requires a frozen corpus and no live network state. Any benchmark that mixes a static corpus with live reputation lookups is measuring the wrong thing.

CEAS and the move to live evaluation

The Conference on Email and Anti-Spam (CEAS) was the venue where much of this evaluation work was published - including the foundational “Spam Corpus Creation for TREC” (Cormack & Lynam, CEAS-2005) and key active-learning papers (Sculley, CEAS 2007). TREC 2007’s epilogue announced that CEAS 2008 would “host a laboratory evaluation modeled after the spam track” plus a Live Challenge - “a real-time version of the task using a live email feed rather than an archival corpus” - an explicit acknowledgment that frozen corpora, however careful, cannot fully capture filtering against live, evolving mail. CEAS is now defunct: its homepage survives only as a frames stub for the 2008 conference, and individual papers are not retrievable from it.

What evaluation teaches a sender

The methodology reinforces, from a different direction, the operating principles in the rest of this library:

No universal threshold exists. Filters are measured across the whole ROC curve precisely because each deployment picks its own operating point on the ham/spam trade-off. A “score” only means something against a stated configuration.
False positives are the headline error. hm% (good mail called spam) is reported as a first-class metric, and the SpamAssassin corpus dedicates a whole graded subset (hard_ham) to the legitimate mail most likely to be misclassified.
Results are corpus-bound. Every number on this page is a property of a specific dataset and setup. The first systematic benchmark (Androutsopoulos et al., in Bayesian spam filtering) carried the same warning, and it still holds.
Live state matters. That offline corpora get “contaminated” by collaborative-network reports is the flip side of the fact that your real-world reputation is global and persistent - which is why consistent, consensual sending compounds over time.

What this means for you, and what Egressif does

The evaluation literature is the antidote to spam-score theater. It shows, with methodology, that filter performance is a trade-off curve, not a single grade; that the error filters most fear is misclassifying good mail; and that any accuracy claim is only as meaningful as the corpus behind it. For a sender, the takeaway is to distrust universal “spam score” promises and focus on the signals that travel across every receiver and every corpus: authentication, consistency, and genuine consent.

Egressif builds on the controllable, deterministic layers rather than chasing a number that does not exist. We keep authentication aligned, sending identity consistent, and lists clean - the inputs that hold up regardless of which receiver, which engine, or which operating point on the curve your mail meets. We will not quote you a universal threshold or an inbox guarantee, because the measurement science says neither exists; we make sure the things that are measurable and controllable are working for you.

Related references

Tell us what you run today.

Domains, rough volume, current providers, and what hurts. You will get a straight answer on fit, and a real number, in one conversation.

Talk to our team