Resources / Spam filtering
Spam corpora and filter evaluation
Filter accuracy claims mean nothing without a methodology. This explains how spam filters are evaluated - the TREC Spam Track's one-at-a-time model, the 1-ROCA% and lam% metrics, the feedback regimes - and the public corpora researchers use, with the caveats that make any single number unreliable.
Last checked: June 21, 2026
Every spam filter ships with an accuracy number, and almost none of them mean what they appear to mean. “99.9% accurate” is meaningless without answers to: measured on which mail, with what feedback, counting which errors, and against what baseline? The field’s answer to that problem is a body of shared evaluation methodology and public corpora - most importantly the TREC Spam Track - that makes filter comparisons reproducible and honest about their limits.
This page explains how filters are measured and what the standard datasets are. It is written for senders too, because the central lesson of the evaluation literature is the same one that runs through this whole library: there is no single accuracy number and no universal threshold - performance is a curve, a trade-off, and a property of a corpus and a configuration.
The 60-second version
- The TREC Spam Track is the reference methodology: messages presented one at a time, in chronological order, with the filter giving a binary verdict and a spamminess score before seeing the next.
- The primary metric is 1-ROCA% (area above the ROC curve) - lower is better - chosen because a single accuracy figure hides the ham/spam trade-off.
- Two raw error rates matter: ham misclassification (hm%) = good mail called spam, and spam misclassification (sm%) = spam called good. They trade off against each other.
- lam% combines both into one number without pre-judging which error is worse.
- Standard corpora include TREC 2007 (trec07p), the SpamAssassin public corpus, and the Enron dataset - each with important caveats (Enron has no spam labels at all).
- Live network lookups corrupt offline tests - a previous downloader may have already reported the corpus to DCC/Razor/Pyzor.
- Every reported number is a property of its corpus; none is a guarantee about your mail.
The TREC Spam Track methodology
The TREC (Text REtrieval Conference, NIST) Spam Track defined how to evaluate a filter realistically. Its model: chronologically ordered email messages are “presented one at a time to the filter,” which yields “(a) a binary spam/ham judgment and (b) a numeric spamminess score for each message before seeing the next.”
Two design decisions make it rigorous:
- Chronological order. Presenting mail in time order preserves realistic temporal signal and prevents look-ahead - “a filter trained on future messages and tested on past ones would have an artificial advantage.” One-at-a-time presentation enforces this.
- Multiple feedback regimes. Real users don’t label every message, so TREC tests four:
| Feedback mode | What it models |
|---|---|
| Immediate | The ideal user - correct label given right after each classification |
| Delayed | Feedback for the first N messages, then none (most mail classified blind) |
| Partial | Only some recipients’ messages are ever labelled (users who never report errors) |
| Active / on-line | The filter may request labels up to a fixed quota |
The headline finding from these regimes is intuitive but important: “Delayed and partial feedback degrade filter performance.” A filter that looks excellent with perfect feedback degrades as feedback dries up - which is the real-world condition. The best active-learning strategy was “uncertainty scheduling,” requesting labels “only for those messages whose score is near the filter’s threshold.”
The evaluation toolkit also constrained filters to make runs comparable: no network resources during evaluation, 1 GB temp disk, 1 GB RAM, and an amortized 2 seconds per message, via five mandatory operations (initialize, classify, train-ham, train-spam, finalize).
The metrics: why one number isn’t enough
The reason filters are measured on a curve rather than a single accuracy figure is a fundamental tension TREC states plainly: “There is a natural tension between ham and spam misclassification percentages. A filter may improve one at the expense of the other.” Move the threshold to catch more spam and you block more legitimate mail; relax it to protect legitimate mail and you let more spam through.
| Metric | Definition | Direction |
|---|---|---|
| hm% (ham misclassification) | Fraction of all ham classified as spam (false-positive rate) | Lower is better |
| sm% (spam misclassification) | Fraction of all spam classified as ham (false-negative rate) | Lower is better |
| 1-ROCA% | Area above the ROC curve, as a percentage. Probabilistically: the chance a random spam scores lower than a random ham | Lower is better |
| lam% (logistic average misclassification) | logit⁻¹(½ logit(hm%) + ½ logit(sm%)) - a geometric mean of the odds of each error | Lower is better |
The 1-ROCA% metric is the primary one precisely because it summarizes the filter’s performance across all threshold settings at once - it does not commit to a single operating point. lam% complements it by combining both error types “without imposing an a priori weighting of ham vs. spam errors.” TREC also computes 95% bootstrap confidence intervals per measure, per corpus, because differences smaller than the noise floor are not real differences.
To make these concrete: the best 1-ROCA% on the public trec07p corpus under immediate feedback was 0.0055% (University of Waterloo). In the TREC 2005 evaluation, CRM114’s best configuration reached a 1-ROCAC% of 0.019 on the FULL corpus - “best of all 44 filter configurations tested.” Those numbers are tiny, and they are also corpus-specific artifacts; the same filter on different mail performs differently. That is the whole point of reporting the methodology alongside the number.
The standard corpora
A filter is only as meaningful as the mail it was tested on. Three datasets recur in the literature, each with a different character and different traps.
TREC 2007 corpora
| Corpus | Composition | Total |
|---|---|---|
| trec07p (public) | 25,220 ham + 50,199 spam | 75,419 |
| MrX3 (private) | 8,082 ham + 153,893 spam | 161,975 |
trec07p was a milestone: “the first TREC public corpus that contains exclusively ham and spam sent to the same server within the same time period” - all messages delivered to one server from April 8 through July 6, 2007, including honeypot accounts. Same server, same window means the ham and spam are genuinely comparable, not stitched together from different sources (a flaw that inflates results when the classifier learns the source rather than the spamminess).
SpamAssassin public corpus
A widely used offline development set, totaling 6,047 messages at roughly 31% spam, deliberately graded by difficulty:
| Subset | Count | Character |
|---|---|---|
| spam | 500 | ”all received from non-spam-trap sources” |
| spam_2 | 1,397 | more recent spam |
| easy_ham | 2,500 | ”frequently do not contain any spammish signatures (like HTML etc)“ |
| easy_ham_2 | 1,400 | more recent ham |
| hard_ham | 250 | ”closer in many respects to typical spam: use of HTML… ‘spammish-sounding’ phrases” |
The hard_ham subset is the interesting one - it is exactly the kind of legitimate mail most at risk of a false positive, which is why it exists as a separate, harder test. The corpus reproduces “all headers in full,” with some obfuscation (some hostnames replaced with spamassassin.taint.org, which has a valid MX record).
Enron Email Dataset
The Enron corpus is “the only substantial collection of ‘real’ email that is public” - about 0.5M messages from ~150 users (mostly Enron senior management), released via the FERC investigation and prepared by CMU’s CALO project (~1.7 GB, no attachments).
There is one thing you must know about it, and it is easy to get wrong: Enron is a ham corpus. It is NOT a spam corpus and contains no spam labels. Used correctly in spam research, it supplies the legitimate side of a dataset, mixed with a separate spam corpus to build a balanced set. A filter trained on Enron ham alone “will reflect senior Enron management vocabulary and may not generalise.” (A 2026 update to the CMU page also notes a forensic flaw that allowed message impersonation in the original archive, though it “probably does not affect NLP uses of the corpus.”)
The trap that invalidates offline tests: live lookups
This caveat is critical and easy to overlook. The SpamAssassin corpus readme warns directly: “Relying on data from public networked blacklists like DNSBLs, Razor, DCC or Pyzor for identification of these messages is not recommended, as a previous downloader of this corpus might have reported them!” Because the collaborative networks (see DCC, Pyzor, and Razor) are global and stateful, a public corpus that thousands of researchers have downloaded has almost certainly been reported into those networks - so live lookups against it produce results that have nothing to do with the messages’ original receipt. The corpus also warns, bluntly, “do NOT send these emails into a live email system,” because doing so generates bounces to the original senders.
The practical rule: reproducible offline evaluation requires a frozen corpus and no live network state. Any benchmark that mixes a static corpus with live reputation lookups is measuring the wrong thing.
CEAS and the move to live evaluation
The Conference on Email and Anti-Spam (CEAS) was the venue where much of this evaluation work was published - including the foundational “Spam Corpus Creation for TREC” (Cormack & Lynam, CEAS-2005) and key active-learning papers (Sculley, CEAS 2007). TREC 2007’s epilogue announced that CEAS 2008 would “host a laboratory evaluation modeled after the spam track” plus a Live Challenge - “a real-time version of the task using a live email feed rather than an archival corpus” - an explicit acknowledgment that frozen corpora, however careful, cannot fully capture filtering against live, evolving mail. CEAS is now defunct: its homepage survives only as a frames stub for the 2008 conference, and individual papers are not retrievable from it.
What evaluation teaches a sender
The methodology reinforces, from a different direction, the operating principles in the rest of this library:
- No universal threshold exists. Filters are measured across the whole ROC curve precisely because each deployment picks its own operating point on the ham/spam trade-off. A “score” only means something against a stated configuration.
- False positives are the headline error. hm% (good mail called spam) is reported as a first-class metric, and the SpamAssassin corpus dedicates a whole graded subset (
hard_ham) to the legitimate mail most likely to be misclassified. - Results are corpus-bound. Every number on this page is a property of a specific dataset and setup. The first systematic benchmark (Androutsopoulos et al., in Bayesian spam filtering) carried the same warning, and it still holds.
- Live state matters. That offline corpora get “contaminated” by collaborative-network reports is the flip side of the fact that your real-world reputation is global and persistent - which is why consistent, consensual sending compounds over time.
What this means for you, and what Egressif does
The evaluation literature is the antidote to spam-score theater. It shows, with methodology, that filter performance is a trade-off curve, not a single grade; that the error filters most fear is misclassifying good mail; and that any accuracy claim is only as meaningful as the corpus behind it. For a sender, the takeaway is to distrust universal “spam score” promises and focus on the signals that travel across every receiver and every corpus: authentication, consistency, and genuine consent.
Egressif builds on the controllable, deterministic layers rather than chasing a number that does not exist. We keep authentication aligned, sending identity consistent, and lists clean - the inputs that hold up regardless of which receiver, which engine, or which operating point on the curve your mail meets. We will not quote you a universal threshold or an inbox guarantee, because the measurement science says neither exists; we make sure the things that are measurable and controllable are working for you.
Related references
- How receiver-side spam filtering actually works A spam filter is not one test with one threshold. It is a layered pipeline - connection and reputation checks, authentication, statistical content analysis, collaborative checksum networks, and rules engines - whose signals combine into a single decision. This page walks the whole chain so you can see where a legitimate message can go wrong.
- Bayesian and statistical spam filtering Statistical filters do not match keywords - they learn token probabilities from a receiver's own ham and spam, then combine the most telling tokens with Bayes' rule. This is the history, the math, and the operational caveats, ending with what it means for a legitimate sender.
- Fuzzy hashing and near-duplicate detection A single changed character defeats an exact hash, so bulk-mail detection needs hashing that tolerates variation. This explains the difference between exact and fuzzy hashing, how DCC and Rspamd implement near-duplicate detection, and why "bulk" is not the same as "spam."
- DCC, Pyzor, and Razor - collaborative checksum networks DCC, Pyzor, and Vipul's Razor let many receivers pool what they see, so a message sent in bulk is recognized as bulk even when each copy is personalized. Here is how each one works, how they differ, what they catch and miss, and the whitelist rule every legitimate bulk sender depends on.
- Apache SpamAssassin architecture SpamAssassin classifies mail by running many rules, each contributing a positive or negative score, and tagging messages whose total crosses a configurable threshold. Here is the scoring model, the plugin architecture, the network tests, the Bayes subsystem, and how training really works.
- Rspamd architecture Rspamd is an event-driven filtering framework that sits between the MTA and the internet, runs dozens of modules in parallel, sums named symbols into a score, and maps that score to an action. Here is the pipeline, the scoring and action model, fuzzy storage, and why it is fast.
Tell us what you run today.
Domains, rough volume, current providers, and what hurts. You will get a straight answer on fit, and a real number, in one conversation.