Resources / Spam filtering
Apache SpamAssassin architecture
SpamAssassin classifies mail by running many rules, each contributing a positive or negative score, and tagging messages whose total crosses a configurable threshold. Here is the scoring model, the plugin architecture, the network tests, the Bayes subsystem, and how training really works.
Last checked: June 22, 2026
Apache SpamAssassin is “an extensible email filter used to identify spam,” and the word that matters is extensible. It is not a single algorithm - it is a framework that runs a large body of independent tests against a message’s headers and content, “to classify email using advanced statistical methods,” and then adds up what they found. Understanding it well means understanding one idea deeply: no single rule decides anything; the total does.
This page is written for senders. SpamAssassin is one of the two engines (with Rspamd) you are most likely to meet on the receiving side, and because it exposes its reasoning in headers you can actually read, knowing its model tells you precisely how the ham-protective and spam-like signals you generate get weighed.
The 60-second version
- SpamAssassin runs many rules; each rule that fires contributes a score - positive for spam-like, negative for ham-like.
- Rules come in types - header, body, rawbody, uri, full, and meta rules that combine other rules with boolean/arithmetic logic.
- The scores sum into one number. By default, 5.0 or higher is tagged as spam - but that default is explicitly described as “quite aggressive,” and operators routinely raise it.
- Default scores are not hand-picked: they are generated by a perceptron (a neural net trained with back-propagation) over a labeled corpus, in four “score sets” for the Bayes/network on-off combinations.
- It records its reasoning in
X-Spam-*headers on every message, including the exact tests that fired and the total score. - Tests span five families: header analysis, body/content, authentication, Bayesian statistics, and network/distributed checks.
- The Bayes plugin is one rule among many - and it needs at least 200 spam + 200 ham before it will score.
- Rules are kept current out-of-band with
sa-update, which installs GPG-signed rule channels. - It is written in Perl, runs as a standalone script, a
spamd/spamcdaemon (recommended for speed), or an embedded library, and is configured through layered.preand.cffiles.
The scoring model
This is the heart of it. From the documentation: by default, “all messages with a calculated score of 5.0 or higher are tagged as spam.” Each rule that fires “contributes its configured score (positive = spam-like; negative = ham-like); the total is the message’s spam score.” So a passing DKIM signature or a recognized welcome-list subject can subtract points, partly offsetting spam-like rules that also fired. The verdict is a signed sum, not a single tripwire.
X-Spam-Status: Yes, score=7.3 required=5.0
tests=BAYES_99,HTML_IMAGE_ONLY_28,RDNS_NONE,DKIM_VALID
autolearn=no
The header above (format per the docs) is the engine showing its work: the total score, the required threshold it was compared against, the named tests that fired, and the autolearn outcome.
The headers SpamAssassin writes
| Header | What it carries |
|---|---|
X-Spam-Status | (Yes|No), score=nn required=nn tests=... autolearn=(ham|spam|no|unavailable|failed) - written on every message |
X-Spam-Flag | Set to YES on messages scoring at or above the threshold |
X-Spam-Level | A row of * characters, one per full score point - easy to filter on |
X-Spam-Checker-Version | SpamAssassin version and the host that scanned the message |
Two operational details follow. First, when a message is tagged as spam, SpamAssassin “creates a new report message and attaches the original as a message/rfc822 MIME part” - the original is preserved intact. Second, and important for trust: “before header modification and addition, all headers beginning with X-Spam- are removed to prevent spammer mischief.” You cannot pre-stamp your own X-Spam-Status: No and have it survive - the receiver strips inbound X-Spam-* before doing its own analysis.
Why the threshold is not a universal number
The 5.0 is a default, not a standard - and the documentation itself calls it “quite aggressive,” “suitable for a single-user setup,” and advises that “if you’re an ISP installing SpamAssassin, you should probably set the default to be more conservative, like 8.0 or 10.0.” It goes further on the destructive end: “It is not recommended to automatically delete or discard messages marked as spam… but if you choose to do so, only delete messages with an exceptionally high score such as 15.0 or higher.” Scores can also be loaded per-site and per-user, from files or from SQL/LDAP, and overridden at multiple levels. So “below 5.0” is meaningful only against an unmodified install; real deployments raise, lower, and re-weight freely. Any tool that promises “your SpamAssassin score” against a fixed 5.0 line is describing one possible configuration, not a fact about every receiver.
Rule types and meta rules
A “rule” is not one thing. SpamAssassin’s configuration grammar defines several rule types by what part of the message the pattern runs against, and they behave differently:
| Rule type | Runs against |
|---|---|
header | a named header field (or exists:/eval: tests on headers) |
body | the visible body text, HTML-stripped and whitespace-normalized |
rawbody | the body before HTML decoding - useful for catching markup tricks |
uri | URIs extracted from the message |
full | the entire raw message, headers and body together |
meta | a boolean/arithmetic combination of other rules |
eval | a named Perl function (used for plugin checks like SPF, DCC, Razor2) |
The interesting one is meta, because it is how SpamAssassin expresses “this combination is suspicious even though no single part is.” A meta rule scores only when its logical expression over other rules is true:
meta FOO_META TEST1 && !(TEST2 || TEST3)
meta BAR_META (3 * TEST1 - 2 * TEST2) > 0
meta URI_META (__HAS_LINK_A + __HAS_LINK_B + __HAS_LINK_C) >= 2
That last form leans on a convention: rules whose names begin with a double underscore (__HAS_LINK_A) are sub-rules - they carry no score of their own and exist only to be referenced by meta rules. Combined with tflags ... multiple (which lets one pattern match repeatedly, up to a maxhits cap), this lets a rule set count occurrences and trigger on thresholds. The point for a sender: the spam-like and ham-like signals you generate are not evaluated in isolation; a meta rule can react to a pattern of them.
The plugin set: what SpamAssassin checks
SpamAssassin’s “modular architecture… allows other technologies to be quickly wielded against spam.” Its default install ships 29 plugins, which fall into five functional families:
| Family | Representative default plugins |
|---|---|
| Header analysis | HeaderEval, RelayEval, MIMEHeader |
| Body / content | BodyEval, HTMLEval, MIMEEval, ImageInfo |
| Authentication | SPF, DKIM, DMARC |
| Bayesian statistics | Bayes, AutoLearnThreshold |
| Network / distributed | DNSEval, URIDNSBL, AskDNS, Pyzor, Razor2, SpamCop, HashBL |
| URI / URL checks | URIDetail, URIEval, HTTPSMismatch, FreeMail |
| Miscellaneous | VBounce, WLBLEval, WelcomeListSubject, ReplaceTags, Check |
The authentication plugins (SPF, DKIM, DMARC) are the ones a sender most directly influences - they convert your published records into score adjustments. The network family is where SpamAssassin reaches out to the collaborative checksum networks: Pyzor, Razor2, DNS blocklists via URIDNSBL/DNSEval, and SpamCop.
Network and collaborative tests
SpamAssassin integrates the checksum networks as plugins, with their own timeouts and quirks:
- DCC is described in the plugin docs as “a system of servers collecting and counting checksums of millions of mail messages,” using checksums that are “fuzzy and ignore aspects of messages.” Crucially, DCC is disabled by default in
init.pre“because it is not open source.” Its match thresholds (dcc_body_max,dcc_fuz1_max,dcc_fuz2_max) default to 999999 - DCC’s “MANY” count - anddcc_timeoutdefaults to 8 seconds. If adccifdsocket is present, SpamAssassin prefers it overdccproc. - Razor2 “calculates a signature for each part of a multipart message” and returns a 0–100 confidence;
check_razor2()fires when confidence reaches the configuredmin_cf. It defaults to a 5-second timeout and forks for asynchronous operation (razor_fork1, though it is experimental on Windows where the default is 0).
These are score inputs like any other rule - a DCC or Razor hit adds points, it does not by itself reject the message.
How the default scores are assigned
A natural question is where a number like “BAYES_99 = 3.5” comes from. The answer is that the default scores are not chosen by hand. Per the project wiki: “The scores are assigned using a neural network trained with error back propagation (Perceptron),” optimized “in terms of minimizing the number of false positives and false negatives” against a large labeled corpus assembled from volunteer mass-checks. (Older 2.x SpamAssassin used a genetic algorithm instead.) This is why two facts surprise people:
- Scores come in sets of four. A rule can list four scores; which one applies depends on whether Bayes and network tests are enabled - score set 0 (neither), 1 (network only), 2 (Bayes only), 3 (both). The optimizer tunes each combination separately, because a rule’s usefulness changes when other signals are present.
- The learn-rule scores can look “wrong.”
BAYES_80can carry a higher score thanBAYES_99. The wiki explains this directly: the rules are independent, and because a genuinely spammy message tends to trip other rules too, the optimizer can afford to lower a high-confidence Bayes rule’s own score to protect against false positives - the message still crosses the threshold on the sum.
One operational corollary: a rule with a score of exactly 0 is not run at all, which is how the default config ships paid or off-by-default blocklist rules in a disabled state.
The Bayes subsystem
SpamAssassin includes a Bayesian classifier (the Bayes plugin) as one scoring component. It “tries to identify spam by looking at what are called tokens; words or short character sequences that are commonly found in spam or ham.” (The general theory - tokenization, the f(w)/Fisher math, training corpora - is covered in Bayesian and statistical spam filtering.) What is specific to SpamAssassin:
- It will not score until trained. “The bayesian classifier can only score new messages if it already has 200 known spams and 200 known hams.” Below that, the Bayes rules simply do not contribute.
- Training is via
sa-learn, separately for each class:sa-learn --mbox --spam spam-fileandsa-learn --mbox --ham ham-file. The docs stress: “It is important to do both.” - Do not train on the wrong data. A direct warning: “Do not train Bayes on different mail streams or public spam corpora. These methods will mislead Bayes into believing certain tokens are spammy or hammy when they are not.” This is why a SpamAssassin Bayes score is local to its deployment.
- Mistake-driven training is the intended workflow: feed missed spam to
sa-learn, and false positives tosa-learn --ham. - Header handling:
sa-learnignores standard SpamAssassin headers and will decapsulate an email that was attached as a report;bayes_ignore_headerinlocal.cfexcludes misleading upstream headers from token extraction.
There is one distinction that trips people up: training is not reporting. Running sa-learn updates only the local Bayes database. Reporting a message to the DCC/Pyzor/Razor networks is a separate action - spamassassin -r < message - and the autolearn= value in X-Spam-Status reflects whether automatic learning happened, not whether anything was reported upstream.
Configuration and how the engine is run
SpamAssassin loads configuration in a defined order: .pre files first (lexical order), then .cf files (lexical order) - e.g. init.pre → 10_default_prefs.cf → 20_body_tests.cf → 50_scores.cf. Global defaults live in directories like /var/lib/spamassassin/<version> or /usr/share/spamassassin (first existing wins); site overrides live in /etc/mail/spamassassin; and per-user preferences in ~/.spamassassin/user_prefs. Scores and rules can additionally be loaded from SQL or LDAP, and extra plugins are pulled in with loadplugin directives in .pre files. (A taint-security note: you cannot use PERL5LIB to relocate SpamAssassin’s modules, because Perl’s taint checks forbid it.)
It runs three ways:
- The
spamassassinscript - a standalone filter. spamd+spamc- a persistent daemon and a thin client;spamcis “faster than spamassassin” because it reuses the long-runningspamdprocess rather than re-initializing Perl per message. This is the recommended mode for any real volume.- An embedded library (
Mail::SpamAssassin) inside another application.
The exact spamd wire protocol and transport are not specified in the overview documentation, so this library does not assert them.
Keeping rules current: sa-update
Rules and scores are not frozen at install time - SpamAssassin ships sa-update to “automate the process of downloading and installing new rules and configuration, based on channels.” The mechanics are worth knowing because they explain how a sender’s environment can shift under them without a software upgrade:
- Channels are DNS-anchored. The default channel is
updates.spamassassin.org; a channel’s latest version is published as a DNS TXT record, and an update is fetched over HTTP from mirrors listed in aMIRRORED.BYfile. - Updates are GPG-signed by default.
sa-updateonly trusts archives signed by “release trusted” keys (the standard SpamAssassin release key and its sub-key are trusted out of the box); if GPG is disabled, only SHA-512/SHA-256 integrity checks remain, which “does not offer any form of security regarding whether or not the downloaded archive is legitimate.” - Plugins are blocked unless explicitly allowed. Downloaded
loadpluginlines are commented out by default, because “plugins can execute unrestricted code on your system, even possibly as root.” - It does not reload the scanner.
sa-updatereturns exit code0only when it actually installed an update, so the documented idiom issa-update && service spamassassin reload.
The practical upshot for a sender: the exact rule that scores your mail today may be replaced tomorrow by a routine sa-update, which is one more reason no fixed “SpamAssassin score” is portable across time, let alone across deployments.
A realistic picture of SpamAssassin’s place
SpamAssassin is mature, Apache-licensed, and process-oriented: in the classic model each scan spins up a fresh interpreter context, which is part of why spamd exists and why throughput is modest compared to event-driven engines (the Rspamd comparison is on its own page). It is most accurately seen not as “the spam filter” but as a flexible scoring host that pulls together authentication results, content heuristics, a local Bayesian opinion, and the collaborative networks into one tunable number. The reason it never auto-deletes on a single rule is the same one the whole stack obeys - the asymmetric cost of a false positive.
What this means for you, and what Egressif does
SpamAssassin is unusually honest about its reasoning - it writes the firing tests and the score into headers you can read. That transparency is a gift to a sender: it makes clear that the way to a good score is to make the ham-protective rules fire and the spam-like ones stay quiet. The authentication plugins (SPF, DKIM, DMARC) are exactly the levers you own, and a clean pass on each subtracts from the total.
Egressif focuses on those controllable inputs - aligned SPF, DKIM, and DMARC, consistent sending identity, and disciplined list hygiene - so that when a message reaches a SpamAssassin install, the authentication rules score in your favor and there is little spam-like signal to accumulate. We do not tune anyone else’s 50_scores.cf or train their Bayes database, and we will not pretend a “5.0” is a universal line. We make sure the parts you control arrive clean, which over time is what builds the sending reputation those rules ultimately reward.
Related references
- How receiver-side spam filtering actually works A spam filter is not one test with one threshold. It is a layered pipeline - connection and reputation checks, authentication, statistical content analysis, collaborative checksum networks, and rules engines - whose signals combine into a single decision. This page walks the whole chain so you can see where a legitimate message can go wrong.
- Bayesian and statistical spam filtering Statistical filters do not match keywords - they learn token probabilities from a receiver's own ham and spam, then combine the most telling tokens with Bayes' rule. This is the history, the math, and the operational caveats, ending with what it means for a legitimate sender.
- Fuzzy hashing and near-duplicate detection A single changed character defeats an exact hash, so bulk-mail detection needs hashing that tolerates variation. This explains the difference between exact and fuzzy hashing, how DCC and Rspamd implement near-duplicate detection, and why "bulk" is not the same as "spam."
- DCC, Pyzor, and Razor - collaborative checksum networks DCC, Pyzor, and Vipul's Razor let many receivers pool what they see, so a message sent in bulk is recognized as bulk even when each copy is personalized. Here is how each one works, how they differ, what they catch and miss, and the whitelist rule every legitimate bulk sender depends on.
- Rspamd architecture Rspamd is an event-driven filtering framework that sits between the MTA and the internet, runs dozens of modules in parallel, sums named symbols into a score, and maps that score to an action. Here is the pipeline, the scoring and action model, fuzzy storage, and why it is fast.
- Greylisting, tarpitting, and rate controls Before any content is scored, receivers can slow or defer suspicious connections. Greylisting (standardized in RFC 6647) returns a temporary failure that legitimate senders retry and most spamware abandons. Here is the mechanism, the RFC's recommended implementation, the failure modes, and where tarpitting and rate limiting fit.
Tell us what you run today.
Domains, rough volume, current providers, and what hurts. You will get a straight answer on fit, and a real number, in one conversation.