egressif.

Resources / Spam filtering

Apache SpamAssassin architecture

SpamAssassin classifies mail by running many rules, each contributing a positive or negative score, and tagging messages whose total crosses a configurable threshold. Here is the scoring model, the plugin architecture, the network tests, the Bayes subsystem, and how training really works.

Last checked: June 22, 2026

Apache SpamAssassin is “an extensible email filter used to identify spam,” and the word that matters is extensible. It is not a single algorithm - it is a framework that runs a large body of independent tests against a message’s headers and content, “to classify email using advanced statistical methods,” and then adds up what they found. Understanding it well means understanding one idea deeply: no single rule decides anything; the total does.

This page is written for senders. SpamAssassin is one of the two engines (with Rspamd) you are most likely to meet on the receiving side, and because it exposes its reasoning in headers you can actually read, knowing its model tells you precisely how the ham-protective and spam-like signals you generate get weighed.

MESSAGEheaders, bodyHEADER+1.2BODY+0.8URI+0.5NETWORK+2.5BAYES-1.5SUM3.5COMPAREdefault 5.0HAMSPAM
Scores shown are illustrative. Each rule contributes a signed score; the sum is compared to SpamAssassin’s configurable default of 5.0 - below it is ham, at or above it is spam.

The 60-second version

  • SpamAssassin runs many rules; each rule that fires contributes a score - positive for spam-like, negative for ham-like.
  • Rules come in types - header, body, rawbody, uri, full, and meta rules that combine other rules with boolean/arithmetic logic.
  • The scores sum into one number. By default, 5.0 or higher is tagged as spam - but that default is explicitly described as “quite aggressive,” and operators routinely raise it.
  • Default scores are not hand-picked: they are generated by a perceptron (a neural net trained with back-propagation) over a labeled corpus, in four “score sets” for the Bayes/network on-off combinations.
  • It records its reasoning in X-Spam-* headers on every message, including the exact tests that fired and the total score.
  • Tests span five families: header analysis, body/content, authentication, Bayesian statistics, and network/distributed checks.
  • The Bayes plugin is one rule among many - and it needs at least 200 spam + 200 ham before it will score.
  • Rules are kept current out-of-band with sa-update, which installs GPG-signed rule channels.
  • It is written in Perl, runs as a standalone script, a spamd/spamc daemon (recommended for speed), or an embedded library, and is configured through layered .pre and .cf files.

The scoring model

This is the heart of it. From the documentation: by default, “all messages with a calculated score of 5.0 or higher are tagged as spam.” Each rule that fires “contributes its configured score (positive = spam-like; negative = ham-like); the total is the message’s spam score.” So a passing DKIM signature or a recognized welcome-list subject can subtract points, partly offsetting spam-like rules that also fired. The verdict is a signed sum, not a single tripwire.

X-Spam-Status: Yes, score=7.3 required=5.0
  tests=BAYES_99,HTML_IMAGE_ONLY_28,RDNS_NONE,DKIM_VALID
  autolearn=no

The header above (format per the docs) is the engine showing its work: the total score, the required threshold it was compared against, the named tests that fired, and the autolearn outcome.

The headers SpamAssassin writes

HeaderWhat it carries
X-Spam-Status(Yes|No), score=nn required=nn tests=... autolearn=(ham|spam|no|unavailable|failed) - written on every message
X-Spam-FlagSet to YES on messages scoring at or above the threshold
X-Spam-LevelA row of * characters, one per full score point - easy to filter on
X-Spam-Checker-VersionSpamAssassin version and the host that scanned the message

Two operational details follow. First, when a message is tagged as spam, SpamAssassin “creates a new report message and attaches the original as a message/rfc822 MIME part” - the original is preserved intact. Second, and important for trust: “before header modification and addition, all headers beginning with X-Spam- are removed to prevent spammer mischief.” You cannot pre-stamp your own X-Spam-Status: No and have it survive - the receiver strips inbound X-Spam-* before doing its own analysis.

Why the threshold is not a universal number

The 5.0 is a default, not a standard - and the documentation itself calls it “quite aggressive,” “suitable for a single-user setup,” and advises that “if you’re an ISP installing SpamAssassin, you should probably set the default to be more conservative, like 8.0 or 10.0.” It goes further on the destructive end: “It is not recommended to automatically delete or discard messages marked as spam… but if you choose to do so, only delete messages with an exceptionally high score such as 15.0 or higher.” Scores can also be loaded per-site and per-user, from files or from SQL/LDAP, and overridden at multiple levels. So “below 5.0” is meaningful only against an unmodified install; real deployments raise, lower, and re-weight freely. Any tool that promises “your SpamAssassin score” against a fixed 5.0 line is describing one possible configuration, not a fact about every receiver.

Rule types and meta rules

A “rule” is not one thing. SpamAssassin’s configuration grammar defines several rule types by what part of the message the pattern runs against, and they behave differently:

Rule typeRuns against
headera named header field (or exists:/eval: tests on headers)
bodythe visible body text, HTML-stripped and whitespace-normalized
rawbodythe body before HTML decoding - useful for catching markup tricks
uriURIs extracted from the message
fullthe entire raw message, headers and body together
metaa boolean/arithmetic combination of other rules
evala named Perl function (used for plugin checks like SPF, DCC, Razor2)

The interesting one is meta, because it is how SpamAssassin expresses “this combination is suspicious even though no single part is.” A meta rule scores only when its logical expression over other rules is true:

meta   FOO_META   TEST1 && !(TEST2 || TEST3)
meta   BAR_META   (3 * TEST1 - 2 * TEST2) > 0
meta   URI_META   (__HAS_LINK_A + __HAS_LINK_B + __HAS_LINK_C) >= 2

That last form leans on a convention: rules whose names begin with a double underscore (__HAS_LINK_A) are sub-rules - they carry no score of their own and exist only to be referenced by meta rules. Combined with tflags ... multiple (which lets one pattern match repeatedly, up to a maxhits cap), this lets a rule set count occurrences and trigger on thresholds. The point for a sender: the spam-like and ham-like signals you generate are not evaluated in isolation; a meta rule can react to a pattern of them.

The plugin set: what SpamAssassin checks

SpamAssassin’s “modular architecture… allows other technologies to be quickly wielded against spam.” Its default install ships 29 plugins, which fall into five functional families:

FamilyRepresentative default plugins
Header analysisHeaderEval, RelayEval, MIMEHeader
Body / contentBodyEval, HTMLEval, MIMEEval, ImageInfo
AuthenticationSPF, DKIM, DMARC
Bayesian statisticsBayes, AutoLearnThreshold
Network / distributedDNSEval, URIDNSBL, AskDNS, Pyzor, Razor2, SpamCop, HashBL
URI / URL checksURIDetail, URIEval, HTTPSMismatch, FreeMail
MiscellaneousVBounce, WLBLEval, WelcomeListSubject, ReplaceTags, Check

The authentication plugins (SPF, DKIM, DMARC) are the ones a sender most directly influences - they convert your published records into score adjustments. The network family is where SpamAssassin reaches out to the collaborative checksum networks: Pyzor, Razor2, DNS blocklists via URIDNSBL/DNSEval, and SpamCop.

Network and collaborative tests

SpamAssassin integrates the checksum networks as plugins, with their own timeouts and quirks:

  • DCC is described in the plugin docs as “a system of servers collecting and counting checksums of millions of mail messages,” using checksums that are “fuzzy and ignore aspects of messages.” Crucially, DCC is disabled by default in init.pre “because it is not open source.” Its match thresholds (dcc_body_max, dcc_fuz1_max, dcc_fuz2_max) default to 999999 - DCC’s “MANY” count - and dcc_timeout defaults to 8 seconds. If a dccifd socket is present, SpamAssassin prefers it over dccproc.
  • Razor2 “calculates a signature for each part of a multipart message” and returns a 0–100 confidence; check_razor2() fires when confidence reaches the configured min_cf. It defaults to a 5-second timeout and forks for asynchronous operation (razor_fork 1, though it is experimental on Windows where the default is 0).

These are score inputs like any other rule - a DCC or Razor hit adds points, it does not by itself reject the message.

How the default scores are assigned

A natural question is where a number like “BAYES_99 = 3.5” comes from. The answer is that the default scores are not chosen by hand. Per the project wiki: “The scores are assigned using a neural network trained with error back propagation (Perceptron),” optimized “in terms of minimizing the number of false positives and false negatives” against a large labeled corpus assembled from volunteer mass-checks. (Older 2.x SpamAssassin used a genetic algorithm instead.) This is why two facts surprise people:

  • Scores come in sets of four. A rule can list four scores; which one applies depends on whether Bayes and network tests are enabled - score set 0 (neither), 1 (network only), 2 (Bayes only), 3 (both). The optimizer tunes each combination separately, because a rule’s usefulness changes when other signals are present.
  • The learn-rule scores can look “wrong.” BAYES_80 can carry a higher score than BAYES_99. The wiki explains this directly: the rules are independent, and because a genuinely spammy message tends to trip other rules too, the optimizer can afford to lower a high-confidence Bayes rule’s own score to protect against false positives - the message still crosses the threshold on the sum.

One operational corollary: a rule with a score of exactly 0 is not run at all, which is how the default config ships paid or off-by-default blocklist rules in a disabled state.

The Bayes subsystem

SpamAssassin includes a Bayesian classifier (the Bayes plugin) as one scoring component. It “tries to identify spam by looking at what are called tokens; words or short character sequences that are commonly found in spam or ham.” (The general theory - tokenization, the f(w)/Fisher math, training corpora - is covered in Bayesian and statistical spam filtering.) What is specific to SpamAssassin:

  • It will not score until trained. “The bayesian classifier can only score new messages if it already has 200 known spams and 200 known hams.” Below that, the Bayes rules simply do not contribute.
  • Training is via sa-learn, separately for each class: sa-learn --mbox --spam spam-file and sa-learn --mbox --ham ham-file. The docs stress: “It is important to do both.”
  • Do not train on the wrong data. A direct warning: “Do not train Bayes on different mail streams or public spam corpora. These methods will mislead Bayes into believing certain tokens are spammy or hammy when they are not.” This is why a SpamAssassin Bayes score is local to its deployment.
  • Mistake-driven training is the intended workflow: feed missed spam to sa-learn, and false positives to sa-learn --ham.
  • Header handling: sa-learn ignores standard SpamAssassin headers and will decapsulate an email that was attached as a report; bayes_ignore_header in local.cf excludes misleading upstream headers from token extraction.

There is one distinction that trips people up: training is not reporting. Running sa-learn updates only the local Bayes database. Reporting a message to the DCC/Pyzor/Razor networks is a separate action - spamassassin -r < message - and the autolearn= value in X-Spam-Status reflects whether automatic learning happened, not whether anything was reported upstream.

Configuration and how the engine is run

SpamAssassin loads configuration in a defined order: .pre files first (lexical order), then .cf files (lexical order) - e.g. init.pre10_default_prefs.cf20_body_tests.cf50_scores.cf. Global defaults live in directories like /var/lib/spamassassin/<version> or /usr/share/spamassassin (first existing wins); site overrides live in /etc/mail/spamassassin; and per-user preferences in ~/.spamassassin/user_prefs. Scores and rules can additionally be loaded from SQL or LDAP, and extra plugins are pulled in with loadplugin directives in .pre files. (A taint-security note: you cannot use PERL5LIB to relocate SpamAssassin’s modules, because Perl’s taint checks forbid it.)

It runs three ways:

  1. The spamassassin script - a standalone filter.
  2. spamd + spamc - a persistent daemon and a thin client; spamc is “faster than spamassassin” because it reuses the long-running spamd process rather than re-initializing Perl per message. This is the recommended mode for any real volume.
  3. An embedded library (Mail::SpamAssassin) inside another application.

The exact spamd wire protocol and transport are not specified in the overview documentation, so this library does not assert them.

Keeping rules current: sa-update

Rules and scores are not frozen at install time - SpamAssassin ships sa-update to “automate the process of downloading and installing new rules and configuration, based on channels.” The mechanics are worth knowing because they explain how a sender’s environment can shift under them without a software upgrade:

  • Channels are DNS-anchored. The default channel is updates.spamassassin.org; a channel’s latest version is published as a DNS TXT record, and an update is fetched over HTTP from mirrors listed in a MIRRORED.BY file.
  • Updates are GPG-signed by default. sa-update only trusts archives signed by “release trusted” keys (the standard SpamAssassin release key and its sub-key are trusted out of the box); if GPG is disabled, only SHA-512/SHA-256 integrity checks remain, which “does not offer any form of security regarding whether or not the downloaded archive is legitimate.”
  • Plugins are blocked unless explicitly allowed. Downloaded loadplugin lines are commented out by default, because “plugins can execute unrestricted code on your system, even possibly as root.”
  • It does not reload the scanner. sa-update returns exit code 0 only when it actually installed an update, so the documented idiom is sa-update && service spamassassin reload.

The practical upshot for a sender: the exact rule that scores your mail today may be replaced tomorrow by a routine sa-update, which is one more reason no fixed “SpamAssassin score” is portable across time, let alone across deployments.

A realistic picture of SpamAssassin’s place

SpamAssassin is mature, Apache-licensed, and process-oriented: in the classic model each scan spins up a fresh interpreter context, which is part of why spamd exists and why throughput is modest compared to event-driven engines (the Rspamd comparison is on its own page). It is most accurately seen not as “the spam filter” but as a flexible scoring host that pulls together authentication results, content heuristics, a local Bayesian opinion, and the collaborative networks into one tunable number. The reason it never auto-deletes on a single rule is the same one the whole stack obeys - the asymmetric cost of a false positive.

What this means for you, and what Egressif does

SpamAssassin is unusually honest about its reasoning - it writes the firing tests and the score into headers you can read. That transparency is a gift to a sender: it makes clear that the way to a good score is to make the ham-protective rules fire and the spam-like ones stay quiet. The authentication plugins (SPF, DKIM, DMARC) are exactly the levers you own, and a clean pass on each subtracts from the total.

Egressif focuses on those controllable inputs - aligned SPF, DKIM, and DMARC, consistent sending identity, and disciplined list hygiene - so that when a message reaches a SpamAssassin install, the authentication rules score in your favor and there is little spam-like signal to accumulate. We do not tune anyone else’s 50_scores.cf or train their Bayes database, and we will not pretend a “5.0” is a universal line. We make sure the parts you control arrive clean, which over time is what builds the sending reputation those rules ultimately reward.

Related references

Tell us what you run today.

Domains, rough volume, current providers, and what hurts. You will get a straight answer on fit, and a real number, in one conversation.

Talk to our team