Resources / Spam filtering
Rspamd architecture
Rspamd is an event-driven filtering framework that sits between the MTA and the internet, runs dozens of modules in parallel, sums named symbols into a score, and maps that score to an action. Here is the pipeline, the scoring and action model, fuzzy storage, and why it is fast.
Last checked: June 22, 2026
Rspamd describes itself as “a high-performance email processing framework designed as an independent layer between your Mail Transfer Agent (MTA) and the internet.” That positioning is the whole design philosophy: it does not live inside the MTA’s delivery path. “Operating outside MTA internal flows, Rspamd provides security isolation while delivering comprehensive message analysis, spam filtering, and policy enforcement.” It looks at a message, recommends an action, and lets the MTA carry it out.
This page is written for senders. Rspamd is, alongside SpamAssassin, one of the two engines you are most likely to meet on the receiving side, and it is increasingly the default at scale. Its model - parallel modules, summed symbols, and a four-way action decision - tells you exactly how the signals you generate translate into pass, greylist, header-tag, or reject.
The 60-second version
- Rspamd runs a four-stage pipeline: pre-filters → main filters (in parallel) → post-filters → action decision.
- Each module contributes named symbols with weights (positive or negative); the weights sum to a score - the same additive model as SpamAssassin, different vocabulary.
- The score maps to an action - the documented set is
no action,greylist,add header,rewrite subject,soft reject, andreject- not a single spam/not-spam flag. - Its statistics module is a Bayesian classifier using OSB tokens combined with the inverse chi-square distribution - the Robinson/Fisher lineage in production.
- It is event-driven and asynchronous (C core + Lua), so one worker handles 100+ concurrent messages; typical scan time is 50–200 ms.
- It ships native fuzzy hashing, a neural-network module, greylisting, rate limiting, reputation, and multimap - things SpamAssassin needs external plugins for.
- Redis is a core dependency for statistics, learning, rate limits, and caching.
- The exact action score thresholds are configurable defaults, not universal numbers.
The four-stage pipeline
The documentation lays out an explicit four-stage flow, and the parallelism in the middle stage is the key to both its speed and its layered logic:
| Stage | What runs | Purpose |
|---|---|---|
| 1. Pre-filters | Whitelisting, basic policy checks | Execute first; can short-circuit and skip the rest |
| 2. Main filters | Authentication (SPF/DKIM/DMARC), content analysis, RBL lookups, statistical classifiers - in parallel | The bulk of the analysis |
| 3. Post-filters | Composites, neural networks, final scoring adjustments | Combine and refine the symbols |
| 4. Action decision | Map cumulative score to an action | pass, add headers, greylist, or reject |
Pre-filters can stop processing early - a whitelisted sender never pays for the full main-filter stage. The main filters run concurrently rather than in sequence, which is why authentication, content, blocklists, and the Bayesian classifier all resolve quickly even though they involve network round-trips.
Symbols, weights, and the score
Rspamd’s scoring is structurally identical to SpamAssassin’s additive model, with renamed parts. “Each analysis module fires and contributes named symbols (analogous to SA rules); cumulative symbol score determines the final action.” A symbol carries a weight; weights can be negative (a valid DKIM signature, a fuzzy whitelist hit) or positive (a blocklist match, a high Bayes probability). The running total is what gets mapped to an action.
Symbols on a message (illustrative names; weights are configurable)
+5.0 FUZZY_DENIED confirmed fuzzy-hash match
+0.1 DKIM_TRACE (informational)
-1.0 DKIM_ALLOW valid aligned DKIM signature
-0.2 MIME_GOOD well-formed MIME
----
+3.9 -> compared against this deployment's action thresholds
The action model: a ladder, not a flag
This is what most distinguishes Rspamd from a yes/no filter. The cumulative score is compared against a series of thresholds, and Rspamd recommends the matching action rather than a binary verdict. Its protocol documentation enumerates the full set:
| Action | Meaning |
|---|---|
no action | message is likely ham - deliver normally |
greylist | defer with a temporary failure so the sender must retry |
add header | suspicious - deliver but add a spam header the MTA or client can sort on |
rewrite subject | suspicious - deliver but rewrite the subject (e.g. prefix [SPAM]) |
soft reject | temporary rejection, “for example, due to rate limit exhausting” |
reject | refuse the message outright |
Rspamd returns this as structured data, so the MTA can act on the recommendation and on the evidence behind it. A scan result looks like:
{ "action": "add header",
"score": 5.2,
"required_score": 7,
"symbols": { "FORGED_SENDER": { "score": 5 },
"DATE_IN_PAST": { "score": 0.1 },
"DKIM_ALLOW": { "score": -1 } } }
Note that score (5.2) is below required_score (7) here, yet the action is add header - because each action has its own threshold, and the message crossed the lower “add header” boundary while staying under “reject.” That ladder is why greylist and soft reject exist as score-driven actions at all: a borderline message can be deferred rather than accepted or refused, buying time for reputation systems and forcing the cheap retry test (see Greylisting, tarpitting, and rate controls), and a rate-limited or temporarily-suspect sender can get a soft reject instead of a permanent one - all from the same scoring machinery.
A caution this library holds to: the exact score numbers that separate these actions are configurable deployment settings. The documentation describes the action levels and exposes the live values at the controller’s /actions endpoint, but it does not fix universal defaults. So, as with SpamAssassin’s 5.0, there is no universal “reject above N” line - each operator sets its own.
What Rspamd checks: the module set
Rspamd advertises “60+ analysis modules.” The documented capabilities span the same families as the rest of this library, plus several that are bundled rather than bolted on:
| Module / capability | Notes from the docs |
|---|---|
| Email authentication | spf, dkim/dkim_signing, dmarc, arc - “SPF, DKIM (signing+validation), DMARC, ARC with caching” |
| Bayesian statistics | Built-in OSB classifier; not compatible with SpamAssassin’s Bayes database - must be retrained from scratch (below) |
| Neural networks | neural - “post-process messages using neural network classification” - runs in the post-filter stage (requires Redis) |
| Fuzzy hashing | Native fuzzy_check: shingles for text, blake2b for attachments, HTML structure since 3.14.0 (below) |
| Real-time blocklists | rbl - “50+ preconfigured RBLs, SURBL, URIBL with parallel DNS queries” |
greylisting | ”allows to delay suspicious messages” (requires Redis) |
ratelimit | ”implements leaked bucket algorithm for ratelimiting” (requires Redis) |
reputation | ”manages reputation evaluation based on various rules” (replaced the old ip_score module) |
multimap | ”a complex module that operates with different types of maps” - match senders, IPs, URLs, etc. against lists |
whitelist | flexible allow/block “based on SPF/DKIM/DMARC combinations” |
| URL filtering, antivirus, AI/ML services | phishing, antivirus, external integrations |
The notable contrast with SpamAssassin: greylisting, rate limiting, reputation, neural networks, ARC, multimap, and native fuzzy hashing are all in the box, where SpamAssassin reaches DCC/Razor through external plugins and does not ship greylisting, rate limiting, or a neural module in its default set.
Under the hood there are two kinds of module. A small set of C modules is statically linked for speed - the default filters line is just chartable, dkim, regexp, fuzzy_check, where regexp is the core engine that evaluates regular-expression rules and embedded Lua. Everything else is a Lua module, loaded dynamically at startup and reloaded on reconfiguration; the docs note Lua modules “are very close to C modules in terms of performance,” which is why most new functionality (including multimap, ratelimit, reputation, greylisting, and neural) is written in Lua. Several modules - anything stateful - “require Redis,” which is the recurring reason Redis is treated as a core dependency rather than an optional add-on.
The neural module
Rspamd bundles a neural-network classifier that runs in the post-filter stage and “adapt[s] to your mail patterns.” It is a refinement layer on top of the symbol model rather than a replacement for it - the symbols produced by the main filters become features the network can weigh. (The overview documentation states its existence and stage; deeper architectural specifics live in the dedicated module documentation, which this library does not assert beyond what is quoted.)
Statistics: the OSB Bayesian classifier
Rspamd’s statistical classifier is Bayesian - “based on the Bayesian theorem, which combines probabilities to assess the likelihood of a message belonging to a particular class” - but two design choices distinguish it from a textbook word-counting filter, and both trace straight to the history covered in Bayesian and statistical spam filtering:
- OSB tokens, not single words. The default tokenizer is
osb(Orthogonal Sparse Bigram), which “goes beyond considering single words as tokens and instead takes into account combinations of words, taking into consideration their positions.” Rspamd uses a window of 5 tokens, so “the number of tokens being approximately 5 times larger than the number of words.” This is the same insight CRM114 demonstrated - short word tuples outperform isolated words. - Inverse chi-square combination. Rspamd combines token probabilities with “the inverse chi-square distribution” - i.e. Robinson’s Fisher-based method, not Graham’s naive product. The Robinson/Fisher math on the Bayesian page is exactly what is running here.
The operational defaults matter for a sender:
| Setting | Default | Meaning |
|---|---|---|
backend | redis | statistics live in Redis (recommended/default since 2.0) |
min_learns | 200 | needs 200 learned spam and 200 ham before it classifies |
min_tokens | 11 | a message needs enough tokens to be worth classifying |
tokenizer | osb | the only supported tokenizer |
That min_learns = 200 is a close echo of SpamAssassin’s 200/200 rule - an untrained classifier stays silent rather than guessing. Rspamd also tokenizes the Subject and a configurable set of headers (classify_headers) plus meta-tokens like message size and attachment count, and because it only learns the headers it is told to, “there is no need to remove any additional headers (e.g., X-Spam) before the learning process.” It supports per_user statistics (when invoked at final delivery) and, since 3.13, multi-class classifiers for categories like newsletter, transactional, and phishing alongside the binary spam/ham model. As the docs warn, its database is not compatible with SpamAssassin’s - a migration means retraining from scratch.
Fuzzy storage
Rspamd’s native fuzzy subsystem is one of its strongest differentiators (the near-duplicate theory is on its own page). The architecture worth knowing here:
- Text uses the shingles algorithm - overlapping word trigrams, 32 hashes per shingle - producing a similarity score rather than an exact match, so templated campaigns with minor per-recipient variations are caught.
- Attachments and images use exact blake2b digests - identical files are matched precisely.
- HTML structure fuzzy hashing (since 3.14.0) matches “DOM structure, layout, and link patterns - independent of text content,” weighted as structure shingles 50%, CTA domains 30%, all domains 15%, structural features 5%. A clever anti-phishing detail: if the DOM is identical but the call-to-action domains differ, similarity is heavily penalized (x0.3), exposing a cloned-brand phishing page even though its layout is a perfect copy.
How fuzzy weight becomes a score
Each stored hash has a weight (“hits”) that “accumulates as users report the same content,” and Rspamd converts weight to score with a hyperbolic-tangent curve:
symbol_score = tanh((weight - max_score) / max_score) x metric_weight
The effect is deliberate smoothing: the score is 0 below the threshold, partial at the threshold, and full at twice the threshold. As the docs put it, this “prevents a single report from triggering the maximum score while ensuring well-confirmed spam gets full weight.” (The max_score parameter is being renamed to hits_limit; both names are currently accepted.) Standard flags distinguish FUZZY_DENIED (flag 1, confirmed spam), FUZZY_PROB (flag 2, probable spam), and FUZZY_WHITE (flag 3, legitimate content - a negative-weight whitelist). By default Rspamd uses fuzzy feeds from rspamd.com over UDP port 11335; if usage is blocked, a zero-weight FUZZY_BLOCKED symbol appears and does not affect processing. The default hash algorithm is mumhash.
Performance design
Rspamd’s reason for existing is throughput, and the numbers from its documentation are concrete:
| Metric | Value |
|---|---|
| Concurrent messages per worker | 100+ |
| Typical scan time | 50–200 ms per message (incl. network) |
| Throughput | 5–10 messages/sec per worker core (~500K–1M/day) |
| Memory | 50–100 MB per worker process |
The design choices behind those numbers: an event-driven core with non-blocking DNS, Redis, and HTTP; Hyperscan for fast regular-expression execution on x86_64; and a worker model split into proxy (protocol translation, load balancing), normal (message scanning), and controller (web UI and management API) roles. Rspamd’s own migration documentation claims “10–100x faster processing” versus SpamAssassin; treat that as the project’s own comparison.
How it talks to the MTA
Rspamd “communicates results to your MTA via HTTP/JSON API or Milter protocol, recommending an action without directly handling mail delivery.” The default ports:
| Port | Worker / use |
|---|---|
| 11332 | Milter listener (e.g. Postfix smtpd_milters = inet:localhost:11332) |
| 11333 | Normal worker - HTTP scan API (/checkv2) |
| 11334 | Controller worker - web UI and management |
| 11335 | Fuzzy storage (UDP) |
Because Rspamd only recommends an action, the MTA stays in control of delivery - the “independent layer” philosophy in practice. Redis underpins the stateful parts: “Statistics storage, learning data, rate limiting, and caching - all backed by Redis,” with the production quick-start installing redis-server alongside Rspamd and Redis HA supported.
Rspamd vs. SpamAssassin at a glance
| Dimension | SpamAssassin | Rspamd |
|---|---|---|
| Core language | Perl | C core + Lua rules/plugins |
| Architecture | Process-per-message (spamd/spamc) | Event-driven, 100+ concurrent/worker |
| Throughput | ~0.5–1 msg/sec/core | ~5–10 msg/sec/core |
| Memory | 30–50 MB | 50–100 MB/worker |
| Scoring | Named rules → score; default tag at 5.0 | Named symbols → score → 4 actions |
| Bayes | Plugin; 200+200 minimum | Built-in; incompatible with SA’s DB |
| Fuzzy hashing | External DCC/Razor plugins (DCC off by default) | Native fuzzy_check (shingles + blake2b + HTML) |
| Greylist / rate limit / neural / ARC | Not in default set | All bundled |
| Redis | Not native (SQL/LDAP) | Core dependency |
Both figures for SpamAssassin’s throughput and memory come from Rspamd’s own comparison table; treat the head-to-head numbers as the project’s framing.
What this means for you, and what Egressif does
Rspamd’s action model is the practical thing for a sender to internalize: your score does not just decide spam-or-not, it decides deliver vs. greylist vs. tag vs. rewrite-subject vs. soft-reject vs. reject. That makes the deterministic, sender-controlled signals - aligned SPF/DKIM/DMARC (which Rspamd validates and caches), a stable identity that earns whitelist and reputation symbols, and disciplined lists that avoid fuzzy and complaint signals - directly responsible for which rung of that ladder you land on. And because the ladder includes deferral rather than only deletion, a borderline score rarely means a lost message - which is the architecture quietly honoring the asymmetric cost of a false positive.
Egressif keeps those inputs clean and consistent so that, across an Rspamd deployment, the negative-weight symbols (valid DKIM, good reputation, fuzzy whitelist) fire for you and the positive-weight ones stay quiet, pulling your score toward pass. We cannot set another operator’s action thresholds or train their Bayesian or neural models, and we do not claim a universal number. We make sure the symbols you generate add up in your favor.
Related references
- How receiver-side spam filtering actually works A spam filter is not one test with one threshold. It is a layered pipeline - connection and reputation checks, authentication, statistical content analysis, collaborative checksum networks, and rules engines - whose signals combine into a single decision. This page walks the whole chain so you can see where a legitimate message can go wrong.
- Bayesian and statistical spam filtering Statistical filters do not match keywords - they learn token probabilities from a receiver's own ham and spam, then combine the most telling tokens with Bayes' rule. This is the history, the math, and the operational caveats, ending with what it means for a legitimate sender.
- Fuzzy hashing and near-duplicate detection A single changed character defeats an exact hash, so bulk-mail detection needs hashing that tolerates variation. This explains the difference between exact and fuzzy hashing, how DCC and Rspamd implement near-duplicate detection, and why "bulk" is not the same as "spam."
- DCC, Pyzor, and Razor - collaborative checksum networks DCC, Pyzor, and Vipul's Razor let many receivers pool what they see, so a message sent in bulk is recognized as bulk even when each copy is personalized. Here is how each one works, how they differ, what they catch and miss, and the whitelist rule every legitimate bulk sender depends on.
- Apache SpamAssassin architecture SpamAssassin classifies mail by running many rules, each contributing a positive or negative score, and tagging messages whose total crosses a configurable threshold. Here is the scoring model, the plugin architecture, the network tests, the Bayes subsystem, and how training really works.
- Greylisting, tarpitting, and rate controls Before any content is scored, receivers can slow or defer suspicious connections. Greylisting (standardized in RFC 6647) returns a temporary failure that legitimate senders retry and most spamware abandons. Here is the mechanism, the RFC's recommended implementation, the failure modes, and where tarpitting and rate limiting fit.
Tell us what you run today.
Domains, rough volume, current providers, and what hurts. You will get a straight answer on fit, and a real number, in one conversation.