Resources / Spam filtering
Fuzzy hashing and near-duplicate detection
A single changed character defeats an exact hash, so bulk-mail detection needs hashing that tolerates variation. This explains the difference between exact and fuzzy hashing, how DCC and Rspamd implement near-duplicate detection, and why "bulk" is not the same as "spam."
Last checked: June 22, 2026
A cryptographic hash is designed to do the opposite of what bulk-mail detection needs. Change one byte of the input - add a space, swap a comma - and the output is completely different. That is exactly the property you want for verifying a download, and exactly the property that makes plain hashes useless against spam, because every copy of a spam campaign is personalized: a different recipient name, a unique unsubscribe token, a shuffled phrase. Fuzzy hashing (also called near-duplicate or locality-sensitive hashing) is the family of techniques built to survive that variation - to produce the same or similar fingerprint for messages that are alike but not byte-for-byte identical.
This page explains the concept and the two well-documented production approaches. It is written for senders, because near-duplicate detection is the layer most likely to group your legitimate bulk mail with everyone else’s bulk mail - and the design answer to that is consent and whitelisting, not evasion.
The 60-second version
- Exact hashing (a normal cryptographic digest) breaks on any single-character change, so spammers defeat it for free by personalizing each copy.
- Fuzzy hashing is built to ignore the parts that vary and fingerprint the parts that stay constant across a campaign.
- DCC uses fuzzy checksums that “include values that are constant across common variations in bulk messages, including ‘personalizations’.”
- Rspamd uses a shingles algorithm for text - overlapping word trigrams, each hashed many ways - and reports a similarity score based on how many shingles match.
- Not everything is fuzzy: Rspamd matches attachments and images by exact blake2b digest, because identical files are the right signal there.
- Detecting near-duplicates detects bulk, not spam. Wanted bulk mail (newsletters you subscribed to) is near-duplicate too - which is why these signals must feed a score alongside whitelists, never a standalone block.
Why exact hashes fail
Imagine a campaign of one million messages. The body is a template; only a greeting and a tracking link differ per recipient. To a cryptographic hash, those are one million completely unrelated inputs producing one million unrelated digests. A receiver that fingerprinted messages with an exact hash would see each copy exactly once and conclude none of them is bulk.
DCC’s documentation states the problem and the fix in one breath: “Because simplistic checksums of spam would not be effective, the main DCC checksums are fuzzy and ignore aspects of messages.” The whole reason a separate class of hashing exists is that the obvious approach - hash the bytes - is trivially defeated by variation the sender introduces anyway.
| Property | Exact hash | Fuzzy / near-duplicate hash |
|---|---|---|
| One-character change | Completely different output | Same or similar output |
| Goal | Prove two inputs are identical | Estimate how similar two inputs are |
| Defeated by | Any personalization | Substantial rewriting of content |
| Right tool for | Matching identical files (attachments, images) | Matching templated campaigns with per-recipient variation |
Making “similar” a number: shingles and Jaccard
Before the implementations, it helps to see the idea they share. The classic way to measure how alike two documents are is to break each into shingles - short, overlapping windows of words - and then ask what fraction of shingles they have in common. That fraction is the Jaccard similarity: the size of the intersection over the size of the union.
J(A, B) = |A intersect B| / |A union B|
A = "claim your free prize today and win big now"
B = "claim your free reward today and win big now" (one word changed)
3-word shingles of A: {claim your free, your free prize, free prize today,
prize today and, today and win, and win big, win big now}
3-word shingles of B: {claim your free, your free reward, free reward today,
reward today and, today and win, and win big, win big now}
shared = 4 (claim your free, today and win, and win big, win big now)
union = 10
J = 4 / 10 = 0.40
A single substitution in a nine-word string only knocks the similarity to 0.40 because the changed word sits inside just three shingles. In a real campaign the body is hundreds of words long, so one personalized greeting or unsubscribe token disturbs a handful of shingles out of hundreds and the Jaccard similarity barely moves off 1.0. That is the whole trick: overlapping windows localize an edit instead of letting it ripple through the entire fingerprint.
Comparing full shingle sets is expensive, so locality-sensitive hashing (LSH) estimates J without storing the sets. The well-known recipe (minhashing) hashes each shingle several different ways and keeps the minimum value under each hash; the chance that two documents share a given minimum equals their Jaccard similarity, so a fixed-size signature approximates the overlap. Rspamd’s documented design is exactly this shape - “32 hashes per shingle,” with a similarity score that “reflects how many shingles match.” The defining property of any such scheme is that the output changes a little when the input changes a little, which is precisely what a cryptographic hash refuses to do.
Approach 1 - DCC’s fuzzy checksums
DCC (Distributed Checksum Clearinghouses) takes the “ignore what varies” route. Its fuzzy checksums “include values that are constant across common variations in bulk messages, including ‘personalizations’,” and the dcc(8) manual puts the design goal plainly: “The fuzzy checksums are designed to ignore only differences that do not affect meanings.” The algorithm is also a moving target: “The fuzzy checksums are changed as spam evolves. Since DCC started being used in late 2000, the fuzzy checksums have been modified several times.”
What is documented is the set of checksums a DCC client computes per message - even though the precise fuzzy-filtering algorithm behind Fuz1/Fuz2 is not disclosed. The dcc(8) man page lists them:
| Checksum | What it covers |
|---|---|
IP | IP address of the SMTP client |
env_From | SMTP envelope sender |
From | From: header line |
Message-ID | Message-ID: header line |
Received | last Received: header line |
substitute | a header chosen by the DCC client |
Body | the SMTP body ignoring white-space |
Fuz1 | a filtered or “fuzzy” body checksum |
Fuz2 | a second, differently-filtered fuzzy body checksum |
Body is already whitespace-insensitive (a near-exact match), while Fuz1 and Fuz2 are the genuinely fuzzy layers; the man page notes they are “omitted if the message body is empty or contains too little of the right kind of information for the checksum to be computed.” A privacy consequence falls out of the design: a DCC server “accumulates counts of cryptographic checksums of messages but not the messages themselves,” and the env_To checksum is never sent to servers at all.
The important conceptual point is what DCC measures. It counts how many recipients across its network reported a message with a given fuzzy checksum - that is, how bulk the message is - not whether the content is objectionable. DCC’s own framing: “DCC does not ‘list’ domain names or IP addresses, but detects bulk mail messages.” (The full DCC, Pyzor, and Razor comparison lives on its own page.)
Approach 2 - Rspamd’s shingles algorithm
Rspamd’s fuzzy_check takes the probabilistic route for text. The documented pipeline:
- Shingling. “Text is split into overlapping word sequences (trigrams/3-grams).” A shingle is a short window of consecutive words; overlapping them means a small edit only disturbs a few shingles, not all of them.
- Hashing. “Each sequence is hashed using multiple hash functions (32 hashes per shingle).”
- Similarity scoring. “When a new message arrives, its shingles are compared against stored patterns. The similarity score reflects how many shingles match, allowing detection of messages that are similar but not identical.”
This is the defining behavior of locality-sensitive hashing: small input changes produce small output changes, so “how much do these two messages overlap?” becomes a countable quantity. Rspamd states the payoff directly - fuzzy hashing is “particularly effective against spam campaigns where the same message template is sent to many recipients with minor variations.”
Why overlapping trigrams survive edits (illustrative)
"claim your free prize now"
-> [claim your free] [your free prize] [free prize now]
"claim your free reward now" (one word changed)
-> [claim your free] [your free reward] [reward now ...]
The first shingle still matches. Enough shingles overlap that the
similarity score stays high - the campaign is still recognized.
Not everything should be fuzzy
A subtle, well-engineered detail: Rspamd does not fuzzy-hash attachments and images. “Unlike text, attachments and images use exact matching via blake2b digests. A hash is computed for the entire content and matched exactly against stored hashes.” The reasoning is sound - an identical attached PDF or image across many messages is itself a strong, clean bulk signal, and exact matching is faster and unambiguous. Fuzzy matching is the right tool for prose that gets reworded; exact matching is the right tool for binary payloads that are copied verbatim.
Hashing structure, not just words
Since version 3.14.0, Rspamd also fuzzy-hashes HTML structure - “DOM structure, layout, and link patterns - independent of text content.” This catches campaigns that change all the words but reuse the same template skeleton. It is built from weighted components (structure shingles 50%, call-to-action domains 30%, all domains 15%, structural features like tag and link counts 5%), with each tag normalized to a token like tagname[.class][@domain]. Notably, tracking classes such as utm_* and analytics_* are filtered out and dynamic classes (GUIDs, timestamps) ignored, so cosmetic per-message noise does not break the match.
Two details are worth pulling out. First, there is a minimum-complexity gate - the docs require at least min_html_tags tags (default 10), at least 2 links, and a DOM depth of at least 3 - so trivial HTML is not fingerprinted on a structure too small to be distinctive. Second, the call-to-action domains carry heavy weight for a reason: if a phishing message copies a brand’s HTML byte-for-byte but points its buttons at a different domain, “CTA mismatch heavily penalizes similarity (x0.3),” so a cloned layout with swapped links does not register as the same template. This structural matching is detailed on the Rspamd architecture page.
Confidence grows with reports
A single sighting of a fuzzy hash is weak evidence - it might be a false match or a one-off. Both count-based and weight-based systems address this by accumulating evidence across reports. Rspamd attaches a weight (“hits”) to each stored hash that “accumulates as users report the same content,” and its scoring uses a hyperbolic-tangent curve so the score “increases gradually from threshold to 2x threshold,” explicitly preventing “a single report from triggering the maximum score while ensuring well-confirmed spam gets full weight.” Pyzor similarly returns how many times a digest “has been reported as spam or whitelisted as not-spam.” The pattern is the same everywhere: near-duplicate detection earns confidence from consensus, not from one observation.
The unavoidable caveat: bulk is not spam
This is the most important thing for a sender to understand, and it is a design principle, not an accident. Wanted bulk mail looks identical to unwanted bulk mail from a near-duplicate standpoint. A double-opt-in newsletter sent to 200,000 subscribers is exactly as “bulk,” exactly as templated, and exactly as fingerprint-able as a spam campaign of the same size.
The systems are built knowing this:
- DCC says outright that it detects bulk, and “only mail targets can say whether a message is solicited.”
- Pyzor includes an explicit whitelist-as-not-spam operation alongside report-as-spam.
- Rspamd reserves a dedicated whitelist flag (flag 3,
FUZZY_WHITE) for legitimate content, contributing a negative weight.
So a near-duplicate match is never meant to be a verdict. It is a score input that should be combined with whitelists for known wanted-bulk senders and with other signals (authentication results, complaint rates). The way a legitimate bulk sender stays on the right side of this layer is not to look less bulk - it is to be genuinely wanted: real consent, easy unsubscribe, consistent identity, so recipients and their providers whitelist rather than report. The collaborative networks that turn these fingerprints into a shared consensus - and the whitelist mechanism each one ships - are covered in detail in DCC, Pyzor, and Razor, and the reason a misfire here is treated so seriously is the subject of false positives and ham protection.
A note on naming the techniques
You will see specific named schemes (for example, Nilsimsa and ssdeep) referenced elsewhere as classic locality-sensitive hashing algorithms. This page intentionally does not describe their internals, because they are outside the primary sources verified for this library; what is documented here - DCC’s fuzzy checksums and Rspamd’s shingle algorithm - is enough to understand the principle. The same restraint applies even within the systems we do document: DCC publishes the list of checksums it computes but not the exact filtering that turns a body into Fuz1 and Fuz2, and the introduction page for Pyzor does not specify how its digest is built (the parts Pyzor’s protocol documentation does disclose are covered on the DCC, Pyzor, and Razor page). Where an algorithm’s internals are not published by a primary source, this library names the technique and stops there rather than guess. The unifying idea behind all of them is identical: a hash whose output changes a little when the input changes a little, so that “similar” becomes measurable.
What this means for you, and what Egressif does
Fuzzy hashing is the layer a legitimate bulk sender most often trips over, and the trap is to misread it as a content problem. It is not. Your newsletter is supposed to be templated and bulk; that is what a newsletter is. What separates wanted bulk from spam, in the eyes of these systems, is entirely outside the hash: genuine consent, recipients who do not report you, and senders who get whitelisted.
Egressif treats this as a consent-and-consistency problem, which is where it actually lives. We keep your sending identity stable and your authentication aligned so receivers can recognize and whitelist you, and we hold list hygiene to the standard that keeps complaint rates low - because near-duplicate detection will always group your bulk mail with everyone else’s, and the only durable answer is to be the bulk sender recipients actually want. That is ultimately a reputation question: a high fuzzy count plus a whitelist entry is delivery, while the same count plus complaints is trouble.
Related references
- How receiver-side spam filtering actually works A spam filter is not one test with one threshold. It is a layered pipeline - connection and reputation checks, authentication, statistical content analysis, collaborative checksum networks, and rules engines - whose signals combine into a single decision. This page walks the whole chain so you can see where a legitimate message can go wrong.
- Bayesian and statistical spam filtering Statistical filters do not match keywords - they learn token probabilities from a receiver's own ham and spam, then combine the most telling tokens with Bayes' rule. This is the history, the math, and the operational caveats, ending with what it means for a legitimate sender.
- DCC, Pyzor, and Razor - collaborative checksum networks DCC, Pyzor, and Vipul's Razor let many receivers pool what they see, so a message sent in bulk is recognized as bulk even when each copy is personalized. Here is how each one works, how they differ, what they catch and miss, and the whitelist rule every legitimate bulk sender depends on.
- Apache SpamAssassin architecture SpamAssassin classifies mail by running many rules, each contributing a positive or negative score, and tagging messages whose total crosses a configurable threshold. Here is the scoring model, the plugin architecture, the network tests, the Bayes subsystem, and how training really works.
- Rspamd architecture Rspamd is an event-driven filtering framework that sits between the MTA and the internet, runs dozens of modules in parallel, sums named symbols into a score, and maps that score to an action. Here is the pipeline, the scoring and action model, fuzzy storage, and why it is fast.
- Greylisting, tarpitting, and rate controls Before any content is scored, receivers can slow or defer suspicious connections. Greylisting (standardized in RFC 6647) returns a temporary failure that legitimate senders retry and most spamware abandons. Here is the mechanism, the RFC's recommended implementation, the failure modes, and where tarpitting and rate limiting fit.
Tell us what you run today.
Domains, rough volume, current providers, and what hurts. You will get a straight answer on fit, and a real number, in one conversation.