Resources / Spam filtering

DCC, Pyzor, and Razor: Checksum Spam Networks

DCC, Pyzor, and Vipul's Razor let many receivers pool what they see, so a message sent in bulk is recognized as bulk even when each copy is personalized. Here is how each one works, how they differ, what they catch and miss, and the whitelist rule every legitimate bulk sender depends on.

Last checked: June 22, 2026

A single mailbox cannot tell whether a message is bulk. It sees the message once. But if thousands of independent mailboxes could compare notes - “I got this too, and so did I, and I” - bulk mail would stand out instantly. That is the entire idea behind collaborative checksum networks, and DCC states it in one sentence: “The idea of DCC is that if mail recipients could compare the mail they receive, they could recognize unsolicited bulk mail.”

Three systems built on this idea are still referenced and deployed: DCC, Pyzor, and Vipul’s Razor. They are often lumped together, but they measure different things and answer different questions. This page is the comparison, written for senders - because if you send legitimate bulk mail, these networks will see you, and understanding them tells you exactly why consent and whitelisting are the only durable answer.

Bulkiness comes from consensus across many receivers; the network returns a high count and the local filter scores it - a count measures bulk, not whether a message is wanted.

The 60-second version

All three are collaborative: detection power comes from aggregating reports across many independent receivers, not from any one mailbox’s judgment.
DCC counts. It reports fuzzy checksums to servers and tells you how many recipients saw the same message - a measure of bulkiness, not badness.
Pyzor counts digests. A client makes a digest “likely to uniquely identify” a message and asks a server how many times it has been reported spam (or whitelisted not-spam).
Razor signs. It builds signatures “that efficiently spot mutating spam content” and returns a 0–100 confidence per message part.
Every one of them has a whitelist / not-spam path, because all three detect bulk, and wanted bulk looks identical to unwanted bulk.
They are best used to feed a score, combined with whitelists and other signals - never as a standalone block.

DCC - Distributed Checksum Clearinghouses

DCC is “an anti-spam content filter that runs on a variety of operating systems,” whose counts “can be used by SMTP servers and mail user agents to detect and reject or filter spam or unsolicited bulk mail.” The mechanism is a count exchange: “A DCC server totals reports of checksums of messages from clients and answers queries about the total counts for checksums of mail messages. A DCC client reports the checksums for a mail message to a server and is told the total number of recipients of mail with each checksum.”

The checksums are fuzzy by necessity (see Fuzzy hashing and near-duplicate detection) - they “include values that are constant across common variations in bulk messages, including ‘personalizations’,” and they change over time as spam evolves.

The checksums DCC actually computes

A DCC client does not compute one checksum per message; it computes several, and the dcc(8) manual lists them. A server “accumulates counts of cryptographic checksums of messages but not the messages themselves” - it stores fingerprints, never content.

Checksum	What it covers
`IP`	IP address of the SMTP client
`env_From`	SMTP envelope sender
`From`	`From:` header line
`Message-ID`	`Message-ID:` header line
`Received`	last `Received:` header line
`substitute`	a header line chosen by the DCC client
`Body`	the SMTP body ignoring white-space
`Fuz1`	a filtered or “fuzzy” body checksum
`Fuz2`	a second, differently-filtered fuzzy body checksum

The Fuz1/Fuz2 layers are the ones “designed to ignore only differences that do not affect meanings,” and they are “omitted if the message body is empty or contains too little of the right kind of information.” A privacy note worth stating: the env_To checksum (the recipient) is never sent to servers, and a client should not report checksums of mail it knows to be private.

What DCC actually measures

This is the crux and the most misunderstood part: DCC measures bulk, not spam. In its own words: “Spam is unsolicited bulk mail, and only mail targets can say whether a message is solicited.” And: “DCC does not ‘list’ domain names or IP addresses, but detects bulk mail messages.” A high DCC count means many people received this - which is true of a spam blast and equally true of a popular newsletter. The manual is explicit that the bulk category is full of legitimate mail: “bulk messages include legitimate mail such as order confirmations from merchants, legitimate mailing lists, and empty or test messages.”

The count a client gets back is usually just a number of recipients, but three special values carry extra meaning:

Count	Meaning
a number	how many recipients of messages with this checksum have been reported
`MANY`	the largest value the field holds - “definitely bulk, but not necessarily unsolicited”
`OK` / `OK2`	the checksum has been marked “good” or “half-good” by DCC servers (a server-side allow signal)

Because servers accept reports “from as many targets as possible, including sources that cannot be trusted,” a single angry user could report a message a million times - and the manual shrugs this off precisely because “much legitimate mail is bulk”: the count is not a spam verdict, so inflating it does not turn ham into spam on its own. The decision still rests on the local whitelist and threshold.

That is why DCC rejection “generally requires a whitelist of solicited bulk mail sources.” Recipients who want bulk mail must whitelist it “by adding your IP address, SMTP envelope sender, RFC 2369 SMTP List-* headers, or other characteristics of your mail to their whiteclnt files.” And pointedly: “The opinions of bulk mail senders about whether their messages are spam are irrelevant.” You do not get to declare yourself wanted; the recipient does.

DCC’s network and operational rules

The client/server exchange is tiny: “a single pair of UDP/IP datagrams of about 150 bytes” per message, often less than a single DNS query. Public servers answer on UDP port 6277.
Servers “flood” (exchange) checksums of bulk mail only with each other; effectiveness rises with more servers connected.
A separate facility, DCC Reputations, automatically computes reputations for sending bulk mail; a reputation “expires automatically a week to 30 days after the last bulk email reported.”

There is an operational rule that matters specifically to anyone running mail at scale. The public DCC servers are for “anonymous DCC clients handling fewer than 100,000 mail messages per day.” Above that, “you should use your own, probably private DCC server.” And DCC’s license is blunt about commercial use: selling the bandwidth and administration of the public servers to third parties “has always been wrong… Blunt words for that include theft and stealing. Vendors of ‘spam appliances’ or services including DCC such as ‘managed email’ must provide DCC servers of their own or contract for DCC services from others.” The DCC software license “is free only to organizations that participate in the global DCC network.” (ISPs filtering mail for their own users are covered.)

Pyzor

Pyzor is “a collaborative, networked system to detect and block spam using digests of messages.” The flow is digest-and-ask: “Using Pyzor client a short digest is generated that is likely to uniquely identify the email message,” then that digest is sent to a server to do one of three things:

check how many times it has been reported as spam or whitelisted as not-spam;
report the message as spam;
whitelist the message as not-spam.

That third operation is the design-level answer for wanted bulk: Pyzor has an explicit not-spam whitelist path, not just a report-spam path. The whole system is GPL, so “people are free to host their own independent servers,” and there is “a well-maintained and actively used public server available (courtesy of SpamExperts) at: public.pyzor.org:24441.” The reference implementation lives at github.com/SpamExperts/pyzor, described as “a Python implementation of a spam-blocking networked system that use spam signatures to identify them.” (Pyzor’s own history notes it “initially started out to be merely a Python implementation of Razor,” but was rebuilt with a new, open protocol because Razor’s server was not open source.)

How the Pyzor digest is built

Pyzor’s introduction page only calls the digest “a short digest… likely to uniquely identify the email message,” but its protocol documentation actually spells the recipe out, and it is a clean illustration of why a near-duplicate digest is not just a hash of the bytes. Pyzor explains the motive first: “Simply hashing the entire message is an ineffective method… because message headers will differ when the content does not, and because spammers will often try to make a message unique by injecting random/unrelated text.” So the version 2.0 digest deliberately samples and normalizes:

Pyzor 2.0 digest construction
  1. Discard all message headers.
  2. If the message is more than 4 lines long:
       - discard the first 20% of the lines
       - use the next 3 lines
       - discard the next 40% of the lines
       - use the next 3 lines
       - discard the rest
  3. Remove any 'word' (whitespace-separated run) 10+ characters long.
  4. Remove anything that looks like an email address (X@Y).
  5. Remove anything that looks like a URL.
  6. Remove anything that looks like an HTML tag.
  7. Remove all whitespace.
  8. Discard any line shorter than 8 characters.
  9. Hash what remains.

Every step is aimed at the same thing: keep the stable middle of the prose and throw away the parts a spammer personalizes or randomizes. Sampling from fixed positions in the body means an injected header or a tracking link near the edges does not move the digest; stripping long tokens, addresses, and URLs removes the obvious per-recipient variables; dropping short lines removes layout noise. Pyzor itself flags this as “an easy-to-understand explanation, rather than a technical one,” so the exact field handling beyond this is not asserted here.

Vipul’s Razor (Razor2)

Razor is “a distributed, collaborative, spam detection and filtering network based on user submissions of spam.” Where DCC counts and Pyzor digests, Razor signs: “Detection is done with signatures that efficiently spot mutating spam content,” and “user input is validated through reputation assignments” so that submissions from untrusted reporters carry less weight.

Operationally, via the SpamAssassin plugin: Razor2 “calculates a signature for each part of a multipart message and then compares those signatures to a database of known spam signatures.” The server “returns a confidence value (0–100) for each part of the message. The part with the highest confidence value is used as the confidence value for the message.” A SpamAssassin rule fires either when check_razor2() sees confidence at or above the configured min_cf, or via check_razor2_range(<engine>,<min>,<max>), which targets a specific signature engine (the plugin documents engine numbers 4 and 8) and a confidence band - so a deployment can weight different Razor engines differently. The plugin defaults to a 5-second timeout and forks a separate process for asynchronous operation (razor_fork default 1), and it requires the Razor2::Client::Agent Perl module.

The “reputation assignments” point is what separates Razor from a raw count: a submission’s weight depends on the trust the network has built up for the reporter, so a single hostile or careless reporter cannot manufacture a confident signature the way a flood of identical DCC reports might inflate a count. It is signature-plus-reputation rather than signature alone.

Razor’s status in 2026

This is worth stating carefully because it is a “legacy but alive” situation. The open-source Razor client codebase was last updated 2013-06-05 on SourceForge (the project registered there in 2000). By contrast, the Razor network/server infrastructure is still operational: as recently as 2026, users filing domain delisting requests receive responses from Cloudmark support. So the network persists while the client code is unmaintained - classify Razor2 as legacy software running against a live service. (The precise corporate ownership of the network is not documented in a primary source we could verify, so this library does not assert it.)

How they differ, side by side

Dimension	DCC	Pyzor	Razor2
Core technique	Fuzzy checksum, count of recipients	Message digest + report counts	Signatures for mutating content
What the server returns	Total recipient count per checksum	Times reported spam / whitelisted not-spam	Confidence 0–100 per part; max part used
What it primarily measures	Bulkiness	Reported-spam consensus on a digest	Signature match confidence
Whitelist / not-spam path	Local whiteclnt files (recipient-maintained)	Explicit whitelist-as-not-spam operation	Reputation-validated submissions
Stored server-side	Cryptographic checksums, not messages	Digests, not messages	Signatures, not messages
How disclosed is the algorithm	Checksum families listed; `Fuz1`/`Fuz2` filtering not published	Digest construction documented (2.0 protocol)	Signature scheme not publicly documented
Bad-reporter resistance	Weak per report (counts can be inflated); leans on whitelists/scoring	Counts + not-spam whitelist	Reputation weighting of submitters
Transport	UDP, ~150-byte exchange, port 6277	Public server `public.pyzor.org:24441`	Forked agent, 5s default timeout
Code status (2026)	Active (v2.3.169, Mar 2024)	Active (public server maintained)	Client code unmaintained since 2013; network still live
License note	Free only to network participants; vendors must run their own servers	GPL; self-hosting encouraged	Open-source plugin; network run externally

What collaborative networks catch - and miss

Catch. Their strength is high-volume, templated campaigns. A message blasted to enough recipients accumulates a high DCC count, a high Pyzor report count, or a confident Razor signature - even when each copy is personalized, because the fingerprints are fuzzy/signature-based rather than exact. The aggregation across thousands of independent reporters is what makes a consensus reliable when any single report would be weak.

Miss. Three structural blind spots follow directly from the design:

Low-volume and targeted mail. A spear-phishing message sent to one person is, by definition, not bulk. Nothing for a count-based system to count.
Genuinely novel content. Detection lags reporting - a brand-new campaign is only caught once enough recipients have reported it. There is an inherent warm-up.
Wanted bulk (the false-positive risk). This is the big one. A double-opt-in newsletter is as bulk as any spam. Without whitelisting it will accumulate exactly the same signals. That is not a flaw to fix; it is why every one of these systems ships a whitelist/not-spam path and why the documented best practice is to feed a score, combined with curated whitelists and other signals (authentication failures, complaint rate), rather than to block on a count alone. (See false positives and ham protection.)

In practice these networks are rarely consulted directly - they are wired in as one input to a scoring engine. SpamAssassin reaches them through its DCC, Pyzor, and Razor2 plugins (see Apache SpamAssassin architecture), and Rspamd ships its own native fuzzy_check plus a bundled DCC module (see Rspamd architecture). Either way, a checksum hit is a weighted symbol, not a verdict.

What this means for you, and what Egressif does

If you send legitimate bulk mail, these three networks will see you - that is not avoidable and not a problem in itself. Bulk is what a newsletter is. The decisive factor is entirely outside the checksum: are you wanted? A high count plus a recipient whitelist entry is delivery; a high count plus complaints is trouble - which is to say it resolves into a reputation question, not a hashing one. The sender’s job is to be the bulk sender recipients and their providers choose to whitelist.

Egressif works on the inputs that earn that whitelist: a stable sending identity (consistent IPs, envelope sender, and List-* headers that receivers can pin a whitelist to), aligned authentication, and list hygiene that keeps complaint rates low. We do not - and cannot - tell another operator’s DCC server that your mail is solicited; only their recipients can. What we can do is make your mail the kind that gets whitelisted rather than reported.

Related references

Tell us what you run today.

Domains, rough volume, current providers, and what hurts. You will get a straight answer on fit, and a real number, in one conversation.

Talk to our team