Skip to content

SpamAssassin rules for matching Bayes-circumventing spam.

I'd put this in , but they don't seem to speak English there. Or anything with a Latin alphabet, even.

I've been quite happy with SpamAssassin, as I may have mentioned previously.

It worked great by itself for a while, and was working at precisely the same efficiency… that is, catching the same percentage of spam–above 90%, for sure, though I haven't run the numbers any time recently, and doing so is no joke considering the sheer volume of email I get (10428 messages since 1 December 2003, of which 4765 were correctly marked as spam and some other number were missed; it's that other number that's hard to track down, since I don't keep missed-spam separate for very long; I look at what made it not get caught and alter the system a bit to account for that, if possible). So then, SA was still catching the same percentage of spam, only but instead of being 3 messages a day, that percentage was more like 20 messages. I shit you not.

So, I hunkered down and got my head around how to use SA's Bayesian stuff. That's still working pretty well, for the most part… except for emails like this one:

From tvsmnbhbbo@hongkong.com  Mon Dec 29 00:31:19 2003
Return-Path: 
Delivered-To: gr@eclipsed.net
Received: from icicle.pobox.com (icicle.pobox.com [207.8.214.2])
        by uriel.eclipsed.net (Postfix) with ESMTP id 2973549701
        for ; Mon, 29 Dec 2003 00:31:18 -0500 (EST)
Received: from icicle.pobox.com (localhost [127.0.0.1])
        by icicle.pobox.com (Postfix) with ESMTP id 2E9A6C0977
        for ; Mon, 29 Dec 2003 00:31:16 -0500 (EST)
Delivered-To: rosenkoetter@pobox.com
Received: from colander (localhost [127.0.0.1])
        by icicle.pobox.com (Postfix) with ESMTP id DE0ECC096F;
        Mon, 29 Dec 2003 00:31:15 -0500 (EST)
Received: from X1 (unknown [218.89.64.181])
        by icicle.pobox.com (Postfix) with SMTP id 3C5CEC0902;
        Mon, 29 Dec 2003 00:31:06 -0500 (EST)
Received: from [218.89.64.181] by
Message-Id: <20031229053116.2E9A6C0977@icicle.pobox.com>
Date: Mon, 29 Dec 2003 00:31:16 -0500 (EST)
From: tvsmnbhbbo@hongkong.com
To: undisclosed-recipients: ;
X-Spam-Checker-Version: SpamAssassin 2.60 (1.212-2003-09-23-exp) on uriel.eclipsed.net
X-Spam-Status: No, hits=-1.3 required=4.1 tests=BAYES_01,HTML_MESSAGE,NO_REAL_NAME autolearn=no version=2.60
X-Spam-Level:
Status: RO
Content-Length: 1440
Lines: 41

IP with HTTP;
        Mon, 29 Dec 2003 03:26:22 -0200
From: "Blanche" 
To: rosemp@pobox.com
Subject: Re: QUZS, over the eyes
Mime-Version: 1.0
X-Mailer: mPOP Web-Mail 2.19
X-Originating-IP: [
IP]
Date: Mon, 29 Dec 2003 04:28:22 -0100
Reply-To: "Hendrickson Blanche" 
Content-Type: multipart/alternative;
        boundary="--ALT--TUHE57977555257899"
Message-Id: 

----ALT--TUHE57977555257899
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 8bit

chantry fill amaze kilohm luzon perceive they're mold doff
leachate dust candelabra supplant bandgap evolutionary manifest thomas
confirmation cauliflower garland cal desmond hued sanskrit airfare theme chuckle

----ALT--TUHE57977555257899
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 8bit

----ALT--TUHE57977555257899
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 8bit

<HTML><HEAD>
<BODY>
<p>Fr</dirt>ee Ca</reactionary>ble* TV</p>
<a href="http://www.
/cable/">
<img border="0" src="http://www.
/fiter.jpg"></a>
acute edgerton infusion cdc cacao cinematic langmuir vineyard satiate beecham mali snore vicksburg burgess perpetual buss bulky guatemala page aboveground disgustful <BR>
transmissible bravado cavalier faucet phase associable eagle corrugate clapeyron wightman bestir bankrupt annuli forfeit tap broth emile coulter orville architecture breakwater loathe resourceful incipient baldwin patina <BR>

</BODY>
</HTML>

----ALT--TUHE57977555257899--

Note that this is very obviously spam… but because of the incredibly low Bayes score (the middle of the Bayes range is “do nothing”, low-scoring emails are probably real email, and high-scoring emails are probably spam), it ended up with a negative overall score.

SUCK!

I just came back from being mostly away from computers for a week. I'd been reading my email while I was away, and just dumping the spam that SA missed in a mbox. That was one hundred fifty-nine messages in a week, and all of them were this.

So, without further adieu, here are the SA rules I just wrote that, on a test volume of about 700 messages, 159 of the spam, get zero false positives and one false negative (because that one spammer seems to have seeded his random character generator slightly better… or because he actually broke something; which isn't totally clear):

header __MPOP_MUA X-Mailer =~ /\bmPOP Web-Mail\b/
header __ANTIBAYES_SUBJECT Subject =~ /\bRe: ([A-Z]+|%RND_UC_CHAR\[2-8\]),\s+\p{IsGraph}+\b/
header __ANTIBAYES_MESSAGEID MESSAGEID =~ /\b[A-Z]{7}-[0-9]{13}@\b/
meta ANTIBAYES_SPAM (__MPOP_MUA && __ANTIBAYES_SUBJECT && __ANTIBAYES_MESSAGEID)
describe ANTIBAYES_SPAM Matches MO for those attempting to circumvent Bayes filters.
score ANTIBAYES_SPAM 1.01 1.01 3.01 3.01

Note that the scores still need tuning (not “some” but “a lot”, probably), but the basic premise is that if you're not using the Bayes rules, this is just a very strong indicator of spam, but shouldn't override things too much. If you're using Bayes, its value needs to be cranked way up (possibly higher than it is there) to counteract what Bayes is going to tell you about the message.

Potential future additions involve parsing the text/html MIME part to remove the (bogus, just there so that Bayes won't work) HTML tags, and matching on the strings there (in the emails I've got, there are three, maybe four different messages here, and they repeat days later, so this probably would be functionally useful). Perhaps a more general aid would be to look for in-word tags and, if you see more than four or five in a message, raise a flag. That's a bit more complex than I know how to do with a single regexp, though, and I haven't read up on how to add full-fledged tests as if they were SA-included subs just yet.

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*