I'd put this in
I've been quite happy with SpamAssassin, as I may have mentioned previously.
It worked great by itself for a while, and was working at precisely the same efficiency… that is, catching the same percentage of spam–above 90%, for sure, though I haven't run the numbers any time recently, and doing so is no joke considering the sheer volume of email I get (10428 messages since 1 December 2003, of which 4765 were correctly marked as spam and some other number were missed; it's that other number that's hard to track down, since I don't keep missed-spam separate for very long; I look at what made it not get caught and alter the system a bit to account for that, if possible). So then, SA was still catching the same percentage of spam, only but instead of being 3 messages a day, that percentage was more like 20 messages. I shit you not.
So, I hunkered down and got my head around how to use SA's Bayesian stuff. That's still working pretty well, for the most part… except for
From tvsmnbhbbo@hongkong.com Mon Dec 29 00:31:19 2003 Return-Path:Delivered-To: gr@eclipsed.net Received: from icicle.pobox.com (icicle.pobox.com [207.8.214.2]) by uriel.eclipsed.net (Postfix) with ESMTP id 2973549701 for ; Mon, 29 Dec 2003 00:31:18 -0500 (EST) Received: from icicle.pobox.com (localhost [127.0.0.1]) by icicle.pobox.com (Postfix) with ESMTP id 2E9A6C0977 for ; Mon, 29 Dec 2003 00:31:16 -0500 (EST) Delivered-To: rosenkoetter@pobox.com Received: from colander (localhost [127.0.0.1]) by icicle.pobox.com (Postfix) with ESMTP id DE0ECC096F; Mon, 29 Dec 2003 00:31:15 -0500 (EST) Received: from X1 (unknown [218.89.64.181]) by icicle.pobox.com (Postfix) with SMTP id 3C5CEC0902; Mon, 29 Dec 2003 00:31:06 -0500 (EST) Received: from [218.89.64.181] by Message-Id: <20031229053116.2E9A6C0977@icicle.pobox.com> Date: Mon, 29 Dec 2003 00:31:16 -0500 (EST) From: tvsmnbhbbo@hongkong.com To: undisclosed-recipients: ; X-Spam-Checker-Version: SpamAssassin 2.60 (1.212-2003-09-23-exp) on uriel.eclipsed.net X-Spam-Status: No, hits=-1.3 required=4.1 tests=BAYES_01,HTML_MESSAGE,NO_REAL_NAME autolearn=no version=2.60 X-Spam-Level: Status: RO Content-Length: 1440 Lines: 41 IP with HTTP; Mon, 29 Dec 2003 03:26:22 -0200 From: "Blanche" To: rosemp@pobox.com Subject: Re: QUZS, over the eyes Mime-Version: 1.0 X-Mailer: mPOP Web-Mail 2.19 X-Originating-IP: [ IP] Date: Mon, 29 Dec 2003 04:28:22 -0100 Reply-To: "Hendrickson Blanche" Content-Type: multipart/alternative; boundary="--ALT--TUHE57977555257899" Message-Id: ----ALT--TUHE57977555257899 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit chantry fill amaze kilohm luzon perceive they're mold doff leachate dust candelabra supplant bandgap evolutionary manifest thomas confirmation cauliflower garland cal desmond hued sanskrit airfare theme chuckle ----ALT--TUHE57977555257899 Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 8bit ----ALT--TUHE57977555257899 Content-Type: text/html; charset=us-ascii Content-Transfer-Encoding: 8bit <HTML><HEAD> <BODY> <p>Fr</dirt>ee Ca</reactionary>ble* TV</p> <a href="http://www. /cable/"> <img border="0" src="http://www. /fiter.jpg"></a> acute edgerton infusion cdc cacao cinematic langmuir vineyard satiate beecham mali snore vicksburg burgess perpetual buss bulky guatemala page aboveground disgustful <BR> transmissible bravado cavalier faucet phase associable eagle corrugate clapeyron wightman bestir bankrupt annuli forfeit tap broth emile coulter orville architecture breakwater loathe resourceful incipient baldwin patina <BR> </BODY> </HTML> ----ALT--TUHE57977555257899--
Note that this is very obviously spam… but because of the incredibly low Bayes score (the middle of the Bayes range is “do nothing”, low-scoring emails are probably real email, and high-scoring emails are probably spam), it ended up with a negative overall score.
SUCK!
I just came back from being mostly away from computers for a week. I'd been reading my email while I was away, and just dumping the spam that SA missed in a mbox. That was one hundred fifty-nine messages in a week, and all of them were this.
So, without further adieu, here are the SA rules I just wrote that, on a test volume of about 700 messages, 159 of the spam, get zero false positives and one false negative (because that one spammer seems to have seeded his random character generator slightly better… or because he actually broke something; which isn't totally clear):
header __MPOP_MUA X-Mailer =~ /\bmPOP Web-Mail\b/
header __ANTIBAYES_SUBJECT Subject =~ /\bRe: ([A-Z]+|%RND_UC_CHAR\[2-8\]),\s+\p{IsGraph}+\b/
header __ANTIBAYES_MESSAGEID MESSAGEID =~ /\b[A-Z]{7}-[0-9]{13}@\b/
meta ANTIBAYES_SPAM (__MPOP_MUA && __ANTIBAYES_SUBJECT && __ANTIBAYES_MESSAGEID)
describe ANTIBAYES_SPAM Matches MO for those attempting to circumvent Bayes filters.
score ANTIBAYES_SPAM 1.01 1.01 3.01 3.01
Note that the scores still need tuning (not “some” but “a lot”, probably), but the basic premise is that if you're not using the Bayes rules, this is just a very strong indicator of spam, but shouldn't override things too much. If you're using Bayes, its value needs to be cranked way up (possibly higher than it is there) to counteract what Bayes is going to tell you about the message.
Potential future additions involve parsing the text/html MIME part to remove the (bogus, just there so that Bayes won't work) HTML tags, and matching on the strings there (in the emails I've got, there are three, maybe four different messages here, and they repeat days later, so this probably would be functionally useful). Perhaps a more general aid would be to look for in-word tags and, if you see more than four or five in a message, raise a flag. That's a bit more complex than I know how to do with a single regexp, though, and I haven't read up on how to add full-fledged tests as if they were SA-included subs just yet.
Post a Comment