I wanted to see which of SpamAssassin's tests were actually catching spam for me, statistically speaking. Comparing that to which catch ham, and then modifying the weights of those rules, is my next step after Bayes. This is a continuous battle: whenever the amount of spam in my inbox gets to averaging more than about seven or eight messages a day, I find a new way to decrease it. This isn't because the old ways have gotten less effective (the hit rate on catching spam goes ever up, and I've had five false positives, total, in five months of use), but because I just keep getting more and more spam, so 5-10% means more messages over time.
The last way was actually using the Bayesian functionality. That was huge, and should be plenty for a few more months. But I want to be ready when it comes time to go reranking tests. So I wrote this Perl script:
Note: for the full text of the copyright notice, download the real Perl script. I avoid any direct reference to my real name in actual posts to lj so that no web spiders get ahold of identifying information (barring image recognition, which I know isn't good enough to work on my ljpic) in conjunction with posts here, since I don't want weblog posts to be held against me by, for instance, my employer. Don't pretend that you can get away with ignoring the copyright notice because it appears in a modified form here. Thanks.#!/usr/bin/perl -w
#
# copyright 2003, [REDACTED]
# Feel free to use, modify, and redistribute this as long as you like
# with the caveat that you must retain the above copyright notice. I
# wouldn't mind hearing from you ([REDACTED]) if you find it
# useful or modify/extend it.
#
# Feed me a bunch of SpamAssassin-tagged messages in mbox format like
# so:
#
# ./sa-analyze.pl < mbox
#
# Then make decisions about your rule-weighting.
use strict;
# We want to keep statistics both by message and over all the test output
# so that we can look at both how many and which tests a given message
# triggered (probably to judge whether to feed that message to sa-learn
# as spam) and at how tests faired over the corpus (probably to change
# the ranking of that individual test based on the spam a given users
# actually gets).
my (%messages, %tests);
my ($hits, $testline, $test);
my $total_hits = 0;
my $messageid = 'null';
while (<>) {
# When we see a new mbox From line, we're looking at a new
# message, so stash what we've got and start anew.
#
# The exception is the first and last message. On the first message,
# we don't yet have anything to store. On the last we won't see another
# From line. So we store nothing if we don't have a $messageid yet,
# and we store once more after we run out of input.
if (/^From / && ($messageid ne 'null')) {
$messages{$messageid}{hits} = $hits;
foreach $test (split/,/, $testline) {
if ($messages{$messageid}{tests}) {
$messages{$messageid}{tests}++
} else {
$messages{$messageid}{tests} = 1;
}
}
$messages{$messageid}{tests} = %tests;
}
# When we see a MessageID, we to start storing statistics on it.
# Unfortunately some MTAs say "Message-ID" and some say "Message-Id".
if (/^Message-ID: /i) {
$messageid = (split())[1];
}
# When we see an X-Spam-Status line, we want to parse it into
# temporary vars which we'll eventually dump in the relevant hashes
# (by incrementing in %tests here and by throwing the results in
# %messages when we get to the next message).
if (/^X-Spam-Status: /) {
($hits, $testline) = (split())[2,4];
# Need to drop the tags; we know what they are because of our
# data structures.
$hits =~ s/hits=//;
$testline =~ s/tests=//;
$total_hits += $hits;
foreach $test (split(/,/, $testline)) {
if ($tests{$test}) {
$tests{$test}++;
} else {
$tests{$test} = 1;
}
}
}
}
# Store information from the last message.
$messages{$messageid}{hits} = $hits;
foreach $test (split/,/, $testline) {
if ($messages{$messageid}{tests}) {
$messages{$messageid}{tests}++
} else {
$messages{$messageid}{tests} = 1;
}
}
print "Average hits: ", $total_hits / (scalar(keys %messages) + 1), "\n";
print "Test statistics:\n";
# Sorting based on value using an inline function... this is totally
# grotty syntax, if you ask me, but it's the Right Perl Way to do this.
# See perldoc -f sort.
foreach $test (sort {$tests{$b} <=> $tests{$a}} keys %tests) {
print " $test: $tests{$test}\n";
}
# So, right now, we're not doing jack with %messages, really. If I end up
# wanting to, it'd be pretty easy to throw in a CLI to query for information
# about given rules (or rules above a given frequency) throuigh %messages.
I'm sure I'm not the first person to want this information, and I'll bet you can make spamd log it as it processes messages, but I felt like doing something active rather than digging through documentation. There's plenty of room for expansion here (as the final comment suggests), and there are definitely things I could be doing better. Like it says, go right ahead and let me know if you have any suggestions.
I won't be recommending the test weights that I get out of this to anyone else, and I wouldn't suggest that you do either. The point is to take the test weights that SA uses by default (which I presume to be decided based on a statistical analysis of a much larger corpus of spam than I've got) and localize them to the spam that you actually get.
Post a Comment