Log in

No account? Create an account
FF Sparks (Casual)

Antispam Woes


So, the nice antispam filtering system I had built for snafu has come tumbling down in ruin, as the load of spam that the system gets has been driving it to the point that it literally kills the machine it's on (load between 20.0 and 43.5). Users were unhappy, because the system was unusably slow.

So I tried bidding farewell to SpamAssassin, and checking out DSpam. DSpam was written in C rather than Perl, and supposedly designed to handle bayesian filtering for an entire site. Well, DSpam /definitely/ didn't kill things as badly. But even with what I could give it as a corpus, it caught maybe 0.5% of the spam on the system. Everyone's mailboxes were flooded, and even the training didn't seem to help much. Users were unhappy, because they couldn't get their real mail, and because training things seemed irritating to them.

I am now trying to build a separate machine which will handle the spam-filtering. Unfortunately, Postfix, for some ungodly unknown reason, does not want to pick up LMTP from anything but localhost. (Please note, I've created a test LMTP feed /specifically/ for this, which should pick up the LMTP from that one remote machine just fine.) Plus, I'm not confident that spamassassin will not kill /that/ machine as well, even though it's more powerful.

So, does anyone know of any antispam package which can handle a small ISP (about 30 users) which gets an inordinate amount of spam due to hosting several domains? I can't afford to spend $2000/yr on licensing one of the commercial packages, and I'm really at my wits' end about this. I need working antispam on Noderunner, but I need noderunner working also, so that I'm not completely buried. :(

Advice/input very, very, VERY appreciated.



First: postfix isn't the latest and greatest in mailer technology. If you want the nicest to work with, I would strongly suggest considering exim or zmailer. Zmailer in particular is a beast which can chew down vast quantities of mail, and keeps it in a nice unified directory for handling before sending. Exim permits you to define custom routers through which you can invoke spam checking techniques.

As for the actual spam detection:
first, use some heuristics. Anything which refers to v1agr@ is a pretty obvious no-no. Various forms of MIME attachment are also easy warning signs; I leave the details once more to your fruitful imagination, and by now you've caught a third of spam really easily.

Second: blanket bayesian filtering is possible, but you need help from your users, because they all have different types of mail which they expect. Alternatively, you can sit there for a few thousand messages, applying common sense, and it will work okay. But be aware that anything which chews through meg after meg of crap will slow down anything, so resign yourself to that right now.

Third: ask yourself if you're tackling this from the right end. You might very easily find that some kind of way of whitelisting mail sources gives you a massive headstart on the efficiency of your testing algorithm. Exim, too, is particularly nice about things like checking source lists, ensuring compliance with various things like SMTP, reverse DNS, and a whole host of related things. Exim also has some of the nicest and most self-explanatory configurations I've ever seen, so if you don't want to pretend you're the ultimate sysadmin, and you really have a life outside tweaking sendmail.cf, I recommend it.
With regards to Postfix, I realize it's not the glitziest package out there -- lord knows my qmail-zealous friends have hours and hours worth of explanations about why djb's code is the One True Path to salvation and suchnot -- but I know Postfix's code well enough to modify it myself, and I know how to tweak the config files pretty adeptly by this point. It's not so much 'latest and greatest' as 'comfortable and sufficient.' I mean, really, it's really not Postfix that's sucking performance-wise, but SpamAssassin. Postfix handles the mail load fine, with almost no CPU impact. It's SpamAssassin which, er, assassinates my system load.

I admit I've never really looked at Zmailer, but I have to admit for a dislike of Exim. Exim's system of filter definitions always drove me absolutely batty when I used to use it years ago, and while I realize Exim's an incredibly powerful mailer, I just can't quite get past that loathing.

First: I should mention that I have a fairly significant collection of header_check and body_check heuristics already, so most of the really obvious stuff gets caught even before being passed on to SpamAssassin. Anyone who relies /solely/ on a single spam filter is asking for trouble, and you can nail so many things right at the server before passing it on to the second stage.

Second: Yep, pretty aware that no matter what, I'm going to lose CPU on spam scanning. And the Bayesian systems, while they have their benefits, lose out on the training aspect. I have a lot of users who don't want to train systems, who just want the spam to Simply Not Be In Their Mailbox. These include people like, say, my parents -- my father, despite years of efforts on my part, will blindly click on any attachment he receives -- who are not ideal targets to teach Bayesian training methods to. :)

Third: Yeah, though I've yet to wrap my head around an efficient, effective way to build a workable whitelist for mail sources for such a diverse set of users. My Postfix is set up to look for a number of things, including various RFC compliance and presence on blacklists or whitelists, but it still is insufficient; enough mail gets through to SpamAssassin to drive the system load up over 40.
Lots of folks run into this problem with spamassassin. Try running it in daemon mode (spamd, I think it is called?) This will save you quite a bit of resources.
I /am/ running it in daemon (spamd/spamc) mode. I dare not even /imagine/ what my system load would be like running it in pure filter mode.

Well, I would say take a closer look at your configuration, but knowing the people you host they're probably just sucking down that much mail. :)

I would seriously consider looking around the net and finding some IP filters folks recommend for filtering out Asian netblocks commonly used for spam. Yeah, I know that may drop some valid e-mails, but it sounds like unless you're willing to do some serious pre-processing you may as well set yourself up a farm of e-mail servers.
The spam problem with my employer has been driving me insane. We have three ISP's and a total over over 13,000 mailboxes spread over 600 domains to consider. The previous admins had tried Postini, SpamAssassin, and a home-grown thing called ICNoSpam (they probably intended the pun, that's just one more of their crimes.)

In the end, there were only two viable options for us. For a small number of mailboxes, paying Everyone.com fifty cents a month to honeypot, spam scrub, and virus filter your email is a good option. The disadvantages are obvious - no local control at all, and, well, you have to pay a hell of a lot for it.

The other option was to get a machine from Barracuda Networks. They make three, the 200 ($1200 for 1000 mailboxes), the 300 ($2000 for 2000 mailboxes), and the 400 ($4000 for 10,000 mailboxes). The number of mailboxes they cover are estimates only and I've found that they can handle twice those numbers in a lot of cases; also, the cost of the machine isn't 'metered', you just get the hardware and the update service. They do white/blacklisting, rate limiting, spamassassin (very optimized), bayesian classifiers, and virus filtering via ClamAV.

There are several downsides. The Barracuda is real money and there's an annual subscription of a couple hundred dollars to their update service to keep it running. Also, it's essentially a closed-source platform (it runs Linux but they don't hand out root to the owner of the box). It is, however, much cheaper than Everyone or any of those other things.

In the end, my conclusion was that the Barracuda was exactly the kind of system I would have had to write myself, had we not got it from someone else, and it's basically the only thing that stems the tide for us. I went from 300 spam a day to about five. Also, I didn't think that writing a spam-scrubber would be much fun, so it was a big win to be able to get it off the shelf.

YMMV. I'm considering picking up a B200 to filter the virtual domains on Big Panda. Since Big Panda has far fewer than 1000 mailboxes, odds are good that there will be a lot of space left over. Even so, it might be a fast solution to a nasty problem, and It doesn't mean handing control over to Brightmail or Everyone.
This may go against your fiber, but I opted for a qmail toaster, running squirrel against a sql database for virtual domain/mailboxing instead of actually giving out shell account for each mailbox. The neat thing is, after 75+ mail accounts and each one of them receiving buttloads of mail, on top of running a LambaMOO with a 3d space system and over 200 space objects moving (typical load of this game is ~0.70), and with the toaster running on the same box, my load is just under 1.0.

There's multiple builtin spam engines running, (SpamAssassin spamd, SpamCop plugins, Spam filtering using ORDB, etc), and I haven't noticed any problems with machine slowness so far.

-- ZC
LambdaMOO with a discrete space system? Dude, I so have to see this. Is it open to the public? Can I have a peek at the code?