Managing Trac Spam
2 November 2006
Ah, the days when only early adopters were using Trac. It was an insider project that only few knew about, and even fewer used. Spammers weren't yet aware of it, and how well it is suited for spamvertising fake handbags and viagra.
Those days are clearly over (and have been for quite a while, in fact). Popular projects such as Ruby on Rails, Django, and WordPress are using Trac. And all the while spam has gone from being absent, to merely annoying, to making Trac sites a major pain.
The situation even inspired the creation of a “law”:
Here's the first draft of Rafe's law:
An Internet service cannot be considered truly successful until it has attracted spammers.
The latest spam problem I'm seeing is Trac spam. Trac is a software development tool, but because comments on open issues are posted on public Web pages, spammers looking to boost their PageRank have created scripts to spam any open Trac installation they can find. The fact that spammers have discovered it is a good indication that Trac is the emerging favorite in the world of free bug tracking tools.
How is this law useful? Let's say you've created a new service, like Ning, or a new blog publishing tool, like Mephisto. How do you know if it's a success? Just consider Rafe's Law. If the spammers care about your service, you've made it.
In response to the spam problems we created the SpamFilter plugin earlier this year. It uses hooks introduced in the 0.10 release of Trac and relied mostly on the Akismet service to keep the spammers out. That worked okay for a couple of months, but has recently broken down completely because Akismet started producing way too many false positives. And false positives are pretty bad if that means you're not letting users and contributors report issues or submit patches.
SpamFilter, Revision 2
I was hoping that the Akismet problems would go away, as they did a couple of times before, but this time they didn't. We obviously needed a new strategy for keeping the bad guys under control. Some Trac sites switched to requiring registration, others started investigating the use of CAPTCHAs. Those are definitely valid approaches to the problem, but I like the default no-registration-required policy of Trac, and don't like the accessibility problems of CAPTCHAs.
So I sat down and gave the SpamFilter plugin a complete redesign. It would no longer rely on a single filter (which would be Akismet, most of the time) to determine whether a submission was spam or ham. Rather, it would collect scores from all the different installed filter strategies, and the total score (dubbed “karma”) would then decide whether the submission should be accepted or rejected. A couple of filter strategies were added in the process, so that we now provide:
- IP blacklisting: uses dnspython to query a configurable set of DNS servers whether the IP address of the submitter is blacklisted.
- IP throttling: The number of submissions per hour from a single IP address can be limited.
- Session check: If the user has a session, and has her name and/or email set (on the “Settings” screen), the submission gets rewarded with positive karma.
- External links check: A large number of external links in a submission can give you negative karma.
- Bayesian filtering: The SpamBayes-based filter strategy is now actually implemented. After some initial training it provides pretty darn good results.
- Regular expressions: You throw any number of regular expressions on the
BadContentWiki page on your site, and any submissions matching a regular expression gets negative karma.
- Akismet: Finally, the now infamous Akismet filter. I'm still hoping the service will improve again and the number of false positives goes down to some reasonable value.
The best thing, though, is how these work together to produce an overall karma score for any single submission. So even if your post matches some pattern on the
BadContent page, if SpamBayes gives you a high ham probability, and you have entered your name on the settings page, your submission has a good chance of being accepted. And every Trac site can easily tune how much weight the different filtering strategies get in the process.
(I'm not saying this is a grand new concept by the way… systems such as SpamAssassin have worked that way for a long time. But for the SpamFilter plugin, it's a huge step forwards)
In addition to the karma system, SpamFilter v0.2 has gained an admin interface. It's usable out of the box with Trac 0.11dev (the current development version which has an integrated web-based administration interface), or with Trac 0.10 and the WebAdmin plugin.
This interface provides a convenient way to configure, monitor, and train the SpamFilter. Oh, I didn't mention that yet, actually: yes, SpamFilter now supports training. Which of course is required to make bayesian filtering usable at all. But you now can also report back false positives or negatives to Akismet so that the service has a chance to improve.
All in all, I think this represents a major step forward in terms of managing spam on Trac sites. I encourage anyone plagued by problems with Trac spam to give it a try.