Spam and captchas
If you remember that weekend I coded up a website just to prove that I could, you'll remember that I put a lot of focus on GBDB's ease of use for casual users. Well it turns out I made it too easy to use because one little spambot decided to fill its queue up with over 30 pages of the worst kind of spam. (Nashville spam — FYI.)
On the Right Glue I've already had to deal with spammers albeit on a much smaller scale. The particular scheme I use here is called a honeypot, which is essentially a hidden field that bots find irresistible which makes it obvious that they are not human.
But GBDB was being targeted repeatedly by a real spambot tailored to the site — adding a honeypot wouldn't be effective because the bot was already designed to fill in just the fields it needs. I needed something bigger, something that bots cannot figure out (not even KevBots).
The title of this post should make it obvious what scheme I ended up choosing, but I'd like to cover my thought processes in making my decision before I declare it formally.
As I've pointed out already, one of my highest-priority goals in designing GBDB was to make it easy to use and understand without making its users have to think too hard about how to navigate and use its features. Most spam-prevention schemes run against that kind of goal: they require the users to go through extra steps or require them to slow down (or both).
Simply slowing a bot down doesn't actually prevent the spam from filling up the queue, so coming up with some kind of rate-limiting scheme wouldn't do me any good. I need to stop them before they're able to submit any filthy Nashville information to my database.
The other idea is to add an extra step in the process of using the site so that humans and bots can differentiate themselves. A typical approach is called a captcha which is usually an image or sound that a human can figure out but a computer cannot. You've probably seen one of these before as they are terribly common.
I don't like the idea of captchas because they add a burden to my users' mental effort when using the site. But then I considered the primary uses for GBDB: to view GBs, vote on GBs and to submit GBs. Those are the three most important features GBDB has. But the important thing is that is their order of importance: it's most important to view GBs, then to vote on them, and finally to add new ones.
The spam bot wasn't voting and (presumably) wasn't viewing either. So the two most important features of the site need no spam protection beyond what little they already have. Ninety nine percent of the time users are never going to see any captchas because only a small minority of users will have any anecdotes to add. Moreover, the users who are adding anecdotes are putting considerable effort into writing their stories — the extra burden of solving a captcha is minor in comparison.
Thus I could easily add a captcha to the submission form knowing that it would not significantly impact GBDB's ease of use.
I ended up deciding on reCAPTCHA because it is ridiculously easy to deploy in PHP (GBDB's source language). Once in place, I changed its theme so that it fits more reasonably within GBDB's kaleidescope of grey. It looks like it was made for the form!
My only issue with reCAPTCHA is that its client-side output is not HTML 5 compliant, so I can no longer correctly claim that GBDB is a fully compliant website. Good thing I don't give a fuck about compliance.