talideon.com

Blackout Ireland

September 14, 2006 at 2:56AM How I kill spam here

Still phlegmmy and stuff, so here’s a post I wrote ages ago and never bothered putting online. My brain’s still too addled to do anything truly productive. Sorry!

I thought I’d might as well mention how I get rid of comment spam on this site. The main point behind it is to show why you should always give your real email address when you make comments here. After all, I’m not the sort of scumbag who’d sell it on to some asshole spammer.

I keep things pretty low-tech. I’ve looked at the kind of spam I’ve gotten in the past and the vast majority of it can be caught simply by analysing the comment body and commenter homepage for certain keywords. The rest is ok for weeding by hand.

First of all, I’ve a table called moderated. This stores a list of email addresses I’ve moderated positively or negatively. If your email address is in this table and I’ve positively moderated it, you bypass any further analysis of your comment. So far I haven’t had to negatively moderate against anyone. Negative moderation using this table is a rather weak mechanism and I expect to kill it off some time soon; it’s more useful for allowing commenters who’ve given proper comments through straight away without any extra work.

Next, it does keyword filtering. The piece of SQL I use is this:

DELETE
FROM   comments
WHERE  date_posted > DATE_SUB(NOW(), INTERVAL 1 DAY)
  AND  email NOT IN (SELECT email FROM moderated WHERE banned = 'No')
  AND (body LIKE '%xanax%'
  OR   body LIKE '%cialis%'
  OR   body LIKE '%tramadol%'
  OR   body LIKE '%phentermine%'
  OR   body LIKE '%viagra%'
  OR   body LIKE '%carisoprodol%'
  OR   body LIKE '%wood blinds%'
  OR   body LIKE '%adipex%'
  OR   body LIKE '%penis enlargement%'
  OR   body LIKE '%horse penetration%'
  OR   body LIKE '%kournikova gallery%'
  OR   body LIKE '%austin waterproof%'
  OR   body LIKE '%clindamycin%'
  OR   body LIKE '%lisinopril%'
  OR   body LIKE '%actonel%'
  OR   body LIKE '%bbw porn%'
  OR   homepage LIKE '%poker%'
  OR   homepage LIKE '%casino%'
  OR   homepage LIKE '%gambling%'
  OR   homepage LIKE '%roulette%'
  OR   homepage LIKE '%blackjack%'
  OR   homepage LIKE '%earrings%');

Simplistic, I know, but it gets rid of an awful lot of junk. Sure, it might I get a couple of false positives first, but before I do the delete, I copy them to the spam table, which I periodically check by hand. The above query is built dynamically from a keywords list and executes once every six hours. I may lower this to running every three hours.

The rest I periodically moderate by hand. If a commenter gives what looks like a dud email address--your.mother@altavista.com and me@nowhere.com come to mind--I delete the contents of the email field. I’d really prefer commenters don’t give an email address rather than giving me a dud one because duds just clog up the moderated table with rubbish, and that’s just not cool.

Finally, I check if there’re any new commenters who’ve submitted ham rather than spam. Those have their email addresses copied to the moderated table so their later comments can escape moderation.

Now, the moderated table had one last purpose: to determine whether, when displayed, any links in the comments get a rel=”nofollow” or not. If you’re not in moderated or you’ve supplied no email address, no Googlejuice for you.

Update (2008-04-23): It’s been a long time coming, but any POST requests with a dodgy or non-existent Referer header will get binned too.

Comments

1 On September 14, 2006 at 9:20, Revence 27 wrote:

Neat scheme. Glad you show how smart plain simplicity can be.

I’ve never had to fight spam, though. I’ll save my thinking for then, and I’ll definitely use these, too.

And your written English is awful American! Nooo!

2 On September 14, 2006 at 9:51, Revence 27 wrote:

Having given the idea a few neurones (on the back-burner), I’ve got this one:

How about a list of spamming domains? Open contribution, like a wiki. If someone is submitted 5+ times, he/she’s a spammer. And you can serve it to whoever requests, in some terse format. And sites can’t get too many. Bloggers can have a “Report” button for the job. And then we can bait the spammers.

Qu’en pense-tu?

3 On September 14, 2006 at 21:16, Keith wrote:

And your written English is awful American!

How? I can’t see anything particularly American in how I write.

How about a list of spamming domains?

Did that, didn’t work. I found it more effective to check the UserAgent string to see if if it’s a known spider. If so, I didn’t show the comment. I found it more effective to remove the audience for the comment rather than block every spammer: there’s little or nothing a spammer can do to force me to give that back to them.

4 On September 15, 2006 at 18:24, Revence 27 wrote:

You wrote American! Look at this ... (you removed the guide for marking up? Frig!): ... if there’s any new commenters there ...’’ That’s supposed to be ... if there are any new commenters there ...’’

That, and other things. You are not alone, anyway.

Sure, your thingies get much spam pruned out. And I really like them, but ... First of all, if I use clever AJaX to fetch the stuff on my homepage, I can skip your trick. As it gains usage, your trick, so will the very cheap evasion. If I use images for sensitive words, or I use some flash thingamies ... If the spammer is only even half-awake, he would post stuff here for decades.

On this one of checking the UserAgent ... well, see it this way: I once wrote a UserAgent that changed names for every single request it ever made. It got the names randomly from the latest file (or GoogleBot’’ when that can’t happen). Somebody stop me!

And, I am not looking at this list of spamming domains as a single-use thing -- it’s not a Unix tool! It can be used for other things (like analysing regexps of domain names that spam, so you can automate the weeding better). I was only hoping someone who has a real use for it (comme toi) would champion its creation. And, still, it could be used very effectively for this job, can’t it?

Do you also see that you could use the domain-checking trick on top of whatever other tricks you may have? Better as a first-line-of-defence, though, since it is 99% correct (which, in fact, is why I like the idea).

Also, will your trick detect this: Better that v1 a 9ra! Toad S E M E N!!! From the African Warrior Frog!!! Hurrry Nooww!!!! Starts at one shilling, and it’s completely free for the first 100 years!!!’’

Did it get caught?

You missed the point I was trying to make, there. The goal of using domain names was inspired by the fact that nobody is going to have 20,000 of them auto-generated! The spammers are a tiny bit of the net, and we can get all their doamins, until they are driven into the ground by costs of having to maintain false identities. That’s why the list is the most promising thing.

I was supposed to post this before, but MTN won’t let me post too many kilobytes. And, besides, my Mum won’t be giving me any free units soon. But I tried.

Qu’en pense-tu?

5 On September 16, 2006 at 22:49, Keith wrote:

You wrote American! Look at this...

That was a typo, not an instance to me writing like a Yankistani.

The check for the user agent is there so Yahoo!’s spider, Google’s spider, &c. won’t see any unmoderated content: it puts a brick wall between the spammer and the things they’re trying to game--the search engines--without having to remove the accumulated spam or moderate everything continuously.

And, I am not looking at this list of spamming domains as a single-use thing -- it’s not a Unix tool!

What you’re looking for is Akismet.

Also, will your trick detect this:

Of course not. We’re not dealing with email spam here. With weblog spam, they’re trying to game search engines, not get people to click through. That makes it easier to deal with by using a keyword list.

The spammers are a tiny bit of the net, and we can get all their doamins, until they are driven into the ground by costs of having to maintain false identities.

A lot of them take advantage of free hosting services to host their content and shift it around as soon as the hosts find them. That’s why I’ve found the whole domain name thing is pointless (aside from analysing the URL for keywords).

6 On September 18, 2006 at 18:20, Revence 27 wrote:

Yeah, I see. You’re right. The only use my “trick” could have is to shield the victims of 1000000-posts-a-second by bots. And that’s not your kind.

You do write like a Yankistani. That’s not even the only one in there. “Yankistani” rocketh! Like “Yirankee”.

7 On September 19, 2006 at 6:28, Keith wrote:

The only use my ?trick? could have is to shield the victims of 1000000-posts-a-second by bots.

Actually, your trick wouldn’t be too effective against them either. For that, you’d need to do comment throttling. Yours requires analysis after the fact, which isn’t all that useful. I only get a few hundred a day, so keywords manage to prune most of the junk. If it gets bad, I’ll start using CFAkismet.

And I’m puzzled: what about the way I write makes it so un-Irish and so American? Your the first person who’s ever described it as such.