Spam prevention and statistics

I dont believe the Spam problem will be solved quickly as the commercial background is easy. Sending mail at a cost of nearly nothing legitimates it disregarding the return you get. Spammers are outlaws today and will be tomorrow which wont stop them anytime soon. So we need to live with Spam.

The goal now is to prevent the spam to appear in my inbox with a false positive rate as low as possible. The tolerable false positive rate depends upon your taste but mine is quiete high despite the amout of spam i get.

Address Exposure

Always beware were you enter your mail address. I am an opponent of faking e-mail addresses or extending them with features like "nospam" and co. This would be a clear RFC violation as the return of the address can not be guranteed. I'd rather avoid giving e-mail addresses to all those little nifty websites, just for downloading some little tool. For this purpose i wrote an anonymous, receive only, web based mailer. Its called wasteland.rfc822.org and need no registration.

Spamtrap

Currently i am trying to establish effective spamtraps. Getting hundrets of e-mail addresses into the spammers dictionarys to get spam to a single mailbox by catch-alls to even get spam from dictionary attacks. I made my first attempt on 13th of April 2005 and on the 16th of May i had already 982 Mails in that inbox. A lot of those had been virus bounces which got delivery because of the catch-all. Nevertheless there is still a fair amout of "valid" spam in that mailbox. Right now i am trying to broaden my domain/hostname repertoir to catch more mail addresses. I guess spammers are working through a dictionary which sorted by some way which means that we get e-mail addresses spreaded in those dictionaries there is a fair chance that the spamtrap will get spam before the spammer trys the real e-mail addresses which then will be protected by an RBL i plan on putting a-top the spamtrap.

One of the interesting things are web-harvester. Crawling the web for mail-addresses. For this purpose wrote a little cgi-bin which inserts generated mailto: links into webpages via server side includes. These e-mail addresses are based on the time/date and the clients ip address requesting the page. When you now get spam for that ip address it reveals the harvesters ip address and the time it got harvested. One of the problems i need to solve right now is that somewhere sometimes the mail address gets converted to lowercase which breaks the base64 encode. I am currently looking for other possibilities to encode that information into the mail address.

Local MDA filtering

You might want to start filtering the spam once it is going to be delivered into your mailbox. This can easily be done with spamassassin or bogofilter. I tried spamassin quite early and it was not of my taste. A lot of tests were made but the results and possibility to steer false positives against false negatives was quite minimal. I went for bogofilter and trained it by feeding it a couple thousand spam mails i manually sorted over the years. The results were effective very fast and so i stayed with bogofilter.

This is the part of my .procmailrc feeding it into bogofilter and sorting it into monthly spam folders.

:0fw | bogofilter -e -p :0: * ^X-Bogosity: Yes bogus/bogus-`date +%Y%m`

Filtering like this causes minimal false positives and you are always encouraged to train your bogofilter to your needs. Even if producing false positives they are not lost. Just have a look at your bogus mailbox of the month.

MX filtering/refusal

Filtering on the MX e.g. denying mail delivery is a very effective way of saying no to spam. There is the probability that this is not necessary legal in some countrys when you provide mail services for others so one might want to just tag mails to let the user itself decide to drop the mail in its bogofilter/procmail setup.

I am running a lot different techniques to detect bogus mail and to deny its delivery. Some of them are probably not RFC conform, some of them might be to restrictive. Everybody has to decide which false positive rate is acceptable.

HELO/EHLO checking

RFC821 or 2821 says the hostname in the HELO/EHLO SMTP handshake SHOULD contain a valid hostname. It explecitly says you should not check for it which means this measurement is not RFC conform. Nevertheless i do it. My point is that any host may send anything in the HELO/EHLO but definitly should not include MY ip address or MY hostname which is obviously a sever protocol violation. So i use the postfix feature "smtpd_recipient_restriction = check_helo_access" and gave it this list:

195.71.99.218 550 You are obviously using a bogus helo/ehlo # # These hosts are in my "mynetworks" so we dont check this file # localhost 550 You are obviously using a bogus helo/ehlo 127.0.0.1 550 You are obviously using a bogus helo/ehlo localhost.localdomain 550 You are obviously using a bogus helo/ehlo gt.owl.de 550 You are obviously using a bogus helo/ehlo rfc822.org 550 You are obviously using a bogus helo/ehlo uucico.de 550 You are obviously using a bogus helo/ehlo

client reverse DNS Checking

This is a feature which denies mails based on the reverse DNS of the clients ip address. Most of the spam today gets send out from dynamic dialup ranges. A typical user will use its providers relay/mailserver to send out mails. Power-users should do so to as a MX on a dynamic ip address is a big security problem.

Over the years i collected ip addresses or better hostnames i got spam from. For this i wrote myself a little perl script which looked through the "Received:" line of my spam, detecting the first non-my site. I started listing hostnames or better groups of hostnames after i got 2-3 spam mails from their domain. There are domains which i completely blacklisted and just whitelisted their MX as the amount of spam was immense but the syntax of the hostnames was to irregular.

I feeded this regular expression list into postfix with heck_client_access regexp:/etc/postfix/dialup_client_access.

# # # gtso-d9b8cc1a.pool.mediaways.net # gtso-c3477532.dsl.mediaWays.net # /^....-[0-9a-f]*\.pool\.mediaways\.net/ 550 20040302 We do not accept mail from dialup ip ranges /^....-[0-9a-f]*\.dsl\.mediaways\.net/ 550 20040330 We do not accept mail from dialup ip ranges

My current regex list includes around 460 Regex which matches all large carriers worldwide with their dialup ranges. There is very little maintainence necessary as this is not a moving target. From time to time i happen to extend the list by a little bit.

Sender domain checking

When you get mail the envelope from address must be a valid address of the null originator <>. So i use the postfix command "reject_unknown_sender_domain" to refuse domains which happen to not exist.

Sender domain mx checking

The Senders domain should have an MX. This MX should be reachable for me. I dont want to always to backtracking by trying to connect to the remote mailserver but at least i can tell that the MX would not be reachable if located in RFC1918 address space e.g. 192.168/16, 10/8 etc. So i am using the "check_sender_mx_access" with this list:

0.0.0.0/8 REJECT Domain MX in broadcast network 10.0.0.0/8 REJECT Domain MX in RFC 1918 private network 127.0.0.0/8 REJECT Domain MX in loopback network 169.254.0.0/16 REJECT Domain MX in link local network 172.16.0.0/12 REJECT Domain MX in RFC 1918 private network 192.0.2.0/24 REJECT Domain MX in TEST-NET network 192.168.0/16 REJECT Domain MX in RFC 1918 private network 224.0.0.0/4 REJECT Domain MX in class D multicast network 240.0.0.0/5 REJECT Domain MX in class E reserved network 248.0.0.0/5 REJECT Domain MX in reserved network # 205.158.62.0/24 REJECT 20040710 Network is dead spam return 218.106.116.162/32 REJECT 20040629 MX for sender is dead spam return address 209.25.147.75/32 REJECT 20040710 MX for sender is dead spam return address 219.129.20.247/32 REJECT 20040710 MX for sender is dead spam return address 218.16.121.18/32 REJECT 20040710 MX for sender is dead spam return address 209.202.218.12/32 REJECT 20040710 MX for sender is dead spam return address 202.104.237.157/32 REJECT 20040710 MX for sender is dead spam return address 222.47.94.97/32 REJECT 20040710 MX for sender is dead spam return address 66.139.78.239/32 REJECT 20040722 MX for sender is dead spam return address 216.145.48.35/32 REJECT 20050329 MX says "nomail" to not receive mail

One of the thoughts was to add the "bogons" networks which contain non-allocated address spaces, hijacked address space etc.

Greylisting

Greylisting is a method for delaying the mail delivery. Spam senders are often "fire and forget" type senders. That means they try the delivery and disregarding the result code 4xx or 5xx they dont retry the delivery as this would cost more resources on the spammers side. Greylisting now means you will always deny the first attempt to deliver a mail with a temporary result code. The "first attempt" is a combination of sender, recipient, client-ip etc.

Greylisting actually brought down my spam to a 10th of its original amount as you can see from the graph.

RBL

RBLs are "Realtime Blocking Lists" usually deployed by DNS. These RBLs may list IP Addresses of senders or sender domains. RBLs are a hard business. You cant control what ends on the RBLs and the listing rules. So selecting the right RBLs for your choice is a hard decision to take. I tryied using ORBS, ORDB and other open relay RBLs. These tend to one day get sued by some spammer and disappear, or getting outrage by ignoring their own listing policy.

Today i am using the ix.dnsbl.manitu.net which is a list generated from spamtraps.o It lists the originating ip address as soon as a spam mail gets into the trap and delists after 48h. So even if a legitimated MX gets listed mails would not endlessly bounce back without interaction of the owner. I also use the combined spamhaus sbl xbl list. This lists contain open proxys, virus/worm/trojan-horse PCs, known spammers etc.

I used to use the rfc-ignorant.org lists until large cooperation got listed. The listing itself were strictly within their listing policys and acceptable for my perspective but got me into too much hassle.

Statistics

For purposes of checking the effectivitiness of my MX alteration to prevent spam i wrote some little perl scripts to go through my mailfolders and count the number of spam mails. As i comparison i did this to my work e-mail address which is basically unfiltered and my home e-mail address which is heavily filtered. One can now compare the trends of spam. To extract the information from your mailbox i use the script getmailstat. It parses all mails and decodes the first Received: lines RFC822 based date. Simply cat all your spam mailboxes into this perl script and save the output. Then i am using gnuplot with this config to produce the graphs.