Wednesday, 7 December 2011

How Do They Do IT: Spam Filters

The current mechanisms for blocking junk email fall into one of the following categories:
  1. User defined
  2. Black lists
  3. White lists
  4. Bayesian filtering
User Defined
We are all probably aware that our email client has the ability to help block junk email depending on what we tell it.  When email appears in your inbox, if you think it is junk, you can typically block that particular sender.  This assumes that all email from that sender will be junk and so it is a rather blunt instrument. 

The corollary of blocking a sender is to tag a sender as being "safe".  All email from this sender will be classed a appropriate and not be classed as junk email by one of the other methods, if it they are operating.  This matters because, in trying to apply the other methods, false positives can result meaning that you can miss legitimate emails as they have been automatically diverted to a junk email folder.  Many people would rather not have this happen and so make use of the "safe sender" functionality rather than blocking emails..

In Microsoft's Exchange-based email systems, users can monitor the Spam Confidence Level (SCL) score being assigned by the email server.  If the threshold for the SCL is too low or high you can then ask you administrator to adjust the level.  This is always a tricky balance between receiving too many junk emails and missing legitimate emails.  It is also not usually something that is for a general user as setting up your email client to monitor SCL scores is not a trivial task.

Black Lists
This is what you might imagine: a centralised list of those who are known spammers.  Your email server (or potentially your email client) can refer to this list and block accordingly.  One of the most popular is Domain Name System Blacklists, also known as DNSBL's or DNS Blacklists. 

An issue here is the proliferation of DNSBL's: it is difficult to decide which to use. Rather like anti-virus checkers, people tend to migrate to the better known names whom they feel they can trust. A considerable benefit of most DNSBL's is that they tend to include "zombie" machines which are used to avoid the simple user defined email blockers.

Recent developments have included listings email addresses that have sent to "honeypots" and ISPs that knowingly host spammers.  However, there has been some concern expressed about blacklists from organisation such as the Electronic Frontier Foundation (EFF).  These concerns are not so much about the technologies but about the specific policies implemented by those compiling the lists.

White Lists
Still the most common form is the user defined white list as described above.  However, increasingly ISPs are supplying their customers with white lists, usually through an email client that is provided by the ISP.  The ISP supplied white lists typically comprise email addresses of companies who apply to the ISP to be included as safe senders.

White lists can operate in one of two ways.  They can let through only those on the list or, alternatively, the list prevents other junk email methods from deleting the message.

The concern about allowing commercial organisations to pay for inclusion on a white list is that they can effectively pay to avoid spam filters.  The business models used to determine payment try to militate against this.  For example, the ISP will charge depending on the number of complaints received.  The ISPs argue that charging this way means that the funds can be used to invest in further spam filtering.  It's not cleat if this actually happens.

There are some non-commercial white list providers.  Inclusion on these lists is allowed only if the sender passes certain tests.  For example, they must not allow unchecked relay of SMTP messages, which is a classic attack vector for spammers.  Personally, I would recommend using one of these white lists.

Bayesian Filtering
This is a statistically-based technique using, you guessed it, Bayesian Probability.  This approach determines how likely a given proposition is ie is an email spam.  The probability is determined using "evidence" ie it is learned from experience. 

One particular form used in spam filtering is known as a "naive Bayesian classifier", which simply means that every feature you look for evidence of in the spam emails is considered independent of every other feature.  This would appear to restrict the ability of the system to learn about system combinations of content that increase the likelihood of a message being spam.  However, it is fast and has surprisingly high accuracy.

Other forms use combinations of content as well as typical traffic patterns.  For example, you may receive many emails with the word Viagra but you rarely send them.  Hence, if you see a high proportion of email with a particular word passing across your network the likelihood of it being spam is raised.

One cannot rely totally on Bayesian Filtering as it is susceptible to "poisoning", where  spammers send email using large amounts of text that is unlikely to be classed as spam.  Hence, whilst individual words might raise an alert, when looked at as a whole, the message receives a lower spam score than would otherwise be the case.

The volumes of spam email are extraordinary.  Between 70% and 80% of all email sent is spam.  As none of the current methods described here are completely effective, there is still scope for much further research in this area.