Spam filtering

Preface

I wrote this article on a potential method of spam filtering back in May of 2002, before Paul Graham’s rather more famous article on the same subject. He had the advantage over me in being much more well-known, and by providing an actual implementation of his ideas, something which my schedule (coursework and final examinations for my degree) did not permit me to do at that time. He also, I suspect, received rather more spam than I: for me to effect a statistical analysis would have been hard with the limited corpus I had available.

In fact, neither I nor Paul Graham was the first to suggest Bayesian classification of spam, as I discovered whilst researching the subject: cf. Androutsopoulos et al. (PDF).

I am, of course, glad that someone implemented the idea of Bayesian spam classification, because my amount of spam has been increasing with time. Bayesian classification works well enough, however, that the amount that I get in my inbox is less than ever.

I’m keeping this piece online just to prove that I had the same idea, and that it was a good one!

Background

Spam is evil.

My main account is fortunately pretty much free of spam. My Hotmail account, on the other hand, receives an unsolicited bulk email message approximately once every quarter of an hour. Although I can’t do anything about this (Hotmail’s blocking options aren’t really comprehensive), this has a side benefit: it does give me a decent corpus of data to work from towards the holy grail of correctly identifying (and discarding) all spam.

Obviously, this should be done without stopping the genuine emails that I want to receive. The aim is therefore to achieve the lowest false acceptance rate and false rejection rate possible.

Many complex spam blocking programs are available, using combinations of sender blacklists, common spam phrases, heuristics and the like. However, my goal is to make a simple system which will produce effective results through what I believe is a novel method.

The idea

My recent study of spoken language processing suggested to me a method by which this aim might be achieved, through the application of probabilistic methods.

First, for each word that occurs in spam messages, the probability of the message being spam given the presence of the word is evaluated:

  • Take the set of spam messages, S, and genuine messages, G, from the corpus of all messages, M.
  • Discard common words (a, the, of, etc.)
  • For each word w that appears in a spam message, count the frequency of appearance in S and G.
  • From this, derive the probability that the word appears given that it is a spam message P(w|S), and the probability that the word appears given that it is a genuine message P(w|G).
  • Bayes’ theorem is used to find the probability that a message is spam given the existence of a word w, P(S|w).

These words and the probability of being spam or genuine are stored in a table. I suggest that only words that provide a strong indication to whether a message is spam or genuine be used stored and used.

Next, we can calculate the probability that the message is spam by finding the mean word spam probability:

P(S|M) = 1/nΣ(P(S|wn))

Words not in the indication table can be skipped in this stage.

Finally, comparing this value with a threshold should give an indication of to what degree the message is spam. Different thresholds can be used to decide whether to accept, reject, or quarantine the message.

Testing

This simply takes the form of applying the process to the messages in the test corpus and evaluating:

  • The false acceptance rate
  • The false rejection rate

I intend to make a proof-of-concept program to test my proposal in the near future.