by Viper
21. June 2009 08:20
I have gotten lot of messages asking more details about how Bayesian spam filter works. When I posted Bayesian Spam Filter Trainer, i did not added reference to lot of wonderful literature that is available describing what Baye's theorem is and how is used to detect spam. So before I discuss how you can refine the process for twitter, let me give you a list of reference material.
How does it work?
I will quickly give an overview. For details I will definitely recommend reading articles that I mentioned above. Back bone of Bayesian approach is good and bad corpus to train the filter. In simple terms, we take a huge chunk of text, then split the text into individual words, remove the words that may not be of any interest and then calculate frequencies and based on that calculate probabilities of words appearing in a good or spam text.
Small context makes it harder
Using standard corpus to train filter for twitter messages works for most part. I noticed success rate of close to 90%. Well that is not bad. The issue with Twitter messages is that you only have 140 characters to establish meaningful context. And out of that 140, good 20 to 30 characters are taken by compressed URLs that spammers will add to advertise their products or trick you into going to some advertisement affiliate redirect site. So we have about 100 characters available to us to detect spam. If you notice, lot of time a good and spam message looks very close to each other. That causes messages to fall through the cracks or you get false positives.
Refine the filter
Here are some of the refinements that I added to my Spam filter service.
- The filter has been trained with live twitter messages. More data you use to train Bayesian filter, better results you will get.
- Manually go through results and classify messages as good and spam
- Make sure that sender's screen name is included in the corpus. This is based on the discussion in A Plan For Spam where it is mentioned that it is very important that you include an email's header, sender name etc. in corpus. Same applies to twitter corpus as well. You do not have elaborate meta data available. Including screen name in corpus makes sure that spammer's account gets included in spam corpus. So in edge cases where message body may be ambiguous, screen name may end up breaking the tie and correctly classify the text.
-
Link included in message helps in cleaning up some corpus. The service does some post processing on the messages in the background. It checks each message for any links posted in it. And then it expands short URLs into actual URLs and then gets some meta data about the target site. It looks at information like title, keywords, description etc. to establish if the user is attempting to redirect to adult content or key loggers or other questionable sites. Based on the results, the message is re-classified.
-
There is always something new to learn. Yes, the spam filter can never have enough of training. The service continuously updates the corpora and reloads new statistical data.
These are some of the techniques that I have deployed in current spam filter service. As I collect more data, these techniques will be refined further. Any suggestions are most welcome. I am learning too like my spam service.
|
|
|
by Viper
20. June 2009 20:11
Download SpamTrainer Binaries
Download SpamTrainer Source
As more and more people are tweeting, spam is growing with it as well. Every time I search for some topic, almost half of the messages seem to fall in one of the following categories:
- Somebody is trying to sell something
- Somebody is posting links to get you affiliate web sites to make some money
- Job agencies are posting jobs
- .... and more
This week I decided to use Bayesian spam filter, that is used in most email servers to filter spam, on twitter messages. While searching around I found Bayesian Spam Filter for C#. That gave a good starting point. Without making any changes or training with any additional corpus, I was able to get very good filtering results. I observed close to 90% spam detection. I studied the messages that fell through the cracks and also studies false positives. Based on the observations I figured that issue is very limited context of 140 characters in twitter. A lot of good and spam twitter messages look pretty much the same. So the key to improving spam filtering results was to train the filter with twitter messages and not use just rely on corpus taken from emails or things like that. So I decided to build an application that I could use to generate corpus that is classified as spam and good twitter messages.
How does it work
-
Start the application.
- Enter a search term and click on "More Data" button.
- Application will do initial classification of messages. All spam messages are displayed in Orange or light blue color.
- Double on any message to change its classification.

- Once you are satisfied with the results, click on "Accept" button and results are saved in appropriate good and spam files.
- You can load the new corpus results by clicking on "Reload Corpus".
Spam Filter Service
I have created a service that you can use to classify your text if you do not want to build one of your own. Following link provides
more details about the service.
Spam Filter Service
|
|
|