Bayesian Training

This is a guide on how to train the Bayesian plugin using the Classification screen. The guide will also show you how to use using Outlook Express, with your existing emails, to train the Bayesian plugin.

Note: some of the features mentioned here, are only available in the very latest (v1.08a) version of the plugin.

Quick Index

2. Training Bayesian
2.1 Training Bayesian: Using classification window
2.2 Training Bayesian: Using existing emails (outlook/outlook express)

3. Options
3.1 Recommended Options

1. Overview

The Bayesian Plugin is a semi-intelligent solution for recognising Spam. Over time, it learns from incoming email messages and the words they contain, marking each word with a spam and a clean probablility. It filters emails based on these probabilities producing an overall score for the whole email.

::Top::

2. Training Bayesian

There are two ways to train the Bayesian filter. The first and most user friendly way to to use the classificiation window. When you have received a few emails, you use the classification window, to tell the Bayesian plugin, which of your messages are spam and which are clean. The more messages you do this operation for, the more accurate the filter will become.

The second way of training the filter, is using your existing emails. It's a bit more complicated to do than just using the classification screen, but if you want results fast, it's the way to go.

At the moment, the only way to do this is by using Outlook Express. So, if you can convert your emails to Outlook Express first then you can follow the guide below (2.2).

::Top::

2.1 Training Bayesian: Using classification window

Select Plugins -> Bayesian Filter from the right-click menu on the SpamPal tray icon (The umbrella near the Taskbar Clock)

Each email that is processed by the plugin is copied here so that they can be reclassified if necessary.

A red icon means the email was classified as spam, as green one means the message is clean.

Functionality of the buttons:

Spam - Mark the currently selected emails as spam
Clean - Mark the currently selected emails as clean
Directory -

Selected - Remove the currently selected emails
All -Remove all emails

Close - Close Classification window (same as using the top right red cross)

Shortcuts:

You can click on the Columns to Sort your emails in Ascending or Descending order.
You can click on a Column and drag (left or right) to position the column layout to your requirement
You can highlight an email and then then use the Delete key to remove it, instead of clicking the remove button
You can highlight an email and then double-click on it, to toggle the Spam/Clean status of the message
You can click on the Maximize button (top right) to Maximize the reclassify screen
You can re-size the reclassify screen by clicking the botton right corner and moving/resizing the window.

::Top::

2.2 Training Bayesian: Using existing emails

If you are using Outlook Express, start at Step 2.

It doesn't require any additional programs aside from Outlook Express.

Step 1: Import from Outlook to Outlook Express. With Outlook opened, open Outlook Express and select the Import option.

Follow the directions for importing from Outlook.

Step 2: In Outlook Express, you first need to create two folders, one which will contain a copy of your spam emails called train-spam and one for all good or non-spam email called train-clean.

To do this, go to Right Click on your Local Folders and select New Folder from the menu:

Now create your training folder for spam emails:

Now create your training folder for non-spam/clean emails:

Now you need to select all the spam emails in your inbox (or other folders) and then Copy them into the spam folder you have just created (ie. the train-spam folder)

To do this, select the emails you want and then Right Click over the highlighted emails:

Select Copy to Folder and then select the train-spam folder:

Now you need to select all the non-spam emails in your inbox (or other folders) and then Copy them into the "clean" folder you have just created (ie. the train-clean folder)

To do this, select the emails you want and then Right Click over the highlighted emails:

Select Copy to Folder and then select the train-clean folder:

Step 3: Export the train-spam and train-clean folders to folders on your hard drive. First, create two folders (one called train-spam and one called train-clean) on your desktop.

Move/resize the Outlook Express window so that you can see the newly created folders and can also see the entire Outlook Express window.

In Outlook Express, open the train-spam folder and click on the first message. Press CTRL-A or select all from the menu, to select all of the messages.

Drag and drop the selected emails to the train-spam folder on your desktop.

This will create a separate .eml files for each email message.

Do the same for the train-clean folder in Outlook Express (select all, then drag to the train-clean folder on your desktop).

Step 4. Now you need to Import the two desktop folders into the Bayesian filter (via the import tab in the Bayesian options panel):

So, start with the spam emails (click the spam button and browse to the train-spam folder on your desktop).

Now you have to do the same import process for the clean emails:

In the Bayesian options Import panel, click on the clean button and then browse to the train-clean folder on your desktop.

Now that your spam and clean email have been imported, you can do a tidy up. You can delete the train-clean and train-spam folders on your desktop and also delete them from Outlook Express.

::Top::

3. Options

Spam Threshold
Increasing this reduces the number of false classifications, decreasing it makes the filter think more email should be tagged as spam. Any word with a ratio below this threshold is considered a clean word.
Default value 90

Learning Threshold
Any word with a ratio greater than or equal to this is added to the database classed as spam.
Default value 99

Limit message processing
Only process the first part of an email (see Amount of message to process)

Amount of message to process (kb)
Limit the amount of an individual email that is processed. Set this to avoid timeouts when the plugin can take too long when processing large emails.

Note: If you have emails that Bayesian mark as spam but they comes un-filtered to inbox anyway...

You may need to make sure that the amount of message to filter for a message preview in the POP3 Port Properties matches the amount of message to filter for a full message. It may also be a good idea raise these values to 128k. See this page for more details about POP3 Port Properties.

Store mesage database on disk
If enabled, the Bayesian plugin will keep the messages to be re-classified not only between mail sessions but also if you re-boot/shut down SpamPal. Messages will only be removed when you highlight an email and click the Remove button (or use the delete key) or when the number of days to keep option below is reached.

Keep messages in the database (days)
This is the number of days that the classify window keeps your messages for. Change this to 1, for example
and it'll only keep your messages for 1 day.

Word Count
The number of significant words examined during classification of an email.
Fewer words checked means the filter is more "trigger-happy", more words checked would mean more spam words would be needed to be present for an email to be classified as spam.
Default value 10

Min/Max word length
Set the minimum and maximum size of word that is used during filtering

Word expiry
Every word is tagged with the time it was last encountered. This threshold ensures that words that haven't occurred recently are removed from the database.
If a word has not appeared for X days (word expiry), the number of times the word has appeared (spam & clean) is decremented once per day until they reach zero. When they both reach zero the word is removed from the database.

Minimum word occurence for filtering
Sets the minimum number of times a word has to appear before it is used in filtering. A low setting will make the plugin more "trigger-happy", letting it mark emails based on less data.
Incoming words are case-sensitive
If unselected, all new email will be converted to lower case before filtering

Create log file
Turns on/off logging

Learn (don't mark spam)
The plugin will do everything it normally does except it does not mark an email as spam. This has the effect of letting the filter "learn" your email without inital period that may make it mark a lot of email as spam before it "knows" your email. Don't forget to turn this option off when you think the filter has seen enough of your email ;-)

Assume whitelisted email is clean
Selecting this means that the plugin will score any whitelisted email as zero, i.e. perfectly clean.

Learn from whitelisted emails
Whether words found in whitelisted emails are added to the database (Use in conjunction with the above).
Example: You may be subscribed to a mailing list about spam (containing words that would be scored as spam) that you have whitelisted. If whitelisted email is considered clean then the words in these emails would be added to the database as clean. This option allows you to stop that happening.

Include headers in filtering
Select this if you want to include all the emails headers in the Bayesian filtering.

Add "X-Bayesian-Words" header
Option whether to add the "X-Bayesian-Words" header that lists the interesting words that were found (and their scores). n.b The "X-Bayesian- Result" header will always be added.

Learn from SpamPal and other plugins
If selected, the plugin will learn using results from SpamPal and all other enabled plugins

Ignore
Maintenance of the list of words that the plugin will ignore.

Functionality of the buttons:

Add
Add the word from the edit box into the list

Remove
Remove the selected words

Remove all
Empty the list

Reset
Revert back to the state of the list before the configuration window was opened

Default
Load the default ignore words

3.1 Recommended Options

Some users have found that these options produce very good results:

Thresholds:
- Spam threshold = 90
- Learning threhold = 98
- Limit message processing = Ticked
- Amount of message to proceed = 48

Words:
- Word count = 35
- Min. word length = 4
- Max. word length = 32
- Word expiry (days) = 90
- Minimum word occurence for filtering = 5

Options :
- Create log file = Ticked
- Learn (don't mark spam) = Unticked
- Assume whitelist email is clean = Ticked
- Learn from whitelisted email = Ticked
- Include headers in filtering = Unticked
- Add 'X-Bayesian-Words' header = Ticked
- Learn from SP and other plugins = Ticked

::Top::

Annotations

Users can now annoate manual pages with their own hints and tips. To share your insights with your fellow SpamPal users, you can use the form of the bottom of the page.

Annotate this page

Unfortunately, owing to a high volume of attempted abuse, new annotations are no longer being accepted for this page. Please accept my apologies for any inconvenience caused.