This is a guide on
how to train the Bayesian plugin using the Classification screen.
The guide will also show you how to use using Outlook Express, with
your existing emails, to train the Bayesian plugin.
Note: some of the features mentioned here, are only available
in the very latest (v1.08a) version of the plugin. |
2. Training Bayesian
2.1 Training Bayesian: Using classification
window
2.2 Training Bayesian:
Using existing emails (outlook/outlook express)
3. Options
3.1 Recommended Options |
The Bayesian Plugin is a semi-intelligent
solution for recognising Spam. Over time, it learns from incoming
email messages
and the words they contain, marking each word with a spam
and a
clean probablility.
It filters emails based on these probabilities producing an overall
score for the whole email.
|
::Top:: |
There are two ways to train the Bayesian
filter. The first and most user friendly way to to use the classificiation
window. When you have received a few emails, you use the classification
window, to tell the Bayesian plugin, which of your messages are spam
and which are clean. The more messages you do this operation for,
the more accurate the filter will become.
The second way of training the filter, is using your existing emails.
It's a bit more complicated to do than just using the classification
screen, but if you want results fast, it's the way to go.
At the moment,
the only way to do this is by using Outlook Express. So,
if you
can
convert
your
emails
to
Outlook
Express
first
then
you can follow the guide below (2.2).
|
::Top:: |
Select Plugins -> Bayesian
Filter from
the right-click menu
on the SpamPal tray icon (The umbrella near the Taskbar Clock)
Each email that is processed by the plugin is copied here so that
they can be reclassified if necessary.
A red icon means the email
was classified as spam, as
green one means the message is clean.
Functionality of the buttons:
Spam - Mark the currently selected
emails as spam
Clean - Mark the currently selected
emails as clean
Directory -
Selected - Remove the currently selected
emails
All -Remove all emails
Close - Close Classification
window (same as using the top right red cross)
Shortcuts:
You can click on the Columns to Sort your emails in Ascending or Descending
order.
You can click on a Column and drag (left or right) to position the
column layout to your requirement
You can highlight an email and then then use the Delete key to
remove it, instead of clicking the remove button
You can highlight an email and then double-click on
it, to toggle the Spam/Clean status of the message
You can click on the Maximize button (top right) to Maximize the reclassify
screen
You can re-size the reclassify screen by clicking the botton
right corner and moving/resizing the window. |
|
::Top:: |
If you are using Outlook
Express, start at Step 2.
It doesn't require any additional programs aside from Outlook Express. |
Step 1:
Import from Outlook to Outlook Express. With Outlook opened, open
Outlook Express and select the Import option.
Follow the directions for importing from Outlook. |
Step 2:
In Outlook Express, you first need to create two folders, one which
will contain a copy of your spam emails
called train-spam and
one for all good or non-spam email called train-clean.
To do this, go to Right Click on your Local
Folders and select New
Folder from the menu: |
|
Now create your training folder for spam emails: |
|
Now create your training folder for non-spam/clean
emails: |
|
Now you need to select all the spam emails in your inbox (or other folders) and then Copy them
into the spam folder you have just created (ie. the train-spam folder)
To do this, select the emails you want and then Right
Click over the highlighted emails: |
|
Select Copy
to Folder and
then select the train-spam folder: |
|
Now you need to select all the non-spam emails
in your inbox (or other folders) and then Copy them
into the "clean" folder you have just created (ie. the train-clean folder)
To do this, select the emails you want and then Right
Click over the highlighted emails: |
|
Select Copy
to Folder and then select the train-clean folder: |
|
Step 3: Export
the train-spam and train-clean folders to folders on your hard drive. First, create two folders
(one called train-spam and
one called train-clean) on your desktop. |
|
Move/resize the Outlook Express window
so that you can see the newly created folders and can also see the
entire Outlook Express window.
In Outlook Express, open the train-spam folder
and click on the first message. Press CTRL-A or select
all from
the menu, to select all of the messages. |
|
Drag and drop the selected emails
to the train-spam folder on your desktop. |
|
This will create a separate .eml files for each email message. |
Do the same for the train-clean folder
in Outlook Express (select all, then drag to the train-clean folder
on your desktop). |
Step
4. Now
you need to Import the two desktop folders into the Bayesian filter
(via the import tab
in the Bayesian options panel):
So, start with the spam emails
(click the spam button and browse to the train-spam folder on your desktop). |
|
|
Now you have to do the same import process for the
clean emails:
In the Bayesian options Import panel, click on the clean button
and then browse to the train-clean folder on
your desktop. |
|
Now that your spam and clean email have been imported, you can do a tidy up. You can
delete the train-clean and train-spam folders
on your desktop and
also delete them from Outlook Express.
|
::Top:: |
Spam Threshold
Increasing this reduces the number of
false classifications, decreasing it makes the filter think more email
should be tagged as spam. Any word with a ratio below this threshold
is considered a clean word.
Default value 90
Learning Threshold
Any word with a ratio greater than or equal to this is added to the database
classed as spam.
Default value 99 |
Limit message processing
Only process the first part of an email (see Amount
of message to process)
Amount of message to process (kb)
Limit the amount of an individual email that is processed. Set this to avoid
timeouts when the plugin can take too long when processing large emails. |
You
may need to make sure that the amount
of message to filter for a message preview in
the POP3 Port Properties matches
the amount of message
to filter for a full message. It
may also be a good idea raise these values to 128k.
See this page for
more details about POP3
Port Properties. |
|
Store mesage database on disk
If enabled, the Bayesian plugin will keep
the messages to be re-classified not only between mail sessions but
also if you re-boot/shut down SpamPal. Messages will only be removed
when you highlight an email and click the Remove button
(or use the delete key)
or when the number of days to keep option
below is reached.
Keep messages in the database (days)
This is
the number of days that the classify window keeps your messages for.
Change this to 1, for example
and it'll only keep your messages
for 1 day.
|
|
Word Count
The number of significant words examined during classification of
an email.
Fewer words checked means the filter is more "trigger-happy",
more words checked would mean more spam words would be needed to
be present for an email to be classified as spam.
Default value 10
Min/Max word length
Set the minimum and maximum size of word that is used during filtering
Word expiry
Every word is tagged with the time it was last encountered. This
threshold ensures that words that haven't occurred recently are removed
from the database.
If a word has not appeared for X days (word expiry), the number
of times the word has appeared (spam & clean) is decremented
once per day until they reach zero. When they both reach zero the
word
is removed from the database.
Minimum word occurence for filtering
Sets the minimum number of times a word has to appear before it
is used in filtering. A low setting will make the plugin more "trigger-happy",
letting it mark emails based on less data.
Incoming words are case-sensitive
If unselected, all new email will be converted to lower case before
filtering
|
|
Create log file
Turns on/off logging
Learn (don't mark spam)
The plugin will do everything it normally does except it does not
mark an email as spam. This has the effect of letting the filter "learn" your
email without inital period that may make it mark a lot of email
as spam before it "knows" your email. Don't forget
to turn this option off when you think the filter has seen enough
of your email ;-)
Assume whitelisted email is clean
Selecting this means that the plugin will score any whitelisted
email as zero, i.e. perfectly clean.
Learn from whitelisted emails
Whether words found in whitelisted emails are added to the database
(Use in conjunction with the above).
Example: You may be subscribed to a mailing list about spam (containing
words that would be scored as spam) that you have whitelisted.
If whitelisted email is considered clean then the words in these
emails
would be added to the database as clean. This option allows you
to stop that happening.
Include headers in filtering
Select this if you want to include all the emails headers in
the Bayesian filtering.
Add "X-Bayesian-Words" header
Option whether to add the "X-Bayesian-Words" header that
lists the interesting words that were found (and their scores). n.b
The "X-Bayesian- Result" header will always be added.
Learn from SpamPal and other plugins
If selected, the plugin will learn using results from SpamPal
and all other enabled plugins
|
|
Ignore
Maintenance of the list of words that the plugin will ignore.
Functionality of the buttons:
Add
Add the word from the edit box into the list
Remove
Remove the selected words
Remove all
Empty the list
Reset
Revert back to the state of the list before the configuration window
was opened
Default
Load the default ignore words
|
Some users have found that these options produce
very good results:
Thresholds:
- Spam threshold = 90
- Learning threhold = 98
- Limit message processing = Ticked
- Amount of message to proceed = 48
Words:
- Word count = 35
- Min. word length = 4
- Max. word length = 32
- Word expiry (days) = 90
- Minimum word occurence for filtering = 5
Options :
- Create log file = Ticked
- Learn (don't mark spam) = Unticked
- Assume whitelist email is clean = Ticked
- Learn from whitelisted email = Ticked
- Include headers in filtering = Unticked
- Add 'X-Bayesian-Words' header = Ticked
- Learn from SP and other plugins = Ticked |
::Top:: |