spamoracle.conf - SpamOracle configuration file format
The spamoracle.conf file is a configuration file governing
the operation of the spamoracle(1) e-mail classification tool. By
default, the configuration file is searched in
$HOME/.spamoracle.conf but an alternate location can be
specified using the -config flag to spamoracle(1).
Important note: most of the configuration parameters should
not be modified lightly, as this may result in completely wrong e-mail
classification. Familiarity with Graham's filtering algorithm, as described
in the paper referenced at the end of this page, is recommended to fully
understand the effect of the parameters.
The spamoracle.conf file is composed of lines of the form
variable = value. Lines starting with a # sign are
treated as comments and ignored. Blank lines are ignored.
Depending on the type of the variable (see the list of variables
below), the value part takes one of the following forms:
- string
- A sequence of characters. Blanks (spaces, tabs) at the beginning and the
end of the string are ignored. Alternatively, the string can be enclosed
in double quotes ("), in which case spaces are not trimmed. Inside
quoted strings, blackslashes (\) and double quotes (") must be
escaped with a backslash, as in \\ or \
- boolean
- Either on, yes, true, or 1 to activate the
boolean option, or off, no, false, or 0 to
deactivate it.
- integer
- A decimal integer
- float
- A decimal floating-point number.
- regexp
- A regular expression in emacs(1) syntax. The repetition operators
are *, +, and ?. Alternation is written \| and
grouping is written \(...\). Character classes are written
between brackets [...] as usual. A single dot denotes any
character except newline. Regular expressions are case-insensitive.
- database_file
- (type string, default value $HOME/.spamoracle.db )
The location of the file that contains the database of word frequencies used
by spamoracle(1).
- html_retain_tags
- (type boolean, default value false)
In HTML-formatted e-mails and attachments, the names of HTML tags are
normally not treated as words and are ignored for the word frequency
calculations. If the html_retain_tags parameter is set to
true, HTML tags (such as img or bold) are treated as
words and included in the computation of word frequencies.
- html_tag_attributes
- (type regexp, default value
a/href\|img/src\|img/alt\|frame/src\|font/face\|font/color)
This regular expression matches pairs of HTML tags and HTML attributes
written as tag/attribute. When scanning
HTML-formatted e-mails and attachments, attributes to HTML tags are
normally ignored, unless the tag/attribute pair matches the regular
expression html_tag_attributes. If the tag/attribute pair matches
this regexp, the value of the attribute (for instance, the URL for the
a/href attribute) is scanned for words.
- (type regexp, default value from:\|subject:)
A regular expression determining which headers of an e-mail message are
scanned for words.
- alternative_favor_html
- (type bool, default value true)
Determine how multipart/alternative messages are treated. If this parameter
is set, and one part of the alternative is of type text/html, this part is
scanned and all other parts are ignored. In all other cases, all parts of
the alternative are scanned.
- (type string, default value X-Spam)
The name of the header that spamoracle mark adds to incoming e-mail
messages, with the results of the spam/non-spam classification.
- (type string, default value X-Attachments)
The name of the header that spamoracle mark adds to incoming e-mail
messages, with the one-line summary of attachment types, names and
character sets. The generation of this header can be turned off with the
summarize_attachment parameter.
- summarize_attachment
- (type boolean, default value true)
If this parameter is set, spamoracle mark generates a one-line
summary of the attachments of the incoming messages, and inserts this
summary in the message headers. Setting this parameter to false
disables the generation of this extra header.
- num_meaningful_words
- (type integer, default value 15)
Maximal number of "meaningful" words that are retained for
computing the spam probability. During mail analysis, spamoracle
extracts all words of the message, and retains those whose spam frequency
(frequency of occurrence in spam messages) is closest to 1 or to 0. At
most num_meaningful_words such "meaningful" words are
retained.
- max_repetitions
- (type integer, default value 2)
Maximum number of times a given word can occur in the set of
"meaningful" words retained for computing the spam probability.
The default value of 2 means that at most 2 occurrences of the same word
will be retained.
- low_freq_limit
- (type float, default value 0.01)
- high_freq_limit
- (type float, default value 0.99)
The spam frequency of a word is computed as the number of occurrences in
spam divided by number of occurrences in all messages. This ratio is then
clipped to the interval [ low_freq_limit, high_freq_limit ],
so that words that are extremely rare or extremely common in spam do not
bias the probability computation too much. The default values of 0.01 and
0.99 are adequate for a corpus of a few thousand e-mails. For larger
corpora (e.g. 10000 e-mails), the values 0.001 and 0.999 may give better
results.
- min_meaningful_words
- (type integer, default value 5)
Minimum number of "meaningful" words below which spamoracle
mark refuses to classify the e-mail and outputs "unknown"
status. This happens with very short e-mails, or e-mails that consist
exclusively of links and pictures.
- good_mail_prob
- (type float, default value 0.2)
Spam probability below which the e-mail is classified as non-spam.
- spam_mail_prob
- (type float, default value 0.8)
Spam probability above which the e-mail is classified as spam. Messages
whose probability falls between good_mail_prob and
spam_mail_prob are classified as "unknown".
Xavier Leroy <Xavier.Leroy@inria.fr>
spamoracle(1)
http://www.paulgraham.com/spam.html (Paul Graham's seminal
paper)