recoll.conf - main personal configuration file for Recoll
This file defines the index configuration for the Recoll full-text
search system.
The system-wide configuration file is normally located inside
/usr/[local]/share/recoll/examples. Any parameter set in the common file may
be overridden by setting it in the specific index configuration file, by
default: $HOME/.recoll/recoll.conf
All recoll commands will accept a -c option or use the
$RECOLL_CONFDIR environment variable to specify a non-default index
configuration directory.
A short extract of the file might look as follows:
# Space-separated list of directories to index.
topdirs = ~/docs /usr/share/doc
[~/somedirectory-with-utf8-txt-files]
defaultcharset = utf-8
There are three kinds of lines:
- Comment or empty.
- Parameter affectation.
- Section definition.
Empty lines or lines beginning with # are ignored.
Affectation lines are in the form 'name = value'. In the following
description, they also have a type, which is mostly indicative. The two
non-obvious ones are 'fn': file path, and 'dfn': directory path.
Section lines allow redefining a parameter for a directory
subtree. Some of the parameters used for indexing are looked up
hierarchically from the more to the less specific. Not all parameters can be
meaningfully redefined, this is specified for each in the next section.
The tilde character (~) is expanded in file names to the name of
the user's home directory.
Some 'string' values are lists, which is only indicated by their
description. In this case white space is used for separation, and elements
with embedded spaces can be quoted with double-quotes.
- topdirs =
string
- Space-separated list of files or directories to recursively index. You can
use symbolic links in the list, they will be followed, independently of
the value of the followLinks variable. The default value is ~ :
recursively index $HOME.
- monitordirs =
string
- Space-separated list of files or directories to monitor for updates. When
running the real-time indexer, this allows monitoring only a subset of the
whole indexed area. The elements must be included in the tree defined by
the 'topdirs' members.
- skippedNames
= string
- File and directory names which should be ignored. White space separated
list of wildcard patterns (simple ones, not paths, must contain no
Have a look at the default configuration for the initial
value, some entries may not suit your situation. The easiest way to see
it is through the GUI Index configuration "local parameters"
panel.
The list in the default configuration does not exclude hidden
directories (names beginning with a dot), which means that it may index
quite a few things that you do not want. On the other hand, email user
agents like Thunderbird usually store messages in hidden directories,
and you probably want this indexed. One possible solution is to have
".*" in "skippedNames", and add things like
"~/.thunderbird" "~/.evolution" to
"topdirs".
Not even the file names are indexed for patterns in this list,
see the "noContentSuffixes" variable for an alternative
approach which indexes the file names. Can be redefined for any
subtree.
- skippedNames-
= string
- List of name patterns to remove from the default skippedNames list. Allows
modifying the list in the local configuration without copying it.
- skippedNames+
= string
- List of name patterns to add to the default skippedNames list. Allows
modifying the list in the local configuration without copying it.
- onlyNames =
string
- Regular file name filter patterns. This is normally empty. If set, only
the file names not in skippedNames and matching one of the patterns will
be considered for indexing. Can be redefined per subtree. Does not apply
to directories.
- noContentSuffixes
= string
- List of name endings (not necessarily dot-separated suffixes) for which we
don't try MIME type identification, and don't uncompress or index content.
Only the names will be indexed. This complements the now obsoleted
recoll_noindex list from the mimemap file, which will go away in a future
release (the move from mimemap to recoll.conf allows editing the list
through the GUI). This is different from skippedNames because these are
name ending matches only (not wildcard patterns), and the file name itself
gets indexed normally. This can be redefined for subdirectories.
- noContentSuffixes-
= string
- List of name endings to remove from the default noContentSuffixes
list.
- noContentSuffixes+
= string
- List of name endings to add to the default noContentSuffixes list.
- skippedPaths
= string
- Absolute paths we should not go into. Space-separated list of wildcard
expressions for absolute filesystem paths (for files or directories). The
variable must be defined at the top level of the configuration file, not
in a subsection.
Any value in the list must be textually consistent with the
values in topdirs, no attempts are made to resolve symbolic links. In
practise, if, as is frequently the case, /home is a link to /usr/home,
your default topdirs will have a single entry '~' which will be
translated to with '/usr/home/yourlogin'.
The index and configuration directories will automatically be
added to the list.
The expressions are matched using 'fnmatch(3)' with the
FNM_PATHNAME flag set by default. This means that '/' characters must be
matched explicitly. You can set 'skippedPathsFnmPathname' to 0 to
disable the use of FNM_PATHNAME (meaning that '/*/dir3' will match
'/dir1/dir2/dir3').
The default value contains the usual mount point for removable
media to remind you that it is in most cases a bad idea to have Recoll
work on these. Explicitly adding '/media/xxx' to the 'topdirs' variable
will override this.
- skippedPathsFnmPathname
= bool
- Set to 0 to override use of FNM_PATHNAME for matching skipped paths.
- nowalkfn =
string
- File name which will cause its parent directory to be skipped. Any
directory containing a file with this name will be skipped as if it was
part of the skippedPaths list. Ex: .recoll-noindex
- daemSkippedPaths
= string
- skippedPaths equivalent specific to real time indexing. This enables
having parts of the tree which are initially indexed but not monitored. If
daemSkippedPaths is not set, the daemon uses skippedPaths.
- zipUseSkippedNames
= bool
- Use skippedNames inside Zip archives. Fetched directly by the rclzip.py
handler. Skip the patterns defined by skippedNames inside Zip archives.
Can be redefined for subdirectories. See
https://www.recoll.org/faqsandhowtos/FilteringOutZipArchiveMembers.html
- zipSkippedNames
= string
- Space-separated list of wildcard expressions for names that should be
ignored inside zip archives. This is used directly by the zip handler. If
zipUseSkippedNames is not set, zipSkippedNames defines the patterns to be
skipped inside archives. If zipUseSkippedNames is set, the two lists are
concatenated and used. Can be redefined for subdirectories. See
https://www.recoll.org/faqsandhowtos/FilteringOutZipArchiveMembers.html
- followLinks =
bool
- Follow symbolic links during indexing. The default is to ignore symbolic
links to avoid multiple indexing of linked files. No effort is made to
avoid duplication when this option is set to true. This option can be set
individually for each of the 'topdirs' members by using sections. It can
not be changed below the 'topdirs' level. Links in the 'topdirs' list
itself are always followed.
- indexedmimetypes
= string
- Restrictive list of indexed MIME types. Normally not set (in which case
all supported types are indexed). If it is set, only the types from the
list will have their contents indexed. The names will be indexed anyway if
indexallfilenames is set (default). MIME type names should be taken from
the mimemap file (the values may be different from xdg-mime or file -i
output in some cases). Can be redefined for subtrees.
- excludedmimetypes
= string
- List of excluded MIME types. Lets you exclude some types from indexing.
MIME type names should be taken from the mimemap file (the values may be
different from xdg-mime or file -i output in some cases) Can be redefined
for subtrees.
- nomd5types =
string
- MIME types for which we don't compute a md5 hash. md5 checksums are used
only for deduplicating results, and can be very expensive to compute on
multimedia or other big files. This list lets you turn off md5 computation
for selected types. It is global (no redefinition for subtrees). At the
moment, it only has an effect for external handlers (exec and execm). The
file types can be specified by listing either MIME types (e.g. audio/mpeg)
or handler names (e.g. rclaudio.py).
- compressedfilemaxkbs
= int
- Size limit for compressed files. We need to decompress these in a
temporary directory for identification, which can be wasteful in some
cases. Limit the waste. Negative means no limit. 0 results in no
processing of any compressed file. Default 100 MB.
- textfilemaxmbs
= int
- Size limit for text files. Mostly for skipping monster logs. Default 20
MB. Use a value of -1 to disable.
- textfilepagekbs
= int
- Page size for text files. If this is set, text/plain files will be divided
into documents of approximately this size. This will reduce memory usage
at index time and help with loading data in the preview window at query
time. Particularly useful with very big files, such as application or
system logs. Also see textfilemaxmbs and compressedfilemaxkbs.
- textunknownasplain
= bool
- Process unknown text/xxx files as text/plain Allows indexing misc. text
files identified as text/whatever by 'file' or 'xdg-mime' without having
to explicitely set config entries for them. This works fine for indexing
(also will cause processing of a lot of useless files), but the documents
indexed this way will be opened by the desktop viewer, even if text/plain
has a specific editor.
- indexallfilenames
= bool
- Index the file names of unprocessed files. Index the names of files the
contents of which we don't index because of an excluded or unsupported
MIME type.
- usesystemfilecommand
= bool
- Use a system mechanism as last resort to guess a MIME type. Depending on
platform and version, a compile-time configuration will decide if this
actually executes a command or uses libmagic. This last-resort
identification (if the suffix-based one failed) is generally useful, but
will cause the indexing of many bogus extension-less 'text' files. Also
see 'systemfilecommand'.
- systemfilecommand
= string
- Command to use for guessing the MIME type if the internal methods fail.
This is ignored on Windows or with Recoll 1.38+ if compiled with libmagic
enabled (the default). Otherwise, this should be a "file -i"
workalike. The file path will be added as a last parameter to the command
line. "xdg-mime" works better than the traditional
"file" command, and is now the configured default (with a
hard-coded fallback to "file")
- processwebqueue
= bool
- Decide if we process the Web queue. The queue is a directory where the
Recoll Web browser plugins create the copies of visited pages.
- membermaxkbs
= int
- Size limit for archive members. This is passed to the MIME handlers in the
environment as RECOLL_FILTER_MAXMEMBERKB.
- indexStripChars
= bool
- Decide if we store character case and diacritics in the index. If we do,
searches sensitive to case and diacritics can be performed, but the index
will be bigger, and some marginal weirdness may sometimes occur. The
default is a stripped index. When using multiple indexes for a search,
this parameter must be defined identically for all. Changing the value
implies an index reset.
- indexStoreDocText
= bool
- Decide if we store the documents' text content in the index. Storing the
text allows extracting snippets from it at query time, instead of building
them from index position data.
Newer Xapian index formats have rendered our use of positions
list unacceptably slow in some cases. The last Xapian index format with
good performance for the old method is Chert, which is default for 1.2,
still supported but not default in 1.4 and will be dropped in 1.6.
The stored document text is translated from its original
format to UTF-8 plain text, but not stripped of upper-case, diacritics,
or punctuation signs. Storing it increases the index size by 10-20%
typically, but also allows for nicer snippets, so it may be worth
enabling it even if not strictly needed for performance if you can
afford the space.
The variable only has an effect when creating an index,
meaning that the xapiandb directory must not exist yet. Its exact effect
depends on the Xapian version.
For Xapian 1.4, if the variable is set to 0, we used to use
the Chert format and not store the text. If the variable was 1, Glass
was used, and the text stored. We don't do this any more: storing the
text has proved to be the much better option, and dropping this
possibility simplifies the code.
So now, the index format for a new index is always the
default, but the variable still controls if the text is stored or not,
and the abstract generation method. With Xapian 1.4 and later, and the
variable set to 0, abstract generation may be very slow, but this
setting may still be useful to save space if you do not use abstract
generation at all, by using the appropriate setting in the GUI, and/or
avoiding the Python API or recollq options which would trigger it.
- nonumbers =
bool
- Decides if terms will be generated for numbers. For example
"123", "1.5e6", 192.168.1.4, would not be indexed if
nonumbers is set ("value123" would still be). Numbers are often
quite interesting to search for, and this should probably not be set
except for special situations, ie, scientific documents with huge amounts
of numbers in them, where setting nonumbers will reduce the index size.
This can only be set for a whole index, not for a subtree.
- notermpositions
= bool
- Do not store term positions. Term positions allow for phrase and proximity
searches, but make the index much bigger. In some special circumstances,
you may want to dispense with them.
- dehyphenate =
bool
- Determines if we index 'coworker' also when the input is 'co-worker'. This
is new in version 1.22, and on by default. Setting the variable to off
allows restoring the previous behaviour.
- indexedpunctuation
= string
- String of UTF-8 punctuation characters to be indexed as words. The
resulting terms will then be searchable and, for example, by setting the
parameter to "%€" (without the double quotes), you would
be able to search separately for "100%" or
"100€" Note that "100%" or "100 %"
would be indexed in the same way, the characters are their own word
separators.
- backslashasletter
= bool
- Process backslash as a normal letter. This may make sense for people
wanting to index TeX commands as such but is not of much general use.
- underscoreasletter
= bool
- Process underscore as normal letter. This makes sense in so many cases
that one wonders if it should not be the default.
- maxtermlength
= int
- Maximum term length in Unicode characters. Words longer than this will be
discarded. The default is 40 and used to be hard-coded, but it can now be
adjusted. You may need an index reset if you change the value.
- nocjk =
bool
- Decides if specific East Asian (Chinese Korean Japanese) characters/word
splitting is turned off. This will save a small amount of CPU if you have
no CJK documents. If your document base does include such text but you are
not interested in searching it, setting nocjk may be a significant time
and space saver.
- cjkngramlen =
int
- This lets you adjust the size of n-grams used for indexing CJK text. The
default value of 2 is probably appropriate in most cases. A value of 3
would allow more precision and efficiency on longer words, but the index
will be approximately twice as large.
- hangultagger
= string
- External tokenizer for Korean Hangul. This allows using an language
specific processor for extracting terms from Korean text, instead of the
generic n-gram term generator. See
https://www.recoll.org/pages/recoll-korean.html for instructions.
- chinesetagger
= string
- External tokenizer for Chinese. This allows using the language specific
Jieba tokenizer for extracting meaningful terms from Chinese text, instead
of the generic n-gram term generator. See
https://www.recoll.org/pages/recoll-chinese.html for instructions.
- indexstemminglanguages
= string
- Languages for which to create stemming expansion data. Stemmer names can
be found by executing 'recollindex -l', or this can also be set from a
list in the GUI. The values are full language names, e.g. english,
french...
- defaultcharset
= string
- Default character set. This is used for files which do not contain a
character set definition (e.g.: text/plain). Values found inside files,
e.g. a 'charset' tag in HTML documents, will override it. If this is not
set, the default character set is the one defined by the NLS environment
($LC_ALL, $LC_CTYPE, $LANG), or ultimately iso-8859-1 (cp-1252 in fact).
If for some reason you want a general default which does not match your
LANG and is not 8859-1, use this variable. This can be redefined for any
sub-directory.
- unac_except_trans
= string
- A list of characters, encoded in UTF-8, which should be handled specially
when converting text to unaccented lowercase. For example, in Swedish, the
letter a with diaeresis has full alphabet citizenship and should not be
turned into an a. Each element in the space-separated list has the special
character as first element and the translation following. The handling of
both the lowercase and upper-case versions of a character should be
specified, as appartenance to the list will turn-off both standard accent
and case processing. The value is global and affects both indexing and
querying. We also convert a few confusing Unicode characters (quotes,
hyphen) to their ASCII equivalent to avoid "invisible" search
failures.
Examples: Swedish: unac_except_trans = ää
Ää öö Öö üü
Üü ßss œoe Œoe æae Æae
ffff fifi flfl åå Åå
’' ❜' ʼ' ‐- unac_except_trans =
ää Ää öö Öö
üü Üü ßss œoe Œoe
æae Æae ffff fifi flfl ’'
❜' ʼ' ‐- a German ß unac_except_trans =
ßss œoe Œoe æae Æae ffff
fifi flfl ’' ❜' ʼ' ‐- are not
performed by unac, but it is unlikely that someone would type the
composed forms in a search. unac_except_trans = ßss œoe
Œoe æae Æae ffff fifi flfl
’' ❜' ʼ' ‐-
- maildefcharset
= string
- Overrides the default character set for email messages which don't specify
one. This is mainly useful for readpst (libpst) dumps, which are utf-8 but
do not say so.
- localfields =
string
- Set fields on all files (usually of a specific fs area). Syntax is the
usual: name = value ; attr1 = val1 ; [...] value is empty so this needs an
initial semi-colon. This is useful, e.g., for setting the rclaptg field
for application selection inside mimeview.
- testmodifusemtime
= bool
- Use mtime instead of ctime to test if a file has been modified. The time
is used in addition to the size, which is always used. Setting this can
reduce re-indexing on systems where extended attributes are used (by some
other application), but not indexed, because changing extended attributes
only affects ctime. Notes: - This may prevent detection of change in some
marginal file rename cases (the target would need to have the same size
and mtime). - You should probably also set noxattrfields to 1 in this
case, except if you still prefer to perform xattr indexing, for example if
the local file update pattern makes it of value (as in general, there is a
risk for pure extended attributes updates without file modification to go
undetected). Perform a full index reset after changing this.
- noxattrfields
= bool
- Disable extended attributes conversion to metadata fields. This probably
needs to be set if testmodifusemtime is set.
- metadatacmds
= string
- Define commands to gather external metadata, e.g. tmsu tags. There can be
several entries, separated by semi-colons, each defining which field name
the data goes into and the command to use. Don't forget the initial
semi-colon. All the field names must be different. You can use aliases in
the "field" file if necessary. As a not too pretty hack conceded
to convenience, any field name beginning with "rclmulti" will be
taken as an indication that the command returns multiple field values
inside a text blob formatted as a recoll configuration file
("fieldname = fieldvalue" lines). The rclmultixx name will be
ignored, and field names and values will be parsed from the data. Example:
metadatacmds = ; tags = tmsu tags %f; rclmulti1 = cmdOutputsConf %f
- cachedir =
dfn
- Top directory for Recoll data. Recoll data directories are normally
located relative to the configuration directory (e.g. ~/.recoll/xapiandb,
~/.recoll/mboxcache). If 'cachedir' is set, the directories are stored
under the specified value instead (e.g. if cachedir is ~/.cache/recoll,
the default dbdir would be ~/.cache/recoll/xapiandb). This affects dbdir,
webcachedir, mboxcachedir, aspellDicDir, which can still be individually
specified to override cachedir. Note that if you have multiple
configurations, each must have a different cachedir, there is no automatic
computation of a subpath under cachedir.
- maxfsoccuppc
= int
- Maximum file system occupation over which we stop indexing. The value is a
percentage, corresponding to what the "Capacity" df output
column shows. The default value is 0, meaning no checking. This parameter
is only checked when the indexer starts, it will not change the behaviour
or a running process.
- dbdir =
dfn
- Xapian database directory location. This will be created on first
indexing. If the value is not an absolute path, it will be interpreted as
relative to cachedir if set, or the configuration directory (-c argument
or $RECOLL_CONFDIR). If nothing is specified, the default is then
~/.recoll/xapiandb/
- idxstatusfile
= fn
- Name of the scratch file where the indexer process updates its status.
Default: idxstatus.txt inside the configuration directory.
- mboxcachedir
= dfn
- Directory location for storing mbox message offsets cache files. This is
normally 'mboxcache' under cachedir if set, or else under the
configuration directory, but it may be useful to share a directory between
different configurations.
- mboxcacheminmbs
= int
- Minimum mbox file size over which we cache the offsets. There is really no
sense in caching offsets for small files. The default is 5 MB.
- mboxmaxmsgmbs
= int
- Maximum mbox member message size in megabytes. Size over which we assume
that the mbox format is bad or we misinterpreted it, at which point we
just stop processing the file.
- webcachedir =
dfn
- Directory where we store the archived web pages after they are processed.
This is only used by the Web history indexing code. Note that this is
different from webdownloadsdir which tells the indexer where the web pages
are stored by the browser, before they are indexed and stored into
webcachedir. Default: cachedir/webcache if cachedir is set, else
$RECOLL_CONFDIR/webcache
- webcachemaxmbs
= int
- Maximum size in MB of the Web archive. This is only used by the web
history indexing code. Default: 40 MB. Reducing the size will not
physically truncate the file.
- webqueuedir =
fn
- The path to the Web indexing queue. This used to be hard-coded in the old
plugin as ~/.recollweb/ToIndex so there would be no need or possibility to
change it, but the WebExtensions plugin now downloads the files to the
user Downloads directory, and a script moves them to webqueuedir. The
script reads this value from the config so it has become possible to
change it.
- webdownloadsdir
= fn
- The path to the browser add-on download directory. This tells the indexer
where the Web browser add-on stores the web page data. The data is then
moved by a script to webqueuedir, then processed, and finally stored in
webcachedir for future previews.
- webcachekeepinterval
= string
- Page recycle interval By default, only one instance of an URL is kept in
the cache. This can be changed by setting this to a value determining at
what frequency we keep multiple instances ('day', 'week', 'month',
entries.
- aspellDicDir
= dfn
- Aspell dictionary storage directory location. The aspell dictionary
(aspdict.(lang).rws) is normally stored in the directory specified by
cachedir if set, or under the configuration directory.
- filtersdir =
dfn
- Directory location for executable input handlers. If RECOLL_FILTERSDIR is
set in the environment, we use it instead. Defaults to
$prefix/share/recoll/filters. Can be redefined for subdirectories.
- iconsdir =
dfn
- Directory location for icons. The only reason to change this would be if
you want to change the icons displayed in the result list. Defaults to
$prefix/share/recoll/images
- idxflushmb =
int
- Threshold (megabytes of new data) where we flush from memory to disk
index. Setting this allows some control over memory usage by the indexer
process. A value of 0 means no explicit flushing, which lets Xapian
perform its own thing, meaning flushing every $XAPIAN_FLUSH_THRESHOLD
documents created, modified or deleted: as memory usage depends on average
document size, not only document count, the Xapian approach is is not very
useful, and you should let Recoll manage the flushes. The program compiled
value is 0. The configured default value (from this file) is now 50 MB,
and should be ok in many cases. You can set it as low as 10 to conserve
memory, but if you are looking for maximum speed, you may want to
experiment with values between 20 and 200. In my experience, values beyond
this are always counterproductive. If you find otherwise, please drop me a
note.
- filtermaxseconds
= int
- Maximum external filter execution time in seconds. Default 1200 (20mn).
Set to 0 for no limit. This is mainly to avoid infinite loops in
postscript files (loop.ps)
- filtermaxmbytes
= int
- Maximum virtual memory space for filter processes (setrlimit(RLIMIT_AS)),
in megabytes. Note that this includes any mapped libs (there is no
reliable Linux way to limit the data space only), so we need to be a bit
generous here. Anything over 2000 will be ignored on 32 bits machines. The
high default value is needed because of java-based handlers (pdftk) which
need a lot of VM (most of it text), esp. pdftk when executed from Python
rclpdf.py. You can use a much lower value if you don't need Java.
- thrQSizes =
string
- Task queue depths for each stage and threading configuration control.
There are three internal queues in the indexing pipeline stages (file data
extraction, terms generation, index update). This parameter defines the
queue depths for each stage (three integer values). In practise, deep
queues have not been shown to increase performance. The first value is
also used to control threading autoconfiguration or disabling
multithreading. If the first queue depth is set to 0 Recoll will set the
queue depths and thread counts based on the detected number of CPUs. The
arbitrarily chosen values are as follows (depth,nthread). 1 CPU -> no
threading. Less than 4 CPUs: (2, 2) (2, 2) (2, 1). Less than 6: (2, 4),
(2, 2), (2, 1). Else (2, 5), (2, 3), (2, 1). If the first queue depth is
set to -1, multithreading will be disabled entirely. The second and third
values are ignored in both these cases.
- thrTCounts =
string
- Number of threads used for each indexing stage. If the first entry in
thrQSizes is not 0 or -1, these three values define the number of threads
used for each stage (file data extraction, term generation, index update).
It makes no sense to use a value other than 1 for the last stage because
updating the Xapian index is necessarily single-threaded (and protected by
a mutex).
- thrTmpDbCnt =
int
- Number of temporary indexes used during incremental or full indexing. If
not set to zero, this defines how many temporary indexes we use during
indexing. These temporary indexes are merged into the main one at the end
of the operation. Using multiple indexes and a final merge can
significantly improve indexing performance when the single-threaded Xapian
index updates become a bottleneck. How useful this is depends on the type
of input and CPU. See the manual for more details.
- loglevel =
int
- Log file verbosity 1-6. A value of 2 will print only errors and warnings.
3 will print information like document updates, 4 is quite verbose and 6
very verbose.
- logfilename =
fn
- Log file destination. Use 'stderr' (default) to write to the console.
- idxloglevel =
int
- Override loglevel for the indexer.
- idxlogfilename
= fn
- Override logfilename for the indexer.
- helperlogfilename
= fn
- Destination file for external helpers standard error output. The external
program error output is left alone by default, e.g. going to the terminal
when the recoll[index] program is executed from the command line. Use
/dev/null or a file inside a non-existent directory to completely suppress
the output.
- daemloglevel
= int
- Override loglevel for the indexer in real time mode. The default is to use
the idx... values if set, else the log... values.
- daemlogfilename
= fn
- Override logfilename for the indexer in real time mode. The default is to
use the idx... values if set, else the log... values.
- pyloglevel =
int
- Override loglevel for the python module.
- pylogfilename
= fn
- Override logfilename for the python module.
- idxnoautopurge
= bool
- Do not purge data for deleted or inaccessible files This can be overridden
by recollindex command line options and may be useful if some parts of the
document set may predictably be inaccessible at times, so that you would
only run the purge after making sure that everything is there.
- orgidxconfdir
= dfn
- Original location of the configuration directory. This is used exclusively
for movable datasets. Locating the configuration directory inside the
directory tree makes it possible to provide automatic query time path
translations once the data set has moved (for example, because it has been
mounted on another location).
- curidxconfdir
= dfn
- Current location of the configuration directory. Complement orgidxconfdir
for movable datasets. This should be used if the configuration directory
has been copied from the dataset to another location, either because the
dataset is readonly and an r/w copy is desired, or for performance
reasons. This records the original moved location before copy, to allow
path translation computations. For example if a dataset originally indexed
as '/home/me/mydata/config' has been mounted to '/media/me/mydata', and
the GUI is running from a copied configuration, orgidxconfdir would be
'/home/me/mydata/config', and curidxconfdir (as set in the copied
configuration) would be
- idxrundir =
dfn
- Indexing process current directory. The input handlers sometimes leave
temporary files in the current directory, so it makes sense to have
recollindex chdir to some temporary directory. If the value is empty, the
current directory is not changed. If the value is (literal) tmp, we use
the temporary directory as set by the environment (RECOLL_TMPDIR else
TMPDIR else /tmp). If the value is an absolute path to a directory, we go
there.
- checkneedretryindexscript
= fn
- Script used to heuristically check if we need to retry indexing files
which previously failed. The default script checks the modified dates on
/usr/bin and /usr/local/bin. A relative path will be looked up in the
filters dirs, then in the path. Use an absolute path to do otherwise.
- recollhelperpath
= string
- Additional places to search for helper executables. This is used, e.g., on
Windows by the Python code, and on Mac OS by the bundled recoll.app
(because I could find no reliable way to tell launchd to set the PATH).
The example below is for Windows. Use ':' as entry separator for Mac and
Ux-like systems, ';' is for Windows only.
- idxabsmlen =
int
- Length of abstracts we store while indexing. Recoll stores an abstract for
each indexed file. The text can come from an actual 'abstract' section in
the document or will just be the beginning of the document. It is stored
in the index so that it can be displayed inside the result lists without
decoding the original file. The idxabsmlen parameter defines the size of
the stored abstract. The default value is 250 bytes. The search interface
gives you the choice to display this stored text or a synthetic abstract
built by extracting text around the search terms. If you always prefer the
synthetic abstract, you can reduce this value and save a little
space.
- idxmetastoredlen
= int
- Truncation length of stored metadata fields. This does not affect indexing
(the whole field is processed anyway), just the amount of data stored in
the index for the purpose of displaying fields inside result lists or
previews. The default value is 150 bytes which may be too low if you have
custom fields.
- idxtexttruncatelen
= int
- Truncation length for all document texts. Only index the beginning of
documents. This is not recommended except if you are sure that the
interesting keywords are at the top and have severe disk space
issues.
- idxsynonyms =
fn
- Name of the index-time synonyms file. This is only used to issue
multi-word single terms for multi-word synonyms so that phrase and
proximity searches work for them (ex: applejack "apple jack").
The feature will only have an effect for querying if the query-time and
index-time synonym files are the same.
- idxniceprio =
int
- "nice" process priority for the indexing processes. Default: 19
(lowest) Appeared with 1.26.5. Prior versions were fixed at 19.
- noaspell =
bool
- Disable aspell use. The aspell dictionary generation takes time, and some
combinations of aspell version, language, and local terms, result in
aspell crashing, so it sometimes makes sense to just disable the
thing.
- aspellLanguage
= string
- Language definitions to use when creating the aspell dictionary. The value
must match a set of aspell language definition files. You can type
"aspell dicts" to see a list The default if this is not set is
to use the NLS environment to guess the value. The values are the 2-letter
language codes (e.g. 'en', 'fr'...)
- aspellAddCreateParam
= string
- Additional option and parameter to aspell dictionary creation command.
Some aspell packages may need an additional option (e.g. on Debian Jessie:
--local-data-dir=/usr/lib/aspell). See Debian bug 772415.
- aspellKeepStderr
= bool
- Set this to have a look at aspell dictionary creation errors. There are
always many, so this is mostly for debugging.
- monauxinterval
= int
- Auxiliary database update interval. The real time indexer only updates the
auxiliary databases (stemdb, aspell) periodically, because it would be too
costly to do it for every document change. The default period is one
hour.
- monixinterval
= int
- Minimum interval (seconds) between processings of the indexing queue. The
real time indexer does not process each event when it comes in, but lets
the queue accumulate, to diminish overhead and to aggregate multiple
events affecting the same file. Default 30 S.
- mondelaypatterns
= string
- Timing parameters for the real time indexing. Definitions for files which
get a longer delay before reindexing is allowed. This is for fast-changing
files, that should only be reindexed once in a while. A list of
wildcardPattern:seconds pairs. The patterns are matched with
fnmatch(pattern, path, 0) You can quote entries containing white space
with double quotes (quote the whole entry, not the pattern). The default
is empty. Example: mondelaypatterns = *.log:20 "*with
spaces.*:30"
- monioniceclass
= int
- ionice class for the indexing process. Despite the misleading name, and on
platforms where this is supported, this affects all indexing processes,
not only the real time/monitoring ones. The default value is 3 (use lowest
"Idle" priority).
- monioniceclassdata
= string
- ionice class level parameter if the class supports it. The default is
empty, as the default "Idle" class has no levels.
- autodiacsens
= bool
- auto-trigger diacritics sensitivity (raw index only). IF the index is not
stripped, decide if we automatically trigger diacritics sensitivity if the
search term has accented characters (not in unac_except_trans). Else you
need to use the query language and the "D" modifier to specify
diacritics sensitivity. Default is no.
- autocasesens
= bool
- auto-trigger case sensitivity (raw index only). IF the index is not
stripped (see indexStripChars), decide if we automatically trigger
character case sensitivity if the search term has upper-case characters in
any but the first position. Else you need to use the query language and
the "C" modifier to specify character-case sensitivity. Default
is yes.
- maxTermExpand
= int
- Maximum query expansion count for a single term (e.g.: when using
wildcards). This only affects queries, not indexing. We used to not limit
this at all (except for filenames where the limit was too low at 1000),
but it is unreasonable with a big index. Default 10000.
- maxXapianClauses
= int
- Maximum number of clauses we add to a single Xapian query. This only
affects queries, not indexing. In some cases, the result of term expansion
can be multiplicative, and we want to avoid eating all the memory. Default
50000.
- snippetMaxPosWalk
= int
- Maximum number of positions we walk while populating a snippet for the
result list. The default of 1,000,000 may be insufficient for very big
documents, the consequence would be snippets with possibly
meaning-altering missing words.
- thumbnailercmd
= string
- Command to use for generating thumbnails. If set, this should be a path to
a command or script followed by its constant arguments. Four arguments
will be appended before execution: the document URL, MIME type, target
icon SIZE (e.g. 128), and output file PATH. The command should generate a
thumbnail from these values. E.g. if the MIME is video, a script could
use: ffmpegthumbnailer -iURL -oPATH -sSIZE.
- stemexpandphrases
= bool
- Default to applying stem expansion to phrase terms. Recoll normally does
not apply stem expansion to terms inside phrase searches. Setting this
parameter will change the default behaviour to expanding terms inside
phrases. If set, you can use a 'l' modifier to disable expansion for a
specific instance.
- autoSpellRarityThreshold
= int
- Inverse of the ratio of term occurrence to total db terms over which we
look for spell neighbours for automatic query expansion When a term is
very uncommon, we may (depending on user choice) look for spelling
variations which would be more common and possibly add them to the
query.
- autoSpellSelectionThreshold
= int
- Ratio of spell neighbour frequency over user input term frequency beyond
which we include the neighbour in the query. When a term has been selected
for spelling expansion because of its rarity, we only include spelling
neighbours which are more common by this ratio.
- kioshowsubdocs
= bool
- Show embedded document results in KDE dolphin/kio and krunner Embedded
documents may clutter the results and are not always easily usable from
the kio or krunner environment. Setting this variable will restrict the
results to standalone documents.
- pdfocr =
bool
- Attempt OCR of PDF files with no text content. This can be defined in
subdirectories. The default is off because OCR is so very slow.
- pdfoutline =
bool
- Extract outlines and bookmarks from PDF documents (needs pdftohtml). This
is not enabled by default because it is rarely needed, and the extra
command takes a little time.
- pdfattach =
bool
- Enable PDF attachment extraction by executing pdftk (if available). This
is normally disabled, because it does slow down PDF indexing a bit even if
not one attachment is ever found.
- Extract text from selected XMP metadata tags. This is a space-separated
list of qualified XMP tag names. Each element can also include a
translation to a Recoll field name, separated by a '|' character. If the
second element is absent, the tag name is used as the Recoll field names.
You will also need to add specifications to the "fields" file to
direct processing of the extracted data.
- Define name of XMP field editing script. This defines the name of a script
to be loaded for editing XMP field values. The script should define a
'MetaFixer' class with a metafix() method which will be called with the
qualified tag name and value of each selected field, for editing or
erasing. A new instance is created for each document, so that the object
can keep state for, e.g. eliminating duplicate values.
- ocrprogs =
string
- OCR modules to try. The top OCR script will try to load the corresponding
modules in order and use the first which reports being capable of
performing OCR on the input file. Modules for tesseract (tesseract) and
ABBYY FineReader (abbyy) are present in the standard distribution. For
compatibility with the previous version, if this is not defined at all,
the default value is "tesseract". Use an explicit empty value if
needed. A value of "abbyy tesseract" will try everything.
- ocrcachedir =
dfn
- Location for caching OCR data. The default if this is empty or undefined
is to store the cached OCR data under $RECOLL_CONFDIR/ocrcache.
- tesseractlang
= string
- Language to assume for tesseract OCR. Important for improving the OCR
accuracy. This can also be set through the contents of a file in the
currently processed directory. See the rclocrtesseract.py script. Example
values: eng, fra... See the tesseract documentation.
- tesseractcmd
= fn
- Path for the tesseract command. Do not quote. This is mostly useful on
Windows, or for specifying a non-default tesseract command. E.g. on
Windows. tesseractcmd = C:/ProgramFiles(x86)/Tesseract-OCR/tesseract.exe
- abbyylang =
string
- Language to assume for abbyy OCR. Important for improving the OCR
accuracy. This can also be set through the contents of a file in the
currently processed directory. See the rclocrabbyy.py script. Typical
values: English, French... See the ABBYY documentation.
- abbyyocrcmd =
fn
- Path for the abbyy command The ABBY directory is usually not in the path,
so you should set this.
- speechtotext
= string
- Activate speech to text conversion The only possible value at the moment
is "whisper" for using the OpenAI whisper program.
- sttmodel =
string
- Name of the whisper model
- sttdevice =
string
- Name of the device to be used by for whisper
- orgmodesubdocs
= bool
- Index org-mode level 1 sections as separate sub-documents This is the
default. If set to false, org-mode files will be indexed as plain
text
- mhmboxquirks
= string
- Enable thunderbird/mozilla-seamonkey mbox format quirks Set this for the
directory where the email mbox files are stored.