cmcalibrate - fit exponential tails for covariance model E-value
determination
cmcalibrate [options] cmfile
cmcalibrate determines exponential tail parameters for
E-value determination by generating random sequences, searching them with
the CM and collecting the scores of the resulting hits. A histogram of the
bit scores of the hits is fit to an exponential tail, and the parameters of
the fitted tail are saved to the CM file. The exponential tail parameters
are then used to estimate the statistical significance of hits found in
cmsearch and cmscan.
A CM file must be calibrated with cmcalibrate before it can
be used in cmsearch or cmscan, with a single exception: it is
not necessary to calibrate CM files that include only models with zero
basepairs before running cmsearch.
cmcalibrate is very slow. It takes a couple of hours to
calibrate a single average sized CM on a single CPU. cmcalibrate will
run in parallel on four cores if Infernal was built on a system that
supports POSIX threading (see the Installation section of the user guide for
more information) and that system has at least 4 cores. Using
<n> cores will result in roughly <n> -fold
acceleration versus a single CPU. You can specify the number of cores be
<n> to use with the --cpu <n> option. MPI
(Message Passing Interface) can be also be used for parallelization with the
--mpi option if Infernal was built with MPI enabled, but using more
than 161 processors is not recommended because increasing past 161 won't
accelerate the calibration. See the Installation section of the user guide
for more information.
The --forecast option can be used to estimate how long the
program will take to run for a given cmfile on the current machine.
To predict the running time on <n> processors with MPI,
additionally use the --nforecast <n> option.
Some large models require a lot of memory to calibrate. You can
determine how much memory is required with the --memreq option. For
these models, you may be limited by the available RAM on your system.
Another strategy for parallelization that can be useful when a lot of memory
is required per core is to split the calibration into <n>
separate computations or partitions, each of which can be performed
separately, potentially in parallel if you have access to a computer
cluster. The results from each computation can then be merged together for
the final calibration. To do this, first run cmcalibrate with the
--split, --ptot <n> and --cfile
<f> options, which will save the <n> separate
partition commands into the file <f> . After all of these
commands have been executed, you can then combine the results and create a
calibrated model file by calling again with the --merge and
--ptot <n> options. See the "Parallelizing
calibration of large models by splitting into partitions" subsection of
the tutorial in the user's guide for more information.
The random sequences searched in cmcalibrate are generated
by an HMM that was trained on real genomic sequences with various GC
contents. The goal is to have the GC distributions in the random sequences
be similar to those in actual genomic sequences.
Four rounds of searches and subsequent exponential tail fits are
performed, one each for the four different CM algorithms that can be used in
cmsearch and cmscan: glocal CYK, glocal Inside, local CYK and
local Inside.
The E-values parameters determined by cmcalibrate are only
used by the cmsearch and cmscan programs. If you are not going
to use these programs then do not waste time calibrating your models.
- -h
- Help; print a brief reminder of command line usage and available options.
- -L <x>
- Set the total length of random sequences to search to <x>
megabases (Mb). By default, <x> is 1.6 Mb. Increasing
<x> will make the exponential tail fits more precise and
E-values more accurate, but will take longer (doubling <x>
will roughly double the running time). Decreasing <x> is not
recommended as it will make the fits less precise and the E-values less
accurate.
- --forecast
- Predict the running time of the calibration of cmfile (with
provided options) on the current machine and exit. The calibration is not
performed. The predictions should be considered rough estimates. If
multithreading is enabled (see Installation section of user guide), the
timing will take into account the number of available cores.
- --nforecast
<n>
- With --forecast, specify that <n> processors will be
used for the calibration. This might be useful for predicting the running
time of an MPI run with <n> processors.
- --memreq
- Predict the amount of required memory for calibrating cmfile (with
provided options) on the current machine and exit. The calibration is not
performed.
- --gtailn
<x>
- fit the exponential tail for glocal Inside and glocal CYK to the
<n> highest scores in the histogram tail, where
<n> is <x> times the number of Mb searched. The
default value of <x> is 250. The value 250 was chosen because
it works well empirically relative to other values.
- --ltailn
<x>
- fit the exponential tail for local Inside and local CYK to the
<n> highest scores in the histogram tail, where
<n> is <x> times the number of Mb searched. The
default value of <x> is 750. The value 750 was chosen because
it works well empirically relative to other values.
- --tailp
<x>
- Ignore the --gtailn and --ltailn prefixed options and fit
the <x> fraction tail of the histogram to an exponential
tail, for all search modes.
- --hfile
<f>
- Save the histograms fit to file <f>. The format of this file
is two space delimited columns per line. The first column is the x-axis
values of bit scores of each bin. The second column is the y-axis values
of number of hits per bin. Each series is delimited by a line with a
single character "&". The file will contain one series for
each of the four exponential tail fits in the following order: glocal CYK,
glocal Inside, local CYK, and local Inside.
- --sfile
<f>
- Save survival plot information to file <f>. The format of
this file is two space delimited columns per line. The first column is the
x-axis values of bit scores of each bin. The second column is the y-axis
values of fraction of hits that meet or exceed the score for each bin.
Each series is delimited by a line with a single character
"&". The file will contain three series of data for each of
the four CM search modes in the following order: glocal CYK, glocal
Inside, local CYK, and local Inside. The first series is the empirical
survival plot from the histogram of hits to the random sequence. The
second series is the exponential tail fit to the empirical distribution.
The third series is the exponential tail fit if lambda were fixed and set
as the natural log of 2 (0.691314718).
- --qqfile
<f>
- Save quantile-quantile plot information to file <f>. The
format of this file is two space delimited columns per line. The first
column is the x-axis values, and the second column is the y-axis values.
The distance of the points from the identity line (y=x) is a measure of
how good the exponential tail fit is, the closer the points are to the
identity line, the better the fit is. Each series is delimited by a line
with a single character "&". The file will contain one
series of empirical data for each of the four exponential tail fits in the
following order: glocal CYK, glocal Inside, local CYK and local Inside.
- --ffile
<f>
- Save space delimited statistics of different exponential tail fits to file
<f>. The file will contain the lambda and mu values for
exponential tails fit to histogram tails of different sizes. The fields in
the file are labelled informatively.
- --xfile
<f>
- Save a list of the scores in each fit histogram tail to file
<f>. Each line of this file will have a different score
indicating one hit existed in the tail with that score. Each series is
delimited by a line with a single character "&". The file
will contain one series for each of the four exponential tail fits in the
following order: glocal CYK, glocal Inside, local CYK, and local Inside.
- --split
- Prepare a partitioned calibration. This option only works in combination
with the --ptot <n> and --cfile
<f> options, and will prepare a calibration split into
<n> separate partitions. The commands to run all of the
partitions will be in the file <f> .
- --cfile
<f>
- With --split, save the commands for all partitions to file
<f> .
- --proot
<s>
- With --split, specify that the per-partition scores files be named
<s>.<n> where <n> is the partition index.
By default they will be named <s>.calib.<n> where
<s> is the name of the CM file to be calibrated (including
path).
- --part
<n>
- specify that this is partition <n> out of <n2>
from --ptot <n2>. Must be used in combination with
--ptot and --pfile .
- --ptot
<n>
- With --split, --part or --merge, specify that there are
<n> total partitions.
- --pfile
<f>
- With --part , specify that scores for this partition be saved to
file <f>
- --merge
- Merge scores from multiple previously executed partitions and calibrate
CMs. If you used the option --proot <s> with
cmcalibrate when you ran it with --split to setup the
partitions, use --proot <s> again with --merge.
The full cmcalibrate --merge command to use will have been output
to standard output when the initial cmcalibrate --split command was
executed.
- --seed
<n>
- Seed the random number generator with <n>, an integer >=
0. If <n> is nonzero, stochastic simulations will be
reproducible; the same command will give the same results. If
<n> is 0, the random number generator is seeded arbitrarily,
and stochastic simulations will vary from run to run of the same command.
The default seed is 181.
- --beta
<x>
- By default query-dependent banding (QDB) is used to accelerate the CM
search algorithms with a beta tail loss probability of 1E-15. This beta
value can be changed to <x> with --beta
<x>. The beta parameter is the amount of probability mass
excluded during band calculation, higher values of beta give greater
speedups but sacrifice more accuracy than lower values. The default value
used is 1E-15. (For more information on QDB see Nawrocki and Eddy, PLoS
Computational Biology 3(3): e56.)
- --nonbanded
- Turn off QDB during E-value calibration. This will slow down calibration.
- --nonull3
- Turn off the null3 post hoc additional null model. This is not recommended
unless you plan on using the same option to cmsearch and/or
cmscan.
- --random
- Use the background null model of the CM to generate the random sequences,
instead of the more realistic HMM. Unless the CM was built using the
--null option to cmbuild, the background null model will be
25% each A, C, G and U.
- --gc
<f>
- Generate the random sequences using the nucleotide distribution from the
sequence file <f>.
- --cpu
<n>
- Set the number of parallel worker threads to <n>. On
multicore machines, the default is 4. You can also control this number by
setting an environment variable, INFERNAL_NCPU. There is also a
master thread, so the actual number of threads that Infernal spawns is
<n>+1. This option is not available if Infernal was compiled
with POSIX threads support turned off.
- --mpi
- Run as an MPI parallel program. This option will only be available if
Infernal has been configured and built with the "--enable-mpi"
flag (see the Installation section of the user guide for more
information).
See infernal(1) for a master man page with a list of all
the individual man pages for programs in the Infernal package.
For complete documentation, see the user guide that came with your
Infernal distribution (Userguide.pdf); or see the Infernal web page
(http://eddylab.org/infernal/).
Copyright (C) 2023 Howard Hughes Medical Institute.
Freely distributed under the BSD open source license.
For additional information on copyright and licensing, see the
file called COPYRIGHT in your Infernal source distribution, or see the
Infernal web page (http://eddylab.org/infernal/).