Raptor-layout - A fast and space-efficient pre-filter for querying
very large collections of nucleotide sequences.
Computes an HIBF layout that tries to minimize the disk space
consumption of the resulting index. The space is estimated using a k-mer
count per user bin which represents the potential denisity in a technical
bin in an interleaved Bloom filter. You can pass the resulting layout to
raptor (https://github.com/seqan/raptor) to build the index and conduct
queries.
- --input-file
(std::filesystem::path)
- The input must be a file containing paths to sequence data you wish to
estimate; one filepath per line. If your file contains auxiliary
information (e.g. species IDs), your file must be tab-separated.
- Example file:
- ```
- /absolute/path/to/file1.fasta
- /absolute/path/to/file2.fa.gz
- ```
- --kmer-size
(unsigned 8 bit integer)
- The k-mer size influences the size estimates of the input. Choosing a
k-mer size that is too small for your data will result in files appearing
more similar than they really are. Likewise, a large k-mer size might miss
out on certain similarities. For DNA sequences, a k-mer size between
[16,32] has proven to work well. Default: 19.
- --num-hash-functions
(unsigned 64 bit integer)
- The number of hash functions to use when building the HIBF from the
resulting layout. This parameter is needed to correctly estimate the index
size when computing the layout. Default: 2.
- --false-positive-rate
(double)
- The false positive rate you aim for when building the HIBF from the
resulting layout. This parameter is needed to correctly estimate the index
size when computing the layout. Default: 0.05.
- --output-filename
(std::filesystem::path)
- A file name for the resulting layout. Default:
"binning.out".
- --threads
(unsigned 64 bit integer)
- The number of threads to use. Currently, only merging of sketches is
parallelized, so if the flag --disable-rearrangement is set, --threads
will have no effect. Default: 1. Value must be in range
[1,18446744073709551615].
To improve the layout, you can estimate the sequence similarities
using HyperLogLog sketches.
- --disable-estimate-union
- The sketches are used to estimate the sequence similarity among a set of
user bins. This will improve the layout computation as merging user bins
that do not increase technical bin sizes will be preferred. This may use
more RAM and can be disabled in RAM-critical environments. Attention: Also
disables rearrangement which depends on union estimations.
- --disable-rearrangement
- As a preprocessing step, rearranging the order of the given user bins
based on their sequence similarity may lead to favourable small unions and
thus a smaller index. Depending on the number of input samples (user
bins), this may be time-consuming and can thus be disabled if a suboptimal
layout is sufficient.
[1] Philippe Flajolet, Éric Fusy, Olivier Gandouet,
Frédéric Meunier. HyperLogLog: the analysis of a near-optimal
cardinality estimation algorithm. AofA: Analysis of Algorithms, Jun 2007,
Juan les Pins, France. pp.137-156. hal-00406166v2,
https://doi.org/10.46298/dmtcs.3545
Last update: Unavailable
Raptor-layout version: 3.0.1 (commit unavailable)
Sharg version: 1.1.1
SeqAn version: 3.3.0-rc.2
https://github.com/seqan/raptor
Raptor-layout Copyright: BSD 3-Clause License
Author: Svenja Mehringer
Contact: svenja.mehringer@fu-berlin.de
SeqAn Copyright: 2006-2023 Knut Reinert, FU-Berlin; released under the
3-clause BSDL.
In your academic works please cite: Raptor: A fast and space-efficient
pre-filter for querying very large collections of nucleotide sequences;
Enrico Seiler, Svenja Mehringer, Mitra Darvish, Etienne Turc, and Knut
Reinert; iScience 2021 24 (7): 102782. doi:
https://doi.org/10.1016/j.isci.2021.102782
For full copyright and/or warranty information see --copyright.