NAME

Raptor-layout - A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences.

DESCRIPTION

Computes an HIBF layout that tries to minimize the disk space consumption of the resulting index. The space is estimated using a k-mer count per user bin which represents the potential denisity in a technical bin in an interleaved Bloom filter. You can pass the resulting layout to raptor (https://github.com/seqan/raptor) to build the index and conduct queries.

OPTIONS

Main options:

--input-file (std::filesystem::path): The input must be a file containing paths to sequence data you wish to estimate; one filepath per line. If your file contains auxiliary information (e.g. species IDs), your file must be tab-separated.
Example file:
```
/absolute/path/to/file1.fasta
/absolute/path/to/file2.fa.gz
```
--kmer-size (unsigned 8 bit integer): The k-mer size influences the size estimates of the input. Choosing a k-mer size that is too small for your data will result in files appearing more similar than they really are. Likewise, a large k-mer size might miss out on certain similarities. For DNA sequences, a k-mer size between [16,32] has proven to work well. Default: 19.
--num-hash-functions (unsigned 64 bit integer): The number of hash functions to use when building the HIBF from the resulting layout. This parameter is needed to correctly estimate the index size when computing the layout. Default: 2.
--false-positive-rate (double): The false positive rate you aim for when building the HIBF from the resulting layout. This parameter is needed to correctly estimate the index size when computing the layout. Default: 0.05.
--output-filename (std::filesystem::path): A file name for the resulting layout. Default: "binning.out".
--threads (unsigned 64 bit integer): The number of threads to use. Currently, only merging of sketches is parallelized, so if the flag --disable-rearrangement is set, --threads will have no effect. Default: 1. Value must be in range [1,18446744073709551615].

HyperLogLog Sketches:

To improve the layout, you can estimate the sequence similarities using HyperLogLog sketches.

--disable-estimate-union: The sketches are used to estimate the sequence similarity among a set of user bins. This will improve the layout computation as merging user bins that do not increase technical bin sizes will be preferred. This may use more RAM and can be disabled in RAM-critical environments. Attention: Also disables rearrangement which depends on union estimations.
--disable-rearrangement: As a preprocessing step, rearranging the order of the given user bins based on their sequence similarity may lead to favourable small unions and thus a smaller index. Depending on the number of input samples (user bins), this may be time-consuming and can thus be disabled if a suboptimal layout is sufficient.

Parameter Tweaking:

Special options

REFERENCES

[1] Philippe Flajolet, Éric Fusy, Olivier Gandouet, Frédéric Meunier. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. AofA: Analysis of Algorithms, Jun 2007, Juan les Pins, France. pp.137-156. hal-00406166v2, https://doi.org/10.46298/dmtcs.3545

Common options

-h, --help: Prints the help page.
-hh, --advanced-help: Prints the help page including advanced options.
--version: Prints the version information.
--copyright: Prints the copyright/license information.
--export-help (std::string): Export the help page information. Value must be one of [html, man, ctd, cwl].

VERSION

Last update: Unavailable
Raptor-layout version: 3.0.1 (commit unavailable)
Sharg version: 1.1.1
SeqAn version: 3.3.0-rc.2

URL

https://github.com/seqan/raptor

LEGAL

Raptor-layout Copyright: BSD 3-Clause License
Author: Svenja Mehringer
Contact: svenja.mehringer@fu-berlin.de
SeqAn Copyright: 2006-2023 Knut Reinert, FU-Berlin; released under the 3-clause BSDL.
In your academic works please cite: Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences; Enrico Seiler, Svenja Mehringer, Mitra Darvish, Etienne Turc, and Knut Reinert; iScience 2021 24 (7): 102782. doi: https://doi.org/10.1016/j.isci.2021.102782
For full copyright and/or warranty information see --copyright.