gsnap - Genomic Short-read Nucleotide Alignment Program
gsnap [OPTIONS...] <FASTA file>, or cat
<FASTA file> | gmap [OPTIONS...]
- -D,
--dir=directory
- Genome directory. Default (as specified by --with-gmapdb to the
configure program) is /var/cache/gmap
- -d,
--db=STRING
- Genome database
- --two-pass
- Two-pass mode, in which the sequences are processed first to identify
splice sites and introns, and then aligned using this splicing
information
- --use-localdb=INT
- Whether to use the local suffix arrays, which help with finding extensions
to the ends of alignments in the presence of splicing or indels (0=no,
1=yes if available (default))
Transcriptome-guided options (optional)
- -C,
--transcriptdir=directory
- Transcriptome directory. Default is the value for --dir above
- -c,
--transcriptdb=STRING
- Transcriptome database
- --transcriptome-mode=STRING
- Options: assist, only, annotate (default). The option assist means to try
transcriptome alignment first, but then use genomic alignment if nothing
is found. The option only means to try transcriptome alignment only. The
option annotate means to try only genomic alignment, to use the
transcriptome only for annotation; this is the fastest option. In the
other two options, annotation is also performed
Computation options
- -k,
--kmer=INT
- kmer size to use in genome database (allowed values: 16 or less) If not
specified, the program will find the highest available kmer size in the
genome database
- --sampling=INT
- Sampling to use in genome database. If not specified, the program will
find the smallest available sampling value in the genome database within
selected k-mer size
- -q,
--part=INT/INT
- Process only the i-th out of every n sequences e.g., 0/100 or 99/100
(useful for distributing jobs to a computer farm).
- --input-buffer-size=INT
- Size of input buffer (program reads this many sequences at a time for
efficiency) (default 10000)
- --barcode-length=INT
- Amount of barcode to remove from start of every read before alignment
(default 0)
- --endtrim-length=INT
- Amount of trim to remove from the end of every read before alignment
(default 0)
- --orientation=STRING
- Orientation of paired-end reads Allowed values: FR (fwd-rev, or typical
Illumina; default), RF (rev-fwd, for circularized inserts), or FF
(fwd-fwd, same strand), or 10X (single-cell where read 1 has barcode
information; read 2 is rev)
- --10x-whitelist=FILE
- Whitelist of 10X Genomics GEM bead barcodes, needed to perform correction
of cellular barcodes. This file can be obtained at
cellranger-x.y.z/lib/python/cellranger/barcodes (for Cell Ranger version
>= 4)
cellranger-x.y.z/lib/cellranger-cs/x.y.z/lib/python/cellranger/barcodes
(<= 3)
- --10x-well-position=INT
- Position of well information in the accession, when separated by colons If
set to 0, then no well information will be printed in the CB field
(default: 4)
- --fastq-id-start=INT
- Starting position of identifier in FASTQ header, space-delimited (>=
1)
- --fastq-id-end=INT
- Ending position of identifier in FASTQ header, space-delimited (>=
1)
- Examples:
- @HWUSI-EAS100R:6:73:941:1973#0/1
- start=1, end=1 (default) => identifier is
HWUSI-EAS100R:6:73:941:1973#0
- @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
- start=1, end=1 => identifier is SRR001666.1 start=2, end=2 =>
identifier is 071112_SLXA-EAS1_s_7:5:1:817:345 start=1, end=2 =>
identifier is SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345
- --force-single-end
- When multiple FASTQ files are provided on the command line, GSNAP assumes
they are matching paired-end files. This flag treats each file as
single-end.
- --filter-chastity=STRING
- Skips reads marked by the Illumina chastity program. Expecting a string
after the accession having a 'Y' after the first colon, like this:
- @accession 1:Y:0:CTTGTA
- where the 'Y' signifies filtering by chastity. Values: off (default),
either, both. For 'either', a 'Y' on either end of a paired-end read will
be filtered. For 'both', a 'Y' is required on both ends of a paired-end
read (or on the only end of a single-end read).
- --allow-pe-name-mismatch
- Allows accession names of reads to mismatch in paired-end files
- --interleaved
- Input is in interleaved format (one read per line, tab-delimited
- --gunzip
- Uncompress gzipped input files
- --bunzip2
- Uncompress bzip2-compressed input files
Computation options
- -B,
--batch=INT
- Batch mode (default = 2)
Mode Hash offsets Hash positions Genome Local hash offsets Local hash
positions
- 0
- allocate mmap mmap allocate mmap
- 1
- allocate mmap & preload mmap allocate mmap & preload
- 2
- allocate mmap & preload mmap & preload allocate mmap &
preload
- 3
- allocate allocate mmap & preload allocate allocate
- (default)
-
4 allocate allocate allocate allocate allocate
- Note: For a single sequence,
all data structures use mmap
- A batch level of 5 means the same as 4, and is kept only for backward
compatibility
- --use-shared-memory=INT
- If 1, then allocated memory is shared among all processes on this node If
0 (default), then each process has private allocated memory
- --preload-shared-memory
- Load files indicated by --batch mode into shared memory for use by
other GMAP/GSNAP processes on this node, and then exit. Ignore any input
files.
- --unload-shared-memory
- Unload files indicated by --batch mode into shared memory, or allow
them to be unloaded when existing GMAP/GSNAP processes on this node are
finished with them. Ignore any input files.
- -m,
--max-mismatches=FLOAT
- Maximum number of mismatches allowed (if not specified, then GSNAP tries
to find the best possible match in the genome) If specified between 0.0
and 1.0, then treated as a fraction of each read length. Otherwise,
treated as an integral number of mismatches (including indel and splicing
penalties). Default is 0.3
- --query-unk-mismatch=INT
- Whether to count unknown (N) characters in the query as a mismatch (0=no
(default), 1=yes)
- --genome-unk-mismatch=INT
- Whether to count unknown (N) characters in the genome as a mismatch (0=no,
1=yes). If --use-mask is specified, default is no, otherwise
yes.
- --maxsearch=INT
- Maximum number of alignments to find (default 1000). Should be larger than
--npaths, which is the number to report. Keeping this number large
will allow for random selection among multiple alignments. Reducing this
number can speed up the program.
- --indel-endlength=INT
- Minimum length at end required for indel alignments (default 4)
- -Y,
--max-insertions=INT
- Maximum number of insertions allowed (default 6)
- -Z,
--max-deletions=INT
- Maximum number of deletions allowed (default 9)
- -M,
--suboptimal-levels=INT
- Report suboptimal hits beyond best hit (default 0) All hits with best
score plus suboptimal-levels are reported (Note: Not currently
implemented)
- -a,
--adapter-strip=STRING
- Method for removing adapters from reads. Currently allowed values: off,
paired. Default is "off". To turn on, specify
"paired", which removes adapters from paired-end reads if they
appear to be present.
- -e,
--use-mask=STRING
- Use genome containing masks (e.g. for non-exons) for scoring
preference
- -V,
--snpsdir=STRING
- Directory for SNPs index files (created using snpindex) (default is
location of genome index files specified using -D and
-d)
- -v,
--use-snps=STRING
- Use database containing known SNPs (in <STRING>.iit, built
previously using snpindex) for tolerance to SNPs
- --cmetdir=STRING
- Directory for methylcytosine index files (created using cmetindex)
(default is location of genome index files specified using -D,
-V, and -d)
- --atoidir=STRING
- Directory for A-to-I RNA editing index files (created using atoiindex)
(default is location of genome index files specified using -D,
-V, and -d)
- --mode=STRING
- Alignment mode: standard (default), cmet-stranded, cmet-nonstranded,
atoi-stranded, atoi-nonstranded, ttoc-stranded, or ttoc-nonstranded.
Non-standard modes requires you to have previously run the cmetindex or
atoiindex programs (which also cover the ttoc modes) on the genome
- -t,
--nthreads=INT
- Number of worker threads
Splicing options for DNA-Seq
- --find-dna-chimeras=INT
- Look for distant splicing involving poor splice sites (0=no, 1=yes) If not
specified, then default is to be on unless only known splicing is desired
(--use-splicing is specified and --novelsplicing is
off)
Splicing options for RNA-Seq
- -N,
--novelsplicing=INT
- Look for novel splicing (0=no (default), 1=yes)
- --splicingdir=STRING
- Directory for splicing involving known sites or known introns, as
specified by the -s or --use-splicing flag (default is
directory computed from -D and -d flags). Note: can just
give full pathname to the -s flag instead.
- -s,
--use-splicing=STRING
- Look for splicing involving known sites or known introns (in
<STRING>.iit), at short or long distances See README instructions
for the distinction between known sites and known introns
- -w,
--localsplicedist=INT
- Definition of local novel splicing event (default 200000)
- --merge-distant-samechr
- Report distant splices on the same chromosome as a single splice, if
possible. Will produce a single SAM line instead of two SAM lines, which
is also done for translocations, inversions, and scramble events
Options for paired-end reads
- --pairmax-dna=INT
- Max total genomic length for DNA-Seq paired reads, or other reads without
splicing (default 2000). Used if -N or -s is not specified.
This value is also used for circular chromosomes when splicing in linear
chromosomes is allowed
- --pairmax-rna=INT
- Max total genomic length for RNA-Seq paired reads, or other reads that
could have a splice (default 200000). Used if -N or -s is
specified. Should probably match the value for -w,
--localsplicedist.
- --resolve-inner=INT
- Whether to resolve soft-clipping on the insides of paired-end reads
(default 1)
- --pairexpect=INT
- Expected paired-end length, used for resolving soft-clipping on the
insides of paired-end reads, and for pairing DNA-seq reads (default
1000)
- --pass1-min-support=INT
- Threshold read support for learning an intron during pass 1 of
--two-pass mode (default 20)
Options for quality scores
- --quality-protocol=STRING
- Protocol for input quality scores. Allowed values: illumina (ASCII 64-126)
(equivalent to -J 64 -j -31) sanger (ASCII 33-126)
(equivalent to -J 33 -j 0)
- Default is sanger (no
quality print shift)
- SAM output files should have quality scores in sanger protocol
- Or you can customize this behavior with these flags:
- -J,
--quality-zero-score=INT
- FASTQ quality scores are zero at this ASCII value (default is 33 for
sanger protocol; for Illumina, select 64)
- -j,
--quality-print-shift=INT
- Shift FASTQ quality scores by this amount in output (default is 0 for
sanger protocol; to change Illumina input to Sanger output, select
-31)
Output options
- -n,
--npaths=INT
- Maximum number of paths to print (default 100).
- -Q,
--quiet-if-excessive
- If more than maximum number of paths are found, then nothing is
printed.
- -O, --ordered
- Print output in same order as input (relevant only if there is more than
one worker thread)
- --show-refdiff
- For GSNAP output in SNP-tolerant alignment, shows all differences relative
to the reference genome as lower case (otherwise, it shows all differences
relative to both the reference and alternate genome)
- --clip-overlap
- For paired-end reads whose alignments overlap, clip the overlapping
region.
- --merge-overlap
- For paired-end reads whose alignments overlap, merge the two ends into a
single end (beta implementation)
- --print-snps
- Print detailed information about SNPs in reads (works only if -v
also selected) (not fully implemented yet)
- --failsonly
- Print only failed alignments, those with no results
- --nofails
- Exclude printing of failed alignments
- --only-concordant
- Print only concordant alignments (concordant_uniq, concordant_mult,
concordant_circular)
- --omit-concordant-uniq
- Do not print any concordant_uniq alignments
- --omit-concordant-mult
- Do not print any concordant_mult alignments
- --omit-softclipped
- Do not allow any alignments with soft clips
- --only-tr-consistent
- Print only alignments with consistent transcripts (XX field present,
identical if paired-end)
- -A,
--format=STRING
- Another format type, other than default. Currently implemented: sam, m8
(BLAST tabular format)
- --split-output=STRING
- Basename for multiple-file output, separately for nomapping,
halfmapping_uniq, halfmapping_mult, unpaired_uniq, unpaired_mult,
paired_uniq, paired_mult, concordant_uniq, and concordant_mult
results
- -o,
--output-file=STRING
- File name for a single stream of output results.
- --failed-input=STRING
- Print completely failed alignments as input FASTA or FASTQ format, to the
given file, appending .1 or .2, for paired-end data. If the
--split-output flag is also given, this file is generated in
addition to the output in the .nomapping file.
- --append-output
- When --split-output or --failed-input is given, this flag
will append output to the existing files. Otherwise, the default is to
create new files.
- --order-among-best=STRING
- Among alignments tied with the best score, order those alignments in this
order. Allowed values: genomic, random (default)
- --output-buffer-size=INT
- Buffer size, in queries, for output thread (default 1000). When the number
of results to be printed exceeds this size, worker threads wait until the
backlog is cleared
Options for SAM output
- --no-sam-headers
- Do not print headers beginning with '@'
- --add-paired-nomappers
- Add nomapper lines as needed to make all paired-end results alternate
between first end and second end
- --paired-flag-means-concordant=INT
- Whether the paired bit in the SAM flags means concordant only (1) or
paired plus concordant (0, default)
- --sam-headers-batch=INT
- Print headers only for this batch, as specified by -q
- --sam-hardclip-use-S
- Use S instead of H for hardclips
- --sam-use-0M=INT
- If 1 (default), then insert 0M in CIGAR between adjacent indels and
introns If 0, do not allow 0M. Picard disallows 0M, but other tools may
require it
- --sam-extended-cigar
- Use extended CIGAR format (using X and = symbols instead of M, to indicate
matches and mismatches, respectively
- --sam-multiple-primaries
- Allows multiple alignments to be marked as primary if they have equally
good mapping scores
- --sam-sparse-secondaries
- For secondary alignments (in multiple mappings), uses '*' for SEQ and QUAL
fields, to give smaller file sizes. However, the output will give warnings
in Picard to give warnings and may not work with downstream tools
- --force-xs-dir
- For RNA-Seq alignments, disallows XS:A:? when the sense direction is
unclear, and replaces this value arbitrarily with XS:A:+. May be useful
for some programs, such as Cufflinks, that cannot handle XS:A:?. However,
if you use this flag, the reported value of XS:A:+ in these cases will not
be meaningful.
- --md-report-snps
- In MD string, when known SNPs are given by the -v flag, prints
difference nucleotides when they differ from reference but match a known
alternate allele
- --no-soft-clips
- Does not allow soft clips at ends. Mismatches will be counted over the
entire query
- --extend-soft-clips
- Extends alignments through soft clipped regions. CIGAR string and
coordinates will be revised, but mismatches and the MD string will reflect
the clipped CIGAR
- --action-if-cigar-error
- Action to take if there is a disagreement between CIGAR length and
sequence length Allowed values: ignore, warning (default), noprint, abort
Note that the noprint option does not print the CIGAR string at all if
there is an error, so it may break a SAM parser
- --read-group-id=STRING
- Value to put into read-group id (RG-ID) field
- --read-group-name=STRING
- Value to put into read-group name (RG-SM) field
- --read-group-library=STRING
- Value to put into read-group library (RG-LB) field
- --read-group-platform=STRING
- Value to put into read-group library (RG-PL) field
Help options
- --check
- Check compiler assumptions
- --version
- Show version
- --help
- Show this help message
- Other tools of GMAP suite
are located in /usr/lib/gmap