IGDISCOVER(1) | IgDiscover | IGDISCOVER(1) |
igdiscover - IgDiscover Documentation
IgDiscover analyzes antibody repertoires and discovers new V genes from high-throughput sequencing reads. Heavy chains, kappa and lambda light chains are supported (to discover VH, VK and VL genes).
IgDiscover is the result of a collaboration between the Gunilla Karlsson Hedestam group at the Department of Microbiology, Tumor and Cell Biology at Karolinska Institutet, Sweden and the Bioinformatics Long-Term Support facility at Science for Life Laboratory (SciLifeLab), Sweden.
If you use IgDiscover, please cite:
Corcoran, Martin M. and Phad, Ganesh E. and Bernat, Néstor Vázquez and Stahl-Hennig, Christiane and Sumida, Noriyuki and Persson, Mats A.A. and Martin, Marcel and Karlsson Hedestam, Gunilla B. Production of individualized V gene databases reveals high levels of immunoglobulin genetic diversity. Nature Communications 7:13642 (2016) https://dx.doi.org/10.1038/ncomms13642
IgDiscover is written in Python 3 and is developed on Linux. The tool also runs on macOS, but is not as well tested on that platform.
For installation on either system, we recommend that you follow the instructions below, which will first explain how to install the Conda package manager. IgDiscover is available as a Conda-package from the bioconda channel. Using Conda will make the installation easy because all dependencies are also available as Conda packages and can thus be installed automatically along with IgDiscover.
There are also non-Conda installation instructions if you cannot use Conda.
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh
When the installer asks you about modifying the PATH in your .bashrc file, answer yes.
conda --version
If you see the conda version number, it worked.
conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge
conda create -n igdiscover igdiscover
This will create a new so-called “environment” for IgDiscover (retry if it fails). Whenever you want to run IgDiscover, you will need to activate the environment with this command:
source activate igdiscover
igdiscover --version
If you see the version number of IgDiscover, it worked! If an error message appears that says "The 'networkx' distribution was not found and is required by snakemake", install networkx manually with:
pip install networkx==2.1
Then retry to check the igdiscover version.
If you use zsh instead of bash (applies to Bio-Linux, for example), the $PATH environment variable will not be setup correctly by the Conda installer. The miniconda installer adds a line export PATH=... to the to the end of your /home/your-user-name/.bashrc file. Copy that line from the file and add it to the end of the file /home/your-user-name/.zshrc instead.
Alternatively, change your default shell to bash by running chsh -s /bin/bash.
If you use conda and see an error that includes something like this:
ImportError: .../.local/lib/python3.5/site-packages/sqt/_helpers.cpython-35m-x86_64-linux-gnu.so: undefined symbol: PyFPE_jbuf
Or you see any error that mentions a .local/ directory, then a previous installation of IgDiscover is interfering with the conda installation.
The easiest way to solve this problem is to delete the directory .local/ in your home directory, see also how to remove IgDiscover from a Linux system.
If you get the error
ValueError: unknown locale: UTF-8
Then follow these instructions.
To install IgDiscover directly from the most recent source code, read the developer installation instructions.
IgDiscover requires quite a few other software tools that are not included in most Linux distributions (or mac OS) and which are also not available from the Python packaging index (PyPI) because they are not Python tools. If you do not use the recommended simple installation instructions via Conda, you need to install those non-Python dependencies manually. Regular Python dependencies are automatically pulled in when IgDiscover itself is installed in the last step with the pip install command. The instructions below are written for Linux and require modifications if you want to try this on OS X.
NOTE:
The dependencies are: MUSCLE, IgBLAST, PEAR, and -- optionally -- flash.
sudo apt-get install python3
mkdir -p ~/.local/bin
Add this line to the end of your ~/.bashrc file:
export PATH=$HOME/.local/bin:$PATH
Then either start a new shell or run source ~/.bashrc to get the changes.
sudo apt-get install muscle
If your distribution does not have a 'muscle' package or if you are not allowed to run sudo:
wget -O - http://www.drive5.com/muscle/downloads3.8.31/muscle3.8.31_i86linux64.tar.gz | tar xz mv muscle3.8.31_i86linux64 ~/.local/bin/
wget http://sco.h-its.org/exelixis/web/software/pear/files/pear-0.9.6-bin-64.tar.gz tar xvf pear-0.9.6-bin-64.tar.gz mv pear-0.9.6-bin-64/pear-0.9.6-bin-64 ~/.local/bin/pear
wget ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/1.4.0/ncbi-igblast-1.4.0-x64-linux.tar.gz tar xvf ncbi-igblast-1.4.0-x64-linux.tar.gz mv ncbi-igblast-1.4.0/bin/igblast? ~/.local/bin/
IgBLAST requires some data files that must be downloaded separately. The following commands put the files into ~/.local/igdata:
mkdir ~/.local/igdata cd ~/.local/igdata wget -r -nH --cut-dirs=4 ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/internal_data wget -r -nH --cut-dirs=4 ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/database/ wget -r -nH --cut-dirs=4 ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/optional_file/
Also, you must set the $IGDATA environment variable to point to the directory with data files. Add this line to your ~/.bashrc:
export IGDATA=$HOME/.local/igdata
Then run source ~/.bashrc to get the changes.
wget -O FLASH-1.2.11.tar.gz http://sourceforge.net/projects/flashpage/files/FLASH-1.2.11.tar.gz/download tar xf FLASH-1.2.11.tar.gz cd FLASH-1.2.11 make mv flash ~/.local/bin/
Install IgDiscover with the Python package manager pip, which will download and install IgDiscover and its dependencies:
pip3 install --user igdiscover
Both commands also install all remaining dependencies. The --user option instructs both commands to install everything into $HOME/.local.
Finally, check the installation with
igdiscover --version
and you should see the version number of IgDiscover.
You should now run IgDiscover on the test data set.
After installing IgDiscover, you should run it once on a small test data that we provide, both to test your installation and to familiarize yourself with running the program.
wget https://bitbucket.org/igdiscover/testdata/downloads/igdiscover-testdata-0.5.tar.gz tar xvf igdiscover-testdata-0.5.tar.gz
igdiscover init --db igdiscover-testdata/database/ --reads igdiscover-testdata/reads.1.fastq.gz discovertest
The name discovertest is the name of the pipeline directory that will be created. Note that only the path to the first reads file needs to be given. The second file is found automatically. There may be a couple of messages “Skipping 'x' because it contains the same sequence as 'y'”, which you can ignore.
The command will have printed a message telling you that the pipeline directory has been initialized, that you should edit the configuration file, and how to actually run IgDiscover after that.
cd discovertest && igdiscover run
On this small dataset, running the pipeline should take not more than about 5 minutes.
See the explanation of final result files.
ENA project PRJEB15295 contains the data for our Nature Communications paper from 2016, in particular ERR1760498, which is the data for the human “H1” sample (multiplex PCR, IgM heavy chain).
Data used for testing TCR detection (human, RACE): SRR2905677 and SRR2905710.
IgDiscover works on a single library at a time. It works within an “analysis directory” for the library, which contains all intermediate and result files.
To start an analysis, you need:
If you do not have a V/D/J database, yet, you may want to read the section about how to obtain V/D/J sequences.
To run an analysis, proceed as follows.
NOTE:
First, pick a name for your analysis. We will use myexperiment in the following. Then run igdiscover init:
igdiscover init myexperiment
A dialog will appear and ask for the file with the first (forward) reads. Find your compressed FASTQ file that contains them and select it. Typical file names may be Library1_S1_L001_R1_001.fastq.gz or mylibrary.1.fastq.gz. You do not need to choose the second read file! It is found automatically.
Next, choose the directory with your database. The directory must contain the three files V.fasta, D.fasta, J.fasta. These files contain the V, D, J gene sequences, respectively. Even if have have only light chains in your data, a D.fasta file needs to be provided, just use one with the heavy chain D gene sequences.
If you do not want a graphical user interface, use the two command-line parameters --db and --reads1 to provide this information instead:
igdiscover init --db path/to/my/database/ --reads1 mylibrary.1.fastq.gz myexperiment
Again, the second reads file will be found automatically. Use --single-reads instead of --reads1 if you have single-end reads or a dataset with already merged reads. For --single-reads, a FASTA file (not only FASTQ) is also allowed. In any case, an analysis directory named myexperiment will have been created.
The previous step created a configuration file named myexperiment/igdiscover.yaml, which you may need to adjust. In particular, the number of discovery rounds is set to 3 by default, which takes a long time. Reducing this to 2 or even 1 often works just as well.
Change into the newly created analysis directory and run the analysis:
igdiscover run
Depending on the size of your library, your computer, and the number of iterations, this will now take from a few hours to a day. See the running IgDiscover section for more fine-grained control over what to run and how to resume the process if something failed.
We use the term “database” to refer to three FASTA files that contain the sequences for the V, D and J genes. IMGT provides sequences for download. For discovering new VH genes, for example, you need to get the IGHV, IGHD and IGHJ files of your species. As IgDiscover uses this only as a starting point, using a similar species will also work.
When using an IMGT database, it is very important to change the long IMGT sequence headers to short headers as IgBLAST does not accept the long headers. We recommend using the program edit_imgt_file.pl. If you installed IgDiscover from Conda, the script is already installed and you can run it by typing the name. It is also available on the IgBlast FTP site.
Run it for all three downloaded files, and then rename files appropritely to make sure that they named V.fasta, D.fasta and J.fasta.
You always need a file with D genes even if you analyze light chains.
In case you have used IgBLAST previously, note that there is no need to run the makeblastdb tool as IgDiscover will do that for you.
IgDiscover can process input data of three different types:
IgDiscover was tested mainly on paired-end Illumina MiSeq reads (2x300bp), but it can also handle 454 and Ion Torrent data.
Depending on the input file type, use a variant of one of the following commands to initialize the analysis directory:
igdiscover init --single-reads=file.fasta.gz --database=my-database-dir/ myexperiment igdiscover init --reads1=file.1.fasta.gz --database=my-database-dir/ myexperiment igdiscover init --reads1=file.1.fastq.gz --database=my-database-dir/ myexperiment
Paired-end reads are first merged and then processed in the same way as single-end reads. Reads that could not be merged are discarded. Single-end reads and merged paired-end reads are expected to follow this structure (from 5' to 3'):
We use IgBLAST to detect the location of the V, D, J genes through the igdiscover igblast subcommand. The G nucleotides after the barcode are split off if the configuration specifies race_g: true. The leader sequence is detected by looking for a start codon near 60 bp upstream of the start of the V gene match.
The igdiscover init command creates a configuration file igdiscover.yaml in the analysis directory. To configure your analysis, change that file with a text editor before running the analysis with igdiscover run.
The syntax should be mostly self-explanatory. The file is in YAML format, but you will not need to learn that. Just follow the examples given in the file. A few rules that may be good to know are the following ones:
forward_primers: - ACGTACGTACGT - AACCGGTTAACC
Even if you have only one primer sequence, you still need to use this syntax.
To find out what the configuration options achieve, see the explanations in the configuration file itself.
The main parameters parameters that may require adjusting are the following.
The iterations option sets the number of rounds of V gene discovery that will be performed. By default, three iterations are run. Even with a very restricted starting V database (for example with only a single V gene sequence), this is usually sufficient to identify most novel germline sequences.
When the starting database is more complete, for example, when analyzing a human IgM library with the current IMGT heavy chain database, a single iteration may be sufficient to produce an individualized database.
If you do not want to discover any new genes and only want to produce an expression profile, for example, then use iterations: 0.
The ignore_j option should be set to true when producing a V gene database for a species where J sequences are unknown:
ignore_j: true
Setting the parameters stranded, forward_primers and reverse_primers to the correct values can be used to remove 5' and 3' primers from the sequences. Doing this is not strictly necessary for IgDiscover. It is simplest if you do not specify any primer sequences.
This provides IgDiscover with stringency requirements for V gene discovery that enable the program to filter out false positives. Usually the ”pregermline filter” can be used in the default mode since all these sequences will be subsequently passed to the higher stringency ”germline filter” where the criteria are set to maximize stringency. Here is how it looks in the configuration file:
pre_germline_filter: unique_cdr3s: 2 # Minimum number of unique CDR3s (within exact matches) unique_js: 2 # Minimum number of unique J genes (within exact matches) check_motifs: false # Check whether 5' end starts with known motif whitelist: true # Add database sequences to the whitelist cluster_size: 0 # Minimum number of sequences assigned to cluster differences: 0 # Merge sequences if they have at most this number of differences allow_stop: true # Whether to allow non-productive sequences containing stop codons cross_mapping_ratio: 0.02 # Threshold for removal of cross-mapping artifacts (set to 0 to disable) allele_ratio: 0.1 # Required minimum ratio between alleles of a single gene # Filtering criteria applied to candidate sequences in the last iteration. # These should be more strict than the pre_germline_filter criteria. # germline_filter: unique_cdr3s: 5 # Minimum number of unique CDR3s (within exact matches) unique_js: 3 # Minimum number of unique J genes (within exact matches) check_motifs: false # Check whether 5' end starts with known motif whitelist: true # Add database sequences to the whitelist cluster_size: 100 # Minimum number of sequences assigned to cluster differences: 0 # Merge sequences if they have at most this number of differences allow_stop: false # Whether to allow non-productive sequences containing stop codons cross_mapping_ratio: 0.02 # Threshold for removal of cross-mapping artifacts (set to 0 to disable) allele_ratio: 0.1 # Required minimum ratio between alleles of a single gene
Factors that affect germline discovery include library source (IgM vs IgK, IgL or IgG) library size, sequence error rate and individual genomic factors (for example the number of J segments present in an individual).
In general, setting a higher cutoff of unique_cdr3s and unique_js will minimize the number of false positives in the output. Example:
unique_cdr3s: 10 # Minimum number of unique CDR3s (within exact matches) unique_js: 4 # Minimum number of unique J genes (within exact matches)
If the differences parameter is set to a value higher than 0, the germline filter inspects clusters of sequences that are closely related (when the edit distance between them is at most differences) and retains only the most common sequence of each cluster. Previously, we believed this would removes some false positives due to accumulated random sequence errors of highly expressed alleles that otherwise would pass the cutoff criteria. However, we found out that we miss true positives, in particular if there are two alleles in the sample that differ in only a single nucleotide. We have now implemented other measures to avoid false positives and recommend against setting the differences to something other than 0.
Read also about the cross mapping, for which germline filtering corrects, and about the germline filters.
Changed in version The: default for the differences setting was changed from 1 to 0.
The command igdiscover run, which is used to start the pipeline, can also be used to resume execution if there was an interruption (a transient failure). Reasons for interruptions might be:
To resume execution after you have fixed the problem, go to the analysis directory and run igdiscover run again. It will skip the steps that have already finished successfully. This capability comes from the workflow management system snakemake, on which igdiscover run is based. Snakemake will determine automatically which steps need to be re-run in order to get to a full result and then run only those.
Alterations to the configuration file after an interruption are possible, but affect only steps that have not already finished successfully. For example, assume you interrupted a run with Ctrl+C after it is already past the step in which barcodes are removed. Then, even if you change the barcode length in the configuration, the barcode removal step will not be re-run when you resume the pipeline and the previous barcode length is in effect. See also the next section.
When you experiment with parameters in the igdiscover.yaml file, such as germline filtering criteria, you do not need to re-run the entire pipeline from the beginning, but can re-use the results that already exist. This can save a lot of processing time, in particular when you avoid re-running IgBLAST in this way.
As described in the previous section, igdiscover run automatically figures out which files need to be re-created if a run was interrupted. Unfortunately, this mechanism is currently not smart enough to also look for changes in the igdiscover.yaml file. Thus, if the full pipeline has finished successfully, then re-running igdiscover run will just print the message Nothing to be done. even after you have changed the configuration file.
You will therefore need to know yourself which file you want to regenerate. Then follow the following steps. Note that these will remove parts of the existing results, and if you need to keep them, make a copy of your analysis directory first.
For example, assume you want to modify some germline filtering setting and then re-run the pipeline. Change the setting in your igdiscover.yaml, then run these commands:
rm iteration-01/new_V_germline.tab igdiscover run iteration-01/new_V_germline.tab
The above will have regenerated the iteration-01/new_V_germline.tab file and also the iteration-01/new_V_germline.fasta file since they are generated by the same script. If you want to update any other files, then also run
igdiscover run
IgDiscover writes all intermediate files, the final V gene database, statistics and plots into the analysis directory that was created with igdiscover init. Inside that directory, there is a final/ subdirectory that contains the analysis results.
These are the files and subdirectories that can be found in the analysis directory. Subdirectories are described in detail below.
Final results are found in the final/ subdirectory of the analysis directory.
The .tab file contains the counts as a table, while the PDF file contains a plot of the same values.
These tables also exist in the iteration-specific directories (iteration-xx). For those, note that the numbers do not include the genes that were discovered in that iteration. For example, iteration-01/expressed_V.tab shows only expression counts of the V genes in the starting database.
If you are interested in the results of each iteration, you can inspect the iteration-xx/ directories. They are structured in the same way as the final/ subdirectory, except that the results are based on the intermediate databases of that iteration. They also contain the following additional files.
For completeness, here is a description of the files in the reads/ and stats/ directories. They are created during pre-processing and are not iteration specific.
This file is a gzip-compressed table with tab-separated values. It is created by the igdiscover igblast subcommand and is the result of parsing raw output from IgBLAST. It contains a few additional columns that do not come directly from IgBLAST. In particular, the CDR3 sequence is detected, the sequence before the V gene match is split into UTR and leader, and the RACE-specific run of G nucleotides is also detected. The first row is a header row with column names. Each subsequent row describes the IgBLAST results for a single pre-processed input sequence.
Note: This file is typically quite large. LibreOffice can open the file directly (even though it is compressed), but make sure you have enough RAM.
Columns:
productive
The UTR, leader, barcode, race_G and genomic_sequence columns are filled in the following way.
This table is the same as the assigned.tab.gz table, except that rows containing low-quality matches have been filtered out. Rows fulfilling any of the following criteria are filtered:
This table contains the candidates for novel V genes found by the discover subcommand. As the other files, it is a text file in tab-separated values format, with the first row containing the column headings. It can be opened directly in LibreOffice, for example.
Candidates are found by inspecting all the sequences assigned to a database gene, and clustering them in multiple ways. The candidate sequences are found by computing a consensus from each found cluster.
Each row describes a single candidate, but possibly multiple clusters. If there are multiple clusters from a single gene that lead to the same consensus sequence, then they get only one row. The cluster column lists the source clusters for the given sequence. Duplicate sequences can still occur when two different genes lead to identical consensus sequences. (These duplicated sequences are merged by the germline filters.)
Below, we use the term cluster set to refer to all the sequences that are in any of the listed clusters.
Some clusters lead to ambiguous consensus sequences (those that include N bases). These have already been filtered out.
A cluster name such as cl3 describes a cluster generated through linkage cluster analysis. The clusters are simply named cl1, cl2, cl3 etc. If any cluster number seems to be missing (such as when cl1 and cl3 occur, but not cl2), then this means that the cluster led to an ambiguous consensus sequence that has been filtered out. Since the cl clusters are created from a random subsample of the data (in order to keep computation time down), they are never larger than the size of the subsample (currently 1000).
The name db represents a cluster that is identical to the database sequence. If no actual cluster corresponding to the database sequence is found, but the database sequence is expressed, a db cluster is inserted artificially in order to make sure that the sequence is not lost. The cluster name all represents the set of all sequences assigned to the source gene. This means that an unambiguous consensus could be computed from all the sequences. Typically, this happens during later iterations when there are no more novel sequences among the sequences assigned to the database gene.
Consensus sequences are computed only from V gene sequences, but each V gene sequence is part of a full V/D/J sequence. We therefore know for each V sequence which J gene it was found with. This number says how many different J genes were found for all sequences that the consensus in this row was computed from.
To clarify: While the consensus sequence is computed only from a subset of sequences assigned to a source gene, all sequences assigned to the source gene are searched for exact occurrences of that consensus sequence.
When comparing sequences, they are first truncated at the 3' end by removing those (typically 8) bases that correspond to the CDR3 region.
The igdiscover discover command can also be run by hand with other parameters, in which case additional columns may appear.
The two files annotated_V_germline.tab and annotated_V_pregermline.tab are copies of the candidates.tab file with two extra columns that show why a candidate was filtered in the germline and pre-germline filtering steps. The two columns are:
The following values can occur in the why_filtered column:
Each gene discovered by IgDiscover gets a unique name such as “VH4.11_S1234”. The “VH4.11” is the name of the database gene to which the novel V gene was initially assigned. The number 1234 is derived from the nucleotide sequence of the novel gene. That is, if you discover the same sequence in two different runs of the IgDiscover, or just in different iterations, the number will be the same. This may help when manually inspecting results.
Be aware that you still need to check the sequence itself since even different sequences can sometimes lead to the same number (a “hash collision”).
The _S1234 suffixes do not accumulate. Before IgDiscover adds the suffix in an iteration, it removes the suffix if it already exists.
The igdiscover program has multiple subcommands. You should already be familiar with the two commands init and run. Each subcommand comes with its own help page that shows how to use that subcommand. Run the command with the --help option to see the help. For example,
igdiscover run --help
shows the help for the run subcommand.
The following additional subcommands may be useful for further analysis.
The following subcommands are used internally, and listed here for completeness.
V gene sequences found by the clustering step of the program (the discover subcommand) are stored in the candidates.tab file. The entries are “candidates” because many of these will be PCR or other artifacts and therefore do not represent true novel V genes. The germline and pre-germline filters take care of removing artifacts. They germline filter is the “real” filter and used only in the last iteration in order to obtain the final gene database. The pre-germline filter is less strict and used in all the earlier iterations.
The germline filters are implemented in the igdiscover germlinefilter subcommand. It performs the following filtering and processing steps:
If a whitelist of sequences is provided (by default, this is the input V gene database), then the candidates that appear on it
Whitelisting allows IgDiscover to identify known germline sequences that are expressed at low levels in a library. If enabled with whitelist: true (the default) in the pregermline and germline filter sections of the configuration file, the sequences present in the starting database are treated as validated germline sequences and will not be discarded if due to too small cluster size as long as they fulfill the remaining criteria (unique_cdr3s, unique_js etc.).
You can see why a candidate was filtered by inspecting the annotated_V_*.tab files
If two very similar sequences appear in the database used by IgBLAST, then sequencing errors may lead to one sequence incorrectly being assigned to the other. This is particularly problematic if one of the sequences is highly expressed while the other is not expressed at all. The not expressed sequence is even included in the list of V gene candidates because it is in the input database and therefore whitelisted. We call this a “cross-mapping artifact”.
The germline filtering step of IgDiscover therefore aims to eliminate cross-mapping artifacts by checking all pairs of sequences for the following:
If all that is the case, then the sequence with the lower expression is discarded.
When multiple alleles of the same gene appear in the list of V gene candidates, such as IGHV1-2*02 and IGHV1-2*04, the germline filter computes the ratio of the values in the exact and the clonotypes columns between them. If the ratio is under the configured threshold, the candidate with the lower count is discarded. See the exact_ratio and clonotype_ratio settings in the germline_filter and pregermline_filter sections of the configuration file.
New in version 0.7.0.
To work with datasets from the Sequence Read Archive, you may want to use the tool fastq-dump, which can download the reads in the format required by IgDiscover. You just need to know the accession number, such as “SRR2905710” and then run this command to download the files to the current directory:
fastq-dump --split-files --gzip SRR2905710
The --split-files option ensures that the paired-end reads are stored in two separate files, one for the forward and one for the reverse read, respectively. (If you do not provide it, you will get an interleaved FASTQ file that currently cannot be read by IgDiscover). The --gzip option creates compressed output. The command creates two files in the current directory. In the above example, they would be named SRR2905710_1.fastq.gz and SRR2905710_2.fastq.gz.
The program fastq-dump is part of the SRA toolkit. On Debian-derived Linux distributions, you can typically install it with sudo apt-get install sra-toolkit. On Conda, install it with conda install -c bioconda sra-tools.
Random subsampling indeed influences somewhat which sequences are found by the cluster analysis, particularly in the beginning. However, the probability is large that all highly expressed sequences are represented in the random sample. Also, due to the database growing with subsequent iterations, the set of sequences assigned to a single database gene becomes smaller and more homogeneous. This makes it increasingly likely that also sequences expressed at lower levels result in a cluster since they now make up a larger fraction of each subsample.
Also, many of the clusters which are captured in one subsample but not in the other are artifacts that are then filtered out anyway by the pre-germline or germline filter.
On human data with a nearly complete starting database, the subsampling seems to have no influence at all, as we determined experimentally. We repeated a run of the program four times on the same human dataset, using identical parameters each time except that the subsampling was done in a different way. Although intermediate results differed, all four personalized databases that the program produced were exactly identical.
Concordance is lower, though, when the input database is not as complete as the human one.
The way in which random subsampling is done is modified by the seed configuration setting, which is set to 1 by default. If its value is the same for two different runs of the program with otherwise identical settings, the numbers chosen by the random number generator will be the same and therefore also subsampling will be done in an identical way. This makes runs of the program reproducible. In order to test how results differ when subsampling is done in a different way, change the seed to a different value.
When you report a bug or unusual behavior to us, we might ask you to send us the output of igdiscover run. You can send its output to a file by running the program like this:
igdiscover run >& logfile.txt
And here is how to send the logging output to a file and also see the output in your terminal at the same time (but you lose the colors):
igdiscover run |& tee logfile.txt
Sometimes you may want to re-analyze a dataset multiple times with different filter settings. To speed this up, IgDiscover can cache the results of two of the most time-consuming steps, read-merging with PEAR and running IgBLAST.
The cache is disabled by default as it uses a lot of disk space. To enable the cache, create a file named ~/.config/igdiscover.conf with the following contents:
use_cache: true
If you do so, a directory named ~/.cache/igdiscover/ is created the next time you run IgDiscover and all IgBLAST results as well as merged reads from PEAR are stored there. On subsequent runs, the existing result is used directly without calling the respective program, which speeds up the pipeline considerably.
The cache is only used when we are certain that the results will indeed be the same. For example, if the IgBLAST program version or th V/D/J database changes, the cached result is not used.
The files in the cache are compressed, but the cache may still get large over time. You can delete the cache with rm -r ~/.cache/igdiscover to free the space.
You should also delete the cache when updating to a newer IgBLAST version as the old results will not be used anymore.
Library sizes of several hundred thousand sequences are required for V gene discovery, with even higher numbers necessary for full database production. For example, IgM library sizes of 750,000 to 1,000,000 sequences for heavy chain databases and 1.5 to 2 million sequences for light chain databases.
IgDiscover has been developed to identify germline databases from libraries that contain substantial fractions of unswitched antibody sequences. We recommend IgM libraries for heavy chain V gene identification and IgKappa and IgLambda libraries for light chain identification. IgDiscover can identify a proportion of gemline sequences in IgG libraries but the process is much more efficient with IgM libraries, enabling the full set of germline sequences to be discovered.
Yes, IgDiscover accepts both unpaired FASTQ files and paired FASTA files but the program should be made aware which is being used, see input requirements.
Yes. For accurate V gene discovery, all primer sequences must be external to the V gene sequences. For example, forward multiplex amplification primers should be present in the leader sequence or 5' UTR, and reverse amplification primers should be located in the constant region, preferably close to the 5' border of the constant region. Primers that are present in framework 1 region or J segments are not recommended for library production.
Both 5'-RACE and multiplex PCR have their own advantages.
5'-RACE will enable library production from species where the upstream V gene sequence is unknown. The output of the upstream subcommand in IgDiscovery enables the identification of consensus leader and 5'-UTR sequences for each of the identified germline V genes, that can subsequenctly be used for primer design for either multiplex PCR or for monoclonal antibody amplification sets.
Multiplex PCR is recommended for species where the upstream sequences are well characterized. Multiplex amplification products are shorter than 5'-RACE products and therefore will be easier to pair and will have less length associated sequence errors.
The starting database refers to the folder that contains the three FASTA files necessary for the process of iterative V gene discovery to begin. IgDiscover uses the standalone IgBLAST program for comparative assignment of sequences to the starting database. Because IgBlast requires three files (for example V.fasta, D.fasta, J.fasta), three FASTA files should be included in the database folder for each analysis to proceed.
In the case of light chains (that do not contain D segments), a dummy D segment file should be included as IgBLAST will not proceed if it does not see three files in the database folder. It is sufficient to save the following sequence as a fasta file and rename it D.fasta, for example, for it to function as the dummy D.fasta file for human light chain analysis:
>D_ummy GGGGGGGGGG
Since we do not have permission to distribute IMGT database files with IgDiscover, you need to download them directly from IMGT. See the section about obtaining a V/D/J database.
By editing the configuration file.
The final germline database in FASTA format is in your analysis directory in the subdirectory final/database/. The V.fasta file contains the new list of V genes. The D.fasta and J.fasta files are unchanged from the starting database.
A phylogenetic tree of the V sequences can be found in final/dendrogram_V.pdf.
For more details of how that database was created, you need to inspect the files created in the last iteration of the discovery process, located in iteration-xx, where xx is the number of iterations configured in the igdiscover.yaml configuration file. For example, if three iterations were used, look into iteration-03/.
Most interesting in that folder are likely
The new_V_germline.fasta file is identical to the one in final/database/V.fasta
Please see the Section on gene names.
IgDiscover itself does not (yet) come with all imaginable analysis facilities built into it. However, it creates many files (mostly with tables) that can be used for custom analysis. For example, all .tab files (in particular assigned.tab.gz and candidates.tab) can be opened and inspected in a spreadsheet application such as LibreOffice. From there, you can do basic tasks such as sorting from the menu of that application.
Often, these facilities are not enough, however, and some basic understanding of the command-line is helpful. Clearly, this is not as convenient as working in a graphical user interface (GUI), but we do not currently have the resources to provide one for IgDiscover. To alleviate this somewhat, we provide here instructions for a few things that you may want to do with the IgDiscover result files.
The candidates.tab file tells you for each discovered sequence how often an exact match of that sequence was found in your input reads. A high number of exact matches is a good indication that the candidate is actually a new gene or allele. In order to find the original reads that correspond to those matches, you can look at the filtered.tab.gz file and extract all rows where the V_errors column is zero.
First, run this on the filtered.tab.gz file:
zcat filtered.tab.gz | head -n 1 | tr '\t' '\n' | nl
This will enumerate the columns in the file. Take a note of the index that the V_errors column has. In newer pipeline versions, the index is 21. Then extract all rows of the file where that field is equal to zero:
If the column wasn’t 21, then replace the $21 appropriately. The part where it says NR == 1 ensures that the column headings are also printed.
Some configuration settings are not documented in the default igdiscover.yaml file since they rarely need to be changed.
# Leave empty or choose a species name supported by IgBLAST: # human, mouse, rabbit, rat, rhesus_monkey # This setting is not used anywhere except that it is passed # to IgBLAST using the -organism option. Since we provide IgBLAST # with our own gene databases, it seems this has no effect. species:
# Which program to use for computing multiple alignments. This is used for # computing consens sequences. # Choose 'mafft', 'clustalo', 'muscle' or 'muscle-fast'. # 'muscle-fast' runs muscle with parameters "-maxiters 1 -diags". # #multialign_program: muscle-fast
To use the most recent IgDiscover version from Git, follow these steps.
git clone https://github.com/NBISweden/IgDiscover.git
(Use the git@ URL instead if you are a developer.)
cd IgDiscover conda env create -n igdiscover -f environment.yml
You can choose a different environment name by changing the name after the -n parameter. This may be necessary, when you already have a regular (non-developer) IgDiscover installation in an igdiscover environment that you don’t want to overwrite.
source activate igdiscover
(Or use whichever name you chose above.)
python3 -m pip install -e .
Whenever you want to update the software:
cd IgDiscover git pull
It may also be necessary to repeat the python3 -m pip install -e . step.
For development, in particular when running tests repeatedly, you should enable the IgBLAST result cache. The cache stores IgBLAST output. If the same dataset with the same dataset is run a second time, the result is retrieved from the cache and IgBLAST is not re-run. This saves a lot of time when re-running datasets, but may also fill up the cache directory ~/.cache/igdiscover/. Also, in production, datasets are usually not re-run with the same settings, which is why caching is disabled by default.
To enable the cache, create a file ~/.config/igdiscover.conf with the following content:
use_cache: true
The file is in YAML format, but at the moment, no other settings are supported.
Go to the doc/ directory in the repository, then run:
make
to build the documentation locally. Open _build/html/index.html in a browser. The layout is different from the version shown on Read the Docs, but allows you to preview any changes you may have made.
We use versioneer to manage version numbers. It extracts the version number from the most recent tag in Git. Thus, to increment the version number, create a Git tag:
git tag v0.5
The v prefix is mandatory.
Then:
If you have been playing around with different installation methods (pip, conda, git, python3 setup.py install etc.) you may have multiple copies of IgDiscover on your system and you will likely run into problems on updates. Here is a list you can follow in order to get rid of the installations as preparation for a clean re-install. Do not add sudo to the commands below if you get permission problems, unless explicitly told to do so! If one of the steps does not work, that is fine, just continue.
Finally, you can follow the normal installation instructions and then the developer installation instructions.
TACTGTGCGAGAGA (seq 1) TACTGTGCGAGAGA (seq 2) TACTGTGCGAGAG- (seq 3) TACTGTGCGAG--- (seq 4) TACTGTGCGAG--- (seq 5) TACTGTGCGAGAG (previous consensus) TACTGTGCGAGAGA (new consensus)
Marcel Martin
2015-2023, Marcel Martin
March 10, 2023 | 0.11 |