Synteny and Rearrangement Identifier
SyRI requires assemblies to be at chromosome-level for accurate identification of SRs. If chromosome-level assemblies are not available, one can create pseudo-chromosome level assemblies using chroder utility.
SyRI uses whole-genome alignments as input. These can be generated using whole-genome aligner of user’s choice (check installation guide. Here, we would use MUMmer3 package.
Firstly, the genomes (in multi-fasta format) are aligned using the NUCmer utility.
nucmer --maxmatch -c 500 -b 500 -l 100 refgenome qrygenome;
Here, -c
,-b
, and -l
are parameters used to control the alignment resolution and need to be adjusted based on the genome size and complexity. More details are available here.
NUCmer would generate an out.delta
file as output. The identified alignments are filtered using delta-filter
and then converted into a tab-separated format using show-coords
.
delta-filter -m -i 90 -l 100 out.delta > out_m_i90_l100.delta;
show-coords -THrd out_m_i90_l100.delta > out_m_i90_l100.coords;
Users can change values for -i
, and -l
input to suit their genomes and specifc scientific problem. More information is available here.
For identificaiton of structural rearrangements (which include duplications), overlapping alignments are not filtered out. In the example above, --maxmatch
(for nucmer) results in identificaiton of all alignments. The -m
(for delta-filter) parameter removes redundant alignments, though it is not necessary but is used as it helps in significantly reducing number of alignments which in turn reduces time and memory required by SyRI. Finally, -THrd
(for show-coords) converts the alignments form .delta
format to .tsv
format consisting of alignment coordinates required by SyRI.
For alignments generated using MUMmer3, CIGAR strings are not required. For other aligners, the whole genome alignments can be parsed either in SAM/BAM format or in a .tsv
with CIGAR string for each alignment for the identification of SNPs and short indels (however, structural rearrangements and structural variations can be identified without CIGAR strings).
syri
SyRI takes genome alignments coordinates as input. Additionally, fasta files for the two genomes will also be required if structure variations are also needed. Further, for short variation identification, when CIGAR strings are not available, .delta
file (as generated from NUCmer) will also be requried.
The usage and parameters are:
usage: syri [-h] -c INFILE [-r REF] [-q QRY] [-d DELTA] [-F {T,S,B}] [-k]
[--log {DEBUG,INFO,WARN}] [--lf LOG_FIN] [--dir DIR]
[--prefix PREFIX] [--seed SEED] [--nc NCORES] [--novcf] [-f]
[--nosr] [--tdgaplen TDGL] [--tdmaxolp TDOLP] [-b BRUTERUNTIME]
[--unic TRANSUNICOUNT] [--unip TRANSUNIPERCENT] [--inc INCREASEBY]
[--no-chrmatch] [--nosv] [--nosnp] [--all] [--allow-offset OFFSET]
[--cigar] [-s SSPATH]
Input Files:
-c INFILE File containing alignment coordinates (default: None)
-r REF Genome A (which is considered as reference for the
alignments). Required for local variation (large
indels, CNVs) identification. (default: None)
-q QRY Genome B (which is considered as query for the
alignments). Required for local variation (large
indels, CNVs) identification. (default: None)
-d DELTA .delta file from mummer. Required for short variation
(SNPs/indels) identification when CIGAR string is not
available (default: None)
optional arguments:
-h, --help show this help message and exit
-F {T,S,B} Input file type. T: Table, S: SAM, B: BAM (default: T)
-k Keep intermediate output files (default: False)
--log {DEBUG,INFO,WARN}
log level (default: INFO)
--lf LOG_FIN Name of log file (default: syri.log)
--dir DIR path to working directory (if not current directory).
All files must be in this directory. (default: None)
--prefix PREFIX Prefix to add before the output file Names (default: )
--seed SEED seed for generating random numbers (default: 1)
--nc NCORES number of cores to use in parallel (max is number of
chromosomes) (default: 1)
--novcf Do not combine all files into one output file
(default: False)
-f Filter out low quality alignments (default: True)
SR identification:
--nosr Set to skip structural rearrangement identification
(default: False)
--tdgaplen TDGL Maximum allowed gap-length between two alignments of a
multi-alignment translocation or duplication (TD).
Larger values increases TD identification sensitivity
but also runtime. (default: 500000)
--tdmaxolp TDOLP Maximum allowed overlap between two translocations.
Value should be in range (0,1]. (default: 0.8)
-b BRUTERUNTIME Cutoff to restrict brute force methods to take too
much time (in seconds). Smaller values would make
algorithm faster, but could have marginal effects on
accuracy. In general case, would not be required.
(default: 60)
--unic TRANSUNICOUNT Number of uniques bps for selecting translocation.
Smaller values would select smaller TLs better, but
may increase time and decrease accuracy. (default:
1000)
--unip TRANSUNIPERCENT
Percent of unique region requried to select
translocation. Value should be in range (0,1]. Smaller
values would allow selection of TDs which are more
overlapped with other regions. (default: 0.5)
--inc INCREASEBY Minimum score increase required to add another
alignment to translocation cluster solution (default:
1000)
--no-chrmatch Do not allow SyRI to automatically match chromosome
ids between the two genomes if they are not equal
(default: False)
ShV identification:
--nosv Set to skip structural variation identification
(default: False)
--nosnp Set to skip SNP/Indel (within alignment)
identification (default: False)
--all Use duplications too for variant identification
(default: False)
--allow-offset OFFSET
BPs allowed to overlap (default: 5)
--cigar Find SNPs/indels using CIGAR string. Necessary for
alignment generated using aligners other than nucmers
(default: False)
-s SSPATH path to show-snps from mummer (default: show-snps)
In case the chromosome IDs for the two assemblies are not identical, SyRI would try to find homologous chromosomes and then map their IDs to be identical. This behaviour can be turned off using the --no-chrmatch
parameter.
Other parameters in this section regulate how translocation and duplications (TDs) are identified. For small networks of overlapping candidate TDs, SyRI uses a brute-force method to find the optimal set of TDs. The time allowed to this method can be restricted using the -b
parameter. If for a network, brute-force method take more than the assigned time, then it will automatically switch to a randomized-greedy method. The --unic
and --unip
parameters state how unique a candidate TD need to be. Candidates which overlap highly with syntenic path and inversions and thus do not pass these thresholds will be filtered out. From a network of candidate TDs, it is possible to select different set of candidates. The --inc
threshold is used decide whether a new set of candidates is better then the current candidate and thus can be selected as the solution or not.
The --allow-offset
parameter is used to define a threshold to decide whether two consecutive alignments within an annotated blocks are overlapping or not. Alignments, for which number of overlapping bases are more than --allow-offset
will result CNVs (copyloss/copygain).
Short variations (SNPs, small indels) are identified by either using CIGAR strings for the alignments or using the show-snps
utility in MUMmer package (requires .delta
file). User can set --cigar
when using CIGAR. If the show-snps
is not in enironment path, then -s
can be used to provide path to it. By default, SyRI do not report short variations within duplicated regions because they lack one-to-one mapping between regions, which in turn renders short variations ambiguous. However, user can set --all
which will return all short variations within all annotated alignments.