Synteny and Rearrangement Identifier
SyRI uses alignments between two chromosome-level assemblies and identifies syntenic blocks, sequence variations and structural rearrangements. Since the release of SyRI in 2019, new whole-genome assembly aligners have been developed while the old aligners have been updated. Here, we analyse the performance of SyRI when using alignments generated using these the currently avaialble state-of-the-art aligners. Analysing genomes of different species with multiple parameter settings for the aligners, these benchmarks provide a quick overview of the expected syri output while also serving as guide for setting up comparative genomics pipeline for other species.
Overview of the workflow
We selected ten chromosome-level assemblies for six species with different genomic complexities.
Full list of genomes used: Genome list
For each species, twenty pair-wise genome combinations were then used for alignment.
For each genome, FASTA file was filtered to select autosomal chromosome sequences.
In addition, all homologous chromosome sequence orientations were corrected in respect to a selected genome from each species using fixchr package.
We tried using four whole-genome alignment tools to align the genome combinations.
For each alignment tool, 5 different parameter settings were selected to test variation in alignments and its effect in structural variation identification.
List of parameter settings used for each alignment tool: Parameter settings
In addition to the 3 presets corresponding for the use of sequence divergence of 5/10/20%, the changes in the parameter for chaining/alignment bandwidth and long-join bandwidth were tested.
Winnowmap requires pre-computation of high frequency k-mers in the selected reference genomes. Meryl was used for k-mer counting. The top 0.02% most frequent k-mers were filtered out.
# Count high frequency k-mers
meryl count k=19 output merylDB asm1.fa
# Filter for the top 0.02% most frequent k-mers
meryl print greater-than distinct=0.9998 merylDB > repetitive_k19.txt
As Winnowmap was developed on top of Minimap2 codebase, parameter set tested for Minimap2 were also tested for Winnowmap.
AnchorWave generates whole-genome alignment using collinear blocks. These blocks are identified via conserved anchors corresponding to full-length CDS and exon.
As AnchorWave utilizes CDS GFF3 to identify the anchors, we standardized the GFF files for the conventional reference genomes from each species into a comprehensive GFF3 format using AGAT.
agat_convert_sp_gxf2gxf.pl -g infile.gff
Prior to the alignment, the identification of anchors requires 2 steps.
Extraction of full-length CDS from the reference genome sequence and gene annotation in GFF3 format as input.
Lift over of the start and end positions of reference full-length CDS to the query genome.
# Extract full-length CDS
AnchorWave gff2seq -r ref.fa -i ref.gff -o cds.fa
# Lift over to reference and query genome
Minimap2 -x splice -t 10 -k 12 -a -p 0.4 -N 20 (ref.fa/query.fa) cds.fa > (ref.sam/query.sam)
Since, AnchorWave requires high-quality GFFs, for each species nine pair-wise genome combinations were tested where the nine non-reference species were compared against the reference genome.
Alignments were performed using the proali function, designed for genome alignments involving relocation variations, chromosome fusion, or whole-genome duplication. Alongside the default settings, parameter adjustments for sequence alignment window width and inter-anchor length threshold (for finding new anchors) were tested.
To identify structural rearrangements, the fragmented MAF output was converted to SAM/BAM format. The maf-convert script was used for this conversion, with the option to include dictionary of sequence lengths, in order to comply with the format required in the downstream SyRI usage.
python maf-convert.py sam AnchorWave.maf -d > AnchorWave.sam
Alignments were filtered to retain only those with at least one matched base in the CIGAR string and appropriate start and end positions. Furthermore, to account for the rare cases in which the inverse alignments’ start and stop positions are identical, the BAM files were also filtered to remove these alignments.
NUCmer is an executable within MUMmer, an alignment tool for DNA and protein sequences. NUCmer is used for the all-vs-all comparison of nucleotide sequences, best used for highly similar sequence that may have large rearrangements.
In addition to the default settings and those from the original SyRI paper, parameter adjustments for minimum cluster length, alignment extension distance, and minimum maximal exact match length were tested.
Two major issues with NUCmer prevented its results from being included in this report:
--maxmatch
with assemblies containing centromeric sequence.Due to these limitations, results from NUCmer could not be included.
By default, SyRI filters out low quality and small alignments. Here, we tested syri with the default filtering and with the full list of alignments (without any filtering).
# SyRI with filter option (default)
syri -c input.bam -r ref.fa -q query.fa -F B
# SyRI without filter option
syri -c input.bam -r ref.fa -q query.fa -F B -f
Overall, AnchorWave produces significantly more alignments than the other two aligners. However, the fraction of the reference/query genomes mapped differs to a lesser degree, suggesting that Minimap2 and Winnowmap generate fewer alignments while covering a larger portion of the genomes.
Higher sequence divergence parameters (parameters 2 & 3) for Minimap2 and Winnowmap had the greatest impact on plant genomes, resulting in fewer alignments and a higher fraction of the reference/query genomes mapped. The parameter set 4 of AnchorWave lowers the inter-anchor length threshold resulting in fewer alignments compared to the other parameters, particularly in mammalian genomes, where the number of alignments becomes similar to that of Minimap2/Winnowmap.
Number of alignments and bases mapped
Number of alignments
Fraction of the reference genome mapped
Fraction of the query genome mapped
Syntenic regions
Syri reports syntenic blocks defined as colliear syntenic alignments uninterrupted by rearrangements. That means that the number of reported unique syntenic blocks is dependent on the number of identified strucutral rearrangements.
For highly similar genomes (such as from the same ecotype, isolate, breed, or cultivar), syntenic blocks had fewer breaks. This illustrates that similar genomes have fewer rearrangements resulting in few longer syntenic blocks.
In general, syri run with alignments generated by Winnowmap reported more fragmented syntenic blocks compared to Minimap2, suggesting higher propensity to find structural rearrangements with Winnowmap alignments. The number and proportion of syntenic regions were also affected by the default filtering of syri. While winnowmap and minimap2 produced similar results, the syntenic regions from alignments generated with Anchorwave were quite different. Possibly, because Anchorwave identifed collinear alignments which limited the detection power of structural rearrangements.
Syri’s output is also dependent on the quality of genome assembly. In our dataset, one human genome assembly was of lower quality (had 10-times more scaffolds) compared to the others. Such assemblies can result in fragmentation of identifiable synteny leading to a larger number of syntenic regions (twice for one example with winnowmap) compared to the overall population.
Number of syntenic regions
Fraction of reference genome corresponding to the syntenic regions
Fraction of query genome corresponding to the syntenic regions
Structural rearrangements
Higher sequence divergence parameters (parameters 2 & 3) in Minimap2 and Winnowmap identified more structural rearrangements (cumulative count of inversions, duplications, and translocations identified by SyRI). This effect is more pronounced in Winnowmap, where most structural rearrangements correspond to duplications.
Number of structural rearrangements
Fraction of reference genome corresponding to the structural rearrangements
Fraction of query genome corresponding to the structural rearrangements
Not aligned regions
Diverging genomic region keep accumulating mutations. Over time, this divergence could become so large that the whole-genome aligners cannot align these originally same sequences with high confidence, thus resulting in not aligning regions. However, it is also possible that a sequence get replaced by another sequence resulting in co-located but not aligning regions in the genomes. Distinguishing between the two is not trivial.
By default, SyRI filters out small and low quality alignments. This can lead to a higher proportion of not aligned regions compared to when no alignments get filtered out. This effect is stronger in species where different strains are more diverged to each other (eg: S. tuberosum).
In some species, differences in the normalized length of reference/query unaligned regions are observed for AnchorWave. These differences may result from the presence of regions in one genome that are absent in the other genome (e.g., centromeric regions) or from variations in genome assembly quality.
Fraction of reference genome corresponding to the not aligned regions
Fraction of query genome corresponding to the not aligned regions
In general, Minimap2 and Winnowmap showed similar alignment statistics to AnchorWave. However, Minimap2 and Winnowmap generated fewer alignments that covered more reference/query genome sequences. Additionally, the alignments from these two aligners exhibited more variation across different parameters when aligning plant genomes.
The use of higher sequence divergence parameters in Minimap2 and Winnowmap, resulted in more alignments with lower identities. Consequently, syri identified more structural rearrangements identified, specifically duplications.
AnchorWave produced MAF output with a large number of alignments, many of which were filtered out when SyRI’s filtering parameter was applied. This suggests that many of the alignments were small and of low quality.
A lower inter-anchor length threshold in AnchorWave, below which new anchor search is stopped (parameter 4), reduces the number of alignments while increasing alignment size and identity in mammalian genomes. This observation runs in parallel with the similar number of syntenic and highly diverged regions between the non-filtered and filtered sets.
All genomic FASTA and GFF files were retrieved from the NCBI database.
Species | Assembly | Accession number |
---|---|---|
S. cerevisiae | R64 | GCF_000146045.2 |
ASM4093860v1 | GCA_040938605.1 | |
ASM3086668v1 | GCA_030866685.1 | |
ASM3029217v1 | GCA_030292175.1 | |
ScYPH499_1.0 | GCA_026000965.1 | |
ScPE_H3_f | GCA_905220325.1 | |
Sc_YJM1078_v1 | GCA_000975645.3 | |
ASM2350882v1 | GCA_023508825.1 | |
ASM29281v1 | GCA_000292815.1 | |
Sc_YJM1250_v1 | GCA_000976935.2 | |
D. melanogaster | Release 6 plus ISO1 MT | GCF_000001215.4 |
ASM2977509v1 | GCA_029775095.1 | |
ASM231075v1 | GCA_002310755.1 | |
ASM4260644v1 | GCA_042606445.1 | |
RU_dmel_BG3_1.0 | GCA_034768405.1 | |
DGRP379 | GCA_004798055.1 | |
ASM1583244v1 | GCA_015832445.1 | |
ASM2014167v1 | GCA_020141675.1 | |
ASM1585258v1 | GCA_015852585.1 | |
UCI_ORw1118_1.0 | GCA_024500395.1 | |
A. thaliana | TAIR10.1 | GCF_000001735.4 |
Col-CC | GCA_028009825.2 | |
Ler-0.7213 | GCA_946406525.1 | |
Cvi-0.6911 | GCA_946414125.1 | |
Tanz-1.10024 | GCA_946409825.1 | |
Rabacal-1.22005 | GCA_946406895.1 | |
IP-Mos-9 | GCA_946411805.1 | |
IP-Hom-0 | GCA_946411425.1 | |
MERE-A-13 | GCA_946411655.1 | |
IP-Sln-22 | GCA_946413935.1 | |
S. tuberosum | solTubStieglitzHap1 | GCA_020169555.1 |
ASM1507626v1 | GCA_015076265.1 | |
ASM1418947v1 | GCA_014189475.1 | |
ASM982717v1 | GCA_009827175.1 | |
ASM982715v1 | GCA_009827155.1 | |
solTubHeraHap2 | GCA_020169575.1 | |
solTubHeraHap1 | GCA_020169585.1 | |
ASM1418930v1 | GCA_014189305.1 | |
ASM1418299v2 | GCA_014182995.2 | |
ASM1418298v2 | GCA_014182985.2 | |
B. taurus | ARS-UCD2.0 | GCF_002263795.3 |
UOA_Wagyu_1 | GCA_040286185.1 | |
UOA_Tuli_1 | GCA_040285425.1 | |
NRF | GCA_963921495.1 | |
SNU_Hanwoo_2.0 | GCA_028973685.2 | |
ARS-LIC_NZ_Jersey | CA_021234555.1 | |
ASM3988117v1 | GCA_039881175.1 | |
ASM4388211v1 | GCA_043882115.1 | |
ROSLIN_BTT_NDA1 | GCA_905123515.1 | |
ARS-LIC_NZ_Holstein-Friesian_1 | GCA_021347905.1 | |
H. sapiens | T2T-CHM13v2.0 | GCF_009914755.1 |
GRCh38.p14 | GCF_000001405.40 | |
hg002v1.1.mat | GCA_018852615.3 | |
hg01243.v3.0 | GCA_018873775.2 | |
Han1 | GCA_024586135.1 | |
hg002v1.1.pat | GCA_018852605.3 | |
HS1011_v1.1 | GCA_001292825.2 | |
KOREF1.0 | GCA_001712695.1 | |
ASM1490585v1 | GCA_014905855.1 | |
PGP1v1 | GCA_020497115.1 |
| Aligner | Parameter | Description |
| ——- | ——— | ———– |
| Minimap2 | minimap2 -ax asm5 –eqx
Alignment sizes and identities
Alignment sizes
Alignment identities
Forward alignments
Fraction of the forward alignments
Inversions
Number of inversions
Fraction of reference genome corresponding to the inversions
Fraction of query genome corresponding to the inversions
SNPs
Number of SNPs
Insertions and Deletions
Fraction of query genome corresponding to the insertions
Fraction of reference genome corresponding to the deletions
Copy gains and Copy losses
Number of copy gains
Fraction of query genome corresponding to the copy gains
Number of copy losses
Fraction of reference genome corresponding to the copy losses
Highly diverged
Number of highly diverged regions
Fraction of reference genome corresponding to the highly diverged regions
Fraction of query genome corresponding to the highly diverged regions
Tandem repeats
Number of the tandem repeats
Fraction of reference genome corresponding to the tandem repeats
Fraction of query genome corresponding to the tandem repeats