syri

Synteny and Rearrangement Identifier

View the Project on GitHub schneebergerlab/syri

File formats

Input file format

Input whole genome alignment can be provided in SAM/BAM or in a TSV format. The columns of the TSV file are:

Column Number Value Type
1 reference start position (1-based, includes start position) int
2 reference end position (1-based, includes end position) int
3 query start position (1-based. Includes the start position.) int
4 query end position (1-based. Includes the end position.) int
5 alignment length in reference int
6 alignment length in query int
7 alingment identity (in percent, 0-100) float
8 alignment direction in reference (always 1) int
9 alignment direction in query (1 for directed alignments, -1 for inverted alignments) int
10 chromosome ID in reference string
11 chromosome ID in query string
12 CIGAR string corresponding to the alignment (Optional; ‘=’ for match, ‘X’ for mismatch, ‘D’ for deletion, ‘I’ for insertion) string

Genomes are required to be provideed in multi-fasta format. Alternatively, nucmer generated .delta file can also be provided in place of CIGAR string for SNP identification.

Output file format

SyRI outputs results in TSV format and VCF file format.

TSV format specifications

Column Number Value Type
1 chromosome ID in reference string
2 reference start position (1-based, includes start position) int
3 reference end position (1-based, includes end position) int
4 sequence in reference (Only for SNPs and indels) string
5 sequence in query (Only for SNPs and indels) string
6 chromosome ID in query string
7 query start position (1-based, includes start position) int
8 query end position (1-based, includes end position) int
9 unique ID (annotation type + number) string
10 parent ID (annotation type + number) string
11 Annotation type string
12 Copy status (for duplications) string

Here, annotation type can have the following meaning:

Annotation Meaning   Annotation Meaning
SYN Syntenic region   SYNAL Alignment in syntenic region
INV Inverted region   INVAL Alignment in inverted region
TRANS Translocated region   TRANSAL Alignment in translocated region
INVTR Inverted translocated region   INVTRAL Alignment in inverted translocated region
DUP Duplicated region   DUPAL Alignment in duplicated region
INVDP Inverted duplicated region   INVDPAL Alignment in inverted duplicated region
NOTAL Un-aligned region   SNP Single nucleotide polymorphism
CPG Copy gain in query   CPL Copy loss in query
HDR Highly diverged regions   TDM Tandem repeat
INS Insertion in query   DEL Deletion in query

Copy status describes whether the duplicated region is in query (copygain, i.e. query has the extra copy) or in reference (copyloss, i.e. reference has the extra copy).

Parent ID corresponds to the unique ID of the annotated block (syntenic region or structural rearrangement) in which the alignment or the local variation exists. So, if there is a A->T SNP (with unique ID SNP1) in a translocated region (unique ID TRANS1) at position Chr1:10 on reference and Chr2:542 on query then the corresponding entry would be:

Chr1  10  10  A T Chr2 542  542 SNP1  TRANS1  SNP -

VCF format

Above information is translated to VCF (v4.3) file format. However, since VCF is based on reference genome position, we do not output un-aligned regions in query genome in VCF file, as it was not possible to write them in context of position in reference genome.