Vivek's Blog: Genomics - Normalization of variants

Variant Normalization

Variant representation

Variants are nucleotide changes that represented as differences to the human reference genome sequence (hg19/Grch37 or hg39/Grch38)
Representing variants in a standard format is very critical for correctly annotating and reporting those for research and diagnostics purposes.

Not-standard or in-correct representation of variant can cause a negative impact on patient care, such as misdiagnosis or clinical trial ineligibility

HGVS (Human Genome Variant Society) nomenclature

HGVS nomenclature provide recommendations for standardized descriptions of variants and it is a de facto recommendation for clinical reporting of sequence variants

ACMG (American College of Medical Genetics) recommends using HGVS nomenclature of variants in the genetics diagnostics reports

HGVS nomenclature is used currently as an international standard to report and exchange information regarding variants. This standard provides recommendations for variants found in DNA, RNA, and protein sequences.

g. is used for genomic variants
c. is used for coding sequence variants in genes
p. is used for protein sequence variants
r. used for RNA sequence variants

HGVS nomenclature is available for different types of variants

Substitutions
Deletions
Insertions
Duplications
Insertions/Deletion (Indels)
CNVs

The reference genome identifiers must be specified as part of the variant along with the changes in the sequence

Example of a genomic variant description

chr19:g.42474436C>T

C to T substitution variant in chromosome 19 at position 42474436

chr19 is the reference identifier
g. denotes a genomic or chromosomal variant
42474436 is the location of the variant on the reference contig
C is the nucleotide in the reference
T is the nucleotide in the sample

HGVS recommendations for variant normalization

The same genomic variant can be represented in a number of non-unique combinations, so it is mandatory to represent variants in a single normalized form. Variant normalization is the process of representing the variants in a unique and compact form. It helps to compare variants from different sources and thus improves the accuracies of variant analysis.

Parsimony rule

Variants should be represented with fewest nucleotide length and superflous nucleotides exactly matching right or left side to the reference should be clipped

chr19 42474436

Reference : 5' -- TCGGCAGCCTCGTACGCCAG -- 3'

Sample : 5' -- TCGGCAGCCTTGTACGCCAG -- 3'

chr19:g.42474435TC>TT [incorrect - leftmost T is the same in sample and reference]
chr19:g.42474435TCG>TTG [incorrect - leftmost T and rightmost G are the same in sample and reference]
chr19:g.42474436CG>TG [incorrect - rightmost G are the same in sample and reference]
chr19:g.42474436C>T [correct - minimal representation of the change in nucleotide]

3'rule

For duplication variant or deletion variant in a repeat region, the most c-terminal position (3'end) of the reference sequence should be used as variant position. Hence this rule is also called the "C-terminal rule"
This helps to resolve duplication or deletion variants that are represented using different reference genome positions
The variants obtained from a typical genomics variant detection pipeline are represented as "left-aligned" since gene and strand details are only obtained after the annotation process

IGV (Integrated Genome Browser) view of alignment file does not follow 3' rule
VCF (Variant Call Format) file from GATK pipeline do not follow 3' rule

Since the 3' rule is based on the transcript strand, this rule can only be applied after gene annotations are done for the variant

Gene annotation is the process of assigning gene and transcript details to a specific variant using EnsEMBL, NCBI, or GenCode gene models.

Let us take the following example of deletion of one of the T in a stretch of 3Ts in chromosome 1 of the reference genome.

pos:100346672
|
Ref : 5'--- ACAATTGTTTCTAT---3'
A. sample : 5'--- ACAATTG-TTCTAT---3' chr1:g.100346672delT (X)
B. sample : 5'--- ACAATTGT-TCTAT---3' chr1:g.100346673delT (X)
C. sample : 5'--- ACAATTGTT-CTAT---3' chr1:g.100346674delT (✔️)

It is not known exactly whether the first or second or third T is deleted - ambiguous variant.
The representations A, B, and C denote a similar deletion event. 3' rule helps to resolve this.
The variant is present in the positive strand gene AGL and hence the 3'end (c-terminal position) is the right-most position on the reference genome. Thus representation C is the correct representation of the deletion event based on the HGVS 3' rule - NM_000642.2:c.1942delT.
The rightmost change should be also used for representing the coding sequence change
If the gene is negative-strand then the left-most position in the C-terminal position and thus representation A is the correct form

Challenges of variant representations

There is a discordance in the way variants are represented for the deletion in the repeat regions and the duplicate regions

Popular variant callers use left-aligned convention (left-most position of the reference sequence) while representing variants in VCF format - GATK, Sentieon, Pindel
Population frequency databases also use left-aligned representation for variants
HGVS conventions recommend right-aligned (3' rule) and this convention is now used as the standard for reporting a variant in clinical reports of the patients

All the variants have to be normalized to a specific convention before performing comparison or analysis or annotations

Popular tools for Variants normalization

VEP - EnsEMBL's Variant Effect Prediction tool
Mutalyzer - LUMUC Mutalyzer tools
SNPEff - Genetic variant annotation and functional effect prediction toolbox