Genomics - Normalization of variants



Variant Normalization

Variant representation

  • Variants are nucleotide changes that represented as differences to the human reference genome sequence (hg19/Grch37 or hg39/Grch38)
  • Representing variants in a standard format is very critical for correctly annotating and reporting those for research and diagnostics purposes.
    • Not-standard or in-correct representation of variant can cause a negative impact on patient care, such as misdiagnosis or clinical trial ineligibility

HGVS (Human Genome Variant Society) nomenclature

  • HGVS nomenclature provide recommendations for standardized descriptions of variants and it is a de facto recommendation for clinical reporting of sequence variants
    • ACMG (American College of Medical Genetics) recommends using HGVS nomenclature of variants in the genetics diagnostics reports
  • HGVS nomenclature is used currently as an international standard to report and exchange information regarding variants. This standard provides recommendations for variants found in DNA, RNA, and protein sequences.
    • g. is used for genomic variants
    • c. is used for coding sequence variants in genes
    • p. is used for protein sequence variants
    • r. used for RNA sequence variants
  • HGVS nomenclature is available for different types of variants
    • Substitutions
    • Deletions
    • Insertions
    • Duplications
    • Insertions/Deletion (Indels)
    • CNVs
  • The reference genome identifiers must be specified as part of the variant along with the changes in the sequence
Example of a genomic variant description

chr19:g.42474436C>T

C to T substitution variant in chromosome 19 at position 42474436
  • chr19 is the reference identifier
  • g. denotes a genomic or chromosomal variant
  • 42474436 is the location of the variant on the reference contig
  • C is the nucleotide in the reference
  • T is the nucleotide in the sample 

HGVS recommendations for variant normalization

The same genomic variant can be represented in a number of non-unique combinations, so it is mandatory to represent variants in a single normalized form. Variant normalization is the process of representing the variants in a unique and compact form. It helps to compare variants from different sources and thus improves the accuracies of variant analysis.

Parsimony rule
  • Variants should be represented with fewest nucleotide length and superflous nucleotides exactly matching right or left side to the reference should be clipped
chr19                       42474436          
Reference : 5' -- TCGGCAGCCTCGTACGCCAG -- 3'
Sample    : 5' -- TCGGCAGCCTTGTACGCCAG -- 3'
    • chr19:g.42474435TC>TT [incorrect - leftmost T is the same in sample and reference]
    • chr19:g.42474435TCG>TTG [incorrect - leftmost T and rightmost G are the same in sample and reference]
    • chr19:g.42474436CG>TG [incorrect -  rightmost G are the same in sample and reference]
    • chr19:g.42474436C>T [correct - minimal representation of the change in nucleotide]
3'rule
  • For duplication variant or deletion variant in a repeat region, the most c-terminal position (3'end) of the reference sequence should be used as variant position. Hence this rule is also called the "C-terminal rule"
  • This helps to resolve duplication or deletion variants that are represented using different reference genome positions
  • The variants obtained from a typical genomics variant detection pipeline are represented as "left-aligned" since gene and strand details are only obtained after the annotation process
    • IGV (Integrated Genome Browser) view of alignment file does not follow 3' rule
    • VCF (Variant Call Format) file from GATK pipeline do not follow 3' rule
  • Since the 3' rule is based on the transcript strand, this rule can only be applied after gene annotations are done for the variant
    • Gene annotation is the process of assigning gene and transcript details to a specific variant using EnsEMBL, NCBI, or GenCode gene models.
Let us take the following example of deletion of one of the T in a stretch of  3Ts in chromosome 1 of the reference genome.

                   pos:100346672
                         | 
      Ref : 5'--- ACAATTGTTTCTAT---3'
A. sample : 5'--- ACAATTG-TTCTAT---3' chr1:g.100346672delT (X)
B. sample : 5'--- ACAATTGT-TCTAT---3' chr1:g.100346673delT (X)
C. sample : 5'--- ACAATTGTT-CTAT---3' chr1:g.100346674delT (✔️)


  • It is not known exactly whether the first or second or third T is deleted - ambiguous variant. 
  • The representations A, B, and C denote a similar deletion event. 3' rule helps to resolve this.
  • The variant is present in the positive strand gene AGL and hence the 3'end (c-terminal position) is the right-most position on the reference genome. Thus representation C is the correct representation of the deletion event based on the HGVS 3' rule - NM_000642.2:c.1942delT.
  • The rightmost change should be also used for representing the coding sequence change
  • If the gene is negative-strand then the left-most position in the C-terminal position and thus representation A is the correct form

Challenges of variant representations

  • There is a discordance in the way variants are represented for the deletion in the repeat regions and the duplicate regions
    • Popular variant callers use left-aligned convention (left-most position of the reference sequence) while representing variants in VCF format - GATK, Sentieon, Pindel
    • Population frequency databases also use left-aligned representation for variants
    • HGVS conventions recommend right-aligned (3' rule) and this convention is now used as the standard for reporting a variant in clinical reports of the patients
  • All the variants have to be normalized to a specific convention before performing comparison or analysis or annotations

Popular tools for Variants normalization

  • VEP - EnsEMBL's Variant Effect Prediction tool 
  • Mutalyzer - LUMUC Mutalyzer tools
  • SNPEff - Genetic variant annotation and functional effect prediction toolbox