Variant Normalization
Variant representation
- Variants are nucleotide changes that represented as differences to the human reference genome sequence (hg19/Grch37 or hg39/Grch38)
- Representing variants in a standard format is very critical for correctly annotating and reporting those for research and diagnostics purposes.
- Not-standard or in-correct representation of variant can cause a negative impact on patient care, such as misdiagnosis or clinical trial ineligibility
HGVS (Human Genome Variant Society) nomenclature
- HGVS nomenclature provide recommendations for standardized descriptions of variants and it is a de facto recommendation for clinical reporting of sequence variants
- ACMG (American College of Medical Genetics) recommends using HGVS nomenclature of variants in the genetics diagnostics reports
- HGVS nomenclature is used currently as an international standard to report and exchange information regarding variants. This standard provides recommendations for variants found in DNA, RNA, and protein sequences.
- g. is used for genomic variants
- c. is used for coding sequence variants in genes
- p. is used for protein sequence variants
- r. used for RNA sequence variants
- HGVS nomenclature is available for different types of variants
- Substitutions
- Deletions
- Insertions
- Duplications
- Insertions/Deletion (Indels)
- CNVs
- The reference genome identifiers must be specified as part of the variant along with the changes in the sequence
Example of a genomic variant description
chr19:g.42474436C>T
C to T substitution variant in chromosome 19 at position 42474436
chr19:g.42474436C>T
C to T substitution variant in chromosome 19 at position 42474436
- chr19 is the reference identifier
- g. denotes a genomic or chromosomal variant
- 42474436 is the location of the variant on the reference contig
- C is the nucleotide in the reference
- T is the nucleotide in the sample
HGVS recommendations for variant normalization
The same genomic variant can be represented in a number of non-unique combinations, so it is mandatory to represent variants in a single normalized form. Variant normalization is the process of representing the variants in a unique and compact form. It helps to compare variants from different sources and thus improves the accuracies of variant analysis.Parsimony rule
- Variants should be represented with fewest nucleotide length and superflous nucleotides exactly matching right or left side to the reference should be clipped
chr19 42474436
Reference : 5' -- TCGGCAGCCTCGTACGCCAG -- 3'
Sample : 5' -- TCGGCAGCCTTGTACGCCAG -- 3'
- chr19:g.42474435TC>TT [incorrect - leftmost T is the same in sample and reference]
- chr19:g.42474435TCG>TTG [incorrect - leftmost T and rightmost G are the same in sample and reference]
- chr19:g.42474436CG>TG [incorrect - rightmost G are the same in sample and reference]
- chr19:g.42474436C>T [correct - minimal representation of the change in nucleotide]
3'rule
- For duplication variant or deletion variant in a repeat region, the most c-terminal position (3'end) of the reference sequence should be used as variant position. Hence this rule is also called the "C-terminal rule"
- This helps to resolve duplication or deletion variants that are represented using different reference genome positions
- The variants obtained from a typical genomics variant detection pipeline are represented as "left-aligned" since gene and strand details are only obtained after the annotation process
- IGV (Integrated Genome Browser) view of alignment file does not follow 3' rule
- VCF (Variant Call Format) file from GATK pipeline do not follow 3' rule
- Since the 3' rule is based on the transcript strand, this rule can only be applied after gene annotations are done for the variant
- Gene annotation is the process of assigning gene and transcript details to a specific variant using EnsEMBL, NCBI, or GenCode gene models.
Let us take the following example of deletion of one of the T in a stretch of 3Ts in chromosome 1 of the reference genome.
pos:100346672
|
Ref : 5'--- ACAATTGTTTCTAT---3'
A. sample : 5'--- ACAATTG-TTCTAT---3' chr1:g.100346672delT (X)
B. sample : 5'--- ACAATTGT-TCTAT---3' chr1:g.100346673delT (X)
C. sample : 5'--- ACAATTGTT-CTAT---3' chr1:g.100346674delT (✔️)
pos:100346672
|
Ref : 5'--- ACAATTGTTTCTAT---3'
A. sample : 5'--- ACAATTG-TTCTAT---3' chr1:g.100346672delT (X)
B. sample : 5'--- ACAATTGT-TCTAT---3' chr1:g.100346673delT (X)
C. sample : 5'--- ACAATTGTT-CTAT---3' chr1:g.100346674delT (✔️)
- It is not known exactly whether the first or second or third T is deleted - ambiguous variant.
- The representations A, B, and C denote a similar deletion event. 3' rule helps to resolve this.
- The variant is present in the positive strand gene AGL and hence the 3'end (c-terminal position) is the right-most position on the reference genome. Thus representation C is the correct representation of the deletion event based on the HGVS 3' rule - NM_000642.2:c.1942delT.
- The rightmost change should be also used for representing the coding sequence change
- If the gene is negative-strand then the left-most position in the C-terminal position and thus representation A is the correct form
Challenges of variant representations
- There is a discordance in the way variants are represented for the deletion in the repeat regions and the duplicate regions
- Popular variant callers use left-aligned convention (left-most position of the reference sequence) while representing variants in VCF format - GATK, Sentieon, Pindel
- Population frequency databases also use left-aligned representation for variants
- HGVS conventions recommend right-aligned (3' rule) and this convention is now used as the standard for reporting a variant in clinical reports of the patients
- All the variants have to be normalized to a specific convention before performing comparison or analysis or annotations
Popular tools for Variants normalization
- VEP - EnsEMBL's Variant Effect Prediction tool
- Mutalyzer - LUMUC Mutalyzer tools
- SNPEff - Genetic variant annotation and functional effect prediction toolbox