Novel bioinformatics pipeline for fast and scalable analysis of large viral phylogenies

A team of researchers recently developed a bioinformatics approach to analyze viral phylogenetic clusters and posted their findings to the bioRxiv* preprint server.

Study: ClusTRace, a bioinformatic pipeline for analyzing clusters in virus phylogenies. Image Credit: M. PATTHAWEE/Shutterstock


Coronavirus disease 2019 (COVID-19) has become a global public health concern, and the emergence of several new severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants is alarming. The variants reported so far have been categorized as either variants of interest (VOIs) or variants of concern (VOCs). The VOCs present increased health risks due to their higher transmissibility, immune-escape properties, and lower response to existing vaccines. So far, five VOCs have been detected – Alpha (B.1.1.7), Beta (B.1.351), Gamma (P1), Delta (B.1.617.2), and Omicron (B.1.1.529).

Currently, there is a growing exigency among healthcare agencies and scientists to address the rising health concerns, pressing them to develop methods for early detection and in-depth analysis of emerging variants that could potentially alert us to build and adopt better COVID-19 management policies.

About the study

In the present study, researchers developed a novel bioinformatics approach named ClusTrace, for fast and scalable analysis of sequence clusters or clades in large viral phylogenies. ClusTrace can perform several high-level functions such as outlier filtering, aligning, phylogenetic tree reconstruction, cluster or clade extraction, variant calling, visualization, and reporting.

It was developed to trace COVID-19 transmission, emphasizing fast and unsupervised screening of phylogenies for markers of super-spreading events, high rates of cluster growth, and the accumulation of novel mutations. ClusTrace can complement existing toolkits like Nextstrain, Pangolin, Nextclade, and Lazypipe for unsupervised clade/cluster analysis with intuitive visualizations and reporting. The team analyzed the SARS-CoV-2 genomic sequence data from COVID-19 patients in Finland between January 2021 and May 2021. The SARS-CoV-2 Alpha and Beta variants were dominant with 5,379 and 1,051 sequences, respectively, in this dataset.


The researchers found that the SARS-CoV-2 Alpha variant had many high-frequency amino acid mutations that followed the GISAID reference. In contrast, only five amino acid mutations were specific to the Finnish data with 10% or higher frequency. As many as half of the mutations for the Beta variant with a frequency of 10% or higher were not covered by the GISAID reference. The team also reported non-GISAID mutations, but only the Beta variant showed non-GISAID mutations in the Spike protein, likely with the potential to affect receptor binding.

Cluster analysis yielded 110 clusters for the Alpha variant and 19 clusters for the Beta variant. Of these clusters, researchers analyzed 10 clusters each for the two variants that had the highest growth rate peaks per month in the study period. Around 58.5% of all Alpha sequences covered clusters with the largest per month growth rate peaks.

For the Beta variant, 94.5% of sequences covered the ten largest clusters. The non-GISAID mutations in these clusters ranged from one to six for the Alpha variant and three to eight for the Beta variant. The number of sequences added to the cluster referred to as the maximal absolute growth rate for the Alpha variant was between 74 and 310 per month in February and March, while it was between 11 and 148 for the Beta variant with peak growth observed during February, March, and April.  The cluster size ranged from 100 to 479 and 14 to 259 for Alpha and Beta variants.


The team demonstrated the use of ClusTrace for lineage assignment, the generation of multi-fasta collections, outlier filtering, alignment, and phylogenetic tree construction. They reported that ClusTrace could perform automated clustering coupled with cluster growth rate analysis and variant calling to scan through phylogeny, which could be interpreted as unsupervised phylogeny-based cluster analysis. It was shown that clusters with high growth rates and non-reference mutations in genomic regions could be easily highlighted for further downstream analysis. ClusTrace could provide different visualizations like Excel summaries and g3viz plots for growth-rate or mutation-rate clades.

In conclusion, ClusTrace could act as a bridge between the massive inflow of sequence data and the proper organization of these sequences (into lineages, alignments, etc.) to understand the evolutionary nature of the pandemic better. SARS-CoV-2 is likely to mutate and evolve into new variants in the future. The global response also requires timely interventions with newer and advanced strategies to deal with the pandemic. The increased capacity of genome sequencing across the globe could be further bolstered by developing novel bioinformatics tools for efficient and scalable genomic surveillance of viruses.

*Important notice

bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information

Journal reference:
  • Plyusnin, I. et al. (2021) "ClusTRace, a bioinformatic pipeline for analyzing clusters in virus phylogenies". bioRxiv. doi: 10.1101/2021.12.09.471941.

Posted in: Medical Science News | Medical Research News | Disease/Infection News

Tags: Amino Acid, Bioinformatics, Coronavirus, Coronavirus Disease COVID-19, Frequency, Genome, Genomic, Healthcare, Mutation, Pandemic, Phylogeny, Protein, Public Health, Receptor, Respiratory, SARS, SARS-CoV-2, Severe Acute Respiratory, Severe Acute Respiratory Syndrome, Spike Protein, Syndrome, Virus

Comments (0)

Written by

Susha Cheriyedath

Susha has a Bachelor of Science (B.Sc.) degree in Chemistry and Master of Science (M.Sc) degree in Biochemistry from the University of Calicut, India. She always had a keen interest in medical and health science. As part of her masters degree, she specialized in Biochemistry, with an emphasis on Microbiology, Physiology, Biotechnology, and Nutrition. In her spare time, she loves to cook up a storm in the kitchen with her super-messy baking experiments.

Source: Read Full Article