2 files

Fast Whole-Genome Phylogeny by Compression: the COVID-19 case

posted on 07.06.2021, 18:54 by Paul Vitanyi, Rudi Cilibrasi
We analyze the whole-genome phylogeny and taxonomy of the SARS-CoV-2 virus, causing the COVID-19 disease, using compression in the form of the alignment-free NCD (Normalized Compression Distance) method to assess similarity. We compare the SARS-CoV-2 virus with a database of over 6,500 viruses. The results comprise that the SARS-CoV-2 virus is closest in that database to the RaTG13 virus and rather close to the bat SARS-like corona viruses bat-SL-CoVZXC21 and bat-SL-CoVZC45. Over 6,500 viruses are identified (given by their registration code) with larger NCD's. The NCD's are compared with the NCD's between the mtDNA's of familiar species. We treat the question whether Pangolins are involved in the SARS-CoV-2 virus. The NCD method or shortly the {\em compression method} is simpler and possibly faster than any other whole-genome method, which makes it the ideal tool to explore phylogeny. Here we use it for the complex case of determining this similarity between the COVID-19 virus SARS-CoV-2 and many other viruses. The resulting phylogeny and taxonomy closely matches earlier efforts by alignment-based methods and a machine-learning method, providing the most compelling evidence to date for the compression method showing that one can achieve equivalent results both simply and fast.




Email Address of Submitting Author

ORCID of Submitting Author


Submitting Author's Institution

CWI & University of Amsterdam

Submitting Author's Country


Usage metrics