loading page

Fast Whole-Genome Phylogeny by Compression: the COVID-19 case
  • Paul Vitanyi ,
  • Rudi Cilibrasi
Paul Vitanyi
CWI & University of Amsterdam

Corresponding Author:[email protected]

Author Profile
Rudi Cilibrasi
Author Profile


We analyze the whole-genome phylogeny and taxonomy of the SARS-CoV-2 virus, causing the COVID-19 disease, using compression in the form of the alignment-free NCD (Normalized Compression Distance) method to assess similarity. We compare the SARS-CoV-2 virus with a database of over 6,500 viruses. The results comprise that the SARS-CoV-2 virus is closest in that database to the RaTG13 virus and rather close to the bat SARS-like corona viruses bat-SL-CoVZXC21 and bat-SL-CoVZC45. Over 6,500 viruses are identified (given by their registration code) with larger NCD’s. The NCD’s are compared with the NCD’s between the mtDNA’s of familiar species. We treat the question whether Pangolins are involved in the SARS-CoV-2 virus. The NCD method or shortly the {\em compression method} is simpler and possibly faster than any other whole-genome method, which makes it the ideal tool to explore phylogeny. Here we use it for the complex case of determining this similarity between the COVID-19 virus SARS-CoV-2 and many other viruses. The resulting phylogeny and taxonomy closely matches earlier efforts by alignment-based methods and a machine-learning method, providing the most compelling evidence to date for the compression method showing that one can achieve equivalent results both simply and fast.