Abstract
This paper addresses broadly the impact that unprecedented levels of scientific discovery can have on the emergent global patterns that we observe in nature. An essentially ubiquitous pattern that is associated with large complex discrete systems is attributable to the Conservation of Hartley-Shannon Information (CoHSI). One of the manifestations of CoHSI in the realm of protein structure is a distinctive equilibrium distribution of protein lengths that is dominated by a power-law. Here we examine the manner in which the accelerated pace of novel protein discovery during the Covid-19 pandemic affected this distribution, showing that despite an initial disruption, nevertheless the equilibrium state was reestablished.
Author Contributions
Copyright© 2024
Warr Gregory, et al.
License
This work is licensed under a Creative Commons Attribution 4.0 International License.
This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Competing interests The authors declare no competing interests.
Funding Interests:
Citation:
Introduction
This paper uses a novel approach to study change in the rapidly evolving and globally accessible TrEMBL protein databases available at By incorporating information theory into this methodology, Proteins are an exemplary discrete system, in that we can consider them as strings of amino acids, the length of a protein being measured by the total number of amino acids. The TrEMBL database Thus it was predicted (and borne out experimentally) that the length distributions of proteins would show the scale-free distributions implied by CoHSI shown on the x-axis. On the left hand side, it is flat corresponding to the sharp rise to the peak of In essence the above development establishes
Results
Having established that there is an equilibrium distribution in protein lengths we can study different versions of TrEMBL as the database grew rapidly in the last few years. What happened between TrEMBL releases 21-03 and 22-02 to explain first why the CoHSI equilibrium was perturbed, and second how it was re-established? Although the only constraints in CoHSI theory of the database, this curation was relaxed early in the Covid-19 pandemic, when a special portal Tens of millions of SARS-CoV-2 protein sequences have been uploaded to the protein databases, and we note that the ORF1ab polyprotein of SARS-CoV-2 contains 7096 amino acids. We suggest that the massive uploading of presumptively redundant SARS-CoV-2 sequences resulted in the perturbation of the equilibrium seen in TrEMBL release 21-03. The resumption of normal curation of the database would have eliminated redundancies created by this large volume of identical submissions of SARS-CoV-2 proteins, reestablishing the equilibrium as seen in TrEMBL release 22-02. Thus while the CoHSI equilibrium as exemplified globally in protein lengths is remarkably stable, at the same time it is sensitive to the consistency of categorization as revealed by the unprecedented number of presumably redundant SARS-CoV-2 sequences that were submitted to TrEMBL early in the Covid-19 pandemic. While the Covid-19 pandemic perturbed the equilibrium of protein length distributions, as described above, this resulted from the unprecedented burst of research into the SARS-CoV-2 virus. However, many other aspects of the Covid-19 pandemic also show power-law behaviour, as would be expected from any large, complex discrete system and as predicted by CoHSI theory
Conclusion
It is reasonable to ask why power-laws, as described here in the impacts of the Covid-19 pandemic are essentially ubiquitous in the natural world. As reviewed in detail in