Prediction of positivity for SARS-CoV-2 from complete blood counts on a million-scale using machine learning
Kunumi
15 min

Artificial intelligence reveals COVID-19 in the blood count? Yes, with more than one million exams
The COVID-19 pandemic exposed not only the limits of our health infrastructure but also the weaknesses of diagnostic methods. RT-PCR, despite being considered the gold standard for detecting SARS-CoV-2, showed its limitations in terms of cost, time, and sensitivity. Serology tests, in turn, present good accuracy, but only in advanced stages of the infection. Given this scenario, Brazilian researchers proposed something bold: what if it were possible to diagnose COVID-19 just from a routine and cheap blood test, like the complete blood count?
It is exactly this question that drives the study led by Gianlucca Zuin, in collaboration with researchers from UFMG, Kunumi, Huna and Grupo Fleury. The proposal is ambitious: to use machine learning to detect SARS-CoV-2 infections based on subtle patterns present in the blood count. The differential lies in the volume of data – more than one million tests – and the sophistication of the models that go beyond COVID-19, including other respiratory infections such as Influenza and H1N1 to avoid biases. Let’s understand why this can change the way we diagnose infectious diseases.
Automated and inclusive diagnosis
The central idea of the work is relatively simple: to use blood counts – cheap, fast and widely available tests – as a data source to diagnose COVID-19 with high accuracy. But the execution of this is anything but trivial.
The great advantage lies in the size and diversity of the dataset: more than 1 million blood counts and 1 million RT-PCR tests were analyzed, covering more than 900 thousand unique individuals between 2016 and 2021. Among these, about 21% tested positive for SARS-CoV-2. Furthermore, the study incorporated data from patients with other respiratory infections, such as Influenza A, B, and H1N1, totaling an additional 120 thousand blood counts. Why is this important? Because many respiratory diseases produce similar changes in the blood. Ignoring this factor would be like training a sniffer dog with only one scent — and then expecting it to identify different fragrances.
When treating the diagnosis as a binary classification problem, the proposed model avoids simplifications. It does not just try to distinguish positive from negative, but rather models complex relationships between multiple hematological indicators, incorporating possible interferences caused by other viruses. The big insight: using a stacking model, which combines several classifiers specialized in different viruses, to reach a robust final decision.
A hematological puzzle solved by machines
Hemograms as biological sensors
The complete blood count measures characteristics of blood cells: white, red, platelets and their proportions. But, in isolation, these variables are difficult to interpret — lymphopenia, for example, is common in several viral infections. By applying machine learning, what the researchers did was build a computational lens capable of detecting non-linear relationships and hidden patterns among the different blood components.
The data were collected with methodological rigor from the laboratories of Grupo Fleury, with 72 analysis machines distributed throughout Brazil. All tests were normalized by sex and age group, avoiding biases related to distinct biotypes. Furthermore, the authors created derived variables — such as neutrophil/lymphocyte and platelet/monocyte ratios — recognized as relevant inflammatory markers.
The model that learns from all viruses
To avoid the trap of over-specialization in SARS-CoV-2, the team proposed a system of specialized models. Each virus (COVID-19, Influenza A, B, H1N1, etc.) had its own classifier, trained only with data from that disease. Then, the outputs of these models were combined through a final stacked classifier, trained with balanced data. The tool used was LightGBM, a gradient boosting algorithm based on decision trees — known for its efficiency and performance.
The selection of the most relevant attributes was carried out through a directed acyclic graph that represents the space of variable combinations. The search in this space used the A* algorithm with AUROC as a heuristic, ensuring that only the best subsets were used. In the end, the most important combinations included proportions such as RBC/RDW, WBC/RBC and hemoglobin/MCHC ratio.
Hemograms as a portrait of the invisible
A precision invisible
The accuracy of the models was validated with solid metrics. When trained only with COVID-19 data from the first wave of the pandemic in Brazil, the model reached an AUROC of 0.922, with sensitivity of 82.4% and specificity of 91.8%. This means that the model is highly reliable for detecting positive cases from the blood count — even before confirmation by RT-PCR.
But the beauty of the study lies in how it deals with the passage of time. By training two distinct models — one with data from the first wave and another from the second — the researchers demonstrated how virus mutations (such as the P.1 variant) affect the performance of the algorithms. While the first wave model degrades its accuracy during the pandemic's progression, the second model maintains an AUROC above 0.95.
Separating COVID from other viruses
Training with data only from COVID-19 led to false positives when testing patients with Influenza from 2019. The introduction of specialized models for each virus, however, drastically reduced these errors. In 2019, about 80% of the stacked model's predictions indicated less than a 30% chance of COVID-19 — even in patients with similar symptoms. The result: a specific, sensitive, and above all, reliable model.
Simulating an endemic future
Another essential experiment simulated scenarios of varied COVID-19 prevalence — from 1% to 90% — combining data from the second wave with data prior to the pandemic. The model maintained AUROC always above 0.89, with good sensitivity and specificity even in contexts where COVID-19 was rare. This suggests that the approach is viable in a post-pandemic world, with the virus circulating endemically.
Analogies with real life
Think of a coffee shop that knows its loyal customers: it notices subtle patterns — like who likes stronger coffee, who prefers cold drinks on hot days. Now imagine that, suddenly, customers with different habits appear (tourists, for example). If the coffee shop had only learned from the local customers, it would have difficulty pleasing the new ones. The same happens with COVID-19 models: training only with the “customers” from the first wave prevents recognizing the nuances of the new viral “visits”.
The study not only acknowledges this, but proposes a way to continuously learn from different viral patterns. A model trained with the variety of respiratory viruses acts like that experienced barista, who knows how to adapt the service according to the audience — maintaining quality and avoiding mistakes.
Bias and reproducibility: the blind spots of medical AI
When compared to other smaller studies — such as those by Avila, Silveira, Banerjee and Cabitza — the Brazilian study stands out for the volume of data, viral diversity and methodological robustness. Models with a few hundred patients tend to overestimate results. Studies like Soltan’s, which included 115 thousand people, have not yet come close to the volume nor the sophistication presented here.
Moreover, the authors draw attention to the selection bias present in pandemic studies: as other viruses temporarily disappeared due to the use of masks and isolation, many models were trained in “sterile” environments free of viral confounding. When we return to the real world — with other respiratory infections in circulation — these models fail.
The proposed approach solves this problem elegantly: by explicitly considering other diseases as part of the modeling, it prevents false diagnoses and increases the clinical reliability of the system.
What to do with all this?
The study is not just a technical demonstration. It is an invitation to rethink how we use common exams in everyday medical practice. If a blood count can predict COVID-19 with 90% accuracy, what else can it reveal? And what if we expanded this to other infectious, inflammatory, or autoimmune diseases?
The practical application is underway. The model is already being integrated into Brazilian hospitals through APIs connected to data systems. The next steps involve evaluating the practical impact on hospital routine: do professionals trust the tool? Does it change clinical decisions? Does it reduce triage time?
But more than that, the study offers a starting point for future research in various directions: co-infections, influence of ethnicity on hematological patterns, impact of vaccination on results and generalization to other countries.
The most important thing is that all of this starts from a simple, cheap and already available test — the blood count. By turning a routine test into an intelligent sensor, we open space for a new era of accessible, inclusive and, above all, human diagnosis.
If you want to understand in depth how this model works, the challenges faced and the possibilities of clinical application, it is worth reading the full article. It is one of the few studies that manages to balance scale, technical sophistication and real impact.
Related