Complete blood count and AI: an alliance for the early detection of breast cancer. / Kunumi Institute

Complete blood count and AI: an alliance for the early detection of breast cancer.

Kunumi

7 min

Complete blood count and AI: an alliance for the early detection of breast cancer.

Despite advances in diagnostic medicine, detecting breast cancer early and accessibly remains a challenge, especially in regions with limited resources. Risk models like Tyrer-Cuzick, although widely used, depend on specific and not always available data, such as family history and breast density. Others, like MIRAI — based on mammograms — face similar limitations, especially in countries where access to mammography is scarce.

In this context, a multidisciplinary group of Brazilian researchers from institutions such as UFMG, Grupo Fleury, INCA and Huna Ltd* proposed a bold approach: what if we could predict the risk of breast cancer from a simple blood test? What if this analysis was enhanced by machine learning algorithms? This question guided the study led by Daniella Castro Araújo, with collaboration from names such as Karina Braga Gomes, Adriano Veloso and Ismael Dale Cotrim Guerreiro da Silva.

Starting from the idea that the complete blood count — a cheap and widely available test — could contain indirect signs of breast cancer, the group proposed using artificial intelligence to identify patterns that would go unnoticed by conventional analyses.

When blood speaks more than we imagine

The proposal of the paper is simple, but bold: to transform the complete blood count into a personalized screening tool for breast cancer. The exam itself is well known, routine, cheap and usually done by anyone who goes through a check-up. But what few imagine is that it can contain subtle clues about the presence of tumors.

The innovation here is not in the exam, but in the way we interpret its data. With the help of machine learning algorithms, the authors sought to capture non-linear relationships between blood markers and the presence or risk of breast cancer. Unlike traditional models, which rely on fixed rules, the proposal is to identify complex patterns that are difficult to detect with classical statistics alone.

This approach is particularly relevant for low-resource health systems. Risk-based screening, using data already available in medical records, can prioritize those who most need mammography — without requiring new exams or expensive technologies.

A mathematical look at the signs of the body

The method used combines classical statistics with state-of-the-art artificial intelligence. The team worked with an impressive data set: 396,848 hemograms from women between 40 and 70 years old, collected between 2004 and 2022, in 309 units of the Fleury Group spread across eight Brazilian states.

First, breast cancer cases were identified based on biopsies or images classified as BI-RADS 5. Controls came from women with normal imaging exams (BI-RADS 1 or 2), including only those who remained cancer-free for at least 4.5 to 6.5 years — a requirement that increased the robustness of the sample.

The set was divided into two: one for modeling (training and validation) and another for testing, ensuring that the model evaluation was done blindly, without data leakage.

To build the predictive models, two techniques were tested: ridge regression, known for its ability to handle multicollinearity, and LightGBM, a tree-based boosting algorithm, fast and efficient for large volumes of data. The goal was simple: to predict, based on the blood count, whether a woman would have a breast cancer diagnosis in the next six months.

The variables used included 13 basic blood biomarkers and seven derived indices, such as the neutrophil-lymphocyte ratio (NLR), the lymphocyte-monocyte ratio (LMR), the systemic inflammation index (SII) and others. The choice of the best predictors was made by an algorithm guided by a directed acyclic graph, which prioritized the incremental gain in AUC with each new variable added.

In the end, the three best predictors were: age, NLR, and red blood cell count (RBC). With them, the ridge model achieved an AUC of 0.64 on the test set — modest, but promising, especially considering the low cost and wide availability of the blood count.

When numbers tell stories: the experiment inside

Screening like a supermarket queue

Imagine a supermarket line where not all people have the same degree of urgency. Some have small children, others are just buying a chocolate bar. What if there was a way to prioritize those who need to be served first?

This was the idea behind the risk stratification proposed by the authors. Based on the estimated probability of breast cancer, each woman was classified into one of four groups: high, moderate, medium, and low risk. The relative risk ratios in each group were 1.99, 1.32, 1.02, and 0.42, respectively. That is, women in the high-risk group had almost twice the probability of cancer compared to the population average.

This logic can be used to reorganize the queue for mammograms in the SUS. Instead of calling women only by age or order of arrival, it would be possible to start with those who, according to the model, have a higher risk — optimizing the use of scarce resources.

The paradox of the simple exam

The blood count has always been there. But, with the right eyes — or the right algorithms — it reveals patterns previously invisible. The data showed that women with cancer had, on average, higher levels of hematocrit, hemoglobin, mean corpuscular volume (MCV), and inflammatory indices such as AISI, dNLR, NLR, PLR, SII, and SIRI. In contrast, lymphocytes and the lymphocyte-monocyte ratio (LMR) were lower.

It is as if the body, even before the tumor is detected by imaging, already gives subtle signs that something is wrong. A more activated immune system, silent inflammation, small changes in red blood cells — all of this can be translated into numbers.

A statistical lens on a population scenario

The prevalence of cancer in the sample was only 0.72%, which creates a typical imbalance challenge. To deal with this, the authors adjusted the case weights in the training of the models, so as not to let the thousands of controls dilute the few positive cases.

Additionally, to ensure statistical robustness, the performance of the models was evaluated with 1000 bootstraps — a technique that simulates multiple samples with replacement, allowing reliable confidence intervals to be estimated for the performance metrics.

The final model, with NLR, RBC and age, achieved an AUC of 0.64 with ridge regression, surpassing LightGBM (AUC of 0.63). It may seem little at first glance, but in low-resource health systems, this prediction gain, even modest, can mean early diagnoses and lives saved.

Interpretability above all

A strong point was the choice of simple and interpretable models. Ridge regression, for example, allows you to directly see the weight of each variable. In LightGBM, the authors used the SHAP method to interpret the importance of each attribute in the model's decision. In both cases, age was the main predictor, followed by the NLR ratio and red blood cell count.

Tests, adjustments and the absence of the obvious

When analyzing the data, the authors did not find striking differences in several classic biomarkers, such as eosinophils, leukocytes, or platelets. This reinforces the idea that, in isolation, few markers make sense — but combined, they tell a different story. The secret lies in the correlations, in the non-linearities, in the hidden patterns that only AI can detect.

Future scenarios and limitations

The study recognizes important limitations. There is no data on comorbidities, medication use or ethnicity — which may introduce biases. There was also no external validation in different populations. Even so, the authors make it clear that this is just the beginning. With the inclusion of new markers, such as cytokines, sex hormones and thyroid hormones, the model may gain accuracy.

Moreover, the model was designed for women who are already in the recommended age range for mammography but have not yet had the exam. In this context, the risk of “over-mammographing” is zero — all should already be screened. The model only rearranges the queue, putting those who are more likely to have a detectable cancer at the front.

Blood, algorithms and access: where do we go now?

This study is a brave provocation. What if an exam that is already routinely done could be used for much more? What if artificial intelligence helped us read the signals of our own body better, with tools we already have?

It is not about replacing mammography, but about enhancing it. About making better use of the data we already have. About creating a new logic of screening, fairer, more precise, more accessible.

The invitation to reflection remains: how many other tools do we already use but have not yet explored their full potential? How many other everyday exams may contain hidden signs that artificial intelligence can reveal?

This work opens space for a new line of research that connects laboratory medicine, data science, and public health. An area where collaboration between clinical, laboratory, and computational teams is essential. A frontier where health care begins with a more attentive look — and with algorithms that learn from the details.

For those interested in the idea, it is well worth diving into the full paper. It brings technical details, tables, metrics, and statistical analyses that show the seriousness and robustness of this approach. It is an excellent example of how data science can directly serve people's lives.

*A Huna is a Brazilian deeptech that uses Artificial Intelligence for early detection of different types of cancer, including breast cancer, based on routine data and exams.

The research presented is one of several led by Daniella Castro Araújo, PhD.