Complete blood count as a biomarker for the diagnosis of severe preeclampsia: a machine learning approach / Kunumi Institute

Complete blood count as a biomarker for the diagnosis of severe preeclampsia: a machine learning approach

Kunumi

20 min

Complete blood count as a biomarker for the diagnosis of severe preeclampsia: a machine learning approach

Artificial Intelligence transforms common blood count into early warning for severe pre-eclampsia

Even with advances in sophisticated exams, state-of-the-art ultrasounds and promising biomarkers, pre-eclampsia with severe signs (sPE) still escapes the clinical radar in many situations. This condition, which affects between 5% and 8% of pregnancies and is responsible for more than 70 thousand maternal deaths and 500 thousand fetal deaths per year worldwide, remains one of the main causes of perinatal morbidity and mortality. Most cases occur in low-income countries, but the problem is universal: we detect it too late.

Despite being widely studied, preeclampsia lacks laboratory markers with a good cost-benefit ratio. Biomarkers such as the PlGF/sFlt-1 index and uterine artery resistance have been explored, but fail to offer sensitivity and applicability on a large scale. Furthermore, sPE symptoms can appear suddenly, without prior clinical changes. There is an unresolved urgency: we need diagnostic methods that are cheap, accessible, reliable, and interpretable.

It is in this scenario that the provocative proposal of Daniella Castro Araújo and her research team at the Federal University of Minas Gerais emerges: to transform the complete blood count, a trivial, cheap, and universally used test, into a powerful biomarker for the detection of sPE. And more — to do this with artificial intelligence. Inspired by works such as those of Jhee, Eberhard, and Lu & Hsu, who explored predictive models with ML for preeclampsia, the authors went further: they created a predictive model based exclusively on three parameters of the blood count.

When the trivial becomes extraordinary

The proposal of the study is bold: to use only the complete blood count (CBC) and artificial intelligence to detect cases of sPE in the third trimester of pregnancy. Behind this simplicity, there is a sophisticated thought. The blood count is cheap, widely available, and already part of routine prenatal exams. So why not enhance its use?

The authors developed a machine learning model based on boosting (LightGBM), fed with synthetic data generated from a new methodology called DAS (Data Augmentation and Smoothing). The original sample consisted of 132 pregnant women, divided between 65 cases of sPE and 67 normotensive controls. With the DAS technique, 3,552 synthetic examples were generated, balancing diversity and representativeness.

The final model relied on only three variables: neutrophil count, mean corpuscular hemoglobin (MCH), and the aggregate systemic inflammation index (AISI), a composite metric based on the number of neutrophils, monocytes, platelets, and lymphocytes. These parameters are routinely generated in blood counts, but rarely interpreted together or through the lens of systemic inflammation in pregnancy.

The innovation here is not only in the use of AI. It is in the courage to revisit the obvious, in the scientific gaze that transforms the common into the strategic. It is like discovering that the smoke alarm was already installed on the wall — we just didn’t know how to interpret it.

The algorithm that reads between the cells

The methodological core of the article revolves around DAS, a synthetic data generation technique based on random weighted averages of real samples. The inspiration came from time series generation methods like the one proposed by Forestier, but adapted to the biomedical tabular context. The idea is simple, yet powerful: to generate infinite synthetic samples that preserve the statistical properties of the real database, without repeating patterns or inducing redundancies.

The modeling was done with LightGBM, a robust boosting-based algorithm that builds iterative decision trees focused on performance. Furthermore, the authors applied a 10-fold cross-validation strategy to ensure that the test data was not used in the generation of synthetic data, avoiding data leakage.

To make the results more interpretable, a ridge regression was also developed with the same three parameters. This approach allows healthcare professionals a direct and practical reading, even if with a slight loss of accuracy.

The attribute selection was conducted by a directed acyclic graph, where each node represents a subset of variables, and the edges connect sets that differ by only one additional attribute. This avoids information overlap and ensures that each new variable added to the model brings information gain.

Furthermore, the model was interpreted with SHAP (Shapley Additive Explanations), which allows understanding the contribution of each variable to each individual prediction. The result is not just an accuracy number — it is an interpretative lens on the inflammatory physiology of pregnancy.

Neutrophils as sentinels, monocytes as silence

When white blood cells become warning signs

Neutrophils, cells frequently associated with defense against infections, emerged in this study as protagonists in sPE. Pregnant women with sPE presented significantly higher neutrophil counts (8,225/mm³ versus 6,664/mm³ in controls), reinforcing the hypothesis of a chronic pro-inflammatory state that is not resolved throughout pregnancy. The Pearson correlation between neutrophils and SHAP values was 0.91, considered very strong, indicating that this variable had a direct and positive impact on risk prediction.

It is as if the neutrophils were "shouting" inflammation even before clinical symptoms appear. And the ML model can hear that shout, when traditional methods cannot.

The silence of monocytes

In contrast, monocytes — another class of white blood cells — were significantly decreased in pregnant women with sPE, with an average of 384/mm³ compared to 620/mm³ in normotensive women. A value below 490/mm³ showed a sensitivity of 71% and specificity of 72%. The finding is unprecedented and draws attention: instead of an inflammatory excess, what we have here is an absence — perhaps a failure in the regulatory immune response, a kind of "immune blackout".

In the everyday metaphor, it is as if monocytes were the firefighters of inflammation. In cases of sPE, they do not show up when the alarm sounds.

AISI: a broken balance metric

AISI is calculated as (neutrophils × monocytes × platelets) ÷ lymphocytes. The average value was 345 in sPE versus 523 in the control group, with p<0.01. The reduction of AISI in sPE reflects a collapse of the inflammatory-resolutive system: more inflammation (neutrophils), less regulation (monocytes) and lower coagulation capacity (platelets).

The negative correlation with SHAP values (-0.90) reinforces the role of this index as a brake on inflammation. When the AISI drops, the risk rises. As if the pregnancy car were going down a hill without brakes.

MCH: the weight of anemia

The mean corpuscular hemoglobin (MCH) also proved to be an important indicator, with an average of 28.8 pg in sPE versus 30.4 pg in the control group. This decrease is consistent with mild anemia and may be linked to the microangiopathic hemolysis characteristic of sPE. More than that, it reveals how systemic inflammation affects iron metabolism, decreasing hemoglobin synthesis.

It is the portrait of an organism in internal conflict, trying to compensate for multiple systemic damages.

Three models, three ways to predict

The boosting model (LightGBM) with neutrophils, AISI and MCH obtained AUROC of 0.90, sensitivity of 95%, specificity of 79%, accuracy of 87% and F1 of 0.88.
Ridge regression with the same three parameters had an AUROC of 0.79, being entirely interpretable. Its formula is: Y = 1.71 − 2.84×MCH − 7.93×AISI + 4.78×Neutrophils If Y > 0, there is a high risk of sPE.
The simple cutoff point for monocytes (<490/mm³) alone had an AUROC of 0.70. Although less accurate, it can be useful in low-resource settings.

The boosting model (LightGBM) with neutrophils, AISI and MCH obtained AUROC of 0.90, sensitivity of 95%, specificity of 79%, accuracy of 87% and F1 of 0.88.

Ridge regression with the same three parameters had an AUROC of 0.79, being entirely interpretable. Its formula is: Y = 1.71 − 2.84×MCH − 7.93×AISI + 4.78×Neutrophils If Y > 0, there is a high risk of sPE.

The simple cutoff point for monocytes (<490/mm³) alone had an AUROC of 0.70. Although less accurate, it can be useful in low-resource settings.

Comparatively, DAS outperformed established synthetic data generation methods, such as SMOTE (AUROC 0.88) and ADASYN (AUROC 0.86), although the difference was not statistically significant.

What we still do not see: provoking the future of digital obstetrics

Study opens many doors. And it also raises many questions. Are we ready to use AI in routine exams? Are we training health professionals to interpret explainable models like ridge regression? Or do we still trust the stethoscope more than predictive metrics with AUROC 0.90?

More than a scientific result, the article proposes a change of mindset: adopting a data-centered view, where technology does not replace but amplifies human clinical capacity. Furthermore, the use of the blood count as a diagnostic tool in sPE can be expanded to other contexts: infections, COVID-19, autoimmune conditions, among others.

What if we applied the DAS methodology to generate synthetic data in other rare conditions? What if we integrated these models into public health systems, generating automated alerts? How would this change maternal and neonatal outcomes?

The limitations are real: small sample, absence of external validation, exclusion of cases with infections or parallel inflammatory conditions. But the potential impact is greater than the uncertainties.

We invite the scientific community to test, criticize, replicate and expand these findings. The study is not an end point — it is an invitation to the next chapter of preventive medicine.

For those who wish to understand, apply and explore the findings more deeply, reading the full paper is highly recommended. It is a vivid example of how data science can transform care in reproductive health.

Complete blood count and AI: an alliance for the early detection of breast cancer.

Kunumi

7 min

A new plasma protein panel predicts progression in the prodromal phase of Alzheimer’s disease

Kunumi

10 min

Prediction of positivity for SARS-CoV-2 from complete blood counts on a million-scale using machine learning

Kunumi

15 min