1 Introduction

This work aims to identify features sets related low birth weight (LBW) in Brazil low-income families, in some vulnerable groups and by city. The World Health Organization (WHO) defines LBW as the birth of children weighing less than 2,500g [28]. LBW is a global public health issue that significantly affects maternal and infant health, potentially leading to severe short and long-term consequences [7, 9, 12]. Additionally, LBW is the most important isolated risk factor for infant mortality [7]. In Brazil, the LBW rate is around 8% [7].

Several studies are conducted to understand the features related to LBW in Brazil. Prenatal care quality and coverage, for example, have been identified as crucial factors to reducing the occurrence of LBW [12, 15, 38]. The mother’s age is also a significant risk factor for LBW, particularly in adolescent [4, 26] and older mothers [35]. The influence of alcohol consumption during pregnancy has also been studied [29]. Even maternal oral health problems can affect LBW [31]. Finally, the impact of some government programs in LBW, such as the Bolsa Família Program (PBF), are also investigated [20].

There are also some studies based on big data that aim to identifying factors related to the LBW in Brazil. A study using data from 2001 to 2015 on more than 100 million Brazilians, for example, investigated factors related to low birth weight from poor or extremely poor families in Brazil [10]. The results showed that the following features are more related to the LBW: low maternal education, black mothers, single mothers, few prenatal visits, maternal age between 35 and 49 years, first pregnancy, and female.

Other studies aim to identifying factors related to the LBW in local areas or in groups that require specific attention, such as Indigenous [3, 14] and Quilombolas [27]. Despite the large amount of work dedicated to identifying factors related to LBW in Brazil and some special areas or groups, few works aggregate national in special groups analysis in the same paper. Some data mining areas, as subgroup discovery, also did not was applied in order to identify features sets related to LBW in Brazil.

Subgroup discovery is a data mining area that aims to identify features sets related to an exaggerated presence of a specific target in relation to others [1, 13, 16]. Thus, it looks for patterns that are significant only in certain portions of the data set (e.g. newborns with LBW, fraudulent transactions). Subgroup discovery has been used in various applications, such as bioinformatics [19], medicine [5], marketing [6], learning [34] and depression [8].

In this context, this work has the following contributions: (1) the use of subgroup discovery as a tool to identify features sets related to LBW; (2) analysis of features sets considering the national scenario of low-income families and comparing them with six more vulnerable groups: small towns, rural areas, illiterate mothers, adolescent mothers, indigenous and quilombolas; and (3) development of a dashboard for publicize features set related to LBW in more than 4,000 Brazilian cities.

The remainder of this article is structured as follows. In Sect. 2 we present the main concepts of the subgroup discovery. In Sect. 3 we show the methodology, highlighting the data processing, the use of subgroup discovery, and the steps for creating the dashboard. In Sect. 4 we present the results and discussions, followed by Sect. 5, where we present the conclusions, future work, and limitations.

2 Subgroup Discovery Problem

The subgroup discovery problem can be formalized as follows. Let D be a labeled dataset with a set A of categorical/discrete attributes. It can be partitioned into \(D^+: \{e_{1}^+, e_{2}^+, ..., e_{|D^+|}^+\}\), representing the target attribute (positive), and \(D^-: \{e_{1}^-, e_{2}^-, ..., e_{|D^-|}^-\}\) representing the other examples (negative). Let \(dom(A_i)\) be the domain of possible values for attribute \(A_i \in A\). An feature \(f_i\) is defined as the pair (attribute, value), such that \(f_i \in F\) and \(F = \cup A_i \times dom(A_i)\). A subgroup s can be represented as a features set \(F' \subset F\), where the presence of positives examples are exaggerated in relation to the negatives.

Let Table 1 be an example of a dataset with 10 examples of newborns with the attributes: mother’s age (\(<20\), [20, 40], \(>40\)), first pregnancy (yes or no), and LBW (yes or no). Let be \(LBW = yes\) the target attribute in Table 1. Thus, \(D^+: \{e_{7}, e_{8}\}\), \(D^- = \{e_{1}, e_{2}, e_{3}, e_{4}, e_{5}, e_{6}, e_{9}, e_{10},\}\) and \(A = \{mother age, first pregnancy\}\). The features set F is given by Table 2.

Table 1. Toy dataset representing newborns with and without low birth weight (LBW).
Table 2. Features universe of Table 1.

In this context, \(s = \{f_1, f_4\}\) (Table 2) is an interesting subgroup because it concentrates two cases of LBW (\(e_7\),\(e_8\)) and only one case of non-LBW (\(e_2\)). The proportion of \(LBW=yes\) in dataset is 30% and in the subgroup s is 66.67%. The concept of "interesting" in subgroup discovery can vary depending on the context, but generally involves identifying some significant deviation from the expected behavior of the dataset. This deviation is measured using evaluation metrics.

One of the most commonly used metrics in the area of subgroup discovery is Weight Relative Accuracy (WRAcc) [8, 22], presented in the Eq. 1, where \(TP\) are the true positives and FP are the false positives. \(|D|\) is the total number of examples in the dataset and \(|D^+|\) is the total number of positive examples. This metric calculates the proportion of the subgroup relative to the dataset, adjusting the precision of the subgroup by the prevalence rate of the positive examples.

$$\begin{aligned} WRAcc(s) = \frac{TP + FP}{|D|} \left( \frac{TP}{TP + FP} - \frac{|D^+|}{|D|} \right) \end{aligned}$$
(1)

,

The \(SSDP+\) [21] is an evolutionary algorithm designed for subgroup discovery, particularly effective in high-dimensional databases. It automatically adjusts mutation and crossover rates based on the performance of the top-k subgroups in each generation. The algorithm iterates until a stopping criterion is met, defined as three consecutive generations without improvement in the top-k subgroups. Additionally, \(SSDP+\) [21] incorporates a diversity operator to avoid redundancy among the top-k subgroups, ensuring greater variety and relevance in the final results.

3 Methodology

The methodology is divided as follows. The Subsect. 3.1 presents the datasets and the preprocessing steps. Subsection 3.2 shows how the subgroup discovery process was applied to the datasets of Subsect. 3.1. Finally, Subsect. 3.3 details the development process of the dashboard.

3.1 Dataset

The dataset utilized in this work is a subset of 100M SINASC-SIM [2, 30]. It is a linkage of big data sets from Brazil that include: "Cadastro Unico" (CADU), Newborn Information System (SINASC) and "Bolsa Familia" ("Family Stipend") Program (PBF), described below:

  • CADU: it registers low-income families in Brazil. It contains information on family composition, education, habitation and other data for analyzing families’ socioeconomic vulnerability. The information contained in the CADU is used to select and prioritize families to government assistance.

  • PBF: it is the largest income transfer program in Brazil. Around 56 million Brazilians (21.3 million families) were benefited for PBF in 2023. The PBF dataset includes the families benefit received.

  • SINASC: it registers information about births in Brazil, providing information about the newborn, his mother and childbirth. The SINASC coverage 90% of Brazil’s births and is a fundamental tool for monitoring some health indicators.

Thus, for a given newborn in 100M SINASC-SIM dataset, we have information about your birth (from SINASC), your residence and family (from CADU), and the benefits received from the PBF. The cohort 100M SINASC-SIM includes 114, 001, 661 million individuals, about 50% of the Brazilian population [2, 30].

The dataset used in this study is a subset of 100M SINASC-SIM with 31,331,817 individuals, including mothers and newborns. We reshaped this dataset in order to match each newborn with his or her mother’s information, resulting in 16,448,931 newborn-mother pairs. After that, we retained only the newborns between 2011 and 2015, resulting in 5,246,874 records.

We removed all redundant attributes, identifiers, and attributes with more than 50% missing values. We excluded cases of twins or triplets (multiple pregnancies), which commonly result in low birth weight. The final dataset has 5,242,215 newborns into low-income families with 42 attributes describing them. Each record has also a label that indicates two possible value for LBW: true or false. We will refer to this dataset as LBW DS (Low Birth Weight Dataset).

Next, we created several subsets of LBW DS by the filters: (1) born in rural areas (LBW Rural), (2) born in cities with fewer than 10,000 inhabitants (LBW 10k), (3) illiterate mothers (LBW Illiterate), (4) mothers under 15 years old (LBW less15years), (5) born into indigenous families (LBW Indigenous), and (6) born into quilombola families (LBW Quilombolas). Furthermore, we created a subset for each Brazilian city with more than 10 records of LBW in the LBW DS, resulting in 4,400 cutouts.

3.2 Subgroups Discovery Application

We applied the SSDP+ algorithm for subgroup discovery task with the following parameters: \(k=10\), \(sim = 0.5\), \(k_s = 2\) and the evaluation metric WRAcc. The parameters sim and \(k_s\) control the subgroup diversity and were chosen based on preliminary tests. Additionally, SSDP+ was configured to ignore features associated with missing values (e.g. \(mother age = NaN\)).

For each of the 4,407 generated datasets (Subsect. 3.1), we ran SSDP+ twice. In the second run, features that appeared in the first iteration were excluded. As a result, the final top-10 subgroups were formed by combining the 5 most relevant subgroups from each of the two iterations. This approach aimed to ensure diversity in subgroup descriptions, as features present in the first five subgroups did not repeat in the subsequent five subgroups. The results were saved in CSV files, one for each dataset.

3.3 Dashboard

The dashboard development was divided into 6 steps:

  1. 1.

    Data collection: we collected the 4407 CSV files containing the top-10 subgroups, one for each city (4,400) and for the datasets LBW DS, LBW Rural, LBW 10k, LBW Illiterate, LBW less15years, LBW Indigenous, and LBW Quilombolas.

  2. 2.

    Data consolidation and correction: we consolidated all 4407 CSV files into a single file.

  3. 3.

    Data preparation: we used regular expressions to transform the result in the consolidate CSV file into insertion commands for a relational dataset (PostgreSQL [32]) and deployed the dataset using a cloud service (Render [33]).

  4. 4.

    Dashboard prototyping: we developed an initial prototypes of the dashboard project using Figma [11] and based on concepts from Storytelling with Data [18].

  5. 5.

    Development: we created a REST API using Java and Spring Boot [37] and deployed the API on the Render service. Finally, we used HTML [23], CSS [24], and JavaScript [25] to build the dashboard consuming data from our API and deployed it on the Vercel platform [39].

  6. 6.

    Deployment of the Dashboard: the dashboard was available by a public link.

4 Results and Discussions

The Table 3 shows some information about the datasets generated in this study: LBW DS, born in rural areas (LBW Rural), born in cities with fewer than 10,000 inhabitants (LBW 10k), illiterate mothers (LBW Illiterate), mothers under 15 years old (LBW less15years), born into indigenous families (LBW Indigenous) and born into quilombola families (LBW Quilombolas).

The LBW rate in low-income families (LBW DS) was 7.25% (Table 3). In the LBW Illiterate and LBW less15years datasets, the LBW were 7.90% and 11.60%, respectively (greater than LBW DS dataset). This is expected, since low education and age are risk factors for LBW [7, 10].

In the other datasets (LBW 10k, LBW Rural, LBW Indigenous, LBW Quilombolas), the LBW rates were lower than LBW DS, although these groups have, in general, worst medical and housing structure. This can be explained by "low birth weight paradox", which shows that the better the medical and home structure, the higher the LBW rate [17, 36]. This occurs because the probability of a newborn with low birth weight surviving increase as the mother has more ways to deal with the problem. Table 3 also indicates that the LBW Rural dataset covers more than 20% of the LBW DS, while the others cover less than 10%.

Table 3. Profile of the databases generated in the study for the application and analysis of the returned subgroups.

The Fig. 1 presents the most relevant subgroups according to the methodology used (Sect. 3.2). Each subgroup in Fig. 1 is represented by a \( w_i \), where \( w_1 \) to \( w_{10} \) are subgroups associated with increased risk of LBW (in purple), and \( w_{11} \) to \( w_{20} \) are associated with reduced risk of LBW (in green). The percentage represents the increase or decrease of LBW rate in each subgroup. The size of each vertex is proportional to the number of newborns belonging to the subgroup and the color intensity is proportional to increase/decrease of LBW rate. The smaller gray circles represent common features in each subgroup.

Fig. 1.
figure 1

Description of the subgroups considered most relevant by the methodology applied. (Color figure online)

Figure 1 also shows that the number of prenatal visits is an important feature related to the increase and reduction of LBW risk. Five subgroups with reduced LBW risk (\(w_{11}\) to \(w_{15}\)) are associated with prenatal visits above six. On the other hand, four subgroups with increased LBW risk (\(w_{1}\), \(w_{3}\), \(w_{4}\), and \(w_{5}\)) are associated with prenatal visits below six. In subgroup \(w_3\), where mothers had between 1 and 3 visits, the LBW risk increases 66%.

Still on the Fig. 1, single and young mothers were associated with increased LBW, as shown in subgroups \(w_9\) and \(w_{10}\), respectively. The same occurred with first pregnancies, cesarean delivery, and female baby gender (subgroups \(w_6\), \(w_7\), and \(w_5\)). Finally, the absence of the PBF was related to a modestly increase of LBW rate (subgroup \(w_8\)). Features related to better health and housing structure, such as piped water, paved streets, garbage collection and hospital birth (subgroups \(w_2\), \(w_4\), \(w_5\), \(w_7\), \(w_8\), and \(w_9\)) were associated with increased LBW. It again can be explained by "low birth weight paradox" [17, 36].

Table 4 presents the description of the top 10 subgroups according to the methodology applied in the datasets LBW DS, LBW 10k, and LBW Rural. For each subgroup, we present an identifier, description, number of positive examples (\(LBW=yes\)), number of negative examples (\(LBW=no\)) and lift. In this context, lift is calculated as the LBW rate in the subgroup divided by the LBW rate in the reference dataset. Therefore, a lift 1.25 for a subgroup s means that it has LBW rate 25% higher than in the reference dataset.

The Table 4 shows that the descriptions related to top-10 subgroups of datasets LBW DS, LBW 10k, and LBW Rural are similar in relation to the kind of features, such as prenatal visits, first pregnancies, single mothers, young mothers, female gender, good housing structure, and absence from PBF. However, the impacts of these features are different for each group.

For example, the number of prenatal visits between four and six increased the LBW rate 24% in the LBW DS (subgroup 1), 20% in rural areas (subgroup 11) and 32% in small cities (subgroup 21). Prenatal visits between one and three, increased LBW 66% in the LBW DS dataset (subgroup 3), 48% in rural areas (subgroup 15) and 70% in small cities (subgroup 26). Thus, the relationship between prenatal visits and LBW rate is considerably higher in small cities and lower in rural areas.

Another example is the first pregnancies, witch increases 19% the LBW rate in LBW DS and LBW 10k datasets (subgroups 6 and 24) and 26% in rural areas (subgroup 12). Therefore, in rural areas (LBW Rural), first pregnancies are more strongly associated with LBW when compared to the general context (LBW DS) and small cities (LBW 10k).

Table 4. Description of the top-10 most relevant subgroups according to the methodology applied in LBW DS, LBW Rural, and LBW 10k. For each subgroup, we present an identifier, description, number of LBW (yes), non-LBW (no) and lift.

Table 5 shows the description of the 10 most relevant subgroups according to the adopted methodology, for illiterate mothers (LBW Illiterate), adolescent mothers (LBW less15years), indigenous (LBW indigenous), and quilombola (LBW quilombolas). The types of features, in general, are also similar in relation to the others datasets. However, again they have different levels of relevance in relation to LBW rate increase.

Table 5. Top-10 subgroups returned for the LBW Illiterate, LBW less15years, LBW indigenous, and LBW quilombolas datasets

The mother with low education (LBW Illiterate) are even more susceptible to having newborns with LBW when she is young (subgroup 36) or during their first pregnancy (subgroup 32) (Table 5). Among adolescent mothers (LBW less15years) who have few prenatal consultations, the rate rises by 87% (subgroup 41). Additionally, being a single mother in first pregnancy increases the LBW rate by 126% (subgroup 44). Another noteworthy features is that among adolescent mothers, normal delivery was associated with an increased LBW rate, whereas for other groups analyzed, normal delivery reduced the LBW rate (Fig. 1).

The increase of LBW rate among indigenous first pregnancy was 40%, significantly higher than in any other analyzed group (Table 5). Furthermore, single indigenous mothers with female newborns have 26% higher LBW rate (subgroup 55), a similar impact to the low number of consultations in the same group (subgroup 54). The impacts of these highlighted features are atypical, when compared to the other groups presented.

The increased of LBW rate in quilombolas groups in first pregnancy was 34% (subgroup 61). Quilombolas with low prenatal consultations and in first pregnancy have 65% more risk of LBW rate (subgroup 63). Finally, among newborns who are grandchildren in the registered family, the LBW rate increased by 53% (subgroup 67).

Considering the most relevant subgroups according to the methodology across different vulnerable groups, it is clear that the impact of features varies significantly between them. In some specific cases, a feature that increased the LBW rate in one group reduced it in another, such as the type of delivery. These kind of differences tend to occur also between Brazilian cities.

Figure 2 presents the dashboard screen developed to show the 10 most relevant subgroups, according to the methodology applied, for all Brazilian cities that have more than 10 cases of LBW in LBW SD dataset. On the left side, there is a sidebar that allows navigation between databases and searching by city. This sidebar is the main tool for accessing different sections and functionalities of the dashboard.

In the center of the screen (Fig. 2), we will find a detailed description of the top-10 subgroups, organized in a table format. Additionally, two important graphs are presented. The first graph illustrates the relationship of births, divided into three columns: total births, low birth weight births, and normal birth weight births. The second graph displays data related to the selected dataset or city, comparing them with the general data. This pattern of description and graphs is repeated for all other options available on the dashboard, providing a consistent and comparative visualization.

The last option on the sidebar is the city search, where specific data for desired cities can be researched. Figure 2 shows where to perform this search; simply click on the "Buscar por Cidade" (Search by City) field and type the city name followed by the state abbreviation in parentheses. For example, to search for information about Caruaru, type "Caruaru (PE)" and click the "Buscar" (Search) button. The data for the searched city will be displayed according to the explained pattern.

Fig. 2.
figure 2

Developed dashboard screen.

For those accessing the dashboard for the first time, it’s important to note that the data may take up to 60 s to load automatically. This initial waiting time only occurs on the first access. After that, all information will be presented faster. These instructions aim to facilitate navigation and usage of the dashboard, allowing for a clear and comparative visualization of the presented data. The dashboard is available at the link: https://dashbpn.vercel.app/.

5 Conclusion and Future Work

This study applied subgroup discovery aiming to identify the main features sets related to the occurrence of LBW (Low Birth Weight) in newborns from low-income families. We further extended the analysis to some specific low-income groups: small towns, rural areas, illiterate mothers, teenage mothers, indigenous and quilombolas.

The results showed that the types of features related to LBW are considerably similar among different groups, such as low number of prenatal visits, first pregnancy, single mother, low educational, female gender and good habitation and medical structure. However, these features have different impacts in each group analyzed. Quilombolas and Indigenous people, for instance, showed a considerably higher risk increase in LBW during the first pregnancy compared to other groups analyzed.

These and other examples suggest that public policies to reduce LBW maybe are more effective if adapted to different groups. To contribute in this direction, we also developed a dashboard to present in a user-friendly manner the groups of features related to LBW in all Brazilian cities that had more than 10 cases of LBW between 2011 and 2015, resulting in information for 4,400 Brazilian cities.

This work is part of a larger study that will include the same type of analysis for features set related to prematurity, congenital malformation, and Apgar score less than seven. These three criteria are also used by Brazil to consider risk at birth. Therefore, investigating and providing these new results is our main future work. Another future task is to make the dashboard more user-friendly by improving the terms used in subgroup descriptions and incorporating new information. Lastly, the same methodology could also be applied to more recent data.