An Auxiliary Approach to Prediction of Binary Outcome with Bayesian Network Model: Exploration with Data for Recurrence of Breast Cancer
Correspondence Address :
N Sreekumaran Nair,
Professor and Head, Department of Medicine, Jawaharlal Institute of Post Graduate Medical Education and Research, Puducherry, India.
E-mail: nsknairmanipal@gmail.com
Introduction: Logistic regression is the classical statistical model that is incorporated to predict a binary outcome variable. These models have theoretical assumptions of independence of predictor variables and linearity of association with the outcome in the logarithmic scale. Alternative models developed in the machine learning context like Naïve Bayes model with similar assumptions and Bayesian Network (BN) model can be used for binary prediction.
Aim: To compare the predictive performance of logistic regression, Naïve Bayes and BN model in predicting the recurrence of Breast cancer.
Materials and Methods: The dataset was procured from UCI Machine Learning repository on recurrence of breast cancer. The study was done on retrospective data from December 2021 to July 2022. The sample size was boosted with the bootstrapping with logistic regression model. The dataset was split into training (70%) and testing (30%) dataset for internal validation. The effect estimates of the potential prognostic variables were estimated using multiple logistic regression model. Naïve Bayes and BN model was also learnt from the training dataset. The indices of predictive accuracy were estimated for the models in both training and testing dataset.
Results: Degree of malignancy and side of affected breast were found to be significant predictors of recurrence of breast cancer. BN model had the least misclassification rate and the best sensitivity in comparison to other models in spite of imbalance in outcome variable.
Conclusion: BN model performed the best in comparison to logistic regression model when the assumptions of logistic regression model were violated and there is imbalance in proportion of outcome.
Binary prediction, Naïve Bayes model, Predictive accuracy
Statistical models in health care have been extensively developed to help in medical decision-making (1). They assist at the process of making important decisions to archive specific clinical outcomes and also in managing resources to be allocated. Prognostic modeling has had immense application in the field of medicine (2). Prognostic models estimate the probability of an outcome of a condition and also explore the relationship of factors affecting this outcome. Unlike other models which incorporate a single explanatory variable and consider other variables as confounders, prognostic models focus on incorporating the combined effect of variables to predict the outcome. They are particularly important in selecting the right treatment and managing resources (2).
When the outcome variable is binary, logistic regression model is preferred for the prognosis of disease outcome (3). Binary logistic regression model encompasses the effect of predictor variables on the dependent binary variable by linearising the relationship using a log link function. Although the performance of logistic regression as a prognostic model has been good, practically, various assumptions are violated (4). One of the most important assumptions of logistic regression is that the predictor variables are independent of one another. This assumption is almost never true in medical research, especially in the prognostic model (5). Regression models which are developed in the frequentist context have the assumption of normality for the error term and homoscedasticity for each level of the independent variable in the model. In spite of these assumptions being violated, logistic regression is widely used. There are some alternative predictive models suggested in literature which can be used as an alternative to logistic regression model which can overcome these assumptions (6). BN model are graphical representations which consists of Directed Acyclic Graphs (DAG) with nodes and edges which can be used to query a binary outcome variable (7). Naïve Bayes models are simple classifiers which are a subset of BN models which considers conditional independence between the set of independent variables to predict the outcome variable (8). These are some alternative models that can be explored for the prediction of binary outcome variable.
Breast cancer is one of the most prominent cancer affecting women around the world (9). Although, recently, there have been advances that has improved the survival outcomes like mortality, recurrence of breast cancer still persists to be around 8-11% after different treatment modalities in India (10). It has been established in literature that some of the most common prognostic factors associated with recurrence of breast cancer includes age, menopausal status, pathological N stage, pathological T stage, treatment modality, HER2, eGFR, oestrogen and progesterone receptors (11).
The prognosis of medical condition such as cancer is dependent on multiple factors which are correlated to one another. Clinical, sociodemographic and treatment modalities given play a crucial role in the progression of breast cancer. Several statistical and machine learning models have been implemented in the prediction of recurrence of breast cancer that has proven to be excellent in their predictive ability (12),(13). Although they have proven to be good, it is imperative that we consider incorporating the expert opinion into these models which can bring in a better insight into the practical use of the models (14). This is the gap between clinical and model experts that needs to be bridged. BN models are an alternative approach which can incorporate the dependency between the factors with supervised learning from data and expert opinion. Data have also shown that hybrid BN models have good predictive accuracy and intuitive explanation ability (15). In this study, our objective was to assess the predictive ability of Naïve Bayes model and BN model compared to logistic regression model in predicting the recurrence of breast cancer.
The present exploratory study from a retrospective secondary data of breast cancer cases was conducted from December 2021 to July 2022 in Jawaharlal Institute of Post Graduate Medical Education and Research, Puducherry.
Models: The Naïve Bayes Model-Naïve Bayes classifier are probabilistic classifiers that is based on Bayes theorem which uses the properties of conditional independence to compactly represent high-dimensional probability distribution (16). The variables are not completely marginally independent in the case of this classifier model. The Naïve Bayes classifier model can be constructed for an outcome variable Y with possible distinct classes {c1,c2…ck} which are mutually exclusive and exhaustive. Naïve Bayes model, though, makes a very strong assumption about the independent variables. In the presence of n independent variables X1,X2…Xn which are potential factors affecting the outcome variable Y, the Naïve Bayes assumption states that Xi’s are conditionally independent of each other given the outcome of the individual. Formally, it is represented as:
(Xi ? X-i | Y) for all i
Naïve Bayes model can be represented as a BN model although the assumptions of independency are strong and generally not true practically. The joint probability distribution of Naïve Bayes model accounting for the assumption is given by
Bayesian Network (BN) model: BN models are graphical representation of the interdependencies between variables represented by a DAG and conditional probabilities. Let ‘G’ be a DAG, then it consists a set of variables, ‘X’ and a set of directed edges, ‘E’ connecting these set of variables represented by nodes (17). In BN models, a node without a parent node is parametrised by the assumed prior distribution, whereas those with parent nodes are parametrised by conditional probability given by P(X|parent(X)). The joint conditional probability of all the variables in the BN model is given by:
P(x1,x2,…,xp)=i=1pP(xi|Parent(xi))
Building a BN model includes steps of variable selection, structure learning and parameter learning, which can be undertaken by supervised learning from the data including expert opinion.
Dataset: The dataset for building the Naïve Bayes model was procured from an online database, UCI Machine Learning Repository (18). The data was with reference to a Breast cancer study to predict the recurrence of event based on certain attributes. The total sample size in the dataset was 286. There were a total of nine variables in the dataset including age, menopause status, tumour size, number of nodes involved, presence of node caps, degree of malignancy, breast, breast quadrant and status of irradiation. The dataset was sourced from Institute of Oncology, University Medical Center, Ljubljana, Yugoslavia by M. Zwitter and M. Soklic in 1988 available from: https://archive.ics.uci.edu/ml/datasets/breast+cancer. The dataset obtained was inflated to a sample size of 1000 with the help of logistic regression equation with all the variables in the existing dataset as predictor variables for recurrence as the outcome. The total effective sample size used in the current manuscript was 1000 after inflation.
Variables in the model: The dataset depicted the multivariable classification of the patients for the prognosis of Breast cancer. The event of interest here was the recurrence of the disease. The dataset contained the information for all the samples. The variables in the model were defined and categorised based on the criterion from the 8th edition of AJCC Cancer Staging Form Supplement (19). The variables in the model are defined and the recategorisation is given below:
1. Age of the patients at the time of diagnosis:
a. 10-39 years
b. 40-49 years
c. 50-59 years and
d. ≥60 years.
2. Whether the patient was pre-or post-menopausal at the time of the diagnosis:
a. <40 years
b. ≥40 years and
c. premenopausal
3. The greatest diameter of the excised tumour. Based on the tumour size chart, they were categorised as
a. T1 (0-2 cm),
b. T2 (2-5 cm) and
c. T3 (>5 cm).
4. The number of axillary lymph nodes that contain metastatic breast cancer visible on histological examination:
a. 0-2,
b. 3-9 and
c. >10
5. The presence of tumour as a capsule of the lymph node, which over time with more aggressive disease, tumour may replace the lymph node.
6. The histological grade of the tumour.
• 1,
• 2 and
• 3 where Grade 1 predominantly consists of cells that retain their usual characteristics and Grade 3 predominantly consists of cells that are highly abnormal.
7. The side of the affected breast.
8. The breast was also divided into five quadrants using nipple as a central point; categorised as
• left-up
• left-down
• right-up
• right-low and
• centre
9. Whether radiation therapy, was given or not.
Statistical Analysis
The dataset was classified into two parts as training and testing dataset. Approximately, 70% of the data was used for training the model and the rest of the 30% of the data was used for testing the classification accuracy of the model. The distribution of the prognostic variables across the binary outcome of recurrence was assessed in the training, testing and the entire dataset. The univariate logistic regression was performed initially and with p-value <0.15 as the cut-off, the potential factors were used to build the multiple logistic regression model. A p-value <0.05 was considered to be statistically significant in the final model.
All the models were trained using training dataset and then tested using both training and testing dataset. Logistic regression model was built with all the potential prognostic variables. The predicted probabilities were estimated from the model. Naïve Bayes model with Laplace smoothing was used to develop the model. BN model was built with two important steps. The structure learning of the BN model was carried out based on the Tree Augmented Network (TAN) method (20). Conditional probabilities associated with each node was estimated using Expectation-Maximisation (EM) method (21). Misclassification rate, sensitivity, specificity, Positive Predictive Value (PPV) and Negative Predictive Value (NPV) were estimated in both training and testing dataset. All the statistical analysis was performed in R Studio Version 1.2.1335 and Netica 6.09 for Bayes nets. The Naïve Bayes model was built using the naivebayes package.
The distribution of all the factors in the model across both the outcome category in both training and testing dataset is given in (Table/Fig 1). Logistic regression model was used and the effect estimates from univariate and multiple logistic regression estimates were obtained and the results are shown in (Table/Fig 2). It was found that degree of malignancy and the side of the breast were the two variables which significantly contributed in the prediction of recurrence of breast cancer from multiple logistic regression model. BN model developed from the TAN method for structure learning and EM method for parameter learning is given as (Table/Fig 3). The probability distribution associated with each variable is given in the network model.
In the training dataset, it was found that logistic regression had a misclassification rate of 33.52%, BN model with 31.09% whereas it was estimated to be 33.38% for Naïve Bayes classifier as given in (Table/Fig 4). When the same model was used to classify the recurrence status in testing dataset, logistic regression had a misclassification rate of 35.1%, BN model had 36.42% whereas it was 34.77% for Naïve Bayes classifier. The sensitivity was poor for all the models. Specificity was excellent for all the models, 96.96% for LR model, 91.52% for BN model and 97.83% for NB model in training dataset. In the testing dataset it was estimated to be 91.83% for LR model, 87.02% for BN model and 92.31% for NB model in testing dataset. PPV was estimated to be 56.25% for LR model, 60% for NB model and 60.6% for BN model in training dataset. In testing dataset, it was estimated to be 22.73% for LR model, 23.81% for NB model and 30.56% for BN model. NPV was estimated to be 66.97% for LR model, 66.86% for NB model and 70.28% for BN model in training dataset. In testing dataset, it was estimated to be 68.21% for LR model, 68.33% for NB model and 65.56% for BN model.
In the present study, the prognostic factors associated with recurrence of breast cancer were determined. It was found that degree of malignancy and side of the affected breast had an impact on the outcome. A study has shown that tumour size, grade of the cancer, nodal status and hormonal factors along with smoking status to have significant association with recurrence of breast cancer (22). A study have also pointed out that receiving neoadjuvant chemotherapy reduced the risk of recurrence for breast cancer (23). The current dataset had variables related to the disease status and not with lifestyle characteristics. The primary objective of this study was to compare the predictive ability of BN, Naive Bayes and Logistic regression model. It was found that even with imbalance in the proportion of outcome variable, BN model outperformed the other models overall. The misclassification rate was least for BN model and it provided a better ability in predicting the recurrence of breast cancer with better sensitivity, which is the key in these models.
Naïve Bayes model and logistic regression have already been applied for predicting the recurrence of breast cancer and has proven to have performed considerably well (24). Naïve Bayes classifier offers a novel approach for categorising patients and offers good performance with low algorithmic cost and high speed of computation. Another study has shown that Naïve Bayes model performs as well as other equivalent machine learning techniques (25). With just seven prognostic factors, nomogram based on Naïve Bayes model gave 80% accuracy suggesting the model can be translated to practical use. Bayesian classifiers have gained importance in classification problem in health care studies and have performed better than classical approach to prognostic modeling (26). Even amongst the Bayesian classifiers, Naïve Bayes model with tree augmented structure and gradient boosting has shown to perform well in predictive accuracy (27). A study by Choi J et l., has showed that hybrid BN models have excellent predictive ability in comparison to any other machine learning algorithms in predicting breast cancer prognosis (15). It was seen that hybrid BN models had AUC of 0.935 as compared to 0.930 and 0.813 for artificial neural network and classical BN model. BN models have also been applied in the prediction of risk of triple negative breast cancer with epidemiological factors and has shown to perform well (28). Studies have compared the predictive accuracy of BN model with other machine learning algorithms like support vector machine and artificial neural network for a binary outcome, and have proven that they are better or comparable at handling missing data and predictive accuracy (29),(30). BN model has further illustrated that it can incorporate complex interactions of prognostic factors and individualising patient care in oncology (31). This suggests that we have to try to translate the machine algorithms such as BN model as a more viable option for clinicians to use.
Witteveen A et al., on the other hand has also reported that conventional logistic regression models have outperformed BN model in predictive accuracy related to breast cancer (32). Although BN model performed better in the development cohort, on validation, it was seen that LR models had a C-statistic of 0.71 whereas it was 0.67 for BN model. The difference observed in the overall predictive ability between the models is not high. Generally, it is seen that the difference in the AUC or C statistic was seen to be less than 0.05 in studies [33,34]. A study by Holm CE et al., has also shown that proper internal and external validation is unaccounted for BN models (35).
Limitation(s)
Our study was limited to the factors that were a part of the source of secondary data which did not include some important established prognostic factors in recurrence of breast cancer. Variables such as Her2, oestrogen receptors, progesterone receptors and eGFR values could have improved the predictive ability of the models. The proportion of outcome had imbalance and therefore, a Synthetic Minority Oversampling Technique (SMOTE) for imbalanced classification can further strengthen the predictive accuracy of the models. External validation was not performed in the study with an independent dataset for generalisability of the model. Other estimates could have also been estimated for showing the predictive accuracy of models, such as AUC, Gini coefficient and C-index which suggests the overall discriminatory ability of the model but this study was with the intention of suggesting alternative techniques for predicting a binary outcome.
BN model can be used as an alternative model for predicting a binary outcome in the recurrence of breast cancer. The predictive ability of BN model was found to be better and it can handle imbalanced classification better. They also provide with a visually intuitive model with lesser assumptions. With further improving the model, they can provide a better predictive model to be used bed-side for clinicians.
Dr. P. Venkatesan for his contribution in helping to understand the models that were used in the application in this study.
DOI: 10.7860/JCDR/2023/59472.17598
Date of Submission: Aug 05, 2022
Date of Peer Review: Oct 13, 2022
Date of Acceptance: Nov 11, 2022
Date of Publishing: Mar 01, 2023
AUTHOR DECLARATION:
• Financial or Other Competing Interests: None
• Was Ethics Committee Approval obtained for this study? No
• Was informed consent obtained from the subjects involved in the study? Yes
• For any images presented appropriate consent has been obtained from the subjects. NA
PLAGIARISM CHECKING METHODS:
• Plagiarism X-checker: Aug 06, 2022
• Manual Googling: Nov 01, 2022
• iThenticate Software: Nov 10, 2022 (7%)
ETYMOLOGY: Author Origin
- Emerging Sources Citation Index (Web of Science, thomsonreuters)
- Index Copernicus ICV 2017: 134.54
- Academic Search Complete Database
- Directory of Open Access Journals (DOAJ)
- Embase
- EBSCOhost
- Google Scholar
- HINARI Access to Research in Health Programme
- Indian Science Abstracts (ISA)
- Journal seek Database
- Popline (reproductive health literature)
- www.omnimedicalsearch.com