Application of Principal Component Analysis in Dealing with Multicollinearity in Modelling Clinical Data
Correspondence Address :
Dr. N Sreekumaran Nair,
Admin Block, 4th Floor, Department of Biostatistics, Jawaharlal Institute of Postgraduate Medical Education and Research, Puducherry, Tamil Nadu, India.
E-mail: nsknairmanipal@gmail.com
Introduction: One of the stringent assumptions about covariates in the Cox hazard and Logistic regression modelling is that they should be independent. Incorporating correlated covariates as such into the model might distort the precision of the estimates due to multicollinearity. One way to deal with multicollinearity is by using Principal Component Analysis (PCA) technique.
Aim: To demonstrate the application of PCA in dealing with correlated covariates while modelling time to event and case-control study data.
Materials and Methods: This study was conducted at Jawaharlal Institute of Postgraduate Medical Education and Research, Puducherry, India, from February 2021 to January 2022. Two datasets were used for the demonstration i.e., data relates to a time to event outcome and a case-control study with binary outcome in which lipids were the correlated covariates. Three sets of Cox regression models were used to demonstrate change in hazard ratios with 95% Confidence Intervals (CI) for evaluating the effect of intervention at a different time of lipid measurement. Model I has evaluated treatment/Body Mass Index (BMI) effect on the outcome by ignoring the effect of lipid parameters. Model II has evaluated treatment/BMI effect on the outcome by incorporating lipid variables but ignoring multicollinearity. Model III has evaluated treatment/ BMI effect on the outcome by incorporating lipid variables through principal component analysis and thus adjusting for multicollinearity. Similarly, a logistic regression model was performed by using the same three sets of models to evaluate the effect of exposure (BMI). The comparability of lipids between the two groups for both datasets was tested using Hotelling’s T-squared statistic.
Results: The lipids measured at 12th, 24th and 36th months between the two groups in the first data set as well as between cases and controls in the second data set were statistically significant. In the first dataset, at baseline, the Hazard Ratio’s (HR’s) were statistically similar irrespective of the models used; while decreasing successively with narrowing 95% CI’s as moving from model I to model III for the lipid measured at 12th, 24th and 36th months. Further, at 24th and 36th months, the HR in model-III found to be significant. In the second data set, the Odds Ratio (OR) were significant for all the three models and it was almost similar for model I and II but in model III it was elevated.
Conclusion: The multicollinearity issue should be properly addressed before including correlated covariates in the Cox regression hazard and Logistic regression model. The PCA technique would be a favourable method.
Analysis of correlated outcomes, Case-control study, Cox 36 regression hazard model, Logistic regression model
Any model that establishes the effect of the potential covariates on the outcome variable should comply with the nature of the outcome or dependent variable. For a longitudinal study with a time to event outcome variable, the commonly used statistical approach is the Cox hazard regression model (1),(2). Similarly, for a case-control study with a binary outcome variable, the suitable approach is the logistic regression model (1). One of the assumptions of such a regression model is that the predictor variables should not be correlated with each other. However, the predictors under consideration may not be truly independent but rather correlated in biomedical research. Such dependency between the covariates in the regression modelling leads to a condition referred to in a statistical term as multicollinearity which means a covariate can be predicted by the remaining covariates (3),(4).
The main issue with multicollinearity is that the estimate of the regression coefficient of one of the correlated predictors depends on the presence of the other predictors in the model. Also, due to multicollinearity, the estimated standard errors of the regression coefficients might get inflated and could lead to spurious results. Variables in clinical research studies are usually found to be correlated (5),(6),(7),(8),(9). This stipulates that the change in one variable is associated with the change in another variable. There are studies that have established the association of these lipid parameters with the outcome of interest such as Cardiovascular Disease (CVD) and Sudden Sensorineural Hearing Loss (SSNHL) (10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24).
The researcher while evaluating the effect of an intervention/exposure on the outcome has to be conscious in dealing with the effect of such multiple correlated predictors. There are studies where the multicollinearity issues of lipids were not addressed while modelling the outcome variable with Cox hazard or logistic regression models (19),(20),(21),(22),(23),(24). In these cited studies, the effect of intervention/exposure were evaluated by introducing the lipid parameters as the covariates in the model as such. So, due to multicollinearity, it is likely to get unreliable point estimates of Hazard Ratio (HR) or Odds Ratio (OR) of the intervention/exposure. Moreover, incorporating the correlated covariates into the model as such may weakens the statistical power of such regression models. In such conditions, the researcher will be concluding with a compromised precision of the effect of intervention/exposure. To address the multicollinearity issue, methods like partial least square (PLS), Ridge Regression (RR) and Principal Component Analysis (PCA) have been suggested. PLS and RR methods are used for the continuous outcome variable. Since our outcome variable is binary, so, in this article PCA technique was used (25).
The objective of this study was to demonstrate the application of PCA method in dealing with multicollinearity with Cox and logistic regression models. The demonstration was done from two data sets. The first data set was from the ACCORD BP (Action to Control Cardiovascular Risk in Diabetes Blood Pressure) trial in which data was recorded from time to event. While second data set was from a case-control study on Sudden Sensorineural Hearing Loss (SSNHL). Lipids were then correlated with covariates in both the data sets.
Materials and Methods
This study was conducted at Jawaharlal Institute of Postgraduate Medical Education and Research, Puducherry, India, from February 2021 to January 2022.
Brief Description of Dataset
For the demonstration the following two datasets were used.
ACCORD BP trial dataset (26): The ACCORD trial dataset was available from Biologic Specimen and Data Repository Information Coordinating Centre (https://biolincc.nhlbi.nih.gov/home/) of National Heart, Lung, and Blood Institute, upon institutional request. It was an open-label multicentric randomised trial of 84 months follow-up. A total of 4733 high-risk type 2 diabetes mellitus eligible participants were randomised into two study groups:
• Intensive BP control group (n=2362): Treatment strategy was to lower Systolic Blood Pressure (SBP) below 120 mmHg.
• Standard BP control group (n=2371): The strategy was to lower SBP below 140 mmHg.
The treatment strategy followed in the respective BP control groups was for the comparison in reducing CVD events. The primary outcome variable considered was a composite of non fatal Myocardial Infarction (MI), non fatal stroke and CVD death whichever occurred first.
The five lipid parameters were measured at baseline and thereafter on yearly basis:
• Total Cholesterol (TC)
• Triglyceride (TG)
• Very Low Density Lipoprotein (VLDL)
• Low Density Lipoprotein (LDL)
• High Density Lipoprotein (HDL)
The participants who were not measured for their lipid parameters at different follow-ups were excluded from the analysis.
SSNHL case-control study dataset (10): The SSNHL case-control study dataset was publicly available from the authors obtained by dryad (http://dx.doi.org/10.5061/dryad.r2b1n). A total of 324 hospitalised cases for SSNHL and 972 controls with normal hearing were taken. As per World Health Organisation (WHO) criteria the underweight subject (BMI ≤18.5 kg/m2) from among the cases and controls were excluded from the analysis (27). The data on BMI and lipid parameters TC, TG, LDL and HDL for the cases and controls were available.
Models
Cox hazard regression model: The Cox proportional hazard model was used when the covariates considered in the model satisfied the proportionality assumption. For a random binary outcome variable Y with a vector of covariates X: [X1 X2 .....Xp] and the corresponding vector of b coefficients β'=[β1 β2.......βp], Cox proportional hazard model with hazard rate h (t/X) at any time t is expressed as:
h (t/X)=h0 (t) eXβ
Where, h0 (t) is an unspecified non negative function of time called baseline hazard at time t. Thus, the HR to an individual of jth group with 1×p vector of covariates X against an individual of kth group with a vector of same covariates can be obtained as:
Where, Xj and Xk are the vector of the same covariate X for jth and kth groups respectively (1),(2),(3):
Cox time-dependent hazard model: The Cox time-dependent model was used when the covariates considered did not satisfy the proportionality assumption. The time-dependent Cox hazard model with hazard rate h[t/X(t)] at time t is expressed as:
h[(t/X(t)]=h0 (t) eX(t)β
Where, h0 (t) is an unspecified non negative function of time called base line hazard at time t. Thus, the HR can be obtained as:
Where, Xj (t) and Xk (t) are the vector of the same covariate X(t) at time t for the jth and kth groups respectively. The estimate of HR associated with ith covariate and the corresponding (1-α) Confidence Interval (CI) was obtained by using the estimates of βi and its standard error as ebi and ebi+Zα/2 SE (βi), respectively.
Logistic regression model: The logistic regression model was used in case control study data set. For a random outcome variable Y with vector of covariates X: X1 X2 .....Xp and the corresponding vector of b coefficients β'=[β0 β1 β2.......βp], the estimate of odds ratio (OR) was obtained by using the logistic model as:
Where, β0 is the constant, called intercept of the regression equation.
Thus, the odds (OR) to an individual of jth group with pj being the probability of occurrence of event with vector of the covariates X against an individual of kth group with pk being the probability of occurrence of the event with the vector of same covariates can be obtained as:
The estimates of OR associated with ith covariate and the corresponding (1-α) CI was obtained by using the estimates of bi and its standard error as ebi and ebi+Zα/2 SE (βi), respectively.
Principal Component Analysis (PCA): It is a data dimension reduction technique. It creates a new set of uncorrelated variables known as Principal Components (PC) based on the linear combinations of all correlated variables. Generally, first few PC’s can explain the most of total variability of all correlated variables (28).
The general PCA equation to create the independent variables is given by:
which maximises the variance of (eiT X) subject to the condition eiT ei=1 and Cov (eiT X, ekTX)=0 for i?k, where, X'=[X1 X2 X3 ...Xp], a random vector of correlated p variables which have the covariance matrix as S with the eigen values λ1≥λ2…≥λp≥0 and eiT is the transpose of eigen vector corresponding to ith eigen value (λi).All the PC’s are uncorrelated and variance equal to the eigen values of S i.e., Var (PCi) = λi. Thus, the first PC explains the maximum variation of the data followed by second component and so on. For both the data set, the new independent variables were created using the measured values of lipid parameters.
Statistical Analysis
The following three sets of models were used in analysis for both the data sets:
Model I: Treatment/BMI effect on the outcome been compared by ignoring the effect of lipid parameters.
Model II: Treatment/BMI effect on the outcome been compared by incorporating lipid variables but ignoring multicollinearity.
Model III: Treatment/BMI effect on the outcome has been compared by incorporating lipid variables through principal component analysis and thus adjusting for multicollinearity.
However, the methodological component and analysis part were explained for each dataset separately.
ACCORD BP trial dataset: The Pearson correlation coefficients between lipids parameters were computed with log-transformed values of lipids due to their skewed distribution (Table/Fig 1).
Three Cox proportional hazard regression models were fitted. The treatment group was taken as the main predictor variable. The Cox proportional hazard regression model was used if proportionality assumptions were satisfied; the Cox time-dependent regression hazard model otherwise. The proportionality assumption of each covariate was tested by Schoenfeld’s global test (29).
The HR with 95% CIs were estimated across all the above three models for measurements of lipids at the baseline, 12th, 24th and 36th month follow-ups. The lipid parameters were introduced after seeing the significant difference in lipids between the two treatment groups. This was tested by multivariate Hotelling’s T-squared statistic as the lipids were correlated (28),(29),(30). The difference testing performed on log-transformed values of lipids for both the datasets to meet assumptions as the distributions were skewed. The lipid parameters were found to differ significantly between the two groups at each time point except at baseline (Table/Fig 2).
The eigen values (λi) and the corresponding transpose of eigen vectors (eiT) were obtained for intensive and standard BP control groups separately. Further, PCA was performed in each group to create new independent variables for the random vector X'=[TC,TG,VLDL,LDL,HDL] with the covariance matrix as:
Using PC equations, data was generated for the first three independent PC’s at baseline, 12th, 24th and 36th months, respectively. These first three PC’s were able to explain more than 99% of the total variation in lipids at each considered time point. The effect of intervention in model III was evaluated by adjusting for the effect of newly formed independent PC’s in the Cox hazard model. The significance of HR’s was judged by their 95% CI’s.
SSNHL case control study dataset: Similarly, the correlation coefficients between lipids parameters were computed with log transformed values of lipids (Table/Fig 1). The BMI was considered as the primary exposure for the SSNHL data. The same three sets of Logistic regression models were fitted for SSNHL dataset. The OR with 95% CI were estimated for BMI which was categorised as normal (BMI between 18.5 to 24.99 kg/m2) and overweight or obese (BMI ≥25 kg/m2) (29). Again, the difference in the lipid parameters between cases and controls was tested using multivariate Hotelling’s T-squared statistic (Table/Fig 2). The lipids were introduced into the model as these differed significantly between the groups.
Similarly, the eigen values (λi) and the corresponding transpose of eigen vectors (eiT ) were obtained for cases and control separately and corresponding PCA was performed for each group for a random vector X'=[TC,TG,LDL,HDL] with covariance matrix as:
The first three independent PC’s were generated, which were able to explain 99% of total variation of all lipids. The effect of BMI in model III was evaluated by adjusting for the effect of newly formed independent PC’s into logistic regression model. The significance of OR’s was judged by their 95% CI’s. The analysis was carried using R Studio version 3.6.1 (31), Statistical Package for Social Sciences (SPSS) version 19.0 (32) and StataCorp. volume 13 (33).
The correlation coefficient and their significance between the lipid parameters were shown in (Table/Fig 1) for both datasets. For the ACCORD BP trial data, the intensive and standard BP control groups at baseline were statistically similar for lipid parameters but differed significantly at 12th, 24th and 36th months (Table/Fig 2). Similarly, the lipid profiles of cases and control groups were significantly different in the SSNHL data (Table/Fig 2).
For the ACCORD BP trial data, (Table/Fig 3) gives the comparison of effect of intervention (HR and 95% CI) assessed by three different models at the different time of lipid measurements. At baseline, the HR’s were statistically similar with slight variation irrespective of the models used. While at 12th, 24th and 36th months, the scenario of HR’s was different in the three models. The HR’s were successively decreasing with narrowing 95% CI’s as moving from model-I to model-III. For the lipid measurement at 12th month, the HR’s and the corresponding 95% CI in the successive three models were 0.885 (0.713-1.099), 0.860 (0.692-1.068), 0.835 (0.672-1.038). At 24th month the HR’s and the corresponding 95% CI in the respective models were 0.835 (0.681-1.025), 0.818 (0.667-1.004), 0.806 (0.657- 0.990) and at 36th month the HR’s and 95% CI for the three models were 0.835 (0.674-1.035), 0.820 (0.656-1.025), 0.695 (0.546-0.883). Moreover, at 24th and 36th months, the HR with model-I and model-II were insignificant but found to be significant for the model-III.
For SSNHL data the effect of BMI (OR and 95% CI) was compared between the three models (Table/Fig 4). The OR with 95% CI for models I and II was 1.465 (1.119-1.917) and 1.467 (1.106-1.945), respectively which indicates the similarity of the point estimate and the corresponding precisions did not differ much. But, model III showed a different scenario as compared to models I and II. The OR with 95% CI for model III was 1.988 (1.425-2.773). The OR for model III was relatively elevated as compared to models I and II and 95% CI was wider too.
In Cox proportional hazard and logistic regression models, the multicollinearity assumptions on the covariates often get oversighted in medical research. This leads to compromised precision of the estimates. Such covariates needed to be independent when considered in the model. Otherwise, the presence of multicollinearity may distort true estimates and thus, end up with biased findings (1),(3),(4). The multivariate statistical approach which deals with the multiple correlated outcomes has its own applications to deal with such problems. The PCA is one of them which has the potential to derive independent PC’s. Moreover, it reduces the dimension of the correlated data and the first few components can explain almost total variation in the data. Thus, instead of using the correlated covariates as such in the model, a few PC’s can be included in the model without loss of information. This PCA approach addresses the issue of multicollinearity with a smaller number of predictors.
There are few cited studies that had evaluated the effect of interventions/exposure on the outcome. In these studies, the lipids parameters that are associated with the outcome of interest were incorporated into the model as such. Thus, by ignoring the effect of multicollinearity conclusions were made (19),(20),(21),(22),(23),(24). Pedersen TR et al., used the Cox hazard model to compare the event rate of the primary outcome of major coronary events in patients treated with high-dose of atorvastatin against usual-dose. The HR was estimated for the primary endpoint adjusting for the other variables including TC and HDL as the simultaneous covariates. The decision emerged in support of the high dose of Atorvastatin in reducing the primary outcome (19). However, the precision of the estimated HR would have been more reliable, if the multicollinearity among the lipids would have been addressed using PCA. Ting ZWR et al., examined the effects of the use of statins and fibrates on the onset of CVD in Chinese diabetic patients using the Cox model. The HR’s were estimated for the lipids LDL, HDL and TG by adjusting the effects of several identified covariates. These correlated lipids had been considered as the separate covariates in the model. The reliability of the estimates may be questionable as the multicollinearity among them was ignored. (20). Hou Q et al., by using a logistic regression model identified the relevant predictors of the presence of carotid plaque in the general Chinese adults. They identified age, gender, DBP and TC as the independent predictors of carotid plaque. Since, age, DBP and TC are the correlated predictors, the estimates of OR’s of these as well as of gender may not be precise as they did not account for multicollinearity. Atleast by using PCA, the more precise estimate for gender could have been obtained (22). The present study demonstrated the application of PCA technique in dealing with multiple correlated covariates. This could benefit the medical researchers/clinicians to obtain more valid and precise estimates for the effect of intervention/exposure. The findings of the ACCORD BP trial data set and SSNHL case-control study dataset for all the three comparative models suggest the importance of PCA to enhance the reliability of the estimates with improved precision (Table/Fig 3). Although, this study demonstrated the application of PCA to address multicollinearity for continuous correlated covariates. But this concept could be employed for correlated categorical covariates also using PCA technique. It could be a good motivation and an interesting area of future research.
Limitation(s)
This study demonstrated the application of PCA to address multicollinearity for continuous correlated covariates and not for categorical correlated covariates.
The study clearly demonstrates that multicollinearity among the covariates in the model should be addressed before inclusion in the Cox regression or Logistic regression model. The PCA technique could be one of the ways to address this issue to obtain reliable and precise estimates for the covariates of interest.
Authors are sincerely grateful to ACCORD Research Materials obtained from National Heart, Lung, and Blood Institute (NHLBI), Biologic Specimen and Data Repository Information Coordinating Centre for providing access to their data through Research Materials Distribution Agreement (RMDA) and authors are also thankful to the ACCORD trial group. Authors also acknowledge the authors of SSNHL study group to make their data publicly available and authors pay their sincere gratitude and regard to the members of the Doctoral advisory committee for their valuable suggestions.
DOI: 10.7860/JCDR/2022/55379.16629
Date of Submission: Feb 02, 2022
Date of Peer Review: Mar 30, 2022
Date of Acceptance: Apr 27, 2022
Date of Publishing: Jul 01, 2022
AUTHOR DECLARATION:
• Financial or Other Competing Interests: None
• Was Ethics Committee Approval obtained for this study? The IEC granted waiver of consent for the study.
• Was informed consent obtained from the subjects involved in the study? NA
• For any images presented appropriate consent has been obtained from the subjects. NA
PLAGIARISM CHECKING METHODS:
• Plagiarism X-checker: Feb 05, 2022
• Manual Googling: Apr 26, 2022
• iThenticate Software: Jun 02, 2022 (8%)
ETYMOLOGY: Author Origin
- Emerging Sources Citation Index (Web of Science, thomsonreuters)
- Index Copernicus ICV 2017: 134.54
- Academic Search Complete Database
- Directory of Open Access Journals (DOAJ)
- Embase
- EBSCOhost
- Google Scholar
- HINARI Access to Research in Health Programme
- Indian Science Abstracts (ISA)
- Journal seek Database
- Popline (reproductive health literature)
- www.omnimedicalsearch.com