MSDS 490: Healthcare Analytics and Decision Making
Project 3
Due Date: 11/18/2024 (Monday Midnight)
Submission Instructions: zip all your files (R code, data, Word document, figures, etc.) into one, following the file naming convention Last Name First Name Project#.zip Use online submission tools in Canvas to submit this project.
Total Score: 100.
Liver cirrhosis is a late stage of scarring (fibrosis) of the liver caused by many forms of liver diseases, such as hepatitis, chronic alcoholism, and metabolic disease. It is a severe condition that can lead to liver failure, and it is associated with significant morbidity and mortality. The progression of liver cirrhosis is often staged, and various clinical factors can influence patient survival. This project utilizes a dataset of cirrhosis patient to estimate their survival using survival analysis.
Dataset Information. The dataset includes clinical and demographic variables, along with survival out- comes. It provides information on patient age, gender, clinical conditions (Patient Disease Stage, Frailty), lab results (Albumin, Triglycerides, Platelets, Total Cholestrol), and derived scores (MELD). The data also has information on the day at which MELD score became available. Two types are outcomes are possible for a patient prior to the study end time or patient getting lost to the study: (1) patient receiving a liver trans- plant; (2) death (indicated by ‘D’). The number of days between liver cirrhosis and these events (N Days) and the type of event (Status) is also provided in the table in the ‘Status’ column: ‘C’ indicates censored, ’D’ indicates death, and ’CL’ indicates liver transplant.
Variable Name
|
Type
|
Description
|
Missing Value
|
ID
|
Integer
|
Unique identifier
|
No
|
N Days
|
Integer
|
number of days between diagnosis and the
earlier of death, transplantation, or study analysis time
|
No
|
Status
|
Categorical
|
status of the patient C (censored),
CL (censored due to liver tx), or D (death)
|
No
|
Age
|
Intger
|
age (days)
|
No
|
Gender
|
Categorical
|
Male (M) or Female (F)
|
No
|
Albumin
|
Continuous
|
albumin
|
No
|
Triglycerides
|
Continuous
|
triglycerides
|
Yes
|
Platelets
|
Integer
|
platelets per cubic (ml/1000)
|
Yes
|
Stage
|
Categorical
|
initial stage of the disease
|
No
|
Cholesterol
|
Integer
|
serum cholesterol (mg/dl)
|
Yes
|
Frailty
|
Categorical
|
low (L), intermediate (I), and severe (S)
|
No
|
MELD TimeStamp
|
String
|
days at which the MELD score is available
|
No
|
MELD
|
Integer
|
MELD score
|
No
|
Note: Treat ’CL’ as a ’C’ for parts 1-7. Consider the columns MELD and MELD TimeStamp only for Part 9.
1. Use the Multivariate Imputation by Chained Equations (MICE) method to impute missing values in the dataset, setting the number of multiple imputations to 10 for 10 cycles. After completing the imputations, pool the imputed datasets and analyze the results. Your analysis should include a comparison of the distributions (mean,median, etc) of features before and after imputation.
2. Plot Kaplan-Meier survival curves stratified by different stages of liver cirrhosis. How does the survival probability differ across the stages of cirrhosis?
3. Ignore MELD score covariate, and associated date information. Calculate the hazard ratios for all attributes in the dataset using Cox proportional hazards regression. Which features significantly affect survival?
4. Check the proportional hazards assumption and also the linearity assumption of continuous covariates (Albumin, SGOT) in the Cox model.
5. Assume that the frailty variable does not meet the proportional hazards assumption. Stratify the analysis by frailty. How do the hazard ratios change when stratifying by frailty, and what is the impact on the model?
6. Adjust the model to include interaction terms between frailty and gender. With the interaction terms included, interpret the hazard ratio of frailty patient with frailty status ‘L’ with those with status ‘I’ and ’S’ .
7. Plot the adjusted survival curves stratified by different stages. How do these curves differ from the Kaplan-Meier curves?
8. Assuming CL as a competing event, perform. a competing risks analysis. What are the subdistribution hazard ratios for death and transplant? How do the results of the competing risks analysis compare to the traditional Cox regression model? Plot the cumulative incidence graph.
9. Consider the MELD score as a time-varying covariate. Rerun the Cox regression analysis including this covariate. How do the results change when incorporating the MELD score as a time-varying covariate?