Chris Mainey - BSOL ICB
Generated by Chat-GPT and Dall-E 3.
Any thoughts on the prediction error here?
This talk draws heavily on Professor Galit Shmueli's 2010 paper of the same name:
Shmueli, G. (2010), To Explain or To Predict?, Statistical Science, vol 25 no 3, pp. 289-310.
This talk draws heavily on Professor Galit Shmueli's 2010 paper of the same name:
Shmueli, G. (2010), To Explain or To Predict?, Statistical Science, vol 25 no 3, pp. 289-310.
"Since all models are wrong, the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad" Box (1976)
This talk draws heavily on Professor Galit Shmueli's 2010 paper of the same name:
Shmueli, G. (2010), To Explain or To Predict?, Statistical Science, vol 25 no 3, pp. 289-310.
"Since all models are wrong, the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad" Box (1976)
Through out this, keep asking yourself: what is your question?
You can use many of the same models to fit in either context, but how you do it is different!
Prof. Shmueli's paper laments that statisticians had almost exclusively on 'explanatory' models.
With the increasing accessibility of Data Science and Machine Learning, the focus of many modern practitioners has swung the other way.
Some of you may always be approaching a model as a prediction question.
What I'm presenting here today is fairly agnostic to your approach, be it bayesian / frequentist / whatever.
Y=F(X)
Y=F(X)
E(Y)=f(X)
Y=F(X)
E(Y)=f(X)
Shmueli, G. (2010), http://www.jstor.org/stable/41058949
...don't be scared, it's not that bad...
We are trying to model how X causes something, without being constrained by what data we have. This can be concepts such as Y = depression, and F(x) could be things like: anxiety, past trauma, physical health, stress... etc. We can't measure them directly, so
Resued from: Arnold et al. Int J Epidemiol, Volume 49, Issue 6, December 2020, Pages 2074–2082, https://doi.org/10.1093/ije/dyaa049
What do I mean by 'causes?' It's not the same as 'associated with'. There is an 'exposure' to 'outcome' effect, and a temporal element: i.e. exposure before outcome. This DAG is hypothesising the causal relationship between chemotherapy and venous thromoembolism (VTE)
The arrows indicator the direction of causal relationships. Age, sex, tumour site and tumour size are confounding this relationship and should be adjusted for in a model, but platelet count is a mediator and should not.
We will use an example I sourced from Kaggle, related to the paper:
Chicco, D., Jurman, G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak 20, 16 (2020). https://doi.org/10.1186/s12911-020-1023-5
https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data/data
We will use an example I sourced from Kaggle, related to the paper:
Chicco, D., Jurman, G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak 20, 16 (2020). https://doi.org/10.1186/s12911-020-1023-5
https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data/data
Despite talking about regression models a lot, I know they are not the most straight-forward thing, and you may never have encountered them.
They are a foundation of many mode complex models, but so I'll try and do regression 2 minutes
y=α+βx+ϵ
y=α+βixi+ϵ
We can 'zoom out' to a more the 'Generalized Linear Model (GLM):
For distributions in the exponential family, GLM allows the linear model to relate to response through a function:
g(μ)=α+βixi
We can 'zoom out' to a more the 'Generalized Linear Model (GLM):
For distributions in the exponential family, GLM allows the linear model to relate to response through a function:
g(μ)=α+βixi
As the name suggest, a more general form allows us to use different distributions, but understand them as a linear model on a given scale. E.g. for count data, μ would be the expected count, and g would be the natural logarithm, and the model is using the Poisson distribution.
In our case, for binary, we are modelling the 'odds' of the outcome (death) on the log scale, with a binomial distribution. This is the log-odd, or 'logit' link fiction: hence logistic regression.
Importantly, I can use it to examine the explanatory relationships, or to predict new data
r_model_exp <- glm(DEATH_EVENT ~ serum_creatinine + ejection_fraction , data=heart_failure_dt , family = "binomial")summary(r_model_exp)
## ## Call:## glm(formula = DEATH_EVENT ~ serum_creatinine + ejection_fraction, ## family = "binomial", data = heart_failure_dt)## ## Coefficients:## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.37769 0.54343 0.695 0.487 ## serum_creatinine 0.74987 0.17932 4.182 2.89e-05 ***## ejection_fraction -0.05986 0.01350 -4.435 9.19e-06 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## (Dispersion parameter for binomial family taken to be 1)## ## Null deviance: 375.35 on 298 degrees of freedom## Residual deviance: 324.32 on 296 degrees of freedom## AIC: 330.32## ## Number of Fisher Scoring iterations: 4
import statsmodels.formula.api as smfpy_model1_exp = smf.logit("DEATH_EVENT ~ serum_creatinine + ejection_fraction", data=heart_failure_pd).fit()
print(py_model1_exp_summary)
## Logit Regression Results ## ==============================================================================## Dep. Variable: DEATH_EVENT No. Observations: 299## Model: Logit Df Residuals: 296## Method: MLE Df Model: 2## Date: Wed, 20 Nov 2024 Pseudo R-squ.: 0.1359## Time: 13:39:05 Log-Likelihood: -162.16## converged: True LL-Null: -187.67## Covariance Type: nonrobust LLR p-value: 8.308e-12## =====================================================================================## coef std err z P>|z| [0.025 0.975]## -------------------------------------------------------------------------------------## Intercept 0.3777 0.543 0.695 0.487 -0.687 1.443## serum_creatinine 0.7499 0.179 4.181 0.000 0.398 1.101## ejection_fraction -0.0599 0.013 -4.435 0.000 -0.086 -0.033## =====================================================================================
library(ModelMetrics)auc(r_model_exp)
## [1] 0.7614173
from sklearn import metricspy_auc = metrics.roc_auc_score(heart_failure_pd['DEATH_EVENT'], py_model1_exp.fittedvalues)print(py_auc)
## 0.7614172824302136
library(ModelMetrics)auc(r_model_exp)
## [1] 0.7614173
from sklearn import metricspy_auc = metrics.roc_auc_score(heart_failure_pd['DEATH_EVENT'], py_model1_exp.fittedvalues)print(py_auc)
## 0.7614172824302136
No, not if your goal is explanatory
We are interested in whether our model makes a good job of estimating F(x). Do the measured variables (and their coefficients) have a relationship with Y? Might use the AUC / ROC as a measure of variance explained by the model. BIC is considered a good measure of fit.
In a GLM like this, it's worth considering the error around the intercept, as there's no separate error term. Is there still a lot of unobserved variance?
Two things here: Scikit-learn is set up to tune predictive models. It, by default, uses the ridge penalty to reduce overfitting/improve predictive accuracy.
Once you apply a penalty, you can't directly interpret the coefficient, or the error as the degrees-of-freedom are effected. E.g. if you penalise 20% of the coefficient's value, how to you understand 80% of two predictors on the degrees of freedom?
You can force scikit learn to do it without a penalty (shown next), but why not use something geared to the purpose? For those who learnt R/Python modelling using Caret or Scikit learn, you need to appreciate that you are building a predictive model, not an explanatory model.
heart_failure_dt$sc_serum_creatinine <- scale(heart_failure_dt$serum_creatinine)heart_failure_dt$sc_ejection_fraction <- scale(heart_failure_dt$ejection_fraction)trainIndex <- caret::createDataPartition(heart_failure_dt$DEATH_EVENT , p = .8 , list = FALSE , times = 1)Train <- heart_failure_dt[ trainIndex,]Test <- heart_failure_dt[-trainIndex,]r_model_pred <- glm(DEATH_EVENT ~ sc_serum_creatinine + sc_ejection_fraction , data=Train , family = "binomial")predictions <- predict(r_model_pred, newdata=Test, type="response")# Model performance metricsauc(Test$DEATH_EVENT, predictions)
## [1] 0.8339599
from sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LogisticRegressionsc= StandardScaler()X = sc.fit_transform(heart_failure_pd[['serum_creatinine', 'ejection_fraction']])y = heart_failure_pd[['DEATH_EVENT']]X_train, X_test, y_train, y_test = train_test_split(X,y , test_size = 0.2)log_reg = LogisticRegression(penalty = 'None').fit(X_train,y_train)
y_pred = log_reg.predict_proba(X_test)py_auc = metrics.roc_auc_score(y_test, y_pred[:,1])print(py_auc)
## 0.7957351290684623
Plausible relationship: might want to draw a DAG.
So you might leave multiple 'non-significant' predictors in an explanatory model, as they are rational and all effects conditional on each other.
You might be happy with a 'wrong' model for in prediction, if it gives better predictions.
Forecasting attendances at an Emergency Department
Forecasting attendances at an Emergency Department
What drives people to attend an Emergency Department?
What drives people to attend an Emergency Department?
What is a person's risk of attending an Emergency Department?
What is a person's risk of attending an Emergency Department?
Did our new UTC pathway decrease how often people attend the Emergency Department?
Did our new UTC pathway decrease how often people attend the Emergency Department?
Building a Large Language Model to answer questions as a chatbot
Building a Large Language Model to answer questions as a chatbot
Modelling long-term population health-state changes
Modelling long-term population health-state changes
What is your question?
Is it predictive or explanatory?
Are you doing anything that is incompatible with that framework?
We used logistic regression in both an explanatory and predictive fashion
Arnold, K.F. et al. (2020) ‘Reflection on modern methods: generalized linear models for prognosis and intervention—theory, practice and implications for machine learning’, International Journal of Epidemiology, 49(6), pp. 2074–2082. Available at: https://doi.org/10.1093/ije/dyaa049.
Box, G.E.P. (1976) “Science and Statistics.” Journal of the American Statistical Association 71, no. 356: 791–99. https://doi.org/10.2307/2286841.
Chicco, D. and Jurman, G. (2020) ‘Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone’, BMC Medical Informatics and Decision Making, 20(1), p. 16. Available at: https://doi.org/10.1186/s12911-020-1023-5.
Hernán, M. A., Hsu, J. and Healy, B. (2019) ‘A Second Chance to Get Causal Inference Right: A Classification of Data Science Tasks’, CHANCE, 32(1), pp. 42–49. doi: 10.1080/09332480.2019.1579578.
Shmueli, G. (2010) 'To Explain or to Predict?' Statistical Science 25, no. 3 : 289–310. http://www.jstor.org/stable/41058949.
heart_failure_dt$sc_ejection_fraction <- scale(heart_failure_dt$ejection_fraction)trainIndex <- caret::createDataPartition(heart_failure_dt$DEATH_EVENT , p = .8 , list = FALSE , times = 1)Train <- heart_failure_dt[ trainIndex,]Test <- heart_failure_dt[-trainIndex,]x <- model.matrix(DEATH_EVENT~log(serum_creatinine)+sc_ejection_fraction, Train)[,-1]y <- Train$DEATH_EVENTlibrary(glmnet)# Cross validate to get best lambda (shrinkage)cv <- cv.glmnet(x, y, alpha = 0, family="binomial")ridge1<-glmnet(x,y, alpha=0, lamda=cv$lambda.1se, family="binomial")# Make predictions on the test datax.test <- model.matrix(DEATH_EVENT~log(serum_creatinine)+sc_ejection_fraction, Test)[,-1]predictions <- predict(ridge1, newx=x.test, type="response") |> as.vector()ModelMetrics::auc(Test$DEATH_EVENT, predictions)
## [1] 71.20213
This talk draws heavily on Professor Galit Shmueli's 2010 paper of the same name:
Shmueli, G. (2010), To Explain or To Predict?, Statistical Science, vol 25 no 3, pp. 289-310.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |