Loading [MathJax]/jax/output/CommonHTML/jax.js
+ - 0:00:00
Notes for current slide
Notes for next slide



To explain or predict:

How different modelling objectives change how you use the same tools



Chris Mainey - BSOL ICB

c.mainey1@nhs.net

chrismainey

chrismainey

0000-0002-3018-6171

A cartoon image generated by Chat-GPT and Dall-E 3 about the differences between explanatory and predictive modelling. Many of the words are nonsensical and illustrate the error inherent in the modelling
Generated by Chat-GPT and Dall-E 3.
Any thoughts on the prediction error here?

1

Overview

This talk draws heavily on Professor Galit Shmueli's 2010 paper of the same name:

Shmueli, G. (2010), To Explain or To Predict?, Statistical Science, vol 25 no 3, pp. 289-310.


2

Overview

This talk draws heavily on Professor Galit Shmueli's 2010 paper of the same name:

Shmueli, G. (2010), To Explain or To Predict?, Statistical Science, vol 25 no 3, pp. 289-310.


"Since all models are wrong, the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad" Box (1976)

2

Overview

This talk draws heavily on Professor Galit Shmueli's 2010 paper of the same name:

Shmueli, G. (2010), To Explain or To Predict?, Statistical Science, vol 25 no 3, pp. 289-310.


"Since all models are wrong, the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad" Box (1976)




What is your question?

2

Through out this, keep asking yourself: what is your question?

The two broad classes of DS/modelling question:

Explain

  • How does an explanatory variable(s) relate to an outcome?
  • We might consider two main groups:
    • Description
    • Causal Inference / Counter factual analysis
3

The two broad classes of DS/modelling question:

Explain

  • How does an explanatory variable(s) relate to an outcome?
  • We might consider two main groups:
    • Description
    • Causal Inference / Counter factual analysis

Predict

  • How can predictor variable(s) predict an outcome?
  • At future time points or in different context?


You can use many of the same models to fit in either context, but how you do it is different!

3

Prof. Shmueli's paper laments that statisticians had almost exclusively on 'explanatory' models.

With the increasing accessibility of Data Science and Machine Learning, the focus of many modern practitioners has swung the other way.

Some of you may always be approaching a model as a prediction question.

What I'm presenting here today is fairly agnostic to your approach, be it bayesian / frequentist / whatever.

Grounding

Y=F(X)

  • Imagine that X causes Y through some function: F
  • F is a theoretical model, set of statements, path model etc.
4

Grounding

Y=F(X)

  • Imagine that X causes Y through some function: F
  • F is a theoretical model, set of statements, path model etc.
  • To model with data, we need to use measurable variables

E(Y)=f(X)

4

Grounding

Y=F(X)

  • Imagine that X causes Y through some function: F
  • F is a theoretical model, set of statements, path model etc.
  • To model with data, we need to use measurable variables

E(Y)=f(X)

Our modelling goals are

  • Explanatory: using various f to estimate F, using X,Y
  • Predictive: estimate new values of Y, using f(X)
4

...don't be scared, it's not that bad...

We are trying to model how X causes something, without being constrained by what data we have. This can be concepts such as Y = depression, and F(x) could be things like: anxiety, past trauma, physical health, stress... etc. We can't measure them directly, so

Example of a Causal Explanatory model

  • Directed Acyclic graph (DAG)
An example of a Directed Acyclic Graph (DAG) from an article reference on the link below.

Resued from: Arnold et al. Int J Epidemiol, Volume 49, Issue 6, December 2020, Pages 2074–2082, https://doi.org/10.1093/ije/dyaa049

5

What do I mean by 'causes?' It's not the same as 'associated with'. There is an 'exposure' to 'outcome' effect, and a temporal element: i.e. exposure before outcome. This DAG is hypothesising the causal relationship between chemotherapy and venous thromoembolism (VTE)

The arrows indicator the direction of causal relationships. Age, sex, tumour site and tumour size are confounding this relationship and should be adjusted for in a model, but platelet count is a mediator and should not.

Simple Example:

We will use an example I sourced from Kaggle, related to the paper:

Chicco, D., Jurman, G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak 20, 16 (2020). https://doi.org/10.1186/s12911-020-1023-5

https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data/data

  • Data related to heart failure
  • Paper tests various prediction methods to see if albumin and serum creatinine alone can predict death.
  • Any modelling approach would be more more extensive than the example here.



6

Simple Example:

We will use an example I sourced from Kaggle, related to the paper:

Chicco, D., Jurman, G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak 20, 16 (2020). https://doi.org/10.1186/s12911-020-1023-5

https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data/data

  • Data related to heart failure
  • Paper tests various prediction methods to see if albumin and serum creatinine alone can predict death.
  • Any modelling approach would be more more extensive than the example here.



I will be using logistic regression in both context

6

Despite talking about regression models a lot, I know they are not the most straight-forward thing, and you may never have encountered them.

They are a foundation of many mode complex models, but so I'll try and do regression 2 minutes

Regression in 2 minutes...

7

Regression models (1)

y=α+βx+ϵ

8

Regression models (2)

9

Regression equation


y=α+βixi+ϵ

  • y - is our 'outcome', or 'dependent' variable
  • α - is the 'intercept', the point where our line crosses y-axis
  • β - is a coefficient (weight) applied to x
  • x - is our 'predictor', or 'independent' variable
  • i - is our index, we can have i predictor variables, each with a coefficient
  • ϵ - is the remaining ('residual') error
10

How can we use regression on other distributions / data types?

We can 'zoom out' to a more the 'Generalized Linear Model (GLM):

For distributions in the exponential family, GLM allows the linear model to relate to response through a function:

g(μ)=α+βixi

  • Where μ is the expectation of Y,
  • g is the link function - related to each
11

How can we use regression on other distributions / data types?

We can 'zoom out' to a more the 'Generalized Linear Model (GLM):

For distributions in the exponential family, GLM allows the linear model to relate to response through a function:

g(μ)=α+βixi

  • Where μ is the expectation of Y,
  • g is the link function - related to each
11

As the name suggest, a more general form allows us to use different distributions, but understand them as a linear model on a given scale. E.g. for count data, μ would be the expected count, and g would be the natural logarithm, and the model is using the Poisson distribution.

In our case, for binary, we are modelling the 'odds' of the outcome (death) on the log scale, with a binomial distribution. This is the log-odd, or 'logit' link fiction: hence logistic regression.

Importantly, I can use it to examine the explanatory relationships, or to predict new data

Explanatory model - R

r_model_exp <- glm(DEATH_EVENT ~ serum_creatinine + ejection_fraction
, data=heart_failure_dt
, family = "binomial")
summary(r_model_exp)
##
## Call:
## glm(formula = DEATH_EVENT ~ serum_creatinine + ejection_fraction,
## family = "binomial", data = heart_failure_dt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.37769 0.54343 0.695 0.487
## serum_creatinine 0.74987 0.17932 4.182 2.89e-05 ***
## ejection_fraction -0.05986 0.01350 -4.435 9.19e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 375.35 on 298 degrees of freedom
## Residual deviance: 324.32 on 296 degrees of freedom
## AIC: 330.32
##
## Number of Fisher Scoring iterations: 4
12

Explanatory model - Python

import statsmodels.formula.api as smf
py_model1_exp = smf.logit("DEATH_EVENT ~ serum_creatinine + ejection_fraction", data=heart_failure_pd).fit()
print(py_model1_exp_summary)
## Logit Regression Results
## ==============================================================================
## Dep. Variable: DEATH_EVENT No. Observations: 299
## Model: Logit Df Residuals: 296
## Method: MLE Df Model: 2
## Date: Wed, 20 Nov 2024 Pseudo R-squ.: 0.1359
## Time: 13:39:05 Log-Likelihood: -162.16
## converged: True LL-Null: -187.67
## Covariance Type: nonrobust LLR p-value: 8.308e-12
## =====================================================================================
## coef std err z P>|z| [0.025 0.975]
## -------------------------------------------------------------------------------------
## Intercept 0.3777 0.543 0.695 0.487 -0.687 1.443
## serum_creatinine 0.7499 0.179 4.181 0.000 0.398 1.101
## ejection_fraction -0.0599 0.013 -4.435 0.000 -0.086 -0.033
## =====================================================================================
13

Testing Fit

  • Significance of coefficients in our model summaries
  • Assumption of regression being met - a topic for another day
library(ModelMetrics)
auc(r_model_exp)
## [1] 0.7614173
from sklearn import metrics
py_auc = metrics.roc_auc_score(heart_failure_pd['DEATH_EVENT'], py_model1_exp.fittedvalues)
print(py_auc)
## 0.7614172824302136
14

Testing Fit

  • Significance of coefficients in our model summaries
  • Assumption of regression being met - a topic for another day
library(ModelMetrics)
auc(r_model_exp)
## [1] 0.7614173
from sklearn import metrics
py_auc = metrics.roc_auc_score(heart_failure_pd['DEATH_EVENT'], py_model1_exp.fittedvalues)
print(py_auc)
## 0.7614172824302136

Is over-fitting an issue?

No, not if your goal is explanatory

14

We are interested in whether our model makes a good job of estimating F(x). Do the measured variables (and their coefficients) have a relationship with Y? Might use the AUC / ROC as a measure of variance explained by the model. BIC is considered a good measure of fit.

In a GLM like this, it's worth considering the error around the intercept, as there's no separate error term. Is there still a lot of unobserved variance?

Two things here: Scikit-learn is set up to tune predictive models. It, by default, uses the ridge penalty to reduce overfitting/improve predictive accuracy.

Once you apply a penalty, you can't directly interpret the coefficient, or the error as the degrees-of-freedom are effected. E.g. if you penalise 20% of the coefficient's value, how to you understand 80% of two predictors on the degrees of freedom?

You can force scikit learn to do it without a penalty (shown next), but why not use something geared to the purpose? For those who learnt R/Python modelling using Caret or Scikit learn, you need to appreciate that you are building a predictive model, not an explanatory model.

Predictive Model - R

heart_failure_dt$sc_serum_creatinine <- scale(heart_failure_dt$serum_creatinine)
heart_failure_dt$sc_ejection_fraction <- scale(heart_failure_dt$ejection_fraction)
trainIndex <- caret::createDataPartition(heart_failure_dt$DEATH_EVENT
, p = .8
, list = FALSE
, times = 1)
Train <- heart_failure_dt[ trainIndex,]
Test <- heart_failure_dt[-trainIndex,]
r_model_pred <- glm(DEATH_EVENT ~ sc_serum_creatinine + sc_ejection_fraction
, data=Train
, family = "binomial")
predictions <- predict(r_model_pred, newdata=Test, type="response")
# Model performance metrics
auc(Test$DEATH_EVENT, predictions)
## [1] 0.8339599
16

Predictive model - Python

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
sc= StandardScaler()
X = sc.fit_transform(heart_failure_pd[['serum_creatinine', 'ejection_fraction']])
y = heart_failure_pd[['DEATH_EVENT']]
X_train, X_test, y_train, y_test = train_test_split(X,y , test_size = 0.2)
log_reg = LogisticRegression(penalty = 'None').fit(X_train,y_train)
y_pred = log_reg.predict_proba(X_test)
py_auc = metrics.roc_auc_score(y_test, y_pred[:,1])
print(py_auc)
## 0.7957351290684623
17

What is important in assessing the fit:

Explanation

  • Plausible relationship
    • Estimating F(X) through modelling f(X)
    • Interpretability
  • Minimising bias
  • Interested in the coefficients and error terms in regression
  • Performance on the whole dataset
  • Feature engineering should be logically consistent with relationship
  • Scale/centre predictors for interpretation reasons

Prediction

  • Prediction error on new data (new sample, hold-out/test set, cross-validation)
  • Bias / Variance trade-off: happy with some bias to reduce variance
  • Less concerned with interpreting coefficients / predictors
  • Multicollinearity therefore less of a problem
  • Feature engineering can be extensive and esoteric
  • Scale/centre predictors as good practise for computation reasons
18

Plausible relationship: might want to draw a DAG.

So you might leave multiple 'non-significant' predictors in an explanatory model, as they are rational and all effects conditional on each other.

You might be happy with a 'wrong' model for in prediction, if it gives better predictions.

Explain or predict Bingo (1):




Forecasting attendances at an Emergency Department



19

Explain or predict Bingo (1):




Forecasting attendances at an Emergency Department



Predict!

19

Explain or predict Bingo (2):




What drives people to attend an Emergency Department?



20

Explain or predict Bingo (2):




What drives people to attend an Emergency Department?



Explain!

20

Explain or predict Bingo (3):




What is a person's risk of attending an Emergency Department?



21

Explain or predict Bingo (3):




What is a person's risk of attending an Emergency Department?



It depends:... is it about the person's individual risk based on explanatory factors, the best prediction you can make, or is it for risk-adjustment?

21

Explain or predict Bingo (4):




Did our new UTC pathway decrease how often people attend the Emergency Department?



22

Explain or predict Bingo (4):




Did our new UTC pathway decrease how often people attend the Emergency Department?



Explain!

22

Explain or predict Bingo (5):




Building a Large Language Model to answer questions as a chatbot



23

Explain or predict Bingo (5):




Building a Large Language Model to answer questions as a chatbot



Predict!

23

Explain or predict Bingo (6):




Modelling long-term population health-state changes



24

Explain or predict Bingo (6):




Modelling long-term population health-state changes



It depends:... are you testing what causes it, or predicting future states of the population?

24

Summary

  • What is your question?

  • Is it predictive or explanatory?

  • Are you using the right modelling framework?
  • Are you doing anything that is incompatible with that framework?

  • We used logistic regression in both an explanatory and predictive fashion

  • Both approaches would have more work than shown here
25

References

Arnold, K.F. et al. (2020) ‘Reflection on modern methods: generalized linear models for prognosis and intervention—theory, practice and implications for machine learning’, International Journal of Epidemiology, 49(6), pp. 2074–2082. Available at: https://doi.org/10.1093/ije/dyaa049.

Box, G.E.P. (1976) “Science and Statistics.” Journal of the American Statistical Association 71, no. 356: 791–99. https://doi.org/10.2307/2286841.

Chicco, D. and Jurman, G. (2020) ‘Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone’, BMC Medical Informatics and Decision Making, 20(1), p. 16. Available at: https://doi.org/10.1186/s12911-020-1023-5.

Hernán, M. A., Hsu, J. and Healy, B. (2019) ‘A Second Chance to Get Causal Inference Right: A Classification of Data Science Tasks’, CHANCE, 32(1), pp. 42–49. doi: 10.1080/09332480.2019.1579578.

Shmueli, G. (2010) 'To Explain or to Predict?' Statistical Science 25, no. 3 : 289–310. http://www.jstor.org/stable/41058949.

26

Predictive Model - R bonus (like Scikit learn assumes you want...)

heart_failure_dt$sc_ejection_fraction <- scale(heart_failure_dt$ejection_fraction)
trainIndex <- caret::createDataPartition(heart_failure_dt$DEATH_EVENT
, p = .8
, list = FALSE
, times = 1)
Train <- heart_failure_dt[ trainIndex,]
Test <- heart_failure_dt[-trainIndex,]
x <- model.matrix(DEATH_EVENT~log(serum_creatinine)+sc_ejection_fraction, Train)[,-1]
y <- Train$DEATH_EVENT
library(glmnet)
# Cross validate to get best lambda (shrinkage)
cv <- cv.glmnet(x, y, alpha = 0, family="binomial")
ridge1<-glmnet(x,y, alpha=0, lamda=cv$lambda.1se, family="binomial")
# Make predictions on the test data
x.test <- model.matrix(DEATH_EVENT~log(serum_creatinine)+sc_ejection_fraction, Test)[,-1]
predictions <- predict(ridge1, newx=x.test, type="response") |> as.vector()
ModelMetrics::auc(Test$DEATH_EVENT, predictions)
## [1] 71.20213
27

Overview

This talk draws heavily on Professor Galit Shmueli's 2010 paper of the same name:

Shmueli, G. (2010), To Explain or To Predict?, Statistical Science, vol 25 no 3, pp. 289-310.


2
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow