To Explain or Predict: how different modelling objectives change how you use the same tools

.pull-left[

# To explain or predict:
### How different modelling objectives change how you use the same tools

__Chris Mainey - BSOL ICB__
<svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg">  <path d="M464 64H48C21.49 64 0 85.49 0 112v288c0 26.51 21.49 48 48 48h416c26.51 0 48-21.49 48-48V112c0-26.51-21.49-48-48-48zm0 48v40.805c-22.422 18.259-58.168 46.651-134.587 106.49-16.841 13.247-50.201 45.072-73.413 44.701-23.208.375-56.579-31.459-73.413-44.701C106.18 199.465 70.425 171.067 48 152.805V112h416zM48 400V214.398c22.914 18.251 55.409 43.862 104.938 82.646 21.857 17.205 60.134 55.186 103.062 54.955 42.717.231 80.509-37.199 103.053-54.947 49.528-38.783 82.032-64.401 104.947-82.653V400H48z"></path></svg> [c.mainey1@nhs.net](mailto:c.mainey1@nhs.net)
<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg">  <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> [chrismainey](https://github.com/chrismainey)
<svg viewBox="0 0 448 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#005EB8;" xmlns="http://www.w3.org/2000/svg">  <path d="M416 32H31.9C14.3 32 0 46.5 0 64.3v383.4C0 465.5 14.3 480 31.9 480H416c17.6 0 32-14.5 32-32.3V64.3c0-17.8-14.4-32.3-32-32.3zM135.4 416H69V202.2h66.5V416zm-33.2-243c-21.3 0-38.5-17.3-38.5-38.5S80.9 96 102.2 96c21.2 0 38.5 17.3 38.5 38.5 0 21.3-17.2 38.5-38.5 38.5zm282.1 243h-66.4V312c0-24.8-.5-56.7-34.5-56.7-34.6 0-39.9 27-39.9 54.9V416h-66.4V202.2h63.7v29.2h.9c8.9-16.8 30.6-34.5 62.9-34.5 67.2 0 79.7 44.3 79.7 101.9V416z"></path></svg>  [chrismainey](https://www.linkedin.com/in/chrismainey/)
<svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#A6CE39;" xmlns="http://www.w3.org/2000/svg">  <path d="M294.75 188.19h-45.92V342h47.47c67.62 0 83.12-51.34 83.12-76.91 0-41.64-26.54-76.9-84.67-76.9zM256 8C119 8 8 119 8 256s111 248 248 248 248-111 248-248S393 8 256 8zm-80.79 360.76h-29.84v-207.5h29.84zm-14.92-231.14a19.57 19.57 0 1 1 19.57-19.57 19.64 19.64 0 0 1-19.57 19.57zM300 369h-81V161.26h80.6c76.73 0 110.44 54.83 110.44 103.85C410 318.39 368.38 369 300 369z"></path></svg> [0000-0002-3018-6171](https://orcid.org/0000-0002-3018-6171)
]

.pull-right[

<p style="text-align:center;font-weight:bold;"><img src="man/figures/DALLE2024-11-08.webp" style="height:550px;" alt="A cartoon image generated by Chat-GPT and Dall-E 3 about the differences between explanatory and predictive modelling. Many of the words are nonsensical and illustrate the error inherent in the modelling">
<br>
Generated by Chat-GPT and Dall-E 3.  <br>
Any thoughts on the prediction error here? </p>
]

---

# Overview

This talk draws heavily on Professor Galit Shmueli's 2010 paper of the same name:

__Shmueli, G. (2010), To Explain or To Predict?, Statistical Science, vol 25 no 3, pp. 289-310.__

<br>

>"Since all models are wrong, the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad"
___Box (1976)___

<br><br><br>
<center><h2>
What is your question?
</h2></center>

???

Through out this, keep asking yourself: what is your question?

---

# The two broad classes of DS/modelling question:

## Explain

* How does an explanatory variable(s) relate to an outcome?
* We might consider two main groups:
  * Description
  * Causal Inference / Counter factual analysis

## Predict

* How can predictor variable(s) predict an outcome?
* At future time points or in different context?

<br>

__You can use many of the same models to fit in either context, but how you do it is different!__

???
Prof. Shmueli's paper laments that statisticians had almost exclusively on 'explanatory' models.

With the increasing accessibility of Data Science and Machine Learning, the focus of many
modern practitioners has swung the other way.

Some of you may always be approaching a model as a prediction question.

What I'm presenting here today is fairly agnostic to your approach, be it bayesian / frequentist / whatever.

---

# Grounding

`$$\mathcal{Y} = \mathcal{F(X)}$$`
+ Imagine that `$\mathcal{X}$` ___causes___ `$\mathcal{Y}$` through some function: `$\mathcal{F}$`
+ `$\mathcal{F}$` is a theoretical model, set of statements, path model etc.

--
+ To model with data, we need to use measurable variables

`$$E(Y) = f(X)$$`

## Our modelling goals are
+ __Explanatory__: using various `$f$` to estimate `$\mathcal{F}$`, using `$X,Y$`
+ __Predictive__: estimate new values of `$Y$`, using `$f(X)$`

.footnote[
Shmueli, G. (2010), http://www.jstor.org/stable/41058949
]

???

...don't be scared, it's not that bad...

We are trying to model how X causes something, without being constrained by what data we have.
This can be concepts such as Y = depression, and F(x) could be things like: anxiety, past trauma, physical health, stress... etc.
We can't measure them directly, so

---

# Example of a Causal Explanatory model

+ Directed Acyclic graph (DAG)

.footnote[Resued from: Arnold et al. Int J Epidemiol, Volume 49, Issue 6, December 2020, Pages 2074–2082, https://doi.org/10.1093/ije/dyaa049]

???
What do I mean by 'causes?'  It's not the same as 'associated with'.  There is an 'exposure' to 'outcome' effect, and a temporal element: i.e. exposure before outcome.
This DAG is hypothesising the causal relationship between chemotherapy and venous thromoembolism (VTE)

The arrows indicator the direction of causal relationships. Age, sex, tumour site and tumour size are confounding this relationship and should be adjusted for in a model, but platelet count is a mediator and should not.

---
# Simple Example:

We will use an example I sourced from Kaggle, related to the paper:

>Chicco, D., Jurman, G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. _BMC Med Inform Decis Mak_ __20__, 16 (2020). https://doi.org/10.1186/s12911-020-1023-5

>https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data/data

+ Data related to heart failure
+ Paper tests various prediction methods to see if albumin and serum creatinine alone can predict death.
+ Any modelling approach would be more more extensive than the example here.

###I will be using logistic regression in both context

???

Despite talking about regression models a lot, I know they are not the most straight-forward thing, and you may never have encountered them.

They are a foundation of many mode complex models, but so I'll try and do regression 2 minutes

---

## Regression in 2 minutes...

---
## Regression models (1)

`$$y= \alpha + \beta x + \epsilon$$`

---

## Regression models (2)
 
<img src="to_explain_or_predict_files/figure-html/lm35-1.png" width="1680" style="display: block; margin: auto;" />

---

# Regression equation

<br>

`$$\large{y= \alpha + \beta_{i} x_{i} + \epsilon}$$`
<br>
.pull-left[
+ `$y$` - is our 'outcome', or 'dependent' variable
+ `$\alpha$` - is the 'intercept', the point where our line crosses y-axis
+ `$\beta$` - is a coefficient (weight) applied to `$x$` 
]
.pull-right[
+ `$x$` - is our 'predictor', or 'independent' variable
+ `$i$` - is our index, we can have `$i$` predictor variables, each with a coefficient
+ `$\epsilon$` - is the remaining ('residual') error
]

---
## How can we use regression on other distributions / data types?

We can 'zoom out' to a more the 'Generalized Linear Model (GLM):

_For distributions in the exponential family, GLM allows the linear model to relate to response through a function:_

`$$\large{g(\mu)= \alpha + \beta_{i} x_{i}}$$`
<br>
+ Where `$\mu$` is the _expectation_ of `$Y$`, 
+ `$g$` is the link function - related to each

--
<br>

###So, for logistic regression, the link function is 'logit' or the log odds of the event.

???

As the name suggest, a more general form allows us to use different distributions, but understand them as a linear model on a given scale.  E.g. for count data, `$\mu$` would be the expected count, and `$g$` would be the natural logarithm, and the model is using the Poisson distribution.

In our case, for binary, we are modelling the 'odds' of the outcome (death) on the log scale, with a binomial distribution.
This is the log-odd, or 'logit' link fiction:  hence logistic regression.

__Importantly, I can use it to examine the explanatory relationships, or to predict new data__

---

# Explanatory model - R

``` r
r_model_exp <- glm(DEATH_EVENT ~ serum_creatinine + ejection_fraction
              , data=heart_failure_dt
              , family = "binomial")

summary(r_model_exp)
```

```
## 
## Call:
## glm(formula = DEATH_EVENT ~ serum_creatinine + ejection_fraction, 
##     family = "binomial", data = heart_failure_dt)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        0.37769    0.54343   0.695    0.487    
## serum_creatinine   0.74987    0.17932   4.182 2.89e-05 ***
## ejection_fraction -0.05986    0.01350  -4.435 9.19e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 375.35  on 298  degrees of freedom
## Residual deviance: 324.32  on 296  degrees of freedom
## AIC: 330.32
## 
## Number of Fisher Scoring iterations: 4
```

---

# Explanatory model - Python

``` python
import statsmodels.formula.api as smf

py_model1_exp = smf.logit("DEATH_EVENT ~ serum_creatinine + ejection_fraction", data=heart_failure_pd).fit()
```

``` python
print(py_model1_exp_summary)
```

```
##                            Logit Regression Results                           
## ==============================================================================
## Dep. Variable:            DEATH_EVENT   No. Observations:                  299
## Model:                          Logit   Df Residuals:                      296
## Method:                           MLE   Df Model:                            2
## Date:                Wed, 20 Nov 2024   Pseudo R-squ.:                  0.1359
## Time:                        13:39:05   Log-Likelihood:                -162.16
## converged:                       True   LL-Null:                       -187.67
## Covariance Type:            nonrobust   LLR p-value:                 8.308e-12
## =====================================================================================
##                         coef    std err          z      P>|z|      [0.025      0.975]
## -------------------------------------------------------------------------------------
## Intercept             0.3777      0.543      0.695      0.487      -0.687       1.443
## serum_creatinine      0.7499      0.179      4.181      0.000       0.398       1.101
## ejection_fraction    -0.0599      0.013     -4.435      0.000      -0.086      -0.033
## =====================================================================================
```

---
# Testing Fit

* Significance of coefficients in our model summaries
* Assumption of regression being met - _a topic for another day_

``` r
library(ModelMetrics)
auc(r_model_exp)
```

```
## [1] 0.7614173
```

``` python
from sklearn import metrics
py_auc = metrics.roc_auc_score(heart_failure_pd['DEATH_EVENT'], py_model1_exp.fittedvalues)
print(py_auc)
```

```
## 0.7614172824302136
```

### Is over-fitting an issue?

_No, not if your goal is explanatory_

???
We are interested in whether our model makes a good job of estimating F(x).
Do the measured variables (and their coefficients) have a relationship with Y?
Might use the AUC / ROC as a measure of variance explained by the model.
BIC is considered a good measure of fit.

In a GLM like this, it's worth considering the error around the intercept, as there's no separate error term. Is there still a lot of unobserved variance?

---
# Why `statsmodels` ?

.footnote[https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression]

???

Two things here: Scikit-learn is set up to tune predictive models.  It, by default, uses the ridge penalty to reduce overfitting/improve predictive accuracy.

Once you apply a penalty, you can't directly interpret the coefficient, or the error as the degrees-of-freedom are effected.  E.g. if you penalise 20% of the coefficient's value, how to you understand 80% of two predictors on the degrees of freedom?

You can force scikit learn to do it without a penalty (shown next), but why not use something geared to the purpose?
For those who learnt R/Python modelling using Caret or Scikit learn, you need to appreciate that you are building a predictive model, not an explanatory model.

---

# Predictive Model - R

``` r
heart_failure_dt$sc_serum_creatinine <- scale(heart_failure_dt$serum_creatinine)
heart_failure_dt$sc_ejection_fraction <- scale(heart_failure_dt$ejection_fraction)

trainIndex <- caret::createDataPartition(heart_failure_dt$DEATH_EVENT
                                         , p = .8
                                         , list = FALSE
                                         , times = 1)

Train <- heart_failure_dt[ trainIndex,]
Test  <- heart_failure_dt[-trainIndex,]

r_model_pred <- glm(DEATH_EVENT ~ sc_serum_creatinine + sc_ejection_fraction
              ,  data=Train
              , family = "binomial")

predictions <- predict(r_model_pred, newdata=Test, type="response")
# Model performance metrics

auc(Test$DEATH_EVENT, predictions)
```

```
## [1] 0.8339599
```

---

# Predictive model - Python

``` python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

sc= StandardScaler()
X = sc.fit_transform(heart_failure_pd[['serum_creatinine', 'ejection_fraction']])

y = heart_failure_pd[['DEATH_EVENT']]
X_train, X_test, y_train, y_test = train_test_split(X,y , test_size = 0.2)

log_reg = LogisticRegression(penalty = 'None').fit(X_train,y_train)
```

``` python
y_pred = log_reg.predict_proba(X_test)

py_auc = metrics.roc_auc_score(y_test, y_pred[:,1])
print(py_auc)
```

```
## 0.7957351290684623
```

---

# What is important in assessing the fit:

.pull-left[

### Explanation
* Plausible relationship
 * Estimating `$\mathcal{F(X)}$` through modelling `$f(X)$`
 * Interpretability
* Minimising bias
* Interested in the coefficients and error terms in regression
* Performance on the whole dataset
* Feature engineering should be logically consistent with relationship
* Scale/centre predictors for interpretation reasons

]

.pull-right[

### Prediction
* Prediction error on new data (new sample, hold-out/test set, cross-validation)
* Bias / Variance trade-off:  happy with some bias to reduce variance
* Less concerned with interpreting coefficients / predictors
* Multicollinearity therefore less of a problem
* Feature engineering can be extensive and esoteric
* Scale/centre predictors as good practise for computation reasons

]

???

Plausible relationship: might want to draw a DAG.

So you might leave multiple 'non-significant' predictors in an explanatory model, as they are rational and all effects conditional on each other.

You might be happy with a 'wrong' model for in prediction, if it gives better predictions.

---
# Explain or predict Bingo (1):

.big[Forecasting attendances at an Emergency Department]

## Predict!

---

# Explain or predict Bingo (2):

.big[What drives people to attend an Emergency Department?]

## Explain!

---
# Explain or predict Bingo (3):

.big[What is a person's risk of attending an Emergency Department?]

## It depends:...  is it about the person's individual risk based on explanatory factors, the best prediction you can make, or is it for risk-adjustment?

---
# Explain or predict Bingo (4):

.big[Did our new UTC pathway decrease how often people attend the Emergency Department?]

##Explain!

---

# Explain or predict Bingo (5):

.big[Building a Large Language Model to answer questions as a chatbot]

## Predict!

---
# Explain or predict Bingo (6):

.big[Modelling long-term population health-state changes]

## It depends:...  are you testing what causes it, or predicting future states of the population?

---

# Summary

.pull-left[

* What is your question?

* Is it predictive or explanatory?
* Are you using the right modelling framework?
* Are you doing anything that is incompatible with that framework?

* We used logistic regression in both an explanatory and predictive fashion
* Both approaches would have more work than shown here

]

.pull-right[

<br><br>
<iframe src="https://giphy.com/embed/MCZ39lz83o5lC" width="480" height="259" style="" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/MCZ39lz83o5lC">via GIPHY</a></p>

]

---

# References

Arnold, K.F. et al. (2020) ‘Reflection on modern methods: generalized linear models for prognosis and intervention—theory, practice and implications for machine learning’, _International Journal of Epidemiology_, __49(6)__, pp. 2074–2082. Available at: https://doi.org/10.1093/ije/dyaa049.

Box, G.E.P. (1976) “Science and Statistics.” _Journal of the American Statistical Association_ __71__, no. 356: 791–99. https://doi.org/10.2307/2286841.

Chicco, D. and Jurman, G. (2020) ‘Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone’, BMC Medical Informatics and Decision Making, 20(1), p. 16. Available at: https://doi.org/10.1186/s12911-020-1023-5.

Hernán, M. A., Hsu, J. and Healy, B. (2019) ‘A Second Chance to Get Causal Inference Right: A Classification of Data Science Tasks’, _CHANCE_, __32(1)__, pp. 42–49. doi: 10.1080/09332480.2019.1579578.

Shmueli, G. (2010) 'To Explain or to Predict?' _Statistical Science_ __25__, no. 3 : 289–310. http://www.jstor.org/stable/41058949.

---

## Predictive Model - R bonus (like Scikit learn assumes you want...)

``` r
heart_failure_dt$sc_ejection_fraction <- scale(heart_failure_dt$ejection_fraction)

Train <- heart_failure_dt[ trainIndex,]
Test  <- heart_failure_dt[-trainIndex,]

x <- model.matrix(DEATH_EVENT~log(serum_creatinine)+sc_ejection_fraction, Train)[,-1]
y <- Train$DEATH_EVENT

library(glmnet)
# Cross validate to get best lambda (shrinkage)
cv <- cv.glmnet(x, y, alpha = 0, family="binomial")

ridge1<-glmnet(x,y, alpha=0, lamda=cv$lambda.1se, family="binomial")

# Make predictions on the test data
x.test <- model.matrix(DEATH_EVENT~log(serum_creatinine)+sc_ejection_fraction, Test)[,-1]

predictions <- predict(ridge1, newx=x.test, type="response") |> as.vector()

ModelMetrics::auc(Test$DEATH_EVENT, predictions)
```

```
## [1] 71.20213
```