This is a surface-level overview of fitting random-intercept models in R and Python and trying to use them for the prediction task described in our chat on 19/01/2022. Hopefully it’s a useful starter!
For prediction, I will demonstrate two methods for a random-intercept:
“Conditional” - ‘conditioned’ on the random-effect i.e. using the random effect. This gives ‘cluster-specific’ predictions and in SHMI this would be trust-specific predictions. You can’t use these for a funnel plot, as there is no residual variation and all points line up at 1 on the y-axis (you are summing residual variance in the same clusters you are calculating it at).
“Marginal” - using the global average prediction i.e. without the random effect. This gives a global prediction and in SHMI this would be prediction at the national average risk for a patient with a set of predictors (not trust-specific). You can use these for a funnel plot, you’ve just got a better case mix model.
Although I advocate the marginal prediction, another approach entirely would be to estimate the random-intercept (how much the trust differs from national average), bootstrap a confidence interval and present as a caterpillar plot. That’s another argument though
I’m Hilbe’s COUNT
package and the medpar
dataset which is a cut from 1991 Medicare files for the state of Arizona.
library(COUNT)
## Loading required package: msme
## Loading required package: MASS
## Loading required package: lattice
## Loading required package: sandwich
library(lme4)
## Loading required package: Matrix
library(ModelMetrics)
##
## Attaching package: 'ModelMetrics'
## The following object is masked from 'package:base':
##
## kappa
library(ggplot2)
library(FunnelPlotR)
data("medpar")
This is using the lme4 library which is a frequentest take on multi-level modelling, but it can generally be interpreted in a Bayesian fashion as well, and many mixed-effects model packages are explicitly Bayesian.
mod1 <- glm(died ~ age80 + los + factor(type), data=medpar, family="binomial")
summary(mod1)
##
## Call:
## glm(formula = died ~ age80 + los + factor(type), family = "binomial",
## data = medpar)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5312 -0.8830 -0.8032 1.2938 2.2568
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.590949 0.097351 -6.070 1.28e-09 ***
## age80 0.656493 0.129180 5.082 3.73e-07 ***
## los -0.037483 0.007871 -4.762 1.92e-06 ***
## factor(type)2 0.418704 0.144611 2.895 0.00379 **
## factor(type)3 0.961028 0.230489 4.170 3.05e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1922.9 on 1494 degrees of freedom
## Residual deviance: 1857.8 on 1490 degrees of freedom
## AIC: 1867.8
##
## Number of Fisher Scoring iterations: 4
auc(mod1)
## [1] 0.6372224
mod2 <- glmer(died ~ (1|provnum) + age80 + los + factor(type), data=medpar, family="binomial")
summary(mod2)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: died ~ (1 | provnum) + age80 + los + factor(type)
## Data: medpar
##
## AIC BIC logLik deviance df.resid
## 1866.1 1898.0 -927.1 1854.1 1489
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.5980 -0.7014 -0.5924 1.0959 3.7898
##
## Random effects:
## Groups Name Variance Std.Dev.
## provnum (Intercept) 0.06488 0.2547
## Number of obs: 1495, groups: provnum, 54
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.59035 0.10826 -5.453 4.96e-08 ***
## age80 0.64556 0.13092 4.931 8.18e-07 ***
## los -0.03986 0.00810 -4.921 8.61e-07 ***
## factor(type)2 0.45907 0.15343 2.992 0.002771 **
## factor(type)3 0.91725 0.25056 3.661 0.000251 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) age80 los fct()2
## age80 -0.306
## los -0.596 -0.005
## factr(typ)2 -0.235 0.027 -0.133
## factr(typ)3 -0.093 0.041 -0.159 0.130
auc(mod2)
## [1] 0.660252
Slightly lower AIC (and asymptotically a reduction of >=4 is a 95% significance, so it’s slightly better bad model, slightly improved C-statistic (but it’s on the training set). C-statistics is still rubbish here though.
Remember we are predicting back on to the training set here, so it’s better to describe them as ‘fitted’ I suppose, but it’s still the predict function.
# Conditional (cluster-specific) - the default
medpar$cond_preds <- predict(mod2, newdata= medpar, type="response")
# Marginal
medpar$marg_preds <- predict(mod2, newdata= medpar, type="response", re.form = ~0)
head(medpar[c("cond_preds", "marg_preds")])
## cond_preds marg_preds
## 1 0.2973475 0.3208677
## 2 0.2574509 0.2790687
## 3 0.4564744 0.4839130
## 4 0.2574509 0.2790687
## 5 0.4763117 0.5038373
## 6 0.2973475 0.3208677
ggplot(medpar, aes(y=cond_preds, x=marg_preds, col = as.factor(died)))+
geom_point()+
geom_abline(intercept=0, slope=1, col="blue")+
scale_color_brewer("Died",palette = "Set2")+
labs(title = "Conditional vs Marginal predictions example in R",
subtitle = "Blue line: x=y",
x = "Marginal Prediction (no random-intercept)",
y = "Conditional Prediciton (with random-intercept)")+
theme_minimal()+
theme(legend.position = "bottom")
The equivalent conditional and marginal SMRs would be:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
SMR_set <-
medpar %>%
group_by(provnum) %>%
summarise(num = sum(died),
conditional = sum(cond_preds),
marginal = sum(marg_preds),
SMR_conditional = sum(died) / sum(cond_preds),
SMR_marginal = sum(died) / sum(marg_preds)
)
fp1 <- funnel_plot(SMR_set$num, SMR_set$conditional, SMR_set$provnum,limit = 95, draw_adjusted = TRUE)
## No overdispersion detected, or draw_adjusted to FALSE, plotting using unadjusted limits
plot(fp1)
phi(fp1)
## [1] 0.3882395
fp2 <- funnel_plot(SMR_set$num, SMR_set$marginal, SMR_set$provnum,limit = 95, draw_adjusted = TRUE)
## No overdispersion detected, or draw_adjusted to FALSE, plotting using unadjusted limits
plot(fp2)
phi(fp2)
## [1] 0.7442356