Introduction

This is a surface-level overview of fitting random-intercept models in R and Python and trying to use them for the prediction task described in our chat on 19/01/2022. Hopefully it’s a useful starter!

For prediction, I will demonstrate two methods for a random-intercept:

Although I advocate the marginal prediction, another approach entirely would be to estimate the random-intercept (how much the trust differs from national average), bootstrap a confidence interval and present as a caterpillar plot. That’s another argument though

Data

I’m Hilbe’s COUNT package and the medpar dataset which is a cut from 1991 Medicare files for the state of Arizona.

library(COUNT)
## Loading required package: msme
## Loading required package: MASS
## Loading required package: lattice
## Loading required package: sandwich
library(lme4)
## Loading required package: Matrix
library(ModelMetrics)
## 
## Attaching package: 'ModelMetrics'
## The following object is masked from 'package:base':
## 
##     kappa
library(ggplot2)
library(FunnelPlotR)

data("medpar")

In R

This is using the lme4 library which is a frequentest take on multi-level modelling, but it can generally be interpreted in a Bayesian fashion as well, and many mixed-effects model packages are explicitly Bayesian.

Single-level glm:

mod1 <- glm(died ~ age80 + los + factor(type), data=medpar, family="binomial")

summary(mod1)
## 
## Call:
## glm(formula = died ~ age80 + los + factor(type), family = "binomial", 
##     data = medpar)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5312  -0.8830  -0.8032   1.2938   2.2568  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.590949   0.097351  -6.070 1.28e-09 ***
## age80          0.656493   0.129180   5.082 3.73e-07 ***
## los           -0.037483   0.007871  -4.762 1.92e-06 ***
## factor(type)2  0.418704   0.144611   2.895  0.00379 ** 
## factor(type)3  0.961028   0.230489   4.170 3.05e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1922.9  on 1494  degrees of freedom
## Residual deviance: 1857.8  on 1490  degrees of freedom
## AIC: 1867.8
## 
## Number of Fisher Scoring iterations: 4
auc(mod1)
## [1] 0.6372224

Random-intercept:

mod2 <- glmer(died ~ (1|provnum) + age80 + los + factor(type), data=medpar, family="binomial")

summary(mod2)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: died ~ (1 | provnum) + age80 + los + factor(type)
##    Data: medpar
## 
##      AIC      BIC   logLik deviance df.resid 
##   1866.1   1898.0   -927.1   1854.1     1489 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.5980 -0.7014 -0.5924  1.0959  3.7898 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  provnum (Intercept) 0.06488  0.2547  
## Number of obs: 1495, groups:  provnum, 54
## 
## Fixed effects:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.59035    0.10826  -5.453 4.96e-08 ***
## age80          0.64556    0.13092   4.931 8.18e-07 ***
## los           -0.03986    0.00810  -4.921 8.61e-07 ***
## factor(type)2  0.45907    0.15343   2.992 0.002771 ** 
## factor(type)3  0.91725    0.25056   3.661 0.000251 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) age80  los    fct()2
## age80       -0.306                     
## los         -0.596 -0.005              
## factr(typ)2 -0.235  0.027 -0.133       
## factr(typ)3 -0.093  0.041 -0.159  0.130
auc(mod2)
## [1] 0.660252

Slightly lower AIC (and asymptotically a reduction of >=4 is a 95% significance, so it’s slightly better bad model, slightly improved C-statistic (but it’s on the training set). C-statistics is still rubbish here though.

Prediction

Remember we are predicting back on to the training set here, so it’s better to describe them as ‘fitted’ I suppose, but it’s still the predict function.

# Conditional (cluster-specific)  -  the default
medpar$cond_preds <- predict(mod2, newdata= medpar, type="response")

# Marginal 
medpar$marg_preds <- predict(mod2, newdata= medpar, type="response", re.form = ~0)
head(medpar[c("cond_preds", "marg_preds")])
##   cond_preds marg_preds
## 1  0.2973475  0.3208677
## 2  0.2574509  0.2790687
## 3  0.4564744  0.4839130
## 4  0.2574509  0.2790687
## 5  0.4763117  0.5038373
## 6  0.2973475  0.3208677
ggplot(medpar, aes(y=cond_preds, x=marg_preds, col = as.factor(died)))+
  geom_point()+
  geom_abline(intercept=0, slope=1, col="blue")+
  scale_color_brewer("Died",palette = "Set2")+
  labs(title = "Conditional vs Marginal predictions example in R",
       subtitle = "Blue line: x=y",
       x = "Marginal Prediction (no random-intercept)",
       y = "Conditional Prediciton (with random-intercept)")+
  theme_minimal()+
  theme(legend.position = "bottom")

SHMI equivalent

The equivalent conditional and marginal SMRs would be:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
SMR_set <-
  medpar %>% 
  group_by(provnum) %>% 
  summarise(num = sum(died),
            conditional = sum(cond_preds),
            marginal = sum(marg_preds),
            SMR_conditional = sum(died) / sum(cond_preds),
            SMR_marginal = sum(died) / sum(marg_preds)
          )

Now Funnel Plot output

Conditional

fp1 <- funnel_plot(SMR_set$num, SMR_set$conditional, SMR_set$provnum,limit = 95, draw_adjusted = TRUE)
## No overdispersion detected, or draw_adjusted to FALSE, plotting using unadjusted limits
plot(fp1)

phi(fp1)
## [1] 0.3882395

Marginal

fp2 <- funnel_plot(SMR_set$num, SMR_set$marginal, SMR_set$provnum,limit = 95, draw_adjusted = TRUE)
## No overdispersion detected, or draw_adjusted to FALSE, plotting using unadjusted limits
plot(fp2)

phi(fp2)
## [1] 0.7442356