class: title-slide <br><br> # Building predictive models with HES data ## Readmissions in HED https://github.com/chrismainey/Readmissions_in_HED_NHSR2020 <br><br> .pull-left[ __Chris Mainey__ <b><p style="font-size:26;">Senior Data Scientist<br> University Hospitals Birmingham NHS FT</b></p> <br> <span style="font-size:26;">[chris.mainey@uhb.nhs.uk](mailto:chris.mainey@uhb.nhs.uk)</span><br>
<a href="https://twitter.com/chrismainey?s=09" style="line-height:2;">@chrismainey</a> ] .pull-right[ <img src= "https://chrismainey.github.io/Readmissions_in_HED_NHSR2020/assets/logo.png" width=50% height=50%> <br> <img src= "https://chrismainey.github.io/Readmissions_in_HED_NHSR2020/assets/HI.png" width=62% height=62%> <br> <img src= "https://www.uhb.nhs.uk/Images/uhb-logo-2017.png" width=58% height=58% style="border:10px solid #FFFFFF;"> ] --- # Healthcare Evaluation Data (HED) <a href="https://www.hed.nhs.uk">www.hed.nhs.uk </a> .pull-left[ - Online hospital benchmarking system - Statistical models and analysis tools - Activity, Mortality, Re-admissions, Length-of-Stay, Market-share etc. - Built by Informatics team at University Hospitals Birmingham NHS Foundation Trust - Used by ~60 NHS and other organisations - Training and support, including `R` <br><br> - __Using national NHS data, including HES, ONS mortality, central returns, NRLS and others__ ] .pull-right[ <br> <img src="https://chrismainey.github.io/EARL_2019_presentation/assets/HED_system.png" width=110%> ] --- # Casemix-adjusted indicators > _How can we compare indicators across different centres/units?_ <br> -- <p><a href="https://commons.wikimedia.org/wiki/File:English_wine_cask_units.jpg#/media/File:English_wine_cask_units.jpg"><img src="https://upload.wikimedia.org/wikipedia/commons/a/a4/English_wine_cask_units.jpg" alt="English wine cask units.jpg" height="233" width="640" class="center"></a> <span style="font-size:8px;">By Grolltech; Own work; <a href="https://creativecommons.org/licenses/by-sa/3.0" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=22228613">Link</a></span> </p> -- .pull-left[ + Aggregated patients in different sizes units + Each patients is different + Consider biases approach ] .pull-right[ + Important variables may be: + Age profiles + Elective / Non-elective balance + Seasonality ] --- # Indirectly-standardised ratio + Adjust all to the expected average risks + Commonly use a regression model to estimate effects of predictors. + Then use model to predict the risk of event for each patient. <br><br> + We can compare our predicted risks to observed events + __Relative risk ratio__: $$ \frac{\sum{events}}{\sum{risk}}$$ + Compare our relative risk ratio to the standard (usually 1, or multiples like 100) --- ## Case-study: Relative-Risk Readmission >_Readmission to any acute provider within 30-days of discharge from another. Indexed to discharge from the first organisation._ -- .pull-left[ <br> + Major variables relate to age, sex, admission method, diagnosis, comorbid conditions. {{content}} ] -- <br> + How we parametrise these variables affects quality of model. + E.g. Age as continuous? Assumes effects of age are constant. + What if it's not? Binning or transformations? -- .pull-right[ <br><br> + Regression assumes all points are independent, __this is not true here:__ + Patients at hospital X more like 'hospital X' patient than 'average patient' + Clustering ] --- class: middle # Non-linear data: What if the relationship between X and Y varies across the range? --- # What about nonlinear data? (1) <img src="Modelling_readmissions_files/figure-html/sig-1.png" width="80%" style="display: block; margin: auto;" /> --- # What about nonlinear data? (2) <img src="Modelling_readmissions_files/figure-html/cats-1.png" width="80%" style="display: block; margin: auto;" /> --- # What about nonlinear data? (3) <img src="Modelling_readmissions_files/figure-html/cats3-1.png" width="80%" style="display: block; margin: auto;" /> --- # What about nonlinear data? (4) <img src="Modelling_readmissions_files/figure-html/cats4-1.png" width="80%" style="display: block; margin: auto;" /> --- # GAMs + Splines .pull-left[ + Smooth, piece-wise polynomials, like a flexible strip for drawing curves. <br><br> + Joined at 'Knot points' between each section <br><br> + This can then be a Generalised Additive Model <br><br> + Essentially: a regression on the sum of smoothers <br><br> `$$y= \alpha + f(x) + \epsilon$$` ] .pull-right[ <br><br><br> <img src="Modelling_readmissions_files/figure-html/gam1-1.png" width="500px" /> ] --- # GAMs in R Prof. Simon Wood's package is de-facto standard ```r library(mgcv) my_gam <- gam(Y ~ s(X, bs="cr"), data=dt) ``` <br> + `s()` control smoothers <br><br> + `bs="cr"` telling it to use cubic regression spline ('basis') <br><br> + Knots (or equivalent) are set by `k` argument, e.g. `k=10` --- # Model Output: ```r summary(my_gam) ``` ``` ## ## Family: gaussian ## Link function: identity ## ## Formula: ## Y ~ s(X, bs = "cr") ## ## Parametric coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 43.9659 0.8305 52.94 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df F p-value ## s(X) 6.087 7.143 296.3 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.876 Deviance explained = 87.9% ## GCV = 211.94 Scale est. = 206.93 n = 300 ``` --- class: middle # Clustering Data collected in unit/centrers, not at random in the population --- # 'Random effects' Lets imagine we have a big cloud of data points that look like this: <img src="Modelling_readmissions_files/figure-html/rint1-1.png" width="80%" style="display: block; margin: auto;" /> --- # Random effects (2) If we assume all points are independent, the previous model was fine, but... <img src="Modelling_readmissions_files/figure-html/rint2-1.png" width="80%" style="display: block; margin: auto;" /> --- # Random effects (3) If we assume all points are independent, the previous model was fine, but... <img src="Modelling_readmissions_files/figure-html/rint3-1.png" width="80%" style="display: block; margin: auto;" /> --- # Random effects (4) So we end up with a 'random-intercept' model: ```r library(lme4) my_ri_model<-lmer(y~x+(1|clust), data=dfc) summary(my_ri_model) ``` ``` ## Linear mixed model fit by REML ['lmerMod'] ## Formula: y ~ x + (1 | clust) ## Data: dfc ## ## REML criterion at convergence: 3955.8 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -2.73388 -0.79825 0.01282 0.83659 2.80617 ## ## Random effects: ## Groups Name Variance Std.Dev. ## clust (Intercept) 2651.3 51.49 ## Residual 151.1 12.29 ## Number of obs: 500, groups: clust, 5 ## ## Fixed effects: ## Estimate Std. Error t value ## (Intercept) 43.4078 23.0788 1.881 ## x 51.3188 0.2839 180.796 ## ## Correlation of Fixed Effects: ## (Intr) ## x -0.062 ``` --- # How do we use it? Web-based, interactive 'modules' that users can interrogate: <img src= "https://github.com/chrismainey/Readmissions_in_HED_NHSR2020/raw/master/assets/Module.png" width=90% height=90% class="center"> --- # ...but HES is pretty big, right? YES! Yes it is, so required special handling: + Memory efficiency and speed - `data.table` package -- + Only load section required for each model: + Use database (SQL Server) for what it's designed for! + Stratified by each HRG4 sub-chapter + Sparse model matrix -- + Parallelisation - `doParallel` - better on Linux, speaking of which: -- + Linux! - Set up a VM on server, RStudio Server. -- + Optimised functions, like `bam()` in `mgcv`, `bigglm()` --- # Journey in HED + HED used SAS for many years to build regression models. -- + CM had PhD project funded by UHB that allowed space to learn `R` -- + CM was useless for the first 18-months! -- + Then started translating 'broken' SAS models to `R` -- + Initially used CM's (annotated scripts) -- + Not sustainable: couldn't pass to other analysts, not fault tolerant, no metadata -- + Built `R` package - MB primarily translated scripts -- + `R` package building encouraged use of Git source control -- + Model management database, powered by functions in `R` package --- # Summary + R is a powerful tool for building case-mix adjustment models + Important to understand your data generation mechanism before modelling + Regression approach, common in indirect standardisation, have assumptions + When modelling hospital readmissions for HED: + Specific modelling of non-linear relationship increased fit, using `gam()` + If clustering affects your data, random-intercepts may be helpful + When building methods, remember is it is marathon, not a sprint + Use `R` in it's right place in the pipeline + Efficient handling is essential + Building `R` packages and using source control has been great help