.pull-left[ <br><br><br> # Modelling non-linear data with Generalized Additive Models (GAMs) ### Using the `mgcv` package <br><br> <br> <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M464 64H48C21.49 64 0 85.49 0 112v288c0 26.51 21.49 48 48 48h416c26.51 0 48-21.49 48-48V112c0-26.51-21.49-48-48-48zm0 48v40.805c-22.422 18.259-58.168 46.651-134.587 106.49-16.841 13.247-50.201 45.072-73.413 44.701-23.208.375-56.579-31.459-73.413-44.701C106.18 199.465 70.425 171.067 48 152.805V112h416zM48 400V214.398c22.914 18.251 55.409 43.862 104.938 82.646 21.857 17.205 60.134 55.186 103.062 54.955 42.717.231 80.509-37.199 103.053-54.947 49.528-38.783 82.032-64.401 104.947-82.653V400H48z"></path></svg> [c.mainey1@nhs.net](mailto:c.mainey1@nhs.net) <svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> [chrismainey](https://github.com/chrismainey) <svg viewBox="0 0 448 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#005EB8;" xmlns="http://www.w3.org/2000/svg"> <path d="M416 32H31.9C14.3 32 0 46.5 0 64.3v383.4C0 465.5 14.3 480 31.9 480H416c17.6 0 32-14.5 32-32.3V64.3c0-17.8-14.4-32.3-32-32.3zM135.4 416H69V202.2h66.5V416zm-33.2-243c-21.3 0-38.5-17.3-38.5-38.5S80.9 96 102.2 96c21.2 0 38.5 17.3 38.5 38.5 0 21.3-17.2 38.5-38.5 38.5zm282.1 243h-66.4V312c0-24.8-.5-56.7-34.5-56.7-34.6 0-39.9 27-39.9 54.9V416h-66.4V202.2h63.7v29.2h.9c8.9-16.8 30.6-34.5 62.9-34.5 67.2 0 79.7 44.3 79.7 101.9V416z"></path></svg> [chrismainey](https://www.linkedin.com/in/chrismainey/) <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#A6CE39;" xmlns="http://www.w3.org/2000/svg"> <path d="M294.75 188.19h-45.92V342h47.47c67.62 0 83.12-51.34 83.12-76.91 0-41.64-26.54-76.9-84.67-76.9zM256 8C119 8 8 119 8 256s111 248 248 248 248-111 248-248S393 8 256 8zm-80.79 360.76h-29.84v-207.5h29.84zm-14.92-231.14a19.57 19.57 0 1 1 19.57-19.57 19.64 19.64 0 0 1-19.57 19.57zM300 369h-81V161.26h80.6c76.73 0 110.44 54.83 110.44 103.85C410 318.39 368.38 369 300 369z"></path></svg> [0000-0002-3018-6171](https://orcid.org/0000-0002-3018-6171) ] .pull-right[ <p style="text-align:center;font-weight:bold;"><img src="man/figures/download.png" style="height:550px;" alt="Picture of Michael Jackson's dance routine for the song 'Smooth Criminal', with the bad "dad-joke": 'You've been hit by a smoothed residual'"> <br> Don't think about it too hard...😉 </p> ] --- # Regression models on non-linear data + Regression is a method for predicting a variable, `Y`, using another, `X` <img src="GAMworkshop_files/figure-html/regression1-1.png" alt="Two-dimensional scatterplot with a range of data points that show the two dimensions are correlated" width="864" style="display: block; margin: auto;" /> --- # Equation of a straight line (1) `$$y= \alpha + \beta x + \epsilon$$` <img src="GAMworkshop_files/figure-html/regression2-1.png" alt="The same scatter plot from the last slide now has a line of best fit drawn, highlighting where it crosses the vertical axis called the intercept or alpha. For each unit of 1 on the horizontal axis, the vertical axis raises by and amount called a coefficient, or beta." width="864" style="display: block; margin: auto;" /> --- # Equation of a straight line (2) `$$y= 2 + 1.5 x + \epsilon$$` <img src="GAMworkshop_files/figure-html/regression3-1.png" alt="The same scatter plot from the last slide now has a line of best fit drawn, highlighting where it crosses the vertical axis called the intercept or alpha. For each unit of 1 on the horizontal axis, the vertical axis raises by and amount called a coefficient, or beta." width="864" style="display: block; margin: auto;" /> --- # What about nonlinear data? (1) <img src="GAMworkshop_files/figure-html/sig-1.png" alt="The same scatter plot from the first slide with no line of best fit" width="864" style="display: block; margin: auto;" /> --- # What about nonlinear data? (2) <img src="GAMworkshop_files/figure-html/cats-1.png" alt="The same scatter plot from the first slide with a straight line of best fit." width="864" style="display: block; margin: auto;" /> --- # What about nonlinear data? (3) <img src="GAMworkshop_files/figure-html/cats3-1.png" alt="The same scatter plot from the first slide with boxes for 3 categories added." width="864" style="display: block; margin: auto;" /> --- # What about nonlinear data? (4) <img src="GAMworkshop_files/figure-html/cats4-1.png" alt="The same scatter plot from the first slide with with a smooth sigmoidal shape fitted to the points" width="864" style="display: block; margin: auto;" /> --- # What about nonlinear data? (5) <img src="GAMworkshop_files/figure-html/cats5-1.png" alt="The same scatter plot from the first slide with a straight line, categorical and the a smooth sigmoidal shapes fitted to the points." width="864" style="display: block; margin: auto;" /> --- .pull-left[ # Poly - what? + More complicated mathsy definitions than I can explain, first lets consider powers / orders: + Squared ( `\(x^2\)` or `\(x * x\)`) + Cubed ( `\(x^3\)` or `\(x * x * x\)`) <br><br> If we use these in regression, we can get something like: `$$y = \alpha + \beta_1x + \beta_2x^2 + \beta_3x^3... + \beta_nx_n^z$$` ] -- .pull-right[ ## Problems with polynomials + Dodgy fit with increased complexity + Can oscillate wildly, particularly at edges: + [Runge's phenomenon](https://en.wikipedia.org/wiki/Runge%27s_phenomenon#:~:text=In%20the%20mathematical%20field%20of,set%20of%20equispaced%20interpolation%20points.) <p style="text-align:center;"><a title="Nicoguaro, CC BY 4.0 <https://creativecommons.org/licenses/by/4.0>, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Runge_phenomenon.svg"><img width="400" alt="Runge phenomenon" src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0a/Runge_phenomenon.svg/512px-Runge_phenomenon.svg.png"; alt = "Chart of the shape of polynomial function, oscillation at the edges as the order of the function increases, demonstrating Runges Phenomenon"></a></p> ] --- # Degrees of freedom (df) Within a model, how many 'parts' are free to vary? E.g. If we have 3 numbers and we know the average is 5: If we have 2 and 7, our final number is not free to vary. It must be 6: $$ \frac{2 + 7 + 6}{3} = 5$$ This means our 'model' is constrained to `\(n-1\)` degrees of freedom -- ## In regression context: + The number of data points in our model ( `\(N\)` ) limits the df + Usually the number of predictors ( `\(k\)` ) in our model is considered the df (one of these is the intercept) + "Residual df" are points left to vary in the model, after considering the df: `$$N-k-1$$` Helpful post on [CrossValidated](https://stats.stackexchange.com/questions/340007/confused-about-residual-degree-of-freedom) --- # Overfitting > When our model fits both the underlying relationship and the 'noise' percuiliar to our sample data + You want to fit the relationship, whilst minimising the noise. + This helps 'generalizability': meaning it will predict well on new data. If we allowed total freedom in our model, e.g. a knot at every data point. What would happen? --- class: middle # Exercise 1: Load and fit non-linear relationship Here we will visualise the relationship, view it as a linear regression, and attempt a polynomial fit. --- # What if we could do something more suitable? If we could define a set of functions we are happy with: + We could use coefficients to 'support' the fit + Could use penalties to restrict how much they flex <p style="text-align:center;font-weight:bold;"><img src="man/figures/basis_functions1.png" style="height:350px;" alt="Example of basis functions for fitting data"></p> .smaller[ Figure taken from Noam Ross' GAMs in R course, CC-BY, https://github.com/noamross/gams-in-r-course ] --- # Splines + How do you (manually) draw a smooth curve? -- .pull-left[ ### [Draftsman's spline](https://www.core77.com/posts/55368/When-Splines-Were-Physical-Objects) / ['flat spline'](https://en.wikipedia.org/wiki/Flat_spline) + Thin flexible strip that bends to curves + Held in place by weights ("ducks"), nails etc. + The tension element is important: spline flexes minimally ] .pull-right[ <p style="text-align:center;"><a title="Pearson Scott Foresman, Public domain, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Spline_(PSF).png"><img width="400" alt="Spline (PSF)" src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/Spline_%28PSF%29.png/512px-Spline_%28PSF%29.png"; alt = "Drawing of a draftsman bending a flexible piece of wood and using it to draw a smooth curve."></a> </p> ] --- # Mathematical splines + Smooth, piece-wise polynomials, like a flexible strip for drawing curves. + 'Knot points' between each section <img src="GAMworkshop_files/figure-html/gam1-1.png" alt="Scatter plot from earlier with a wiggly sigmoidal best fit line" width="864" style="display: block; margin: auto;" /> --- # How smooth? Can be controlled by number of knots `\((k)\)`, or by a penalty `\(\gamma\)`. .pull-left[ <img src="GAMworkshop_files/figure-html/knots2-1.png" alt="Scatter plot from earlier with a wiggly sigmoidal best fit lines for 2, 20 and 50 points" width="432" style="display: block; margin: auto;" /> ] .pull-right[ <img src="GAMworkshop_files/figure-html/penalty-1.png" alt="Scatter plot from earlier with a wiggly sigmoidal best fit lines for 20 knots but differing penalties 0.001, 1 and 10" width="432" style="display: block; margin: auto;" /> ] --- # Generalized Additive Model + Regression models where we fit smoothers (like splines) from our data. + Strictly additive, but smoothers can describe complex relationships. + In our case: `$$y= \alpha + f(x) + \epsilon$$` -- <br> .smaller[ Or more formally, an example GAM might be (Wood, 2017): ] `$$g(\mu_i) = A_i \theta + f_1(x_1) + f_2(x_{2i}) + f3(x_{3i}, x_{4i}) + ...$$` <br> Where: .smaller[ + `\(\mu_i \equiv \mathbb{E}(Y _i)\)`, the expectation of Y] .smaller[ + `\(Yi \sim EF(\mu _i, \phi _i)\)`, `\(Yi\)` is a response variable, distributed according to exponential family distribution with mean `\(\mu _i\)` and shape parameter `\(\phi\)`.] .smaller[ + `\(A_i\)` is a row of the model matrix for any strictly parametric model components with `\(\theta\)` the corresponding parameter vector.] .smaller[ + `\(f_i\)` are smooth functions of the covariates, `\(xk\)`, where `\(k\)` is each function basis.] --- # What does that mean for me? + Can build regression models with smoothers, particularly suited to non-linear, or noisy data + _Hastie (1985)_ used knot every point, _Wood (2017)_ uses reduced-rank version -- ## Issues + We need to chose the right _dimension_ (degrees of freedom / knots) for our smoothers + We need to chose the right penalty ( `\(\lambda\)` ) for our smoothers ### Consequence + If you penalise a smooth of `\(k\)` dimensions, it no longer has `\(k-1\)` degrees of freedom as they are reduced + 'Effective degrees of freedom' - the penalized df of the predictors in the model. __Note:__ <br> `\(df(\lambda) = k\)`, when `\(\lambda = 0\)` <br> `\(df(\lambda) \rightarrow 0\)`, when `\(\lambda \rightarrow \infty\)` --- # mgcv: mixed gam computation vehicle + Prof. Simon Wood's package, pretty much the standard + Included in standard `R` distribution, used in `ggplot2` `geom_smooth` etc. + Has sensible defaults for dimensions + Estimates the ideal penalty for smooths by various methods, with REML recommended. -- ```r library(mgcv) my_gam <- gam(Y ~ s(X, bs="cr"), data=dt) ``` + `s()` controls smoothers (and other options, `t`, `ti`) + `bs="cr"` telling it to use cubic regression spline ('basis') + Default determined from data, but you can alter this e.g. (`k=10`) + Penalty (smoothing parameter) estimation method is set to (`REML`) --- # Model Output: ```r summary(my_gam) ``` ``` ## ## Family: gaussian ## Link function: identity ## ## Formula: ## Y ~ s(X, bs = "cr") ## ## Parametric coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 43.9659 0.8305 52.94 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df F p-value ## s(X) 6.087 7.143 296.3 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.876 Deviance explained = 87.9% ## GCV = 211.94 Scale est. = 206.93 n = 300 ``` --- # Check your model: ```r gam.check(my_gam) ``` <img src="GAMworkshop_files/figure-html/gam3a-1.png" width="864" style="display: block; margin: auto;" /> ``` ## ## Method: GCV Optimizer: magic ## Smoothing parameter selection converged after 4 iterations. ## The RMS GCV score gradient at convergence was 1.107369e-05 . ## The Hessian was positive definite. ## Model rank = 10 / 10 ## ## Basis dimension (k) checking results. Low p-value (k-index<1) may ## indicate that k is too low, especially if edf is close to k'. ## ## k' edf k-index p-value ## s(X) 9.00 6.09 1.1 0.97 ``` --- # Check your model: ```r gam.check(my_gam) ``` ``` ## ## Method: GCV Optimizer: magic ## Smoothing parameter selection converged after 4 iterations. ## The RMS GCV score gradient at convergence was 1.107369e-05 . ## The Hessian was positive definite. ## Model rank = 10 / 10 ## ## Basis dimension (k) checking results. Low p-value (k-index<1) may ## indicate that k is too low, especially if edf is close to k'. ## ## k' edf k-index p-value ## s(X) 9.00 6.09 1.1 0.93 ``` --- # Is it any better than linear model? ```r my_lm <- lm(Y ~ X, data=dt) anova(my_lm, my_gam) ``` ``` ## Analysis of Variance Table ## ## Model 1: Y ~ X ## Model 2: Y ~ s(X, bs = "cr") ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 298.00 88154 ## 2 292.91 60613 5.0873 27540 26.161 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ```r AIC(my_lm, my_gam) ``` ``` ## df AIC ## my_lm 3.000000 2562.280 ## my_gam 8.087281 2460.085 ``` -- ## Yes, yes it is! --- class: middle # Exercise 2: Simple GAM fit We will now fit the same relationship with a GAM using the `mgcv` package. --- class: middle # Exercise 3: GAM fitting options We will now look at varying the fit using things like the degrees of freedom and penalty. We will also visualise these changes and effects on models. --- class: middle # Exercise 4: Multivariable GAMs We will now generalise to include more than one predictor / smooth function, how they might be combined, and effects on models. We will also progress on to a generalised linear model, using a distribution family --- class: middle # Break --- class: middle # Exercise 5: Put it all together! We will now apply what we've learnt to the Framingham cohort study, predicting the binary variable: 'TenYearCHD' using other columns. See how you can use smoothers on the continuous variables to get the best fit possible. ___Hint:___ you will need to use AIC or auc to compare models, not R2. --- # Summary + Regression models are concerned with explaining one variable: `y`, with another: `x` + This relationship is assumed to be linear + If your data are not linear, or noisy, a smoother might be appropriate -- + Splines are ideal smoothers, and are polynomials joined at 'knot' points -- + GAMs are a framework for regressions using smoothers -- + `mgcv` is a great package for GAMs with various smoothers available + `mgcv` estimates the required smoothing penalty for you + `gratia` or `mgcViz` packages are good visualization tool for GAMs --- # References and Further reading: #### GitHub code: https://github.com/chrismainey/GAMworkshop #### Simon Wood's comprehensive book: + WOOD, S. N. 2017. Generalized Additive Models: An Introduction with R, Second Edition, Florida, USA, CRC Press. #### Noam Ross free online GAM course: https://noamross.github.io/gams-in-r-course/ <br> + HARRELL, F. E., JR. 2001. Regression Modeling Strategies, New York, Springer-Verlag New York. + HASTIE, T. & TIBSHIRANI, R. 1986. Generalized Additive Models. Statistical Science, 1, 297-310. 291 + HASTIE, T., TIBSHIRANI, R. & FRIEDMAN, J. 2009. The Elements of Statistical Learning : Data Mining, Inference, and Prediction, New York, NETHERLANDS, Springer.