Fitting Wggly Data

.pull-left[

# Fitting Wiggly Data
### Why `mgcv` is awesome

]

.pull-right[

<p style="text-align:center;font-weight:bold;"><img src="man/figures/download.png" style="height:550px;" alt="Picture of Michael Jackson's dance routine for the song 'Smooth Criminal', with the bad "dad-joke": 'You've been hit by a smoothed residual'">
<br>
Don't think about it too hard...😉  </p>

]

.qr1[
<img src="man/figures/qr1.png" style="height:150px;" alt="QR code https://chrismainey.github.io/fitting_wiggly_data/fitting_wiggly_data.html">
]

---

# Regression models on non-linear data

+ Regression is a method for predicting a variable, `Y`, using another, `X`

<img src="fitting_wiggly_data_files/figure-html/regression1-1.png" title="Two-dimensional scatterplot with a range of data points that show the two dimensions are correlated" alt="Two-dimensional scatterplot with a range of data points that show the two dimensions are correlated" width="864" style="display: block; margin: auto;" />
---

# Equation of a straight line (1)

`$$y= \alpha + \beta x + \epsilon$$`

---

# Equation of a straight line (2)

`$$y= 2 + 1.5 x + \epsilon$$`

---

# What about nonlinear data? (1)

---

# What about nonlinear data? (2)

<img src="fitting_wiggly_data_files/figure-html/cats-1.png" title="The same scatter plot from the first slide with a straight line of best fit." alt="The same scatter plot from the first slide with a straight line of best fit." width="864" style="display: block; margin: auto;" />
---

# What about nonlinear data? (3)

<img src="fitting_wiggly_data_files/figure-html/cats3-1.png" title="The same scatter plot from the first slide with boxes for 3 categories added." alt="The same scatter plot from the first slide with boxes for 3 categories added." width="864" style="display: block; margin: auto;" />
---

# What about nonlinear data? (4)

<img src="fitting_wiggly_data_files/figure-html/cats4-1.png" title="The same scatter plot from the first slide with with a smooth sigmoidal shape fitted to the points" alt="The same scatter plot from the first slide with with a smooth sigmoidal shape fitted to the points" width="864" style="display: block; margin: auto;" />
---

# What about nonlinear data? (5)

<img src="fitting_wiggly_data_files/figure-html/cats5-1.png" title="The same scatter plot from the first slide with a straight line, categorical and the a smooth sigmoidal shapes fitted to the points." alt="The same scatter plot from the first slide with a straight line, categorical and the a smooth sigmoidal shapes fitted to the points." width="864" style="display: block; margin: auto;" />
---

.pull-left[

# Poly - what?

+ More complicated mathsy definitions than I can explain, first lets consider powers / orders:

+ Squared ( `$x^2$` or `$x * x$`)

+ Cubed ( `$x^3$` or `$x * x * x$`)

If we use these in regression, we can get something like:

`$$y = a_0 + a_1x + a_2x^2 + ... + a_nx_n$$`
]

.pull-right[

## Problems with polynomials

+ Dodgy fit with increased complexity

+ Can oscillate wildly, particularly at edges:
  + [Runge's phenomenon](https://en.wikipedia.org/wiki/Runge%27s_phenomenon#:~:text=In%20the%20mathematical%20field%20of,set%20of%20equispaced%20interpolation%20points.)

]

---

# What if we could do something more suitable?

If we could define a set of functions we are happy with:
+ We could use coefficients to 'support' the fit
+ Could use penalties to restrict how much they flex

.smaller[
Figure taken from Noam Ross' GAMs in R course, CC-BY, https://github.com/noamross/gams-in-r-course
]

---

# Splines

+ How do you (manually) draw a smooth curve?

.pull-left[

### [Draftsman's spline](https://www.core77.com/posts/55368/When-Splines-Were-Physical-Objects) / ['flat spline'](https://en.wikipedia.org/wiki/Flat_spline)

+ Thin flexible strip that bends to curves

+ Held in place by weights ("ducks"), nails etc.

+ The tension element is important: spline flexes minimally

]

.pull-right[

]

---

# Mathematical splines

+ Smooth, piece-wise polynomials, like a flexible strip for drawing curves.
+ 'Knot points' between each section

---

# How smooth?

Can be controlled by number of knots `$(k)$`, or by a penalty `$\gamma$`.

.pull-left[
<img src="fitting_wiggly_data_files/figure-html/knots2-1.png" title="Scatter plot from earlier with a wiggly sigmoidal best fit lines for 2, 20 and 50 points" alt="Scatter plot from earlier with a wiggly sigmoidal best fit lines for 2, 20 and 50 points" width="432" style="display: block; margin: auto;" />
]

.pull-right[
<img src="fitting_wiggly_data_files/figure-html/penalty-1.png" title="Scatter plot from earlier with a wiggly sigmoidal best fit lines for 20 knots but differing penalties 0.001, 1 and 10" alt="Scatter plot from earlier with a wiggly sigmoidal best fit lines for 20 knots but differing penalties 0.001, 1 and 10" width="432" style="display: block; margin: auto;" />

]

---

# Generalized Additive Model

+ Regression models where we fit smoothers (like splines) from our data.
+ Strictly additive, but smoothers can describe complex relationships.
+ In our case:

`$$y= \alpha + f(x) + \epsilon$$`
--

<br>
.smaller[
Or more formally, an example GAM might be (Wood, 2017):
]

`$$g(\mu_i) = A_i \theta + f_1(x_1) + f_2(x_{2i}) + f3(x_{3i}, x_{4i}) + ...$$`
<br>
Where:

.smaller[
+ `$\mu_i \equiv \mathbb{E}(Y _i)$`, the expectation of Y]

.smaller[
+ `$Yi \sim EF(\mu _i, \phi _i)$`, `$Yi$` is a response variable, distributed according to exponential family distribution with mean `$\mu _i$` and shape parameter `$\phi$`.]

.smaller[
+ `$A_i$` is a row of the model matrix for any strictly parametric model components with `$\theta$` the corresponding parameter vector.]

.smaller[
+ `$f_i$` are smooth functions of the covariates, `$xk$`, where `$k$` is each function basis.]

---
# What does that mean for me?

+ Can build regression models with smoothers, particularly suited to non-linear, or noisy data

+ _Hastie (1985)_ used knot every point, _Wood (2017)_ uses reduced-rank version

<br>

# mgcv: mixed gam computation vehicle

+ Prof. Simon Wood's package, pretty much the standard
+ Included in standard `R` distribution, used in `ggplot2` `geom_smooth` etc.

```r
library(mgcv)

my_gam <- gam(Y ~ s(X, bs="cr"), data=dt)
```

+ `s()` controls smoothers (and other options, `t`, `ti`)
+ `bs="cr"` telling it to use cubic regression spline ('basis')
+ Default determined from data, but you can alter this e.g. (`k=10`)
---

# Model Output:

```r
summary(my_gam)
```

```
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Y ~ s(X, bs = "cr")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  43.9659     0.8305   52.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##        edf Ref.df     F p-value    
## s(X) 6.087  7.143 296.3  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.876   Deviance explained = 87.9%
## GCV = 211.94  Scale est. = 206.93    n = 300
```

---
# Check your model:

```r
gam.check(my_gam)
```

```
## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 4 iterations.
## The RMS GCV score gradient at convergence was 1.107369e-05 .
## The Hessian was positive definite.
## Model rank =  10 / 10 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##        k'  edf k-index p-value
## s(X) 9.00 6.09     1.1    0.97
```

---
# Check your model:

```r
gam.check(my_gam)
```

---

# Is it any better than linear model?

```r
my_lm <- lm(Y ~ X, data=dt)

anova(my_lm, my_gam)
```

```
## Analysis of Variance Table
## 
## Model 1: Y ~ X
## Model 2: Y ~ s(X, bs = "cr")
##   Res.Df   RSS     Df Sum of Sq      F    Pr(>F)    
## 1 298.00 88154                                      
## 2 292.91 60613 5.0873     27540 26.161 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

```r
AIC(my_lm, my_gam)
```

```
##              df      AIC
## my_lm  3.000000 2562.280
## my_gam 8.087281 2460.085
```

## Yes, yes it is!

---

# Summary

+ Regression models are concerned with explaining one variable: `y`, with another: `x`

+ This relationship is assumed to be linear

+ If your data are not linear, or noisy, a smoother might be appropriate

+ Splines are ideal smoothers, and are polynomials joined at 'knot' points

+ GAMs are a framework for regressions using smoothers

+ `mgcv` is a great package for GAMs with various smoothers available

+ `mgcv` estimates the required smoothing penalty for you

+ `gratia` or `mgcViz` packages are good visualization tool for GAMs

---
# References and Further reading:

#### GitHub code: 
https://github.com/chrismainey/fitting_wiggly_data

#### Simon Wood's comprehensive book:
+ WOOD, S. N. 2017. Generalized Additive Models: An Introduction with R, Second Edition, Florida, USA, CRC Press.

#### Noam Ross free online GAM course:
https://noamross.github.io/gams-in-r-course/

<br>

+ HARRELL, F. E., JR. 2001. Regression Modeling Strategies, New York, Springer-Verlag New York.

+ HASTIE, T. & TIBSHIRANI, R. 1986. Generalized Additive Models. Statistical Science, 1, 297-310. 291

+ HASTIE, T., TIBSHIRANI, R. & FRIEDMAN, J. 2009. The Elements of Statistical Learning : Data Mining, Inference, and Prediction, New York, NETHERLANDS, Springer.