## 6.3 Regression models

Regression models are models of one variable, called the response, conditionally on one or more other variables, called the predictors. Typically we use models that assume conditional independence of the responses given the predictors, and a particularly convenient class of regression models is based on exponential families.

If we let $$y_i \in \mathbb{R}$$ denote the $$i$$th response and $$x_i \in \mathbb{R}^p$$ the $$i$$th predictor we can consider the exponential family of models with joint density $f(\mathbf{y} \mid \mathbf{X}, \beta) = \prod_{i=1}^n e^{\theta^T t_i(y_i \mid x_i) - \log \varphi_i(\theta)}$ for suitably chosen base measures. Here $\mathbf{X} = \left( \begin{array}{c} x_1^T \\ x_2^T \\ \vdots \\ x_{n-1}^T \\ x_n^T \end{array} \right)$ is called the model matrix and is an $$n \times p$$ matrix.

Example 6.3 If $$y_i \in \mathbb{N}_0$$ are counts we often use a Poisson regression model with point probabilities (density w.r.t. counting measure) $p_i(y_i \mid x_i) = e^{-\mu(x_i)} \frac{\mu(x_i)^{y_i}}{y_i!}.$ If the mean depends on the predictors in a log-linear way, $$\log(\mu(x_i)) = x_i^T \beta$$, then $p_i(y_i \mid x_i) = e^{\beta^T x_i y_i - \exp( x_i^T \beta)} \frac{1}{y_i!}.$ The factor $$1/y_i!$$ can be absorbed into the base measure, and we recognize this Poisson regression model as an exponential family with sufficient statistics $t_i(y_i) = x_i y_i$ and $\log \varphi_i(\beta) = \exp( x_i^T \beta).$

To implement numerical optimization algorithms for computing the maximum-likelihood estimate we note that $t(\mathbf{y}) = \sum_{i=1}^n x_i y_i = \mathbf{X}^T \mathbf{y} \quad \text{and} \quad \kappa(\beta) = \sum_{i=1}^n e^{x_i^T \beta},$ that $\nabla \kappa(\beta) = \sum_{i=1}^n x_i e^{x_i^T \beta},$ and that $D^2 \kappa(\beta) = \sum_{i=1}^n x_i x_i^T e^{x_i^T \beta} = \mathbf{X}^T \mathbf{W}(\beta) \mathbf{X}$ where $$\mathbf{W}(\beta)$$ is a diagonal matrix with $$\mathbf{W}(\beta)_{ii} = \exp(x_i^T \beta).$$ All these formulas follow directly from the general formulas for exponential families.

To illustrate the use of a Poisson regression model we consider the following data set from a major Swedish supermarket chain. It contains the number of bags of frozen vegetables sold in a given week in a given store under a marketing campaign. A predicted number of items sold in a normal week (without a campaign) based on historic sales is included. It is of interest to recalibrate this number to give a good prediction of the number of items actually sold.

vegetables <- read_csv("data/vegetables.csv", col_types = cols(store = "c"))
summary(vegetables)
##       sale          normalSale        store
##  Min.   :  1.00   Min.   :  0.20   Length:1066
##  1st Qu.: 12.00   1st Qu.:  4.20   Class :character
##  Median : 21.00   Median :  7.25   Mode  :character
##  Mean   : 40.29   Mean   : 11.72
##  3rd Qu.: 40.00   3rd Qu.: 12.25
##  Max.   :571.00   Max.   :102.00

It is natural to model the number of sold items using a Poisson regression model, and we will consider the following simple log-linear model:

$\log(E(\text{sale} \mid \text{normalSale})) = \beta_0 + \beta_1 \log(\text{normalSale}).$

This is an example of an exponential family regression model as above with a Poisson response distribution. It is a standard regression model that can be fitted to data using the glm function and using the formula interface to specify how the response should depend on the normal sale. The model is fitted by computing the maximum-likelihood estimate of the two parameters $$\beta_0$$ and $$\beta_1$$.

# A Poisson regression model
pois_model_null <- glm(
sale ~ log(normalSale),
data = vegetables,
family = poisson
)  
summary(pois_model_null)
##
## Call:
## glm(formula = sale ~ log(normalSale), family = poisson, data = vegetables)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -16.218   -2.677   -0.864    1.324   21.730
##
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)        1.461      0.015    97.2   <2e-16 ***
## log(normalSale)    0.922      0.005   184.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
##     Null deviance: 51177  on 1065  degrees of freedom
## Residual deviance: 17502  on 1064  degrees of freedom
## AIC: 22823
##
## Number of Fisher Scoring iterations: 5

The fitted model of the expected sale as a function of the normal sale, $$x$$, is
$x \mapsto e^{1.46 + 0.92 \times \log(x)} = (4.31 \times x^{-0.08}) \times x.$ This model roughly predicts a four-fold increase of the sale during a campaign, though the effect decays with increasing $$x$$. If the normal sale is 10 items the factor is $$4.31 \times 10^{-0.08} = 3.58$$, and if the normal sale is 100 items the factor reduces to $$4.31 \times 100^{-0.08} = 2.98$$.

We are not entirely satisfied with this model. It is fitted across all stores in the data set, and it is not obvious that the same effect should apply across all stores. Thus we would like to fit a model of the form

$\log(E(\text{sale})) = \beta_{\text{store}} + \beta_1 \log(\text{normalSale})$

instead. Fortunately, this is also straightforward using glm.

## Note, variable store is a factor! The 'intercept' is removed
## in the formula to obtain a parametrization as above.
pois_model <- glm(
sale ~ store + log(normalSale) - 1,
data = vegetables,
family = poisson
)

We will not print all the individual store effects as there are 352 individual stores.

summary(pois_model)\$coefficients[c(1, 2, 3, 4, 353), ]
##                 Estimate Std. Error z value   Pr(>|z|)
## store1            2.7182    0.12756  21.309 9.416e-101
## store10           4.2954    0.12043  35.668 1.254e-278
## store100          3.4866    0.12426  28.060 3.055e-173
## store101          3.3007    0.11791  27.993 1.989e-172
## log(normalSale)   0.2025    0.03127   6.474  9.536e-11

We should note that the coefficient of $$\log(\text{normalSale})$$ has changed considerably, and the model is a somewhat different model now for the individual stores.

From a computational viewpoint the most important thing that has changed is that the number of parameters has increased a lot. In the first and simple model there are two parameters. In the second model there are 353 parameters. Computing the maximum-likelihood estimate is a considerably more difficult problem in the second case.