## 6.1 Exponential families

This section introduces exponential families in a concise way. The crucial observation is that the log-likelihood is concave, and that we can derive general formulas for derivatives. This will be important for the optimization algorithms developed later for computing maximum-likelihood estimates and for answering standard asymptotic inference questions.

The exponential families are extremely well behaved from a mathematical as well as a computational viewpoint, but they may be inadequate for modeling data in some cases. A typical practical problem is that there is heterogeneous variation in data beyond what can be captured by any single exponential family. A fairly common technique is then to build an exponential family model of the observed variables as well as some latent variables. The latent variables then serve the purpose of modeling the heterogeneity. The resulting model of the observed variables is consequently the marginalization of an exponential family, which is generally not an exponential family and in many ways less well behaved. It is nevertheless possible to exploit the exponential family structure underlying the marginalized model for many computations of statistical importance. The EM-algorithm as treated in Chapter 8 is one particularly good example, but Bayesian computations can in similar ways exploit the structure.

### 6.1.1 Full exponential families

In this section we consider statistical models on an abstract product sample space $\mathcal{Y} = \mathcal{Y}_1 \times \ldots \times \mathcal{Y}_m.$ We will be interested in models of observations $$y_1 \in \mathcal{Y}_1, \ldots, y_m \in \mathcal{Y}_m$$ that are independent but not necessarily identically distributed.

An exponential family is defined in terms of two ingredients:

• maps $$t_j : \mathcal{Y}_j \to \mathbb{R}^p$$ for $$j = 1, \ldots, m$$,
• and non-trivial $$\sigma$$-finite measures $$\nu_j$$ on $$\mathcal{Y}_j$$ for $$j = 1, \ldots, m$$.

The maps $$t_j$$ are called sufficient statistics, and in terms of these and the base measures $$\nu_j$$ we define $\varphi_j(\theta) = \int e^{\theta^T t_j(u)} \nu_j(\mathrm{d}u).$ These functions are well defined as functions $\varphi_j : \mathbb{R}^p \to (0,\infty].$ We define $\Theta_j = \mathrm{int}(\{ \theta \in \mathbb{R}^p \mid \varphi_j(\theta) < \infty \}),$ which by definition is an open set as. It can be shown that $$\Theta_j$$ is convex and that $$\varphi_j$$ is a log-convex function. Defining $\Theta = \bigcap_{j=1}^m \Theta_j,$ then $$\Theta$$ is likewise open and convex, and we define the exponential family as the distributions parametrized by $$\theta \in \Theta$$ that have densities

$\begin{equation} f(\mathbf{y} \mid \theta) = \prod_{j=1}^m \frac{1}{\varphi_j(\theta)} e^{\theta^T t_j(y_j)} = e^{\theta^T \sum_{j=1}^m t_j(y_j) - \sum_{j=1}^m \log \varphi_j(\theta)}, \quad \mathbf{y} \in \mathcal{Y}, \tag{6.1} \end{equation}$

w.r.t. $$\otimes_{j=1}^m \nu_j$$. The case where $$\Theta = \emptyset$$ is of no interest, and we will thus assume that the parameter set $$\Theta$$ is non-empty. The parameter $$\theta$$ is called the canonical parameter and $$\Theta$$ is the canonical parameter space. We may also say that the exponential family is canonically parametrized by $$\theta$$. It is important to realize that an exponential family may come with a non-canonical parametrization that doesn’t reveal right away that it is an exponential family. Thus a bit of work is then needed to show that the parametrized family of distributions can, indeed, be reparametrized as an exponential family. In the non-canonical parametrization, the family is then an example of a curved exponential family as defined below.

Example 6.1 The von Mises distributions on $$\mathcal{Y} = (-\pi, \pi]$$ form an exponential family with $$m = 1$$. The sufficient statistic $$t_1 : (-\pi, \pi] \mapsto \mathbb{R}^2$$ is $t_1(y) = \left(\begin{array}{c} \cos(y) \\ \sin(y) \end{array}\right),$ and $\varphi(\theta) = \int_{-\pi}^{\pi} e^{\theta_1 \cos(u) + \theta_2 \sin(u)} \mathrm{d}u < \infty$ for all $$\theta = (\theta_1, \theta_2)^T \in \mathbb{R}^2$$. Thus the canonical parameter space is $$\Theta = \mathbb{R}^2$$.

As mentioned in Section 1.2.1, the function $$\varphi(\theta)$$ can be expressed in terms of a modified Bessel function, but it doesn’t have an expression in terms of elementary functions. Likewise in Section 1.2.1, an alternative parametrization (polar coordinates) was given; $(\kappa, \mu) \mapsto \theta = \kappa \left(\begin{array}{c} \cos(\mu) \\ \sin(\mu) \end{array}\right)$ that maps $$[0,\infty) \times (-\pi, \pi]$$ onto $$\Theta$$. The von Mises distributions form a curved exponential family in the $$(\kappa, \mu)$$-parametrization, but this parametrization has several problems. First, the $$\mu$$ parameter is not identifiable if $$\kappa = 0$$, which is reflected by the fact that the reparametrization is not a one-to-one map. Second, the parameter space is not open, which can be quite a nuisance for e.g. maxmimum-likelihood estimation. We could circumvent these problems by restricting attention to $$(\kappa, \mu) \in (0,\infty) \times (-\pi, \pi)$$, but we would then miss some of the distributions in the exponential family – notably the uniform distribution corresponding to $$\theta = 0$$. In conclusion, the canonical parametrization of the family of distributions as an exponential family is preferable for mathematical and computational reasons.

Example 6.2 The family of Gaussian distributions on $$\mathbb{R}$$ is an example of an exponential family as defined above with $$m = 1$$ and $$\mathcal{Y} = \mathbb{R}$$. The density of the $$\mathcal{N}(\mu, \sigma^2)$$ distribution is $\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(y - \mu)^2}{2\sigma^2}\right) = \frac{1}{\sqrt{\pi}} \exp\left(\frac{\mu}{\sigma^2} y - \frac{1}{2\sigma^2} y^2 - \frac{\mu^2}{2\sigma^2} - \frac{1}{2}\log (2\sigma^2) \right).$ Letting the base measure $$\nu_1$$ be Lebesgue measure scaled by $$1/\sqrt{\pi}$$, and $$t_1 : \mathbb{R} \mapsto \mathbb{R}^2$$ be $t_1(y) = \left(\begin{array}{c} y \\ - y^2 \end{array}\right)$ we identify this family of distributions as an exponential family with canonical parameter $\theta = \left(\begin{array}{c} \frac{\mu}{\sigma^2} \\ \frac{1}{2 \sigma^2} \end{array}\right).$

We can express the mean and variance in terms of $$\theta$$ as $\sigma^2 = \frac{1}{2\theta_2} \quad \text{and} \quad \mu = \frac{\theta_1}{2\theta_2},$ and we find that $\log \varphi_1(\theta) = \frac{\mu^2}{2\sigma^2} + \frac{1}{2} \log(2 \sigma^2) = \frac{\theta_1^2}{4\theta_2} - \frac{1}{2} \log \theta_2.$

We note that the reparametrization $$(\mu, \sigma^2) \mapsto \theta$$ maps $$\mathbb{R} \times (0,\infty)$$ bijectively onto the open set $$\mathbb{R} \times (0,\infty)$$, and that the formula above for $$\log \varphi_1(\theta)$$ only holds on this set. It is natural to ask if the canonical parameter space is actually larger for this exponential family. That is, is $$\varphi_1(\theta) < \infty$$ for $$\theta_2 \leq 0$$? To this end observe that if $$\theta_2 \leq 0$$ $\varphi_1(\theta) = \frac{1}{\sqrt{2\pi}} \int e^{\theta_1 u - \theta_2 \frac{u^2}{2}} \mathrm{d}u \geq \frac{1}{\sqrt{2\pi}} \int e^{\theta_1 u} \mathrm{d} u = \infty,$ and we conclude that $\Theta = \mathbb{R} \times (0,\infty).$

The family of Gaussian distributions is an example of a family of distributions whose commonly used parametrization in terms of mean and variance differs from the canonical parametrization as an exponential family. The mean and variance are easy to interpret, but in terms of mean and variance, the Gaussian distributions form a curved exponential family. For mathematical and computational purposes the canonical parametrization is preferable.

For general exponential families it may seem restrictive that all the sufficient statistics, $$t_j$$, take values in the same $$p$$-dimensional space, and that all marginal distributions share a common parameter vector $$\theta$$. This is, however, not a restriction. Say we have two distributions with sufficient statistics $$\tilde{t}_1 : \mathcal{Y}_1 \to \mathbb{R}^{p_1}$$ and $$\tilde{t}_2 : \mathcal{Y}_2 \to \mathbb{R}^{p_2}$$ and corresponding parameters $$\theta_1$$ and $$\theta_2$$, then we construct $t_1(y_1) = \left(\begin{array}{c} \tilde{t}_1(y_1) \\ 0 \end{array}\right) \quad \text{and} \quad t_2(y_2) = \left(\begin{array}{c} 0 \\ \tilde{t}_2(y_2) \end{array}\right).$ Now $$t_1$$ and $$t_2$$ both map into $$\mathbb{R}^{p}$$ with $$p = p_1 + p_2$$, and we can bundle the parameters together into the vector $\theta = \left(\begin{array}{c} \theta_1 \\ \theta_2 \end{array}\right) \in \mathbb{R}^p.$ Clearly, this construction can be generalized to any number of distributions and allows us to always assume a common parameter space. The sufficient statistics then take care of selecting which of the coordinates in the parameter vector that is used for any particular marginal distribution.

### 6.1.2 Exponential family Bayesian networks

An exponential family is defined above as a parametrized family of distributions on $$\mathcal{Y}$$ of independent variables. That is, the joint density in (6.1) factorizes w.r.t. a product measure. Without really changing the notation this assumption can be relaxed considerably.

If the sufficient statistic $$t_j$$, instead of being a fixed map, is allowed to depend on $$y_1, \ldots, y_{j-1}$$, $$f(\mathbf{y} \mid \theta)$$ as defined in (6.1) is still a joint density. The only difference is that the factor $f_j(y_j \mid y_1, \ldots, y_{j-1}, \theta) = e^{\theta^T t_j(y_j) - \log \varphi_j(\theta)}$ is now the conditional density of $$y_j$$ given $$y_1, \ldots, y_{j-1}$$. In the notation we let the dependence of $$t_j$$ on $$y_1, \ldots, y_{j-1}$$ be implicit as this will not affect the abstract theory. In any concrete example though, it will be clear how $$t_j$$ actually depends upon all, some or none of these variables. Note that $$\varphi_j$$ may now also depend upon the data through $$y_1, \ldots, y_{j-1}$$.

The observation above is powerful. It allows us to develop a unified approach based on exponential families for a majority of all statistical models that are applied in practice. The models we consider make two essential assumptions

1. the variables that we model can be ordered such that $$y_j$$ only depends on $$y_1, \ldots, y_{j-1}$$ for $$j = 1, \ldots, m$$,
2. all the conditional distributions form exponential families themselves, with the conditioning variables entering through the $$t_j$$-maps.

The first of these assumptions is equivalent to the joint distribution being a Bayesian network, that is, a distribution whose density factorizes according to a directed acyclic graph. This includes time series models and hierarchical models. The second assumption is more restrictive, but is a common practice in applied work. Moreover, as many standard statistical models of univariate discrete and continuous variables are, in fact, exponential families, building up a joint distribution as a Bayesian network via conditional binomial, Poisson, beta, Gamma and Gaussian distributions among others is a rather flexible framework, and yet it will result in a model with a density that factorizes as in (6.1).

Bayesian networks is a large and interesting subject in itself, and it is unfair to gloss over all the details. One of the main points is that for many computations it is possible to develop efficient algorithms by exploiting the graph structure. The seminal paper by Lauritzen and Spiegelhalter (1988) demonstrated this for the computation of conditional distributions. The mere fact that the variables can be ordered in a way that aligns with how the variables depend on each other is useful, but there can be many ways to do this, and just specifying an ordering ignores important details of the graph. It is, however, beyond the scope of this book to get into the graph algorithms required for a thorough general treatment of Bayesian networks.

### 6.1.3 Likelihood computations

To simplify the notation we introduce $t(\mathbf{y}) = \sum_{j=1}^m t_j(y_j),$ which we refer to as the sufficient statistic, and $\kappa(\theta) = \sum_{j=1}^m \log \varphi_j(\theta),$ which is a convex $$C^{\infty}$$-function on $$\Theta$$.

Having observed $$\mathbf{y} \in \mathcal{Y}$$ it is evident that the log-likelihood for the exponential family specified by (6.1) is $\ell(\theta) = \log f(\mathbf{y} \mid \theta) = \theta^T t(\mathbf{y}) - \kappa(\theta).$ From this it follows that the gradient of the log-likelihood is $\nabla \ell(\theta) = t(\mathbf{y}) - \nabla \kappa(\theta)$ and the Hessian is $D^2 \ell(\theta) = - D^2 \kappa(\theta),$ which is always negative semidefinite. The maximum-likelihood estimator exists if and only if there is a solution to the score equation $t(\mathbf{y}) = \nabla \kappa(\theta),$ and it is unique if there is such a solution, $$\hat{\theta}$$, for which $$I(\hat{\theta}) = D^2 \kappa(\hat{\theta})$$ is positive definite. We call $$I(\hat{\theta})$$ the observed Fisher information.

Note that under the independence assumption, $\nabla \log \varphi_j(\theta) = \frac{1}{\varphi_j(\theta)} \int t_j(u) e^{\theta^T t_j(u)} \nu_i(\mathrm{d} u) = E_{\theta}(t_j(Y)),$ which means that the score equation can be expressed as $t(\mathbf{y}) = \sum_{j=1}^m E_{\theta}(t_j(Y)).$ In the Bayesian network setup $$\nabla \log \varphi_j(\theta) = E_{\theta}(t_j(Y) \mid Y_1, \ldots, Y_{j-1}),$$ and the score equation is $t(\mathbf{y}) = \sum_{j=1}^m E_{\theta}(t_j(Y) \mid y_1, \ldots, y_{j-1}),$ which is a bit more complicated as the right-hand-side depends on the observations.

The definition of an exponential family in Section 6.1 encompasses the situation where $$y_1, \ldots, y_m$$ are i.i.d. by taking $$t_j$$ to be independent of $$j$$. In that case, $$\kappa(\theta) = m \kappa_1(\theta)$$, the score equation can be rewritten as $\frac{1}{m} \sum_{j=1}^m t_1(y_j) = \kappa_1(\theta),$ and the Fisher information becomes $I(\hat{\theta}) = m D^2 \kappa_1(\hat{\theta}).$

However, the point of the general formulation is that it includes regression models, and, via the Bayesian networks extension above, models with dependence structures. We could, of course, then have i.i.d. replications $$\mathbf{y}_1, \ldots, \mathbf{y}_n$$ of the entire $$m$$-dimensional vector $$\mathbf{y}$$, and we would get the score equation $\frac{1}{n} \sum_{i=1}^n t(\mathbf{y}_i) = \kappa(\theta),$ and corresponding Fisher information $I(\hat{\theta}) = n D^2 \kappa(\hat{\theta}).$

### 6.1.4 Curved exponential families

A curved exponential family consists of an exponential family together with a $$C^2$$-map $$\theta : \Psi \to \Theta$$ from a set $$\Psi \subseteq \mathbb{R}^q$$. The set $$\Psi$$ provides a parametrization of the subset $$\theta(\Psi) \subseteq \Theta$$ of the full exponential family, and the log-likelihood in the $$\psi$$-parameter is $\ell(\psi) = \theta(\psi)^T t(\mathbf{y}) - \kappa(\theta(\psi)).$ It has gradient $\nabla \ell(\psi) = D \theta(\psi)^T t(\mathbf{y}) - \nabla \kappa(\theta(\psi)) D \theta(\psi)$ and Hessian $D^2 \ell(\psi) = \sum_{k=1}^p D^2\theta_k(\psi) (t(\mathbf{y})_k - \partial_k \kappa(\theta(\psi))) - D \theta(\psi)^T D^2 \kappa(\theta(\psi)) D \theta(\psi).$

### References

Lauritzen, S. L., and D. J. Spiegelhalter. 1988. “Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems.” Journal of the Royal Statistical Society. Series B (Methodological) 50 (2): 157–224.