Due June 7 @ 11:59PM June 9 @ 11:59PM
Consider again the MLE of the Dirichlet-multinomial model. In HW5, we worked out a Newton's method. In this homework, we explore the MM and EM approach.
Show that, given iid observations $\mathbf{x}_1,\ldots,\mathbf{x}_n$, the log-likelihood can be written as $$ L(\alpha) = \sum_{i=1}^n \ln \binom{|\mathbf{x}_i|}{\mathbf{x}_i} +\sum_{i=1}^n \sum_{j=1}^d \sum_{k=0}^{x_{ij}-1} \ln(\alpha_j+k) - \sum_{i=1}^n \sum_{k=0}^{|\mathbf{x}_i|-1} \ln(|\alpha|+k). $$ Hint: $\Gamma(a + k) / \Gamma(a) = a (a + 1) \cdots (a+k-1)$.
Suppose $(P_1,\ldots,P_d) \in \Delta_d = \{\mathbf{p}: p_i \ge 0, \sum_i p_i = 1\}$ follows a Dirichlet distribution with parameter $\alpha = (\alpha_1,\ldots,\alpha_d)$. Show that $$ \mathbf{E}(\ln P_j) = \Psi(\alpha_j) - \Psi(|\alpha|), $$ where $\Psi(z) = \Gamma'(z)/\Gamma(z)$ is the digamma function and $|\alpha| = \sum_{j=1}^d \alpha_j$. Hint: Differentiate the identity $$ 1 = \int_{\Delta_d} \frac{\Gamma(|\alpha|)}{\prod_{j=1}^d \Gamma(\alpha_j)} \prod_{j=1}^d p_j^{\alpha_j-1} \, d\mathbf{p}. $$
The admixture representation of the Dirichlet-multinomial distribution suggests that we can treat the unobserved multinomial parameters $\mathbf{p}_1,\ldots,\mathbf{p}_n$ as missing data and derive an EM algorithm. Show that the Q function is $$ Q(\alpha|\alpha^{(t)}) = \sum_{j=1}^d \sum_{i=1}^n \alpha_j \left[\Psi(x_{ij}+\alpha_j^{(t)}) - \Psi(|\mathbf{x}_i|+|\alpha^{(t)}|)\right] - n \sum_{j=1}^d \ln \Gamma(\alpha_j) + n \ln \Gamma(|\alpha|) + c^{(t)}, $$ where $c^{(t)}$ is a constant irrelevant to optimization. Explain why it is not easy to maximize the Q function.
We derive an MM algorithm for maximing $L$. Consider the formulation of the log-likelihood that contains terms $\ln (\alpha_j + k)$ and $- \ln (|\alpha|+k)$. Applying Jensen's inequality to the concave term $\ln (\alpha_j + k)$ and supporting hyperplane inequality to the convex term $- \ln (|\alpha|+k)$, show that a minorizing function to $L(\alpha)$ is $$ g(\alpha|\alpha^{(t)}) = - \sum_{k=0}^{\max_i |\mathbf{x}_i|-1} \frac{r_k}{|\alpha^{(t)}|+k} |\alpha| + \sum_{j=1}^d \sum_{k=0}^{\max_i x_{ij}-1} \frac{s_{jk} \alpha_j^{(t)}}{\alpha_j^{(t)}+k} \ln \alpha_j + c^{(t)}, $$ where $s_{jk} = \sum_{i=1}^n 1_{\{x_{ij} > k\}}$, $r_k = \sum_{i=1}^n 1_{\{|\mathbf{x}_i| > k\}}$, and $c^{(t)}$ is a constant irrelevant to optimization. Maximizing the surrogate function $g(\alpha|\alpha^{(t)})$ is trivial since $\alpha_j$ are separated. Show that the MM updates are $$ \alpha_j^{(t+1)} = \frac{\sum_{k=0}^{\max_i x_{ij}-1} \frac{s_{jk}}{\alpha_j^{(t)}+k}}{\sum_{k=0}^{\max_i |\mathbf{x}_i|-1} \frac{r_k}{|\alpha^{(t)}|+k}} \alpha_j^{(t)}, \quad j=1,\ldots,d. $$ The quantities $s_{jk}$, $r_k$, $\max_i x_{ij}$ and $\max_i |\mathbf{x}_i|$ only depend on data and can be pre-computed. Comment on whether the MM updates respect the parameter constraint $\alpha_j>0$.
Write a function for finding MLE of Dirichlet-multinomial distribution given iid observations $\mathbf{x}_1,\ldots,\mathbf{x}_n$, using MM algorithm.
"""
dirmult_mm(X)
Find the MLE of Dirichlet-multinomial distribution using MM algorithm.
# Argument
* `X`: an `d`-by-`n` matrix of counts; each column is one data point.
# Optional argument
* `alpha0`: starting point.
* `maxiters`: the maximum allowable Newton iterations (default 100).
* `tolfun`: the tolerance for relative change in objective values (default 1e-6).
# Output
# Output
* `logl`: the log-likelihood at MLE.
* `niter`: the number of iterations performed.
# `α`: the MLE.
* `∇`: the gradient at MLE.
* `obsinfo`: the observed information matrix at MLE.
"""
function dirmult_mm(
X::AbstractMatrix;
α0::Vector = dirmult_mom(X),
maxiters::Int = 100,
tolfun = 1e-6
)
# your code here
end
Finally let us re-consider the EM algorithm. The difficulty with the M step in EM algorithm can be remedied. Discuss how we can further minorize the $\ln \Gamma(|\alpha|)$ term in the $Q$ function to produce a minorizing function with all $\alpha_j$ separated. For this homework, you do not need to implement this EM-MM hybrid algorithm. Hint: $z \mapsto \ln \Gamma(z)$ is a convex function.