• Q1: What’s the notation $\binom{|\boldsymbol{x}|}{\boldsymbol{x}}$?
    This is the multinomial coefficient
    \(\binom{|\boldsymbol{x}|}{\boldsymbol{x}} = \binom{|\boldsymbol{x}|}{x_1 \, \cdots \, x_d} = \frac{(\sum_{i=1}^d x_i)!}{x_1! \cdots x_d!}.\)

  • Q2: Concavity of the log-likelihood function?
    For the concavity of the log-likelihood, we observe that it is some log terms (concave) minus some other log terms (concave). We would suspect it is not concave in general. To show that something is not concave, it suffices to give one counter example. Q7 gives some hints when the observed information matrix is not pd. Later when you do Q10, if you purposely start your algorithm from a bad starting point, often you’ll find that the observed information matrix is not pd at the first few iterates. That also suffices as a proof the log-likelihood is not concave.

  • Q7: What structure does the observed information matrix have?
    Some results you showed in HW2, e.g., the Sherman-Morrison formula Q3(a) and determinant formula Q3(d) are particularly helpful here.

  • Q8: The algebra for method of moment estimator (MoM) is terrible :(
    It may be easier to work on the marginal moments of the Dirichlet distribution. Keep in mind we are not deriving the exact MoM; we only need a reasonably good starting point. Some heuristics may work well.

  • Q10: Some columns of the training data are all zeros, causing trouble in Newton’s algorithm.
    If the $j$-th column of data is all 0, then we can show (how?) the corresponding “MLE” has $\alpha_j = 0$. It’s one example the parameter domain is not compact and the MLE is not attained within parameter domain. We may just remove those zero columns before running the Newton-Raphson algorithm.

  • Q10: How to make sure Newton’s iterates almost remain in the parameter domain (positive orthant)?
    Before the line search loop, I will try to make sure the initial step length won’t make any coordinate negative.