Biostat M280 Homework 2¶

Due May 5 @ 11:59PM

Q1¶

Consider the numerical task of estimating an $n \times n$ kinship matrix $\Phi$ from an $n \times m$ genotype matrix $\mathbf{X}$. Here $n$ is the number of individuals and $m$ is the number of genetic markers. Lange et al derived a method of moment estimator of form $$ \widehat \Phi_{ij} = \frac{e_{ij} - \sum_{k=1}^m [p_k^2 + (1 - p_k)^2]}{m - \sum_{k=1}^m [p_k^2 + (1 - p_k)^2]}, \quad 1 \le i, j \le n, $$ where $$ \begin{eqnarray*} e_{ij} &=& \frac{1}{4} \sum_{k=1}^m [x_{ik} x_{jk} + (2 - x_{ik})(2 - x_{jk})] \\ p_k &=& \frac {1}{2n} \sum_{i=1}^n x_{ik}. \end{eqnarray*} $$

Write a function that takes a matrix X as input and outputs the method of moment estimator
```
function kinship(X::Matrix{Float64})
     # your code here
     return Φ
 end
```
Make your function as efficient as possible.
In a typical genetic data set, $n$ is at order of $10^3 \sim 10^5$ and $m$ is at order of $10^5 \sim 10^6$. Benchmark your function using a smaller data set
```
srand(280) # seed
 X = rand(0.0:2.0, 1000, 10000)
```
Efficiency (both speed and memory) will be the most important criterion when grading this question.

Q2¶

Show the Sherman-Morrison formula $$ (\mathbf{A} + \mathbf{u} \mathbf{u}^T)^{-1} = \mathbf{A}^{-1} - \frac{1}{1 + \mathbf{u}^T \mathbf{A}^{-1} \mathbf{u}} \mathbf{A}^{-1} \mathbf{u} \mathbf{u}^T \mathbf{A}^{-1}, $$ where $\mathbf{A} \in \mathbb{R}^{n \times n}$ is nonsingular and $\mathbf{u} \in \mathbb{R}^n$. This formula supplies the inverse of the symmetric, rank-one perturbation of $\mathbf{A}$.
Show the Woodbury formula $$ (\mathbf{A} + \mathbf{U} \mathbf{V}^T)^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1} \mathbf{U} (\mathbf{I}_m + \mathbf{V}^T \mathbf{A}^{-1} \mathbf{U})^{-1} \mathbf{V}^T \mathbf{A}^{-1}, $$ where $\mathbf{A} \in \mathbb{R}^{n \times n}$ is nonsingular, $\mathbf{U}, \mathbf{V} \in \mathbb{R}^{n \times m}$, and $\mathbf{I}_m$ is the $m \times m$ identity matrix. In many applications $m$ is much smaller than $n$. Woodbury formula generalizes Sherman-Morrison and is valuable because the smaller matrix $\mathbf{I}_m + \mathbf{V}^T \mathbf{A}^{-1} \mathbf{U}$ is cheaper to invert than the larger matrix $\mathbf{A} + \mathbf{U} \mathbf{V}^T$.
Show the binomial inversion formula $$ (\mathbf{A} + \mathbf{U} \mathbf{B} \mathbf{V}^T)^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1} \mathbf{U} (\mathbf{B}^{-1} + \mathbf{V}^T \mathbf{A}^{-1} \mathbf{U})^{-1} \mathbf{V}^T \mathbf{A}^{-1}, $$ where $\mathbf{A}$ and $\mathbf{B}$ are nonsingular.
Show the identity $$ \text{det}(\mathbf{A} + \mathbf{U} \mathbf{V}^T) = \text{det}(\mathbf{A}) \text{det}(\mathbf{I}_m + \mathbf{V}^T \mathbf{A}^{-1} \mathbf{U}). $$ This formula is useful for evaluating the density of a multivariate normal with covariance matrix $\mathbf{A} + \mathbf{U} \mathbf{V}^T$.

Q3¶

Consider a mixed effects model $$ y_i = \mathbf{x}_i^T \beta + \mathbf{z}_i^T \gamma + \epsilon_i, \quad i=1,\ldots,n, $$ where $\epsilon_i$ are independent normal errors $N(0,\sigma_0^2)$, $\beta \in \mathbb{R}^p$ are fixed effects, and $\gamma \in \mathbb{R}^q$ are random effects assumed to be $N(\mathbf{0}_q, \sigma_1^2 \mathbf{I}_q$) independent of $\epsilon_i$.

Show that $$ \mathbf{y} \sim N \left( \mathbf{X} \beta, \sigma_0^2 \mathbf{I}_n + \sigma_1^2 \mathbf{Z} \mathbf{Z}^T \right), $$ where $\mathbf{y} \in \mathbb{R}^n$, $\mathbf{X} \in \mathbb{R}^{n \times p}$, and $\mathbf{Z} \in \mathbb{R}^{n \times q}$.
Write a function, with interface
```
logpdf_mvn(y, Z, σ0, σ1),
```
that evaluates the log-density of a multivariate normal with mean $\mathbf{0}$ and covariance $\sigma_0^2 \mathbf{I} + \sigma_1^2 \mathbf{Z} \mathbf{Z}^T$ at $\mathbf{y}$. Make your code efficient in the $n \gg q$ case.

Compare your result (both accuracy and timing) to the Distributions.jl package using following data.

using BenchmarkTools, Distributions

 srand(280)
 n, q = 2000, 10
 Z = randn(n, q)
 σ0, σ1 = 0.5, 2.0
 Σ = σ1^2 * Z * Z.' + σ0^2 * I
 mvn = MvNormal(Σ) # MVN(0, Σ)
 y = rand(mvn) # generate one instance from MNV(0, Σ)

 # check you answer matches that from Distributions.jl
 @show logpdf_mvn(y, Z, σ0, σ1)
 @show logpdf(mvn, y)

 # benchmark
 @benchmark logpdf_mvn(y, Z, σ0, σ1)
 @benchmark logpdf(mvn, y)