Due May 5 @ 11:59PM
Consider the numerical task of estimating an $n \times n$ kinship matrix $\Phi$ from an $n \times m$ genotype matrix $\mathbf{X}$. Here $n$ is the number of individuals and $m$ is the number of genetic markers. Lange et al derived a method of moment estimator of form $$ \widehat \Phi_{ij} = \frac{e_{ij} - \sum_{k=1}^m [p_k^2 + (1 - p_k)^2]}{m - \sum_{k=1}^m [p_k^2 + (1 - p_k)^2]}, \quad 1 \le i, j \le n, $$ where $$ \begin{eqnarray*} e_{ij} &=& \frac{1}{4} \sum_{k=1}^m [x_{ik} x_{jk} + (2 - x_{ik})(2 - x_{jk})] \\ p_k &=& \frac {1}{2n} \sum_{i=1}^n x_{ik}. \end{eqnarray*} $$
Write a function that takes a matrix X
as input and outputs the method of moment estimator
function kinship(X::Matrix{Float64})
# your code here
return Φ
end
Make your function as efficient as possible.
In a typical genetic data set, $n$ is at order of $10^3 \sim 10^5$ and $m$ is at order of $10^5 \sim 10^6$. Benchmark your function using a smaller data set
srand(280) # seed
X = rand(0.0:2.0, 1000, 10000)
Efficiency (both speed and memory) will be the most important criterion when grading this question.
Show the Sherman-Morrison formula $$ (\mathbf{A} + \mathbf{u} \mathbf{u}^T)^{-1} = \mathbf{A}^{-1} - \frac{1}{1 + \mathbf{u}^T \mathbf{A}^{-1} \mathbf{u}} \mathbf{A}^{-1} \mathbf{u} \mathbf{u}^T \mathbf{A}^{-1}, $$ where $\mathbf{A} \in \mathbb{R}^{n \times n}$ is nonsingular and $\mathbf{u} \in \mathbb{R}^n$. This formula supplies the inverse of the symmetric, rank-one perturbation of $\mathbf{A}$.
Show the Woodbury formula $$ (\mathbf{A} + \mathbf{U} \mathbf{V}^T)^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1} \mathbf{U} (\mathbf{I}_m + \mathbf{V}^T \mathbf{A}^{-1} \mathbf{U})^{-1} \mathbf{V}^T \mathbf{A}^{-1}, $$ where $\mathbf{A} \in \mathbb{R}^{n \times n}$ is nonsingular, $\mathbf{U}, \mathbf{V} \in \mathbb{R}^{n \times m}$, and $\mathbf{I}_m$ is the $m \times m$ identity matrix. In many applications $m$ is much smaller than $n$. Woodbury formula generalizes Sherman-Morrison and is valuable because the smaller matrix $\mathbf{I}_m + \mathbf{V}^T \mathbf{A}^{-1} \mathbf{U}$ is cheaper to invert than the larger matrix $\mathbf{A} + \mathbf{U} \mathbf{V}^T$.
Show the binomial inversion formula $$ (\mathbf{A} + \mathbf{U} \mathbf{B} \mathbf{V}^T)^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1} \mathbf{U} (\mathbf{B}^{-1} + \mathbf{V}^T \mathbf{A}^{-1} \mathbf{U})^{-1} \mathbf{V}^T \mathbf{A}^{-1}, $$ where $\mathbf{A}$ and $\mathbf{B}$ are nonsingular.
Show the identity $$ \text{det}(\mathbf{A} + \mathbf{U} \mathbf{V}^T) = \text{det}(\mathbf{A}) \text{det}(\mathbf{I}_m + \mathbf{V}^T \mathbf{A}^{-1} \mathbf{U}). $$ This formula is useful for evaluating the density of a multivariate normal with covariance matrix $\mathbf{A} + \mathbf{U} \mathbf{V}^T$.
Consider a mixed effects model $$ y_i = \mathbf{x}_i^T \beta + \mathbf{z}_i^T \gamma + \epsilon_i, \quad i=1,\ldots,n, $$ where $\epsilon_i$ are independent normal errors $N(0,\sigma_0^2)$, $\beta \in \mathbb{R}^p$ are fixed effects, and $\gamma \in \mathbb{R}^q$ are random effects assumed to be $N(\mathbf{0}_q, \sigma_1^2 \mathbf{I}_q$) independent of $\epsilon_i$.
Show that $$ \mathbf{y} \sim N \left( \mathbf{X} \beta, \sigma_0^2 \mathbf{I}_n + \sigma_1^2 \mathbf{Z} \mathbf{Z}^T \right), $$ where $\mathbf{y} \in \mathbb{R}^n$, $\mathbf{X} \in \mathbb{R}^{n \times p}$, and $\mathbf{Z} \in \mathbb{R}^{n \times q}$.
Write a function, with interface
logpdf_mvn(y, Z, σ0, σ1),
that evaluates the log-density of a multivariate normal with mean $\mathbf{0}$ and covariance $\sigma_0^2 \mathbf{I} + \sigma_1^2 \mathbf{Z} \mathbf{Z}^T$ at $\mathbf{y}$. Make your code efficient in the $n \gg q$ case.
Compare your result (both accuracy and timing) to the Distributions.jl package using following data.
using BenchmarkTools, Distributions
srand(280)
n, q = 2000, 10
Z = randn(n, q)
σ0, σ1 = 0.5, 2.0
Σ = σ1^2 * Z * Z.' + σ0^2 * I
mvn = MvNormal(Σ) # MVN(0, Σ)
y = rand(mvn) # generate one instance from MNV(0, Σ)
# check you answer matches that from Distributions.jl
@show logpdf_mvn(y, Z, σ0, σ1)
@show logpdf(mvn, y)
# benchmark
@benchmark logpdf_mvn(y, Z, σ0, σ1)
@benchmark logpdf(mvn, y)