BIOSTAT M280: Statistical Computing

  • Mon 12pm-1:50pm @ CHS 43-105A
    Wed 1pm-1:50pm @ CHS 43-105A
  • Instructor: Dr. Hua Zhou, huazhou@ucla.edu
  • Multi-listed as BIOSTAT M280, BIOMATH M280, and STAT M230
  • For Biostatistics doctoral students, this course satisfies the requirement for BIOSTAT 257.

What is statistics?

  • Statistics, the science of data analysis, is the applied mathematics in the 21st century.

  • People (scientists, goverment, health professionals, companies) collect data in order to answer certain questions. Statisticians's job is to help them extract knowledge and insights from data.

  • Must-read for (bio)statistics students:

  • If existing software tools readily solve the problem, all the better.

  • Often statisticians need to implement their own methods, test new algorithms, or tailor classical methods to new types of data (big, streaming).

  • This entails at least two essential skills: programming and fundamental knowledge of algorithms.

  • Two examples: How Gauss became famous and Marc Coram deciphering a jail message.

How Gauss became famous?


1801

  • Dr. Carl Friedrich Gauss, 24; proved the Fundamental Theorem of Algebra; wrote the book Disquisitiones Arithmetic, which is still being studied today.

  • Jan 1-Feb 11 (41 days), astronomer Piazzi observed Ceres (a dwarf planet), which was then lost behind sun.

  • Aug-Sep, futile search by top astronomers; Laplace claimed it unsolvable.

  • Oct-Nov, Gauss did calculations by the method of least squares.

  • Dec 31, astronomer von Zach re-located Ceres according to Gauss' calculation.

Aftermath

For more history

Lessons

  • Motivated by real data and real problem (data science!).

  • Heuristic solution first: method of least squares.

  • Algorithm development: linear algebra, Gaussian elimination, FFT (fast Fourier transform).

  • Solution readily verifiable: Ceres was re-discovered.

  • Theoretical justification (Gaussian distribution, Gauss-Markov theorem) comes much later.

Marc Coram and a jail message

  • A consulting project by Marc Coram (then a graduate student in statistics at Stanford); customer is a professor in political science, who wants to understand a cryptic message circulated in a state prison.
  • Marc modeled the letter sequence by a Markov chain ($26 \times 26$ transition matrix) and estimated transition probabilities from War and Peace.
  • Now each mapping $\sigma$ yields a likelihood $f(\sigma)$ of the symbol sequence.
  • Find the $\sigma$ that maximizes $f$. Sample space is at least $26! = 4.0329 \times 10^{26}$. Combinatorial optimization -- hard!
  • Metropolis algorithm: At each iteration:
    • generate a new $\sigma'$ by random transposition of two letters
    • accept $\sigma'$ with probability $\min \left\{\frac{f(\sigma')}{f(\sigma)}, 1\right\}$

Lessons

  • Motivated by a real problem (data science!).

  • Solution readily verifiable: we can read it!

  • Algorithm development: Metropolis sampler is one of top 10 algorithms in the 20th century.

  • See the article The Markov chain Monte Carlo revolution by Persi Diaconis (2009) for more details.

What is this course about?

  • Not a course on statistical packages. It does not answer questions such as How to fit a linear mixed model in R, Julia, SAS, SPSS, or Stata?

  • Not a pure programming course, although programming is important and we do homework in Julia.
    The new BIOSTAT 203A (Data Management) in fall quarter focuses on programming in R and SAS.

  • Not a course on data science. The new BIOSTAT 203B (Introduction to Data Science) in winter quarter focuses on software tools for data scientists.

  • This course focuses on algorithms, or, numerical methods in statistics.

  • To quote James Gentle

    The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.

  • For a common numerical task in statistics, say solving the least squares problem $$ \widehat \beta = ({\bf X}^T {\bf X})^{-1} {\bf X}^T {\bf y}, $$ we need to know which methods/algorithms are out there and what are their advantages and disadvantages. You will fail this course if you use

    inv(X'X) * X' * y
    

    Using X \ y (or lm(y ~ X) in R) is correct but not the purpose of this course. We want to understand what computer is doing when calling X \ y.

  • For biostat studuents, this course satisfies the requirement of BIOSTAT 257 in the new curriculum. Ask Ms Roxy Naranjo rlnaranjo@ph.ucla.edu for the paperwork.

Course logistics