BIOSTAT M280: Statistical Computing

  • Mon 12pm-1:50pm @ CHS 61-262
    Wed 1pm-1:50pm @ CHS 71-257
  • Instructor: Dr. Hua Zhou, huazhou@ucla.edu
  • Multi-listed as BIOSTAT M280, BIOMATH M280, and STAT M230

What is statistics?

  • Statistics, the science of data analysis, is the applied mathematics in the 21st century.

  • People (scientists, goverment, health professionals, companies) collect data in order to answer certain questions. Statisticians's job is to help them extract knowledge and insights from data.

  • Two must-read for (bio)statistics students:

  • If existing software tools readily solve the problem, all the better.

  • Often statisticians need to implement their own methods, test new algorithms, or tailor classical methods to new types of data (big, streaming).

  • This entails at least two essential skills: programming and fundamental knowledge of algorithms.

  • Two examples: How Gauss became famous and Marc Coram deciphering a jail message.

How Gauss became famous?


1801

  • Dr. Carl Friedrich Gauss, 24; proved the Fundamental Theorem of Algebra; wrote the book Disquisitiones Arithmetic, which is still being studied today.

  • Jan 1-Feb 11 (41 days), astronomer Piazzi observed Ceres (a dwarf planet), which was then lost behind sun.

  • Aug-Sep, futile search by top astronomers; Laplace claimed it unsolvable.

  • Oct-Nov, Gauss did calculations by the method of least squares.

  • Dec 31, astronomer von Zach re-located Ceres according to Gauss' calculation.

Aftermath

For more history

Lessons

  • Motivated by real data and real problem (data science!).

  • Heuristic solution first: method of least squares.

  • Algorithm development: linear algebra, Gaussian elimination, FFT (fast Fourier transform).

  • Solution readily verifiable: Ceres was re-discovered.

  • Theoretical justification (Gaussian distribution, Gauss-Markov theorem) comes much later.

Marc Coram and a jail message

  • A consulting project by Marc Coram (then a graduate student in statistics at Stanford); customer is a professor in political science, who wants to understand a cryptic message circulated in a state prison.
  • Marc modeled the letter sequence by a Markov chain ($26 \times 26$ transition matrix) and estimated transition probabilities from War and Peace.
  • Now each mapping $\sigma$ yields a likelihood $f(\sigma)$ of the symbol sequence.
  • Find the $\sigma$ that maximizes $f$. Sample space is at least $26! = 4.0329 \times 10^{26}$. Combinatorial optimization -- hard!
  • Metropolis algorithm: At each iteration:
    • generate a new $\sigma'$ by random transposition of two letters
    • accept $\sigma'$ with probability $\min \left\{\frac{f(\sigma')}{f(\sigma)}, 1\right\}$

Lessons

  • Motivated by a real problem (data science!).

  • Solution readily verifiable: we can read it!

  • Algorithm development: Metropolis sampler is one of top 10 algorithms in the 20th century.

  • See the article The Markov chain Monte Carlo revolution by Persi Diaconis (2009) for more details.

What is this course about?

  • Not a course on packages for data analysis. It does not answer questions such as How to fit a linear mixed model in R, Julia, SAS, SPSS, or Stata?

  • Not a programming course, although programming is extremely important and we do homework in Julia.
    The new BIOSTAT 203A (Data Management) in fall quarter and 203B (Introduction to Data Science) in spring quarter will focus more on programming.

  • This course focuses on algorithms, or, numerical methods in statistics.

  • To quote James Gentle

    The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.

  • For a common numerical task in statistics, say solving the least squares problem $$ \widehat \beta = ({\bf X}^T {\bf X})^{-1} {\bf X}^T {\bf y}, $$ we need to know which methods/algorithms are out there and what are their advantages and disadvantages. You will fail this course if you use

    inv(X'X) * X'y
    

    Using X \ y (or lm(y ~ X) in R) is correct but not the purpose of this course. We want to understand what computer is doing when calling X \ y.

Course logistics

In [ ]: