BIOSTAT M280: Statistical Computing¶

Mon 12pm-1:50pm @ CHS 43-105A
Wed 1pm-1:50pm @ CHS 43-105A
Instructor: Dr. Hua Zhou, huazhou@ucla.edu
Multi-listed as BIOSTAT M280, BIOMATH M280, and STAT M230
For Biostatistics doctoral students, this course satisfies the requirement for BIOSTAT 257.

What is statistics?¶

Statistics, the science of data analysis, is the applied mathematics in the 21st century.
People (scientists, goverment, health professionals, companies) collect data in order to answer certain questions. Statisticians's job is to help them extract knowledge and insights from data.
Must-read for (bio)statistics students:
- 50 years of data sicence, by David Donoho.
If existing software tools readily solve the problem, all the better.
Often statisticians need to implement their own methods, test new algorithms, or tailor classical methods to new types of data (big, streaming).
This entails at least two essential skills: programming and fundamental knowledge of algorithms.
Two examples: How Gauss became famous and Marc Coram deciphering a jail message.

Dr. Carl Friedrich Gauss, 24; proved the Fundamental Theorem of Algebra; wrote the book Disquisitiones Arithmetic, which is still being studied today.
Jan 1-Feb 11 (41 days), astronomer Piazzi observed Ceres (a dwarf planet), which was then lost behind sun.
Aug-Sep, futile search by top astronomers; Laplace claimed it unsolvable.
Oct-Nov, Gauss did calculations by the method of least squares.
Dec 31, astronomer von Zach re-located Ceres according to Gauss' calculation.

1802, Summarische Übersicht der zur Bestimmung der Bahnen der beiden neuen Hauptplaneten angewandten Methoden (Summary survey of the methods used for the determination of the orbits of the two new main planets), considered the origin of linear algebra.
1807, Professor of Astronomy and (the first) Director of Göttingen Observatory in the remainder of his life.
1809, Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium (Theory of motion of the celestial bodies moving in conic sections around the Sun); birth of the Gaussian (normal) distribution, as an attempt to rationalize the method of least squares.
1810, Laplace consolidated the importance of Gaussian distribution by proving the central limit theorem.
1829, Gauss-Markov Theorem.

Webpage: The Discovery of Ceres
Article: The Discovery of Ceres: How Gauss Became Famous by Teets and Whitehead (1999).
Stephen Stigler gives a more comprehensive account of the origin of the method of least squares in his book The History of Statistics.

Motivated by real data and real problem (data science!).
Heuristic solution first: method of least squares.
Algorithm development: linear algebra, Gaussian elimination, FFT (fast Fourier transform).
Solution readily verifiable: Ceres was re-discovered.
Theoretical justification (Gaussian distribution, Gauss-Markov theorem) comes much later.

A consulting project by Marc Coram (then a graduate student in statistics at Stanford); customer is a professor in political science, who wants to understand a cryptic message circulated in a state prison.
Marc modeled the letter sequence by a Markov chain ($26 \times 26$ transition matrix) and estimated transition probabilities from War and Peace.
Now each mapping $\sigma$ yields a likelihood $f(\sigma)$ of the symbol sequence.
Find the $\sigma$ that maximizes $f$. Sample space is at least $26! = 4.0329 \times 10^{26}$. Combinatorial optimization -- hard!
Metropolis algorithm: At each iteration:
- generate a new $\sigma'$ by random transposition of two letters
- accept $\sigma'$ with probability $\min \left\{\frac{f(\sigma')}{f(\sigma)}, 1\right\}$

Motivated by a real problem (data science!).
Solution readily verifiable: we can read it!
Algorithm development: Metropolis sampler is one of top 10 algorithms in the 20th century.
See the article The Markov chain Monte Carlo revolution by Persi Diaconis (2009) for more details.

Not a course on statistical packages. It does not answer questions such as How to fit a linear mixed model in R, Julia, SAS, SPSS, or Stata?
Not a pure programming course, although programming is important and we do homework in Julia.
The new BIOSTAT 203A (Data Management) in fall quarter focuses on programming in R and SAS.
Not a course on data science. The new BIOSTAT 203B (Introduction to Data Science) in winter quarter focuses on software tools for data scientists.
This course focuses on algorithms, or, numerical methods in statistics.
To quote James Gentle

The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.
For a common numerical task in statistics, say solving the least squares problem $$ \widehat \beta = ({\bf X}^T {\bf X})^{-1} {\bf X}^T {\bf y}, $$ we need to know which methods/algorithms are out there and what are their advantages and disadvantages. You will fail this course if you use
```
inv(X'X) * X' * y
```
Using X \ y (or lm(y ~ X) in R) is correct but not the purpose of this course. We want to understand what computer is doing when calling X \ y.
For biostat studuents, this course satisfies the requirement of BIOSTAT 257 in the new curriculum. Ask Ms Roxy Naranjo rlnaranjo@ph.ucla.edu for the paperwork.