Statistics, the science of data analysis, is the applied mathematics in the 21st century.
People (scientists, goverment, health professionals, companies) collect data in order to answer certain questions. Statisticians's job is to help them extract knowledge and insights from data.
Two must-read for (bio)statistics students:
If existing software tools readily solve the problem, all the better.
Often statisticians need to implement their own methods, test new algorithms, or tailor classical methods to new types of data (big, streaming).
This entails at least two essential skills: programming and fundamental knowledge of algorithms.
Two examples: How Gauss became famous and Marc Coram deciphering a jail message.
Dr. Carl Friedrich Gauss, 24; proved the Fundamental Theorem of Algebra; wrote the book Disquisitiones Arithmetic, which is still being studied today.
Jan 1-Feb 11 (41 days), astronomer Piazzi observed Ceres (a dwarf planet), which was then lost behind sun.
Aug-Sep, futile search by top astronomers; Laplace claimed it unsolvable.
Oct-Nov, Gauss did calculations by the method of least squares.
Dec 31, astronomer von Zach re-located Ceres according to Gauss' calculation.
1802, Summarische Übersicht der zur Bestimmung der Bahnen der beiden neuen Hauptplaneten angewandten Methoden (Summary survey of the methods used for the determination of the orbits of the two new main planets), considered the origin of linear algebra.
1807, Professor of Astronomy and (the first) Director of Göttingen Observatory in the remainder of his life.
1809, Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium (Theory of motion of the celestial bodies moving in conic sections around the Sun); birth of the Gaussian (normal) distribution, as an attempt to rationalize the method of least squares.
1810, Laplace consolidated the importance of Gaussian distribution by proving the central limit theorem.
1829, Gauss-Markov Theorem.
Webpage: The Discovery of Ceres
Article: The Discovery of Ceres: How Gauss Became Famous by Teets and Whitehead (1999).
Stephen Stigler gives a more comprehensive account of the origin of the method of least squares in his book The History of Statistics.
Motivated by real data and real problem (data science!).
Heuristic solution first: method of least squares.
Algorithm development: linear algebra, Gaussian elimination, FFT (fast Fourier transform).
Solution readily verifiable: Ceres was re-discovered.
Theoretical justification (Gaussian distribution, Gauss-Markov theorem) comes much later.
Motivated by a real problem (data science!).
Solution readily verifiable: we can read it!
Algorithm development: Metropolis sampler is one of top 10 algorithms in the 20th century.
See the article The Markov chain Monte Carlo revolution by Persi Diaconis (2009) for more details.
Not a course on packages for data analysis. It does not answer questions such as How to fit a linear mixed model in R, Julia, SAS, SPSS, or Stata?
Not a programming course, although programming is extremely important and we do homework in Julia.
The new BIOSTAT 203A (Data Management) in fall quarter and 203B (Introduction to Data Science) in spring quarter will focus more on programming.
This course focuses on algorithms, or, numerical methods in statistics.
To quote James Gentle
The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.
For a common numerical task in statistics, say solving the least squares problem $$ \widehat \beta = ({\bf X}^T {\bf X})^{-1} {\bf X}^T {\bf y}, $$ we need to know which methods/algorithms are out there and what are their advantages and disadvantages. You will fail this course if you use
inv(X'X) * X'y
Using X \ y
(or lm(y ~ X)
in R) is correct but not the purpose of this course. We want to understand what computer is doing when calling X \ y
.
Course webpage: http://hua-zhou.github.io/teaching/biostatm280-2017spring/
Check the Schedule and Announcements pages frequently.
Questions following the posts will be taken.
Slides will be posted before each lecture.