Jan 9, 2018

\(\DeclareMathOperator*{\argmin}{arg\,min}\)

What is this course about?

Statistics and data science

  • This course (Biostat M280) is used as a placeholder for Biostat 203B: Introduction to Data Science, which is pending approval.

  • Statistics, the science of data analysis, is the applied mathematics in the 21st century.

  • Data is increasing in volume, velocity, and variety.

Classification of data sets by Huber (1994); Huber (1996)

Data Size Bytes Storage Mode
tiny \(10^2\) piece of paper
small \(10^4\) a few pieces of paper
medium \(10^6\) (MB) a floppy disk
large \(10^8\) hard disk
huge \(10^9\) (GB) hard disk(s)
massive \(10^{12}\) (TB) hard disk(s); RAID storage

Four V's of big data

Course desciption

  • This course introduces some computing skills and software tools for handling potentially big public health data.

  • Read syllabus for a tentative list of topics and course logistics.

References

Huber, P. J. (1994). Huge data sets. In COMPSTAT 1994 (Vienna) (pp. 3–13). Heidelberg: Physica.

Huber, P. J. (1996). Massive data sets workshop: The morning after. In Massive data sets: Proceedings of a workshop (pp. 169–184). Washington: National Academy Press.