An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.
-- Buckheit and Donoho (1995)
Duke Potti Scandal
Potti et al (2006) Genomic signatures to guide the use of chemotherapeutics, Nature Medicine, 12(11):1294--1300.
Baggerly and Coombes (2009) Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology, Ann. Appl. Stat., 3(4):1309--1334.
More information:
Nature Genetics (2015 Impact Factor: 31.616). 20 articles about microarray profiling published in Nature Genetics between Jan 2005 and Dec 2006.
Bible code.
Witztum, Rips, and Rosenberg (1994) Equidistant letter sequences in the book of genesis. Statist. Sci., 9(3):429-438.
McKay, Bar-Natan, Bar-Hillel, and Kalai (1999) Solving the Bible code puzzle, Statist. Sci., 14(2):150-173.
Replicability has been a foundation of science. It helps accumulate scientific knowledge.
Greater research impact.
Better work habit boosts quality of research.
Better teamwork. For you (graduate students), it means better communication with your advisor.
while true
Stud: "that idea you told me to try - it doesn't work!"
Prof: "ok. how about trying this instead."
end
Unless you reproduce the computing environment (algorithms, dataset, tuning parameters), there's no way professor can help you.
When we publish articles containing figures which were generated by computer, we also publish the complete software environment which generates the figures.
-- Buckheit and Donoho (1995)
A good example: http://stanford.edu/~boyd/papers/admm_distr_stats.html
We are going to practice reproducible research now. That is to make your homework reproducible using Git/GitHub, and IJulia.
If it's not in source control, it doesn't exist.
Statisticians, as opposed to closet mathematicians, rarely do things in vacuum.
We use Git in this course.
I'm an egotistical bastard, and I name all my projects after myself. First 'Linux', now 'git'.
-- Linus Torvalds
A Git server enabling multi-person collaboration through a centralized repository.
Git client on your own machine.
yum install git
on CentOS port install git
or other package managers Don't totally rely on GUI or IDE. Learn to use Git on command line, which is needed for cluster and cloud computing.
git pull
= git fetch
+ git merge
). git add
). git commit
). git push
). Register for an account on a Git server, e.g., github.com.
Upload your SSH public key to the server.
Identify yourself at local machine, e.g.,
git config --global user.name "Hua Zhou"
git config --global user.email "huazhou@ucla.edu"
Name and email appear in each commit you make.
Initialize a project:
biostat-m280-2018-spring
on the server. git clone git@github.com:Hua-Zhou/biostat-m280-2018-spring.git
Working with your local copy.
git pull
: update local Git repository with remote repository (fetch + merge) git log filename
: display the current status of working directory git diff
: show differences (by default difference from the most recent commit) git add file1 file2 ...
: add file(s) to the staging area git commit
: commit changes in staging area to Git directory git push
: publish commits in local Git repository to remote repository git reset --soft HEAD~1
: undo the last commit git checkout filename
: go back to the last commit, discarding all changes made git rm
: remove files from git control Branching in Git.
For this course, you need to have two branches:
develop
for your own developmentmaster
for releases (homework submission). Note master
is the default branch when you initialize the project; create and switch to develop
branch immediately after project initialization.
Commonly used commands:
git branch branchname
: create a branch git branch
: show all project branches git checkout branchname
: switch to a branch git tag
: show tags (major landmarks)git tag tagname
: create a tagClone the project, create a develop
branch, where your write solution for HW1.
# clone the project
git clone git@github.com:UCLA-BIOSTAT-M280-2017-Spring/biostat-m280-2017-HuaZhou.git
# enter project folder
cd biostat-m280-2017-HuaZhou
# what branches are there?
git branch
# create develop branch
git branch develop
# switch to the develop branch
git checkout develop
# create folder for HW1
mkdir hw1
cd hw1
# let's write some code
echo "x = 1" > code.jl
echo "some bug" >> code.jl
# commit the code
git add code.jl
git commit -m "famous x = 1 function"
# push to remote repo
git push
Submit and tag HW1 solution to master
branch.
# which branch are we in
git branch
# change to the master branch
git checkout master
# merge develop branch to master branch
git pull origin develop
# push to the remote master branch
git push
# tag version hw1
git tag hw1
git push --tags
Be judicious what to put in repository.
.gitignore
fileStrictly version control system is for source files only. E.g. only xxx.tex
, xxx.bib
, and figure files are necessary to produce a pdf file. Pdf file doesn't need to be version controlled or, if version controlled, doesn't need to be frequently committed.
Commit early, commit often and don't spare the horses.
Adding an informative message when you commit is not optional. Spending one minute on commit message saves hours later for your collaborators and yourself. Read the following sentence to yourself 3 times:
Write every commit message like the next person who reads it is an axe-wielding maniac who knows where you live.
IPython notebook is a powerful tool for authoring dynamic document, which combines code, formatted text, math, and multimedia in a single document.
Jupyter is the current development that emcompasses multiple languages including Julia, Python, and R.
Julia uses Jupyter notebook through the IJulia.jl package.
In this course, you are required to write your homework reports using IJulia.
For each homework, you need to submit your IJulia notebook (.e.g, hw1.ipynb
), html (e.g., hw1.html
), along with all code and data that are necessary to reproduce the results.
You can start with the Jupyter notebook for the lectures.