No handwritten homework reports are accepted for this course. We work with Git and GitHub. Efficient and abundant use of Git, e.g., frequent and well-documented commits, is an important criterion for grading your homework.
Apply for the Student Developer Pack at GitHub using your UCLA email.
Create a private repository biostat-m280-2019-winter
and add Hua-Zhou
and juhkim111
as your collaborators with write permission.
Top directories of the repository should be hw1
, hw2
, … Maintain two branches master
and develop
. The develop
branch will be your main playground, the place where you develop solution (code) to homework problems and write up report. The master
branch will be your presentation area. Submit your homework files (R markdown file Rmd
, html
file converted from R markdown, all code and data sets to reproduce results) in master
branch.
After each homework due date, teaching assistant and instructor will check out your master branch for grading. Tag each of your homework submissions with tag names hw1
, hw2
, … Tagging time will be used as your submission time. That means if you tag your hw1
submission after deadline, penalty points will be deducted for late submission.
The /home/m280data/NYCHVS
folder on teaching server contains a data set from the New York City Housing and Vacancy Survey. See 2019 ASA Data Challenge Expo for further details.
```bash
ls -l /home/m280data/NYCHVS
```
```
## total 74960
## -r--r--r--. 1 root root 5524433 Jan 17 03:51 NYCHVS_1991.csv
## -r--r--r--. 1 root root 5854176 Jan 17 03:52 NYCHVS_1993.csv
## -r--r--r--. 1 root root 6530552 Jan 17 03:52 NYCHVS_1996.csv
## -r--r--r--. 1 root root 6445160 Jan 17 03:52 NYCHVS_1999.csv
## -r--r--r--. 1 root root 7219516 Jan 17 03:52 NYCHVS_2002.csv
## -r--r--r--. 1 root root 7253152 Jan 17 03:52 NYCHVS_2005.csv
## -r--r--r--. 1 root root 8521473 Jan 17 03:52 NYCHVS_2008.csv
## -r--r--r--. 1 root root 7696926 Jan 17 03:52 NYCHVS_2011.csv
## -r--r--r--. 1 root root 6872681 Jan 17 03:52 NYCHVS_2014.csv
## -r--r--r--. 1 root root 14821469 Jan 17 03:52 NYCHVS_2017.csv
```
Please, do not put these data files into Git; they are big. Also do not copy them into your directory. Just read from the data folder /home/m280data/NYCHVS
directly. Use Bash commands to answer following questions.
Display the first few lines of NYCHVS_1991.csv
.
Display number of lines in each csv
file.
Display the 3 files that have the least number of lines
What’s the output of following bash script?
for datafile in /home/m280data/NYCHVS/*.csv
do
ls $datafile
done
What unique values does the second variable borough
take in NYCHVS_1991.csv
? Tabulate how many times each value appears.
You and your friend just have finished reading Pride and Prejudice by Jane Austen. Among the four main characters in the book, Elizabeth, Jane, Lydia, and Darcy, your friend thinks that Darcy was the most mentioned. You, however, are certain it was Elizabeth. Obtain the full text of the novel from https://www.gutenberg.org/files/1342/1342.txt and save to your local folder.
curl https://www.gutenberg.org/files/1342/1342.txt > pride_and_prejudice.txt
Do not put this text file pride_and_prejudice.txt
in Git. Using a for
loop, how would you tabulate the number of times each of the four characters is mentioned?
What’s the difference between the following two commands?
echo 'hello, world' > test1.txt
and
echo 'hello, world' >> test2.txt
Using your favorite text editor (e.g., vi
), type the following and save the file as middle.sh
:
#!/bin/sh
# Select lines from the middle of a file.
# Usage: bash middle.sh filename end_line num_lines
head -n "$2" "$1" | tail -n "$3"
Using chmod
make the file executable by the owner, and run
./middle.sh pride_and_prejudice.txt 20 5
Explain the output. Explain the meaning of "$1"
, "$2"
, and "$3"
in this shell script. Why do we need the first line of the shell script?
In class we discussed using R to organize simulation studies.
Expand the runSim.R
script to include arguments seed
(random seed), n
(sample size), dist
(distribution) and rep
(number of simulation replicates). When dist="gaussian"
, generate data from standard normal; when dist="t1"
, generate data from t-distribution with degree of freedom 1 (same as Cauchy distribution); when dist="t5"
, generate data from t-distribution with degree of freedom 5. Calling runSim.R
will (1) set random seed according to argument seed
, (2) generate data according to argument dist
, (3) compute the primed-indexed average estimator and the classical sample average estimator for each simulation replicate, (4) report the average mean squared error (MSE) \[
\frac{\sum_{r=1}^{\text{rep}} (\widehat \mu_r - \mu_{\text{true}})^2}{\text{rep}}
\] for both methods.
Modify the autoSim.R
script to run simulations with combinations of sample sizes nVals = seq(100, 500, by=100)
and distributions distTypes = c("gaussian", "t1", "t5")
and write output to appropriately named files. Use rep = 50
, and seed = 280
.
Write an R script to collect simulation results from output files and print average MSEs in a table of format
\(n\) | Method | Gaussian | \(t_5\) | \(t_1\) |
---|---|---|---|---|
100 | PrimeAvg | |||
100 | SampAvg | |||
200 | PrimeAvg | |||
200 | SampAvg | |||
300 | PrimeAvg | |||
300 | SampAvg | |||
400 | PrimeAvg | |||
400 | SampAvg | |||
500 | PrimeAvg | |||
500 | SampAvg |