Jan 11, 2018

Why Linux

Linux is the most common platform for scientific computing.

  • Open source and community support.

  • Things break; when they break using Linux, it's easy to fix.

  • Scalability: portable devices (Android, iOS), laptops, servers, clusters, and super computers.
    • E.g. UCLA Hoffmann2 cluster runs on Linux.
  • Cost: it's free!

Distributions of Linux

  • Debian/Ubuntu is a popular choice for personal computers.

  • RHEL/CentOS is popular on servers.

  • The teaching server for this class runs CentOS 7.

  • Mac OS was originally derived from Unix/Linux (Darwin kernel). It is POSIX compliant. Most shell commands we review here apply to Mac OS terminal as well. Windows/DOS, unfortunately, is a totally different breed.

  • Show distribution/version on Linux:

    cat /etc/*-release
    ## CentOS Linux release 7.4.1708 (Core) 
    ## NAME="CentOS Linux"
    ## VERSION="7 (Core)"
    ## ID="centos"
    ## ID_LIKE="rhel fedora"
    ## VERSION_ID="7"
    ## PRETTY_NAME="CentOS Linux 7 (Core)"
    ## ANSI_COLOR="0;31"
    ## CPE_NAME="cpe:/o:centos:centos:7"
    ## HOME_URL="https://www.centos.org/"
    ## BUG_REPORT_URL="https://bugs.centos.org/"
    ## 
    ## CENTOS_MANTISBT_PROJECT="CentOS-7"
    ## CENTOS_MANTISBT_PROJECT_VERSION="7"
    ## REDHAT_SUPPORT_PRODUCT="centos"
    ## REDHAT_SUPPORT_PRODUCT_VERSION="7"
    ## 
    ## CentOS Linux release 7.4.1708 (Core) 
    ## CentOS Linux release 7.4.1708 (Core)

  • Show distribution/version on Mac:

    sw_vers -productVersion

    or

    system_profiler SPSoftwareDataType

Linux shells

Linux shells

  • A shell translates commands to OS instructions.

  • Most commonly used shells include bash, csh, tcsh, zsh, etc.

  • Sometimes a script or a command does not run simply because it's written for another shell.

  • We mostly use bash shell commands in this class.

  • Determine the current shell:

    echo $SHELL
    ## /bin/bash
  • List available shells:

    cat /etc/shells
    ## /bin/sh
    ## /bin/bash
    ## /sbin/nologin
    ## /usr/bin/sh
    ## /usr/bin/bash
    ## /usr/sbin/nologin

  • Change to another shell:

    exec bash -l

    The -l option indicates it should be a login shell.

  • Change your login shell permanently:

    chsh -s /bin/bash userid

    Then log out and log in.

Bash completion

Bash provides the following standard completion for the Linux users by default. Much less typing errors and time!

  • Pathname completion.

  • Filename completion.

  • Variablename completion: echo $[TAB][TAB].

  • Username completion: cd ~[TAB][TAB].

  • Hostname completion ssh huazhou@[TAB][TAB].

  • It can also be customized to auto-complete other stuff such as options and command's arguments. Google bash completion for more information.

Navigate file system

Linux directory structure

  • Upon log in, user is at his/her home directory.

Move around the file system

  • pwd prints absolute path to the current working directory:

    pwd
    ## /home/huazhou/github.com/Hua-Zhou.github.io/teaching/biostatm280-2018winter/slides/02-linux
  • ls lists contents of a directory:

    ls
    ## autoSim.R
    ## Emacs_Reference_Card.pdf
    ## key_authentication_1.png
    ## key_authentication_2.png
    ## linux_directory_structure.png
    ## linux_filepermission_oct.png
    ## linux_filepermission.png
    ## linux.html
    ## linux.Rmd
    ## meanEst.R
    ## Richard_Stallman_2013.png
    ## runSim.R
    ## screenshot_top.png
    ## Vi_Cheat_Sheet.pdf

  • ls -l lists detailed contents of a directory:

    ls -l
    ## total 3932
    ## -rw-r--r--. 1 huazhou huazhou     263 Jan 11 15:21 autoSim.R
    ## -rw-r--r--. 1 huazhou huazhou  110345 Jan  9 04:48 Emacs_Reference_Card.pdf
    ## -rw-r--r--. 1 huazhou huazhou  321281 Jan 10 23:29 key_authentication_1.png
    ## -rw-r--r--. 1 huazhou huazhou   96119 Jan 10 23:29 key_authentication_2.png
    ## -rw-rw-r--. 1 huazhou huazhou   11662 Jan  8 23:36 linux_directory_structure.png
    ## -rw-r--r--. 1 huazhou huazhou   42472 Jan  9 00:26 linux_filepermission_oct.png
    ## -rw-r--r--. 1 huazhou huazhou  102188 Jan  9 00:29 linux_filepermission.png
    ## -rw-rw-r--. 1 huazhou huazhou 2486252 Jan 16 16:06 linux.html
    ## -rw-r--r--. 1 huazhou huazhou   20159 Jan 16 16:15 linux.Rmd
    ## -rw-r--r--. 1 huazhou huazhou     381 Jan 11 01:02 meanEst.R
    ## -rw-r--r--. 1 huazhou huazhou  141962 Jan  9 04:37 Richard_Stallman_2013.png
    ## -rw-r--r--. 1 huazhou huazhou     498 Jan 16 16:12 runSim.R
    ## -rw-r--r--. 1 huazhou huazhou  469335 Jan  9 06:11 screenshot_top.png
    ## -rw-r--r--. 1 huazhou huazhou  199492 Jan  9 04:56 Vi_Cheat_Sheet.pdf

  • ls -al lists all contents of a directory, including those start with . (hidden folders):

    ls -al
    ## total 3940
    ## drwxrwxr-x. 2 huazhou huazhou    4096 Jan 16 16:10 .
    ## drwxrwxr-x. 6 huazhou huazhou      87 Jan 16 03:23 ..
    ## -rw-r--r--. 1 huazhou huazhou     263 Jan 11 15:21 autoSim.R
    ## -rw-r--r--. 1 huazhou huazhou  110345 Jan  9 04:48 Emacs_Reference_Card.pdf
    ## -rw-r--r--. 1 huazhou huazhou  321281 Jan 10 23:29 key_authentication_1.png
    ## -rw-r--r--. 1 huazhou huazhou   96119 Jan 10 23:29 key_authentication_2.png
    ## -rw-rw-r--. 1 huazhou huazhou   11662 Jan  8 23:36 linux_directory_structure.png
    ## -rw-r--r--. 1 huazhou huazhou   42472 Jan  9 00:26 linux_filepermission_oct.png
    ## -rw-r--r--. 1 huazhou huazhou  102188 Jan  9 00:29 linux_filepermission.png
    ## -rw-rw-r--. 1 huazhou huazhou 2486252 Jan 16 16:06 linux.html
    ## -rw-r--r--. 1 huazhou huazhou   20159 Jan 16 16:15 linux.Rmd
    ## -rw-r--r--. 1 huazhou huazhou     381 Jan 11 01:02 meanEst.R
    ## -rw-rw-r--. 1 huazhou huazhou    4077 Jan 16 16:10 .RData
    ## -rw-r--r--. 1 huazhou huazhou  141962 Jan  9 04:37 Richard_Stallman_2013.png
    ## -rw-r--r--. 1 huazhou huazhou     498 Jan 16 16:12 runSim.R
    ## -rw-r--r--. 1 huazhou huazhou  469335 Jan  9 06:11 screenshot_top.png
    ## -rw-r--r--. 1 huazhou huazhou  199492 Jan  9 04:56 Vi_Cheat_Sheet.pdf

  • .. denotes the parent of current working directory.

  • . denotes the current working directory.

  • ~ denotes user's home directory.

  • / denotes the root directory.

  • cd .. changes to parent directory.

  • cd or cd ~ changes to home directory.

  • cd / changes to root directory.

File permissions

  • chmod g+x file makes a file executable to group members.

  • chmod 751 file sets permission rwxr-x--x to a file.

  • groups userid shows which group(s) a user belongs to:

    groups huazhou
    ## huazhou : huazhou wheel m280

Manipulate files and directories

  • cp copies file to a new location.

  • mv moves file to a new location.

  • touch creates a text file; if file already exists, it's left unchanged.

  • rm deletes a file.

  • mkdir creates a new directory.

  • rmdir deletes an empty directory.

  • rm -rf deletes a directory and all contents in that directory (be cautious using the -f option …).

Find files

  • locate locates a file by name:

    locate linux.Rmd
    ## /home/huazhou/github.com/Hua-Zhou.github.io/teaching/biostatm280-2018winter/slides/02-linux/linux.Rmd
  • which locates a program:

    which R
    ## /usr/bin/R

  • find is similar to locate but has more functionalities, e.g., select files by age, size, permissions, …. , and is ubiquitous.

    find linux.Rmd
    ## linux.Rmd
    find /home/huazhou -name linux.Rmd
    ## /home/huazhou/github.com/Hua-Zhou.github.io/teaching/biostatm280-2018winter/slides/02-linux/linux.Rmd

Wildcard characters

Wildcard Matches
? any single character
* any character 0 or more times
+ one or more preceding pattern
^ beginning of the line
$ end of the line
[set] any character in set
[!set] any character not in set
[a-z] any lowercase letter
[0-9] any number (same as [0123456789])

  • # all png files in current folder
    ls -l *.png
    ## -rw-r--r--. 1 huazhou huazhou 321281 Jan 10 23:29 key_authentication_1.png
    ## -rw-r--r--. 1 huazhou huazhou  96119 Jan 10 23:29 key_authentication_2.png
    ## -rw-rw-r--. 1 huazhou huazhou  11662 Jan  8 23:36 linux_directory_structure.png
    ## -rw-r--r--. 1 huazhou huazhou  42472 Jan  9 00:26 linux_filepermission_oct.png
    ## -rw-r--r--. 1 huazhou huazhou 102188 Jan  9 00:29 linux_filepermission.png
    ## -rw-r--r--. 1 huazhou huazhou 141962 Jan  9 04:37 Richard_Stallman_2013.png
    ## -rw-r--r--. 1 huazhou huazhou 469335 Jan  9 06:11 screenshot_top.png

Regular expression

  • Wildcards are examples of regular expressions.

  • Regular expressions are a powerful tool to efficiently sift through large amounts of text: record linking, data cleaning, scraping data from website or other data-feed.

  • Google regular expressions to learn.

Work with text files

View/peek text files

  • cat prints the contents of a file:

    cat linux.Rmd
    ## ---
    ## title: "Linux Basics"
    ## author: "Dr. Hua Zhou"
    ## date: "Jan 11, 2018"
    ## output: ioslides_presentation
    ## subtitle: Biostat M280
    ## bibliography: ../bib-HZ.bib
    ## ---
    ## 
    ## ## Why Linux
    ## 
    ## Linux is _the_ most common platform for scientific computing.
    ## 
    ## - Open source and community support.
    ## 
    ## - Things break; when they break using Linux, it's easy to fix.
    ## 
    ## - Scalability: portable devices (Android, iOS), laptops, servers, clusters, and super computers.
    ##     - E.g. UCLA Hoffmann2 cluster runs on Linux.
    ## 
    ## - Cost: it's free!
    ## 
    ## ## [Distributions of Linux](http://upload.wikimedia.org/wikipedia/commons/1/1b/Linux_Distribution_Timeline.svg)
    ## 
    ## - Debian/Ubuntu is a popular choice for personal computers.
    ## 
    ## - RHEL/CentOS is popular on servers.
    ## 
    ## - The teaching server for this class runs CentOS 7.
    ## 
    ## - Mac OS was originally derived from Unix/Linux (Darwin kernel). It is POSIX compliant. Most shell commands we review here apply to Mac OS terminal as well. Windows/DOS, unfortunately, is a totally different breed.
    ## 
    ## ## {.smaller}
    ## 
    ## - Show distribution/version on Linux:
    ##     ```{bash}
    ##     cat /etc/*-release
    ##     ```
    ## 
    ## ----
    ## 
    ## - Show distribution/version on Mac:
    ##     ```{bash, eval=FALSE}
    ##     sw_vers -productVersion
    ##     ```
    ## or
    ##     ```{bash, eval=FALSE}
    ##     system_profiler SPSoftwareDataType
    ##     ```
    ## 
    ## # Linux shells
    ## 
    ## ## Linux shells
    ## 
    ## - A shell translates commands to OS instructions.
    ## 
    ## - Most commonly used shells include `bash`, `csh`, `tcsh`, `zsh`, etc.
    ## 
    ## - Sometimes a script or a command does not run simply because it's written for another shell.
    ## 
    ## - We mostly use `bash` shell commands in this class.
    ## 
    ## ----
    ## 
    ## - Determine the current shell:
    ##     ```{bash}
    ##     echo $SHELL
    ##     ```
    ## 
    ## - List available shells:
    ##     ```{bash}
    ##     cat /etc/shells
    ##     ```
    ## 
    ## ----
    ## 
    ## - Change to another shell:
    ##     ```{bash, eval=FALSE}
    ##     exec bash -l
    ##     ```
    ## The `-l` option indicates it should be a login shell.
    ## 
    ## - Change your login shell permanently:
    ##     ```{bash, eval=FALSE}
    ##     chsh -s /bin/bash userid
    ##     ```
    ## Then log out and log in.
    ## 
    ## ## Bash completion
    ## 
    ## Bash provides the following standard completion for the Linux users by default. Much less typing errors and time!  
    ## 
    ## - Pathname completion.  
    ## 
    ## - Filename completion.  
    ## 
    ## - Variablename completion: `echo $[TAB][TAB]`.  
    ## 
    ## - Username completion: `cd ~[TAB][TAB]`.
    ## 
    ## - Hostname completion `ssh huazhou@[TAB][TAB]`.
    ## 
    ## - It can also be customized to auto-complete other stuff such as options and command's arguments. Google `bash completion` for more information.
    ## 
    ## # Navigate file system
    ## 
    ## ## Linux directory structure
    ## 
    ## <p align="center">
    ##   <img src="./linux_directory_structure.png" height="300" width="450">
    ## </p>
    ## - Upon log in, user is at his/her home directory.
    ## 
    ## ## Move around the file system {.smaller}
    ## 
    ## - `pwd` prints absolute path to the current working directory:
    ##     ```{bash}
    ##     pwd
    ##     ```
    ## 
    ## - `ls` lists contents of a directory:  
    ##     ```{bash}
    ##     ls
    ##     ```
    ## 
    ## ## {.smaller}
    ## 
    ## - `ls -l` lists detailed contents of a directory:  
    ##     ```{bash}
    ##     ls -l
    ##     ```
    ## 
    ## ## {.smaller}
    ## 
    ## - `ls -al` lists all contents of a directory, including those start with `.` (hidden folders):
    ##     ```{bash, small=TRUE}
    ##     ls -al
    ##     ```
    ## 
    ## ----
    ## 
    ## - `..` denotes the parent of current working directory.
    ## 
    ## - `.` denotes the current working directory.
    ## 
    ## - `~` denotes user's home directory.
    ## 
    ## - `/` denotes the root directory.
    ## 
    ## - `cd ..` changes to parent directory.
    ## 
    ## - `cd` or `cd ~` changes to home directory.
    ## 
    ## - `cd /` changes to root directory.
    ## 
    ## <!-- 
    ## - `pushd` changes the working directory but pushes the current directory into a stack.
    ## 
    ## - `popd` changes the working directory to the last directory added to the stack.
    ## -->
    ## 
    ## ## File permissions
    ## 
    ## <p align="center">
    ##   <img src="./linux_filepermission.png" height="125">
    ## </p>
    ## 
    ## <p align="center">
    ##   <img src="./linux_filepermission_oct.png" height="250">
    ## </p>
    ## 
    ## ----
    ## 
    ## - `chmod g+x file` makes a file executable to group members.
    ## 
    ## - `chmod 751 file` sets permission `rwxr-x--x` to a file.
    ## 
    ## - `groups userid` shows which group(s) a user belongs to:
    ##     ```{bash}
    ##     groups huazhou
    ##     ```
    ## 
    ## ## Manipulate files and directories
    ## 
    ## - `cp` copies file to a new location.
    ## 
    ## - `mv` moves file to a new location.
    ## 
    ## - `touch` creates a text file; if file already exists, it's left unchanged.
    ## 
    ## - `rm` deletes a file.
    ## 
    ## - `mkdir` creates a new directory.
    ## 
    ## - `rmdir` deletes an _empty_ directory.
    ## 
    ## - `rm -rf` deletes a directory and all contents in that directory (be cautious using the `-f` option ...).
    ## 
    ## ## Find files {.smaller}
    ## 
    ## - `locate` locates a file by name:
    ##     ```{bash}
    ##     locate linux.Rmd
    ##     ```
    ## 
    ## - `which` locates a program:
    ##     ```{bash}
    ##     which R
    ##     ```
    ## 
    ## ----
    ## 
    ## - `find` is similar to `locate` but has more functionalities, e.g., select files by age, size, permissions, .... , and is ubiquitous.
    ##     ```{bash}
    ##     find linux.Rmd
    ##     ```
    ##     ```{bash}
    ##     find /home/huazhou -name linux.Rmd
    ##     ```
    ## 
    ## ## Wildcard characters {.smaller}
    ## 
    ## | Wildcard   | Matches                             |
    ## |------------|-------------------------------------|
    ## | `?`        | any single character                |
    ## | `*`        | any character 0 or more times       |
    ## | `+`        | one or more preceding pattern       |
    ## | `^`        | beginning of the line               |
    ## | `$`        | end of the line                     |
    ## | `[set]`    | any character in set                |
    ## | `[!set]`   | any character not in set            |
    ## | `[a-z]`    | any lowercase letter                |
    ## | `[0-9]`    | any number (same as `[0123456789]`) |
    ## 
    ## ## {.smaller}
    ## 
    ## -
    ##     ```{bash}
    ##     # all png files in current folder
    ##     ls -l *.png
    ##     ```
    ## 
    ## ## Regular expression
    ## 
    ## - Wildcards are examples of _regular expressions_. 
    ## 
    ## - Regular expressions are a powerful tool to efficiently sift through large amounts of text: record linking, data cleaning, scraping data from website or other data-feed. 
    ## 
    ## - Google `regular expressions` to learn.
    ## 
    ## # Work with text files
    ## 
    ## ## View/peek text files
    ## 
    ## - `cat` prints the contents of a file:
    ##     ```{bash, size='smallsize'}
    ##     cat linux.Rmd
    ##     ```
    ## 
    ## ----
    ## 
    ## - `head -l` prints the first $l$ lines of a file:
    ##     ```{bash}
    ##     head linux.Rmd
    ##     ```
    ## 
    ## ----
    ## 
    ## - `tail -l` prints the last $l$ lines of a file:
    ##     ```{bash}
    ##     tail linux.Rmd
    ##     ```
    ## 
    ## ## `less` is more; `more` is less
    ## 
    ## - `more` browses a text file screen by screen (only downwards). Scroll down one page (paging) by pressing the spacebar; exit by pressing the `q` key.
    ## 
    ## - `less` is also a pager, but has more functionalities, e.g., scroll upwards and downwards through the input.
    ## 
    ## - `less` doesn't need to read the whole file, i.e., it loads files faster than `more`.
    ## 
    ## ## `grep`
    ## 
    ## `grep` prints lines that match an expression:
    ## 
    ## - Show lines that contain string `CentOS`:
    ##     ```{bash}
    ##     # quotes not necessary if not a regular expression
    ##     grep 'CentOS' linux.Rmd
    ##     ```
    ## 
    ## ----
    ## 
    ## - Search multiple text files:
    ##     ```{bash}
    ##     grep 'CentOS' *.Rmd
    ##     ```
    ## 
    ## ----
    ## 
    ## - Show matching line numbers:
    ##     ```{bash}
    ##     grep -n 'CentOS' linux.Rmd
    ##     ```
    ## 
    ## ----
    ## 
    ## - Find all files in current directory with `.png` extension:
    ##     ```{bash}
    ##     ls | grep '\.png$'
    ##     ```
    ## 
    ## - Find all directories in the current directory:
    ##     ```{bash}
    ##     ls -al | grep '^d'
    ##     ```
    ## 
    ## ## `sed`
    ## 
    ## - `sed` is a stream editor.
    ## 
    ## - Replace `CentOS` by `RHEL` in a text file:
    ##     ```{bash}
    ##     sed 's/CentOS/RHEL/' linux.Rmd | grep RHEL
    ##     ```
    ## 
    ## ## `awk`
    ## 
    ## - `awk` is a filter and report writer.
    ## 
    ## - Print sorted list of login names:
    ##     ```{bash}
    ##     awk -F: '{ print $1 }' /etc/passwd | sort | head -5
    ##     ```
    ## 
    ## ----
    ##     
    ## - Print number of lines in a file, as `NR` stands for Number of Rows:
    ##     ```{bash}
    ##     awk 'END { print NR }' /etc/passwd
    ##     ```
    ## or
    ##     ```{bash}
    ##     wc -l /etc/passwd
    ##     ```
    ## or
    ##     ```{bash}
    ##     wc -l < /etc/passwd
    ##     ```
    ## 
    ## ----
    ## 
    ## - Print login names with UID in range `1000-1035`:
    ##     ```{bash}
    ##     awk -F: '{if ($3 >= 1000 && $3 <= 1035) print}' /etc/passwd
    ##     ```
    ##     
    ## - Print login names and log-in shells in comma-seperated format:
    ##     ```{bash}
    ##     awk -F: '{OFS = ","} {print $1, $7}' /etc/passwd
    ##     ```
    ## 
    ## ----
    ## 
    ## - Print login names and indicate those with UID>1000 as `vip`:
    ##     ```{bash}
    ##     awk -F: -v status="" '{OFS = ","} 
    ##     {if ($3 >= 1000) status="vip"; else status="regular"} 
    ##     {print $1, status}' /etc/passwd
    ##     ```
    ## 
    ## ## Piping and redirection
    ## 
    ## - `|` sends output from one command as input of another command.
    ## 
    ## - `>` directs output from one command to a file.
    ## 
    ## - `>>` appends output from one command to a file.
    ## 
    ## - `<` reads input from a file.
    ## 
    ## - Combinations of shell commands (`grep`, `sed`, `awk`, ...), piping and redirection, and regular expressions allow us pre-process and reformat huge text files efficiently. 
    ## 
    ## - See HW1.
    ## 
    ## ## Text editors
    ## 
    ## <p align="center">
    ##   <img src="./Richard_Stallman_2013.png" height="400">
    ## </p>
    ## Source: [Editor War](http://en.wikipedia.org/wiki/Editor_war) on Wikipedia.
    ## 
    ## ## Emacs
    ## 
    ## - `Emacs` is a powerful text editor with extensive support for many languages including `R`, $\LaTeX$, `python`, and `C/C++`; however it's _not_ installed by default on many Linux distributions. 
    ## 
    ## - Basic survival commands:
    ##     - `emacs filename` to open a file with emacs.  
    ##     - `CTRL-x CTRL-f` to open an existing or new file.  
    ##     - `CTRL-x CTRX-s` to save.  
    ##     - `CTRL-x CTRL-w` to save as.  
    ##     - `CTRL-x CTRL-c` to quit.
    ## 
    ## ----
    ## 
    ## - Google `emacs cheatsheet`
    ## 
    ## <p align="center">
    ##   <img src="./Emacs_Reference_Card.pdf" height="400">
    ## </p>
    ## 
    ## `C-<key>` means hold the `control` key, and press `<key>`.  
    ## `M-<key>` means press the `Esc` key once, and press `<key>`.
    ## 
    ## ## Vi
    ## 
    ## - `Vi` is ubiquitous (POSIX standard). Learn at least its basics; otherwise you can edit nothing on some clusters. 
    ## 
    ## - Basic survival commands:
    ##     - `vi filename` to start editing a file.
    ##     - `vi` is a _modal_ editor: _insert_ mode and _normal_ mode. Pressing `i` switches from the normal mode to insert mode. Pressing `ESC` switches from the insert mode to normal mode.  
    ##     - `:x<Return>` quits `vi` and saves changes.  
    ##     - `:q!<Return>` quits vi without saving latest changes.  
    ##     - `:w<Return>` saves changes.
    ##     - `:wq<Return>` quits `vi` and saves changes.      
    ## 
    ## ----
    ## 
    ## - Google `vi cheatsheet`
    ## 
    ## <p align="center">
    ##   <img src="./Vi_Cheat_Sheet.pdf" height="500">
    ## </p>
    ## 
    ## ## IDE (Integrated Development Environment)
    ## 
    ## - Statisticians write a lot of code. Critical to adopt a good IDE that goes beyond code editing: syntax highlighting, executing code within editor, debugging, profiling, version control, etc.
    ## 
    ## - R Studio, Eclipse, Emacs, Matlab, Visual Studio, etc.
    ## 
    ## # Processes
    ## 
    ## ## Processes
    ## 
    ## - OS runs processes on behalf of user.
    ## 
    ## - Each process has Process ID (PID), Username (UID), Parent process ID (PPID), Time and data process started (STIME), time running (TIME), etc.
    ## 
    ##     ```{bash}
    ##     ps
    ##     ```
    ## 
    ## ----
    ## 
    ## - All current running processes:
    ##     ```{bash}
    ##     ps -eaf
    ##     ```
    ## 
    ## ----
    ## 
    ## - All Python processes:
    ##     ```{bash}
    ##     ps -eaf | grep python
    ##     ```
    ## 
    ## ----
    ## 
    ## - Process with PID=1:
    ##     ```{bash}
    ##     ps -fp 1
    ##     ```
    ## 
    ## ----
    ## 
    ## - All processes owned by a user:
    ##     ```{bash}
    ##     ps -fu huazhou
    ##     ```
    ## 
    ## ## Kill processes
    ## 
    ## - Kill process with PID=1001:
    ##     ```{bash, eval=FALSE}
    ##     kill 1001
    ##     ```
    ## 
    ## - Kill all R processes.
    ##     ```{bash, eval=FALSE}
    ##     killall -r R
    ##     ```
    ## 
    ## ## `top`
    ## 
    ## - `top` prints realtime process information (very useful).
    ##     ```{bash, eval=FALSE}
    ##     top
    ##     ```
    ##     
    ## <p align="center">
    ##   <img src="./screenshot_top.png" height="400">
    ## </p>
    ## 
    ## # Secure shell (SSH)
    ## 
    ## ## SSH
    ## 
    ## SSH (secure shell) is the dominant cryptographic network protocol for secure network connection via an insecure network.
    ## 
    ## - On Linux or Mac, access the teaching server by
    ##     ```{bash, eval=FALSE}
    ##     ssh username@35.227.165.60
    ##     ```
    ## 
    ## - Windows machines need the [PuTTY](http://www.putty.org) program (free).
    ## 
    ## ## Use keys over password
    ## 
    ## - Key authentication is more secure than password. Most passwords are weak.
    ## 
    ## - Script or a program may need to systematically SSH into other machines.
    ## 
    ## - Log into multiple machines using the same key.
    ## 
    ## - Seamless use of many services: Git, svn, Amazon EC2 cloud service, parallel computing on multiple hosts, etc.
    ## 
    ## - Many servers only allow key authentication and do not accept password authentication.
    ## 
    ## ## Key authentication
    ## 
    ## <p align="center">
    ##   <img src="./key_authentication_1.png" height="200">
    ## </p>
    ## 
    ## <p align="center">
    ##   <img src="./key_authentication_2.png" height="250">
    ## </p>
    ## 
    ## ----
    ## 
    ## - _Public key_. Put on the machine(s) you want to log in.
    ## 
    ## - _Private key_. Put on your own computer. Consider this as the actual key in your pocket; never give to others.
    ## 
    ## - Messages from server to your computer is encrypted with your public key. It can only be decrypted using your private key.
    ## 
    ## - Messages from your computer to server is signed with your private key (digital signatures) and can be verified by anyone who has your public key (authentication).
    ## 
    ## ## Steps for generate keys {.smaller}
    ## 
    ## - On Linux or Mac, to generate a key pair:
    ##     ```{bash, eval=FALSE}
    ##     ssh-keygen -t rsa -f ~/.ssh/[KEY_FILENAME] -C [USERNAME]
    ##     ```   
    ##     - `[KEY_FILENAME]` is the name that you want to use for your SSH key files. For example, a filename of `my-ssh-key` generates a private key file named `my-ssh-key` and a public key file named `my-ssh-key.pub`.  
    ##     
    ##     - `[USERNAME]` is the user for whom you will apply this SSH key.   
    ##     
    ##     - Use a (optional) paraphrase different form password.  
    ##     
    ## - Set correct permissions on the `.ssh` folder and key files
    ##     ```{bash, eval=FALSE}
    ##     chmod 400 ~/.ssh/[KEY_FILENAME]
    ##     ```
    ## 
    ## ----
    ## 
    ## - Append the public key to the `~/.ssh/authorized_keys` file of any Linux machine we want to SSH to, e.g.,
    ##     ```{bash, eval=FALSE}
    ##     ssh-copy-id -i ~/.ssh/[KEY_FILENAME] [USERNAME]@35.227.165.60
    ##     ```
    ## 
    ## - Test your new key.
    ##     ```{bash, eval=FALSE}
    ##     ssh -i ~/.ssh/[KEY_FILENAME] [USERNAME]@35.227.165.60
    ##     ```
    ## 
    ## - Now you don't need password each time you connect from your machine to the teaching server.
    ## 
    ## ----
    ## 
    ## - If you set paraphrase when generating keys, you'll be prompted for the paraphrase each time the private key is used. Avoid repeatedly entering the paraphrase by using `ssh-agent` on Linux/Mac or Pagent on Windows.
    ## 
    ## - Same key pair can be used between any two machines. We don't need to regenerate keys for each new connection.
    ## 
    ## - For Windows users, the private key generated by `ssh-keygen` cannot be directly used by PuTTY; use PuTTYgen for conversion. Then let PuTTYgen use the converted private key. Read [tutorial](https://www.digitalocean.com/community/tutorials/how-to-create-ssh-keys-with-putty-to-connect-to-a-vps).
    ## 
    ## ## Transfer files between machines
    ## 
    ## - `scp` securely transfers files between machines using SSH.
    ##     ```{bash, eval=FALSE}
    ##     ## copy file from local to remote
    ##     scp localfile username@35.227.165.60:/pathtofolder
    ##     ```
    ##     ```{bash, eval=FALSE}
    ##     ## copy file from remote to local
    ##     scp username@35.227.165.60:/pathtofile pathtolocalfolder
    ##     ```
    ## 
    ## - `sftp` is FTP via SSH.
    ## 
    ## - GUIs for Windows (WinSCP) or Mac (Cyberduck).
    ## 
    ## - (My preferred way) Use a **version control system** to sync project files between different machines and systems.
    ## 
    ## ## Line breaks in text files
    ## 
    ## - Windows uses a pair of `CR` and `LF` for line breaks. 
    ## 
    ## - Linux/Unix uses an `LF` character only. 
    ## 
    ## - MacOS X also uses a single `LF` character. But old Mac OS used a single `CR` character for line breaks. 
    ## 
    ## - If transferred in binary mode (bit by bit) between OSs, a text file could look a mess. 
    ## 
    ## - Most transfer programs automatically switch to text mode when transferring text files and perform conversion of line breaks between different OSs; but I used to run into problems using WinSCP. Sometimes you have to tell WinSCP explicitly a text file is being transferred.
    ## 
    ## # Run R in Linux
    ## 
    ## ## Interactive mode
    ## 
    ## - Start R in the interactive mode by typing `R` in shell.
    ## 
    ## - Then run R script by
    ##     ```{r, eval=FALSE}
    ##     source("script.R")
    ##     ```
    ## 
    ## ## Batch mode {.smaller}
    ## 
    ## - Demo script [`meanEst.R`](http://hua-zhou.github.io/teaching/biostatm280-2018winter/slides/02-linux/meanEst.R) implements an (terrible) estimator of mean
    ## $$
    ##   {\widehat \mu}_n = \frac{\sum_{i=1}^n x_i 1_{x_i \text{ is prime}}}{\sum_{i=1}^n 1_{x_i \text{ is prime}}}.
    ## $$
    ##     ```{bash, echo=FALSE}
    ##     cat meanEst.R
    ##     ```
    ## 
    ## ----
    ## 
    ## - To run your R code non-interactively aka in batch mode, we have at least two options:
    ##     ```{bash, eval=FALSE}
    ##     # default output to meanEst.Rout
    ##     R CMD BATCH meanEst.R
    ##     ```
    ## or
    ##     ```{bash, eval=FALSE}
    ##     # output to stdout
    ##     Rscript meanEst.R
    ##     ```
    ## 
    ## - Typically automate batch calls using a scripting language, e.g., Python, perl, and shell script.
    ## 
    ## ## Pass arguments to R scripts
    ## 
    ## - Specify arguments in `R CMD BATCH`:
    ##     ```{bash, eval=FALSE}
    ##     R CMD BATCH '--args mu=1 sig=2 kap=3' script.R
    ##     ```
    ## 
    ## - Specify arguments in `Rscript`:
    ##     ```{bash, eval=FALSE}
    ##     Rscript script.R mu=1 sig=2 kap=3
    ##     ```
    ## 
    ## - Parse command line arguments using magic formula
    ##     ```{r, eval=FALSE}
    ##     for (arg in commandArgs(T)) {
    ##       eval(parse(text=arg))
    ##     }
    ##     ```
    ## in R script. After calling the above code, all command line arguments will be available in the global namespace.
    ## 
    ## ---- 
    ## 
    ## - To understand the magic formula `commandArgs`, run R by:
    ##     ```{bash, eval=FALSE}
    ##     R '--args mu=1 sig=2 kap=3'
    ##     ```
    ## and then issue commands in R
    ##     ```{r, eval=FALSE}
    ##     commandArgs()
    ##     commandArgs(TRUE)
    ##     ```
    ## 
    ## ----
    ## 
    ## - Understand the magic formula `parse` and `eval`:
    ##     ```{r, error=TRUE}
    ##     rm(list=ls())
    ##     print(x)
    ##     parse(text="x=3")
    ##     eval(parse(text="x=3"))
    ##     print(x)
    ##     ```
    ## 
    ## ## {.smaller}
    ## 
    ## - [`runSim.R`](http://hua-zhou.github.io/teaching/biostatm280-2018winter/slides/02-linux/runSim.R) has components: (1) method implementation, (2) data generator with unspecified parameter `n`, (3) estimation based on generated data, and (4) **command argument parser**.
    ## ```{bash, echo=FALSE}
    ## cat runSim.R
    ## ```
    ## 
    ## ----
    ## 
    ## - Call `runSim.R` with sample size `n=100`:
    ##     ```{bash}
    ##     R CMD BATCH '--args n=100' runSim.R
    ##     ```
    ## or
    ##     ```{bash}
    ##     Rscript runSim.R n=100
    ##     ```
    ## 
    ## ## Run long jobs
    ## 
    ## - Many statistical computing tasks take long: simulation, MCMC, etc.
    ## 
    ## - `nohup` command in Linux runs program(s) immune to hangups and writes output to `nohup.out` by default. Logging out will _not_ kill the process; we can log in later to check status and results.
    ## 
    ## - `nohup` is POSIX standard thus available on Linux and MacOS.
    ## 
    ## - Run `runSim.R` in background and writes output to `nohup.out`:
    ##     ```{bash}
    ##     nohup Rscript runSim.R n=100 &
    ##     ```
    ## 
    ## ## screen
    ## 
    ## - `screen` is another popular utility, but not installed by default. 
    ## 
    ## - Typical workflow using `screen`.
    ## 
    ##     0. Access remote server using `ssh`.
    ## 
    ##     0. Start jobs in batch mode.
    ## 
    ##     0. Detach jobs.
    ## 
    ##     0. Exit from server, wait for jobs to finish.
    ## 
    ##     0. Access remote server using `ssh`.
    ## 
    ##     0. Re-attach jobs, check on progress, get results, etc.
    ## 
    ## ## Use R to call R
    ## 
    ## R in conjuction with `nohup` or `screen` can be used to orchestrate a large simulation study.
    ## 
    ## - It can be more elegant, transparent, and robust to parallelize jobs corresponding to different scenarios (e.g., different generative models) outside of the code used to do statistical computation.
    ## 
    ## - We consider a simulation study in R but the same approach could be used with code written in Julia, Matlab, Python, etc.
    ## 
    ## - Python in many ways makes a better _glue_; we may discuss this later in the course.
    ## 
    ## ----
    ## 
    ## - Suppose we have 
    ##     - [`runSim.R`](http://hua-zhou.github.io/teaching/biostatm280-2018winter/slides/02-linux/runSim.R) which runs a simulation based on command line argument `n`.  
    ##     - A large collection of `n` values that we want to use in our simulation study.  
    ##     - Access to a server with 128 cores.  
    ##     
    ## - Option 1: manually call `runSim.R` for each setting.
    ## 
    ## - Option 2: automate calls using R and `nohup`. [autoSim.R](http://hua-zhou.github.io/teaching/biostatm280-2018winter/slides/02-linux/autoSim.R)
    ## 
    ## ----
    ## 
    ## -
    ##     ```{bash}
    ##     cat autoSim.R
    ##     ```
    ## 
    ## ----
    ## 
    ## -
    ##     ```{bash}
    ##     Rscript autoSim.R
    ##     ```
    ## 
    ##     ```{bash, echo=FALSE, eval=TRUE}
    ##     rm n*.txt *.Rout
    ##     ```
    ##     
    ## - Now we just need write a script to collect results from the output files.

  • head -l prints the first \(l\) lines of a file:

    head linux.Rmd
    ## ---
    ## title: "Linux Basics"
    ## author: "Dr. Hua Zhou"
    ## date: "Jan 11, 2018"
    ## output: ioslides_presentation
    ## subtitle: Biostat M280
    ## bibliography: ../bib-HZ.bib
    ## ---
    ## 
    ## ## Why Linux

  • tail -l prints the last \(l\) lines of a file:

    tail linux.Rmd
    ## -
    ##     ```{bash}
    ##     Rscript autoSim.R
    ##     ```
    ## 
    ##     ```{bash, echo=FALSE, eval=TRUE}
    ##     rm n*.txt *.Rout
    ##     ```
    ##     
    ## - Now we just need write a script to collect results from the output files.

less is more; more is less

  • more browses a text file screen by screen (only downwards). Scroll down one page (paging) by pressing the spacebar; exit by pressing the q key.

  • less is also a pager, but has more functionalities, e.g., scroll upwards and downwards through the input.

  • less doesn't need to read the whole file, i.e., it loads files faster than more.

grep

grep prints lines that match an expression:

  • Show lines that contain string CentOS:

    # quotes not necessary if not a regular expression
    grep 'CentOS' linux.Rmd
    ## - RHEL/CentOS is popular on servers.
    ## - The teaching server for this class runs CentOS 7.
    ## - Show lines that contain string `CentOS`:
    ##     grep 'CentOS' linux.Rmd
    ##     grep 'CentOS' *.Rmd
    ##     grep -n 'CentOS' linux.Rmd
    ## - Replace `CentOS` by `RHEL` in a text file:
    ##     sed 's/CentOS/RHEL/' linux.Rmd | grep RHEL

  • Search multiple text files:

    grep 'CentOS' *.Rmd
    ## - RHEL/CentOS is popular on servers.
    ## - The teaching server for this class runs CentOS 7.
    ## - Show lines that contain string `CentOS`:
    ##     grep 'CentOS' linux.Rmd
    ##     grep 'CentOS' *.Rmd
    ##     grep -n 'CentOS' linux.Rmd
    ## - Replace `CentOS` by `RHEL` in a text file:
    ##     sed 's/CentOS/RHEL/' linux.Rmd | grep RHEL

  • Show matching line numbers:

    grep -n 'CentOS' linux.Rmd
    ## 27:- RHEL/CentOS is popular on servers.
    ## 29:- The teaching server for this class runs CentOS 7.
    ## 286:- Show lines that contain string `CentOS`:
    ## 289:    grep 'CentOS' linux.Rmd
    ## 296:    grep 'CentOS' *.Rmd
    ## 303:    grep -n 'CentOS' linux.Rmd
    ## 322:- Replace `CentOS` by `RHEL` in a text file:
    ## 324:    sed 's/CentOS/RHEL/' linux.Rmd | grep RHEL

  • Find all files in current directory with .png extension:

    ls | grep '\.png$'
    ## key_authentication_1.png
    ## key_authentication_2.png
    ## linux_directory_structure.png
    ## linux_filepermission_oct.png
    ## linux_filepermission.png
    ## Richard_Stallman_2013.png
    ## screenshot_top.png
  • Find all directories in the current directory:

    ls -al | grep '^d'
    ## drwxrwxr-x. 2 huazhou huazhou    4096 Jan 16 16:10 .
    ## drwxrwxr-x. 6 huazhou huazhou      87 Jan 16 03:23 ..

sed

  • sed is a stream editor.

  • Replace CentOS by RHEL in a text file:

    sed 's/CentOS/RHEL/' linux.Rmd | grep RHEL
    ## - RHEL/RHEL is popular on servers.
    ## - The teaching server for this class runs RHEL 7.
    ## - Show lines that contain string `RHEL`:
    ##     grep 'RHEL' linux.Rmd
    ##     grep 'RHEL' *.Rmd
    ##     grep -n 'RHEL' linux.Rmd
    ## - Replace `RHEL` by `RHEL` in a text file:
    ##     sed 's/RHEL/RHEL/' linux.Rmd | grep RHEL

awk

  • awk is a filter and report writer.

  • Print sorted list of login names:

    awk -F: '{ print $1 }' /etc/passwd | sort | head -5
    ## adm
    ## andrewngu
    ## anorthrup
    ## bin
    ## bryanlin24

  • Print number of lines in a file, as NR stands for Number of Rows:

    awk 'END { print NR }' /etc/passwd
    ## 59

    or

    wc -l /etc/passwd
    ## 59 /etc/passwd

    or

    wc -l < /etc/passwd
    ## 59

  • Print login names with UID in range 1000-1035:

    awk -F: '{if ($3 >= 1000 && $3 <= 1035) print}' /etc/passwd
    ## huazhou:x:1000:1001::/home/huazhou:/bin/bash
    ## juhkim111:x:1001:1002::/home/juhkim111:/bin/bash
    ## jboland521:x:1033:1034::/home/jboland521:/bin/bash
    ## tdbufford:x:1034:1035::/home/tdbufford:/bin/bash
    ## emjcampos:x:1035:1036::/home/emjcampos:/bin/bash
  • Print login names and log-in shells in comma-seperated format:

    awk -F: '{OFS = ","} {print $1, $7}' /etc/passwd
    ## root,/bin/bash
    ## bin,/sbin/nologin
    ## daemon,/sbin/nologin
    ## adm,/sbin/nologin
    ## lp,/sbin/nologin
    ## sync,/bin/sync
    ## shutdown,/sbin/shutdown
    ## halt,/sbin/halt
    ## mail,/sbin/nologin
    ## operator,/sbin/nologin
    ## games,/sbin/nologin
    ## ftp,/sbin/nologin
    ## nobody,/sbin/nologin
    ## systemd-network,/sbin/nologin
    ## dbus,/sbin/nologin
    ## polkitd,/sbin/nologin
    ## ntp,/sbin/nologin
    ## postfix,/sbin/nologin
    ## sshd,/sbin/nologin
    ## chrony,/sbin/nologin
    ## huazhou,/bin/bash
    ## juhkim111,/bin/bash
    ## rstudio-server,/bin/bash
    ## jboland521,/bin/bash
    ## tdbufford,/bin/bash
    ## emjcampos,/bin/bash
    ## nanchen322,/bin/bash
    ## gaoshuang,/bin/bash
    ## sgoitom,/bin/bash
    ## tahlia.hodes,/bin/bash
    ## huiyuhu,/bin/bash
    ## luminghuang,/bin/bash
    ## sarahh.jii,/bin/bash
    ## makadlac,/bin/bash
    ## david.levy,/bin/bash
    ## zexuan55,/bin/bash
    ## lishanpeng0913,/bin/bash
    ## bryanlin24,/bin/bash
    ## xliu352,/bin/bash
    ## l4luo,/bin/bash
    ## mondals,/bin/bash
    ## andrewngu,/bin/bash
    ## anorthrup,/bin/bash
    ## chad.e.pickering,/bin/bash
    ## mdponzini,/bin/bash
    ## c.shaw,/bin/bash
    ## shendarrick821,/bin/bash
    ## ziyansong08,/bin/bash
    ## ericasu,/bin/bash
    ## nabbongoug,/bin/bash
    ## wangdy0536,/bin/bash
    ## katywang,/bin/bash
    ## zy.zhang,/bin/bash
    ## zhaokezk,/bin/bash
    ## tss,/sbin/nologin
    ## stjia,/bin/bash
    ## jayxu33,/bin/bash
    ## keydemo,/bin/bash
    ## jiayunli,/bin/bash

  • Print login names and indicate those with UID>1000 as vip:

    awk -F: -v status="" '{OFS = ","} 
    {if ($3 >= 1000) status="vip"; else status="regular"} 
    {print $1, status}' /etc/passwd
    ## root,regular
    ## bin,regular
    ## daemon,regular
    ## adm,regular
    ## lp,regular
    ## sync,regular
    ## shutdown,regular
    ## halt,regular
    ## mail,regular
    ## operator,regular
    ## games,regular
    ## ftp,regular
    ## nobody,regular
    ## systemd-network,regular
    ## dbus,regular
    ## polkitd,regular
    ## ntp,regular
    ## postfix,regular
    ## sshd,regular
    ## chrony,regular
    ## huazhou,vip
    ## juhkim111,vip
    ## rstudio-server,regular
    ## jboland521,vip
    ## tdbufford,vip
    ## emjcampos,vip
    ## nanchen322,vip
    ## gaoshuang,vip
    ## sgoitom,vip
    ## tahlia.hodes,vip
    ## huiyuhu,vip
    ## luminghuang,vip
    ## sarahh.jii,vip
    ## makadlac,vip
    ## david.levy,vip
    ## zexuan55,vip
    ## lishanpeng0913,vip
    ## bryanlin24,vip
    ## xliu352,vip
    ## l4luo,vip
    ## mondals,vip
    ## andrewngu,vip
    ## anorthrup,vip
    ## chad.e.pickering,vip
    ## mdponzini,vip
    ## c.shaw,vip
    ## shendarrick821,vip
    ## ziyansong08,vip
    ## ericasu,vip
    ## nabbongoug,vip
    ## wangdy0536,vip
    ## katywang,vip
    ## zy.zhang,vip
    ## zhaokezk,vip
    ## tss,regular
    ## stjia,vip
    ## jayxu33,vip
    ## keydemo,vip
    ## jiayunli,vip

Piping and redirection

  • | sends output from one command as input of another command.

  • > directs output from one command to a file.

  • >> appends output from one command to a file.

  • < reads input from a file.

  • Combinations of shell commands (grep, sed, awk, …), piping and redirection, and regular expressions allow us pre-process and reformat huge text files efficiently.

  • See HW1.

Text editors

Emacs

  • Emacs is a powerful text editor with extensive support for many languages including R, \(\LaTeX\), python, and C/C++; however it's not installed by default on many Linux distributions.

  • Basic survival commands:
    • emacs filename to open a file with emacs.
    • CTRL-x CTRL-f to open an existing or new file.
    • CTRL-x CTRX-s to save.
    • CTRL-x CTRL-w to save as.
    • CTRL-x CTRL-c to quit.

  • Google emacs cheatsheet

C-<key> means hold the control key, and press <key>.
M-<key> means press the Esc key once, and press <key>.

Vi

  • Vi is ubiquitous (POSIX standard). Learn at least its basics; otherwise you can edit nothing on some clusters.

  • Basic survival commands:
    • vi filename to start editing a file.
    • vi is a modal editor: insert mode and normal mode. Pressing i switches from the normal mode to insert mode. Pressing ESC switches from the insert mode to normal mode.
    • :x<Return> quits vi and saves changes.
    • :q!<Return> quits vi without saving latest changes.
    • :w<Return> saves changes.
    • :wq<Return> quits vi and saves changes.

  • Google vi cheatsheet

IDE (Integrated Development Environment)

  • Statisticians write a lot of code. Critical to adopt a good IDE that goes beyond code editing: syntax highlighting, executing code within editor, debugging, profiling, version control, etc.

  • R Studio, Eclipse, Emacs, Matlab, Visual Studio, etc.

Processes

Processes

  • OS runs processes on behalf of user.

  • Each process has Process ID (PID), Username (UID), Parent process ID (PPID), Time and data process started (STIME), time running (TIME), etc.

    ps
    ##   PID TTY          TIME CMD
    ## 25251 ?        00:00:06 rsession
    ## 26572 ?        00:00:00 sshd
    ## 26715 ?        00:00:01 R
    ## 26798 ?        00:00:00 sh
    ## 26799 ?        00:00:00 ps

  • All current running processes:

    ps -eaf
    ## UID        PID  PPID  C STIME TTY          TIME CMD
    ## root         1     0  0 Jan08 ?        00:00:15 /usr/lib/systemd/systemd --switched-root --system --deserialize 21
    ## root         2     0  0 Jan08 ?        00:00:00 [kthreadd]
    ## root         3     2  0 Jan08 ?        00:00:00 [ksoftirqd/0]
    ## root         5     2  0 Jan08 ?        00:00:00 [kworker/0:0H]
    ## root         7     2  0 Jan08 ?        00:00:00 [migration/0]
    ## root         8     2  0 Jan08 ?        00:00:00 [rcu_bh]
    ## root         9     2  0 Jan08 ?        00:00:32 [rcu_sched]
    ## root        10     2  0 Jan08 ?        00:00:02 [watchdog/0]
    ## root        11     2  0 Jan08 ?        00:00:02 [watchdog/1]
    ## root        12     2  0 Jan08 ?        00:00:00 [migration/1]
    ## root        13     2  0 Jan08 ?        00:00:00 [ksoftirqd/1]
    ## root        15     2  0 Jan08 ?        00:00:00 [kworker/1:0H]
    ## root        16     2  0 Jan08 ?        00:00:02 [watchdog/2]
    ## root        17     2  0 Jan08 ?        00:00:01 [migration/2]
    ## root        18     2  0 Jan08 ?        00:00:00 [ksoftirqd/2]
    ## root        20     2  0 Jan08 ?        00:00:00 [kworker/2:0H]
    ## root        21     2  0 Jan08 ?        00:00:01 [watchdog/3]
    ## root        22     2  0 Jan08 ?        00:00:01 [migration/3]
    ## root        23     2  0 Jan08 ?        00:00:00 [ksoftirqd/3]
    ## root        25     2  0 Jan08 ?        00:00:00 [kworker/3:0H]
    ## root        27     2  0 Jan08 ?        00:00:00 [kdevtmpfs]
    ## root        28     2  0 Jan08 ?        00:00:00 [netns]
    ## root        29     2  0 Jan08 ?        00:00:00 [khungtaskd]
    ## root        30     2  0 Jan08 ?        00:00:00 [writeback]
    ## root        31     2  0 Jan08 ?        00:00:00 [kintegrityd]
    ## root        32     2  0 Jan08 ?        00:00:00 [bioset]
    ## root        33     2  0 Jan08 ?        00:00:00 [kblockd]
    ## root        34     2  0 Jan08 ?        00:00:00 [md]
    ## root        41     2  0 Jan08 ?        00:00:00 [kswapd0]
    ## root        42     2  0 Jan08 ?        00:00:00 [ksmd]
    ## root        43     2  0 Jan08 ?        00:00:03 [khugepaged]
    ## root        44     2  0 Jan08 ?        00:00:00 [crypto]
    ## root        52     2  0 Jan08 ?        00:00:00 [kthrotld]
    ## root        53     2  0 Jan08 ?        00:00:00 [kworker/u8:1]
    ## root        54     2  0 Jan08 ?        00:00:00 [kmpath_rdacd]
    ## root        55     2  0 Jan08 ?        00:00:00 [kpsmoused]
    ## root        57     2  0 Jan08 ?        00:00:00 [ipv6_addrconf]
    ## root        76     2  0 Jan08 ?        00:00:00 [deferwq]
    ## root       112     2  0 Jan08 ?        00:00:08 [kauditd]
    ## root       157     2  0 Jan08 ?        00:00:00 [virtscsi-scan]
    ## root       158     2  0 Jan08 ?        00:00:00 [scsi_eh_0]
    ## root       159     2  0 Jan08 ?        00:00:00 [scsi_tmf_0]
    ## root       160     2  0 Jan08 ?        00:00:04 [kworker/u8:2]
    ## root       184     2  0 Jan08 ?        00:00:00 [bioset]
    ## root       185     2  0 Jan08 ?        00:00:00 [xfsalloc]
    ## root       186     2  0 Jan08 ?        00:00:00 [xfs_mru_cache]
    ## root       187     2  0 Jan08 ?        00:00:00 [xfs-buf/sda1]
    ## root       188     2  0 Jan08 ?        00:00:00 [xfs-data/sda1]
    ## root       189     2  0 Jan08 ?        00:00:00 [xfs-conv/sda1]
    ## root       190     2  0 Jan08 ?        00:00:00 [xfs-cil/sda1]
    ## root       191     2  0 Jan08 ?        00:00:00 [xfs-reclaim/sda]
    ## root       192     2  0 Jan08 ?        00:00:00 [xfs-log/sda1]
    ## root       193     2  0 Jan08 ?        00:00:00 [xfs-eofblocks/s]
    ## root       194     2  0 Jan08 ?        00:02:30 [xfsaild/sda1]
    ## root       232     2  0 Jan08 ?        00:00:00 [kworker/1:1H]
    ## root       247     1  0 Jan08 ?        00:01:19 /usr/lib/systemd/systemd-journald
    ## root       272     1  0 Jan08 ?        00:00:00 /usr/lib/systemd/systemd-udevd
    ## root       301     1  0 Jan08 ?        00:00:15 /sbin/auditd
    ## root       356     2  0 Jan08 ?        00:00:00 [edac-poller]
    ## dbus       368     1  0 Jan08 ?        00:00:03 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
    ## root       377     2  0 Jan08 ?        00:00:02 [kworker/0:1H]
    ## root       381     1  0 Jan08 ?        00:00:03 /usr/lib/systemd/systemd-logind
    ## polkitd    382     1  0 Jan08 ?        00:00:00 /usr/lib/polkit-1/polkitd --no-debug
    ## root       384     1  0 Jan08 ?        00:00:00 /usr/sbin/acpid
    ## root       385     1  0 Jan08 ?        00:01:05 /usr/sbin/rsyslogd -n
    ## root       388     1  0 Jan08 ?        00:00:01 /usr/sbin/crond -n
    ## chrony     396     1  0 Jan08 ?        00:00:00 /usr/sbin/chronyd
    ## root       399     1  0 Jan08 tty1     00:00:00 /sbin/agetty --noclear tty1 linux
    ## root       400     1  0 Jan08 ttyS0    00:00:00 /sbin/agetty --keep-baud 115200 38400 9600 ttyS0 vt220
    ## root       465     1  0 Jan08 ?        00:00:01 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid
    ## root       466     2  0 Jan08 ?        00:00:00 [kworker/3:1H]
    ## root       468     1  0 Jan08 ?        00:00:10 /usr/sbin/NetworkManager --no-daemon
    ## root       589   468  0 Jan08 ?        00:00:00 /sbin/dhclient -d -q -sf /usr/libexec/nm-dhcp-helper -pf /var/run/dhclient-eth0.pid -lf /var/lib/NetworkManager/dhclient-f233d74e-1487-4c65-9470-2ae07fbc1e62-eth0.lease -cf /var/lib/NetworkManager/dhclient-eth0.conf eth0
    ## root       824     1  0 Jan08 ?        00:01:09 /usr/bin/python -Es /usr/sbin/tuned -l -P
    ## root       926     1  0 Jan08 ?        00:00:14 /usr/bin/python /usr/bin/google_clock_skew_daemon
    ## root       927     1  0 Jan08 ?        00:00:23 /usr/bin/python /usr/bin/google_ip_forwarding_daemon
    ## root       928     1  0 Jan08 ?        00:00:45 /usr/bin/python /usr/bin/google_accounts_daemon
    ## root       988     1  0 Jan08 ?        00:00:04 /usr/libexec/postfix/master -w
    ## postfix    994   988  0 Jan08 ?        00:00:00 qmgr -l -t unix -u
    ## root      1002     2  0 Jan08 ?        00:00:00 [kworker/2:1H]
    ## root      2405     1  0 Jan08 ?        00:00:09 /usr/sbin/sshd -D
    ## rstudio+  6693     1  0 Jan08 ?        00:03:13 /usr/lib/rstudio-server/bin/rserver
    ## root     11759  6693  0 Jan12 ?        00:00:00 /usr/lib/rstudio-server/bin/rserver
    ## root     14041     2  0 Jan13 ?        00:00:02 [kworker/3:1]
    ## root     21425     2  0 11:15 ?        00:00:00 [kworker/3:2]
    ## postfix  25204   988  0 15:33 ?        00:00:00 pickup -l -t unix -u
    ## huazhou  25251  6693  0 15:39 ?        00:00:06 /usr/lib/rstudio-server/bin/rsession -u huazhou --launcher-token A858A787
    ## root     25411  2405  0 15:49 ?        00:00:00 sshd: juhkim111 [priv]
    ## juhkim1+ 25419 25411  0 15:49 ?        00:00:00 sshd: juhkim111@pts/0
    ## juhkim1+ 25420 25419  0 15:49 pts/0    00:00:00 -bash
    ## root     25566     2  0 15:55 ?        00:00:00 [kworker/2:2]
    ## root     25711     2  0 16:00 ?        00:00:00 [kworker/1:2]
    ## root     25751     2  0 16:00 ?        00:00:00 [kworker/0:1]
    ## root     25818     2  0 Jan15 ?        00:00:01 [kworker/0:0]
    ## root     25998     2  0 Jan11 ?        00:00:03 [kworker/1:1]
    ## root     26349     2  0 16:06 ?        00:00:00 [kworker/2:1]
    ## root     26561  2405  0 16:07 ?        00:00:00 sshd: huazhou [priv]
    ## huazhou  26572 26561  0 16:07 ?        00:00:00 sshd: huazhou@pts/1
    ## huazhou  26573 26572  0 16:07 pts/1    00:00:00 -bash
    ## root     26664     2  0 16:11 ?        00:00:00 [kworker/2:0]
    ## huazhou  26715 25251 73 16:15 ?        00:00:01 /usr/lib64/R/bin/exec/R --slave --no-save --no-restore -e rmarkdown::render('/home/huazhou/github.com/Hua-Zhou.github.io/teaching/biostatm280-2018winter/slides/02-linux/linux.Rmd',~+~~+~encoding~+~=~+~'UTF-8');
    ## huazhou  26800 26715  0 16:15 ?        00:00:00 sh -c 'bash'  -c 'ps -eaf' 2>&1
    ## huazhou  26801 26800  0 16:15 ?        00:00:00 ps -eaf

  • All Python processes:

    ps -eaf | grep python
    ## root       465     1  0 Jan08 ?        00:00:01 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid
    ## root       824     1  0 Jan08 ?        00:01:09 /usr/bin/python -Es /usr/sbin/tuned -l -P
    ## root       926     1  0 Jan08 ?        00:00:14 /usr/bin/python /usr/bin/google_clock_skew_daemon
    ## root       927     1  0 Jan08 ?        00:00:23 /usr/bin/python /usr/bin/google_ip_forwarding_daemon
    ## root       928     1  0 Jan08 ?        00:00:45 /usr/bin/python /usr/bin/google_accounts_daemon
    ## huazhou  26802 26715  0 16:15 ?        00:00:00 sh -c 'bash'  -c 'ps -eaf | grep python' 2>&1
    ## huazhou  26803 26802  0 16:15 ?        00:00:00 bash -c ps -eaf | grep python
    ## huazhou  26805 26803  0 16:15 ?        00:00:00 grep python

  • Process with PID=1:

    ps -fp 1
    ## UID        PID  PPID  C STIME TTY          TIME CMD
    ## root         1     0  0 Jan08 ?        00:00:15 /usr/lib/systemd/systemd --switched-root --system --deserialize 21

  • All processes owned by a user:

    ps -fu huazhou
    ## UID        PID  PPID  C STIME TTY          TIME CMD
    ## huazhou  25251  6693  0 15:39 ?        00:00:06 /usr/lib/rstudio-server/bin/rsession -u huazhou --launcher-token A858A787
    ## huazhou  26572 26561  0 16:07 ?        00:00:00 sshd: huazhou@pts/1
    ## huazhou  26573 26572  0 16:07 pts/1    00:00:00 -bash
    ## huazhou  26715 25251 49 16:15 ?        00:00:01 /usr/lib64/R/bin/exec/R --slave --no-save --no-restore -e rmarkdown::render('/home/huazhou/github.com/Hua-Zhou.github.io/teaching/biostatm280-2018winter/slides/02-linux/linux.Rmd',~+~~+~encoding~+~=~+~'UTF-8');
    ## huazhou  26808 26715  0 16:16 ?        00:00:00 sh -c 'bash'  -c 'ps -fu huazhou' 2>&1
    ## huazhou  26809 26808  0 16:16 ?        00:00:00 ps -fu huazhou

Kill processes

  • Kill process with PID=1001:

    kill 1001
  • Kill all R processes.

    killall -r R

top

  • top prints realtime process information (very useful).

    top

Secure shell (SSH)

SSH

SSH (secure shell) is the dominant cryptographic network protocol for secure network connection via an insecure network.

  • On Linux or Mac, access the teaching server by

    ssh username@35.227.165.60
  • Windows machines need the PuTTY program (free).

Use keys over password

  • Key authentication is more secure than password. Most passwords are weak.

  • Script or a program may need to systematically SSH into other machines.

  • Log into multiple machines using the same key.

  • Seamless use of many services: Git, svn, Amazon EC2 cloud service, parallel computing on multiple hosts, etc.

  • Many servers only allow key authentication and do not accept password authentication.

Key authentication

  • Public key. Put on the machine(s) you want to log in.

  • Private key. Put on your own computer. Consider this as the actual key in your pocket; never give to others.

  • Messages from server to your computer is encrypted with your public key. It can only be decrypted using your private key.

  • Messages from your computer to server is signed with your private key (digital signatures) and can be verified by anyone who has your public key (authentication).

Steps for generate keys

  • On Linux or Mac, to generate a key pair:

    ssh-keygen -t rsa -f ~/.ssh/[KEY_FILENAME] -C [USERNAME]
    • [KEY_FILENAME] is the name that you want to use for your SSH key files. For example, a filename of my-ssh-key generates a private key file named my-ssh-key and a public key file named my-ssh-key.pub.

    • [USERNAME] is the user for whom you will apply this SSH key.

    • Use a (optional) paraphrase different form password.

  • Set correct permissions on the .ssh folder and key files

    chmod 400 ~/.ssh/[KEY_FILENAME]

  • Append the public key to the ~/.ssh/authorized_keys file of any Linux machine we want to SSH to, e.g.,

    ssh-copy-id -i ~/.ssh/[KEY_FILENAME] [USERNAME]@35.227.165.60
  • Test your new key.

    ssh -i ~/.ssh/[KEY_FILENAME] [USERNAME]@35.227.165.60
  • Now you don't need password each time you connect from your machine to the teaching server.

  • If you set paraphrase when generating keys, you'll be prompted for the paraphrase each time the private key is used. Avoid repeatedly entering the paraphrase by using ssh-agent on Linux/Mac or Pagent on Windows.

  • Same key pair can be used between any two machines. We don't need to regenerate keys for each new connection.

  • For Windows users, the private key generated by ssh-keygen cannot be directly used by PuTTY; use PuTTYgen for conversion. Then let PuTTYgen use the converted private key. Read tutorial.

Transfer files between machines

  • scp securely transfers files between machines using SSH.

    ## copy file from local to remote
    scp localfile username@35.227.165.60:/pathtofolder
    ## copy file from remote to local
    scp username@35.227.165.60:/pathtofile pathtolocalfolder
  • sftp is FTP via SSH.

  • GUIs for Windows (WinSCP) or Mac (Cyberduck).

  • (My preferred way) Use a version control system to sync project files between different machines and systems.

Line breaks in text files

  • Windows uses a pair of CR and LF for line breaks.

  • Linux/Unix uses an LF character only.

  • MacOS X also uses a single LF character. But old Mac OS used a single CR character for line breaks.

  • If transferred in binary mode (bit by bit) between OSs, a text file could look a mess.

  • Most transfer programs automatically switch to text mode when transferring text files and perform conversion of line breaks between different OSs; but I used to run into problems using WinSCP. Sometimes you have to tell WinSCP explicitly a text file is being transferred.

Run R in Linux

Interactive mode

  • Start R in the interactive mode by typing R in shell.

  • Then run R script by

    source("script.R")

Batch mode

  • Demo script meanEst.R implements an (terrible) estimator of mean \[ {\widehat \mu}_n = \frac{\sum_{i=1}^n x_i 1_{x_i \text{ is prime}}}{\sum_{i=1}^n 1_{x_i \text{ is prime}}}. \]

    ## ## check if a given integer is prime
    ## isPrime = function(n) {
    ##   if (n <= 3) {
    ##     return (TRUE)
    ##   }
    ##   if (any((n %% 2:floor(sqrt(n))) == 0)) {
    ##     return (FALSE)
    ##   }
    ##   return (TRUE)
    ## }
    ## 
    ## ## estimate mean only using observation with prime indices
    ## estMeanPrimes = function (x) {
    ##   n = length(x)
    ##   ind = sapply(1:n, isPrime)
    ##   return (mean(x[ind]))
    ## }
    ## 
    ## print(estMeanPrimes(rnorm(100000)))

  • To run your R code non-interactively aka in batch mode, we have at least two options:

    # default output to meanEst.Rout
    R CMD BATCH meanEst.R

    or

    # output to stdout
    Rscript meanEst.R
  • Typically automate batch calls using a scripting language, e.g., Python, perl, and shell script.

Pass arguments to R scripts

  • Specify arguments in R CMD BATCH:

    R CMD BATCH '--args mu=1 sig=2 kap=3' script.R
  • Specify arguments in Rscript:

    Rscript script.R mu=1 sig=2 kap=3
  • Parse command line arguments using magic formula

    for (arg in commandArgs(T)) {
      eval(parse(text=arg))
    }

    in R script. After calling the above code, all command line arguments will be available in the global namespace.

  • To understand the magic formula commandArgs, run R by:

    R '--args mu=1 sig=2 kap=3'

    and then issue commands in R

    commandArgs()
    commandArgs(TRUE)

  • Understand the magic formula parse and eval:

    rm(list=ls())
    print(x)
    ## Error in print(x): object 'x' not found
    parse(text="x=3")
    ## expression(x = 3)
    eval(parse(text="x=3"))
    print(x)
    ## [1] 3

  • runSim.R has components: (1) method implementation, (2) data generator with unspecified parameter n, (3) estimation based on generated data, and (4) command argument parser.
## ## parsing command arguments
## for (arg in commandArgs(TRUE)) {
##   eval(parse(text=arg))
## }
## 
## ## check if a given integer is prime
## isPrime = function(n) {
##   if (n <= 3) {
##     return (TRUE)
##   }
##   if (any((n %% 2:floor(sqrt(n))) == 0)) {
##     return (FALSE)
##   }
##   return (TRUE)
## }
## 
## ## estimate mean only using observation with prime indices
## estMeanPrimes = function (x) {
##   n = length(x)
##   ind = sapply(1:n, isPrime)
##   return (mean(x[ind]))
## }
## 
## # simulate data
## x = rnorm(n)
## 
## # estimate mean
## estMeanPrimes(x)

  • Call runSim.R with sample size n=100:

    R CMD BATCH '--args n=100' runSim.R

    or

    Rscript runSim.R n=100
    ## [1] -0.03840343

Run long jobs

  • Many statistical computing tasks take long: simulation, MCMC, etc.

  • nohup command in Linux runs program(s) immune to hangups and writes output to nohup.out by default. Logging out will not kill the process; we can log in later to check status and results.

  • nohup is POSIX standard thus available on Linux and MacOS.

  • Run runSim.R in background and writes output to nohup.out:

    nohup Rscript runSim.R n=100 &
    ## [1] 0.4355978

screen

  • screen is another popular utility, but not installed by default.

  • Typical workflow using screen.

    1. Access remote server using ssh.

    2. Start jobs in batch mode.

    3. Detach jobs.

    4. Exit from server, wait for jobs to finish.

    5. Access remote server using ssh.

    6. Re-attach jobs, check on progress, get results, etc.

Use R to call R

R in conjuction with nohup or screen can be used to orchestrate a large simulation study.

  • It can be more elegant, transparent, and robust to parallelize jobs corresponding to different scenarios (e.g., different generative models) outside of the code used to do statistical computation.

  • We consider a simulation study in R but the same approach could be used with code written in Julia, Matlab, Python, etc.

  • Python in many ways makes a better glue; we may discuss this later in the course.

  • Suppose we have
    • runSim.R which runs a simulation based on command line argument n.
    • A large collection of n values that we want to use in our simulation study.
    • Access to a server with 128 cores.
  • Option 1: manually call runSim.R for each setting.

  • Option 2: automate calls using R and nohup. autoSim.R

  • cat autoSim.R
    ## # autoSim.R
    ## 
    ## nVals = seq(100, 500, by=100)
    ## for (n in nVals) {
    ##   oFile = paste("n", n, ".txt", sep="")
    ##   arg = paste("n=", n, sep="")
    ##   sysCall = paste("nohup Rscript runSim.R ", arg, " > ", oFile)
    ##   system(sysCall)
    ##   print(paste("sysCall=", sysCall, sep=""))
    ## }

  • Rscript autoSim.R
    ## [1] "sysCall=nohup Rscript runSim.R  n=100  >  n100.txt"
    ## [1] "sysCall=nohup Rscript runSim.R  n=200  >  n200.txt"
    ## [1] "sysCall=nohup Rscript runSim.R  n=300  >  n300.txt"
    ## [1] "sysCall=nohup Rscript runSim.R  n=400  >  n400.txt"
    ## [1] "sysCall=nohup Rscript runSim.R  n=500  >  n500.txt"
  • Now we just need write a script to collect results from the output files.