Algorithms¶

Algorithm is loosely defined as a set of instructions for doing something. Input $\to$ Output.
Knuth: (1) finiteness, (2) definiteness, (3) input, (4) output, (5) effectiveness.

Measure of efficiency¶

A basic unit for measuring algorithmic efficiency is flop.
A flop (floating point operation) consists of a floating point addition, subtraction, multiplication, division, or comparison, and the usually accompanying fetch and store.

Some books count multiplication followed by an addition (fused multiply-add, FMA) as one flop. This results a factor of up to 2 difference in flop counts.

How to measure efficiency of an algorithm? Big O notation. If $n$ is the size of a problem, an algorithm has order $O(f(n))$, where the leading term in the number of flops is $c \cdot f(n)$. For example,
- matrix-vector multiplication A * b, where A is $m \times n$ and b is $n \times 1$, takes $2mn$ or $O(mn)$ flops
- matrix-matrix multiplication A * B, where A is $m \times n$ and B is $n \times p$, takes $2mnp$ or $O(mnp)$ flops
A hierarchy of computational complexity:
Let $n$ be the problem size.
- Exponential order: $O(b^n)$ (NP-hard="horrible")
- Polynomial order: $O(n^q)$ (doable)
- $O(n \log n )$ (fast)
- Linear order $O(n)$ (fast)
- Log order $O(\log n)$ (super fast)
Classification of data sets by Huber.

Data Size	Bytes	Storage Mode
Tiny	$10^2$	Piece of paper
Small	$10^4$	A few pieces of paper
Medium	$10^6$ (megatbytes)	A floppy disk
Large	$10^8$	Hard disk
Huge	$10^9$ (gigabytes)	Hard disk(s)
Massive	$10^{12}$ (terabytes)	RAID storage

Difference of $O(n^2)$ and $O(n\log n)$ on massive data. Suppose we have a teraflop supercomputer capable of doing $10^{12}$ flops per second. For a problem of size $n=10^{12}$, $O(n \log n)$ algorithm takes about $$10^{12} \log (10^{12}) / 10^{12} \approx 27 \text{ seconds}.$$ $O(n^2)$ algorithm takes about $10^{12}$ seconds, which is approximately 31710 years!
QuickSort and FFT (invented by Tukey!) are celebrated algorithms that turn $O(n^2)$ operations into $O(n \log n)$. Another example is the Strassen's method, which turns $O(n^3)$ matrix multiplication into $O(n^{\log_2 7})$.
One goal of this course is to get familiar with the flop counts for some common numerical tasks in statistics.

The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.
For example, compare flops of the two mathematically equivalent expressions: A * B * x and A * (B * x) where A and B are matrices and x is a vector.

using BenchmarkTools

srand(123) # seed
n = 1000
A = randn(n, n)
B = randn(n, n)
x = randn(n)

@benchmark A * B * x

BenchmarkTools.Trial: 
  memory estimate:  7.64 MiB
  allocs estimate:  3
  --------------
  minimum time:     14.363 ms (0.00% GC)
  median time:      15.265 ms (0.00% GC)
  mean time:        15.372 ms (3.20% GC)
  maximum time:     21.254 ms (0.00% GC)
  --------------
  samples:          325
  evals/sample:     1

@benchmark A * (B * x)

BenchmarkTools.Trial: 
  memory estimate:  15.88 KiB
  allocs estimate:  2
  --------------
  minimum time:     449.260 μs (0.00% GC)
  median time:      537.231 μs (0.00% GC)
  mean time:        548.740 μs (0.00% GC)
  maximum time:     1.505 ms (0.00% GC)
  --------------
  samples:          8862
  evals/sample:     1

Performance of computer systems¶

FLOPS (floating point operations per second) is a measure of computer performance.
For example, my laptop has the Intel i7-6920HQ (Skylake) CPU with 4 cores runing at 2.90 GHz (cycles per second).

versioninfo()

Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)

Intel Skylake CPUs can do 16 DP flops per cylce and 32 SP flops per cycle. Then the theoretical throughput of my laptop is $$ 4 \times 2.9 \times 10^9 \times 16 = 185.6 \text{ GFLOPS DP} $$ in double precision and $$ 4 \times 2.9 \times 10^9 \times 32 = 371.2 \text{ GFLOPS SP} $$ in single precision.