Processing math: 100%
In [1]:
versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code

Algorithms

  • Algorithm is loosely defined as a set of instructions for doing something. Input Output.

  • Knuth: (1) finiteness, (2) definiteness, (3) input, (4) output, (5) effectiveness.

Measure of efficiency

  • A basic unit for measuring algorithmic efficiency is flop.

    A flop (floating point operation) consists of a floating point addition, subtraction, multiplication, division, or comparison, and the usually accompanying fetch and store.

Some books count multiplication followed by an addition (fused multiply-add, FMA) as one flop. This results a factor of up to 2 difference in flop counts.

  • How to measure efficiency of an algorithm? Big O notation. If n is the size of a problem, an algorithm has order O(f(n)), where the leading term in the number of flops is cf(n). For example,

    • matrix-vector multiplication A * b, where A is m×n and b is n×1, takes 2mn or O(mn) flops
    • matrix-matrix multiplication A * B, where A is m×n and B is n×p, takes 2mnp or O(mnp) flops
  • A hierarchy of computational complexity:
    Let n be the problem size.

    • Exponential order: O(bn) (NP-hard="horrible")
    • Polynomial order: O(nq) (doable)
    • O(nlogn) (fast)
    • Linear order O(n) (fast)
    • Log order O(logn) (super fast)
  • Classification of data sets by Huber.

Data Size Bytes Storage Mode
Tiny 102 Piece of paper
Small 104 A few pieces of paper
Medium 106 (megatbytes) A floppy disk
Large 108 Hard disk
Huge 109 (gigabytes) Hard disk(s)
Massive 1012 (terabytes) RAID storage
  • Difference of O(n2) and O(nlogn) on massive data. Suppose we have a teraflop supercomputer capable of doing 1012 flops per second. For a problem of size n=1012, O(nlogn) algorithm takes about 1012log(1012)/101227 seconds. O(n2) algorithm takes about 1012 seconds, which is approximately 31710 years!

  • QuickSort and FFT (invented by Tukey!) are celebrated algorithms that turn O(n2) operations into O(nlogn). Another example is the Strassen's method, which turns O(n3) matrix multiplication into O(nlog27).

  • One goal of this course is to get familiar with the flop counts for some common numerical tasks in statistics.

    The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.

  • For example, compare flops of the two mathematically equivalent expressions: A * B * x and A * (B * x) where A and B are matrices and x is a vector.

In [2]:
using BenchmarkTools, Random

Random.seed!(123) # seed
n = 1000
A = randn(n, n)
B = randn(n, n)
x = randn(n)

# complexity is n^3 + n^2 = O(n^3)
@benchmark $A * $B * $x
Out[2]:
BenchmarkTools.Trial: 
  memory estimate:  7.64 MiB
  allocs estimate:  3
  --------------
  minimum time:     13.637 ms (0.00% GC)
  median time:      13.927 ms (0.00% GC)
  mean time:        14.571 ms (3.61% GC)
  maximum time:     49.666 ms (70.42% GC)
  --------------
  samples:          343
  evals/sample:     1
In [3]:
# complexity is n^2 + n^2 = O(n^2)
@benchmark $A * ($B * $x)
Out[3]:
BenchmarkTools.Trial: 
  memory estimate:  15.88 KiB
  allocs estimate:  2
  --------------
  minimum time:     431.931 μs (0.00% GC)
  median time:      533.876 μs (0.00% GC)
  mean time:        555.505 μs (0.00% GC)
  maximum time:     1.894 ms (0.00% GC)
  --------------
  samples:          8707
  evals/sample:     1

Performance of computer systems

  • FLOPS (floating point operations per second) is a measure of computer performance.

  • For example, my laptop has the Intel i7-6920HQ (Skylake) CPU with 4 cores runing at 2.90 GHz (cycles per second).

In [4]:
versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code

Intel Skylake CPUs can do 16 DP flops per cylce and 32 SP flops per cycle. Then the theoretical throughput of my laptop is 4×2.9×109×16=185.6 GFLOPS DP in double precision and 4×2.9×109×32=371.2 GFLOPS SP in single precision.

  • In Julia, computes the peak flop rate of the computer by using double precision gemm!
In [5]:
using LinearAlgebra

LinearAlgebra.peakflops(2^14) # matrix size 2^14
Out[5]:
1.5873294017052066e11