Table of Contents¶

0.1 Numerical linear algebra: introduction

0.2 BLAS

0.3 Memory hierarchy and level-3 fraction

0.4 Effect of data layout

0.5 Avoid memory allocation: some examples

Numerical linear algebra: introduction¶

Topics in numerical algebra:
- BLAS
- solve linear equations $\mathbf{A} \mathbf{x} = \mathbf{b}$
- regression computations $\mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y}$
- eigen-problems $\mathbf{A} \mathbf{x} = \lambda \mathbf{x}$
- generalized eigen-problems $\mathbf{A} \mathbf{x} = \lambda \mathbf{B} \mathbf{x}$
- singular value decompositions $\mathbf{A} = \mathbf{U} \Sigma \mathbf{V}^T$
- iterative methods for numerical linear algebra
Except for the iterative methods, most of these numerical linear algebra tasks are implemented in the BLAS and LAPACK libraries. They form the building blocks of most statistical computing tasks (optimization, MCMC).
Our major goal (or learning objectives) is to
1. know the computational cost (flop count) of each task
2. be familiar with the BLAS and LAPACK functions (what they do)
3. do not re-invent wheels by implementing these dense linear algebra subroutines by yourself
4. understand the need for iterative methods
5. apply appropriate numerical algebra tools to various statistical problems
All high-level languages (R, Matlab, Julia) call BLAS and LAPACK for numerical linear algebra.
- Julia offers more flexibility by exposing interfaces to many BLAS/LAPACK subroutines directly. See documentation.

BLAS¶

BLAS stands for basic linear algebra subprograms.
See netlib for a complete list of standardized BLAS functions.
There are many implementations of BLAS.
- Netlib provides a reference implementation
- Matlab uses Intel's MKL (mathematical kernel libaries)
- Julia uses OpenBLAS
- JuliaPro offers the option of using MKL
There are 3 levels of BLAS functions.
- Level 1: vector-vector operation
- Level 2: matrix-vector operation
- Level 3: matrix-matrix operation

Level	Example Operation	Name	Dimension	Flops
1	$\alpha \gets \mathbf{x}^T \mathbf{y}$	dot product	$\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$	$2n$
	$\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}$	axpy	$\alpha \in \mathbb{R}$, $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$	$2n$
2	$\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}$	gaxpy	$\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$	$2mn$
	$\mathbf{A} \gets \mathbf{A} + \mathbf{y} \mathbf{x}^T$	rank one update	$\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$	$2mn$
3	$\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}$	matrix multiplication	$\mathbf{A} \in \mathbb{R}^{m \times p}$, $\mathbf{B} \in \mathbb{R}^{p \times n}$, $\mathbf{C} \in \mathbb{R}^{m \times n}$	$2mnp$

Typical BLAS functions support single precision (S), double precision (D), complex (C), and double complex (Z).
Some operations appear as level-3 but indeed are level-2.
- A common operation in statistics is column scaling or row scaling $$ \begin{eqnarray*} \mathbf{A} &=& \mathbf{A} \mathbf{D} \quad \text{(column scaling)} \\ \mathbf{A} &=& \mathbf{D} \mathbf{A} \quad \text{(row scaling)}, \end{eqnarray*} $$ where $\mathbf{D}$ is diagonal.
- These are essentially level-2 operations!

using BenchmarkTools

srand(123) # seed
n = 2000
A = rand(n, n)
d = rand(n)  # d vector

2000-element Array{Float64,1}:
 0.763192
 0.668759
 0.709337
 0.088416
 0.289151
 0.375468
 0.437356
 0.179474
 0.122238
 0.895783
 0.332146
 0.206425
 0.747789
 ⋮       
 0.484487
 0.567484
 0.75455 
 0.703483
 0.166205
 0.754612
 0.231834
 0.769243
 0.805681
 0.553389
 0.450904
 0.814614

D = diagm(d) # diagonal matrix with d as diagonal

2000×2000 Array{Float64,2}:
 0.763192  0.0       0.0       0.0       …  0.0       0.0       0.0     
 0.0       0.668759  0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.709337  0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.088416     0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0       …  0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0       …  0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 ⋮                                       ⋱                              
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0       …  0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0       …  0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.0     
 0.0       0.0       0.0       0.0          0.553389  0.0       0.0     
 0.0       0.0       0.0       0.0          0.0       0.450904  0.0     
 0.0       0.0       0.0       0.0          0.0       0.0       0.814614

# this is calling BLAS routine for matrix multiplication: O(n^3) flops
@benchmark A * D

BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  2
  --------------
  minimum time:     100.553 ms (0.35% GC)
  median time:      109.386 ms (2.32% GC)
  mean time:        115.684 ms (3.36% GC)
  maximum time:     226.789 ms (1.78% GC)
  --------------
  samples:          44
  evals/sample:     1

Diagonal(d)

2000×2000 Diagonal{Float64}:
 0.763192   ⋅         ⋅         ⋅        …   ⋅         ⋅         ⋅      
  ⋅        0.668759   ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅        0.709337   ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅        0.088416      ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅        …   ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅        …   ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
 ⋮                                       ⋱                              
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅        …   ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅        …   ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅           0.553389   ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅        0.450904   ⋅      
  ⋅         ⋅         ⋅         ⋅            ⋅         ⋅        0.814614

# current way for columnwise scaling: O(n^2) flops
@benchmark A * Diagonal(d)

BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  3
  --------------
  minimum time:     7.149 ms (2.63% GC)
  median time:      8.212 ms (24.38% GC)
  mean time:        8.453 ms (25.21% GC)
  maximum time:     65.978 ms (88.55% GC)
  --------------
  samples:          591
  evals/sample:     1

@which A * Diagonal(d)

# in-place: avoid allocate space for result
@benchmark scale!(A, d)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.479 ms (0.00% GC)
  median time:      11.387 ms (0.00% GC)
  mean time:        11.209 ms (0.00% GC)
  maximum time:     20.322 ms (0.00% GC)
  --------------
  samples:          447
  evals/sample:     1

Memory hierarchy and level-3 fraction¶

Key to high performance is effective use of memory hierarchy. True on all architectures.

Flop count is not the sole determinant of algorithm efficiency. Another important factor is data movement through the memory hierarchy.

Numbers everyone should know

Operation	Time
L1 cache reference	0.5 ns
L2 cache reference	7 ns
Main memory reference	100 ns
Read 1 MB sequentially from memory	250,000 ns
Read 1 MB sequentially from SSD	1,000,000 ns
Read 1 MB sequentially from disk	20,000,000 ns

Source: https://gist.github.com/jboner/2841832

For example, Xeon X5650 CPU has a theoretical throughput of 128 DP GFLOPS but a max memory bandwidth of 32GB/s.
Can we keep CPU cores busy with enough deliveries of matrix data and ship the results to memory fast enough to avoid backlog?
Answer: use high-level BLAS as much as possible.

BLAS	Dimension	Mem. Refs.	Flops	Ratio
Level 1: $\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}$	$\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$	$3n$	$2n$	3:2
Level 2: $\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}$	$\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$, $\mathbf{A} \in \mathbb{R}^{n \times n}$	$n^2$	$2n^2$	1:2
Level 3: $\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}$	$\mathbf{A}, \mathbf{B}, \mathbf{C} \in\mathbb{R}^{n \times n}$	$4n^2$	$2n^3$	2:n

Higher level BLAS (3 or 2) make more effective use of arithmetic logic units (ALU) by keeping them busy. Surface-to-volume effect.
See Dongarra slides.

A distinction between LAPACK and LINPACK (older version of R uses LINPACK) is that LAPACK makes use of higher level BLAS as much as possible (usually by smart partitioning) to increase the so-called level-3 fraction.

Effect of data layout¶

Data layout in memory affects algorithmic efficiency too. It is much faster to move chunks of data in memory than retrieving/writing scattered data.
Storage mode: column-major (Fortran, Matlab, R, Julia) vs row-major (C/C++).
Cache line is the minimum amount of cache which can be loaded and stored to memory.
- x86 CPUs: 64 bytes
- ARM CPUS: 32 bytes

Accessing column-major stored matrix by rows causes lots of cache misses.
Take matrix multiplication as an example $$ \mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}, \quad \mathbf{A} \in \mathbb{R}^{m \times p}, \mathbf{B} \in \mathbb{R}^{p \times n}, \mathbf{C} \in \mathbb{R}^{m \times n}. $$ Assume the storage is column-major, such as in Julia. There are 6 variants of the algorithms according to the order in the triple loops.
- jki or kji looping:
```
# inner most loop
  for i = 1:m
      C[i, j] = C[i, j] + A[i, k] * B[k, j]
  end
```
- ikj or kij looping:
```
# inner most loop        
  for j = 1:n
      C[i, j] = C[i, j] + A[i, k] * B[k, j]
  end
```
- ijk or jik looping:
```
# inner most loop        
  for k = 1:p
      C[i, j] = C[i, j] + A[i, k] * B[k, j]
  end
```
We pay attention to the innermost loop, where the vector calculation occurs. The associated stride when accessing the three matrices in memory (assuming column-major storage) is

Variant	A Stride	B Stride	C Stride
$jki$ or $kji$	Unit	0	Unit
$ikj$ or $kij$	0	Non-Unit	Non-Unit
$ijk$ or $jik$	Non-Unit	Unit	0

Apparently the variants $jki$ or $kji$ are preferred.

"""
    matmul_by_loop!(A, B, C, order)

Overwrite `C` by `A * B`. `order` indicates the looping order for triple loop.
"""
function matmul_by_loop!(A::Matrix, B::Matrix, C::Matrix, order::String)
    
    m = size(A, 1)
    p = size(A, 2)
    n = size(B, 2)
    fill!(C, 0)
    
    if order == "jki"
        for j = 1:n, k = 1:p, i = 1:m
            C[i, j] += A[i, k] * B[k, j]
        end
    end

    if order == "kji"
        for k = 1:p, j = 1:n, i = 1:m
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "ikj"
        for i = 1:m, k = 1:p, j = 1:n
            C[i, j] += A[i, k] * B[k, j]
        end
    end

    if order == "kij"
        for k = 1:p, i = 1:m, j = 1:n
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "ijk"
        for i = 1:m, j = 1:n, k = 1:p
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "jik"
        for j = 1:n, i = 1:m, k = 1:p
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
end

srand(123)
m, n, p = 2000, 100, 2000
A = rand(m, n)
B = rand(n, p)
C = zeros(m, p);

$jki$ and $kji$ looping:

using BenchmarkTools

@benchmark matmul_by_loop!(A, B, C, "jki")

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     430.149 ms (0.00% GC)
  median time:      444.670 ms (0.00% GC)
  mean time:        469.967 ms (0.00% GC)
  maximum time:     563.758 ms (0.00% GC)
  --------------
  samples:          11
  evals/sample:     1

@benchmark matmul_by_loop!(A, B, C, "kji")

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     473.348 ms (0.00% GC)
  median time:      483.225 ms (0.00% GC)
  mean time:        486.455 ms (0.00% GC)
  maximum time:     509.115 ms (0.00% GC)
  --------------
  samples:          11
  evals/sample:     1

$ikj$ and $kij$ looping:

@benchmark matmul_by_loop!(A, B, C, "ikj")

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.383 s (0.00% GC)
  median time:      2.643 s (0.00% GC)
  mean time:        2.643 s (0.00% GC)
  maximum time:     2.903 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1

@benchmark matmul_by_loop!(A, B, C, "kij")

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.403 s (0.00% GC)
  median time:      2.411 s (0.00% GC)
  mean time:        2.411 s (0.00% GC)
  maximum time:     2.419 s (0.00% GC)
  --------------
  samples:          3
  evals/sample:     1

$ijk$ and $jik$ looping:

@benchmark matmul_by_loop!(A, B, C, "ijk")

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     952.224 ms (0.00% GC)
  median time:      958.913 ms (0.00% GC)
  mean time:        1.042 s (0.00% GC)
  maximum time:     1.344 s (0.00% GC)
  --------------
  samples:          5
  evals/sample:     1

@benchmark matmul_by_loop!(A, B, C, "ijk")

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     951.335 ms (0.00% GC)
  median time:      994.415 ms (0.00% GC)
  mean time:        1.033 s (0.00% GC)
  maximum time:     1.253 s (0.00% GC)
  --------------
  samples:          5
  evals/sample:     1

Julia wraps BLAS library for matrix multiplication. We see BLAS library wins hands down (multi-threading, Strassen algorithm, higher level-3 fraction by block outer product).

@which A_mul_B!(C, A, B)

@benchmark A_mul_B!(C, A, B)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.096 ms (0.00% GC)
  median time:      7.280 ms (0.00% GC)
  mean time:        7.448 ms (0.00% GC)
  maximum time:     10.894 ms (0.00% GC)
  --------------
  samples:          670
  evals/sample:     1

@benchmark Base.LinAlg.BLAS.gemm!('N', 'N', 1.0, A, B, 1.0, C)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.842 ms (0.00% GC)
  median time:      5.061 ms (0.00% GC)
  mean time:        5.705 ms (0.00% GC)
  maximum time:     16.833 ms (0.00% GC)
  --------------
  samples:          874
  evals/sample:     1

To appreciate the efforts in an optimized BLAS implementation such as OpenBLAS (evolved from GotoBLAS), see the Quora question, especially the video. Bottomline is

Get familiar with (good implementations of) BLAS/LAPACK and use them as much as possible.

Avoid memory allocation: some examples¶

Transposing matrix is an expensive memory operation.
- In R, the command
```
t(A) %*% x
```
  will first transpose A then perform matrix multiplication, causing unnecessary memory allocation
- Julia is smart to avoid transposing matrix if possible.

srand(123)

n = 1000
A = rand(n, n)
x = rand(n)

# dispatch to At_mul_B (and then to BLAS)
# does *not* actually transpose the matrix
@benchmark A' * x

BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     108.709 μs (0.00% GC)
  median time:      136.560 μs (0.00% GC)
  mean time:        141.652 μs (0.00% GC)
  maximum time:     1.009 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

@which A' * x

# dispatch to BLAS
@benchmark At_mul_B(A, x)

BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     110.028 μs (0.00% GC)
  median time:      150.350 μs (0.00% GC)
  mean time:        203.158 μs (0.00% GC)
  maximum time:     2.248 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

# let's force transpose
@benchmark transpose(A) * x

BenchmarkTools.Trial: 
  memory estimate:  7.64 MiB
  allocs estimate:  3
  --------------
  minimum time:     2.909 ms (0.00% GC)
  median time:      3.582 ms (0.00% GC)
  mean time:        3.956 ms (16.06% GC)
  maximum time:     13.355 ms (28.46% GC)
  --------------
  samples:          1259
  evals/sample:     1

# pre-allocate result
out = zeros(size(A, 2))
@benchmark At_mul_B!(out, A, x)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     101.399 μs (0.00% GC)
  median time:      147.288 μs (0.00% GC)
  mean time:        175.533 μs (0.00% GC)
  maximum time:     954.276 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Broadcasting in Julia achieves vectorized code without creating intermediate arrays.

srand(123)
X, Y = rand(1000,1000), rand(1000,1000)

# two temporary arrays are created
@benchmark max(abs(X), abs(Y))

WARNING: abs(x::AbstractArray{T}) where T <: Number is deprecated, use abs.(x) instead.
Stacktrace:
 [1] depwarn(::String, ::Symbol) at ./deprecated.jl:70
 [2] abs(::Array{Float64,2}) at ./deprecated.jl:57
 [3] ##core#727() at /Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:316
 [4] ##sample#728(::BenchmarkTools.Parameters) at /Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:322
 [5] #_run#16(::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}, ::BenchmarkTools.Parameters) at /Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:350
 [6] (::BenchmarkTools.#kw##_run)(::Array{Any,1}, ::BenchmarkTools.#_run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 [7] anonymous at ./<missing>:?
 [8] #run_result#19(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}, ::BenchmarkTools.Parameters) at /Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:44
 [9] (::BenchmarkTools.#kw##run_result)(::Array{Any,1}, ::BenchmarkTools.#run_result, ::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 [10] #run#21(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}, ::BenchmarkTools.Parameters) at /Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:67
 [11] (::Base.#kw##run)(::Array{Any,1}, ::Base.#run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 [12] warmup(::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}) at /Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:100
 [13] include_string(::String, ::String) at ./loading.jl:522
 [14] execute_request(::ZMQ.Socket, ::IJulia.Msg) at /Users/huazhou/.julia/v0.6/IJulia/src/execute_request.jl:158
 [15] (::Compat.#inner#17{Array{Any,1},IJulia.#execute_request,Tuple{ZMQ.Socket,IJulia.Msg}})() at /Users/huazhou/.julia/v0.6/Compat/src/Compat.jl:385
 [16] eventloop(::ZMQ.Socket) at /Users/huazhou/.julia/v0.6/IJulia/src/eventloop.jl:8
 [17] (::IJulia.##14#17)() at ./task.jl:335
while loading In[23], in expression starting on line 234
WARNING: abs(x::AbstractArray{T}) where T <: Number is deprecated, use abs.(x) instead.
Stacktrace:
 [1] depwarn(::String, ::Symbol) at ./deprecated.jl:70
 [2] abs(::Array{Float64,2}) at ./deprecated.jl:57
 [3] ##core#727() at /Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:316
 [4] ##sample#728(::BenchmarkTools.Parameters) at /Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:322
 [5] #_run#16(::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}, ::BenchmarkTools.Parameters) at /Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:350
 [6] (::BenchmarkTools.#kw##_run)(::Array{Any,1}, ::BenchmarkTools.#_run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 [7] anonymous at ./<missing>:?
 [8] #run_result#19(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}, ::BenchmarkTools.Parameters) at /Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:44
 [9] (::BenchmarkTools.#kw##run_result)(::Array{Any,1}, ::BenchmarkTools.#run_result, ::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 [10] #run#21(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}, ::BenchmarkTools.Parameters) at /Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:67
 [11] (::Base.#kw##run)(::Array{Any,1}, ::Base.#run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 [12] (::BenchmarkTools.#kw##warmup)(::Array{Any,1}, ::BenchmarkTools.#warmup, ::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}) at ./<missing>:0
 [13] #tune!#26(::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}, ::BenchmarkTools.Parameters) at /Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:155
 [14] tune!(::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}) at /Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:155
 [15] include_string(::String, ::String) at ./loading.jl:522
 [16] execute_request(::ZMQ.Socket, ::IJulia.Msg) at /Users/huazhou/.julia/v0.6/IJulia/src/execute_request.jl:158
 [17] (::Compat.#inner#17{Array{Any,1},IJulia.#execute_request,Tuple{ZMQ.Socket,IJulia.Msg}})() at /Users/huazhou/.julia/v0.6/Compat/src/Compat.jl:385
 [18] eventloop(::ZMQ.Socket) at /Users/huazhou/.julia/v0.6/IJulia/src/eventloop.jl:8
 [19] (::IJulia.##14#17)() at ./task.jl:335
while loading In[23], in expression starting on line 235
WARNING: abs(x::AbstractArray{T}) where T <: Number is deprecated, use abs.(x) instead.
Stacktrace:
 [1] depwarn(::String, ::Symbol) at ./deprecated.jl:70
 [2] abs(::Array{Float64,2}) at ./deprecated.jl:57
 [3] ##core#727() at /Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:316
 [4] ##sample#728(::BenchmarkTools.Parameters) at /Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:322
 [5] #_run#16(::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}, ::BenchmarkTools.Parameters) at /Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:350
 [6] _run(::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}, ::BenchmarkTools.Parameters) at /Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:344
 [7] #run_result#19(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}, ::BenchmarkTools.Parameters) at /Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:44
 [8] #run#21(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}, ::BenchmarkTools.Parameters) at /Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:67
 [9] run(::BenchmarkTools.Benchmark{Symbol("##benchmark#726")}) at /Users/huazhou/.julia/v0.6/BenchmarkTools/src/execution.jl:67
 [10] include_string(::String, ::String) at ./loading.jl:522
 [11] execute_request(::ZMQ.Socket, ::IJulia.Msg) at /Users/huazhou/.julia/v0.6/IJulia/src/execute_request.jl:158
 [12] (::Compat.#inner#17{Array{Any,1},IJulia.#execute_request,Tuple{ZMQ.Socket,IJulia.Msg}})() at /Users/huazhou/.julia/v0.6/Compat/src/Compat.jl:385
 [13] eventloop(::ZMQ.Socket) at /Users/huazhou/.julia/v0.6/IJulia/src/eventloop.jl:8
 [14] (::IJulia.##14#17)() at ./task.jl:335
while loading In[23], in expression starting on line 236

BenchmarkTools.Trial: 
  memory estimate:  22.92 MiB
  allocs estimate:  240
  --------------
  minimum time:     7.062 ms (18.87% GC)
  median time:      7.582 ms (19.88% GC)
  mean time:        9.103 ms (19.75% GC)
  maximum time:     26.794 ms (17.30% GC)
  --------------
  samples:          551
  evals/sample:     1

# no temporary arrays created
@benchmark max.(abs.(X), abs.(Y))

BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  27
  --------------
  minimum time:     2.589 ms (0.00% GC)
  median time:      3.110 ms (0.00% GC)
  mean time:        3.657 ms (16.53% GC)
  maximum time:     11.023 ms (33.54% GC)
  --------------
  samples:          1365
  evals/sample:     1

# no memory allocation at all!
Z = zeros(X)
@benchmark Z .= max.(abs.(X), abs.(Y))

BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  4
  --------------
  minimum time:     1.961 ms (0.00% GC)
  median time:      2.232 ms (0.00% GC)
  mean time:        2.350 ms (0.00% GC)
  maximum time:     8.337 ms (0.00% GC)
  --------------
  samples:          2116
  evals/sample:     1

View avoids creating extra copy of matrix data.

srand(123)
A = randn(1000, 1000)

# sum entries in a sub-matrix
@benchmark sum(A[1:2:500, 1:2:500])

BenchmarkTools.Trial: 
  memory estimate:  488.47 KiB
  allocs estimate:  5
  --------------
  minimum time:     64.237 μs (0.00% GC)
  median time:      266.307 μs (0.00% GC)
  mean time:        237.880 μs (12.22% GC)
  maximum time:     2.320 ms (77.69% GC)
  --------------
  samples:          10000
  evals/sample:     1

# view avoids creating a separate sub-matrix
@benchmark sum(@view A[1:2:500, 1:2:500])

BenchmarkTools.Trial: 
  memory estimate:  1.20 KiB
  allocs estimate:  36
  --------------
  minimum time:     98.406 μs (0.00% GC)
  median time:      107.452 μs (0.00% GC)
  mean time:        121.689 μs (0.00% GC)
  maximum time:     1.269 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Julia 0.6 adds the @views macro, which can be useful in some operations.

@benchmark @views sum(A[1:2:500, 1:2:500])

BenchmarkTools.Trial: 
  memory estimate:  1.34 KiB
  allocs estimate:  40
  --------------
  minimum time:     98.835 μs (0.00% GC)
  median time:      100.799 μs (0.00% GC)
  mean time:        114.490 μs (0.00% GC)
  maximum time:     812.846 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

versioninfo()

Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)