GPU Computing in Julia

This session introduces GPU computing in Julia.

In [1]:
versioninfo()
Julia Version 0.6.4
Commit 9d11f62bcb (2018-07-09 19:09 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=16)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)

GPGPU

GPUs are ubiquitous in modern computers. Following are GPUs today's typical computer systems.

NVIDIA GPUs Tesla K80 GTX 1080 GT 650M
Tesla M2090 GTX 580 GT 650M
Computers servers, cluster desktop laptop
Server Desktop Laptop
Main usage scientific computing daily work, gaming daily work
Memory 24 GB 8 GB 1GB
Memory bandwidth 480 GB/sec 320 GB/sec 80GB/sec
Number of cores 4992 2560 384
Processor clock 562 MHz 1.6 GHz 0.9GHz
Peak DP performance 2.91 TFLOPS 257 GFLOPS
Peak SP performance 8.73 TFLOPS 8228 GFLOPS 691Gflops

GPU architecture vs CPU architecture.

  • GPUs contain 100s of processing cores on a single card; several cards can fit in a desktop PC
  • Each core carries out the same operations in parallel on different input data -- single program, multiple data (SPMD) paradigm
  • Extremely high arithmetic intensity if one can transfer the data onto and results off of the processors quickly
i7 die Fermi die
Einstein Rain man

GPGPU in Julia

GPU support by Julia is under active development. Check JuliaGPU for currently available packages.

There are at least three paradigms to program GPU in Julia.

  • CUDA is an ecosystem exclusively for Nvidia GPUs. There are extensive CUDA libraries for scientific computing: CuBLAS, CuRAND, CuSparse, CuSolve, CuDNN, ...

    The CuArray.jl package allows defining arrays on Nvidia GPUs and overloads many common operations. CuArrays.jl supports Julia v1.0+.

  • OpenCL is a standard supported multiple manufacturers (Nvidia, AMD, Intel, Apple, ...), but lacks some libraries essential for statistical computing.

    The CLArray.jl package allows defining arrays on OpenCL devices and overloads many common operations. Currently CLArrays.jl only supports Julia v0.6.

  • ArrayFire is a high performance library that works on both CUDA or OpenCL framework.

    The ArrayFire.jl package wraps the library for julia.

  • Warning: Most recent Apple operating system iOS 10.14 (Mojave) does not support CUDA yet.

Because my laptop has an AMD Radeon GPU, I'll illustrate using OpenCL on Julia v0.6.4.

Query GPU devices in the system

In [2]:
using CLArrays

# check available devices on this machine
CLArrays.devices()
Out[2]:
3-element Array{OpenCL.cl.Device,1}:
 OpenCL.Device(Intel(R) HD Graphics 530 on Apple @0x0000000001024500)                 
 OpenCL.Device(AMD Radeon Pro 460 Compute Engine on Apple @0x0000000001021c00)        
 OpenCL.Device(Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz on Apple @0x00000000ffffffff)
In [3]:
# use the AMD Radeon Pro 460 GPU
dev = CLArrays.devices()[2]
CLArrays.init(dev)
Out[3]:
OpenCL context with:
CL version: OpenCL 1.2 
Device: CL AMD Radeon Pro 460 Compute Engine
            threads: 256
             blocks: (256, 256, 256)
      global_memory: 4294.967296 mb
 free_global_memory: NaN mb
       local_memory: 0.032768 mb

Generate arrays on GPU devices

In [4]:
# generate GPU arrays
xd = rand(CLArray{Float32}, 5, 3)
Out[4]:
GPU: 5×3 Array{Float32,2}:
 0.935426   0.656262   0.145963 
 0.255108   0.0626027  0.565825 
 0.0110094  0.679819   0.767408 
 0.640347   0.114715   0.0996476
 0.607266   0.332567   0.492061 
In [5]:
yd = ones(CLArray{Float32}, 5, 3)
Out[5]:
GPU: 5×3 Array{Float32,2}:
 1.0  1.0  1.0
 1.0  1.0  1.0
 1.0  1.0  1.0
 1.0  1.0  1.0
 1.0  1.0  1.0

Transfer data between main memory and GPU

In [6]:
# transfer data from main memory to GPU
x = randn(5, 3)
xd = CLArray(x)
Out[6]:
GPU: 5×3 Array{Float64,2}:
  0.0621414  -0.905539    0.795695
 -0.440687    0.427526   -0.235945
  0.304403    0.0362422  -0.260491
  1.58347    -0.492863   -0.076589
  0.193085    0.351561   -0.857394
In [7]:
# transfer data from main memory to GPU
x = collect(xd)
Out[7]:
5×3 Array{Float64,2}:
  0.0621414  -0.905539    0.795695
 -0.440687    0.427526   -0.235945
  0.304403    0.0362422  -0.260491
  1.58347    -0.492863   -0.076589
  0.193085    0.351561   -0.857394

Elementiwise operations

In [8]:
zd = log.(yd .+ sin.(xd))
Out[8]:
GPU: 5×3 Array{Float64,2}:
  0.0602494  -1.54533     0.539034 
 -0.556103    0.346862   -0.266262 
  0.262152    0.0355933  -0.297806 
  0.693107   -0.640839   -0.0795998
  0.175538    0.295921   -1.41116  
In [9]:
# getting back x
asin.(exp.(zd) .- yd)
Out[9]:
GPU: 5×3 Array{Float64,2}:
  0.0621414  -0.905539    0.795695
 -0.440687    0.427526   -0.235945
  0.304403    0.0362422  -0.260491
  1.55813    -0.492863   -0.076589
  0.193085    0.351561   -0.857394

Linear algebra

In [10]:
zd = zeros(CLArray{Float32}, 3, 3)
At_mul_B!(zd, xd, yd)
Out[10]:
GPU: 3×3 Array{Float32,2}:
  1.70241    1.70241    1.70241 
 -0.583072  -0.583072  -0.583072
 -0.634723  -0.634723  -0.634723
In [11]:
using BenchmarkTools

n = 512
xd = rand(CLArray{Float32}, n, n)
yd = rand(CLArray{Float32}, n, n)
zd = zeros(CLArray{Float32}, n, n)

# SP matrix multiplication on GPU
@benchmark A_mul_B!($zd, $xd, $yd)
base64 binary data: G1s5MW1FUlJPUiAodW5oYW5kbGVkIHRhc2sgZmFpbHVyZSk6IBtbOTFtT3BlbkNMIEVycm9yOiBPcGVuQ0wuQ29udGV4dCBlcnJvcjoggMMiBYB/G1szOW0KU3RhY2t0cmFjZToKIFsxXSAbWzFtcmFpc2VfY29udGV4dF9lcnJvchtbMjJtG1syMm0bWzFtKBtbMjJtG1syMm06OlN0cmluZywgOjpTdHJpbmcbWzFtKRtbMjJtG1syMm0gYXQgG1sxbS9Vc2Vycy9odWF6aG91Ly5qdWxpYS92MC42L09wZW5DTC9zcmMvY29udGV4dC5qbDoxMDkbWzIybRtbMjJtCiBbMl0gG1sxbW1hY3JvIGV4cGFuc2lvbhtbMjJtG1syMm0gYXQgG1sxbS9Vc2Vycy9odWF6aG91Ly5qdWxpYS92MC42L09wZW5DTC9zcmMvY29udGV4dC5qbDoxNDgbWzIybRtbMjJtIFtpbmxpbmVkXQogWzNdIBtbMW0oOjpPcGVuQ0wuY2wuIyM0MyM0NCkbWzIybRtbMjJtG1sxbSgbWzIybRtbMjJtG1sxbSkbWzIybRtbMjJtIGF0IBtbMW0uL3Rhc2suamw6MzM1G1syMm0bWzIybQobWzM5bQ==
Out[11]:
BenchmarkTools.Trial: 
  memory estimate:  2.86 KiB
  allocs estimate:  96
  --------------
  minimum time:     18.158 μs (0.00% GC)
  median time:      23.084 μs (0.00% GC)
  mean time:        26.206 μs (3.11% GC)
  maximum time:     17.951 ms (45.33% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [12]:
x = rand(Float32, n, n)
y = rand(Float32, n, n)
z = zeros(Float32, n, n)

# SP matrix multiplication on CPU
@benchmark A_mul_B!($z, $x, $y)
Out[12]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     890.801 μs (0.00% GC)
  median time:      1.207 ms (0.00% GC)
  mean time:        1.237 ms (0.00% GC)
  maximum time:     2.917 ms (0.00% GC)
  --------------
  samples:          4028
  evals/sample:     1

We ses ~50 fold speedup in this matrix multiplication example.