Compiled languages: C/C++, Fortran, ...
Interpreted language: R, Matlab, Python, SAS IML, JavaScript, ...
Mixed (dynamic) languages: Matlab (JIT), R (compiler
package), Julia, Cython, JAVA, ...
Script languages: Linux shell scripts, Perl, ...
Database languages: SQL, Hadoop.
To be versatile in the big data era, master at least one language in each category.
To improve efficiency of interpreted languages such as R or Matlab, conventional wisdom is to avoid loops as much as possible. Aka, vectorize code
The only loop you are allowed to have is that for an iterative algorithm.
When looping is necessary, need to code in C, C++, or Fortran.
Success stories: the popular glmnet
package in R is coded in Fortran.
Modern languages such as Julia tries to solve the two language problem. That is to achieve efficiency without vectorizing code.
Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments
Started in 2009. First public release in 2012.
Aim to solve the notorious two language problem:
Walks like Python. Runs like C.
Write high-level, abstract code that closely resembles mathematical formulas
Julia is more than just "Fast R" or "Fast Matlab"
It's not meant for high performance computing
Deficiencies in the core language
devtools
, roxygen2
, Matrix
)Only 6 active developers left (out of 20 R-Core members)
Doug Bates (member of R-Core, Matrix
and lme4
)
https://github.com/dmbates/MixedModels.jl
As some of you may know, I have had a (rather late) mid-life crisis and run off with another language called Julia.
-- Doug Bates (on the `knitr` Google Group)
Features | R | Matlab | Julia |
---|---|---|---|
Open source | 👍 | 👎 | 👍 |
IDE | RStudio 👍 👍 👍 | 👍 👍 👍 | Atom+Juno 👎 |
Dynamic document | RMarkdown 👍 👍 👍 | 👍 👍 | Jupyter 👍 👍 |
Multi-threading | parallel 👎 |
👍 | 👍 👍 see docs |
JIT | compiler 👎 |
👍 👍 | 👍 👍 👍 |
Call C/Fortran | wrapper, Rcpp |
wrapper | no glue code needed |
Call shared library | wrapper | wrapper | no glue code needed |
Type system | 👎 | 👍 👍 | 👍 👍 👍 |
Pass by reference | 👎 | 👎 | 👍 👍 👍 |
Linear algebra | 👎 | MKL, Arpack | OpenBLAS, eigpack, or MKL |
Distributed computing | 👎 | 👍 | 👍 👍 👍 |
Sparse linear algebra | Matrix package 👎 |
👍 👍 👍 | 👍 👍 👍 |
Documentation | 👍 | 👍 👍 👍 | 👍 |
Profiler | 👍 👍 | 👍 👍 👍 | 👍 |
Benchmark code R-benchmark-25.R
from http://r.research.att.com/benchmarks/R-benchmark-25.R covers many commonly used numerical operations used in statistics.
We ported to Matlab and Julia and report the run times (averaged over 5 runs) here.
Test | R 3.4.3 | Matlab R2017a | Julia 0.6.2 |
---|---|---|---|
Matrix creation, trans., deform. (2500 x 2500) | 0.65 | 0.13 | 0.21 |
Power of matrix (2400 x 2400, A.^1000 ) |
0.18 | 0.10 | 0.18 |
Quick sort ($n = 7 \times 10^6$) | 0.75 | 0.30 | 0.65 |
Cross product (2800 x 2800, $A^TA$) | 15.02 | 0.18 | 0.23 |
LS solution ($n = p = 2000$) | 7.00 | 0.07 | 0.06 |
FFT ($n = 2,400,000$) | 0.32 | 0.03 | 0.04 |
Eigen-values ($600 \times 600$) | 0.75 | 0.22 | 0.26 |
Determinant ($2500 \times 2500$) | 3.77 | 0.19 | 0.14 |
Cholesky ($3000 \times 3000$) | 5.54 | 0.08 | 0.17 |
Matrix inverse ($1600 \times 1600$) | 4.13 | 0.11 | 0.14 |
Fibonacci (vector calculation) | 0.23 | 0.16 | 0.27 |
Hilbert (matrix calculation) | 0.27 | 0.07 | 0.06 |
GCD (recursion) | 0.42 | 0.09 | 0.16 |
Toeplitz matrix (loops) | 0.32 | 0.0012 | 0.0007 |
Escoufiers (mixed) | 0.30 | 0.15 | 0.14 |
Machine specs: Intel i7 @ 2.9GHz (4 physical cores, 8 threads), 16G RAM, Mac OS 10.13.3.
An example from Dr. Doug Bates's slides Julia for R Programmers.
The task is to create a Gibbs sampler for the density
$$
f(x, y) = k x^2 exp(- x y^2 - y^2 + 2y - 4x), x > 0
$$
using the conditional distributions
$$
\begin{eqnarray*}
X | Y &\sim& \Gamma \left( 3, \frac{1}{y^2 + 4} \right) \\
Y | X &\sim& N \left(\frac{1}{1+x}, \frac{1}{2(1+x)} \right).
\end{eqnarray*}
$$
This is a Julia function for the simple Gibbs sampler:
using Distributions
function jgibbs(N, thin)
mat = zeros(N, 2)
x = y = 0.0
for i in 1:N
for j in 1:thin
x = rand(Gamma(3.0, 1.0 / (y * y + 4.0)))
y = rand(Normal(1.0 / (x + 1.0), 1.0 / sqrt(2.0(x + 1.0))))
end
mat[i, 1] = x
mat[i, 2] = y
end
mat
end
Generate a bivariate sample of size 10,000 with a thinning of 500. How long does it take?
jgibbs(100, 5); # warm-up
@elapsed jgibbs(10000, 500)
RCall.jl
package allows us to execute R code without leaving the Julia
environment. We first define an R function Rgibbs()
.using RCall
R"""
library(Matrix)
Rgibbs <- function(N, thin) {
mat <- matrix(0, nrow=N, ncol=2)
x <- y <- 0
for (i in 1:N) {
for (j in 1:thin) {
x <- rgamma(1, 3, y * y + 4) # 3rd arg is rate
y <- rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1)))
}
mat[i,] <- c(x, y)
}
mat
}
"""
and then generate the same number of samples
# benchmark
@elapsed R"""
system.time(Rgibbs(10000, 500))
"""
We see 40-80 fold speed up of Julia
over R
on this example, without extra coding effort!
Julia
documentation. The Julia
REPL, or Julia
shell, has four main modes.
Default mode is the Julian prompt julia>
. Type backspace in other modes to enter default mode.
Help mode help?>
. Type ?
to enter help mode. ?search_term
does a fuzzy search for search_term
.
Shell mode shell>
. Type ;
to enter shell mode.
Search mode (reverse-i-search)
. Press ctrl+R
to enter search model.
With RCall.jl
package installed, we can enter the R mode by typing $
(shift+4) at Julia REPL.
Some survival commands in Julia REPL:
quit()
or Ctrl+D
: exit Julia.
Ctrl+C
: interrupt execution.
Ctrl+L
: clear screen.
whos()
: list all variables in current workspace.
workspace()
: clear all variables and reset session.
Append ;
(semi-colon) to suppress displaying output from a command.
include("filename.jl")
to source a Julia code file.
Online help from REPL: ?function_name
.
Google (of course).
Julia documentation: https://docs.julialang.org/en/stable/.
Look up source code: @edit func(x)
.
Friends.
I highly recommend the editor Atom with packages julia-client
, language-julia
, and latex-completions
installed.
If you want RStudio- or Matlab- like IDE, install the uber-juno
package in Atom. Follow instructions at https://github.com/JunoLab/uber-juno/blob/master/setup.md.
JuliaPro bundles Julia, Atom, Juno, and many commonly used packages.
For homework, I recommend Jupyter Notebook.
Also worth trying is JupyterLab, which is supposed to replace Jupyter Notebook after it reaches v1.0.
Each Julia package is a Git repository. Each Julia package name ends with .jl
. E.g., Distributions.jl
package lives at https://github.com/JuliaStats/Distributions.jl.
Google search with PackageName.jl
usually leads to the package on github.com.
The package ecosystem is rapidly maturing; a complete list of registered packages (which are required to have a certain level of testing and documentation) is at http://pkg.julialang.org/.
For example, the package called Distributions.jl
is added with
Pkg.add("Distributions") # no .jl
and "removed" (although not completely deleted) with
Pkg.rm("Distributions")
The package manager provides a dependency solver that determines which packages are actually required to be installed.
Non-registered packages are added by cloning the relevant Git repository. E.g.,
Pkg.clone("git@github.com:OpenMendel/SnpArrays.jl.git")
A package needs only be added once, at which point it is downloaded into your local .julia/vx.x
directory in your home directory.
;ls -l /Users/huazhou/.julia/v0.6 "|" head -20
Pkg.dir()
:Pkg.dir("Distributions")
If you start having problems with packages that seem to be unsolvable, you can try just deleting your .julia directory and reinstalling all your packages.
For JuliaPro, the packages are downloaed to directory /Applications/JuliaPro-0.6.2.2.app/Contents/Resources/pkgs-0.6.2.2/v0.6/
.
Periodically, one should run Pkg.update()
, which checks for, downloads and installs updated versions of all the packages you currently have installed.
Pkg.status()
lists the status of all installed packages.
Using functions in package.
using Distributions
This pulls all of the exported functions in the module into your local namespace, as you can check using the whos()
command. An alternative is
import Distributions
Now, the functions from the Distributions package are available only using
Distributions.<FUNNAME>
All functions, not only exported functions, are always available like this.
# Pkg.add("RCall")
using RCall
x = randn(1000)
R"""
hist($x, main="I'm plotting a Julia vector")
"""
R"""
library(ggplot2)
qplot($x)
"""
x = R"""
rnorm(10)
"""
y = collect(x)
Access Julia variables in R REPL mode:
julia> x = rand(5) # Julia variable
R> y <- $x
Pass Julia expression in R REPL mode:
R> y <- $(rand(5))
Put Julia variable into R environment:
julia> @rput x
R> x
Get R variable into Julia environment:
R> r <- 2
Julia> @rget r
If you want to call Julia within R, check out the XRJulia
package by John Chambers.
y = 1
typeof(y) # same as int in R
y = 1.0
typeof(y) # same as double in R
# Greek letters: `\pi<tab>`
π
# Greek letters: `\theta<tab>`
θ = y + π
# emoji! `\:kissing_cat:<tab>`
😽 = 5.0
α̂ = π
# vector of Float64 0s
x = zeros(5)
# vector Int64 0s
x = zeros(Int, 5)
# matrix of Float64 0s
x = zeros(5, 3)
# matrix of Float64 1s
x = ones(5, 3)
# define array without initialization
x = Array{Float64}(5, 3)
# fill a matrix by 0s
fill!(x, 0)
# initialize an array to be 2.5
fill(π, (5, 3))
a = 3//5
typeof(a)
b = 3//7
a + b
# uniform [0, 1) random numbers
x = rand(5, 3)
# uniform random numbers (in single precision)
x = rand(Float16, 5, 3)
# random numbers from {1,...,5}
x = rand(1:5, 5, 3)
# standard normal random numbers
x = randn(5, 3)
# range
1:10
typeof(1:10)
1:2:10
typeof(1:2:10)
# integers 1-10
x = collect(1:10)
# Float64 numbers 1-10
x = collect(1.0:10)
# convert to a specific type
convert(Vector{Float64}, 1:10)
@time
, @elapsed
, @allocated
macros:
srand(123) # seed
x = randn(10000)
@time sum(x) # first run includes compilation time
@time sum(x) # no compilation time after first run
@elapsed sum(x)
@allocated sum(x)
Use BenchmarkTools.jl
for more robust benchmarking. Analog of microbenchmark
package in R.
using BenchmarkTools
@benchmark sum(x)
R"""
library(microbenchmark)
microbenchmark(sum($x))
"""
x = randn(5, 3)
size(x)
size(x, 1) # nrow() in R
size(x, 2)
# total number of elements
length(x)
# 5 × 5 matrix of random Normal(0, 1)
x = randn(5, 5)
# first column
x[:, 1]
# first row
x[1, :]
# sub-array
x[1:2, 2:3]
# getting a subset of a matrix creates a copy, but you can also create "views"
z = view(x, 1:2, 2:3)
z[2, 2] = 0.0
x
# y points to same data as x
y = x
# x and y point to same data
pointer(x), pointer(y)
# changing y also changes x
y[:, 1] = 0
x
# create a new copy of data
z = copy(x)
pointer(x), pointer(z)
x, y, z = randn(5, 3), randn(5, 2), randn(3, 5)
M = [x y]
M = [x y; z]
Dot operation in Julia is elementwise operation.
x = randn(5, 3)
y = ones(5, 3)
x .* y # same x * y in R
x .^ (-2)
sin.(x)
x = randn(5)
# vector L2 norm
vecnorm(x)
y = randn(5) # another vector
# dot product
dot(x, y) # x' * y
x'y
x, y = randn(5, 3), randn(3, 2)
# matrix multiplication, same as %*% in R
x * y
x = randn(3, 3)
# conjugate transpose
x'
b = rand(3)
x'b # same as x' * b
trace(x)
det(x)
rank(x)
# 10-by-10 sparse matrix with sparsity 0.1
X = sprandn(10, 10, .1)
# convert to dense matrix; be cautious when dealing with big data
Xfull = full(X)
# convert a dense matrix to sparse matrix
sparse(Xfull)
# syntax for sparse linear algebra is same as dense linear algebra
β = ones(10)
X * β
# many functions apply to sparse matrices as well
sum(X, 2)
if-elseif-else-end
if condition1
# do something
elseif condition2
# do something
else
# do something
end
for
loop
for i in 1:10
println(i)
end
Nested for
loop:
for i in 1:10
for j in 1:5
println(i * j)
end
end
Same as
for i in 1:10, j in 1:5
println(i * j)
end
Exit loop:
for i in 1:10
# do something
if condition1
exit # skip remaining loop
end
end
Exit iteration:
for i in 1:10
# do something
if condition1
continue # skip to next iteration
end
# do something
end
In Julia, all arguments to functions are passed by reference, in contrast to R and Matlab.
Function names ending with !
indicates that function mutates at least one argument, typically the first.
sort!(x) # vs sort(x)
Function definition
function func(req1, req2; key1=dflt1, key2=dflt2)
# do stuff
return out1, out2, out3
end
Required arguments are separated with a comma and use the positional notation.
Optional arguments need a default value in the signature.
Semicolon is not required in function call.
return statement is optional.
Multiple outputs can be returned as a tuple, e.g., return out1, out2, out3
.
Anonymous functions, e.g., x -> x^2
, is commonly used in collection function or list comprehensions.
map(x -> x^2, y) # square each element in x
Functions can be nested:
function outerfunction()
# do some outer stuff
function innerfunction()
# do inner stuff
# can access prior outer definitions
end
# do more outer stuff
end
Functions can be vectorized using the Dot syntax:
function myfunc(x)
return sin(x^2)
end
x = randn(5, 3)
myfunc.(x)
Collection function (think this as the series of apply
functions in R).
Apply a function to each element of a collection:
map(f, coll) # or
map(coll) do elem
# do stuff with elem
# must contain return
end
map(x -> sin(x^2), x)
map(x) do elem
elem = elem^2
return sin(elem)
end
# Mapreduce
mapreduce(x -> sin(x^2), +, x)
# same as
sum(x -> sin(x^2), x)
[sin(2i + j) for i in 1:5, j in 1:3] # similar to Python
When thinking about types, think about sets.
Everything is a subtype of the abstract type Any
.
An abstract type defines a set of types
Number
:typeof()
, supertype()
, and subtypes()
.typeof(1.0), typeof(1)
supertype(Float64)
subtypes(AbstractFloat)
# Is Float64 a subtype of AbstractFloat?
Float64 <: AbstractFloat
# On 64bit machine, Int == Int64
Int == Int64
convert(Float64, 1)
randn(Float32, 5)
convert(Vector{Float64}, randn(Float32, 5))
convert(Int, 1.0)
convert(Int, 1.5) # should use round(1.5)
round(Int, 1.5)
Multiple dispatch lies in the core of Julia design. It allows built-in and user-defined functions to be overloaded for different combinations of argument types.
Let's consider a simple "doubling" function:
g(x) = x + x
g(1.5)
This definition is too broad, since some things can't be added
g("hello world")
Number
can be added.g(x::Float64) = x + x
g(x::Number) = x + x
This is a lot nicer than
function g(x)
if isa(x, Number)
return x + x
else
throw(ArgumentError("x should be a number"))
end
end
methods(func)
function display all methods defined for func
.methods(g)
@which func(x)
marco tells which method is being used for argument signature x
.x = 1
typeof(x)
g(x)
@which g(x)
x = randn(5)
@which g(x)
g(x)
Following figures and some examples are taken from Arch D. Robinson's slides Introduction to Writing High Performance Julia.
Julia
's efficiency results from its capabilities to infer the types of all variables within a function and then call LLVM to generate optimized machine code at run-time. workspace() # clear previous definition of g
g(x::Number) = x + x
This function will work on any type which has a method for +
.
@show g(2)
@show g(2.0);
This is the abstract syntax tree (AST).
@code_lowered g(2)
Type inference:
@code_warntype g(2)
@code_warntype g(2.0)
Peek at the compiled LLVM bitcode with @code_llvm
@code_llvm g(2)
@code_llvm g(2.0)
We didn't provide a type annotation. But different LLVM code gets generated depending on the argument type!
In R or Python, g(2)
and g(2.0)
would use the same code for both.
In Julia, g(2)
and g(2.0)
dispatches to optimized code for Int64
and Float64
, respectively.
For integer input x
, LLVM compiler is smart enough to know x + x
is simple shifting x
by 1 bit, which is faster than addition.
Lowest level is the assembly code, which is machine dependent.
@code_native g(2)
@code_native g(2.0)
Julia has several built-in tools for profiling. The @time
marco outputs run time and heap allocation.
function tally(x)
s = 0
for v in x
s += v
end
s
end
a = rand(10000)
@time tally(a) # first run: include compile time
@time tally(a)
For more robust benchmarking, the BenchmarkTools.jl package is highly recommended.
using BenchmarkTools
@benchmark tally(a)
We see the memory allocation (468.75 KiB, average 10.73% GC) is suspiciously high.
The Profile
module gives line by line profile results.
Profile.clear()
@profile tally(a)
Profile.print(format=:flat)
One can use ProfileView
package for better visualization of profile data:
using ProfileView
ProfileView.view()
@code_warntype tally(a)
;cat bar.jl
Next, in terminal, we run the script with --track-allocation=user
option.
;julia --track-allocation=user bar.jl
The profiler outputs a file bar.jl.mem
.
;cat bar.jl.mem
We see line 4 is allocating suspicious amount of heap memory.
The key to writing performant Julia code is to be type stable, such that Julia
is able to infer types of all variables and output of a function from the types of input arguments.
Is the tally
function type stable? How to diagnose and fix it?
@code_warntype tally(rand(100))
In this case, Julia fails to infer the type of the reduction variable s
, which has to be boxed in heap memory at run time.
This is the generated LLVM bitcode, which is unsually long and contains lots of box:
@code_llvm tally(rand(100))
What's the fix?
function tally2(x)
s = zero(eltype(x))
for v in x
s += v
end
s
end
@benchmark tally2(a)
Much shorter LLVM bitcode:
@code_llvm tally2(a)
The three most popular options (as far as I know) in Julia are
ggplot2
in RWe demonstrate Plots.jl below:
# Pkg.add("Plots")
using Plots
x = cumsum(randn(50, 2), 1);
# Pkg.add("PyPlot")
pyplot() # set the backend to PyPlot
plot(x, title="Random walk", xlab="time")
# Pkg.add("PlotlyJS")
plotlyjs() # change backend to Plotly
plot(x, title="Random walk", xlab="time")
gr() # change backend to GR
plot(x, title="Random walk", xlab="time")
gr()
@gif for i in 1:20
plot(x -> sin(x) / (.2i), 0, i, xlim=(0, 20), ylim=(-.75, .75))
scatter!(x -> cos(x) * .01 * i, 0, i, m=1)
end
versioninfo()