Computer arithmetics

Units of computer storage

  • bit = binary + digit (coined by statistician John Tukey).
  • byte = 8 bits.
  • KB = kilobyte = $10^3$ bytes.
  • MB = megabytes = $10^6$ bytes.
  • GB = gigabytes = $10^9$ bytes. Typical RAM size.
  • TB = terabytes = $10^{12}$ bytes. Typical hard drive size. Size of NYSE each trading session.
  • PB = petabytes = $10^{15}$ bytes.
  • EB = exabytes = $10^{18}$ bytes. Size of all healthcare data in 2011 is ~150 EB.
  • ZB = zetabytes = $10^{21}$ bytes.

Julia function Base.summarysize shows the amount of memory (in bytes) used by an object.

In [1]:
x = rand(100, 100)
Base.summarysize(x)
Out[1]:
80000

whos() function prints all variables in workspace and their sizes.

In [2]:
whos()
                          Base               Module
                        Compat  19628 KB     Module
                          Core               Module
                        IJulia  19702 KB     Module
                          JSON  19539 KB     Module
                          Main               Module
                       MbedTLS  19564 KB     Module
                     Nullables   1120 bytes  Module
                           ZMQ  19510 KB     Module
                             x     78 KB     100×100 Array{Float64,2}

Storage of Characters

  • Plain text files are stored in the form of characters: .jl, .r, .c, .cpp, .ipynb, .html, .tex, ...
  • ASCII (American Code for Information Interchange): 7 bits, only $2^7=128$ characters.
In [3]:
# integers 0, 1, ..., 127 and corresponding ascii character
# show(STDOUT, "text/plain", [0:127 Char.(0:127)]) 
[0:127 Char.(0:127)]
Out[3]:
128×2 Array{Any,2}:
   0  '\0'  
   1  '\x01'
   2  '\x02'
   3  '\x03'
   4  '\x04'
   5  '\x05'
   6  '\x06'
   7  '\a'  
   8  '\b'  
   9  '\t'  
  10  '\n'  
  11  '\v'  
  12  '\f'  
   ⋮        
 116  't'   
 117  'u'   
 118  'v'   
 119  'w'   
 120  'x'   
 121  'y'   
 122  'z'   
 123  '{'   
 124  '|'   
 125  '}'   
 126  '~'   
 127  '\x7f'
  • Extended ASCII: 8 bits, $2^8=256$ characters.
In [4]:
# integers 128, 129, ..., 255 and corresponding extended ascii character
# show(STDOUT, "text/plain", [128:255 Char.(128:255)])
[128:255 Char.(128:255)]
Out[4]:
128×2 Array{Any,2}:
 128  '\u80'
 129  '\u81'
 130  '\u82'
 131  '\u83'
 132  '\u84'
 133  '\u85'
 134  '\u86'
 135  '\u87'
 136  '\u88'
 137  '\u89'
 138  '\u8a'
 139  '\u8b'
 140  '\u8c'
   ⋮        
 244  'ô'   
 245  'õ'   
 246  'ö'   
 247  '÷'   
 248  'ø'   
 249  'ù'   
 250  'ú'   
 251  'û'   
 252  'ü'   
 253  'ý'   
 254  'þ'   
 255  'ÿ'   
  • Unicode: UTF-8, UTF-16 and UTF-32 support many more characters including foreign characters; last 7 digits conform to ASCII.

  • UTF-8 is the current dominant character encoding on internet.

  • Julia supports the full range of UTF-8 characters. You can type many Unicode math symbols by typing the backslashed LaTeX symbol name followed by tab.
In [5]:
# \beta-<tab>
β = 0.0
# \beta-<tab>-\hat-<tab>
β̂ = 0.0
Out[5]:
0.0

Fixed-point number system

  • Fixed-point number system is a computer model for integers $\mathbb{Z}$.

  • The number of bits and method of representing negative numbers vary from system to system.

    • The integer type in R has $M=32$ or 64 bits.
    • Matlab has (u)int8, (u)int16, (u)int32, (u)int64.
    • Julia has even more integer types. Using Tom Breloff's Plots.jl and PlotRecipes.jl packages, we can visualize the type tree under Integer
    • Storage for a Signed or Unsigned integer can be $M = 8, 16, 32, 64$ or 128 bits.
In [6]:
# make a list of a type T and it's supertypes
T = Integer
sups = [T]
sup = T
while sup != Any
    sup = supertype(sup)
    unshift!(sups, sup)
end

# recursively build a graph of subtypes of T
n = length(sups)
nodes, source, destiny = copy(sups), collect(1:n-1), collect(2:n)
function add_subs!(T, supidx)
    for sub in subtypes(T)
        push!(nodes, sub)
        subidx = length(nodes)
        push!(source, supidx)
        push!(destiny, subidx)
        add_subs!(sub, subidx)
    end
end
add_subs!(T, n)
names = map(string, nodes)

using PlotRecipes
#pyplot(alpha=0.5, size=(800, 500))
gr(alpha=0.5, size=(800, 500))
graphplot(source, destiny, names=names, method=:tree)
WARNING: Install LightGraphs for the best layout calculations.
Out[6]:
Any Number Real Integer BigInt Bool Signed Int128 Int16 Int32 Int64 Int8 Unsigned UInt128 UInt16 UInt32 UInt64 UInt8

Signed integers

  • First bit indicates sign.

    • 0 for nonnegative numbers
    • 1 for negative numbers
  • Two's complement representation for negative numbers.

    • Sign bit is set to 1
    • remaining bits are set to opposite values
    • 1 is added to the result
    • Two's complement representation of a negative integer x is same as the unsigned integer 2^m + x.
In [7]:
@show typeof(18)
@show bits(18)
@show bits(-18)
@show bits(UInt64(Int128(2)^64 - 18)) == bits(-18)
@show bits(2 * 18) # shift bits of 18
@show bits(2 * -18); # shift bits of -18
typeof(18) = Int64
bits(18) = "0000000000000000000000000000000000000000000000000000000000010010"
bits(-18) = "1111111111111111111111111111111111111111111111111111111111101110"
bits(UInt64(Int128(2) ^ 64 - 18)) == bits(-18) = true
bits(2 * 18) = "0000000000000000000000000000000000000000000000000000000000100100"
bits(2 * -18) = "1111111111111111111111111111111111111111111111111111111111011100"
  • Two's complement representation respects modular arithmetic nicely.
    Addition of any two signed integers are just bitwise addition, possibly modulo $2^M$

  • Range of representable integers by $M$-bit signed integer is $[-2^{M-1},2^{M-1}-1]$.
    • Julia functions typemin(T) and typemax(T) give the lowest and highest representable number of a type T respectively
In [8]:
for T in [Int8, Int16, Int32, Int64, Int128]
    println(T, '\t', typemin(T), '\t', typemax(T))
end
Int8	-128	127
Int16	-32768	32767
Int32	-2147483648	2147483647
Int64	-9223372036854775808	9223372036854775807
Int128	-170141183460469231731687303715884105728	170141183460469231731687303715884105727

Unsigned integers

  • For unsigned integers, the range is $[0,2^M-1]$.
In [9]:
for t in [UInt8, UInt16, UInt32, UInt64, UInt128]
    println(t, '\t', typemin(t), '\t', typemax(t))
end
UInt8	0	255
UInt16	0	65535
UInt32	0	4294967295
UInt64	0	18446744073709551615
UInt128	0	340282366920938463463374607431768211455

BigInt

Julia BigInt type is arbitrary precision.

In [10]:
@show typemax(Int128)
@show typemax(Int128) + 1 # modular arithmetic!
@show BigInt(typemax(Int128)) + 1;
typemax(Int128) = 170141183460469231731687303715884105727
typemax(Int128) + 1 = -170141183460469231731687303715884105728
BigInt(typemax(Int128)) + 1 = 170141183460469231731687303715884105728

Overflow and underflow for integer arithmetic

R reports NA for integer overflow and underflow.
Julia outputs the result according to modular arithmetic.

In [11]:
@show typemax(Int32)
@show typemax(Int32) + Int32(1); # modular arithmetics!
typemax(Int32) = 2147483647
typemax(Int32) + Int32(1) = -2147483648
In [12]:
using RCall

R"""
.Machine$integer.max
"""
Out[12]:
RCall.RObject{RCall.IntSxp}
[1] 2147483647
In [13]:
R"""
M <- 32
big <- 2^(M-1) - 1
as.integer(big)
"""
Out[13]:
RCall.RObject{RCall.IntSxp}
[1] 2147483647
In [14]:
R"""
as.integer(big+1)
"""
WARNING: RCall.jl: Warning: NAs introduced by coercion to integer range
Out[14]:
RCall.RObject{RCall.IntSxp}
[1] NA

Floating-number system

Floating-point number system is a computer model for real numbers.

  • Most computer systems adopt the IEEE 754 standard, established in 1985, for floating-point arithmetics.
    For the history, see an interview with William Kahan.

  • In the scientific notation, a real number is represented as $$\pm d_0.d_1d_2 \cdots d_p \times b^e.$$ In computer, the base is $b=2$ and the digits $d_i$ are 0 or 1.

  • Normalized vs denormalized numbers. For example, decimal number 18 is $$ +1.0010 \times 2^4 \quad (\text{normalized})$$ or, equivalently, $$ +0.1001 \times 2^5 \quad (\text{denormalized}).$$

  • In the floating-number system, computer stores

    • sign bit
    • the fraction (or mantissa, or significand) of the normalized representation
    • the actual exponent plus a bias
  • Double precision (64 bit = 8 bytes)
    • In Julia, Float64 is the type for double precision numbers
    • First bit is sign bit
    • $p=52$ significant bits
    • 11 exponent bits: $e_{\max}=1023$, $e_{\min}=-1022$, bias=1023
    • $e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers
    • range of magnitude: $10^{\pm 308}$ in decimal because $\log_{10} (2^{1023}) \approx 308$
    • precision to the $\log_{10}(2^{-52}) \approx 16$ decimal point
  • Single precision (32 bit = 4 bytes)

    • In Julia, Float32 is the type for single precision numbers
    • First bit is sign bit
    • $p=23$ significant bits
    • 8 exponent bits: $e_{\max}=127$, $e_{\min}=-126$, bias=127
    • $e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers
    • range of magnitude: $10^{\pm 38}$ in decimal because $\log_{10} (2^{127}) \approx 38$
    • precision: $\log_{10}(2^{23}) \approx 7$ decimal point
  • Half precision (16 bit = 2 bytes)

    • In Julia, Float16 is the type for half precision numbers
    • First bit is sign bit
    • $p=10$ significant bits
    • 5 exponent bits: $e_{\max}=15$, $e_{\min}=-14$, bias=15
    • $e_{\text{min}}-1$ and $e_{\text{max}}+1$ are reserved for special numbers
    • range of magnitude: $10^{\pm 4}$ in decimal because $\log_{10} (2^{15}) \approx 4$
    • precision: $\log_{10}(2^{10}) \approx 3$ decimal point
In [15]:
println("Half precision:")
@show bits(Float16(18)) # 18 in half precision
@show bits(Float16(-18)) # -18 in half precision
println("Single precision:")
@show bits(Float32(18)) # 18 in single precision
@show bits(Float32(-18)) # -18 in single precision
println("Double precision:")
@show bits(Float64(18)) # 18 in double precision
@show bits(Float64(-18)) # -18 in double precision
@show Float32(π) # SP number displays 7 decimal digits
@show Float64(π) # DP number displays 15 decimal digits
Half precision:
bits(Float16(18)) = "0100110010000000"
bits(Float16(-18)) = "1100110010000000"
Single precision:
bits(Float32(18)) = "01000001100100000000000000000000"
bits(Float32(-18)) = "11000001100100000000000000000000"
Double precision:
bits(Float64(18)) = "0100000000110010000000000000000000000000000000000000000000000000"
bits(Float64(-18)) = "1100000000110010000000000000000000000000000000000000000000000000"
Float32(π) = 3.1415927f0
Float64(π) = 3.141592653589793
Out[15]:
3.141592653589793
  • Special floating-point numbers.
    • Exponent $e_{\max}+1$ plus a zero mantissa means $\pm \infty$
    • Exponent $e_{\max}+1$ plus a nonzero mantissa means NaN. NaN could be produced from 0 / 0, 0 * Inf, ...
      In general NaN ≠ NaN bitwise
    • Exponent $e_{\min}-1$ with a zero mantissa represents the real number 0
    • Exponent $e_{\min}-1$ with a nonzero mantissa are for numbers less than $b^{e_{\min}}$
      Numbers are denormalized in the range $(0,b^{e_{\min}})$ -- graceful underflow
In [16]:
@show bits(Inf) # Inf in double precision
@show bits(-Inf) # -Inf in double precision
@show bits(0 / 0) # NaN
@show bits(0 * Inf) # NaN
@show bits(0.0) # 0 in double precision 
@show nextfloat(0.0) # next representable number 
@show bits(nextfloat(0.0)); # denormalized
bits(Inf) = "0111111111110000000000000000000000000000000000000000000000000000"
bits(-Inf) = "1111111111110000000000000000000000000000000000000000000000000000"
bits(0 / 0) = "1111111111111000000000000000000000000000000000000000000000000000"
bits(0Inf) = "1111111111111000000000000000000000000000000000000000000000000000"
bits(0.0) = "0000000000000000000000000000000000000000000000000000000000000000"
nextfloat(0.0) = 5.0e-324
bits(nextfloat(0.0)) = "0000000000000000000000000000000000000000000000000000000000000001"
  • Rounding is necessary whenever a number has more than $p$ significand bits. Most computer systems use the default IEEE 754 round to nearest mode (also called ties to even mode). Julia offers several rounding modes, the default being RoundNearest. For example, the number 0.1 in decimal system cannot be represented accurately as a floating point number: $$ 0.1 = 1.10011001... \times 2^{-4} $$
In [17]:
@show bits(0.1f0) # single precision Float32
@show bits(0.1);  # double precision Float64
bits(0.1f0) = "00111101110011001100110011001101"
bits(0.1) = "0011111110111001100110011001100110011001100110011001100110011010"
  • In summary
    • Single precision: range $\pm 10^{\pm 38}$ with precision up to 7 decimal digits
    • Double precision: range $\pm 10^{\pm 308}$ with precision up to 16 decimal digits
    • The floating-point numbers do not occur uniformly over the real number line Each magnitude has same number of representible numbers
    • Machine epsilons are the spacings of numbers around 1: $$\epsilon_{\min}=b^{-p}, \quad \epsilon_{\max} = b^{1-p}.$$
In [18]:
@show eps(Float32)  # machine epsilon for a floating point type
@show eps(Float64)  # same as eps()
# eps(x) is the spacing after x
@show eps(100.0)
@show eps(0.0)
# nextfloat(x) and prevfloat(x) give the neighbors of x
@show x = 1.25f0
@show prevfloat(x), x, nextfloat(x)
@show bits(prevfloat(x)), bits(x), bits(nextfloat(x));
eps(Float32) = 1.1920929f-7
eps(Float64) = 2.220446049250313e-16
eps(100.0) = 1.4210854715202004e-14
eps(0.0) = 5.0e-324
x = 1.25f0 = 1.25f0
(prevfloat(x), x, nextfloat(x)) = (1.2499999f0, 1.25f0, 1.2500001f0)
(bits(prevfloat(x)), bits(x), bits(nextfloat(x))) = ("00111111100111111111111111111111", "00111111101000000000000000000000", "00111111101000000000000000000001")
  • In R, the variable .Machine contains numerical characteristics of the machine.
In [19]:
R"""
.Machine
"""
Out[19]:
RCall.RObject{RCall.VecSxp}
$double.eps
[1] 2.220446e-16

$double.neg.eps
[1] 1.110223e-16

$double.xmin
[1] 2.225074e-308

$double.xmax
[1] 1.797693e+308

$double.base
[1] 2

$double.digits
[1] 53

$double.rounding
[1] 5

$double.guard
[1] 0

$double.ulp.digits
[1] -52

$double.neg.ulp.digits
[1] -53

$double.exponent
[1] 11

$double.min.exp
[1] -1022

$double.max.exp
[1] 1024

$integer.max
[1] 2147483647

$sizeof.long
[1] 8

$sizeof.longlong
[1] 8

$sizeof.longdouble
[1] 16

$sizeof.pointer
[1] 8

  • Julia provides Float16 (half precision), Float32 (single precision), Float64 (double precision), and BigFloat (arbitrary precision).
In [20]:
# make a list of a type T and it's supertypes
T = AbstractFloat
sups = [T]
sup = T
while sup != Any
    sup = supertype(sup)
    unshift!(sups,sup)
end
n = length(sups)
nodes, source, destiny = copy(sups), collect(1:n-1), collect(2:n)
add_subs!(T, n)
names = map(string, nodes)

graphplot(source, destiny, names=names, method=:tree)
WARNING: Install LightGraphs for the best layout calculations.
Out[20]:
Any Number Real AbstractFloat BigFloat Float16 Float32 Float64

Overflow and underflow of floating-point number

  • For double precision, the range is $\pm 10^{\pm 308}$. In most situations, underflow is preferred over overflow. Overflow causes crashes. Underflow yields zeros or denormalized numbers.

  • E.g., the logit link function is $$p = \frac{\exp (x^T \beta)}{1 + \exp (x^T \beta)} = \frac{1}{1+\exp(- x^T \beta)}.$$ The former expression can easily lead to Inf / Inf = NaN, while the latter expression leads to graceful underflow.

In [21]:
for T in [Float16, Float32, Float64,]
    println(T, '\t', typemin(T), '\t', typemax(T), '\t', realmin(T), 
        '\t', realmax(T), '\t', eps(T))
end
Float16	-Inf	Inf	6.104e-5	6.55e4	0.000977
Float32	-Inf	Inf	1.1754944e-38	3.4028235e38	1.1920929e-7
Float64	-Inf	Inf	2.2250738585072014e-308	1.7976931348623157e308	2.220446049250313e-16
  • BigFloat in Julia offers arbitrary precision.
In [22]:
@show precision(BigFloat), realmin(BigFloat), realmax(BigFloat);
(precision(BigFloat), realmin(BigFloat), realmax(BigFloat)) = (256, 8.50969131174083613912978790962048280567755996982969624908264897850135431080301e-1388255822130839284, 5.875653789111587590936911998878442589938516392745498308333779606469323584389875e+1388255822130839282)
In [23]:
@show BigFloat(π); # default precision for BigFloat is 256 bits
# set precision to 1024 bits
setprecision(BigFloat, 1024) do 
    @show BigFloat(π)
end;
BigFloat(π) = 3.141592653589793238462643383279502884197169399375105820974944592307816406286198
BigFloat(π) = 3.141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117067982148086513282306647093844609550582231725359408128481117450284102701938521105559644622948954930381964428810975665933446128475648233786783165271201909145648566923460348610454326648213393607260249141273724586997

Catastrophic cancellation

  • Scenario 1: Addition or subtraction of two numbers of widely different magnitudes: $a+b$ or $a-b$ where $a \gg b$ or $a \ll b$. We loose the precision in the number of smaller magnitude. Consider $$\begin{eqnarray*} a &=& x.xxx ... \times 2^{30} \\ b &=& y.yyy... \times 2^{-30} \end{eqnarray*}$$ What happens when computer calculates $a+b$? We get $a+b=a$!
In [24]:
a = 1.0 * 2.0^30
b = 1.0 * 2.0^-30
a + b == a
Out[24]:
true
  • Scenario 2: Subtraction of two nearly equal numbers eliminates significant digits. $a-b$ where $a \approx b$. Consider $$\begin{eqnarray*} a &=& x.xxxxxxxxxx1ssss \\ b &=& x.xxxxxxxxxx0tttt \end{eqnarray*}$$ The result is $1.vvvvu...u$ where $u$ are unassigned digits.
In [25]:
a = 1.2345678f0 # rounding
@show bits(a) # rounding
b = 1.2345677f0
@show bits(b)
@show a - b # correct result should be 1e-7
@show bits(a - b);
bits(a) = "00111111100111100000011001010001"
bits(b) = "00111111100111100000011001010000"
a - b = 1.1920929f-7
bits(a - b) = "00110100000000000000000000000000"
  • Implications for numerical computation
    • Rule 1: add small numbers together before adding larger ones
    • Rule 2: add numbers of like magnitude together (paring). When all numbers are of same sign and similar magnitude, add in pairs so each stage the summands are of similar magnitude
    • Rule 3: avoid substraction of two numbers that are nearly equal

Algebraic laws

Floating-point numbers may violate many algebraic laws we are familiar with, such associative and distributive laws. See Homework 1 problems.

Further reading