Prostate Data

This demonstration solves a regular, unconstrained lasso problem using the constrained lasso solution path (lsq_classopath.jl).

The prostate data come from a study that examined the correlation between the level of prostate specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy. (Stamey et al. (1989))

Let's load and organize the prostate data. Since we are interested in the following variables as predictors, we extract them and create a design matrix Xz:

  • lcavol : log(cancer volume)
  • lweight: log(prostate weight)
  • age : age
  • lbph : log(benign prostatic hyperplasia amount)
  • svi : seminal vesicle invasion
  • lcp : log(capsular penetration)
  • gleason: Gleason score
  • pgg45 : percentage Gleason scores 4 or 5

The response variable is lpsa, which is log(prostate specific antigen).

using ConstrainedLasso 
prostate = readcsv(joinpath(Pkg.dir("ConstrainedLasso"), "docs/src/demo/misc/prostate.csv"), header=true)
tmp = Int[]
labels = ["lcavol" "lweight" "age" "lbph" "svi" "lcp" "gleason" "pgg45"]
for i in labels
    push!(tmp, find(x -> x == i, prostate[2])[1])
end
Xz = Array{Float64}(prostate[1][:, tmp])
97×8 Array{Float64,2}:
 -0.579818  2.76946  50.0  -1.38629   0.0  -1.38629   6.0   0.0
 -0.994252  3.31963  58.0  -1.38629   0.0  -1.38629   6.0   0.0
 -0.510826  2.69124  74.0  -1.38629   0.0  -1.38629   7.0  20.0
 -1.20397   3.28279  58.0  -1.38629   0.0  -1.38629   6.0   0.0
  0.751416  3.43237  62.0  -1.38629   0.0  -1.38629   6.0   0.0
 -1.04982   3.22883  50.0  -1.38629   0.0  -1.38629   6.0   0.0
  0.737164  3.47352  64.0   0.615186  0.0  -1.38629   6.0   0.0
  0.693147  3.53951  58.0   1.53687   0.0  -1.38629   6.0   0.0
 -0.776529  3.53951  47.0  -1.38629   0.0  -1.38629   6.0   0.0
  0.223144  3.24454  63.0  -1.38629   0.0  -1.38629   6.0   0.0
  0.254642  3.60414  65.0  -1.38629   0.0  -1.38629   6.0   0.0
 -1.34707   3.59868  63.0   1.26695   0.0  -1.38629   6.0   0.0
  1.61343   3.02286  63.0  -1.38629   0.0  -0.597837  7.0  30.0
  ⋮                                         ⋮                  
  3.30285   3.51898  64.0  -1.38629   1.0   2.32728   7.0  60.0
  2.02419   3.7317   58.0   1.639     0.0  -1.38629   6.0   0.0
  1.73166   3.36902  62.0  -1.38629   1.0   0.300105  7.0  30.0
  2.80759   4.71805  65.0  -1.38629   1.0   2.46385   7.0  60.0
  1.56235   3.69511  76.0   0.936093  1.0   0.81093   7.0  75.0
  3.24649   4.10182  68.0  -1.38629   0.0  -1.38629   6.0   0.0
  2.5329    3.67757  61.0   1.34807   1.0  -1.38629   7.0  15.0
  2.83027   3.8764   68.0  -1.38629   1.0   1.32176   7.0  60.0
  3.821     3.89691  44.0  -1.38629   1.0   2.16905   7.0  40.0
  2.90745   3.39619  52.0  -1.38629   1.0   2.46385   7.0  10.0
  2.88256   3.77391  68.0   1.55814   1.0   1.55814   7.0  80.0
  3.47197   3.975    68.0   0.438255  1.0   2.90417   7.0  20.0
y = Array{Float64}(prostate[1][:, end-1])
97-element Array{Float64,1}:
 -0.430783
 -0.162519
 -0.162519
 -0.162519
  0.371564
  0.765468
  0.765468
  0.854415
  1.04732 
  1.04732 
  1.26695 
  1.26695 
  1.26695 
  ⋮       
  3.63099 
  3.68009 
  3.71235 
  3.98434 
  3.9936  
  4.02981 
  4.12955 
  4.38515 
  4.68444 
  5.14312 
  5.47751 
  5.58293

First we standardize the data by subtracting its mean and dividing by its standard deviation.

n, p = size(Xz)
for i in 1:size(Xz,2)
    Xz[:, i] -= mean(Xz[:, i])
    Xz[:, i] /= std(Xz[:, i])
end
Xz
97×8 Array{Float64,2}:
 -1.63736   -2.00621    -1.86243    …  -0.863171  -1.04216   -0.864467
 -1.98898   -0.722009   -0.787896      -0.863171  -1.04216   -0.864467
 -1.57882   -2.18878     1.36116       -0.863171   0.342627  -0.155348
 -2.16692   -0.807994   -0.787896      -0.863171  -1.04216   -0.864467
 -0.507874  -0.458834   -0.250631      -0.863171  -1.04216   -0.864467
 -2.03613   -0.933955   -1.86243    …  -0.863171  -1.04216   -0.864467
 -0.519967  -0.362793    0.0180011     -0.863171  -1.04216   -0.864467
 -0.557313  -0.208757   -0.787896      -0.863171  -1.04216   -0.864467
 -1.80425   -0.208757   -2.26537       -0.863171  -1.04216   -0.864467
 -0.956085  -0.897266   -0.116315      -0.863171  -1.04216   -0.864467
 -0.92936   -0.0578992   0.152317   …  -0.863171  -1.04216   -0.864467
 -2.28833   -0.0706369  -0.116315      -0.863171  -1.04216   -0.864467
  0.223498  -1.41472    -0.116315      -0.299282   0.342627   0.199211
  ⋮                                 ⋱   ⋮                             
  1.65688   -0.256675    0.0180011  …   1.7927     0.342627   1.26289 
  0.572009   0.239854   -0.787896      -0.863171  -1.04216   -0.864467
  0.323806  -0.606718   -0.250631       0.342907   0.342627   0.199211
  1.23668    2.54221     0.152317       1.89038    0.342627   1.26289 
  0.180156   0.154448    1.6298         0.70824    0.342627   1.79473 
  1.60906    1.10379     0.555266   …  -0.863171  -1.04216   -0.864467
  1.00362    0.113497   -0.384948      -0.863171   0.342627  -0.332628
  1.25592    0.577607    0.555266       1.07357    0.342627   1.26289 
  2.09651    0.625489   -2.66832        1.67954    0.342627   0.55377 
  1.3214    -0.543304   -1.59379        1.89038    0.342627  -0.509907
  1.30029    0.338384    0.555266   …   1.24263    0.342627   1.97201 
  1.80037    0.807764    0.555266       2.20528    0.342627  -0.155348

Now we solve the problem using solution path algorithm.

βpath, ρpath, = lsq_classopath(Xz, y);
βpath
8×9 Array{Float64,2}:
 0.000197119  0.421559  0.461915   …   0.597645    0.603245    0.665561 
 0.0          0.0       0.0            0.232715    0.246191    0.266408 
 0.0          0.0       0.0           -0.0601318  -0.0936838  -0.158234 
 0.0          0.0       0.0            0.0882392   0.108105    0.14034  
 0.0          0.0       0.0403562      0.243534    0.252539    0.315269 
 0.0          0.0       0.0        …   0.0         0.0        -0.148508 
 0.0          0.0       0.0            0.0         0.0121929   0.0354652
 0.0          0.0       0.0            0.0646193   0.0699873   0.125787

We plot the solution path below.

using Plots; pyplot(); 
colors = [:green :orange :black :purple :red :grey :brown :blue] 
plot(ρpath, βpath', xaxis = ("ρ", (minimum(ρpath),
      maximum(ρpath))), yaxis = ("β̂(ρ)"), label=labels, color=colors)
title!("Prostate Data: Solution Path via Constrained Lasso")

Below, we solve the same problem using GLMNet.jl package.

using GLMNet;  
path = glmnet(Xz, y, intercept=false);
path.betas
plot(path.lambda, path.betas', color=colors, label=labels, 
        xaxis=("λ"), yaxis= ("β̂(λ)"))
title!("Prostate Data: Solution Path via GLMNet.jl")

Follow the link to access the .ipynb file of this page.