Prostate Data
This demonstration solves a regular, unconstrained lasso problem using the constrained lasso solution path (lsq_classopath.jl).
The prostate data come from a study that examined the correlation between the level of prostate specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy. (Stamey et al. (1989))
Let's load and organize the prostate data. Since we are interested in the following variables as predictors, we extract them and create a design matrix Xz:
lcavol: log(cancer volume)lweight: log(prostate weight)age: agelbph: log(benign prostatic hyperplasia amount)svi: seminal vesicle invasionlcp: log(capsular penetration)gleason: Gleason scorepgg45: percentage Gleason scores 4 or 5
The response variable is lpsa, which is log(prostate specific antigen).
using ConstrainedLasso
prostate = readcsv(joinpath(Pkg.dir("ConstrainedLasso"), "docs/src/demo/misc/prostate.csv"), header=true) tmp = Int[] labels = ["lcavol" "lweight" "age" "lbph" "svi" "lcp" "gleason" "pgg45"] for i in labels push!(tmp, find(x -> x == i, prostate[2])[1]) end Xz = Array{Float64}(prostate[1][:, tmp])
97×8 Array{Float64,2}:
-0.579818 2.76946 50.0 -1.38629 0.0 -1.38629 6.0 0.0
-0.994252 3.31963 58.0 -1.38629 0.0 -1.38629 6.0 0.0
-0.510826 2.69124 74.0 -1.38629 0.0 -1.38629 7.0 20.0
-1.20397 3.28279 58.0 -1.38629 0.0 -1.38629 6.0 0.0
0.751416 3.43237 62.0 -1.38629 0.0 -1.38629 6.0 0.0
-1.04982 3.22883 50.0 -1.38629 0.0 -1.38629 6.0 0.0
0.737164 3.47352 64.0 0.615186 0.0 -1.38629 6.0 0.0
0.693147 3.53951 58.0 1.53687 0.0 -1.38629 6.0 0.0
-0.776529 3.53951 47.0 -1.38629 0.0 -1.38629 6.0 0.0
0.223144 3.24454 63.0 -1.38629 0.0 -1.38629 6.0 0.0
0.254642 3.60414 65.0 -1.38629 0.0 -1.38629 6.0 0.0
-1.34707 3.59868 63.0 1.26695 0.0 -1.38629 6.0 0.0
1.61343 3.02286 63.0 -1.38629 0.0 -0.597837 7.0 30.0
⋮ ⋮
3.30285 3.51898 64.0 -1.38629 1.0 2.32728 7.0 60.0
2.02419 3.7317 58.0 1.639 0.0 -1.38629 6.0 0.0
1.73166 3.36902 62.0 -1.38629 1.0 0.300105 7.0 30.0
2.80759 4.71805 65.0 -1.38629 1.0 2.46385 7.0 60.0
1.56235 3.69511 76.0 0.936093 1.0 0.81093 7.0 75.0
3.24649 4.10182 68.0 -1.38629 0.0 -1.38629 6.0 0.0
2.5329 3.67757 61.0 1.34807 1.0 -1.38629 7.0 15.0
2.83027 3.8764 68.0 -1.38629 1.0 1.32176 7.0 60.0
3.821 3.89691 44.0 -1.38629 1.0 2.16905 7.0 40.0
2.90745 3.39619 52.0 -1.38629 1.0 2.46385 7.0 10.0
2.88256 3.77391 68.0 1.55814 1.0 1.55814 7.0 80.0
3.47197 3.975 68.0 0.438255 1.0 2.90417 7.0 20.0
y = Array{Float64}(prostate[1][:, end-1])
97-element Array{Float64,1}:
-0.430783
-0.162519
-0.162519
-0.162519
0.371564
0.765468
0.765468
0.854415
1.04732
1.04732
1.26695
1.26695
1.26695
⋮
3.63099
3.68009
3.71235
3.98434
3.9936
4.02981
4.12955
4.38515
4.68444
5.14312
5.47751
5.58293
First we standardize the data by subtracting its mean and dividing by its standard deviation.
n, p = size(Xz) for i in 1:size(Xz,2) Xz[:, i] -= mean(Xz[:, i]) Xz[:, i] /= std(Xz[:, i]) end Xz
97×8 Array{Float64,2}:
-1.63736 -2.00621 -1.86243 … -0.863171 -1.04216 -0.864467
-1.98898 -0.722009 -0.787896 -0.863171 -1.04216 -0.864467
-1.57882 -2.18878 1.36116 -0.863171 0.342627 -0.155348
-2.16692 -0.807994 -0.787896 -0.863171 -1.04216 -0.864467
-0.507874 -0.458834 -0.250631 -0.863171 -1.04216 -0.864467
-2.03613 -0.933955 -1.86243 … -0.863171 -1.04216 -0.864467
-0.519967 -0.362793 0.0180011 -0.863171 -1.04216 -0.864467
-0.557313 -0.208757 -0.787896 -0.863171 -1.04216 -0.864467
-1.80425 -0.208757 -2.26537 -0.863171 -1.04216 -0.864467
-0.956085 -0.897266 -0.116315 -0.863171 -1.04216 -0.864467
-0.92936 -0.0578992 0.152317 … -0.863171 -1.04216 -0.864467
-2.28833 -0.0706369 -0.116315 -0.863171 -1.04216 -0.864467
0.223498 -1.41472 -0.116315 -0.299282 0.342627 0.199211
⋮ ⋱ ⋮
1.65688 -0.256675 0.0180011 … 1.7927 0.342627 1.26289
0.572009 0.239854 -0.787896 -0.863171 -1.04216 -0.864467
0.323806 -0.606718 -0.250631 0.342907 0.342627 0.199211
1.23668 2.54221 0.152317 1.89038 0.342627 1.26289
0.180156 0.154448 1.6298 0.70824 0.342627 1.79473
1.60906 1.10379 0.555266 … -0.863171 -1.04216 -0.864467
1.00362 0.113497 -0.384948 -0.863171 0.342627 -0.332628
1.25592 0.577607 0.555266 1.07357 0.342627 1.26289
2.09651 0.625489 -2.66832 1.67954 0.342627 0.55377
1.3214 -0.543304 -1.59379 1.89038 0.342627 -0.509907
1.30029 0.338384 0.555266 … 1.24263 0.342627 1.97201
1.80037 0.807764 0.555266 2.20528 0.342627 -0.155348
Now we solve the problem using solution path algorithm.
βpath, ρpath, = lsq_classopath(Xz, y);
βpath
8×9 Array{Float64,2}:
0.000197119 0.421559 0.461915 … 0.597645 0.603245 0.665561
0.0 0.0 0.0 0.232715 0.246191 0.266408
0.0 0.0 0.0 -0.0601318 -0.0936838 -0.158234
0.0 0.0 0.0 0.0882392 0.108105 0.14034
0.0 0.0 0.0403562 0.243534 0.252539 0.315269
0.0 0.0 0.0 … 0.0 0.0 -0.148508
0.0 0.0 0.0 0.0 0.0121929 0.0354652
0.0 0.0 0.0 0.0646193 0.0699873 0.125787
We plot the solution path below.
using Plots; pyplot(); colors = [:green :orange :black :purple :red :grey :brown :blue] plot(ρpath, βpath', xaxis = ("ρ", (minimum(ρpath), maximum(ρpath))), yaxis = ("β̂(ρ)"), label=labels, color=colors) title!("Prostate Data: Solution Path via Constrained Lasso")
Below, we solve the same problem using GLMNet.jl package.
using GLMNet; path = glmnet(Xz, y, intercept=false); path.betas
plot(path.lambda, path.betas', color=colors, label=labels, xaxis=("λ"), yaxis= ("β̂(λ)")) title!("Prostate Data: Solution Path via GLMNet.jl")
Follow the link to access the .ipynb file of this page.