BSPwSVMs

The goal of BSPwSVMs is to provide an efficient implementation of Binary Sparse Probability estimation with Weighted Support Vector Machines (wSVMs) with automated feature selection. This package implements the methodology developed in: Zeng, L. and Zhang, H. H. (2023). Sparse Learning and Class Probability Estimation with Weighted Support Vector Machines. arXiv:2312.10618

Overview

While traditional $L_2$-norm regulated weighted SVMs, such as those described in Wang, Zhang, and Wu (2019) and Zeng and Zhang (2022), provide a strong foundation for multiclass probability estimation, they face significant challenges in high-dimensional settings. When the number of features $p$ is much larger than the number of observations $n$, noise features can lead to overfitting and degrade the accuracy of the probability estimation.

BSPwSVMs (Zeng and Zhang, 2023) addresses these limitations by integrating sparsity with weighted SVM learning:

Automated Variable Selection: By incorporating sparsity-inducing penalties (such as the $L_1$ norm or Elastic Net), the model automatically identifies and retains only the most informative features. This feature selection is critical for model interpretability and performance when $p \gg n$.
Calibrated Probability Estimation: By combining automated feature selection with a weighted SVM framework, the package produces posterior probability estimates that are more robust to noise and better calibrated than standard wSVM implementations.
Variable Grouping and Importance: The framework allows for the assessment of feature importance and supports variable grouping, providing deeper insights into the underlying data structure.

Installation

You can install the development version of BSPwSVMs from GitHub with:

# install.packages("devtools")
devtools::install_github("zly555/BSPwSVMs")

Example

This example demonstrates the sparse learning and automated feature selection capabilities of the BSPwSVMs package using the simulation framework from Example 3 in Zeng and Zhang (2023). This scenario specifically evaluates model performance in the presence of unbalanced classes and highly correlated features where $p \gg n$.

We compare four distinct algorithms to showcase the advantages of our proposed methods:

$L_2$ - wSVMs (LTWSVM): The standard weighted SVM with no feature selection capability as baseline method.
Two-Stage wSVMs (LOTWSVM): Incorporates Lasso-style penalties for variable selection.
Elastic Net - wSVMs (ENPWSVM): Combines $L_1$ and $L_2$ penalties to handle correlated predictors.
Elastic Net - $L_2$ - wSVMs (ENTPWSVM): A hybrid scheme for optimized probability estimation and grouping.

The example tracks four key performance metrics:

Running Time: Computational efficiency in high-dimensional spaces.
EGKL: Expected Generalized Kullback-Leibler divergence, used as a rigorous measure of class probability estimation.
Test Error: Overall classification accuracy on out-of-sample data.
Feature Selection: The model's ability to identify the true informative predictors from the noise.

Install the Required R Library

library(BSPwSVMs)
library(lpSolveAPI)
library(quadprog)
library(osqp)
library(Matrix)
library(MASS)
library(ggplot2)
library(tictoc)

Generate the Simulated Data

#' Geneate the simulation data for example 3
#'
#' @param n sample size
#' @param p dimensionality
#' @param mu multivariate normal mean
#' @param sig_feature number of significant features
#' @param rho rho covariance matrix
#' @param unbalance.ratio class unbalance ratio
#' @param seed random seed
#'
#' @return Simulated data

generator_ex3 <- function(n, 
                          p, 
                          mu, 
                          sig_feature = 5, 
                          rho = 0, 
                          unbalance.ratio = 0.5, 
                          seed = 42){

  mu1 <-  c(rep(mu,sig_feature),rep(0,p-sig_feature))
  mu2 <-  c(rep(-mu,sig_feature),rep(0,p-sig_feature))
  
  Sig_star <- matrix(0, nrow =sig_feature, ncol=sig_feature)
  v <- c(1, rho, rho^2, rho^3, rho^4)
  Sig_star[1,] <- v
  Sig_star[sig_feature,] <- rev(v)
  Sig_star[2,] <- c(v[2],v[1],v[2],v[3],v[4])
  Sig_star[3,] <- c(v[3],v[2],v[1],v[2],v[3])
  Sig_star[sig_feature-1,] <- c(v[4],v[3],v[2],v[1],v[2])
  Sigma1 <-  diag(p)
  Sigma1[1:sig_feature, 1:sig_feature] <- Sig_star
  Sigma2 <- Sigma1

  set.seed(seed)
  X <- rbind(MASS::mvrnorm(floor(n*unbalance.ratio),mu1,Sigma1), mvrnorm(floor(n*(1-unbalance.ratio)),mu2,Sigma2))
  y <- c(rep(1,floor(n*unbalance.ratio)),rep(-1,floor(n*(1-unbalance.ratio))))

  return(list(X= as.matrix(X), y = y, n=n, p=p, sig_feature=sig_feature))
}

Running Setup

############################
###### Sample Setting ######
############################
n <- 200 # tuning plus training size
train.tune.ratio <- 0.5
p_dim <- c(100, 200, 400, 1000)
n_train <- floor(n*train.tune.ratio)

nsim <- 1 # number of simulatiom

# number of PIs 
m = floor(sqrt(n_train))-1

########################
###### Algorithms ######
########################
# We illustrate 4 algorithms, the L2 baseline, and 3 proposed sparse learning algorithms
# L2 wSVM use quadprog solver (Dense)
# L1+L2 wSVM use osqp solver (Sparse)
# ElasticNet_Primal use osqp solver (Sparse)
# ElasticNet_L2_Primal use osqp solver (Sparse)
baseline_algo <- c("L2_wSVM")

# algorithm with feature selection
algs <- c("L1_L2_wSVM", "ElasticNet_wSVM", "ElasticNet_L2_wSVM")

n_alg <- length(algs)
algo_index <- 1:n_alg

# set thresholds
max_freq <- floor(sqrt(n_train))-1
threshold <- c(0.2, 0.3, 0.5)
threshold_freq <- max_freq*threshold

###################################################
###### Change Seed for each simulation Files ######
###################################################

set.seed (42)
# rand_seed <- sample.int(10000, 3*nsim) # if use MCMC simulations
rand_seed <- c(11224, 33713, 64410) # just used for 1 simulation as example

##################################
######Prepare result Matrix ######
##################################

# the matrix to hold all the results with all the siumulation results
Result_Matrix <- as.data.frame(matrix(0, nrow = length(p_dim)*nsim*(n_alg+2), ncol = 20+2*length(threshold)))
col_names <- c('SimNum', 'Algorithm','Optim', 'Solver', 'N_train','Dims','ElapsedTime(mins)', 
  'LOG_ElapsedTime', 'TrueF','lambda1','lambda2','beta_prec', 'EGKL', 'TestErr', 'p_signal_union','p_noisy_union',
  'p_signal_intersec','p_noisy_intersec', 'p_signal_byrow','p_noisy_byrow')

for (fq in threshold_freq){
    col_names <- c(col_names, paste("p_signal_thres_", fq ,sep = ""))
    col_names <- c(col_names, paste("p_noisy_thres_", fq ,sep = ""))
}

colnames(Result_Matrix) <- col_names

# create a list of feature matrix based on differnt dimensions, and different algorithms
# last five row are the aggregation for different algorithms
F_matrix <- list()

for (p in p_dim){
  F_matrix[[paste("F_Matrix_p", p, sep = "")]] <- matrix(c(rep(algo_index, each = m), algo_index, rep(0,(m+1)*n_alg*p)), nrow = (m+1)*n_alg, ncol = p+1, byrow=FALSE)
}

# set inital count
n_records <- 1

MCMC Simulations

tictoc::tic() # time measurement

# simulated with different dimensions
for (p in p_dim){
  seed.cnt <- 1
    # loop for number of simulation (here use 1 simulation as illustration)
    for (sim in 1:nsim){
      # generate training, tuning and testing data
      train.data <- generator_ex3(n = n_train, 
                                  p = p, 
                                  mu = 1, 
                                  sig_feature = 5, 
                                  rho = 0.8, 
                                  unbalance.ratio = 0.6, 
                                  seed = rand_seed[seed.cnt])
      
      tune.data <- generator_ex3(n = n-n_train, 
                                 p = p, 
                                 mu = 1, 
                                 sig_feature = 5, 
                                 rho = 0.8, 
                                 unbalance.ratio = 0.6, 
                                 seed = rand_seed[seed.cnt+1])
      
      test.data <- generator_ex3(n = 25*n, 
                                 p = p, 
                                 mu = 1, 
                                 sig_feature = 5, 
                                 rho = 0.8, 
                                 unbalance.ratio = 0.6, 
                                 seed = rand_seed[seed.cnt+2])

      x.train <- train.data$X
      y.train <- train.data$y
      x.tune <- tune.data$X
      y.tune <- tune.data$y
      x.test <- test.data$X
      y.test <- test.data$y
      
      # significant feature list
      n_sig_features <- train.data$sig_feature
      sig_features <- 1:n_sig_features

      seed.cnt <- seed.cnt + 3

      #############
      ## L2 wSVM ##
      #############
      tictoc::tic()
      result <- L2wSVM(x.train, 
                       y.train, 
                       x.tune, 
                       y.tune, 
                       x.test, 
                       y.test, 
                       small.tune = FALSE)
      
      T.diff <- toc()
      elapsed_time <- round(as.numeric(T.diff$toc - T.diff$tic)/60, 3)
      log_elapsed_time <- round(log(elapsed_time),3)
      
      Result_Matrix[n_records, 1] <- sim
      Result_Matrix[n_records, 2] <- baseline_algo[1]
      Result_Matrix[n_records, 3] <- "Dual"
      Result_Matrix[n_records, 4] <- "quadprog"
      Result_Matrix[n_records, 5] <- n_train 
      Result_Matrix[n_records, 6] <- p
      Result_Matrix[n_records, 7] <- elapsed_time
      Result_Matrix[n_records, 8] <- log_elapsed_time
      Result_Matrix[n_records, 9] <- "x1-x5"
      Result_Matrix[n_records, 10] <- result$best.lambda
      Result_Matrix[n_records, 11] <- NA
      Result_Matrix[n_records, 12] <- NA
      Result_Matrix[n_records, 13] <- result$egkl
      Result_Matrix[n_records, 14] <- result$TestErr
      Result_Matrix[n_records, 15:length(col_names)] <- NA
      
      n_records <- n_records + 1

      ################
      ## L1+L2 wSVM ##
      ################
        
      algo_cnt <- 1
        
      tictoc::tic()
      result <- L1wSVM(x.train, 
                       y.train, 
                       x.tune, 
                       y.tune, 
                       x.test, 
                       y.test, 
                       osqp.option = TRUE, 
                       small.tune = FALSE, 
                       double.tune = FALSE, 
                       beta_precision = 2)
      
      T.diff <- toc()
      elapsed_time <- round(as.numeric(T.diff$toc - T.diff$tic)/60, 3)
      log_elapsed_time <- round(log(elapsed_time),3)
      
      Result_Matrix[n_records, 1] <- sim
      Result_Matrix[n_records, 2] <- algs[1]
      Result_Matrix[n_records, 3] <- "Dual"
      Result_Matrix[n_records, 4] <- "OSQP"
      Result_Matrix[n_records, 5] <- n_train 
      Result_Matrix[n_records, 6] <- p
      Result_Matrix[n_records, 7] <- elapsed_time
      Result_Matrix[n_records, 8] <- log_elapsed_time
      Result_Matrix[n_records, 9] <- "x1-x5"
      Result_Matrix[n_records, 10] <- result$best.lambda
      Result_Matrix[n_records, 11] <- NA
      Result_Matrix[n_records, 12] <- 10^((-1)*result$beta_precision)
      Result_Matrix[n_records, 13] <- result$egkl
      Result_Matrix[n_records, 14] <- result$TestErr
      
      # Union features
      aggr_f_sum <- colSums(result$f_matrix)
      Result_Matrix[n_records, 15] <- sum(aggr_f_sum[1:n_sig_features]!=0)
      Result_Matrix[n_records, 16] <- sum(aggr_f_sum[-(1:n_sig_features)]!=0)
     
      # Intersection features
      aggr_f_prod <- apply(result$f_matrix, 2, prod)
      Result_Matrix[n_records, 17] <- sum(aggr_f_prod[1:n_sig_features]!=0)
      Result_Matrix[n_records, 18] <- sum(aggr_f_prod[-(1:n_sig_features)]!=0)
      
      # sumemrize features in f_matrix in result list for each row
      Result_Matrix[n_records, 19] <- round(mean(apply(result$f_matrix[,1:n_sig_features], 1, sum)),2)
      Result_Matrix[n_records, 20] <- round(mean(apply(result$f_matrix[,-(1:n_sig_features)], 1, sum)),2)
      
      # Select feature by threshold
      cnt <- 1
      for (f in 1:length(threshold_freq)){
        Result_Matrix[n_records, 20+cnt] <- sum(aggr_f_sum[1:n_sig_features]>threshold_freq[f])
        Result_Matrix[n_records, 20+cnt+1] <- sum(aggr_f_sum[-(1:n_sig_features)]>threshold_freq[f])
        cnt <- cnt+2
      }

      F_matrix[[paste("F_Matrix_p", p, sep = "")]][((algo_index[algo_cnt]-1)*m+1):(algo_index[algo_cnt]*m),2:(p+1)] <- F_matrix[[paste("F_Matrix_p", p, sep = "")]][((algo_index[algo_cnt]-1)*m+1):(algo_index[algo_cnt]*m),2:(p+1)] + result$f_matrix
      
      n_records <- n_records + 1
      algo_cnt <- algo_cnt + 1
      
      #####################
      ## ElasticNet wSVM ##
      #####################
      
      tictoc::tic()
        
      result <- ElasticNetwSVM(x.train, 
                               y.train, 
                               x.tune, 
                               y.tune, 
                               x.test, 
                               y.test, 
                               l2.option = FALSE, 
                               beta_precision = 1e-2)

      T.diff <- tictoc::toc()
      elapsed_time <- round(as.numeric(T.diff$toc - T.diff$tic)/60, 3)
      log_elapsed_time <- round(log(elapsed_time),3)
      
      Result_Matrix[n_records, 1] <- sim
      Result_Matrix[n_records, 2] <- algs[2]
      Result_Matrix[n_records, 3] <- "Primal"
      Result_Matrix[n_records, 4] <- "OSQP"
      Result_Matrix[n_records, 5] <- n_train 
      Result_Matrix[n_records, 6] <- p
      Result_Matrix[n_records, 7] <- elapsed_time
      Result_Matrix[n_records, 8] <- log_elapsed_time
      Result_Matrix[n_records, 9] <- "x1-x5"
      Result_Matrix[n_records, 10] <- result$best.lambda1
      Result_Matrix[n_records, 11] <- result$best.lambda2
      Result_Matrix[n_records, 12] <- result$beta_precision
      Result_Matrix[n_records, 13] <- result$egkl
      Result_Matrix[n_records, 14] <- result$TestErr
      
      # Union features
      aggr_f_sum <- colSums(result$f_matrix)
      Result_Matrix[n_records, 15] <- sum(aggr_f_sum[1:n_sig_features]!=0)
      Result_Matrix[n_records, 16] <- sum(aggr_f_sum[-(1:n_sig_features)]!=0)
     
      # Intersection features
      aggr_f_prod <- apply(result$f_matrix, 2, prod)
      Result_Matrix[n_records, 17] <- sum(aggr_f_prod[1:n_sig_features]!=0)
      Result_Matrix[n_records, 18] <- sum(aggr_f_prod[-(1:n_sig_features)]!=0)
      
      # sumemrize features in f_matrix in result list for each row
      Result_Matrix[n_records, 19] <- round(mean(apply(result$f_matrix[,1:n_sig_features], 1, sum)),2)
      Result_Matrix[n_records, 20] <- round(mean(apply(result$f_matrix[,-(1:n_sig_features)], 1, sum)),2)
      
      # Select feature by threshold
      cnt <- 1
      for (f in 1:length(threshold_freq)){
        Result_Matrix[n_records, 20+cnt] <- sum(aggr_f_sum[1:n_sig_features]>threshold_freq[f])
        Result_Matrix[n_records, 20+cnt+1] <- sum(aggr_f_sum[-(1:n_sig_features)]>threshold_freq[f])
        cnt <- cnt+2
      }

      F_matrix[[paste("F_Matrix_p", p, sep = "")]][((algo_index[algo_cnt]-1)*m+1):(algo_index[algo_cnt]*m),2:(p+1)] <- F_matrix[[paste("F_Matrix_p", p, sep = "")]][((algo_index[algo_cnt]-1)*m+1):(algo_index[algo_cnt]*m),2:(p+1)] + result$f_matrix
      
      n_records <- n_records + 1
      algo_cnt <- algo_cnt + 1

      ##########################
      ## ElasticNet + L2 wSVM ##
      ##########################
     
      tictoc::tic()
      
      result <- ElasticNetwSVM(x.train, 
                               y.train, 
                               x.tune, 
                               y.tune, 
                               x.test, 
                               y.test, 
                               l2.option = TRUE, 
                               beta_precision = 1e-3)

      T.diff <- tictoc::toc()
      elapsed_time <- round(as.numeric(T.diff$toc - T.diff$tic)/60, 3)
      log_elapsed_time <- round(log(elapsed_time),3)
      
      Result_Matrix[n_records, 1] <- sim
      Result_Matrix[n_records, 2] <- algs[3]
      Result_Matrix[n_records, 3] <- "Primal"
      Result_Matrix[n_records, 4] <- "OSQP"
      Result_Matrix[n_records, 5] <- n_train 
      Result_Matrix[n_records, 6] <- p
      Result_Matrix[n_records, 7] <- elapsed_time
      Result_Matrix[n_records, 8] <- log_elapsed_time
      Result_Matrix[n_records, 9] <- "x1-x5"
      Result_Matrix[n_records, 10] <- result$best.lambda1
      Result_Matrix[n_records, 11] <- result$best.lambda2
      Result_Matrix[n_records, 12] <- result$beta_precision
      Result_Matrix[n_records, 13] <- result$egkl
      Result_Matrix[n_records, 14] <- result$TestErr
       
      # Union features
      aggr_f_sum <- colSums(result$f_matrix)
      Result_Matrix[n_records, 15] <- sum(aggr_f_sum[1:n_sig_features]!=0)
      Result_Matrix[n_records, 16] <- sum(aggr_f_sum[-(1:n_sig_features)]!=0)
     
      # Intersection features
      aggr_f_prod <- apply(result$f_matrix, 2, prod)
      Result_Matrix[n_records, 17] <- sum(aggr_f_prod[1:n_sig_features]!=0)
      Result_Matrix[n_records, 18] <- sum(aggr_f_prod[-(1:n_sig_features)]!=0)
      
      # sumemrize features in f_matrix in result list for each row
      Result_Matrix[n_records, 19] <- round(mean(apply(result$f_matrix[,1:n_sig_features], 1, sum)),2)
      Result_Matrix[n_records, 20] <- round(mean(apply(result$f_matrix[,-(1:n_sig_features)], 1, sum)),2)
      
      # Select feature by threshold
      cnt <- 1
      for (f in 1:length(threshold_freq)){
        Result_Matrix[n_records, 20+cnt] <- sum(aggr_f_sum[1:n_sig_features]>threshold_freq[f])
        Result_Matrix[n_records, 20+cnt+1] <- sum(aggr_f_sum[-(1:n_sig_features)]>threshold_freq[f])
        cnt <- cnt+2
      }

      F_matrix[[paste("F_Matrix_p", p, sep = "")]][((algo_index[algo_cnt]-1)*m+1):(algo_index[algo_cnt]*m),2:(p+1)] <- F_matrix[[paste("F_Matrix_p", p, sep = "")]][((algo_index[algo_cnt]-1)*m+1):(algo_index[algo_cnt]*m),2:(p+1)] + result$f_matrix
      
      n_records <- n_records + 1

  }

}

T.diff <- tictoc::toc()
time_elasped <- round(as.numeric(T.diff$toc - T.diff$tic)/60, 3)

# total running time for simulation
cat("Total Time for simulations:\n", time_elasped, "\n")
# Total Time for simulations:
# 63.276 

# write the simulation result into CSV file
write.csv(Result_Matrix, file = 'example3.csv')

Simulation Results (Example 3)

Algorithm	N	Dims	Time (min)	$λ_1$	$λ_2$	EGKL	TE	$p_{signal}$	$p_{noise}$
LTWSVM	100	100	0.795	0.280	NA	0.378	0.165	NA	NA
LOTWSVM	100	100	1.240	0.064	NA	0.391	0.145	4	13
ENPWSVM	100	100	1.269	0.055	0.0055	0.382	0.132	4	17
ENTPWSVM	100	100	1.337	0.055	55	0.370	0.125	5	4
LTWSVM	100	200	0.728	0.640	NA	0.397	0.162	NA	NA
LOTWSVM	100	200	1.569	0.064	NA	0.389	0.139	5	23
ENPWSVM	100	200	2.354	0.055	0.055	0.387	0.132	5	26
ENTPWSVM	100	200	2.442	0.550	0.055	0.389	0.137	5	157
LTWSVM	100	400	0.791	1.900	NA	0.457	0.160	NA	NA
LOTWSVM	100	400	2.958	0.100	NA	0.369	0.113	4	2
ENPWSVM	100	400	5.168	0.055	0.0055	0.319	0.126	4	31
ENTPWSVM	100	400	5.254	0.055	55	0.373	0.128	5	9
LTWSVM	100	1000	0.985	1.900	NA	0.465	0.204	NA	NA
LOTWSVM	100	1000	8.202	0.037	NA	0.388	0.159	4	103
ENPWSVM	100	1000	14.335	0.055	0.055	0.362	0.143	5	81
ENTPWSVM	100	1000	14.353	0.055	0.55	0.372	0.158	5	180

Conclusion

The BSPwSVMs package provides a comprehensive toolkit for high-dimensional binary classification where both predictive accuracy and model interpretability are paramount. By integrating weighted Support Vector Machines (wSVMs) with sparsity-inducing penalties, we address a critical gap in standard machine learning workflows: the need for reliable class probability estimation in the presence of noise and high-dimensional features.

Our key contributions include:

Accurate Probability Estimation: We leverage the weighted SVM framework to ensure posterior probabilities are well-calibrated. This is evidenced by the low EGKL (Expected Generalized Kullback-Leibler) scores across our high-dimensional simulations, indicating superior estimation of class membership likelihoods compared to standard SVM implementations.
Automatic Feature Selection: By utilizing $L_1$ and Elastic Net regularizers, the package spontaneously identifies informative predictors. This significantly reduces the impact of noise features—a critical capability for maintaining model performance in settings where the number of dimensions far exceeds the number of observations ($p \gg n$).
Robustness to Correlation: As demonstrated in our correlated feature simulations ($\rho=0.8$), BSPwSVMs maintains high signal capture ($p_{signal}$) where traditional $L_2$ methods fail to provide sparsity. Furthermore, the Elastic Net component effectively handles the grouping effect of highly correlated features, ensuring that relevant variable clusters are selected together rather than arbitrarily excluded.

For a deeper dive into the mathematical proofs, complexity analysis, and extensive Monte Carlo performance metrics, please consult:

Zeng, L. and Zhang, H. H. (2023). Sparse Learning and Class Probability Estimation with Weighted Support Vector Machines. arXiv:2312.10618

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
R		R
inst		inst
man		man
.Rbuildignore		.Rbuildignore
.Rhistory		.Rhistory
.gitignore		.gitignore
BSPwSVMs.Rproj		BSPwSVMs.Rproj
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BSPwSVMs

Overview

Installation

Example

Install the Required R Library

Generate the Simulated Data

Running Setup

MCMC Simulations

Simulation Results (Example 3)

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BSPwSVMs

Overview

Installation

Example

Install the Required R Library

Generate the Simulated Data

Running Setup

MCMC Simulations

Simulation Results (Example 3)

Conclusion

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages