GPU-Accelerated Primal Learning for Extremely Fast Large-Scale Classification


John Halloran and David Rocke


Department of Public Health Sciences, UC Davis

One of the most efficient methods to solve L2 -regularized primal problems, such as logistic regression and linear support vector machine (SVM) classification, is the widely used trust region Newton algorithm, TRON. While TRON has recently been shown to enjoy substantial speedups on shared-memory multi-core systems, exploiting graphical processing units (GPUs) to speed up the method is significantly more difficult, owing to the highly complex and heavily sequential nature of the algorithm. In this work, we show that using judicious GPU-optimization principles, TRON training time for different losses and feature representations may be drastically reduced. For sparse feature sets, we show that using GPUs to train logistic regression classifiers in LIBLINEAR is up to an order-of-magnitude faster than solely using multithreading. For dense feature sets (which impose far more stringent memory constraints), we show that GPUs substantially reduce the lengthy SVM learning times required for state-of-the-art proteomics analysis, leading to dramatic improvements over recently proposed speedups. Furthermore, we show how GPU speedups may be mixed with multithreading to enable such speedups when the dataset is too large for GPU memory requirements; on a massive dense proteomics dataset of nearly a quarter-billion data instances, these mixed-architecture speedups reduce SVM analysis time from over half a week to less than a single day while using limited GPU memory.



Code for Experiments

All source code from the paper is available in the tarball tronGpu_neurips2020.tgz. To run the experiments from the paper described below, please decompress this tarball and navigate to the resulting directory.

Primal GPU Optimizations for Logistic Regression in LIBLINEAR

Directories and corresponding methods from the paper:
tronLrMix: TRON-LR-MIX in the paper
tronLrGpu0: TRON-LR-GPU0 in the paper<
tronLrGpu: TRON-LR-GPU in the paper
liblinear: standard single-threaded version of LIBLINEAR used to measure speedups in the paper
liblinear-multicore-2.30: TRON-LR-CPU in the paper, the multithread-optimized version of LIBLINEAR

To run the experiments in the paper, set the value of the variable NR to the maximum number of supported compute threads on your linux machine and run the following command:
./runLr.sh

The above script will build all the solvers, download the benchmark datasets, and test the various methods (see runLr.sh for further details). CUDA 10.0 (or greater) must be installed and in your path, as well as g++/gcc and CMake. Running the above command will write the outputs of the various timings tests to corresponding files in the run directory. An example reported runtime in an output file will look like:
"Training took = 368.448910 seconds wall clock time."

Primal GPU Optimizations for SVM Learning in Percolator (Proteomics Analysis Software)

Directories and corresponding methods from the paper:
tronSvmGpu: TRON-SVM-GPU in the paper
tronSvmMix: TRON-SVM-MIX in the paper
tronSvmCPU: TRON-SVM-CPU in the paper (also includes the multithread-optimized version of L2-SVM-MFN)
percolator: implementation of Percolator's current solver (L2-SVM-MFN) in wide-spread use

The multithread-optimized implementations of TRON and L2-SVM-MFN are taken directly from the recent Percolator speedup paper:
Halloran, John T., and David M. Rocke.
"A matter of time: faster Percolator analysis via efficient SVM learning for large-scale proteomics."
Journal of proteome research 17.5 (2018): 1978-1982.

To test the implementations in the paper, set the value of the variable NR to the maximum number of supported compute threads on your linux machine and run the following command:
./runSvm.sh

The above script will build all the solvers, download benchmark datasets, and test the various methods (see runSvm.sh for further details). CUDA 10.0 (or greater) must be installed and in your path, as well as g++/gcc and CMake. Running the above command will write the outputs of the various timings tests to corresponding files in the run directory. An example reported runtime in an output file will look like:
"SVM training took 8069 seconds wall clock time."

Note that the above experiments may require substantial analysis times for the datasets used in the paper.

PyTorch GPU/CPU Logistic Regression Solvers

The python script pyTorchLogisticRegression_rcv1.py times PyTorch GPU/CPU Gradient Descent and L-BFGS solvers and TRON (implemented in scikit-learn) for logistic regression. PyTorch and scikit-learn must be installed for these tests (python3 is also recommended).