OPTIMIZATION OF F# IMPLEMENTATION OF SVM

The post “SVM Classification in F#” shows how fast is to implement SVM classification and Sequential Minimal Optimization (SMO) method in F#, but it doesn’t show how fast is the F# implementation. Unfortunately, the performance of that code is too small for practical applications. Apparently, there are intrinsic limitations of the .Net execution model and Mono virtual machine, which prevent to achieve a native code performance, and this overhead needs to be estimated. However, the main factor is the computational complexity of the implementation, and this problem can be solved.

Setup

LIBSVM provides the collection of datasets a1a, …, a9a build from the original Adult dataset of John C. Platt. Different train/test splits allow estimating the performance of different parts of the SVM/SMO implementation in a precise manner.

The accuracy of the model is not important for the code. Thus, the parameters were chosen arbitrarily: maximal number of iterations - 1,000,000; kernel - RBF with \(\gamma\)=0.1; \(C\)=1.0; \(\epsilon\) = 0.001

All execution times are measured in seconds and obtained on the following hardware and software configuration.

Computer: MacBook Pro (Retina, 15-inch, Mid 2014)
Processor: 2.2 GHz Intel Core i7
Memory: 16 GB 1600 MHz DDR3
Operating System: OS X El Capitan (10.11.6)
.NET Framework: Mono 4.4.2

Baseline

The first experiment with the F# implementation for a1a dataset had shown that LIBSVM is about 30 times faster on training task and about 5 times faster on testing task. However, LIBSVM is being developed more than 10 years and its authors optimized even kernel function evaluation. Replacing generic RBF kernel \(exp(-\gamma (x_1-x_2)^2)=exp(-\gamma (x_1 \cdot x_1 + x_2 \cdot x_2 - 2 x_1 \cdot x_2)\) with the custom implementation changes results to 16 and 4 times respectively because LIBSVM saves a lot of time by caching \(x \cdot x\) for support vectors.

Dataset `a1a` (1605/30956)	LIBSVM	F# (generic RBF)	F# (custom RBF)
train time	0.109	3.469	1.806
test time	1.727	9.409	7.060
# iterations	974	983	983
# support vectors	729	731	731
accuracy	84.20%	84.22%	84.22%

However, for the bigger training set a9a results are different. Generic RBF kernel training is 17 times slower and testing is 7 times slower. The custom RBF training is 13 times slower and testing is 5 times slower.

Dataset `a9a` (13844/16281)	LIBSVM	F# (generic RBF)	F# (custom RBF)
train time	56.949	974.535	762.467
test time	13.584	97.925	67.811
# iterations	20354	20889	20889
# support vectors	11901	1185	11852
accuracy	85.03%	85.04%	85.04%

Testing times show that factor 5-6 is the cost of .Net code model and Mono virtual machine, and it is only slightly improved by F# code optimizations. Nonetheless, there are computational tricks that can seriously improve the training time.

See the baseline implementation: SMO.fs #1.

Caching

Computing of the kernel function is the central element and the biggest performance bottleneck of SVM method. An ideal solution would be to compute the kernel matrix for the entire training set and use pre-computed values for the optimization, but, often, the matrix is too large to be stored in memory. For optimization of kernel evaluations, LIBSVM is using a circular buffer of recently used columns for caching kernel computations. Implementation of the circular buffer structure is quite easy in F#:

/// LRU list of computed columns
type private LRU(capacity : int, N : int, Q : int -> int -> float) =
    let indices = Array.zeroCreate<int> capacity
    let columns = Array.zeroCreate<float[]> capacity
    let mutable first = 0
    let mutable last = 0

    /// Returns cache capacity
    member lru.Capacity = capacity

    /// Returns j-th column of Q matrix
    member lru.Item (j : int) =
        let index = lru.tryFindIndex j
        if index <> not_found then
            columns.[index]
        else
            let column = Array.init N (fun i -> Q i j)
            lru.insert j column
            column

    /// Try to find computed column values
    member private lru.tryFindIndex probe =
        let mutable i = first
        while (i <> last) && (indices.[i] <> probe) do
            i <- (i + 1) % capacity

        if i <> last then i else not_found

    /// Insert new computed column values
    member private lru.insert index column =
        indices.[last] <- index
        columns.[last] <- column
        last <- (last + 1) % capacity

        if first = last then first <- (first + 1) % capacity

    /// Returns required capacity for the specified cache size and column length
    static member capacity (cacheSize : int) (length : int) =
       let columnSize = sizeof<float>*length + sizeof<int> + sizeof<float[]>
       max 2 (cacheSize / columnSize)

There are few places, which can benefit from columns caching:

gradient update function (uses cached i-th and j-th columns of the Q matrix)

/// Update gradient
let inline updateG i j a_i a_j =
  let Q_i = Q_lru.[i]
  let Q_j = Q_lru.[j]
  for t in 0..N-1 do
      G.[t] <- G.[t] + (Q_i.[t])*(a_i - A.[i]) + (Q_j.[t])*(a_j - A.[j])

optimization sub-problem solver (used cached i-th column and populates j-th column of the Q matrix)

/// Solve an optimization sub-problem
let inline solve i j =
  let Q_i = Q_lru.[i]
  let Q_j = Q_lru.[j]
  let a_ij = (Q_i.[i]) + (Q_j.[j]) - 2.0*(Q_i.[j])*Y.[i]*Y.[j]
  ...

working set selection (populates i-th column of the Q matrix)

let inline minLowTo s =
  let ts = I_low () |> Seq.filter (fun t -> _y_gf t < _y_gf s)
  let Q_s = Q_lru.[s]
  let objective t =
      let a_ts = (Q t t) + (Q_s.[s]) - 2.0*(Q_s.[t])*Y.[t]*Y.[s]
      let b_ts = _y_gf t - _y_gf s
      -b_ts*b_ts/(if a_ts > 0.0 then a_ts else tau)
  if Seq.isEmpty ts then not_found else Seq.minBy objective ts

Caching columns of the kernel matrix effectively reduces the computation time of Sequential Minimal Optimization (SMO) method. 10MB cache makes F# SMO only 10 times slower than LIBSVM on a1a and 200MB cache reduces the time by 33% on a9a.

Dataset `a1a` (1605/30956)	LIBSVM	F# (generic RBF)	F# (custom RBF)
train time	0.109	1.089	1.170
test time	1.727	9.424	7.060
# iterations	974	983	983
# support vectors	729	731	731
accuracy	84.20%	84.22%	84.22%

Amazingly caching for the generic RBF function works better than for the custom RBF function, which shows only minor improvement with caching.

Dataset `a9a` (13844/16281)	LIBSVM	F# (generic RBF)	F# (custom RBF)
train time	56.949	650.800	760.476
test time	13.584	89.397	67.968
# iterations	20354	20889	20889
# support vectors	11901	11852	11852
accuracy	85.03%	85.04%	85.04%

Probably, “warming-up” of Mono JIT takes more time than the execution of the corresponding code.

See the implementation with caching: SMO.fs #2.

Shrinking

Shrinking is another trick allowing to reduce the number of kernel evaluations. The main idea of shrinking is that instead of optimization of all \(\mathbf{\alpha}\) variables, which include unbounded (\(0 < \alpha < C\)) and bounded \((\alpha = 0 \wedge \alpha = C)\) variables, every 1000 iterations the active set of optimization variables is shrinked and a smaller optimization problem for unbounded \(\mathbf{\alpha}_A\) is solved.

\[\begin{array} \mathop{min}_{\alpha_A} & \quad f(\mathbf{\alpha}) = \frac{1}{2}\mathbf{\alpha}^T_AQ\mathbf{\alpha}_A - (\mathbf{p}_A + Q_{AN}\mathbf{\alpha}^k_A)^T\mathbf{\alpha}_A \\ \text{subject to} & \quad 0 \leq \alpha_i \leq C, i \in A \\ & \quad \mathbf{y}^T_A\mathbf{\alpha}_A=\Delta-\mathbf{y}^T_N\mathbf{\alpha}^k_A \end{array}\]

Caching and shrinking are reducing the training time for a1a and a9a by factor 10.

Dataset `a1a` (1605/30956)	LIBSVM	F# (generic RBF)	F# (custom RBF)
train time	0.109	1.288	1.156
test time	1.727	9.982	7.570
# iterations	974	1007	1007
# support vectors	729	731	731
accuracy	84.20%	84.22%	84.22%

Dataset `a9a` (13844/16281)	LIBSVM	F# (generic RBF)	F# (custom RBF)
train time	56.949	539.381	451.254
test time	13.584	97.499	79.363
# iterations	20354	27142	27142
# support vectors	11901	11955	11955
accuracy	85.03%	85.03%	85.03%

See the implementaion with shrinking: SMO #3.

Seq to For

Often, the optimization of the code is the first step of the performance optimization, but the super optimized code cannot solve problems of the bad algorithm. Caching and shrinking are techniques, which optimized the algorithm and gave the 10 times performance boost.

This implementation is using F# sequences for modelling sets of active/inactive variables indices. This abstraction is very convenient for the prototyping and the algorithm optimization stages, but it has high costs because sequences in F# are enumerations and require memory allocation like any other object instance. Replacing sequences and sequence transformations by loops gives additional 25% of the performance.

Dataset `a1a` (1605/30956)	LIBSVM	F# (generic RBF)	F# (custom RBF)
train time	0.109	0.782	0.759
test time	1.727	9.498	7.733
# iterations	974	1007	1006
# support vectors	729	731	731
accuracy	84.20%	84.22%	84.22%

Dataset `a9a` (13844/16281)	LIBSVM	F# (generic RBF)	F# (custom RBF)
train time	56.949	387.812	317.044
test time	13.584	99.118	78.330
# iterations	20354	21427	21427
# support vectors	11901	11917	11917
accuracy	85.03%	85.05%	85.05%

See the implementation with seq replaced by for: SMO #4.

Conclusions

F# language features like type inference and pattern matching are excellent for prototyping, but F# is not a strict functional language. Some parts of the functional code can be iteratively refactored into the procedural code for the sake of performance optimization. However, it is not possible to achieve the performance of the native code, and, probably, 5-6 times degradation of the performance is the cost of the faster development process.

Optimization of F# implementation of SVM