The post “SVM Classification in F#” shows how fast is to implement SVM classification and Sequential Minimal Optimization (SMO) method in F#, but it doesn’t show how fast is the F# implementation. Unfortunately, the performance of that code is too small for practical applications. Apparently, there are intrinsic limitations of the .Net execution model and Mono virtual machine, which prevent to achieve a native code performance, and this overhead needs to be estimated. However, the main factor is the computational complexity of the implementation, and this problem can be solved.

Setup

LIBSVM provides the collection of datasets a1a, …, a9a build from the original Adult dataset of John C. Platt. Different train/test splits allow estimating the performance of different parts of the SVM/SMO implementation in a precise manner.

The accuracy of the model is not important for the code. Thus, the parameters were chosen arbitrarily: maximal number of iterations - 1,000,000; kernel - RBF with \(\gamma\)=0.1; \(C\)=1.0; \(\epsilon\) = 0.001

All execution times are measured in seconds and obtained on the following hardware and software configuration.

  • Computer: MacBook Pro (Retina, 15-inch, Mid 2014)
  • Processor: 2.2 GHz Intel Core i7
  • Memory: 16 GB 1600 MHz DDR3
  • Operating System: OS X El Capitan (10.11.6)
  • .NET Framework: Mono 4.4.2

Baseline

The first experiment with the F# implementation for a1a dataset had shown that LIBSVM is about 30 times faster on training task and about 5 times faster on testing task. However, LIBSVM is being developed more than 10 years and its authors optimized even kernel function evaluation. Replacing generic RBF kernel \(exp(-\gamma (x_1-x_2)^2)=exp(-\gamma (x_1 \cdot x_1 + x_2 \cdot x_2 - 2 x_1 \cdot x_2)\) with the custom implementation changes results to 16 and 4 times respectively because LIBSVM saves a lot of time by caching \(x \cdot x\) for support vectors.

Dataset a1a (1605/30956) LIBSVM F# (generic RBF) F# (custom RBF)
train time 0.109 3.469 1.806
test time 1.727 9.409 7.060
# iterations 974 983 983
# support vectors 729 731 731
accuracy 84.20% 84.22% 84.22%

However, for the bigger training set a9a results are different. Generic RBF kernel training is 17 times slower and testing is 7 times slower. The custom RBF training is 13 times slower and testing is 5 times slower.

Dataset a9a (13844/16281) LIBSVM F# (generic RBF) F# (custom RBF)
train time 56.949 974.535 762.467
test time 13.584 97.925 67.811
# iterations 20354 20889 20889
# support vectors 11901 1185 11852
accuracy 85.03% 85.04% 85.04%

Testing times show that factor 5-6 is the cost of .Net code model and Mono virtual machine, and it is only slightly improved by F# code optimizations. Nonetheless, there are computational tricks that can seriously improve the training time.

See the baseline implementation: SMO.fs #1.

Caching

Computing of the kernel function is the central element and the biggest performance bottleneck of SVM method. An ideal solution would be to compute the kernel matrix for the entire training set and use pre-computed values for the optimization, but, often, the matrix is too large to be stored in memory. For optimization of kernel evaluations, LIBSVM is using a circular buffer of recently used columns for caching kernel computations. Implementation of the circular buffer structure is quite easy in F#:

/// LRU list of computed columns
type private LRU(capacity : int, N : int, Q : int -> int -> float) =
    let indices = Array.zeroCreate<int> capacity
    let columns = Array.zeroCreate<float[]> capacity
    let mutable first = 0
    let mutable last = 0

    /// Returns cache capacity
    member lru.Capacity = capacity

    /// Returns j-th column of Q matrix
    member lru.Item (j : int) =
        let index = lru.tryFindIndex j
        if index <> not_found then
            columns.[index]
        else
            let column = Array.init N (fun i -> Q i j)
            lru.insert j column
            column

    /// Try to find computed column values
    member private lru.tryFindIndex probe =
        let mutable i = first
        while (i <> last) && (indices.[i] <> probe) do
            i <- (i + 1) % capacity

        if i <> last then i else not_found

    /// Insert new computed column values
    member private lru.insert index column =
        indices.[last] <- index
        columns.[last] <- column
        last <- (last + 1) % capacity

        if first = last then first <- (first + 1) % capacity

    /// Returns required capacity for the specified cache size and column length
    static member capacity (cacheSize : int) (length : int) =
       let columnSize = sizeof<float>*length + sizeof<int> + sizeof<float[]>
       max 2 (cacheSize / columnSize)

There are few places, which can benefit from columns caching:

  • gradient update function (uses cached i-th and j-th columns of the Q matrix)
    /// Update gradient
    let inline updateG i j a_i a_j =
      let Q_i = Q_lru.[i]
      let Q_j = Q_lru.[j]
      for t in 0..N-1 do
          G.[t] <- G.[t] + (Q_i.[t])*(a_i - A.[i]) + (Q_j.[t])*(a_j - A.[j])
    
  • optimization sub-problem solver (used cached i-th column and populates j-th column of the Q matrix)
    /// Solve an optimization sub-problem
    let inline solve i j =
      let Q_i = Q_lru.[i]
      let Q_j = Q_lru.[j]
      let a_ij = (Q_i.[i]) + (Q_j.[j]) - 2.0*(Q_i.[j])*Y.[i]*Y.[j]
      ...
    
  • working set selection (populates i-th column of the Q matrix)
    let inline minLowTo s =
      let ts = I_low () |> Seq.filter (fun t -> _y_gf t < _y_gf s)
      let Q_s = Q_lru.[s]
      let objective t =
          let a_ts = (Q t t) + (Q_s.[s]) - 2.0*(Q_s.[t])*Y.[t]*Y.[s]
          let b_ts = _y_gf t - _y_gf s
          -b_ts*b_ts/(if a_ts > 0.0 then a_ts else tau)
      if Seq.isEmpty ts then not_found else Seq.minBy objective ts
    

Caching columns of the kernel matrix effectively reduces the computation time of Sequential Minimal Optimization (SMO) method. 10MB cache makes F# SMO only 10 times slower than LIBSVM on a1a and 200MB cache reduces the time by 33% on a9a.

Dataset a1a (1605/30956) LIBSVM F# (generic RBF) F# (custom RBF)
train time 0.109 1.089 1.170
test time 1.727 9.424 7.060
# iterations 974 983 983
# support vectors 729 731 731
accuracy 84.20% 84.22% 84.22%

Amazingly caching for the generic RBF function works better than for the custom RBF function, which shows only minor improvement with caching.

Dataset a9a (13844/16281) LIBSVM F# (generic RBF) F# (custom RBF)
train time 56.949 650.800 760.476
test time 13.584 89.397 67.968
# iterations 20354 20889 20889
# support vectors 11901 11852 11852
accuracy 85.03% 85.04% 85.04%

Probably, “warming-up” of Mono JIT takes more time than the execution of the corresponding code.

See the implementation with caching: SMO.fs #2.

Shrinking

Shrinking is another trick allowing to reduce the number of kernel evaluations. The main idea of shrinking is that instead of optimization of all \(\mathbf{\alpha}\) variables, which include unbounded (\(0 < \alpha < C\)) and bounded \((\alpha = 0 \wedge \alpha = C)\) variables, every 1000 iterations the active set of optimization variables is shrinked and a smaller optimization problem for unbounded \(\mathbf{\alpha}_A\) is solved.

\[\begin{array} \mathop{min}_{\alpha_A} & \quad f(\mathbf{\alpha}) = \frac{1}{2}\mathbf{\alpha}^T_AQ\mathbf{\alpha}_A - (\mathbf{p}_A + Q_{AN}\mathbf{\alpha}^k_A)^T\mathbf{\alpha}_A \\ \text{subject to} & \quad 0 \leq \alpha_i \leq C, i \in A \\ & \quad \mathbf{y}^T_A\mathbf{\alpha}_A=\Delta-\mathbf{y}^T_N\mathbf{\alpha}^k_A \end{array}\]

Caching and shrinking are reducing the training time for a1a and a9a by factor 10.

Dataset a1a (1605/30956) LIBSVM F# (generic RBF) F# (custom RBF)
train time 0.109 1.288 1.156
test time 1.727 9.982 7.570
# iterations 974 1007 1007
# support vectors 729 731 731
accuracy 84.20% 84.22% 84.22%
Dataset a9a (13844/16281) LIBSVM F# (generic RBF) F# (custom RBF)
train time 56.949 539.381 451.254
test time 13.584 97.499 79.363
# iterations 20354 27142 27142
# support vectors 11901 11955 11955
accuracy 85.03% 85.03% 85.03%

See the implementaion with shrinking: SMO #3.

Seq to For

Often, the optimization of the code is the first step of the performance optimization, but the super optimized code cannot solve problems of the bad algorithm. Caching and shrinking are techniques, which optimized the algorithm and gave the 10 times performance boost.

This implementation is using F# sequences for modelling sets of active/inactive variables indices. This abstraction is very convenient for the prototyping and the algorithm optimization stages, but it has high costs because sequences in F# are enumerations and require memory allocation like any other object instance. Replacing sequences and sequence transformations by loops gives additional 25% of the performance.

Dataset a1a (1605/30956) LIBSVM F# (generic RBF) F# (custom RBF)
train time 0.109 0.782 0.759
test time 1.727 9.498 7.733
# iterations 974 1007 1006
# support vectors 729 731 731
accuracy 84.20% 84.22% 84.22%
Dataset a9a (13844/16281) LIBSVM F# (generic RBF) F# (custom RBF)
train time 56.949 387.812 317.044
test time 13.584 99.118 78.330
# iterations 20354 21427 21427
# support vectors 11901 11917 11917
accuracy 85.03% 85.05% 85.05%

See the implementation with seq replaced by for: SMO #4.

Conclusions

F# language features like type inference and pattern matching are excellent for prototyping, but F# is not a strict functional language. Some parts of the functional code can be iteratively refactored into the procedural code for the sake of performance optimization. However, it is not possible to achieve the performance of the native code, and, probably, 5-6 times degradation of the performance is the cost of the faster development process.