The post “SVM Classification in F#” shows how fast is to implement SVM classification and Sequential Minimal Optimization (SMO) method in F#, but it doesn’t show how fast is the F# implementation. Unfortunately, the performance of that code is too small for practical applications. Apparently, there are intrinsic limitations of the .Net execution model and Mono virtual machine, which prevent to achieve a native code performance, and this overhead needs to be estimated. However, the main factor is the computational complexity of the implementation, and this problem can be solved.

## Setup

LIBSVM provides the collection of datasets a1a, …, a9a build from the original Adult dataset of John C. Platt. Different train/test splits allow estimating the performance of different parts of the SVM/SMO implementation in a precise manner.

The accuracy of the model is not important for the code. Thus, the parameters were chosen arbitrarily: maximal number of iterations - 1,000,000; kernel - RBF with \(\gamma\)=0.1; \(C\)=1.0; \(\epsilon\) = 0.001

All execution times are measured in seconds and obtained on the following hardware and software configuration.

**Computer**: MacBook Pro (Retina, 15-inch, Mid 2014)**Processor**: 2.2 GHz Intel Core i7**Memory**: 16 GB 1600 MHz DDR3**Operating System**: OS X El Capitan (10.11.6)**.NET Framework**: Mono 4.4.2

## Baseline

The first experiment with the F# implementation for `a1a`

dataset had shown that LIBSVM is about 30 times faster on training task
and about 5 times faster on testing task. However, LIBSVM is being developed more than 10 years and its authors optimized even
kernel function evaluation. Replacing generic RBF kernel \(exp(-\gamma (x_1-x_2)^2)=exp(-\gamma (x_1 \cdot x_1 + x_2 \cdot x_2 - 2 x_1 \cdot x_2)\)
with the custom implementation changes results to 16 and 4 times respectively because LIBSVM saves a lot of time by caching \(x \cdot x\) for
support vectors.

Dataset `a1a` (1605/30956) |
LIBSVM | F# (generic RBF) | F# (custom RBF) |
---|---|---|---|

train time |
0.109 | 3.469 | 1.806 |

test time |
1.727 | 9.409 | 7.060 |

# iterations |
974 | 983 | 983 |

# support vectors |
729 | 731 | 731 |

accuracy |
84.20% | 84.22% | 84.22% |

However, for the bigger training set `a9a`

results are different. Generic RBF kernel training is 17 times slower and
testing is 7 times slower. The custom RBF training is 13 times slower and testing is 5 times slower.

Dataset `a9a` (13844/16281) |
LIBSVM | F# (generic RBF) | F# (custom RBF) |
---|---|---|---|

train time |
56.949 | 974.535 | 762.467 |

test time |
13.584 | 97.925 | 67.811 |

# iterations |
20354 | 20889 | 20889 |

# support vectors |
11901 | 1185 | 11852 |

accuracy |
85.03% | 85.04% | 85.04% |

Testing times show that factor 5-6 is the cost of .Net code model and Mono virtual machine, and it is only slightly improved by F# code optimizations. Nonetheless, there are computational tricks that can seriously improve the training time.

See the baseline implementation: SMO.fs #1.

## Caching

Computing of the kernel function is the central element and the biggest performance bottleneck of SVM method. An ideal solution would be to compute the kernel matrix for the entire training set and use pre-computed values for the optimization, but, often, the matrix is too large to be stored in memory. For optimization of kernel evaluations, LIBSVM is using a circular buffer of recently used columns for caching kernel computations. Implementation of the circular buffer structure is quite easy in F#:

```
/// LRU list of computed columns
type private LRU(capacity : int, N : int, Q : int -> int -> float) =
let indices = Array.zeroCreate<int> capacity
let columns = Array.zeroCreate<float[]> capacity
let mutable first = 0
let mutable last = 0
/// Returns cache capacity
member lru.Capacity = capacity
/// Returns j-th column of Q matrix
member lru.Item (j : int) =
let index = lru.tryFindIndex j
if index <> not_found then
columns.[index]
else
let column = Array.init N (fun i -> Q i j)
lru.insert j column
column
/// Try to find computed column values
member private lru.tryFindIndex probe =
let mutable i = first
while (i <> last) && (indices.[i] <> probe) do
i <- (i + 1) % capacity
if i <> last then i else not_found
/// Insert new computed column values
member private lru.insert index column =
indices.[last] <- index
columns.[last] <- column
last <- (last + 1) % capacity
if first = last then first <- (first + 1) % capacity
/// Returns required capacity for the specified cache size and column length
static member capacity (cacheSize : int) (length : int) =
let columnSize = sizeof<float>*length + sizeof<int> + sizeof<float[]>
max 2 (cacheSize / columnSize)
```

There are few places, which can benefit from columns caching:

- gradient update function (uses cached i-th and j-th columns of the Q matrix)
`/// Update gradient let inline updateG i j a_i a_j = let Q_i = Q_lru.[i] let Q_j = Q_lru.[j] for t in 0..N-1 do G.[t] <- G.[t] + (Q_i.[t])*(a_i - A.[i]) + (Q_j.[t])*(a_j - A.[j])`

- optimization sub-problem solver (used cached i-th column and populates j-th column of the Q matrix)
`/// Solve an optimization sub-problem let inline solve i j = let Q_i = Q_lru.[i] let Q_j = Q_lru.[j] let a_ij = (Q_i.[i]) + (Q_j.[j]) - 2.0*(Q_i.[j])*Y.[i]*Y.[j] ...`

- working set selection (populates i-th column of the Q matrix)
`let inline minLowTo s = let ts = I_low () |> Seq.filter (fun t -> _y_gf t < _y_gf s) let Q_s = Q_lru.[s] let objective t = let a_ts = (Q t t) + (Q_s.[s]) - 2.0*(Q_s.[t])*Y.[t]*Y.[s] let b_ts = _y_gf t - _y_gf s -b_ts*b_ts/(if a_ts > 0.0 then a_ts else tau) if Seq.isEmpty ts then not_found else Seq.minBy objective ts`

Caching columns of the kernel matrix effectively reduces the computation time of Sequential Minimal Optimization
(SMO) method. 10MB cache makes F# SMO only 10 times slower than LIBSVM on `a1a`

and 200MB cache reduces the time
by 33% on `a9a`

.

Dataset `a1a` (1605/30956) |
LIBSVM | F# (generic RBF) | F# (custom RBF) |
---|---|---|---|

train time |
0.109 | 1.089 | 1.170 |

test time |
1.727 | 9.424 | 7.060 |

# iterations |
974 | 983 | 983 |

# support vectors |
729 | 731 | 731 |

accuracy |
84.20% | 84.22% | 84.22% |

Amazingly caching for the generic RBF function works better than for the custom RBF function, which shows only minor improvement with caching.

Dataset `a9a` (13844/16281) |
LIBSVM | F# (generic RBF) | F# (custom RBF) |
---|---|---|---|

train time |
56.949 | 650.800 | 760.476 |

test time |
13.584 | 89.397 | 67.968 |

# iterations |
20354 | 20889 | 20889 |

# support vectors |
11901 | 11852 | 11852 |

accuracy |
85.03% | 85.04% | 85.04% |

*Probably, “warming-up” of Mono JIT takes more time than the execution of the corresponding code.*

See the implementation with caching: SMO.fs #2.

## Shrinking

Shrinking is another trick allowing to reduce the number of kernel evaluations. The main idea of shrinking is that instead of optimization of all \(\mathbf{\alpha}\) variables, which include unbounded (\(0 < \alpha < C\)) and bounded \((\alpha = 0 \wedge \alpha = C)\) variables, every 1000 iterations the active set of optimization variables is shrinked and a smaller optimization problem for unbounded \(\mathbf{\alpha}_A\) is solved.

Caching and shrinking are reducing the training time for `a1a`

and `a9a`

by factor 10.

Dataset `a1a` (1605/30956) |
LIBSVM | F# (generic RBF) | F# (custom RBF) |
---|---|---|---|

train time |
0.109 | 1.288 | 1.156 |

test time |
1.727 | 9.982 | 7.570 |

# iterations |
974 | 1007 | 1007 |

# support vectors |
729 | 731 | 731 |

accuracy |
84.20% | 84.22% | 84.22% |

Dataset `a9a` (13844/16281) |
LIBSVM | F# (generic RBF) | F# (custom RBF) |
---|---|---|---|

train time |
56.949 | 539.381 | 451.254 |

test time |
13.584 | 97.499 | 79.363 |

# iterations |
20354 | 27142 | 27142 |

# support vectors |
11901 | 11955 | 11955 |

accuracy |
85.03% | 85.03% | 85.03% |

See the implementaion with shrinking: SMO #3.

## Seq to For

Often, the optimization of the code is the first step of the performance optimization, but the super optimized code cannot solve problems of the bad algorithm. Caching and shrinking are techniques, which optimized the algorithm and gave the 10 times performance boost.

This implementation is using F# sequences for modelling sets of active/inactive variables indices. This abstraction is very convenient for the prototyping and the algorithm optimization stages, but it has high costs because sequences in F# are enumerations and require memory allocation like any other object instance. Replacing sequences and sequence transformations by loops gives additional 25% of the performance.

Dataset `a1a` (1605/30956) |
LIBSVM | F# (generic RBF) | F# (custom RBF) |
---|---|---|---|

train time |
0.109 | 0.782 | 0.759 |

test time |
1.727 | 9.498 | 7.733 |

# iterations |
974 | 1007 | 1006 |

# support vectors |
729 | 731 | 731 |

accuracy |
84.20% | 84.22% | 84.22% |

Dataset `a9a` (13844/16281) |
LIBSVM | F# (generic RBF) | F# (custom RBF) |
---|---|---|---|

train time |
56.949 | 387.812 | 317.044 |

test time |
13.584 | 99.118 | 78.330 |

# iterations |
20354 | 21427 | 21427 |

# support vectors |
11901 | 11917 | 11917 |

accuracy |
85.03% | 85.05% | 85.05% |

See the implementation with `seq`

replaced by `for`

: SMO #4.

## Conclusions

F# language features like type inference and pattern matching are excellent for prototyping, but F# is not a strict functional language. Some parts of the functional code can be iteratively refactored into the procedural code for the sake of performance optimization. However, it is not possible to achieve the performance of the native code, and, probably, 5-6 times degradation of the performance is the cost of the faster development process.