Summary statistics are commonly used to build a simple quantitative description of a set of observations. Simple descriptions include mean, variance, skewness and kurtosis, which are quantitative measures of location, spread and shape of the data distribution. However, straightforward implementations of these measures in F# do not scale to large amounts of data. There are more sophisticated methods, but imperative implementations of those methods use mutable variables. Nonetheless, mathematical definitions of those methods allow to build effective functional implementation using higher-order functions in F#.

Simple Mean and Variance

The F# implementation of mean \(\bar{x}=\sum_{i=1}^N x_i\) for a sequence of any length is simple and needs only one pass for processing of the entire sequence:

let mean s = s |> Seq.average

However, the straightforward implementation in F# of variance $s^2=\frac{\sum_{i=1}^N (x_i-\bar{x})^2}{N-1}$ requires first to compute mean and then compute variance, i.e. the sequence will be iterated two times:

let variance s =
  let M = mean s
  let N = Seq.length s
  (s |> Seq.sumBy (fun x -> (x - N) * (x - N))) / (n - 1)

There is another way to compute mean \(\bar{x}=\sum_{i=1}^N x_i\) and variance \(s^2=\frac{\sum_{i=1}^N x_i^2 - (\sum_{i=1}^N x_i)^2/N}{N-1}\) in one pass, but in practice the precision of the computation will be much less than the inherent precision of floating-point operations in general. Even above mentioned naive implementation is numerically stable only if \(N\) is small.

Advanced Implementation

There is a numerically stable algorithm for the sample variance and mean that is based on Welford’s recurrent equation 1:

\[M_{2,k}=M_{2,k-1} + (x_k - M_{k-1})^2\]
\[M_k=M_{k-1} + \frac{(x_k - M_{k-1})}{k}\]
\[\bar{x}=M_N, s_N^2=\frac{M_{2,N}}{N-1}\]

Through the Python implementation or any other imperative implementation is using mutable variabes for accumulating \(M_k\) and \(M_{2,k}\)

def online_variance(data):
    k = 0
    M_k = 0.0
    M_2k = 0.0

    for x in data:
        k = k + 1
        delta = x - M_k
        M_k = M_k + delta/k
        M_2k = M_2k + delta*(x - M_k)

    if n < 2:
        return float('nan');
    else:
        return M_2k / (k - 1)

It is possible to reproduce this computation in F#, but it would be both “ugly” and “twisting” the mathematical definition. Moreover, the F# impementation of \(M_k\) and \(M_{2,k}\) recurrent equation is quite simple:

let M (k_1, M_k_1, M_2k_1) x =
  let k = k_1 + 1
  let delta = x - M_k_1
  let M_k = M_k_1 + delta / n
  let M_2k = M_2k_1 + delta * (x - M_k)
  (k, M_k, M_2k)

The function M is intentionally using tuple as the first argument, which structure is similar to the returned result. Using this definition it is easy to compute mean and variance with Seq.fold.

let meanAndVariance s =
  let N, M_N, M_2N = s |> Seq.fold M (0, 0.0, 0.0)
  M_N, M_2N / (N - 1)

Correctness of Implementation

The implementation of the online mean and variance computation is relatively simple. Though, it is also easy to make a mistake and use M_k_1 instead of M_k in M_2k_1 + delta * (x - M_k). The first thing that comes in mind in order to avoid bugs is to write a unit test 2:

open FsUnit
...
let m, v = meanAndVariance (seq { 1.0..1.0..10.0 })
m |> should be ((equalWithin 0.001) 5.5)
v |> should be ((equalWithin 0.001) 9.166)

Nonetheless, many questions arise:

How many unit tests are needed to prove that the implementation is correct? 1, 2, 10, 100, 1000, … There is no answer to this question and, fortunately, it is not needed.

There is another question that have the definite answer:

What properties need to be checked in order to be sure that the implementation is correct? The answer is simple enough - mean and variance computed by the online algoritm should be equal to mean and variance computed by naive implementation of mean and variance for any valid input 3:

let meanAndVarianceAreCorrect s =
    (Seq.length s > 2) ==> (lazy
        let m1, v1 = meanAndVariance s
        let m2, v2 = mean s, variance s
        abs(m1 - m2) < 0.00001 && abs(v1 - v2) < 0.00001)

Valid inputs are sequences of normal float numbers with a limited variance:

let validSequences = Arb.generate<NormalFloat> |> Gen.map float
        |> Gen.suchThat (fun f -> -1e120 < f && f < 1e120)
        |> Gen.nonEmptyListOf |> Gen.map List.toSeq
        |> Arb.fromGen

and a quick check shows that the required property holds for 100 randomly generated valid inputs:

Check.Quick (Prop.forAll validSequences meanAndVarianceAreCorrect)

Advanced Applications

A brief analysis of the function M shows that is has bigger value than the simple computation of mean and variance for the sequence, e.g.:

  • First, it allows to compute both the variance \(\sigma^2\) of the population of a total size \(N\) and the unbiased estimate \(s^2\) for a finite number \(n\) of samples:
    let N, M_N, M_2N = s |> Seq.fold M (0, 0.0, 0.0)
    let sigma_2 = M_2N/N
    let s_2 = M_2N/(N-1)
    
  • Second, it can be applied to any “foldable” type:
    let N, M_N, M_2N = s |> List.fold M (0, 0.0, 0.0)
    
    let N, M_N, M_2N = s |> Array.fold M (0, 0.0, 0.0)
    
  • Third, it can be easily applied to multiple input values:
    let M0 = 0, 0.0, 0.0
    let M2 (M1, M2) (x1, x2) = M M1 x1, M M2 x2
    let (N1, M1_N, M1_2N), (N2, M2_N, M2_2N) = s2 |> Seq.fold M2 (M0, M0)
    

Conclusions

The F# language has advanced language features and additional tools that greatly simplify implementation and sanity checks for the summary statistics algorithms. Moreover, the resulting code is extremely flexible and can be easily reused for solving other kinds of problems.