3.3 Sparse linear algebra

library(Matrix)
bandSparse(15, 15, seq(-2, 2))
## 15 x 15 sparse Matrix of class "ngCMatrix"
##                                    
##  [1,] | | | . . . . . . . . . . . .
##  [2,] | | | | . . . . . . . . . . .
##  [3,] | | | | | . . . . . . . . . .
##  [4,] . | | | | | . . . . . . . . .
##  [5,] . . | | | | | . . . . . . . .
##  [6,] . . . | | | | | . . . . . . .
##  [7,] . . . . | | | | | . . . . . .
##  [8,] . . . . . | | | | | . . . . .
##  [9,] . . . . . . | | | | | . . . .
## [10,] . . . . . . . | | | | | . . .
## [11,] . . . . . . . . | | | | | . .
## [12,] . . . . . . . . . | | | | | .
## [13,] . . . . . . . . . . | | | | |
## [14,] . . . . . . . . . . . | | | |
## [15,] . . . . . . . . . . . . | | |
K <- bandSparse(n, n, seq(-2, 2))
weights <- c(1/3, 1/4, rep(1/5, n - 4), 1/4, 1/3)
weights <- c(NA, NA, rep(1/5, n - 4), NA, NA)
p_Nuuk <- ggplot(Nuuk_year, aes(Year, Temperature)) + geom_point()
p_Nuuk + 
  geom_line(aes(y = as.numeric(K %*% Nuuk_year$Temperature) * weights), 
            color = "red")

When the smoother matrix is sparse, matrix multiplication can be much faster.

We will present some benchmark comparisons below. First we compare the run time for the matrix multiplication as.numeric(K %*% Nuuk_year$Temperature) * weights using a sparse matrix (as above) with the run time using a dense matrix. The dense matrix is given as Kdense = as.matrix(K). These run times are compared to using filter(). In all computations, \(k = 5\).

The difference in slopes between dense and sparse matrix multiplication should be noted. This is the difference between \(O(n^2)\) and \(O(n)\) run time. The run time for the dense matrix multiplication will not change with \(k\). For the other two it will increase (linearly) with increasing \(k\).

For smoothing only once with a given smoother matrix the time to construct the matrix should also be taken into account for fair comparison with filter(). It turns out that the function bandSparse is not optimized for the specific running mean banded matrix, and a faster C++ function for this job is given below.

#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List fastBand(int n, int k) {
  int N = (2 * k + 1) * (n - 2 * k) + 3 * k * k + k;
  int iter = 0;
  IntegerVector i(N), p(n + 1);
  for(int col = 0; col < n; ++col) {
    p[col] = iter;
    for(int r = std::max(col - k, 0); r < std::min(col + k + 1, n); ++r) {
      i[iter] = r;
      ++iter;
    }
  }
  p[n] = N;
  return List::create(_["i"] = i, _["p"] = p);
}

And then R function.

bandSparseFast <- function(n, k) {
  n <- as.integer(n)
  k <- as.integer(k)
  tmp <- fastBand(n, k)
  new("ngCMatrix", 
      i = tmp$i, 
      p = tmp$p, 
      Dim = c(n, n))
}

The construction of the sparse matrix turns out to take up much more time than the matrix-vector multiplication. The run time is still \(O(n)\), but the constant is of the order of a factor 16 larger than for filter(). With the faster construction of the sparse matrix, the constant is reduced to being of the order 5 larger than for filter(). For small \(n\) there is some overhead from the constructor of the sparse matrix object even for the faster algorithm.

If you implement an algorithm (like a smoother) using linear algebra (e.g. a matrix-vector product) then sparse matrix numerical methods can be useful compared to dense matrix numerical methods. The Matrix package for R implements sparse matrices, and you should always attempt to use methods for constructing the sparse matrix that avoid dense intermediates. But even with a special purpose constructor of a sparse band matrix, sparse linear algebra cannot compete with optimized special purpose algorithms like filter() or a C++ implementation of run_mean(). The filter() function even works more generally for kernels (weights) with equidistant data.

We conclude this section by verifying that filter() actually computes the running mean up to numerical errors.

qplot(1:n, 
      as.numeric(K %*% Nuuk_year$Temperature) * weights - 
        c(stats::filter(Nuuk_year$Temperature, rep(1/5, 5)))) +
  scale_y_continuous("Difference")

all(as.numeric(K %*% Nuuk_year$Temperature) * weights == 
      c(stats::filter(Nuuk_year$Temperature, rep(1/5, 5))))
## [1] FALSE
all.equal(as.numeric(K %*% Nuuk_year$Temperature) * weights, 
          c(stats::filter(Nuuk_year$Temperature, rep(1/5, 5))))
## [1] TRUE
identical(as.numeric(K %*% Nuuk_year$Temperature) * weights, 
          c(stats::filter(Nuuk_year$Temperature, rep(1/5, 5))))
## [1] FALSE