Defining Custom Loss Functions¶

In our examples, we have thus far used the logistic loss on data that is either defined manually or read from a dataset. However, we can also define our custom loss functions, and pass them as the first argument to the solve member function of the algorithms. To demonstrate this, we focus on the following simple loss function:

\[\operatorname{f}(x) = \sum_{n=1}^{N} \frac{1}{n} {\left(x^{(n)} - n\right)}^{2}\]

for some \(N \geq 1\). As can be observed, the loss is a convex quadratic function of the \(N\)-dimensional decision vector, and the optimizer of Problem (1) with this loss is

\[\begin{split}x^{\star} = \begin{bmatrix} 1 \\ 2 \\ \vdots \\ N \end{bmatrix} \quad \text{with} \quad \operatorname{f}\left(x^{\star}\right) = 0 \,.\end{split}\]

Furthermore, it can easily be verified that the gradient and the Hessian of the loss are

\[\begin{split}\nabla\operatorname{f}(x) = \begin{bmatrix} \frac{2}{1} \left(x^{(1)} - 1\right) \\ \frac{2}{2} \left(x^{(2)} - 2\right) \\ \vdots \\ \frac{2}{N} \left(x^{(N)} - N\right) \\ \end{bmatrix} \quad \text{and} \quad \nabla^{2}\operatorname{f}(x) = \begin{bmatrix} 2 & & & \\ & 1 & & \\ & & \ddots & \\ & & & \frac{2}{N} \end{bmatrix} \,.\end{split}\]

In polo, a smooth loss is any object that, when called with two input arguments, x and g, returns the loss value associated with x and writes the gradient in g. Because the loss objects are not allowed to modify the decision vector and because of compatibility with other (high-level) languages, the types of x and g are const value_t * and value_t *, respectively, where value_t is the type (e.g., double) of the values that the decision vector and the gradient contain. Hence, one way to define the loss mentioned above is to create a new struct as follows

struct simple_loss {
  simple_loss(const int N) : N{N} {}

  double operator()(const double *x, double *g) const {
    double loss{0};
    for (int n = 1; n <= N; n++) {
      const double residual{x[n - 1] - n};
      loss += residual * residual / n;
      g[n - 1] = 2 * residual / n;
    }
    return loss;
  }

private:
  const int N;
};

where we keep the data (N) of the loss private, and define the operator() member function appropriately. Note that, because C++ has zero-based numbering, we use n-1 when indexing x and g.

To use a custom loss object of type simple_loss, we first construct loss with, e.g., N = 10. Thanks to the simple and closed-form representation of this simple loss function, we can easily verify that the loss is a \(\mu\)-strongly convex function with \(\mu = \lambda_{\text{min}} = 2/N\) and \(L = \lambda_{\text{max}} = 2\), where \(\lambda_{\text{min}}\) and \(\lambda_{\text{max}}\) are the minimum and maximum eigenvalues of the Hessian, respectively. We know from [2004-Nesterov] that the vanilla gradient descent algorithm with constant step size \(\gamma = \frac{2}{\mu + L}\) converges linearly to the optimum. We set the step size of the algorithm accordingly, and define a value terminator with absolute and relative tolerances of \(10^{-8}\) and \(10^{-13}\), respectively. The resulting code is given in Listing 7.

Listing 7 getting-started/loss.cpp¶

/* include system libraries */
#include <fstream>
#include <iostream>
#include <vector>
using namespace std;

/* include polo */
#include <polo/polo.hpp>
using namespace polo;

struct simple_loss {
  simple_loss(const int N) : N{N} {}

  double operator()(const double *x, double *g) const {
    double loss{0};
    for (int n = 1; n <= N; n++) {
      const double residual{x[n - 1] - n};
      loss += residual * residual / n;
      g[n - 1] = 2 * residual / n;
    }
    return loss;
  }

private:
  const int N;
};

int main(int argc, char *argv[]) {
  /* define the smooth loss */
  const int N{10};
  simple_loss loss(N);
  const double mu{2. / N};
  const double L{2.};

  /* select and configure the desired solver */
  algorithm::gd<double, int> alg;
  alg.step_parameters(2 / (mu + L));

  /* pick a state logger */
  utility::logger::value<double, int> logger;

  /* pick a terminator */
  terminator::value<double, int> terminator(1E-8, 1E-13);

  /* provide an initial vector to the solver, and solve the problem */
  const vector<double> x0(N);
  alg.initialize(x0);
  alg.solve(loss, logger, terminator);

  /* open a csv file for writing */
  ofstream file("loss.csv");
  if (file) { /* if successfully opened for writing */
    file << "k,t,f\n";
    for (const auto &log : logger)
      file << fixed << log.getk() << ',' << log.gett() << ',' << log.getf()
           << '\n';
  }

  /* print the result */
  cout << "Optimum: " << fixed << alg.getf() << '\n';
  cout << "Optimizer: [";
  for (const auto val : alg.getx())
    cout << val << ',';
  cout << "].\n";

  return 0;
}

We append the following lines to CMakeLists.txt

add_executable(loss loss.cpp)
target_link_libraries(loss polo::polo)

and build the project. Running the executable should give the output:

Optimum: 0.000000
Optimizer: [1.000036,2.000000,3.000000,4.000000,5.000000,6.000000,6.999998,7.999984,8.999910,9.999641,].

Here, we observe that the optimum value is attained (up to the fixed, 6-digit precision) by the algorithm, even though the optimizer is not. This is because the value terminator only checks the absolute and relative changes in the loss value. If we want to have a termination condition based on the changes in the decision vector instead, we can replace

terminator::value<double, int> terminator(1E-8, 1E-13);

with

terminator::decision<double, int> terminator(1E-8);

This decision terminator with the tolerance \(10^{-8}\) will stop the algorithm when the following condition holds:

\[{\left\lVert x_{k-1} - x_{k} \right\rVert}_{2} < \epsilon_{\text{abs}} = 10^{-8} \,.\]

Rebuilding the project and rerunning the executable should give the following:

Optimum: 0.000000
Optimizer: [1.000000,2.000000,3.000000,4.000000,5.000000,6.000000,7.000000,8.000000,9.000000,10.000000,].