Logistic Regression

In the previous chapter, we used a vanilla gradient descent (gd) algorithm to solve a logistic loss minimization problem. gd is the most basic member of the proximal gradient family.

In this section, we first revisit the logistic loss minimization problem, use different (and more advanced) members of the family, i.e., Heavyball [1964-Polyak], Nesterov [1983-Nesterov], AdaGrad [2011-Duchi] and Adam [2014-Kingma], and compare their performances. To this end, we need to change the algorithm-related parts of Listing 5 to obtain Listing 8.

Listing 8 serial/logistic.cpp
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
/* include system libraries */
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
using namespace std;

/* include polo */
#include <polo/polo.hpp>
using namespace polo;

int main(int argc, char *argv[]) {
  /* define the problem data */
  auto data =
      utility::reader<double, int>::svm({"../data/australian_scale"}, 690, 14);

  /* define the smooth loss */
  loss::logistic<double, int> loss(data);

  /* estimate smoothness of the loss */
  double rowmax{0};
  for (int row = 0; row < data.nsamples(); row++) {
    double rowsquare{0};
    for (const auto val : data.matrix()->getrow(row))
      rowsquare += val * val;
    if (rowsquare > rowmax)
      rowmax = rowsquare;
  }
  const double L = 0.25 * data.nsamples() * rowmax;

  /* select and configure the desired solver */
#ifdef MOMENTUM
  algorithm::momentum<double, int> alg;
  alg.boosting_parameters(0.9, 1 / L);
  const string filename{"logistic-momentum.csv"};
#elif defined NESTEROV
  algorithm::nesterov<double, int> alg;
  alg.boosting_parameters(0.9, 1 / L);
  const string filename{"logistic-nesterov.csv"};
#elif defined ADAGRAD
  algorithm::adagrad<double, int> alg;
  const string filename{"logistic-adagrad.csv"};
#elif defined ADAM
  algorithm::adam<double, int> alg;
  alg.boosting_parameters(0.9, 0.1);
  alg.smoothing_parameters(0.999, 1E-8);
  alg.step_parameters(0.001);
  const string filename{"logistic-adam.csv"};
#endif

  /* pick a state logger */
  utility::logger::value<double, int> logger;

  /* pick a terminator */
  terminator::value<double, int> terminator(5E-2);

  /* provide an initial vector to the solver, and solve the problem */
  const vector<double> x0(data.nfeatures());
  alg.initialize(x0);
  alg.solve(loss, logger, terminator);

  /* open a csv file for writing */
  ofstream file(filename);
  if (file) { /* if successfully opened for writing */
    file << "k,t,f\n";
    for (const auto &log : logger)
      file << fixed << log.getk() << ',' << log.gett() << ',' << log.getf()
           << '\n';
  }

  /* print the result */
  cout << "Optimum: " << fixed << alg.getf() << '\n';
  cout << "Optimizer: [";
  for (const auto val : alg.getx())
    cout << val << ',';
  cout << "].\n";

  return 0;
}

In this example, we use preprocessor directives to conditionally compile parts of the file. For instance, if the preprocessor directive MOMENTUM is defined, alg will be an instance of momentum (i.e., Heavyball), which updates, at each iteration \(k\), the decision variable \(x_{k}\) as follows:

\[ \begin{align}\begin{aligned}\nu_{k} & = \mu \nu_{k-1} + \epsilon g_{k}\\x_{k+1} & = x_{k} - \nu_{k} \,,\end{aligned}\end{align} \]

where \(\mu\) is the momentum term and usually set to 0.9 [2016-Ruder], \(g_{k}\) is the gradient at \(x_{k}\) (i.e., \(g_{k} = \nabla \operatorname{f}(x_{k})\)) in the example, and \(\nu_{k}\) is the variable that holds the exponential moving average of the gradient at iteration \(k\). We set \(\mu = 0.9\) and \(\epsilon = 1/L\) by using boosting_parameters 1 member function of alg, and filename to logistic-momentum.csv, which is used for saving the logged states of the algorithm. We select and configure 1 other algorithms similarly, and guard them with the preprocessor directives. To compile Listing 8 for different algorithms, we add the following lines to CMakeLists.txt:

add_executable(logistic-momentum logistic.cpp)
target_compile_definitions(logistic-momentum PUBLIC MOMENTUM)
target_link_libraries(logistic-momentum polo::polo)

add_executable(logistic-nesterov logistic.cpp)
target_compile_definitions(logistic-nesterov PUBLIC NESTEROV)
target_link_libraries(logistic-nesterov polo::polo)

add_executable(logistic-adagrad logistic.cpp)
target_compile_definitions(logistic-adagrad PUBLIC ADAGRAD)
target_link_libraries(logistic-adagrad polo::polo)

add_executable(logistic-adam logistic.cpp)
target_compile_definitions(logistic-adam PUBLIC ADAM)
target_link_libraries(logistic-adam polo::polo)

CMake’s target_compile_definitions command helps us define the appropriate preprocessor directives. Basically, using this command, we create four different executables from the same source file. Building the project, running the executables, and using the plotting script given in Listing 9 result in a figure similar to Fig. 3.

Listing 9 serial/logistic.py
import csv  # for reading a csv file
from matplotlib import pyplot as plt  # for plotting

h, w = plt.figaspect(1)
fig, axes = plt.subplots(2, 2, sharey=True, figsize=(h, w))

algorithms = ["momentum", "nesterov", "adagrad", "adam"]

for (algorithm, axis) in zip(algorithms, axes.reshape(-1)):
    k = []
    f = []

    with open(f"logistic-{algorithm}.csv") as csvfile:
        csvReader = csv.reader(csvfile, delimiter=",")
        next(csvReader)  # skip the header
        for row in csvReader:
            k.append(int(row[0]))
            f.append(float(row[2]))

    axis.plot(k, f)
    axis.set_title(algorithm)
    axis.set_xlabel(r"$k$")
    axis.set_ylabel(r"$f(\cdot)$")
    axis.grid()

plt.tight_layout()
plt.savefig("logistic.svg")
plt.savefig("logistic.pdf")
../_images/logistic.svg

Fig. 3 Loss values generated by the algorithms in Listing 8.

In Fig. 3, we observe that all four algorithms converge roughly to the same loss value when a value terminator with an absolute value tolerance of 5E-2 is used. Both momentum and nesterov require the knowledge of the Lipschitz parameter (\(L\)) of the logistic loss, and they have similar convergence profiles. adagrad and adam, on the other hand, are configured without using the knowledge of \(L\) (cf. Listing 8), and they can adapt their learning rates to the loss function at hand. For instance, in adagrad, there is an increase in the loss value initially resulting from a too big initial step size, which is later accounted for as the algorithm proceeds. It is also worth noting that, in polo, the default initial step size for adagrad is 1, and the configuration in Listing 8 for adam is the one suggested in [2014-Kingma]. Both algorithms can be tuned to have better performances than the ones obtained in this example.

Now, we add different regularizers to the problem, and repeat the same procedure to solve the corresponding regularized problems.

\(\ell_{1}\) Regularization

Elastic Net Regularization

2016-Ruder

Sebastian Ruder. “An overview of gradient descent optimization algorithms.” (Sep. 2016). arXiv: 1609.04747.

Footnotes

1(1,2)

The configuration of the algorithms and the terminology will become more evident in Proximal Gradient Methods.