Logistic Regression¶
In the previous chapter, we used a vanilla gradient
descent (gd
) algorithm to solve a logistic loss minimization problem.
gd
is the most basic member of the proximal gradient family.
In this section, we first revisit the logistic loss minimization problem, use different (and more advanced) members of the family, i.e., Heavyball [1964-Polyak], Nesterov [1983-Nesterov], AdaGrad [2011-Duchi] and Adam [2014-Kingma], and compare their performances. To this end, we need to change the algorithm-related parts of Listing 5 to obtain Listing 8.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | /* include system libraries */
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
using namespace std;
/* include polo */
#include <polo/polo.hpp>
using namespace polo;
int main(int argc, char *argv[]) {
/* define the problem data */
auto data =
utility::reader<double, int>::svm({"../data/australian_scale"}, 690, 14);
/* define the smooth loss */
loss::logistic<double, int> loss(data);
/* estimate smoothness of the loss */
double rowmax{0};
for (int row = 0; row < data.nsamples(); row++) {
double rowsquare{0};
for (const auto val : data.matrix()->getrow(row))
rowsquare += val * val;
if (rowsquare > rowmax)
rowmax = rowsquare;
}
const double L = 0.25 * data.nsamples() * rowmax;
/* select and configure the desired solver */
#ifdef MOMENTUM
algorithm::momentum<double, int> alg;
alg.boosting_parameters(0.9, 1 / L);
const string filename{"logistic-momentum.csv"};
#elif defined NESTEROV
algorithm::nesterov<double, int> alg;
alg.boosting_parameters(0.9, 1 / L);
const string filename{"logistic-nesterov.csv"};
#elif defined ADAGRAD
algorithm::adagrad<double, int> alg;
const string filename{"logistic-adagrad.csv"};
#elif defined ADAM
algorithm::adam<double, int> alg;
alg.boosting_parameters(0.9, 0.1);
alg.smoothing_parameters(0.999, 1E-8);
alg.step_parameters(0.001);
const string filename{"logistic-adam.csv"};
#endif
/* pick a state logger */
utility::logger::value<double, int> logger;
/* pick a terminator */
terminator::value<double, int> terminator(5E-2);
/* provide an initial vector to the solver, and solve the problem */
const vector<double> x0(data.nfeatures());
alg.initialize(x0);
alg.solve(loss, logger, terminator);
/* open a csv file for writing */
ofstream file(filename);
if (file) { /* if successfully opened for writing */
file << "k,t,f\n";
for (const auto &log : logger)
file << fixed << log.getk() << ',' << log.gett() << ',' << log.getf()
<< '\n';
}
/* print the result */
cout << "Optimum: " << fixed << alg.getf() << '\n';
cout << "Optimizer: [";
for (const auto val : alg.getx())
cout << val << ',';
cout << "].\n";
return 0;
}
|
In this example, we use preprocessor directives to conditionally compile
parts of the file. For instance, if the preprocessor directive MOMENTUM
is
defined, alg
will be an instance of momentum
(i.e., Heavyball), which
updates, at each iteration \(k\), the decision variable \(x_{k}\) as
follows:
where \(\mu\) is the momentum term and usually set to 0.9 [2016-Ruder],
\(g_{k}\) is the gradient at \(x_{k}\) (i.e., \(g_{k} = \nabla
\operatorname{f}(x_{k})\)) in the example, and \(\nu_{k}\) is the variable
that holds the exponential moving average of the gradient at iteration
\(k\). We set \(\mu = 0.9\) and \(\epsilon = 1/L\) by using
boosting_parameters
1 member function of alg
, and filename
to
logistic-momentum.csv
, which is used for saving the logged states of the
algorithm. We select and configure 1 other algorithms similarly, and guard
them with the preprocessor directives. To compile
Listing 8 for different algorithms, we add the following
lines to CMakeLists.txt
:
add_executable(logistic-momentum logistic.cpp)
target_compile_definitions(logistic-momentum PUBLIC MOMENTUM)
target_link_libraries(logistic-momentum polo::polo)
add_executable(logistic-nesterov logistic.cpp)
target_compile_definitions(logistic-nesterov PUBLIC NESTEROV)
target_link_libraries(logistic-nesterov polo::polo)
add_executable(logistic-adagrad logistic.cpp)
target_compile_definitions(logistic-adagrad PUBLIC ADAGRAD)
target_link_libraries(logistic-adagrad polo::polo)
add_executable(logistic-adam logistic.cpp)
target_compile_definitions(logistic-adam PUBLIC ADAM)
target_link_libraries(logistic-adam polo::polo)
CMake’s target_compile_definitions command helps us define the appropriate preprocessor directives. Basically, using this command, we create four different executables from the same source file. Building the project, running the executables, and using the plotting script given in Listing 9 result in a figure similar to Fig. 3.
import csv # for reading a csv file
from matplotlib import pyplot as plt # for plotting
h, w = plt.figaspect(1)
fig, axes = plt.subplots(2, 2, sharey=True, figsize=(h, w))
algorithms = ["momentum", "nesterov", "adagrad", "adam"]
for (algorithm, axis) in zip(algorithms, axes.reshape(-1)):
k = []
f = []
with open(f"logistic-{algorithm}.csv") as csvfile:
csvReader = csv.reader(csvfile, delimiter=",")
next(csvReader) # skip the header
for row in csvReader:
k.append(int(row[0]))
f.append(float(row[2]))
axis.plot(k, f)
axis.set_title(algorithm)
axis.set_xlabel(r"$k$")
axis.set_ylabel(r"$f(\cdot)$")
axis.grid()
plt.tight_layout()
plt.savefig("logistic.svg")
plt.savefig("logistic.pdf")
In Fig. 3, we observe that all four algorithms converge
roughly to the same loss value when a value terminator with an absolute value
tolerance of 5E-2
is used. Both momentum
and nesterov
require the
knowledge of the Lipschitz parameter (\(L\)) of the logistic loss, and they
have similar convergence profiles. adagrad
and adam
, on the other hand,
are configured without using the knowledge of \(L\) (cf.
Listing 8), and they can adapt their learning rates to
the loss function at hand. For instance, in adagrad
, there is an increase
in the loss value initially resulting from a too big initial step size, which
is later accounted for as the algorithm proceeds. It is also worth noting that,
in polo
, the default initial step size for adagrad
is 1, and the
configuration in Listing 8 for adam
is the one
suggested in [2014-Kingma]. Both algorithms can be tuned to have better
performances than the ones obtained in this example.
Now, we add different regularizers to the problem, and repeat the same procedure to solve the corresponding regularized problems.
\(\ell_{1}\) Regularization¶
Elastic Net Regularization¶
- 2016-Ruder
Sebastian Ruder. “An overview of gradient descent optimization algorithms.” (Sep. 2016). arXiv: 1609.04747.
Footnotes
- 1(1,2)
The configuration of the algorithms and the terminology will become more evident in Proximal Gradient Methods.