   ## A  Duality Gaps with Fenchel Duality

This section is taken from the appendix D of Julien Mairal’s PhD thesis . We are going to use intensively Fenchel Duality (see ). Let us consider the problem

 min w ∈ ℝp
[g(w
 ▵ =
f(w) + λψ(w)],     (47)

We first notice that for all the formulations we have been interested in, g(w) can be rewritten

 g(w) = f(X⊤w) + λψ(w),      (48)

where X=[x1,…,xn] are training vectors, and f is an appropriated smooth real-valued function of ℝn, and ψ one of the regularization functions we have introduced.

Given a primal variable w in ℝp and a dual variable κ in ℝn, we obtain using classical Fenchel duality rules , that the following quantity is a duality gap for problem (47):

δ(w,κ)
 ▵ =
g(w) + f(κ) + λψ(−Xκ / λ),

where f and ψ are respectively the Fenchel conjugates of f and ψ. Denoting by w the solution of Eq. (47), the duality gap is interesting in the sense that it upperbounds the difference with the optimal value of the function:

 δ(w,κ) ≥  g(w)−g(w⋆) ≥ 0.

Similarly, we will consider pairs of primal-dual variables (W,K) when dealing with matrices.

During the optimization, sequences of primal variables w are available, and one wishes to exploit duality gaps for estimating the difference g(w)−g(w). This requires the following components:

• being able to efficiently compute f and ψ.
• being able to obtain a “good” dual variable κ given a primal variable w, such that δ(w,κ) is close to g(w)−g(w).

We suppose that the first point is satisfied (we will detail these computations for every loss and regularization functions in the sequel), and explain how to choose κ in general (details will also be given in the sequel).

Let us first consider the choice that associates with a primal variable w, the dual variable

κ(w
 ▵ =
∇ f(Xw),      (49)

and let us compute δ(w,κ(w)). First, easy computations show that for all vectors z in ℝn, f(∇f(z)) = z∇f(z)−f(z), which gives

 δ(w,κ(w)) = f(X⊤w) + λ ψ(w) + f∗(∇f(X⊤w)) + λψ∗(−X∇f(X⊤w)/ λ), (50) = λ ψ(w) + w⊤X∇f(X⊤w) + λψ∗(−X∇f(X⊤w)/ λ). (51)

We now use the classical Fenchel-Young inequality (see, Proposition 3.3.4 of ) on the function ψ, which gives

 δ(w,κ(w))  ≥  w⊤X∇f(X⊤w) − w⊤X∇f(X⊤w) = 0,

with equality if and only if −X∇f(Xw) belongs to ∂ ψ(w). Interestingly, we now that first-order optimality conditions for Eq. (48) gives that −X∇f(Xw) ∈ ∂ ψ(w). We have now in hand a non-negative function w ↦ δ(w,κ(w)) of w, that upperbounds g(w)−g(w) and satisfying δ(w,κ(w))=0.

This is however not a sufficient property to make it a good measure of the quality of the optimization, and further work is required, that will be dependent on f and ψ. We have indeed proven that δ(w,κ(w)) is always 0. However, for w different than w, δ(w,κ(w)) can be infinite, making it a non-informative duality-gap. The reasons for this can be one of the following:

• The term ψ(−X∇f(Xw)/ λ) might have an infinite value.
• Intercepts make the problem more complicated. One can write the formulation with an intercept by adding a row to X filled with the value 1, add one dimension to the vector w, and consider a regularization function ψ that does regularize the last entry of w. This further complexifies the computation of ψ and its definition, as shown in the next section.

Let us now detail how we proceed to solve these problems, but first without considering the intercept. The analysis is similar when working with matrices W instead of vectors w.

#### A.0.1  Duality Gaps without Intercepts

Let us show how to compute the Fenchel conjugate of the functions we have introduced. We now present the Fenchel conjugate of the loss functions f.

• The square loss
f(z)= 1 2n
||yz||22
f(κ)= n 2
||κ||22 + κy.
• The logistic loss
f(z)= 1 n
 n ∑ i=1
log(1+eyizi
f(κ)=

+∞  if  ∃ i∈ [ 1;n ]   s.t.   yiκi ∉ [−1,0],
 n ∑ i=1
(1+yiκi)log(1+yiκi)−yiκilog(−yiκi)  otherwise.
• The multiclass logistic loss (or softmax). The primal variable is now a matrix Z, in ℝn × r, which represents the product XW. We denote by K the dual variable in ℝn × r.
f(Z)= 1 n
 n ∑ i=1
log

 r ∑ j=1
e Zij − Ziyi

f(K)=

+∞  if  ∃ i ∈[ 1;n ]   s.t.   { Kij < 0  and  j ≠ yi }  or  Kiyi < −1,
 n ∑ i=1

 ∑ j ≠ yi
Kijlog(Kij) + (1+Ki yi)log(1+Ki yi)

.

Our first remark is that the choice Eq. (49) ensures that f(κ) is not infinite.

As for the regularization function, except for the Tikhonov regularization which is self-conjugate (it is equal to its Fenchel conjugate), we have considered functions that are norms. There exists therefore a norm ||.|| such that ψ(w)=||w||, and we denote by ||.|| its dual-norm. In such a case, the Fenchel conjugate of ψ for a vector γ in ℝp takes the form

ψ(γ) =

 0 if  ||γ||∗≤ 1, +∞ otherwise.

It turns out that for almost all the norms we have presented, there exists (i) either a closed form for the dual-norm or (ii) there exists an efficient algorithm evaluating it. The only one which does not conform to this statement is the tree-structured sum of ℓ2-norms, for which we do not know how to evaluate it efficiently.

One can now slightly modify the definition of κ to ensure that ψ(−Xκ/λ) ≠ +∞. A natural choice is

κ(w
 ▵ =
min

1,
 λ ||X∇f(X⊤w)||∗

∇ f(Xw),

which is the one we have implemented. With this new choice, it is easy to see that for all vectors w in ℝp, we still have f(κ) ≠ + ∞, and finally, we also have δ(w,κ(w)) < + ∞ and δ(w,κ(w))=0, making it potentially a good duality gap.

#### A.0.2  Duality Gaps with Intercepts

Even though adding an intercept does seem a simple modification to the original problem, it induces difficulties for finding good dual variables.

We recall that having an intercept is equivalent to having a problem of the type (48), by adding a row to X filled with the value 1, adding one dimension to the vector w (or one row for matrices W), and by using a regularization function that does not depend on the last entry of w (or the last row of W).

Suppose that we are considering a problem of type (48) of dimension p+1, but we are using a regularization function ψ: ℝp+1 → ℝ, such that for a vector w in ℝp+1, ψ(w) = ψ(w[ 1;p ]), where ψ: ℝp → ℝ is one of the regularization function we have introduced. Then, considering a primal variable w, a dual variable κ, and writing γ=Xκ/λ, we are interested in computing

ψ(γ) =

 +∞  if  γp+1 ≠ 0 ψ∗(γ[ 1;p ])  otherwise,

which means that in order the duality gap not to be infinite, one needs in addition to ensure that γp+1 be zero. Since the last row of X is filled with ones, this writes down ∑i=1p+1 κi=0. For the formulation with matrices W and K, the constraint is similar but for every column of K.

Let us now detail how we proceed for every loss function to find a “good” dual variable κ satisfying this additional constraint, given a primal variable w in ℝp+1, we first define the auxiliary function

κ′(w
 ▵ =
∇f(Xw),

(which becomes K′(W)= ∇f(XW) for matrices), and then define another auxiliary function κ″(w) as follows, to take into account the additional constraint ∑i=1p+1 κi=0.

• For the square loss, we define another auxiliary function:
κ″(w ▵ =
κ′(w) −  1 n
1p+1κ′(w)1p+1
where 1p+1 is a vector of size p+1 filled with ones. This step, ensures that ∑i=1p+1κ″(w)i= 0.
• For the logistic loss, the situation is slightly more complicated since additional constraints are involved in the definition of f.
κ″(w ▵ =
arg minκ ∈ ℝn ||κ−κ′(w)||22   s.t.    n ∑ i=1
κi=0  and  ∀ i ∈ [ 1;n ], κi ∈ [−1,0].
This problem can be solved in linear-time  using a similar algorithm as for the projection onto the ℓ1-ball, since it is an instance of a quadratic knapsack problem.
• For the multi-class logistic loss, we proceed in a similar way, for every column Kj of K, j ∈ [ 1;r ]:
K′′ j(w ▵ =
arg minκ ∈ ℝn ||K′ j−κ′(w)||22   s.t.    n ∑ i=1
κi=0  and
∀ i ∈ [ 1;n ], {κi ≥ 0  if  j ≠ yi}  and  {κi ≥ −1  if  yi=j}.

When the function ψ is the Tykhonov regularization function, we end the process by setting κ(w)=κ″(w). When it is a norm, we choose, as before for taking into account the constraint ||Xκ||≤ λ,

κ(w
 ▵ =
min

1,
 λ ||Xκ″(w)||∗

κ″(w),

with a similar formulation for matrices W and K.

Even though finding dual variables while taking into account the intercept requires quite a lot of engineering, notably implementing a quadratic knapsack solver, it can be done efficiently.   