Theoretical grounds of lambda optimization

#	User	Rating
1	ecnerwala	3649
2	Benq	3581
3	orzdevinwang	3570
4	Geothermal	3569
4	cnnfls_csy	3569
6	tourist	3565
7	maroonrk	3531
8	Radewoosh	3521
9	Um_nik	3482
10	jiangly	3468

#	User	Contrib.
1	maomao90	174
2	awoo	164
3	adamant	162
4	TheScrasse	159
5	nor	158
6	maroonrk	156
7	-is-this-fft-	151
8	SecondThread	147
9	orz	146
10	pajenegod	145

Hi everyone!

This time I'd like to write about what's widely known as "Aliens trick" (as it got popularized after 2016 IOI problem called Aliens). There are already some articles about it here and there, and I'd like to summarize them, while also adding insights into the connection between this trick and generic Lagrange multipliers and Lagrangian duality which often occurs in e.g. linear programming problems.

Familiarity with a previous blog about ternary search or, at the very least, definitions and propositions from it is expected.

Great thanks to mango_lassi and 300iq for useful discussions and some key insights on this.

Note that although explanation here might be quite verbose and hard to comprehend at first, the algorithm itself is stunningly simple.

Another point that I'd like to highlight for those already familiar with "Aliens trick" is that typical solutions using it require binary search on lambdas to reach specified constraint by minimizing its value for specific $$$\lambda$$$. However, this part is actually unnecessary and you don't even have to calculate the value of constraint function at all within your search.

It further simplifies the algorithm and extends applicability of aliens trick to the cases when it is hard to minimize constraint function while simultaneously minimizing target function for the given $$$\lambda$$$.

Tldr.

Problem. Let $$$f : X \to \mathbb R$$$ and $$$g : X \to \mathbb R^c$$$. You need to solve the constrained optimization problem

$$$\begin{gather}f(x) \to \min,\\ g(x) = 0.\end{gather}$$$

Auxiliary function. Let $$$t(\lambda) = \inf_x [f(x) - \lambda \cdot g(x)]$$$. Finding $$$t(\lambda)$$$ is unconstrained problem and is usually much simpler.

Equivalently, $$$t(\lambda) = \inf_y [h(y) - \lambda \cdot y]$$$ where $$$h(y)$$$ is the minimum possible $$$f(x)$$$ subject to $$$g(x)=y$$$.

As a point-wise minimum of linear functions, $$$t(\lambda)$$$ is concave, therefore its maximum can be found with ternary search.

Key observation. By definition, $$$t(\lambda) \leq h(0)$$$ for any $$$\lambda$$$, thus $$$\max_\lambda t(\lambda)$$$ provides a lower bound for $$$h(0)$$$. When $$$h(y)$$$ is convex, inequality turns into equation, that is $$$\max_\lambda t(\lambda) = h(0) = f(x^*)$$$ where $$$x^*$$$ is the solution to the minimization problem.

Solution. Assume that $$$t(\lambda)$$$ is computable for any $$$\lambda$$$ and $$$h(y)$$$ is convex. Then find $$$\max_\lambda t(\lambda)$$$ with the ternary search on $$$t(\lambda)$$$ over possible values of $$$\lambda$$$. This maximum is equal to the minimum $$$f(x)$$$ subject to $$$g(x)=0$$$.

If $$$g(x)$$$ and $$$f(x)$$$ are integer functions, $$$\lambda_i$$$ corresponds to $$$h(y_i) - h(y_i-1)$$$ and can be found among integers.

Boring and somewhat rigorous explanation is below, problem examples are belower.

Lagrange duality

Let $$$f : X \to \mathbb R$$$ be the objective function and $$$g : X \to \mathbb R^c$$$ be the constraint function. The constrained optimization problem

$$$\begin{gather}f(x) \to \min,\\ g(x) = 0\end{gather}$$$

in some cases can be reduced to finding stationary points of the Lagrange function

$$$L(x, \lambda) = f(x) - \lambda \cdot g(x).$$$

Here $$$\lambda \cdot g(x)$$$ is a dot product of $$$g(x)$$$ and a variable vector $$$\lambda \in \mathbb R^c$$$.

Def. 1. A function $$$L(x, \lambda)$$$ defined above is called the Lagrange function, or the Lagrangian.

Def. 2. A vector $$$\lambda \in \mathbb R^c$$$ defined above is called a Lagrange multiplier. Its components are collectively called Lagrange multipliers.

Mathematical optimization focuses on finding stationary points of $$$L(x,\lambda)$$$. However, we're more interested in its infimal projection

$$$t(\lambda) = \inf\limits_{x \in X} L(x,\lambda).$$$

Def. 3. A function $$$t(\lambda)$$$ defined above is called the Lagrange dual function.

Example. You're given an array $$$a_1,\dots,a_n$$$. You need to find a sequence $$$1 \leq x_1 < x_2 < \dots < x_k \leq n$$$ of exactly $$$k$$$ indices such that $$$a_{x_1} + a_{x_2} + \dots + a_{x_k}$$$ is minimum possible. This problem is solvable, for example, by sorting $$$a_1,\dots,a_n$$$ and picking $$$k$$$ smallest elements. However, let's see how it would be formulated in terms above to better understand this notion.

For this problem, $$$X=2^{\{1,2,\dots, n\}}$$$ is a set of all such sequences. Let $$$x=\{x_1,x_2,\dots, x_m\} \in X$$$, then

$$$\begin{align} f(x)&=a_{x_1} + a_{x_2} + \dots + a_{x_m},\\ g(x)&=|x|-k=m-k,\\ t(\lambda)&=\inf\limits_{x} [f(x) - \lambda (|x|-k)] \\&= \inf\limits_{x} [(a_{x_1} - \lambda) + (a_{x_2} - \lambda) + \dots + (a_{x_m} - \lambda)]+\lambda k.\end{align}$$$

Note that finding $$$t(\lambda)$$$ is unconstrained optimization problem and to compute its value it is enough to simply take in the set $$$x$$$ every index $$$i$$$ such that $$$a_i - \lambda \leq 0$$$. We do not have to know $$$|x|$$$ beforehand, instead we equally distribute $$$\lambda |x|$$$ between all elements of $$$x$$$.

This is the most common example of how aliens trick is applied on practice: $$$g(x)$$$ is typically a size of some set (number of picked elements in a subsequence, number of selected subsegments, etc) and we treat subtracted $$$\lambda \cdot g(x)$$$ as if each element of this set bears additional penalty $$$\lambda$$$ to the target function $$$f(x)$$$.

Another important observation to make here is that increasing $$$\lambda$$$ would increase the size of optimal $$$x$$$ for $$$t(\lambda)$$$ and decreasing it will make more elements positive, thus leading to a smaller size of $$$x$$$. It means that it is possible to find $$$\lambda$$$ that will give us the correct size of $$$x$$$ by the binary search.

Of course, there might be several $$$a_i$$$ such that $$$a_i = \lambda$$$ and it wouldn't matter if we include them in $$$x$$$ for $$$t(\lambda)$$$, so there might be several optimal $$$x$$$ with different sizes for a given $$$\lambda$$$. To mitigate this, it is possible to compute a set of all possible $$$|x|$$$ for each $$$\lambda$$$ (it will be a contiguous sequence of integers) and make the binary search based on the smallest element of such set.

Note that $$$t(\lambda)$$$, as a point-wise infimum of concave (linear in $$$\lambda$$$) functions, is always concave, no matter what $$$X$$$ and $$$f$$$ are.

If $$$x^*$$$ is the solution to the original problem, then $$$t(\lambda) \leq L(x^*,\lambda)=f(x^*)$$$ for any $$$\lambda \in \mathbb R^c$$$. This allows us to introduce

Def. 4. The unconstrained optimization problem $$$t(\lambda) \to \max$$$ is called the Lagrangian dual problem.

Def. 5. If $$$\lambda^*$$$ is the solution to the dual problem, the value $$$f(x^*) - t(\lambda^*)$$$ is called the duality gap.

Def. 6. A condition when the duality gap is zero is called a strong duality.

Typical example here is Slater's condition, which says that strong duality holds if $$$f(x)$$$ is convex, $$$g(x)$$$ is affine and there exists $$$x$$$ such that $$$g(x)=0$$$.

Change of domain

In competitive programming, the set $$$X$$$ in definitions above is often weird and very difficult to analyze directly, so Slater's condition is not applicable. As a typical example, $$$X$$$ could be a set of all possible partitions of $$$\{1,2,\dots, n\}$$$ into non-intersecting segments. Besides, instead of specific equation $$$g(x)=0$$$, you are often asked to minimize $$$f(x)$$$ subject to $$$g(x)=y$$$ where $$$y$$$ is a part of problem input.

To mitigate this, we define $$$h(y)$$$ as the minimum value of $$$f(x)$$$ subject to $$$g(x)=y$$$. In this notion, the dual function is written as

$$$t(\lambda) = \inf\limits_{y \in Y} [h(y) - \lambda \cdot y],$$$

where $$$Y=\{ g(x) : x \in X\}$$$. The set $$$Y$$$ is usually much more regular than $$$X$$$, as just by definition it is already a subset of $$$\mathbb R^c$$$. The strong duality condition is also very clear in this terms: it holds if and only if $$$0 \in Y$$$ and there is a $$$\lambda$$$ for which $$$y=0$$$ delivers infimum.

Def. 7. The epigraph of a function $$$f : Y \to \mathbb R$$$ is a set $$$\text{epi }f = \{(y, z) : z \geq f(y)\} \subset Y \times \mathbb R$$$.

Def. 8. A supporting hyperplane of a set $$$X \subset \mathbb R^d$$$ with a normal vector $$$\lambda \in \mathbb R^d$$$ is a surface defined as $$$\lambda \cdot x = c$$$, where $$$c$$$ is the largest possible number such that $$$\lambda \cdot x \geq c$$$ for all $$$x \in X$$$. Equivalently, $$$c$$$ is the infimum of $$$\lambda \cdot x$$$ among all $$$x \in X$$$.

Geometrically, $$$t(\lambda)$$$ defines a level at which the epigraph of $$$h(y)$$$ has a supporting hyperplane with the normal vector $$$(-\lambda, 1)$$$. Indeed, the half-space bounded by such hyperplane on the level $$$c$$$ is defined as $$$\{(y, z) : z-\lambda \cdot y \geq c\}$$$.

All the points $$$(y, h(y))$$$ at which the hyperplane touches the epigraph minimize the $$$t(\lambda)$$$. Please, refer to the picture below. Lower $$$c$$$ would move the hyperplane lower, while higher $$$c$$$ would move it higher. With $$$c=t(\lambda)$$$, the hyperplane is lowest possible while still intersecting the epigraph of the function in the point $$$(y^*, h(y^*))$$$ where $$$y^*$$$ delivers the minimum of $$$h(y) - \lambda \cdot y$$$.

For strong duality to hold for all inputs, all $$$y \in Y$$$ should have a supporting hyperplane that touches the epigraph at $$$(y, h(y))$$$. This condition is essentially equivalent to $$$h(y)$$$ being convex-extensible, that is, there should exist a convex function on $$$\mathbb R^c$$$ such that its restriction to $$$Y$$$ yields $$$h(y)$$$.

Returning to our example above, $$$h(y)$$$ equals to the minimum $$$f(x)$$$ subject to $$$g(x)=|x|-k=y$$$, which means that $$$h(y)$$$ corresponds to the same problem, but with number of elements in $$$x$$$ changed from $$$k$$$ to $$$k+y$$$.

As we already noted, optimal way to solve this problem is to sort all elements and pick $$$k+y$$$ smallest ones. Since elements are increasing, sum of first $$$k+y$$$ elements is convex as a function of $$$y$$$.

This means that for the example problem strong duality holds and instead of doing binary search on $$$\lambda$$$ to match the needed $$$|x|$$$ we could do a ternary search to find the largest $$$t(\lambda)$$$, as the maximum value of $$$t(\lambda)$$$ would correspond to the minimum $$$f(x)$$$ subject to $$$g(x)=0$$$.

$$$1$$$-dimensional case

For now, let's assume that the problem has a single constraint, thus only one dimension to deal with.

Due to general properties of convex functions, larger $$$y$$$ would require larger $$$\lambda$$$ for the supporting line in the point $$$y$$$ and vice versa — larger $$$\lambda$$$ would be able to define a supporting line on larger $$$y$$$ only. This monotonicity means that we can find optimal $$$\lambda$$$.

Algorithm for finding $$$\lambda$$$ that delivers optimal $$$t(\lambda)$$$:

Do the binary search on $$$\lambda$$$. Assume that you work in $$$[\lambda_l, \lambda_r)$$$ and test $$$\lambda_m$$$ with $$$m \approx \frac{l+r}{2}$$$.
Compute optimal $$$x_\lambda$$$ for $$$f(x)-\lambda_m g(x)$$$ function and $$$y_\lambda=g(x_\lambda)$$$ corresponding to it.
When there are several possible $$$x_\lambda$$$, choose the one that delivers minimum $$$y_\lambda$$$.
If $$$y_\lambda > 0$$$, you should reduce working range to $$$[\lambda_l, \lambda_m)$$$, otherwise reduce it to $$$[\lambda_m,\lambda_r)$$$.

Third step is essential, as $$$\lambda_m$$$ can correspond to several consequent $$$y$$$ such that the points $$$(y, h(y))$$$ lie on the same line and, therefore, have a common supporting line. However, if we look on the smallest $$$y$$$ for every $$$\lambda_m$$$, we will guarantee that the values we find are non-decreasing as a function of $$$\lambda_m$$$. If finding minimal $$$y_\lambda$$$ is cumbersome, one might use ternary search instead to solve the dual problem directly.

Note: This approach doesn't guarantee that we find $$$x_\lambda$$$ such that $$$g(x_\lambda)=0$$$, however we will find $$$\lambda$$$ that corresponds to the optimum, which would mean that $$$(y_\lambda, f(x_\lambda))$$$ and $$$(y^*, f(x^*))$$$ lie on the same line with a slope coefficient $$$\lambda$$$, thus you can get the answer as

$$$f(x^*) = f(x_\lambda) + (y^*-y_\lambda) \cdot \lambda = f(x_\lambda) - y_\lambda \cdot \lambda = t(\lambda).$$$

Integer search

What are the possible $$$\lambda$$$ for specific $$$y$$$? When $$$h(y)$$$ is continuously differentiable, it essentially means that $$$\lambda$$$ corresponds to $$$y$$$ such that $$$\lambda = h'(y)$$$. On the other hand, when $$$Y$$$ is the set of integers, $$$y$$$ optimizes $$$t(\lambda)$$$ for all $$$\lambda$$$ such that

$$$h(y) - h(y-1) \leq \lambda \leq h(y + 1) - h(y).$$$

So, if $$$h(y)$$$ is an integer function, we may do the integer search of $$$\lambda$$$ on possible values of $$$h(k)-h(k-1)$$$ only.

In our example with the smallest-sum subset of $$$k$$$ elements, only those $$$\lambda$$$ that are equal to some of $$$a_i$$$ are the points at which possible optimal $$$x$$$ for $$$t(\lambda)$$$ change. It corresponds to the transition from $$$h(y)$$$ to $$$h(y+1)$$$, as the difference between them is equal to the next $$$a_i$$$ in a sorted list.

In a very generic case, when the function is not continuously differentiable and $$$y_i$$$ are not necessarily integer, the set of possible $$$\lambda_i$$$ for a given $$$y_i$$$ is defined as $$$\partial h(y_i)$$$, the so-called sub-differential of $$$h$$$ in $$$y_i$$$, formally defined as $$$[a,b]$$$ where

$$$\begin{gather}a = \sup_{y'_i < y_i} \frac{f(y_i) - f(y'_i)}{y_i - y'_i},~~b = \inf_{y_i < y'_i} \frac{f(y'_i) - f(y_i)}{y'_i - y_i}\end{gather}.$$$

The concept of sub-derivatives and sub-differentials can be generalized to multi-dimensional case as well with sub-gradients.

$$$2$$$-dimensional case

When $$$c>1$$$, the original problem might be reduced to a sequence of nested single-dimensional problems. Let's start with $$$c=2$$$.

$$$\inf\limits_{\substack{g_1(x)=0\\g_2(x)=0}} f(x) = \max\limits_{\lambda_1}\inf\limits_{g_2(x)=0} [f(x)-\lambda_1 g_1(x)] = \max\limits_{\lambda_1} t(\lambda_1),$$$

The function $$$t(\lambda_1)$$$ can also be rewritten as a maximum of the Lagrangian dual:

$$$\begin{gather}t(\lambda_1) = \max\limits_{\lambda_2} \inf\limits_x [f_{\lambda_1}(x)-\lambda_2 g_2(x)]=\max\limits_{\lambda_2}t_{\lambda_1}(\lambda_2),\\ f_{\lambda_1}(x) = f(x) - \lambda_1 g_1(x). \end{gather}$$$

Thus, rather than solving joint problem on $$$\lambda_1,\lambda_2$$$, we need to solve two Lagrangian dual problems:

$$$\begin{align} t(\lambda_1) &= \inf\limits_{g_2(x)=0} [f(x)-\lambda_1 g_1(x)] \to \max,\\ t_{\lambda_1}(\lambda_2) &= \inf\limits_x [f_{\lambda_1}(x)-\lambda_2 g_2(x)] \to \max. \end{align}$$$

Rewriting them in $$$h$$$-terms, we obtain

$$$\begin{align} t(\lambda_1) &= \inf\limits_{y_1} [h(y_1)-\lambda_1 y_1],\\ t_{\lambda_1}(\lambda_2) &= \inf\limits_{y_2} [h_{\lambda_1}(y_2)-\lambda_2 y_2], \end{align}$$$

where $$$h$$$-functions are defined in the following way:

$$$\begin{align} h(y_1) &= h(y_1, 0),\\ h_{\lambda_1}(y_2) &= \inf\limits_{y_1} [h(y_1, y_2) - \lambda_1 y_1]. \end{align}$$$

For both problems to have strong duality, both $$$h(y_1) = h(y_1, 0)$$$ and $$$h_{\lambda_1}(y_1)$$$ must be convex.

Multidimensional case

Let's for the sake of clarity write down the full sequence of nested $$$1$$$-dimensional optimization problems in $$$c$$$-dimensional case.

We need to optimize $$$f(x)$$$ w.r.t $$$g_1(x) = g_2(x) = \dots = g_n(x)=0$$$. We introduce $$$\lambda_1, \dots, \lambda_n$$$. and solve the optimization problem

$$$\max\limits_{\lambda_1 \dots \lambda_n} t(\lambda_1, \dots, \lambda_n) = \max\limits_{\lambda_1} \dots \max\limits_{\lambda_n} \inf\limits_x [f(x) - \lambda_1 \cdot g_1(x) - \dots - \lambda_n \cdot g_n(x)].$$$

We decompose it in a sequence of nested problems in the following way:

$$$\max\limits_\lambda t(\lambda) = \max\limits_{\lambda_1} t(\lambda_1) = \max\limits_{\lambda_1} \max\limits_{\lambda_2} t_{\lambda_1}(\lambda_2) = \max\limits_{\lambda_1} \max\limits_{\lambda_2} \max\limits_{\lambda_3} t_{\lambda_1 \lambda_2}(\lambda_3) = \dots,$$$

where

$$$\begin{gather} t_{\lambda_1 \dots \lambda_{k-1}}(\lambda_k) = \inf\limits_{\substack{g_{k+1}(x)=0 \\ \dots \\ g_n(x) = 0}} [f_{\lambda_1 \dots \lambda_{k-1}}(x) - \lambda_k \cdot g_k(x)],\\ f_{\lambda_1 \dots \lambda_{k-1}}(x) = f(x) - \lambda_1 g_1(x) - \dots - \lambda_{k-1} g_{k-1}(x).\end{gather}$$$

With the help of $$$h(y_1, \dots, y_n)$$$, the same could be rewritten as

$$$\begin{gather} t_{\lambda_1 \dots \lambda_{k-1}}(\lambda_k) = \inf\limits_{y_k} [h_{\lambda_1 \dots \lambda_{k-1}}(y_k) - \lambda_k \cdot y_k],\\ h_{\lambda_1 \dots \lambda_{k-1}}(y_k) = \inf\limits_{y_1 \dots y_{k-1}} [h(y_1,\dots,y_{k-1},y_k,0,\dots,0) - \lambda_1 y_1 - \dots - \lambda_{k-1} y_{k-1}].\end{gather}$$$

For nested binary search on $$$\lambda$$$ components to work, every function $$$h_{\lambda_1 \dots \lambda_{k-1}}(y_k)$$$ must be convex, including $$$h(y_1) = h(y_1, 0, \dots, 0)$$$.

The whole procedure could roughly look as follows:

void adjust(double *lmb, int i, int c) {
    if(i == c) {
        return;
    }
    double l = -inf, r = +inf; // some numbers that are large enough
    while(r - l > eps) { // Might be unsafe on large numbers, consider fixed number of iterations
        double m = (l + r) / 2;
        lmb[i] = m;
        adjust(lmb, i + 1, c);
        auto [xl, yl] = solve(lmb, i); // returns (x_lambda, y_lambda) with the minimum y_lambda[i]
        if(yl[k] > 0) {
            r = m;
        } else {
            l = m;
        }
    }
    lmb[i] = (l + r) / 2;
}

Alternatively, one might consider nested ternary search to solve the joint dual problem directly.

Testing convexity

Differential criteria. When $$$Y$$$ is continuous set, convexity may be tested by local criteria. For one-dimensional set, $$$h(y)$$$ is convex if and only if $$$h'(y)$$$ is non-decreasing. If it has a second derivative, we might also say that the function is convex if and only if $$$h' '(y) \geq 0$$$ for all $$$y$$$.

In multidimensional case, the local criterion is that the Hessian matrix $$$\frac{\partial h}{\partial y_i \partial y_j}$$$ is a positive semi-definite.

Finite differences. In discrete case, derivatives can be substituted with finite differences. A function $$$h(y) : \mathbb Z \to \mathbb R$$$ is convex if and only if

$$$\Delta [h] (y) = h(y+1)-h(y)$$$

is non-decreasing or, which is equivalent,

$$$\Delta^2 [h] (y) = h(y+2)-2h(y+1)+h(y) \geq 0.$$$

Reduction to mincost flow. Another possible way to prove that $$$1$$$-dimensional $$$h(y)$$$ is convex in relevant problems is to reduce it to finding min-cost flow of size $$$y$$$ in some network. The common algorithm to find such flow pushes a flow of size $$$1$$$ through the shortest path in the residual network. The shortest path becomes larger with each push, guaranteeing that mincost flow is convex in $$$y$$$.

Similarly, reduction to maxcost $$$k$$$-flow would prove the concavity of $$$h(k)$$$. This method was revealed to me by 300iq.

Testing convexity in $$$1$$$-dimensional reductions. In multidimensional discrete case testing convexity might be challenging. Instead of this, one may try to directly prove that every single-dimensional $$$h$$$-function defined above is convex. Specifically, for $$$2$$$-dimensional case one would need to prove the convexity of

$$$\begin{align} h(y_1) &= h(y_1, 0),\\ h_{\lambda_1}(y_2) &= \inf\limits_{y_1} [h(y_1, y_2) - \lambda_1 y_1]. \end{align}$$$

Problem examples

Sometimes the problem is stated with $$$\max$$$ rather than $$$\min$$$. In this case, $$$t(\lambda)$$$ and $$$h(y)$$$ are defined as supremum and maximum rather than infimum and minimum. Correspondingly, $$$h(y)$$$ needs to be concave rather than convex to allow the usage of the trick.

Gosha is hunting. You're given $$$a$$$, $$$b$$$, $$$p_1, \dots, p_n$$$ and $$$q_1,\dots, q_n$$$. You need to pick two subsets $$$A$$$ and $$$B$$$ of $$$\{1,2,\dots,n\}$$$ of size $$$a$$$ and $$$b$$$ correspondingly in such a way that the following sum is maximized:

$$$f(A, B) = \sum\limits_{i \in A} p_i + \sum\limits_{j \in B} q_j - \sum\limits_{k \in A \cap B} p_k q_k.$$$

Formulating it as a constrained optimization problem, we obtain

$$$\begin{gather}f(A, B) \to \max, \\ |A| - a = 0, \\ |B| - b = 0.\end{gather}$$$

Lagrangian dual here is

$$$t(\lambda_1, \lambda_2) = \max\limits_{A,B} [f(A, B) - \lambda_1|A| - \lambda_2|B|] + \lambda_1 a + \lambda_2 b.$$$

Let $$$h(\alpha,\beta) = \max f(A, B)$$$ subject to $$$|A| = \alpha$$$ and $$$|B| = \beta$$$, then

$$$t(\lambda_1, \lambda_2) = \max\limits_{\alpha,\beta}[h(\alpha,\beta)-\lambda_1 (\alpha-a) - \lambda_2 (\beta-b)].$$$

As it is $$$2$$$-dimensional problem, we would need to prove the concavity of the following functions:

$$$\begin{align} h(\alpha) &= h(\alpha, 0),\\ h_{\lambda_1}(\beta) &= \max\limits_\alpha[h(\alpha, \beta) - \lambda_1\alpha] + \lambda_1 a. \end{align}$$$

To prove the concavity of $$$h_{\lambda_1}$$$, we can consider the alternative formulation of this problem

$$$h_{\lambda_1}(\beta) = \max\limits_{|B|=\beta} [f(A, B)-\lambda_1|A|] + \lambda_1 a,$$$

which is the primal problem for $$$h_{\lambda_1}$$$. We can interpret it as if we're penalized by $$$\lambda_1$$$ for every element from the set $$$A$$$, but we have no constraint on its size. It means, that we might by default take into the answer all elements of $$$A$$$ having $$$p_i \geq \lambda_1$$$.

Then we might sort all potential elements of $$$B$$$ by their potential contribution to $$$f(A, B)$$$ if we pick this element and pick top $$$\beta$$$ elements in the sorted list. Since we sorted the elements by their contribution before picking them, increasing $$$\beta$$$ will lead to smaller increases in $$$f(A, B)$$$ proving that $$$h_{\lambda_1}$$$ is concave. Here's an example solution with nested ternary search on $$$t(\lambda_1,\lambda_2)$$$:

Code

void solve() {
    int n, a, b;
    cin >> n >> a >> b;
    double p[n], u[n];
    const double maxc = 4000;
    for(int i = 0; i < n; i++) {
        cin >> p[i];
    }
    for(int i = 0; i < n; i++) {
        cin >> u[i];
    }
    auto f = [&](double x, double y) {
        double res = 0;
        for(int i = 0; i < n; i++) {
            res += max({
                0.,
                p[i] - x,
                u[i] - y,
                p[i] + u[i] - p[i] * u[i] - x - y
            });
        }
        return res + a * x + b * y;
    };
    auto in = [&](double x) {
        double l = 0, r = maxc;
        for(int z = 0; z < 50; z++) {
            double eps = (r - l) / 1e3;
            double y1 = (l + r) / 2 - eps;
            double y2 = (l + r) / 2 + eps;
            if(f(x, y1) > f(x, y2)) {
                l = y1;
            } else {
                r = y2;
            }
        }
        return f(x, (l + r) / 2);
    };
    auto out = [&]() {
        double l = 0, r = maxc;
        for(int z = 0; z < 50; z++) {
            double eps = (r - l) / 1e3;
            double x1 = (l + r) / 2 - eps;
            double x2 = (l + r) / 2 + eps;
            if(in(x1) > in(x2)) {
                l = x1;
            } else {
                r = x2;
            }
        }
        return in((l + r) / 2);
    };
    cout << fixed << setprecision(6) << out() << endl;
}

Note that we do not have to minimize $$$|A|$$$ or $$$|B|$$$ when we use ternary search on $$$t(\lambda_1,\lambda_2)$$$ instead of binary search on $$$\lambda$$$ to match it with $$$|A|=a$$$ and $$$|B|=b$$$. In this way, we do not need to care about $$$|A|$$$ and $$$|B|$$$ and might as well not store it at all throughout the code.

Minkowski addition on epigraphs. Let $$$f_1 : [0,N] \to \mathbb R$$$ and $$$f_2 : \mathbb [0,M] \to \mathbb R$$$ be convex functions computable in $$$O(1)$$$. You need to quickly compute $$$f(x)$$$ for any given point $$$x$$$ where $$$f : \mathbb [0,N+M] \to \mathbb R$$$ is a convex function defined as:

$$$\text{epi } f = \text{epi }f_1 + \text{epi }f_2.$$$

From this definition follows an alternative formula for $$$f(x)$$$:

$$$f(x) = \min\limits_{x_1+x_2=x}[f_1(x_1)+f_2(x_2)],$$$

thus we can say that $$$f$$$ is a convolution of $$$f_1$$$ and $$$f_2$$$ with $$$\min$$$-operation.

To solve this problem, we introduce

$$$t(\lambda) = \inf\limits_{x_1,x_2}[f_1(x_1)+f_2(x_2) - \lambda x_1 - \lambda x_2] + \lambda x.$$$

Formally, constraint function here is $$$g(x_1,x_2) = x_1+x_2-x$$$ and corresponding $$$h$$$-function is defined as

$$$h(y) = \min\limits_{x_1+x_2-x=y} [f_1(x_1) + f_2(x_2)] = f(x+y),$$$

therefore $$$h(y)$$$ is convex as well and strong duality holds. To compute $$$t(\lambda)$$$, we can separate variables as

$$$t(\lambda) = \inf\limits_{x_1} [f_1(x_1) - \lambda x_1] + \inf\limits_{x_2} [f_2(x_2) - \lambda x_2] + \lambda x.$$$

Since $$$f_1$$$ and $$$f_2$$$ are convex, corresponding infimums are computable in $$$O(\log C)$$$ with the ternary search, therefore $$$f(x)$$$ as a whole is computable in $$$O(\log^2 C)$$$.

Another noteworthy fact here is that optimal $$$(x_1, x_2)$$$ always define points such that both $$$f_1$$$ and $$$f_2$$$ have a supporting line with the slope $$$\lambda$$$ in $$$x_1$$$ and $$$x_2$$$ correspondingly.

It means that if $$$f_1$$$ and $$$f_2$$$ are defined in $$$[0,N] \cap \mathbb Z$$$ and $$$[0,M] \cap \mathbb Z$$$ instead of $$$[0,N]$$$ and $$$[0,M]$$$, it is possible to compute values of $$$f$$$ in $$$[0,N+M] \cap \mathbb Z$$$ in $$$O(N+M)$$$ by merging two arrays as in merge-sort, but comparing $$$a_{i+1}-a_i$$$ and $$$b_{i+1}-b_i$$$ instead of $$$a_i$$$ and $$$b_i$$$:

// a and b are convex, compute c[i+j] = min(a[i] + b[j])
vector<int> min_conv(vector<int> a, vector<int> b) {
    adjacent_difference(begin(a), end(a), begin(a));
    adjacent_difference(begin(b), end(b), begin(b));
    vector<int> c = {a[0] + b[0]};
    merge(begin(a) + 1, end(a), begin(b) + 1, end(b), back_inserter(c));
    partial_sum(begin(c), end(c), begin(c));
    return c;
}

Honorable Mention. You're given an array $$$a_1,\dots, a_n$$$ and $$$q$$$ queries. Each query is a triple $$$l,r,k$$$ and you need to compute the maximum sum on $$$k$$$ non-empty non-intersecting subsegments of $$$[l,r]$$$.

This is an example of a problem where concavity may be proven via the reduction to maxcost $$$k$$$-flow on the following network:

Any $$$k$$$-flow on this network corresponds to a set of $$$k$$$ non-empty non-intersecting subsegments, thus $$$h(k)$$$ equals to the cost of the maximum $$$k$$$-flow in the network and thus $$$h(k)$$$ is concave.

To solve the problem efficiently, we should be able to pick a set of subsegments of $$$[l, r]$$$ such that each segment is penalized by $$$\lambda$$$ and their sum is maximized. As is often the case with problems that deal with segments, we're going to use a segment tree.

Let $$$d_{v}(k, p, s)$$$ be the the largest number we can obtain on a segment $$$[l, r)$$$ corresponding to the vertex $$$v$$$ of the segment tree, such that exactly $$$k$$$ subsegments are used. Here $$$p, s \in \{0,1\}$$$ are prefix and suffix indicators. If $$$p=1$$$ it means that one of $$$k$$$ subsegments must be a prefix of $$$[l, r)$$$ and if $$$s=1$$$, it means that one of $$$k$$$ subsegments must be a suffix of $$$[l, r)$$$. In this notion,

$$$d_v(k, p, s) = \max\left\{ \begin{matrix} \max\limits_{i+j=k}[d_{l(v)}(i,p,0)+d_{r(v)}(j,0,s)],\\ \max\limits_{i+j=k+1}[d_{l(v)}(i,p,1)+d_{r(v)}(j,1,s)] \end{matrix} \right\}.$$$

Where $$$l(v)$$$ and $$$r(v)$$$ are left and right children of $$$v$$$ in the segment tree correspondingly. These expressions are $$$(\max, +)$$$ convolutions of concave functions, thus if all values of $$$d_v$$$ are stored in each vertex for every valid triple $$$(k,p,s)$$$, it is possible to calculate $$$d_v$$$ from $$$d_{l(v)}$$$ and $$$d_{r(v)}$$$ with the help of the previous example problem.

Now that each segment hold all possible values of $$$d_v$$$, let's deal with the triple $$$l, r, k$$$. Let's say that $$$[l, r]$$$ is decomposed into vertices $$$v_1,\dots, v_k$$$. Instead of solving for specific $$$k$$$, we will penalize penalize each segment by $$$\lambda$$$ and do ternary search on $$$t(\lambda)$$$.

Let $$$r_i(s)$$$ be the optimal answer on $$$v_1, \dots, v_i$$$ where $$$s$$$ is a suffix-indicator, then

$$$r_{i}(s) = \max\left\{ \begin{matrix} \max\limits_{k}[r_{i-1}(0) + d_{v_i}(k,0,s) - \lambda k],\\ \max\limits_{k}[r_{i-1}(1) + d_{v_i}(k,1,s) - \lambda (k-1)] \end{matrix} \right\}.$$$

Since $$$d_{v_i}(k, \dots)$$$ is concave, so is $$$d_{v_i}(k, \dots) - \lambda k$$$, thus the optimal value of this function might be found with another ternary search.

Or binary search on $$$d_{v_i}(k+1, \dots) - d_{v_i}(k, \dots)$$$, as the optimal value corresponds to the smallest $$$k$$$ such that this difference is not greater than $$$\lambda$$$. One of possible implementations for the solution is below:

code

#include <bits/stdc++.h>

using namespace std;

#define int int64_t
#define all(x) begin(x), end(x)

const int inf = 1e9;
const int maxn = 35001;
const int maxs = 4 * maxn;
int a[maxn];
vector<int> ans[maxs][2][2];
vector<int> diff[maxs][2][2];

vector<int> max_conv(vector<int> a, vector<int> b) {
    vector<int> c = {a[0] + b[0]};
    merge(1 + all(a), 1 + all(b), back_inserter(c), greater<int>());
    partial_sum(all(c), begin(c));
    return c;
}

void build(int v = 1, int l = 0, int r = maxn) {
    if(r - l == 1) {
        ans[v][0][0] = {0, a[l]};
        ans[v][0][1] = ans[v][1][0] = ans[v][1][1] = {-inf, a[l]};
        diff[v][0][0] = {0, a[l]};
        diff[v][0][1] = diff[v][1][0] = diff[v][1][1] = {-inf, a[l] + inf};
    } else {
        int m = (l + r) / 2;
        build(2 * v, l, m);
        build(2 * v + 1, m, r);
        for(int pl = 0; pl <= 1; pl++) {
            for(int sr = 0; sr <= 1; sr++) {
                ans[v][pl][sr] = max_conv(diff[2 * v][pl][0],
                                          diff[2 * v + 1][0][sr]);
                auto cand = max_conv(diff[2 * v][pl][1],
                                     diff[2 * v + 1][1][sr]);
                for(size_t i = 1; i < cand.size(); i++) {
                    ans[v][pl][sr][i - 1] = max(ans[v][pl][sr][i - 1],
                                                cand[i]);
                }
                adjacent_difference(all(ans[v][pl][sr]), back_inserter(diff[v][pl][sr]));
            }
        }
    }
}

void get(int a, int b, vector<int> &st, int v = 1, int l = 0, int r = maxn) {
    if(a <= l && r <= b) {
        st.push_back(v);
    } else if(a < r && l < b) {
        int m = (l + r) / 2;
        get(a, b, st, 2 * v, l, m);
        get(a, b, st, 2 * v + 1, m, r);
    }
}

signed main() {
    ios::sync_with_stdio(0);
    cin.tie(0);
    int n, q;
    cin >> n >> q;
    for(int i = 0; i < n; i++) {
        cin >> a[i];
    }
    build();
    while(q--) {
        int l, r, k;
        cin >> l >> r >> k;
        l--;
        vector<int> ver;
        get(l, r, ver);
        auto opt = [&](int lmb) {
            int bst[2];
            bst[0] = 0;
            bst[1] = -inf;
            for(auto it: ver) {
                int nbst[2];
                for(int sr = 0; sr <= 1; sr++) {
                    auto lmb_max = [&](int v, int p, int s, int lmb) {
                        auto it = lower_bound(
                            1 + all(diff[v][p][s]), lmb, greater<int>()
                        ) - begin(diff[v][p][s]) - 1;
                        return ans[v][p][s][it] - lmb * it;
                    };
                    nbst[sr] = max(
                        bst[0] + lmb_max(it, 0, sr, lmb),
                        bst[1] + lmb_max(it, 1, sr, lmb) + lmb
                    );
                }
                memcpy(bst, nbst, sizeof nbst);
            }
            return bst[0] + lmb * k;
        };
        l--;
        int L = -maxn, R = inf;
        while(R - L > 1) {
            int M = (L + R) / 2;
            if(opt(M + 1) - opt(M) >= 0) {
                R = M;
            } else {
                L = M;
            }
        }
        cout << opt(R) << "\n";
    }
}

The solution here has the complexity of $$$O(n \log^3 n)$$$, however it is possible to reduce it to $$$O(n \log^2 n)$$$ with parallel binary search.

References

Duality (optimization) — English Wikipedia
The Trick From Aliens — Serbanology
My Take on Aliens' Trick — Mamnoon Siam's Blog
Incredibly beautiful DP optimization from N^3 to N log^2 N
Comment on Codeforces by _h_
Part of the article was once revealed to me in a dream

Comments (9)

Write comment?

dario2994

2 years ago, # |

← Rev. 3 →

+45

I would like to give a minimal presentation of the topic. Of course, it is equivalent to yours but, since it is more concise and uses slightly different mathematical concepts, it may be useful to someone. I will not enter into the "reconstruction issue".

Problem statement. We are given $$$f:X\to\mathbb R$$$ and $$$g:X\to\mathbb R^d$$$ ($$$d=1$$$ if there is only one constraint). The goal is to compute

$$$ \min\limits_{x:\,g(x)=0} f(x) .$$$

Assumptions. Assume that $$$h(y):=\min_{g(x)=y}f(x)$$$ is convex and that, for any $$$\lambda\in\mathbb R^d$$$, we can solve the minimization problem

$$$ \min\limits_x f(x)- \lambda g(x) . $$$

Solution. Let us consider the convex conjugate of $$$h$$$:

$$$ h^*(\lambda) := \max\limits_y \lambda y - h(y) \quad \Big(=-\min\limits_x f(x)-\lambda g(x)\Big) .$$$

The function $$$h^*$$$ is convex (as it is the maximum of linear functions), we are able to compute it pointwise (this is one of the assumptions), and it satisfies (thanks to this)

$$$ \min\limits_{x:\,g(x)=0} f(x) = h(0) = h^{**}(0) = -\min h^* .$$$

Since one can find the minimum of a convex function with a ternary search (also in higher dimension) this last identity produces a solution to the problem.

→ Reply

adamant

2 years ago, # ^ |

+14

Thanks, I was not familiar with Fenchel–Moreau theorem and convex conjugates before! This indeed provides another interesting perspective on what's going on here. Do I understand correctly, that the epigraph of convex biconjugate is a convex hull of original function's epigraph?

+26

Yes, you understand correctly.

TimonKnigge

+18

Regarding the multidimensional case, why pick nested ternary search specifically? Even in cases where first order methods aren't applicable (note automatic differentiation is quite versatile) I would expect something like coordinate descent+line search to do better.

+10

I haven't yet seen a problem with $$$c>2$$$ and don't have much experience with other methods to describe them confidently. And for $$$c=2$$$, I believe, it wouldn't be better anyway?

I tried Newton method and gradient descent instead of ternary search in Gosha is hunting, but Hessian of $$$h$$$ is degenerate and gradient is non-continuous, thus I was not able to make it work. It would be great if you could show any other methods using that problem as an example.

Can you please elaborate on coordinate descent and line search? Do you know any competitive programming problems (not necessarily with aliens trick) where these methods are applicable?

Ok, I made quite a few edits after this text was published.

Most importantly, I added two more problem examples, particularly 102331H - Honorable Mention which I consider especially hard (problem is from 2019 Летние Петрозаводские сборы, день 2: 300iq Contest 2 (XX Открытый кубок, Гран-при Казани) where it was one of two problems that weren't solved by any team during the live contest in Petrozavodsk).

I also tried my best to clarify any part that didn't seem clear enough and added a toy problem example to refer to it alongside the story, so that it's not as overly abstract as it was in the beginning.

With this in mind, article should be mostly complete now.

I hope you will enjoy the reading and apologize for any moments that are still unclear or overly verbose or obscure.

P. S. And I just can't stress this part enough, as it seems to be a common misconception: You do not need to base binary search of $$$\lambda$$$ on matching the constraint from the input. If you properly incorporate the constraint in your dual function, you will get the optimal $$$f(x)$$$ when $$$t(\lambda)$$$ is maximized, which is doable with either ternary search or binary search on $$$t(\lambda + 1) - t(\lambda)$$$ in integer case to find the lowest $$$\lambda$$$ such that this difference is non-negative.

This way you avoid all the unnecessary hardship of having to store both $$$f(x)$$$ and $$$g(x)$$$ while calculating $$$t(\lambda)$$$ or having to minimize constraint value to guarantee the monotonicity between $$$\lambda$$$ and constraint. This is especially important in multidimensional case.

P.P.S. Is there anything I haven't covered about this concept that might require further elaboration? Most problems with aliens trick do not require the recovery of the answer. It seems to be possible in some cases, but maybe not always. Does anyone know any problems about aliens trick where one needs to actually recover $$$x$$$?

Graphter

Thank you for such a valuable resource. I was recently learning Alien's Trick and this blog helped me a lot to generalize the ideas (although it may be hard to understand at first, as you said). I have some observations/questions about some parts of the blog:

Another important observation to make here is that increasing $$$\lambda$$$ would decrease the size of optimal $$$x$$$ for $$$t(\lambda)$$$ and decreasing it will make more elements positive, thus leading to a larger size of $$$x$$$.

I believe there might be a small mistake here. If I've understood correctly, it's actually the other way around as increasing $$$\lambda$$$ will make more elements satisfy $$$a_i - \lambda \leq 0$$$ so there would be more elements picked in $$$x$$$.

Typical example here is Slater's condition, which says that strong duality holds if $$$f(x)$$$ is convex and there exists $$$x$$$ such that $$$g(x) = 0$$$

I might be wrong here as I was not really familiar with Slater's condition, but after reading some resources, I think that Slater's condition implies that $$$g$$$ needs to be affine because is not of the form $$$f_i(x) \leq 0$$$, or am I missing something ?

Hi! Sorry, I somehow missed your comment. It seems that you're right about Slater's condition. And for the example, it also seems true that increasing $$$\lambda$$$ will in fact increase the size of the input. I will update the blog accordingly, thanks!

YeahPotato

13 months ago, # |

+38

Thanks for sharing! I have a question about details in the example, Gosha is Hunting. I'm not sure if I understood this correctly.

Should it be $$$h(\alpha,\beta)=\max f(A,B)\quad\text{ s.t. }|A|-a=\alpha,|B|-b=\beta$$$ instead of $$$|A|=\alpha,|B|=\beta$$$ according to the discussion above? If it's true, proving the concavity of $$$h(\alpha,0)$$$ (i.e. $$$|A|=a+\alpha,|B|=b$$$) will be non-trivial. I've seen some proofs that fix $$$A$$$ then sort contribution in $$$B$$$, but it doesn't make sense actually. Reducing to flow is also not quite easy.

adamant's blog

Tldr.

Lagrange duality

Change of domain

$$$1$$$-dimensional case

Integer search

$$$2$$$-dimensional case

Multidimensional case

Testing convexity

Problem examples

References