### Anai's blog

By Anai, history, 18 months ago,

Talking with my colleagues a few months ago, I noticed that except for the editorial of Aliens (IOI 2016), there isn't any comprehensive resource in learning the DP optimization (well, maybe except for this one, but... yeah), so I decided to write a more extensive tutorial on it. Cheers! ^^

• +158

 » 18 months ago, # |   +27 There is this article actually.P.S. I wonder if there's going to be any formal work on the connection of this trick with Lagrange multipliers...
•  » » 18 months ago, # ^ | ← Rev. 3 →   +17 Well... yes, I know the blog post, but I think there is room for quite a few more details. As for the Lagrange multipliers, now that you say it, that might just be the inspiration for the trick :))
•  » » 18 months ago, # ^ |   +13 This trick relies on a simple fact sometimes known as the "Lagrange sufficiency theorem":Let $X$ be a set, $f : X \to \mathbb R$ a 'value function', $h : X \to \mathbb R$ a 'constraint function', and $b \in \mathbb R$ our specific constraint. Suppose $\lambda \in \mathbb R$ and $x \in X$ are such that $x$ maximises $L(x,\lambda) = f(x) - \lambda(h(x)-b)$. Moreover, suppose $h(x)=b$. Then $x$ maximises $f(x)$ subject to the constraint $h(x)=b$.(The proof is trivial: just observe that $L = f$ when $h(x)=b$.)In the dp problems, $X$ is something like "all partitions of the sequence 1..n into contiguous subsequences", $f$ is what it is, $h$ is "number of parts of the partition", and $b$ is k.The general situation where the Lagrange sufficiency theorem is useful is where $X$ is 'nice', so that we can optimise $L(-, \lambda)$ on $X$ (using a one-dimensional dp maybe), but the set $\langle x \in X \mid h(x)=b \rangle$ is less nice, so that we cannot optimise on it directly (or maybe only in quadratic time). In general you will find that, letting $x(\lambda)$ maximise $L(-,\lambda)$, $h(x(\lambda))$ is monotonically decreasing in $\lambda$ (let's not worry about ambiguity in choice of $x(\lambda)$). So you should be able to find a $\lambda$ with $h(x(\lambda)) = b$ using binary search; binary search is fast so this method can be good. It might be that no $\lambda$ with $h(x(\lambda))=b$ exists, in which case this method isn't helpful. Convexity comes into play to ensure that $\lambda$ exists, and this is where things become a bit technical.The Wikipedia article only seems to deal with the case where $X$ is a manifold (i.e. a 'continuous' space), and $f, h$ are differentiable. I think the reason for this is that instead of having to worry about global extremisation and existence of $\lambda$, you just have to look for (constrained) stationary points, which is simpler (it's local). But if you want a sufficient condition for when your solution is the constrained maximum, then you need the "sufficiency theorem". (Of course, derivatives and stationarity aren't going to help you in discrete dp problems...)Not sure if this is what you were asking for, but I hope it helps.
•  » » » 18 months ago, # ^ |   0 Oh. It seems to be nice explanation. I think. (I will hopefully be able to actually read and comprehend it at some moment of time)
 » 18 months ago, # |   +49 My boy memento also wrote an article on this recently.
•  » » 18 months ago, # ^ |   +10 I like this article, is good.
•  » » 5 months ago, # ^ |   0 Hi memento, I really want to know which tool or platform you used to build a github.io blog like that. I also wonder how you insert math formula and graph, the font of text, I see it so beautiful. Many thanks.
•  » » » 5 weeks ago, # ^ |   +8 For blog, HTML :((. For math formula, mathjax. For graphs, geogebra (I later found out that you can do that more easily and beautifully using python matplotlib).
•  » » 5 weeks ago, # ^ |   +89 I know I'm kinda late, but I decided to read your article and just couldn't skip this random fact about Antarctica Due to the fact that $K$ is quite big, simulating $K−N$ iterations (maybe even with a priority queue) would take tremendous amount of time — roughly $10^9+7$ tons of ice will melt in Anterctica in that time. I did my research and the latest data I could find is from 2009-2017 from here. It is said that over the specified period the total mass loss of Antarctic Ice Sheet was $\approx 252 \pm 26\, Gt/y$. I will take the average $252\cdot10^9$ per year.Now, the code. It is pretty simple (it passes small tests on usaco, so it is probably right). Since you didn't specify the specs of a computer to run a program, let's use the popular and public ideone. So, here is my code.Obviously, the code is too slow, so all I could do is estimate running time. And as you can see, it said that $\approx 1.138 \cdot 10^9$ tons of ice will melt while this program is executing. It means that you was wrong only by $\approx 14\%$, which is breathtaking!Now I'm actually very curious: did you say $10^9 + 7$ randomly or at least googled some numbers?
•  » » » 5 weeks ago, # ^ |   +13 The effects of climate changing always irritated me, so I wanted to put something related to that, whatever the figure might turn out to be. And so I googled. Believe it or not, it was breathtaking for me too!Btw, thanks for checking out the article :D
 » 18 months ago, # |   +6 I think there are two typos. The summation in the definition of the second dp should be over $v_k$ and it should probably be $max(x_t - y_{t + 1} + 1, 0)^2$ in the last dp.
•  » » 18 months ago, # ^ |   +3 You're right, fixed :)
 » 5 months ago, # | ← Rev. 2 →   +14 There's some part I don't understand. After the binary search the answer is supposed to be $dp[n] - cnt[n] * \lambda$ if I want to use at most $K$ resources.However in this problem 1279F - New Year and Handle Change the idea is to perform at most $k$ operations and I received wrong answer (90721141) with that way but received accepted (90488910) with $dp[n] - k * \lambda$. I know it's always better to perform k operations instead of some number smaller than k, but anyways the idea is that I was supposed to get the optimal value with any number of operations equal or less than k at the end right?Is there something I am missing?
•  » » 5 months ago, # ^ |   +21 dp[n] — cnt[n] * lamba might be tied for many different possible cnt[n] (in fact, if aliens trick works for the problem the set of good values of cnt[n] is a contiguous range of possible cnt[n], try to prove this), in this case you should use k as cnt[n] by yourself explicitely.
•  » » » 4 months ago, # ^ |   0 How we can prove ,if dp is convex/concave ?. I mean is there any special property for cost function so that we can use binary search.
•  » » » » 4 months ago, # ^ | ← Rev. 2 →   +18 There's a concrete property that is the sufficient and necessary condition for aliens to work:For maximum cost: Let f(i) be the optimal answer taking i things/elements/operations. The increase you receive from i-1 to i must be at least as good as from i to i+1. In other words:f(i) — f(i-1) >= f(i+1) — f(i)orf(i) >= (f(i+1) + f(i-1)) / 2for minimum, >= turns into <=Now, as for how to prove, every problem is a different problem. You really need to approach every problem differently probably. But that second inequality is useful because a certain way to prove it is assume optimal answers for i-1 and i+1, then somehow exchange things/elements/operations in a way that i-1 and i+1 both end up with valid i things/elements/operations. If you can do that then you've found a valid set of i things/elements/operations that's at least as good as the mean of i-1 and i+1. To see how this works better look for that recent educational F that was aliens trick, someone explained in the editorial's comment section a proof for that problem. SpoilerIf you asked about how this implies we can use binary search, you can view it as points (i, f(i)). Every point in there should be in the convex hull. Now, solving the dp with a certain lambda is just maximum dot product queries with points (-lambda, 1).
•  » » » » » 4 months ago, # ^ |   0 I was solving a problem in which we have to split array into k segments , where cost of an subarray is square of sum of the elements of that subarray. I assume formation of an segment as an cut, initially there is not cut (k=1), now i observe that contibution of an individual cut decreases if we increases k(that is addition of more cuts). Now lets say answer for i-1 cuts is x. for i cuts , select any i-1 cuts there contibution to answer should be greater than or equal to x , for remaining one lets say it decreases the answer by delta1. ans(i-1)=x ans(i) =x-delta1. if i apply same thing to i+1 cuts, ans(i+1) = x-delta2-delt3. now to prove if our function is concave/convex we just have to proof 2*delta1>=delta2+delta3, and i think this is very much intuitive. please correct me if i am wrong.
•  » » » » » » 4 months ago, # ^ |   +13 It's something like that but you're oversimplifying it imo because it's not guaranteed that the other i-1 cuts will remain in the same position.
•  » » » » » » » 4 months ago, # ^ |   0 yeah i was also thinking the same.
•  » » » » 4 months ago, # ^ |   +23 I think there is a general theme: you get concavity from bounded integer linear programs whose relaxations do not introduce new basic solutions -- I'll explain what this means. This category of problems includes flow, and things which reduce to flow, and also partitions into contiguous subsequences (maybe this also reduces to flow, but I don't need to know this to prove concavity!).Consider a general integer linear program, i.e. something of this form: find a vector of $n$ integers $x = (x_1, \ldots, x_n)$ which maximizes some linear functional $f(x) = c_1x_1 + \ldots +c_nx_n$ subject to some linear cost constraint $Ax \le b$ (note that $Ax$ and $b$ are vectors -- this is a list of inequalities). Say $g(b)$ is the maximum $f$, as a function of $b$.Now consider the same problem, except we drop constraint that $x_i$ be integral. This is the "relaxation". Say $g_r(b)$ is the answer in this case. I claim that $g_r$ is obviously concave. Indeed, given $b_1, b_2$, and $0 \le t \le 1$, take $x^1, x^2$ such that $Ax^j \le b_j$, $f(x^j) = g(b_j)$, and let $x = tx^1 + (1-t) x^2$. Then $Ax \le tb_1 + (1-t)b_2$ and so $f(x) = t f(x^1) + (1-t) f(x^2) = t g(b_1) + (1-t) g(b_2) \le g( t b_1 + (1-t) b_2)$, as needed.This doesn't mean that $g$ is concave, as in general we could have $g < g_r$. But sometimes $g(b) = g_r(b)$. In particular this happens when any $x$ such that $Ax \le b$ is an affine combination of some integer solutions $x^1, \ldots, x^m$, $Ax^j \le b$, because in this case $f(x) \le \max_j f(x^j) \le g(r)$.For any bounded linear program (bounded means that $x$ such that $Ax = b$ cannot be arbitrarily large), all solutions are affine combinations of basic solutions (which are vertices of the polytope). You get basic solutions by deciding that some of the inequalities in $Ax$ should be equalities, and solving the corresponding system of equations.So here's the general recipe to prove convexity: phrase your problem as an integer linear program, check that it is bounded, and check that all basic solutions are integral.Let me describe how I believe this works for your problem of partitioning a sequence of length $n$ into at most $k$ parts, such that sum-of-squares-of-sums is minimized. (I guess the array should be nonnegative, since you don't get convexity for $1, -1, 1, -1$.) Say we have $n + {n \choose 2}$ unknowns $x_1, \ldots, x_n$, $d_{ij}$ ($1 \le i < j \le n$), where $x_i$ represents "which part is the $i$th term in", and $d_{ij}$ "should" be 1 if $i$, $j$ are in the same part, and $0$ otherwise. So our constraints are $1 \le x_1 \le x_2 \le \ldots \le x_n \le k$, $0 \le d_{ij} \le 1$, $1 - (x_j - x_i) \le d_{ij}$. Note that sum-of-squares-of-sums is a linear function in $d_{ij}$. (This formulation lets us take $d_{ij} = 1$ even if $x_i < x_j$, but that's never optimal, since we assume all elements of array have same sign.) So the integer version of this problem gives the answer you want. You can check that (for integer $k$) all basic solutions are integral. So the minimum sum-of-squares-of-sums is convex as a function of $k$, by the general theory above.
•  » » » » » 4 months ago, # ^ |   +10 Wow that is awesome, I think now i understands this topic little much better.
 » 4 months ago, # |   +9 In China, it is named " WQS Binary Search ". Because this trick was invented by Wang QinShi (王钦石, WQS). Here's the original paper.
 » 2 months ago, # |   +39 Anai, your website is down. Do you have a copy of the article? I remember enjoying it!
•  » » 6 weeks ago, # ^ |   +8 I am really sorry for this, I'm having some issues at univ and I haven't paid the server because I really wanna code the website again a better format and post some other mathsy things there. I'm 99% sure it will be done by the end of December.
•  » » » 6 weeks ago, # ^ |   +5 No problem, thanks for writing the article to begin with!If it's easy to upload a PDF version or post it as a CF blog, that would be great, but no worries if you can't.
 » 13 days ago, # |   +5 You can find a copy of the website here, the old URL will take a while to point to this IP.