nor's blog

By nor, 7 weeks ago, In English

TL;DR

Use the following template (C++20) for efficient and near-optimal binary search (in terms of number of queries) on floating point numbers.

Template

It is also possible to extend this to breaking early when a custom closeness predicate is true (for example, min(absolute error, relative error) < 1e-9 and so on), but for the sake of simplicity, this template does not do so.

Introduction

It is well known that writing the condition for continuation of the binary search as anything like while (r - l > 1) or while (l < r) is bug-prone on floating point integers, so people tend to write a binary search by fixing the number of iterations. However, this is usually not the best method — you need $$$O(\log L)$$$ iterations where $$$L$$$ is the length of the range, and the range of floating point numbers is very large.

So, I came up with a way to binary search on (IEEE754, i.e., practically all implementations of) floating point numbers that takes bit_width(floating_type) + O(1) calls to the predicate for binary search. I am pretty sure that this method has been explored before (feel free to drop references in the comments if you find them). Regardless of whether this is a novel algorithm or not, I wanted to share this since it teaches you a lot about floating points, and it also has the following advantages:

  • No need to hard-code the number of iterations.
  • It avoids issues that you get in other implementations (for example, using sqrt, you end up with a ton of cases with 0, negatives and so on).
  • It is efficient (sqrt is expensive, and so is dealing with a lot of cases).

Arriving at the algorithm

I started out by thinking about this: floating point numbers have a very large range. Can we do better than $$$O(\log L)$$$ where $$$L$$$ is the length of the range? Recall that floating point numbers also have a fixed width that is much smaller than the log of the range they represent, and between two consecutive representable floating point numbers, it is their ratio that is nicely bounded (ignoring the boundary of 0 and inf and nans) instead of their difference. Now since it is well-known that for getting to a certain relative-error, it is optimal to use $$$\sqrt{lr}$$$ instead of $$$\frac{l + r}{2}$$$ in binary search.

So the first idea that comes to mind is to use sqrt instead of midpoint in the usual binary search. However, it has a lot of issues — for example, if $$$l = 0$$$, then you never progress in the binary search (this is fixable by doing a midpoint search on 0 to 1 and sqrt search on 1 to $$$r$$$). One more issue is how to deal with negatives (this is again doable by splitting the input range into multiple ranges where sqrt works) and with overflows/underflows. And the main issue with this approach is that sqrt is expensive — if you are doing a problem where the predicate is pretty fast and you need to do a lot of binary searches, most of your computation time would be due to sqrt.

Inspired by these, we decide to try to approximate sqrt in a way that preserves monotonicity. Here comes the main point — note that the IEEE754 implementation of floating point numbers separates the mantissa from the exponent, and the exponent part of $$$\sqrt{lr}$$$ is roughly the mean of that of $$$l$$$ and $$$r$$$. Since these are the top few bits (excluding the sign bit), we can just do the following (for positive floating point numbers at least): read the floating point representation as an integer (this can be done in a type-safe manner by using std::bit_cast), take the midpoint of these integers, and convert it back to a floating point number. This clearly preserves monotonicity — this can be verified easily by hand. Note that this same thing works for when both range endpoints are negative numbers too — for this we simply invert the sign bit. The case where both are of opposite signs is also simple (if you disregard the fact that +0 and -0 are equal but have distinct representations and reciprocals) — check the predicate on $$$0$$$ (both the zeroes if it matters for your predicate).

Now if we want to do some more handling (infinities, NaNs, denormals), we can do it as we wish. In the implementation in the TL;DR above, we decide to replace NaNs with the infinity in the correct direction, and since there can be many infinities, we try to bring down the range endpoint to the largest representable floating point instead.

Analysis

In all, there are at most $$$w + O(1)$$$ calls to the predicate, where $$$w$$$ is the length of the complement of the longest common prefix of the floating point representations of endpoints.

This implementation is also robust to predicates which are noisy near the boundary (i.e., near the boundary, there is a small range where the true values of the predicate can come after the false values) — in this case the algorithm returns something in this range near the boundary.

Note that we did not need to hardcode the number of iterations, nor did we require some carefully-written predicate for loop termination.

Also, since std::bit_cast is practically free compared to std::sqrt, it is much faster in cases where you need to do a lot of binary searches.

As a usage example, this submission uses the usual binary search with fixed number of iterations, and this submission uses the template above.

Some problems

The following are some problems that use binary search on floating points, and should be solvable using this template. If you encounter any bugs in the template while solving these problems, do let me know in the comments below. Thanks to jeroenodb and PurpleCrayon for problem suggestions.

Full text and comments »

  • Vote: I like it
  • +46
  • Vote: I do not like it

By nor, 7 weeks ago, In English

MikeMirzayanov added a new compiler in response to the bug mentioned here. However, it does not come without a catch.

Namely, any pragma that is of the form #pragma GCC target(...) would NOT work with this new compiler. The issue is well-known by people who use up-to-date compilers but there has not been much progress towards a fix.

One of the ways to fix it is to add the target pragma AFTER the STL includes. This should make your code compile, but a lot of code would simply not be optimized, so use it at your own risk.

The other way is to use g++-12, but risk a TLE due to the linked blog (it's easy to write hacks for most problems). Fortunately, there is a fix for this hack, but, echoing the words of the blog author, use that at your own risk.

For completeness, the first two of these don't compile on g++-13 but the last one does.

#pragma GCC optimize("O3")
#pragma GCC target("avx")
#include <vector>
int main() { std::vector<int> a; }
#pragma GCC target("avx")
#include <vector>
int main() { std::vector<int> a; }
#pragma GCC optimize("O3")
#include <vector>
#pragma GCC target("avx")
int main() { std::vector<int> a; }

Full text and comments »

  • Vote: I like it
  • +34
  • Vote: I do not like it

By nor, history, 2 months ago, In English

When submitting to 1923E - Count Paths, I noticed that 248189683 takes 124ms and [submission:248191037] takes 420ms.

Both of these are the same code (can be confirmed via diff), and there is no indeterminate behaviour like randomness involved in the submissions. It looks like it points to some issue with the judging machines (maybe one judging server is just slower than the other or the server load was high).

Pinging MikeMirzayanov for increasing chances of this getting fixed. Thanks!

Full text and comments »

  • Vote: I like it
  • +65
  • Vote: I do not like it

By nor, 3 months ago, In English

Someone asked me about my template that used a "cache wrapper" for lambdas, so I decided to write a blog explaining how it works. For reference, here is a submission of mine from 2021 that uses that template.

Here's what you will find implementations for in this blog:

  • Generalized hashing (for tuple types, sequence types and basic types)
  • Convenient aliases for policy based data structures
  • Wrappers for recursive (or otherwise) lambdas that automatically do caching (memoization) for you.

Jump to the Usage section if you only care about the template, though I would strongly recommend reading the rest of the blog too since it has a lot of cool/useful things in my opinion.

If you know how to implement recursive lambdas

Full text and comments »

  • Vote: I like it
  • +86
  • Vote: I do not like it

By nor, 4 months ago, In English

Introduction

A lot of people shy away from solving (mathematical) recurrences just because the theory is not very clear/approachable due to not having an advanced background in math.

As a consequence, the usual ways of solving recurrences tend to be:

  • Find the first few terms on OEIS.
  • Guess terms from the rate of growth of the recurrence (exponential rate of growth means you can sometimes estimate the exponential terms going from largest to smallest — though this fails in cases where there is a double-root of the characteristic equation)
  • Use some theorem whose validity you can't prove (the so-called characteristic equation method)
  • Overkill using generating functions

But this doesn't have to be the case, because there is a nice method you can apply to solve equations reliably. I independently came up with this method back when I was in middle school, and surprisingly a lot of people have no idea that you can solve recurrences like this.

The method is more principled and reliable than the first two guesswork methods, the third method might fail to apply in some cases, and the fourth method requires knowing what an infinite series is. An added bonus is that it prevents you from forgetting special cases that often plague recurrences! Meanwhile, as we show in the last section, having intuition about the method also allows you to apply the ideas developed to things that are only vaguely related to recurrences.

The best thing about the method is that it requires nothing more than school-level algebra and potentially some intuition/common sense.

I'm not saying that you should only use this method, just that keeping it in mind is pretty helpful whenever you're faced with a recurrence (or even more fundamentally, linear combinations of numbers).

A simple example

Note: I recommend working through this section (with pen and paper) with the examples $$$x_k = 2x_{k-1} + 3$$$ and $$$x_k = x_{k-1} + 5$$$. Using symbols can be intimidating for some and thus examples should be used while presenting new ideas, but it is also important to understand the intuition behind the underlying ideas in the form of algebra.

Let's first consider a recurrence $$$x_k = ax_{k-1} + b$$$. We employ wishful thinking: what is there was no $$$b$$$ in the recurrence? In that case, we would have had $$$x_k = x_0 \cdot a^k$$$, and that would be the end of the story. So can we reduce the general case to one where $$$b = 0$$$?

If you define $$$y_k = x_k + c$$$ (i.e., vertically shift the graph of $$$(i, x_i)$$$ by some offset), then we have $$$y_k = a(y_{k-1} - c) + b + c$$$. Now we have one degree of freedom, and we can choose $$$c$$$ such that the constant term $$$-ac + b + c$$$ is $$$0$$$, i.e., $$$c = \frac{b}{a-1}$$$. Note that we immediately run into a special case here: $$$a = 1$$$ — in this case, such a $$$c$$$ is impossible to choose. So we make two cases to finish off the problem:

  • $$$a = 1$$$: In this case, we have $$$x_k = x_{k-1} + b$$$. This means that $$$x$$$ is an arithmetic progression and not a geometric progression, and $$$x_k = x_0 + bk$$$.
  • $$$a \ne 1$$$: In this case, there is a $$$c$$$ such that $$$y_k = ay_{k-1}$$$, so $$$y_k = a^k y_0$$$. Translating it back into $$$x$$$-space (expressing $$$y$$$ in terms of $$$x$$$), we have $$$x_k + c = a^k (x_0 + c)$$$ which gives us $$$x_k = a^k x_0 + \frac{a^k - 1}{a - 1}b$$$.

Exercise: Try to solve the recurrence $$$x_k = 2 x_{k-1}^2$$$ using this method.

A slightly harder example

Note: I recommend working through this section with two separate examples in mind: $$$x_k = 3x_{k-1} + 10x_{k-2}$$$ and $$$x_k = 4x_{k-1} - 4x_{k-2}$$$.

Let's now consider the recurrence $$$x_k = a x_{k-1} + b x_{k-2} + c$$$. For the sake of simplicity of presentation, we note that we can do something like the previous example to get rid of the $$$c$$$, so let's assume $$$c = 0$$$. So we want to solve $$$x_k = a x_{k-1} + b x_{k-2}$$$.

Now how do we deal with the two lagging $$$x_{k-1}$$$ and $$$x_{k-2}$$$ terms instead of just one? The trick is to try and find $$$d$$$ such that $$$y_k = x_k - d \cdot x_{k-1}$$$ becomes a geometric progression (we can also use a $$$+d$$$ instead of $$$-d$$$ like we did in the previous solution, but that does not matter).

We can also look at it in this way: we want to subtract a suitable multiple of $$$x_{k-1}$$$ from both sides, so that $$$x_k - d x_{k-1}$$$ (for some $$$r$$$) becomes a multiple of its lagged sequence.

For that to happen, notice that we must have $$$y_k = e y_{k-1}$$$. This translates to $$$x_k - d x_{k-1} = e (x_{k-1} - d x_{k-2})$$$. Comparing it with the original recurrence, we must have $$$d + e = a$$$ and $$$de = - b$$$. So we have $$$|d - e| = \sqrt{(d + e)^2 - 4de} = \sqrt{a^2 + 4b}$$$ (note how this relates to solving a quadratic equation: $$$x^2 - ax - b$$$ needs to be factorized into $$$(x - d)(x - e)$$$ — this will be helpful later on). Using this, we can solve for $$$d$$$ and $$$e$$$.

Now once that we have $$$d$$$ and $$$e$$$, we have $$$y_k = e y_{k-1}$$$, so $$$y_k = e^{k-1} y_1$$$. Let's now, as earlier, go back to $$$x$$$-space. We have $$$x_k - d x_{k-1} = e^{k-1} y_1$$$: note that $$$y_1 = (x_1 - d x_0)$$$ is a constant we already know.

We now have two methods of solving this: one is to simply add these equalities for different $$$k$$$ in a telescoping manner, after dividing the whole equation by $$$d^k$$$. Just for the sake of showing the usage of this method a bit more, I will go with the other one.

Looking at the new recurrence we got for $$$x$$$, we again want to define some $$$z_k$$$ as a function of $$$x_k$$$, such that $$$z_k = d z_{k-1}$$$. Since there a term proportional to $$$e^{k-1}$$$ on the right, we will choose $$$z_k = x_k + f \cdot e^{k-1}$$$.

Going back to the recurrence in $$$x$$$, we get $$$z_k - f \cdot e^{k-1} - d \cdot z_{k-1} + d \cdot f \cdot e^{k-2} = e^{k-1} \cdot y_1$$$.

To get rid of the non-$$$z$$$ terms, we must have $$$f \cdot (d - e) = ey_1$$$. We are again faced with two cases, as in the previous example.

The case with $$$d \ne e$$$ (the "distinct-root" case) is easy: in this case, we can just solve for the problem analogously to what we did to solve the simpler example.

The only remaining case is $$$d = e$$$ (the "repeated-root" case): let's look closely at why we failed here. The two terms coming from $$$x_k$$$ and from $$$x_{k-1}$$$ cancel out completely. What can we do to be able to have different contributions of $$$e^{k-1}$$$ from both of these terms? Let's choose $$$z_k = x_k + g(k) \cdot e^{k-1}$$$ instead. Then we must have $$$g(k - 1) - g(k) = y_1$$$. This means $$$g(k)$$$ is in fact a linear function of $$$g$$$. That is, we have $$$g(k) = hk + i$$$ for some $$$h, i$$$: here $$$h = -y_1$$$ and $$$i$$$ can be set to anything arbitrary (for now let's say $$$i = 0$$$). So we choose $$$g(k) = -y_1 k$$$, i.e., $$$z_k = x_k - y_1 \cdot k \cdot e^{k-1}$$$.

Since we know $$$z_k = dz_{k-1}$$$, we know that $$$z_k = d^k z_0$$$. Going into $$$x$$$-space (expressing $$$z$$$ in terms of $$$x$$$), we have a solution to the equation.

A short sketch of how to deal with higher order recurrences

In this section, just by generalizing the approach in the previous section, we will show how you can in fact derive the method of characteristic equations while solving recurrences.

Remember the note about solving for $$$d$$$ and $$$e$$$. It looks like finding a factorization of $$$x^2 - ax - b$$$ in the form $$$(x - d)(x - e)$$$.

When you try to use the method above for third order recurrences, you will quickly realize that we are trying to find a factorization of $$$x^3 - ax^2 - bx - c$$$ into $$$(x - d)(x^2 - ex + f)$$$ for the first step, and as you go along solving the lower order recurrence at every step, you will realize that you also need to factorize $$$x^2 - ex + f$$$ into two linear polynomials.

If the roots $$$r_i$$$ ($$$i = 1$$$ to $$$3$$$) of $$$x^3 - ax^2 - bx - c$$$ are all distinct, you will get $$$x_k$$$ in the nice form $$$x_k = \sum_{i=1}^3 c_i r_i^k$$$. But if the roots have multiplicity more than 1 (i.e., $$$(x - r)^s$$$ divides the polynomial for some $$$s > 1$$$), then the $$$c_i$$$ becomes a polynomial in $$$k$$$ with degree $$$s - 1$$$.

Using induction, you can show that a result of this form is indeed true for higher order recurrences as well.

An example slightly unrelated to recurrences

There's a funny story related to this method. Back when I was testing Pinely Round 3 (Div. 1 + Div. 2), problem D was not in the problemset. When it was added (with $$$k = 1$$$ as the easier version, the problem was supposed to be problem B in the set), I solved the problem instantly because I used this method for solving recurrences before, and thought that its difficulty is supposed to be not more than A. A lot of people disagreed with me on this, and it was eventually agreed upon to use general $$$k$$$ as a harder problem — I still disagreed with this opinion for a while, but then I realized that not a lot of people know this method of approaching recurrences/linear combinations.

This was partly the motivation behind writing this blog (the other part was the existence of blogs like this).

Here's how you solve that problem using this method.

In this problem, you are given a reverse recurrence of the form $$$y + z = x + k$$$. Inspired by the method developed above, we will just subtract $$$k$$$ from both sides, and let $$$x' = x - k, y' = y - k, z' = z - k$$$, to get $$$y' + z' = x'$$$.

This means that if we shift all the array elements vertically by $$$-k$$$, we are just splitting elements into other elements $$$\ge 1 - k$$$.

(Arguably) the hardest part of the problem is to realize that you can shift numbers and deal with them as usual, after the transformation. But to a person who knows this method well, it might seem like this problem was made specifically to be solved by it.

Conclusion

In this blog, we saw that we can solve recurrences of the form $$$a_k = f(a_{k-1}, \dots, a_{k - l})$$$ by trying to impose easy-to-analyze structure on expressions of the form $$$b_k = g(a_{k}, a_{k-1}, \dots, a_{k - l + 1})$$$. In each example, the structure was of the form $$$b_k = r b_{k-1}$$$, which gives us a way to compute $$$b_k$$$ directly, thereby reducing the "order" of the problem from $$$l + 1$$$ to $$$l$$$. If you tried the exercise, the structure was supposed to be of the form $$$b_k = b_{k-1}^2$$$.

In a way, this is an application of the divide and conquer way of thinking (in the general sense). This kind of divide and conquer, but in a bottom-up way rather than a top-down way, is also used when you derive things like Hensel's lifting lemma.

Full text and comments »

  • Vote: I like it
  • +131
  • Vote: I do not like it

By nor, 4 months ago, In English

Motivation

On a computer science discord server, someone recently asked the following question:

Is the master theorem applicable for the following recurrence? $$$T(n) = 7T(\lfloor n / 20 \rfloor) + 2T(\lfloor n / 8 \rfloor) + n$$$

There was some discussion related to how $$$T$$$ is monotonically increasing (which is hard to prove), and then someone claimed that there is a solution using induction for a better bound. However, these ad hoc solutions often require some guessing and the Master theorem is not directly applicable.

Similarly, in the linear time median finding algorithm (or the linear time order statistics algorithm in general), the recurrence is $$$T(n) = T(\lfloor 7n/10 \rfloor + \varepsilon) + T(\lfloor n/5 \rfloor) + n$$$ where $$$\varepsilon \in \Theta(1)$$$ Also, in merge sort and creating a segment tree recursively, the time complexities are $$$T(n) = T(\lfloor n/2 \rfloor) + T(\lceil n/2 \rceil) + f(n)$$$ where $$$f(n)$$$ is $$$\Theta(n)$$$ for merge sort and $$$\Theta(1)$$$ for segment tree construction.

Even the general form of the recurrence that the master theorem handles is just $$$T(n) = aT(n/b) + f(n)$$$, and it doesn't handle cases of the form $$$T(n) = aT(\lfloor n/b \rfloor) + f(n)$$$ and $$$T(n) = aT(\lceil n/b \rceil) + f(n)$$$, which is what we find in a lot of algorithms.

To cleanly handle all these cases at once, there is a generalization of the Master theorem for recurrences, called the Akra-Bazzi theorem. It is not really well-known, and since it is not trivial to prove (the proof is pretty nice, I recommend looking at it), it is often skipped over in CS courses on algorithms (sometimes even the instructors are not aware of this theorem). This is why I wanted to write this blog — to provide a better way of reasoning about your divide and conquer algorithms.


Statement

Theorem: Consider $$$T(x)$$$ given by a sufficient number of base cases (for instance, bounded for $$$x < x_0$$$ for $$$x_0$$$ defined later), and defined recursively by $$$T(x) = g(x) + \sum_{i=1}^k a_i T(b_i x + h_i(x))$$$ for large enough $$$x$$$ (i.e., for all $$$x \ge x_0$$$ for some $$$x_0$$$).

Then, subject to the following conditions:

  • $$$a_i > 0$$$ and $$$0 < b_i < 1$$$ are constants for each $$$i$$$
  • $$$g$$$ is differentiable and $$$|g'(x)|$$$ has at most polynomial growth (i.e., $$$|g'(x)| \in O(x^c)$$$ for some constant $$$c$$$)
  • $$$|h_i(x)| \in O(x/(\log x)^2)$$$ for all $$$i$$$

We have, if $$$p$$$ is the unique solution to $$$\sum_{i=1}^k a_i b_i^p = 1$$$, then

$$$

T(x) \in \Theta \left(x^p \left(1 + \int_{1}^{x} \frac{g(u)}{u^{p+1}} \mathrm{d}u \right)\right)

$$$

There are some more pedantic conditions on $$$x_0$$$ and $$$g$$$ for this to hold, but for all practical purposes, those conditions (like some definite integrals with finite upper and lower limits are finite) are true. For more, you can refer to this paper.


Advantages and applications

This has the following advantages:

  • Being much more general than the master theorem, and in different cases for $$$p$$$, reduces to the master theorem readily.
  • Being applicable to more back-of-the-envelope calculations and more CS-like recurrences, where you don't want to think about floors/ceils/constant offsets — the "error term" $$$h_i$$$ takes care of that for you.
  • Reduces your dependence on ad hoc methods like guess-and-induct.

Let's apply this theorem to the few examples mentioned above (although there won't be any non-trivial results among the above examples, an important special case will be illustrated in the following examples).

  • The original recurrence: $$$T(n) = 7T(\lfloor n / 20 \rfloor) + 2T(\lfloor n / 8 \rfloor) + n$$$ — note that in this case, we get $$$0 < p < 1$$$. The theorem claims that $$$T(n) \in \Theta(x^p (1 + \frac{x^{1-p} - 1}{1-p})) = \Theta(x)$$$.
  • The same analysis as above holds for the linear-time median finding algorithm.
  • For the merge sort recurrence and the segment tree building recurrence, we have $$$p = 1$$$. The theorem allows us to use the Master theorem directly without caring about the floors.

Full text and comments »

  • Vote: I like it
  • +67
  • Vote: I do not like it

By nor, 5 months ago, In English

Disclaimer: This blog (and all of my other blogs, unless specified otherwise) is 100% ChatGPT-free — there has been no use of any AI/ML-based application while coming up with the content of this blog.

The reason and an appeal

The contents are as follows:

  • Introduction
  • Why you should use lambdas
  • Some context
  • Hand-rolling our own lambdas
  • C++ lambda syntax explained
  • Using lambdas
  • Using lambdas with the STL
  • Some useful non-trivial patterns
  • Some other interesting patterns
  • Examples of competitive programming code using C++ lambdas

Prerequisites: knowing a bit about structs/classes in C++, member functions, knowing that the STL exists.

If you feel something is not very clear, I recommend waiting for a bit, because I tried to ensure that all important ideas are explained at some point or another, and if you don't understand and it doesn't pop up later, it is probably not that important (and should not harm the experience of reading this blog). Nevertheless, if I missed out on explaining something that looks important, please ask me in the comments — I'd be happy to answer your questions!

I would like to thank (in alphabetical order) AlperenT, bestial-42-centroids and kostia244 for discussions that helped improve the presentation and teachability of this blog.


Introduction

A few years back, I co-authored a blog on pragmas because there were very few people who understood what pragmas do. I noticed that the same is true, albeit to a lower extent, for lambdas (C++ and otherwise) — some using them as a minor convenience but nothing more; others having an irrational (and ignorant) disdain for them, mainly arising either out of some experiences that come from not using them properly or a love of all things "minimalistic".

Hence, part of the motivation behind this blog is to show and convince people that lambdas can be very useful, both for your code AND your thought process, and that they will make your life easier if you do any amount of programming in a language that has decent support for lambdas (yes, not only C++). Thus, a significant part of the blog would also talk about some topics related to models of computation and some history. And, of course, I will make a few philosophical arguments along the way, which you can feel free to skip, but they are helpful as a first step to realizing why such a change in perspective is quite important for any learning.

The other reason I am writing this blog is to explain C++ lambdas on a much deeper level than people usually go into, without being as terse and unreadable for a beginner as the cppreference page — this will involve rolling our own lambdas! This activity will also help us understand why certain behaviors are intuitive for C++ lambdas. And in the end, as everyone expects a tutorial on CF to be, we will also show some tricks that are especially useful in the competitive programming context.

Examples of usage of lambdas in code doesn't come until much later in this blog, because we develop an understanding of what role lambdas actually play in a language, instead of just introducing them as nifty tricks that can help you shorten your code. If you want to only look at applications, please skip to the last few sections.

Hopefully, this blog will also get people interested in programming language theory — especially functional programming and parts that deal with language design.

Finally, no matter how experienced you are with lambdas/C++/programming languages, there is a high chance that you will find some piece of code/idea in this blog that will look very new. I recommend checking out the two sections on patterns (in one of them we implement OOP using lambdas, for example), and the discussion on untyped v/s typed lambda calculus and implementing recursion.


Why you should use lambdas

I recommend reading this section again after reading the rest of the blog since it'll make more sense after you understand what lambdas are. But since the reader is probably growing restless about why they should be investing their time in lambdas, here's a list of reasons why you should learn about lambdas:

  • Philosophical reasons: Imperfect as it may be, the Blub paradox implies that a programmer's way of problem-solving using code is upper-bounded by the language they use itself to the point that they can correctly realize how limited more primitive languages are, but fail to realize why any feature a higher level language gives them should help them as a programmer, and dismiss higher level languages as just their preferred language + some weird feature that they don't care about. Meanwhile, as people who know math, we probably already realize that thinking on a lower level than necessary is harmful to our development because we don't see the bigger picture. Think of it like having induction as a tool in our toolbox versus having to repeat an argument 1000 times to show that $$$P(1000)$$$ is true for some predicate $$$P$$$ — this is what it feels like to code in a language that does not have for-loops, which a person who exclusively copypastes code to manually implement what a loop could have done would be very proud of. The moral of the story? Think on a higher level, code in whichever language you want to.
  • More practical reasons — You can define a lambda anywhere instead of functions that must be defined at namespace (global or otherwise) scope. So lambdas require you to scroll much less, prevent bugs, keep things local, and avoid polluting the namespace — all of which are supposed to be good practices. They also reduce context switching that makes code bug-prone.
  • You can store functions in data and pass functions more uniformly. This can be achieved by function pointers, too, but the other features make lambdas much more convenient for this purpose. It bypasses the need to make "tagged" versions of functions and implement your own vtable kind of things.
  • You can store data inside functions! Instead of passing a value/reference as a parameter each time a function is called (for example, using the same dfs function for a local graph requires you to pass the local graph by reference to the dfs, which is inconvenient), you can just store (in C++ lambda language, "capture") the reference/value of a variable inside the lambda and be done with it.
  • The standard library has a ton of support for lambdas, such as in sorting, binary searching, filtering, aggregating, etc. The usage is as simple as passing a single line lambda into a function provided by the STL.

Some context

The unusual name lambda comes from lambda calculus, which is a mathematical system for expressing computation, much like what a computer program is supposed to be written in. In this computation system, we have only three types of expressions, which are supposed to be the representation of each computation:

  • terms of the form $$$x$$$: these correspond to variables.
  • terms of the form $$$\lambda x. E$$$: this corresponds to a function that takes input $$$x$$$ and returns the result of evaluating $$$E$$$, which is an expression that may or may not involve $$$x$$$. In C++, this would roughly correspond to a function with parameter $$$x$$$ and function body return $$$E$$$.
  • terms of the form $$$(E_1 \, E_2)$$$ where $$$E_1$$$ and $$$E_2$$$ are two expressions: this corresponds to applying the function represented by the expression $$$E_1$$$ to the argument $$$E_2$$$. In C++, this would roughly correspond to a function $$$E_1$$$ being called with $$$E_2$$$ as a parameter, i.e., $$$E_1(E_2)$$$.

And that's it. Note that there are no numbers and no restriction on what $$$E_2$$$ evaluates to. In particular, $$$E_2$$$ can be a function too! The same holds for $$$E$$$ and $$$E_1$$$ as well, so there are many kinds of expressions that you can write that will make sense in this "language". Evaluation of these expressions is done using certain rules called reductions (which resemble simplification of mathematical expressions), though at this stage, it is not really necessary to understand what they do (it is sufficient to take some amount of intuition from any language of your choice to develop an intuition for it, though care should be taken to avoid taking the intuition too far, because there are certain core concepts in languages that are simply not present in this lambda calculus — for instance, types).

It turns out that this system of computation is Turing complete, i.e., any program that can be run on a Turing machine can be simulated using this method of computation. But why is this relevant to the discussion?

The second type of term here is called a lambda term. This is where the lambda terminology comes from — because lambdas play the role of functions in this system. Except in lambda calculus, you're allowed to do a large variety of things by virtue of the lack of constraints in the definition. For example, you can pass around functions to other functions (remember that $$$E_2$$$ is not constrained to be of a particular "type").

Nitpick

But note that we are not allowed to manipulate functions in C++ as freely as the above definition dictates. For example, we are not allowed to return functions that depend on the input because function pointers in C/C++ are supposed to be pointers to things known at compile time.

To illustrate this, let's temporarily add numbers and the $$$+$$$ operator to the lambda calculus (by the way, they can be semantically simulated using just lambda expressions using what are called Church numerals, but that is a completely different story for another day). The following is a valid lambda expression:

$$$

\lambda x. \lambda y.(x + y)

$$$

This function takes a value $$$x$$$ and returns a lambda that takes an input $$$y$$$ and returns $$$x + y$$$. Note that the output of the inner lambda depends on the value of $$$x$$$. For example, $$$(\lambda x. \lambda y.(x + y)) (42)$$$ is just $$$\lambda y.(42 + y)$$$. So we have something analogous to a function that returns a function, such that the returned function is dependent on the input to the outer function. (By the way, this is what I meant by lambdas reducing the dependence on "tagged" functions, because you have a great tagging mechanism for lambdas — just store the data inside it).

The corresponding in C++ style syntax should be something like (but this isn't legal C++ code, so I took the liberty of using non-standard syntax here):

int(int) lambda1(int x) {
    int lambda2(int y) {
        return x + y;
    }
    return lambda2;
}

So, there is a clear difference between C++ functions and lambda terms from the lambda calculus despite all the other semantic similarities. This discussion already hints to us what a lambda function (as we will call it from now on) should look like in C++.

Let's digress for the rest of this section into the discussion of functional programming languages.

Note that the imperative programming model (let's think about C for the sake of concreteness) is largely based on the Turing machine model of computation — you have tapes (memory), pointers, and states. The main focus is on evolving the state of the program and the memory in order to get to an end goal.

The functional programming model, on the other hand, is largely based on lambda calculus. You rarely have any state in a functional programming language like Haskell or OCaml (especially in a purely functional language like Haskell). Instead, you express your code in the form of function applications and some basic inbuilt types (like integral types) and functions (like numerical operators).

Both of them are equally powerful. However, functional programming languages are often at a much higher level than imperative languages because of the power of abstraction and resemblance to mathematics, which rarely talks about state in any of its fundamental parts.

Also, a fun fact — the sharp eyed reader would have noticed that the lambda function we wrote is semantically equivalent to a function that takes two numbers $$$x$$$ and $$$y$$$ and returns their sum, except that we have partial application for free, by "binding" $$$x$$$ to the lambda after the first application and returning a function that adds $$$x$$$ to anything. This style of writing lambda terms is called currying and can be used to implement multi-argument functions in a language where functions only have one argument.

This is just scraping the surface of what is otherwise a crazy set of things that can be done with lambda calculus. For example, one interesting thing you can do with it is to show how weird computation systems like the SKI combinator calculus are in fact Turing complete. Also, since the lambda calculus is Turing complete, there is a way (a quite elegant one at that) to represent recursion called a fixed point combinator, which can be implemented in multiple ways, one of which is the Y-combinator.


Hand-rolling our own lambdas

Since C++ is Turing complete, should we expect there to be a way to implement (a typed variant of) lambda calculus in C++? Definitely, though we probably shouldn't expect it to be very elegant. Or can we? Read on to find out.

Let's think about what a lambda function (shortened to lambda for convenience) should behave like in C++.

It should be what is called a "first-class citizen" of the language in programming language design — i.e., all operations that are supported for other entities should also be supported for lambdas. In other words, it should have the following properties at the very least:

  • Should be able to be passed as a function argument
  • Should be able to be returned from a function
  • Should be able to be assigned to a variable

The most general kind of thing that C++ supports that adheres to these rules is an object. Let's try to implement a lambda as an object in that case.

So for now, a lambda looks like this for us:

struct Lambda {
    // ???
};
Lambda lambda{};

Note that since structs can be defined in a local scope, we will be able to do whatever follows inside a local scope as well as a namespace scope.

Now according to its definition, we should be able to call a lambda as well. The most obvious way of associating a function with a struct is to implement a member function, let's say call. However, one cool functionality that C++ provides is the ability to overload operator() for a struct — this is the function call operator, and it will help us reduce the syntactic boundaries between a function and an object. So rather than using the name call, we will overload operator(), which does the same thing but with a better syntax.

So, something like the following prints 2 in a new line:

struct Lambda {
    int operator()(int x) const {
        return x + 1;
    }
};
Lambda lambda;
std::cout << lambda(1) << '\n';

At this point, we should note that such a struct which overloads operator() is called a function object in the C++ standard (and is frequently informally referred to as the horribly named "functor" which has a very different meaning in category theory, so it will irk a lot of people if you call a function object a functor in front of them).

Now that we can wrap functions into objects, we can do whatever an object can — for example, we can now return something that behaves like a function, so it might seem that the story ends here. However, note that we didn't implement anything that allows us to "bind" data to functions, as we wanted to do in our addition example. It is necessary to implement this, in order to make $$$E$$$ aware of the value of $$$x$$$ somehow, when applying $$$\lambda x. E$$$ to some $$$x$$$.

Since we want to emulate a mathematical language, for now we don't really care about C++ semantics, and we will choose value semantics for $$$x$$$ (which will make a lot of copies, but will make reasoning easier). This basically means that whenever we store $$$x$$$, it will be, just like in mathematics, completely independent of wherever its value was taken from.

Let's try our hand again at the $$$\lambda x. \lambda y.(x + y)$$$ example. Let's write it in a more computer-program-style syntax with parentheses at all the right places (which some people might recognize as being one infix-to-prefix transformation away from being Lisp code):

(lambda x.
    (lambda y.
        (x + y)))

Since there are two $$$\lambda$$$-s, we would need two structs this time — one for the outer lambda and one for the inner lambda. The inner lambda needs to "capture" $$$x$$$, so we store a copy of $$$x$$$ in it whenever it is constructed.

struct LambdaInner {
    int x_; // our binding
    LambdaInner(int x) : x_{x} {}
    int operator()(int y) const {
        return x_ + y;
    }
};
struct LambdaOuter {
    LambdaInner operator()(int x) const {
        return LambdaInner(x);
    }
} lambda_outer;
LambdaInner lambda_inner = lambda_outer(1);
std::cout << lambda_inner(2) << '\n'; // or lambda_outer(1)(2)

But this starts looking less and less like actual lambda functions now. The trick is to use the fact that we are able to declare structs in local scope.

struct LambdaOuter {
    auto operator()(int x) const {
        struct LambdaInner {
            int x_; // our binding
            LambdaInner(int x) : x_{x} {}
            int operator()(int y) const {
                return x_ + y;
            }
        };
        return LambdaInner(x);
    }
} lambda_outer;
std::cout << lambda_outer(1)(2) << '\n';

Okay, this now resembles lambda expressions a bit more than before. But there is still a large amount of noise in the code — for example, having to declare x_ and naming the structs for the lambdas. Turns out that we have exactly the tools for the job with C++ lambdas, which provide syntactic sugar for this kind of a thing:

auto lambda_outer =
    [](int x) {
        return [x](int y) {
            return x + y;
        };
    };
std::cout << lambda_outer(1)(2) << '\n';

Here, the lambda being returned has "captured" or "bound" the variable x by storing a copy of it inside itself.

Don't worry if you're a bit baffled with the syntax going from that monstrosity of object oriented code to this simple of a code, for now is the time to look at the syntax of a C++ lambda.


C++ lambda syntax explained

A C++ lambda expression (or lambda function), in its simplest form, has the syntax

[capture_list](parameter_list) specifiers -> trailing_return_type {
    // function body
};

where the trailing return type and specifiers are both optional (we will get to them in a moment, they're not super critical for now).

Given that lambda functions aim to replicate a lot of C++ function functionality, there is a lot of noise you could introduce here too, with syntax that I have not mentioned in the above — you can make it something as complex as the following as of C++20 (even ignoring the possible *this and variadic captures):

int x = 1;
int y = 0;
auto lambda = [&, x = std::move(x)]
                <std::integral T>
                requires(sizeof(T) <= 8)
                (T a, auto b, int c)
                mutable constexpr noexcept [[deprecated]]
                -> std::uint64_t {
    y = x;
    return static_cast<std::uint64_t>(c) + a + sizeof(b) + x + y;
};

However, we won't be really concerned with a lot of this, and will only deal with a few special cases (besides, if you understand the rough correspondence between function objects and lambdas, it becomes more intuitive to guess how this syntax works).

In the syntax mentioned earlier, the lambda roughly corresponds to this struct (don't worry if you don't understand this yet, we will explain the terms used here in detail):

struct UnnamedLambda {
    // 1. variables corresponding to the capture list go here
    // 2. constructor for storing captured variables
    // 3. call operator - with some specifiers if specified:
    auto operator()(parameter_list) -> /* trailing_return_type, if any, goes here */ const {
        // lambda function body
    }
};

Note that the member function is marked const regardless of what we specify for the lambda — this means that it is not allowed to do anything that mutates/is-allowed-to-mutate-but-doesn't the member variables. We can get past this restriction as we will see later.

Warning: this is only a representative correspondence and works well for almost all purposes (for example, the copy-assignment constructor of lambdas is deleted, but we didn't show it for the sake of simplicity). It is possible for the C++ lambda behaviour to diverge in some edge cases from the behaviour of this struct, but I haven't seen any such divergences.

For example, a lambda like the following:

int y = 1, x = 0;
auto lambda = [y, &x](auto a, int b) constexpr mutable -> int {
    return sizeof(a) * y + b * x;
};

corresponds to

struct UnnamedLambda {
    // captured variables
    int y_;
    int& x_;
    // constructor - constexpr by default if possible
    constexpr UnnamedLambda(int y, int& x) : y_{y}, x_{x} {}
    // call operator - the constexpr specifier marks this function as constexpr (a bit useless since it is constexpr whenever possible anyway), and the mutable specifier removes the const
    constexpr auto operator()(auto a, int b) -> int {
        return sizeof(a) * y + b * x;
    }
}

It is possible to template a lambda starting from C++20, but it is not something that is very useful for competitive programming, so I will omit it for the sake of simplicity, since the syntax has quite a lot of parts already.

Now let's look at every part in a bit of detail.

The capture list

Firstly, global/static variables and global constants don't need to be captured (and can't be) — they are accessible without any captures and will always be usable inside a lambda like a normal variable. It is similar to being captured by reference by default for those variables.

Let's say we want to capture a variable y by value and x by reference (so in the struct, we store a copy of y and a reference to x). In that case, the capture list becomes y, &x or &x, y. We can do this for any number of variables in any order — &x, y, z, x, &y, &z, w etc.

Now let's say you don't want to manually capture all variables that you're going to be using.

Let's say you want to capture all variables by value. The capture list becomes =.

And if you want to capture all variables by reference, the capture list becomes &.

But you'd say, I don't want to be stuck using the same semantics for all my captured variables. C++ has a way of allowing you to specify variables for which the default semantics are not followed.

If you want to capture all variables by value, but g by reference since it is a large graph, your capture list becomes =, &g (the default should be at the beginning of the capture list).

If you want to do the opposite and capture all variables by reference, but capture g by value, your capture list becomes &, g.

Now let's say that rather than capturing a variable directly, you want to store a value computed by a function, and all other variables should be captured by reference. Then your capture list becomes &, var = func_call(/* as usual */). This can be done for multiple variables. The semantics are the same as if you had done auto var = func_call().

Similarly, if func_call returns a reference, you can also use &var = func_call(). The semantics are the same as if you had done auto& var = func_call().

Note: when capturing a variable by reference and using lambdas after the variable has been destroyed, you will end up with dangling references where accessing that variable in the lambda is undefined behaviour, just like with any other usage of dangling references — this is one of the reasons why I chose to capture by value instead of by reference when showing the 2 lambda example. Keep this in mind while using lambdas.

The parameter list

This is pretty much what you would use for any other function. In particular, there is also support for default parameters.

Also, it is noteworthy that the support for auto parameters was added to lambdas back in C++14, but was added for usual functions only in C++20. This shows how much simpler lambdas are for the language itself, when compared to functions. In the corresponding struct, this corresponds to a templated operator(), one for each instantiation that is needed.

So your lambdas can look like this (while compiling in C++14 as well):


bool f(int a, int b) { return a < b; } int main() { int x = 1; int z = 0; auto lambda = [&x, y = f(1, z)](int a, auto b, int c = 1) { if (y) { x = c; return a + b; } else { return x - b + c; } }; }

The specifiers

For this section, it would be helpful to keep in mind our model of structs corresponding to lambdas, in order to understand how the specifier mutable helps, which we will explain below.

You might have noticed that we have not changed the captured value of any variable that we captured by value (the ones we capture by reference can mutate the variables they refer to, so a = b is possible for a being a reference captured variable). The reason is that with the correspondence, the call operator is const by default.

So, how would we mutate these variables? The answer is the usage of the mutable specifier — the same keyword that helps you mutate a variable in a class when you are calling a function that is marked const.

We're now able to do something like this:

auto alternate_even_odd = [i = 0]() mutable {
    i ^= 1;
    return i;
};

When mutable is specified in a lambda, the const appended to the operator() is removed. This in turn allows us to mutate all captured variables (in this context it means calling functions/operators/member functions on them which require a non-const reference to them).

There are other specifiers — constexpr (since C++17), consteval (since C++20) and static (since C++23), but they would not be very useful in a competitive programming context, so we omit them.

The trailing return type

Notice that the default syntax for a lambda doesn't have a return type, but the syntax for C++ functions has a return type.

In fact, you can have auto return type for C++ functions starting from C++14, and the type is deduced using some type deduction rules. By default, these rules apply for lambdas as well.

However, it is possible to have certain ambiguities/underspecification, like the following:

auto f = [](int64_t x) {
    if (x < 0) return 0;
    return x;
};

or

auto f = [](auto g, int x) {
    return g(g, x);
};

In the first, the compiler is unable to deduce whether the lambda's return type is supposed to be int or int64_t.

In the latter, since the type of g is unknown, it is also unknown what g returns.

In such cases, we must either cast the return values to a certain type, or specify the return type on our own, like the following:

auto f = [](int64_t x) -> int64_t {
    if (x < 0) return 0;
    return x;
};
auto f = [](auto g, int x) -> int {
    return g(g, x);
};

Using lambdas

Okay, now that we have finally shown what the lambda syntax can do in C++, we are yet to use them in any application.

In order to use them, we need to know how to do the following, since the type of the corresponding structs of two different lambdas can be different (and the types of two separately defined lambdas are indeed different in C++):

  • Defining lambdas
  • Passing lambdas
  • Storing lambdas

Defining lambdas

This is very easy, as we have already seen before:

auto f = [](){};

And a cool trick that I didn't mention earlier is that if your parameter list is empty, you can simply drop the ().

So the shortest lambda definition looks like this:

auto f = []{};

Passing lambdas

In C++20, you can do this:

void print_result(int x, auto f) {
    std::cout << f(x) << '\n';
}
void print_result_1(int x, auto& f) {
    std::cout << f(x) << '\n';
}
void print_result_2(int x, auto&& f) {
    std::cout << f(x) << '\n';
}

A small nitpick: the last one is better if you want to pass a lambda by reference if it is an lvalue (informally, something that has an address), but by value if it is not. For example print_result_2(2, [](int x) { return 2 * x; }); and print_result_2(2, some_lambda); both work, but the first of these doesn't work with the print_result_1.

When passing a lambda to a lambda, the above syntax works even for something as old as C++14.

In older standards of C++, like C++11/14/17, you can do this (the & and && variants still work, but I omitted them for the sake of not having too much code):

template <typename Func>
void print_result(int x, Func f) {
    std::cout << f(x) << '\n';
} 

Storing lambdas

This is something that competitive programmers rarely do, but when they do, a lot of them add unnecessary overheads mostly due to lack of understanding of lambdas/templates.

Firstly I will show how people end up storing lambdas:

struct Storage {
    template <typename Func>
    Storage(Func f_) : f{f_} {}
    // or Storage(auto f_) : f{f_} {}
    // or Storage(std::function<int(int)> f_) : f{f_} {}
    std::function<int(int)> f;
};
Storage storage([](int x) { return x * 2; });

Other people use this very limited version (we'll get to why it is limited):

struct Storage {
    using F = int(*)(int);
    Storage(F f_) : f{f_} {}
    F f;
};
Storage storage([](int x) { return 2 * x; });
// Storage storage(+[](int x) { return 2 * x; }); also works, relies on decaying from a lambda to function pointer

The most natural way, however, is to do either this:

template <typename F>
struct Storage {
    Storage(F f_) : f{f_} {}
    F f;
};
Storage f([](int x) { return 2 * x; })

or

auto make_struct = [](auto f_) {
    struct Storage {
        std::remove_reference_t<decltype(f_)> f;
    };
    return Storage{f_};
};
auto s = make_struct([](int x) { return 2 * x; });

For the sake of convenience, let's name these v1, v2, v3 and v4.

v1 seems to be the most generic way, but has unnecessary overhead if you don't need to do something like making a vector of lambdas. The reason it works is that std::function "erases" the type of the lambda (with a technique called type erasure), but to do that it needs heap allocation and other overheads, and often is unable to optimize the code in the lambda that interacts with the outside (for instance, recursion). Meanwhile, lambdas are able to be optimized as much as, if not faster than, usual functions, since structs are zero-cost abstractions. I have seen spectacular slowdowns with v1 (50x sometimes) due to lack of optimizations, and I'm of the opinion that unless necessary, type erasure should never be used in a context where performance matters. Of course, no kidding, type erasure is an engineering marvel that deserves its own blog, and std::function, as an extension does so too.

v2 falls back to the concept of a function pointer, which exists since the old days when C was all the rage. It works decently fine in practice, but has a couple of issues:

  • It is impossible to cast a lambda that has a capture to a usual function. This should be clear when we think about how a lambda can be represented as a struct. The very fact that a lambda can be converted to a function pointer is a surprise if you know how lambdas are implemented.
  • Function pointers suffer from having to use one more indirection than necessary, due to the function call not being known "by the struct". Still better than std::function in terms of speed, though.

v3 solves all the above problems, in the case you don't need to be able to store the container in a vector. The only drawback is that since it is a templated struct, it can not be defined in local scope, as much as we want to. The reason is that in many implementations, symbols in a template must have external linkage, and this is simply impossible for something in a local scope. There is another minor drawback, which is an inconvenience at best — since partial template specialization is not allowed in C++, you must rely on CTAD (class template argument deduction) to make your code shorter. Otherwise you would have to do something like

auto f = [](int x) { return 2 * x; };
Storage<decltype(f)> s(f); // or std::remove_reference_t<decltype(f)> to remove the reference that comes attached with decltype on an lvalue

Coming to v4 — it does all the things that the other solutions do, and since it is a lambda, it doesn't suffer from the issues that templates do.

As it turns out, v4 is also the most "functional" implementation out of all of them, because you are literally just using functions and nothing object oriented is required.

Is the fact that v4 is the most convenient out of the above 4 just coincidental? I think not.

Anyway, partly since all my library code had been written many years ago, I use v3 for my competitive programming library code — for instance, my implementation that modifies and generalizes the AtCoder library code for lazy segment tree can be found here, starting at line 393. As an aside, it is one of the fastest implementations for that problem, and it used to be the fastest for a very long time until a couple of months ago, and is off from the fastest solution by a few ms. The current fastest solution also uses the v3 pattern for making things generic.

But for anything new, I like using v4 — it is functional and flexible.

The verdict:

  • Use v1 whenever you need type erasure but can't do without captures
  • Use v2 whenever you need type erasure but can do without captures
  • Use v3 whenever you don't need type erasure, but want a classical struct definition somewhere, and want all the performance
  • Use v4 whenever you don't want type erasure and want your code to be flexible and not depend on CTAD.

Using lambdas with the STL

As promised, here are some applications of lambdas with some of the STL algorithms/data structures.

Since STL algorithms are implemented mainly as functions or function-like objects, the main use of lambdas is as callbacks to the functions. We are able to pass lambdas as one of the parameters of the STL functions in question, and the lambda is called inside the body, which is why it is a callback.

It is also possible to use them in STL containers during construction.

For the precise functionality of the functions I mention below, I would recommend reading about them on cppreference.

Lambdas are used in multiple forms:

Comparators

A comparator is something that returns the result of comparison of its inputs. It is used in the following:

  • sorting: std::sort(l, r, comparator); where the comparator tells whether $$$a < b$$$ or not for inputs $$$a$$$ and $$$b$$$.
  • binary_search/lower_bound/upper_bound/heap operations/set intersection etc.: similar to sorting.
  • next_permutation: finds the next permutation according to an order provided by a predicate.
  • containers (map/set, unordered_map/unordered_set): for the constructor for ordered versions, we can pass a lambda that tells when $$$a < b$$$ for inputs $$$a$$$ and $$$b$$$, and for the unordered versions, we can pass a lambda that tells when $$$a = b$$$ for inputs $$$a$$$ and $$$b$$$.

Predicates

A predicate is something that decides whether some condition is true or not, based on its inputs. It is used in the following:

  • partition_point — returns the point $$$p$$$ in a range $$$[l, r)$$$ such that $$$f$$$ is true on $$$[l, p)$$$ and false on $$$[p, r)$$$.
  • filtering and filtering-based algorithms:
    • find_if/find_if_not — returns the first position where the predicate is true/false (also see find_last_if/find_last_if_not)
    • default_searcher, boyer_moore_searcher, boyer_moore_horspool_searcher — used in std::search to search for a pattern based on the predicate
    • copy_if/copy_if_not — copies all things on which the predicate evaluates to true/false
    • remove_if/remove_if_not — similar to copy_if/copy_if_not but removes elements instead
    • and others, which are too numerous to list in a blog post

Generators/transformers

  • std::generate expects a function object — usually a lambda — and assigns the result of successive calls to every element in the input range. To implement this, a solution using stateful lambdas is the following:
std::generate(l, r, [i = 0]() mutable { return i++; });
// the same as std::iota(l, r, 0);

There is a more idiomatic solution using coroutines, but coroutines in C++ allocate on the heap.

  • std::transform expects a function object and for each $$$x$$$ in a range, updates the corresponding position in another range with $$$f(x)$$$.

Hash functions

  • unordered maps/sets, boyer_moore_searcher, boyer_moore_horspool_searcher all expect a hashing function to be implemented (if not, they fall back to std::hash), and this should be passed during construction.

n-ary operators

  • std::accumulate and std::reduce expect you to have a binary operator (by default, std::plus) to be able to aggregate over a range
  • std::partial_sum does the same, but stores values for every prefix sum
  • the multitude of inclusive/exclusive_scans and their transform variants also do something similar.
  • std::inner_product allows you to pass two different binary operators (one being an analog for +, the other for *).
  • std::adjacent_difference does the opposite of std::partial_sum, but in a different order, and expects a function object to be called as the difference operator.

Some useful non-trivial patterns

Immediately invoked lambdas

Consider this situation: you have to initialize some data (let's say some result of precomputation), but you want to

  • keep it local, because global scope usually leads to potentially buggy code
  • avoid polluting your scope with unnecessary variables, in order to prevent bugs with the same variable

The concept of an immediately invoked lambda comes in here.

In a nutshell, the idea is that we would like to make a new nested scope for it, but scopes are not expression while lambdas are.

So your solution would earlier look like this:

std::vector<int> a;
{
    // process and modify a
}

and it becomes this:

auto a = [&] {
    std::vector<int> a;
    // process and modify a
    return a;
}(); // notice these parentheses

There are a couple more benefits to this, arising out of the fact that we have now managed to convert a scope into an expression.

So we can do the following easily:

  • Complicated initialization can easily be done in the constructor like this (more useful for real life code than competitive programming):
struct A {
    A(size_t n, int val) :
        a{[&] {
            std::vector<int> v(n);
            std::fill(v.begin(), v.end(), val);
            return v;
        }()} {}
    std::vector<int> a;
};
  • Make variables with potentially complicated initialization const if mutation is not needed — this is somewhat of an extension of the previous point, and it often leads to optimized code, even if we ignore the number of errors we would save at compile time.

  • Precompute and assign at compile time — if we have an array to precompute, we can do it at compile time globally like this:

auto precomputed_array = [] {
    std::array<int, 10000> a{};
    // preprocess
    return a;
}();

Since lambdas are constexpr by default, if everything inside the lambda is constexpr-friendly, the compiler will try to perform this lambda call at compile time, which will in general be faster than computing stuff at runtime, and the compiler can also reason about the data that it has generated at compile time, so it even makes the rest of the program faster.

One small drawback (for Codeforces) is that if there are too many operations in the lambda, you will risk getting a compile time error (telling you to increase your compiler's constexpr operations limit — for which I have not yet been able to find a way from within the source code).

The compiler is doing the best it can (and the assembly for constexpr-ed code will be much faster than the assembly for non-constexpr-ed code, simply because what is left to run is the bare minimum possible), and for the compiler, there is no limit on computational resources other than the ones due to the machine. Unfortunately, this doesn't work out for practical purposes like contests, where compile-time and system limitations (which include compiler limits that are built into it, but can be changed).

One hacky workaround is:

auto something = [] {
    int a; // note that it is not initialized, only declared, so it is not constexpr friendly. so the compiler will not constexpr it. it is definitely a code smell, but what can we do?
    // rest of the function
}();

This is a workaround that I have been using for a long time now, as a way to prevent things from getting constexpr-ed at compile time. Don't do this in real-life code though.

Recursive lambdas

We can implement recursion in a Turing machine, but can we do so with lambdas? Recall that we noted that the Y-combinator is a way to do precisely this thing in an untyped lambda calculus.

We can write a somewhat similar analog for it in C++ lambdas too, using the C++14 feature that allows us to have auto arguments.

auto fib = [](auto self, int n) -> int {
    if (n <= 1) return n;
    return self(self, n - 1) + self(self, n - 2);
};
std::cout << fib(fib, 10) << '\n';

(If you want to be more efficient in the general case, consider using auto&& instead of auto).

Let's look at how we came up with this implementation.

We don't have a way to refer to the unnamed lambda inside itself with C++ syntax (and the lambda calculus syntax).

So what is the next best thing we can do? Yes, just make a dummy parameter that will (whenever it should be used) be itself, and remember to call the defined lambda with itself whenever it needs to be used. This is precisely what the above code does.

As a fun exercise, let's try writing a function that takes a lambda, which has 2 arguments — one referring to itself, and the other being what it should call (and will eventually become itself). Our target would be to do something like the following:

auto lambda = [](auto self, int n) {
    // something
    // the recursive call is of the form self(n - 1)
};
auto rec_lambda = add_recursion(lambda);
std::cout << rec_lambda(10) << '\n';

We will ignore all considerations of performance for now, and make copies of everything, for ease of presentation.

Let's make a function object that provides an overload for operator(), and abstracts out the self(self, ...) calls into just self2(...) calls — then we can use self2 as the self parameter in the original lambda's definition, and the double recursion (which should ideally be optimized away) will allow us to do what we want.

template <typename F>
struct add_recursion_t {
    F f;
    template <typename X>
    auto operator()(X x) {
        return f(*this, x);
    }
};
template <typename F>
auto add_recursion(F f) { // could have as well been implemented as a lambda
    return add_recursion_t{f};
}

This is also what the std::y_combinator proposal does, but with better optimizations that sadly renders the code less than readable. This paper is redundant with the C++23 feature called "Deducing this" (which also solves a lot of other problems, not limited to lambdas).

To avoid writing self so many times, it allows us to refer to the lambda itself in its body, and write

auto fib = [](this auto&& self, int n) { // notice that the -> int is gone
    if (n <= 1) return n;
    return self(n - 1) + self(n - 2);
};
std::cout << fib(10) << '\n';

You can test this code out on the latest Clang release.

But can we do this in just lambdas? The answer is

Yes and no

Stateful lambdas

Have you ever wanted a Python generator in your code? Turns out that with mutable state, you can implement stateful lambdas that work in a similar manner.

It should be abundantly clear with the std::generate example we used a while back that stateful lambdas are nothing but objects with a single member function accessible via the call operator, and some mutable state represented by the data captured by them. It is helpful to remember the struct-to-lambda correspondence here.

For instance, let's say you want to print something that looks like a tuple, and std::apply works on it.

How would you print its space separated members?

std::apply([i = 0ULL] (const auto&... x) mutable -> void {
    (((i++ ? (std::cout << ' ') : void()), std::cout << x), ...);
}, tuple_object);

Here we have made abundant use of fold expressions, but the main point is that after expansion, this mutates the state of i inside the body each time there is an i++, and a space is printed before every element except the first element.

Similarly, you can write a random number generator as follows (taking the example from here):

auto next_random = [
    splitmix64 = [](uint64_t x) {
        x += 0x9e3779b97f4a7c15;
        x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9;
        x = (x ^ (x >> 27)) * 0x94d049bb133111eb;
        return x ^ (x >> 31);
    },
    FIXED_RANDOM = chrono::steady_clock::now().time_since_epoch().count(),
    random = 0
] () mutable {
    return random = splitmix64(random + FIXED_RANDOM);
};

while (n--) std::cout << next_random() << '\n';

If you want to, it is also possible to write all of your code in a single immediately-invoked lambda, but that is something I discourage you from writing, unless you want to obfuscate code or just want to challenge your brain.

For example, this code for the latest Div2A works, but I definitely do NOT recommend that you write code this way. However, it is quite instructive to understand why this code works and why certain variations do not, so I definitely DO recommend reading and understanding this — it'll help cement your understanding of how captures work, what is called when, and so on.

#include "bits/stdc++.h"

int main() {
    [rd = [] {
         int x; std::cin >> x;
         return x;
     },
     wt = [](auto x) { return (std::cout << x, 0); },
     sorted = [](auto x) {
         std::sort(x.begin(), x.end());
         return x;
     },
     forn = [](auto f, int n) {
         while (n--) f();
         return 0;
     }] {
        [&, _ = forn(
                [&]() {
                    auto n = rd();
                    auto k = rd();
                    [&, a = std::vector<int>(n)]() mutable {
                        [&, _ = std::generate_n(a.begin(), a.size(), rd)]() {
                            [&, ans = (k == 1 && a != sorted(a) ? "NO\n" : "YES\n")]() {
                                [_ = wt(ans)]() {
                                };
                            }();
                        }();
                    }();
                },
                rd())]() {
        }();
    }();
}

Some other interesting patterns

Using lambdas for implementing classes

Note that we claimed that lambda calculus is Turing complete. But can it implement classes? The answer is of course yes.

Can C++ lambdas implement classes? Also yes.

Let's let go of the implementation of lambdas in C++ for now, and think of implementing objects in terms of functions.

For implementing classes, we need data and member functions. For data, we can just store the data as a capture in a lambda. But what does the lambda do? We should be able to call something using the captured data, so the lambda should take as a parameter whatever the implementation of the member function should be.

So, once we have an object, we can call the data with the function, instead of calling the function on the data.

// constructor
auto range = [](size_t l, size_t r) {
    return [l, r](auto f) {
        return f(l, r);
    };
};
// (free) member functions
auto len   = [](size_t x, size_t y) { return y - x; };
auto right = [](size_t x, size_t y) { return y; };
auto left  = [](size_t x, size_t y) { return x; };
// usage
auto a = range(1, 5); // this is an object
auto len_a = a(len);  // compare with a.len() in usual C++ syntax

And if we want to support recursive functions, we can change the definitions a tiny bit.

auto range = [](size_t l, size_t r) {
    return [l, r](auto f) {
        return f(f, l, r);
    };
};
auto len   = [](auto self, size_t x, size_t y) { return y - x; };
auto right = [](auto self, size_t x, size_t y) { return y; };
auto left  = [](auto self, size_t x, size_t y) { return x; };
auto depth = [](auto self, size_t x, size_t y) {
    if (y - x <= 1) return 0;
    return 1 + self(self, x, x + (y - x) / 2);
};

auto a = range(1, 5);
auto len_a = a(len);

Mutual recursion can also be implemented similarly, but would require much more effort. Do let me know if you find some way to implement mutual recursion in a clean manner!

Inheriting from lambdas

template <typename... T>
struct S : T... {
    template <typename... L>
    S(L&&... l) : T(std::forward<L>(l)...) {}
    using T::operator()...;
};

This can be used to implement a somewhat generic lambda by providing different implementations like this:

S generic_lambda = {
    [](int x) { return 1; },
    [](double d) { return 0.0; }
};
std::cout << generic_lambda(0) << '\n';
std::cout << generic_lambda(1.0) << '\n';

But we already have generic lambdas in C++20, and they're pretty interesting in themselves.


Examples of competitive programming code using C++ lambdas

Finally, let's come to some useful code patterns involving lambdas that I use (and used to use) while solving problems.

As a local function

This seems to be the most popular usecase for lambdas, and for good reason — you can avoid code duplication by literally making a code block executable (the & capture does wonders for this usecase). It ends up being particularly useful in the following scenarios:

  • When you have to implement a function for a certain query, but it will be used multiple times, perhaps with slightly different behaviours.
  • When you are implementing a data structure inline — i.e., when you don't have library code at hand, and want to avoid having to declare a struct for, say, a segment tree or a Fenwick tree or a DSU.
  • When you want to avoid rewriting annoying bits of code like when you are writing code for an interactive problem, where the input/output format is complex (even for a very basic format, you need to print, flush, read and return, which deserves a function of its own).

For example, although the intent is not very clear in this example (unless we try solving the problem itself), it is clear that there would have been a lot of code duplication without the use of a lambda — just count the number of calls to the lambda.

DFS and other applications of recursive lambdas

Most people implement DFS as a function, and to do that, they make their graphs global too.

What happens when we want to do a multi-test problem? Clear graphs each time? Of course that has the potential to be buggy.

So what some other people do is to make graphs local, but pass them as a parameter. Similarly, if they want to update any arrays/vectors/integer, like visited/color/component/n_components, they also pass them as a parameter.

Why is this bad? If you swap out two arrays/variables in the references, it would not be a very obvious bug. And it just increases your context switch, because now you have to care about every function call. So if you are implementing a particularly involved DFS, you will end up being more prone to bugs than you could bargain for.

Some smart people realized this and started using local lambdas for DFS, but they used the following trick (which is still not the best trick, I'll explain why in a moment):

std::function<int(int, int)> dfs = [&](int u, int p) {
    int sum = 1;
    for (auto v : g[u])
        if (v != p) sum += dfs(v, u);
    return sum;
};
int sz = dfs(0, -1);

Why it is not the best approach:

  • Typing the types twice is too much work
  • Type erasure (yet again) makes recursive functions quite slow. It might not be visible for DFS, but once you start using it for something like divide and conquer or recursive DP that needs of the order of $$$10^7-10^8$$$ recursive calls, the performance hit would almost invariably make your solution too slow to get AC.
  • No support for default parameters — some might see this as a good thing, but the point is that it is abstracting away too much detail from the lambda.

Here's what I use:

auto dfs = [&](auto self, int u, int p = -1) -> int { // can also be auto&& if you want to be careful
    int sum = 1;
    for (auto v : g[u])
        if (v != p) sum += self(self, v, u);
    return sum;
};
int sz = dfs(dfs, 0);

Some very high rated programmers use this pattern too, for example, this submission, and for good reason — it pretty much always compiles down to the most optimal code (according to the compiler). The only case I know of this being slower than a usual implementation was because the compiler was vectorizing the DFS(!) and was leading to instruction cache misses. That was a very peculiar case, and I don't expect it to encounter it again in the wild, especially not in a problem with a tight TL.

Others use the idea outlined in the now-redundant std::y_combinator proposal, and do something like this:

// template code
template <class Fun>
class y_combinator_result {
    Fun fun_;
  public:
    template<class T>
    explicit y_combinator_result(T &&fun): fun_(std::forward<T>(fun)) {}    
    template<class ...Args>
    decltype(auto) operator()(Args &&...args) {
        return fun_(std::ref(*this), std::forward<Args>(args)...);
    }
};
template<class Fun>
decltype(auto) y_combinator(Fun &&fun) {
    return y_combinator_result<std::decay_t<Fun>>(std::forward<Fun>(fun));
}
// actual code
auto dfs = y_combinator([&](auto self, int u, int p = -1) -> int {
    int sum = 1;
    for (auto v : g[u])
        if (v != p) sum += self(v, u);
    return sum;
});
int sz = dfs(0);

For instance, this submission.

In fact, this is such a popular pattern that some high-rated people put it on top of their bare minimum templates as well, for instance, here.

The one drawback for this template is that sometimes the compiler is not able to reason through all the abstraction, and the code generated might be slightly slower. This is fortunately not the case for most cases, but there is a visible overhead difference in a few cases I have seen.

Interfacing library code

Library code should ideally be hidden away in a header file, but since most online judges don't have header file support, we are forced to write library code before any of the actual code begins.

It is also important to ensure that after copypasting library code into your submission, there is absolutely no change that you need to do in the library code unless it is a fundamental difference from the logic of the data structure/algorithm, and you just happen to have all the code at hand. Otherwise, you would be editing a large piece of code that you might have forgotten the details/quirks of, and it is easy to end up with bugs.

To do this, it is quite important to implement your library in a way that is as generic as possible. And there are few things more generic than allowing your data structure to execute arbitrary code.

Hence, it makes for a really good choice to use a lambda as a customization point. In fact, the C++ standard calls certain types of function objects "customization points" which goes on to show that interfacing into library code with functions/function objects is a well-understood/well-adopted design practice that is also very convenient for library usage.

As an example of such a thing, my implementation that modifies and generalizes the AtCoder library code for lazy segment tree can be found here, and it uses lambdas as possible customization points. It not only implements lazy segment trees (with some opt-in optimizations), but it also has functionality for normal segment tree, with barely any overhead compared to a usual segment tree implementation.

Algorithms with custom predicate/aggregation function

This is in fact one of the most common applications. There are countless lines of code among codeforces submissions that essentially sorts an array according to a given predicate: for instance, this code.

There are also a very large number of binary search submissions that follow the template "define a predicate, do binary search on it" — for instance, this code. Note that the manually written binary search could have been avoided by using std::partition_point as well, but it is not very popular for some reason.

Prefix and suffix sums (usual addition or multiplication or xor) can be implemented using std::partial_sum using lambdas as well.


Please let me know if there is anything I missed out, and constructive comments on the content in general are welcome!

Full text and comments »

  • Vote: I like it
  • +407
  • Vote: I do not like it

By nor, 5 months ago, In English

This blog is not about the binary GCD algorithm, but rather about exploring theoretical guarantees for an optimization on top of the standard Euclidean algorithm. The STL version is faster than both these algorithms, because it uses the binary GCD algorithm, but these algorithms are of interest from theoretical considerations.

In particular, my conjecture is that the second algorithm takes at most as many iterations as the first one, and if true, it'd be a pretty surprising claim, given how hard it is to bound Euclidean algorithm runtimes. So, it would be really cool if someone could prove this property.

UPD: thanks VArtem for affirming that this is the case and pointing to some relevant references.

The usual gcd algorithm looks something like this:

auto gcd = []<std::integral T>(T a, T b) -> T {
    while (b) {
        a = std::exchange(b, a % b);
    }
    return std::abs(a);
};

Consider the following implementation, that makes the second argument greedily smaller.

auto gcd2 = []<std::integral T>(T a, T b) -> T {
    a = std::abs(a);
    b = std::abs(b);
    while (b) {
        T r = a % b;
        if (r > (b >> 1)) r = b - r;
        a = std::exchange(b, r); // a = std::exchange(b, std::min(a % b, b - a % b); also works fine
    }
    return a;
};

On random data, gcd2 seems to be around 30% faster than gcd on my machine, for int32_t and int64_t arguments.

And it doesn't seem immediately obvious, but on all inputs I brute-forced, it seems gcd2 always takes an equal or smaller number of iterations compared to gcd. A proof is outlined in the comments.

Here's the benchmarking code I used:

Benchmarking code

And the timing results I get are:

190237720
gcd_32: 14054 ms
190237720
gcd2_32: 11064 ms
13318666
gcd_64: 47618 ms
13318666
gcd2_64: 36981 ms

When I compare this new algorithm with std::gcd on similar data, there seems to be barely any slowdown with 32 bit ints, but the % operator makes it much slower on 64 bit ints.

Do let me know what you think about it in the comments, and a proof/counterexample would be appreciated!

Full text and comments »

  • Vote: I like it
  • +88
  • Vote: I do not like it

By nor, 5 months ago, In English

Note: for those who don't like using the post-increment operator (i++), hopefully this blog convinces you that it is not just a convention that C programmers coaxed the world into following out of tradition. Also, all of the following discusses the increment operators in C++, and not C, where the semantics are slightly different.

Disclaimer: I use ++i much more often in code. But i++ has its own place, and I use it — and the generalization I mention — quite frequently wherever it is a sane choice.

Is ++i better than i++?

A lot of people are taught that since i++ also returns the old value of i before the increment, it needs a temporary, and hence it must be slower. Good news: almost all compilers today optimize this part away whenever this behaviour is not needed, and the difference between i++ and ++i doesn't matter.

So what's the difference? In C++, turns out that apart from this old/new value difference, there is another difference.

Any expression in C++ that looks like a op= b where op is a binary operator in fact "returns" a reference to the variable a after completing the operation. As a result of this, you can do things like this:

// replaces x by a * x % p
void mod_mul(int64_t& x, int a, int p) {
  (x *= a) %= p;
}

Turns out that since you only need the value after the increment for the pre-increment operator, it is possible to have similar behaviour there too, and that is precisely what C++ chose to do:

// replaces x by (x + 1) % p
void mod_inc(int64_t& x, int p) {
  ++x %= p;
}

However, the post-increment doesn't enjoy any of these properties. Perhaps this is why the Google C++ Style Guide (which I don't really like all that much, but will just quote here) recommends

Use the prefix form (++i) of the increment and decrement operators unless you need postfix semantics.

Another part of the language where you can see this distinction in practice is when you overload pre-increment and post-increment operators for a class:

struct Int {
  int value{};
  Int& operator++() {
    std::cout << "pre-incrementing\n";
    ++value; return *this;
  }
  Int operator++(int) {
    std::cout << "post-incrementing\n";
    Int temp = *this;
    ++value;
    return temp;
  }
};

If you notice closely, the return type of the pre-increment overload is a reference, which is not the case with the post-increment overload. Using a post-increment also makes a copy, so if you're using a post-increment on a struct/class object that defines both, it is likely that there is a potential performance improvement lying right there.

So, is there any legitimate use-case for post-increment?

Indeed, there is. Notice how in the post-increment code in the previous example, we had to store a temporary. This gives us a hint as to where exactly post-increment is useful: when you need to avoid temporary variables! And one of the places where temporary variables are not needed is when an operation (in this case, increment) must be done after any use of a value.

An example to illustrate this point comes when you want to implement copying of a C-style, null-terminated string (without copying the null terminator so that you're able to, let's say, concatenate strings into a buffer). Let's say we didn't have post-increment in the language at all. The algorithm looks like this:

void copy_c_string(const char* input, char* output) {
  while (*input) {
    *output = *input;
    ++input;
    ++output;
  }
}

The first thing one would say is that there are too many lines of code — you need to check input, need to separately handle input and output pointers, and if someone is refactoring this code and accidentally deletes one of the increments, they will introduce a bug that is hard to find in a large codebase.

Contrast that with the following:

void copy_c_string(const char* input, char* output) {
  while (*input) {
    *output++ = *input++;
  }
}

The intent of moving the pointer forward immediately after every use (and using it exactly once, too) is perfectly conveyed, we don't have too many lines of code, and all is well. Of course, it takes some time getting used to writing/reading this kind of code, but thinking of it in terms of "immediate operation after use" shows that it works analogously to what RAII is to a manual destructor call.

In fact, the usage of post-increment and pre-decrement (not pre-increment) was (and is) so popular (partly due to widespread use of half-open ranges), that certain architectures in the past had special functionality for them (for example, this). Thus, in some rare cases, it is imperative to use post-increment instead of pre-increment for performance.

Do we have an op= style (compound assignment) generalization for post-increment?

Yes, we do. It is called std::exchange.

For the C++ experts

What it does is this — std::exchange(a, b) returns the old value of a after setting it to b. So i++ is the same as std::exchange(i, i + 1) in terms of semantics.

Similarly, a = std::exchange(b, a) swaps the values of a and b, which shows that it is more general than std::swap, though maybe at a small performance hit for types more complicated than a simple integer. Adding a bunch of std::moves should fix that to some extent, though.

Why the strange name? How do I remember it?

I consider this to be a naming disaster, and believe that if there was a better succinct name that does NOT imply bidirectional flow of data, the C++ committee would probably have tried to do it.

The way I remember it is a = std::exchange(b, f(a, b)) is (morally) equivalent to std::tie(a, b) = std::make_pair(b, f(a, b)) — that is, the assignments are done at the same time. So somewhat like storing temporaries for everything that is to be assigned, and then performing the assignments at the same time.

A more general example is a = g(a, std::exchange(b, f(a, b))) which does std::tie(a, b) = std::make_pair(g(a, b), f(a, b)) — in a mathematical sense, it allows you to do two parallel assignments in a linear chain-like syntax. I believe this is what brings about the exchange naming (if not the CMPXCHG instruction).

A more intuitive way to read A = std::exchange(B, C) is that A becomes B and B becomes C at the same time.

Some competitive programming relevant examples

Fibonacci/linear recurrences

Writing a = std::exchange(b, a + b) is much easier and less error-prone than auto tmp = a; a = b; b = tmp + b; (in fact, someone pointed out that the initial version of this code was wrong). The same goes for any other linear recurrence in 2 variables.

Iterative GCD and extended GCD

Since the Euclidean GCD algorithm is also an affine transformation (especially one where you swap after update), the implementation goes from the less readable/intuitive

std::array<T, 3> extgcd(T a, T b) {
    std::array x{T{1}, T{0}};
    while (b) {
        std::swap(x[0] -= a / b * x[1], x[1]);   // x[0] = x[1], x[1] = x[0] - a/b * x[1]
        std::swap(a %= b, b);                    // a = b, b = a % b
    }
    return {x[0], x[1], a};
}

to

std::array<T, 3> extgcd(T a, T b) {
    std::array x{T{1}, T{0}};
    while (b) {
        x[0] = std::exchange(x[1], x[0] - a / b * x[1]);
        a = std::exchange(b, a % b);
    }
    return {x[0], x[1], a};
}
Lazy updates in buffers (use-and-clear)

It is usually error-prone to clear a buffer of, let's say, updates that you want to process, in the very end of the processing function, as follows:

void flush() {
  for (auto x : updates) {
    // do something
  }
  updates.clear(); // easy to forget
}

Sometimes this might lead to a MLE as well, if this flush function is in fact a DFS that is called in a loop after processing the updates, where the graph is built during the function itself, and the memory limit is low or the structure you store inside the buffer has quite a lot of information.

To be able to deal with the first of these issues, people used something of the following sort:

void flush() {
  std::vector<Update> updates_tmp;
  using std::swap; swap(updates_tmp, updates);
  for (auto x : updates_tmp) {
    // do something
  }
}

This faces multiple issues

  • You need to create and assign a temporary, and also need to remember the type of the updates variable (can you see where we are going with this?)
  • The temporary exists till the end of the function, so this doesn't fix the second issue.
  • Initializing/declaring and then modifying is a very non-functional (anti-)pattern, and making it a habit in code will almost invariably lead to state-mutation-related bugs.

Consider the following solution using std::exchange:

void flush() {
  for (auto x : std::exchange(updates, {})) {
    // do something
  }
}

Which of these two implementations mirrors the intent of the original function? And which one would you choose to write yourself?

Some software engineering relevant examples

Implementing move constructors and assignment operators

While defining move semantics for your class, it is usually a good idea to keep the moved-from object in defined state. Sometimes this might even be necessary, for instance, when you want to implement something with semantics like std::unique_ptr.

Avoiding redundancy.

It is possible to avoid having to implement post-increment when pre-increment is already available by just doing return std::exchange(*this, ++Type{*this})

Acrobatics with locks

The "immediately do after use" idea also carries over to when you're using locks — it is clearly optimal to use locks for as short a time as possible, when comparing across the same kinds of software architecture. One way that std::exchange can be used to this effect is to use a scoped lock inside a lambda, that returns std::exchange(resource, resetted_resource).

In fact, there is a common primitive in lock-free programming that uses the exchange idiom (compare-and-swap, test-and-set, std::atomic_exchange).

Generic programming with iterators

Sometimes the only way to increment an iterator is std::next. Using std::exchange, we can implement our own version of post-increment to be able to write generic code using iterators. Since iterators are very general, you might find yourself using this trick in very unexpected places too.


Please let me know what you think about this C++ feature and if there are any mistakes in the above blog. Comments are welcome!

Full text and comments »

  • Vote: I like it
  • +235
  • Vote: I do not like it

By nor, 6 months ago, In English

I believe that adding the Boost library on Codeforces would be a great addition for C++ competitive programmers. AtCoder already supports it (alongside GMP and Eigen, but Boost has replacements/wrappers for those: Boost.Multiprecision and Boost.uBLAS). CodeChef supports it too (or at least used to support it at some point, not sure now). I've seen at least 3 posts by people trying to use Boost on Codeforces and failing, and on thinking about it, I couldn't really come up with a good reason (in my opinion) that Boost should not be supported on Codeforces.

Some might argue against adding a library on Codeforces, but for them, I recommend viewing this blog as an educational resource on some of the library's features that I find cool.

There are a lot of things in the STL in the newer C++ standards, whose implementations were originally inspired from Boost, so you could be able to use some features that might not be available until the next few C++ standards.

As of now, the latest version of Boost is 1.83.0, and its documentation can be found here.

Here are a few features from Boost that are pretty useful for competitive programming:

  1. Algorithms from Boost.Algorithms — this includes algorithms like KMP and binary exponentiation (power for arbitrary associative operations) as well. It also makes for much cleaner code and with significantly less effort. It also includes some "mini" algorithms that are annoying to write, that one might use in some technical contexts (for example, gather is an algorithm like this, and gets used a lot in divide and conquer problems). There are some algorithms that ended up in the STL, like partition_point for doing binary search.

  2. Containers from Boost.Container — this includes a multitude of containers that are not in the C++ standard yet. My personal favourites among these containers are the flat_map and flat_set containers, which are much better implementations for the corresponding associative data structures than the STL ones, and small_vector, which does similar optimizations as the basic_string STL container, but has less stringent conditions on the type of things it can contain.

  3. Better bitsets using the dynamic_bitset library — there was a recent blog about dynamic bitsets in Python, but I mentioned a way of getting there using a hack with C++ STL bitsets. However, dynamic bitsets are supported in Boost. In fact, the Boost implementation also exposes functions like find_first, find_next, range set, range flip and so on.

  4. Graph algorithms from the Boost Graph Library — it contains practically every useful graph algorithm that we encounter in competitive programming.

  5. Different kinds of heaps with Boost.Heap — this contains multiple types of heaps with a lot of useful specialized functions. Ever wanted to try out a Fibonacci heap? Or just any heap with decrease-min or deletion operations? This library will cater to all your needs.

  6. Interval handling using Boost.Icl — often we want to handle ranges of integers in competitive programming problems. This library gives us ready-made data structures to solve such problems, without having to think about corner cases that arise with interval sets. This solution is a prime example that would be much easier to implement with this library.

  7. Intrusive containers using Boost.Intrusive — there are times when we want to implement a list manually, with some additional information inside the node. For example, there is an implementation of adjacency lists that is quite popular in China, which makes an array of nodes, and inside those nodes, it stores pointers/index of next/previous edges. Doing that using an std::list would be quite painful, and there is a reasonable compromise using intrusive lists. In general, intrusive containers tend to be faster, so if you like constant optimization, this library is pretty nice to know about.

  8. Boost.Math — this contains a ton of useful math features. Unfortunately, the polynomial library they provide doesn't have fast operations (polynomial multiplication is quadratic for example), but the rest of the functionality is pretty good. It has things like continued fractions, root finding and quaternions, all of which I have seen in competitive programming contests.

  9. BigInt using Boost.Multiprecision — self-explanatory.

  10. Linear algebra using Boost uBLAS — this is a linear algebra library that has pretty much all the functionality one would ever need on a competitive programming problem. However, this library is quite old, and it is usually recommended to use other libraries just for that reason (though I don't find this to be off-putting). The one thing you need to squeeze out performance is to define the macro BOOST_UBLAS_NDEBUG.

Some other libraries that are not directly applicable, but I really like

Hopefully this was enough to convince some people that Boost is quite a handy library, that might be worth putting effort into learning, and that it convinces MikeMirzayanov to add it on Codeforces.

Full text and comments »

  • Vote: I like it
  • +86
  • Vote: I do not like it

By nor, 7 months ago, In English

Disclaimer: Please verify that whichever of the following solutions you choose to use indeed increases the stack limit for the execution of the solution, since different OSes/versions can behave differently.

On Linux (and probably MacOS), running ulimit -s unlimited in every instance of the shell you use to execute your program will work just fine (assuming you don't have a hard limit). On some compilers, passing -Wl,--stack=268435456 to the compiler options should give you a 256MB stack.

But a much easier way from my comment a couple of years ago, originally from here, is to do this in your Meta Hacker Cup template:

#include <bits/stdc++.h>
using namespace std;
void main_() {
    // implement your solution here
}
static void run_with_stack_size(void (*func)(void), size_t stsize) {
    char *stack, *send;
    stack = (char *)malloc(stsize);
    send = stack + stsize - 16;
    send = (char *)((uintptr_t)send / 16 * 16);
    asm volatile(
        "mov %%rsp, (%0)\n"
        "mov %0, %%rsp\n"
        :
        : "r"(send));
    func();
    asm volatile("mov (%0), %%rsp\n" : : "r"(send));
    free(stack);
}
int main() {
    run_with_stack_size(main_, 1024 * 1024 * 1024); // run with a 1 GiB stack
    return 0;
}

It apparently works with Windows as well, as a commenter confirmed, and I believe it should work with MacOS as well.

UPD: To use this on 32 bit systems/compilers, replacing %rsp with %esp works. I haven't tried changing this for ARM, so any solutions which do the same thing for ARM would be appreciated.

Full text and comments »

  • Vote: I like it
  • +205
  • Vote: I do not like it

By nor, history, 9 months ago, In English

I am a logic gate. Ask me anything.

Full text and comments »

  • Vote: I like it
  • +125
  • Vote: I do not like it

By nor, 10 months ago, In English

TL;DR

  • The Floyd-Warshall algorithm is a special case of the more general case of aggregation over paths in a graph — in this blog, we look at a few examples and point to some references for further study.
  • The algebraic path problem constitutes a family of problems that can be solved using similar techniques.
  • This and this paper develop the theory of semirings in this context, and this treats some computational aspects of it.
  • Kleene algebras are what end up being used in most of the examples given here, but semirings are more general structures.
  • While not discussed in the blog post, the idea of a valuation algebra is a further generalization that shows connections to more problems of various types.

Full text and comments »

  • Vote: I like it
  • +171
  • Vote: I do not like it

By nor, 14 months ago, In English

It's been about 5 years since the round took place, and since there was no editorial all this time, I decided to write one myself.

Thanks to tfg for discussing problem F with me and coming up with the permutation interpretation that makes it much easier to reason about the problem. The idea in F was also inspired by other submissions.

1060A - Phone Numbers

Hints/intuition
Solution

Submission: 196688903

1060B - Maximum Sum of Digits

Hints/intuition
Solution

Submission: 196689846

1060C - Maximum Subrectangle

Hints/intuition
Solution

Submission: 196755317

1060D - Social Circles

Hints/intuition
Solution

Submission: 196760424

1060E - Sergey and Subway

Hints/intuition
Solution

Submission: 196767377

1060F - Shrinking Tree

Hints/intuition
Solution

Submission: 196942805

1060G - Balls and Pockets

Hints/intuition
Solution

Submission: 196921763

1060H - Sophisticated Device

Hints/intuition
Solution

Submission: 196822812

Full text and comments »

  • Vote: I like it
  • +240
  • Vote: I do not like it

By nor, 14 months ago, In English

There were a few instances on CF according to which it seems that quite a few people aren't very comfortable with floor and ceiling functions (and inequalities in general) — for instance, here and here. This inspired me to write a hopefully short blog on how to systematically work with these topics.

Note that I will not define what a real number is, or do anything that seems too non-elementary. Also, for the foundational stuff, I will rely on some correct intuition rather than rigour to build basic ideas, and the systematic logically correct way of reasoning will be used later. The idea behind this blog is to just help you understand how to reason about the topics we'll cover.

Recap: basic inequalities

Before we jump ahead to floors and ceilings, we need to talk about inequalities. I will assume that you know something about the number line. Let's recap some informal definitions:

  • We say that $$$a < b$$$ if and only if the number $$$a$$$ comes before the number $$$b$$$ on the number line. Here number refers to real numbers.
  • We say that $$$a = b$$$ if and only if $$$a$$$ and $$$b$$$ coincide on the number line.
  • We say that $$$a > b$$$ if and only if $$$b < a$$$, i.e., $$$a$$$ comes after $$$b$$$ on the number line.
  • We say that $$$a \le b$$$ if and only if one of $$$a < b$$$ or $$$a = b$$$ holds.
  • We say that $$$a \ge b$$$ if and only if one of $$$a > b$$$ or $$$a = b$$$ holds.

For the sake of brevity, we will abbreviate "if and only if" to iff. We will often use $$$P \iff Q$$$ in math notation to indicate that $$$P$$$ is true iff $$$Q$$$ is true. For a unidirectional implication of the form "if $$$P$$$ is true, then $$$Q$$$ is also true", we will write $$$P \implies Q$$$, and read it as $$$P$$$ implies $$$Q$$$.

Here's a fact that people often use intuitively, called the Law of Trichotomy:

  • Exactly one of the following relations is true: $$$a < b$$$, $$$a = b$$$, $$$a > b$$$.

This is useful a lot of times when you try to break things into cases, and try to look for contradictions.

A corollary of this is the following definition of the relations in terms of $$$<$$$ and negation (you can use any relation other than $$$=$$$ to get all other operations):

  • $$$a = b$$$ iff $$$a \not < b$$$ and $$$b \not < a$$$.
  • $$$a > b$$$ iff $$$b < a$$$.
  • $$$a \le b$$$ iff $$$b \not < a$$$.
  • $$$a \ge b$$$ iff $$$a \not < b$$$.

Now note that if we have some results about $$$<$$$, using the above relations, we can come up with relations corresponding to these operations too.

We can also write these relations in terms of $$$\le$$$ — I leave this as an exercise for you to confirm that you've understood the logic behind these.

I recommend understanding why these make sense, so that you can reason about directions and inequalities more easily.

Chaining these relations

Let's say $$$a R b$$$ and $$$b R c$$$ are true, where $$$R$$$ is any one of $$$<, >, =, \le, \ge$$$. Then it is easy to show that $$$a R c$$$ is also true. This allows us to chain inequalities, and write things like $$$a < b < c$$$ and remembering that as we read the expression from left to right, the terms only increase.

This property is called transitivity, and is well-studied in other contexts too. But for our purposes, it just makes our life simpler.

At this point, it is very important to note two more things:

  • If $$$a < b$$$ and $$$b > c$$$, we can't say anything about the order of $$$a$$$ and $$$c$$$. This is a rookie mistake that happens a lot.
  • $$$<$$$ is a stronger statement than $$$\le$$$. In particular, if we have something like $$$a < b$$$ and $$$b \le c$$$, then it is impossible that $$$a = c$$$. So we can in fact say that $$$a < c$$$, rather than the weaker statement $$$a \le c$$$. Mathematically, you can show this by reasoning as follows: $$$b \le c$$$ means that either $$$b < c$$$ or $$$b = c$$$. In the first case, the chaining is trivially true. In the second case, since $$$b = c$$$, the positions of $$$b$$$ and $$$c$$$ on the number line coincide, so the result is again true. (For the pedantic reader: the second part of this argument is referring to the fact that $$$<$$$ is a well-defined relation — in terms of respecting equality).

How arithmetic operations affect inequalities

This is what many people don't internalize very well, or are simply not exposed to. At the risk of being too verbose, I will explain some interactions of arithmetic operations and inequalities (we will focus on $$$<$$$, but such relations can be developed for other relations as well — like $$$>, \le$$$ and their combinations — and that is left as an exercise):

  • Addition/subtraction: let's say you have an inequality $$$a < b$$$. If you shift both things by the same amount on the number line, their relative positions won't get flipped. In other words, if we add a constant $$$c$$$ to both $$$a$$$ and $$$b$$$, the relative ordering of $$$a+c$$$ and $$$b+c$$$ is the same as that of $$$a$$$ and $$$b$$$. That is, $$$a < b$$$ implies $$$a + c < b + c$$$. The same holds for subtraction. So this implies some sort of a cancellation law:
$$$a < b \iff a + c < b + c$$$
  • Multiplication by $$$0$$$: Both sides of any inequality become $$$0$$$, so upon multiplication by $$$0$$$, inequalities become equalities.

  • Multiplication by a positive number: let's say you have an inequality $$$a < b$$$. This means $$$0 < b - a$$$, or in other words, $$$b - a$$$ is a positive number. If we multiply it by a positive number $$$c$$$, it will remain positive. That is $$$0 < c(b - a)$$$. Adding $$$ac$$$ to both sides gives $$$ac < bc$$$. So:

$$$c > 0, a < b \implies ac < bc$$$
  • Multiplication by a negative number: let's say you have an inequality $$$a < b$$$ and a negative number $$$c < 0$$$. Clearly, we have $$$a(-c) < b(-c)$$$ from the previous result. Adding $$$ac + bc$$$ to both sides gives $$$bc < ac$$$, or $$$ac > bc$$$. That is, the sign of the inequality gets flipped. So:
$$$c < 0, a < b \implies ac > bc$$$
  • Adding two inequalities that go the same way: let's say you have two inequalities $$$a < b$$$ and $$$c < d$$$. Is it true that $$$a + c < b + d$$$? Let's try to reason about it using the previous result. Let's subtract $$$c$$$ from both sides of the inequality (we can do this by the previous result). Then $$$a + c < b + d \iff a < b + d - c$$$. Note that since $$$c < d$$$, doing the same thing gives us $$$0 < d - c$$$. This essentially means that $$$d - c$$$ is a positive number. If we shift $$$b$$$ by a positive number to the right, we are moving $$$b$$$ farther from $$$a$$$, and hence the shifted $$$b$$$ should still be $$$> a$$$. This means that $$$a < b + d - c$$$. Using the previous equivalence, $$$a + c < b + d$$$ is indeed true. All in all, we have shown that
$$$a < b, c < d \implies a + c < b + d$$$
  • Subtracting two inequalities that go the opposite way: let's say you have two inequalities $$$a < b$$$ and $$$c > d$$$. Is it true that $$$a - c < b - d$$$? Note that multiplying $$$c > d$$$ by $$$-1$$$ gives $$$-c < -d$$$. So we have $$$a - c < b - d$$$ by our previous result. We have shown that
$$$a < b, c > d \implies a - c < b - d$$$

NOTE: Subtracting inequalities of the same type and adding inequalities of the opposite type does not work, and I leave it as an exercise for you to see which steps in the above proofs fail.

Multiplying and dividing inequalities is significantly more complicated, but they can be treated with the same ideas. More specifically, for multiplying two inequalities, you have to make cases on the signs of the 4 terms. For division, the idea is the same, but additionally you have to take care that some denominator is not 0.

Floors and ceilings

Let's start out by the definition of the floor function and the ceiling functions:

  • The floor function is a function $$$\lfloor \cdot \rfloor : \mathbb{R} \to \mathbb{Z}$$$ such that for all real numbers $$$x$$$, $$$\lfloor x \rfloor$$$ is the integer $$$n$$$ such that $$$n \le x < n + 1$$$.

  • The ceiling function is a function $$$\lceil \cdot \rceil : \mathbb{R} \to \mathbb{Z}$$$ such that for all real numbers $$$x$$$, $$$\lceil x \rceil$$$ is the integer $$$n$$$ such that $$$n - 1 < x \le n$$$.

Clearly, if such $$$n$$$ exist, they are unique.

Why do these definitions make sense? Note that intervals of the form $$$[n, n + 1)$$$ for integer $$$n$$$ cover the number line disjointly (this is not a rigorous proof, but I don't want to deal with real numbers), so $$$x$$$ will always be in one of these intervals. The same property holds for intervals of the form $$$(n - 1, n]$$$.

As a special case, when $$$x$$$ is an integer, both the ceiling and the floor functions return $$$n$$$.

NOTE: Remember that $$$\lfloor 1.5 \rfloor = 1$$$, but $$$\lfloor -1.5 \rfloor = -2$$$, and not $$$-1$$$.

In fact, we can notice the following:

  • $$$\lfloor x \rfloor$$$ is the largest integer $$$n$$$ such that $$$n \le x$$$.
  • $$$\lceil x \rceil$$$ is the smallest integer $$$n$$$ such that $$$x \le n$$$.

The proof follows from the definition (we will show this only for floor, the ceiling case is analogous): let's say that this is not true, and there is some $$$a > n$$$ such that $$$a \le x$$$. $$$a > n$$$ means $$$a \ge n + 1$$$. But we know that $$$x < n + 1 \le a$$$, so $$$x < a$$$. This is a contradiction.

Note: As a consequence of the above fact, we have $$$\lfloor x \rfloor \le \lceil x \rceil$$$, and the difference between them is $$$1$$$ iff $$$x$$$ is not an integer; when $$$x$$$ is an integer, they are equal to $$$x$$$.

We can rephrase these statements in the following ways that can be useful while solving problems.

  • For an integer $$$n$$$ and a real number $$$x$$$, the following holds: $$$n \le x \iff n \le \lfloor x \rfloor$$$.
  • For an integer $$$n$$$ and a real number $$$x$$$, the following holds: $$$n \ge x \iff n \ge \lceil x \rceil$$$.

This is especially important because these kinds of iff statements help us reduce the large number of cases that can arise while framing and solving inequalities. For an example, consider this comment.

Let's try to rephrase the original definition into another way too. $$$n \le x < n + 1$$$ is equivalent to the two inequalities $$$n \le x$$$ and $$$x < n + 1$$$. The latter is equivalent to $$$x - 1 < n$$$. So combining these again shows that $$$x - 1 < n \le x$$$. These steps are reversible, so this means this can also function as an alternate definition of the floor function. Doing this similar for the ceiling function, we have the following properties (that are also alternate definitions):

  • $$$x - 1 < \lfloor x \rfloor \le x$$$
  • $$$x \le \lceil x \rceil < x + 1$$$

It is sometimes useful to think in terms of the fractional part of a number (i.e., distance from floor), since it is guaranteed to be a positive real number less than 1. Similarly, distance to ceiling is helpful too.

Some more properties of floors and ceilings

Let's see what happens to the floor of a real number when we add (or equivalently, subtract) an integer.

Let $$$n' = \lfloor x + a \rfloor$$$ where $$$a$$$ is an integer. This is equivalent to saying that $$$n' \le x + a < n' + 1$$$, which is equivalent to $$$n' - a \le x < n' - a + 1$$$. Since $$$n'$$$ is an integer, $$$n' - a$$$ is also an integer. Hence this is equivalent to saying that $$$n' - a = \lfloor x \rfloor$$$. In other words, we have shown that

$$$a \in \mathbb{Z} \implies \lfloor x + a \rfloor = \lfloor x \rfloor + a$$$

Similarly, we can show that

$$$a \in \mathbb{Z} \implies \lceil x + a \rceil = \lceil x \rceil + a$$$

However, this kind of an identity doesn't hold for multiplication by an integer in general. For example, $$$\lfloor 1.5 \rfloor = 1$$$, and $$$\lfloor 2 \cdot 1.5 \rfloor = 3 \ne 2 \cdot \lfloor 1.5 \rfloor$$$.

Note that by the same example, we can show that $$$\lfloor x + y \rfloor$$$ is NOT equal to $$$\lfloor x \rfloor + \lfloor y \rfloor$$$ in general (and something similar for ceiling). However, it can be shown that the following are true:

$$$\lfloor x \rfloor + \lfloor y \rfloor \le \lfloor x + y \rfloor \le \lfloor x \rfloor + \lfloor y \rfloor + 1$$$
$$$\lceil x \rceil + \lceil y \rceil - 1 \le \lceil x + y \rceil \le \lceil x \rceil + \lceil y \rceil$$$

Let's now look at what happens when we multiply the argument of the functions by $$$-1$$$.

From $$$n \le x < n + 1 \iff -n - 1 < -x \le -n$$$, we can note that $$$-\lfloor x \rfloor = \lceil -x \rceil$$$. By replacing $$$x$$$ by $$$-x$$$ in this relation, we have $$$\lfloor -x \rfloor = - \lceil x \rceil$$$.

Since the floor and ceiling functions both return integers, applying any amount of floor or ceiling functions to the result will not change their value after one of them is already applied once.

One non-trivial property that sometimes helps is the following:

For an arbitrary positive integer $$$n$$$ and real number $$$x$$$:

  • $$$\lfloor \frac{\lfloor x \rfloor}{n} \rfloor = \lfloor \frac{x}{n} \rfloor$$$
  • $$$\lceil \frac{\lceil x \rceil}{n} \rceil = \lceil \frac{x}{n} \rceil$$$

Proving this is simple and is left as an exercise for the reader. Alternatively, it can be shown via the proof of the property below.

Another more general property is the following (reference: Concrete Mathematics, p71):

Let $$$f(x)$$$ be any continuous, monotonically increasing function with the property that $$$f(x)$$$ being an integer implies $$$x$$$ being an integer. Then we have $$$\lfloor f(x) \rfloor = \lfloor f(\lfloor x \rfloor) \rfloor$$$, and a similar thing holds for the ceiling function.

The proof is as follows. There is nothing to prove for $$$x$$$ being an integer. Consider $$$x$$$ being a non-integer. Since $$$\lfloor \cdot \rfloor$$$ and $$$f$$$ are both non-decreasing functions, we have $$$\lfloor f(\lfloor x \rfloor) \rfloor \le \lfloor f(x) \rfloor$$$. Suppose that the inequality is strict — in that case, since $$$f$$$ is monotonic and continuous, there is a $$$\lfloor x \rfloor < y \le x$$$ such that $$$f(y) = \lfloor f(x) \rfloor$$$. Then $$$y$$$ is an integer because of the property that $$$f$$$ has. But there can not be another integer between floor of a number and itself, which is a contradiction.

For more properties and applications, I would refer you to the Wikipedia article on these functions.

Some natural applications of floors and ceilings

  • Quotient and remainder: The quotient on dividing a positive integer $$$a$$$ by another positive integer $$$b$$$ is by definition $$$\lfloor a / b \rfloor$$$. The remainder is $$$a - b \lfloor a / b \rfloor$$$.
  • To count the number of multiples of $$$a$$$ that are $$$\le n$$$, we note that if there are $$$k$$$ multiples, then the largest multiple of $$$a$$$ is $$$ka$$$, and it should be $$$\le n$$$. Hence $$$k \le n/a$$$. This is also a sufficient condition for there being at least $$$k$$$ multiples of $$$a$$$ at most $$$n$$$. So $$$k = \lfloor n / a \rfloor$$$ by maximality of $$$k$$$.

Examples of reasoning with the developed tools

Now that we have some results in our toolbox, we'll show how to actually solve problems cleanly without getting stuck thinking about multiple cases at once. We've already shown some ways of using these while deriving other results, so we'll keep this short and I'll just link to some problems instead.

Some identities and bounds

I'll list out some non-trivial identities and bounds, and proving them is left as an exercise.

  • Floors and ceilings for fractions: For a positive integer $$$b$$$ and an arbitrary integer $$$a$$$, we have $$$\lceil a/b \rceil = \lfloor (a + b - 1)/b \rfloor$$$.
  • Gauss' property: For positive integers $$$m, n$$$, we have $$$\sum_{i = 0}^{n - 1} \lfloor im/n \rfloor = ((m - 1)(n - 1) + \gcd(m, n) - 1) / 2$$$. To prove this, try looking at the number of points below the line $$$y = mx/n$$$ inside the rectangle with corners $$$(0, 0)$$$ and $$$(m, n)$$$.
  • Hermite's identities: For a positive integer $$$n$$$ and a real number $$$x$$$, the following are true: $$$\lfloor nx \rfloor = \sum_{i = 0}^{n - 1} \lfloor x + i/n \rfloor$$$ and $$$\lceil nx \rceil = \sum_{i = 0}^{n - 1} \lceil x - i/n \rceil$$$. This can be proved by making cases on whether the fractional part (respectively, distance to ceiling) is between $$$i/n$$$ and $$$(i + 1)/n$$$.
  • Harmonic bounds: $$$\sum_{i = 1}^n \lfloor x / i \rfloor \le x \sum_{i = 1}^n 1/i = O(x \log n)$$$. As a consequence, a sieve that considers all divisors from $$$1$$$ to $$$n$$$ for a total of $$$x$$$ consecutive integers takes $$$O(x \log n)$$$ time. This is why both the normal sieve of Eratosthenes and the segmented variant are fast enough. This also shows that the sum of divisors of a positive integer $$$n$$$ is at most $$$O(n \log n)$$$.

Programming language specific details

For the sake of sanity, it is recommended to ensure that whenever you are trying to perform integer division in any programming language, try to convert it into a form where the denominator is positive. However, for most languages, it is always true that (a / b) * b + (a % b) evaluates to a.

There are typically two behaviors when performing integer division with a positive divisor:

  • Computing the floor in all cases: this is the case for Python, for example. Hence a // b will give $$$\lfloor a / b \rfloor$$$, and a % b will give the non-negative remainder when $$$a$$$ is divided by $$$b$$$.
  • Rounding the result towards $$$0$$$: this is the case in C++, for example. One reason behind this is to prevent overflow when multiplying the result by the denominator. So, for positive $$$a$$$, the behaviour is the same as in Python, but for negative $$$a$$$, it returns $$$\lceil a / b \rceil$$$, and a % b gives the negative distance to the corresponding multiple of $$$b$$$.

Note that to compute the ceiling of some $$$a / b$$$ for positive integer $$$b$$$ and integer $$$a$$$, we can use the result mentioned earlier: $$$\lceil a / b \rceil = \lfloor (a + b - 1) / b \rfloor$$$. Implementing code for correctly computing floors and ceilings in both C++ and Python is an instructive exercise.

Keep in mind that std::floor and std::ceil are operations on floating point numbers, and their results are also floating point numbers. Hence they might not be exact when converted to integers, and hence it is not recommended to use them for most practical purposes.

Full text and comments »

  • Vote: I like it
  • +211
  • Vote: I do not like it

By nor, history, 14 months ago, In English

As was noted in this blog, Google decided to discontinue its programming contests (Code Jam, Kick Start and Hash Code).

It seems they will take down all relevant servers (and hence the problems and the submission system) for all these contests. While there are some initiatives to archive past Code Jam and Kick Start problems (for example, here — also have a look at this comment), there seems to be nothing similar for Hash Code.

As a past participant of Hash Code, I believe that these types of contests are quite important too, and may have directly or indirectly inspired other such contests (like the Reply Challenge and AtCoder Heuristic Contests). Also there are some techniques like simulated annealing and using SAT/ILP solvers that might not show up in standard competitive programming websites but do show up in such optimization contests. The tutorials for them will be lacking without examples too.

So here are my questions to anyone who might have reliable information about the contests, since everything seems to be in a mess right now:

  1. Are there copyright issues that prevent the problems from being archived and open for submissions elsewhere (for a non-profit scenario)? This also holds for the BOJ hosting, so it might be possible that they have discussed the relevant issues already (though I am not aware of any such details).
  2. Does anyone have any ideas on how to archive these problems? For instance, what kinds of data need to be preserved (statements? scorers? output submissions? code submissions? ranklists?).
  3. If all goes well, is it possible to set this up with other existing Code Jam and Kick Start archives?

Full text and comments »

  • Vote: I like it
  • +111
  • Vote: I do not like it

By nor, 15 months ago, In English

I've been involved with testing a lot of Codeforces contests and coordinating contests for my uni, and have seen various kinds of issues crop up. However, I was able to come up with a stable system for my uni contests that helped make the experience as educational and error-free as possible, and I have found the lack of some of those things to be the main reason behind issues faced by Codeforces contests.

Given that it takes months to get a round ready, it is better to be more careful than necessary compared to getting your round unrated or having bad feedback in general.

I'd like to thank Ari, null_awe and prabowo for reviewing the blog and discussing more about problem-setting.

The current Codeforces contest guidelines are here for problem-setters and here for testers, and they have some great sanity checks that will help you make a good contest. But I still think it is missing some things that leads to bad contests at times.

Issues

Let's list out some issues that plague contests.

  1. Wrong intended solution or tests.
  2. Weak tests.
  3. TL or ML too tight or too loose.
  4. Some hacks are not accepted.
  5. Wrong IO format.
  6. Checker issues — too stringent, too loose, or simply wrong.
  7. Guessable problems.
  8. Problems that have appeared before.
  9. Unbalanced contests — difficulty-wise.
  10. Unbalanced contests — topic-wise.
  11. Long queue, system unavailability.

Out of these, the most glaring issue is having a wrong intended solution itself, and leads to making the contest unrated immediately.

Discussion

I think the fundamental problem behind most of these issues is two-fold:

  • Lack of rigor while setting a problem.
  • Not adopting a security-based approach to testing (pentests, for example).

Whenever I coordinated contests at my uni, I sent out a couple of guideline documents for setters and testers, that built upon these two ideas. The guiding principle for the first of these reasons (that every author and tester should understand, in my opinion) is consistency (and proofs of it). Not just one type of consistency, but consistency of the intent, constraints, statement, checker, validator and tests — all at the same time.

NOTE: Before we go forward, let's clarify what we mean by a formal proof. In this comment, I clarified that we are not talking about scary super-technical proofs (such as ones you might encounter in advanced set theory). A proof that is logically sound, where steps follow from basic logical reasoning and don't have gaps, is good enough. I call it a "formal proof" just because people on Codeforces call anything that looks reasonable a "proof" (which is obviously wrong and misleading) — and that is not what I am talking about. If you're not very familiar with the idea of a mathematical proof, don't worry. It is not meant to be too hard. I would recommend reading the chapter on proofs in this book, to understand what kind of logical statements I am talking about, and the idea behind proofs generally.

Let's understand this by asking ourselves the following questions:

  • What's the intent of the problem (not in terms of ideas, but from a contest perspective)? Usually it is "solve this problem in xyz time complexity and/or abc memory complexity" or "solve this problem in these many queries". Constraints on the input and output, time limit and memory limit arise from this naturally.
  • What do we want the contestants to solve? This corresponds to a precise and unambiguous statement of the problem, including the IO format, guarantees on special kinds of tests (if any) and so on. Unfortunately, there is no way to formally check that your Polygon materials are consistent with this problem statement, which is why it becomes necessary to ensure that the statements are such that your solution can be derived formally from the problem statement.
  • How do the contestants prove that they have solved the problem? In other types of reasoning/math contests, you have to provide a proof of correctness to show that you have solved the problem. This is also the approach taken by academia. However, unless you force participants to submit a solution in proof languages like Coq or Lean, it is impossible (or very hard) to enforce that they have a proof of correctness for their arguments and time complexity of their algorithms. This is where tests come in. "Strong" tests mean that any code that comes from a solution that doesn't satisfy the intent of the problem fails to get accepted on those tests. Since the constraints are bounded, the set of all possible tests is finite, but still huge enough that adding them all is infeasible. So to combat that, a good practice is to add all possible tests for small constraints, some random max tests, and tests that break all* possible heuristic solutions.
  • How to verify that our testing procedure is correct? Of course, if you feed in garbage data (i.e., data that doesn't satisfy the preconditions of a program), the output might be garbage. If there are inconsistencies in the tests, it is impossible to verify if the submitted solutions are correct (as in they solve the problem as stated). The same goes for tests that fall outside the constraints.

Solutions to the issues — the guidelines

We will now list the guidelines that help make contests error-free. Most of these will have both an author and a tester part. Keep in mind that these guidelines will be along the lines of a security perspective.

Note that there are some things that we can completely verify in a mathematical sense by proofs. But there are also things that we can't be sure of. One of the reasons behind it is that the halting problem is undecidable. So, for those kinds of things, we enforce some kinds of limits on the runtime or the memory requirements, and since we also want to check the behavior of submitted solutions on large inputs, we miss out on a lot of tests due to the combinatorial explosion of the number of tests that happens for most problems.

Provable things

Firstly, let's see what we can prove the correctness of. Keep in mind that a proof of consistency is always an "if and only if" argument.

Statement and solution

This is the part that comes even before setting the problem on Polygon or any other contest preparation system. Ensure that you have the following things:

  • Formal statement
  • Formal algorithm
  • Formal proof of correctness of algorithm

None of this should have actual numbers in it (keep everything in terms of what is defined in the input, and don't think about the constraints for now). Don't even implement it in a programming language; use constructs in the programming language as well as standard algorithms whose proof of correctness is well-known as a black-box.

Your proof and algorithm should depend only on the statement and logic. Don't implicitly add in any other assumptions or make wrong assumptions in the proof (if you write the proof, it will feel off anyway, so it's quite hard to make a mistake like this if you follow this guideline).

Some notes:

  • The algorithm and the proof of correctness are two different things. Saying that "it can be easily seen that this greedy works" or "finding max flow in this graph gives the answer" is not a proof.
  • Write out everything mathematically, even if it is obvious to you. This is the foundation of the whole problem, and if there are issues in this, is there even a point to preparing the problem, or even thinking about it?
  • I recommend reading this article and the links it refers to, to understand what I mean by a proof here. If you don't know how to structure things in the form of a proof, perhaps you are not yet ready for writing a problem for a contest yet. However, it is never too late to learn, and your idea behind the problem might be correct. A formal proof gives you a way to communicate things with others unambiguously, which is important since natural language is imprecise. Writing formal proofs is not hard, and apart from being a tool to communicate, it cements your understanding as well.
  • It might be possible that there is a solution faster/better than your intended solution. Don't be afraid to use it as the intended solution (in fact, try to find as many solutions as possible).

Doing this step properly eliminates the possibility of having wrong statements or wrong intended solutions. It also avoids situations like this where the intended solution was "fast enough" on the tests but the problem in general didn't seem to be solvable fast enough. Incidentally, that blog is a good example of what can go wrong in a real-life setting.

Constraints and test validation

Setting constraints is the part which requires getting your hands dirty. Let's ignore that for now, and let's say we have some constraints.

How do you ensure that your tests are correct? Think of it as a competitive programming problem again, though a very simple one. This is where a validator comes in. With a validator, you validate the tests based on what the problem states. The reasoning behind proving the correctness of the validator should go like this:

We show that the validator accepts a test if and only if the test is valid according to the problem statement. For the if-direction, [do something]. Now for the only-if-direction, [do something].

I recommend keeping the numbers in this proof generic enough, as is done in most math proofs. We will plug in the numbers later, when we discuss the constraints. Having this proof handy is quite important. Sometimes, when preparing a problem, you end up changing the constraints later on, but forget to update the validator. If you keep this proof as a comment in the validator itself, and inspect all the contest materials before committing changes, you will immediately catch inconsistencies in the statement and the validator.

Also, validators are important when hacks are enabled. If your validator is too loose, you risk some AC solutions getting hacked. And if your validator is too tight, some hacks might be invalid, and there are some hidden conditions in the problem that the hacker might be able to guess, which makes the contest unfair. For example, this contest had such an issue.

For choosing constraints, you need to write some code to verify that the TL and ML are fine and so on. However, this is not something you can prove even in a handwavy sense (unless you know the assembly that your code generates and the time every instruction takes in being executed in that order, on the worst possible test), so we will keep it for later.

Checker

This corresponds to checking that the contestant's solution indeed outputs something that is valid, according to the problem statement.

Again, as in the validator, using a formal proof, ensure that the output satisfies all the conditions for it to be accepted as a valid output by the statement, and don't add any extra checks. This is important because there have been CF contests where the checker checked something much stronger about the structure of the solution. A very simple example is "print a graph that is connected". This doesn't mention that the graph can't have multiple edges or loops, but many people will just write a checker that will add these conditions anyway. This is another reason why adding links to precise definitions in the problem is a good thing.

Sometimes writing the checker is harder than solving the original problem itself, and there have been instances in certain CF contests where writing the checker was the second subtask of a problem! Keep this in mind while setting problems, since it is useless if you can't write checkers. Sometimes, it might make sense to ask for only a partial proof of solution from a contestant — like a certain aspect that you can compute only if you use some technique, or something that is weaker than asking for the output, but is a reasonable thing to ask for (this changes the problem itself, though, and hence it doesn't show up as an issue on the contestant-facing side of problem-setting).

Things we have to manage heuristically

As we mentioned earlier, almost everything about setting constraints and creating tests is something that we have to manage heuristically. Both of these things go hand-in-hand, which makes the whole process quite iterative. Some obvious guidelines that help in these cases are given in the next section.

Condensed list of guidelines for testers and authors

Common guidelines for setters and testers

  1. Read this blog and the links in this post that point to CF guidelines. Also read the links to issues in actual CF contests and think about what went wrong.
  2. Keep the contest preparation process interactive and discuss as many solutions as possible, as well as potential changes to problems.

Guidelines for setters

  1. Ensure that you know what testers are meant to help you with, and read guidelines meant for them as a starting point.
  2. Have a formal write-up describing the problem, algorithm and the proof, as described in the blog. The formal editorial for a problem should be ready before you prepare the problem.
  3. Keep the problem statements short, clean and unambiguous. Use lots of space in unavoidable situations. Avoid using examples in statements that unnecessarily obstruct the flow of the reader, and write the statement in a way that doesn't need you to rely on examples in the first place.
  4. Ensure that your model solution is readable and is provably correct (and it doesn't have any undefined behavior and so on). Preferably write it in a "safe" language, and turn on all debug flags while testing the main solution (this is something I have wanted on Polygon, but there's no support for that unfortunately).
  5. Write strict validators and checkers, and prove their correctness too (both in terms of semantics as well as formatting). When writing checkers, do NOT assume that the jury solution is correct. Rather, check the correctness of both the jury solution as well as the participant's solution.
  6. Have exhaustive test cases — ideally, use a bruteforce to generate all possible tests for some small sub-constraints to ensure correctness, corner cases, tricky large cases for TLE and random large cases. Ensure that you cover all possible special cases, and look for specific solutions that can give AC but do not work on the general case (like faulty greedy solutions and so on). Keep in mind unordered_map and unordered_set like peculiarities in the languages.
  7. Ensure that the time limits are not too tight, and they are at least 2-3 times the slowest AC solution you have. Try to keep constraints in a way that the most optimized unintended solution either doesn't pass (by a good margin — like twice the TL), or barely passes. In the latter case, you should be reconsidering whether you want this problem to appear on the contest at all.
  8. Before relying on the testers, try to test your own problems according to the tester guidelines.
  9. Maintain something like a spreadsheet for tracking status and feedback of problems, and current progress.

Guidelines for testers

  1. Ensure that you know what setters are meant to do, and read guidelines meant for them as a starting point.
  2. Ensure that the problem statements are water-tight and self-contained (not a lot of background knowledge necessary, and wherever necessary, links should be present) and provided examples clarify unavoidable ambiguities (which should ideally be non-existent).
  3. After testing, confirm that the authors' proofs are all correct. This is not limited to just proofs of correctness of the intended solution, but also for the checker, validator and so on. Just because you couldn't solve a problem in the virtual contest testing doesn't mean you mustn't care about the other problems.
  4. Try to hack the problem: submit heuristic solutions, both in the sense of potential WA as well as potential TLE. If they pass on all cases, try to come up with cases they can’t pass on, or prove that this heuristic is correct/fast enough. Such a situation is quite instructive whenever it arises (from experience, this happens usually when there is a greedy solution involved which doesn’t really need the full strength of the greedy decision).
  5. Try to write the slowest possible reasonable solutions that a contestant can come up with, with the same complexity as the author’s solution. Also try to come up with highly optimized brute-force solutions (using caching, bitsets, precomputation of answers and so on) and see if they can pass the test cases. Look for test cases that these can’t pass, but the official solution passes. In case the official solution and the optimized brute-force both have similar speeds, discuss what to do next with the coordinator and the author.
  6. If you come up with an alternative approach with a better or slightly worse complexity, notify the problem setter, and depending on what the intended difficulty of the problem is, discuss whether to make the problem harder/easier.
  7. Give detailed feedback on the quality of problem statements, problem content, editorials, solutions, test cases, perceived hardness of the contest, and anything else that will help improve the perceived quality of the contest.

Handling other non-obvious issues.

At this point, there are the following issues we haven't addressed.

  1. Guessable problems.
  2. Problems that have appeared before.
  3. Unbalanced contests — difficulty-wise.
  4. Unbalanced contests — topic-wise.
  5. Long queue, system unavailability.

The first four issues are solved only by testing and being consciously aware of the fact that they need to be fixed.

If problems are guessable, it makes sense to ask contestants to construct something that shows that their guess is correct. For example, it might be easy to guess that the size of the answer for a problem is $$$n/2$$$. This is very guessable, and in this case, it is better to ask for the construction of the answer as well.

For avoiding problems that have appeared before, since there is currently no repository of all competitive programming (or math contest) problems that is searchable by context, the only way is to invite a lot of experienced testers with different backgrounds.

For avoiding unbalanced contests (both in terms of difficulty and topics), it is important to have a large variety of testers — people with various amounts of experience as well as various types of topics that they are strong at.

The last type of issue is something unique to online contests that are on a large scale — for example, CF contests. This is the main reason why pretests exist in the first case. To avoid long queues, usually there are at most 2-3 pretests for problems like A and B. To make strong pretests that are also minimal, there are two ways of choosing tests (which can be combined too):

  1. Choose an exhaustive test (on smaller constraints), and a random max-test.
  2. Pick all possible tests that fail solutions that you encountered in testing.

This is one of the reasons why the first couple of problems are multi-test in nature.

Conclusion

If there's anything I missed out, please let me know, since it's been quite some time since I last coordinated a contest. Feel free to discuss on various aspects of problem-setting in this context, and I would request coordinators to also share their to-do list when they start coordinating a contest, right from the very beginning of the review stage.

Full text and comments »

  • Vote: I like it
  • +144
  • Vote: I do not like it

By nor, 15 months ago, In English

Disclaimer: I am not an expert in the field of the psychology of learning and problem-solving, so take the following with a grain of salt. There is not much "scientific" evidence for this blog, and the following is validated by personal experience and the experiences of people I know (who fall everywhere on the "success" spectrum — from greys to reds in competitive programming, from beginners in math to IMO gold medalists, and from people with zero research experience to people with monumental publications for their own fields). This blog only covers one possible way of thinking about knowledge organization and retrieval that I have been using for the past decade or so, but there definitely will be other models that might work even better. Even though I just have a few courses worth of formal psychology experience, I still think the content of this blog should be relevant to people involved with problem-solving.

I hope this helps both students (to make their learning efficient) as well as educators (to make their teaching methodology more concrete and participate actively in their students' learning process).

Note that the techniques mentioned in this blog (or any resource that tries to explain learning and reasoning) will not work for you if your basics behind reasoning are flawed. If you're a beginner at reasoning in terms of proofs, I recommend this book to understand what mathematical reasoning is like. Intuition is just a guide — it is developed and applied only through multiple instances of rigorous reasoning.

Note: This blog is quite long, so reading it bit-by-bit should be better than skimming it in one go. I have also added a few links to some really good resources within the text, where I felt that it would be better to link to pre-existing stuff to avoid repetition and give a more varied perspective.

Here's what we'll discuss in the blog:

  • Why even bother understanding learning, and how do I use that knowledge?
  • A mental model of how knowledge is stored — the data structure
  • Learning methods — constructing the data structure
    • Reading up on theory
    • Learning via problems
    • Applying creativity to create new ideas and contexts
  • Reasoning — querying the data structure
    • Reasoning techniques
      • "Wishful thinking"
      • Intuition
      • Rigor
    • On developing general reasoning techniques
  • Some limitations of human problem-solving and overcoming them
    • Working memory
    • Motivating creativity
    • Overfitting on your knowledge
  • Some real-life examples
    • How learning can look like for you
    • A few useful high-level ideas
    • How solving problems can look like
    • How you can take feedback from this knowledge graph
  • Further reading

Full text and comments »

  • Vote: I like it
  • +498
  • Vote: I do not like it

By nor, 15 months ago, In English

Firstly, the CF Catalog is a great feature that makes it much easier to learn topics by having a lot of educational content in the same place, so thanks MikeMirzayanov!

I was originally going to post this as a comment under the catalog announcement blog, but it became too large for a single comment, and it will probably need some more discussion.

I want to suggest the following improvements for the catalog (and initiate a discussion on how best to tackle these issues with the current version):

  1. Multiple topics per blog
  2. Blog submission for review

The reasons and detailed suggestions are as follows:

Full text and comments »

  • Vote: I like it
  • +146
  • Vote: I do not like it

By nor, 15 months ago, In English

People often say C++ is much more cumbersome to write than any sane language should be, but I think that is mainly due to lack of knowledge about how to write modern C++ code. Here's how you can write code more smartly and succinctly in C++(20):

Negative indexing with end

Have you ever found yourself using a vector like a stack just because you have to also access the last few positions, or writing a sliding window code using a deque (where the size might be variable, but you want to access an element at some offset from the end)?

Chances are that you have written code like this:

while (stk.size() >= 3 && stk[stk.size() - 2] + stk[stk.size() - 1] > stk[stk.size() - 3]) {
  stk.pop_back();
}

A cleaner way of writing the same thing is this:

while (stk.size() >= 3 && end(stk)[-2] + end(stk)[-1] > end(stk)[-3]) {
  stk.pop_back();
}

This trick works not just for end, but for any iterator, provided that (roughly) there is no undefined behaviour like out of bounds accesses.

This can be done only with iterators which provide random access. For example, iterators for vectors but not for sets.

Free functions like len and str

Python's len and str come in handy a lot of times, and they're convenient to use because they're free functions (that is, not member functions).

Fortunately, C++ has its own set of free functions that are quite helpful:

  1. size(v): It returns the unsigned size of $$$v$$$, and does the same thing as v.size() That is, its numerical value is the size of $$$v$$$, and type is some unspecified unsigned type (aliased to size_t). This also works for arrays (but not raw pointers).
  2. ssize(v): It returns the signed size of $$$v$$$. This is quite useful, because it prevents bugs like a.size() - 1 on empty a. It doesn't have a member function analogue, though. This also works for arrays (but not raw pointers). If you use sz macros, you can safely remove them from your code!
  3. begin(v), end(v), rbegin(v), rend(v): They do the same thing as what their corresponding member functions do. These also work for arrays (but not raw pointers).
  4. empty(v): This checks if $$$v$$$ is empty or not. Analogous to v.empty()
  5. data(v): This returns a pointer to the block of data where it works.
  6. to_string(x): This returns the string representation of a datatype. They're only defined for things like int and double though.

Variable declaration in loops/conditions

If you have used range-based for loops, you might have noticed that there is no support for indexing by default. So people usually either just switch to index based loop, or do the following:

int i = -1;
for (auto x : a) {
  i++;
  // do something
}
// or
{
  int i = -1;
  for (auto x : a) {
    i++;
    // do something
  }
}

However, it either pollutes the parent scope with the variable i and makes things buggy, or just looks ugly. C++20 has the following feature that makes it cleaner:

for (int i = -1; auto x : a) {
  i++;
  // do something
}

Similarly, if there is an object such that you only have to inspect its properties just once and it becomes useless later on, you can construct it in the init part of the if condition, like this:

if (auto x = construct_object(); x.good() && !x.bad()) {
  // do something
}

Unpacking structs

A lot of people write code that looks like this:

std::vector<std::tuple<int, char, std::string>> a;
// populate a
for (int i = 0; i < (int)a.size(); ++i) {
  int x = std::get<0>(a[i]);
  char y = std::get<1>(a[i]);
  std::string z = std::get<2>(a[i]);
}
// or
for (auto& A : a) {
  int x = std::get<0>(A);
  char y = std::get<1>(A);
  std::string z = std::get<2>(A);
}
// or
for (auto& A : a) {
  int x; char y; std::string z;
  std::tie(x, y, z) = A;
}

A cleaner and more consistent way of writing this can be:

for (auto [x, y, z] : a) {
}
// or
for (auto& A : a) {
  auto [x, y, z] = A;
}

If you want references instead of copies stored in x, y, z, then you can do this:

for (auto& [x, y, z] : a) {
}
// or
for (auto& A : a) {
  auto& [x, y, z] = A;
}

Note that this is not limited to for loops, or even to just tuples, and you can use this sort of unpacking (structured binding) for most structs (in particular, custom structs you make, or std::pair). In fact, if you have std::array<int, 3>, then this unpacking will give you its contents in three different variables.

Some more utilities

There are some lesser known functionalities that C++20 introduced, like starts_with and ends_with on strings. Some of the more useful ones can be found here.

Notably, there is also support for binary searching like we do in competitive programming, and the relevant information can be found in the language specific section of this blog.

You can also print a number in binary, analogously to what you can do in Python. This can be done by constructing a bitset out of the integer and converting it to a string, or just by printing std::format("{:b}", a) (which is unfortunately not on CF yet).

emplace_back, emplace_front and emplace

Seriously, stop using a.push_back(make_pair(1, 2)) and a.push_back(make_tuple(1, 2, 3)) and even a.push_back({1, 2, 3}), and start using a.emplace_back(1, 2) instead.

The only reasonably common place this doesn't work is when you have a container that stores std::array, so emplace_back doesn't work; in that case a.push_back({1, 2, 3}) is fine.

The other variants: emplace_front and emplace do the same thing for their other counterparts (for example, in a deque, queue or priority_queue).

Lambdas

Lambdas in C++ are in fact cleaner and more intuitive (with more features) than in Python. Compared to both Python lambdas (being able to have more than 1 line of code) and local functions (having mutable states more intuitively and capturing parameters from an ancestor scope), in fact. This is because you can capture stuff more explicitly, have mutable state, can pass them as function objects, and even implement recursion using generic lambdas, though C++23 will have a feature called deducing this which will make a lot of things much simpler.

They can be passed around to almost all algorithms in the STL algorithms library, which is part of the reason why the STL is so powerful and generic.

They can also be used to make certain functions local, and avoid polluting the global namespace with function definitions. One more thing about them is that they are constexpr by default, which is not true for traditional functions.

Vector-like data structure without iterator invalidation

You probably know what we're coming to at this point: std::deque. It is essentially std::vector (without $$$O(1)$$$ move guarantees but whatever), but with no iterator invalidation on resizing/adding elements.

That is why it is ideal in a very important setting:

  • Making buffers of elements (so that we can use pointers to elements without worrying about what happens when the buffer is resized).

No more Python laughing in our face about having to worry about iterator invalidation!

Python style ranges and even more flexible range manipulation

Using temporary container in for loop

Suppose you want to do something for multiple arguments, but they aren't indices. Most people would either duplicate code, or do something like this:

vector<char> a{'L', 'R', 'D', 'U'};
for (auto x : a) { /* do something */ }

But this has a disadvantage: it needs more characters and also makes it more error-prone to use vectors. Instead of this, you can do this:

for (auto x : {'L', 'R', 'D', 'U'}) { /* do something */ }

or even

for (auto x : "LRDU"s) { /* do something */ }

Note that the s at the end makes it a string literal which is why it works, and const char* doesn't work because there is a \0 at the end of the literal. This can still allocate on the heap (not in this case, only for big strings due to SSO).

In case you explicitly want an allocation with a vector for some reason, you can use

for (auto x : std::vector{'L', 'R', 'D', 'U'})

or

for (auto x : std::vector<char>{'L', 'R', 'D', 'U'})
Ranges

C++20 has a feature called ranges, that allows you to do manipulations on, well, ranges.

Let's think about how to implement this: For integers between 1 to 5, multiply them by 2, and print the last 3 elements in reversed order.

#include <bits/stdc++.h>

int main() {
  std::vector<int> a{1, 2, 3, 4, 5};
  for (auto& x : a) x *= 2;
  for (int i = 0; i < 3; ++i) std::cout << a[(int)a.size() - i - 1] << '\n';
}

However, this is prone to having bugs (what about out of bounds errors in some other scenario?), and doesn't look intuitive at all.

Now consider this:

#include <bits/stdc++.h>

namespace R = std::ranges;
namespace V = std::ranges::views;

int main() {
    for (auto x : V::iota(1, 6) | V::transform([](int x) { return x * 2; }) | V::reverse | V::take(3))
        std::cout << x << '\n';
}

Here the intent is pretty clear. You take integers in $$$[1, 6)$$$, multiply them by 2, reverse them, and take the first 3. Now in this resulting range of numbers, you print each of them.

So let's see what happens here. The | operator is defined for ranges and an operation on ranges that returns a range. For example, V::transform(range, lambda) gives a range where every element is transformed, and doing range | V::transform(lambda) does the same thing. In fact, for a lot of cases, this isn't stored in memory at all (and operations are done while iterating on the fly), unless you have things like reversing a range.

One thing that you can finally do is split a string with delimiters in a nice way (which used to be a major point of ridicule directed towards C++ from a lot of Python users). This is as simple as using a split view in the ranges library. For more information, see this. Here's an example taken verbatim from the cppreference page:

#include <iomanip>
#include <iostream>
#include <ranges>
#include <string_view>
 
int main() {
    constexpr std::string_view words{"Hello^_^C++^_^20^_^!"};
    constexpr std::string_view delim{"^_^"};
    for (const auto word : std::views::split(words, delim))
        std::cout << std::quoted(std::string_view{word.begin(), word.end()}) << ' ';
}

A more common example can be like this:

#include <bits/stdc++.h>

namespace R = std::ranges;
namespace V = std::ranges::views;

int main() {
    std::cin.tie(nullptr)->sync_with_stdio(false);
    std::string s(1000000, '0');
    for (int i = 0; i < 1000000 / 2; ++i) s[i * 2] = '1';
    for (const auto word : V::split(s, '1'))
        std::cout << std::string_view{begin(word), end(word)} << ' ';
}

Some ranges functionality is as follows:

  • converting to and from a range: conversion to a range is implicit for most STL containers. If r is a range, then for most STL containers, like std::vector, you can do something like std::vector<int> a(r.begin(), r.end()). C++23 has a feature R::to<ContainerType>(range) that will construct a container out of a range, and you could even do range | R::to<ContainerType>().
  • iota: This works similarly to Python's range() but without the stride.
  • reverse: This works similarly to Python's reversed()
  • take: This takes the first $$$n$$$ elements of a range (unless you run out of elements by then)
  • drop: This takes everything but the first $$$n$$$ elements of a range (unless you run out of elements by then)
  • counted: This takes $$$n$$$ elements of a range starting from an iterator.
  • keys, values: keys takes a range whose elements are pair-like and returns a range of their first elements (so if we have (key, value) pairs, it will return keys), similar to .keys() in Python. Similarly, values returns a range of the values.
  • elements: if we have a range of tuple-like elements, this takes the $$$n$$$-th members of each of these tuples (where $$$N$$$ is a compile-time constant).
  • all: though type conversions from containers to ranges are usually implicit, it takes a container and takes everything in that container.
  • Some upcoming C++23 features that will make it even more convenient to use ranges: to, zip, slide, stride, repeat, adjacent, pairwise, cartesian_product.

Note that this is not the only thing you can do with ranges. In fact, with almost all possible algorithms that you can find in the algorithms library that use a begin and an end iterator (like std::sort or std::partial_sum or std::all_of), you can find a ranges version here, where you can just pass the range/container name instead of the begin and end iterators you usually pass.

By the way, if you don't want to wait for C++23, you can use some of the functionality from this code, which I used to use a couple of years ago when C++20 was not a thing.

Spoiler

Finally, I'd like to thank ToxicPie9, meooow and magnus.hegdahl for discussing certain contents of this blog, and other members of the AC server for giving their inputs too.

Full text and comments »

  • Vote: I like it
  • +669
  • Vote: I do not like it

By nor, 15 months ago, In English

This blog is intended to be an introduction to permutations for beginners, as there seem to be no resources other than cryptic (for beginners) editorials that talk about some cycles/composition, and people unfamiliar with permutations are left wondering what those things are.

We'll break the blog into three major parts based on how we most commonly look at permutations. Of course, even though we will treat these approaches separately (though not so much for the last two approaches), in many problems, you might need to use multiple approaches at the same time and maybe even something completely new. Usually, you'd use these ideas in conjunction with something like greedy/DP after you realize the underlying setup.

An ordering of the sections in decreasing order of importance according to the current permutations meta would be: cycle decomposition, permutation composition, ordering.

However, the topics are introduced in a way so that things make more sense, because there are some dependencies in definitions.

Table of contents

  1. Definition, basic properties, and notation
    • Definition
    • Two-line notation
    • Permutation invariants
  2. The ordering perspective
    • Fixed points
    • Derangements
    • Inversions
    • Longest increasing subsequences, and Erdos Szekeres theorem
    • Next permutation
    • Random permutations
  3. The cycle decomposition perspective
    • Cycles
    • Cycle notation
    • Finding $$$k$$$-th next, functional graphs
    • Swaps/transpositions
    • Swaps on cycles
  4. The permutation composition perspective
    • Definition
    • Examples — cycles and swaps
    • Parity of permutations
    • Parity under composition
    • Inverse of permutations
    • Involutions
    • $$$k$$$-th power of permutations
    • Order of permutations
    • Square-root of permutations
    • Conjugation and conjugacy classes
  5. Miscellaneous topics (brief discussion)
    • Group theory
    • Counting: Burnside, Polya enumeration, Stirling numbers
    • Coordinate compression
    • Connected component DP
    • Young tableaus
    • Schreier Sims algorithm
    • Permutation tree
  6. Problems

Definition, basic properties, and notation

To start with, here's the definition of a permutation that you might find in many problems on Codeforces:

Definition: A permutation of size $$$n$$$ is an array of size $$$n$$$ where each integer from $$$1$$$ to $$$n$$$ appears exactly once.

But why do we care about these arrays/sequences? The reason behind this is simple.

Consider any sequence $$$a$$$ of $$$n$$$ integers. We can index them with integers from $$$1$$$ to $$$n$$$. So the $$$i$$$-th integer in the sequence is $$$a_i$$$. To understand the following, it helps to consider the example where $$$a_i = i$$$ for all $$$i$$$.

Let's now shuffle this sequence (elements are allowed to stay at their original places). We will try to construct a permutation from this shuffling in two ways:

  1. What is the new index of the element that was initially at position $$$i$$$? (In the case when $$$a_i = i$$$, this just means finding the position of $$$i$$$ in the new array). Let's say this index was $$$f(i)$$$. Then $$$f$$$ is a function from $$$\{1, 2, \dots, n\}$$$ to itself. Now notice that when we write out the array $$$[f(1), f(2), \dots, f(n)]$$$, this satisfies the definition of a permutation! And we can easily construct a shuffling operation corresponding to every permutation just by reversing this process.
  2. What was the old index of the element that is now at position $$$i$$$? (In the case when $$$a_i = i$$$, this just means looking at position $$$i$$$ in the new array and reporting the number there). Let's say this index is $$$g(i)$$$. Then $$$g$$$ is again a function from $$$\{1, 2, \dots, n\}$$$ to itself. When we write out the array $$$[g(1), g(2), \dots, g(n)]$$$, this is again a permutation! Note that $$$f$$$ and $$$g$$$ are not defined the same way, and the generated permutations are, in fact, distinct. In fact, it should not be hard to see that $$$f(g(x)) = g(f(x)) = x$$$ for any $$$x$$$ in $$$\{1, 2, \dots, n\}$$$.

This means shuffling around elements (or permuting them, in standard terminology) is another way to think about permutations.

This shows us how permutations are related to functions from $$$\{1, 2, \dots, n\}$$$ to itself such that all images are distinct (and all elements have a unique pre-image). In other words, a permutation can be thought of simply as a bijection on a set, which is the most natural definition to use for quite a few applications. The second interpretation is what corresponds to this definition, and the thing in the first definition is what we sometimes call the position array of the permutation in the second definition (this name is not standard). We will later see that it is what is called the inverse of the permutation.

Note that the above explanation is quite important for having an intuition on what a permutation is. If you don't understand some parts, I recommend re-reading this and trying a few simple examples until you're comfortable with what the permutation and functions $$$f$$$ and $$$g$$$ have to do with each other.

Two-line notation

Note that a permutation $$$[a_1, a_2, \dots, a_n]$$$ of $$$[1, 2, \dots, n]$$$ corresponds to a function $$$f$$$ on $$$\{1, 2, \dots, n\}$$$ defined by $$$f(i) = a_i$$$. Implicitly speaking, the set of pairs $$$(i, a_i)$$$ uniquely determines the permutation (it is just the set of mappings from position to element) and vice versa. To understand compositions and inverses (which will be introduced later on) better, the following notation can come in handy:

\begin{equation*} \begin{pmatrix} 1 & 2 & \cdots & n \\ a_1 & a_2 & \cdots & a_n \end{pmatrix} \end{equation*}

Here we have just arranged the pairs vertically, from left to right. Note that these pairs can be shuffled around (like dominoes), and this can lead to some nice properties:

  1. What happens when we invert the pairs (that is, replace $$$(i, a_i)$$$ by $$$(a_i, i)$$$ for all $$$i$$$, which corresponds to swapping the two rows)? If the shown permutation corresponds to the function $$$g$$$, the permutation after inverting the pairs corresponds to the function $$$f$$$! This should be clear by looking at how $$$f$$$ and $$$g$$$ were defined earlier. So this sort of operation leads to a function inverse.
  2. What happens if we try to combine two permutations by trying to match the lower half of the first permutation $$$p_1$$$ with the upper half of the second permutation $$$p_2$$$? If the $$$g$$$ for $$$p_1$$$ is $$$g_1$$$, and that for $$$p_2$$$ is $$$g_2$$$, then it can be verified that the new permutation's $$$g$$$ is just the function formed by applying $$$g_1$$$, then $$$g_2$$$. So this sort of operation leads to function composition.

Permutation invariants

Note that when a permutation is sorted, it leads to $$$[1, 2, \dots, n]$$$. So any accumulation operation on the permutation array (like sum, product, xor, sum of squares, number of odd integers, etc.) that doesn't rely on the order of elements in an array will give the same value for all permutations of the same length.

The ordering perspective

Fixed points

A fixed point of a permutation $$$a$$$ is an index $$$i$$$ such that $$$a_i = i$$$. These are essentially the values that are not affected by the permutation at all, so if we disregard what happens to these values, we won't be losing out on any information for that permutation $$$a$$$. This is also why these $$$i$$$-s are sometimes skipped in the two-line notation (and also in the cycle notation, which will be introduced in the next section).

Note: If you are familiar with cycles or are on your second reading, you might note that this idea is probably more appropriate for cycles and compositions, but I kept it in this section since it shares some ideas with other subsections in this section. The same goes for the next section.

Derangements

A derangement is a permutation with no fixed points. That is, for every $$$i$$$, we have $$$a_i \ne i$$$. One useful thing to know is how many of the $$$n!$$$ permutations of size $$$n$$$ are derangements. Let's say this number is $$$D_n$$$. There are at least three distinct ways to count them:

Solving linear equations

In this approach, we group all possible permutations by the number of their fixed points. If there are $$$i$$$ fixed points, then upon relabeling the remaining $$$n - i$$$ elements from $$$1$$$ to $$$n - i$$$, the resulting permutation is a derangement. So, the number of permutations on $$$n$$$ elements with $$$i$$$ fixed points is exactly $$$\binom{n}{i} \cdot D_{n - i}$$$.

Summing this over all $$$i$$$ gives us the identity $$$n! = \sum_{i = 0}^n \binom{n}{i} \cdot D_{n - i}$$$.

Along with the fact that $$$D_0 = 1$$$ and $$$D_1 = 0$$$, we can use induction to show that $$$D_n = n! \cdot \sum_{i = 0}^n \frac{(-1)^i}{i!}$$$. We can also use the following identity to come up with the same result:

Special case of Mobius inversion
Recursion

Let's count the number of derangements using a recursion. Suppose $$$a_n = k \ne n$$$. We look at $$$a_k$$$. If $$$a_k$$$ is $$$n$$$, then there are $$$D_{n - 2}$$$ ways of assigning the remaining numbers to the permutation. If $$$a_k$$$ is not $$$n$$$, then we note that this reduces to the number of derangements on $$$n - 1$$$ numbers, by renaming $$$n$$$ (to be assigned in the remaining $$$n - 2$$$ slots) to $$$k$$$.

So the recurrence becomes $$$D_n = (n - 1) \cdot (D_{n - 1} + D_{n - 2})$$$, and this can be solved using induction too.

Using the principle of inclusion and exclusion

We can use an inclusion-exclusion argument to solve this problem too. Let $$$S_i$$$ be the set of permutations that leave the $$$i$$$-th element fixed (the remaining are free to be permuted).

Then we need $$$n! - |\cup_i S_i|$$$. Using the inclusion-exclusion principle, since the intersection of any $$$k$$$ different $$$S_i$$$'s has size $$$(n - k)!$$$, we again get the same expression for $$$D_n$$$.

Inversions

An inversion in an array (not necessarily a permutation) is any pair of indices $$$(i, j)$$$ such that $$$i < j$$$ and $$$a_i > a_j$$$. What is the minimum number of inversions? It is 0, because you can just enforce $$$a_i = i$$$. What is the maximum number of inversions? It is $$$n(n - 1)/2$$$ because you can just enforce $$$a_i = n - i + 1$$$, and then every unordered pair of distinct indices corresponds to an inversion.

Let's consider the following task: For a given permutation (or an array with distinct elements), you are allowed to do the following operation any number of times:

  • Pick any $$$1 \le i < n$$$ and swap $$$a_i$$$ and $$$a_{i + 1}$$$.
  1. Is it possible to sort this array using these operations only?
  2. If yes, what is the minimum number of operations if you want to sort this array using these operations only?

Note that if it is possible for a permutation, it is also possible for any array with distinct elements (we can just replace the elements in the array by their ranks when sorting in increasing order).

The answer to the first part is yes. The main idea is to pick the smallest element and keep swapping it with the previous element until it reaches the first position. Then do the same thing recursively for $$$[2, \dots, n]$$$.

Let's think about the number of operations here. How many operations do we do in the first phase? Note that everything coming before $$$1$$$ in the original permutation contributed to an inversion where the second element was $$$1$$$. Now after $$$1$$$ is in its intended place, there will be no more inversions that are related to $$$1$$$ in any way. For the rest of the permutation, we can do this argument recursively. But for that, we need to see what happens to inversions where $$$1$$$ was not involved. Note that the relative order of no other pair changes, so the other types of inversions remain the same! So, the total number of operations is clearly the number of inversions in the permutation.

Is this the minimum number of operations you need to sort the array? The answer turns out to be yes. Consider any operation that is done. Since it flips the order of two adjacent things, there is at most one inversion it can reduce. So the decrease in the number of inversions is at most $$$1$$$. In a sorted array, there are no inversions. So the number of operations must be at least the number of inversions in the original permutation, and thus we are done.

Now you might think — how do we actually count the number of inversions in a permutation? The answer is you can do that using merge sort or by some point-update-range-sum data structure like Fenwick tree or segment tree. The merge sort solution can be found here, and for the data structure solution, consider the following:

  • Initially, you have an array $$$A$$$ of size $$$n$$$ filled with $$$0$$$ (on which the data structure will be constructed).
  • For each $$$i$$$ starting from $$$1$$$ to $$$n$$$, find the sum of the prefix of $$$A$$$ of size $$$a_i - 1$$$ (that is, sum of $$$A_j$$$ for $$$0 \le j < a_i$$$), and add it to the answer. Increment $$$A_i$$$ by $$$1$$$.

Longest increasing subsequences and Erdos Szekeres theorem

An increasing subsequence of an array $$$a$$$ (not necessarily a permutation) is a sequence of indices $$$i_1 < i_2 < \dots < i_k$$$ such that $$$a_{i_1} < a_{i_2} < \dots < a_{i_k}$$$. Decreasing subsequences are defined similarly.

An algorithm to find a longest increasing (or, with a few modifications, non-decreasing) subsequence of an array can be found in this nice video tutorial. But that is not what we are concerned with at the moment.

We are concerned with bounds on the length of any longest increasing subsequence (LIS from here on). However, for decreasing subsequences, an LIS has length $$$1$$$. The Erdos Szekeres theorem tells us that in such cases, the length of the longest decreasing subsequence will be large.

More formally, the theorem states that in any permutation (or array with distinct elements) of size at least $$$xy + 1$$$, there is either an increasing subsequence of length $$$x + 1$$$ or a decreasing subsequence of length $$$y + 1$$$.

The easiest way to prove this theorem is via an application of the Pigeonhole principle.

Suppose for the sake of contradiction that the theorem is false for some permutation $$$a$$$. For every $$$i$$$, consider the length of a longest increasing subsequence ending at index $$$i$$$ and the length of a longest decreasing subsequence ending at index $$$i$$$. Let's call these numbers $$$x_i$$$ and $$$y_i$$$. Note that all $$$x_i$$$-s are integers between $$$1$$$ and $$$x$$$, and all $$$y_i$$$-s are integers between $$$1$$$ and $$$y$$$. So there can be at most $$$xy$$$ distinct pairs $$$(x_i, y_i)$$$. By the Pigeonhole principle, there are $$$i < j$$$ such that $$$(x_i, y_i) = (x_j, y_j)$$$. Since all elements are distinct, $$$a_i < a_j$$$ or $$$a_i > a_j$$$. In the first case, it is impossible that $$$x_i = x_j$$$, and in the latter, it is impossible that $$$y_i = y_j$$$. This is a contradiction, and we are done.

A more sophisticated and deeper proof of this theorem can be done using Dilworth's theorem. This blog uses it to prove a special case of this theorem, though the proof can be modified easily to work for the complete proof too.

Next permutation

Note that permutations are just sequences of integers, so it is possible to sort the set of all possible sequences of size $$$n$$$ lexicographically (i.e., in the order you would find words in a dictionary). This defines a natural indexing of each permutation. How do we find the next permutation from a given permutation?

The easiest way (in C++) is to use std::next_permutation, but we'll briefly describe how it works.

Let's find the first index $$$i$$$ from the right so that $$$a_i > a_{i - 1}$$$. Since $$$i$$$ was the first index satisfying this condition, all indices $$$i$$$ to $$$n$$$ must form a decreasing sequence. Note that the smallest number in this sequence that is larger than $$$a_i$$$ will be the new element at position $$$i$$$, and the rest of them (along with the current $$$a_i$$$) will be sorted in increasing order after it. All of this can be implemented in time $$$O(n - i + 1)$$$.

Note that starting from the lexicographically smallest permutation (which is $$$[1, 2, \dots, n]$$$), the number of permutations between (both inclusive) this permutation and a permutation whose first position $$$i$$$ such that $$$a_i \ne i$$$ is $$$k$$$, is at least $$$(n - k)! + 1$$$. This means that if you apply next_permutation even a large number of times, the number of elements in the permutation that will ever change will not be large (unless you start from a specific permutation, and even in that case, apart from a single change for potentially all indices, the same conclusion holds).

So if for a permutation of length $$$n$$$, you do the next_permutation operation (as implemented above) $$$O(n^k)$$$ times, the time taken will be (much faster than) $$$O(k n^k \log n)$$$. You can even bound the number of operations to perform next_permutation $$$r$$$ times by $$$O(r + n)$$$. The analysis is similar to the complexity analysis when you increment the binary representation of an $$$n$$$-bit number $$$r$$$ times.

Random permutations

There are a total of $$$n!$$$ permutations of size $$$n$$$. How do we construct a random permutation if all we have is a random number generator that can give us integers in a range we can choose?

For thinking about how we can incrementally construct such a permutation, let's try to remove all elements $$$> k$$$ for some $$$k$$$. Note that the position of $$$k$$$ in this array is equally likely to be any integer from $$$1$$$ to $$$k$$$. However, this won't lead to a very efficient algorithm right away.

We would want to rather construct a random permutation from left to right. Suppose we have a prefix chosen already. Note that the permutations of all elements on the right are equally likely. Also, all elements on the right are equally likely to be the first element of the suffix (i.e., the last element of the just-larger prefix). This observation leads to Durstenfeld's variation of the Fisher-Yates shuffle.

Now let's think about some statistics of random permutations.

Firstly, what is the expected length of the greedily picked increasing sequence by this algorithm:

  • If there is nothing in the sequence or the new element is greater than the last picked element, pick this element.

Note that an element is picked if and only if it is more than everything before it. That is, we can do the following using linearity of expectation:

$$$E[l] = E[\sum_i(a_i \text{ chosen})] = \sum_i P[a_i \text{ chosen}] = \sum_i P[a_i > \max(a_1, \dots, a_{i - 1})] = \sum_i \frac{1}{i}$$$, which is the harmonic sum, and is approximately $$$\ln n$$$.

However, it can be shown that the expected length of the LIS is $$$\Theta(\sqrt{n})$$$, so a greedy approach is much worse than computing the LIS systematically.

For more random permutation statistics that also have some information on statistics derived from the next few sections, I would refer you to this.

The cycle decomposition perspective

We will now switch gears and note a very important part of the theory of permutations — the cycle decomposition. You will find yourself referring to this quite a lot while solving problems on permutations.

Cycles

Suppose we have a permutation $$$a$$$. Let's fix some $$$i$$$. Let's try to look at the sequence $$$i$$$, $$$a_i$$$, $$$a_{a_i}$$$ and so on. Eventually, it must repeat because there are at most $$$n$$$ values that this sequence can take. For the sake of convenience, let's call the $$$j$$$-th element of this sequence $$$b_j$$$.

Then for some $$$k < l$$$, we have $$$b_k = b_l$$$. Let's take $$$k$$$ to be the smallest such index, and let $$$l$$$ be the smallest such index for that corresponding $$$k$$$. If $$$k$$$ is not $$$1$$$, then we must have $$$b_{k - 1} = b_{l - 1}$$$, because $$$a$$$ is a bijection. However, this contradicts the minimality of $$$k$$$! This means that the first element that repeats in the sequence is, in fact, $$$i$$$.

Now suppose something in the sequence between $$$1$$$ and $$$l$$$ repeats (let's say at positions $$$1 < m < o < l$$$). Then by repeating the bijection argument $$$m - 1$$$ times, we must have $$$b_{o - m + 1} = b_1 = i$$$, so $$$o - m + 1 \ge l$$$. But this is a contradiction. This means that the sequence repeats starting from $$$l$$$ and forms a cycle.

We call this a cycle because if we made the graph where there was a directed edge from $$$i$$$ to $$$a_i$$$, then this $$$i$$$ would be in a cycle of length $$$l - 1$$$.

Now, this was done for a single $$$i$$$. Doing this for all $$$i$$$ means that all elements are in a cycle in this graph. Clearly, since the indegree and outdegree of each $$$i$$$ are $$$1$$$, this means that these cycles are all disjoint.

Let's take an example at this point. Consider the permutation $$$[2, 3, 1, 5, 4]$$$. The cycles for each element are as follows:

  • $$$1 \to 2 \to 3 \to 1$$$
  • $$$2 \to 3 \to 1 \to 2$$$
  • $$$3 \to 1 \to 2 \to 3$$$
  • $$$4 \to 5 \to 4$$$
  • $$$5 \to 4 \to 5$$$

These correspond to the cycles $$$1 \to 2 \to 3 \to 1$$$ and $$$4 \to 5 \to 4$$$ in the graph.

Cycle notation

Note that these cycles are independent. That is, we can just say that for these cycles, any element outside the cycle is irrelevant. So in this sense, for any permutation, we can just list out its cycles and we will be able to determine the permutation uniquely.

So we just represent a permutation as a list of its cycles. For example, the permutation $$$[2, 3, 1, 5, 4]$$$ can be represented as $$$(1 2 3)(4 5)$$$.

Note: there is a deeper meaning behind this notation that is related to composition and how disjoint cycles commute for composition.

Finding $$$k$$$-th next, functional graphs

Consider the following task: for a permutation, compute the value of $$$a_{a_{\ddots_{i}}}$$$ for each $$$i$$$ where there are $$$k$$$ $$$a$$$-s, $$$k \le 10^{18}$$$.

Note that if you can find cycles, you can do a cyclic shift for each cycle (appropriately computed modulo the cycle length) and give the answer.

What if it was not guaranteed that $$$a$$$ is a permutation, but all elements in $$$a$$$ are in $$$[1, n]$$$ nevertheless? The cycle argument breaks down here, but you can show the following fact:

  • Each weakly connected component consists of two parts: a directed cycle and directed trees rooted at distinct vertices in the cycle, directed towards the root (also called reverse-oriented arborescences).

The same problem can be solved in this setting, too, by using binary lifting for each vertex till it reaches the "main" cycle of its component.

In the language of graph theory, such graphs are called functional graphs (which have outdegree of each vertex = 1).

Some problems to practice on are Planet Queries (I and II) on CSES.

Swaps/transpositions

Let's start from the array $$$[1, 2, \dots, n]$$$ and try to apply swaps on this array till we reach some desired permutation $$$a = [a_1, a_2, \dots, a_n]$$$. Note that we can always apply such swaps, by trying to build the permutation from left to right. These swaps are also called transpositions.

Swaps on cycles

Let's now think about what happens when we apply a swap to a permutation. Suppose the swap swaps elements at positions $$$i$$$ and $$$j$$$. Let's look at the graph associated with this permutation. In this graph, swapping two elements at positions $$$x$$$ and $$$z$$$ is equivalent to breaking two edges $$$x \to y$$$ and $$$z \to w$$$ and making two edges $$$x \to w$$$ and $$$z \to y$$$ (in something like a crossing manner).

We make two cases:

  1. $$$i$$$ and $$$j$$$ are in different cycles: this joins the two cycles together.
  2. $$$i$$$ and $$$j$$$ are in the same cycle: this breaks the common cycle into two.

A relevant problem at this point would be 1768D - Lucky Permutation.

Let's consider another permutation sorting problem, but here rather than just adjacent elements being swapped, we allow swapping any elements. What is the minimum number of swaps to sort a permutation this way?

Note that in the final result, there are exactly $$$n$$$ cycles (singletons). Let's say we currently have $$$c$$$ cycles. Since the number of cycles increases by at most 1 each time, there must be at least $$$n - c$$$ swaps to sort this permutation.

Is this achievable? Yes. We can keep swapping elements in the same cycle while there is a cycle of length at least $$$2$$$, and this will sort the permutation.

The permutation composition perspective

Definition

Suppose we have two permutations $$$a$$$ and $$$b$$$ of the same size $$$n$$$. Now we want to define a composition of them as follows: $$$ab$$$ is defined as the array $$$c$$$ where $$$c_i = a_{b_{i}}$$$.

This essentially corresponds to finding the composition of the $$$g$$$-functions of $$$a$$$ and $$$b$$$, with $$$g$$$ for $$$b$$$ being applied first.

Let's get an intuitive sense of what it does, by getting some information out of the definition. Let's first deal with the case when $$$b$$$ is the "identity" permutation: $$$b_i = i$$$. Then $$$c = a$$$ according to this definition. Similarly, if $$$a$$$ is the identity permutation, then $$$c = b$$$. So, composing with the identity permutation in any order gives the original permutation back.

Now let's understand this by using the two-line notation. The idea is to take the two-line notation for both permutations and do some sort of pattern matching to get to their composition.

Consider the following permutations $$$a$$$ and $$$b$$$.

\begin{equation*} \begin{pmatrix} 1 & 2 & \cdots & n \\ a_1 & a_2 & \cdots & a_n \end{pmatrix} \end{equation*}

and

\begin{equation*} \begin{pmatrix} 1 & 2 & \cdots & n \\ b_1 & b_2 & \cdots & b_n \end{pmatrix} \end{equation*}

Let's reorder the columns of the two-line notation for $$$a$$$ so that the lower row of $$$b$$$ and the upper row of $$$a$$$ match. However, this corresponds to shuffling the columns of $$$a$$$ according to $$$b$$$:

\begin{equation*} \begin{pmatrix} b_1 & b_2 & \cdots & b_n \\ a_{b_1} & a_{b_2} & \cdots & a_{b_n} \end{pmatrix} \end{equation*}

Note that this is still a valid two-line notation for $$$a$$$. Now, if we "merge" this with the two-line notation for $$$b$$$, this gives us the two-line notation for the composition of $$$a$$$ and $$$b$$$.

Note how this idea of composition arises naturally from the interpretation of a permutation as a bijection on a set.

We start with the identity permutation, apply the first permutation to it, and then apply the second permutation to it. Applying means replacing $$$x \mapsto a_x$$$. The two-line notation for a permutation $$$a$$$ has the following property:

  • If the permutation in the first row is $$$p$$$, then the permutation in the second row is $$$ap$$$.

Examples — cycles and swaps

Note that we can associate a cycle itself with a permutation, where the non-cycle elements are fixed points, and the remaining are mapped according to the cycle. In this sense, cycles are permutations.

Also note that a swap itself is a permutation in the same manner. In fact, a swap is just a cycle of length 2.

It is helpful to play around with a few examples of swaps and cycles and write them in two-line notation to understand their mappings and try composing them.

At this point, you should also try to recall the following two things:

  1. When we introduced cycles, we wrote them in a certain format. In fact, you can compose disjoint cycles without worrying about the order in which they are composed. This should be obvious from the fact that they are independent in a way (an element that gets moved in one permutation is a fixed point of the other). You can also see that the notation $$$(1 2 3)(4 5)$$$ also conveniently captures that this permutation is a composition of the cycles $$$(1 2 3)$$$ and $$$(4 5)$$$.

  2. When we discussed swaps, we noted that you could "apply" swaps. This corresponds to left-multiplying the permutation with the corresponding swap permutation. It is instructive to come up with a few examples of swaps and how they compose at this point.

It is important to note that (by using the fact that function composition is associative) permutation composition is associative.

Let's say we have a permutation where we came to by applying the following three permutations: $$$(12)$$$, $$$(312)(43)$$$ and $$$(53241)$$$ in this order, we would write it as $$$(53241)(312)(43)(12)$$$. Note that here $$$(12)$$$ and $$$(43)$$$ are cycles of length $$$2$$$, i.e., they are swaps.

Now how do we reduce this back to a normal permutation? One way would be to write out the permutation for each cycle and then keep composing it. A cleaner way of doing the same thing would be to do this: for every element, start from the right, and replace it with what the cycle maps it to. For instance, if we want to find the image of $$$1$$$, the first cycle maps $$$1$$$ to $$$2$$$, the second cycle maps $$$2$$$ to $$$2$$$, the third cycle maps $$$2$$$ to $$$3$$$, and the fourth cycle maps $$$3$$$ to $$$2$$$.

Parity of permutations

Recall that we mentioned earlier that an adjacent swap reduces the number of inversions by at most $$$1$$$. Well, to be more precise, it changes the number of inversions by $$$\pm 1$$$, and hence flips the parity of the number of inversions of the permutation.

We define the parity of the permutation to be the parity of the number of inversions of the permutation.

The reason why this is important is that it gives a handle on the parity of swaps (not just adjacent swaps) as well as some information about the cycle parities, as follows.

Let's consider a swap that swaps elements at positions $$$i < j$$$. Then a way to get to this would be to apply $$$j - i$$$ adjacent swaps to bring $$$a_j$$$ to position $$$i$$$, and $$$j - i - 1$$$ adjacent swaps to bring $$$a_i$$$ (from its new position) to position $$$j$$$. This takes an odd number of total adjacent swaps, so the final number of inversions is also flipped.

As a corollary, applying any swap changes the parity of the permutation, and in particular, all swap permutations have odd parity (parities add up modulo $$$2$$$ over composition, as can be seen easily).

Now note that for a cycle, we can construct it by doing a few swaps. More specifically, if the cycle is $$$(i_1, i_2, \dots, i_k)$$$, then we can do $$$k - 1$$$ swaps to apply the same transformation as the cycle permutation.

So all even-sized cycles have odd parity, and all odd-sized cycles have even parity.

Let's consider any decomposition of a permutation into (possibly non-disjoint) cycles. As a result of the above discussion, the parity of the permutation is odd iff the number of even-sized cycles is odd.

In particular, for a decomposition of a permutation into swaps, the parity of the permutation is the same as the parity of the number of swaps in the decomposition. Note that this also implies that we can't have two decompositions with an odd difference in the number of swaps.

Rephrasing the above result, we have the fact that if it takes an odd number of swaps to get to a permutation, the permutation has odd parity and vice versa.

Parity under composition

Clearly, from the above discussion, it can be seen that parities add modulo $$$2$$$ over composition. This, however, is important enough to warrant its own section because it summarizes all of the discussion above.

Inverse of permutations

Firstly, let's think about permutation in an application sense. Is there a way to get back the original permutation after some permutation has been applied to it? From our remarks on the two-line notation, it suffices to do this for the initial permutation being the identity permutation. Let's say the applied permutation was $$$a$$$:

\begin{equation*} \begin{pmatrix} 1 & 2 & \cdots & n \\ a_1 & a_2 & \cdots & a_n \end{pmatrix} \end{equation*}

By our remarks on composition using the two-line notation, we need something that swaps these two rows as follows:

\begin{equation*} \begin{pmatrix} a_1 & a_2 & \cdots & a_n\\ 1 & 2 & \cdots & n \end{pmatrix} \end{equation*}

Note that this is easy: consider the position array $$$p$$$ of $$$a$$$ as we defined earlier; i.e., $$$p_i$$$ is the position where $$$a$$$ equals $$$i$$$, i.e., $$$a_{p_i} = i$$$. Note that $$$a$$$ is $$$a_i$$$ at $$$i$$$, so $$$p_{a_i}$$$ is the position where $$$a$$$ is $$$a_i$$$, i.e., $$$p_{a_i} = i$$$. This shows that $$$pa = ap$$$, and this common value is the identity permutation (which we shall denote by $$$\text{id}$$$ in what follows).

In other words, $$$p$$$ and $$$a$$$ are inverses of each other under composition. It is trivial to note that $$$a$$$ is the position array of $$$p$$$ as well. We denote $$$p$$$ by $$$a^{-1}$$$.

Now that we have some intuition about the inverse in terms of the two-line notation, we think about it in terms of cycles, swaps, and compositions:

  1. The inverse of a cycle $$$(i_1, i_2, \dots, i_k)$$$ is simply $$$(i_k, \dots, i_2, i_1)$$$. This is done by reversing all the edges in the related directed graph.
  2. The inverse of a swap is itself. This also follows from the fact that a swap is a 2-cycle.
  3. The inverse of a composition $$$ab$$$ is $$$b^{-1}a^{-1}$$$. This corresponds to undoing the permutation application on a stack and also makes algebraic sense: $$$abb^{-1}a^{-1} = aa^{-1} = \text{id} = b^{-1}b = b^{-1}a^{-1}ab$$$.

In particular, for a decomposition of a permutation into (not necessarily disjoint) cycles, we can just take a mirror image of the decomposition, and it will correspond to the inverse of the permutation!

Also, the parity of the inverse is the same as the parity of the original permutation.

Involutions

An involution is a permutation whose inverse is itself. Consider the disjoint cycle decomposition of the permutation $$$a = c_1 \dots c_k$$$. $$$a = a^{-1}$$$ means $$$\text{id} = a^2 = \prod_i c_i^2$$$. Note that we are able to write this because disjoint cycles commute, and so do identical cycles. $$$c_i^2$$$ should be an identity map on the elements of $$$c_i$$$ because else the resulting permutation won't be $$$\text{id}$$$. This means that $$$c_i$$$ is either of size $$$1$$$ or $$$2$$$. Conversely, it can be checked that these permutations are involutions.

$$$k$$$-th power of permutations

Since permutation composition is associative, we can use binary exponentiation to compute the permutation. A more mathematical way to think about it is to use the same trick as in the previous subsection. With definitions the same as in the previous section, for any $$$a$$$, $$$a^k = \prod_i c_i^k$$$, so the cycle structure is determined by the powers of the cycles.

Finding the $$$k$$$-th power of a cycle just corresponds to mapping it to its $$$k$$$-th next neighbor in the cycle. If the length of the cycle is $$$c$$$, and $$$g = \gcd(c, k)$$$, then the disjoint cycle decomposition of $$$c^k$$$ consists of $$$g$$$ cycles, with each of them corresponding to a stride of $$$k$$$ along the original cycle.

Order of permutations

The order of a permutation $$$a$$$ is defined as the least $$$k$$$ such that $$$a^k$$$ is the identity permutation. Let's again look at the cycle decomposition as in the previous subsection. For a cycle $$$c$$$ of length $$$l$$$ to be "dissolved" into singletons, it is necessary and sufficient to have the power of the permutation divisible by $$$l$$$. Hence, the order of the permutation is just the LCM of all cycle lengths.

Square-root of permutations

For this section, we will consider the following problem: 612E - Square Root of Permutation. The solution is left as an exercise to the reader (it shouldn't be hard, considering what we have discussed in the last few sections).

Conjugation and conjugacy classes

Suppose $$$a, b$$$ are two permutations. We think about the following permutation: $$$aba^{-1}$$$. If we think about how the two-line notation for these permutations combines, we can note that this is just the permutation whose cycle decomposition is almost the same as $$$b$$$, but $$$a$$$ is applied to each element in the cycles. In other words, we "reinterpret" the cycles by mapping to and from a separate "ground set" via permutation $$$a$$$.

For a formal statement, let the cycle decomposition of $$$b$$$ be $$$c_1 \cdots c_k$$$, where $$$c_i = (x_{i1}, \dots, x_{il_i})$$$. We claim that $$$aba^{-1}$$$ is $$$c'_1 \cdots c'_k$$$ where $$$c'_i = (a_{x_{i1}}, \dots, a_{x_{il_i}})$$$.

The proof formalizes the earlier argument. Consider an element $$$i$$$, and we look at what happens to $$$i$$$ as each permutation is applied to it:

  1. $$$aba^{-1}$$$: There is a $$$j$$$ such that $$$a_j = i$$$ since $$$a$$$ is a permutation. So $$$(aba^{-1})_i = a_{b_{a^{-1}_i}} = a_{b_j}$$$.
  2. $$$c'_1 \cdots c'_k$$$: Without loss of generality, let's say $$$i$$$ is the first element of $$$c'_1$$$, i.e., $$$i = a_{x_{i1}}$$$. Then $$$j = x_{i1}$$$. Now it is mapped to $$$a_{x_{i2}} = a_{b_{x_{i1}}} = a_{b_j}$$$.

Since both of them give the same result, we are done.

This says that the operation $$$aba^{-1}$$$ is nothing but a translation of the labels of $$$b$$$ according to $$$a$$$.

Note that by this result, since we are just applying the permutation to everything inside the cycles, we can map any product of disjoint cycles to any other product of disjoint cycles, provided the multiset of cycle sizes doesn't change.

Let's call $$$b$$$ and $$$c$$$ to be conjugate if there is an $$$a$$$ such that $$$c = aba^{-1}$$$, and call this operation conjugation.

Thus this leads to a partition of all permutations into equivalence classes (after verifying a couple of axioms), where everything inside a class is conjugate to everything else in the class. Note that any equivalence class is uniquely determined by the multiset of cycle sizes. These equivalence classes are called conjugacy classes.

Miscellaneous topics

This section will mostly be about some very brief references and is intended to tell people about some topics they might not have seen or heard about before, which are relevant to this discussion. There might be blogs in the future regarding these topics, but it's still a good idea to read up on these yourself.

Group theory

The set of permutations of size $$$n$$$ forms what is called a group and is denoted by $$$S_n$$$, and it turns out that all finite groups of size $$$n$$$ are isomorphic to some subgroup of $$$S_n$$$. This is why there has been a lot of work on permutations from a group theoretic perspective. The section on permutation composition is best viewed under a group theoretic lens.

The analysis of a lot of games can be done using group theory, for instance, 15-game, Rubik's cube, Latin squares, and so on.

There is a subgroup of permutations that corresponds to all even permutations. This group is also known to be simple and can be used to show results about the 15-game, for example.

Counting: Burnside, Polya enumeration, Stirling numbers

Using group theoretic ideas, the Burnside lemma and the Polya enumeration theorem come into the picture, and they help to count classes of objects (equivalent under some relation) rather than objects themselves.

Stirling numbers are quite useful when counting things related to permutations, and they deserve a whole blog for themselves.

Coordinate compression

Note that in the above discussions, especially in the section on ordering, we used arrays of distinct elements and permutations interchangeably. Whenever something only depends on the $$$\le, <, >, \ge, =$$$ relations between the elements and not the elements themselves, it makes sense to replace elements with their rank. In the case of distinct elements, this "compressed version" becomes a permutation.

Connected component DP

Just like the failed idea in the generation of permutations, that idea can also be used for dp. More details can be found in this blog.

Young tableaus

Closely related to the LIS and other ordering concepts is the idea of Young tableaus. This has a very interesting theory, and this blog is a good reference for some of the results.

Schreier Sims algorithm

Suppose you have a subgroup generated by some permutations, and you want to find the size of the subgroup. The Schreier Sims algorithm helps in finding this size, and a good blog can be found here.

Permutation tree

There is a cool and interesting data structure that can do different kinds of queries on permutations and subarrays. This blog is a good introduction to the data structure and what it can do.

Problems

I have added some standard problems whenever suitable, but for practicing permutations well, it makes sense to practice on a larger set of problems. Since there is a very large number of problems on permutations, maybe just adding a few of them that I know would not do justice to the variety of problems that can be made on permutations. A couple of possible methods of practicing are the following:

  1. Search on the CF problemset for problems with the name permutation in them.
  2. Look out for permutation problems on sources for standard problems like CSES.

Full text and comments »

  • Vote: I like it
  • +414
  • Vote: I do not like it

By nor, 16 months ago, In English

Disclaimer: This is not an introduction to greedy algorithms. Rather, it is only a way to formalize and internalize certain patterns that crop up while solving problems that can be solved using a greedy algorithm.

Note for beginners: If you're uncomfortable with proving the correctness of greedy algorithms, I would refer you to this tutorial that describes two common ways of proving correctness — "greedy stays ahead" and "exchange arguments". For examples of such arguments, I would recommend trying to prove the correctness of standard greedy algorithms (such as choosing events to maximize number of non-overlapping events, Kruskal's algorithm, binary representation of an integer) using these methods, and using your favourite search engine to look up more examples.

Have you ever wondered why greedy algorithms sometimes magically seem to work? Or find them unintuitive in a mathematical sense? This post is meant to be an introduction to a mathematical structure called a greedoid (developed in the 1980s, but with certain errors that were corrected as recently as 2021), that captures some intuition behind greedy algorithms and provides a fresh perspective on them.

The idea behind this blog is to abstract out commonly occurring patterns into a single structure, which can be focused upon and studied extensively. This helps by making it easy to recognize such arguments in many situations, especially when you're not explicitly looking for them. Note that this blog is in no way a comprehensive collection of all results about greedoids; for a more comprehensive treatment, I would refer you to the references as well as your favourite search engine. If there is something unclear in this blog, please let me know in the comments.

For readers familiar with matroids

To get an idea as to how wide a variety of concepts it is connected to is, greedoids come up in BFS, Dijkstra's algorithm, Prim's algorithm, Kruskal's algorithm, Blossom's algorithm, ear decomposition of graphs, posets, machine scheduling, convex hulls, linear algebra and so on, some of which we shall discuss in this blog.

Note that we will mainly focus on intuition and results, and not focus on proofs of these properties, because they tend to become quite involved, and there is probably not enough space to discuss these otherwise. However, we will give sketches of proofs of properties that are quite important in developing an intuition for greedoids.

Table of contents

  1. Motivation and definition
    • Unordered version of conditions on optimality
    • Weaker conditions on optimality of the greedy algorithm
  2. Some examples
    • Interval greedoids
    • Matroids
    • Antimatroids
    • Other greedoids
  3. Constructing more greedoids
    • From matroids
    • From antimatroids
    • From greedoids
  4. Some random facts about greedoids
  5. References and further reading

Motivation and definition

We will take an approach in an order that is different from most treatments of greedoids (which throw a definition at you), and try to build up to the structure that a greedoid provides us.

How does a usual greedy algorithm look like? Initially, we have made no decisions. Then we perform a sequence of steps. After each step, we have a string of decisions that we have taken (for instance, picking some element and adding it to a set). We also want a set of final choices (so we can't extend beyond those choices).

To make this concept precise, we define the following:

  1. A ground set is any set. We usually interpret this as a set of incremental decisions that can be made.
  2. A language is any set of finite sequences (or words) of elements coming from the ground set. We interpret a language as a set of possible sequences of decisions we can take in the greedy algorithm.
  3. A simple language is any language where the sequences don't have repeated elements. This definition is motivated from the fact that we don't take a decision more than once.
  4. A hereditary language is any language $$$L$$$ on a ground set $$$S$$$ satisfying the following two conditions:
    • $$$\emptyset \in L$$$
    • For any $$$s \in L$$$, every prefix of $$$s$$$ is also in $$$L$$$. This definition is motivated from the fact that we want to capture a sense of time/causality in our sequence of decisions. That is, if we have a sequence of decisions, we must have arrived at it from the just-smaller prefix of decisions.
  5. A basic word in $$$L$$$ is any $$$s \in L$$$ such that there is no element $$$e$$$ in $$$S$$$ such that $$$s$$$ extended by that element (denoted by $$$s \cdot e$$$ or $$$se$$$) is in $$$L$$$. This is motivated from the fact that we want to have some ending states at the end of the greedy algorithm.

We can now rephrase the optimization problems that a large set of greedy algorithms try to solve in the following terms:

  • Consider a simple hereditary language $$$L$$$ on the ground set $$$S$$$, and a function $$$w : L \to \mathbb{R}$$$. Now maximize $$$w(s)$$$ where $$$s$$$ is a basic word of $$$L$$$.

A (certain quite general type of) greedy algorithm looks something like this:

  • Initially there are no decisions taken.
  • At step $$$i$$$, when the decisions $$$x_1, \dots, x_{i - 1}$$$ have already been taken, pick an $$$x_i \in S$$$ such that $$$x_1 x_2 \cdots x_i \in L$$$, and this choice of $$$x_i$$$ maximizes $$$w(x_1 x_2 \cdots x_i)$$$ over all valid choices of $$$x_i$$$. If there is no such $$$x_i$$$, terminate the algorithm.

Of course, taking arbitrary $$$w$$$ doesn't make any sense, so we limit $$$w$$$ to the following kinds of functions (we shall relax these conditions later, so that we can reason about a larger variety of problems):

  1. If at a certain point, a decision $$$x$$$ is optimal, then it will also be optimal at a later stage. That is, if $$$x$$$ is chosen when the prefix was $$$s$$$, then for a prefix $$$st$$$, we have $$$w(stxu) \ge w(styu)$$$ for all valid $$$y, u$$$ ($$$u$$$ can be empty, $$$y$$$ is a decision).
  2. It is optimal to pick $$$x$$$ sooner than later. That is, if $$$x$$$ is chosen when the prefix was $$$s$$$, then we must have $$$w(sxtyu) \ge w(sytxu)$$$ for all valid $$$y, t, u$$$ ($$$t, u$$$ can be empty, $$$y$$$ is a decision).

To this end, we define greedoids as follows:

A greedoid is a simple hereditary language $$$L$$$ on a ground set $$$S$$$ that satisfies an "exchange" property of the following form:

  • If $$$s, t \in L$$$ satisfy $$$|s| > |t|$$$, then there is a decision in $$$s$$$ that we can append to $$$t$$$, so that the resulting word is also in $$$L$$$.

This extra condition is crucial for us to be able to prove the optimality of greedy solutions using an exchange argument (the usual exchange argument that people apply when proving the correctness of greedy algorithms).

One notable result is that by this extra condition, all basic words in $$$L$$$ have the same length. This might seem like a major downgrade from the kind of generality we were going for, but it still handles a lot of cases and gives us nice properties to work with.

For people familiar with matroids

It turns out that greedoids have the following equivalent definitions which will be useful later on (we omit the proofs, since they are quite easy):

  1. Let a set variant of the greedoid be defined as follows: a pair $$$(S, F)$$$ of a ground set $$$S$$$ and a family of its subsets $$$F$$$, such that $$$\emptyset \in F$$$ and for any $$$A, B \in F$$$ with $$$|A| > |B|$$$, there exists an $$$s \in A \setminus B$$$ such that $$$B \cup \{s\} \in F$$$. Then the original definition of a greedoid and this definition of a set variant are equivalent, in the sense that for any such $$$F$$$, there is exactly one simple hereditary language $$$L$$$ such that the set of set of decisions in a word for each word in $$$L$$$ is precisely $$$F$$$. (That is, when we replace sequences in $$$L$$$ with an unordered version, the resulting set of unordered sets is $$$F$$$).
  2. We can replace the second condition in the definition of the set variant by the condition that all maximal sets have the same cardinality. Maximal sets are defined as sets such that the set formed after adding another element to them is not in $$$F$$$.

There is also a set analogue for simple hereditary languages: $$$\emptyset \in F$$$ and for each $$$X \in F$$$, we have an element $$$x \in X$$$ such that $$$X \setminus \{x\} \in F$$$. The intuition remains the same. Note that again, we don't need this to hold for all $$$x \in X$$$, but at least one $$$x \in X$$$.

But why do we care about greedoids at all? The answer lies in the following informally-stated two-part theorem:

Theorem 1:

  1. If $$$L$$$ is a greedoid on $$$S$$$, and $$$w$$$ satisfies the conditions mentioned earlier, then the greedy algorithm gives the correct answer.
  2. If the greedy algorithm gives the correct answer for all $$$w$$$ satisfying the conditions mentioned earlier, and if $$$L$$$ has the same length for all its basic words, then $$$L$$$ is a greedoid on $$$S$$$.
Brief proof sketch

Note that the definition of greedoid doesn't depend on $$$w$$$ in any way, so it can be studied as a combinatorial structure in its own right — this leads to quite non-trivial results at times. However, when we think of them in the greedy sense, we almost always have an associated $$$w$$$.

In a lot of cases, $$$w$$$ has a much simpler (and restrictive) structure, for instance, having a positive and additive weight function (which is defined on each element). In that case, the following algorithm works for matroids (special cases of greedoids): sort elements in descending order of their weight, and pick an element iff adding it to the set of current choices is still in the matroid.

Unordered version of conditions on optimality

Usually, $$$w$$$ is not defined on a sequence of steps, but on the set of choices you have made. However, in the general case, the "natural" relaxation from the ordered version to the unordered version of greedoids fails (both in being equivalent and in giving the optimal answer). In fact, this error was present in the original publications (since 1981) and was corrected fairly recently in 2021. For a more detailed discussion (and the proofs of the following few theorems), please refer to $$$[1]$$$, which we shall closely follow.

We shall start out with some special cases:

Theorem 2: Consider a greedoid $$$F$$$ (unordered) on a ground set $$$S$$$. Suppose $$$c$$$ is a utility function from $$$S$$$ to $$$\mathbb{R}$$$, so that the weight of a set of decisions is additive over its elements. Then the greedy algorithm gives an optimal solution iff the following is true:

  • If $$$A, A \cup \{x\} \in F$$$ and $$$B$$$ is an unordered basic word (i.e., maximal set) in $$$F$$$ such that $$$A \subseteq B$$$ and $$$x \in S \setminus B$$$, then there is a $$$y \in B \setminus A$$$ such that $$$A \cup \{y\} \in F$$$ and $$$B \cup \{x\} \setminus \{y\}$$$ is also a maximal set.

The following theorem generalizes the sufficiency:

Theorem 3: Consider a greedoid $$$F$$$ (unordered) on a ground set $$$S$$$ and $$$w : 2^S \to \mathbb{R}$$$. The greedy algorithm gives an optimal solution if (and not iff) the following is true:

  • If for some $$$A, A \cup \{x\} \in F$$$ and $$$B$$$ being an unordered basic word (i.e., maximal set) in $$$F$$$ such that $$$A \subseteq B$$$ and $$$x \in S \setminus B$$$, it holds that $$$w(A \cup \{x\}) \ge w(A \cup \{u\})$$$ for all valid $$$u$$$, then there is a $$$y \in B \setminus A$$$ such that $$$A \cup \{y\} \in F$$$ and $$$B \cup \{x\} \setminus \{y\}$$$ is also a maximal set and $$$w(B \cup \{x\} \setminus \{y\}) \ge w(B)$$$.

This theorem can be used to show optimality of Prim's algorithm on its corresponding greedoid.

This theorem can also be derived from theorem 6 that we shall mention later on.

Remember that we mentioned that the original claim in the 1981 paper about greedoids was false? It turns out that the claim is true about a certain type of greedoids, called local forest greedoids. Since it is not so relevant to our discussion, we're going to spoiler it to avoid impairing the readability of what follows.

Local forest greedoids and some optimality theorems
Weaker conditions on optimality of the greedy algorithm

As we had mentioned earlier, the constraints on $$$w$$$ while motivating the definition of a greedoid (ordered) are quite stringent, and they might not hold in a lot of cases. Here, we work upon reducing the set of constraints (which will also have implications on theorem 3 earlier).

Theorem 6: Let $$$L$$$ be a greedoid on a ground set $$$S$$$, and $$$w$$$ be an objective function that satisfies the following condition:

  • If $$$ax \in L$$$ such that $$$w(ax) \ge w(ay)$$$ for every valid $$$y$$$ and $$$c = azb$$$ is a basic word, then there exists a basic word $$$d = axe$$$ such that $$$w(d) \ge w(c)$$$.

Then the greedy algorithm gives an optimal solution.

The proof is quite easy, and goes by way of contradiction. There is a special case of the above theorem (that needs some care to prove), which is applicable in a lot of cases. It turns out that it corresponds to the last part of the proof sketch of Theorem 1, so maybe it's not such a bad idea to read it again.

Some examples

The main point of adding examples at this point is to showcase the kinds of greedoids that exist in the wild and some of their properties, so that it becomes easy to recognize these structures. I have added the examples in spoiler tags, just to make navigation easier. Some of these examples will make more sense after reading the section on how to construct greedoids.

While going through these examples, I encourage you to find some examples of cost functions that correspond to greedy optimality, so that you can recognize them in the future. Discussing them in the comments can be a great idea too. For a slightly larger set of examples, please refer to Wikipedia and the references.

When appropriate, we will also point out greedy algorithms that are associated with these structures.

Interval greedoids

These are greedoids that satisfy the local union property (as introduced in the section of local forest greedoids). Equivalently, they are also defined as greedoids where for every $$$A \subseteq B \subseteq C$$$ with $$$A, B, C \in F$$$, if $$$A \cup \{x\}, C \cup \{x\} \in F$$$ for some $$$x \not \in C$$$, then $$$B \cup \{x\} \in F$$$.

Some examples are:

  1. All local forest greedoids.
  2. Matroids (to be introduced later).
  3. Anti-matroids (to be introduced later).
Directed branching greedoid
Undirected branching greedoid
Clique elimination greedoid

Matroids

Matroids are greedoids that satisfy a stronger property than the interval property for interval greedoids, by removing the lower bounds. More precisely, a matroid is a greedoid that satisfies the following property:

  • For every $$$B \subseteq C$$$ with $$$B, C \in F$$$, if $$$C \cup \{x\} \in F$$$ for some $$$x \not \in C$$$, then $$$B \cup \{x\} \in F$$$.

Equivalently, they are greedoids where in the unordered version, for each $$$X \in F$$$, and for all (not at least one) $$$x \in X$$$ we have $$$X \setminus \{x\} \in F$$$. In other words, they are downwards closed. In such a context, elements of $$$F$$$ are also called independent sets.

Intuitively, they try to capture some concept of independence. In fact, matroids came up from linear algebra and graph theory, and greedoids were a generalization of them when people realized that matroids are too restrictive, and downwards-closedness is not really important in the context of greedy algorithms. The examples will make this notion clearer.

Note that for matroids, it is sufficient to define a basis (the set of maximal sets in $$$F$$$), and downwards closedness handles the rest for us.

Matroids have a much vaster theory compared to greedoids, due to being studied for quite a long time and being more popular in the research community. Since this is a blog only on greedy algorithms, we will refrain from diving into matroid theory in too much detail. Interested readers can go through the references for more.

Some examples of matroids are as follows.

Free matroid
Uniform matroid
Graphic matroid
Transversal matroid
Gammoid
Algebraic matroid
Vector matroid
Column matroid

Antimatroids

Antimatroids are greedoids that satisfy a stronger property than the interval property for interval greedoids, by removing the upper bounds. More precisely, an antimatroid is a greedoid that satisfies the following property:

  • For every $$$A \subseteq B$$$ with $$$A, B \in F$$$, if $$$A \cup \{x\} \in F$$$ for some $$$x \not \in B$$$, then $$$B \cup \{x\} \in F$$$.

Unlike matroids, this does not necessarily imply upward closure, but it does imply closure under unions (which is another way to define antimatroids).

Another definition of antimatroids that is in terms of languages (and gives some intuition about their structure) calls a simple hereditary language over a ground set an antimatroid iff the following two conditions hold:

  1. Every element of the ground set must appear in a word in the language.
  2. The exchange property holds for any two words $$$a, b$$$ such that $$$a$$$ has an element not in $$$b$$$, rather than only restricting it to $$$|a| > |b|$$$.

Using this, we note that the basic words are all permutations of the ground set.

Another piece of intuition (derived from the above definition) that helps with antimatroids is the fact that when we try constructing sets in an antimatroid, we keep adding elements to a set, and once we can add an element, we can add it any time after that.

It also makes sense to talk about antimatroids in the context of convex geometries. To understand the intuition behind this, we take the example of a special case of what is called a shelling antimatroid. Let's consider a finite set of $$$n$$$ points in the 2D plane. At each step, remove a point from the convex hull of the remaining points. The set of points at any point in this process clearly forms an antimatroid by the above intuition. In fact, if instead of 2D, we embed the ground set in a space with a high enough dimension, we can get an antimatroid isomorphic to the given matroid!

What is so special about convexity here? Turns out we can use some sort of "anti-exchange" rule to define convex geometries in the following manner. It will roughly correspond to this fact: if a point $$$z$$$ is in the convex hull of a set $$$X$$$ and a point $$$y$$$, then $$$y$$$ is outside the convex hull of $$$X$$$ and $$$z$$$.

Let's consider a set system closed under intersection, which has both the ground set and empty set in it. Any subset of the ground set (doesn't matter whether it is in the system or not) has a smallest superset in such a set system (and it is the smallest set from the system that contains the subset). Let's call this mapping from the subset to its corresponding smallest superset in the system $$$\tau$$$. Then we have the following properties:

  1. $$$\tau(\emptyset) = \emptyset$$$
  2. $$$A \subseteq \tau(A)$$$
  3. $$$A \subseteq B \implies \tau(A) \subseteq \tau(B)$$$
  4. $$$\tau(\tau(A)) = \tau(A)$$$

Now for such a set system, it is called a convex geometry if the following "anti-exchange" property holds:

  • If some distinct $$$y, z \not \in \tau(X)$$$, but $$$z \in \tau(X \cup \{y\})$$$, then $$$y \not \in \tau(X \cup \{z\})$$$.

Intuitively, consider $$$\tau$$$ as a convex hull operator. Then the anti-exchange property simply means that if $$$z$$$ is in the convex hull of $$$X$$$ and $$$y$$$, then $$$y$$$ is outside the convex hull of $$$X$$$ and $$$z$$$.

Now it turns out that the complements of all sets in an antimatroid form a convex geometry! This is not too hard to prove.

We shall now give some examples of antimatroids.

Chain antimatroid
Poset antimatroid
Shelling antimatroid
Perfect elimination antimatroid
Point-line search antimatroid
Cover antimatroid
Point search antimatroid
Line search antimatroid
Undirected point/line search antimatroid
Capacitated point search antimatroid
Transitivity antimatroid and shelling of an acyclic matroid
Chip-firing antimatroid
Universal antimatroid

Other greedoids

Now that we got a few special types of greedoids out of the way, we shall look at some more greedoids.

Perfect elimination greedoids on bipartite graphs
Retract, monoid retract and dismantling greedoids
Bipartite matching greedoid
Linking greedoid
Gaussian elimination greedoid
Twisted and dual-twisted matroids (which are not matroids)
Ear decomposition greedoid (undirected and directed)
Blossom greedoid

Constructing more greedoids

Note that some constructions from polymatroids/matroids etc. were already mentioned in the section on examples, so we will focus on the ones that were not.

From matroids
Truncation
Duality
Restriction
Contraction
Sums and unions
From antimatroids
Join of antimatroids on the same ground set
Intersection of matroid and antimatroid
From greedoids
Truncation
Restriction
Contraction

Some random facts about greedoids

This is a collection of interesting facts about greedoids that just don't seem to fit in the above sections, but are important enough not to be ignored. Of course, with every important construction or idea, there are a lot of facts that lead to a combinatorial explosion of the kinds of facts we can derive about anything, so what we discussed above is way too incomplete to be a comprehensive introduction. It'll hopefully evolve over time too, so I'm open to suggestions regarding this section.

  • You can construct Tutte polynomials for greedoids, just like you can for matroids.
  • We can characterize greedoids into matroids, antimatroids, and none of these in terms of having certain kinds of greedoid minors.
  • Antimatroids are very closely related to lattices, and in general we can reason about greedoids by their associated poset flats too.

References and further reading

  1. Dávid Szeszlér (2021), Sufficient conditions for the optimality of the greedy algorithm in greedoids.
  2. Bernhard Korte, Laszlo Lovasz, Rainer Schrader (1991), Greedoids
  3. Matroid intersection in simple words
  4. Goecke, Korte, Lovasz (1986), Examples and algorithmic properties of greedoids
  5. Bernhard Korte, Laszlo Lovasz (1988), Intersection of matroids and antimatroids
  6. Brenda L Dietrich (1987), Matroids and antimatroids — a survey

Lastly, since this is a huge blog, some mistakes might have crept in. If you find any, please let me know!

Full text and comments »

  • Vote: I like it
  • +349
  • Vote: I do not like it

By nor, 16 months ago, In English

Recently someone asked me to explain how to solve a couple of problems which went like this: "Find the expected time before XYZ happens". Note that here the time to completion is a random variable, and in probability theory, such random variables are called "stopping times" for obvious reasons. It turned out that these problems were solvable using something called martingales which are random processes with nice invariants and a ton of properties.

I think a lot of people don't get the intuition behind these topics and why they're so useful. So I hope the following intuitive introduction helps people develop a deeper understanding. Note that in the spirit of clarity, I will only be dealing with finite sets and finite time processes (unless absolutely necessary, and in such cases, the interpretation should be obvious). I will do this because people tend to get lost in the measure-theoretic aspects of the rigor that is needed and skip on the intuition that they should be developing instead. Also, note that sometimes the explanations will have more material than is strictly necessary, but the idea is to have a more informed intuition rather than presenting a terse-yet-complete exposition that leaves people without any idea about how to play around with the setup.

People already familiar with the initial concepts can skip to the interesting sections, but I would still recommend reading the explanations as a whole in case your understanding is a bit rusty. The concept of "minimal" sets is used for a lot of the mental modelling in the blog, so maybe you'd still want to read the relevant parts of the blog where it is introduced.

I would like to thank Everule and rivalq for suggesting me to write this blog, and them and meme and adamant for proofreading and discussing the content to ensure completeness and clarity.

BONUS: Here's an interesting paper that uses martingales to solve stuff.

Table of contents

  1. Probability, sigma algebras, and random variables
  2. Expected value of a random variable and conditional probability
  3. Conditional expectation of a random variable with respect to a sigma algebra
  4. Martingales
  5. Stopping times
  6. Some math problems
  7. Some competitive programming problems

Probability, sigma algebras, and random variables

Let's look at the simplified definition of a probability space first. We say simplified here because there are some technicalities that need to be sorted out for infinite sets, which we're not considering here anyway.

A probability space is a triple $$$(\Omega, \mathcal{F}, P)$$$ consisting of

  1. $$$\Omega$$$: the (non-empty) ground set, also known as the sample space. It is helpful to think of it as the set of all possible outcomes of an experiment.
  2. $$$\mathcal{F}$$$: the $$$\sigma$$$-algebra. This is a collection of subsets of $$$\Omega$$$. Note that this is not always the power set of $$$\Omega$$$. It is helpful to think of its elements (the chosen subsets) as "events". For example, when rolling a die, one possible event can be "getting an even number". For this experiment, we have $$$\Omega = \{1, 2, 3, 4, 5, 6\}$$$ and this event is $$$\{2, 4, 6\}$$$.
  3. $$$P$$$: the probability function, a function from $$$\mathcal{F}$$$ to $$$[0, 1]$$$. This is just a weight that we assign to every event that we have chosen.

For this function to have some nice properties and for it to make sense as some measure of "chance", we add some more constraints to these functions:

  1. $$$\Omega$$$ is an event (i.e., $$$\Omega \in \mathcal{F}$$$).
  2. If $$$A \in \mathcal{F}$$$, then $$$\Omega \setminus A \in \mathcal{F}$$$. That is, if something happening is an event, then it not happening is also an event.
  3. If $$$A \in \mathcal{F}$$$ and $$$B \in \mathcal{F}$$$, then $$$A \cup B \in \mathcal{F}$$$. That is, if $$$X$$$ happening is an event and $$$Y$$$ happening is an event, then at least one of $$$X$$$ and $$$Y$$$ happening is also an event. Note that this combined with the previous constraint allows us to say that $$$A \cap B$$$ is an event and so on due to De Morgan's law.
  4. $$$P(\Omega) = 1$$$, that is, the probability of the whole sample space is $$$1$$$.
  5. For disjoint sets $$$A$$$ and $$$B$$$, $$$P(A \cup B) = P(A) + P(B)$$$. This, along with the previous constraint, is sufficient to derive identities like $$$P(\emptyset) = 0$$$, $$$P(A) + P(\Omega \setminus A) = 1$$$ and so on.

Let's now build some intuition about this definition (the following, especially the part about minimal sets, applies only to finite sets, but there are a lot of similarities in the conclusions when we try to generalize to other sets using measure-theoretic concepts).

Note that since $$$\mathcal{F}$$$ is closed under intersections, if we do the following procedure multiple times: take two sets and replace them with their intersection, we will end up with some "minimal" non-empty sets that we can't break further (this notation is not standard at all, but we will use this a lot in what follows). Consider the set of all such possible minimal sets. Clearly, these sets form a partition of $$$\Omega$$$ since they are disjoint (else contradiction to minimality) and since every element is in a minimal set (using the fact that $$$\mathcal{F}$$$ is closed under complements and constructing a minimal set if needed, or by just intersecting all sets containing that element — there is at least one such set, which is $$$\Omega$$$). Now since $$$\mathcal{F}$$$ is also closed under unions, it also equals the set of all possible unions of these minimal subsets. In other words, we have a partition of the elements of $$$\Omega$$$ into some sets, and $$$\mathcal{F}$$$ is the set formed by all possible unions of sets in that partition. We can now think of $$$\sigma$$$-algebras in terms of partitions! Also, note that the probabilities for each of these minimal events of the $$$\sigma$$$-algebra add up while taking the unions.

In some sense, we can now construct an equivalent probability space $$$(\Omega', \mathcal{F}', P')$$$ where $$$\Omega'$$$ is the set of minimal events of $$$\mathcal{F}$$$, $$$\mathcal{F}'$$$ is the power set of $$$\Omega'$$$, and $$$P'$$$ is defined naturally from $$$P$$$. This definition is not standard but will be useful later on. Note that if $$$\mathcal{F}$$$ was the power set of $$$\Omega$$$, then this equivalent probability space would have been the same as the original probability space. In this sense, all other $$$\sigma$$$-algebras are "coarser" than the power set, whose minimal sets are all singletons.

At this point, we should realize that $$$\mathcal{F}$$$ is a measure of how much "information" we have about $$$\Omega$$$ (we can't distinguish between two elements in the same minimal set). This idea, along with the partition intuition, is a good way to deal with how you gain information over time in random processes, but we'll come to that a bit later.

Let's now turn to the formal definition of a random variable.

A random variable with respect to a probability space $$$(\Omega, \mathcal{F}, P)$$$ is a function $$$X$$$ from $$$\Omega$$$ to $$$\mathbb{R}$$$ such that $$$X^{-1}((-\infty, x]) \in \mathcal{F}$$$ for all $$$x \in \mathbb{R}$$$. That is, the set of pre-images of real numbers $$$\le x$$$ should be an element of $$$\mathcal{F}$$$.

Note that since we are dealing with finite sets, we will use an alternative (but equivalent for finite sets) definition where we replace the interval $$$(-\infty, x]$$$ by $$$\{x\}$$$.

When written like this, it might be a bit intimidating, but the underlying idea is quite simple. We have a sample space $$$\Omega$$$, and we have another set $$$\mathbb{R}$$$. Now we want to map elements of $$$\Omega$$$ to $$$\mathbb{R}$$$ in some way so that we can do some computations over elements of $$$\Omega$$$. However, since in a probability space, $$$\mathcal{F}$$$ tells us how much information we have (or are allowed to look at), we shouldn't be able to distinguish between elements of $$$\Omega$$$ that are inside the same minimal event of $$$\mathcal{F}$$$. Since there are finitely many elements in $$$\Omega$$$, and elements inside the same minimal event should be mapped to the same real number, there can be only finitely many intervals that $$$X$$$ partitions $$$\mathbb{R}$$$ into, and the condition in the definition is necessary and sufficient (due to closedness of $$$\mathcal{F}$$$ under intersection and unions).

As a side-note, note that we can also define functions of a random variable at this point by simply composing the function $$$X$$$ with a function from $$$\mathbb{R}$$$ to $$$\mathbb{R}$$$. Similarly, if we have two random variables on the same probability space, we can also define functions of them quite naturally. After all, we just need to specify their values on the minimal sets and we're done.

Expectation of a random variable and conditional probability

Let's try to investigate random variables a bit further. Let's say we have a probability space $$$(\Omega, \mathcal{F}, P)$$$ and a random variable $$$X$$$ on this probability space. How do we capture some sort of a mean value of $$$X$$$ over the entire probability space? One possible answer is the following: for each minimal event, we consider the common value of the random variable at each element of that event, and add it to a weighted sum with weight equal to the probability of that minimal event. Note that minimal events form a partition so the sum of these weights is $$$1$$$. In some sense, this is an average of the random variable over all minimal events, weighted by their importance.

But how do we write this up mathematically without referring to minimal events? Let's take advantage of the condition in the definition of a random variable. Instead of summing over minimal events, we can sum over distinct values $$$x$$$ of the random variable, and weight it by the probability of that variable being equal to $$$x$$$. This probability is well-defined due to the condition in the definition of a random variable (because the event that the random variable equals $$$x$$$ is in fact a disjoint union of minimal events and their probabilities add up). This definition is also consistent with our earlier intuition.

So we can define the expected value of a random variable $$$X$$$ to be $$$E[X] := \sum_{x \in \text{range}(X)} x \cdot P(X = x)$$$.

Note that by our earlier definition related to minimal events, we can also clearly see that for any random variables $$$X, Y$$$ on the same probability space, we have $$$E[X + Y] = E[X] + E[Y]$$$.

Let's now change gears and think a bit about how we can isolate an event. Suppose we have an event $$$E$$$. Let's say that we want to ignore everything else, and only look at the elements of $$$\Omega$$$ (and consequently the minimal events of $$$\mathcal{F}$$$ that form a partition of $$$E$$$). Can we make a probability space out of it? One easy way would be to just inherit the structure of $$$\Omega$$$ and $$$\mathcal{F}$$$ and set the probabilities of everything that uses elements other than the minimal elements constituting $$$E$$$ to be $$$0$$$, but we run into one trouble if we simply decide to inherit the probability function. The sum of probabilities over all such minimal events is not $$$1$$$. One possible idea is to normalize it by $$$P(E)$$$, which is possible iff it is non-zero. It can easily be checked that this probability space is indeed a valid probability space.

This is how we define the conditional probability space wrt an event $$$E$$$ with $$$P(E) > 0$$$. The sample space and the $$$\sigma$$$-algebra remain the same, but the probability function becomes $$$P(A | E) := P'(A) = P(A \cap E) / P(E)$$$. This can be verified to be consistent with our earlier reasoning by considering minimal elements.

A cleaner way of looking at it is that we're really only concerned with some subset of $$$\Omega$$$ and $$$\mathcal{F}$$$ that correspond to elements in $$$E$$$, and we define a probability space with $$$E$$$ as the ground set. But just to make sure that we can interoperate with the other events in $$$\mathcal{F}$$$, we take the event, trim down the parts that are not in $$$E$$$, and then use the probability (or relative weight) of that set wrt the probability (or weight) of $$$E$$$ instead.

One corollary of this is the following: $$$\sum_{E_i} P(A | E_i) P(E_i) = P(A)$$$ for a partition of $$$\Omega$$$ into events $$$E_i$$$. This can be seen to be true if we consider the minimal sets in $$$A$$$. In fact, this is used while solving a lot of recurrences, like in Bayes' theorem.

This also has an expected value variant, and if we naturally restrict a random variable to each conditional probability space, by summing over minimal sets, we get that if $$$E[X | E_i]$$$ is the expected value of $$$X$$$ wrt the conditional probability space with respect to $$$E_i$$$, then $$$\sum_{E_i} E[X | E_i] P(E_i) = E[X]$$$. This variant is used in recurrences too, like finding the expected time before we toss two heads in a row with a fair coin (though the sample space is infinite there, these identities still hold).

Conditional expectation with respect to a sigma algebra

Let's continue with the partition and information analogy even further. Recall how we had defined an equivalent probability space. It corresponds to collapsing equivalence classes (determined by the partition) into a single element, and making a probability space on that as the sample space. What if we do that again, after defining another sparser/coarser probability space on the resulting probability space?

Of course, this leads to a coarser partition of the sample space in some sense, and if $$$\mathcal{F}_1$$$ is the original $$$\sigma$$$-algebra, and $$$\mathcal{F}_2$$$ is the new $$$\sigma$$$-algebra (both of them corresponding to the same $$$\Omega$$$, after "unwrapping/unrolling" the second coarsening), then $$$\mathcal{F}_2 \subseteq \mathcal{F}_1$$$.

Now suppose we had a random variable $$$X$$$ on the original probability space. How do we reduce/adapt the random variable to this coarser probability space? This is where expected value comes into the picture. We will choose a definition for conditional expectation, and see that it satisfies a lot of good properties and is consistent with our conditional probability identities that we figured out in the previous section.

Note that $$$X$$$ is constant on all minimal sets of $$$\mathcal{F}_1$$$. Let's define the conditional expectation of $$$X$$$ with respect to the $$$\sigma$$$-algebra $$$\mathcal{F}_2$$$ as the random variable $$$E[X | \mathcal{F_2}] := X'$$$ such that on every element of a minimal set $$$E_i$$$ of $$$\mathcal{F}_2$$$, $$$X'$$$ assumes the value $$$E[X | E_i]$$$. This essentially means that we are trying to take a "conditional average" of the information about the values of $$$X$$$ on the set $$$E_i$$$. It is easy to verify that $$$E[X'] = E[X]$$$, where the first expectation is taken over the coarser probability space, and the second is taken on the finer probability space. In a case of unfortunate naming, $$$X'$$$ is called the conditional expectation of $$$X$$$ wrt the $$$\sigma$$$-algebra $$$\mathcal{F}_2$$$.

To make this a bit clearer, let's take an example.

Example

We note a couple of points here:

  1. In a way, we are trying to replace the random variable with its "best approximation" in some sense (for instance, a variance minimization sense, but that is totally unrelated). Also note that the expectation of the random variable does not change when we apply this coarsening to $$$X$$$.
  2. Note that for the trivial $$$\sigma$$$-algebra $$$\mathcal{F}_2 = \{\emptyset, \Omega\}$$$, the random variable $$$E[X | \mathcal{F_2}]$$$ is identically the constant function that equals $$$E[X]$$$.

We can also chain a couple of coarsenings. Suppose $$$\mathcal{F}_3 \subseteq \mathcal{F}_2 \subseteq \mathcal{F}_1$$$ and $$$X$$$ is a random variable wrt the probability space $$$(\Omega, \mathcal{F}_1, P)$$$. Then we have $$$E[E[X | \mathcal{F}_2] | \mathcal{F}_3] = E[X | \mathcal{F_3}]$$$, as can be verified easily by chaining conditional expectations. In fact, the fact that expectation remains the same even after coarsening is a direct consequence of this and the last point.

A more commonly used term is conditional expectation with respect to another random variable. It is usually defined as this:

Suppose $$$X$$$ and $$$Y$$$ are random variables on the same probability space. For any element $$$\omega$$$ of the sample space where $$$Y(\omega) = y$$$, we define $$$E[X | Y]\left(\omega\right) = E[X | E]$$$ where $$$E$$$ is the event that $$$Y = y$$$. That is, for every element where $$$Y$$$ assumes a value of $$$y$$$, we take the conditional expectation wrt the event that $$$Y = y$$$ and set the value of the answer to that conditional expectation.

However, this is the same as saying this: let $$$\mathcal{F}_Y$$$ be the $$$\sigma$$$-algebra where the minimal sets correspond to distinct values of $$$Y$$$. Then $$$E[X | Y]$$$ is identically equal to the random variable $$$E[X | \mathcal{F}_Y]$$$. As a side-note, this constructed $$$\sigma$$$-algebra is called the $$$\sigma$$$-algebra generated by the random variable $$$Y$$$.

In this sense, conditional expectation with respect to a $$$\sigma$$$-algebra is a cleaner and more sophisticated concept than conditional expectation with respect to a random variable, though both concepts are equivalent (you can always induce a $$$\sigma$$$-algebra by choosing where to put equal values for the random variable you're conditioning on).

Another way of thinking about this concept is to think of it as projecting down from a set of minimal events to a coarser set of minimal events. For each chunk of minimal events we're collapsing, we replace it with a single minimal event by the "center of mass" of those events, where the "position" is the value of the random variable, and the "mass" equals the probability of those minimal events. This works out mathematically too, and is precise for this setup.

Just for the sake of completeness, since the above definition captures what conditional expectation with respect to a $$$\sigma$$$-algebra does, but breaks down when you try to generalize it to infinite sample spaces, here's the "official definition" that people use (which is quite counter-intuitive and takes some time to digest):

A conditional expectation of $$$X$$$ given $$$\mathcal{H}$$$, denoted as $$$E[X | \mathcal{H}]$$$, is any random variable on $$$(\Omega, \mathcal{H}, P)$$$ which satisfies $$$\int_H E[X | \mathcal{H}] d\mathbb{P} = \int_H X d\mathbb{P}$$$ for each $$$H \in \mathcal{H}$$$.

This essentially says the same thing when you think about the possible $$$H$$$, but in a more non-constructive way. Turns out that this random variable is not unique in the strictest sense, but $$$P(Y \ne Z) = 0$$$ holds, where $$$Y$$$ and $$$Z$$$ are two candidates for this random variable. Also, this definition makes it clear that $$$X - E[X | \mathcal{H}]$$$ is orthogonal to the indicator function (so we get a more precise definition of projecting down).

Martingales

Let's say there is a "process" of some sort, where at every step, we get more information. Let's say our sample space for the whole experiment is $$$\Omega$$$. Here, $$$\Omega$$$ is almost always infinite, but the intuition we built carries forward here, even though the concepts like minimal sets might not. Let's take a very simple example: we toss a coin at each step. The sample space for the whole experiment is the set of all infinite binary strings. Let's think about the kind of information that we have available at various steps in the process. After step 0, all the information we have (for instance, the complete information about a random variable that has information about what has happened until now) corresponds to the trivial $$$\sigma$$$-algebra $$$\{\emptyset, \Omega\}$$$. After step $$$i$$$, we have information for the first $$$i$$$ coin tosses, so the information corresponds to the $$$\sigma$$$-algebra with "minimal" sets being the sets of binary strings which are the same in the first $$$i$$$ positions.

To understand why $$$\sigma$$$-algebras are necessary to capture information, perhaps the following analogy helps. Think in terms of minimal sets. Each time we get some more information, we might be able to distinguish between elements in minimal sets. To be able to do operations like finding the expectation of some random variable in this scenario, we need to be able to distinguish between these elements too. In some sense, you're inspecting multiple parallel universes at the same time, and in each universe, you have a certain prefix of things that happened, for each of which you want to be able to analyze what happens later.

Note that we always get more information at every stage of such a stochastic process. So we define a filtration as follows (without any reference to a process for the sake of generality): a sequence of $$$\sigma$$$-algebras $$$\mathcal{F}_i$$$ such that $$$\mathcal{F}_i \subseteq \mathcal{F}_j$$$ iff $$$i \le j$$$.

Now let's get to what a process really is. At every step of a process, we have a random variable that shows up. But what is the $$$\sigma$$$-algebra for that random variable? To remedy this, we take a $$$\sigma$$$-algebra $$$\mathcal{F}$$$ that captures all possible information. So we define a stochastic process as follows:

For a given probability space $$$(\Omega, \mathcal{F}, P)$$$, and an index set $$$I$$$ with a total order $$$\le$$$, a stochastic process $$$X$$$ is a collection of random variables $$$\{X(i) | i \in I\}$$$.

For our purposes, $$$I$$$ will mostly be the non-negative integers, though $$$I = \mathbb{R}$$$ is used extensively in a lot of contexts. $$$I$$$ having a total order means that there is some concept of time associated with $$$I$$$.

How do we now construct a "natural filtration" that captures the information that the first $$$i$$$ steps of the process (i.e., $$$\{X(j) | j \le i\}$$$) give us? Remember how we said we can construct a $$$\sigma$$$-algebra from a random variable? We can also use the same construction to construct it from a finite set of random variables, by refining two partitions together while it is possible (this corresponds to the join of the two partitions in the refinement lattice, if you're familiar with posets and lattices). So the filtration can be set up this way too, by refining the current $$$\sigma$$$-algebra with a new random variable each time. More formally, paraphrasing from Wikipedia,

Let $$${\displaystyle (X_{n})_{n\in \mathbb {N} }}$$$ be a stochastic process on the probability space $$${\displaystyle (\Omega ,{\mathcal {F}},P)}$$$. Then

$$${\displaystyle {\mathcal {F}}_{n}:=\sigma (X_{k}\mid k\leq n)}$$$

is a $$$\sigma$$$-algebra and $$${\displaystyle \mathbb {F} =({\mathcal {F}}_{n})_{n\in \mathbb {N} }}$$$ is a filtration. Here $$${\displaystyle \sigma (X_{k}\mid k\leq n)}$$$ denotes the $$$\sigma$$$-algebra generated by the random variables $$${\displaystyle X_{1},X_{2},\dots ,X_{n}}$$$. $$${\mathbb F}$$$ is a filtration, since by definition all $$${\displaystyle {\mathcal {F}}_{n}}$$$ are $$$\sigma$$$-algebras and $$${\displaystyle \sigma (X_{k}\mid k\leq n)\subseteq \sigma (X_{k}\mid k\leq n+1).}$$$

This is known as the natural filtration of $$${\displaystyle {\mathcal {F}}}$$$ with respect to $$${\displaystyle X}$$$.

Finally, we are able to define what a martingale is. Note that if $$$\sigma(X_i) \subseteq \mathcal{H}$$$ for some $$$\sigma$$$-algebra $$$\mathcal{H}$$$, then $$$E[X_i | \mathcal{H}] = X_i$$$ (think about the minimal sets and the definition). A martingale essentially says that conditional on the information we have until time $$$i$$$, a random variable $$$X_j$$$ for $$$j \ge i$$$ looks like $$$X_i$$$. More formally,

A martingale is a stochastic process $$$X_1, X_2, \dots$$$ such that $$$E[X_{n + 1} | \sigma(X_1, X_2, \dots, X_n)] = X_n$$$, and $$$E[|X_n|] < \infty$$$.

Equivalently, the first condition can be written as $$$E[X_{n + 1} - X_n | \sigma(X_1, X_2, \dots, X_n)] = 0$$$. Note that we talk about "almost sure" equality of random variables when dealing with more complicated probability spaces, but for our purposes, we will not concern ourselves with this detail.

Note that this is necessary (trivial) and sufficient, because $$$X_j - X_i = (X_j - X_{j - 1}) + (X_{j - 1} - X_{j - 2}) + \dots + (X_{i+1} - X_i)$$$, and by applying the nested coarsening property, it reduces to $$$E[X_j - X_i | \sigma(X_1, \dots, X_i)] = 0$$$.

Why is a martingale so useful? Firstly, it gives us an invariant, and as is well-known, invariants make life easier when we try to prove results in math. Secondly, it originated in the theory of gambling, and for fair gambling games, the expected value of the profit at a later point in time should be equal to the profit at that point in time, which is precisely the martingale condition!

Let's look at some constructions of martingales.

  1. Consider a random walk, where the initial position is $$$X_0 = 0$$$, and $$$X_i = X_{i-1} \pm 1$$$ with equal probability $$$p \le 1/2$$$, and $$$X_i = X_{i-1}$$$ with probability $$$1 - 2p$$$. Then this is a martingale (from the definition).
  2. Consider a random walk denoting the total assets of a gambler who keeps reinvesting all his capital into the game. The game gives a payout of $$$r$$$ percent if he wins (with probability $$$p$$$), a loss of $$$r$$$ percent if he loses (with probability $$$p$$$), and no payout in case its a draw (with probability $$$1 - 2p$$$). Then his wealth is a martingale.

There are many more examples of martingales, but I won't go too much in depth about that, and would leave you to refer to some other resource for more examples.

Some more intuition about martingales: let's see what happens to the conditional expectation in the definition of a martingale when we apply a convex function to a stochastic process (i.e., to all random variables in the stochastic process). Since probabilities are all positive, whenever we're taking the conditional expectation, we are taking a convex combination of the values of the convex function of a random variable. By Jensen's inequality, we get that rather than equality, $$$\ge$$$ holds in the defining equation for martingales. Similarly, if we had a concave function, $$$\le$$$ would hold in the defining equation. These types of processes (where $$$\ge$$$ and $$$\le$$$ hold instead of $$$=$$$) are called sub-martingales and super-martingales respectively. Note that not all stochastic processes are one of sub-/super-martingales. Note that $$$\ge 0$$$ and $$$\le 0$$$ means that the random variable on the LHS is almost surely non-negative and non-positive, respectively.

An example of a sub-martingale

I won't go into a lot of detail about martingales (since they deserve a field of study of their own), but will just point out some nice results, that help to get a feel of how martingales behave:

  1. Azuma's inequality: this provides a bound on how "close" we will be to the initial value of the random variable given bounds on the movement of the martingale (or how much in the wrong direction we can go for a sub-/super-martingale). Let's say $$$\{X_k\}$$$ is a super-martingale (which a martingale also is), and let's suppose $$$|X_k - X_{k - 1}| \le c_k$$$ almost surely. Then we have $$$P(X_N - X_0 \ge \varepsilon) \le \exp(-\varepsilon^2/(2 \sum_{k = 1}^N c_k^2))$$$. Since the negative of a super-martingale is a sub-martingale, we get for a sub-martingale $$$X$$$, $$$P(X_N - X_0 \le -\varepsilon) \le \exp(-\varepsilon^2/(2 \sum_{k = 1}^N c_k^2))$$$. Since a martingale is both a sub-martingale and a super-martingale, we have a probabilistic bound on how close $$$X_N$$$ will be to $$$X_0$$$, that is, $$$P(|X_N - X_0| \ge \varepsilon) \le 2 \exp(-\varepsilon^2/(2 \sum_{k = 1}^N c_k^2))$$$. There is a stronger version of this inequality, but I think this is quite sufficient to see how well-behaved martingales are.

  2. Doob's martingale inequality: this is a Markov-style inequality on the probability that a sub-martingale will exceed a value $$$x$$$. More formally, for a sub-martingale $$$X$$$, we have $$$P(\max_{1 \le i \le n} X_i \ge x) \le E[\max(X_n, 0)] / x$$$. This can be used to also show Markov's inequality.

  3. Doob's martingale convergence theorem: this is a result that says that a super-martingale that is bounded from below will converge (almost surely) to a random variable with finite expectation. Note that if we bar the case when the variable diverges to infinity in some sense, the other possible way for it to not converge is to oscillate. This inequality roughly says that bounded martingales don't oscillate.

Stopping times

Stopping times are quite interesting mathematical objects, and when combined with martingales, they give a lot of interesting results and are a very powerful tool.

Suppose we are in the gambler's shoes, and we want to write an algorithm to finish the game, depending on certain outcomes in the process. Stopping times are a way to formalize the analysis of when such exit conditions are reached. The best we can do right now, given the current theory we have, is to define it as a random variable (that may or may not depend on the stochastic process). We also have the constraint that we can't see the future, so the decision about when to stop must be taken with all the information at the current time.

To that end, let's consider a random variable $$$\tau$$$ (that takes values in the index set, that is the set of non-negative integers here) on a probability space $$$(\Omega, \mathcal{F}, P)$$$, and let's say we have a stochastic process $$$X$$$ on the same probability space with an associated filtration $$$\{\mathcal{F}_i\}$$$. We say that $$$\tau$$$ is a stopping time if the event that $$$\{\tau = t\}$$$ is an event in $$$\mathcal{F}_t$$$. That is, the decision to stop at time $$$t$$$ must be taken with information no more than what we have at time $$$t$$$. For also keeping this consistent with real number indexing, we modify the condition to $$$\{\tau \le t\}$$$ being an event in $$$\mathcal{F}_t$$$.

To understand why this ensures no forward-looking, consider the following. We bunch together elements of $$$\Omega$$$ that have stopping time $$$\le t$$$, and the condition is that we shouldn't be able to distinguish them if without information from time after $$$t$$$. Similarly, it must not be the case that two elements of $$$\Omega$$$ in the same minimal set of $$$\mathcal{F}_t$$$ have different stopping times, when not both are $$$> t$$$. In our coin flip example, the latter makes sense if you think about why you can't have stopping time for 000010110... = 4 and that for 000011000... = 6.

Another way of formalizing the same thing is to say that the stochastic process (denoting the stopping of a process) defined by $$$Y_t = 0$$$ if $$$\tau < t$$$ and $$$1$$$ otherwise is an adapted process wrt the filtration. That is, $$$\sigma(Y_t) \subseteq \mathcal{F}_t$$$ for all $$$t$$$.

Since the stopping time is an index, the most natural thing at this point is to think of how to index the stochastic process with some function of stopping time. If we go back to the definition of a random variable, it should be clear that intuitively we need to assign a value to it at every element in $$$\Omega$$$. In simple cases, this is trivial, since we can just compute the value of $$$\tau$$$ at the minimal set which has this element (let's say this value is $$$t$$$), and compute the value of $$$X_t$$$ at the minimal set again. From here, it should be obvious that $$$X_\tau$$$ is indeed a random variable on the original probability space, and it denotes the value of the stochastic process when the process is exited. Also note that by the same reasoning, $$$X_{\min(\tau, t)}$$$ is a random variable whose induced $$$\sigma$$$-algebra is a subset of $$$\mathcal{F}_t$$$. $$$X_{\min(\tau, t)}$$$ corresponds to a stopped process, which means that we just arbitrarily stopped a process after a time limit. These are very important types of processes, and have been extensively studied.

The cool part about martingales and stopping times together is that this setup has been studied a lot, and there are quite a few important results that I'll list without proof, which can help us solve a lot of problems as well as build an understanding around stopping times and martingales:

  1. Martingale central limit theorem: This is a generalization of the usual central limit theorem to martingales. The associated process is the usual martingale, but with differences between consecutive values being bounded by a fixed bound almost surely. We choose a stopping time as follows: at each step $$$i + 1$$$, compute the conditional variance of $$$X_{i + 1}$$$ wrt $$$\mathcal{F}_i$$$, and keep adding it to a counter. Stop once this counter is more than a certain value $$$\sigma^2$$$. (In a sense, we are waiting till we have collected a large enough variance/move). If the stopping time here is $$$\tau_\sigma$$$, then the random variable $$$X_{\tau_\sigma} / \sigma$$$ converges to a normally distributed variable with mean $$$0$$$ and variance $$$1$$$, as $$$\sigma$$$ tends to $$$\infty$$$. This also assumes that the sum of variances diverges.

  2. Doob's optional stopping theorem: This is a very important theorem, and will be quite useful to compute the expected values of stopping times later on. It talks about the fairness of a martingale when we stop after a random stopping time, in the sense that we developed above. The metric for fairness is that $$$E[X_\tau] = E[X_0]$$$, and it turns out that this is true if any of the following three conditions holds:

    • The stopping time $$$\tau$$$ is bounded above.
    • The martingale is bounded and the stopping time $$$\tau$$$ is finite almost surely (i.e., $$$P(\tau = \infty) = 0$$$).
    • The expected value of $$$\tau$$$ is finite, and the difference between $$$X_{i + 1}$$$ and $$$X_i$$$ is bounded. Note that this can be used to compute the stopping time if we modify the martingale a bit, as we shall see later.

Some math problems

We'll start out with some math problems that I have seen over the years (apologies since I don't remember the sources for some of them).

Problem 1: Suppose there is a gambler who has $$$n$$$ dollars and gambles infinitely many times. The payoff of the $$$t^{th}$$$ round is $$$X_t$$$, which is an integer bounded above and below by some fixed constants. We know that $$$E[X_t|X_{t-1},\cdots,X_1]\geq c$$$ where $$$c > 0$$$. Suppose the probability of the gambler going bankrupt is $$$p(n)$$$. Do we have $$$\lim_{n\rightarrow \infty}p(n)=0$$$?

Solution sketch

Problem 2 (folklore): Suppose there is an infinite grid, with a non-negative real number in each. Suppose that for every cell, the number in it is the mean of the numbers on its 4 neighbouring squares. Show that all the numbers must be equal.

Solution sketch

Problem 3 (USA IMO 2018 Winter TST 1, P2): Find all functions $$$f\colon \mathbb{Z}^2 \to [0, 1]$$$ such that for any integers $$$x$$$ and $$$y$$$,

$$$f(x, y) = \frac{f(x - 1, y) + f(x, y - 1)}{2}$$$
Solution sketch

Problem 4 (2022 Putnam A4): Suppose that $$$X_1, X_2, \ldots$$$ are real numbers between 0 and 1 that are chosen independently and uniformly at random. Let $$$S=\sum_{i=1}^kX_i/2^i,$$$ where $$$k$$$ is the least positive integer such that $$$X_k<X_{k+1},$$$ or $$$k=\infty$$$ if there is no such integer. Find the expected value of $$$S$$$.

Solution sketch (due to ABCDE)

Problem 5 (AGMC 2021 P1): In a dance party initially there are $$$20$$$ girls and $$$22$$$ boys in the pool and infinitely many more girls and boys waiting outside. In each round, a participant from the pool is picked uniformly at random; if a girl is picked, then she invites a boy from the pool to dance and then both of them leave the party after the dance; while if a boy is picked, then he invites a girl and a boy from the waiting line and dance together. The three of them all stay after the dance. The party is over when there are only (two) boys left in the pool.

(a) What is the probability that the party never ends?

(b) Now the organizer of this party decides to reverse the rule, namely that if a girl is picked, then she invites a boy and a girl from the waiting line to dance and the three stay after the dance; while if a boy is picked, he invites a girl from the pool to dance and both leave after the dance. Still the party is over when there are only (two) boys left in the pool. What is the expected number of rounds until the party ends?

Solution sketch

Problem 6 (folklore): Let a monkey type randomly on a typewriter with alphabet $$$\Sigma$$$. For a given string $$$w\in\Sigma^*$$$, let $$$w_1,\dots,w_n$$$ be the nonempty strings that are simultaneously a prefix and suffix of $$$w$$$. Then show that the expected number of keystrokes the monkey needs to make to type $$$w$$$ is $$$\sum_{i=1}^n|\Sigma|^{|w_i|}$$$.

Solution sketch

A different approach for the above problem is discussed here.

Problem 7 (Alon, Spencer): Let $$$G=(V,E)$$$ be a graph with chromatic number $$$\chi(G)=1000$$$. Let $$$U \subseteq V$$$ be a random subset of $$$V$$$ chosen uniformly from among all $$$2^{|V|}$$$ subsets of $$$V$$$. Let $$$H=G[U]$$$ be the induced subgraph of $$$G$$$ on $$$U$$$. Prove that the probability that $$$H$$$ has a chromatic number of $$$\le 400$$$ is less than $$$0.01$$$.

Solution sketch

Some competitive programming problems

Almost all the problems that I know about in a competitive programming perspective are discussed in this Chinese blog and this editorial to 1479E that explains the main idea in a slightly different manner. I won't repeat their discussion (the equations should be understandable if you understand what the main approach with the math problems above is), but I would just list out the problems that they link as well as some problems that appeared later (in reverse chronological order).

  1. 1575F - Finding Expected Value
  2. 1479E - School Clubs
  3. 1392H - ZS Shuffles Cards
  4. 1349D - Slime and Biscuits
  5. 1025G - Company Acquisitions
  6. 850F - Rainbow Balls
  7. USACO 2018 — Balance Beam

Full text and comments »

  • Vote: I like it
  • +468
  • Vote: I do not like it

By nor, 16 months ago, In English

Since someone recently asked me about my competitive programming setup, and I like tinkering with my setup to make it as useful and minimal as possible, I thought I should share my setup that I've used for the past few years and a modified version that I've used at onsite ICPC contests. I've also talked to a few people who went to IOI and had a similar setup, and I'm fairly confident that at least some people will be able to successfully use this without having to worry too much about the details, like I did. This is definitely NOT the only way to set up a basic environment, but it was the way that worked for me for quite a long time.

This blog is also meant as a companion blog to this blog on using the command line.

Before we start with the tools, it's a good point to mention this resource as another useful reference for installing vim and g++.

About the choice of tools

The choice of tools here is motivated by two things — availability and minimalism.

About availability: most onsite contests have a Linux system (which is almost always Ubuntu), and while installing the most common compiler used for competitive programming (g++) on Ubuntu, the meta-package build-essentials is required by the standard installation procedure. This package also installs make. Most onsite contests provide gdb too. As far as vim is concerned, it is provided on almost all onsite contests as well (and if it is not present, you can use vi as a good enough substitute).

Needless to say, this setup assumes that you have a Linux machine or Linux VM. If you're on Windows, using WSL should suffice. Most of these things should also work on MacOS (speaking from my limited experience of testing an ICPC contest while stuck with a machine with MacOS on it).

The fact that they're minimalist also correlates with the fact that they're fast and easy to set up (sometimes they don't need anything to be set up either) and won't take a ton of your time looking at complicated menus and waiting for startup. Note that using vim might be a bit too much if you haven't used it earlier and you're not willing to devote a few days to getting comfortable with it, so it's a matter of personal preference to use some other text editor like Sublime or VS Code. However, I do not know of any alternative to make (other than writing a shell script, which is a bit inconvenient unless you're a sh/bash expert) and gdb that are as widely available. I'd like to re-emphasize the fact that this is just my setup; your mileage might vary.

Now about the tools and how to install them:

  1. vim is a minimalist and customizable text editor that allows you to do a lot of stuff in a no-nonsense way. Its keybindings can be a bit counterintuitive at first, but once you get used to it, it makes writing and editing code very convenient. Installation instructions can be found at the link I shared earlier.
  2. make is a minimal build system that reads something called a Makefile and figures out how to do your compilation for you. Installation instructions can be found at the link I shared earlier, under the heading "An example of a workflow in Vim".
  3. gdb is a command-line debugger that works for a lot of languages and is pretty convenient and intuitive to use (and also quite scalable). For installation instructions, a search on your favourite search engine should be sufficient.

Setup for onsite contests

This is a more concise setup that I've used for onsite contests, and people might want to build upon it further.

I prefer using the following template, but for ICPC, I added some things from the KACTL template to keep things uniform in the codebook.

Template

Vim

For setting up vim, you just need to create a file ~/.vimrc which stores all your configurations.

A concise vimrc
A bit more beginner-friendly vimrc
Explanation of what everything does

Makefile

Note that you can run make without any setup at all. For example, if the path (relative or absolute) to your cpp file is <some_path>/code.cpp, you can run make <some_path>/code and it will generate an executable at the path <some_path>/code, using the default compilation flags. make also prints the command that it used to compile the file.

One feature of make is that once you compile a file, unless there are changes in the source file, it won't compile it again. To counter that, either use make -B instead of make, or use the Makefile below and run make clean which removes all executables from the current directory upto max depth 1.

For using more compilation flags with make, create a file named Makefile and store the following Makefile in it (you can modify this too).

Makefile
Explanation

GDB

GDB doesn't really need setup. With the above flags, the most I have ever needed was to run the executable with debug flags in gdb, and look at the backtrace to pinpoint the line with the error. To do that, I do the following (you can also do this in vim using the termdebug package that is added in the vimrc above and is provided by vim out of the box):

  1. Run gdb <executable_name>
  2. If you have an input file called in.txt, type run < in.txt and hit enter. If you don't have an input file, you can just give it input like in a terminal. Output redirection works as well.
  3. If the program has errors, it will stop at some point with an error. To look at the backtrace, type bt and hit enter.

Setup for online contests

My setup for online contests is a bit more involved, and I'll keep this section a bit more concise, since you can just set this up once and never have to worry about it again, in contrast with onsite contests, where every character counts.

I use a bigger template for online contests, which is as follows:

Template

Note that there's an include for debugging, and it's pretty easy to use (I've mentioned it here).

Vim

Note that the following vimrc has some non-competitive-programming stuff as well, since I use vim as my primary editor. For more information on each plugin (each such line starts with Plug or Plugin), you can search for it on GitHub. For instance, for the line Plug 'sbdchd/neoformat', the relevant GitHub repo is this. I use the snippet manager UltiSnips and clang-format and clangd for code formatting and static analysis purposes. I used to use some of the commented plugins earlier, so those can be useful too.

vimrc

Makefile

On my personal system, since compilation time becomes a major overhead, I also use precompiled headers. Also, for contests like Hash Code, I use OpenMP at times, so I have some flags for those in my Makefile.

Makefile

Let me know if I missed out on something, or if I should add something here!

Full text and comments »

  • Vote: I like it
  • +121
  • Vote: I do not like it

By nor, 16 months ago, In English

Recently, I retired from competitive programming completely, and I found myself wondering about what I'll do with all the competitive-programming-specific knowledge I have gained over the years. For quite some time, I've been thinking of giving back to the community by writing about some obscure topics that not a lot of people know about but are super interesting. However, every time I ended up with ideas that are either too obscure to be feasibly used in a competitive programming problem, too technical to ever be seen in a contest, or having too many good blogs/articles about them anyway.

So here's my question: do you think there are topics/ideas that you find are interesting and beautiful, but are not appreciated enough to have their own blogs? If I am familiar with those ideas, I can consider writing something on them (and irrespective of that, I would encourage you to write a blog on such ideas if you're familiar with them). If I don't know about them, it'll be a good way for me and everyone else visiting this blog to learn about it (and if I like the topic, who knows I might as well write something on it too). Also, it's not like the ideas have to necessarily be related to algorithms and data structures, some meta stuff like this nice blog works too.

For an idea about the kinds of blogs I'm talking about, you could refer to my previous educational blogs. I was mainly inspired by adamant's and errorgorn's blogs, since they tend to be really interesting (I would recommend checking them out!).

Full text and comments »

  • Vote: I like it
  • +159
  • Vote: I do not like it