### nor's blog

By nor, 3 months ago, It's been about 5 years since the round took place, and since there was no editorial all this time, I decided to write one myself.

Thanks to tfg for discussing problem F with me and coming up with the permutation interpretation that makes it much easier to reason about the problem. The idea in F was also inspired by other submissions.

1060A - Phone Numbers

Hints/intuition
Solution

Submission: 196688903

1060B - Maximum Sum of Digits

Hints/intuition
Solution

Submission: 196689846

1060C - Maximum Subrectangle

Hints/intuition
Solution

Submission: 196755317

1060D - Social Circles

Hints/intuition
Solution

Submission: 196760424

1060E - Sergey and Subway

Hints/intuition
Solution

Submission: 196767377

1060F - Shrinking Tree

Hints/intuition
Solution

Submission: 196942805

1060G - Balls and Pockets

Hints/intuition
Solution

Submission: 196921763

1060H - Sophisticated Device

Hints/intuition
Solution

Submission: 196822812

By nor, 3 months ago, There were a few instances on CF according to which it seems that quite a few people aren't very comfortable with floor and ceiling functions (and inequalities in general) — for instance, here and here. This inspired me to write a hopefully short blog on how to systematically work with these topics.

Note that I will not define what a real number is, or do anything that seems too non-elementary. Also, for the foundational stuff, I will rely on some correct intuition rather than rigour to build basic ideas, and the systematic logically correct way of reasoning will be used later. The idea behind this blog is to just help you understand how to reason about the topics we'll cover.

## Recap: basic inequalities

Before we jump ahead to floors and ceilings, we need to talk about inequalities. I will assume that you know something about the number line. Let's recap some informal definitions:

• We say that $a < b$ if and only if the number $a$ comes before the number $b$ on the number line. Here number refers to real numbers.
• We say that $a = b$ if and only if $a$ and $b$ coincide on the number line.
• We say that $a > b$ if and only if $b < a$, i.e., $a$ comes after $b$ on the number line.
• We say that $a \le b$ if and only if one of $a < b$ or $a = b$ holds.
• We say that $a \ge b$ if and only if one of $a > b$ or $a = b$ holds.

For the sake of brevity, we will abbreviate "if and only if" to iff. We will often use $P \iff Q$ in math notation to indicate that $P$ is true iff $Q$ is true. For a unidirectional implication of the form "if $P$ is true, then $Q$ is also true", we will write $P \implies Q$, and read it as $P$ implies $Q$.

Here's a fact that people often use intuitively, called the Law of Trichotomy:

• Exactly one of the following relations is true: $a < b$, $a = b$, $a > b$.

This is useful a lot of times when you try to break things into cases, and try to look for contradictions.

A corollary of this is the following definition of the relations in terms of $<$ and negation (you can use any relation other than $=$ to get all other operations):

• $a = b$ iff $a \not < b$ and $b \not < a$.
• $a > b$ iff $b < a$.
• $a \le b$ iff $b \not < a$.
• $a \ge b$ iff $a \not < b$.

Now note that if we have some results about $<$, using the above relations, we can come up with relations corresponding to these operations too.

We can also write these relations in terms of $\le$ — I leave this as an exercise for you to confirm that you've understood the logic behind these.

I recommend understanding why these make sense, so that you can reason about directions and inequalities more easily.

## Chaining these relations

Let's say $a R b$ and $b R c$ are true, where $R$ is any one of $<, >, =, \le, \ge$. Then it is easy to show that $a R c$ is also true. This allows us to chain inequalities, and write things like $a < b < c$ and remembering that as we read the expression from left to right, the terms only increase.

This property is called transitivity, and is well-studied in other contexts too. But for our purposes, it just makes our life simpler.

At this point, it is very important to note two more things:

• If $a < b$ and $b > c$, we can't say anything about the order of $a$ and $c$. This is a rookie mistake that happens a lot.
• $<$ is a stronger statement than $\le$. In particular, if we have something like $a < b$ and $b \le c$, then it is impossible that $a = c$. So we can in fact say that $a < c$, rather than the weaker statement $a \le c$. Mathematically, you can show this by reasoning as follows: $b \le c$ means that either $b < c$ or $b = c$. In the first case, the chaining is trivially true. In the second case, since $b = c$, the positions of $b$ and $c$ on the number line coincide, so the result is again true. (For the pedantic reader: the second part of this argument is referring to the fact that $<$ is a well-defined relation — in terms of respecting equality).

## How arithmetic operations affect inequalities

This is what many people don't internalize very well, or are simply not exposed to. At the risk of being too verbose, I will explain some interactions of arithmetic operations and inequalities (we will focus on $<$, but such relations can be developed for other relations as well — like $>, \le$ and their combinations — and that is left as an exercise):

• Addition/subtraction: let's say you have an inequality $a < b$. If you shift both things by the same amount on the number line, their relative positions won't get flipped. In other words, if we add a constant $c$ to both $a$ and $b$, the relative ordering of $a+c$ and $b+c$ is the same as that of $a$ and $b$. That is, $a < b$ implies $a + c < b + c$. The same holds for subtraction. So this implies some sort of a cancellation law:
$a < b \iff a + c < b + c$
• Multiplication by $0$: Both sides of any inequality become $0$, so upon multiplication by $0$, inequalities become equalities.

• Multiplication by a positive number: let's say you have an inequality $a < b$. This means $0 < b - a$, or in other words, $b - a$ is a positive number. If we multiply it by a positive number $c$, it will remain positive. That is $0 < c(b - a)$. Adding $ac$ to both sides gives $ac < bc$. So:

$c > 0, a < b \implies ac < bc$
• Multiplication by a negative number: let's say you have an inequality $a < b$ and a negative number $c < 0$. Clearly, we have $a(-c) < b(-c)$ from the previous result. Adding $ac + bc$ to both sides gives $bc < ac$, or $ac > bc$. That is, the sign of the inequality gets flipped. So:
$c < 0, a < b \implies ac > bc$
• Adding two inequalities that go the same way: let's say you have two inequalities $a < b$ and $c < d$. Is it true that $a + c < b + d$? Let's try to reason about it using the previous result. Let's subtract $c$ from both sides of the inequality (we can do this by the previous result). Then $a + c < b + d \iff a < b + d - c$. Note that since $c < d$, doing the same thing gives us $0 < d - c$. This essentially means that $d - c$ is a positive number. If we shift $b$ by a positive number to the right, we are moving $b$ farther from $a$, and hence the shifted $b$ should still be $> a$. This means that $a < b + d - c$. Using the previous equivalence, $a + c < b + d$ is indeed true. All in all, we have shown that
$a < b, c < d \implies a + c < b + d$
• Subtracting two inequalities that go the opposite way: let's say you have two inequalities $a < b$ and $c > d$. Is it true that $a - c < b - d$? Note that multiplying $c > d$ by $-1$ gives $-c < -d$. So we have $a - c < b - d$ by our previous result. We have shown that
$a < b, c > d \implies a - c < b - d$

NOTE: Subtracting inequalities of the same type and adding inequalities of the opposite type does not work, and I leave it as an exercise for you to see which steps in the above proofs fail.

Multiplying and dividing inequalities is significantly more complicated, but they can be treated with the same ideas. More specifically, for multiplying two inequalities, you have to make cases on the signs of the 4 terms. For division, the idea is the same, but additionally you have to take care that some denominator is not 0.

## Floors and ceilings

Let's start out by the definition of the floor function and the ceiling functions:

• The floor function is a function $\lfloor \cdot \rfloor : \mathbb{R} \to \mathbb{Z}$ such that for all real numbers $x$, $\lfloor x \rfloor$ is the integer $n$ such that $n \le x < n + 1$.

• The ceiling function is a function $\lceil \cdot \rceil : \mathbb{R} \to \mathbb{Z}$ such that for all real numbers $x$, $\lceil x \rceil$ is the integer $n$ such that $n - 1 < x \le n$.

Clearly, if such $n$ exist, they are unique.

Why do these definitions make sense? Note that intervals of the form $[n, n + 1)$ for integer $n$ cover the number line disjointly (this is not a rigorous proof, but I don't want to deal with real numbers), so $x$ will always be in one of these intervals. The same property holds for intervals of the form $(n - 1, n]$.

As a special case, when $x$ is an integer, both the ceiling and the floor functions return $n$.

NOTE: Remember that $\lfloor 1.5 \rfloor = 1$, but $\lfloor -1.5 \rfloor = -2$, and not $-1$.

In fact, we can notice the following:

• $\lfloor x \rfloor$ is the largest integer $n$ such that $n \le x$.
• $\lceil x \rceil$ is the smallest integer $n$ such that $x \le n$.

The proof follows from the definition (we will show this only for floor, the ceiling case is analogous): let's say that this is not true, and there is some $a > n$ such that $a \le x$. $a > n$ means $a \ge n + 1$. But we know that $x < n + 1 \le a$, so $x < a$. This is a contradiction.

Note: As a consequence of the above fact, we have $\lfloor x \rfloor \le \lceil x \rceil$, and the difference between them is $1$ iff $x$ is not an integer; when $x$ is an integer, they are equal to $x$.

We can rephrase these statements in the following ways that can be useful while solving problems.

• For an integer $n$ and a real number $x$, the following holds: $n \le x \iff n \le \lfloor x \rfloor$.
• For an integer $n$ and a real number $x$, the following holds: $n \ge x \iff n \ge \lceil x \rceil$.

This is especially important because these kinds of iff statements help us reduce the large number of cases that can arise while framing and solving inequalities. For an example, consider this comment.

Let's try to rephrase the original definition into another way too. $n \le x < n + 1$ is equivalent to the two inequalities $n \le x$ and $x < n + 1$. The latter is equivalent to $x - 1 < n$. So combining these again shows that $x - 1 < n \le x$. These steps are reversible, so this means this can also function as an alternate definition of the floor function. Doing this similar for the ceiling function, we have the following properties (that are also alternate definitions):

• $x - 1 < \lfloor x \rfloor \le x$
• $x \le \lceil x \rceil < x + 1$

It is sometimes useful to think in terms of the fractional part of a number (i.e., distance from floor), since it is guaranteed to be a positive real number less than 1. Similarly, distance to ceiling is helpful too.

## Some more properties of floors and ceilings

Let's see what happens to the floor of a real number when we add (or equivalently, subtract) an integer.

Let $n' = \lfloor x + a \rfloor$ where $a$ is an integer. This is equivalent to saying that $n' \le x + a < n' + 1$, which is equivalent to $n' - a \le x < n' - a + 1$. Since $n'$ is an integer, $n' - a$ is also an integer. Hence this is equivalent to saying that $n' - a = \lfloor x \rfloor$. In other words, we have shown that

$a \in \mathbb{Z} \implies \lfloor x + a \rfloor = \lfloor x \rfloor + a$

Similarly, we can show that

$a \in \mathbb{Z} \implies \lceil x + a \rceil = \lceil x \rceil + a$

However, this kind of an identity doesn't hold for multiplication by an integer in general. For example, $\lfloor 1.5 \rfloor = 1$, and $\lfloor 2 \cdot 1.5 \rfloor = 3 \ne 2 \cdot \lfloor 1.5 \rfloor$.

Note that by the same example, we can show that $\lfloor x + y \rfloor$ is NOT equal to $\lfloor x \rfloor + \lfloor y \rfloor$ in general (and something similar for ceiling). However, it can be shown that the following are true:

$\lfloor x \rfloor + \lfloor y \rfloor \le \lfloor x + y \rfloor \le \lfloor x \rfloor + \lfloor y \rfloor + 1$
$\lceil x \rceil + \lceil y \rceil - 1 \le \lceil x + y \rceil \le \lceil x \rceil + \lceil y \rceil$

Let's now look at what happens when we multiply the argument of the functions by $-1$.

From $n \le x < n + 1 \iff -n - 1 < -x \le -n$, we can note that $-\lfloor x \rfloor = \lceil -x \rceil$. By replacing $x$ by $-x$ in this relation, we have $\lfloor -x \rfloor = - \lceil x \rceil$.

Since the floor and ceiling functions both return integers, applying any amount of floor or ceiling functions to the result will not change their value after one of them is already applied once.

One non-trivial property that sometimes helps is the following:

For an arbitrary positive integer $n$ and real number \$x:

• $\lfloor \frac{\lfloor x \rfloor}{n} \rfloor = \lfloor \frac{x}{n} \rfloor$
• $\lceil \frac{\lceil x \rceil}{n} \rceil = \lceil \frac{x}{n} \rceil$

Proving this requires some knowledge of Euclid's division lemma (remainder on dividing an integer by a positive integer is a non-negative integer smaller than the positive integer), and is left as an exercise for the reader.

For more properties and applications, I would refer you to the Wikipedia article on these functions.

## Some natural applications of floors and ceilings

• Quotient and remainder: The quotient on dividing a positive integer $a$ by another positive integer $b$ is by definition $\lfloor a / b \rfloor$. The remainder is $a - b \lfloor a / b \rfloor$.
• To count the number of multiples of $a$ that are $\le n$, we note that if there are $k$ multiples, then the largest multiple of $a$ is $ka$, and it should be $\le n$. Hence $k \le n/a$. This is also a sufficient condition for there being at least $k$ multiples of $a$ at most $n$. So $k = \lfloor n / a \rfloor$ by maximality of $k$.

## Examples of reasoning with the developed tools

Now that we have some results in our toolbox, we'll show how to actually solve problems cleanly without getting stuck thinking about multiple cases at once. We've already shown some ways of using these while deriving other results, so we'll keep this short and I'll just link to some problems instead.

## Some identities and bounds

I'll list out some non-trivial identities and bounds, and proving them is left as an exercise.

• Floors and ceilings for fractions: For a positive integer $b$ and an arbitrary integer $a$, we have $\lceil a/b \rceil = \lfloor (a + b - 1)/b \rfloor$.
• Gauss' property: For positive integers $m, n$, we have $\sum_{i = 0}^{n - 1} \lfloor im/n \rfloor = ((m - 1)(n - 1) + \gcd(m, n) - 1) / 2$. To prove this, try looking at the number of points below the line $y = mx/n$ inside the rectangle with corners $(0, 0)$ and $(m, n)$.
• Hermite's identities: For a positive integer $n$ and a real number $x$, the following are true: $\lfloor nx \rfloor = \sum_{i = 0}^{n - 1} \lfloor x + i/n \rfloor$ and $\lceil nx \rceil = \sum_{i = 0}^{n - 1} \lceil x - i/n \rceil$. This can be proved by making cases on whether the fractional part (respectively, distance to ceiling) is between $i/n$ and $(i + 1)/n$.
• Harmonic bounds: $\sum_{i = 1}^n \lfloor x / i \rfloor \le x \sum_{i = 1}^n 1/i = O(x \log n)$. As a consequence, a sieve that considers all divisors from $1$ to $n$ for a total of $x$ consecutive integers takes $O(x \log n)$ time. This is why both the normal sieve of Eratosthenes and the segmented variant are fast enough. This also shows that the sum of divisors of a positive integer $n$ is at most $O(n \log n)$.

## Programming language specific details

For the sake of sanity, it is recommended to ensure that whenever you are trying to perform integer division in any programming language, try to convert it into a form where the denominator is positive. However, for most languages, it is always true that (a / b) * b + (a % b) evaluates to a.

There are typically two behaviors when performing integer division with a positive divisor:

• Computing the floor in all cases: this is the case for Python, for example. Hence a // b will give $\lfloor a / b \rfloor$, and a % b will give the non-negative remainder when $a$ is divided by $b$.
• Rounding the result towards $0$: this is the case in C++, for example. One reason behind this is to prevent overflow when multiplying the result by the denominator. So, for positive $a$, the behaviour is the same as in Python, but for negative $a$, it returns $\lceil a / b \rceil$, and a % b gives the negative distance to the corresponding multiple of $b$.

Note that to compute the ceiling of some $a / b$ for positive integer $b$ and integer $a$, we can use the result mentioned earlier: $\lceil a / b \rceil = \lfloor (a + b - 1) / b \rfloor$. Implementing code for correctly computing floors and ceilings in both C++ and Python is an instructive exercise.

Keep in mind that std::floor and std::ceil are operations on floating point numbers, and their results are also floating point numbers. Hence they might not be exact when converted to integers, and hence it is not recommended to use them for most practical purposes.

By nor, history, 3 months ago, As was noted in this blog, Google decided to discontinue its programming contests (Code Jam, Kick Start and Hash Code).

It seems they will take down all relevant servers (and hence the problems and the submission system) for all these contests. While there are some initiatives to archive past Code Jam and Kick Start problems (for example, here — also have a look at this comment), there seems to be nothing similar for Hash Code.

As a past participant of Hash Code, I believe that these types of contests are quite important too, and may have directly or indirectly inspired other such contests (like the Reply Challenge and AtCoder Heuristic Contests). Also there are some techniques like simulated annealing and using SAT/ILP solvers that might not show up in standard competitive programming websites but do show up in such optimization contests. The tutorials for them will be lacking without examples too.

So here are my questions to anyone who might have reliable information about the contests, since everything seems to be in a mess right now:

1. Are there copyright issues that prevent the problems from being archived and open for submissions elsewhere (for a non-profit scenario)? This also holds for the BOJ hosting, so it might be possible that they have discussed the relevant issues already (though I am not aware of any such details).
2. Does anyone have any ideas on how to archive these problems? For instance, what kinds of data need to be preserved (statements? scorers? output submissions? code submissions? ranklists?).
3. If all goes well, is it possible to set this up with other existing Code Jam and Kick Start archives?
By nor, 4 months ago, I've been involved with testing a lot of Codeforces contests and coordinating contests for my uni, and have seen various kinds of issues crop up. However, I was able to come up with a stable system for my uni contests that helped make the experience as educational and error-free as possible, and I have found the lack of some of those things to be the main reason behind issues faced by Codeforces contests.

Given that it takes months to get a round ready, it is better to be more careful than necessary compared to getting your round unrated or having bad feedback in general.

I'd like to thank Ari, null_awe and prabowo for reviewing the blog and discussing more about problem-setting.

The current Codeforces contest guidelines are here for problem-setters and here for testers, and they have some great sanity checks that will help you make a good contest. But I still think it is missing some things that leads to bad contests at times.

## Issues

Let's list out some issues that plague contests.

1. Wrong intended solution or tests.
2. Weak tests.
3. TL or ML too tight or too loose.
4. Some hacks are not accepted.
5. Wrong IO format.
6. Checker issues — too stringent, too loose, or simply wrong.
7. Guessable problems.
8. Problems that have appeared before.
9. Unbalanced contests — difficulty-wise.
10. Unbalanced contests — topic-wise.
11. Long queue, system unavailability.

Out of these, the most glaring issue is having a wrong intended solution itself, and leads to making the contest unrated immediately.

## Discussion

I think the fundamental problem behind most of these issues is two-fold:

• Lack of rigor while setting a problem.
• Not adopting a security-based approach to testing (pentests, for example).

Whenever I coordinated contests at my uni, I sent out a couple of guideline documents for setters and testers, that built upon these two ideas. The guiding principle for the first of these reasons (that every author and tester should understand, in my opinion) is consistency (and proofs of it). Not just one type of consistency, but consistency of the intent, constraints, statement, checker, validator and tests — all at the same time.

NOTE: Before we go forward, let's clarify what we mean by a formal proof. In this comment, I clarified that we are not talking about scary super-technical proofs (such as ones you might encounter in advanced set theory). A proof that is logically sound, where steps follow from basic logical reasoning and don't have gaps, is good enough. I call it a "formal proof" just because people on Codeforces call anything that looks reasonable a "proof" (which is obviously wrong and misleading) — and that is not what I am talking about. If you're not very familiar with the idea of a mathematical proof, don't worry. It is not meant to be too hard. I would recommend reading the chapter on proofs in this book, to understand what kind of logical statements I am talking about, and the idea behind proofs generally.

Let's understand this by asking ourselves the following questions:

• What's the intent of the problem (not in terms of ideas, but from a contest perspective)? Usually it is "solve this problem in xyz time complexity and/or abc memory complexity" or "solve this problem in these many queries". Constraints on the input and output, time limit and memory limit arise from this naturally.
• What do we want the contestants to solve? This corresponds to a precise and unambiguous statement of the problem, including the IO format, guarantees on special kinds of tests (if any) and so on. Unfortunately, there is no way to formally check that your Polygon materials are consistent with this problem statement, which is why it becomes necessary to ensure that the statements are such that your solution can be derived formally from the problem statement.
• How do the contestants prove that they have solved the problem? In other types of reasoning/math contests, you have to provide a proof of correctness to show that you have solved the problem. This is also the approach taken by academia. However, unless you force participants to submit a solution in proof languages like Coq or Lean, it is impossible (or very hard) to enforce that they have a proof of correctness for their arguments and time complexity of their algorithms. This is where tests come in. "Strong" tests mean that any code that comes from a solution that doesn't satisfy the intent of the problem fails to get accepted on those tests. Since the constraints are bounded, the set of all possible tests is finite, but still huge enough that adding them all is infeasible. So to combat that, a good practice is to add all possible tests for small constraints, some random max tests, and tests that break all* possible heuristic solutions.
• How to verify that our testing procedure is correct? Of course, if you feed in garbage data (i.e., data that doesn't satisfy the preconditions of a program), the output might be garbage. If there are inconsistencies in the tests, it is impossible to verify if the submitted solutions are correct (as in they solve the problem as stated). The same goes for tests that fall outside the constraints.

## Solutions to the issues — the guidelines

We will now list the guidelines that help make contests error-free. Most of these will have both an author and a tester part. Keep in mind that these guidelines will be along the lines of a security perspective.

Note that there are some things that we can completely verify in a mathematical sense by proofs. But there are also things that we can't be sure of. One of the reasons behind it is that the halting problem is undecidable. So, for those kinds of things, we enforce some kinds of limits on the runtime or the memory requirements, and since we also want to check the behavior of submitted solutions on large inputs, we miss out on a lot of tests due to the combinatorial explosion of the number of tests that happens for most problems.

### Provable things

Firstly, let's see what we can prove the correctness of. Keep in mind that a proof of consistency is always an "if and only if" argument.

#### Statement and solution

This is the part that comes even before setting the problem on Polygon or any other contest preparation system. Ensure that you have the following things:

• Formal statement
• Formal algorithm
• Formal proof of correctness of algorithm

None of this should have actual numbers in it (keep everything in terms of what is defined in the input, and don't think about the constraints for now). Don't even implement it in a programming language; use constructs in the programming language as well as standard algorithms whose proof of correctness is well-known as a black-box.

Your proof and algorithm should depend only on the statement and logic. Don't implicitly add in any other assumptions or make wrong assumptions in the proof (if you write the proof, it will feel off anyway, so it's quite hard to make a mistake like this if you follow this guideline).

Some notes:

• The algorithm and the proof of correctness are two different things. Saying that "it can be easily seen that this greedy works" or "finding max flow in this graph gives the answer" is not a proof.
• Write out everything mathematically, even if it is obvious to you. This is the foundation of the whole problem, and if there are issues in this, is there even a point to preparing the problem, or even thinking about it?
• I recommend reading this article and the links it refers to, to understand what I mean by a proof here. If you don't know how to structure things in the form of a proof, perhaps you are not yet ready for writing a problem for a contest yet. However, it is never too late to learn, and your idea behind the problem might be correct. A formal proof gives you a way to communicate things with others unambiguously, which is important since natural language is imprecise. Writing formal proofs is not hard, and apart from being a tool to communicate, it cements your understanding as well.
• It might be possible that there is a solution faster/better than your intended solution. Don't be afraid to use it as the intended solution (in fact, try to find as many solutions as possible).

Doing this step properly eliminates the possibility of having wrong statements or wrong intended solutions. It also avoids situations like this where the intended solution was "fast enough" on the tests but the problem in general didn't seem to be solvable fast enough. Incidentally, that blog is a good example of what can go wrong in a real-life setting.

#### Constraints and test validation

Setting constraints is the part which requires getting your hands dirty. Let's ignore that for now, and let's say we have some constraints.

How do you ensure that your tests are correct? Think of it as a competitive programming problem again, though a very simple one. This is where a validator comes in. With a validator, you validate the tests based on what the problem states. The reasoning behind proving the correctness of the validator should go like this:

We show that the validator accepts a test if and only if the test is valid according to the problem statement. For the if-direction, [do something]. Now for the only-if-direction, [do something].

I recommend keeping the numbers in this proof generic enough, as is done in most math proofs. We will plug in the numbers later, when we discuss the constraints. Having this proof handy is quite important. Sometimes, when preparing a problem, you end up changing the constraints later on, but forget to update the validator. If you keep this proof as a comment in the validator itself, and inspect all the contest materials before committing changes, you will immediately catch inconsistencies in the statement and the validator.

Also, validators are important when hacks are enabled. If your validator is too loose, you risk some AC solutions getting hacked. And if your validator is too tight, some hacks might be invalid, and there are some hidden conditions in the problem that the hacker might be able to guess, which makes the contest unfair. For example, this contest had such an issue.

For choosing constraints, you need to write some code to verify that the TL and ML are fine and so on. However, this is not something you can prove even in a handwavy sense (unless you know the assembly that your code generates and the time every instruction takes in being executed in that order, on the worst possible test), so we will keep it for later.

#### Checker

This corresponds to checking that the contestant's solution indeed outputs something that is valid, according to the problem statement.

Again, as in the validator, using a formal proof, ensure that the output satisfies all the conditions for it to be accepted as a valid output by the statement, and don't add any extra checks. This is important because there have been CF contests where the checker checked something much stronger about the structure of the solution. A very simple example is "print a graph that is connected". This doesn't mention that the graph can't have multiple edges or loops, but many people will just write a checker that will add these conditions anyway. This is another reason why adding links to precise definitions in the problem is a good thing.

Sometimes writing the checker is harder than solving the original problem itself, and there have been instances in certain CF contests where writing the checker was the second subtask of a problem! Keep this in mind while setting problems, since it is useless if you can't write checkers. Sometimes, it might make sense to ask for only a partial proof of solution from a contestant — like a certain aspect that you can compute only if you use some technique, or something that is weaker than asking for the output, but is a reasonable thing to ask for (this changes the problem itself, though, and hence it doesn't show up as an issue on the contestant-facing side of problem-setting).

### Things we have to manage heuristically

As we mentioned earlier, almost everything about setting constraints and creating tests is something that we have to manage heuristically. Both of these things go hand-in-hand, which makes the whole process quite iterative. Some obvious guidelines that help in these cases are given in the next section.

### Condensed list of guidelines for testers and authors

#### Common guidelines for setters and testers

1. Read this blog and the links in this post that point to CF guidelines. Also read the links to issues in actual CF contests and think about what went wrong.
2. Keep the contest preparation process interactive and discuss as many solutions as possible, as well as potential changes to problems.

#### Guidelines for setters

1. Ensure that you know what testers are meant to help you with, and read guidelines meant for them as a starting point.
2. Have a formal write-up describing the problem, algorithm and the proof, as described in the blog. The formal editorial for a problem should be ready before you prepare the problem.
3. Keep the problem statements short, clean and unambiguous. Use lots of space in unavoidable situations. Avoid using examples in statements that unnecessarily obstruct the flow of the reader, and write the statement in a way that doesn't need you to rely on examples in the first place.
4. Ensure that your model solution is readable and is provably correct (and it doesn't have any undefined behavior and so on). Preferably write it in a "safe" language, and turn on all debug flags while testing the main solution (this is something I have wanted on Polygon, but there's no support for that unfortunately).
5. Write strict validators and checkers, and prove their correctness too (both in terms of semantics as well as formatting). When writing checkers, do NOT assume that the jury solution is correct. Rather, check the correctness of both the jury solution as well as the participant's solution.
6. Have exhaustive test cases — ideally, use a bruteforce to generate all possible tests for some small sub-constraints to ensure correctness, corner cases, tricky large cases for TLE and random large cases. Ensure that you cover all possible special cases, and look for specific solutions that can give AC but do not work on the general case (like faulty greedy solutions and so on). Keep in mind unordered_map and unordered_set like peculiarities in the languages.
7. Ensure that the time limits are not too tight, and they are at least 2-3 times the slowest AC solution you have. Try to keep constraints in a way that the most optimized unintended solution either doesn't pass (by a good margin — like twice the TL), or barely passes. In the latter case, you should be reconsidering whether you want this problem to appear on the contest at all.
8. Before relying on the testers, try to test your own problems according to the tester guidelines.
9. Maintain something like a spreadsheet for tracking status and feedback of problems, and current progress.

#### Guidelines for testers

1. Ensure that you know what setters are meant to do, and read guidelines meant for them as a starting point.
2. Ensure that the problem statements are water-tight and self-contained (not a lot of background knowledge necessary, and wherever necessary, links should be present) and provided examples clarify unavoidable ambiguities (which should ideally be non-existent).
3. After testing, confirm that the authors' proofs are all correct. This is not limited to just proofs of correctness of the intended solution, but also for the checker, validator and so on. Just because you couldn't solve a problem in the virtual contest testing doesn't mean you mustn't care about the other problems.
4. Try to hack the problem: submit heuristic solutions, both in the sense of potential WA as well as potential TLE. If they pass on all cases, try to come up with cases they can’t pass on, or prove that this heuristic is correct/fast enough. Such a situation is quite instructive whenever it arises (from experience, this happens usually when there is a greedy solution involved which doesn’t really need the full strength of the greedy decision).
5. Try to write the slowest possible reasonable solutions that a contestant can come up with, with the same complexity as the author’s solution. Also try to come up with highly optimized brute-force solutions (using caching, bitsets, precomputation of answers and so on) and see if they can pass the test cases. Look for test cases that these can’t pass, but the official solution passes. In case the official solution and the optimized brute-force both have similar speeds, discuss what to do next with the coordinator and the author.
6. If you come up with an alternative approach with a better or slightly worse complexity, notify the problem setter, and depending on what the intended difficulty of the problem is, discuss whether to make the problem harder/easier.
7. Give detailed feedback on the quality of problem statements, problem content, editorials, solutions, test cases, perceived hardness of the contest, and anything else that will help improve the perceived quality of the contest.

### Handling other non-obvious issues.

At this point, there are the following issues we haven't addressed.

1. Guessable problems.
2. Problems that have appeared before.
3. Unbalanced contests — difficulty-wise.
4. Unbalanced contests — topic-wise.
5. Long queue, system unavailability.

The first four issues are solved only by testing and being consciously aware of the fact that they need to be fixed.

If problems are guessable, it makes sense to ask contestants to construct something that shows that their guess is correct. For example, it might be easy to guess that the size of the answer for a problem is $n/2$. This is very guessable, and in this case, it is better to ask for the construction of the answer as well.

For avoiding problems that have appeared before, since there is currently no repository of all competitive programming (or math contest) problems that is searchable by context, the only way is to invite a lot of experienced testers with different backgrounds.

For avoiding unbalanced contests (both in terms of difficulty and topics), it is important to have a large variety of testers — people with various amounts of experience as well as various types of topics that they are strong at.

The last type of issue is something unique to online contests that are on a large scale — for example, CF contests. This is the main reason why pretests exist in the first case. To avoid long queues, usually there are at most 2-3 pretests for problems like A and B. To make strong pretests that are also minimal, there are two ways of choosing tests (which can be combined too):

1. Choose an exhaustive test (on smaller constraints), and a random max-test.
2. Pick all possible tests that fail solutions that you encountered in testing.

This is one of the reasons why the first couple of problems are multi-test in nature.

## Conclusion

If there's anything I missed out, please let me know, since it's been quite some time since I last coordinated a contest. Feel free to discuss on various aspects of problem-setting in this context, and I would request coordinators to also share their to-do list when they start coordinating a contest, right from the very beginning of the review stage.

By nor, 5 months ago, Disclaimer: I am not an expert in the field of the psychology of learning and problem-solving, so take the following with a grain of salt. There is not much "scientific" evidence for this blog, and the following is validated by personal experience and the experiences of people I know (who fall everywhere on the "success" spectrum — from greys to reds in competitive programming, from beginners in math to IMO gold medalists, and from people with zero research experience to people with monumental publications for their own fields). This blog only covers one possible way of thinking about knowledge organization and retrieval that I have been using for the past decade or so, but there definitely will be other models that might work even better. Even though I just have a few courses worth of formal psychology experience, I still think the content of this blog should be relevant to people involved with problem-solving.

I hope this helps both students (to make their learning efficient) as well as educators (to make their teaching methodology more concrete and participate actively in their students' learning process).

Note that the techniques mentioned in this blog (or any resource that tries to explain learning and reasoning) will not work for you if your basics behind reasoning are flawed. If you're a beginner at reasoning in terms of proofs, I recommend this book to understand what mathematical reasoning is like. Intuition is just a guide — it is developed and applied only through multiple instances of rigorous reasoning.

Note: This blog is quite long, so reading it bit-by-bit should be better than skimming it in one go. I have also added a few links to some really good resources within the text, where I felt that it would be better to link to pre-existing stuff to avoid repetition and give a more varied perspective.

Here's what we'll discuss in the blog:

• Why even bother understanding learning, and how do I use that knowledge?
• A mental model of how knowledge is stored — the data structure
• Learning methods — constructing the data structure
• Learning via problems
• Applying creativity to create new ideas and contexts
• Reasoning — querying the data structure
• Reasoning techniques
• "Wishful thinking"
• Intuition
• Rigor
• On developing general reasoning techniques
• Some limitations of human problem-solving and overcoming them
• Working memory
• Motivating creativity
• Some real-life examples
• How learning can look like for you
• A few useful high-level ideas
• How solving problems can look like
• How you can take feedback from this knowledge graph
By nor, 5 months ago, Firstly, the CF Catalog is a great feature that makes it much easier to learn topics by having a lot of educational content in the same place, so thanks MikeMirzayanov!

I was originally going to post this as a comment under the catalog announcement blog, but it became too large for a single comment, and it will probably need some more discussion.

I want to suggest the following improvements for the catalog (and initiate a discussion on how best to tackle these issues with the current version):

1. Multiple topics per blog
2. Blog submission for review

The reasons and detailed suggestions are as follows:

By nor, 5 months ago, People often say C++ is much more cumbersome to write than any sane language should be, but I think that is mainly due to lack of knowledge about how to write modern C++ code. Here's how you can write code more smartly and succinctly in C++(20):

### Negative indexing with end

Have you ever found yourself using a vector like a stack just because you have to also access the last few positions, or writing a sliding window code using a deque (where the size might be variable, but you want to access an element at some offset from the end)?

Chances are that you have written code like this:

while (stk.size() >= 3 && stk[stk.size() - 2] + stk[stk.size() - 1] > stk[stk.size() - 3]) {
stk.pop_back();
}


A cleaner way of writing the same thing is this:

while (stk.size() >= 3 && end(stk)[-2] + end(stk)[-1] > end(stk)[-3]) {
stk.pop_back();
}


This trick works not just for end, but for any iterator, provided that (roughly) there is no undefined behaviour like out of bounds accesses.

This can be done only with iterators which provide random access. For example, iterators for vectors but not for sets.

### Free functions like len and str

Python's len and str come in handy a lot of times, and they're convenient to use because they're free functions (that is, not member functions).

Fortunately, C++ has its own set of free functions that are quite helpful:

1. size(v): It returns the unsigned size of $v$, and does the same thing as v.size() That is, its numerical value is the size of $v$, and type is some unspecified unsigned type (aliased to size_t). This also works for arrays (but not raw pointers).
2. ssize(v): It returns the signed size of $v$. This is quite useful, because it prevents bugs like a.size() - 1 on empty a. It doesn't have a member function analogue, though. This also works for arrays (but not raw pointers). If you use sz macros, you can safely remove them from your code!
3. begin(v), end(v), rbegin(v), rend(v): They do the same thing as what their corresponding member functions do. These also work for arrays (but not raw pointers).
4. empty(v): This checks if $v$ is empty or not. Analogous to v.empty()
5. data(v): This returns a pointer to the block of data where it works.
6. to_string(x): This returns the string representation of a datatype. They're only defined for things like int and double though.

### Variable declaration in loops/conditions

If you have used range-based for loops, you might have noticed that there is no support for indexing by default. So people usually either just switch to index based loop, or do the following:

int i = -1;
for (auto x : a) {
i++;
// do something
}
// or
{
int i = -1;
for (auto x : a) {
i++;
// do something
}
}


However, it either pollutes the parent scope with the variable i and makes things buggy, or just looks ugly. C++20 has the following feature that makes it cleaner:

for (int i = -1; auto x : a) {
i++;
// do something
}


Similarly, if there is an object such that you only have to inspect its properties just once and it becomes useless later on, you can construct it in the init part of the if condition, like this:

if (auto x = construct_object(); x.good() && !x.bad()) {
// do something
}


### Unpacking structs

A lot of people write code that looks like this:

std::vector<std::tuple<int, char, std::string>> a;
// populate a
for (int i = 0; i < (int)a.size(); ++i) {
int x = std::get<0>(a[i]);
char y = std::get<1>(a[i]);
std::string z = std::get<2>(a[i]);
}
// or
for (auto& A : a) {
int x = std::get<0>(A);
char y = std::get<1>(A);
std::string z = std::get<2>(A);
}
// or
for (auto& A : a) {
int x; char y; std::string z;
std::tie(x, y, z) = A;
}


A cleaner and more consistent way of writing this can be:

for (auto [x, y, z] : a) {
}
// or
for (auto& A : a) {
auto [x, y, z] = A;
}


If you want references instead of copies stored in x, y, z, then you can do this:

for (auto& [x, y, z] : a) {
}
// or
for (auto& A : a) {
auto& [x, y, z] = A;
}


Note that this is not limited to for loops, or even to just tuples, and you can use this sort of unpacking (structured binding) for most structs (in particular, custom structs you make, or std::pair). In fact, if you have std::array<int, 3>, then this unpacking will give you its contents in three different variables.

### Some more utilities

There are some lesser known functionalities that C++20 introduced, like starts_with and ends_with on strings. Some of the more useful ones can be found here.

Notably, there is also support for binary searching like we do in competitive programming, and the relevant information can be found in the language specific section of this blog.

You can also print a number in binary, analogously to what you can do in Python. This can be done by constructing a bitset out of the integer and converting it to a string, or just by printing std::format("{:b}", a) (which is unfortunately not on CF yet).

### emplace_back, emplace_front and emplace

Seriously, stop using a.push_back(make_pair(1, 2)) and a.push_back(make_tuple(1, 2, 3)) and even a.push_back({1, 2, 3}), and start using a.emplace_back(1, 2) instead.

The only reasonably common place this doesn't work is when you have a container that stores std::array, so emplace_back doesn't work; in that case a.push_back({1, 2, 3}) is fine.

The other variants: emplace_front and emplace do the same thing for their other counterparts (for example, in a deque, queue or priority_queue).

### Lambdas

Lambdas in C++ are in fact cleaner and more intuitive (with more features) than in Python. Compared to both Python lambdas (being able to have more than 1 line of code) and local functions (having mutable states more intuitively and capturing parameters from an ancestor scope), in fact. This is because you can capture stuff more explicitly, have mutable state, can pass them as function objects, and even implement recursion using generic lambdas, though C++23 will have a feature called deducing this which will make a lot of things much simpler.

They can be passed around to almost all algorithms in the STL algorithms library, which is part of the reason why the STL is so powerful and generic.

They can also be used to make certain functions local, and avoid polluting the global namespace with function definitions. One more thing about them is that they are constexpr by default, which is not true for traditional functions.

### Vector-like data structure without iterator invalidation

You probably know what we're coming to at this point: std::deque. It is essentially std::vector (without $O(1)$ move guarantees but whatever), but with no iterator invalidation on resizing/adding elements.

That is why it is ideal in a very important setting:

• Making buffers of elements (so that we can use pointers to elements without worrying about what happens when the buffer is resized).

No more Python laughing in our face about having to worry about iterator invalidation!

### Python style ranges and even more flexible range manipulation

##### Using temporary container in for loop

Suppose you want to do something for multiple arguments, but they aren't indices. Most people would either duplicate code, or do something like this:

vector<char> a{'L', 'R', 'D', 'U'};
for (auto x : a) { /* do something */ }


But this has a disadvantage: it needs more characters and also makes it more error-prone to use vectors. Instead of this, you can do this:

for (auto x : {'L', 'R', 'D', 'U'}) { /* do something */ }


or even

for (auto x : "LRDU"s) { /* do something */ }


Note that the s at the end makes it a string literal which is why it works, and const char* doesn't work because there is a \0 at the end of the literal. This can still allocate on the heap (not in this case, only for big strings due to SSO).

In case you explicitly want an allocation with a vector for some reason, you can use

for (auto x : std::vector{'L', 'R', 'D', 'U'})


or

for (auto x : std::vector<char>{'L', 'R', 'D', 'U'})

##### Ranges

C++20 has a feature called ranges, that allows you to do manipulations on, well, ranges.

Let's think about how to implement this: For integers between 1 to 5, multiply them by 2, and print the last 3 elements in reversed order.

#include <bits/stdc++.h>

int main() {
std::vector<int> a{1, 2, 3, 4, 5};
for (auto& x : a) x *= 2;
for (int i = 0; i < 3; ++i) std::cout << a[(int)a.size() - i - 1] << '\n';
}


However, this is prone to having bugs (what about out of bounds errors in some other scenario?), and doesn't look intuitive at all.

Now consider this:

#include <bits/stdc++.h>

namespace R = std::ranges;
namespace V = std::ranges::views;

int main() {
for (auto x : V::iota(1, 6) | V::transform([](int x) { return x * 2; }) | V::reverse | V::take(3))
std::cout << x << '\n';
}


Here the intent is pretty clear. You take integers in $[1, 6)$, multiply them by 2, reverse them, and take the first 3. Now in this resulting range of numbers, you print each of them.

So let's see what happens here. The | operator is defined for ranges and an operation on ranges that returns a range. For example, V::transform(range, lambda) gives a range where every element is transformed, and doing range | V::transform(lambda) does the same thing. In fact, for a lot of cases, this isn't stored in memory at all (and operations are done while iterating on the fly), unless you have things like reversing a range.

One thing that you can finally do is split a string with delimiters in a nice way (which used to be a major point of ridicule directed towards C++ from a lot of Python users). This is as simple as using a split view in the ranges library. For more information, see this. Here's an example taken verbatim from the cppreference page:

#include <iomanip>
#include <iostream>
#include <ranges>
#include <string_view>

int main() {
constexpr std::string_view words{"Hello^_^C++^_^20^_^!"};
constexpr std::string_view delim{"^_^"};
for (const auto word : std::views::split(words, delim))
std::cout << std::quoted(std::string_view{word.begin(), word.end()}) << ' ';
}


A more common example can be like this:

#include <bits/stdc++.h>

namespace R = std::ranges;
namespace V = std::ranges::views;

int main() {
std::cin.tie(nullptr)->sync_with_stdio(false);
std::string s(1000000, '0');
for (int i = 0; i < 1000000 / 2; ++i) s[i * 2] = '1';
for (const auto word : V::split(s, '1'))
std::cout << std::string_view{begin(word), end(word)} << ' ';
}


Some ranges functionality is as follows:

• converting to and from a range: conversion to a range is implicit for most STL containers. If r is a range, then for most STL containers, like std::vector, you can do something like std::vector<int> a(r.begin(), r.end()). C++23 has a feature R::to<ContainerType>(range) that will construct a container out of a range, and you could even do range | R::to<ContainerType>().
• iota: This works similarly to Python's range() but without the stride.
• reverse: This works similarly to Python's reversed()
• take: This takes the first $n$ elements of a range (unless you run out of elements by then)
• drop: This takes everything but the first $n$ elements of a range (unless you run out of elements by then)
• counted: This takes $n$ elements of a range starting from an iterator.
• keys, values: keys takes a range whose elements are pair-like and returns a range of their first elements (so if we have (key, value) pairs, it will return keys), similar to .keys() in Python. Similarly, values returns a range of the values.
• elements: if we have a range of tuple-like elements, this takes the $n$-th members of each of these tuples (where $N$ is a compile-time constant).
• all: though type conversions from containers to ranges are usually implicit, it takes a container and takes everything in that container.
• Some upcoming C++23 features that will make it even more convenient to use ranges: to, zip, slide, stride, repeat, adjacent, pairwise, cartesian_product.

Note that this is not the only thing you can do with ranges. In fact, with almost all possible algorithms that you can find in the algorithms library that use a begin and an end iterator (like std::sort or std::partial_sum or std::all_of), you can find a ranges version here, where you can just pass the range/container name instead of the begin and end iterators you usually pass.

By the way, if you don't want to wait for C++23, you can use some of the functionality from this code, which I used to use a couple of years ago when C++20 was not a thing.

Spoiler

Finally, I'd like to thank ToxicPie9, meooow and magnus.hegdahl for discussing certain contents of this blog, and other members of the AC server for giving their inputs too.

By nor, 5 months ago, This blog is intended to be an introduction to permutations for beginners, as there seem to be no resources other than cryptic (for beginners) editorials that talk about some cycles/composition, and people unfamiliar with permutations are left wondering what those things are.

We'll break the blog into three major parts based on how we most commonly look at permutations. Of course, even though we will treat these approaches separately (though not so much for the last two approaches), in many problems, you might need to use multiple approaches at the same time and maybe even something completely new. Usually, you'd use these ideas in conjunction with something like greedy/DP after you realize the underlying setup.

An ordering of the sections in decreasing order of importance according to the current permutations meta would be: cycle decomposition, permutation composition, ordering.

However, the topics are introduced in a way so that things make more sense, because there are some dependencies in definitions.

1. Definition, basic properties, and notation
• Definition
• Two-line notation
• Permutation invariants
2. The ordering perspective
• Fixed points
• Derangements
• Inversions
• Longest increasing subsequences, and Erdos Szekeres theorem
• Next permutation
• Random permutations
3. The cycle decomposition perspective
• Cycles
• Cycle notation
• Finding $k$-th next, functional graphs
• Swaps/transpositions
• Swaps on cycles
4. The permutation composition perspective
• Definition
• Examples — cycles and swaps
• Parity of permutations
• Parity under composition
• Inverse of permutations
• Involutions
• $k$-th power of permutations
• Order of permutations
• Square-root of permutations
• Conjugation and conjugacy classes
5. Miscellaneous topics (brief discussion)
• Group theory
• Counting: Burnside, Polya enumeration, Stirling numbers
• Coordinate compression
• Connected component DP
• Young tableaus
• Schreier Sims algorithm
• Permutation tree
6. Problems

## Definition, basic properties, and notation

To start with, here's the definition of a permutation that you might find in many problems on Codeforces:

Definition: A permutation of size $n$ is an array of size $n$ where each integer from $1$ to $n$ appears exactly once.

But why do we care about these arrays/sequences? The reason behind this is simple.

Consider any sequence $a$ of $n$ integers. We can index them with integers from $1$ to $n$. So the $i$-th integer in the sequence is $a_i$. To understand the following, it helps to consider the example where $a_i = i$ for all $i$.

Let's now shuffle this sequence (elements are allowed to stay at their original places). We will try to construct a permutation from this shuffling in two ways:

1. What is the new index of the element that was initially at position $i$? (In the case when $a_i = i$, this just means finding the position of $i$ in the new array). Let's say this index was $f(i)$. Then $f$ is a function from $\{1, 2, \dots, n\}$ to itself. Now notice that when we write out the array $[f(1), f(2), \dots, f(n)]$, this satisfies the definition of a permutation! And we can easily construct a shuffling operation corresponding to every permutation just by reversing this process.
2. What was the old index of the element that is now at position $i$? (In the case when $a_i = i$, this just means looking at position $i$ in the new array and reporting the number there). Let's say this index is $g(i)$. Then $g$ is again a function from $\{1, 2, \dots, n\}$ to itself. When we write out the array $[g(1), g(2), \dots, g(n)]$, this is again a permutation! Note that $f$ and $g$ are not defined the same way, and the generated permutations are, in fact, distinct. In fact, it should not be hard to see that $f(g(x)) = g(f(x)) = x$ for any $x$ in $\{1, 2, \dots, n\}$.

This means shuffling around elements (or permuting them, in standard terminology) is another way to think about permutations.

This shows us how permutations are related to functions from $\{1, 2, \dots, n\}$ to itself such that all images are distinct (and all elements have a unique pre-image). In other words, a permutation can be thought of simply as a bijection on a set, which is the most natural definition to use for quite a few applications. The second interpretation is what corresponds to this definition, and the thing in the first definition is what we sometimes call the position array of the permutation in the second definition (this name is not standard). We will later see that it is what is called the inverse of the permutation.

Note that the above explanation is quite important for having an intuition on what a permutation is. If you don't understand some parts, I recommend re-reading this and trying a few simple examples until you're comfortable with what the permutation and functions $f$ and $g$ have to do with each other.

#### Two-line notation

Note that a permutation $[a_1, a_2, \dots, a_n]$ of $[1, 2, \dots, n]$ corresponds to a function $f$ on $\{1, 2, \dots, n\}$ defined by $f(i) = a_i$. Implicitly speaking, the set of pairs $(i, a_i)$ uniquely determines the permutation (it is just the set of mappings from position to element) and vice versa. To understand compositions and inverses (which will be introduced later on) better, the following notation can come in handy:

\begin{equation*} \begin{pmatrix} 1 & 2 & \cdots & n \\ a_1 & a_2 & \cdots & a_n \end{pmatrix} \end{equation*}

Here we have just arranged the pairs vertically, from left to right. Note that these pairs can be shuffled around (like dominoes), and this can lead to some nice properties:

1. What happens when we invert the pairs (that is, replace $(i, a_i)$ by $(a_i, i)$ for all $i$, which corresponds to swapping the two rows)? If the shown permutation corresponds to the function $g$, the permutation after inverting the pairs corresponds to the function $f$! This should be clear by looking at how $f$ and $g$ were defined earlier. So this sort of operation leads to a function inverse.
2. What happens if we try to combine two permutations by trying to match the lower half of the first permutation $p_1$ with the upper half of the second permutation $p_2$? If the $g$ for $p_1$ is $g_1$, and that for $p_2$ is $g_2$, then it can be verified that the new permutation's $g$ is just the function formed by applying $g_1$, then $g_2$. So this sort of operation leads to function composition.

#### Permutation invariants

Note that when a permutation is sorted, it leads to $[1, 2, \dots, n]$. So any accumulation operation on the permutation array (like sum, product, xor, sum of squares, number of odd integers, etc.) that doesn't rely on the order of elements in an array will give the same value for all permutations of the same length.

## The ordering perspective

#### Fixed points

A fixed point of a permutation $a$ is an index $i$ such that $a_i = i$. These are essentially the values that are not affected by the permutation at all, so if we disregard what happens to these values, we won't be losing out on any information for that permutation $a$. This is also why these $i$-s are sometimes skipped in the two-line notation (and also in the cycle notation, which will be introduced in the next section).

Note: If you are familiar with cycles or are on your second reading, you might note that this idea is probably more appropriate for cycles and compositions, but I kept it in this section since it shares some ideas with other subsections in this section. The same goes for the next section.

#### Derangements

A derangement is a permutation with no fixed points. That is, for every $i$, we have $a_i \ne i$. One useful thing to know is how many of the $n!$ permutations of size $n$ are derangements. Let's say this number is $D_n$. There are at least three distinct ways to count them:

###### Solving linear equations

In this approach, we group all possible permutations by the number of their fixed points. If there are $i$ fixed points, then upon relabeling the remaining $n - i$ elements from $1$ to $n - i$, the resulting permutation is a derangement. So, the number of permutations on $n$ elements with $i$ fixed points is exactly $\binom{n}{i} \cdot D_{n - i}$.

Summing this over all $i$ gives us the identity $n! = \sum_{i = 0}^n \binom{n}{i} \cdot D_{n - i}$.

Along with the fact that $D_0 = 1$ and $D_1 = 0$, we can use induction to show that $D_n = n! \cdot \sum_{i = 0}^n \frac{(-1)^i}{i!}$. We can also use the following identity to come up with the same result:

Special case of Mobius inversion
###### Recursion

Let's count the number of derangements using a recursion. Suppose $a_n = k \ne n$. We look at $a_k$. If $a_k$ is $n$, then there are $D_{n - 2}$ ways of assigning the remaining numbers to the permutation. If $a_k$ is not $n$, then we note that this reduces to the number of derangements on $n - 1$ numbers, by renaming $n$ (to be assigned in the remaining $n - 2$ slots) to $k$.

So the recurrence becomes $D_n = (n - 1) \cdot (D_{n - 1} + D_{n - 2})$, and this can be solved using induction too.

###### Using the principle of inclusion and exclusion

We can use an inclusion-exclusion argument to solve this problem too. Let $S_i$ be the set of permutations that leave the $i$-th element fixed (the remaining are free to be permuted).

Then we need $n! - |\cup_i S_i|$. Using the inclusion-exclusion principle, since the intersection of any $k$ different $S_i$'s has size $(n - k)!$, we again get the same expression for $D_n$.

#### Inversions

An inversion in an array (not necessarily a permutation) is any pair of indices $(i, j)$ such that $i < j$ and $a_i > a_j$. What is the minimum number of inversions? It is 0, because you can just enforce $a_i = i$. What is the maximum number of inversions? It is $n(n - 1)/2$ because you can just enforce $a_i = n - i + 1$, and then every unordered pair of distinct indices corresponds to an inversion.

Let's consider the following task: For a given permutation (or an array with distinct elements), you are allowed to do the following operation any number of times:

• Pick any $1 \le i < n$ and swap $a_i$ and $a_{i + 1}$.
1. Is it possible to sort this array using these operations only?
2. If yes, what is the minimum number of operations if you want to sort this array using these operations only?

Note that if it is possible for a permutation, it is also possible for any array with distinct elements (we can just replace the elements in the array by their ranks when sorting in increasing order).

The answer to the first part is yes. The main idea is to pick the smallest element and keep swapping it with the previous element until it reaches the first position. Then do the same thing recursively for $[2, \dots, n]$.

Let's think about the number of operations here. How many operations do we do in the first phase? Note that everything coming before $1$ in the original permutation contributed to an inversion where the second element was $1$. Now after $1$ is in its intended place, there will be no more inversions that are related to $1$ in any way. For the rest of the permutation, we can do this argument recursively. But for that, we need to see what happens to inversions where $1$ was not involved. Note that the relative order of no other pair changes, so the other types of inversions remain the same! So, the total number of operations is clearly the number of inversions in the permutation.

Is this the minimum number of operations you need to sort the array? The answer turns out to be yes. Consider any operation that is done. Since it flips the order of two adjacent things, there is at most one inversion it can reduce. So the decrease in the number of inversions is at most $1$. In a sorted array, there are no inversions. So the number of operations must be at least the number of inversions in the original permutation, and thus we are done.

Now you might think — how do we actually count the number of inversions in a permutation? The answer is you can do that using merge sort or by some point-update-range-sum data structure like Fenwick tree or segment tree. The merge sort solution can be found here, and for the data structure solution, consider the following:

• Initially, you have an array $A$ of size $n$ filled with $0$ (on which the data structure will be constructed).
• For each $i$ starting from $1$ to $n$, find the sum of the prefix of $A$ of size $a_i - 1$ (that is, sum of $A_j$ for $0 \le j < a_i$), and add it to the answer. Increment $A_i$ by $1$.

#### Longest increasing subsequences and Erdos Szekeres theorem

An increasing subsequence of an array $a$ (not necessarily a permutation) is a sequence of indices $i_1 < i_2 < \dots < i_k$ such that $a_{i_1} < a_{i_2} < \dots < a_{i_k}$. Decreasing subsequences are defined similarly.

An algorithm to find a longest increasing (or, with a few modifications, non-decreasing) subsequence of an array can be found in this nice video tutorial. But that is not what we are concerned with at the moment.

We are concerned with bounds on the length of any longest increasing subsequence (LIS from here on). However, for decreasing subsequences, an LIS has length $1$. The Erdos Szekeres theorem tells us that in such cases, the length of the longest decreasing subsequence will be large.

More formally, the theorem states that in any permutation (or array with distinct elements) of size at least $xy + 1$, there is either an increasing subsequence of length $x + 1$ or a decreasing subsequence of length $y + 1$.

The easiest way to prove this theorem is via an application of the Pigeonhole principle.

Suppose for the sake of contradiction that the theorem is false for some permutation $a$. For every $i$, consider the length of a longest increasing subsequence ending at index $i$ and the length of a longest decreasing subsequence ending at index $i$. Let's call these numbers $x_i$ and $y_i$. Note that all $x_i$-s are integers between $1$ and $x$, and all $y_i$-s are integers between $1$ and $y$. So there can be at most $xy$ distinct pairs $(x_i, y_i)$. By the Pigeonhole principle, there are $i < j$ such that $(x_i, y_i) = (x_j, y_j)$. Since all elements are distinct, $a_i < a_j$ or $a_i > a_j$. In the first case, it is impossible that $x_i = x_j$, and in the latter, it is impossible that $y_i = y_j$. This is a contradiction, and we are done.

A more sophisticated and deeper proof of this theorem can be done using Dilworth's theorem. This blog uses it to prove a special case of this theorem, though the proof can be modified easily to work for the complete proof too.

#### Next permutation

Note that permutations are just sequences of integers, so it is possible to sort the set of all possible sequences of size $n$ lexicographically (i.e., in the order you would find words in a dictionary). This defines a natural indexing of each permutation. How do we find the next permutation from a given permutation?

The easiest way (in C++) is to use std::next_permutation, but we'll briefly describe how it works.

Let's find the first index $i$ from the right so that $a_i > a_{i - 1}$. Since $i$ was the first index satisfying this condition, all indices $i$ to $n$ must form a decreasing sequence. Note that the smallest number in this sequence that is larger than $a_i$ will be the new element at position $i$, and the rest of them (along with the current $a_i$) will be sorted in increasing order after it. All of this can be implemented in time $O(n - i + 1)$.

Note that starting from the lexicographically smallest permutation (which is $[1, 2, \dots, n]$), the number of permutations between (both inclusive) this permutation and a permutation whose first position $i$ such that $a_i \ne i$ is $k$, is at least $(n - k)! + 1$. This means that if you apply next_permutation even a large number of times, the number of elements in the permutation that will ever change will not be large (unless you start from a specific permutation, and even in that case, apart from a single change for potentially all indices, the same conclusion holds).

So if for a permutation of length $n$, you do the next_permutation operation (as implemented above) $O(n^k)$ times, the time taken will be (much faster than) $O(k n^k \log n)$. You can even bound the number of operations to perform next_permutation $r$ times by $O(r + n)$. The analysis is similar to the complexity analysis when you increment the binary representation of an $n$-bit number $r$ times.

#### Random permutations

There are a total of $n!$ permutations of size $n$. How do we construct a random permutation if all we have is a random number generator that can give us integers in a range we can choose?

For thinking about how we can incrementally construct such a permutation, let's try to remove all elements $> k$ for some $k$. Note that the position of $k$ in this array is equally likely to be any integer from $1$ to $k$. However, this won't lead to a very efficient algorithm right away.

We would want to rather construct a random permutation from left to right. Suppose we have a prefix chosen already. Note that the permutations of all elements on the right are equally likely. Also, all elements on the right are equally likely to be the first element of the suffix (i.e., the last element of the just-larger prefix). This observation leads to Durstenfeld's variation of the Fisher-Yates shuffle.

Now let's think about some statistics of random permutations.

Firstly, what is the expected length of the greedily picked increasing sequence by this algorithm:

• If there is nothing in the sequence or the new element is greater than the last picked element, pick this element.

Note that an element is picked if and only if it is more than everything before it. That is, we can do the following using linearity of expectation:

$E[l] = E[\sum_i(a_i \text{ chosen})] = \sum_i P[a_i \text{ chosen}] = \sum_i P[a_i > \max(a_1, \dots, a_{i - 1})] = \sum_i \frac{1}{i}$, which is the harmonic sum, and is approximately $\ln n$.

However, it can be shown that the expected length of the LIS is $\Theta(\sqrt{n})$, so a greedy approach is much worse than computing the LIS systematically.

For more random permutation statistics that also have some information on statistics derived from the next few sections, I would refer you to this.

## The cycle decomposition perspective

We will now switch gears and note a very important part of the theory of permutations — the cycle decomposition. You will find yourself referring to this quite a lot while solving problems on permutations.

#### Cycles

Suppose we have a permutation $a$. Let's fix some $i$. Let's try to look at the sequence $i$, $a_i$, $a_{a_i}$ and so on. Eventually, it must repeat because there are at most $n$ values that this sequence can take. For the sake of convenience, let's call the $j$-th element of this sequence $b_j$.

Then for some $k < l$, we have $b_k = b_l$. Let's take $k$ to be the smallest such index, and let $l$ be the smallest such index for that corresponding $k$. If $k$ is not $1$, then we must have $b_{k - 1} = b_{l - 1}$, because $a$ is a bijection. However, this contradicts the minimality of $k$! This means that the first element that repeats in the sequence is, in fact, $i$.

Now suppose something in the sequence between $1$ and $l$ repeats (let's say at positions $1 < m < o < l$). Then by repeating the bijection argument $m - 1$ times, we must have $b_{o - m + 1} = b_1 = i$, so $o - m + 1 \ge l$. But this is a contradiction. This means that the sequence repeats starting from $l$ and forms a cycle.

We call this a cycle because if we made the graph where there was a directed edge from $i$ to $a_i$, then this $i$ would be in a cycle of length $l - 1$.

Now, this was done for a single $i$. Doing this for all $i$ means that all elements are in a cycle in this graph. Clearly, since the indegree and outdegree of each $i$ are $1$, this means that these cycles are all disjoint.

Let's take an example at this point. Consider the permutation $[2, 3, 1, 5, 4]$. The cycles for each element are as follows:

• $1 \to 2 \to 3 \to 1$
• $2 \to 3 \to 1 \to 2$
• $3 \to 1 \to 2 \to 3$
• $4 \to 5 \to 4$
• $5 \to 4 \to 5$

These correspond to the cycles $1 \to 2 \to 3 \to 1$ and $4 \to 5 \to 4$ in the graph.

#### Cycle notation

Note that these cycles are independent. That is, we can just say that for these cycles, any element outside the cycle is irrelevant. So in this sense, for any permutation, we can just list out its cycles and we will be able to determine the permutation uniquely.

So we just represent a permutation as a list of its cycles. For example, the permutation $[2, 3, 1, 5, 4]$ can be represented as $(1 2 3)(4 5)$.

Note: there is a deeper meaning behind this notation that is related to composition and how disjoint cycles commute for composition.

#### Finding $k$-th next, functional graphs

Consider the following task: for a permutation, compute the value of $a_{a_{\ddots_{i}}}$ for each $i$ where there are $k$ $a$-s, $k \le 10^{18}$.

Note that if you can find cycles, you can do a cyclic shift for each cycle (appropriately computed modulo the cycle length) and give the answer.

What if it was not guaranteed that $a$ is a permutation, but all elements in $a$ are in $[1, n]$ nevertheless? The cycle argument breaks down here, but you can show the following fact:

• Each weakly connected component consists of two parts: a directed cycle and directed trees rooted at distinct vertices in the cycle, directed towards the root (also called reverse-oriented arborescences).

The same problem can be solved in this setting, too, by using binary lifting for each vertex till it reaches the "main" cycle of its component.

In the language of graph theory, such graphs are called functional graphs (which have outdegree of each vertex = 1).

Some problems to practice on are Planet Queries (I and II) on CSES.

#### Swaps/transpositions

Let's start from the array $[1, 2, \dots, n]$ and try to apply swaps on this array till we reach some desired permutation $a = [a_1, a_2, \dots, a_n]$. Note that we can always apply such swaps, by trying to build the permutation from left to right. These swaps are also called transpositions.

#### Swaps on cycles

Let's now think about what happens when we apply a swap to a permutation. Suppose the swap swaps elements at positions $i$ and $j$. Let's look at the graph associated with this permutation. In this graph, swapping two elements at positions $x$ and $z$ is equivalent to breaking two edges $x \to y$ and $z \to w$ and making two edges $x \to w$ and $z \to y$ (in something like a crossing manner).

We make two cases:

1. $i$ and $j$ are in different cycles: this joins the two cycles together.
2. $i$ and $j$ are in the same cycle: this breaks the common cycle into two.

A relevant problem at this point would be 1768D - Lucky Permutation.

Let's consider another permutation sorting problem, but here rather than just adjacent elements being swapped, we allow swapping any elements. What is the minimum number of swaps to sort a permutation this way?

Note that in the final result, there are exactly $n$ cycles (singletons). Let's say we currently have $c$ cycles. Since the number of cycles increases by at most 1 each time, there must be at least $n - c$ swaps to sort this permutation.

Is this achievable? Yes. We can keep swapping elements in the same cycle while there is a cycle of length at least $2$, and this will sort the permutation.

## The permutation composition perspective

#### Definition

Suppose we have two permutations $a$ and $b$ of the same size $n$. Now we want to define a composition of them as follows: $ab$ is defined as the array $c$ where $c_i = a_{b_{i}}$.

This essentially corresponds to finding the composition of the $g$-functions of $a$ and $b$, with $g$ for $b$ being applied first.

Let's get an intuitive sense of what it does, by getting some information out of the definition. Let's first deal with the case when $b$ is the "identity" permutation: $b_i = i$. Then $c = a$ according to this definition. Similarly, if $a$ is the identity permutation, then $c = b$. So, composing with the identity permutation in any order gives the original permutation back.

Now let's understand this by using the two-line notation. The idea is to take the two-line notation for both permutations and do some sort of pattern matching to get to their composition.

Consider the following permutations $a$ and $b$.

\begin{equation*} \begin{pmatrix} 1 & 2 & \cdots & n \\ a_1 & a_2 & \cdots & a_n \end{pmatrix} \end{equation*}

and

\begin{equation*} \begin{pmatrix} 1 & 2 & \cdots & n \\ b_1 & b_2 & \cdots & b_n \end{pmatrix} \end{equation*}

Let's reorder the columns of the two-line notation for $a$ so that the lower row of $b$ and the upper row of $a$ match. However, this corresponds to shuffling the columns of $a$ according to $b$:

\begin{equation*} \begin{pmatrix} b_1 & b_2 & \cdots & b_n \\ a_{b_1} & a_{b_2} & \cdots & a_{b_n} \end{pmatrix} \end{equation*}

Note that this is still a valid two-line notation for $a$. Now, if we "merge" this with the two-line notation for $b$, this gives us the two-line notation for the composition of $a$ and $b$.

Note how this idea of composition arises naturally from the interpretation of a permutation as a bijection on a set.

We start with the identity permutation, apply the first permutation to it, and then apply the second permutation to it. Applying means replacing $x \mapsto a_x$. The two-line notation for a permutation $a$ has the following property:

• If the permutation in the first row is $p$, then the permutation in the second row is $ap$.

#### Examples — cycles and swaps

Note that we can associate a cycle itself with a permutation, where the non-cycle elements are fixed points, and the remaining are mapped according to the cycle. In this sense, cycles are permutations.

Also note that a swap itself is a permutation in the same manner. In fact, a swap is just a cycle of length 2.

It is helpful to play around with a few examples of swaps and cycles and write them in two-line notation to understand their mappings and try composing them.

At this point, you should also try to recall the following two things:

1. When we introduced cycles, we wrote them in a certain format. In fact, you can compose disjoint cycles without worrying about the order in which they are composed. This should be obvious from the fact that they are independent in a way (an element that gets moved in one permutation is a fixed point of the other). You can also see that the notation $(1 2 3)(4 5)$ also conveniently captures that this permutation is a composition of the cycles $(1 2 3)$ and $(4 5)$.

2. When we discussed swaps, we noted that you could "apply" swaps. This corresponds to left-multiplying the permutation with the corresponding swap permutation. It is instructive to come up with a few examples of swaps and how they compose at this point.

It is important to note that (by using the fact that function composition is associative) permutation composition is associative.

Let's say we have a permutation where we came to by applying the following three permutations: $(12)$, $(312)(43)$ and $(53241)$ in this order, we would write it as $(53241)(312)(43)(12)$. Note that here $(12)$ and $(43)$ are cycles of length $2$, i.e., they are swaps.

Now how do we reduce this back to a normal permutation? One way would be to write out the permutation for each cycle and then keep composing it. A cleaner way of doing the same thing would be to do this: for every element, start from the right, and replace it with what the cycle maps it to. For instance, if we want to find the image of $1$, the first cycle maps $1$ to $2$, the second cycle maps $2$ to $2$, the third cycle maps $2$ to $3$, and the fourth cycle maps $3$ to $2$.

#### Parity of permutations

Recall that we mentioned earlier that an adjacent swap reduces the number of inversions by at most $1$. Well, to be more precise, it changes the number of inversions by $\pm 1$, and hence flips the parity of the number of inversions of the permutation.

We define the parity of the permutation to be the parity of the number of inversions of the permutation.

The reason why this is important is that it gives a handle on the parity of swaps (not just adjacent swaps) as well as some information about the cycle parities, as follows.

Let's consider a swap that swaps elements at positions $i < j$. Then a way to get to this would be to apply $j - i$ adjacent swaps to bring $a_j$ to position $i$, and $j - i - 1$ adjacent swaps to bring $a_i$ (from its new position) to position $j$. This takes an odd number of total adjacent swaps, so the final number of inversions is also flipped.

As a corollary, applying any swap changes the parity of the permutation, and in particular, all swap permutations have odd parity (parities add up modulo $2$ over composition, as can be seen easily).

Now note that for a cycle, we can construct it by doing a few swaps. More specifically, if the cycle is $(i_1, i_2, \dots, i_k)$, then we can do $k - 1$ swaps to apply the same transformation as the cycle permutation.

So all even-sized cycles have odd parity, and all odd-sized cycles have even parity.

Let's consider any decomposition of a permutation into (possibly non-disjoint) cycles. As a result of the above discussion, the parity of the permutation is odd iff the number of even-sized cycles is odd.

In particular, for a decomposition of a permutation into swaps, the parity of the permutation is the same as the parity of the number of swaps in the decomposition. Note that this also implies that we can't have two decompositions with an odd difference in the number of swaps.

Rephrasing the above result, we have the fact that if it takes an odd number of swaps to get to a permutation, the permutation has odd parity and vice versa.

#### Parity under composition

Clearly, from the above discussion, it can be seen that parities add modulo $2$ over composition. This, however, is important enough to warrant its own section because it summarizes all of the discussion above.

#### Inverse of permutations

Firstly, let's think about permutation in an application sense. Is there a way to get back the original permutation after some permutation has been applied to it? From our remarks on the two-line notation, it suffices to do this for the initial permutation being the identity permutation. Let's say the applied permutation was $a$:

\begin{equation*} \begin{pmatrix} 1 & 2 & \cdots & n \\ a_1 & a_2 & \cdots & a_n \end{pmatrix} \end{equation*}

By our remarks on composition using the two-line notation, we need something that swaps these two rows as follows:

\begin{equation*} \begin{pmatrix} a_1 & a_2 & \cdots & a_n\\ 1 & 2 & \cdots & n \end{pmatrix} \end{equation*}

Note that this is easy: consider the position array $p$ of $a$ as we defined earlier; i.e., $p_i$ is the position where $a$ equals $i$, i.e., $a_{p_i} = i$. Note that $a$ is $a_i$ at $i$, so $p_{a_i}$ is the position where $a$ is $a_i$, i.e., $p_{a_i} = i$. This shows that $pa = ap$, and this common value is the identity permutation (which we shall denote by $\text{id}$ in what follows).

In other words, $p$ and $a$ are inverses of each other under composition. It is trivial to note that $a$ is the position array of $p$ as well. We denote $p$ by $a^{-1}$.

Now that we have some intuition about the inverse in terms of the two-line notation, we think about it in terms of cycles, swaps, and compositions:

1. The inverse of a cycle $(i_1, i_2, \dots, i_k)$ is simply $(i_k, \dots, i_2, i_1)$. This is done by reversing all the edges in the related directed graph.
2. The inverse of a swap is itself. This also follows from the fact that a swap is a 2-cycle.
3. The inverse of a composition $ab$ is $b^{-1}a^{-1}$. This corresponds to undoing the permutation application on a stack and also makes algebraic sense: $abb^{-1}a^{-1} = aa^{-1} = \text{id} = b^{-1}b = b^{-1}a^{-1}ab$.

In particular, for a decomposition of a permutation into (not necessarily disjoint) cycles, we can just take a mirror image of the decomposition, and it will correspond to the inverse of the permutation!

Also, the parity of the inverse is the same as the parity of the original permutation.

#### Involutions

An involution is a permutation whose inverse is itself. Consider the disjoint cycle decomposition of the permutation $a = c_1 \dots c_k$. $a = a^{-1}$ means $\text{id} = a^2 = \prod_i c_i^2$. Note that we are able to write this because disjoint cycles commute, and so do identical cycles. $c_i^2$ should be an identity map on the elements of $c_i$ because else the resulting permutation won't be $\text{id}$. This means that $c_i$ is either of size $1$ or $2$. Conversely, it can be checked that these permutations are involutions.

#### $k$-th power of permutations

Since permutation composition is associative, we can use binary exponentiation to compute the permutation. A more mathematical way to think about it is to use the same trick as in the previous subsection. With definitions the same as in the previous section, for any $a$, $a^k = \prod_i c_i^k$, so the cycle structure is determined by the powers of the cycles.

Finding the $k$-th power of a cycle just corresponds to mapping it to its $k$-th next neighbor in the cycle. If the length of the cycle is $c$, and $g = \gcd(c, k)$, then the disjoint cycle decomposition of $c^k$ consists of $g$ cycles, with each of them corresponding to a stride of $k$ along the original cycle.

#### Order of permutations

The order of a permutation $a$ is defined as the least $k$ such that $a^k$ is the identity permutation. Let's again look at the cycle decomposition as in the previous subsection. For a cycle $c$ of length $l$ to be "dissolved" into singletons, it is necessary and sufficient to have the power of the permutation divisible by $l$. Hence, the order of the permutation is just the LCM of all cycle lengths.

#### Square-root of permutations

For this section, we will consider the following problem: 612E - Square Root of Permutation. The solution is left as an exercise to the reader (it shouldn't be hard, considering what we have discussed in the last few sections).

#### Conjugation and conjugacy classes

Suppose $a, b$ are two permutations. We think about the following permutation: $aba^{-1}$. If we think about how the two-line notation for these permutations combines, we can note that this is just the permutation whose cycle decomposition is almost the same as $b$, but $a$ is applied to each element in the cycles. In other words, we "reinterpret" the cycles by mapping to and from a separate "ground set" via permutation $a$.

For a formal statement, let the cycle decomposition of $b$ be $c_1 \cdots c_k$, where $c_i = (x_{i1}, \dots, x_{il_i})$. We claim that $aba^{-1}$ is $c'_1 \cdots c'_k$ where $c'_i = (a_{x_{i1}}, \dots, a_{x_{il_i}})$.

The proof formalizes the earlier argument. Consider an element $i$, and we look at what happens to $i$ as each permutation is applied to it:

1. $aba^{-1}$: There is a $j$ such that $a_j = i$ since $a$ is a permutation. So $(aba^{-1})_i = a_{b_{a^{-1}_i}} = a_{b_j}$.
2. $c'_1 \cdots c'_k$: Without loss of generality, let's say $i$ is the first element of $c'_1$, i.e., $i = a_{x_{i1}}$. Then $j = x_{i1}$. Now it is mapped to $a_{x_{i2}} = a_{b_{x_{i1}}} = a_{b_j}$.

Since both of them give the same result, we are done.

This says that the operation $aba^{-1}$ is nothing but a translation of the labels of $b$ according to $a$.

Note that by this result, since we are just applying the permutation to everything inside the cycles, we can map any product of disjoint cycles to any other product of disjoint cycles, provided the multiset of cycle sizes doesn't change.

Let's call $b$ and $c$ to be conjugate if there is an $a$ such that $c = aba^{-1}$, and call this operation conjugation.

Thus this leads to a partition of all permutations into equivalence classes (after verifying a couple of axioms), where everything inside a class is conjugate to everything else in the class. Note that any equivalence class is uniquely determined by the multiset of cycle sizes. These equivalence classes are called conjugacy classes.

## Miscellaneous topics

This section will mostly be about some very brief references and is intended to tell people about some topics they might not have seen or heard about before, which are relevant to this discussion. There might be blogs in the future regarding these topics, but it's still a good idea to read up on these yourself.

#### Group theory

The set of permutations of size $n$ forms what is called a group and is denoted by $S_n$, and it turns out that all finite groups of size $n$ are isomorphic to some subgroup of $S_n$. This is why there has been a lot of work on permutations from a group theoretic perspective. The section on permutation composition is best viewed under a group theoretic lens.

The analysis of a lot of games can be done using group theory, for instance, 15-game, Rubik's cube, Latin squares, and so on.

There is a subgroup of permutations that corresponds to all even permutations. This group is also known to be simple and can be used to show results about the 15-game, for example.

#### Counting: Burnside, Polya enumeration, Stirling numbers

Using group theoretic ideas, the Burnside lemma and the Polya enumeration theorem come into the picture, and they help to count classes of objects (equivalent under some relation) rather than objects themselves.

Stirling numbers are quite useful when counting things related to permutations, and they deserve a whole blog for themselves.

#### Coordinate compression

Note that in the above discussions, especially in the section on ordering, we used arrays of distinct elements and permutations interchangeably. Whenever something only depends on the $\le, <, >, \ge, =$ relations between the elements and not the elements themselves, it makes sense to replace elements with their rank. In the case of distinct elements, this "compressed version" becomes a permutation.

#### Connected component DP

Just like the failed idea in the generation of permutations, that idea can also be used for dp. More details can be found in this blog.

#### Young tableaus

Closely related to the LIS and other ordering concepts is the idea of Young tableaus. This has a very interesting theory, and this blog is a good reference for some of the results.

#### Schreier Sims algorithm

Suppose you have a subgroup generated by some permutations, and you want to find the size of the subgroup. The Schreier Sims algorithm helps in finding this size, and a good blog can be found here.

#### Permutation tree

There is a cool and interesting data structure that can do different kinds of queries on permutations and subarrays. This blog is a good introduction to the data structure and what it can do.

## Problems

I have added some standard problems whenever suitable, but for practicing permutations well, it makes sense to practice on a larger set of problems. Since there is a very large number of problems on permutations, maybe just adding a few of them that I know would not do justice to the variety of problems that can be made on permutations. A couple of possible methods of practicing are the following:

1. Search on the CF problemset for problems with the name permutation in them.
2. Look out for permutation problems on sources for standard problems like CSES.
By nor, 5 months ago, Disclaimer: This is not an introduction to greedy algorithms. Rather, it is only a way to formalize and internalize certain patterns that crop up while solving problems that can be solved using a greedy algorithm.

Note for beginners: If you're uncomfortable with proving the correctness of greedy algorithms, I would refer you to this tutorial that describes two common ways of proving correctness — "greedy stays ahead" and "exchange arguments". For examples of such arguments, I would recommend trying to prove the correctness of standard greedy algorithms (such as choosing events to maximize number of non-overlapping events, Kruskal's algorithm, binary representation of an integer) using these methods, and using your favourite search engine to look up more examples.

Have you ever wondered why greedy algorithms sometimes magically seem to work? Or find them unintuitive in a mathematical sense? This post is meant to be an introduction to a mathematical structure called a greedoid (developed in the 1980s, but with certain errors that were corrected as recently as 2021), that captures some intuition behind greedy algorithms and provides a fresh perspective on them.

The idea behind this blog is to abstract out commonly occurring patterns into a single structure, which can be focused upon and studied extensively. This helps by making it easy to recognize such arguments in many situations, especially when you're not explicitly looking for them. Note that this blog is in no way a comprehensive collection of all results about greedoids; for a more comprehensive treatment, I would refer you to the references as well as your favourite search engine. If there is something unclear in this blog, please let me know in the comments.

To get an idea as to how wide a variety of concepts it is connected to is, greedoids come up in BFS, Dijkstra's algorithm, Prim's algorithm, Kruskal's algorithm, Blossom's algorithm, ear decomposition of graphs, posets, machine scheduling, convex hulls, linear algebra and so on, some of which we shall discuss in this blog.

Note that we will mainly focus on intuition and results, and not focus on proofs of these properties, because they tend to become quite involved, and there is probably not enough space to discuss these otherwise. However, we will give sketches of proofs of properties that are quite important in developing an intuition for greedoids.

1. Motivation and definition
• Unordered version of conditions on optimality
• Weaker conditions on optimality of the greedy algorithm
2. Some examples
• Interval greedoids
• Matroids
• Antimatroids
• Other greedoids
3. Constructing more greedoids
• From matroids
• From antimatroids
• From greedoids
4. Some random facts about greedoids

### Motivation and definition

We will take an approach in an order that is different from most treatments of greedoids (which throw a definition at you), and try to build up to the structure that a greedoid provides us.

How does a usual greedy algorithm look like? Initially, we have made no decisions. Then we perform a sequence of steps. After each step, we have a string of decisions that we have taken (for instance, picking some element and adding it to a set). We also want a set of final choices (so we can't extend beyond those choices).

To make this concept precise, we define the following:

1. A ground set is any set. We usually interpret this as a set of incremental decisions that can be made.
2. A language is any set of finite sequences (or words) of elements coming from the ground set. We interpret a language as a set of possible sequences of decisions we can take in the greedy algorithm.
3. A simple language is any language where the sequences don't have repeated elements. This definition is motivated from the fact that we don't take a decision more than once.
4. A hereditary language is any language $L$ on a ground set $S$ satisfying the following two conditions:
• $\emptyset \in L$
• For any $s \in L$, every prefix of $s$ is also in $L$. This definition is motivated from the fact that we want to capture a sense of time/causality in our sequence of decisions. That is, if we have a sequence of decisions, we must have arrived at it from the just-smaller prefix of decisions.
5. A basic word in $L$ is any $s \in L$ such that there is no element $e$ in $S$ such that $s$ extended by that element (denoted by $s \cdot e$ or $se$) is in $L$. This is motivated from the fact that we want to have some ending states at the end of the greedy algorithm.

We can now rephrase the optimization problems that a large set of greedy algorithms try to solve in the following terms:

• Consider a simple hereditary language $L$ on the ground set $S$, and a function $w : L \to \mathbb{R}$. Now maximize $w(s)$ where $s$ is a basic word of $L$.

A (certain quite general type of) greedy algorithm looks something like this:

• Initially there are no decisions taken.
• At step $i$, when the decisions $x_1, \dots, x_{i - 1}$ have already been taken, pick an $x_i \in S$ such that $x_1 x_2 \cdots x_i \in L$, and this choice of $x_i$ maximizes $w(x_1 x_2 \cdots x_i)$ over all valid choices of $x_i$. If there is no such $x_i$, terminate the algorithm.

Of course, taking arbitrary $w$ doesn't make any sense, so we limit $w$ to the following kinds of functions (we shall relax these conditions later, so that we can reason about a larger variety of problems):

1. If at a certain point, a decision $x$ is optimal, then it will also be optimal at a later stage. That is, if $x$ is chosen when the prefix was $s$, then for a prefix $st$, we have $w(stxu) \ge w(styu)$ for all valid $y, u$ ($u$ can be empty, $y$ is a decision).
2. It is optimal to pick $x$ sooner than later. That is, if $x$ is chosen when the prefix was $s$, then we must have $w(sxtyu) \ge w(sytxu)$ for all valid $y, t, u$ ($t, u$ can be empty, $y$ is a decision).

To this end, we define greedoids as follows:

A greedoid is a simple hereditary language $L$ on a ground set $S$ that satisfies an "exchange" property of the following form:

• If $s, t \in L$ satisfy $|s| > |t|$, then there is a decision in $s$ that we can append to $t$, so that the resulting word is also in $L$.

This extra condition is crucial for us to be able to prove the optimality of greedy solutions using an exchange argument (the usual exchange argument that people apply when proving the correctness of greedy algorithms).

One notable result is that by this extra condition, all basic words in $L$ have the same length. This might seem like a major downgrade from the kind of generality we were going for, but it still handles a lot of cases and gives us nice properties to work with.

For people familiar with matroids

It turns out that greedoids have the following equivalent definitions which will be useful later on (we omit the proofs, since they are quite easy):

1. Let a set variant of the greedoid be defined as follows: a pair $(S, F)$ of a ground set $S$ and a family of its subsets $F$, such that $\emptyset \in F$ and for any $A, B \in F$ with $|A| > |B|$, there exists an $s \in A \setminus B$ such that $B \cup \{s\} \in F$. Then the original definition of a greedoid and this definition of a set variant are equivalent, in the sense that for any such $F$, there is exactly one simple hereditary language $L$ such that the set of set of decisions in a word for each word in $L$ is precisely $F$. (That is, when we replace sequences in $L$ with an unordered version, the resulting set of unordered sets is $F$).
2. We can replace the second condition in the definition of the set variant by the condition that all maximal sets have the same cardinality. Maximal sets are defined as sets such that the set formed after adding another element to them is not in $F$.

There is also a set analogue for simple hereditary languages: $\emptyset \in F$ and for each $X \in F$, we have an element $x \in X$ such that $X \setminus \{x\} \in F$. The intuition remains the same. Note that again, we don't need this to hold for all $x \in X$, but at least one $x \in X$.

But why do we care about greedoids at all? The answer lies in the following informally-stated two-part theorem:

Theorem 1:

1. If $L$ is a greedoid on $S$, and $w$ satisfies the conditions mentioned earlier, then the greedy algorithm gives the correct answer.
2. If the greedy algorithm gives the correct answer for all $w$ satisfying the conditions mentioned earlier, and if $L$ has the same length for all its basic words, then $L$ is a greedoid on $S$.
Brief proof sketch

Note that the definition of greedoid doesn't depend on $w$ in any way, so it can be studied as a combinatorial structure in its own right — this leads to quite non-trivial results at times. However, when we think of them in the greedy sense, we almost always have an associated $w$.

In a lot of cases, $w$ has a much simpler (and restrictive) structure, for instance, having a positive and additive weight function (which is defined on each element). In that case, the following algorithm works for matroids (special cases of greedoids): sort elements in descending order of their weight, and pick an element iff adding it to the set of current choices is still in the matroid.

##### Unordered version of conditions on optimality

Usually, $w$ is not defined on a sequence of steps, but on the set of choices you have made. However, in the general case, the "natural" relaxation from the ordered version to the unordered version of greedoids fails (both in being equivalent and in giving the optimal answer). In fact, this error was present in the original publications (since 1981) and was corrected fairly recently in 2021. For a more detailed discussion (and the proofs of the following few theorems), please refer to $$, which we shall closely follow.

We shall start out with some special cases:

Theorem 2: Consider a greedoid $F$ (unordered) on a ground set $S$. Suppose $c$ is a utility function from $S$ to $\mathbb{R}$, so that the weight of a set of decisions is additive over its elements. Then the greedy algorithm gives an optimal solution iff the following is true:

• If $A, A \cup \{x\} \in F$ and $B$ is an unordered basic word (i.e., maximal set) in $F$ such that $A \subseteq B$ and $x \in S \setminus B$, then there is a $y \in B \setminus A$ such that $A \cup \{y\} \in F$ and $B \cup \{x\} \setminus \{y\}$ is also a maximal set.

The following theorem generalizes the sufficiency:

Theorem 3: Consider a greedoid $F$ (unordered) on a ground set $S$ and $w : S \to \mathbb{R}$. The greedy algorithm gives an optimal solution if (and not iff) the following is true:

• If for some $A, A \cup \{x\} \in F$ and $B$ being an unordered basic word (i.e., maximal set) in $F$ such that $A \subseteq B$ and $x \in S \setminus B$, it holds that $w(A \cup \{x\}) \ge w(A \cup \{u\})$ for all valid $u$, then there is a $y \in B \setminus A$ such that $A \cup \{y\} \in F$ and $B \cup \{x\} \setminus \{y\}$ is also a maximal set and $w(B \cup \{x\} \setminus \{y\}) \ge w(B)$.

This theorem can be used to show optimality of Prim's algorithm on its corresponding greedoid.

This theorem can also be derived from theorem 6 that we shall mention later on.

Remember that we mentioned that the original claim in the 1981 paper about greedoids was false? It turns out that the claim is true about a certain type of greedoids, called local forest greedoids. Since it is not so relevant to our discussion, we're going to spoiler it to avoid impairing the readability of what follows.

Local forest greedoids and some optimality theorems
##### Weaker conditions on optimality of the greedy algorithm

As we had mentioned earlier, the constraints on $w$ while motivating the definition of a greedoid (ordered) are quite stringent, and they might not hold in a lot of cases. Here, we work upon reducing the set of constraints (which will also have implications on theorem 3 earlier).

Theorem 6: Let $L$ be a greedoid on a ground set $S$, and $w$ be an objective function that satisfies the following condition:

• If $ax \in L$ such that $w(ax) \ge w(ay)$ for every valid $y$ and $c = azb$ is a basic word, then there exists a basic word $d = axe$ such that $w(d) \ge w(c)$.

Then the greedy algorithm gives an optimal solution.

The proof is quite easy, and goes by way of contradiction. There is a special case of the above theorem (that needs some care to prove), which is applicable in a lot of cases. It turns out that it corresponds to the last part of the proof sketch of Theorem 1, so maybe it's not such a bad idea to read it again.

### Some examples

The main point of adding examples at this point is to showcase the kinds of greedoids that exist in the wild and some of their properties, so that it becomes easy to recognize these structures. I have added the examples in spoiler tags, just to make navigation easier. Some of these examples will make more sense after reading the section on how to construct greedoids.

While going through these examples, I encourage you to find some examples of cost functions that correspond to greedy optimality, so that you can recognize them in the future. Discussing them in the comments can be a great idea too. For a slightly larger set of examples, please refer to Wikipedia and the references.

When appropriate, we will also point out greedy algorithms that are associated with these structures.

#### Interval greedoids

These are greedoids that satisfy the local union property (as introduced in the section of local forest greedoids). Equivalently, they are also defined as greedoids where for every $A \subseteq B \subseteq C$ with $A, B, C \in F$, if $A \cup \{x\}, C \cup \{x\} \in F$ for some $x \not \in C$, then $B \cup \{x\} \in F$.

Some examples are:

1. All local forest greedoids.
2. Matroids (to be introduced later).
3. Anti-matroids (to be introduced later).
Directed branching greedoid
Undirected branching greedoid
Clique elimination greedoid

#### Matroids

Matroids are greedoids that satisfy a stronger property than the interval property for interval greedoids, by removing the lower bounds. More precisely, a matroid is a greedoid that satisfies the following property:

• For every $B \subseteq C$ with $B, C \in F$, if $C \cup \{x\} \in F$ for some $x \not \in C$, then $B \cup \{x\} \in F$.

Equivalently, they are greedoids where in the unordered version, for each $X \in F$, and for all (not at least one) $x \in X$ we have $X \setminus \{x\} \in F$. In other words, they are downwards closed. In such a context, elements of $F$ are also called independent sets.

Intuitively, they try to capture some concept of independence. In fact, matroids came up from linear algebra and graph theory, and greedoids were a generalization of them when people realized that matroids are too restrictive, and downwards-closedness is not really important in the context of greedy algorithms. The examples will make this notion clearer.

Note that for matroids, it is sufficient to define a basis (the set of maximal sets in $F$), and downwards closedness handles the rest for us.

Matroids have a much vaster theory compared to greedoids, due to being studied for quite a long time and being more popular in the research community. Since this is a blog only on greedy algorithms, we will refrain from diving into matroid theory in too much detail. Interested readers can go through the references for more.

Some examples of matroids are as follows.

Free matroid
Uniform matroid
Graphic matroid
Transversal matroid
Gammoid
Algebraic matroid
Vector matroid
Column matroid

#### Antimatroids

Antimatroids are greedoids that satisfy a stronger property than the interval property for interval greedoids, by removing the upper bounds. More precisely, an antimatroid is a greedoid that satisfies the following property:

• For every $A \subseteq B$ with $A, B \in F$, if $A \cup \{x\} \in F$ for some $x \not \in B$, then $B \cup \{x\} \in F$.

Unlike matroids, this does not necessarily imply upward closure, but it does imply closure under unions (which is another way to define antimatroids).

Another definition of antimatroids that is in terms of languages (and gives some intuition about their structure) calls a simple hereditary language over a ground set an antimatroid iff the following two conditions hold:

1. Every element of the ground set must appear in a word in the language.
2. The exchange property holds for any two words $a, b$ such that $a$ has an element not in $b$, rather than only restricting it to $|a| > |b|$.

Using this, we note that the basic words are all permutations of the ground set.

Another piece of intuition (derived from the above definition) that helps with antimatroids is the fact that when we try constructing sets in an antimatroid, we keep adding elements to a set, and once we can add an element, we can add it any time after that.

It also makes sense to talk about antimatroids in the context of convex geometries. To understand the intuition behind this, we take the example of a special case of what is called a shelling antimatroid. Let's consider a finite set of $n$ points in the 2D plane. At each step, remove a point from the convex hull of the remaining points. The set of points at any point in this process clearly forms an antimatroid by the above intuition. In fact, if instead of 2D, we embed the ground set in a space with a high enough dimension, we can get an antimatroid isomorphic to the given matroid!

What is so special about convexity here? Turns out we can use some sort of "anti-exchange" rule to define convex geometries in the following manner. It will roughly correspond to this fact: if a point $z$ is in the convex hull of a set $X$ and a point $y$, then $y$ is outside the convex hull of $X$ and $z$.

Let's consider a set system closed under intersection, which has both the ground set and empty set in it. Any subset of the ground set (doesn't matter whether it is in the system or not) has a smallest superset in such a set system (and it is the smallest set from the system that contains the subset). Let's call this mapping from the subset to its corresponding smallest superset in the system $\tau$. Then we have the following properties:

1. $\tau(\emptyset) = \emptyset$
2. $A \subseteq \tau(A)$
3. $A \subseteq B \implies \tau(A) \subseteq \tau(B)$
4. $\tau(\tau(A)) = \tau(A)$

Now for such a set system, it is called a convex geometry if the following "anti-exchange" property holds:

• If some distinct $y, z \not \in \tau(X)$, but $z \in \tau(X \cup \{y\})$, then $y \not \in \tau(X \cup \{z\})$.

Intuitively, consider $\tau$ as a convex hull operator. Then the anti-exchange property simply means that if $z$ is in the convex hull of $X$ and $y$, then $y$ is outside the convex hull of $X$ and $z$.

Now it turns out that the complements of all sets in an antimatroid form a convex geometry! This is not too hard to prove.

We shall now give some examples of antimatroids.

Chain antimatroid
Poset antimatroid
Shelling antimatroid
Perfect elimination antimatroid
Point-line search antimatroid
Cover antimatroid
Point search antimatroid
Line search antimatroid
Undirected point/line search antimatroid
Capacitated point search antimatroid
Transitivity antimatroid and shelling of an acyclic matroid
Chip-firing antimatroid
Universal antimatroid

#### Other greedoids

Now that we got a few special types of greedoids out of the way, we shall look at some more greedoids.

Perfect elimination greedoids on bipartite graphs
Retract, monoid retract and dismantling greedoids
Bipartite matching greedoid
Gaussian elimination greedoid
Twisted and dual-twisted matroids (which are not matroids)
Ear decomposition greedoid (undirected and directed)
Blossom greedoid

### Constructing more greedoids

Note that some constructions from polymatroids/matroids etc. were already mentioned in the section on examples, so we will focus on the ones that were not.

Truncation
Duality
Restriction
Contraction
Sums and unions
##### From antimatroids
Join of antimatroids on the same ground set
Intersection of matroid and antimatroid
Truncation
Restriction
Contraction

### Some random facts about greedoids

This is a collection of interesting facts about greedoids that just don't seem to fit in the above sections, but are important enough not to be ignored. Of course, with every important construction or idea, there are a lot of facts that lead to a combinatorial explosion of the kinds of facts we can derive about anything, so what we discussed above is way too incomplete to be a comprehensive introduction. It'll hopefully evolve over time too, so I'm open to suggestions regarding this section.

• You can construct Tutte polynomials for greedoids, just like you can for matroids.
• We can characterize greedoids into matroids, antimatroids, and none of these in terms of having certain kinds of greedoid minors.
• Antimatroids are very closely related to lattices, and in general we can reason about greedoids by their associated poset flats too.

1. Dávid Szeszlér (2021), Sufficient conditions for the optimality of the greedy algorithm in greedoids.
2. Bernhard Korte, Laszlo Lovasz, Rainer Schrader (1991), Greedoids
3. Matroid intersection in simple words
4. Goecke, Korte, Lovasz (1986), Examples and algorithmic properties of greedoids
5. Bernhard Korte, Laszlo Lovasz (1988), Intersection of matroids and antimatroids
6. Brenda L Dietrich (1987), Matroids and antimatroids — a survey

Lastly, since this is a huge blog, some mistakes might have crept in. If you find any, please let me know!

By nor, 5 months ago, Recently someone asked me to explain how to solve a couple of problems which went like this: "Find the expected time before XYZ happens". Note that here the time to completion is a random variable, and in probability theory, such random variables are called "stopping times" for obvious reasons. It turned out that these problems were solvable using something called martingales which are random processes with nice invariants and a ton of properties.

I think a lot of people don't get the intuition behind these topics and why they're so useful. So I hope the following intuitive introduction helps people develop a deeper understanding. Note that in the spirit of clarity, I will only be dealing with finite sets and finite time processes (unless absolutely necessary, and in such cases, the interpretation should be obvious). I will do this because people tend to get lost in the measure-theoretic aspects of the rigor that is needed and skip on the intuition that they should be developing instead. Also, note that sometimes the explanations will have more material than is strictly necessary, but the idea is to have a more informed intuition rather than presenting a terse-yet-complete exposition that leaves people without any idea about how to play around with the setup.

People already familiar with the initial concepts can skip to the interesting sections, but I would still recommend reading the explanations as a whole in case your understanding is a bit rusty. The concept of "minimal" sets is used for a lot of the mental modelling in the blog, so maybe you'd still want to read the relevant parts of the blog where it is introduced.

I would like to thank Everule and rivalq for suggesting me to write this blog, and them and meme and adamant for proofreading and discussing the content to ensure completeness and clarity.

BONUS: Here's an interesting paper that uses martingales to solve stuff.

1. Probability, sigma algebras, and random variables
2. Expected value of a random variable and conditional probability
3. Conditional expectation of a random variable with respect to a sigma algebra
4. Martingales
5. Stopping times
6. Some math problems
7. Some competitive programming problems

### Probability, sigma algebras, and random variables

Let's look at the simplified definition of a probability space first. We say simplified here because there are some technicalities that need to be sorted out for infinite sets, which we're not considering here anyway.

A probability space is a triple $(\Omega, \mathcal{F}, P)$ consisting of

1. $\Omega$: the (non-empty) ground set, also known as the sample space. It is helpful to think of it as the set of all possible outcomes of an experiment.
2. $\mathcal{F}$: the $\sigma$-algebra. This is a collection of subsets of $\Omega$. Note that this is not always the power set of $\Omega$. It is helpful to think of its elements (the chosen subsets) as "events". For example, when rolling a die, one possible event can be "getting an even number". For this experiment, we have $\Omega = \{1, 2, 3, 4, 5, 6\}$ and this event is $\{2, 4, 6\}$.
3. $P$: the probability function, a function from $\mathcal{F}$ to $[0, 1]$. This is just a weight that we assign to every event that we have chosen.

For this function to have some nice properties and for it to make sense as some measure of "chance", we add some more constraints to these functions:

1. $\Omega$ is an event (i.e., $\Omega \in \mathcal{F}$).
2. If $A \in \mathcal{F}$, then $\Omega \setminus A \in \mathcal{F}$. That is, if something happening is an event, then it not happening is also an event.
3. If $A \in \mathcal{F}$ and $B \in \mathcal{F}$, then $A \cup B \in \mathcal{F}$. That is, if $X$ happening is an event and $Y$ happening is an event, then at least one of $X$ and $Y$ happening is also an event. Note that this combined with the previous constraint allows us to say that $A \cap B$ is an event and so on due to De Morgan's law.
4. $P(\Omega) = 1$, that is, the probability of the whole sample space is $1$.
5. For disjoint sets $A$ and $B$, $P(A \cup B) = P(A) + P(B)$. This, along with the previous constraint, is sufficient to derive identities like $P(\emptyset) = 0$, $P(A) + P(\Omega \setminus A) = 1$ and so on.

Let's now build some intuition about this definition (the following, especially the part about minimal sets, applies only to finite sets, but there are a lot of similarities in the conclusions when we try to generalize to other sets using measure-theoretic concepts).

Note that since $\mathcal{F}$ is closed under intersections, if we do the following procedure multiple times: take two sets and replace them with their intersection, we will end up with some "minimal" non-empty sets that we can't break further (this notation is not standard at all, but we will use this a lot in what follows). Consider the set of all such possible minimal sets. Clearly, these sets form a partition of $\Omega$ since they are disjoint (else contradiction to minimality) and since every element is in a minimal set (using the fact that $\mathcal{F}$ is closed under complements and constructing a minimal set if needed, or by just intersecting all sets containing that element — there is at least one such set, which is $\Omega$). Now since $\mathcal{F}$ is also closed under unions, it also equals the set of all possible unions of these minimal subsets. In other words, we have a partition of the elements of $\Omega$ into some sets, and $\mathcal{F}$ is the set formed by all possible unions of sets in that partition. We can now think of $\sigma$-algebras in terms of partitions! Also, note that the probabilities for each of these minimal events of the $\sigma$-algebra add up while taking the unions.

In some sense, we can now construct an equivalent probability space $(\Omega', \mathcal{F}', P')$ where $\Omega'$ is the set of minimal events of $\mathcal{F}$, $\mathcal{F}'$ is the power set of $\Omega'$, and $P'$ is defined naturally from $P$. This definition is not standard but will be useful later on. Note that if $\mathcal{F}$ was the power set of $\Omega$, then this equivalent probability space would have been the same as the original probability space. In this sense, all other $\sigma$-algebras are "coarser" than the power set, whose minimal sets are all singletons.

At this point, we should realize that $\mathcal{F}$ is a measure of how much "information" we have about $\Omega$ (we can't distinguish between two elements in the same minimal set). This idea, along with the partition intuition, is a good way to deal with how you gain information over time in random processes, but we'll come to that a bit later.

Let's now turn to the formal definition of a random variable.

A random variable with respect to a probability space $(\Omega, \mathcal{F}, P)$ is a function $X$ from $\Omega$ to $\mathbb{R}$ such that $X^{-1}((-\infty, x]) \in \mathcal{F}$ for all $x \in \mathbb{R}$. That is, the set of pre-images of real numbers $\le x$ should be an element of $\mathcal{F}$.

Note that since we are dealing with finite sets, we will use an alternative (but equivalent for finite sets) definition where we replace the interval $(-\infty, x]$ by $\{x\}$.

When written like this, it might be a bit intimidating, but the underlying idea is quite simple. We have a sample space $\Omega$, and we have another set $\mathbb{R}$. Now we want to map elements of $\Omega$ to $\mathbb{R}$ in some way so that we can do some computations over elements of $\Omega$. However, since in a probability space, $\mathcal{F}$ tells us how much information we have (or are allowed to look at), we shouldn't be able to distinguish between elements of $\Omega$ that are inside the same minimal event of $\mathcal{F}$. Since there are finitely many elements in $\Omega$, and elements inside the same minimal event should be mapped to the same real number, there can be only finitely many intervals that $X$ partitions $\mathbb{R}$ into, and the condition in the definition is necessary and sufficient (due to closedness of $\mathcal{F}$ under intersection and unions).

As a side-note, note that we can also define functions of a random variable at this point by simply composing the function $X$ with a function from $\mathbb{R}$ to $\mathbb{R}$. Similarly, if we have two random variables on the same probability space, we can also define functions of them quite naturally. After all, we just need to specify their values on the minimal sets and we're done.

### Expectation of a random variable and conditional probability

Let's try to investigate random variables a bit further. Let's say we have a probability space $(\Omega, \mathcal{F}, P)$ and a random variable $X$ on this probability space. How do we capture some sort of a mean value of $X$ over the entire probability space? One possible answer is the following: for each minimal event, we consider the common value of the random variable at each element of that event, and add it to a weighted sum with weight equal to the probability of that minimal event. Note that minimal events form a partition so the sum of these weights is $1$. In some sense, this is an average of the random variable over all minimal events, weighted by their importance.

But how do we write this up mathematically without referring to minimal events? Let's take advantage of the condition in the definition of a random variable. Instead of summing over minimal events, we can sum over distinct values $x$ of the random variable, and weight it by the probability of that variable being equal to $x$. This probability is well-defined due to the condition in the definition of a random variable (because the event that the random variable equals $x$ is in fact a disjoint union of minimal events and their probabilities add up). This definition is also consistent with our earlier intuition.

So we can define the expected value of a random variable $X$ to be $E[X] := \sum_{x \in \text{range}(X)} x \cdot P(X = x)$.

Note that by our earlier definition related to minimal events, we can also clearly see that for any random variables $X, Y$ on the same probability space, we have $E[X + Y] = E[X] + E[Y]$.

Let's now change gears and think a bit about how we can isolate an event. Suppose we have an event $E$. Let's say that we want to ignore everything else, and only look at the elements of $\Omega$ (and consequently the minimal events of $\mathcal{F}$ that form a partition of $E$). Can we make a probability space out of it? One easy way would be to just inherit the structure of $\Omega$ and $\mathcal{F}$ and set the probabilities of everything that uses elements other than the minimal elements constituting $E$ to be $0$, but we run into one trouble if we simply decide to inherit the probability function. The sum of probabilities over all such minimal events is not $1$. One possible idea is to normalize it by $P(E)$, which is possible iff it is non-zero. It can easily be checked that this probability space is indeed a valid probability space.

This is how we define the conditional probability space wrt an event $E$ with $P(E) > 0$. The sample space and the $\sigma$-algebra remain the same, but the probability function becomes $P(A | E) := P'(A) = P(A \cap E) / P(E)$. This can be verified to be consistent with our earlier reasoning by considering minimal elements.

A cleaner way of looking at it is that we're really only concerned with some subset of $\Omega$ and $\mathcal{F}$ that correspond to elements in $E$, and we define a probability space with $E$ as the ground set. But just to make sure that we can interoperate with the other events in $\mathcal{F}$, we take the event, trim down the parts that are not in $E$, and then use the probability (or relative weight) of that set wrt the probability (or weight) of $E$ instead.

One corollary of this is the following: $\sum_{E_i} P(A | E_i) P(E_i) = P(A)$ for a partition of $\Omega$ into events $E_i$. This can be seen to be true if we consider the minimal sets in $A$. In fact, this is used while solving a lot of recurrences, like in Bayes' theorem.

This also has an expected value variant, and if we naturally restrict a random variable to each conditional probability space, by summing over minimal sets, we get that if $E[X | E_i]$ is the expected value of $X$ wrt the conditional probability space with respect to $E_i$, then $\sum_{E_i} E[X | E_i] P(E_i) = E[X]$. This variant is used in recurrences too, like finding the expected time before we toss two heads in a row with a fair coin (though the sample space is infinite there, these identities still hold).

### Conditional expectation with respect to a sigma algebra

Let's continue with the partition and information analogy even further. Recall how we had defined an equivalent probability space. It corresponds to collapsing equivalence classes (determined by the partition) into a single element, and making a probability space on that as the sample space. What if we do that again, after defining another sparser/coarser probability space on the resulting probability space?

Of course, this leads to a coarser partition of the sample space in some sense, and if $\mathcal{F}_1$ is the original $\sigma$-algebra, and $\mathcal{F}_2$ is the new $\sigma$-algebra (both of them corresponding to the same $\Omega$, after "unwrapping/unrolling" the second coarsening), then $\mathcal{F}_2 \subseteq \mathcal{F}_1$.

Now suppose we had a random variable $X$ on the original probability space. How do we reduce/adapt the random variable to this coarser probability space? This is where expected value comes into the picture. We will choose a definition for conditional expectation, and see that it satisfies a lot of good properties and is consistent with our conditional probability identities that we figured out in the previous section.

Note that $X$ is constant on all minimal sets of $\mathcal{F}_1$. Let's define the conditional expectation of $X$ with respect to the $\sigma$-algebra $\mathcal{F}_2$ as the random variable $E[X | \mathcal{F_2}] := X'$ such that on every element of a minimal set $E_i$ of $\mathcal{F}_2$, $X'$ assumes the value $E[X | E_i]$. This essentially means that we are trying to take a "conditional average" of the information about the values of $X$ on the set $E_i$. It is easy to verify that $E[X'] = E[X]$, where the first expectation is taken over the coarser probability space, and the second is taken on the finer probability space. In a case of unfortunate naming, $X'$ is called the conditional expectation of $X$ wrt the $\sigma$-algebra $\mathcal{F}_2$.

To make this a bit clearer, let's take an example.

Example

We note a couple of points here:

1. In a way, we are trying to replace the random variable with its "best approximation" in some sense (for instance, a variance minimization sense, but that is totally unrelated). Also note that the expectation of the random variable does not change when we apply this coarsening to $X$.
2. Note that for the trivial $\sigma$-algebra $\mathcal{F}_2 = \{\emptyset, \Omega\}$, the random variable $E[X | \mathcal{F_2}]$ is identically the constant function that equals $E[X]$.

We can also chain a couple of coarsenings. Suppose $\mathcal{F}_3 \subseteq \mathcal{F}_2 \subseteq \mathcal{F}_1$ and $X$ is a random variable wrt the probability space $(\Omega, \mathcal{F}_1, P)$. Then we have $E[E[X | \mathcal{F}_2] | \mathcal{F}_3] = E[X | \mathcal{F_3}]$, as can be verified easily by chaining conditional expectations. In fact, the fact that expectation remains the same even after coarsening is a direct consequence of this and the last point.

A more commonly used term is conditional expectation with respect to another random variable. It is usually defined as this:

Suppose $X$ and $Y$ are random variables on the same probability space. For any element $\omega$ of the sample space where $Y(\omega) = y$, we define $E[X | Y]\left(\omega\right) = E[X | E]$ where $E$ is the event that $Y = y$. That is, for every element where $Y$ assumes a value of $y$, we take the conditional expectation wrt the event that $Y = y$ and set the value of the answer to that conditional expectation.

However, this is the same as saying this: let $\mathcal{F}_Y$ be the $\sigma$-algebra where the minimal sets correspond to distinct values of $Y$. Then $E[X | Y]$ is identically equal to the random variable $E[X | \mathcal{F}_Y]$. As a side-note, this constructed $\sigma$-algebra is called the $\sigma$-algebra generated by the random variable $Y$.

In this sense, conditional expectation with respect to a $\sigma$-algebra is a cleaner and more sophisticated concept than conditional expectation with respect to a random variable, though both concepts are equivalent (you can always induce a $\sigma$-algebra by choosing where to put equal values for the random variable you're conditioning on).

Another way of thinking about this concept is to think of it as projecting down from a set of minimal events to a coarser set of minimal events. For each chunk of minimal events we're collapsing, we replace it with a single minimal event by the "center of mass" of those events, where the "position" is the value of the random variable, and the "mass" equals the probability of those minimal events. This works out mathematically too, and is precise for this setup.

Just for the sake of completeness, since the above definition captures what conditional expectation with respect to a $\sigma$-algebra does, but breaks down when you try to generalize it to infinite sample spaces, here's the "official definition" that people use (which is quite counter-intuitive and takes some time to digest):

A conditional expectation of $X$ given $\mathcal{H}$, denoted as $E[X | \mathcal{H}]$, is any random variable on $(\Omega, \mathcal{H}, P)$ which satisfies $\int_H E[X | \mathcal{H}] d\mathbb{P} = \int_H X d\mathbb{P}$ for each $H \in \mathcal{H}$.

This essentially says the same thing when you think about the possible $H$, but in a more non-constructive way. Turns out that this random variable is not unique in the strictest sense, but $P(Y \ne Z) = 0$ holds, where $Y$ and $Z$ are two candidates for this random variable. Also, this definition makes it clear that $X - E[X | \mathcal{H}]$ is orthogonal to the indicator function (so we get a more precise definition of projecting down).

### Martingales

Let's say there is a "process" of some sort, where at every step, we get more information. Let's say our sample space for the whole experiment is $\Omega$. Here, $\Omega$ is almost always infinite, but the intuition we built carries forward here, even though the concepts like minimal sets might not. Let's take a very simple example: we toss a coin at each step. The sample space for the whole experiment is the set of all infinite binary strings. Let's think about the kind of information that we have available at various steps in the process. After step 0, all the information we have (for instance, the complete information about a random variable that has information about what has happened until now) corresponds to the trivial $\sigma$-algebra $\{\emptyset, \Omega\}$. After step $i$, we have information for the first $i$ coin tosses, so the information corresponds to the $\sigma$-algebra with "minimal" sets being the sets of binary strings which are the same in the first $i$ positions.

To understand why $\sigma$-algebras are necessary to capture information, perhaps the following analogy helps. Think in terms of minimal sets. Each time we get some more information, we might be able to distinguish between elements in minimal sets. To be able to do operations like finding the expectation of some random variable in this scenario, we need to be able to distinguish between these elements too. In some sense, you're inspecting multiple parallel universes at the same time, and in each universe, you have a certain prefix of things that happened, for each of which you want to be able to analyze what happens later.

Note that we always get more information at every stage of such a stochastic process. So we define a filtration as follows (without any reference to a process for the sake of generality): a sequence of $\sigma$-algebras $\mathcal{F}_i$ such that $\mathcal{F}_i \subseteq \mathcal{F}_j$ iff $i \le j$.

Now let's get to what a process really is. At every step of a process, we have a random variable that shows up. But what is the $\sigma$-algebra for that random variable? To remedy this, we take a $\sigma$-algebra $\mathcal{F}$ that captures all possible information. So we define a stochastic process as follows:

For a given probability space $(\Omega, \mathcal{F}, P)$, and an index set $I$ with a total order $\le$, a stochastic process $X$ is a collection of random variables $\{X(i) | i \in I\}$.

For our purposes, $I$ will mostly be the non-negative integers, though $I = \mathbb{R}$ is used extensively in a lot of contexts. $I$ having a total order means that there is some concept of time associated with $I$.

How do we now construct a "natural filtration" that captures the information that the first $i$ steps of the process (i.e., $\{X(j) | j \le i\}$) give us? Remember how we said we can construct a $\sigma$-algebra from a random variable? We can also use the same construction to construct it from a finite set of random variables, by refining two partitions together while it is possible (this corresponds to the join of the two partitions in the refinement lattice, if you're familiar with posets and lattices). So the filtration can be set up this way too, by refining the current $\sigma$-algebra with a new random variable each time. More formally, paraphrasing from Wikipedia,

Let $(X_{n})_{n\in \mathbb {N} }$ be a stochastic process on the probability space $(\Omega ,{\mathcal {F}},P)$. Then

${\mathcal {F}}_{n}:=\sigma (X_{k}\mid k\leq n)$

is a $\sigma$-algebra and $\mathbb {F} =({\mathcal {F}}_{n})_{n\in \mathbb {N} }$ is a filtration. Here $\sigma (X_{k}\mid k\leq n)$ denotes the $\sigma$-algebra generated by the random variables $X_{1},X_{2},\dots ,X_{n}$. ${\mathbb F}$ is a filtration, since by definition all ${\mathcal {F}}_{n}$ are $\sigma$-algebras and $\sigma (X_{k}\mid k\leq n)\subseteq \sigma (X_{k}\mid k\leq n+1).$

This is known as the natural filtration of ${\mathcal {F}}$ with respect to $X$.

Finally, we are able to define what a martingale is. Note that if $\sigma(X_i) \subseteq \mathcal{H}$ for some $\sigma$-algebra $\mathcal{H}$, then $E[X_i | \mathcal{H}] = X_i$ (think about the minimal sets and the definition). A martingale essentially says that conditional on the information we have until time $i$, a random variable $X_j$ for $j \ge i$ looks like $X_i$. More formally,

A martingale is a stochastic process $X_1, X_2, \dots$ such that $E[X_{n + 1} | \sigma(X_1, X_2, \dots, X_n)] = X_n$, and $E[|X_n|] < \infty$.

Equivalently, the first condition can be written as $E[X_{n + 1} - X_n | \sigma(X_1, X_2, \dots, X_n)] = 0$. Note that we talk about "almost sure" equality of random variables when dealing with more complicated probability spaces, but for our purposes, we will not concern ourselves with this detail.

Note that this is necessary (trivial) and sufficient, because $X_j - X_i = (X_j - X_{j - 1}) + (X_{j - 1} - X_{j - 2}) + \dots + (X_{i+1} - X_i)$, and by applying the nested coarsening property, it reduces to $E[X_j - X_i | \sigma(X_1, \dots, X_i)] = 0$.

Why is a martingale so useful? Firstly, it gives us an invariant, and as is well-known, invariants make life easier when we try to prove results in math. Secondly, it originated in the theory of gambling, and for fair gambling games, the expected value of the profit at a later point in time should be equal to the profit at that point in time, which is precisely the martingale condition!

Let's look at some constructions of martingales.

1. Consider a random walk, where the initial position is $X_0 = 0$, and $X_i = X_{i-1} \pm 1$ with equal probability $p \le 1/2$, and $X_i = X_{i-1}$ with probability $1 - 2p$. Then this is a martingale (from the definition).
2. Consider a random walk denoting the total assets of a gambler who keeps reinvesting all his capital into the game. The game gives a payout of $r$ percent if he wins (with probability $p$), a loss of $r$ percent if he loses (with probability $p$), and no payout in case its a draw (with probability $1 - 2p$). Then his wealth is a martingale.

There are many more examples of martingales, but I won't go too much in depth about that, and would leave you to refer to some other resource for more examples.

Some more intuition about martingales: let's see what happens to the conditional expectation in the definition of a martingale when we apply a convex function to a stochastic process (i.e., to all random variables in the stochastic process). Since probabilities are all positive, whenever we're taking the conditional expectation, we are taking a convex combination of the values of the convex function of a random variable. By Jensen's inequality, we get that rather than equality, $\ge$ holds in the defining equation for martingales. Similarly, if we had a concave function, $\le$ would hold in the defining equation. These types of processes (where $\ge$ and $\le$ hold instead of $=$) are called sub-martingales and super-martingales respectively. Note that not all stochastic processes are one of sub-/super-martingales. Note that $\ge 0$ and $\le 0$ means that the random variable on the LHS is almost surely non-negative and non-positive, respectively.

An example of a sub-martingale

I won't go into a lot of detail about martingales (since they deserve a field of study of their own), but will just point out some nice results, that help to get a feel of how martingales behave:

1. Azuma's inequality: this provides a bound on how "close" we will be to the initial value of the random variable given bounds on the movement of the martingale (or how much in the wrong direction we can go for a sub-/super-martingale). Let's say $\{X_k\}$ is a super-martingale (which a martingale also is), and let's suppose $|X_k - X_{k - 1}| \le c_k$ almost surely. Then we have $P(X_N - X_0 \ge \varepsilon) \le \exp(-\varepsilon^2/(2 \sum_{k = 1}^N c_k^2))$. Since the negative of a super-martingale is a sub-martingale, we get for a sub-martingale $X$, $P(X_N - X_0 \le -\varepsilon) \le \exp(-\varepsilon^2/(2 \sum_{k = 1}^N c_k^2))$. Since a martingale is both a sub-martingale and a super-martingale, we have a probabilistic bound on how close $X_N$ will be to $X_0$, that is, $P(|X_N - X_0| \ge \varepsilon) \le 2 \exp(-\varepsilon^2/(2 \sum_{k = 1}^N c_k^2))$. There is a stronger version of this inequality, but I think this is quite sufficient to see how well-behaved martingales are.

2. Doob's martingale inequality: this is a Markov-style inequality on the probability that a sub-martingale will exceed a value $x$. More formally, for a sub-martingale $X$, we have $P(\max_{1 \le i \le n} X_i \ge x) \le E[\max(X_n, 0)] / x$. This can be used to also show Markov's inequality.

3. Doob's martingale convergence theorem: this is a result that says that a super-martingale that is bounded from below will converge (almost surely) to a random variable with finite expectation. Note that if we bar the case when the variable diverges to infinity in some sense, the other possible way for it to not converge is to oscillate. This inequality roughly says that bounded martingales don't oscillate.

### Stopping times

Stopping times are quite interesting mathematical objects, and when combined with martingales, they give a lot of interesting results and are a very powerful tool.

Suppose we are in the gambler's shoes, and we want to write an algorithm to finish the game, depending on certain outcomes in the process. Stopping times are a way to formalize the analysis of when such exit conditions are reached. The best we can do right now, given the current theory we have, is to define it as a random variable (that may or may not depend on the stochastic process). We also have the constraint that we can't see the future, so the decision about when to stop must be taken with all the information at the current time.

To that end, let's consider a random variable $\tau$ (that takes values in the index set, that is the set of non-negative integers here) on a probability space $(\Omega, \mathcal{F}, P)$, and let's say we have a stochastic process $X$ on the same probability space with an associated filtration $\{\mathcal{F}_i\}$. We say that $\tau$ is a stopping time if the event that $\{\tau = t\}$ is an event in $\mathcal{F}_t$. That is, the decision to stop at time $t$ must be taken with information no more than what we have at time $t$. For also keeping this consistent with real number indexing, we modify the condition to $\{\tau \le t\}$ being an event in $\mathcal{F}_t$.

To understand why this ensures no forward-looking, consider the following. We bunch together elements of $\Omega$ that have stopping time $\le t$, and the condition is that we shouldn't be able to distinguish them if without information from time after $t$. Similarly, it must not be the case that two elements of $\Omega$ in the same minimal set of $\mathcal{F}_t$ have different stopping times, when not both are $> t$. In our coin flip example, the latter makes sense if you think about why you can't have stopping time for 000010110... = 4 and that for 000011000... = 6.

Another way of formalizing the same thing is to say that the stochastic process (denoting the stopping of a process) defined by $Y_t = 0$ if $\tau < t$ and $1$ otherwise is an adapted process wrt the filtration. That is, $\sigma(Y_t) \subseteq \mathcal{F}_t$ for all $t$.

Since the stopping time is an index, the most natural thing at this point is to think of how to index the stochastic process with some function of stopping time. If we go back to the definition of a random variable, it should be clear that intuitively we need to assign a value to it at every element in $\Omega$. In simple cases, this is trivial, since we can just compute the value of $\tau$ at the minimal set which has this element (let's say this value is $t$), and compute the value of $X_t$ at the minimal set again. From here, it should be obvious that $X_\tau$ is indeed a random variable on the original probability space, and it denotes the value of the stochastic process when the process is exited. Also note that by the same reasoning, $X_{\min(\tau, t)}$ is a random variable whose induced $\sigma$-algebra is a subset of $\mathcal{F}_t$. $X_{\min(\tau, t)}$ corresponds to a stopped process, which means that we just arbitrarily stopped a process after a time limit. These are very important types of processes, and have been extensively studied.

The cool part about martingales and stopping times together is that this setup has been studied a lot, and there are quite a few important results that I'll list without proof, which can help us solve a lot of problems as well as build an understanding around stopping times and martingales:

1. Martingale central limit theorem: This is a generalization of the usual central limit theorem to martingales. The associated process is the usual martingale, but with differences between consecutive values being bounded by a fixed bound almost surely. We choose a stopping time as follows: at each step $i + 1$, compute the conditional variance of $X_{i + 1}$ wrt $\mathcal{F}_i$, and keep adding it to a counter. Stop once this counter is more than a certain value $\sigma^2$. (In a sense, we are waiting till we have collected a large enough variance/move). If the stopping time here is $\tau_\sigma$, then the random variable $X_{\tau_\sigma} / \sigma$ converges to a normally distributed variable with mean $0$ and variance $1$, as $\sigma$ tends to $\infty$. This also assumes that the sum of variances diverges.

2. Doob's optional stopping theorem: This is a very important theorem, and will be quite useful to compute the expected values of stopping times later on. It talks about the fairness of a martingale when we stop after a random stopping time, in the sense that we developed above. The metric for fairness is that $E[X_\tau] = E[X_0]$, and it turns out that this is true if any of the following three conditions holds:

• The stopping time $\tau$ is bounded above.
• The martingale is bounded and the stopping time $\tau$ is finite almost surely (i.e., $P(\tau = \infty) = 0$).
• The expected value of $\tau$ is finite, and the difference between $X_{i + 1}$ and $X_i$ is bounded. Note that this can be used to compute the stopping time if we modify the martingale a bit, as we shall see later.

### Some math problems

We'll start out with some math problems that I have seen over the years (apologies since I don't remember the sources for some of them).

Problem 1: Suppose there is a gambler who has $n$ dollars and gambles infinitely many times. The payoff of the $t^{th}$ round is $X_t$, which is an integer bounded above and below by some fixed constants. We know that $E[X_t|X_{t-1},\cdots,X_1]\geq c$ where $c > 0$. Suppose the probability of the gambler going bankrupt is $p(n)$. Do we have $\lim_{n\rightarrow \infty}p(n)=0$?

Solution sketch

Problem 2 (folklore): Suppose there is an infinite grid, with a non-negative real number in each. Suppose that for every cell, the number in it is the mean of the numbers on its 4 neighbouring squares. Show that all the numbers must be equal.

Solution sketch

Problem 3 (USA IMO 2018 Winter TST 1, P2): Find all functions $f\colon \mathbb{Z}^2 \to [0, 1]$ such that for any integers $x$ and $y$,

$f(x, y) = \frac{f(x - 1, y) + f(x, y - 1)}{2}$
Solution sketch

Problem 4 (2022 Putnam A4): Suppose that $X_1, X_2, \ldots$ are real numbers between 0 and 1 that are chosen independently and uniformly at random. Let $S=\sum_{i=1}^kX_i/2^i,$ where $k$ is the least positive integer such that $X_k<X_{k+1},$ or $k=\infty$ if there is no such integer. Find the expected value of $S$.

Solution sketch (due to ABCDE)

Problem 5 (AGMC 2021 P1): In a dance party initially there are $20$ girls and $22$ boys in the pool and infinitely many more girls and boys waiting outside. In each round, a participant from the pool is picked uniformly at random; if a girl is picked, then she invites a boy from the pool to dance and then both of them leave the party after the dance; while if a boy is picked, then he invites a girl and a boy from the waiting line and dance together. The three of them all stay after the dance. The party is over when there are only (two) boys left in the pool.

(a) What is the probability that the party never ends?

(b) Now the organizer of this party decides to reverse the rule, namely that if a girl is picked, then she invites a boy and a girl from the waiting line to dance and the three stay after the dance; while if a boy is picked, he invites a girl from the pool to dance and both leave after the dance. Still the party is over when there are only (two) boys left in the pool. What is the expected number of rounds until the party ends?

Solution sketch

Problem 6 (folklore): Let a monkey type randomly on a typewriter with alphabet $\Sigma$. For a given string $w\in\Sigma^*$, let $w_1,\dots,w_n$ be the nonempty strings that are simultaneously a prefix and suffix of $w$. Then show that the expected number of keystrokes the monkey needs to make to type $w$ is $\sum_{i=1}^n|\Sigma|^{|w_i|}$.

Solution sketch

A different approach for the above problem is discussed here.

Problem 7 (Alon, Spencer): Let $G=(V,E)$ be a graph with chromatic number $\chi(G)=1000$. Let $U \subseteq V$ be a random subset of $V$ chosen uniformly from among all $2^{|V|}$ subsets of $V$. Let $H=G[U]$ be the induced subgraph of $G$ on $U$. Prove that the probability that $H$ has a chromatic number of $\le 400$ is less than $0.01$.

Solution sketch

### Some competitive programming problems

Almost all the problems that I know about in a competitive programming perspective are discussed in this Chinese blog and this editorial to 1479E that explains the main idea in a slightly different manner. I won't repeat their discussion (the equations should be understandable if you understand what the main approach with the math problems above is), but I would just list out the problems that they link as well as some problems that appeared later (in reverse chronological order).

By nor, 5 months ago, Since someone recently asked me about my competitive programming setup, and I like tinkering with my setup to make it as useful and minimal as possible, I thought I should share my setup that I've used for the past few years and a modified version that I've used at onsite ICPC contests. I've also talked to a few people who went to IOI and had a similar setup, and I'm fairly confident that at least some people will be able to successfully use this without having to worry too much about the details, like I did. This is definitely NOT the only way to set up a basic environment, but it was the way that worked for me for quite a long time.

This blog is also meant as a companion blog to this blog on using the command line.

Before we start with the tools, it's a good point to mention this resource as another useful reference for installing vim and g++.

### About the choice of tools

The choice of tools here is motivated by two things — availability and minimalism.

About availability: most onsite contests have a Linux system (which is almost always Ubuntu), and while installing the most common compiler used for competitive programming (g++) on Ubuntu, the meta-package build-essentials is required by the standard installation procedure. This package also installs make. Most onsite contests provide gdb too. As far as vim is concerned, it is provided on almost all onsite contests as well (and if it is not present, you can use vi as a good enough substitute).

Needless to say, this setup assumes that you have a Linux machine or Linux VM. If you're on Windows, using WSL should suffice. Most of these things should also work on MacOS (speaking from my limited experience of testing an ICPC contest while stuck with a machine with MacOS on it).

The fact that they're minimalist also correlates with the fact that they're fast and easy to set up (sometimes they don't need anything to be set up either) and won't take a ton of your time looking at complicated menus and waiting for startup. Note that using vim might be a bit too much if you haven't used it earlier and you're not willing to devote a few days to getting comfortable with it, so it's a matter of personal preference to use some other text editor like Sublime or VS Code. However, I do not know of any alternative to make (other than writing a shell script, which is a bit inconvenient unless you're a sh/bash expert) and gdb that are as widely available. I'd like to re-emphasize the fact that this is just my setup; your mileage might vary.

Now about the tools and how to install them:

1. vim is a minimalist and customizable text editor that allows you to do a lot of stuff in a no-nonsense way. Its keybindings can be a bit counterintuitive at first, but once you get used to it, it makes writing and editing code very convenient. Installation instructions can be found at the link I shared earlier.
2. make is a minimal build system that reads something called a Makefile and figures out how to do your compilation for you. Installation instructions can be found at the link I shared earlier, under the heading "An example of a workflow in Vim".
3. gdb is a command-line debugger that works for a lot of languages and is pretty convenient and intuitive to use (and also quite scalable). For installation instructions, a search on your favourite search engine should be sufficient.

### Setup for onsite contests

This is a more concise setup that I've used for onsite contests, and people might want to build upon it further.

I prefer using the following template, but for ICPC, I added some things from the KACTL template to keep things uniform in the codebook.

Template

#### Vim

For setting up vim, you just need to create a file ~/.vimrc which stores all your configurations.

A concise vimrc
A bit more beginner-friendly vimrc
Explanation of what everything does

#### Makefile

Note that you can run make without any setup at all. For example, if the path (relative or absolute) to your cpp file is <some_path>/code.cpp, you can run make <some_path>/code and it will generate an executable at the path <some_path>/code, using the default compilation flags. make also prints the command that it used to compile the file.

One feature of make is that once you compile a file, unless there are changes in the source file, it won't compile it again. To counter that, either use make -B instead of make, or use the Makefile below and run make clean which removes all executables from the current directory upto max depth 1.

For using more compilation flags with make, create a file named Makefile and store the following Makefile in it (you can modify this too).

Makefile
Explanation

#### GDB

GDB doesn't really need setup. With the above flags, the most I have ever needed was to run the executable with debug flags in gdb, and look at the backtrace to pinpoint the line with the error. To do that, I do the following (you can also do this in vim using the termdebug package that is added in the vimrc above and is provided by vim out of the box):

1. Run gdb <executable_name>
2. If you have an input file called in.txt, type run < in.txt and hit enter. If you don't have an input file, you can just give it input like in a terminal. Output redirection works as well.
3. If the program has errors, it will stop at some point with an error. To look at the backtrace, type bt and hit enter.

### Setup for online contests

My setup for online contests is a bit more involved, and I'll keep this section a bit more concise, since you can just set this up once and never have to worry about it again, in contrast with onsite contests, where every character counts.

I use a bigger template for online contests, which is as follows:

Template

Note that there's an include for debugging, and it's pretty easy to use (I've mentioned it here).

#### Vim

Note that the following vimrc has some non-competitive-programming stuff as well, since I use vim as my primary editor. For more information on each plugin (each such line starts with Plug or Plugin), you can search for it on GitHub. For instance, for the line Plug 'sbdchd/neoformat', the relevant GitHub repo is this. I use the snippet manager UltiSnips and clang-format and clangd for code formatting and static analysis purposes. I used to use some of the commented plugins earlier, so those can be useful too.

vimrc

#### Makefile

On my personal system, since compilation time becomes a major overhead, I also use precompiled headers. Also, for contests like Hash Code, I use OpenMP at times, so I have some flags for those in my Makefile.

Makefile

Let me know if I missed out on something, or if I should add something here!

By nor, 5 months ago, Recently, I retired from competitive programming completely, and I found myself wondering about what I'll do with all the competitive-programming-specific knowledge I have gained over the years. For quite some time, I've been thinking of giving back to the community by writing about some obscure topics that not a lot of people know about but are super interesting. However, every time I ended up with ideas that are either too obscure to be feasibly used in a competitive programming problem, too technical to ever be seen in a contest, or having too many good blogs/articles about them anyway.

So here's my question: do you think there are topics/ideas that you find are interesting and beautiful, but are not appreciated enough to have their own blogs? If I am familiar with those ideas, I can consider writing something on them (and irrespective of that, I would encourage you to write a blog on such ideas if you're familiar with them). If I don't know about them, it'll be a good way for me and everyone else visiting this blog to learn about it (and if I like the topic, who knows I might as well write something on it too). Also, it's not like the ideas have to necessarily be related to algorithms and data structures, some meta stuff like this nice blog works too.

For an idea about the kinds of blogs I'm talking about, you could refer to my previous educational blogs. I was mainly inspired by adamant's and errorgorn's blogs, since they tend to be really interesting (I would recommend checking them out!).

By nor, 7 months ago, The 2022 ICPC Seoul Regional Mirror Contest has ended (link), and since no one made a post for discussion, let's discuss below. How to solve A, B, G and H?

By nor, 10 months ago, This blog was initially meant to request updates to the C compiler on Codeforces (given the large number of blogs complaining about "mysterious" issues in their submissions in C). While writing it, I realized that the chances of the said updates would be higher if I mentioned a few reasons why people would like to code in C instead of C++ (some of the reasons are not completely serious, as should be evident from the context).

Before people come after me in the comments saying that you should use Rust for a few reasons I will be listing below, please keep in mind that I agree that Rust is a great language with reasonably good design choices (most of them better than C++, at least), and using C is a personal choice.

### Commonly encountered issues in C on Codeforces

1. The C compiler on Codeforces is currently 32-bit, and hence it suffers from the same disadvantages as C++ faces compared to the 64-bit version. For instance, 64-bit arithmetic is slower than it could be (and leads to a ~2x slowdown in many cases).
2. There is an issue with slow IO using stdio, as I mentioned in this comment. Paraphrasing the issue here, this blog mentioned a bug that was making the earlier MinGW implementation of IO very slow, and it was patched to fix this issue. This is also why the newer C++ compilers on Codeforces have a decently fast implementation of scanf/printf compared to the older benchmarks in the blog. However, this is not the case for C. My hypothesis is that this issue wasn't patched for the C compiler. There is, however, a quick fix for this issue: add this line to the top of your submission, and it will revert to the faster implementation (at least as of now): #define __USE_MINGW_ANSI_STDIO 0. Thanks to ffao for telling me about this hack.
3. It seems that there is an issue with the ancient GCC version used for C11. For instance, when using the timespec struct to get high precision time, there is a compilation error (and it seems it is because of limited C11 support in GCC 5.1). Update: a similar issue arises (timespec_get not found) when I submit with C++20 (64) as well, so I think this is an issue with the C library rather than the compiler.

A possible fix to the above issues that works most of the time is to submit your code in C++20 (64 bit) instead of C11. You might face problems due to different behavior of specific keywords or ODR violations due to naming issues, but it should be obvious how to fix those errors manually.

### Some update requests

Coming to the original point of the blog, I have a couple of update requests (if any other updates can make C easier to use, feel free to discuss them in the comments). Tagging MikeMirzayanov so these updates can be considered in the immediate future.

• Adding C17 (64 bit) alongside the old C11 option: This should be doable using msys2 as done for C++. I am somewhat confident that not using MinGW for this purpose would make the slow IO issue disappear. Also, the reason why I recommend C17 instead of C11 is that C17 is the latest version that addresses defects in C11. Note that C17 and C18 are two names for the same thing. However, I still believe that the current C11 option should be available in the future as well, for the sake of completeness/more options/other reasons mentioned in the comments below.
• Fixing slow IO: If the above doesn't fix the slow IO issue, perhaps there could be a way to fix it in MinGW itself. I am not too familiar with this, though.
• Fixing the timespec_get issue (and other C11/C17 support issues): I am not exactly sure on how to go about fixing this, but since this seems to be a C library issue, it could be a good idea to try installing a C library that has this functionality (and is C11 compliant). The issue probably boils down to Windows not having libraries that support everything in the C standard, since C11/C17 works perfectly on my Linux system.



By nor, 13 months ago, Thank you for participating in this round! We hope you found the problems interesting and/or funny.

Special thanks to dorijanlendvaj for helping with Polygon related issues. Preparing good tests for certain problems was in fact a good problem in itself (in fact, sometimes they were hard enough for usual CF rounds), and we will mention these problems for the interested reader.

If the editorials are not visible, try registering for the contest or something similar.

Problem A: idea: ToxicPie9, preparation: nor

Solution
Code (Python) for solution 1
Code (Python) for solution 2
Code (Python) for solution 3
Preparation trivia

Problem B: idea: dorijanlendvaj, preparation: dorijanlendvaj

Solution
Code (Python)
Preparation trivia

Problem C: idea: nor, preparation: nor

Solution
Code (Python)

Problem D: idea: Ari, preparation: Ari

Solution
Code (Python)
Preparation trivia

Problem E: idea: errorgorn, preparation: nor

Solution
Code (Python)

Problem F: idea: nor, preparation: nor

Solution
Code (Python)

Problem G: idea: nor, preparation: nor

Solution
Code (Python) for solution 1
Code (Python) for solution 2
Preparation trivia

Problem H1: idea: nor, preparation: nor

Solution
Code (Python)
Preparation trivia

Problem H2: idea: ToxicPie9, preparation: nor

Solution
Code (Python) for solution 1
Code (Python) for solution 2
Code (Python) for solution 3
By nor, 13 months ago, Hi Codeforces!

We are pleased to invite you to [contest:383377], which will take place on [contest_time:383377]. You will be given 9 problems and 2 hours to solve them. Please follow this link in order to register for the contest. Note that this contest is not an official Codeforces round.

The round will be unrated for all participants. It will be held according to modified ICPC rules.

All problems in this round were written and prepared by dorijanlendvaj, errorgorn, Ari, nor and ToxicPie9. We have worked on this round for quite a short time and hope you find every problem so easy that even your five year old sibling could solve it interesting.

This round is sponsored by errorgorn and Ari with the following prizes:

• 1st place: 10 lakh VEF
• 2nd place: 20 lakh VEF
• 3rd place: 30 lakh VEF
• 8th place: A steam copy of "Sakuya Izayoi Gives You Advice And Dabs"

We would like to thank:

• No one for marvellous coordination of the round.
• errorgorn for being a VIP tester, who liked the problems enough to sponsor the round prizes of 60 lakh VEF.
• Ari for sponsoring the 8th place prize.
• MikeMirzayanov for the amazing Codeforces and Polygon platforms!
• You for participating and upvoting.

Good luck and have fun! See you in the standings. Make sure to read the statements and conversion rate very carefully.

Reminder: The registration has just started!

Reminder 2: The contest starts in 15 minutes! Good luck, have fun!

UPD: Congratulations to the winners:

We would also like to congratulate the top $104$ participants for solving all the problems in-contest!

The editorial along with some more interesting stuff can be found here.

By nor, 17 months ago, This blog will be more about developing ideas about certain structures rather than actually coding up stuff. The ideas we'll be developing here have many applications, some of which involve:

• Number Theoretic Möbius Inversion
• Principle of Inclusion-Exclusion
• Directed Acylic Graphs
• Subset Sum Convolution
• Finite Difference Calculus (and prefix sums)

Note that some of the terminology used below might be non-standard. Also, some parts of the blog make a reference to abstract algebra without much explanation; such parts are just for people who know some algebra, and can safely be skipped by those who don't know those concepts at all.

Prerequisites:

• Matrix multiplication
• Basic combinatorics

## Definition of a poset and some intuition

All of these have one thing in common: they have an underlying "poset". Let's understand what a poset is.

Rigorously speaking, a poset is a partially ordered set, i.e., a set $S$ equipped with a relation $\le$ that is reflexive, antisymmetric, and transitive.

More precisely,

• $\le$ is reflexive means that $a \le a$ for all $a$ in $S$.
• $\le$ is antisymmetric means that if $a \le b$ and $b \le a$, then $a = b$
• $\le$ is transitive means that if $a \le b$ and $b \le c$, then $a \le c$

To get some intuition about posets, it's helpful to think of them in terms of directed acyclic graphs (DAGs), equipped with the relation of reachability; i.e., vertices $a$ is related to $b$ iff there is a path from $a$ to $b$. In fact, there is always at least one DAG corresponding to every poset that is equivalent to it in terms of semantics.

To make this notion a bit clearer, we construct a "DAG" (we allow self-loops, so it is not precisely a DAG, but we can get rid of these self-loops) as follows: construct a vertex for each element of the set, and there is an edge from $u$ to $v$ iff $u \le v$. For the lack of a better word, we shall call this the "relational DAG" of the poset. Note that this notation is not standard.

Remark: By the properties of a poset, this "DAG" has the property that if there is a path from $u$ to $v$, there is an edge between $u$ and $v$ (i.e., reachability and adjacency are equivalent in this graph). In other words, this graph is its own reflexive transitive closure. Also, a topological sort of this "DAG" (ignoring the self-loops) is called a linear extension of the poset (flattening out the "DAG" induces a natural linear order on the vertices, which are just the elements of the poset).

It is usually easier to look at this graph after removing all redundancies (i.e., if $u \le v$ and $v \le w$, then remove the edge $(u, w)$ from the graph if it exists). The resulting graph is called the Hasse diagram of the poset, and is super useful for reasoning about posets.

Note that it is NOT necessary that for any two elements $(u, v)$, one of $u \le v$ and $v \le u$ must hold. For instance, consider the poset given by the set $S = \{a_1, a_2, a_3, a_4\}$ and the relation $\le$ given by $\{(a, a) \mid a \in S\} \cup \{(a_1, a_2), (a_1, a_3), (a_1, a_4), (a_2, a_4), (a_3, a_4)\}$. This satisfies the constraints that $\le$ must be reflexive, antisymmetric and transitive; however, neither of $a_2 \le a_3$ and $a_3 \le a_2$ hold.

For the remainder of this blog, it will help to think of $\le$ in terms of the Hasse diagram of the poset.

Note: when we say $a < b$, we mean that $a \le b$ and $a \ne b$.

## Some examples of posets

• Either of the sets $\mathbb{R}, \mathbb{N}, \mathbb{Z}, \mathbb{Q}$ equipped with the usual $\le$ is a poset.
• For any set $A$, the set of its subsets, equipped with the relation $\subseteq$ (or $\supseteq$) is a poset.
• The set $\mathbb{N}$ equipped with the relation $\mid$ (divisibility) is a poset.

## The incidence algebra of a poset

Note for the pedantic reader: In what follows, we only look at locally finite posets (i.e., posets in which the set of $z$ such that $x \le z \le y$ is finite for all $x, y$). In particular, we can use the following ideas with infinite DAGs too, though it might be possible that infinitely large (but "locally sparse") matrices may not make sense.

In particular, the following doesn't apply to the posets $(\mathbb{R}, \le), (\mathbb{Q}, \le)$ and so on.

Let $(S, \le)$ be some poset. Consider the adjacency matrix $M$ of its relational DAG. It has the property that if $x \not\leq y$, then the entry at $(x, y)$ is $0$, and otherwise it is $1$. We can also interpret this matrix $M$ as a function from $S \times S \to \{0, 1\}$.

We generalize this notion a bit. Rather than having $M(x, y) = 1$ whenever $x \le y$, we assign arbitrary complex numbers to $M(x, y)$ (which is equivalent to assigning arbitrary edge weights (possibly $0$) to edges in the relational DAG).

This leads to a set of functions $\alpha : S \times S \to \mathbb{C}$, with the property that $x \not\leq y \implies \alpha(x, y) = 0$. We call this the "incidence algebra" of the poset (for those who know what an algebra is, we'll see why this is an algebra). We won't distinguish between these functions and their corresponding matrices in what follows.

The main idea behind the whole approach is to look at these functions as matrices and extract useful information using matrix operations.

To define the "product" on this set of functions, we can simply use matrix multiplication to define its semantics. That is, for functions $\alpha, \beta$ in the incidence algebra, we define the product as $(\alpha\beta)(x, y) = \sum_{z \in S} \alpha(x, z) \beta(z, y)$, which is some kind of convolution.

We can see that the incidence algebra is closed under product (intuitively, think of it as aggregating some combination of edge weights over paths of length 2; if there is no path from $x$ to $y$, the set of such paths is empty anyway, so the answer is still 0, the identity of the aggregation function which is $+$).

Similarly, we can define scalar multiplication and addition in terms of matrix operations, and see that the incidence algebra is closed under these operations too.

Important remark: Note that if $f$ is a function from $S$ to $\mathbb{C}$, then a vector with $f(x)$ at the position corresponding to $x$ also satisfies the usual matrix multiplication identities with the elements of the incidence algebra, with the semantics of multiplication with a column vector being analogous to aggregating over paths that end at a given vertex (and for row vector left-multiplication, paths that end at a given vertex).

Optional remark: For those who know what an algebra is, matrix multiplication is a bilinear operator so product is the bilinear operator in the incidence algebra. A further generalization would be to define a different underlying field than $\mathbb{C}$ and another bilinear operator, but we won't look into this further.

## Some special members of the incidence algebra

Okay, now that we have defined this set and some operations on it, let's see some examples of members of this set, what information we can get out of these operations, and what kind of operations are "nice" in this context.

Firstly, the most obvious choice of a matrix is $O$, the zero matrix (which is clearly a member of the incidence algebra). This doesn't really give us any information though. It still has a role as the additive identity for addition in this algebra.

The second most obvious choice is $I$, the identity matrix. Note that this is a member of the incidence algebra due to reflexivity of $\le$. This is a multiplicative identity, and that's something we will need for Möbius inversion.

Thirdly, let's consider the adjacency matrix $M$ of the relational DAG. Let the corresponding function be $\zeta$. Clearly, this is also in the incidence algebra. We have the property that $\zeta(x, y) = 1$ if $x \le y$, and $0$ otherwise.

Spoiler alert: the inverse of this matrix is the Möbius function for the poset and corresponds to things like the number theoretic Möbius function in the divisibility poset and so on.

## The Möbius function

Finally, we define the Möbius function $\mu$ as follows (It might not feel intuitive at first, but bear with me for a while):

For $x \not\leq y$, $\mu(x, y) = 0$, $\mu(x, x) = 1 \, \forall x \in S$, and inductively define

$\mu(x, y) = -\sum_{x \le z < y} \mu(x, z)$

Note that $\mu(x, y)$ is always an integer. Also, note that $\sum_{x \le z \le y} \mu(x, z) = 1$ if $x = y$ and $0$ otherwise (in the case $x < y$, it follows from the identity, in the case when $x, y$ are not comparable, the sum is empty, and in the case when $x = y$, it consists of only one term equal to $1$).

Hence we have for all $x, y \in S$,

$\sum_{z \in S} \mu(x, z) \zeta(z, y) = \sum_{x \le z \le y} \mu(x, z) = \begin{cases} 1 & \text{if } x = y\\ 0 & \text{otherwise} \end{cases} = I(x, y)$

However, the expression on the left is just $(\mu \zeta)(x, y)$. Since this holds for all $x, y$, we have $\mu\zeta = I$.

As we mentioned earlier, this means that $\mu$ is the inverse of $\zeta$, so we must also have $\zeta\mu = I$ (and expanding this gives $\sum_{x \le z \le y} \mu(z, y)$ is $1$ if $x = y$ and $0$ otherwise).

## The Möbius inversion formula

The Möbius inversion formula in this setting just reduces to a simple identity corresponding to the multiplication of a matrix and a vector.

Formally, the Möbius inversion formula states that if $f, g$ are functions from $S$ to $\mathbb{C}$ such that $g(x) = \sum_{y \le x} f(y)$, then we have $f(x) = \sum_{y \le x} \mu(y, x) g(y)$. The converse also holds true.

For a proof, consider a vector $F$ with the element at position corresponding to $x$ being $f(x)$. Define $G$ similarly.

Now the condition merely means that $G = \zeta^\intercal F$. Left-multiplying this by $\mu^\intercal$, we get $\mu^\intercal G = \mu^\intercal \zeta^\intercal F = IF = F$. Expanding this gives the Möbius inversion formula. The converse can be proved by reversing all these steps, since $\mu^\intercal$ is invertible.

Note that this formula is just a trivial consequence of some matrix multiplication properties; so there's a lot more to the theory we have developed above than just this formula. However, we shall limit ourselves to only the applications of this formula to get to some interesting results.

## Some applications

### The generalized principle of inclusion-exclusion

Let's first talk about the contexts in which we use generalized inclusion-exclusion. We usually want to compute some function $f$ of a finite set $A$, but it is much easier to compute the sum of $f$ over all subsets of $A$.

That is, it is easy to compute $g(A) = \sum_{B \subseteq A} f(B)$, but hard to compute $f(A)$ itself.

The generalized principle of inclusion-exclusion states that $f(A) = \sum_{B \subseteq A} (-1)^{|A| - |B|} g(B)$.

For a proof, let's look at the poset corresponding to the power set of a finite set $F$ such that $A \subseteq F$, equipped with the $\subseteq$ relation. Clearly, using the Möbius inversion formula, we have $f(A) = \sum_{B \subseteq A} \mu(B, A) g(B)$.

We claim that $\mu(B, A) = (-1)^{|A| - |B|}$ for $B \subseteq A$. We'll do this by fixing $B$ and doing an induction on $|A|$.

For $|A| = |B|$, this is clear. Now suppose this is true for all $|A| < k$. Consider the case when $|A| = k$. We have (from the definition of the Möbius function) that

$\mu(B, A) = -\sum_{B \subseteq C \subsetneq A} (-1)^{|C| - |B|} = -\sum_{\emptyset \subseteq C \setminus B \subsetneq A \setminus B} (-1)^{|C \setminus B|} = (-1)^{|A| - |B|}$

where the last equality follows from the binomial theorem, and the proof is complete.

#### The usual principle of inclusion-exclusion as a special case

Now let's see how the usual inclusion-exclusion principle is just a special case of this generalized formula.

The inclusion-exclusion principle states that for sets $A_1, A_2, \ldots, A_n$, we have

$|A_1 \cup \cdots \cup A_n| = \sum_{1 \le i \le n} |A_i| - \sum_{1 \le i < j \le n} |A_i \cap A_j| + \cdots + (-1)^{n-1}|A_1 \cap \cdots \cap A_n|$

For a proof, consider the poset corresponding to the power set of $[n] := \{1, \ldots, n\}$, equipped with $\supseteq$.

Also define $U = A_1 \cup \ldots \cup A_n$.

For a set $S \subseteq [n]$, define

$f(S) =$ number of elements in all $A_i$, where $i \in S$, but not in any other $A_j$-s.

$g(S) =$ number of elements in all $A_i$, where $i \in S$, and possibly in other $A_j$-s.

Then we have $g(S) = |\cap_{i \in S} A_i|$. Also, we have $g(S) = \sum_{T \supseteq S} f(T)$. Using the Möbius inversion formula, we get

$0 = f(\emptyset) = \sum_{A \supseteq \emptyset} \mu(\emptyset, A) g(A) = \sum_{A \in 2^{[n]}} (-1)^{|A|} |\cap_{i \in A} A_i|$

which completes the proof.

### Subset sum convolution

I won't be explaining the ideas behind subset sum convolution in too much detail, but would refer you to this blog that covers the necessary material.

The poset in consideration is the power set of $[n]$, and its elements are in bijection with $n$ bit integers where the $i^\mathrm{th}$ bit is $1$ iff $i$ is in the set.

The transforms $z$, $\mu$ are the $\zeta$ and $\mu$ functions here (note that a two-variable function in the incidence algebra is just a map that transforms a one-variable function to another one-variable function, which can be thought of as the multiplication of the matrix corresponding to the two-variable function with the vector corresponding to the one-variable function), and $\sigma$ corresponds to a diagonal matrix with the element on the diagonal being 1 if the element of the poset has an even number of elements, and -1 otherwise.

The rest of that blog can conveniently be studied in terms of the incidence algebra for this poset.

### The number-theoretic Möbius function

For this, we will need to cover a bit more ground, since the divisibility poset has nicer properties compared to other posets (it is what is called a lattice).

#### What is a lattice

A lattice is a special kind of a poset, where for each pair of elements $(x, y)$, there exist elements $x \land y$ (called the meet of $x, y$) and $x \lor y$ (called the join of $x, y$) such that

• $z \le x$ and $z \le y$ $\implies$ $z \le x \land y$
• $x \land y \le x$ and $x \land y \le y$
• $x \le z$ and $y \le z$ $\implies$ $x \lor y \le z$
• $x \le x \lor y$ and $y \le x \lor y$

Intuitively, this basically means that each pair of elements in the Hasse diagram has a least common ancestor and highest common descendent. For instance, the power set poset and the divisibility poset are both lattices (exercise: what are the meet and the join functions for these lattices?).

Note that inductively, every finite subset of a lattice has a meet and a join. Also, due to antisymmetry of $\le$, the meet and the join are unique for each pair $x, y$. Hence for finite lattices $L$, there is a unique "maximum" (say $\top_L$) and a unique "minimum" (say $\bot_L$).

#### A theorem for lattices

Theorem: For a lattice $(L, \le)$, and $\bot_L < a \in L$, $\sum_{x \lor a = \top_L} \mu(\bot_L, x) = 0$.

Essentially, this theorem states that the sum of the Möbius function applied on all $\bot_L, x$ where the "LCA" of $a$ and $x$ is $\top_L$ is 0.

For a proof, we first note that because of the second Möbius identity (using the product of $\mu$ and $\zeta$), we have:

$\sum_{x \lor a = \top_L} \mu(\bot_L, x) = \sum_{x} \mu(\bot_L, x) \sum_{x \lor a \le y} \mu(y, \top_L)$

Using the definition of $\lor$ and rearranging the sums reduces the sum to

$\sum_{a \le y} \mu(y, \top_L) \sum_{x \le y} \mu(\bot_L, x)$

Using the first Möbius identity and noting that $y$ is not $\bot_L$, reduces this expression to $0$, as needed.

Now let's try to compute $\mu(a, b)$ for $a, b$ in the divisibility poset. Note that if $a$ doesn't divide $b$ or if $a > b$, this is just 0. So it suffices to consider the case where $b = ax$ for some natural number $x$.

Let's look at all positive integers $r$ such that $1 \mid r \mid x$. This forms a lattice $L$. Similarly, the set of positive integers $r'$ such that $a \mid r' \mid ax$ also forms a lattice, which is isomorphic to $L$. Also, since by the inductive definition, $\mu(a, ar)$ depends only on the elements $s$ such that $a \mid s \mid ar$, so we must have $\mu(a, ar) = \mu(1, r)$. So it suffices to compute $\mu(1, x)$ for all $x$.

We claim that for $x$ being square-free, $\mu(1, x)$ is $(-1)^k$ if $x$ is a product of $k$ distinct primes, and it is $0$ otherwise. To show this, we induct on $x$.

Note that $1 = \bot_L$ and $x = \top_L$. The claim is trivial for $x = 1$. Now suppose $p$ is a prime that divides $x$. Using the previously proved theorem, we get $\mu(1, x) = -\sum_{y \ne x, y \lor p = x} \mu(1, y)$.

If $p^2$ divides $b$, then the sum on the right is empty (as $\lor$ is just the LCM). Otherwise, the sum contains only one $y$, which equals $x / p$, so using the inductive hypothesis, we are done.

As a side-note, for "reasonable" functions that only depend on the ratio between their parameters, we can "compress" them into a function that takes a single parameter. The "product" in that case becomes Dirichlet convolution, which has tons of interesting properties. A relevant blog for that can be found here.

Also note that if we restrict ourselves to the divisibility posets of square-free numbers $n$, they are isomorphic to the power set poset of the set of prime divisors of $n$, so all power set posets are just special cases of divisibility posets!

### Finite Difference Calculus

In this case, the poset is particularly simple, and it equals $(\mathbb{N} \cup \{0\}, \le)$. The function $\mu$ takes a value of $1$ at $(n, n)$, $-1$ at $(n, n + 1)$, and $0$ everywhere else. "Convolution" with $\mu$ is equivalent to the difference operator, and using Möbius inversion, the prefix sum operator is that with $\zeta$.

## Final notes and remarks

Firstly, the content of this blog might not be super-useful in competitive programming, but the intent behind this was to introduce a very standard technique in combinatorics that people don't realize is super-flexible.

Secondly, it might not be easy to understand in a first reading, so it's encouraged to work out the math through the blog, and understand why and how these things are tied together. It will also help to make some Hasse diagrams of posets, and look at concrete examples of operations we described here.

Finally, there might be errors and typos in the blog, as well as better ways to approach the content of the blog. If you find any, please let me know in the comments! math,
By nor, 19 months ago, As a certain legendary grandmaster once said (paraphrased), if you know obscure techniques and are not red yet, go and learn binary search.

The main aim of this tutorial is to collect and explain some ideas that are related to binary search, and collect some great tutorials/blogs/comments that explain related things. I haven't found a comprehensive tutorial for binary search that also explains some intuition about how to get your own implementation right with invariants, as well as some language features for binary search, so I felt like I should write something that covers most of what I know about binary search.

This is mainly geared towards beginners, but some of the topics that I mention in the other resources, optimizing binary search and language-specifics sections might not be well-known to people in the low div-1 range. I will also include some references to pretty high-quality articles/resources with related content, for the interested.

## Contents

1. The main idea behind binary search
1. A small classical example
2. A more general idea
3. The generalization
2. Binary searching on the answer
3. Two other common ways of implementing binary search
4. Binary lifting for binary search
5. Language specific ways for binary searching
1. C++
2. Python
3. Java
6. Optimizing binary search
7. Other resources

## The main idea behind binary search

The main idea behind binary search is linear reduction of search space. We'll elaborate on this below.

### A small classical example

Let's start with the classical example of binary search. Suppose you have an array $[1, 7, 9, 12, 19]$, and you're asked to check if $7$ is a member of this array or not (and also return its 0-indexed position in this array). A naive approach would be to iterate over all elements, and check if any element is $7$ or not. This would take 5 iterations.

Now note that this array is sorted. So if I check some element at position $p$ in this array, we can have one of three possible outcomes:

• The element is equal to $7$: this means that we are done, and we can terminate our search!

• The element is less than $7$: since the array is sorted, all elements with position $< p$ are less than $7$ as well. So it is useless to look at any element with position $\le p$ now.

• The element is more than $7$: in a similar manner to the previous case, it is useless to look at any element with position $\ge p$ now.

So we could do something like this: initially our search space is the range of indices $[0, 4]$. Let's try looking at the middle element of our search space. If we do that, we end up in one of two scenarios:

• We terminate if the element at that position is $7$

• We reduce the search space to at most half of the original search space each time.

For a concrete example, let's look at the middle element of the search space, which is $9$. Since this is more than $7$, it is useless to look at indices in the range $[2, 4]$. So our search space reduces to $[0, 1]$. There are two midpoints of this array, let's just use the left one for the sake of uniformity. The element we are now looking at is $1$, which is less than $7$. So it is useless to look at indices in the range $[0, 0]$. So our search space becomes $[1, 1]$. The middle position is just $1$, and since the element there is $7$, we can return the position $1$ as the answer.

What if there was a $4$ instead of $7$ in the original array? In the last step of the above dry-run of the algorithm, we would note that it is useless to look at indices in the range $[0, 1]$, so the search space becomes empty! When the search space becomes empty, that means we haven't found it yet. So we should return some special value indicating that the search function resulted in a failure. We'll use the size of the array as the return value for now (we will see why we took this choice later on).

An implementation might look like the following:

int find_position(const vector<int>& a, int x) {
int l = 0;
int r = (int)a.size() - 1;   // [l, r] is our search space
while (l <= r) {             // search space is non-empty
int m = l + (r - l) / 2; // "middle" position in the range
if (a[m] == x) return m; // found!
else if (a[m] < x) {
l = m + 1;           // remove all indices <= m from the search space
} else {
r = m - 1;           // remove all indices >= m from the search space
}
}
return n;                    // failure
}


Is this more efficient than a linear search? Note that at each step, we reduce the search space by at least half, so in $1 + \log_2 n$ iterations (where $n$ is the size of $a$), the search space size reduces to $< 1$, and since the size is an integer, it becomes $0$. It's easy to see how it terminates.

### A more general idea

More often than not, binary search is not used in such a simple use-case in competitive programming (or in real life).

Let's have a closer look at what we were doing in the above algorithm. We were using some kind of ordering (the ordering of integers, in our example) to discard a large part of the search space (in fact, we discarded about half the search space each time).

How do we generalize this further? More importantly, what will be the statement of the generalized problem?

Firstly, we will do something that will make our original problem easier to generalize. Rather than checking if an element is present in the array, we can find the position of the first element that is greater than or equal to the element (if no such element exists, we'll return the size of the array).

Clearly, this is even stronger than our original problem. If the answer to this problem is the index $i$, we can check if this is the right answer by checking if $i < n$ and $a[i] = x$. If this condition isn't true, we don't have any occurence of $x$ in $a$.

Let's try to do away with the ordering, and see how far we go.

Let's mentally (not in code) construct an array $b$, where the value of $b[i]$ is true if and only if $a[i] < x$.

How does $b$ look like? Clearly, some (possibly empty) prefix of it is filled with "true", and the remaining (possibly empty) suffix is filled with "false".

Then our problem reduces to finding the position of the first false ($n$ if not found) in this kind of an array $b$. For now, forget all about $a$ and $x$, and focus only on $b$.

It turns out, that for any such array $b$, we can easily find the first false using the same idea of discarding parts of the search space.

Let's start with some notation. $[l_0, r_0]$ will be the interval we need to search (for our problem it is $[0, n - 1]$). $l, r$ will be indices we shall maintain at each step such that the range of indices $[l_0, l]$ in $b$ consists only of "true" values, and the range of indices $[r, r_0]$ in $b$ consists only of "false" values.

Initially, we have zero information about any element of $b$, so it is better to force these ranges to be empty ranges. We can do so by setting $l = l_0 - 1$ and $r = r_0 + 1$ initially. We will try to increase $l$ and decrease $r$ so that these two ranges cover the whole search space. So, in the end, $l$ will correspond to the location of the last true, and $r$ will correspond to the location of the first false, and we can simply return $r$!

Let's suppose that at some point, $r - l > 1$. This means that there is at least one element between the indices $l$ and $r$ that we haven't put in one of these intervals yet. Let's take an index roughly in the middle of this range $(l, r)$, say $m = \lfloor(l + r) / 2\rfloor$.

• If $b[m]$ is true, then we know that everything to the left of $b[m]$ is true as well, so we can increase the range $[l_0, l]$ to $[l_0, m]$ (which is equivalent to replacing $l$ by $m$).

• Otherwise, it is false, so everything to the right of $b[m]$ is false as well. So we can increase the range $[r, r_0]$ to $[m, r_0]$.

Each time, we reduce the size of the unexplored range from $r - l - 1$ to $r - m - 1$ or $m - l - 1$, which corresponds to a reduction by at least half. So the number of iterations is at most $\log_2(r_0 - l_0 + 1) + 1$.

An implementation of this would look something like this:

int find_first_false(const vector<bool>& b) {
int l = -1;
int r = (int)b.size();
while (r - l > 1) {
int m = l + (r - l) / 2;
if (b[m]) {
l = m;
} else {
r = m;
}
}
return r;
}


Note that this terminates by our previous observation.

But, we said earlier that we won't really construct $b$ in our code, right? How do we use this same algorithm for solving our own problem?

Note that, by what we said earlier, $b[i]$ is just defined as the truth value of $a[i] < x$, so computing it on the fly is no big deal. So, if we replace $b[m]$ with $a[m] < x$, that would solve our problem.

That was quite a handful, so let's see a brief summary of what we did:

• Rather than explicitly reason using the order < and the value x, we constructed an array $b$ of some very specific form (the first few things in it being true and the rest being false), and found the location of the first false in b.

This very clearly points to our desired generalization:

### The generalization

Suppose we are given the following:

• $[l, r]$: a range of integers (in our example, $[0, n - 1]$)

• $f$: a function from integers to booleans, which satisfies the following property: there exists some integer $t$ such that for all $l \le x < t$, $f(x)$ is true, and for all $t \le x \le r$, $f(x)$ is false.

Then if the time taken to compute $f$ is upper bounded by $T(l, r)$, we can find the value of $t$ (i.e., the first false index) in $O(T(l, r) \log_2 (r - l + 1)) + O(1)$ time.

Such an $f$ is called a predicate. In the problem we discussed above, $f$ is called a predicate.

An implementation of this function will be something like the following (ignoring overflow errors):

template <class Integer, class F>
Integer find_first_false(Integer l, Integer r, F&& f) {
--l;
++r;
while (r - l > 1) {
Integer m = l + (r - l) / 2; // prefer std::midpoint in C++20
if (f(m)) {
l = m;
} else {
r = m;
}
}
return r;
}


Note that this implementation also has the nice property that $l$ is the position of the last true in the corresponding array $b$, so you can define a function similar to this one that returns $l$ instead.

To use this function to implement our original binary search, we can do something like the following:

int find_position(const vector<int>& a, int x) {
auto f = [&](int i) {
return a[i] < x;
};
int n = (int)a.size();
int pos = find_first_false(0, n - 1, f);
if (pos == n || a[pos] != x) return n;
return pos;
}


Note that this abstraction also gives us the following result: we don't really need $a$ to be sorted at all. The only thing we need is that everything less than $x$ in $a$ should be in a prefix, and everything not less than $x$ should be in the remaining suffix and if $x$ is in the array, the beginning of that suffix should have $x$ at it. This definition also handles possible duplicates of $x$ in the array.

As we'll see later on, this kind of an abstraction allows us to solve a whole variety of problems.

Exercise: What would you do if in $b$, the first few elements were false, and the rest were true?

## Binary searching on the answer

Sometimes, it is much easier to deal with bounds on the answer rather than the exact answer itself.

In other words, sometimes it is much easier to construct a function $f$ that returns true iff the input is $\ge$ the answer to the problem, by running some algorithm that returns a boolean value not by computing the answer, but by considering some properties of the problem at hand and the input.

To make this more concrete, let's consider this problem.

In short, you're given an $n \times m$ multiplication table, and you want to find the $k^\mathrm{th}$ smallest entry in this table (i.e., if you were to sort all the entries, find the entry at position $k$).

It is not clear how to directly find this value. But given a "guess" $x$, we can see if this is at least the answer or not:

Let's find the number of integers smaller than $x$ in each row, and sum them up. This can be done by a simple division for each row.

If the total numbers less than $x$ are $< k$, we will return true, otherwise, we will return false. This predicate works since the number of entries smaller than $x$ is a non-decreasing function of $x$ (more commonly, we compare at $f(x), f(x + 1)$ and try to argue that if $f(x)$ is false, then $f(x + 1)$ is also false), and hence the last element that makes the function return true is in fact the $k^\mathrm{th}$ smallest entry.

This can be found using binary search as we have discussed above.

## Two other common ways of implementing binary search

We'll discuss two other ways of implementing usual binary search, and why they seem intuitive to people who use them.

We'll refer to the previous implementation as the $(l, r)$ way, since the invariant in that was equivalent to saying that the remaining search space will always be (l, r).

This section is just to help out people in reading others' binary search implementations, as well as choose between implementations to find the way you find best. I prefer using the implementation used above, so I'll explain the other two implementations in the context of that, with some intuition as to why people choose to implement things that way.

### The $[l, r]$ way

In this type of an implementation, we maintain $l + 1$ and $r - 1$ rather than $l$ and $r$. The reason behind this is something like the following (independent of the above reasoning):

$[l, r]$ consists of the currently explored search space. Let's maintain an extra integer ($ans$), which stores the "best" answer found so far, i.e., the leftmost false found so far.

This interval is non-empty if $r \ge l$. The midpoint is the same as before, i.e., $m = \lfloor(l + r) / 2\rfloor$. If $f(m)$ evaluates to false, then the best possible answer so found has to be $m$, and we reduce the search space from $[l, r]$ to $[l, m - 1]$. Otherwise, we reduce it to $[m + 1, r]$, and do not update the answer.

Note that here we would also need to maintain a separate variable $ans$, and initialize it appropriately, depending on whether you want the first false (as done above), or the last true (which can be done by moving the update of the $ans$ variable across the if-branches), which is also why I haven't used it for a very long time. However, people who prefer closed intervals and this explanation seem to gravitate towards this implementation.

The invariant here remains that if $l', r'$ are the $l, r$ in the old implementation, and $l, r$ are the $l, r$ in this implementation, then $l' = l - 1$ and $ans = r' = r + 1$. In the case where you need to find the last true, the invariant becomes $ans = l' = l - 1$ and $r' = r + 1$.

An implementation is as follows:

template <class Integer, class F>
Integer find_first_false(Integer l, Integer r, F&& f) {
Integer ans = n;
while (l <= r) {
Integer m = l + (r - l) / 2; // prefer std::midpoint in C++20
if (f(m)) {
l = m + 1;
} else {
ans = m;
r = m - 1;
}
}
return ans;
}


### The $[l, r)$ way

In this type of an implementation, we maintain $l + 1$ and $r$ instead of $l$ and $r$. The interpretation is that the search space is in $[l, r)$, and people who like using half-open intervals tend to prefer this implementation. The rationale behind this is that when the search space becomes empty, it corresponds to the range $[ans, ans)$ (if you're trying to find the first false, of course). The structure of the intervals doesn't always correspond in this implementation with the other implementations, since the value of $m$ used is usually slightly different (and usually equal to $\lfloor(l + r) / 2\rfloor$ where $[l, r)$ is the search space at that point, and in the case that there are two range midpoints, this is the one to the right and not to the left).

It's much more illuminating to think of it in this way: Suppose you have a range $[l, r)$, and you need to insert a false just after the last true in the range. What will be the position of the new false?

Another way of looking at it is: everything in the range $[l, ans)$ corresponds to true, and everything in the range $[ans, r)$ corresponds to false. So $ans$ is some sort of a "partition point" for the array. We will see later on how this exact interpretation is used in an implementation of this function in C++'s STL.

Let's check the midpoint of the range $[l, r)$. If $f(m)$ is true, we will never put it at a location $\le m$, so $l$ needs to be updated to $m + 1$. Otherwise, we have to put it at a position $\le m$, so the $r$ needs to be updated to $m$.

template <class Integer, class F>
Integer find_first_false(Integer l, Integer r, F&& f) {
++r;
while (l < r) {
Integer m = l + (r - l) / 2; // prefer std::midpoint in C++20
if (f(m)) {
l = m + 1;
} else {
r = m;
}
}
return r;
}


## Binary lifting for binary search

This and the next sections will be much more terse than the previous section, since we will be doing more of an overview of methods instead of explaining something from scratch.

Consider the following implementation:

template <class Integer, class F>
Integer find_first_false(Integer l, Integer r, F&& f) {
if (l > r) return r + 1;
++r;
Integer w = Integer(1) << __lg(r - l);
--l;
if (f(l + w)) l = r - w;
for (w >>= 1; w >= Integer(1); w >>= 1)
if (f(l + w)) l += w;
return l + 1;
}


Here, we try to ensure that the interval sizes are always powers of 2. Why we do so will be explained in the section on optimizing binary search.

After we ensure that the interval sizes are always powers of 2, we can just try to reconstruct the binary representation of $ans - 1 - l$, where $ans$ is what we need to return. For that, we can try to go from the highest bit to the lowest bit, and greedy works here for the same reason that binary representations are unique; adding a higher power leads to the predicate being false.

## Language specific ways for binary searching

Some of the most used languages in competitive programming fortunately have some inbuilt functions (albeit limited) to do binary search for you.

### C++

• binary_search: This function returns a bool that tells you if there is an element equal to the queried element between two iterators (with an optional comparator).

• lower_bound, upper_bound: If the range occupied by elements comparing equal to $x$ is defined by iterators $[it_l, it_r)$ in a given range of iterators $[input_it_l, input_it_r)$, then lower_bound and upper_bound return $it_l$ and $it_r$. In other words, lower_bound(it1, it2, x, comp) returns the iterator to the first element in the range of iterators $[it1, it2)$ which is $\ge x$ according to $comp$ (which is optional and defaults to the usual comparison), and upper_bound does the same for $> x$ instead.

• equal_range: This returns both lower_bound and upper_bound.

• partition_point: This works exactly like the binary search function we wrote in the $[l, r)$ way. In C++20, with ranges::views::iota, we can use it to do binary search on the answer as well.

### Python

• The bisect module has useful functions like bisect, bisect_left, bisect_right that do similar things to lower_bound and upper_bound.

### Java

• Collections.binarySearch and Arrays.binarySearch do something similar to finding an index/element (not necessarily the first equal or the last equal) using binary search.

## Optimizing binary search

In our first implementation, we were returning early when we found the element. Sometimes, it's faster to stop the binary search early, and in those cases, this is sometimes a constant factor optimization.

The binary lifting implemented for binary search above is also a constant factor optimization on some architectures if unrolled manually (which is possible due to the simple nature of the increments, and the possibility of hardcoding the constants), according to John Bentley's Programming Pearls. It is an improvement in some cases, where the locations you binary search on are few in number and repeated in successive binary search applications, leading to better cache usage. An example of doing that optimization to the extreme would be to implement multiple functions (each with a different power of 2), and store the function pointers in an array, and use computed jumps (for instance, by computing the power of 2 needed) to get to the correct function at runtime, which still leads to some speedup on certain architectures.

For instance, for certain types of queries (where either $l$ or $r$ are more or less the same), the speedup is pretty significant. However, for other types of queries, the simpler version is more efficient. To show the kind of speedups you can get for certain types of queries, I did some benchmarks on queries where the left bound is always the same. For the benchmarking code below, the results are as follows:

Benchmarking code

Results:

-----------------------------
Simple binary search
744175
Time: 2184 ms
-----------------------------
Binary lifting
744175
Time: 1504 ms
-----------------------------
Binary lifting with unrolling
744175
Time: 1407 ms
-----------------------------


https://algorithmica.org/en/eytzinger explains quite a nice method of exploiting the cache and speeding up binary search in a certain setup by a pretty impressive factor.

https://codeforces.com/blog/entry/76182 explains a variation of binary search which divides the ranges in an uneven order, which can lead to a change at the complexity level as well.

## Other resources

As promised, here's a collection of resources that I felt are pretty high-quality, that pertain to topics related to binary search.

A great resource is the Codeforces EDU tutorial (and problems) on binary search.

A great video that also explains binary search is linked in the following blog: https://codeforces.com/blog/entry/67509.

There is also an extension to vanilla binary search, called the parallel binary search, and https://codeforces.com/blog/entry/89694 links to a great video tutorial for this extension.

Binary lifting is a pretty general idea. A tutorial on binary lifting on fenwick trees can be found at https://codeforces.com/blog/entry/61364, a great one on binary lifting on trees can be found at https://usaco.guide/plat/binary-jump?lang=cpp (which also has links to other resources).

Binary search on segment trees is a quite useful technique too, and a working implementation (with some intuition on how it works) for recursive segment tree can be found at https://codeforces.com/blog/entry/83883. An implementation for the iterative segment tree can be found at https://github.com/atcoder/ac-library/blob/master/atcoder/lazysegtree.hpp.

A rough idea of how it works (based off a conversation where I was explaining this code to someone) is as follows. Some parts of it might not make sense, so please let me know if there's something unclear.

Spoiler

Please let me know if you find any bugs or typos in this blog!

UPD: Added benchmarking code for binary lifting (with and without unrolling) and binary search in a few special cases to show speedups. By nor, 19 months ago, ## Introduction

A while ago ToxicPie9 and I made and posted this meme: To my surprise, many people are quite confused about it. In fact, I realized there are a lot of misconceptions about pragmas. Most C++ users on Codeforces put a few lines of pragmas at the start of every submission. However, I believe many of them don't fully understand what they're doing or how pragmas work; the only thing people seem to think is that "add pragmas -> code go brr".

Pragmas are a form of the dark arts that are feared and used, both correctly and incorrectly, by lots of competitive programmers. They are widely believed to make your code much faster, but sometimes may lead to slowdowns and even runtime errors on certain platforms. In this blog, I will explain the effects of #pragma GCC optimize and #pragma GCC target, how they work, and how you should and shouldn't use them.

Feel free to skip to the end of the blog for a TL;DR.

## Warning

All of this discussion is for GCC, and the code you will write might not be portable across different compilers or platforms (in terms of turning on the same optimizations). However, you may assume that most of them work on Codeforces and other x86 platforms that are not too old.

## Some Non-examples

The following does nothing.

#pragma GCC optimize(" unroll-loops")
#pragma gcc optimize("Ofast")
#pragma GCC optimization("Ofast")
#pragma optimize(Ofast)


Yes, these are real-life examples we have seen being used by many competitive programmers, including quite a few LGMs. Perhaps the third one among these came from this blog which seems to have popularized the idea of using pragmas, but with a small mistake. Many people are misled to believe that some of the above work, when they actually don't -- stop using them.

If ever in doubt about whether your pragmas are correct, turn on most compiler warnings with the command-line option -Wall (or the more specific -Wunknown-pragmas). For example, if you compile the code above with -Wall, you'll get the following output, which tells you that the pragmas are invalid and useless:

foo.cpp:2: warning: ignoring ‘#pragma gcc optimize’ [-Wunknown-pragmas]
2 | #pragma gcc optimize("Ofast")
|
foo.cpp:3: warning: ignoring ‘#pragma GCC optimization’ [-Wunknown-pragmas]
3 | #pragma GCC optimization("Ofast")
|
foo.cpp:4: warning: ignoring ‘#pragma optimize ’ [-Wunknown-pragmas]
4 | #pragma optimize(Ofast)
|
foo.cpp:1:37: warning: bad option ‘-f unroll-loops’ to pragma ‘optimize’ [-Wpragmas]
1 | #pragma GCC optimize(" unroll-loops")
|                                     ^


Try to check if your submissions have similar problems. If yes, then you have probably been using pragmas wrong the entire time -- it's time to change your default code.

## What Is "#pragma GCC optimize"?

It turns on certain optimization flags for GCC. The syntax is

#pragma GCC optimize (option, ...)


From the official source on GCC pragmas, this pragma allows you to set global optimization flags (specified by option) for functions that come after it. Let's have a look at a few of them:

• O0, O1: These are pretty useless for competitive programming purposes, so we won't discuss these here.
• O2: This is the default optimization option on Codeforces, so using this might not give any tangible benefit.
• O3: This is the first non-trivial optimization option. It can make your code slower sometimes (due to the large size of generated code), but it is not very frequent in competitive programming. Some of the things it does are:
• Auto-vectorize the code if the mentioned architectures allow it. This can make your code much faster by using SIMD (single instruction, multiple data) which kinda parallelizes your code on an instruction level. More info below.
• Function inlining — inlines functions aggressively if possible (and no, marking functions as inline doesn't inline functions, nor does it give hints to the compiler)
• Unrolls loops more aggressively than O2 (this might lead to instruction cache misses if generated code size is too large)
• Ofast: This is one of the more controversial flags. It turns on all optimizations that O3 offers, along with some other optimizations, some of which might not be standards compliant. For instance, it turns on the fast-math optimization, which assumes floating-point arithmetic is associative (among other things), and under this assumption, it is not unexpected to see your floating-point error analysis go to waste. Ofast may or may not make your code faster; only use this if you're sure it does the right things.

You can also use some other options, like:

• unroll-loops -- Enables aggressive loop unrolling, which reduces the number of branches and optimizes parallel computation, but might increase code size too much and lead to instruction cache misses.
• unroll-all-loops -- Usually makes the program slower.
• strict-overflow -- Enables some optimizations that take advantage of the fact that signed integer overflow is undefined behavior.
• trapv -- This one is quite special, as it is not usually considered an "optimization": enabling this will make your code run much slower, but causes signed integer overflows to generate runtime errors. Useful for debugging. (Editor's note: if your system supports, it's recommended to use a sanitizer instead. You can find tutorials on the internet, so we won't discuss it here.)

A full list of supported flags in GCC can be found here.

An example: let's say you want to use O3 and unroll-loops. Then you could do something like

#pragma GCC optimize("O3")
#pragma GCC optimize("unroll-loops")


or

#pragma GCC optimize("O3,unroll-loops")


or

#pragma GCC optimize("O3","unroll-loops")


Note that the options are strings inside quotation marks, and there's no space after or before the comma in the second way of writing. However, you can have a space before or after the comma in the last way of writing without any issues.

## What Is "#pragma GCC target"?

Many modern computers and almost all competitive programming judges run on the x86 architecture. It has received a lot of additions over the years -- particularly, extra features (instructions) that enable different calculations or make existing ones much faster.

Compilers allow you to take advantage of these instruction sets extensions. For example, CPUs that support the popcnt instruction can theoretically compile __builtin_popcount into one instruction, which is much faster than usual implementations of this function. Similarly, if you want to auto-vectorize code, you'd need some instruction sets that actually support the vector instructions. Some of these instruction sets are sse4.2, avx, avx2. Here's a list of instruction sets that are usually supported on Codeforces:

• avx and avx2: These are instruction sets that provide 8, 16 and 32 byte vector instructions (i.e., you can do some kinds of operations on pairs of 8 aligned integers at the same time). Prefer using avx2 since it's newer.
• sse, sse2, sse3, sse4, sse4.1, sse4.2: These are instruction sets that are also for vectorization, but they're older and not as good as avx and avx2. These are useful for competitions on websites such as Yandex, where avx2 is not supported and gives a runtime error due to unrecognized instruction (it corresponds to a SIGILL signal — ill-formed instruction).
• popcnt, lzcnt — These optimize the popcount (__builtin_popcount family) and count leading zeros (__builtin_clz family) operations respectively.
• abm, bmi, bmi2: These are bit manipulation instruction sets (note that bmi is not a subset of bmi2). They provide even more bitwise operations like ctz, blsi, and pdep.
• fma: This is not so widely used, since avx and sse make up for most of it already.
• mmx: This is even older than the sse* family of instruction sets, hence is generally useless.

You should be quite familiar with them if you have written SIMD code before. If you're interested in a certain instruction set, there are plenty of resources online.

An example: if you want to use, say, avx2, bmi, bmi2, popcnt and lzcnt, you can write

#pragma GCC target("avx2,bmi,bmi2,popcnt,lzcnt")


Again, note that we don't have spaces in the string.

There are two main ways to use #pragma GCC target.

• You can use it with the optimization pragmas. It allows the compiler to automatically generate efficient SIMD instructions from parts of your code (based on the optimization flags), often boosting their performance by roughly 2, 4 or even 8 times.
• It enables you to use the intrinsics of supported instruction sets. For example,
#include <immintrin.h>
// returns the odd bits of x: 0b01010101 -> 0b1111
uint32_t odd_bits(uint32_t x) {
return _pext_u32(x, 0x55555555u);
}


This code works very efficiently, but won't compile unless you specify a valid target. Normally this is done by adding the -mbmi2 compilation flag, but this is impossible on Codeforces, so you can only enable it by other ways like #pragma GCC target("bmi2").

• An extreme example of this is this super-fast convolution code that makes heavy use of SIMD intrinsics to squeeze the running time of an $\mathcal{O}(N \log N)$ algorithm with a traditionally bad constant factor with $N \approx 5 \times 10^5$ to only 49ms.

Note that using vectorization targets won't work unless you have O3 or Ofast on (more specifically, tree-vectorize, with at leastO2 -- note that O2 is turned on by default on Codeforces).

## Function Attributes

Just in case you don't want to apply these optimizations globally, you can use function attributes. For instance,

__attribute__((target("avx2"), optimize("O3", "unroll-loops"))) void work() {
// do something
}


This enables these attributes for this function only, and the rest of the code will be compiled with the global options.

## Bonus

#pragma GCC ivdep: This pragma forces the compiler to vectorize a loop if the loop was not vectorized due to possible aliasing. This means that the compiler wasn't able to prove that vectorizing is safe due to data dependency, but you can, due to knowledge of the problem at hand, show that this case will never happen. This results in undefined behaviour if you're not careful enough about there not being a dependency across loop iterations. Think of this having a similar idea behind it as pointer restrict in C99.

#pragma GCC push_options, #pragma GCC pop_options: These maintain a stack of target and optimization pragmas, enabling you to temporarily switch to another set of options. #pragma GCC reset_options resets all the options.

#pragma GCC optimize "Ofast" and #pragma GCC optimize "-Ofast" also surprisingly work. The same holds for stuff like #pragma GCC optimize "-funroll-loops" and #pragma GCC optimize "unroll-loops". However, #pragma GCC target "avx2" works but #pragma GCC target "-mavx2" doesn't.

## Some Caveats

As we have pointed out already, there might be some caveats associated with using the aforementioned pragmas.

• Some of these optimize/target pragmas can potentially make your code slower too, due to code size and other associated stuff.
• You might get runtime errors if you're careless about your choice of instruction sets. In general, on any platform, you should try to use each instruction set (and also write code that benefits from that instruction set!) and see if you're getting weird behaviour or not. There are also some ways to check if the machine your code is going to be run on has certain instruction sets at compile time, but inconsistencies with the actual set of instruction sets are possible in subtle ways, so you shouldn't in general depend on them. A good way, however, is to run custom tests with stuff like assert(__builtin_cpu_supports("avx2")).
• In general, constant-factor optimization at such a low level is somewhat unpredictable, and you should rely on measuring the performance rather than guessing from the instruction counts/latencies. This also means taking the contents of this blog with a grain of salt and not blindly believing that using these pragmas will somehow help you cheese every $\mathcal{O}(n^2)$ solution into the time limit for $n = 10^5$. A good way to look at generated assembly is by using godbolt.org.

## Some Examples

There are multiple incidents in codeforces folklore that point to how fast pragmas can make your code. Some instances are:

Some more helpful discussion can be seen in the thread for this comment: https://codeforces.com/blog/entry/94609?#comment-836718

## Conclusion/TL;DR

If you skipped the entire blog to read this part, it's recommended that you also read the Some Non-examples section and check if you're using an incorrect pragma.

No pragma is absolutely safe to use. Don't blindly believe that one line of pragma can suddenly, magically boost the speed of all your submissions. Often, you will need some testing or researching to decide how to improve your constant factor.

When restricted to competitive programming on Codeforces or a fixed platform, however, some results might be more predictable. Specifically,

• The default O2 already has a lot of optimizations compared to O0. Increasing it to O3 or even Ofast might increase or decrease the performance, depending on your code and input. Most online judges provide a "custom test" function which helps you figure out which one is more appropriate. You can also test other flags if you know what they do.
• Usually, the avx and avx2 targets are supported by newer online judges. When they are supported, these can usually make your code faster. They often work best when your code is easily vectorizable (e.g., tight loops) -- which requires some experience to write. Again, custom tests are always useful (also, make sure they don't cause errors due to architecture issues). You can try the sse family if they don't work.
• Other targets like popcnt and bmi don't usually matter a lot when the compiler is optimizing, however operations like __lg, popcount, etc. require them to be fast. This is important when you use bitsets or do heavy bit manipulations.

Some pragmas I personally use on Codeforces are:

#pragma GCC optimize("O3,unroll-loops")
#pragma GCC target("avx2,bmi,bmi2,lzcnt,popcnt")


If on a platform with no avx2 support, I switch out the avx2 with either avx or one of the sse's. Some (ancient) judges might not have support for the other targets either, so it's usually a decision you need to make as mentioned in the blog.

## Acknowledgments

Thanks to ToxicPie9 for helping me write this blog and suggesting both stylistic changes as well as things to include! Please give him upvotes. By nor, 22 months ago, In this blog post, I will focus on the following problem: 920E - Connected Components?. More generally, we will look at how to do efficiently traverse the complement graph of a given graph in both dfs and bfs orders. In particular, the bfs order also allows us to find a shortest-path tree of the complement graph. This is my first blog, so please let me know of any errors/typos that might have crept in.

Note 1: I will present the algorithms in the order I came up with them, so it might not be the most clean presentation, but I hope that it might give some insight into how to come up with new ideas in graphs. Note that I don't claim to be the first person to come up with these ideas (especially the naive algorithm, which is identical to the solution in the editorial), but I haven't seen any of these implementation tricks being used in any implementation before. Resources pointing to similar ideas/implementations are welcome!

Note 2: All adjacency relations will be in terms of the original graph (which is given to us as an input, and is sparse).

(Not-so-important) note 3: We won't be using lists (or an implementation of lists using arrays, which was made popular by the usual implementation of the adjacency list which is used by a lot of Chinese competitive programmers), but in the final implementation, we will be using an application of DSU which can be used to implement deletions + traversal in a sorted list efficiently.

#### TL;DR

If you're looking for the implementations, here they are:

1. Naive DFS in $O((n + m) \log n)$: 125679159
2. BFS without sorting adjacency lists in $O(n + m)$: 125679188
3. BFS with sorting adjacency lists (using radix sort) in $O(n + m)$: 125679224
4. DFS with sorting adjacency lists (using radix sort) in (conjectured) $O(n + m)$ (guaranteed $O((n + m) \log n)$): 125679248

#### Solving the problem using DFS in $O((n + m) \log n)$ time

Note that in usual dfs, we do something of the following sort while running a DFS on vertex $u$:

mark u as visited
if v is not already visited:
dfs(v)


Note that the order of the loops doesn't really matter, so we can do something like the following too:

mark u as visited
for v in the set of unvisited vertices:
if v is adjacent to u:
dfs(v)


Now if we want to do a DFS on the complement graph, we can merely change the condition to if v is not adjacent to u and we will be done.

How will we go about implementing this algorithm efficiently? Suppose we maintain a set of unvisited vertices, and the adjacency lists are sets instead of vectors/lists. Then we can go through the set of unvisited vertices, and if an unvisited vertex is not in the adjacency set of the current vertex, we can recursively do a DFS on that vertex.

Why is this efficient? Note that in any iteration, we either skip a vertex or we don't. In the case we skip a vertex, the vertex has to be in the set of vertices that are adjacent to the current vertex, so we don't do more than $O(m)$ skips. In the case we don't skip, we remove a vertex from the unvisited set, so we don't do more than $O(n)$ iterations of this type either. So the total number of iterations is $O(m + n)$, and the cost of every iteration is upper bounded by $O(\log n)$, which gives us the bound.

Implementation: 125679159

#### Solving the problem using BFS in $O(n + m)$ time

Let's come to our first linear implementation. I switched over to BFS, since you can still find components using BFS, and it is easier to reason about an iterative algorithm than a recursive algorithm with global variables.

The first change is that we don't use a set, instead we use a vector to store unvisited vertices. Initially all the vertices are unvisited.

Let's see what happens when we process a vertex $u$ in this algorithm.

Firstly, we note down all the unvisited vertices that are adjacent to the vertex $u$. Let's call them blocked vertices (for $u$), and mark them in a global array (we will unmark them when we finish the iteration corresponding to $u$).

Then we iterate over all the unvisited vertices, and we can push an unvisited vertex into the queue if and only if it is not $u$ and it is not blocked. Then the only remaining unvisited vertices are the ones that are blocked, so we replace the set of unvisited vertices by the set of blocked vertices. This can be seen to be identical to a usual BFS, but with minor modifications.

Why is this efficient? Suppose the order in which vertices are popped from the queue is $u_1, u_2, \ldots, u_k$. Then when we finish processing $u_i$, the unvisited set is $V \cap N(u_1) \cap \cdots \cap N(u_i) \subseteq N(u_i)$, so the total size of the vectors we process at any point is at most $O(m)$. The number of iterations is also hence upper bounded by $O(n + m)$, and we are done. Note that this argument doesn't depend on the connectedness of the graph.

Implementation: 125679188

#### Solving the problem using BFS in $O(n + m)$ time, using yet another algorithm

Remember how we implemented DFS? We can do a similar thing for BFS, and we will modify the algorithm in the previous section a bit to somehow match what we did in the DFS case. We can try sorting the adjacency lists. However, that can be worst case $O(m \log n)$ or something similar. How do we do better than that? Instead of sorting adjacency lists separately, we can sort all edges using radix sort in $O(n + m)$, and then create a graph from those edges which will automatically have sorted adjacency lists.

Now our invariant will be that the set of unvisited vertices is sorted. So when we process a vertex, instead of marking the blocked vertices in advance, we can use two-pointers to check if an unvisited vertex is in adjacent to $u$ or not. The rest of the algorithm is pretty much identical to the remaining part of the previous algorithm.

Implementation: 125679224

#### Solving the problem using DFS in conjectured $O(n + m)$ time.

This is (in my opinion) the most interesting algorithm among all of the ones I have presented so far. However, I don't have a correct proof of the time complexity. We will use the idea from the previous implementation here for sorting the adjacency list, and all we need to do now is to make the following operations in the naive DFS implementation fast enough:

1. Erasing an element from a set
2. Iterating over a set
3. Finding an element in the adjacency list of the current vertex

Note that the two-pointers idea still works here while iterating over the global unvisited set (which is not implemented using a std::set since erasing would be too slow), if we are guaranteed that the unvisited set is sorted during iteration.

Consider the state of the unvisited set. At any point, it consists of some vertices $-1 = a_0 \le a_1 \le a_2 \le \cdots \le a_k \le n = a_{k+1}$ (the first and the last elements are just for the sake of convenience in the following definition). Define the operation $root$ as follows: for any $a_i < x \le a_{i+1}$, we have $root(x) = x$. Note that the computation of $root$ will be done lazily whenever we need to compute the function $root$. Then if we have such an operation, we can do the needed operations as follows:

1. Erasing an element from a set: set $root[x] = x + 1$ for now (when the actual value of $root(x)$ is needed, we will do it then and there, and update $root[x]$ and all its dependencies for speeding up further computations). This links together the two ranges: the one ending at $x$ and the one starting at $x + 1$.
2. Iterating over the set: We can do this using a single for loop as follows (by definition of the function root): for (int it = root(0); it < n; it = root(it + 1))
3. Finding an element in the adjacency list of the current vertex can be done using two pointers, since we are guaranteed by the definition of $root$ that we iterate over the unvisited vertices in increasing order.

Now how do we lazily compute $root$ when needed? We shall do so in a DSU-like manner with path compression as follows:

int get_root(int u) {
if (root[u] == u) return u;
return root[u] = get_root(root[u]);
}


Why is this efficient? DSU guarantees that the time taken is $O((m + n) \log n)$. However, running some stress tests suggests that this is in fact much better than that, and that the number of compressions done is in $O(n + m)$. I tried proving it but found a fatal error in a step, so if anyone can come up with a proof (or a counter-example) for this, please let me know!

Implementation: 125679248 