Rating changes for last rounds are temporarily rolled back. They will be returned soon. ×

nor's blog

By nor, 3 days ago,

This blog was initially meant to request updates to the C compiler on Codeforces (given the large number of blogs complaining about "mysterious" issues in their submissions in C). While writing it, I realized that the chances of the said updates would be higher if I mentioned a few reasons why people would like to code in C instead of C++ (some of the reasons are not completely serious, as should be evident from the context).

Before people come after me in the comments saying that you should use Rust for a few reasons I will be listing below, please keep in mind that I agree that Rust is a great language with reasonably good design choices (most of them better than C++, at least), and using C is a personal choice.

Commonly encountered issues in C on Codeforces

1. The C compiler on Codeforces is currently 32-bit, and hence it suffers from the same disadvantages as C++ faces compared to the 64-bit version. For instance, 64-bit arithmetic is slower than it could be (and leads to a ~2x slowdown in many cases).
2. There is an issue with slow IO using stdio, as I mentioned in this comment. Paraphrasing the issue here, this blog mentioned a bug that was making the earlier MinGW implementation of IO very slow, and it was patched to fix this issue. This is also why the newer C++ compilers on Codeforces have a decently fast implementation of scanf/printf compared to the older benchmarks in the blog. However, this is not the case for C. My hypothesis is that this issue wasn't patched for the C compiler. There is, however, a quick fix for this issue: add this line to the top of your submission, and it will revert to the faster implementation (at least as of now): #define __USE_MINGW_ANSI_STDIO 0. Thanks to ffao for telling me about this hack.
3. It seems that there is an issue with the ancient GCC version used for C11. For instance, when using the timespec struct to get high precision time, there is a compilation error (and it seems it is because of limited C11 support in GCC 5.1). Update: a similar issue arises (timespec_get not found) when I submit with C++20 (64) as well, so I think this is an issue with the C library rather than the compiler.

A possible fix to the above issues that works most of the time is to submit your code in C++20 (64 bit) instead of C11. You might face problems due to different behavior of specific keywords or ODR violations due to naming issues, but it should be obvious how to fix those errors manually.

Some update requests

Coming to the original point of the blog, I have a couple of update requests (if any other updates can make C easier to use, feel free to discuss them in the comments). Tagging MikeMirzayanov so these updates can be considered in the immediate future.

• Adding C17 (64 bit) alongside the old C11 option: This should be doable using msys2 as done for C++. I am somewhat confident that not using MinGW for this purpose would make the slow IO issue disappear. Also, the reason why I recommend C17 instead of C11 is that C17 is the latest version that addresses defects in C11. Note that C17 and C18 are two names for the same thing. However, I still believe that the current C11 option should be available in the future as well, for the sake of completeness/more options/other reasons mentioned in the comments below.
• Fixing slow IO: If the above doesn't fix the slow IO issue, perhaps there could be a way to fix it in MinGW itself. I am not too familiar with this, though.
• Fixing the timespec_get issue (and other C11/C17 support issues): I am not exactly sure on how to go about fixing this, but since this seems to be a C library issue, it could be a good idea to try installing a C library that has this functionality (and is C11 compliant). The issue probably boils down to Windows not having libraries that support everything in the C standard, since C11/C17 works perfectly on my Linux system.



• +128

By nor, 3 months ago,

Thank you for participating in this round! We hope you found the problems interesting and/or funny.

Special thanks to dorijanlendvaj for helping with Polygon related issues. Preparing good tests for certain problems was in fact a good problem in itself (in fact, sometimes they were hard enough for usual CF rounds), and we will mention these problems for the interested reader.

If the editorials are not visible, try registering for the contest or something similar.

Problem A: idea: ToxicPie9, preparation: nor

Solution
Code (Python) for solution 1
Code (Python) for solution 2
Code (Python) for solution 3
Preparation trivia

Problem B: idea: dorijanlendvaj, preparation: dorijanlendvaj

Solution
Code (Python)
Preparation trivia

Problem C: idea: nor, preparation: nor

Solution
Code (Python)

Problem D: idea: Sexpert, preparation: Sexpert

Solution
Code (Python)
Preparation trivia

Problem E: idea: errorgorn, preparation: nor

Solution
Code (Python)

Problem F: idea: nor, preparation: nor

Solution
Code (Python)

Problem G: idea: nor, preparation: nor

Solution
Code (Python) for solution 1
Code (Python) for solution 2
Preparation trivia

Problem H1: idea: nor, preparation: nor

Solution
Code (Python)
Preparation trivia

Problem H2: idea: ToxicPie9, preparation: nor

Solution
Code (Python) for solution 1
Code (Python) for solution 2
Code (Python) for solution 3

• +64

By nor, 3 months ago,

Hi Codeforces!

We are pleased to invite you to InterviewForces Contest #1 (Div. 7), which will take place on May/26/2022 17:35 (Moscow time). You will be given 9 problems and 2 hours to solve them. Please follow this link in order to register for the contest. Note that this contest is not an official Codeforces round.

The round will be unrated for all participants. It will be held according to modified ICPC rules.

All problems in this round were written and prepared by dorijanlendvaj, errorgorn, Sexpert, nor and ToxicPie9. We have worked on this round for quite a short time and hope you find every problem so easy that even your five year old sibling could solve it interesting.

This round is sponsored by errorgorn and Sexpert with the following prizes:

• 1st place: 10 lakh VEF
• 2nd place: 20 lakh VEF
• 3rd place: 30 lakh VEF
• 8th place: A steam copy of "Sakuya Izayoi Gives You Advice And Dabs"

We would like to thank:

• No one for marvellous coordination of the round.
• errorgorn for being a VIP tester, who liked the problems enough to sponsor the round prizes of 60 lakh VEF.
• Sexpert for sponsoring the 8th place prize.
• MikeMirzayanov for the amazing Codeforces and Polygon platforms!
• You for participating and upvoting.

Good luck and have fun! See you in the standings. Make sure to read the statements and conversion rate very carefully.

Reminder: The registration has just started!

Reminder 2: The contest starts in 15 minutes! Good luck, have fun!

UPD: Congratulations to the winners:

We would also like to congratulate the top $104$ participants for solving all the problems in-contest!

The editorial along with some more interesting stuff can be found here.

• +369

By nor, 8 months ago,

This blog will be more about developing ideas about certain structures rather than actually coding up stuff. The ideas we'll be developing here have many applications, some of which involve:

• Number Theoretic Möbius Inversion
• Principle of Inclusion-Exclusion
• Directed Acylic Graphs
• Subset Sum Convolution
• Finite Difference Calculus (and prefix sums)

Note that some of the terminology used below might be non-standard. Also, some parts of the blog make a reference to abstract algebra without much explanation; such parts are just for people who know some algebra, and can safely be skipped by those who don't know those concepts at all.

Prerequisites:

• Matrix multiplication
• Basic combinatorics

Definition of a poset and some intuition

All of these have one thing in common: they have an underlying "poset". Let's understand what a poset is.

Rigorously speaking, a poset is a partially ordered set, i.e., a set $S$ equipped with a relation $\le$ that is reflexive, antisymmetric, and transitive.

More precisely,

• $\le$ is reflexive means that $a \le a$ for all $a$ in $S$.
• $\le$ is antisymmetric means that if $a \le b$ and $b \le a$, then $a = b$
• $\le$ is transitive means that if $a \le b$ and $b \le c$, then $a \le c$

To get some intuition about posets, it's helpful to think of them in terms of directed acyclic graphs (DAGs), equipped with the relation of reachability; i.e., vertices $a$ is related to $b$ iff there is a path from $a$ to $b$. In fact, there is always at least one DAG corresponding to every poset that is equivalent to it in terms of semantics.

To make this notion a bit clearer, we construct a "DAG" (we allow self-loops, so it is not precisely a DAG, but we can get rid of these self-loops) as follows: construct a vertex for each element of the set, and there is an edge from $u$ to $v$ iff $u \le v$. For the lack of a better word, we shall call this the "relational DAG" of the poset. Note that this notation is not standard.

Remark: By the properties of a poset, this "DAG" has the property that if there is a path from $u$ to $v$, there is an edge between $u$ and $v$ (i.e., reachability and adjacency are equivalent in this graph). In other words, this graph is its own reflexive transitive closure. Also, a topological sort of this "DAG" (ignoring the self-loops) is called a linear extension of the poset (flattening out the "DAG" induces a natural linear order on the vertices, which are just the elements of the poset).

It is usually easier to look at this graph after removing all redundancies (i.e., if $u \le v$ and $v \le w$, then remove the edge $(u, w)$ from the graph if it exists). The resulting graph is called the Hasse diagram of the poset, and is super useful for reasoning about posets.

Note that it is NOT necessary that for any two elements $(u, v)$, one of $u \le v$ and $v \le u$ must hold. For instance, consider the poset given by the set $S = \{a_1, a_2, a_3, a_4\}$ and the relation $\le$ given by $\{(a, a) \mid a \in S\} \cup \{(a_1, a_2), (a_1, a_3), (a_1, a_4), (a_2, a_4), (a_3, a_4)\}$. This satisfies the constraints that $\le$ must be reflexive, antisymmetric and transitive; however, neither of $a_2 \le a_3$ and $a_3 \le a_2$ hold.

For the remainder of this blog, it will help to think of $\le$ in terms of the Hasse diagram of the poset.

Note: when we say $a < b$, we mean that $a \le b$ and $a \ne b$.

Some examples of posets

• Either of the sets $\mathbb{R}, \mathbb{N}, \mathbb{Z}, \mathbb{Q}$ equipped with the usual $\le$ is a poset.
• For any set $A$, the set of its subsets, equipped with the relation $\subseteq$ (or $\supseteq$) is a poset.
• The set $\mathbb{N}$ equipped with the relation $\mid$ (divisibility) is a poset.

The incidence algebra of a poset

Note for the pedantic reader: In what follows, we only look at locally finite posets (i.e., posets in which the set of $z$ such that $x \le z \le y$ is finite for all $x, y$). In particular, we can use the following ideas with infinite DAGs too, though it might be possible that infinitely large (but "locally sparse") matrices may not make sense.

In particular, the following doesn't apply to the posets $(\mathbb{R}, \le), (\mathbb{Q}, \le)$ and so on.

Let $(S, \le)$ be some poset. Consider the adjacency matrix $M$ of its relational DAG. It has the property that if $x \not\leq y$, then the entry at $(x, y)$ is $0$, and otherwise it is $1$. We can also interpret this matrix $M$ as a function from $S \times S \to \{0, 1\}$.

We generalize this notion a bit. Rather than having $M(x, y) = 1$ whenever $x \le y$, we assign arbitrary complex numbers to $M(x, y)$ (which is equivalent to assigning arbitrary edge weights (possibly $0$) to edges in the relational DAG).

This leads to a set of functions $\alpha : S \times S \to \mathbb{C}$, with the property that $x \not\leq y \implies \alpha(x, y) = 0$. We call this the "incidence algebra" of the poset (for those who know what an algebra is, we'll see why this is an algebra). We won't distinguish between these functions and their corresponding matrices in what follows.

The main idea behind the whole approach is to look at these functions as matrices and extract useful information using matrix operations.

To define the "product" on this set of functions, we can simply use matrix multiplication to define its semantics. That is, for functions $\alpha, \beta$ in the incidence algebra, we define the product as $(\alpha\beta)(x, y) = \sum_{z \in S} \alpha(x, z) \beta(z, y)$, which is some kind of convolution.

We can see that the incidence algebra is closed under product (intuitively, think of it as aggregating some combination of edge weights over paths of length 2; if there is no path from $x$ to $y$, the set of such paths is empty anyway, so the answer is still 0, the identity of the aggregation function which is $+$).

Similarly, we can define scalar multiplication and addition in terms of matrix operations, and see that the incidence algebra is closed under these operations too.

Important remark: Note that if $f$ is a function from $S$ to $\mathbb{C}$, then a vector with $f(x)$ at the position corresponding to $x$ also satisfies the usual matrix multiplication identities with the elements of the incidence algebra, with the semantics of multiplication with a column vector being analogous to aggregating over paths that end at a given vertex (and for row vector left-multiplication, paths that end at a given vertex).

Optional remark: For those who know what an algebra is, matrix multiplication is a bilinear operator so product is the bilinear operator in the incidence algebra. A further generalization would be to define a different underlying field than $\mathbb{C}$ and another bilinear operator, but we won't look into this further.

Some special members of the incidence algebra

Okay, now that we have defined this set and some operations on it, let's see some examples of members of this set, what information we can get out of these operations, and what kind of operations are "nice" in this context.

Firstly, the most obvious choice of a matrix is $O$, the zero matrix (which is clearly a member of the incidence algebra). This doesn't really give us any information though. It still has a role as the additive identity for addition in this algebra.

The second most obvious choice is $I$, the identity matrix. Note that this is a member of the incidence algebra due to reflexivity of $\le$. This is a multiplicative identity, and that's something we will need for Möbius inversion.

Thirdly, let's consider the adjacency matrix $M$ of the relational DAG. Let the corresponding function be $\zeta$. Clearly, this is also in the incidence algebra. We have the property that $\zeta(x, y) = 1$ if $x \le y$, and $0$ otherwise.

Spoiler alert: the inverse of this matrix is the Möbius function for the poset and corresponds to things like the number theoretic Möbius function in the divisibility poset and so on.

The Möbius function

Finally, we define the Möbius function $\mu$ as follows (It might not feel intuitive at first, but bear with me for a while):

For $x \not\leq y$, $\mu(x, y) = 0$, $\mu(x, x) = 1 \, \forall x \in S$, and inductively define

$\mu(x, y) = -\sum_{x \le z < y} \mu(x, z)$

Note that $\mu(x, y)$ is always an integer. Also, note that $\sum_{x \le z \le y} \mu(x, z) = 1$ if $x = y$ and $0$ otherwise (in the case $x < y$, it follows from the identity, in the case when $x, y$ are not comparable, the sum is empty, and in the case when $x = y$, it consists of only one term equal to $1$).

Hence we have for all $x, y \in S$,

$\sum_{z \in S} \mu(x, z) \zeta(z, y) = \sum_{x \le z \le y} \mu(x, z) = \begin{cases} 1 & \text{if } x = y\\ 0 & \text{otherwise} \end{cases} = I(x, y)$

However, the expression on the left is just $(\mu \zeta)(x, y)$. Since this holds for all $x, y$, we have $\mu\zeta = I$.

As we mentioned earlier, this means that $\mu$ is the inverse of $\zeta$, so we must also have $\zeta\mu = I$ (and expanding this gives $\sum_{x \le z \le y} \mu(z, y)$ is $1$ if $x = y$ and $0$ otherwise).

The Möbius inversion formula

The Möbius inversion formula in this setting just reduces to a simple identity corresponding to the multiplication of a matrix and a vector.

Formally, the Möbius inversion formula states that if $f, g$ are functions from $S$ to $\mathbb{C}$ such that $g(x) = \sum_{y \le x} f(y)$, then we have $f(x) = \sum_{y \le x} \mu(y, x) g(y)$. The converse also holds true.

For a proof, consider a vector $F$ with the element at position corresponding to $x$ being $f(x)$. Define $G$ similarly.

Now the condition merely means that $G = \zeta^\intercal F$. Left-multiplying this by $\mu^\intercal$, we get $\mu^\intercal G = \mu^\intercal \zeta^\intercal F = IF = F$. Expanding this gives the Möbius inversion formula. The converse can be proved by reversing all these steps, since $\mu^\intercal$ is invertible.

Note that this formula is just a trivial consequence of some matrix multiplication properties; so there's a lot more to the theory we have developed above than just this formula. However, we shall limit ourselves to only the applications of this formula to get to some interesting results.

Some applications

The generalized principle of inclusion-exclusion

Let's first talk about the contexts in which we use generalized inclusion-exclusion. We usually want to compute some function $f$ of a finite set $A$, but it is much easier to compute the sum of $f$ over all subsets of $A$.

That is, it is easy to compute $g(A) = \sum_{B \subseteq A} f(B)$, but hard to compute $f(A)$ itself.

The generalized principle of inclusion-exclusion states that $f(A) = \sum_{B \subseteq A} (-1)^{|A| - |B|} g(B)$.

For a proof, let's look at the poset corresponding to the power set of a finite set $F$ such that $A \subseteq F$, equipped with the $\subseteq$ relation. Clearly, using the Möbius inversion formula, we have $f(A) = \sum_{B \subseteq A} \mu(B, A) g(B)$.

We claim that $\mu(B, A) = (-1)^{|A| - |B|}$ for $B \subseteq A$. We'll do this by fixing $B$ and doing an induction on $|A|$.

For $|A| = |B|$, this is clear. Now suppose this is true for all $|A| < k$. Consider the case when $|A| = k$. We have (from the definition of the Möbius function) that

$\mu(B, A) = -\sum_{B \subseteq C \subsetneq A} (-1)^{|C| - |B|} = -\sum_{\emptyset \subseteq C \setminus B \subsetneq A \setminus B} (-1)^{|C \setminus B|} = (-1)^{|A| - |B|}$

where the last equality follows from the binomial theorem, and the proof is complete.

The usual principle of inclusion-exclusion as a special case

Now let's see how the usual inclusion-exclusion principle is just a special case of this generalized formula.

The inclusion-exclusion principle states that for sets $A_1, A_2, \ldots, A_n$, we have

$|A_1 \cup \cdots \cup A_n| = \sum_{1 \le i \le n} |A_i| - \sum_{1 \le i < j \le n} |A_i \cap A_j| + \cdots + (-1)^{n-1}|A_1 \cap \cdots \cap A_n|$

For a proof, consider the poset corresponding to the power set of $[n] := \{1, \ldots, n\}$, equipped with $\supseteq$.

Also define $U = A_1 \cup \ldots \cup A_n$.

For a set $S \subseteq [n]$, define

$f(S) =$ number of elements in all $A_i$, where $i \in S$, but not in any other $A_j$-s.

$g(S) =$ number of elements in all $A_i$, where $i \in S$, and possibly in other $A_j$-s.

Then we have $g(S) = |\cap_{i \in S} A_i|$. Also, we have $g(S) = \sum_{T \supseteq S} f(T)$. Using the Möbius inversion formula, we get

$0 = f(\emptyset) = \sum_{A \supseteq \emptyset} \mu(\emptyset, A) g(A) = \sum_{A \in 2^{[n]}} (-1)^{|A|} |\cap_{i \in A} A_i|$

which completes the proof.

Subset sum convolution

I won't be explaining the ideas behind subset sum convolution in too much detail, but would refer you to this blog that covers the necessary material.

The poset in consideration is the power set of $[n]$, and its elements are in bijection with $n$ bit integers where the $i^\mathrm{th}$ bit is $1$ iff $i$ is in the set.

The transforms $z$, $\mu$ are the $\zeta$ and $\mu$ functions here (note that a two-variable function in the incidence algebra is just a map that transforms a one-variable function to another one-variable function, which can be thought of as the multiplication of the matrix corresponding to the two-variable function with the vector corresponding to the one-variable function), and $\sigma$ corresponds to a diagonal matrix with the element on the diagonal being 1 if the element of the poset has an even number of elements, and -1 otherwise.

The rest of that blog can conveniently be studied in terms of the incidence algebra for this poset.

The number-theoretic Möbius function

For this, we will need to cover a bit more ground, since the divisibility poset has nicer properties compared to other posets (it is what is called a lattice).

What is a lattice

A lattice is a special kind of a poset, where for each pair of elements $(x, y)$, there exist elements $x \land y$ (called the meet of $x, y$) and $x \lor y$ (called the join of $x, y$) such that

• $z \le x$ and $z \le y$ $\implies$ $z \le x \land y$
• $x \land y \le x$ and $x \land y \le y$
• $x \le z$ and $y \le z$ $\implies$ $x \lor y \le z$
• $x \le x \lor y$ and $y \le x \lor y$

Intuitively, this basically means that each pair of elements in the Hasse diagram has a least common ancestor and highest common descendent. For instance, the power set poset and the divisibility poset are both lattices (exercise: what are the meet and the join functions for these lattices?).

Note that inductively, every finite subset of a lattice has a meet and a join. Also, due to antisymmetry of $\le$, the meet and the join are unique for each pair $x, y$. Hence for finite lattices $L$, there is a unique "maximum" (say $\top_L$) and a unique "minimum" (say $\bot_L$).

A theorem for lattices

Theorem: For a lattice $(L, \le)$, and $\bot_L < a \in L$, $\sum_{x \lor a = \top_L} \mu(\bot_L, x) = 0$.

Essentially, this theorem states that the sum of the Möbius function applied on all $\bot_L, x$ where the "LCA" of $a$ and $x$ is $\top_L$ is 0.

For a proof, we first note that because of the second Möbius identity (using the product of $\mu$ and $\zeta$), we have:

$\sum_{x \lor a = \top_L} \mu(\bot_L, x) = \sum_{x} \mu(\bot_L, x) \sum_{x \lor a \le y} \mu(y, \top_L)$

Using the definition of $\lor$ and rearranging the sums reduces the sum to

$\sum_{a \le y} \mu(y, \top_L) \sum_{x \le y} \mu(\bot_L, x)$

Using the first Möbius identity and noting that $y$ is not $\bot_L$, reduces this expression to $0$, as needed.

Now let's try to compute $\mu(a, b)$ for $a, b$ in the divisibility poset. Note that if $a$ doesn't divide $b$ or if $a > b$, this is just 0. So it suffices to consider the case where $b = ax$ for some natural number $x$.

Let's look at all positive integers $r$ such that $1 \mid r \mid x$. This forms a lattice $L$. Similarly, the set of positive integers $r'$ such that $a \mid r' \mid ax$ also forms a lattice, which is isomorphic to $L$. Also, since by the inductive definition, $\mu(a, ar)$ depends only on the elements $s$ such that $a \mid s \mid ar$, so we must have $\mu(a, ar) = \mu(1, r)$. So it suffices to compute $\mu(1, x)$ for all $x$.

We claim that for $x$ being square-free, $\mu(1, x)$ is $(-1)^k$ if $x$ is a product of $k$ distinct primes, and it is $0$ otherwise. To show this, we induct on $x$.

Note that $1 = \bot_L$ and $x = \top_L$. The claim is trivial for $x = 1$. Now suppose $p$ is a prime that divides $x$. Using the previously proved theorem, we get $\mu(1, x) = -\sum_{y \ne x, y \lor p = x} \mu(1, y)$.

If $p^2$ divides $b$, then the sum on the right is empty (as $\lor$ is just the LCM). Otherwise, the sum contains only one $y$, which equals $x / p$, so using the inductive hypothesis, we are done.

As a side-note, for "reasonable" functions that only depend on the ratio between their parameters, we can "compress" them into a function that takes a single parameter. The "product" in that case becomes Dirichlet convolution, which has tons of interesting properties. A relevant blog for that can be found here.

Also note that if we restrict ourselves to the divisibility posets of square-free numbers $n$, they are isomorphic to the power set poset of the set of prime divisors of $n$, so all power set posets are just special cases of divisibility posets!

Finite Difference Calculus

In this case, the poset is particularly simple, and it equals $(\mathbb{N} \cup \{0\}, \le)$. The function $\mu$ takes a value of $1$ at $(n, n)$, $-1$ at $(n, n + 1)$, and $0$ everywhere else. "Convolution" with $\mu$ is equivalent to the difference operator, and using Möbius inversion, the prefix sum operator is that with $\zeta$.

Final notes and remarks

Firstly, the content of this blog might not be super-useful in competitive programming, but the intent behind this was to introduce a very standard technique in combinatorics that people don't realize is super-flexible.

Secondly, it might not be easy to understand in a first reading, so it's encouraged to work out the math through the blog, and understand why and how these things are tied together. It will also help to make some Hasse diagrams of posets, and look at concrete examples of operations we described here.

Finally, there might be errors and typos in the blog, as well as better ways to approach the content of the blog. If you find any, please let me know in the comments!

• +364

By nor, 9 months ago,

As a certain legendary grandmaster once said (paraphrased), if you know obscure techniques and are not red yet, go and learn binary search.

The main aim of this tutorial is to collect and explain some ideas that are related to binary search, and collect some great tutorials/blogs/comments that explain related things. I haven't found a comprehensive tutorial for binary search that also explains some intuition about how to get your own implementation right with invariants, as well as some language features for binary search, so I felt like I should write something that covers most of what I know about binary search.

This is mainly geared towards beginners, but some of the topics that I mention in the other resources, optimizing binary search and language-specifics sections might not be well-known to people in the low div-1 range. I will also include some references to pretty high-quality articles/resources with related content, for the interested.

Contents

1. The main idea behind binary search
1. A small classical example
2. A more general idea
3. The generalization
2. Binary searching on the answer
3. Two other common ways of implementing binary search
4. Binary lifting for binary search
5. Language specific ways for binary searching
1. C++
2. Python
3. Java
6. Optimizing binary search
7. Other resources

The main idea behind binary search

The main idea behind binary search is linear reduction of search space. We'll elaborate on this below.

A small classical example

Let's start with the classical example of binary search. Suppose you have an array $[1, 7, 9, 12, 19]$, and you're asked to check if $7$ is a member of this array or not (and also return its 0-indexed position in this array). A naive approach would be to iterate over all elements, and check if any element is $7$ or not. This would take 5 iterations.

Now note that this array is sorted. So if I check some element at position $p$ in this array, we can have one of three possible outcomes:

• The element is equal to $7$: this means that we are done, and we can terminate our search!

• The element is less than $7$: since the array is sorted, all elements with position $< p$ are less than $7$ as well. So it is useless to look at any element with position $\le p$ now.

• The element is more than $7$: in a similar manner to the previous case, it is useless to look at any element with position $\ge p$ now.

So we could do something like this: initially our search space is the range of indices $[0, 4]$. Let's try looking at the middle element of our search space. If we do that, we end up in one of two scenarios:

• We terminate if the element at that position is $7$

• We reduce the search space to at most half of the original search space each time.

For a concrete example, let's look at the middle element of the search space, which is $9$. Since this is more than $7$, it is useless to look at indices in the range $[2, 4]$. So our search space reduces to $[0, 1]$. There are two midpoints of this array, let's just use the left one for the sake of uniformity. The element we are now looking at is $1$, which is less than $7$. So it is useless to look at indices in the range $[0, 0]$. So our search space becomes $[1, 1]$. The middle position is just $1$, and since the element there is $7$, we can return the position $1$ as the answer.

What if there was a $4$ instead of $7$ in the original array? In the last step of the above dry-run of the algorithm, we would note that it is useless to look at indices in the range $[0, 1]$, so the search space becomes empty! When the search space becomes empty, that means we haven't found it yet. So we should return some special value indicating that the search function resulted in a failure. We'll use the size of the array as the return value for now (we will see why we took this choice later on).

An implementation might look like the following:

int find_position(const vector<int>& a, int x) {
int l = 0;
int r = (int)a.size() - 1;   // [l, r] is our search space
while (l <= r) {             // search space is non-empty
int m = l + (r - l) / 2; // "middle" position in the range
if (a[m] == x) return m; // found!
else if (a[m] < x) {
l = m + 1;           // remove all indices <= m from the search space
} else {
r = m - 1;           // remove all indices >= m from the search space
}
}
return n;                    // failure
}


Is this more efficient than a linear search? Note that at each step, we reduce the search space by at least half, so in $1 + \log_2 n$ iterations (where $n$ is the size of $a$), the search space size reduces to $< 1$, and since the size is an integer, it becomes $0$. It's easy to see how it terminates.

A more general idea

More often than not, binary search is not used in such a simple use-case in competitive programming (or in real life).

Let's have a closer look at what we were doing in the above algorithm. We were using some kind of ordering (the ordering of integers, in our example) to discard a large part of the search space (in fact, we discarded about half the search space each time).

How do we generalize this further? More importantly, what will be the statement of the generalized problem?

Firstly, we will do something that will make our original problem easier to generalize. Rather than checking if an element is present in the array, we can find the position of the first element that is greater than or equal to the element (if no such element exists, we'll return the size of the array).

Clearly, this is even stronger than our original problem. If the answer to this problem is the index $i$, we can check if this is the right answer by checking if $i < n$ and $a[i] = x$. If this condition isn't true, we don't have any occurence of $x$ in $a$.

Let's try to do away with the ordering, and see how far we go.

Let's mentally (not in code) construct an array $b$, where the value of $b[i]$ is true if and only if $a[i] < x$.

How does $b$ look like? Clearly, some (possibly empty) prefix of it is filled with "true", and the remaining (possibly empty) suffix is filled with "false".

Then our problem reduces to finding the position of the first false ($n$ if not found) in this kind of an array $b$. For now, forget all about $a$ and $x$, and focus only on $b$.

It turns out, that for any such array $b$, we can easily find the first false using the same idea of discarding parts of the search space.

Let's start with some notation. $[l_0, r_0]$ will be the interval we need to search (for our problem it is $[0, n - 1]$). $l, r$ will be indices we shall maintain at each step such that the range of indices $[l_0, l]$ in $b$ consists only of "true" values, and the range of indices $[r, r_0]$ in $b$ consists only of "false" values.

Initially, we have zero information about any element of $b$, so it is better to force these ranges to be empty ranges. We can do so by setting $l = l_0 - 1$ and $r = r_0 + 1$ initially. We will try to increase $l$ and decrease $r$ so that these two ranges cover the whole search space. So, in the end, $l$ will correspond to the location of the last true, and $r$ will correspond to the location of the first false, and we can simply return $r$!

Let's suppose that at some point, $r - l > 1$. This means that there is at least one element between the indices $l$ and $r$ that we haven't put in one of these intervals yet. Let's take an index roughly in the middle of this range $(l, r)$, say $m = \lfloor(l + r) / 2\rfloor$.

• If $b[m]$ is true, then we know that everything to the left of $b[m]$ is true as well, so we can increase the range $[l_0, l]$ to $[l_0, m]$ (which is equivalent to replacing $l$ by $m$).

• Otherwise, it is false, so everything to the right of $b[m]$ is false as well. So we can increase the range $[r, r_0]$ to $[m, r_0]$.

Each time, we reduce the size of the unexplored range from $r - l - 1$ to $r - m - 1$ or $m - l - 1$, which corresponds to a reduction by at least half. So the number of iterations is at most $\log_2(r_0 - l_0 + 1) + 1$.

An implementation of this would look something like this:

int find_first_false(const vector<bool>& b) {
int l = -1;
int r = (int)b.size();
while (r - l > 1) {
int m = l + (r - l) / 2;
if (b[m]) {
l = m;
} else {
r = m;
}
}
return r;
}


Note that this terminates by our previous observation.

But, we said earlier that we won't really construct $b$ in our code, right? How do we use this same algorithm for solving our own problem?

Note that, by what we said earlier, $b[i]$ is just defined as the truth value of $a[i] < x$, so computing it on the fly is no big deal. So, if we replace $b[m]$ with $a[m] < x$, that would solve our problem.

That was quite a handful, so let's see a brief summary of what we did:

• Rather than explicitly reason using the order < and the value x, we constructed an array $b$ of some very specific form (the first few things in it being true and the rest being false), and found the location of the first false in b.

This very clearly points to our desired generalization:

The generalization

Suppose we are given the following:

• $[l, r]$: a range of integers (in our example, $[0, n - 1]$)

• $f$: a function from integers to booleans, which satisfies the following property: there exists some integer $t$ such that for all $l \le x < t$, $f(x)$ is true, and for all $t \le x \le r$, $f(x)$ is false.

Then if the time taken to compute $f$ is upper bounded by $T(l, r)$, we can find the value of $t$ (i.e., the first false index) in $O(T(l, r) \log_2 (r - l + 1)) + O(1)$ time.

Such an $f$ is called a predicate. In the problem we discussed above, $f$ is called a predicate.

An implementation of this function will be something like the following (ignoring overflow errors):

template <class Integer, class F>
Integer find_first_false(Integer l, Integer r, F&& f) {
--l;
++r;
while (r - l > 1) {
Integer m = l + (r - l) / 2; // prefer std::midpoint in C++20
if (f(m)) {
l = m;
} else {
r = m;
}
}
return r;
}


Note that this implementation also has the nice property that $l$ is the position of the last true in the corresponding array $b$, so you can define a function similar to this one that returns $l$ instead.

To use this function to implement our original binary search, we can do something like the following:

int find_position(const vector<int>& a, int x) {
auto f = [&](int i) {
return a[i] < x;
};
int n = (int)a.size();
int pos = find_first_false(0, n - 1, f);
if (pos == n || a[pos] != x) return n;
return pos;
}


Note that this abstraction also gives us the following result: we don't really need $a$ to be sorted at all. The only thing we need is that everything less than $x$ in $a$ should be in a prefix, and everything not less than $x$ should be in the remaining suffix and if $x$ is in the array, the beginning of that suffix should have $x$ at it. This definition also handles possible duplicates of $x$ in the array.

As we'll see later on, this kind of an abstraction allows us to solve a whole variety of problems.

Exercise: What would you do if in $b$, the first few elements were false, and the rest were true?

Sometimes, it is much easier to deal with bounds on the answer rather than the exact answer itself.

In other words, sometimes it is much easier to construct a function $f$ that returns true iff the input is $\ge$ the answer to the problem, by running some algorithm that returns a boolean value not by computing the answer, but by considering some properties of the problem at hand and the input.

To make this more concrete, let's consider this problem.

In short, you're given an $n \times m$ multiplication table, and you want to find the $k^\mathrm{th}$ smallest entry in this table (i.e., if you were to sort all the entries, find the entry at position $k$).

It is not clear how to directly find this value. But given a "guess" $x$, we can see if this is at least the answer or not:

Let's find the number of integers smaller than $x$ in each row, and sum them up. This can be done by a simple division for each row.

If the total numbers less than $x$ are $< k$, we will return true, otherwise, we will return false. This predicate works since the number of entries smaller than $x$ is a non-decreasing function of $x$ (more commonly, we compare at $f(x), f(x + 1)$ and try to argue that if $f(x)$ is false, then $f(x + 1)$ is also false), and hence the last element that makes the function return true is in fact the $k^\mathrm{th}$ smallest entry.

This can be found using binary search as we have discussed above.

Two other common ways of implementing binary search

We'll discuss two other ways of implementing usual binary search, and why they seem intuitive to people who use them.

We'll refer to the previous implementation as the $(l, r)$ way, since the invariant in that was equivalent to saying that the remaining search space will always be (l, r).

This section is just to help out people in reading others' binary search implementations, as well as choose between implementations to find the way you find best. I prefer using the implementation used above, so I'll explain the other two implementations in the context of that, with some intuition as to why people choose to implement things that way.

The $[l, r]$ way

In this type of an implementation, we maintain $l + 1$ and $r - 1$ rather than $l$ and $r$. The reason behind this is something like the following (independent of the above reasoning):

$[l, r]$ consists of the currently explored search space. Let's maintain an extra integer ($ans$), which stores the "best" answer found so far, i.e., the leftmost false found so far.

This interval is non-empty if $r \ge l$. The midpoint is the same as before, i.e., $m = \lfloor(l + r) / 2\rfloor$. If $f(m)$ evaluates to false, then the best possible answer so found has to be $m$, and we reduce the search space from $[l, r]$ to $[l, m - 1]$. Otherwise, we reduce it to $[m + 1, r]$, and do not update the answer.

Note that here we would also need to maintain a separate variable $ans$, and initialize it appropriately, depending on whether you want the first false (as done above), or the last true (which can be done by moving the update of the $ans$ variable across the if-branches), which is also why I haven't used it for a very long time. However, people who prefer closed intervals and this explanation seem to gravitate towards this implementation.

The invariant here remains that if $l', r'$ are the $l, r$ in the old implementation, and $l, r$ are the $l, r$ in this implementation, then $l' = l - 1$ and $ans = r' = r + 1$. In the case where you need to find the last true, the invariant becomes $ans = l' = l - 1$ and $r' = r + 1$.

An implementation is as follows:

template <class Integer, class F>
Integer find_first_false(Integer l, Integer r, F&& f) {
Integer ans = n;
while (l <= r) {
Integer m = l + (r - l) / 2; // prefer std::midpoint in C++20
if (f(m)) {
l = m + 1;
} else {
ans = m;
r = m - 1;
}
}
return ans;
}


The $[l, r)$ way

In this type of an implementation, we maintain $l + 1$ and $r$ instead of $l$ and $r$. The interpretation is that the search space is in $[l, r)$, and people who like using half-open intervals tend to prefer this implementation. The rationale behind this is that when the search space becomes empty, it corresponds to the range $[ans, ans)$ (if you're trying to find the first false, of course). The structure of the intervals doesn't always correspond in this implementation with the other implementations, since the value of $m$ used is usually slightly different (and usually equal to $\lfloor(l + r) / 2\rfloor$ where $[l, r)$ is the search space at that point, and in the case that there are two range midpoints, this is the one to the right and not to the left).

It's much more illuminating to think of it in this way: Suppose you have a range $[l, r)$, and you need to insert a false just after the last true in the range. What will be the position of the new false?

Another way of looking at it is: everything in the range $[l, ans)$ corresponds to true, and everything in the range $[ans, r)$ corresponds to true. So $ans$ is some sort of a "partition point" for the array. We will see later on how this exact interpretation is used in an implementation of this function in C++'s STL.

Let's check the midpoint of the range $[l, r)$. If $f(m)$ is true, we will never put it at a location $\le m$, so $l$ needs to be updated to $m + 1$. Otherwise, we have to put it at a position $\le m$, so the $r$ needs to be updated to $m$.

template <class Integer, class F>
Integer find_first_false(Integer l, Integer r, F&& f) {
++r;
while (l < r) {
Integer m = l + (r - l) / 2; // prefer std::midpoint in C++20
if (f(m)) {
l = m + 1;
} else {
r = m;
}
}
return r;
}


Binary lifting for binary search

This and the next sections will be much more terse than the previous section, since we will be doing more of an overview of methods instead of explaining something from scratch.

Consider the following implementation:

template <class Integer, class F>
Integer find_first_false(Integer l, Integer r, F&& f) {
if (l > r) return r + 1;
++r;
Integer w = Integer(1) << __lg(r - l);
--l;
if (f(l + w)) l = r - w;
for (w >>= 1; w >= Integer(1); w >>= 1)
if (f(l + w)) l += w;
return l + 1;
}


Here, we try to ensure that the interval sizes are always powers of 2. Why we do so will be explained in the section on optimizing binary search.

After we ensure that the interval sizes are always powers of 2, we can just try to reconstruct the binary representation of $ans - 1 - l$, where $ans$ is what we need to return. For that, we can try to go from the highest bit to the lowest bit, and greedy works here for the same reason that binary representations are unique; adding a higher power leads to the predicate being false.

Language specific ways for binary searching

Some of the most used languages in competitive programming fortunately have some inbuilt functions (albeit limited) to do binary search for you.

C++

• binary_search: This function returns a bool that tells you if there is an element equal to the queried element between two iterators (with an optional comparator).

• lower_bound, upper_bound: If the range occupied by elements comparing equal to $x$ is defined by iterators $[it_l, it_r)$ in a given range of iterators $[input_it_l, input_it_r)$, then lower_bound and upper_bound return $it_l$ and $it_r$. In other words, lower_bound(it1, it2, x, comp) returns the iterator to the first element in the range of iterators $[it1, it2)$ which is $\ge x$ according to $comp$ (which is optional and defaults to the usual comparison), and upper_bound does the same for $> x$ instead.

• equal_range: This returns both lower_bound and upper_bound.

• partition_point: This works exactly like the binary search function we wrote in the $[l, r)$ way. In C++20, with ranges::views::iota, we can use it to do binary search on the answer as well.

Python

• The bisect module has useful functions like bisect, bisect_left, bisect_right that do similar things to lower_bound and upper_bound.

Java

• Collections.binarySearch and Arrays.binarySearch do something similar to finding an index/element (not necessarily the first equal or the last equal) using binary search.

Optimizing binary search

In our first implementation, we were returning early when we found the element. Sometimes, it's faster to stop the binary search early, and in those cases, this is sometimes a constant factor optimization.

The binary lifting implemented for binary search above is also a constant factor optimization on some architectures if unrolled manually (which is possible due to the simple nature of the increments, and the possibility of hardcoding the constants), according to John Bentley's Programming Pearls. It is an improvement in some cases, where the locations you binary search on are few in number and repeated in successive binary search applications, leading to better cache usage. An example of doing that optimization to the extreme would be to implement multiple functions (each with a different power of 2), and store the function pointers in an array, and use computed jumps (for instance, by computing the power of 2 needed) to get to the correct function at runtime, which still leads to some speedup on certain architectures.

For instance, for certain types of queries (where either $l$ or $r$ are more or less the same), the speedup is pretty significant. However, for other types of queries, the simpler version is more efficient. To show the kind of speedups you can get for certain types of queries, I did some benchmarks on queries where the left bound is always the same. For the benchmarking code below, the results are as follows:

Benchmarking code

Results:

-----------------------------
Simple binary search
744175
Time: 2184 ms
-----------------------------
Binary lifting
744175
Time: 1504 ms
-----------------------------
Binary lifting with unrolling
744175
Time: 1407 ms
-----------------------------


https://algorithmica.org/en/eytzinger explains quite a nice method of exploiting the cache and speeding up binary search in a certain setup by a pretty impressive factor.

https://codeforces.com/blog/entry/76182 explains a variation of binary search which divides the ranges in an uneven order, which can lead to a change at the complexity level as well.

Other resources

As promised, here's a collection of resources that I felt are pretty high-quality, that pertain to topics related to binary search.

A great resource is the Codeforces EDU tutorial (and problems) on binary search.

A great video that also explains binary search is linked in the following blog: https://codeforces.com/blog/entry/67509.

There is also an extension to vanilla binary search, called the parallel binary search, and https://codeforces.com/blog/entry/89694 links to a great video tutorial for this extension.

Binary lifting is a pretty general idea. A tutorial on binary lifting on fenwick trees can be found at https://codeforces.com/blog/entry/61364, a great one on binary lifting on trees can be found at https://usaco.guide/plat/binary-jump?lang=cpp (which also has links to other resources).

Binary search on segment trees is a quite useful technique too, and a working implementation (with some intuition on how it works) for recursive segment tree can be found at https://codeforces.com/blog/entry/83883. An implementation for the iterative segment tree can be found at https://github.com/atcoder/ac-library/blob/master/atcoder/lazysegtree.hpp.

A rough idea of how it works (based off a conversation where I was explaining this code to someone) is as follows. Some parts of it might not make sense, so please let me know if there's something unclear.

Spoiler

Please let me know if you find any bugs or typos in this blog!

UPD: Added benchmarking code for binary lifting (with and without unrolling) and binary search in a few special cases to show speedups.

• +246

By nor, 10 months ago,

Introduction

A while ago ToxicPie9 and I made and posted this meme:

To my surprise, many people are quite confused about it. In fact, I realized there are a lot of misconceptions about pragmas. Most C++ users on Codeforces put a few lines of pragmas at the start of every submission. However, I believe many of them don't fully understand what they're doing or how pragmas work; the only thing people seem to think is that "add pragmas -> code go brr".

Pragmas are a form of the dark arts that are feared and used, both correctly and incorrectly, by lots of competitive programmers. They are widely believed to make your code much faster, but sometimes may lead to slowdowns and even runtime errors on certain platforms. In this blog, I will explain the effects of #pragma GCC optimize and #pragma GCC target, how they work, and how you should and shouldn't use them.

Feel free to skip to the end of the blog for a TL;DR.

Warning

All of this discussion is for GCC, and the code you will write might not be portable across different compilers or platforms (in terms of turning on the same optimizations). However, you may assume that most of them work on Codeforces and other x86 platforms that are not too old.

Some Non-examples

The following does nothing.

#pragma GCC optimize(" unroll-loops")
#pragma gcc optimize("Ofast")
#pragma GCC optimization("Ofast")
#pragma optimize(Ofast)


Yes, these are real-life examples we have seen being used by many competitive programmers, including quite a few LGMs. Perhaps the third one among these came from this blog which seems to have popularized the idea of using pragmas, but with a small mistake. Many people are misled to believe that some of the above work, when they actually don't -- stop using them.

If ever in doubt about whether your pragmas are correct, turn on most compiler warnings with the command-line option -Wall (or the more specific -Wunknown-pragmas). For example, if you compile the code above with -Wall, you'll get the following output, which tells you that the pragmas are invalid and useless:

foo.cpp:2: warning: ignoring ‘#pragma gcc optimize’ [-Wunknown-pragmas]
2 | #pragma gcc optimize("Ofast")
|
foo.cpp:3: warning: ignoring ‘#pragma GCC optimization’ [-Wunknown-pragmas]
3 | #pragma GCC optimization("Ofast")
|
foo.cpp:4: warning: ignoring ‘#pragma optimize ’ [-Wunknown-pragmas]
4 | #pragma optimize(Ofast)
|
foo.cpp:1:37: warning: bad option ‘-f unroll-loops’ to pragma ‘optimize’ [-Wpragmas]
1 | #pragma GCC optimize(" unroll-loops")
|                                     ^


Try to check if your submissions have similar problems. If yes, then you have probably been using pragmas wrong the entire time -- it's time to change your default code.

What Is "#pragma GCC optimize"?

It turns on certain optimization flags for GCC. The syntax is

#pragma GCC optimize (option, ...)


From the official source on GCC pragmas, this pragma allows you to set global optimization flags (specified by option) for functions that come after it. Let's have a look at a few of them:

• O0, O1: These are pretty useless for competitive programming purposes, so we won't discuss these here.
• O2: This is the default optimization option on Codeforces, so using this might not give any tangible benefit.
• O3: This is the first non-trivial optimization option. It can make your code slower sometimes (due to the large size of generated code), but it is not very frequent in competitive programming. Some of the things it does are:
• Auto-vectorize the code if the mentioned architectures allow it. This can make your code much faster by using SIMD (single instruction, multiple data) which kinda parallelizes your code on an instruction level. More info below.
• Function inlining — inlines functions aggressively if possible (and no, marking functions as inline doesn't inline functions, nor does it give hints to the compiler)
• Unrolls loops more aggressively than O2 (this might lead to instruction cache misses if generated code size is too large)
• Ofast: This is one of the more controversial flags. It turns on all optimizations that O3 offers, along with some other optimizations, some of which might not be standards compliant. For instance, it turns on the fast-math optimization, which assumes floating-point arithmetic is associative (among other things), and under this assumption, it is not unexpected to see your floating-point error analysis go to waste. Ofast may or may not make your code faster; only use this if you're sure it does the right things.

You can also use some other options, like:

• unroll-loops -- Enables aggressive loop unrolling, which reduces the number of branches and optimizes parallel computation, but might increase code size too much and lead to instruction cache misses.
• unroll-all-loops -- Usually makes the program slower.
• strict-overflow -- Enables some optimizations that take advantage of the fact that signed integer overflow is undefined behavior.
• trapv -- This one is quite special, as it is not usually considered an "optimization": enabling this will make your code run much slower, but causes signed integer overflows to generate runtime errors. Useful for debugging. (Editor's note: if your system supports, it's recommended to use a sanitizer instead. You can find tutorials on the internet, so we won't discuss it here.)

A full list of supported flags in GCC can be found here.

An example: let's say you want to use O3 and unroll-loops. Then you could do something like

#pragma GCC optimize("O3")
#pragma GCC optimize("unroll-loops")


or

#pragma GCC optimize("O3,unroll-loops")


or

#pragma GCC optimize("O3","unroll-loops")


Note that the options are strings inside quotation marks, and there's no space after or before the comma in the second way of writing. However, you can have a space before or after the comma in the last way of writing without any issues.

What Is "#pragma GCC target"?

Many modern computers and almost all competitive programming judges run on the x86 architecture. It has received a lot of additions over the years -- particularly, extra features (instructions) that enable different calculations or make existing ones much faster.

Compilers allow you to take advantage of these instruction sets extensions. For example, CPUs that support the popcnt instruction can theoretically compile __builtin_popcount into one instruction, which is much faster than usual implementations of this function. Similarly, if you want to auto-vectorize code, you'd need some instruction sets that actually support the vector instructions. Some of these instruction sets are sse4.2, avx, avx2. Here's a list of instruction sets that are usually supported on Codeforces:

• avx and avx2: These are instruction sets that provide 8, 16 and 32 byte vector instructions (i.e., you can do some kinds of operations on pairs of 8 aligned integers at the same time). Prefer using avx2 since it's newer.
• sse, sse2, sse3, sse4, sse4.1, sse4.2: These are instruction sets that are also for vectorization, but they're older and not as good as avx and avx2. These are useful for competitions on websites such as Yandex, where avx2 is not supported and gives a runtime error due to unrecognized instruction (it corresponds to a SIGILL signal — ill-formed instruction).
• popcnt, lzcnt — These optimize the popcount (__builtin_popcount family) and count leading zeros (__builtin_clz family) operations respectively.
• abm, bmi, bmi2: These are bit manipulation instruction sets (note that bmi is not a subset of bmi2). They provide even more bitwise operations like ctz, blsi, and pdep.
• fma: This is not so widely used, since avx and sse make up for most of it already.
• mmx: This is even older than the sse* family of instruction sets, hence is generally useless.

You should be quite familiar with them if you have written SIMD code before. If you're interested in a certain instruction set, there are plenty of resources online.

An example: if you want to use, say, avx2, bmi, bmi2, popcnt and lzcnt, you can write

#pragma GCC target("avx2,bmi,bmi2,popcnt,lzcnt")


Again, note that we don't have spaces in the string.

There are two main ways to use #pragma GCC target.

• You can use it with the optimization pragmas. It allows the compiler to automatically generate efficient SIMD instructions from parts of your code (based on the optimization flags), often boosting their performance by roughly 2, 4 or even 8 times.
• It enables you to use the intrinsics of supported instruction sets. For example,
#include <immintrin.h>
// returns the odd bits of x: 0b01010101 -> 0b1111
uint32_t odd_bits(uint32_t x) {
return _pext_u32(x, 0x55555555u);
}


This code works very efficiently, but won't compile unless you specify a valid target. Normally this is done by adding the -mbmi2 compilation flag, but this is impossible on Codeforces, so you can only enable it by other ways like #pragma GCC target("bmi2").

• An extreme example of this is this super-fast convolution code that makes heavy use of SIMD intrinsics to squeeze the running time of an $\mathcal{O}(N \log N)$ algorithm with a traditionally bad constant factor with $N \approx 5 \times 10^5$ to only 49ms.

Note that using vectorization targets won't work unless you have O3 or Ofast on (more specifically, tree-vectorize, with at leastO2 -- note that O2 is turned on by default on Codeforces).

Function Attributes

Just in case you don't want to apply these optimizations globally, you can use function attributes. For instance,

__attribute__((target("avx2"), optimize("O3", "unroll-loops"))) void work() {
// do something
}


This enables these attributes for this function only, and the rest of the code will be compiled with the global options.

Bonus

#pragma GCC ivdep: This pragma forces the compiler to vectorize a loop if the loop was not vectorized due to possible aliasing. This means that the compiler wasn't able to prove that vectorizing is safe due to data dependency, but you can, due to knowledge of the problem at hand, show that this case will never happen. This results in undefined behaviour if you're not careful enough about there not being a dependency across loop iterations. Think of this having a similar idea behind it as pointer restrict in C99.

#pragma GCC push_options, #pragma GCC pop_options: These maintain a stack of target and optimization pragmas, enabling you to temporarily switch to another set of options. #pragma GCC reset_options resets all the options.

#pragma GCC optimize "Ofast" and #pragma GCC optimize "-Ofast" also surprisingly work. The same holds for stuff like #pragma GCC optimize "-funroll-loops" and #pragma GCC optimize "unroll-loops". However, #pragma GCC target "avx2" works but #pragma GCC target "-mavx2" doesn't.

Some Caveats

As we have pointed out already, there might be some caveats associated with using the aforementioned pragmas.

• Some of these optimize/target pragmas can potentially make your code slower too, due to code size and other associated stuff.
• You might get runtime errors if you're careless about your choice of instruction sets. In general, on any platform, you should try to use each instruction set (and also write code that benefits from that instruction set!) and see if you're getting weird behaviour or not. There are also some ways to check if the machine your code is going to be run on has certain instruction sets at compile time, but inconsistencies with the actual set of instruction sets are possible in subtle ways, so you shouldn't in general depend on them. A good way, however, is to run custom tests with stuff like assert(__builtin_cpu_supports("avx2")).
• In general, constant-factor optimization at such a low level is somewhat unpredictable, and you should rely on measuring the performance rather than guessing from the instruction counts/latencies. This also means taking the contents of this blog with a grain of salt and not blindly believing that using these pragmas will somehow help you cheese every $\mathcal{O}(n^2)$ solution into the time limit for $n = 10^5$. A good way to look at generated assembly is by using godbolt.org.

Some Examples

There are multiple incidents in codeforces folklore that point to how fast pragmas can make your code. Some instances are:

Some more helpful discussion can be seen in the thread for this comment: https://codeforces.com/blog/entry/94609?#comment-836718

Conclusion/TL;DR

If you skipped the entire blog to read this part, it's recommended that you also read the Some Non-examples section and check if you're using an incorrect pragma.

No pragma is absolutely safe to use. Don't blindly believe that one line of pragma can suddenly, magically boost the speed of all your submissions. Often, you will need some testing or researching to decide how to improve your constant factor.

When restricted to competitive programming on Codeforces or a fixed platform, however, some results might be more predictable. Specifically,

• The default O2 already has a lot of optimizations compared to O0. Increasing it to O3 or even Ofast might increase or decrease the performance, depending on your code and input. Most online judges provide a "custom test" function which helps you figure out which one is more appropriate. You can also test other flags if you know what they do.
• Usually, the avx and avx2 targets are supported by newer online judges. When they are supported, these can usually make your code faster. They often work best when your code is easily vectorizable (e.g., tight loops) -- which requires some experience to write. Again, custom tests are always useful (also, make sure they don't cause errors due to architecture issues). You can try the sse family if they don't work.
• Other targets like popcnt and bmi don't usually matter a lot when the compiler is optimizing, however operations like __lg, popcount, etc. require them to be fast. This is important when you use bitsets or do heavy bit manipulations.

Some pragmas I personally use on Codeforces are:

#pragma GCC optimize("O3,unroll-loops")
#pragma GCC target("avx2,bmi,bmi2,lzcnt,popcnt")


If on a platform with no avx2 support, I switch out the avx2 with either avx or one of the sse's. Some (ancient) judges might not have support for the other targets either, so it's usually a decision you need to make as mentioned in the blog.

Acknowledgments

Thanks to ToxicPie9 for helping me write this blog and suggesting both stylistic changes as well as things to include! Please give him upvotes.

• +854

By nor, 12 months ago,

In this blog post, I will focus on the following problem: 920E - Связные компоненты?. More generally, we will look at how to do efficiently traverse the complement graph of a given graph in both dfs and bfs orders. In particular, the bfs order also allows us to find a shortest-path tree of the complement graph. This is my first blog, so please let me know of any errors/typos that might have crept in.

Note 1: I will present the algorithms in the order I came up with them, so it might not be the most clean presentation, but I hope that it might give some insight into how to come up with new ideas in graphs. Note that I don't claim to be the first person to come up with these ideas (especially the naive algorithm, which is identical to the solution in the editorial), but I haven't seen any of these implementation tricks being used in any implementation before. Resources pointing to similar ideas/implementations are welcome!

Note 2: All adjacency relations will be in terms of the original graph (which is given to us as an input, and is sparse).

(Not-so-important) note 3: We won't be using lists (or an implementation of lists using arrays, which was made popular by the usual implementation of the adjacency list which is used by a lot of Chinese competitive programmers), but in the final implementation, we will be using an application of DSU which can be used to implement deletions + traversal in a sorted list efficiently.

TL;DR

If you're looking for the implementations, here they are:

1. Naive DFS in $O((n + m) \log n)$: 125679159
2. BFS without sorting adjacency lists in $O(n + m)$: 125679188
3. BFS with sorting adjacency lists (using radix sort) in $O(n + m)$: 125679224
4. DFS with sorting adjacency lists (using radix sort) in (conjectured) $O(n + m)$ (guaranteed $O((n + m) \log n)$): 125679248

Solving the problem using DFS in $O((n + m) \log n)$ time

Note that in usual dfs, we do something of the following sort while running a DFS on vertex $u$:

mark u as visited
if v is not already visited:
dfs(v)


Note that the order of the loops doesn't really matter, so we can do something like the following too:

mark u as visited
for v in the set of unvisited vertices:
if v is adjacent to u:
dfs(v)


Now if we want to do a DFS on the complement graph, we can merely change the condition to if v is not adjacent to u and we will be done.

How will we go about implementing this algorithm efficiently? Suppose we maintain a set of unvisited vertices, and the adjacency lists are sets instead of vectors/lists. Then we can go through the set of unvisited vertices, and if an unvisited vertex is not in the adjacency set of the current vertex, we can recursively do a DFS on that vertex.

Why is this efficient? Note that in any iteration, we either skip a vertex or we don't. In the case we skip a vertex, the vertex has to be in the set of vertices that are adjacent to the current vertex, so we don't do more than $O(m)$ skips. In the case we don't skip, we remove a vertex from the unvisited set, so we don't do more than $O(n)$ iterations of this type either. So the total number of iterations is $O(m + n)$, and the cost of every iteration is upper bounded by $O(\log n)$, which gives us the bound.

Implementation: 125679159

Solving the problem using BFS in $O(n + m)$ time

Let's come to our first linear implementation. I switched over to BFS, since you can still find components using BFS, and it is easier to reason about an iterative algorithm than a recursive algorithm with global variables.

The first change is that we don't use a set, instead we use a vector to store unvisited vertices. Initially all the vertices are unvisited.

Let's see what happens when we process a vertex $u$ in this algorithm.

Firstly, we note down all the unvisited vertices that are adjacent to the vertex $u$. Let's call them blocked vertices (for $u$), and mark them in a global array (we will unmark them when we finish the iteration corresponding to $u$).

Then we iterate over all the unvisited vertices, and we can push an unvisited vertex into the queue if and only if it is not $u$ and it is not blocked. Then the only remaining unvisited vertices are the ones that are blocked, so we replace the set of unvisited vertices by the set of blocked vertices. This can be seen to be identical to a usual BFS, but with minor modifications.

Why is this efficient? Suppose the order in which vertices are popped from the queue is $u_1, u_2, \ldots, u_k$. Then when we finish processing $u_i$, the unvisited set is $V \cap N(u_1) \cap \cdots \cap N(u_i) \subseteq N(u_i)$, so the total size of the vectors we process at any point is at most $O(m)$. The number of iterations is also hence upper bounded by $O(n + m)$, and we are done. Note that this argument doesn't depend on the connectedness of the graph.

Implementation: 125679188

Solving the problem using BFS in $O(n + m)$ time, using yet another algorithm

Remember how we implemented DFS? We can do a similar thing for BFS, and we will modify the algorithm in the previous section a bit to somehow match what we did in the DFS case. We can try sorting the adjacency lists. However, that can be worst case $O(m \log n)$ or something similar. How do we do better than that? Instead of sorting adjacency lists separately, we can sort all edges using radix sort in $O(n + m)$, and then create a graph from those edges which will automatically have sorted adjacency lists.

Now our invariant will be that the set of unvisited vertices is sorted. So when we process a vertex, instead of marking the blocked vertices in advance, we can use two-pointers to check if an unvisited vertex is in adjacent to $u$ or not. The rest of the algorithm is pretty much identical to the remaining part of the previous algorithm.

Implementation: 125679224

Solving the problem using DFS in conjectured $O(n + m)$ time.

This is (in my opinion) the most interesting algorithm among all of the ones I have presented so far. However, I don't have a correct proof of the time complexity. We will use the idea from the previous implementation here for sorting the adjacency list, and all we need to do now is to make the following operations in the naive DFS implementation fast enough:

1. Erasing an element from a set
2. Iterating over a set
3. Finding an element in the adjacency list of the current vertex

Note that the two-pointers idea still works here while iterating over the global unvisited set (which is not implemented using a std::set since erasing would be too slow), if we are guaranteed that the unvisited set is sorted during iteration.

Consider the state of the unvisited set. At any point, it consists of some vertices $-1 = a_0 \le a_1 \le a_2 \le \cdots \le a_k \le n = a_{k+1}$ (the first and the last elements are just for the sake of convenience in the following definition). Define the operation $root$ as follows: for any $a_i < x \le a_{i+1}$, we have $root(x) = x$. Note that the computation of $root$ will be done lazily whenever we need to compute the function $root$. Then if we have such an operation, we can do the needed operations as follows:

1. Erasing an element from a set: set $root[x] = x + 1$ for now (when the actual value of $root(x)$ is needed, we will do it then and there, and update $root[x]$ and all its dependencies for speeding up further computations). This links together the two ranges: the one ending at $x$ and the one starting at $x + 1$.
2. Iterating over the set: We can do this using a single for loop as follows (by definition of the function root): for (int it = root(0); it < n; it = root(it + 1))
3. Finding an element in the adjacency list of the current vertex can be done using two pointers, since we are guaranteed by the definition of $root$ that we iterate over the unvisited vertices in increasing order.

Now how do we lazily compute $root$ when needed? We shall do so in a DSU-like manner with path compression as follows:

int get_root(int u) {
if (root[u] == u) return u;
return root[u] = get_root(root[u]);
}


Why is this efficient? DSU guarantees that the time taken is $O((m + n) \log n)$. However, running some stress tests suggests that this is in fact much better than that, and that the number of compressions done is in $O(n + m)$. I tried proving it but found a fatal error in a step, so if anyone can come up with a proof (or a counter-example) for this, please let me know!

Implementation: 125679248