### Noble_Mushtak's blog

By Noble_Mushtak, history, 22 months ago,

Hello. As some of you may know, the Mount Allison Programming Showdown 2020 took place yesterday, and I had a question about the Divide and Conquer problem.

You can read the problem yourself, but basically, the problem gives you two integers $b$ and $d$, where $1 < b, d < 2^{63}$ and $d$ is prime, and then asks you if there exists an $m$ such that $b^m \equiv -1 \pmod{d}$. You need to output yes if such an $m$ exists and output no otherwise. (This is an oversimplification of the problem because the actual problem statement is quite long, so it is possible I misread what it is asking, but I am just giving this short summary of the problem so that I don't have to repeat the whole problem statement in this post.)

Here is how my solution works:

• If $d=2$, then $m$ exists if and only if $b$ is odd.
• If $d$ is an odd prime, then $m$ exists if and only if $b$ has even order in the group of units $\pmod{d}$.
• PROOF: If $m$ exists, then we can let $m'$ is the smallest positive integer such that $b^{m'}\equiv -1\pmod{d}$. Thus, the order of $b$ must be greater than $m'$. Moreover, we have $b^{2m'}\equiv (-1)^2\equiv 1\pmod{d}$, so the order of $b$ must be at most $2m'$. Now, for any integer $m < n < 2m'$, it holds that $0 < n-m' < m'$, so $b^{n-m'}\not\equiv -1\pmod{d}$ since $m'$ is the smallest positive integer such that $b^{m'}\equiv -1\pmod{d}$ and $n-m' < m'$. Thus, $b^n\equiv b^mb^{n-m}\equiv -b^{n-m}\not\equiv 1\pmod{d}$, so the order of $b$ is not $n$ for any $m' < n < 2m'$. Therefore, the order of $b$ must be $2m'$, so the order of $b$ is even.
• PROOF CONTINUED: If $b$ has even order, then let $2m$ be the order of $b$. Then, $b^{2m}\equiv (b^m)^2\equiv 1\pmod{d}$. Thus, either $b^m\equiv -1\pmod{d}$ or $b^m\equiv 1\pmod{d}$. However, since $2m$ is the order of $b$, $b^m\not\equiv 1\pmod{d}$, so it must be that $b^m\equiv -1\pmod{d}$.
• Now, to see if the order of $b$ is even, calculate $b^n\pmod{d}$, where $n$ is the biggest odd factor of $d-1$. If $b^n\equiv 1\pmod{d}$, then the order of $b$ divides $n$, so $b$ has odd order, so output no.
• Otherwise, if $b^n\not\equiv 1\pmod{d}$, then the order of $b$ does not divide $n$. Since the order of $b$ must divide $d-1$, this means that the order of $b$ must be even, so output yes.

However, when I submit a solution off the above methods, I get Wrong Answer. Can anyone see what I am doing wrong? I have included my C++ code below. Thanks for the help.

#pragma GCC optimize("Ofast")
#pragma GCC optimize("unroll-loops")
#pragma GCC target("sse,sse2,sse3,ssse3,sse4,popcnt,abm,mmx,avx,avx2")
#include <bits/stdc++.h>

typedef int64_t num;

static inline num multModulo(num num1, num num2, num MOD) {
return ( static_cast<__int128>(num1) * num2) % MOD;
}

num calcModuloExp(num base, num exp, num MOD) {
num result = 1, cur = base;
while (exp) {
if (exp & 1) result = multModulo(result, cur, MOD);
cur = multModulo(cur, cur, MOD), exp >>= 1;
}
return result;
}

int main() {
num b, d;
scanf("%" PRId64 " %" PRId64, &b, &d);
if (d == 2) {
puts((b & 1) ? "yes" : "no");
} else {
num biggestOddFactor = d-1;
while (!(biggestOddFactor & 1)) biggestOddFactor >>= 1;
num answer = calcModuloExp(b, biggestOddFactor, d);
puts((answer == 1) ? "no" : "yes");
}

return 0;
}


• +3

By Noble_Mushtak, history, 3 years ago,

Usually, I hear people saying that std::cin and std::cout is a relatively slow form of input while scanf and printf is a faster form of input which should be sufficient for all problems. However, sometimes I wonder if there are any problems where scanf and printf would be too slow.

For example, look at the problem 1181D — Irrigation. This problem asks us to read in 500,000 numbers, each of which could be as large as 500,000, and then asks us to read in 500,000 more numbers, each of which could be as large as $10^{18}$. Clearly, this is a lot of input/output, so users should use scanf and printf over std::cin and std::cout. However, even when I used scanf and printf, my solution ran in 2355 ms, barely under the time limit of 2.5 seconds. On the other hand, when I wrote my code using an I/O library I wrote myself, which uses fread and fwrite to read/write 32768 bytes at a time, my improved solution ran in 592 ms, almost 4 times faster than the original solution.

Since my code barely ran under time using scanf and printf, I wonder if there are any CodeForces problems where using scanf and printf will inevitably lead to a Time Limit Exceeded error and users must use some other I/O method like I did. Has anyone else encountered a problem where scanf and printf just weren't quite fast enough? Moreover, has anyone else built a custom I/O library for competitive programming like I did? If anyone has knows how to do input/output faster than fread and fwrite, I would love to hear about it.

• +5

By Noble_Mushtak, history, 3 years ago,

Recently, I have become interested in x86 assembly, and I wanted to try out my skills on CodeForces using __asm__ extension in GCC. For example, I was very pleased when I solved problem 495B using a very naive algorithm implemented in assembly which would probably have TLEd if it was written in regular C (basically, looped all the way up to $\frac{n}{4}$ to find all divisors of $n$ for $n \leq 10^9$ rather than using $O(\sqrt{n})$ algorithm).

However, this algorithm was implemented by using only 32-bit numbers and registers, so I was curious as to how the performance would be affected by using 64-bit numbers and registers. Therefore, I tried to use the %rax, %rbx, etc. registers instead of the %eax, %ebx, etc. registers, but for some reason, this gave me a "bad register" error. I then tried running my code using Clang++17 Diagnostics and this gave me a much clearer error:

p71.cpp:24:13: error: register %rax is only available in 64-bit mode
__asm__("push %%rax \n\t"

Of course, this implies that CodeForces is compiling our programs using 32-bit mode, so using 64-bit integers is even slower than it would normally be, because GCC/G++/Clang++ has to manipulate multiple registers in order to implement 64-bit operations in 32-bit x86. The fact that CodeForces uses 32-bit mode surprised me very much, since 64-bit has been the standard for computers and even smartphones for years. In fact, Ubuntu recently decided to drop its support for 32-bit computers in its upcoming versions, showing how old 32-bit really is.

Given all this, why does CodeForces continue to use GCC/G++/Clang++ in 32-bit version? Is this to penalize programmers from using 64-bit integers when it's not necessary, or do they use 32-bit mode for a more technical reason? I am not sure using 32-bit mode makes a huge difference since I find most problems can be solved using only 32-bit numbers anyway, but I am very curious as to why CodeForces made the decision to choose 32-bit over 64-bit.

• +17

By Noble_Mushtak, history, 3 years ago,

On SPOJ, there is a problem named POLYMUL, which is pretty straightforward: Multiply two polynomials, which have integer coefficients in the interval $[0, 1000]$ and can each have a degree of up to $10000$.

Currently, I am trying to solve this problem with NTT and my solution can be found here. As the solution shows, I am doing NTT with the primes of both $998244353$ and $2013265921$, and then using Chinese Remainder Theorem in order to find the true coefficients without any modulo (I am assuming the coefficients are less than $1000^2\cdot 10000=10^{10}$, which is much less than the product of $998244353$ and $2013265921$). On my machine, it takes about 3 seconds for this solution to solve a test case with $T=10$ (i.e. $10$ pairs of polynomial) where each polynomial in each pair is of degree $10000$ and the coefficients are integers randomly picked from the interval $[0, 1000]$.

However, the time limit is 1 second, and I do not know what to do to optimize my solution more. Does any one have any suggestions on how to optimize my NTT algorithm or how to optimize my C code in general? Any solutions would be very welcome. Thanks!

• +11

By Noble_Mushtak, history, 5 years ago,

All solutions below are thoroughly commented and based on the official editorial. I might start doing this for one or two editorials every month starting now because writing alternate solutions for 701C and 701D and writing solutions for 701E and 701F (could not figure out on my own) really helped me understand the problems better. Also, I'm sorry if the later solutions are more clunky; I'm not used to solving problems of that difficulty. Thanks!

### 701A — Cards

Their explanation is pretty straight-forward, so here's the solution.

### 701B — Cells Not Under Attack

Here, I deviate from the editorial in two ways:

• Because I am using C and do not have set, I instead mark rows and columns as attacked in an array of bools and count the number of attacked rows and columns as I go along.
• Their formula is n·n - cnt where cnt = szY·n + szX·n - szX·szY. This gives us n·n - szY·n - szX·n + szX·szY = (n - szX)(n - szY). I use the latter in my solution.

Here is my solution for this problem.

### 701C — They Are Everywhere

The confusing part of their answer is that they're not really clear about when they're talking about looping through the different types of letters in s or all of the letters in order. Here is my implementation of their solution.

### 701D — As Fast As Possible

Here is my binary search solution. The problem with their explanation here is it's out of order, they never actually show their equation for what I called x, their variable posM convoluted their formula for x, and they never actually solve for what I called y but instead only briefly mention that solutions must account for the bus going back.

### 701E — Connecting Universities

I think the way they explained 701E pretty well other than the fact that lv is unnecessary since lv  =  1 for all v. However, it is missing lots of detail if you're not so good at working with trees. Here's my implementation of the editorial solution.

### 701F — Break Up

I could not understand from the editorial's explanation how to find the second edge in the case of two edges until I looked at the dfs2() function from this solution to 701F from Vosatorp, so thanks to them for this solution!

Basically, in our depth first search, if an edge from va to vb in path from s to t is unnecessary, then there will contain a cycle with vb and a vertex before vb in the path. This is because since the edge is unnecessary, there exists another path from s to t without that edge. Let vl be the vertex farthest along the original path that is before vb on this new path. and vr be the vertex closest to vb that is after it on this new path. Then, using the old path, we get to vb, then vr and using the new path, we get to vl. Thus, by this process, we have created a cycle with both vb and vl, proving our original point. Ergo, by contrapositives, edges where vb is not in any cycle with a previous vertex in the path are necessary.

This process seems to be a variation on Tarjan's Bridge-Finding algorithm which I did not know before.

Other than that proof of correctness, everything else is pretty much explained in the comments of my re-writing of Vosatorp's solution in C, without vector.

### 700D — Huffman Coding on Segment

Coming when I have enough time/experience to understand the solution!

### 700E — Cool Slogans

Coming when I have enough time/experience to understand the solution!

• +12

By Noble_Mushtak, history, 6 years ago,

There's this really good article on how to find the maximal empty subgrid of a grid so one might wonder, "How do we find the number of empty subgrids of a grid?" I first saw the answer to the second question in this solution to CHSGMNTS on CodeChef by lohit_97 (found on CodeChef and on CodeForces), so thanks to them for the following algorithm!

First, here is a more formalized way of saying the problem:

• The first line of input contains two integers width and height (where 1 ≤ width, height ≤ 1000).
• The next height lines of input contain width integers that fit into long in C/C++.
• Let matrix[i][j] be the integer number i on row number j.
• Find the number of tuples (a, b, c, d) such that matrix[i][j] is 0 for all a ≤ i ≤ c and b ≤ j ≤ d.

The way we can solve this problem is by keeping an array emptyToLeft[i][j] which is the maximum number such that i - emptyToLeft[i][j] < x ≤ i implies matrix[x][j] is 0. (If matrix[i][j] is not 0, then we just set emptyToLeft[i][j] to 0 to make the inequality always false.)

Once we have that, we loop through each column i and in each column, we make a stack and a tempAnswer = 0. In the bottom of a stack, we start with  - 1 and as we loop through each row j, each row is popped on and some rows are popped off in order to preserve the property that between two consecutive stack elements a and b where b is directly above a, we have a < y ≤ b implies emptyToLeft[i][y] ≥ emptyToLeft[i][b]. By the definition of emptyToLeft, this means i - emptyToLeft[i][b] < x ≤ i and a < y ≤ b implies matrix[x][y] is 0. Thus, all such (x, y) as (a, b) and (i, j) as (c, d) in the tuple (a, b, c, d) is an answer. Therefore, two consecutive stack elements a and b give us (b - a) * emptyToLeft[i][b] answers, so we add that to tempAnswer when we add b to the stack and subtract that from tempAnswer when we pop b off. Finally, the answer is simply the sum of all tempAnswers across all i, j.

Here is the C code for this algorithm:

#include <stdio.h>
#include <stdlib.h>
#define REPEAT(token, num) for (token = 0; token < num; token++)

typedef long num_cells;
typedef long cell;
typedef long long num_matrices;

cell matrix[1000][1000];
num_cells width, height, emptyToLeft[1000][1000], stack[1000], stackLength;

int main() {
num_cells i, j, prevToLeft;
//Get the input for the matrix:
scanf("%li %li", &width, &height);
REPEAT(j, height) REPEAT(i, width) scanf("%li", matrix[i]+j);

//Create emptyToLeft by looping through the rows:
REPEAT(j, height) {
//This is the previous emptyToLeft, which starts out at 0 because at the beginning of a row, there are no elements to the left.
prevToLeft = 0;
//Loop through each column:
REPEAT(i, width) {
//If matrix[i][j] != 0, then emptyToLeft[i][j] = 0.
if (matrix[i][j]) emptyToLeft[i][j] = 0;
else emptyToLeft[i][j] = prevToLeft+1;
//Also, update prevToLeft:
prevToLeft = emptyToLeft[i][j];
}
}

REPEAT(i, width) {
//For each column, create a stack with -1 at the bottom.
stack[0] = -1, stackLength = 1;
//Loop through each row:
REPEAT(j, height) {
//If emptyToLeft[i][stack[stackLength-1]] >= emptyToLeft[i][j], then putting j onto the stack will violate the stack property, so we need to pop the stack:
while (stack[stackLength-1] != -1 && emptyToLeft[i][stack[stackLength-1]] >= emptyToLeft[i][j]) {
//Update tempAnswer now that we are going to pop the stack:
//Note that here, stack[stackLength-1] is described as b above while stack[stackLength-2] is a.
//Pop the stack:
stackLength--;
}
stack[stackLength++] = j;
}
}

exit(0);
}


Notice that we have a while loop inside of the j loop, giving us a third inner loop and thus possibly O(height2width) complexity. However, each run of the while loop pops an element off the stack and we only add an element to the stack for each j, meaning overall, for one i, we can do at most height pops. Thus, the j loop is O(height) for each i and the while loop is O(height) for each i, which gives us O(height·width) complexity for this problem.

• -20

By Noble_Mushtak, history, 6 years ago,

After the contest, I read other people's solution to Vacations and people kept using big dynamic programming arrays on Div. 2 Problem C, which confused me, so I decided to show all of you that this was unnecessary by solving this problem using only one uint_32 variable in C. Using bit fields in a struct, we can get this down to only three bytes.

The intuition behind this solution is that you only need to keep track of what is possible from the day before and if nothing is possible, then you have to rest this day and everything will be possible the next day. I think it's kind of a greedy algorithm because unlike the DP algorithms, which seemed to take into account solutions where we rest early in order to do something the next day, this algorithm does something on every day when it can and rests only when it finds that it needs to. I do this because whether or not we rest early or the day after when we have to is irrelevant: We still end up resting one day, so the answer is the same. Thus, no dynamic programming is necessary and we only need to keep track of one answer at every point.

• +10

By Noble_Mushtak, history, 6 years ago,

Here is my analysis of the solution to Contest 362 Div. 1 Problem A:

• There are edges per query.
• Each edge takes to process. There are added every query and O(q) queries, so we have edges, meaning each edge is to process.
• Finally, there are O(q) queries.

This gives us overall. However, the editorial does not have the term and simply has . How did they get rid of the term in the analysis? Did I do something wrong here?