Bitwise operations 2 — popcount & bitsets

#	User	Rating
1	tourist	3880
2	jiangly	3669
3	ecnerwala	3654
4	Benq	3627
5	orzdevinwang	3612
6	Geothermal	3569
6	cnnfls_csy	3569
8	jqdai0815	3532
9	Radewoosh	3522
10	gyh20	3447

#	User	Contrib.
1	awoo	161
1	maomao90	161
3	adamant	156
4	maroonrk	153
5	-is-this-fft-	148
5	atcoder_official	148
5	SecondThread	148
8	Petr	147
9	nor	144
10	TheScrasse	142

Part 1 (link) introduces basic bitwise operations. This is part 2 and it's mainly about (in)famous bitsets and example problems. Also, see links to very useful advanced stuff at the bottom. EDIT: here's video version of this blog (on my Youtube channel).

Built-in functions

In C++, __builtin_popcount(x) returns popcount of a number — the number of ones in the binary representation of $$$x$$$. Use __builtin_popcountll(x) for long longs.

There are also __builtin_clz and __builtin_ctz (and their long long versions) for counting the number of leading or trailing zeros in a positive number. Read more here. Now, try to solve these two simple tasks in $$$O(1)$$$, then open the spoiler to check the solution:

Compute the biggest power of 2 that is a divisor of x. Example: f(12) = 4

Compute the smallest power of 2 that is not smaller than x. Example: f(12) = 16

While popcount is often needed, I rarely use the other two functions. During a contest, I would solve the two tasks above in $$$O(\log x)$$$ with simple while-loops, because it's easier and more intuitive for me. Just be aware that these can be done in $$$O(1)$$$, and use clz or ctz if you need to speed up your solution.

Motivation behind bitsets

Consider this problem: There are $$$N \leq 5000$$$ workers. Each worker is available during some days of this month (which has 30 days). For each worker, you are given a set of numbers, each from interval $$$[1, 30]$$$, representing his/her availability. You need to assign an important project to two workers but they will be able to work on the project only when they are both available. Find two workers that are best for the job — maximize the number of days when both these workers are available.

You can compute the intersection of two workers (two sets) in $$$O(30)$$$ by using e.g. two pointers for two sorted sequences. Doing that for every pair of workers is $$$O(N^2 \cdot 30)$$$, slightly too slow.

We can think about the availability of a worker as a binary string of length $$$30$$$, which can be stored in a single int. With this representation, we can count the intersection size in $$$O(1)$$$ by using __builtin_popcount(x[i] & x[j]). The complexity becomes $$$O(N^2)$$$, fast enough.

What if we are given the availability for the whole year or in general for $$$D$$$ days? We can handle $$$D \leq 64$$$ in a single unsigned long long, what about bigger $$$D$$$?

We can split $$$D$$$ days into convenient parts of size $$$64$$$ and store the availability of a single worker in an array of $$$\frac{D}{64}$$$ unsigned long longs. Then, the intersection can be computed in $$$O(\frac{D}{64})$$$ and the whole complexity is $$$O(N^2 \cdot \frac{D}{64})$$$.

code

const int K = MAX_D / 64 + 1;
unsigned long long x[MAX_N][K];
int intersection(int i, int j) {
	int total = 0;
	for(int z = 0; z < K; z++) {
		total += __builtin_popcountll(x[i][z] & x[j][z]);
	}
	return total;
}

So, we can simulate a long binary number with multiple unsigned long longs. The implementation isn't that bad but doing binary shifts would be quite ugly. Turns out all of this can be done with bitsets easily.

Bitsets

bitset<365> is a binary number with $$$365$$$ bits available, and it supports most of binary operations. The code above changes into simple:

code

bitset<MAX_D> x[MAX_N];
int intersection(int i, int j) {
	return (x[i] & x[j]).count();
}

Some functions differ, e.g. x.count() instead of __builtin_popcount(x) but it's only more convenient. You can read and print binary numbers, construct a bitset from int or string bitset<100> a(17); bitset<100> b("1010");. You can even access particular bits with b[i]. Read more in C++ reference https://en.cppreference.com/w/cpp/utility/bitset.

Note that the size of the bitset must be a constant number. You can't read $$$n$$$ and then declare bitset<n> john;. If $$$n$$$ is up to $$$100$$$, just create bitset<100>.

The complexity of bitwise operations is $$$O(\frac{size}{32})$$$ or $$$O(\frac{size}{64})$$$, it depends on the architecture of your computer.

Problems

P1. Different numbers — You are given a sequence of $$$N \leq 10^7$$$ numbers, each from interval $$$[0, 10^9]$$$. How many different values appear in the sequence? Don't use set or unordered_set because they quite slow.

solution

P2. Knapsack — You are given $$$N \leq 1000$$$ items, each with some weight $$$w_i$$$. Is there a subset of items with total weight exactly $$$W \leq 10^6$$$?

solution

Standard knapsack with boolean array would be $$$O(N \cdot W)$$$, too slow.

bool can[MAX_W];
int main() {
	int n, W;
	cin >> n >> W;
	can[0] = true;
	for(int id = 0; id < n; id++) {
		int x;
		cin >> x;
		for(int i = W; i >= x; i--) {
			if(can[i-x]) can[i] = true;
		}
	}
	puts(can[W] ? "YES" : "NO");
}

Instead of using a boolean array, let's use bitsets and binary shifting to get $$$O(\frac{N \cdot W}{32})$$$. You don't need to know bitsets to see how this would work for $$$W \leq 32$$$ and a bitmask unsigned long long can;, and here we just do it for a longer bitmask.

bitset<MAX_W> can;
int main() {
	int n, W;
	cin >> n >> W;
	can[0] = true;
	for(int id = 0; id < n; id++) {
		int x;
		cin >> x;
		can = can | (can << x); // or just: can |= (can << x);
	}
	puts(can[W] ? "YES" : "NO");

P3. Triangles in a graph — Given a graph with $$$n \leq 2000$$$ vertices and $$$m \leq n \cdot (n - 1) / 2$$$ edges, count triples of vertices $$$a, b, c$$$ such that there are edges $$$a-b$$$, $$$a-c$$$ and $$$b-c$$$.

hint

P4. Chef and Queries — https://www.codechef.com/problems/CHEFQUE (easy)

P5. Odd Topic — https://www.codechef.com/AABH2020/problems/ODTPIC (medium), thanks to Not-Afraid for the suggestion

P6. Funny Gnomes — https://www.codechef.com/problems/SHAIKHGN (hard)

Bonuses

1) m & (m-1) turns off the lowest bit that was set to $$$1$$$ in a positive number $$$m$$$. For example, we get $$$24$$$ for $$$m = 26$$$, as $$$11010$$$ changes into $$$11000$$$. Explanation on quora
2) A quite similar trick allows us to iterate efficiently over all submasks of a mask, article on cp-algorithms / e-maxx. This article also explains why masks-submasks iteration is $$$O(3^n)$$$.
3) DP on broken profile (grid dp) — https://cp-algorithms.com/dynamic_programming/profile-dynamics.html
4) SOS dp (sum over subset) — https://codeforces.com/blog/entry/45223 & https://www.youtube.com/watch?v=Lpvsd8WpbWc&t=5m4s
5) _Find_next function and complexity notation for bitsets — https://codeforces.com/blog/entry/43718

~~I will add links to some problems in online judges, feel free to suggest some in the comments.~~ I think that bonuses 3 and 4 lack some explanation with drawings, maybe I will make some soon.

#include <bits/stdc++.h> #pragma GCC target("popcnt") using namespace std; #define int long long int mod = 1e9 + 7; signed main(void) { int n; cin >> n; vector<vector<int>> grid(n, vector<int>(n, 0)); for (int i = 0; i < n; i++) { string s; cin >> s; for (int j = 0; j < n; j++) { grid[i][j] = s[j] - '0'; } } vector<bitset<3000>> bit(n); for (int j = 0; j < n; j++) { for (int i = 0; i < n; i++) { bit[j][i] = grid[i][j]; } } int ans = 0; for (int i = 0; i < n; i++) { for (int j = i + 1; j < n; j++) { int cnt = (bit[i] & bit[j]).count(); ans += cnt * (cnt - 1) / 2; } } cout << ans << endl; return 0; }

Comments (49)

Show archived | Write comment?

Not-Afraid

4 years ago, # |

+12

This problem from Codechef is a perfect example where bitsets comes handy.

→ Reply

VirusARzk

9 months ago, # ^ |

Yep, it was a great question.

cfmaster

Would you solve a couple of dp_bitmask problems in yournext post?

disabled_Account

amazing work

lemelisk

← Rev. 2 →

+27

It's worth noting that after adding #pragma GCC target("popcnt") __builtin_popcount() is replaced to corresponding machine instruction (look at the difference). In my test this maked x2 speed up.

bitset::count() use __builtin_popcount() call in implementation, so it's also affected by this.

Qualified

4 years ago, # ^ |

I would like to see your benchmarks. When I made my benchmarks, I noticed about a 6.2% speedup. Not as drastic as you said.

2147483648

10 months ago, # ^ |

I would like too, can you show us your bench?

rambojack002

https://cses.fi/problemset/task/2137/

The problem get submitted after adding #pragma GCC target("popcnt")

navneet.h

this problem on codechef and see comments in editorial some solved using bitset to speed up the brute force solution

abhi2402

Another problem where bitsets come handy.

zxcv890

What is the difference between $$$O(1)$$$ and $$$O(30)$$$? How is using bitsets 'true' constant time?

Errichto

Good question. You can think about the number of bits in architecture (usually $$$32$$$ or $$$64$$$) as a variable and then we're talking about $$$O(n)$$$ vs. $$$O(\frac{n}{b})$$$. You come up with an algorithm and it will have speed dependent on the machine it will be run on.

CodingKnight

← Rev. 3 →

Thanks for the informative tutorial. The following is a C++14 demo program on using the 32/64-bit popcount/clz/ctz bit-counting built-in functions in C++. The templates included in this demo may help beginners to use these functions without worrying about memorizing their slightly inconvenient names.

demo

gupta_samarth

Hi Errichto, I had asked you during a recent stream whether it is possible to do an iterative version of going through all bitmasks of length $$$N$$$ in $$$O(2^N)$$$ time, instead of $$$O(N*2^N)$$$. We can do it in recursion, by passing what we need to keep track of as an argument in the recursive call. You had suggested that it is possible via some DP and bit tricks. So I thought a little bit, and the trick used in Fenwick Tree came to mind. I have written a simple code, that for each mask will store all the bits that are on in it. This is the link.

I had a question regarding this, we can get the value of the last set bit using &, but we need to find the position of the bit, i.e. if last set bit value is $$$100$$$ then position is $$$2$$$. To find position, we will still have to iterate over the length of the bitmask right? Does the __builtin_ctz function take 1 operation or length of bits number of operations? What about the bitwise & and $$$|$$$ operators? Ofcourse maximum operations ( that we ever do in CP, mostly ) will only have 64 bit numbers, so you could say O(64) = O(1), but I want to know about the constants exactly.

tfg

You can use some magic like https://www.chessprogramming.org/BitScan#With_isolated_LS1B to get the position instead of the value. Don't ask me how it works, idk about that but it works. You can also do it in O(logBITS) with binary search (as mentioned in the link I gave too under divide and conquer).

Also maybe more important to know, most often optimizing these O(BITS) that you loop over bits is overkill and not necessary because you usually do some really light operations also not involving cache misses so it's actually faster than you'd expect.

Wow, the divide and conquer method was mind blowing. Another question I have always had in my mind was, when you do $$$a$$$&$$$b$$$, you must make $$$log(max(a,b))$$$ operations right? So does that make the first part of finding the last bit value itself useless? Also, does this mean Fenwick Tree has $$$log^2(N)$$$ update instead of $$$log(N)$$$?

No, a&b should be done in few cycles and should actually be cheaper than a+b because the ALU in cpus for sure have a bitwise and operation and it doesn't need to carry over the carry bit so it's fully parallel for the bits.

Oh yeah, makes sense. ( Le me: Recalls Computer Architecture course, wonders what use is university )

Ritwin

3 years ago, # ^ |

That makes sense to me, because computers probably have built-in circuits in the chips for bitwise operations, where you plug in 2 numbers from one end, some magic happens in constant time, and you get the output.

Update :)

I made two Youtube videos (part 1 and part2) which cover the same content as my blogs.

And added bonus (5) with a function to find the next bit 1 in a bitset, _Find_next(i).

SPyofgame

I have a question

Can I implement find_next_bit() manually under O(log n) ?

Shift and then https://www.geeksforgeeks.org/position-of-rightmost-set-bit/

← Rev. 4 →

IGMaster ! I have 2 questions

Is it danger if I use signed variable ?
But if I want to know next set bit of (n) starting from (p). I still need to know the variable (n) which is from position (p) — 1 -> 0 right ?

I don't know. Google: c++ signed variable bitwise operations
yeah, you need to shift first.

Thanks IGMaster

IGMaster ! Can I find next unset bit without reversing it ?

I edited: Master -> IGMaster

Stop calling me master. I would negate a number first (flip every bit) but for sure something equivalent can be done. I think you're focusing too much on this. When I had your rating, I didn't know any of this.

Thanks and sorry for calling like that.

I just curious to know if there exists a simpler & effeciency bitwise implementation

Swistakk

+28

Errichto: Stop calling me master.

SPyofgame: I edited: Master -> IGMaster

XDDDDDDDDDDDDDDDDDDDDDDDDDD

YoshiYoshiYoshi

Edit: This should be deleted, I was just toasted and put 1LL << 31 instead of 1LL << 32.

... ... ...

Hey people, I need some help.

For chef and queries, I have the current si updating like this after each iteration:

s = ((a*s+b)% MAX_W);

where MAX_W is 1LL << 31;

the type of a, s and b is long long.

I don't get why this messes up the algorithm. If I change their types to unsigned int and remove the % operation in the update, the algorithm works fine. However, I think that the first approach should be working fine, as it is how the problem says we need to calculate si.

Can someone please guide me through the reasoning behind the issue?

Thanks!

adil198

What was size of your bitset??
I am taking bitset<1000000001> and it shows stack-overflow
while taking bitset<100000> on smaller inputs works fine

purplesyringa

bitset<1000000001> takes about 125 MB of memory. You are (generally) not expected to allocate so much on the stack. Use the heap or a global variable instead.

can you please elaboarte more on a code?
P4 of the above post.
As in the bitset video by Errichto, he used a bitset of that size.

totally_banned

Please check these 2 codes:

Problem:- https://codeforces.com/problemset/problem/550/B WA:- https://ideone.com/0eCeTX AC:- https://codeforces.com/contest/550/submission/11417471

The only difference is, the AC code has 0 based array, and mine has 1 based array. Is it wrong to make 1 based array for bitmasking?

Then you need a bitmask with bits on positions 1 through N, so an even number up to 2^(n+1). I don't think you understand what your code does.

Robur_131

Errichto What is the complexity of __builtin_popcount(N)? (Assuming $$$N$$$ can be as arbitrarily large as possible)

Mcayitel

It will take O(32)

So in other words it's just $$$\mathcal{O}(\text{number of bits})$$$ ? I saw this comment claiming that's it's $$$\mathcal{O}(\log{(\text{number of bits})})$$$

If it is so, I don't know why it is and what exactly is working at architecture or compiler level. If anyone replies at your comment plz let me know as I also what to know how it can do in log(bits) time.

mnaeraxr

Here you have how to do it in $$$O(\log bits)$$$.

Technically speaking it is only considered $$$O(\log bits)$$$ because it assumes adding and shifting numbers can be done in constant time, so it does $$$O(\log bits)$$$ such operations. This is a reasonable assumption in this context since we are handling only with numbers of 32 and 64 bits.

Treat int popcount as an $$$O(1)$$$ operation but it's just slightly slower than simpler operations like xor. The complexity becomes $$$O(L/32)$$$ for bitsets of length $$$L$$$.

parag776

16 months ago, # |

https://www.codechef.com/problems/FUNARR

this problem can be done for bitset practice.

d_sqaure

12 months ago, # |

bitset<1000000001> bs; it is showing segmentation fault when i am declaring inside function.But it is fine if i declare it globally.can anyone explaint why is this happening and other restrictions in bitset declarations ?

Wind_Eagle

12 months ago, # ^ |

Local variables are placed in a stack. This bitset is equivalent to ~125MB, this is a lot bigger than you local stack size (don't remember exactly what is default stack size). If you will try submitting it to Codeforces, you won't get any errors.

Increase stack size using compiler flags.

Thanks!.And if i declare vector< bool > v(10000000001); of same size locally.It is working fine, what is the reason here ?. Also, if you can share the way to customise stack size using compilers flag will be helpful.

meanup-

10 months ago, # |

← Rev. 6 →

I was struggling to figure out the first problem which was posted in this post. After watching the video for endless times I was missing something, and could not code this problem. Hence, I would like to get some suggestion whether the below code can be improved? Please assist.

There are $$$N \leq 5000$$$ workers. Each worker is available during some days of this month (which has $$$30$$$ days). For each worker, you are given a set of numbers, each from interval $$$[1,30]$$$ , representing his/her availability. You need to assign an important project to two workers but they will be able to work on the project only when they are both available. Find two workers that are best for the job — maximize the number of days when both these workers are available.

Test Case:

Input: $$$\text{workers} = [[2,3,5,6,8],[2,4,5,8],[1,2,10,12,14,16]];$$$

output: 3, pairs: 0 and 1

Unable to parse markup [type=CF_MATHJAX]

My Code

#include <bits/stdc++.h>
using namespace std;

const int MAX_N = 5000;
const int MAX_D = 30;
const int K = MAX_D / 64 + 1;
unsigned long long x[MAX_N][K];

int intersection(int i, int j) {
	int total = 0;
	for(int z = 0; z < K; z++) {
		total += __builtin_popcountll(x[i][z] & x[j][z]);
	}
	return total;
}

int main() {
	vector<vector<int>> workers = {
		{2,3,5,6,8},
		{2,4,5,8},
		{1,2,10,12,14,16}
	};

	for(int i = 0; i < workers.size(); i++) {
		for(int j : workers[i]) {
			x[i][j / 64] |= (1ULL << (j % 64));
		}
	}

	int max_intersection = 0;
	pair<int,int> best_pair;

	for(int i = 0; i < workers.size(); i++) {
		for(int j = i + 1; j < workers.size(); j++) {
			int inter = intersection(i, j);
			if(inter > max_intersection) {
				max_intersection = inter;
				best_pair = {i, j};
			}
		}
	}

	cout << "Pairs: " << best_pair.first << " and " << best_pair.second << "\n";
	cout << "Maximum common days: " << max_intersection << "\n";

	return 0;
}

Using bitset

#include <bits/stdc++.h>
using namespace std;

const int MAX_N = 5000;
const int MAX_D = 30;
bitset<MAX_D> x[MAX_N];

int intersection(int i, int j) {
	return (x[i] & x[j]).count();
}

int main() {
	vector<vector<int>> workers = {
		{2,3,5,6,8},
		{2,4,5,8},
		{1,2,10,12,14,16}
	};

	for(int i = 0; i < workers.size(); i++) {
		for(int j : workers[i]) {
			x[i].set(j - 1); 
		}
	}

	int max_intersection = 0;
	pair<int,int> best_pair;

	for(int i = 0; i < workers.size(); i++) {
		for(int j = i + 1; j < workers.size(); j++) {
			int inter = intersection(i, j);
			if(inter > max_intersection) {
				max_intersection = inter;
				best_pair = {i, j};
			}
		}
	}

	cout << "Pairs: " << best_pair.first << " and " << best_pair.second << "\n";
	cout << "Maximum common days: " << max_intersection << "\n";

	return 0;
}

SuperMo

8 months ago, # |

What is the complexity of the third problem (counting triangle) I think that finding all the pair would take (1/2)N^2 which is O(N^2)and then we Need to iterate over all n again to find the third intersection if it does exist .so its O(n^3/32) am I right ?

dr.owl

4 months ago, # |

I guess best motivation to read this blog came when Errichto write too slow to some method.

AMBUJ_2003

2 months ago, # |

How can we copy a bitset to other.

for P5 can we use a prefix bitset and see changes and solve that way