Nisiyama_Suzune's blog

By Nisiyama_Suzune, history, 9 months ago, In English,


The idea of this article originated from a contest (Petrozavodsk Summer-2016. Petr Mitrichev Contest 14), which I believe is attributed to Petr. In this contest, an interesting problem is proposed:

"Cosider this process: pick a random number ni uniformly at random between 10 and 100. Generate ni random points with integer coordinates, picking each coordinate independently and uniformly at random from all integers between 0 and 109, inclusive. Find the convex hull of those points.

Now you are given 10000 polygons generated by this program. For each polygon, you need to guess the value ni that was used for generating it.

Your answer will be accepted if the average (over all 10000 hulls) absolute difference between the natural logarithm of your guess and the natural logarithm of the true ni is below 0.2."

Unfortunately, I didn't really manage to work this one out during our 5-hour training session. After the training is over, however, I have tried to read the solution program written by Petr, which looks like the following:

public class h {
	static int[] splitBy = new int[] {/* 1000 seemingly-random elements */};
	static double[] splitVal = new double[] {/* another 1000 seemingly-arbitrarily-chosen elements */};
	static double[] adjYes = new double[] {/* Another 1000 seemingly-stochastically-generated elements */};
	static double[] adjNo = new double[] {/* ANOTHER 1000 seemingly-... elements, I'm really at my wit's end */};
	public static void main(String[] args) {
		/* Process the convex hull, so that[0] is the average length of the convex hull to four sides of the square border
				(i.e. (0, 0) - (1E9, 1E9));[1] is the area of the hull;[2] is the number of points on the hull.
		double res = 0;
		for (int ti = 0; ti < splitBy.length; ++ti) {
			if ([splitBy[ti]] >= splitVal[ti]) {
				res += adjYes[ti];
			} else {
				res += adjNo[ti];
		int guess = (int) Math.round (res);
		if (guess < 10) guess = 10;
		if (guess > 100) guess = 100;
		pw.println (guess);

While I was struggling to understand where all the "magic numbers" come from, I do realize that the whole program is somewhat akin to a "features to output" black box, which is extensively studied in machine learning. So, I have made my own attempt at building a learner that can solve the above problem.

A lightweight learner

Apparently, most online judge simply do not support scikit-learn or tensorflow, which are common machine learning libraries in Python. (Or an 100MB model file, the imagination of 500 users with an 100MB file each makes my head ache. And yes, there are even multiple submissions.) Therefore, some handcraft code is necessary to implement a learner that is easy to use.

As a university student, I, unfortunately, do not know much about machine learning, especially in regression, where even fewer methods are adaptable. However, I somehow got particularly attracted by the idea of the neural network after some googling, both because its wide application and its simplicity, especially due to the fact that its core code can be written with about 60 lines. I will introduce some of its basic mechanism below.

Understand neural network in one section

A neural network, naturally, is made up of neurons. A neuron in a neural network, by definition, is something that maps an input vector x to an output y. Specifically, a certain neuron consists of a weight vector w with the same length as x, a bias constant b, and some mapping function f, and we compute its output with the following simple formula:

In general, we tend to use the sigmoid function as f for reasons we will see below. Other values are "tweakable" parts of a neuron that will be adjusted according to the data, as is the process of learning.

Turning back to the topic of a neural network, it will contain several layers of neurons, where each layer reads the input from the previous layer (the first layer, of course, from the input) and outputs the result as the inputs of the next layer (or the final answer, if it is the last layer). As such, it will not be very difficult to implement a neuron network if all parameters are given, since copying the formula above will suffice.

However, how do we know what these parameters are, anyway? Well, one common way is called "gradient descent". With this method, you imagine a hyperspace with each parameter of a neuron network as an axis. Then, each point in this hyperspace actually represents a neuron network. If we can give every point (neuron network) an error value that indicates how far it is away from the ideal model, then we can simply pick a random point and begin walking (descending) towards a direction where the error value is decreasing. In the end we will reach a point where the error value is very small, which represents a good neural network that we want. As you can guess, in practice we generate a random data (a random convex hull, regarding the problem above) and dictate the error value to be the square of the difference between the output of the network and the real answer, and walk one "step" towards the smaller error value. If sufficient data is generated, then this "stochastic gradient descent" method should approximately be equal to "gradient descent".

Now I can claim due to the fact that the sigmoid function is differentiable at every point, with some difficult-to-understand maths we can actually figure out the best direction in which the error value decreases the fastest without actually trying to explore around. This extensively studied field is named Backpropagation, and simply copying the result from wikipedia is sufficient for us to build a neural network ourselves.

I will show my code regarding the aforementioned problem below:

#include <bits/stdc++.h>

//ft inputs, n layers, m neurons per layer.
template <int ft = 3, int n = 2, int m = 3, int MAXDATA = 100000>
struct network {
	double wp[n][m][ft/* or m, if larger */], bp[n][m], w[m], b, val[n][m], del[n][m], avg[ft + 1], sig[ft + 1];
	network () {
		std::mt19937_64 mt (time (0));
		std::uniform_real_distribution <double> urdn (0, 2 * sqrt (m));
		for (int i = 0; i < n; ++i) for (int j = 0; j < m; ++j) for (int k = 0; k < (i ? m : ft); ++k)
			wp[i][j][k] = urdn (mt);
		for (int i = 0; i < n; ++i) for (int j = 0; j < m; ++j) bp[i][j] = urdn (mt);
		for (int i = 0; i < m; ++i) w[i] = urdn (mt); b = urdn (mt);
		for (int i = 0; i < ft + 1; ++i) avg[i] = sig[i] = 0;
	double compute (double *x) {
		for (int j = 0; j < m; ++j) {
			val[0][j] = bp[0][j]; for (int k = 0; k < ft; ++k) val[0][j] += wp[0][j][k] * x[k];
			val[0][j] = 1 / (1 + exp (-val[0][j]));
		for (int i = 1; i < n; ++i) for (int j = 0; j < m; ++j) {
			val[i][j] = bp[i][j]; for (int k = 0; k < m; ++k) val[i][j] += wp[i][j][k] * val[i - 1][k];
			val[i][j] = 1 / (1 + exp (-val[i][j]));
		double res = b; for (int i = 0; i < m; ++i) res += val[n - 1][i] * w[i];
//		return 1 / (1 + exp (-res));
		return res;
	void desc (double *x, double t, double eta) {
		double o = compute (x), delo = (o - t); // * o * (1 - o)
		for (int j = 0; j < m; ++j) del[n - 1][j] = w[j] * delo * val[n - 1][j] * (1 - val[n - 1][j]);
		for (int i = n - 2; i >= 0; --i) for (int j = 0; j < m; ++j) {
			del[i][j] = 0; for (int k = 0; k < m; ++k)
				del[i][j] += wp[i + 1][k][j] * del[i + 1][k] * val[i][j] * (1 - val[i][j]);
		for (int j = 0; j < m; ++j) bp[0][j] -= eta * del[0][j];
		for (int j = 0; j < m; ++j) for (int k = 0; k < ft; ++k) wp[0][j][k] -= eta * del[0][j] * x[k];
		for (int i = 1; i < n; ++i) for (int j = 0; j < m; ++j) bp[i][j] -= eta * del[i][j];
		for (int i = 1; i < n; ++i) for (int j = 0; j < m; ++j) for (int k = 0; k < m; ++k)
			wp[i][j][k] -= eta * del[i][j] * val[i - 1][k];
		b -= eta * delo;
//		for (int i = 0; i < m; ++i) w[i] -= eta * delo * o * (1 - o) * val[i];
		for (int i = 0; i < m; ++i) w[i] -= eta * delo * val[n - 1][i];
	void train (double data[MAXDATA][ft + 1], int dn, int epoch, double eta) {
		for (int i = 0; i < ft + 1; ++i) for (int j = 0; j < dn; ++j) avg[i] += data[j][i];
		for (int i = 0; i < ft + 1; ++i) avg[i] /= dn; 
		for (int i = 0; i < ft + 1; ++i) for (int j = 0; j < dn; ++j)
			sig[i] += (data[j][i] - avg[i]) * (data[j][i] - avg[i]);
		for (int i = 0; i < ft + 1; ++i) sig[i] = sqrt (sig[i] / dn); 
		for (int i = 0; i < ft + 1; ++i) for (int j = 0; j < dn; ++j)
			data[j][i] = (data[j][i] - avg[i]) / sig[i];
		for (int cnt = 0; cnt < epoch; ++cnt) for (int test = 0; test < dn; ++test)
			desc (data[test], data[test][ft], eta);
	double predict (double *x) {
		for (int i = 0; i < ft; ++i) x[i] = (x[i] - avg[i]) / sig[i];
		return compute (x) * sig[ft] + avg[ft];
	std::string to_string () {
		std::ostringstream os; os << std::fixed << std::setprecision (16);
		for (int i = 0; i < n; ++i) for (int j = 0; j < m; ++j) for (int k = 0; k < (i ? m : ft); ++k)
			os << wp[i][j][k] << " ";
		for (int i = 0; i < n; ++i) for (int j = 0; j < m; ++j) os << bp[i][j] << " ";
		for (int i = 0; i < m; ++i) os << w[i] << " "; os << b << " ";
		for (int i = 0; i < ft + 1; ++i) os << avg[i] << " ";
		for (int i = 0; i < ft + 1; ++i) os << sig[i] << " ";
		return os.str ();
	void read (const std::string &str) {
		std::istringstream is (str);
		for (int i = 0; i < n; ++i) for (int j = 0; j < m; ++j) for (int k = 0; k < (i ? m : ft); ++k)
			is >> wp[i][j][k];
		for (int i = 0; i < n; ++i) for (int j = 0; j < m; ++j) is >> bp[i][j];
		for (int i = 0; i < m; ++i) is >> w[i]; is >> b;
		for (int i = 0; i < ft + 1; ++i) is >> avg[i];
		for (int i = 0; i < ft + 1; ++i) is >> sig[i];

#define cd const double &
const double EPS = 1E-8, PI = acos (-1);
int sgn (cd x) { return x < -EPS ? -1 : x > EPS; }
int cmp (cd x, cd y) { return sgn (x - y); }
double sqr (cd x) { return x * x; }

#define cp const point &
struct point {
	double x, y;
	explicit point (cd x = 0, cd y = 0) : x (x), y (y) {}
	int dim () const { return sgn (y) == 0 ? sgn (x) < 0 : sgn (y) < 0; }
	point unit () const { double l = sqrt (x * x + y * y); return point (x / l, y / l); }
	point rot90 () const { return point (-y, x); }
	point _rot90 () const { return point (y, -x); }
	point rot (cd t) const {
		double c = cos (t), s = sin (t);
		return point (x * c - y * s, x * s + y * c); } };
bool operator == (cp a, cp b) { return cmp (a.x, b.x) == 0 && cmp (a.y, b.y) == 0; }
bool operator != (cp a, cp b) { return cmp (a.x, b.x) != 0 || cmp (a.y, b.y) != 0; }
bool operator < (cp a, cp b) { return (cmp (a.x, b.x) == 0) ? cmp (a.y, b.y) < 0 : cmp (a.x, b.x) < 0; }
point operator - (cp a) { return point (-a.x, -a.y); }
point operator + (cp a, cp b) { return point (a.x + b.x, a.y + b.y); }
point operator - (cp a, cp b) { return point (a.x - b.x, a.y - b.y); }
point operator * (cp a, cd b) { return point (a.x * b, a.y * b); }
point operator / (cp a, cd b) { return point (a.x / b, a.y / b); }
double dot (cp a, cp b) { return a.x * b.x + a.y * b.y; }
double det (cp a, cp b) { return a.x * b.y - a.y * b.x; }
double dis2 (cp a, cp b = point ()) { return sqr (a.x - b.x) + sqr (a.y - b.y); }
double dis (cp a, cp b = point ()) { return sqrt (dis2 (a, b)); }

bool turn_left (cp a, cp b, cp c) { return sgn (det (b - a, c - a)) >= 0; }
std::vector <point> convex_hull (std::vector <point> a) {
	int cnt = 0; std::sort (a.begin (), a.end ());
	std::vector <point> ret (a.size () << 1, point ());
	for (int i = 0; i < (int) a.size (); ++i) {
		while (cnt > 1 && turn_left (ret[cnt - 2], a[i], ret[cnt - 1])) --cnt; 
		ret[cnt++] = a[i]; }
	int fixed = cnt;
	for (int i = (int) a.size () - 1; i >= 0; --i) {
		while (cnt > fixed && turn_left (ret[cnt - 2], a[i], ret[cnt - 1])) --cnt;
		ret[cnt++] = a[i]; }
	return std::vector <point> (ret.begin (), ret.begin () + cnt - 1); }

const int FT = 10, N = 3, M = 4, DATA = 5000000;
network <FT, N, M, DATA> nt;

void process (double *data, const std::vector <point> &cv) {
	data[0] = 0; data[1] = cv.size ();
	data[2] = 1E9; data[3] = 1E9;
	data[4] = 1E9; data[5] = 1E9;
	data[6] = 0; data[7] = 0; data[8] = 0; data[9] = 1E18;
	for (int i = 0; i < cv.size (); ++i) {
		data[0] += det (cv[i], cv[(i + 1) % cv.size ()]);
		data[2] = std::min (data[2], cv[i].x);
		data[3] = std::min (data[3], cv[i].y);
		data[4] = std::min (data[4], 1E9 - cv[i].x);
		data[5] = std::min (data[5], 1E9 - cv[i].y);
		data[6] += dis (cv[i], cv[(i + 1) % cv.size ()]);
		data[7] += dis2 (cv[i], cv[(i + 1) % cv.size ()]);
		data[8] = std::max (data[8], dis (cv[i], cv[(i + 1) % cv.size ()]));
		data[9] = std::min (data[9], dis (cv[i], cv[(i + 1) % cv.size ()]));

#ifdef LOCAL
double data[DATA][FT + 1];

void gen_data () {
	std::mt19937_64 mt (time (0));
	std::uniform_int_distribution <int> un (10, 100), uid (0, 1000000000);
	for (int cnt = 0; cnt < DATA; ++cnt) {
		int n = un (mt);
		std::vector <point> pp, cv; pp.resize (n);
		for (int i = 0; i < n; ++i) pp[i] = point (uid (mt), uid (mt));
		cv = convex_hull (pp); process (data[cnt], cv); data[cnt][FT] = log (n);

void train () {
	gen_data ();
	std::cout << "Generated." << std::endl;
	nt.train (data, DATA, 10, 0.01);
	std::cout << "Trained." << std::endl;
	double err = 0;
	gen_data ();
	for (int cnt = 0; cnt < DATA; ++cnt) {
		int ans = std::max (std::min ((int) round (exp (nt.predict (data[cnt]))), 100), 10), real = round (exp (data[cnt][FT]));
		err += std::abs (log (ans) - data[cnt][FT]);
	std::cout << "Error: " << std::fixed << std::setprecision (10) << err / DATA << "\n";
	std::cout << nt.to_string () << "\n";

//Error: 0.1931354718
const std::string str = "-1.3935330713026979 -1.6855399067812094 0.0080830627163314 -0.0456935204722265 -0.0134075027893312 -0.0225523456635180 -0.0063385628290917 -0.0539803586714664 0.0285615302448615 -0.0494911914462167 4.8156449644606729 1.4141506946578808 0.0609360500524733 -0.0003939198727380 0.0974573799765315 -0.0237353370691478 1.8823428209722266 -0.0253047473923868 -0.4320535560737181 -0.0465107808811315 3.9087941887190509 1.1605608005526673 -0.0055668920418296 0.0435006757439702 0.0168235313881814 0.0201162039464427 0.0005391585736898 -0.0281508733844095 0.0333519586644446 0.0138573643228493 -2.2584700896151499 -1.4482223377667529 -0.0087279280693970 -0.0067154696151271 0.0931324993850075 -0.0246202499367100 0.1202642385880566 0.0742084099697175 -0.0576728791494071 0.0048743269227391 0.7245500807425441 1.1451918581258713 1.5362660419598355 2.2247337510533445 -1.3361750107632056 -0.6202098232180651 2.6126102753644820 -2.6224614441649430 -1.6239623067583704 0.0103356949589122 2.3160496100308006 0.7983423380182993 2.7402717506451184 -0.4696450730651245 -5.9647497198434687 2.5913155713930167 -0.7492492475439734 -2.6902420797953686 -1.8925204583122668 4.1461581213116920 -1.3283582467170971 1.6545505567705860 -1.8337202201037788 5.9739259421367326 -0.0514667439768578 5.4697553418504725 0.4195708475958704 -8.2367354144820037 -2.7322241686093887 1.5109443918819332 3.4841054954004966 0.4931178741138817 -5.7924156962977023 1.3728086432563174 -3.0935727146312852 -2.1233242848314902 2.7215008891274142 -2.1566137088986088 -2.4630818715655334 -1.2871925059203080 -3.6940220894585978 -4.0749736367183296 0.4396854435207839 -0.5672193348210693 -1.8759334759019322 -1.8112273299081239 1.5777769313392334 0.6556212919805925 -0.6391849809511954 1549414420922584064.0000000000000000 10.0076592000000009 24937256.4471242018043995 24942299.7631161995232105 24933933.8955707997083664 24907107.2519301995635033 3354856368.0634231567382812 1588693346300253696.0000000000000000 716306964.7579888105392456 91336248.0137662440538406 3.8563280309383297 243867607280379392.0000000000000000 2.2982303054101396 34087367.3454539999365807 34090994.8023483082652092 34100829.3740548193454742 33993881.5109920650720596 263817168.3726958632469177 361286974354604864.0000000000000000 129441292.3767285346984863 67804188.7760041356086731 0.5986632816451091";

void run () { (str);
	std::ios::sync_with_stdio (0);
	int N; std::cin >> N;
	for (int i = 0; i < N; ++i) {
		int cnt; std::cin >> cnt;
		std::vector <point> cv (cnt, point ());
		for (int j = 0; j < cnt; ++j) std::cin >> cv[j].x >> cv[j].y;
		static double test[FT]; process (test, cv);
		std::cout << std::max (std::min ((int) round (exp (nt.predict (test))), 100), 10) << "\n";

int main () {
#ifdef LOCAL
	train ();
	run ();

As we can see, the core code of a neuron network takes only about 60 lines to finish (which I believe is the reason why I can never learn how to code the infamous link-cut tree — the implementation of my brain just got beaten by sheer line numbers), and its output as parameters is significantly less than most models. These are the reasons that I think the neuron network may be one popular ML methods for competitive programming.

Extension to other problems

I believe I have seen a few similar problems before, such as give you 10 numbers generated uniformly from [1, n] and let you guess what value n is. However, as of now I cannot find any more examples. Some help is appreciated if you happen to know some similar problems.


Now that hopefully you have known something about machine learning, do hurry to solve all the problems...Well, at least, before all the "nefarious" problem makers know the tricks and even find better features than you can.

P.S. What approach would you use to solve the "convex hull guessing" problem? Do you know any similar problems? Do let me know with comments, your help is most appreciated!

Update: Since some of you challenged that Petr's solution is specifically made for the judge's test cases, I compiled the code and tested it against my random data. It would appear that his solution achieves ~0.195 error, which is below the threshold and thus satisfies the requirement. For those interested I have pasted the original code at here. Be warned though, it is very long.

Read more »

  • Vote: I like it  
  • +132
  • Vote: I do not like it  

By Nisiyama_Suzune, 20 months ago, In English,

Originating from Project Euler, Dirichlet convolution saw its use in optimizing the problem to compute the partial sum of some specific multiplicative function. This article is aimed to introduce the exact technique applied.


This tutorial can be viewed as an extension of the previous tutorial, so I recommend to take a look at that one first.

Dirichlet convolution

Dirichlet convolution is a way to generate a new function from two functions. Specifically, the Dirichlet convolution of two functions f(x) and g(x) is:

We already know that one property of such convolution is that if f(x) and g(x) are all multiplicative, f * g(x) is multiplicative as well. Based on the property of the Möbius inversion , we can also claim that .

Optimization technique

Let's say that I wish to find the n-th partial sum of some multiplicative function f(x), i.e. I want to know the value of for a given n. Now let's pretend that I (miraculously) find a "magical function" g(x) such that both the partial sum of g(x) (denoted as sg(x)) and the partial sum of (denoted as ) are (miraculously) very easy to obtain in O(1) time complexity. I can then perform the following trick to compute sf(x).

Let's begin with the definition of the Dirichlet convolution:

We can assume that i = pd, and sum things via p and d.

Now we can split the part where d = 1.

We can see that sf(n) has surfaced. Let's move all the rest of the equation to the other hand.

According to the harmonic lemma introduced in the last article, only at most different elements exist for . We can thus recursively calculate the value of those sf(x), dividing the product into at most different segments, and sum them up in complexity.

What is the overall complexity of the algorithm? If we implement a hash table to store every sf(x) we computed, then with the integer division lemma (see last tutorial), it is not difficult to observe that only these values of sf(x) are computed. Since computing an sf(x) costs , the overall cost should be .

Upon now we have not used the interesting fact that f(x) is multiplicative. Come to think of it, we can actually use the linear sieve to pre-compute a few first elements of sf(x), omitting the need to process them recursively. Let's say that we pre-compute the first elements of sf(x). We can calculate the complexity

Apparently, when

Hence, the algorithm is optimized to .

The following code gives a brief idea on the implementation of the technique.

/*	Prefix sum of multiplicative functions :
		p_f : the prefix sum of f (x) (1 <= x <= th).
		p_g : the prefix sum of g (x) (0 <= x <= N).
		p_c : the prefix sum of f * g (x) (0 <= x <= N).
		th : the thereshold, generally should be n ^ (2 / 3).

struct prefix_mul {

	typedef long long (*func) (long long);

	func p_f, p_g, p_c;
	long long n, th;
	std::unordered_map <long long, long long> mem;

	prefix_mul (func p_f, func p_g, func p_c) : p_f (p_f), p_g (p_g), p_c (p_c) {}

	long long calc (long long x) {
		if (x <= th) return p_f (x);
		auto d = mem.find (x);
		if (d != mem.end ()) return d -> second;
		long long ans = 0;
		for (long long i = 2, la; i <= x; i = la + 1) {
			la = x / (x / i);
			ans = ans + (p_g (la) - p_g (i - 1) + mod) * calc (x / i);
		ans = p_c (x) - ans; ans = ans / inv;
		return mem[x] = ans;

	long long solve (long long n, long long th) {
		if (n <= 0) return 0;
		prefix_mul::n = n; prefix_mul::th = th;
		inv = p_g (1);
		return calc (n); 


Decoding the magic

Though we have finished discussing the algorithm, I have not yet shown how to find the "magic function". Unfortunately, there is no guarantee that such function exists in the first place. However, we can think about a question: how many functions are there that is very easy to compute the partial sum of it?


Certainly we know that , I(x) and Idk(x) are all trivial functions that have formulas for their partial sums, but there are hardly any other ones that come handy in the form of the Dirichlet convolution. The fact suggests that we can apply a trial-and-error method to test these trivial functions until one of them satisfies the condition that is also trivial.

Still, if we know some property of the function f(x), we might have some other ways to work around it.

For instance, consider the Möbius function μ(x), we have already noticed that , which tells us to take I(x) as the "magic function". Another example would be the Euler's totient function , we can prove that (with the "pk" method, i.e. show that is multiplicative first, and then compute ), so .

Example problems

Example 1. Find out the number of co-prime pairs of integer (x, y) in range [1, n].

Solution 1. We already know that

Now since we can calculate the partial sum of μ(d) faster, we are able to apply the harmonic lemma on the formula and optimize the loop to . Note that while we have to compute multiple partial sums of μ(d), the overall complexity is still if we do not clear the hash table, due to the fact that all values we plugged in is in the sequence , which are already computed while processing μ(n) anyway.

Example 2. Find out the sum of gcd(x, y) for every pair of integer (x, y) in range [1, n] (gcd(x, y) means the greatest common divisor of x and y).

Solution 2. Since

We just apply the optimization technique exactly the same as above.

Example 3. Find out the sum of lcm(x, y) for every pair of integer (x, y) in range [1, n] (lcm(x, y) means the least common multiple of x and y).

Solution 3. We know that

Now we just have to find a "magic function" for g(l). With a little luck, we can notice that Id(l) = g * Id2(l), and we just apply the optimization technique exactly the same as above.

(Should you be curious about the reason why Id(l) = g * Id2(l) holds true, the answer is still just the "pk" method: it is vital to show that g * Id2(l) is multiplicative, and simply calculating g * Id2(pk) will yield the result.)

Practice Problems



Counting Divisors (square)



Under some circumstances an implemented hash table might not be available. Fortunately, that does not mean you have to hand-code it for the optimization. With a little observation we can notice that all keys appearing in the hash table are in the form . Therefore, if we use i, i.e. as the hashing function, we can prove that and f(x) never collides.

Read more »

  • Vote: I like it  
  • +131
  • Vote: I do not like it  

By Nisiyama_Suzune, 20 months ago, In English,

This short article is mainly focused on the linear sieve and its various applications in competitive programming, including a brief introduction on how to pick out primes and a way to calculate multiple values of multiplicative functions.

Sieve of Eratosthenes

While this name may sound scary, the sieve of Eratosthenes is probably the simplest way to pick out all the primes in a given range from 1 to n. As we already know, one of the properties that all primes have is that they do not have any factors except 1 and themselves. Therefore, if we cross out all composites, which have at least one factor, we can obtain all primes. The following code demonstrates a simple implementation of the said algorithm:

std::vector <int> prime;
bool is_composite[MAXN];

void sieve (int n) {
	std::fill (is_composite, is_composite + n, false);
	for (int i = 2; i < n; ++i) {
		if (!is_composite[i]) prime.push_back (i);
		for (int j = 2; i * j < n; ++j)
			is_composite[i * j] = true;

As we can see, the statement is_composite[i * j] = true; crosses out all numbers that do have a factor, as they are all composites. All remaining numbers, therefore, should be prime. We then check for those primes and put them into a container named prime.

Linear sieve

It can be analyzed that the method above runs in complexity (with the Euler–Mascheroni constant, i.e. ). Let us take a minute to consider the bottleneck of such sieve. While we do need to cross out each composite once, in practice we run the inner loop for a composite multiple times due to the fact that it has multiple factors. Thus, if we can establish a unique representation for each composite and pick them out only once, our algorithm will be somewhat better. Actually it is possible to do so. Note that every composite q must have at least one prime factor, so we can pick the smallest prime factor p, and let the rest of the part be i, i.e. q = ip. Since p is the smallest prime factor, we have i ≥ p, and no prime less than p can divide i. Now let us take a look at the code we have a moment ago. When we loop for every i, all primes not exceeding i is already recorded in the container prime. Therefore, if we only loop for all elements in prime in the inner loop, breaking out when the element divides i, we can pick out each composite exactly once.

std::vector <int> prime;
bool is_composite[MAXN];

void sieve (int n) {
	std::fill (is_composite, is_composite + n, false);
	for (int i = 2; i < n; ++i) {
		if (!is_composite[i]) prime.push_back (i);
		for (int j = 0; j < prime.size () && i * prime[j] < n; ++j) {
			is_composite[i * prime[j]] = true;
			if (i % prime[j] == 0) break;

As is shown in the code, the statement if (i % prime[j] == 0) break; terminates the loop when p divides i. From the analysis above, we can see that the inner loop is executed only once for each composite. Hence, the code above performs in O(n) complexity, resulting in its name — the 'linear' sieve.

Multiplicative function

There is one specific kind of function that shows importance in the study of number theory — the multiplicative function. By definition, A function f(x) defined on all positive integers is multiplicative if it satisfies the following condition:

For every co-prime pair of integers p and q, f(pq) = f(p)f(q).

Applying the definition, it can be shown that f(n) = f(n)f(1). Thus, unless for every integer n we have f(n) = 0, f(1) must be 1. Moreover, two multiplicative functions f(n) and g(n) are identical if and only if for every prime p and non-negative integer k, f(pk) = g(pk) holds true. It can then be implied that for a multiplicative function f(n), it will suffice to know about its representation in f(pk).

The following functions are more or less commonly used multiplicative functions, according to Wikipedia:

  1. The constant function I(pk) = 1.
  2. The power function Ida(pk) = pak, where a is constant.
  3. The unit function ε(pk) = [pk = 1] ([P] is 1 when P is true, and 0 otherwise).
  4. The divisor function , denoting the sum of the a-th powers of all the positive divisors of the number.
  5. The Möbius function μ(pk) = [k = 0] - [k = 1].
  6. The Euler's totient function φ(pk) = pk - pk - 1.

It is interesting that the linear sieve can also be used to find out all the values of a multiplicative function f(x) in a given range [1, n]. To do so, we have take a closer look in the code of the linear sieve. As we can see, every integer x will be picked out only once, and we must know one of the following property:

  1. x is prime. In this case, we can determine the value of f(x) directly.
  2. x = ip and p does not divide i. In this case, we know that f(x) = f(i)f(p). Since both f(i) and f(p) are already known before, we can simply multiply them together.
  3. x = ip and p divides i. This is a more complicated case where we have to figure out a relationship between f(i) and f(ip). Fortunately, in most situations, a simple relationship between them exists. For example, in the Euler's totient function, we can easily infer that φ(ip) = pφ(i).

Since we can compute the function value of x satisfying any of the above property, we can simply modify the linear sieve to find out all values of the multiplicative function from 1 to n. The following code implements an example on the Euler's totient function.

std::vector <int> prime;
bool is_composite[MAXN];
int phi[MAXN];

void sieve (int n) {
	std::fill (is_composite, is_composite + n, false);
	phi[1] = 1;
	for (int i = 2; i < n; ++i) {
		if (!is_composite[i]) {
			prime.push_back (i);
			phi[i] = i - 1;					//i is prime
		for (int j = 0; j < prime.size () && i * prime[j] < n; ++j) {
			is_composite[i * prime[j]] = true;
			if (i % prime[j] == 0) {
				phi[i * prime[j]] = phi[i] * prime[j];	//prime[j] divides i
			} else {
				phi[i * prime[j]] = phi[i] * phi[prime[j]];	//prime[j] does not divide i

Extension on the linear sieve

Well, it might not always be possible to find out a formula when p divides i. For instance, I can write some random multiplicative function f(pk) = k which is difficult to infer a formula with. However, there is still a way to apply the linear sieve on such function. We can maintain a counter array cnt[i] denoting the power of the smallest prime factor of i, and find a formula using the array since . The following code gives an example on the function I've just written.

std::vector <int> prime;
bool is_composite[MAXN];
int func[MAXN], cnt[MAXN];

void sieve (int n) {
	std::fill (is_composite, is_composite + n, false);
	func[1] = 1;
	for (int i = 2; i < n; ++i) {
		if (!is_composite[i]) {
			prime.push_back (i);
			func[i] = 1; cnt[i] = 1;
		for (int j = 0; j < prime.size () && i * prime[j] < n; ++j) {
			is_composite[i * prime[j]] = true;
			if (i % prime[j] == 0) {
				func[i * prime[j]] = func[i] / cnt[i] * (cnt[i] + 1);
				cnt[i * prime[j]] = cnt[i] + 1;
			} else {
				func[i * prime[j]] = func[i] * func[prime[j]];
				cnt[i * prime[j]] = 1;

Example problems

The Embarrassed Cryptographer


Solution: Simply applying the linear sieve will be enough.

Farey Sequence


Solution: Notice that Δ F(n) = F(n) - F(n - 1) is multiplicative and Δ F(pk) = pk - pk - 1. Then the linear sieve will come in handy to solve the problem.

Read more »

  • Vote: I like it  
  • +226
  • Vote: I do not like it  

By Nisiyama_Suzune, 20 months ago, In English,

If you've ever taken some lessons on competitive programming, chances are that you have already heard about one of the most famous formula: the Möbius inversion. This article is aimed to provide some basic insight on what is the Möbius inversion, as well as how to apply it in various programming tasks.


If you are not familiar with the linear sieve and multiplicative functions, it is recommended that you read about them first here.

I will introduce some frequently used notations and lemmas first.


  1. [P] refers to the boolean expression, i.e. [P] = 1 when P is true, and 0 otherwise.
  2. x refers to rounding x down to the nearest integer. Thus refers to the integer division.
  3. d|n means that d can divide n (without a remainder).

The following functions are all multiplicative functions, where p is a prime number and k is a positive integer.

  1. The constant function I(pk) = 1.
  2. The identity function Id(pk) = pk.
  3. The power function Ida(pk) = pak, where a is constant.
  4. The unit function .
  5. The divisor function , denoting the sum of the a-th powers of all the positive divisors of the number.
  6. The Möbius function μ(pk) = [k = 0] - [k = 1].
  7. The Euler's totient function φ(pk) = pk - pk - 1.


I have some my unofficial names for these frequently used conclusions. If you happen to know any more commonly used name for them, you are more than welcome to tell me.

  • ( The integer division lemma ) For every positive integer p, q and r, .

Proof: We know that . Hence . Since the fraction part of cannot exceed , we achieve the conclusion.

  • ( The harmonic lemma ) Consider the harmonic sequence on integer division .
  1. The sequence is non-increasing, and there are at most different elements.
  2. The sum of the sequence is approximate to .

Proof: Denote . For every i not exceeding , there will be no more than values in the domain of d(i), so there can be at most different values for d(i). For the rest part of i greater than , we can infer that and is a positive integer, so there can be at most different values for d(i). Therefore we know that there are at most different elements in the sequence.

Since the Euler–Mascheroni constant states that , we know that , so the sum of the original sequence is approximate to .

Moreover, it is actually possible to exploit the property that the sequence has at most different elements, and write a loop that runs in complexity to iterate through every possible value of d(i), using the fact that the greatest integer x satisfying d(x) = d(i) is . The piece of code below demonstrates one way to program it.

for (int i = 1, la; i <= n; i = la + 1) {
	la = n / (n / i);
	//n / x yields the same value for i <= x <= la.
  • ( The transitive property of multiplicative functions )
  1. If both f(x) and g(x) are multiplicative, then h(x) = f(x)g(x) is also multiplicative.
  2. If both f(x) and g(x) are multiplicative, then their Dirichlet convolution is also multiplicative. Specifically, if f(x) is multiplicative, is also multiplicative.

The proof is more mathematical, which I will skip here.

What is the Möbius inversion?

According to Wikipedia, the Möbius inversion states that:

If for every positive integer n, then , where μ(x) is the Möbius function, which is multiplicative and satisfies f(1) = 1, f(p) =  - 1, f(pk) = 0 for any prime number p and any integer k ≥ 2. ( It is worth noting that you can pre-compute all values of the Möbius function from 1 to n using the linear sieve. )

However, the definition alone does not mean much (unless the problem statement explicitly gives you something like . In that case, well...). There is one important property that is probably more useful than the definition:

In order to prove this property, we have to use the transitive property of multiplicative functions to show that is multiplicative. After that, we can see f(1) = 1 and f(pk) = 0, which means f(x) = ε(x). Now you may ask: why is this property important? I think it is best to show the reasons in some basic examples below.

Example 1. Find out the number of co-prime pairs of integer (x, y) in range [1, n].

Solution 1. Notice that the question is the same as asking you the value of

Now apply the Möbius inversion on [gcd(i, j)] = 1, we have

which is the same as

Notice that [d|gcd(i, j)] = [d|i][d|j]. Therefore

We can change the order of summing things up, so

We know that . Thus

Then we can simply loop through 1 to n to compute the formula. While we can optimize this loop to using the harmonic lemma, we still have to use the linear sieve to pre-compute the values of μ(d). Therefore, the overall complexity of the algorithm is O(n).

Example 2. Find out the sum of gcd(x, y) for every pair of integer (x, y) in range [1, n] (gcd(x, y) means the greatest common divisor of x and y).

Solution 2. Similar as above, notice that the question is the same as asking you the value of

Let k = gcd(i, j). We can then loop for k first.

Now assume i = ak, j = bk, and then

It can be observed that the last part of the formula is the same as the one in example 1. Applying the same procedure, we have

According to the harmonic lemma, . Therefore, if we compute the sum above with brute force, the overall complexity will be .

This is, however, not the best complexity we can achieve with the Möbius inversion. Now let l = kd, and we can loop for l first.

Applying the transitive property of multiplicative functions, we can see that is multiplicative. Moreover, we have g(pk) = pk - pk - 1. If you have somewhat studied about the number theory, you may figure out that g(l) is actually the Euler's totient function. Regardless of whether you know about the coincidence or not, g(l) can be pre-computed with the linear sieve, and , which can be simply computed in O(n) complexity.

Example 3. Find out the sum of lcm(x, y) for every pair of integer (x, y) in range [1, n] (lcm(x, y) means the least common multiple of x and y).

Solution 3. Well, we should be pretty familiar with the techniques by now. First of all, the answer should be

Since , let k = gcd(i, j). We can then loop for k first.

Now assume i = ak, j = bk, and then

Now let's proceed to assume a = dp, b = dq, then

Now notice , and we have

Now following the example above, let l = kd, and then

Notice that is also multiplicative, and g(pk) = pk - pk + 1. We can pre-compute the values of g(l) via the linear sieve, and . The overall complexity will be O(n).

Practice Problems

Simple Sum

(Thanks _threat_ for showing me this one!)




Sky Code



It is known that all examples above can be further optimized to . The exact technique will (probably) be introduced in the upcoming article.

Update: The article introducing the optimization is ready here.

Finally, thanks Tommyr7 for his effort in the article!

Read more »

  • Vote: I like it  
  • +202
  • Vote: I do not like it