neal's blog

By neal, 2 months ago, In English,

Don't use rand(). Why? Let's jump right into some code. What value will the following code print, approximately?

#include <cstdlib>
#include <iostream>
using namespace std;

const int ITERATIONS = 1e7;

int main() {
    double sum = 0;

    for (int i = 0; i < ITERATIONS; i++)
        sum += rand() % 1000000;

    cout << "Average value: " << sum / ITERATIONS << '\n';
}

Should be about 500,000, right? Turns out it depends on the compiler, and on Codeforces it prints 16382, which isn't even close. Try it out yourself.

What's happening here?

If you look up C++ documentation on rand(), you'll see that it returns "a pseudo-random integral value between 0 and RAND_MAX." Click again on RAND_MAX and you'll see that "This value is implementation dependent. It's guaranteed that this value is at least 32767." On the Codeforces machines, it turns out RAND_MAX is exactly 32767. That's so small!

It doesn't stop there though; random_shuffle() also uses rand(). Recall that in order to perform a random shuffle, we need to generate random indices up to n, the size of the array. But if rand() only goes up to 32767, what happens if we call random_shuffle() on an array with significantly more elements than that? Time for some more code. What would you expect the following code to print?

#include <algorithm>
#include <iostream>
#include <vector>
using namespace std;

const int N = 3000000;

double average_distance(const vector<int> &permutation) {
    double distance_sum = 0;

    for (int i = 0; i < N; i++)
        distance_sum += abs(permutation[i] - i);

    return distance_sum / N;
}

int main() {
    vector<int> permutation(N);

    for (int i = 0; i < N; i++)
        permutation[i] = i;

    random_shuffle(permutation.begin(), permutation.end());
    cout << average_distance(permutation) << '\n';
}

This computes the average distance that each value moves in the random shuffle. If you work out a bit of math, you'll find that the answer on a perfectly random shuffle should be = 1,000,000. Even if you don't want to do the math, you can observe that the answer is between = 1,500,000, the average distance for index 0, and = 750,000, the average distance for index .

Well, once again the code above disappoints; it prints out 64463. Try it yourself. In other words, random_shuffle() moved each element a distance of 2% of the length of the array on average. Based on my testing, the implementation of random_shuffle() on Codeforces matches the following exactly:

    for (int i = 1; i < N; i++)
        swap(permutation[i], permutation[rand() % (i + 1)]);

So naturally if RAND_MAX is much less than N, this shuffle will be problematic.

rand() itself has more quality problems than just RAND_MAX being small though; it is typically implemented as a relatively simple linear congruential generator. On the Codeforces compiler, it looks like this:

static long holdrand = 1L;

void srand(unsigned int seed) {
  holdrand = (long) seed;
}

int rand() {
  return (((holdrand = holdrand * 214013L + 2531011L) >> 16) & 0x7fff);
}

In particular, linear congruential generators (LCGs) suffer from extreme predictability in the lower bits. The k-th bit (starting from k = 0, the lowest bit) has a period of 2k + 1 (i.e., how long until the sequence takes to repeat). So the lowest bit has a period of just 2, the second lowest a period of 4, etc. This is why the function above discards the lowest 16 bits, and the resulting output is at most 32767.

What's the solution?

Don't worry, as of C++11 there are much better random number generators available in C++. The only thing you need to remember is to use mt19937, included in the <random> header. This is a Mersenne Twister based on the prime 219937 - 1, which also happens to be its period. It's a much higher-quality RNG than rand(), in addition to being much faster (389 ms to generate and add 108 numbers from mt19937 in Custom Invocation, vs. 1170 ms for rand()). It also produces full 32-bit unsigned outputs between 0 and 232 - 1 = 4294967295, rather than maxing out at a measly 32767.

To replace random_shuffle(), you can now call shuffle() and pass in your mt19937 as the third argument; the shuffle algorithm will use your provided generator for shuffling.

C++11 also gives you some nifty distributions. uniform_int_distribution gives you perfectly uniform numbers, without the bias of mod -- i.e., rand() % 10000 is more likely to give you a number between 0 and 999 than a number between 9000 and 9999, since 32767 is not a perfect multiple of 10000. There are many other fun distributions as well including normal_distribution and exponential_distribution.

To give you a more concrete idea, here's some code using several of the tools mentioned above. Note that the code seeds the random number generator using a high-precision clock. This is important for avoiding hacks specifically tailored to your code, since using a fixed seed means that anyone can determine what your RNG will output. One last thing: if you want 64-bit random numbers, just use mt19937_64 instead.

#include <algorithm>
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
using namespace std;

const int N = 3000000;

double average_distance(const vector<int> &permutation) {
    double distance_sum = 0;

    for (int i = 0; i < N; i++)
        distance_sum += abs(permutation[i] - i);

    return distance_sum / N;
}

int main() {
    mt19937 rng(chrono::steady_clock::now().time_since_epoch().count());
    vector<int> permutation(N);

    for (int i = 0; i < N; i++)
        permutation[i] = i;

    shuffle(permutation.begin(), permutation.end(), rng);
    cout << average_distance(permutation) << '\n';

    for (int i = 0; i < N; i++)
        permutation[i] = i;

    for (int i = 1; i < N; i++)
        swap(permutation[i], permutation[uniform_int_distribution<int>(0, i)(rng)]);

    cout << average_distance(permutation) << '\n';
}

Both shuffles result in almost exactly 106 average distance, like we originally expected.

Additional References

This post was inspired in part by Stephan T. Lavavej's talk "rand() Considered Harmful": https://channel9.msdn.com/Events/GoingNative/2013/rand-Considered-Harmful

If you want even faster, higher-quality random number generators, take a look at this site by Sebastiano Vigna.

 
 
 
 
  • Vote: I like it  
  • +891
  • Vote: I do not like it  

»
2 months ago, # |
  Vote: I like it +27 Vote: I do not like it

Thanks for an interesting and well-written post. Since we are on Codeforces, it would be nice to also see a real competitive programming scenario where the good old rand()%N doesn't work but mt19937 works. While your post shows strong evidence that rand()%N is bad, does this really matter in competitive programming?

  • »
    »
    2 months ago, # ^ |
      Vote: I like it +32 Vote: I do not like it

    An example would be any problem which requires random numbers of the order 100 000 or greater. Here's one such problem (from local training in Russian, so no public contest link, sorry!):

    The jury has a string of 2n letters (1 ≤ n ≤ 100 000), fixed in advance in each test. Of these, exactly n are as, and exactly n are bs. The contestant known only the value of n, and has to find any letter a in the string. For that, the contestant can interactively ask up to 100 questions of the form “what is the letter at position p?”, and the jury gives back the letter. As soon as it is an a, the contestant's solution passes the test.

    Sounds simple, right? But the actual number of incorrect attempts is usually considerable, and a good portion of them is the result of using rand() (the judge is using Windows and MinGW / Visual Studio, so the upper bound is 32 767, as explained in the post).

  • »
    »
    2 months ago, # ^ |
      Vote: I like it +26 Vote: I do not like it

    I recall several problems requiring "pick random element" idea, where you can easily get WA if you aren't really picking random element.

    Example: you have N lines, no two of them are parallel, you know that there is a point P such that at least N/2 lines pass through that point, your task is to find this point.

    You'd want to take random pair of lines and check if their intersection is a solution, right? Sounds like 1/4 chance of guessing it for each try. Or zero chance, if the test has been build in such a way that all "good" lines have IDs larger than 32767 and you don't do any random_shuffle on the input (because why would you?..)

  • »
    »
    2 months ago, # ^ |
      Vote: I like it +18 Vote: I do not like it

    Sure, fair question. While the mediocre results of random_shuffle will often still be good enough for you to pass a problem (case in point: 39639107), one big situation where you really need good randomization in your code is when doing hashing.

    When hashing you don't want to use fixed values, as someone can find a hack for your code fairly easily using a collision attack.

    If you choose your multiplication base using rand(), you'll be limiting yourself to 32767 possible values which isn't ideal. Worse, if you choose your hashing mod using rand(), you'll be stuck with a very small mod value and extremely prone to collisions on big test cases.

    Even if you look for a mod by generating, e.g., 1000000000 + rand() and repeating until prime, this only leaves you with about 1600 primes that you could end up with. It's actually very doable to perform a separate collision attack on each of those primes and concatenate all of the results into a single monster string that breaks this code. Especially so in a contest with an open hacking phase.

    Another common practice when using rand() is to do srand(time(NULL)). This is another vulnerability because time(NULL) only changes once per second, making it quite easy to hack again: just run collision attacks for each second in a 60-second range, and then submit the hack in those 60 seconds.

    So in summary, you may be able to get away with using rand(). But don't do it. Just use mt19937.

    • »
      »
      »
      2 months ago, # ^ |
        Vote: I like it +18 Vote: I do not like it

      Here's an example of a virtually unhackable solution of mine using hashing with mt19937_64 and a high-precision clock: 42357469.

  • »
    »
    2 months ago, # ^ |
      Vote: I like it +26 Vote: I do not like it

    AIM Tech Round 4 Div. 1 B. A lot of solutions with rand()%N and random_shuffle failed on this problem.

  • »
    »
    4 weeks ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    https://codeforces.com/contest/1039/problem/B This is the perfect example for a problem where rand() doesn't work but mt19937 rng(chrono::steady_clock::now().time_since_epoch().count()) does.

»
2 months ago, # |
  Vote: I like it +44 Vote: I do not like it

Another advantage of C++ random generators over rand() is that they are independent objects instead of single static function. This comes in handy in the following case: suppose that you write a solution that uses random somehow and a random test generator in the same file (extremely useful in interactive problems, but in my opinion it is better than usual bash/bat scripts for classical problems also). Then you can create two separate instance of RNG to be used by your solution and by test generator respectively, this way changes in the solution and test generation don't influence each other, which makes testing and debugging easier.

»
2 months ago, # |
  Vote: I like it +29 Vote: I do not like it

OMG yesterday my team got many WAs in Problem F NWERC 2014 because of this issue. fixed the rand and got AC. Thank you so much.

»
2 months ago, # |
  Vote: I like it +12 Vote: I do not like it

You should at least mention the most common workaround for the original problem: instead of rand() % 1000000 use rand()*rand() % 1000000 or (rand() * RAND_MAX + rand()) % 1000000

This is enough in 99% of those few cases when just rand() isn't enough

  • »
    »
    2 months ago, # ^ |
      Vote: I like it +47 Vote: I do not like it

    Second version is working almost perfectly (though to fully avoid the same effect that makes the first version bad it is better to change it to (rand() * (RAND_MAX + 1) + rand()) % (int)1e6), however the first one is a complete disaster unless you are fine with some results having probability more than 250 times higher than some other results: Simulation here, it times out on ideone, but you can run it locally.

    Output

    And of course, you need to run code like this only on Windows testing systems (if RAND_MAX  = 231 - 1, both versions are technically undefined behaviour and sometimes yield negative numbers anyway), so you need to also write a platform-dependent ifdef or something like this, increasing the amount of code even more.

    • »
      »
      »
      2 months ago, # ^ |
      Rev. 2   Vote: I like it +5 Vote: I do not like it

      A solution to the overflow problem is casting rand() to an unsigned type because in unsigned types overflow is not an undefined behaviour, even though this is a bit akward.

      Also another option to generate a random numbe would be:

      ((rand()<<15) + rand())%MAXN
      

      This is good because as you probably know << is faster than * (although this does not matter much as % will take the most time I think). What I said is correct because RAND_MAX in codeforces is 2^15-1 and in fact if MAXN and RAND_MAX are both powers of 2 (which is the case of codeforces) I think the results of this adaptation are equiprobable. I also suspect that RAND_MAX is a power of 2 most of the time (as you said it is the case in Windows).

»
2 months ago, # |
  Vote: I like it 0 Vote: I do not like it

Thank you so much!

I was really looking into generating random numbers and random shuffles while creating my first contest. This helps a lot.

  • »
    »
    2 months ago, # ^ |
      Vote: I like it +4 Vote: I do not like it

    Use functions from testlib.h if you create a contest. https://github.com/MikeMirzayanov/testlib

    • »
      »
      »
      2 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      Thank you. It's a good idea though to know what is happening behind the scenes of testlib.h

      • »
        »
        »
        »
        2 months ago, # ^ |
          Vote: I like it +3 Vote: I do not like it

        Yes, it is. Looking through it and reading important functions (given in the very first comment) will take you around 10 minutes.

»
2 months ago, # |
  Vote: I like it 0 Vote: I do not like it

Can anyone tell, is it good idea to use such random functions?

Spoiler
  • »
    »
    2 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Tagging neal, I_love_Tanya_Romanova, Errichto and Gassa with all respect, hoping to get answer :)

  • »
    »
    2 months ago, # ^ |
      Vote: I like it +23 Vote: I do not like it

    Bad idea

  • »
    »
    2 months ago, # ^ |
    Rev. 2   Vote: I like it +5 Vote: I do not like it

    The only advantage is that you can generate long long's now (some places like CF being exception, where rand() returns a number up to 216). Disadvantages: smaller cycle (you will quicker get the same values), possibly losing some pseudo-random properties, slower program, you're doing something you don't understand. Read the blogs by neal.