## arc4random_uniform and avoiding modulo bias when using a random number generator 2016-07-30

Using the `arc4random_uniform` function is recommend over using `arc4random`: Former is advertised *as not having “modulo bias”* issues, see `man arc4random`.

(So I got curious, searched the web a bit, found this blog post, this answer and this link to the source code of one version used by Apple’s. It took me a bit to figure things out but when I got it I was surprised how simple the solution is and the explanation could be. So here is my take at a better explanation.)

Say we have a random number generator that produces numbers from 0 to 7 (i.e. 8 different values). Now in our application, we need 3 different values (i.e. 0 to 2, for simplicity). So the maximum source value `SRC_MAX` is 7, the maximum destination value `DST_MAX` is 2.

If we calculate random values like

r = source_generator() % (DST_MAX + 1);

we end up with 0s and 1s a lot more than 2s — some people call that *modulo bias*, because there is bias for/against certain values.

In the semi-visual world, the scenario looks like this (“XXX” marking troublemakers, i.e. values causing bias):

/---------\ /---------\ /---------\ | | | | | | |XXX|XXX| 0 1 DST_ SRC_ MAX MAX

Now to solve the modulo bias one could pick *some way* so that troublemakers are skipped, not taken into account. With troublemakers removed, we return to uniform distribution. To do that, when we hit a troublemaker we just *roll the dice again* until we no longer have a troublemaker. If our source generator is producing uniformly distributed output, that’s guaranteed to terminate, quickly. In code, it could be a loop like this:

for (;;) { src = source_generator(); // [0 .. SRC_MAX] if (... no skip needed ...) break; } return src % (DST_MAX + 1); // [0 .. DST_MAX]

How many (and which values need to be skipped?

The number of troublemakers calculates as

trouble_count = (SRC_MAX + 1) % (DST_MAX + 1);

e.g. 2 for our case because (7+1) % (2+1) = 8 % 3 = 2 because 2*3 + 2 = 8.

Our input is a range ([0 .. `SRC_MAX`]) and our output is a (smaller) range ([0 .. `DST_MAX`]).

There are two easy places where to skip values: *(a) at the beginning* or *(b) at the end*.

a) Beginning > for (;;) { /---------\ /---------\ > src = source_generator(); |XXX|XXX| | | | | | | > if (src >= trouble_count) 0 1 DST_ SRC_ > break; MAX MAX > } > return src % (DST_MAX + 1); b) End > for (;;) { /---------\ /---------\ > src = source_generator(); | | | | | | |XXX|XXX| > if (src <= SRC_MAX - trouble_count) 0 1 DST_ SRC_ > break; MAX MAX > } > return src % (DST_MAX + 1);

To me (b) seems more intuitive, the `arc4random_uniform` sources I looked at went for (a).

Now the only tricky part left is how to calculate

trouble_count = (SRC_MAX + 1) % (DST_MAX + 1);

for `SRC_MAX` = `0xffffffff` = 2³²-1 without `(SRC_MAX + 1)` wrapping around to 0. I would go with

trouble_count = (SRC_MAX - (DST_MAX + 1) + 1) % (DST_MAX + 1);

The idea is that `(k - n) % n` equals `k % n` as we are dealing with modulus.

That’s all.

## Leave a Reply

You must be logged in to post a comment.