arc4random_uniform and avoiding modulo bias when using a random number generator 2016-07-30

Using the arc4random_uniform function is recommend over using arc4random: Former is advertised as not having “modulo bias” issues, see man arc4random.
(So I got curious, searched the web a bit, found this blog post, this answer and this link to the source code of one version used by Apple’s. It took me a bit to figure things out but when I got it I was surprised how simple the solution is and the explanation could be. So here is my take at a better explanation.)

Say we have a random number generator that produces numbers from 0 to 7 (i.e. 8 different values). Now in our application, we need 3 different values (i.e. 0 to 2, for simplicity). So the maximum source value SRC_MAX is 7, the maximum destination value DST_MAX is 2.

If we calculate random values like

r = source_generator() % (DST_MAX + 1);

we end up with 0s and 1s a lot more than 2s — some people call that modulo bias, because there is bias for/against certain values.
In the semi-visual world, the scenario looks like this (“XXX” marking troublemakers, i.e. values causing bias):

 

 /---------\ /---------\ /---------\
|   |   |   |   |   |   |XXX|XXX|
 0   1   DST_                SRC_
         MAX                 MAX

 

Now to solve the modulo bias one could pick some way so that troublemakers are skipped, not taken into account. With troublemakers removed, we return to uniform distribution. To do that, when we hit a troublemaker we just roll the dice again until we no longer have a troublemaker. If our source generator is producing uniformly distributed output, that’s guaranteed to terminate, quickly. In code, it could be a loop like this:

for (;;) {
  src = source_generator();  // [0 .. SRC_MAX]
  if (... no skip needed ...)
    break;
}
return src % (DST_MAX + 1);  // [0 .. DST_MAX]

 

How many (and which  values need to be skipped?

The number of troublemakers calculates as

trouble_count = (SRC_MAX + 1) % (DST_MAX + 1);

e.g. 2 for our case because (7+1) % (2+1) = 8 % 3 = 2 because 2*3 + 2 = 8.
Our input is a range ([0 .. SRC_MAX]) and our output is a (smaller) range ([0 .. DST_MAX]).
There are two easy places where to skip values: (a) at the beginning or (b) at the end.

 

a) Beginning
                                    >   for (;;) {
         /---------\ /---------\    >     src = source_generator();
|XXX|XXX|   |   |   |   |   |   |   >     if (src >= trouble_count)
 0   1   DST_                SRC_   >       break;
         MAX                 MAX    >   }
                                    >   return src % (DST_MAX + 1);

b) End
                                    >   for (;;) {
 /---------\ /---------\            >     src = source_generator();
|   |   |   |   |   |   |XXX|XXX|   >     if (src <= SRC_MAX - trouble_count)
 0   1   DST_                SRC_   >       break;
         MAX                 MAX    >   }
                                    >   return src % (DST_MAX + 1);

 

To me (b) seems more intuitive, the arc4random_uniform sources I looked at went for (a).

Now the only tricky part left is how to calculate

trouble_count = (SRC_MAX + 1) % (DST_MAX + 1);

for SRC_MAX = 0xffffffff = 2³²-1 without (SRC_MAX + 1) wrapping around to 0. I would go with

trouble_count = (SRC_MAX - (DST_MAX + 1) + 1) % (DST_MAX + 1);

The idea is that (k - n) % n equals k % n as we are dealing with modulus.

That’s all.

Leave a Reply

You must be logged in to post a comment.