arc4random_uniform and avoiding modulo bias when using a random number generator
Using the arc4random_uniform
function is recommend over using arc4random
:
Former is advertised as not having "modulo bias" issues, see man
arc4random
. (So I got curious, searched the web a bit, found
this blog post,
this answer and
this link to the source code of
one version used by Apple's.
It took me a bit to figure things out but when I got it I was surprised how
simple the solution is and the explanation could be. So here is my take at a
better explanation.) Say we have a random number generator that produces
numbers from 0 to 7 (i.e. 8 different values). Now in our application, we need
3 different values (i.e. 0 to 2, for simplicity). So the maximum source value
SRC_MAX
is 7, the maximum destination value DST_MAX
is 2. If we calculate
random values like
r = source_generator() % (DST_MAX + 1);
we end up with 0s and 1s a lot more than 2s — some people call that modulo bias , because there is bias for/against certain values. In the semi-visual world, the scenario looks like this ("XXX" marking troublemakers, i.e. values causing bias):
/---------\ /---------\ /---------\ | | | | | | |XXX|XXX| 0 1 DST_ SRC_ MAX MAX
Now to solve the modulo bias one could pick some way so that troublemakers are skipped, not taken into account. With troublemakers removed, we return to uniform distribution. To do that, when we hit a troublemaker we just roll the dice again until we no longer have a troublemaker. If our source generator is producing uniformly distributed output, that's guaranteed to terminate, quickly. In code, it could be a loop like this:
for (;;) { src = source_generator(); // [0 .. SRC_MAX] if (... no skip needed ...) break; } return src % (DST_MAX + 1); // [0 .. DST_MAX]
How many (and which values) need to be skipped? The number of troublemakers calculates as
trouble_count = (SRC_MAX + 1) % (DST_MAX + 1);
e.g. 2 for our case because (7+1) % (2+1) = 8 % 3 = 2 because 2*3 + 2 = 8. Our
input is a range ([0 .. SRC_MAX
]) and our output is a (smaller) range ([0
.. DST_MAX
]). There are two easy places where to skip values:
(a) at the beginning or (b) at the end.
a) Beginning > for (;;) { /---------\ /---------\ > src = source_generator(); |XXX|XXX| | | | | | | > if (src >= trouble_count) 0 1 DST_ SRC_ > break; MAX MAX > } > return src % (DST_MAX + 1); b) End > for (;;) { /---------\ /---------\ > src = source_generator(); | | | | | | |XXX|XXX| > if (src <= SRC_MAX - trouble_count) 0 1 DST_ SRC_ > break; MAX MAX > } > return src % (DST_MAX + 1);
To me (b) seems more intuitive, the
arc4random_uniform
sources I looked at
went for (a). Now the only tricky part left is how to calculate
trouble_count = (SRC_MAX + 1) % (DST_MAX + 1);
for SRC_MAX
= 0xffffffff
= 2³²-1 without (SRC_MAX + 1)
wrapping around
to 0. I would go with
trouble_count = (SRC_MAX - (DST_MAX + 1) + 1) % (DST_MAX + 1);
The idea is that (k - n) % n
equals k % n
as we are dealing with modulus.
That's all.