Randomness

Rolling the dice is good and bad.

Apr 27, 2023

Randomness is a double edge sword. It can be used to simulate interesting behaviors in simulated environments, generate synthetic data for testing an algorithm, or provide the quintessential structure for developing artificial intelligence. However, randomness is...random. As such, any numbers that are generated randomly are typically nonrecoverable during reproduction. As such, let's talk about how a random number is generated, how to make the numbers deterministic, and what to do when you cannot have a deterministic algorithm.

Pseudorandom Number Generation (PRNG)

First, when somebody mentions using a random number generator within their code, they are not talking about independently generated numbers. Instead, they are typically talking about a number generator which produces a sequence of numbers which seems random, but is deterministically chosen by some algorithm. This is known as a pseudorandom number generator, or PRNG for short. Let's take a look at one such pseudorandom algorithm.

const int A = 4;
const int C = 1;
const int M = 9;
int state = 0;
int prng(int* seed) {
    if (seed) {
        state = *seed;
    }
    state = (A * state + C) % M;
    return state;
}

This is a PRNG known as a linear congruential generator (LCG). First, it takes in some sort of initial value, known as the seed. Then, it updates the internal state using the previous state. Finally, it returns the new state. In the case above, `A`, `C`, and `M` can be tweaked to be any value, provided `M > 0`, `0 < A < M`, `0 <= C < M`, and `0 <= seed < M`. This will give a random sequence of numbers that will eventually cycle back to the start of the sequence (e.g., 0, 1, 5, 3, 4, 8, 6, 7, 2, 0, 1, ...). The best PRNG algorithms should make the cycle as long as possible.

Of course, most languages already have some form of PRNG implemented. For example, Python has the `random` library:

import random

# Generate random integer [1, 100]
print(random.randint(1, 100))

R has methods like `runif` or `sample`:

# Generate a random integer [1, 100]
sample(1:100, 1, replace=TRUE)

C has `rand` from the standard library:

#include <stdlib.h>
#include <stdio.h>

int main() {
    // Generate a random integer [1, 100]
    printf("%d", (rand() % 100) + 1);
}

Seeding the Random Values

Now, how do we make the values deterministic for every run? Well, remember that seed parameter in the random method? That indicates the starting state which is used to generate all subsequent random numbers. Therefore, if we set the seed to a constant value, we can expect that the random number sequence generated will always be the same.

Once again languages and libraries have their own method of doing so. Python has a `seed` method:

import random

random.seed(0)

# Generate random integer [1, 100] (Will always generate 50)
print(random.randint(1, 100))

R has `set.seed`:

set.seed(0)

# Generate a random integer [1, 100] (Will always generate 14)
sample(1:100, 1, replace=TRUE)

C has `srand`:

#include <stdlib.h>
#include <stdio.h>

int main() {
    srand(0);

    // Generate a random integer [1, 100] (Will always generate 84)
    printf("%d", (rand() % 100) + 1);
}

When reporting on results of a study or providing a demonstration, you should always seed your PRNGs such that the results can be reproduced. In more production based applications for average users, you should have a way to set the seed when using a PRNG.

Thread-Safe PRNG

What if you wanted to use a random value in concurrent operations? Well, let's answer a simpler question first: how do you make a PRNG thread-safe?

Well, the first thing would be to make the PRNG itself thread-safe. The issue with this is that we cannot guarantee that two threads accessing the PRNG will be in the same order. We could guarantee that the internal method is seeded to a static value, but then the randomness is not very random.

Well, what if we put a lock on the call to the random method itself? That also falls apart due to a similar issue which we cannot guarantee that two threads accessing the PRNG will be in the same order.

So, the best case scenario is to create a new random for each thread that it can use to generate its own random numbers. We can seed the values using the original seed multiplied with the thread number. This is known as a thread local random and is typically implemented in some capacity in languages that can handle concurrency.

Python can create a new instance of `random.Random` for this:

from random import Random
from multiprocessing import Pool

# Generate random integer [1, 100]
def do_math(rand):
    return rand.randint(1, 100)

initial_seed = 23
processes = 4
randoms = []

# Add randoms, offsetting the seed
for i in range(4):
    randoms.append(Random(initial_seed * (i + 1)))

# Give each process its own seeded random
with Pool(processes) as pool:
    print(pool.map(do_math, randoms))

Java, meanwhile, can hold a `Random` in a `ThreadLocal` value:

public static ThreadLocal<Random> rand(final long seed) {
    return ThreadLocal.withInitial(() -> new Random(seed));
}

Seeds are not Hyperparameters

Although seeds are great for reproducing random numbers, they are also a double edged sword. For example, let's say I wanted simulate a random 20-sided die roll 10 times and find the highest possible sum. The chance of getting the highest value 2000 is `9.8 x 10^-14`. You're more likely to win the lottery while being struck by lighting at the same time. However, with a seeded random number generator, you could theoretically get ridiculously close or exactly the highest possible value.

As such, picking the initial seed value be completely random without any tuning from the user. Additionally, the reported precision of the result should not be affected by the variance in the seeded value.

True Random Number Generator (TRNG)

One way to generate a seed value is by using a true random number generator, or TRNG. TRNGs are random numbers generated using a physical device instead of programmatically. In theory, by using noise signals for some sort of microscopic phenomena, the values generated should be completely unpredictable. As such, you can use the value generated from the TRNG as a seed for the PRNG.

While this approach mitigates some issues, it does not solve all of them. Since you are only performing one run, you could be lucky or unlucky with your measure value being higher or lower than average.

Variance < Precision

Another thing to account for is the variance in your runs. This can typically be done by running the algorithm multiple times using different, random seeds. Using that sampled distribution, you can then take an average and measure the confidence interval. This is actually a form of cross-validation, although in this case we are testing the seed and not using subsets of the data. I would recommend running the algorithm at least 60 times for validation.

My general opinion is that the variance of the value between seeds should be less than the reported precision. The better your algorithm is, the less variance there should be between reproductions without seeding the value.

Hopefully, now you have a better understanding how of randomness works along with reproducibility efforts and pitfalls to avoid.

Some Additional Thoughts

Libraries Have Different Random Parameters

Currently, I only mentioned random numbers generated by the standard library of a language. However, other libraries may have their own implementation of randomness or non-deterministic behavior. For example, Python's NumPy has a different random which you can set the seed for:

import numpy as np

# Set the seed
np.random.seed(0)

# Generate random number
print(np.random.random())

Always be sure to check the documentation of whatever library you are using such that you are not blindsided in the future.

I'm thinking about documenting libraries and their random implementations for reproducibility in the future, so I'll probably follow up here with a repository at some point.

Let's Discover Open Science and Reproducibility