Chapter 33 Setting the Seed of the Random Number Generator

Welcome back to Quantitative Reasoning! In our previous tutorial, we learned how to generate random permutations and random samples. Because the outcome is random, we have no guarantee that the next permutation or sample is exactly the same as the previous one. This feature of randomness can make the life of a programmer difficult. If computer code doesn’t lead to reproducible results, how can we find out whether two different versions of the code are functionally equivalent? For example, last time I claimed that these two lines of code are synonymous. They only differ in the last argument: prob = c(0.75, 0.25) versus prob = c(3, 1).

sample(c("H", "T"), size = 20, replace = TRUE, prob = c(0.75, 0.25))
sample(c("H", "T"), size = 20, replace = TRUE, prob = c(3, 1))

When we run the commands without further interventions, the sequences are random, so we can’t count on the second command to generate exactly the same sequence as the first command. How can we confirm that the second command, if we had run it first, really would have produced the same result as the first command?

Because reproducible computer code is essential for detecting and eliminating bugs, computers don’t produce unpredictable sequences of random numbers. We won’t go into the details of random number generation, which is a complex problem that requires detailed knowledge about number theory and the internal operations of a computer. The default random number generator in R is called the Mersenne Twister. If you’re curious about the inner workings of the Mersenne Twister, you can find the research paper that introduced the algorithm at the URL below this video (http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/ARTICLES/mt.pdf), but be warned that the authors assume that readers have taken specialised upper-level maths and computer science courses. Instead of going through the details of the algorithm, let’s develop a high-level mental model. Although I won’t paint a fully accurate picture of the actual implementation, the model can help us to understand some of the main features of computer-generated random numbers.

When we start a new R session, R generates a long sequence of random bits (i.e. numbers that can only be either 0 or 1) behind the scenes. The sequence is determined by a combination of the date and time when you start the R session and by the session’s process ID, which is an arbitrary number assigned by your operating system. When we ask R to produce the first random object (e.g. by calling rnorm() or sample()), R reads the first few bits in the sequence and algorithmically transforms them into the output we requested (e.g. a number or a character). When we ask for the second random object, R generates output based on the next few bits and so on. Don’t worry that we’ll run out of random bits. The sequence is more than \(10^{6000}\) bits long. For comparison, there are approximately \(10^{80}\) atoms in the universe, so we really have lots of random bits in the sequence and will probably never get to the end of it.

For reproducible code, we would need to go back to the beginning of the random bit sequence and rerun the same functions as before. With a lot of detective work, it’s in principle possible to find out which bit sequence R used when we started the R session. However, this effort isn’t really worth our time. Instead, we’re going to replace the current bit sequence with another one that allows us to easily jump back to the beginning. We tell R to begin a new bit sequence with the function set.seed(). We usually pass one argument, called the “seed” of the random number generator, to set.seed(). Seeds must be integers between \(-2^{31}+1\) and \(2^{31}-1\), so we can choose from more than 4 billion options.

set.seed(-1663809373)

We can think of the seed as the call number of a book in a library. The library has more than 4 billion books in its collection, and each book contains a random sequence with more than \(10^{6000}\) bits. When we run set.seed(-1663809373), R picks up the book with call number -1663809373 from the shelf and starts reading the bit sequence in this book. When we run set.seed() with the same seed immediately before each sample() function in our script, we indeed get exactly the same sequence twice.

set.seed(-1663809373)
sample(c("H", "T"), size = 20, replace = TRUE, prob = c(0.75, 0.25))

##  [1] "H" "H" "H" "H" "H" "H" "H" "H" "H" "T" "H" "H" "H" "T" "H" "H" "T" "H" "H"
## [20] "H"

set.seed(-1663809373)
sample(c("H", "T"), size = 20, replace = TRUE, prob = c(3, 1))

##  [1] "H" "H" "H" "H" "H" "H" "H" "H" "H" "T" "H" "H" "H" "T" "H" "H" "T" "H" "H"
## [20] "H"

In production code, we usually don’t include set.seed() because we want the output to resemble true randomness, which implies that we shouldn’t get the same result each time we run the code. Still, set.seed() is a useful tool for debugging because it allows us to easily reproduce bugs and analyse problems step by step. With set.seed(), we can also compare two different versions of an R script.

Here is a summary of this tutorial.

We learned that set.seed() generates reproducible sequences of random objects.
We identify random sequences by the so-called “seed”, which is an integer argument passed to set.seed().
Whenever we run set.seed() with the same seed, R produces random objects from the same bit sequence.
If we run the same sequence of functions using the same bit sequence, we get exactly the same random objects.
Seeding random numbers is great for debugging, but we normally shouldn’t include set.seed() in production code.

In the past two tutorials, we learned how to generate random vectors, where it’s unlikely that elements repeat. Sometimes we deliberately want to create vectors with repeating elements. Next time we learn how to replicate elements in a vector with the rep() function.

See you soon.