Chapter 32 Random Permutations and Random Samples

Welcome back to Quantitative Reasoning! In tutorial 26, we learned that we can generate normally distributed random numbers with the R function rnorm(). Mathematically speaking, normally distributed random numbers can have infinitely many different values between minus infinity and plus infinity. If we want to generate random numbers from a finite set of objects, we need a different R function: sample(). In this tutorial, we explore different ways of sampling. We’ll learn how to create a random permutation of elements in a vector. We’ll find out how we sample elements in a vector with and without replacement. We’ll also learn how to sample elements with unequal probabilities.

The sample() function needs at least one argument: a vector. If the argument contains two or more elements, then sample() returns a vector that contains a random permutation of the input elements. Here is an example where the input is a vector with the numbers 5, 6, 1, 6 and 3.

sample(c(5, 6, 1, 6, 3))
## [1] 3 5 1 6 6

Because the permutation is random, your result is likely to differ from mine. We get yet another permutation when we run the command a second time.

sample(c(5, 6, 1, 6, 3))
## [1] 5 1 6 6 3

The sample() function can also accept non-numeric input. Here is an example where the input is a character vector.

sample(c("Rachel", "Tatyana", "Noah", "Quentin", "Aisha"))
## [1] "Noah"    "Tatyana" "Quentin" "Aisha"   "Rachel"

If the input is a single non-numeric element or a single number less than 1, sample() returns this element.

sample("Rachel")
## [1] "Rachel"

If the input is a number x greater than or equal to 1, we might expect that the return value is only the number x. Alas, this case is an exception: instead of returning only x, sample() returns a permutation of the integers from 1 to x.

sample(4.7)
## [1] 3 1 4 2

This exception can be a source of bugs. There’s a warning in the documentation:

?sample

Note that this convenience feature may lead to undesired behaviour when x is of varying length in calls such as sample(x).

In this course, we won’t have input of varying length, but it’s still important to be aware of the sample() function’s quirky behaviour to avoid bad surprises.

Random permutations have applications in permutation tests, which we cover later in this course. More frequently, we don’t need a complete permutation of all elements, but only a few random elements from a vector. For example, we may need to select two random students to work together on a project. We generate random samples of size two by passing a second argument to the sample() function: size = 2.

sample(c("Rachel", "Tatyana", "Noah", "Quentin", "Aisha"), size = 2)
## [1] "Aisha" "Noah"

When we use this command, it’s guaranteed that the two returned names are different. This type of sampling is called sampling without replacement. We can think of it as the equivalent of the following real-world experiment. We write the students’ names on name cards, put the cards into a hat, and we randomly draw two cards from the hat. After we draw the first card, we keep it outside the hat so that the second card cannot be a repeat of the first card.

If we keep drawing more and more cards without replacement, eventually there will be no cards left in the hat. If the number of draws exceeds the number of elements we can draw, R throws an error. Here is what happens if we try to create a random team of 6 students, but there are only 5 students to sample from. R complains that it “cannot take a sample larger than the population when replace = FALSE”.

sample(c("Rachel", "Tatyana", "Noah", "Quentin", "Aisha"), size = 6)
## Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'

Sampling without replacement is the proper procedure if we don’t want repeating elements, but there are real-world scenarios in which elements can repeat. For example, when we roll two dice, it’s possible that both numbers are equal. An equivalent random experiment is to put cards with the numbers from 1 to 6 in a hat, draw one card randomly, put the card back in the hat and make a second random draw from all six cards. This kind of sampling is called sampling with replacement. We can instruct R to perform sampling with replacement by passing the argument replace = TRUE to the sample() function. Here is how we simulate rolling two dice.

sample(6, size = 2, replace = TRUE)
## [1] 4 3

If we repeat this command several times, we’ll sooner or later obtain a return value with two equal numbers.

## [1] 4 1
## [1] 6 5
## [1] 4 5
## [1] 1 5
## [1] 6 3
## [1] 1 6
## [1] 1 3
## [1] 2 5
## [1] 4 1
## [1] 3 3

So far, we assumed that all elements in the vector are sampled with equal probability. In many real-world scenarios, outcomes occur with unequal probabilities. For example, if we flip a bent coin, heads and tails have different probabilities. We can instruct R to sample with unequal probabilities by passing the additional argument prob to the sample() function. Here is how we simulate 20 flips of a coin that shows heads with probability 0.75 and tails with probability 0.25.

sample(c("H", "T"), size = 20, replace = TRUE, prob = c(0.75, 0.25))
##  [1] "H" "H" "H" "H" "H" "H" "H" "H" "H" "T" "T" "H" "H" "T" "H" "H" "H" "T" "T"
## [20] "H"

Because we sample with replacement, we can set the size equal to a number that is greater than the number of elements in the first argument.

The elements in the prob vector must not be negative, and at least one of them must be positive, but they need not sum to 1. If the odds of getting heads from a coin flip is 3:1, we can put these odds directly into the prob vector.

sample(c("H", "T"), size = 20, replace = TRUE, prob = c(3, 1))
##  [1] "H" "H" "H" "T" "T" "H" "H" "T" "H" "H" "H" "H" "H" "H" "T" "T" "H" "H" "H"
## [20] "T"

This command is equivalent to our earlier version with prob = c(0.75, 0.25). If the values in prob don’t sum to 1, R normalises them behind the scenes (i.e. R divides each value by the sum of all prob elements to generate a proper probability distribution). This feature can occasionally be useful.

Perhaps you are still sceptical whether R really would have produced the same coin flips if I had used prob = c(0.75, 0.25) instead of prob = c(3, 1). Our first sequence of coin flips isn’t identical with the second sequence. One might simply shrug and blame this fact on the nature of randomness: it’s unlikely that a sequence of 20 coin flips would be repeated exactly in two independent random experiments. However, there is a way to retrieve past sequences of computer-generated random numbers without having to save them in memory. Next time, I use this feature to demonstrate that prob = c(0.75, 0.25) and prob = c(3, 1) really produce identical sequences.

Let’s recap what we learned about the sample() function.

  • sample(x) creates a random permutation of the elements in the vector x.
  • There’s one exception to this rule: if x is a single number greater than 1, sample(x) returns a permutation of the integers between 1 and x.
  • We sample k elements from x without replacement by using the command sample(x, size = k).
  • If we want to sample k elements with replacement, we need to add the argument replace = TRUE.
  • By default, all elements in x are sampled with equal probability. If we want to sample them with different probabilities, we pass the probabilities as additional argument prob to the sample() function.

Next time, we learn how we can reproduce the same random numbers when we run an R script a second time.

See you soon.