Generating random data in Python
By John Lekberg on April 24, 2020.
This week's post is about Python's random module. You will learn:
- How to shuffle lists of data.
- How to sample discrete probability distributions.
- How to sample continuous probability distributions.
- How to use a Monte Carlo method to estimate π.
Shuffling data
Shuffling data is like shuffling a deck of cards: it permutes the data.
Shuffling is useful when you want to random data that
- Has fixed number of objects of a given type. E.g. "3 reds and 4 blues".
- Presents the objects in a random order.
The shuffle method shuffles a list in-place:
import random x = ["red"] * 3 + ["blue"] * 4 x
['red', 'red', 'red', 'blue', 'blue', 'blue', 'blue']
random.shuffle(x)
None
x
['red', 'blue', 'blue', 'red', 'blue', 'red', 'blue']
shuffle
only works on mutable sequences (e.g. list).
It will not work on immutable sequences like tuples:
x = ("red",) * 3 + ("blue",) * 4 x
('red', 'red', 'red', 'blue', 'blue', 'blue', 'blue')
random.shuffle(x)
TypeError: 'tuple' object does not support item assignment
If you want to shuffle an immutable sequence, use sample:
x = ("red",) * 3 + ("blue",) * 4 random.sample(x, k=len(x))
['blue', 'red', 'red', 'blue', 'blue', 'blue', 'red']
Sampling discrete probability distributions
A discrete probability distribution is used in scenarios where the set of possible outcomes is discrete (e.g. rolling dice).
You can sample integer ranges using randrange and randint:
random.randrange(10, 20)
19
random.randint(10, 20)
12
The difference between randrange
and randint
is that randrange(10, 20)
never returns 20
, but randint(10, 20)
may return 20
.
You can sample sequences using choice, choices, and sample:
x = ["Snake", "Ocelot", "Otacon", "Silverburgh", "Wolf", "Raven"] random.choice(x)
'Otacon'
random.choices(x, k=3)
['Ocelot', 'Snake', 'Wolf']
random.sample(x, k=3)
['Raven', 'Otacon', 'Ocelot']
The difference between choices
and sample
is
choices
samples with replacement. The same element can be sampled twice.sample
samples without replacement. The same element is never sampled twice.
Because choices
samples with replacement, it can sample as many times as I want:
random.choices(x, k=10)
['Wolf',
'Snake',
'Ocelot',
'Otacon',
'Snake',
'Ocelot',
'Wolf',
'Raven',
'Raven',
'Ocelot']
But sample
cannot sample more than the size of the list:
random.sample(x, k=10)
ValueError: Sample larger than population or is negative
choices
can be used to sample dice rolls:
rolls = [1, 2, 3, 4, 5, 6] random.choices(rolls, k = 10)
[3, 4, 2, 2, 2, 1, 2, 3, 2, 2]
choices
can also take weighted samples, which can simulate loaded dice rolls:
weights = [0, 0, 0, 1, 0, 10] random.choices(rolls, weights, k = 10)
[6, 6, 4, 6, 4, 6, 6, 6, 6, 6]
Sampling containers other than sequences will not work:
x = ( i**2 for i in range(10) ) x
<generator object <genexpr> at 0x10b7f4f50>
random.choice(x)
TypeError: object of type 'generator' has no len()
You have to cast other containers into sequences:
y = list(x) y
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
random.choice(y)
16
Sampling continuous probability distributions
A continuous probability distribution is used in scenarios where the set of possible outcomes is continuous (e.g. the temperature).
The function random
samples between 0 and 1 (excluding 1).
random.random()
0.2464472059192525
This is the basic way to sample a continuous range. But Python also supports several common distributions:
Distribution | Function |
---|---|
Uniform | uniform(a, b) |
Triangular | triangular(low, high, mode) |
Beta | betavariate(alpha, beta) |
Exponential | expovariate(lambd) |
Gamma | gammavariate(alpha, beta) |
Normal | normalvariate(mu, sigma) |
Log-normal | lognormvariate(mu, sigma) |
von Mises | vonmisesvariate(mu, kappa) |
Pareto | paretovariate(alpha) |
Weibull | weibullvariate(alpha, beta) |
For more information about these distributions, read the hyperlinked documents in the table above.
Generating multiple samples
An easy way to generate multiple samples is to use a list comprehension or a generator expression with the range function.
[ random.uniform(0, 10) for _ in range(5) ]
[8.363684210196281,
6.513950372798824,
9.386900604728854,
2.916991931995473,
2.9009410170872263]
samples = ( random.randint(1, 10) for _ in range(10000) ) sum(samples)
55039
An easy way to generate an infinite stream of samples is to use itertools.count in a generator expression:
import itertools stream = ( random.normalvariate(0, 1) for _ in itertools.count() ) total = 0 while abs(total) < 2: total += next(stream) total
2.654750795526832
Using Monte Carlo methods to estimate π (Pi)
Monte Carlo methods are algorithms that rely on random sampling to estimate results. Wikipedia gives a good overview of the algorithmic structure:
- Define a domain of possible inputs.
- Generate inputs randomly from a probability distribution over the domain.
- Perform a deterministic computation on the inputs.
- Aggregate the results.
A classic example of using Monte Carlo methods is estimating the value of the mathematical constant π (Pi). Here's how that works:
-
I have a circle with radius 1.
-
I have a square that contains the circle. (The side length is 2.)
-
The ratio of
(the circle's area) to (the square's area)
is π/4.
-
This means that if I uniformly cover the square in points, π/4 of the points will also be inside the circle.
As a result, I can estimate π by
-
Generating points that are uniformly distributed throughout the square.
-
Counting how many points are also in the circle. (How do I determine if I point is in the circle? I calculate the Euclidean distance from the point to the center of the circle. Then I check if that distance is less than the circle's radius.)
-
The ratio of
(the number of points in the circle) to (the total number of points)
is approximately π/4. So I multiply that ratio by 4 to get my estimate for π.
Here's Python code that implements this:
def estimate_pi(N): """Estimate the mathematical constant Pi using N points.""" X_dist = ( random.uniform(-1, 1) for _ in range(N) ) Y_dist = ( random.uniform(-1, 1) for _ in range(N) ) N_in_circle = 0 for x, y in zip(X_dist, Y_dist): in_circle = x**2 + y**2 <= 1 if in_circle: N_in_circle += 1 pi = 4 * (N_in_circle / N) return pi estimate_pi(10)
3.6
estimate_pi(1000)
3.084
estimate_pi(100_000)
3.14092
estimate_pi(1_000_000)
3.141632
In conclusion...
In this week's post, you learned how to generate random data in Python. You can shuffle data, sample sequences, and sample common continuous distributions (e.g. the Normal distribution). Monte Carlo methods use random number generation to estimate solutions to complex problems.
For more information about Monte Carlo methods, read these documents:
- "How the Coast Guard Uses Analytics to Search for Those Lost at Sea" by Nick Kolakowski
- "Monte Carlo method" via Wikipedia
- "The Beginning of the Monte Carlo Method" by N. Metropolis
My challenge to you:
Create a function that samples the Bernoulli distribution. Use that to create a function that samples the Binomial distribution.
If you enjoyed this week's post, share it with your friends and stay tuned for next week's post. See you then!
(If you spot any errors or typos on this post, contact me via my contact page.)