Generating random data in Python

By John Lekberg on April 24, 2020.

This week's post is about Python's random module. You will learn:

How to shuffle lists of data.
How to sample discrete probability distributions.
How to sample continuous probability distributions.
How to use a Monte Carlo method to estimate π.

Shuffling data

Shuffling data is like shuffling a deck of cards: it permutes the data.

Shuffling is useful when you want to random data that

Has fixed number of objects of a given type. E.g. "3 reds and 4 blues".
Presents the objects in a random order.

The shuffle method shuffles a list in-place:

import random

x = ["red"] * 3 + ["blue"] * 4
x

['red', 'red', 'red', 'blue', 'blue', 'blue', 'blue']

random.shuffle(x)

None

['red', 'blue', 'blue', 'red', 'blue', 'red', 'blue']

shuffle only works on mutable sequences (e.g. list). It will not work on immutable sequences like tuples:

x = ("red",) * 3 + ("blue",) * 4
x

('red', 'red', 'red', 'blue', 'blue', 'blue', 'blue')

random.shuffle(x)

TypeError: 'tuple' object does not support item assignment

If you want to shuffle an immutable sequence, use sample:

x = ("red",) * 3 + ("blue",) * 4
random.sample(x, k=len(x))

['blue', 'red', 'red', 'blue', 'blue', 'blue', 'red']

Sampling discrete probability distributions

A discrete probability distribution is used in scenarios where the set of possible outcomes is discrete (e.g. rolling dice).

You can sample integer ranges using randrange and randint:

random.randrange(10, 20)

random.randint(10, 20)

The difference between randrange and randint is that randrange(10, 20) never returns 20, but randint(10, 20) may return 20.

You can sample sequences using choice, choices, and sample:

x = ["Snake", "Ocelot", "Otacon", "Silverburgh", "Wolf", "Raven"]
random.choice(x)

'Otacon'

random.choices(x, k=3)

['Ocelot', 'Snake', 'Wolf']

random.sample(x, k=3)

['Raven', 'Otacon', 'Ocelot']

The difference between choices and sample is

choices samples with replacement. The same element can be sampled twice.
sample samples without replacement. The same element is never sampled twice.

Because choices samples with replacement, it can sample as many times as I want:

random.choices(x, k=10)

['Wolf',
 'Snake',
 'Ocelot',
 'Otacon',
 'Snake',
 'Ocelot',
 'Wolf',
 'Raven',
 'Raven',
 'Ocelot']

But sample cannot sample more than the size of the list:

random.sample(x, k=10)

ValueError: Sample larger than population or is negative

choices can be used to sample dice rolls:

rolls = [1, 2, 3, 4, 5, 6]
random.choices(rolls, k = 10)

[3, 4, 2, 2, 2, 1, 2, 3, 2, 2]

choices can also take weighted samples, which can simulate loaded dice rolls:

weights = [0, 0, 0, 1, 0, 10]
random.choices(rolls, weights, k = 10)

[6, 6, 4, 6, 4, 6, 6, 6, 6, 6]

Sampling containers other than sequences will not work:

x = ( i**2 for i in range(10) )
x

<generator object <genexpr> at 0x10b7f4f50>

random.choice(x)

TypeError: object of type 'generator' has no len()

You have to cast other containers into sequences:

y = list(x)
y

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

random.choice(y)

Sampling continuous probability distributions

A continuous probability distribution is used in scenarios where the set of possible outcomes is continuous (e.g. the temperature).

The function random samples between 0 and 1 (excluding 1).

random.random()

0.2464472059192525

This is the basic way to sample a continuous range. But Python also supports several common distributions:

Distribution	Function
Uniform	uniform(`a`, `b`)
Triangular	triangular(`low`, `high`, `mode`)
Beta	betavariate(`alpha`, `beta`)
Exponential	expovariate(`lambd`)
Gamma	gammavariate(`alpha`, `beta`)
Normal	normalvariate(`mu`, `sigma`)
Log-normal	lognormvariate(`mu`, `sigma`)
von Mises	vonmisesvariate(`mu`, `kappa`)
Pareto	paretovariate(`alpha`)
Weibull	weibullvariate(`alpha`, `beta`)

For more information about these distributions, read the hyperlinked documents in the table above.

Generating multiple samples

An easy way to generate multiple samples is to use a list comprehension or a generator expression with the range function.

[ random.uniform(0, 10) for _ in range(5) ]

[8.363684210196281,
 6.513950372798824,
 9.386900604728854,
 2.916991931995473,
 2.9009410170872263]

samples = ( random.randint(1, 10) for _ in range(10000) )
sum(samples)

An easy way to generate an infinite stream of samples is to use itertools.count in a generator expression:

import itertools

stream = ( random.normalvariate(0, 1) for _ in itertools.count() )

total = 0
while abs(total) < 2:
    total += next(stream)

total

2.654750795526832

Using Monte Carlo methods to estimate π (Pi)

Monte Carlo methods are algorithms that rely on random sampling to estimate results. Wikipedia gives a good overview of the algorithmic structure:

Define a domain of possible inputs.
Generate inputs randomly from a probability distribution over the domain.
Perform a deterministic computation on the inputs.
Aggregate the results.

A classic example of using Monte Carlo methods is estimating the value of the mathematical constant π (Pi). Here's how that works:

I have a circle with radius 1.
I have a square that contains the circle. (The side length is 2.)
The ratio of

(the circle's area) to (the square's area)

is π/4.
This means that if I uniformly cover the square in points, π/4 of the points will also be inside the circle.

As a result, I can estimate π by

Generating points that are uniformly distributed throughout the square.
Counting how many points are also in the circle. (How do I determine if I point is in the circle? I calculate the Euclidean distance from the point to the center of the circle. Then I check if that distance is less than the circle's radius.)
The ratio of

(the number of points in the circle) to (the total number of points)

is approximately π/4. So I multiply that ratio by 4 to get my estimate for π.

Here's Python code that implements this:

def estimate_pi(N):
    """Estimate the mathematical constant Pi using N points."""
    X_dist = ( random.uniform(-1, 1) for _ in range(N) )
    Y_dist = ( random.uniform(-1, 1) for _ in range(N) )
    
    N_in_circle = 0
    for x, y in zip(X_dist, Y_dist):
        in_circle = x**2 + y**2 <= 1
        if in_circle:
            N_in_circle += 1
    
    pi = 4 * (N_in_circle / N)
    return pi

estimate_pi(10)

3.6

estimate_pi(1000)

3.084

estimate_pi(100_000)

3.14092

estimate_pi(1_000_000)

3.141632

In conclusion...

In this week's post, you learned how to generate random data in Python. You can shuffle data, sample sequences, and sample common continuous distributions (e.g. the Normal distribution). Monte Carlo methods use random number generation to estimate solutions to complex problems.

For more information about Monte Carlo methods, read these documents:

My challenge to you:

Create a function that samples the Bernoulli distribution. Use that to create a function that samples the Binomial distribution.

If you enjoyed this week's post, share it with your friends and stay tuned for next week's post. See you then!

(If you spot any errors or typos on this post, contact me via my contact page.)