Return to Blog

Generating random data in Python

By John Lekberg on April 24, 2020.


This week's post is about Python's random module. You will learn:

Shuffling data

Shuffling data is like shuffling a deck of cards: it permutes the data.

Shuffling is useful when you want to random data that

The shuffle method shuffles a list in-place:

import random

x = ["red"] * 3 + ["blue"] * 4
x
['red', 'red', 'red', 'blue', 'blue', 'blue', 'blue']
random.shuffle(x)
None
x
['red', 'blue', 'blue', 'red', 'blue', 'red', 'blue']

shuffle only works on mutable sequences (e.g. list). It will not work on immutable sequences like tuples:

x = ("red",) * 3 + ("blue",) * 4
x
('red', 'red', 'red', 'blue', 'blue', 'blue', 'blue')
random.shuffle(x)
TypeError: 'tuple' object does not support item assignment

If you want to shuffle an immutable sequence, use sample:

x = ("red",) * 3 + ("blue",) * 4
random.sample(x, k=len(x))
['blue', 'red', 'red', 'blue', 'blue', 'blue', 'red']

Sampling discrete probability distributions

A discrete probability distribution is used in scenarios where the set of possible outcomes is discrete (e.g. rolling dice).

You can sample integer ranges using randrange and randint:

random.randrange(10, 20)
19
random.randint(10, 20)
12

The difference between randrange and randint is that randrange(10, 20) never returns 20, but randint(10, 20) may return 20.

You can sample sequences using choice, choices, and sample:

x = ["Snake", "Ocelot", "Otacon", "Silverburgh", "Wolf", "Raven"]
random.choice(x)
'Otacon'
random.choices(x, k=3)
['Ocelot', 'Snake', 'Wolf']
random.sample(x, k=3)
['Raven', 'Otacon', 'Ocelot']

The difference between choices and sample is

Because choices samples with replacement, it can sample as many times as I want:

random.choices(x, k=10)
['Wolf',
 'Snake',
 'Ocelot',
 'Otacon',
 'Snake',
 'Ocelot',
 'Wolf',
 'Raven',
 'Raven',
 'Ocelot']

But sample cannot sample more than the size of the list:

random.sample(x, k=10)
ValueError: Sample larger than population or is negative

choices can be used to sample dice rolls:

rolls = [1, 2, 3, 4, 5, 6]
random.choices(rolls, k = 10)
[3, 4, 2, 2, 2, 1, 2, 3, 2, 2]

choices can also take weighted samples, which can simulate loaded dice rolls:

weights = [0, 0, 0, 1, 0, 10]
random.choices(rolls, weights, k = 10)
[6, 6, 4, 6, 4, 6, 6, 6, 6, 6]

Sampling containers other than sequences will not work:

x = ( i**2 for i in range(10) )
x
<generator object <genexpr> at 0x10b7f4f50>
random.choice(x)
TypeError: object of type 'generator' has no len()

You have to cast other containers into sequences:

y = list(x)
y
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
random.choice(y)
16

Sampling continuous probability distributions

A continuous probability distribution is used in scenarios where the set of possible outcomes is continuous (e.g. the temperature).

The function random samples between 0 and 1 (excluding 1).

random.random()
0.2464472059192525

This is the basic way to sample a continuous range. But Python also supports several common distributions:

DistributionFunction
Uniformuniform(a, b)
Triangulartriangular(low, high, mode)
Betabetavariate(alpha, beta)
Exponentialexpovariate(lambd)
Gammagammavariate(alpha, beta)
Normalnormalvariate(mu, sigma)
Log-normallognormvariate(mu, sigma)
von Misesvonmisesvariate(mu, kappa)
Paretoparetovariate(alpha)
Weibullweibullvariate(alpha, beta)

For more information about these distributions, read the hyperlinked documents in the table above.

Generating multiple samples

An easy way to generate multiple samples is to use a list comprehension or a generator expression with the range function.

[ random.uniform(0, 10) for _ in range(5) ]
[8.363684210196281,
 6.513950372798824,
 9.386900604728854,
 2.916991931995473,
 2.9009410170872263]
samples = ( random.randint(1, 10) for _ in range(10000) )
sum(samples)
55039

An easy way to generate an infinite stream of samples is to use itertools.count in a generator expression:

import itertools

stream = ( random.normalvariate(0, 1) for _ in itertools.count() )

total = 0
while abs(total) < 2:
    total += next(stream)

total
2.654750795526832

Using Monte Carlo methods to estimate π (Pi)

Monte Carlo methods are algorithms that rely on random sampling to estimate results. Wikipedia gives a good overview of the algorithmic structure:

  1. Define a domain of possible inputs.
  2. Generate inputs randomly from a probability distribution over the domain.
  3. Perform a deterministic computation on the inputs.
  4. Aggregate the results.

A classic example of using Monte Carlo methods is estimating the value of the mathematical constant π (Pi). Here's how that works:

As a result, I can estimate π by

Here's Python code that implements this:

def estimate_pi(N):
    """Estimate the mathematical constant Pi using N points."""
    X_dist = ( random.uniform(-1, 1) for _ in range(N) )
    Y_dist = ( random.uniform(-1, 1) for _ in range(N) )
    
    N_in_circle = 0
    for x, y in zip(X_dist, Y_dist):
        in_circle = x**2 + y**2 <= 1
        if in_circle:
            N_in_circle += 1
    
    pi = 4 * (N_in_circle / N)
    return pi

estimate_pi(10)
3.6
estimate_pi(1000)
3.084
estimate_pi(100_000)
3.14092
estimate_pi(1_000_000)
3.141632

In conclusion...

In this week's post, you learned how to generate random data in Python. You can shuffle data, sample sequences, and sample common continuous distributions (e.g. the Normal distribution). Monte Carlo methods use random number generation to estimate solutions to complex problems.

For more information about Monte Carlo methods, read these documents:

My challenge to you:

Create a function that samples the Bernoulli distribution. Use that to create a function that samples the Binomial distribution.

If you enjoyed this week's post, share it with your friends and stay tuned for next week's post. See you then!


(If you spot any errors or typos on this post, contact me via my contact page.)