# Generating random data in Python

By John Lekberg on April 24, 2020.

This week's post is about Python's random module. You will learn:

# Shuffling data

Shuffling data is like shuffling a deck of cards: it permutes the data.

Shuffling is useful when you want to random data that

• Has fixed number of objects of a given type. E.g. "3 reds and 4 blues".
• Presents the objects in a random order.

The shuffle method shuffles a list in-place:

``````import random

x = ["red"] * 3 + ["blue"] * 4
x
``````
``````['red', 'red', 'red', 'blue', 'blue', 'blue', 'blue']
``````
``````random.shuffle(x)
``````
``````None
``````
``````x
``````
``````['red', 'blue', 'blue', 'red', 'blue', 'red', 'blue']
``````

`shuffle` only works on mutable sequences (e.g. list). It will not work on immutable sequences like tuples:

``````x = ("red",) * 3 + ("blue",) * 4
x
``````
``````('red', 'red', 'red', 'blue', 'blue', 'blue', 'blue')
``````
``````random.shuffle(x)
``````
``````TypeError: 'tuple' object does not support item assignment
``````

If you want to shuffle an immutable sequence, use sample:

``````x = ("red",) * 3 + ("blue",) * 4
random.sample(x, k=len(x))
``````
``````['blue', 'red', 'red', 'blue', 'blue', 'blue', 'red']
``````

# Sampling discrete probability distributions

A discrete probability distribution is used in scenarios where the set of possible outcomes is discrete (e.g. rolling dice).

You can sample integer ranges using randrange and randint:

``````random.randrange(10, 20)
``````
``````19
``````
``````random.randint(10, 20)
``````
``````12
``````

The difference between `randrange` and `randint` is that `randrange(10, 20)` never returns `20`, but `randint(10, 20)` may return `20`.

You can sample sequences using choice, choices, and sample:

``````x = ["Snake", "Ocelot", "Otacon", "Silverburgh", "Wolf", "Raven"]
random.choice(x)
``````
``````'Otacon'
``````
``````random.choices(x, k=3)
``````
``````['Ocelot', 'Snake', 'Wolf']
``````
``````random.sample(x, k=3)
``````
``````['Raven', 'Otacon', 'Ocelot']
``````

The difference between `choices` and `sample` is

• `choices` samples with replacement. The same element can be sampled twice.
• `sample` samples without replacement. The same element is never sampled twice.

Because `choices` samples with replacement, it can sample as many times as I want:

``````random.choices(x, k=10)
``````
``````['Wolf',
'Snake',
'Ocelot',
'Otacon',
'Snake',
'Ocelot',
'Wolf',
'Raven',
'Raven',
'Ocelot']
``````

But `sample` cannot sample more than the size of the list:

``````random.sample(x, k=10)
``````
``````ValueError: Sample larger than population or is negative
``````

`choices` can be used to sample dice rolls:

``````rolls = [1, 2, 3, 4, 5, 6]
random.choices(rolls, k = 10)
``````
``````[3, 4, 2, 2, 2, 1, 2, 3, 2, 2]
``````

`choices` can also take weighted samples, which can simulate loaded dice rolls:

``````weights = [0, 0, 0, 1, 0, 10]
random.choices(rolls, weights, k = 10)
``````
``````[6, 6, 4, 6, 4, 6, 6, 6, 6, 6]
``````

Sampling containers other than sequences will not work:

``````x = ( i**2 for i in range(10) )
x
``````
``````<generator object <genexpr> at 0x10b7f4f50>
``````
``````random.choice(x)
``````
``````TypeError: object of type 'generator' has no len()
``````

You have to cast other containers into sequences:

``````y = list(x)
y
``````
``````[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
``````
``````random.choice(y)
``````
``````16
``````

# Sampling continuous probability distributions

A continuous probability distribution is used in scenarios where the set of possible outcomes is continuous (e.g. the temperature).

The function `random` samples between 0 and 1 (excluding 1).

``````random.random()
``````
``````0.2464472059192525
``````

This is the basic way to sample a continuous range. But Python also supports several common distributions:

DistributionFunction
Uniformuniform(a, b)
Triangulartriangular(low, high, mode)
Betabetavariate(alpha, beta)
Exponentialexpovariate(lambd)
Gammagammavariate(alpha, beta)
Normalnormalvariate(mu, sigma)
Log-normallognormvariate(mu, sigma)
von Misesvonmisesvariate(mu, kappa)
Paretoparetovariate(alpha)
Weibullweibullvariate(alpha, beta)

# Generating multiple samples

An easy way to generate multiple samples is to use a list comprehension or a generator expression with the range function.

``````[ random.uniform(0, 10) for _ in range(5) ]
``````
``````[8.363684210196281,
6.513950372798824,
9.386900604728854,
2.916991931995473,
2.9009410170872263]
``````
``````samples = ( random.randint(1, 10) for _ in range(10000) )
sum(samples)
``````
``````55039
``````

An easy way to generate an infinite stream of samples is to use itertools.count in a generator expression:

``````import itertools

stream = ( random.normalvariate(0, 1) for _ in itertools.count() )

total = 0
while abs(total) < 2:
total += next(stream)

total
``````
``````2.654750795526832
``````

# Using Monte Carlo methods to estimate π (Pi)

Monte Carlo methods are algorithms that rely on random sampling to estimate results. Wikipedia gives a good overview of the algorithmic structure:

1. Define a domain of possible inputs.
2. Generate inputs randomly from a probability distribution over the domain.
3. Perform a deterministic computation on the inputs.
4. Aggregate the results.

A classic example of using Monte Carlo methods is estimating the value of the mathematical constant π (Pi). Here's how that works:

• I have a circle with radius 1.

• I have a square that contains the circle. (The side length is 2.)

• The ratio of

(the circle's area) to (the square's area)

is π/4.

• This means that if I uniformly cover the square in points, π/4 of the points will also be inside the circle.

As a result, I can estimate π by

• Generating points that are uniformly distributed throughout the square.

• Counting how many points are also in the circle. (How do I determine if I point is in the circle? I calculate the Euclidean distance from the point to the center of the circle. Then I check if that distance is less than the circle's radius.)

• The ratio of

(the number of points in the circle) to (the total number of points)

is approximately π/4. So I multiply that ratio by 4 to get my estimate for π.

Here's Python code that implements this:

``````def estimate_pi(N):
"""Estimate the mathematical constant Pi using N points."""
X_dist = ( random.uniform(-1, 1) for _ in range(N) )
Y_dist = ( random.uniform(-1, 1) for _ in range(N) )

N_in_circle = 0
for x, y in zip(X_dist, Y_dist):
in_circle = x**2 + y**2 <= 1
if in_circle:
N_in_circle += 1

pi = 4 * (N_in_circle / N)
return pi

estimate_pi(10)
``````
``````3.6
``````
``````estimate_pi(1000)
``````
``````3.084
``````
``````estimate_pi(100_000)
``````
``````3.14092
``````
``````estimate_pi(1_000_000)
``````
``````3.141632
``````

# In conclusion...

In this week's post, you learned how to generate random data in Python. You can shuffle data, sample sequences, and sample common continuous distributions (e.g. the Normal distribution). Monte Carlo methods use random number generation to estimate solutions to complex problems.