Return to Blog

Using Python's bisect module

By John Lekberg on November 21, 2020.


Hacker News discussion.


This week's post is about Python's bisect module. You will learn:

The goal of the bisect module is to allow you to efficiently search and update sorted lists. To this end, it provides:

Statistical data binning (bisect.bisect)

In statistical data binning, you have data that you want to group into "bins". E.g.

To me, a straightforward approach is to write a function that computes the bin. E.g., binning fruits:

def bin_fruit(fruit):
    if fruit in ["lemon", "orange"]:
        return "citrus"
    elif fruit in ["apple", "pear"]:
        return "malinae"


import random

data = random.choices(
    ["lemon", "orange", "apple", "pear"], k=10
)
data
['lemon',
 'lemon',
 'apple',
 'apple',
 'apple',
 'lemon',
 'apple',
 'apple',
 'lemon',
 'pear']
[bin_fruit(x) for x in data]
['citrus',
 'citrus',
 'malinae',
 'malinae',
 'malinae',
 'citrus',
 'malinae',
 'malinae',
 'citrus',
 'malinae']

E.g., turning test scores into letter grades:

def bin_score(score):
    if 90 <= score <= 100:
        return "A"
    elif 80 <= score < 90:
        return "B"
    elif 70 <= score < 80:
        return "C"
    elif 60 <= score < 70:
        return "D"
    elif score < 60:
        return "F"


import random

data = [random.randint(50, 100) for _ in range(10)]
data
[78, 87, 93, 78, 53, 92, 64, 70, 75, 60]
[bin_score(x) for x in data]
['C', 'B', 'A', 'C', 'F', 'A', 'D', 'C', 'C', 'D']

For a dataset of n records and m bins, the time complexity of this approach tends to be

O(mn)

Because binning one record takes O(m) time -- O(m) if-statements are checked -- and n records need to be binned.

If the number of bins m is constant, then the overall time complexity is simply O(n).

But what if the number of bins m is large enough that the straightforward approach is too slow?

Adding new data to a sorted list (bisect.insort)

In this scenario, you have a sorted list of data and you want to:

For example, you can imagine having a temperature sensor that regularly reports the temperature. You want to regularly report the median temperature, as well as be able to incorporate new temperature data as it arrives.

To me, a straightforward approach is to use list.append and statistics.median. E.g., monitoring temperatures:

def add_new_measurement(data, temperature):
    data.append(temperature)


def median(data):
    return statistics.median(data)


import random
import statistics

data = [random.randint(93, 100) for _ in range(10)]
data
[98, 94, 97, 94, 94, 98, 98, 94, 94, 95]
median(data)
94.5
add_new_measurement(data, 97)
data
[98, 94, 97, 94, 94, 98, 98, 94, 94, 95, 97]
median(data)
95

For a dataset of n records, the time complexity of this approach tends to be

So if the workload requires frequently adding new data and infrequently computing the median, this approach works well.

But what if the workload is inverted? For example, if temperature measurements are received every couple of minutes, but the median must be computed every couple of seconds.

In that scenario, I think we would be better suited by speeding up computing the median, even at the expense of slowing down adding new data.

Computing order statistics can be done in O(1) time if the data is already sorted. So a straightforward way to handle this workload would be to sort the data after appending and use the fact that the data is sorted to efficiently compute the median:

def add_new_measurement(data, temperature):
    data.append(temperature)
    data.sort()


def median(data):
    n = len(data)
    if n % 2 == 0:
        return (data[(n // 2) - 1] + data[n // 2]) / 2
    else:
        return data[n // 2]


import random

data = sorted(random.randint(93, 100) for _ in range(10))
data
[93, 94, 94, 96, 96, 97, 98, 99, 100, 100]
median(data)
96.5
add_new_measurement(data, 97)
data
[93, 94, 94, 96, 96, 97, 97, 98, 99, 100, 100]
median(data)
97

For a dataset of n records, the time complexity of this approach tends to be

But I think that the time complexity for adding new data can further be improved by using bisect.insort, which runs in

O(n)

Here's Python code that does that:

import bisect


def add_new_measurement(data, temperature):
    bisect.insort(data, temperature)

Doing this changes the time complexity of adding new data from

O(n log n)

to

O(n)

In conclusion...

In this week's post you learned how to use the bisect module for statistical data binning and adding new data to sorted lists.

For further reading:

My challenge to you:

The bisect module provides "left"- and "right"-variants of bisect.insort and bisect.bisect: bisect.insort_left, bisect.insort_right, bisect.bisect_left, bisect.bisect_right.

Read the documentation and figure out what the difference between the "left"- and "right"-variants is.

If you enjoyed this week's post, share it with your friends and stay tuned for next week's post. See you then!


(If you spot any errors or typos on this post, contact me via my contact page.)