# Using Python's bisect module

By John Lekberg on November 21, 2020.

Hacker News discussion.

This week's post is about Python's bisect module. You will learn:

The goal of the bisect module is to allow you to efficiently search and update sorted lists. To this end, it provides:

# Statistical data binning (bisect.bisect)

In statistical data binning, you have data that you want to group into "bins". E.g.

• You have data on different fruits and want to group "lemon" and "orange" into "citrus".
• You have data on students' grades and want to group scores between 80 and 90 into "B", scores between 90 and 100 into "A", etc.

To me, a straightforward approach is to write a function that computes the bin. E.g., binning fruits:

``````def bin_fruit(fruit):
if fruit in ["lemon", "orange"]:
return "citrus"
elif fruit in ["apple", "pear"]:
return "malinae"

import random

data = random.choices(
["lemon", "orange", "apple", "pear"], k=10
)
data
``````
``````['lemon',
'lemon',
'apple',
'apple',
'apple',
'lemon',
'apple',
'apple',
'lemon',
'pear']
``````
``````[bin_fruit(x) for x in data]
``````
``````['citrus',
'citrus',
'malinae',
'malinae',
'malinae',
'citrus',
'malinae',
'malinae',
'citrus',
'malinae']
``````

E.g., turning test scores into letter grades:

``````def bin_score(score):
if 90 <= score <= 100:
return "A"
elif 80 <= score < 90:
return "B"
elif 70 <= score < 80:
return "C"
elif 60 <= score < 70:
return "D"
elif score < 60:
return "F"

import random

data = [random.randint(50, 100) for _ in range(10)]
data
``````
``````[78, 87, 93, 78, 53, 92, 64, 70, 75, 60]
``````
``````[bin_score(x) for x in data]
``````
``````['C', 'B', 'A', 'C', 'F', 'A', 'D', 'C', 'C', 'D']
``````

For a dataset of n records and m bins, the time complexity of this approach tends to be

O(mn)

Because binning one record takes O(m) time -- O(m) if-statements are checked -- and n records need to be binned.

If the number of bins m is constant, then the overall time complexity is simply O(n).

But what if the number of bins m is large enough that the straightforward approach is too slow?

• For discrete data, you can use dictionaries. E.g., binning fruits:

``````bins = {
"lemon": "citrus",
"orange": "citrus",
"apple": "malinae",
"pear": "malinae",
}

def bin_fruit(fruit):
return bins[fruit]

import random

data = random.choices(
["lemon", "orange", "apple", "pear"], k=10
)
data
``````
``````['lemon',
'pear',
'lemon',
'lemon',
'apple',
'orange',
'apple',
'orange',
'pear',
'apple']
``````
``````[bin_fruit(x) for x in data]
``````
``````['citrus',
'malinae',
'citrus',
'citrus',
'malinae',
'citrus',
'malinae',
'citrus',
'malinae',
'malinae']
``````

Binning one record this way takes O(1) time -- regardless of the number of bins m -- because dictionaries are implemented using hash tables. And creating the dictionary take O(m) time -- one entry for each bin. Thus, the overall time complexity, for n records and m bins, is

O(m + n).

• And, for continuous data, you can use bisect.bisect. E.g., turning test scores into letter grades:

``````score_bins = [60, 70, 80, 90]
score_letters = ["F", "D", "C", "B", "A"]

import bisect

def bin_score(score):
i = bisect.bisect(score_bins, score)
return score_letters[i]

import random

data = [random.randint(50, 100) for _ in range(10)]
data
``````
``````[60, 100, 88, 98, 55, 52, 52, 94, 71, 85]
``````
``````[bin_score(x) for x in data]
``````
``````['D', 'A', 'B', 'A', 'F', 'F', 'F', 'A', 'C', 'B']
``````

Binning one record this way takes O(logĀ m) time, because bisect.bisect uses binary search to navigate `score_bins`. And creating `score_bins` takes O(m) time. Thus, the overall time complexity, for n records and m bins, is

O(m + n log m)

NOTE: If you are able to use 3rd-party libraries, such as NumPy or Pandas, then look at these functions: numpy.digitize, numpy.searchsorted, pandas.Series.searchsorted, pandas.qcut, pandas.cut.

# Adding new data to a sorted list (bisect.insort)

In this scenario, you have a sorted list of data and you want to:

For example, you can imagine having a temperature sensor that regularly reports the temperature. You want to regularly report the median temperature, as well as be able to incorporate new temperature data as it arrives.

To me, a straightforward approach is to use list.append and statistics.median. E.g., monitoring temperatures:

``````def add_new_measurement(data, temperature):
data.append(temperature)

def median(data):
return statistics.median(data)

import random
import statistics

data = [random.randint(93, 100) for _ in range(10)]
data
``````
``````[98, 94, 97, 94, 94, 98, 98, 94, 94, 95]
``````
``````median(data)
``````
``````94.5
``````
``````add_new_measurement(data, 97)
data
``````
``````[98, 94, 97, 94, 94, 98, 98, 94, 94, 95, 97]
``````
``````median(data)
``````
``````95
``````

For a dataset of n records, the time complexity of this approach tends to be

• Computing median: O(n) or O(n log n) -- depending on how the median is computed.

So if the workload requires frequently adding new data and infrequently computing the median, this approach works well.

But what if the workload is inverted? For example, if temperature measurements are received every couple of minutes, but the median must be computed every couple of seconds.

In that scenario, I think we would be better suited by speeding up computing the median, even at the expense of slowing down adding new data.

Computing order statistics can be done in O(1) time if the data is already sorted. So a straightforward way to handle this workload would be to sort the data after appending and use the fact that the data is sorted to efficiently compute the median:

``````def add_new_measurement(data, temperature):
data.append(temperature)
data.sort()

def median(data):
n = len(data)
if n % 2 == 0:
return (data[(n // 2) - 1] + data[n // 2]) / 2
else:
return data[n // 2]

import random

data = sorted(random.randint(93, 100) for _ in range(10))
data
``````
``````[93, 94, 94, 96, 96, 97, 98, 99, 100, 100]
``````
``````median(data)
``````
``````96.5
``````
``````add_new_measurement(data, 97)
data
``````
``````[93, 94, 94, 96, 96, 97, 97, 98, 99, 100, 100]
``````
``````median(data)
``````
``````97
``````

For a dataset of n records, the time complexity of this approach tends to be

• Computing median: O(1).
• Adding new data: O(n log n).

But I think that the time complexity for adding new data can further be improved by using bisect.insort, which runs in

O(n)

Here's Python code that does that:

``````import bisect

bisect.insort(data, temperature)
``````

Doing this changes the time complexity of adding new data from

O(n log n)

to

O(n)

# In conclusion...

In this week's post you learned how to use the bisect module for statistical data binning and adding new data to sorted lists.