Using Python's bisect module
By John Lekberg on November 21, 2020.
This week's post is about Python's bisect module. You will learn:
- Statistical data binning with bisect.bisect.
- Adding new data to a sorted list with bisect.insort.
The goal of the bisect module is to allow you to efficiently search and update sorted lists. To this end, it provides:
- A binary search implementation, bisect.bisect.
- An "insert" -- as in insertion sort -- implementation, bisect.insort.
Statistical data binning (bisect.bisect)
In statistical data binning, you have data that you want to group into "bins". E.g.
- You have data on different fruits and want to group "lemon" and "orange" into "citrus".
- You have data on students' grades and want to group scores between 80 and 90 into "B", scores between 90 and 100 into "A", etc.
To me, a straightforward approach is to write a function that computes the bin. E.g., binning fruits:
def bin_fruit(fruit): if fruit in ["lemon", "orange"]: return "citrus" elif fruit in ["apple", "pear"]: return "malinae" import random data = random.choices( ["lemon", "orange", "apple", "pear"], k=10 ) data
['lemon',
'lemon',
'apple',
'apple',
'apple',
'lemon',
'apple',
'apple',
'lemon',
'pear']
[bin_fruit(x) for x in data]
['citrus',
'citrus',
'malinae',
'malinae',
'malinae',
'citrus',
'malinae',
'malinae',
'citrus',
'malinae']
E.g., turning test scores into letter grades:
def bin_score(score): if 90 <= score <= 100: return "A" elif 80 <= score < 90: return "B" elif 70 <= score < 80: return "C" elif 60 <= score < 70: return "D" elif score < 60: return "F" import random data = [random.randint(50, 100) for _ in range(10)] data
[78, 87, 93, 78, 53, 92, 64, 70, 75, 60]
[bin_score(x) for x in data]
['C', 'B', 'A', 'C', 'F', 'A', 'D', 'C', 'C', 'D']
For a dataset of n records and m bins, the time complexity of this approach tends to be
O(mn)
Because binning one record takes O(m) time -- O(m) if-statements are checked -- and n records need to be binned.
If the number of bins m is constant, then the overall time complexity is simply O(n).
But what if the number of bins m is large enough that the straightforward approach is too slow?
-
For discrete data, you can use dictionaries. E.g., binning fruits:
bins = { "lemon": "citrus", "orange": "citrus", "apple": "malinae", "pear": "malinae", } def bin_fruit(fruit): return bins[fruit] import random data = random.choices( ["lemon", "orange", "apple", "pear"], k=10 ) data
['lemon', 'pear', 'lemon', 'lemon', 'apple', 'orange', 'apple', 'orange', 'pear', 'apple']
[bin_fruit(x) for x in data]
['citrus', 'malinae', 'citrus', 'citrus', 'malinae', 'citrus', 'malinae', 'citrus', 'malinae', 'malinae']
Binning one record this way takes O(1) time -- regardless of the number of bins m -- because dictionaries are implemented using hash tables. And creating the dictionary take O(m) time -- one entry for each bin. Thus, the overall time complexity, for n records and m bins, is
O(m + n).
-
And, for continuous data, you can use bisect.bisect. E.g., turning test scores into letter grades:
score_bins = [60, 70, 80, 90] score_letters = ["F", "D", "C", "B", "A"] import bisect def bin_score(score): i = bisect.bisect(score_bins, score) return score_letters[i] import random data = [random.randint(50, 100) for _ in range(10)] data
[60, 100, 88, 98, 55, 52, 52, 94, 71, 85]
[bin_score(x) for x in data]
['D', 'A', 'B', 'A', 'F', 'F', 'F', 'A', 'C', 'B']
Binning one record this way takes O(logĀ m) time, because bisect.bisect uses binary search to navigate
score_bins
. And creatingscore_bins
takes O(m) time. Thus, the overall time complexity, for n records and m bins, isO(m + n log m)
NOTE: If you are able to use 3rd-party libraries, such as NumPy or Pandas, then look at these functions: numpy.digitize, numpy.searchsorted, pandas.Series.searchsorted, pandas.qcut, pandas.cut.
Adding new data to a sorted list (bisect.insort)
In this scenario, you have a sorted list of data and you want to:
- Compute order statistics (e.g. compute the median).
- Add new data.
For example, you can imagine having a temperature sensor that regularly reports the temperature. You want to regularly report the median temperature, as well as be able to incorporate new temperature data as it arrives.
To me, a straightforward approach is to use list.append and statistics.median. E.g., monitoring temperatures:
def add_new_measurement(data, temperature): data.append(temperature) def median(data): return statistics.median(data) import random import statistics data = [random.randint(93, 100) for _ in range(10)] data
[98, 94, 97, 94, 94, 98, 98, 94, 94, 95]
median(data)
94.5
add_new_measurement(data, 97) data
[98, 94, 97, 94, 94, 98, 98, 94, 94, 95, 97]
median(data)
95
For a dataset of n records, the time complexity of this approach tends to be
- Computing median: O(n) or O(n log n) -- depending on how the median is computed.
- Adding new data: O(1).
So if the workload requires frequently adding new data and infrequently computing the median, this approach works well.
But what if the workload is inverted? For example, if temperature measurements are received every couple of minutes, but the median must be computed every couple of seconds.
In that scenario, I think we would be better suited by speeding up computing the median, even at the expense of slowing down adding new data.
Computing order statistics can be done in O(1) time if the data is already sorted. So a straightforward way to handle this workload would be to sort the data after appending and use the fact that the data is sorted to efficiently compute the median:
def add_new_measurement(data, temperature): data.append(temperature) data.sort() def median(data): n = len(data) if n % 2 == 0: return (data[(n // 2) - 1] + data[n // 2]) / 2 else: return data[n // 2] import random data = sorted(random.randint(93, 100) for _ in range(10)) data
[93, 94, 94, 96, 96, 97, 98, 99, 100, 100]
median(data)
96.5
add_new_measurement(data, 97) data
[93, 94, 94, 96, 96, 97, 97, 98, 99, 100, 100]
median(data)
97
For a dataset of n records, the time complexity of this approach tends to be
- Computing median: O(1).
- Adding new data: O(n log n).
But I think that the time complexity for adding new data can further be improved by using bisect.insort, which runs in
O(n)
Here's Python code that does that:
import bisect
def add_new_measurement(data, temperature):
bisect.insort(data, temperature)
Doing this changes the time complexity of adding new data from
O(n log n)
to
O(n)
In conclusion...
In this week's post you learned how to use the bisect module for statistical data binning and adding new data to sorted lists.
For further reading:
- Algorithms that process input piece-by-piece -- like the temperature sensor scenario -- are called online algorithms.
- bisect.insort can be used to implement an Insertion sort. Insertion sort is an online algorithm and -- along with Python's built-in timsort -- is an adaptive sorting algorithm because it takes advantage of data that is partially sorted.
My challenge to you:
The bisect module provides "left"- and "right"-variants of bisect.insort and bisect.bisect: bisect.insort_left, bisect.insort_right, bisect.bisect_left, bisect.bisect_right.
Read the documentation and figure out what the difference between the "left"- and "right"-variants is.
If you enjoyed this week's post, share it with your friends and stay tuned for next week's post. See you then!
(If you spot any errors or typos on this post, contact me via my contact page.)