Python's heapq module
By John Lekberg on November 01, 2020.
This week's Python blog post is about Python's heapq module. You will learn:
- Why you should care about heaps and
heapq
. - How to efficiently obtain the smallest (and largest) records from a dataset.
- How to merge multiple sorted datasets into a single sorted dataset.
- How to create and manipulate heaps.
Why you should care about heaps and heapq
You should care about heaps because they are a data structure that allow you to quickly access the smallest (or largest) elements of a dataset, without needing to sort the entire dataset.
E.g. I want to get the 10 smallest numbers from a dataset of 10,000,000 numbers. Sorting-and-slicing takes 3.8 s, but using a heap takes 0.3 s!
import cProfile import heapq import random data = [random.uniform(0, 1) for _ in range(10_000_000)] cProfile.run("sorted(data)[:10]")
4 function calls in 3.816 seconds
cProfile.run('heapq.nsmallest(10, data)')
143 function calls in 0.289 seconds
Python's heapq module implements binary min-heaps using lists. It provides an API to directly create and manipulate heaps, as well as a higher-level set of utility functions: heapq.nsmallest, heapq.nlargest, and heapq.merge.
Obtaining the smallest (and largest) records from a dataset
If you have dataset, you can obtain the k smallest or largest records by sorting and slicing it. But, it can be more efficient to use heapq.nsmallest and heapq.nlargest.
E.g. I want the 10 smallest records from a dataset with 10,000,000 records:
import random data = [random.uniform(0, 1) for _ in range(10_000_000)] sorted(data)[:10]
[0.22276864861022627,
0.34177789369722766,
0.37737542995066387,
0.45103237357931525,
0.47123333399847134,
0.47861174271801377,
0.4913566066457804,
0.49333286645473995,
0.584151135096475,
0.7235123209516772]
And, I could also use heapq.nsmallest to do this:
from heapq import nsmallest nsmallest(10, data)
[2.4336311188477566e-08,
2.4653716934608383e-07,
2.484063295060679e-07,
3.0462511246831525e-07,
3.9886920166765094e-07,
4.797817356738676e-07,
5.917942023092593e-07,
8.555260716525126e-07,
1.207491232779745e-06,
1.2146108545607603e-06]
nsmallest(10, data) == sorted(data)[:10]
True
But heapq.nsmallest runs an order of magnitude faster. E.g. 0.29 seconds versus 3.91 seconds:
import cProfile cProfile.run("nsmallest(10, data)")
162 function calls in 0.290 seconds
cProfile.run("sorted(data)[:10]")
4 function calls in 3.910 seconds
This is because the time complexity of grabbing the k smallest records from a dataset of n elements is...
- O(n log n) with sorting and slicing.
- O(n + k log n) with heapq.nsmallest. (If k is constant, then this is O(n).)
So, should you ever use sorted or list.sort? And what about min and max?
-
If you only need the single largest or smallest record (k=1), then use min or max.
-
If k is "small" -- compared to the size of the dataset -- then use heapq.nsmallest and heapq.nlargest.
The reason that I say "if k is 'small'" is because -- in theory, even though the time complexity of heapq.nsmallest will always be at least as good as that of sorting, O(n log n) -- in practice, Python's timsort may be quicker when k is close to n.
The best way to figure out if you should use heapq.nsmallest or sorted is to try both and measure the results.
Merging sorted datasets into a single sorted dataset
If you have several sorted datasets and you want to merge them into a single sorted dataset, you can concatenate them and sort the result. (This is called a "k-way merge".) But, you could consume less memory by using heapq.merge.
E.g. I have 7 sorted datasets -- in files on disk -- with 1,000,000 records each:
import random
for n in range(7):
with open(f"dataset-{n}", "w") as file:
data = sorted(
random.uniform(0, 1) for _ in range(1_000_000)
)
for record in data:
print(record, file=file)
I want to merge these datasets into a single sorted dataset -- on disk. I could do this by concatenating the datasets with itertools.chain.from_iterable and then using the built-in function sorted:
from contextlib import ExitStack
from itertools import chain
def merge_sorted():
with ExitStack() as stack:
file_out = stack.enter_context(
open("dataset-out", "w")
)
file_ins = [
stack.enter_context(open(f"dataset-{n}"))
for n in range(7)
]
data_out = sorted(
chain.from_iterable(file_ins), key=float
)
for record in data_out:
print(record, file=file_out)
Or, I could also merge these datasets with heapq.merge:
from heapq import merge
def merge_heapq():
with ExitStack() as stack:
file_out = stack.enter_context(
open("dataset-out", "w")
)
file_ins = [
stack.enter_context(open(f"dataset-{n}"))
for n in range(7)
]
data_out = merge(*file_ins, key=float)
for record in data_out:
print(record, file=file_out)
Even though heapq.merge runs a bit slower -- e.g., 10.9 s vs. 6.7 s -- it uses much less memory -- e.g., 144 kB vs. 811,036 kB:
import cProfile cProfile.run("merge_heapq()")
21033074 function calls in 10.891 seconds
cProfile.run("merge_sorted()")
7033062 function calls in 6.743 seconds
import tracemalloc tracemalloc.start() merge_heapq() _, peak = tracemalloc.get_traced_memory() tracemalloc.stop() print(f"Peak memory: {peak:,} B")
Peak memory: 144,659 B
tracemalloc.start() merge_sorted() _, peak = tracemalloc.get_traced_memory() tracemalloc.stop() print(f"Peak memory: {peak:,} B")
Peak memory: 811,036,242 B
The caveat here is that the input datasets must already be sorted.
Creating and manipulating heaps
heapq provides an API for directly creating and manipulating a binary min-heap. heapq represents a binary min-heap as a list, so an empty heap is just an empty list:
[]
[]
A list can be turned into a heap in-place using heapq.heapify:
from heapq import heapify x = [1, 5, 4, 3, 7, 2] heapify(x) x
[1, 3, 2, 5, 7, 4]
The minimum element is the first element of the list:
x[0]
1
x[0] == min(x)
True
You can push elements onto the heap with heapq.heappush, and you can pop elements off of the heap with heapq.heappop:
from heapq import heapify, heappush, heappop x = [1, 5, 4, 3, 7, 2] heapify(x) x
[1, 3, 2, 5, 7, 4]
heappop(x)
1
heappush(x, 8) heappush(x, -1) heappop(x)
-1
x
[2, 3, 4, 5, 7, 8]
heapq also provides the shortcut methods heapq.heappushpop and heapq.heapreplace:
-
heapq.heappushpop has the same effect as calling heapq.heappush then heapq.heappop. But, the implementation of
heappushpop
is slightly more efficient than callingheappush
andheappop
individually. -
heapq.heapreplace has the same effect as calling heapq.heappop then heapq.heappush. But,
heapreplace
is more efficient than callingheappop
andheappush
individually. Andheapreplace
does not change the size of the heap.
If you want to create a fixed-size heap:
- Keep track of the heap size using len.
- When the heap size reaches the "maximum", then switch from using heapq.heappush to using heapq.heapreplace or heapq.heappushpop.
If you want to create a max-heap (instead of a min-heap):
-
If possible, recast the scenario into the terms of a min-heap. If the records are numeric, negate the numbers (e.g. 5 becomes -5). In general, wrap the records in an object that reverses the ordering. E.g.
from functools import total_ordering @total_ordering class ReverseValue: def __init__(self, value): self.value = value def __lt__(self, other): return self.value > other.value def __eq__(self, other): return self.value == other.value def __repr__(self): return f"ReverseValue({self.value!r})" ReverseValue(4) < ReverseValue(3)
True
data = list(map(ReverseValue, range(4))) data
[ReverseValue(0), ReverseValue(1), ReverseValue(2), ReverseValue(3)]
from heapq import heapify heapify(data) data
[ReverseValue(3), ReverseValue(1), ReverseValue(2), ReverseValue(0)]
-
Otherwise, implement a max-heap. See "What do I use for a max-heap implementation in Python?" via StackOverflow.com for more information.
In conclusion...
In this week's post, you learned about Python's heapq module. You learned how to directly create and manipulate Binary min-heaps, as well as how to use high-level utility functions to obtain the k smallest (or largest records) from a dataset, and to merge multiple sorted datasets into a single sorted dataset.
My challenge to you:
Using the API provided by heapq, recreate the utility functions heapq.nsmallest, heapq.nlargest and heapq.merge. E.g. here is a reimplementation of heapq.nsmallest:
from heapq import heapify, heappop def my_nsmallest(n, iterable, *, key=None): """My reimplementation of heapq.nsmallest.""" if key is None: data = list(iterable) else: data = [ (key(record), record) for record in iterable ] heapify(data) result = [] for _ in range(n): if key is None: item = heappop(data) else: _, item = heappop(data) result.append(item) return result
If you enjoyed this week's post, share it with your friends and stay tuned for next week's post. See you then!
(If you spot any errors or typos on this post, contact me via my contact page.)