Python's heapq module

By John Lekberg on November 01, 2020.

This week's Python blog post is about Python's heapq module. You will learn:

Why you should care about heaps and heapq.
How to efficiently obtain the smallest (and largest) records from a dataset.
How to merge multiple sorted datasets into a single sorted dataset.
How to create and manipulate heaps.

Why you should care about heaps and `heapq`

You should care about heaps because they are a data structure that allow you to quickly access the smallest (or largest) elements of a dataset, without needing to sort the entire dataset.

E.g. I want to get the 10 smallest numbers from a dataset of 10,000,000 numbers. Sorting-and-slicing takes 3.8 s, but using a heap takes 0.3 s!

import cProfile
import heapq
import random

data = [random.uniform(0, 1) for _ in range(10_000_000)]

cProfile.run("sorted(data)[:10]")

4 function calls in 3.816 seconds

cProfile.run('heapq.nsmallest(10, data)')

143 function calls in 0.289 seconds

Python's heapq module implements binary min-heaps using lists. It provides an API to directly create and manipulate heaps, as well as a higher-level set of utility functions: heapq.nsmallest, heapq.nlargest, and heapq.merge.

Obtaining the smallest (and largest) records from a dataset

If you have dataset, you can obtain the k smallest or largest records by sorting and slicing it. But, it can be more efficient to use heapq.nsmallest and heapq.nlargest.

E.g. I want the 10 smallest records from a dataset with 10,000,000 records:

import random

data = [random.uniform(0, 1) for _ in range(10_000_000)]

sorted(data)[:10]

[0.22276864861022627,
 0.34177789369722766,
 0.37737542995066387,
 0.45103237357931525,
 0.47123333399847134,
 0.47861174271801377,
 0.4913566066457804,
 0.49333286645473995,
 0.584151135096475,
 0.7235123209516772]

And, I could also use heapq.nsmallest to do this:

from heapq import nsmallest

nsmallest(10, data)

[2.4336311188477566e-08,
 2.4653716934608383e-07,
 2.484063295060679e-07,
 3.0462511246831525e-07,
 3.9886920166765094e-07,
 4.797817356738676e-07,
 5.917942023092593e-07,
 8.555260716525126e-07,
 1.207491232779745e-06,
 1.2146108545607603e-06]

nsmallest(10, data) == sorted(data)[:10]

True

But heapq.nsmallest runs an order of magnitude faster. E.g. 0.29 seconds versus 3.91 seconds:

import cProfile

cProfile.run("nsmallest(10, data)")

162 function calls in 0.290 seconds

cProfile.run("sorted(data)[:10]")

4 function calls in 3.910 seconds

This is because the time complexity of grabbing the k smallest records from a dataset of n elements is...

O(n log n) with sorting and slicing.
O(n + k log n) with heapq.nsmallest. (If k is constant, then this is O(n).)

So, should you ever use sorted or list.sort? And what about min and max?

If you only need the single largest or smallest record (k=1), then use min or max.
If k is "small" -- compared to the size of the dataset -- then use heapq.nsmallest and heapq.nlargest.
Otherwise, use sorted or list.sort.

The reason that I say "if k is 'small'" is because -- in theory, even though the time complexity of heapq.nsmallest will always be at least as good as that of sorting, O(n log n) -- in practice, Python's timsort may be quicker when k is close to n.

The best way to figure out if you should use heapq.nsmallest or sorted is to try both and measure the results.

Merging sorted datasets into a single sorted dataset

If you have several sorted datasets and you want to merge them into a single sorted dataset, you can concatenate them and sort the result. (This is called a "k-way merge".) But, you could consume less memory by using heapq.merge.

E.g. I have 7 sorted datasets -- in files on disk -- with 1,000,000 records each:

import random

for n in range(7):
    with open(f"dataset-{n}", "w") as file:
        data = sorted(
            random.uniform(0, 1) for _ in range(1_000_000)
        )
        for record in data:
            print(record, file=file)

I want to merge these datasets into a single sorted dataset -- on disk. I could do this by concatenating the datasets with itertools.chain.from_iterable and then using the built-in function sorted:

from contextlib import ExitStack
from itertools import chain


def merge_sorted():
    with ExitStack() as stack:
        file_out = stack.enter_context(
            open("dataset-out", "w")
        )
        file_ins = [
            stack.enter_context(open(f"dataset-{n}"))
            for n in range(7)
        ]
        data_out = sorted(
            chain.from_iterable(file_ins), key=float
        )
        for record in data_out:
            print(record, file=file_out)

Or, I could also merge these datasets with heapq.merge:

from heapq import merge


def merge_heapq():
    with ExitStack() as stack:
        file_out = stack.enter_context(
            open("dataset-out", "w")
        )
        file_ins = [
            stack.enter_context(open(f"dataset-{n}"))
            for n in range(7)
        ]
        data_out = merge(*file_ins, key=float)
        for record in data_out:
            print(record, file=file_out)

Even though heapq.merge runs a bit slower -- e.g., 10.9 s vs. 6.7 s -- it uses much less memory -- e.g., 144 kB vs. 811,036 kB:

import cProfile

cProfile.run("merge_heapq()")

21033074 function calls in 10.891 seconds

cProfile.run("merge_sorted()")

7033062 function calls in 6.743 seconds

import tracemalloc

tracemalloc.start()
merge_heapq()
_, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"Peak memory: {peak:,} B")

Peak memory: 144,659 B

tracemalloc.start()
merge_sorted()
_, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"Peak memory: {peak:,} B")

Peak memory: 811,036,242 B

The caveat here is that the input datasets must already be sorted.

Creating and manipulating heaps

heapq provides an API for directly creating and manipulating a binary min-heap. heapq represents a binary min-heap as a list, so an empty heap is just an empty list:

[]

[]

A list can be turned into a heap in-place using heapq.heapify:

from heapq import heapify

x = [1, 5, 4, 3, 7, 2]
heapify(x)
x

[1, 3, 2, 5, 7, 4]

The minimum element is the first element of the list:

x[0]

x[0] == min(x)

True

You can push elements onto the heap with heapq.heappush, and you can pop elements off of the heap with heapq.heappop:

from heapq import heapify, heappush, heappop

x = [1, 5, 4, 3, 7, 2]
heapify(x)
x

[1, 3, 2, 5, 7, 4]

heappop(x)

heappush(x, 8)
heappush(x, -1)
heappop(x)

-1

[2, 3, 4, 5, 7, 8]

heapq also provides the shortcut methods heapq.heappushpop and heapq.heapreplace:

heapq.heappushpop has the same effect as calling heapq.heappush then heapq.heappop. But, the implementation of heappushpop is slightly more efficient than calling heappush and heappop individually.
heapq.heapreplace has the same effect as calling heapq.heappop then heapq.heappush. But, heapreplace is more efficient than calling heappop and heappush individually. And heapreplace does not change the size of the heap.

If you want to create a fixed-size heap:

Keep track of the heap size using len.
When the heap size reaches the "maximum", then switch from using heapq.heappush to using heapq.heapreplace or heapq.heappushpop.

If you want to create a max-heap (instead of a min-heap):

If possible, recast the scenario into the terms of a min-heap. If the records are numeric, negate the numbers (e.g. 5 becomes -5). In general, wrap the records in an object that reverses the ordering. E.g.

from functools import total_ordering


@total_ordering
class ReverseValue:
    def __init__(self, value):
        self.value = value

    def __lt__(self, other):
        return self.value > other.value

    def __eq__(self, other):
        return self.value == other.value

    def __repr__(self):
        return f"ReverseValue({self.value!r})"


ReverseValue(4) < ReverseValue(3)

True

data = list(map(ReverseValue, range(4)))
data

[ReverseValue(0), ReverseValue(1), ReverseValue(2), ReverseValue(3)]

from heapq import heapify

heapify(data)
data

[ReverseValue(3), ReverseValue(1), ReverseValue(2), ReverseValue(0)]

Otherwise, implement a max-heap. See "What do I use for a max-heap implementation in Python?" via StackOverflow.com for more information.

In conclusion...

In this week's post, you learned about Python's heapq module. You learned how to directly create and manipulate Binary min-heaps, as well as how to use high-level utility functions to obtain the k smallest (or largest records) from a dataset, and to merge multiple sorted datasets into a single sorted dataset.

My challenge to you:

Using the API provided by heapq, recreate the utility functions heapq.nsmallest, heapq.nlargest and heapq.merge. E.g. here is a reimplementation of heapq.nsmallest:

from heapq import heapify, heappop


def my_nsmallest(n, iterable, *, key=None):
    """My reimplementation of heapq.nsmallest."""
    if key is None:
        data = list(iterable)
    else:
        data = [
            (key(record), record) for record in iterable
        ]

    heapify(data)

    result = []
    for _ in range(n):
        if key is None:
            item = heappop(data)
        else:
            _, item = heappop(data)
        result.append(item)

    return result

If you enjoyed this week's post, share it with your friends and stay tuned for next week's post. See you then!