Return to Blog

Python's heapq module

By John Lekberg on November 01, 2020.


This week's Python blog post is about Python's heapq module. You will learn:

Why you should care about heaps and heapq

You should care about heaps because they are a data structure that allow you to quickly access the smallest (or largest) elements of a dataset, without needing to sort the entire dataset.

E.g. I want to get the 10 smallest numbers from a dataset of 10,000,000 numbers. Sorting-and-slicing takes 3.8 s, but using a heap takes 0.3 s!

import cProfile
import heapq
import random

data = [random.uniform(0, 1) for _ in range(10_000_000)]

cProfile.run("sorted(data)[:10]")
4 function calls in 3.816 seconds
cProfile.run('heapq.nsmallest(10, data)')
143 function calls in 0.289 seconds

Python's heapq module implements binary min-heaps using lists. It provides an API to directly create and manipulate heaps, as well as a higher-level set of utility functions: heapq.nsmallest, heapq.nlargest, and heapq.merge.

Obtaining the smallest (and largest) records from a dataset

If you have dataset, you can obtain the k smallest or largest records by sorting and slicing it. But, it can be more efficient to use heapq.nsmallest and heapq.nlargest.

E.g. I want the 10 smallest records from a dataset with 10,000,000 records:

import random

data = [random.uniform(0, 1) for _ in range(10_000_000)]

sorted(data)[:10]
[0.22276864861022627,
 0.34177789369722766,
 0.37737542995066387,
 0.45103237357931525,
 0.47123333399847134,
 0.47861174271801377,
 0.4913566066457804,
 0.49333286645473995,
 0.584151135096475,
 0.7235123209516772]

And, I could also use heapq.nsmallest to do this:

from heapq import nsmallest

nsmallest(10, data)
[2.4336311188477566e-08,
 2.4653716934608383e-07,
 2.484063295060679e-07,
 3.0462511246831525e-07,
 3.9886920166765094e-07,
 4.797817356738676e-07,
 5.917942023092593e-07,
 8.555260716525126e-07,
 1.207491232779745e-06,
 1.2146108545607603e-06]
nsmallest(10, data) == sorted(data)[:10]
True

But heapq.nsmallest runs an order of magnitude faster. E.g. 0.29 seconds versus 3.91 seconds:

import cProfile

cProfile.run("nsmallest(10, data)")
162 function calls in 0.290 seconds
cProfile.run("sorted(data)[:10]")
4 function calls in 3.910 seconds

This is because the time complexity of grabbing the k smallest records from a dataset of n elements is...

So, should you ever use sorted or list.sort? And what about min and max?

The reason that I say "if k is 'small'" is because -- in theory, even though the time complexity of heapq.nsmallest will always be at least as good as that of sorting, O(n log n) -- in practice, Python's timsort may be quicker when k is close to n.

The best way to figure out if you should use heapq.nsmallest or sorted is to try both and measure the results.

Merging sorted datasets into a single sorted dataset

If you have several sorted datasets and you want to merge them into a single sorted dataset, you can concatenate them and sort the result. (This is called a "k-way merge".) But, you could consume less memory by using heapq.merge.

E.g. I have 7 sorted datasets -- in files on disk -- with 1,000,000 records each:

import random

for n in range(7):
    with open(f"dataset-{n}", "w") as file:
        data = sorted(
            random.uniform(0, 1) for _ in range(1_000_000)
        )
        for record in data:
            print(record, file=file)

I want to merge these datasets into a single sorted dataset -- on disk. I could do this by concatenating the datasets with itertools.chain.from_iterable and then using the built-in function sorted:

from contextlib import ExitStack
from itertools import chain


def merge_sorted():
    with ExitStack() as stack:
        file_out = stack.enter_context(
            open("dataset-out", "w")
        )
        file_ins = [
            stack.enter_context(open(f"dataset-{n}"))
            for n in range(7)
        ]
        data_out = sorted(
            chain.from_iterable(file_ins), key=float
        )
        for record in data_out:
            print(record, file=file_out)

Or, I could also merge these datasets with heapq.merge:

from heapq import merge


def merge_heapq():
    with ExitStack() as stack:
        file_out = stack.enter_context(
            open("dataset-out", "w")
        )
        file_ins = [
            stack.enter_context(open(f"dataset-{n}"))
            for n in range(7)
        ]
        data_out = merge(*file_ins, key=float)
        for record in data_out:
            print(record, file=file_out)

Even though heapq.merge runs a bit slower -- e.g., 10.9 s vs. 6.7 s -- it uses much less memory -- e.g., 144 kB vs. 811,036 kB:

import cProfile

cProfile.run("merge_heapq()")
21033074 function calls in 10.891 seconds
cProfile.run("merge_sorted()")
7033062 function calls in 6.743 seconds
import tracemalloc

tracemalloc.start()
merge_heapq()
_, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"Peak memory: {peak:,} B")
Peak memory: 144,659 B
tracemalloc.start()
merge_sorted()
_, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"Peak memory: {peak:,} B")
Peak memory: 811,036,242 B

The caveat here is that the input datasets must already be sorted.

Creating and manipulating heaps

heapq provides an API for directly creating and manipulating a binary min-heap. heapq represents a binary min-heap as a list, so an empty heap is just an empty list:

[]
[]

A list can be turned into a heap in-place using heapq.heapify:

from heapq import heapify

x = [1, 5, 4, 3, 7, 2]
heapify(x)
x
[1, 3, 2, 5, 7, 4]

The minimum element is the first element of the list:

x[0]
1
x[0] == min(x)
True

You can push elements onto the heap with heapq.heappush, and you can pop elements off of the heap with heapq.heappop:

from heapq import heapify, heappush, heappop

x = [1, 5, 4, 3, 7, 2]
heapify(x)
x
[1, 3, 2, 5, 7, 4]
heappop(x)
1
heappush(x, 8)
heappush(x, -1)
heappop(x)
-1
x
[2, 3, 4, 5, 7, 8]

heapq also provides the shortcut methods heapq.heappushpop and heapq.heapreplace:

If you want to create a fixed-size heap:

If you want to create a max-heap (instead of a min-heap):

In conclusion...

In this week's post, you learned about Python's heapq module. You learned how to directly create and manipulate Binary min-heaps, as well as how to use high-level utility functions to obtain the k smallest (or largest records) from a dataset, and to merge multiple sorted datasets into a single sorted dataset.

My challenge to you:

Using the API provided by heapq, recreate the utility functions heapq.nsmallest, heapq.nlargest and heapq.merge. E.g. here is a reimplementation of heapq.nsmallest:

from heapq import heapify, heappop


def my_nsmallest(n, iterable, *, key=None):
    """My reimplementation of heapq.nsmallest."""
    if key is None:
        data = list(iterable)
    else:
        data = [
            (key(record), record) for record in iterable
        ]

    heapify(data)

    result = []
    for _ in range(n):
        if key is None:
            item = heappop(data)
        else:
            _, item = heappop(data)
        result.append(item)

    return result

If you enjoyed this week's post, share it with your friends and stay tuned for next week's post. See you then!


(If you spot any errors or typos on this post, contact me via my contact page.)