Sets and Frozen Sets in Python
By John Lekberg on August 08, 2020.
This week's post is about using sets and frozen sets in Python. You will learn:
- What are sets and frozen sets?
- How to use sets for fast membership checking.
- How to use sets to remove duplicate data.
- How to use set operations to simplify your code.
Why should you care about sets?
Sets are useful because membership checks are a lot faster with sets than with lists:
import cProfile data_list = list(range(10_000_000)) cProfile.run('9_999_999 in data_list')
3 function calls in 0.136 seconds
data_set = set(range(10_000_000)) cProfile.run('9_999_999 in data_set')
3 function calls in 0.000 seconds
However, creating a set is a bit slower than creating a list:
import cProfile cProfile.run('list(range(10_000_000))')
3 function calls in 0.520 seconds
cProfile.run('set(range(10_000_000))')
3 function calls in 0.961 seconds
What are sets?
In Python, a set is collection of elements, based on a mathematical set. There are three ways to create a set:
-
Use the built-in function set:
set([1, 2, 3])
{1, 2, 3}
-
List the elements between
{
and}
, comma separated:{1, 2, 3}
{1, 2, 3}
-
Use a set comprehension:
{ n ** 2 for n in range(1, 5) }
{1, 4, 9, 16}
Create an empty set with the expression set()
:
set()
set()
Do not use {}
to create an empty set. Doing this creates a dictionary:
type(set())
set
type({})
dict
Here's how sets are different than Sequences like list and tuple:
-
Sets only contain hashable elements.
{"hello", "there"}
{'hello', 'there'}
{"hello", ["there"]}
TypeError: unhashable type: 'list'
-
Sets can't be subscripted:
data_list = [1, 2, 3] data_list[0]
1
data_set = {1, 2, 3} data_set[0]
TypeError: 'set' object is not subscriptable
Instead you iterate over a set using, e.g., a for-loop:
for x in data_set: print(x)
1 2 3
-
Sets never contain duplicates:
{1, 1, 1, 2, 2, 2, 3}
{1, 2, 3}
-
Sets don't care about what order the elements are in:
[1, 2, 3] == [3, 2, 1]
False
{1, 2, 3} == {3, 2, 1}
True
For the full API, read "Set Types - set, frozenset" via Python.org.
What are frozen sets?
Frozen sets are immutable, hashable sets. Create them with the built-in function frozenset:
frozenset([1, 2, 3])
frozenset({1, 2, 3})
Because frozen sets are hashable, they can be contained in other sets:
{{1}, {2, 3}}
TypeError: unhashable type: 'set'
{frozenset([1]), frozenset([2, 3])}
{frozenset({1}), frozenset({2, 3})}
Because frozen sets are immutable, elements cannot be added or removed:
data_set = {1, 2, 3} data_set.add(5) data_set
{1, 2, 3, 5}
data_fset = frozenset([1, 2, 3]) data_fset.add(5)
AttributeError: 'frozenset' object has no attribute 'add'
Unlike lists and tuples, sets and frozen sets can be compared with each other:
[1, 2, 3] == (1, 2, 3)
False
{1, 2, 3} == frozenset({1, 2, 3})
True
And unlike lists and tuples, sets and frozen sets can be combined with set operations:
[1, 2, 3] + (1, 4, 5)
TypeError: can only concatenate list (not "tuple") to list
{1, 2, 3} | frozenset([1, 4, 5])
{1, 2, 3, 4, 5}
When sets and frozen sets are combined, the type of the result is the type of the left operand:
type(set() | frozenset())
set
type(frozenset() | set())
frozenset
For the full API, read "Set Types - set, frozenset" via Python.org.
Using sets for fast membership checks
Sets use hash tables to store their elements. This means that the time complexity of membership checks is O(1). With a list, membership checks are a linear search and take O(n) time. E.g.
-
Compare 16 seconds:
import cProfile domain = range(50_000) def example_list(): data_list = list(domain) for i in domain: assert i in data_list cProfile.run('example_list()')
4 function calls in 16.138 seconds
-
... to 0.007 seconds:
def example_set(): data_set = set(domain) for i in domain: assert i in data_set cProfile.run('example_set()')
4 function calls in 0.007 seconds
Keep this in mind when writing code that does repeated membership checks.
Using sets to remove duplicates
Here's a simple way to remove duplicates from a list (or any other iterable):
data_list = [1, 1, 1, 2, 2, 2, 1, 1, 3, 2, 1, 1, 3, 1] list(set(data_list))
[1, 2, 3]
If I'm dealing with generators and potentially infinite data, this technique doesn't work:
import random stream = iter(lambda: random.randint(0, 100), None) list(set(stream))
(this code never finishes)
In this case, I usually create a generator function that I
call unique
:
def unique(it): """Remove duplicates from an iterable using a set.""" past = set() for x in it: if x not in past: yield x past.add(x) stream = iter(lambda: random.randint(0, 100), None) unique(stream)
<generator object unique at 0x1111f1c10>
from itertools import islice list(islice(unique(stream), 10))
[96, 75, 25, 58, 49, 9, 64, 77, 95, 20]
Using set operations to simplify your code
Set operations like union, intersection, difference can be used to simplify your code.
Here are some examples:
-
Here's code that combines a set of teachers and teacher assistants to create a set of employees:
teacher = {"T.Karen", "T.Robert"} teacher_assistant = {"S.Susan", "S.John"} employee = set() for person in teacher: employee.add(person) for person in teacher_assistant: employee.add(person) employee
{'S.John', 'S.Susan', 'T.Karen', 'T.Robert'}
Simplify this code by using set.union (the
|
operator):employee = teacher | teacher_assistant employee
{'S.John', 'S.Susan', 'T.Karen', 'T.Robert'}
-
Here's code that combines a set of students and a set of employees to create a set of student-employees:
student = {"S.Bobby", "S.Susan", "S.John"} employee = {"S.John", "S.Susan", "T.Karen", "T.Robert"} student_employee = set() for person in student: if person in employee: student_employee.add(person) student_employee
{'S.John', 'S.Susan'}
Simplify this code by using set.intersection (the
&
operator):student_employee = student & employee student_employee
{'S.John', 'S.Susan'}
-
Here's code that combines a set of students and a set of employees to create a set of students that are not employees:
student = {"S.Bobby", "S.Susan", "S.John"} employee = {"S.John", "S.Susan", "T.Karen", "T.Robert"} student_only = set() for person in student: if person not in employee: student_only.add(person) student_only
{'S.Bobby'}
Simplify this code by using set.difference (the
-
operator):student_only = student - employee student_only
{'S.Bobby'}
For the full API, read "Set Types - set, frozenset" via Python.org.
In conclusion...
In this week's post, you learned about sets and frozen sets. You learned how to use sets for fast membership checking, and how to use sets to remove duplicate data. You also learned how to use set operations to simplify your code.
For more information about sets in Python, see these resources:
- "Set Types — set, frozenset" via Python.org
- "Set Practice: learning from Python's set types" by Luciano Ramalho
setobject.c
by Raymond D. Hettinger
There is relationship between set theory and logic that you may have
noticed.
E.g. x in (A | B)
is equivalent to (x in A) or (x in B)
.
Here's a table of this relationship:
Set Operation | Logical Operation |
---|---|
union | disjunction (OR) |
intersection | conjunction (AND) |
difference | negation (NOT) |
symmetric difference | exclusive or (XOR) |
My challenge to you:
Read the documentation for these operations:
Then write your own versions of these operations. E.g. here's a solution for
set.union
:def union(*iterables): """Return a new set with elements from the set and all others. """ return { x for it in iterables for x in it } union("hello", "there", {1, 2, 3})
{1, 2, 3, 'e', 'h', 'l', 'o', 'r', 't'}
If you enjoyed this week's post, share it with your friends and stay tuned for next week's post. See you then!
(If you spot any errors or typos on this post, contact me via my contact page.)