Solving the Sequence Alignment problem in Python
By John Lekberg on October 25, 2020.
This week's post is about solving the "Sequence Alignment" problem. You will learn:
- How to create a brute force solution.
- How to create a more efficient solution using the Needleman-Wunsch algorithm and dynamic programming .
Problem statement
As input, you are given two sequences. E.g.
"CAT"
"CT"
(The sequences can be strings or other arrays of data.)
As output, your goal is to produce an alignment, which pairs up elements of the sequence. E.g.
C - C
A - T
T -
An alignment can have gaps. E.g.
C - C
A -
T - T
While an alignment can have gaps, it cannot change the relative order of the sequence elements. E.g. "CT" cannot be changed into "TC".
Specifically, your goal is to produce an alignment with maximal score. Here's how to calculate the score:
- Score = 0
- Look at each pair of elements:
- If there is a gap, then score -= 1
- Otherwise, if the elements are the same, then score += 1.
- Otherwise, if the elements are different, then score -= 1.
E.g. this alignment has a score of -1:
C - C (same, +1)
A - T (different, -1)
T - (gap, -1)
But this alignment has a score of +1:
C - C (same, +1)
A - (gap, -1)
T - T (same, +1)
So, your goal is to take two sequences and find an alignment with maximal score.
How I represent the problem's data
For the input, I represent the sequences as strings or lists. E.g. I can represent the sequence "CAT" as
"CAT"
or as
["C", "A", "T"]
Really, anything that implements collections.abc.Sequence (not just strings and lists) will work.
For the output, I represent an alignment as a list of tuples of indices (or None, if there is a gap.) E.g. this alignment:
C - C
A - T
T -
would be represented as
[(0, 0), (1, 1), (2, None)]
And this alignment:
C - C
A -
T - T
would be represented as
[(0, 0), (1, None), (2, 1)]
Creating a brute force solution
I like starting with brute force solutions when I work on problems. Brute force solutions tend to be simpler to implement and, fairly often, the brute force solution is "good enough" for the actual inputs that will be encountered.
I start by creating a function that takes two index ranges and iterates over all possible alignments:
from collections import deque
def all_alignments(x, y):
"""Return an iterable of all alignments of two
sequences.
x, y -- Sequences.
"""
def F(x, y):
"""A helper function that recursively builds the
alignments.
x, y -- Sequence indices for the original x and y.
"""
if len(x) == 0 and len(y) == 0:
yield deque()
scenarios = []
if len(x) > 0 and len(y) > 0:
scenarios.append((x[0], x[1:], y[0], y[1:]))
if len(x) > 0:
scenarios.append((x[0], x[1:], None, y))
if len(y) > 0:
scenarios.append((None, x, y[0], y[1:]))
# NOTE: "xh" and "xt" stand for "x-head" and "x-tail",
# with "head" being the front of the sequence, and
# "tail" being the rest of the sequence. Similarly for
# "yh" and "yt".
for xh, xt, yh, yt in scenarios:
for alignment in F(xt, yt):
alignment.appendleft((xh, yh))
yield alignment
alignments = F(range(len(x)), range(len(y)))
return map(list, alignments)
(This code uses: collections.deque, generator function, range, len, map.)
E.g. here are all possible alignments of "CAT" and "CT":
list(all_alignments("CAT", "CT"))
[[(0, 0), (1, 1), (2, None)],
[(0, 0), (1, None), (2, 1)],
[(0, 0), (1, None), (2, None), (None, 1)],
[(0, 0), (1, None), (None, 1), (2, None)],
[(0, 0), (None, 1), (1, None), (2, None)],
[(0, None), (1, 0), (2, 1)],
[(0, None), (1, 0), (2, None), (None, 1)],
[(0, None), (1, 0), (None, 1), (2, None)],
[(0, None), (1, None), (2, 0), (None, 1)],
[(0, None), (1, None), (2, None), (None, 0), (None, 1)],
[(0, None), (1, None), (None, 0), (2, 1)],
[(0, None), (1, None), (None, 0), (2, None), (None, 1)],
[(0, None), (1, None), (None, 0), (None, 1), (2, None)],
[(0, None), (None, 0), (1, 1), (2, None)],
[(0, None), (None, 0), (1, None), (2, 1)],
[(0, None), (None, 0), (1, None), (2, None), (None, 1)],
[(0, None), (None, 0), (1, None), (None, 1), (2, None)],
[(0, None), (None, 0), (None, 1), (1, None), (2, None)],
[(None, 0), (0, 1), (1, None), (2, None)],
[(None, 0), (0, None), (1, 1), (2, None)],
[(None, 0), (0, None), (1, None), (2, 1)],
[(None, 0), (0, None), (1, None), (2, None), (None, 1)],
[(None, 0), (0, None), (1, None), (None, 1), (2, None)],
[(None, 0), (0, None), (None, 1), (1, None), (2, None)],
[(None, 0), (None, 1), (0, None), (1, None), (2, None)]]
Here's a more readable form of the output: ("-" indicates a gap)
def print_alignment(x, y, alignment): print("".join( "-" if i is None else x[i] for i, _ in alignment )) print("".join( "-" if j is None else y[j] for _, j in alignment )) x = "CAT" y = "CT" for alignment in all_alignments(x, y): print_alignment(x, y, alignment) print()
CAT
CT-
CAT
C-T
CAT-
C--T
CA-T
C-T-
C-AT
CT--
CAT
-CT
CAT-
-C-T
CA-T
-CT-
CAT-
--CT
CAT--
---CT
CA-T
--CT
CA-T-
--C-T
CA--T
--CT-
C-AT
-CT-
C-AT
-C-T
C-AT-
-C--T
C-A-T
-C-T-
C--AT
-CT--
-CAT
CT--
-CAT
C-T-
-CAT
C--T
-CAT-
C---T
-CA-T
C--T-
-C-AT
C-T--
--CAT
CT---
Next, I create a function that takes two sequences and an alignment to produce a score:
def alignment_score(x, y, alignment):
"""Score an alignment.
x, y -- sequences.
alignment -- an alignment of x and y.
"""
score_gap = -1
score_same = +1
score_different = -1
score = 0
for i, j in alignment:
if (i is None) or (j is None):
score += score_gap
elif x[i] == y[j]:
score += score_same
elif x[i] != y[j]:
score += score_different
return score
Consider this alignment:
C - C
A -
T - T
Here's an example of computing the score:
x = "CAT" y = "CT" alignment = [(0, 0), (1, None), (2, 1)] alignment_score(x, y, alignment)
1
With these two functions -- all_alignments
and alignment_score
-- the
brute force solution will search all alignments to find one with
a maximal score:
from functools import partial
def align_bf(x, y):
"""Align two sequences, maximizing the
alignment score, using brute force.
x, y -- sequences.
"""
return max(
all_alignments(x, y),
key=partial(alignment_score, x, y),
)
(This code uses: functools.partial, max.)
align_bf("CAT", "CT")
[(0, 0), (1, None), (2, 1)]
print_alignment("CAT", "CT", align_bf("CAT", "CT"))
CAT
C-T
What's the time complexity of this solution? For two sequences of n and m elements:
-
The number of possible alignments is given by the Delannoy numbers. The number of alignments D is given by the recurrence relation
D(n, 0) = 1
D(0, m) = 1
D(n, m) = D(n - 1, m) + D(n, m - 1) + D(n - 1, m - 1)
To give you a sense of how quickly this number grows:
from functools import lru_cache @lru_cache(maxsize=None) def D(n, m): if n == 0 or m == 0: return 1 else: return D(n - 1, m) + D(n, m - 1) + D(n - 1, m - 1)
-
There are 3 possible alignments of two 1-element sequences:
D(1, 1)
3
-
There are 8,097,453 possible alignments of two 10-character sequences:
D(10, 10)
8097453
-
There are 2.05e+75 possible alignments of two 100-character sequences:
D(100, 100)
2053716830872415770228778006271971120334843128349550587141047275840274143041
D(100,100) is getting close the the Eddington number, 10e+80 -- the estimated number of hydrogen atoms in the observable universe.
(See OEIS A001850 for more information on Delannoy numbers of the form D(n,n).)
-
-
Computing the alignment score takes time linear in the sizes of both sequences: O(n + m).
As a result, the overall time complexity of the brute force solution is:
O( D(n, m) × (n + m) )
Creating a more efficient solution
The brute force solution is simple, but it doesn't scale well. In practice, sequence alignment is used to analyze sequences of biological data (e.g. nucleic acid sequences). Given that the size of these sequences can be hundreds or thousands of elements long, there's no way that the brute force solution would work for data of that size.
In 1970, Saul B. Needleman and Christian D. Wunsch created a faster algorithm to solve this problem: the Needleman-Wunsch algorithm. (See "A general method applicable to the search for similarities in the amino acid sequence of two proteins", https://doi.org/10.1016/0022-2836(70)90057-4.) The algorithm uses dynamic programming to solve the sequence alignment problem in O(mn) time.
Here's a Python implementation of the Needleman-Wunsch algorithm, based on section 3 of "Parallel Needleman-Wunsch Algorithm for Grid":
from itertools import product
from collections import deque
def needleman_wunsch(x, y):
"""Run the Needleman-Wunsch algorithm on two sequences.
x, y -- sequences.
Code based on pseudocode in Section 3 of:
Naveed, Tahir; Siddiqui, Imitaz Saeed; Ahmed, Shaftab.
"Parallel Needleman-Wunsch Algorithm for Grid." n.d.
https://upload.wikimedia.org/wikipedia/en/c/c4/ParallelNeedlemanAlgorithm.pdf
"""
N, M = len(x), len(y)
s = lambda a, b: int(a == b)
DIAG = -1, -1
LEFT = -1, 0
UP = 0, -1
# Create tables F and Ptr
F = {}
Ptr = {}
F[-1, -1] = 0
for i in range(N):
F[i, -1] = -i
for j in range(M):
F[-1, j] = -j
option_Ptr = DIAG, LEFT, UP
for i, j in product(range(N), range(M)):
option_F = (
F[i - 1, j - 1] + s(x[i], y[j]),
F[i - 1, j] - 1,
F[i, j - 1] - 1,
)
F[i, j], Ptr[i, j] = max(zip(option_F, option_Ptr))
# Work backwards from (N - 1, M - 1) to (0, 0)
# to find the best alignment.
alignment = deque()
i, j = N - 1, M - 1
while i >= 0 and j >= 0:
direction = Ptr[i, j]
if direction == DIAG:
element = i, j
elif direction == LEFT:
element = i, None
elif direction == UP:
element = None, j
alignment.appendleft(element)
di, dj = direction
i, j = i + di, j + dj
while i >= 0:
alignment.appendleft((i, None))
i -= 1
while j >= 0:
alignment.appendleft((None, j))
j -= 1
return list(alignment)
(This code uses: collections.deque, itertools.product, zip, list.)
needleman_wunsch("CAT", "CT")
[(0, 0), (1, None), (2, 1)]
And so, the faster algorithm will simply call needleman_wunsch
:
def align_fast(x, y): """Align two sequences, maximizing the alignment score, using the Needleman-Wunsch algorithm. x, y -- sequences. """ return needleman_wunsch(x, y) align_fast("CAT", "CT")
[(0, 0), (1, None), (2, 1)]
print_alignment("CAT", "CT", align_fast("CAT", "CT"))
CAT
C-T
What's the time complexity of this solution? For two sequences of n and m elements:
- Creating tables
F
andPtr
takes O(mn) time. - Creating the alignment by navigating from
Ptr[N-1, M-1]
toPtr[0, 0]
takes O(n+m) time.
As a result, the overall time complexity of the algorithm is
O(mn)
In conclusion...
In this week's post, you learned how to solve the "Sequence Alignment" problem. You learned how to create a brute force solution that generates every possible alignment. Then you learned that brute force is infeasible for larger sequences: two 10-element sequences have over 8,000,000 different alignments! Finally, you learned how to reimplement the Needleman-Wunsch algorithm in Python.
My challenge to you:
Modify
needleman_wunsch
to take parameters for the scoring:
score_gap
- how to score a gap. (Default: -1)score_same
- how to score equal elements. (Default: +1)score_different
- how to score non-equal elements. (Default: -1)E.g. you can call the new function like this:
needleman_wunsch( "CAT", "CT", score_gap=-10, score_same=+3, score_different=-1, )
If you enjoyed this week's post, share it with your friends and stay tuned for next week's post. See you then!
(If you spot any errors or typos on this post, contact me via my contact page.)