Aligner

This module implements the aligner.

class Data(score: float)

Private data class for the Needleman-Wunsch+Gotoh sequence aligner.

__init__(score: float)
score: float

The current score.

p: float

\(P_{m,n}\) in [Gotoh1982].

q: float

\(Q_{m,n}\) in [Gotoh1982].

pSize: int

The size of the p gap. \(k\) in [Gotoh1982].

qSize: int

The size of the q gap. \(k\) in [Gotoh1982].

class Aligner(start_score: float = - 1.0, open_score: float = - 1.0, extend_score: float = - 0.5)

A generic Needleman-Wunsch+Gotoh sequence aligner.

This implementation uses Gotoh’s improvements to get \(\mathcal{O}(mn)\) running time and reduce memory requirements to essentially the backtracking matrix only. In Gotoh’s technique the gap weight formula must be of the special form \(w_k = uk + v\) (affine gap). \(k\) is the gap size, \(v\) is the gap opening score and \(u\) the gap extension score.

The aligner is type-agnostic. When the aligner wants to compare two objects, it calls the method similarity() with both objects as arguments. This method should return the score of the alignment. The score should increase with the desirability of the alignment, but otherwise there are no fixed rules.

The score must harmonize with the penalties for inserting gaps. If the score for opening a gap is -1.0 (the default) then a satisfactory match should return a score > 1.0.

The similarity() function may consult a PAM or BLOSUM matrix, or compute a hamming distance between the arguments. It may also use auxiliary data like Part-of-Speech tags. In this case the data type aligned could be a dict containing the word and the POS-tag.

__init__(start_score: float = - 1.0, open_score: float = - 1.0, extend_score: float = - 0.5)
start_score: float

The gap opening score at the start of the string. Set this to 0 to find local alignments.

open_score: float

The gap opening score \(v\).

extend_score: float

The gap extension score \(u\).

align(seq_a: Sequence[object], seq_b: Sequence[object], similarity: Callable[[object, object], float], gap_a: Optional[Callable[[], object]] = None, gap_b: Optional[Callable[[], object]] = None) Tuple[Sequence[object], Sequence[object], float]

Align two sequences.

Parameters
  • similarity – a callable that returns the similarity of two objects

  • gap_a – insert gap_a() for a gap in sequence a. None inserts None.

  • gap_b – insert gap_b() for a gap in sequence b. None inserts gap_a().

Returns

the aligned sequences and the score

build_debug_matrix(matrix: List[List[super_collator.aligner.Data]], len_matrix: List[List[int]], ts_a: Sequence[object], ts_b: Sequence[object]) str

Build a human-readable debug matrix.

Parameters
  • matrix – the full scoring matrix

  • len_matrix – the backtracking matrix

  • ts_a – the first aligned string

  • ts_b – the second aligned string

Return str

the debug matrix as human readable string