Multiple sequence comparison of arbitrary string with oriented characters
The core problem is:
I am looking for an algorithm to calculate a maximum parsimonious distance between a set of strings. With distance I mean something similar to the Damerau–Levenshtein distance i.e. the minimal number of deletions, insertions, substitution and transposition of characters or adjacent characters blocks. But instead of regular strings I want to investigate strings with oriented characters.
Thus, a string could look like:
- (A,1) (B,1) (C,1) (D,1)
and possible derivatives could be:
- (A,1) (C,0) (B,0) (D,1)
- (A,1) (C,1) (B,1) (D,1)
- (A,1) (B,0) (C,0) (D,1)
Where A,B,C,D are character identities and 1 = forward and 0 = reverse.
Here, derivative 1. would have the distance 2, since you could cut out block BC and re paste it inverted (1 cut, 1 paste). Derivative 2. would have 2 as well, since you may cut out C and re paste it in front of B (1 cut, 1 Paste) whereas number 3. would require 4 operations (2 cuts, 2 pastes) to be transformed. Analogous, deletion or insertion of blocks would yield a distance 1.
If you would define (X,0) and (X,1) as two different non oriented characters (X0, X1) for all possible X, example 3. would result in a distance of 2 since you could then cut out the block B1C1 and insert the block B0C0 in two steps.
A real world example:
The genes in a bacterial genome could be considered the oriented character (A,0), (B,0)... By determining the sequence distance, the genomic orientation of homologue genes in two related bacteria could be used as a evolutionary marker trace. The fact that bacterial genomes are circular strings introduces the additional border condition ABC is equal to BCA.
Real genomes do have unique genes with no equivalent in a partner thus giving rise to a place holder character @. Those place holders reduce the information content of the comparison to a lower bound, since e.g. (A,1)(B,1)@(C,1) can be transformed to (A,1)@@@(B,1)@(C,1) by insertion of the block @@@. However the orientation restores partially the information content since you may find (A,1)@@@(B,0)@(C,1) indicating a minimum distance of 3. Even better would be an algorithm to compare multiple related sequences (genomes) simultaneously, since you could then find intermediates in the evolutionary history, which increases resolution.
I realize, there are several questions already posted on text string comparison. But they fail to be easily expandable to include the orientation. In addition, there exists a wealth of methods to deal with biological sequences, in particular for multiple sequence analysis. However, those are limited to macromolecule sequences which do not exist in alternate orientations and usually invoke specific weights for any particular character match.
If there exists already a python library which would allow for the necessary customization to solve that problem, that would be fantastic. But any suitable orientation aware algorithm would be very helpful.
I believe the following code can help you out:
from itertools import permutations from random import randint from pprint import pprint def generate_genes(): """ Generates a boilerplate list of genes @rtype : list """ tuple_list =  for i in range(16): binary_var = bin(i)[2:] if len(binary_var) != 4: binary_var = "0" * (4 - len(binary_var)) + binary_var tuple_list.append([('A', (1 if binary_var == '1' else 0)), ('B', (1 if binary_var == '1' else 0)), ('C', (1 if binary_var == '1' else 0)), ('D', (1 if binary_var == '1' else 0))]) return tuple_list def all_possible_genes(): """ Generates all possible combinations of ABCD genes @return: returns a list of combinations @rtype: tuple """ gene_list = generate_genes() all_possible_permutations =  for gene in gene_list: all_possible_permutations.append([var for var in permutations(gene)]) return all_possible_permutations def gene_stringify(gene_tuple): """ @type gene_tuple : tuple @param gene_tuple: The gene tuple generated """ return "".join(str(var) for var in gene_tuple if var) def dameraulevenshtein(seq1, seq2): """Calculate the Damerau-Levenshtein distance between sequences. This distance is the number of additions, deletions, substitutions, and transpositions needed to transform the first sequence into the second. Although generally used with strings, any sequences of comparable objects will work. Transpositions are exchanges of *consecutive* characters; all other operations are self-explanatory. This implementation is O(N*M) time and O(M) space, for N and M the lengths of the two sequences. >>> dameraulevenshtein('ba', 'abc') 2 >>> dameraulevenshtein('fee', 'deed') 2 It works with arbitrary sequences too: >>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e']) 2 """ # codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F # Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix. # However, only the current and two previous rows are needed at once, # so we only store those. oneago = None thisrow = range(1, len(seq2) + 1) +  for x in xrange(len(seq1)): # Python lists wrap around for negative indices, so put the # leftmost column at the *end* of the list. This matches with # the zero-indexed strings and saves extra calculation. twoago, oneago, thisrow = oneago, thisrow,  * len(seq2) + [x + 1] for y in xrange(len(seq2)): delcost = oneago[y] + 1 addcost = thisrow[y - 1] + 1 subcost = oneago[y - 1] + (seq1[x] != seq2[y]) thisrow[y] = min(delcost, addcost, subcost) # This block deals with transpositions if (x > 0 and y > 0 and seq1[x] == seq2[y - 1] and seq1[x - 1] == seq2[y] and seq1[x] != seq2[y]): thisrow[y] = min(thisrow[y], twoago[y - 2] + 1) return thisrow[len(seq2) - 1] if __name__ == '__main__': genes = all_possible_genes() list1 = genes[randint(0, 15)][randint(0, 23)] list2 = genes[randint(0, 15)][randint(0, 23)] print gene_stringify(list1) pprint(list1) print gene_stringify(list2) pprint(list2) print dameraulevenshtein(gene_stringify(list1), gene_stringify(list2))