双序列比对

成对序列比对是通过优化两个序列之间的相似性评分来将两个序列相互比对的过程。的 Bio.Align 模块包含 PairwiseAligner 使用Needleman-Wunsch、Smith-Waterman、Gotoh(三状态)和Waterman-Smith-Beyer全局和局部成对对齐算法以及快速最优全局对齐算法(FOGMAA)进行全局和局部对齐,并具有多种更改对齐参数的选项。我们指的是德宾 et al. [Durbin1998] 了解有关序列比对算法的深入信息。

基本用法

要生成成对比对,首先创建 PairwiseAligner 对象:

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()

PairwiseAligner 对象 aligner (see部分  成对对齐器对象 )存储用于成对比对的比对参数。这些属性可以在对象的构造函数中设置:

>>> aligner = Align.PairwiseAligner(match_score=1.0)

或在制作对象后:

>>> aligner.match_score = 1.0

使用 aligner.score 计算两个序列之间比对评分的方法:

>>> target = "GAACT"
>>> query = "GAT"
>>> score = aligner.score(target, query)
>>> score
3.0

aligner.align 方法返回一个 PairwiseAlignments 对象,它是找到的对齐的迭代器。 PairwiseAlignments 对象会告诉您找到了多少个路线,以及它们的分数是多少:

>>> alignments = aligner.align(target, query)
>>> alignments
<PairwiseAlignments object (2 alignments; score=3) at 0x...>

两个序列之间的每次比对都存储在 Alignment object. Alignment 对象可以通过迭代对齐或索引来获得:

>>> alignment = alignments[0]
>>> alignment
<Alignment object (2 rows x 5 columns) at 0x...>

遍历 Alignment 对象并打印它们以查看对齐方式:

>>> for alignment in alignments:
...     print(alignment)
...
target            0 GAACT 5
                  0 ||--| 5
query             0 GA--T 3

target            0 GAACT 5
                  0 |-|-| 5
query             0 G-A-T 3

使用索引获取对齐的序列(请参阅 对路线进行切片和索引 ):

>>> alignment[0]
'GAACT'
>>> alignment[1]
'G-A-T'

每个对齐都存储对齐分数:

>>> alignment.score
3.0

以及指向对齐序列的指针:

>>> alignment.target
'GAACT'
>>> alignment.query
'GAT'

在内部,对齐按照序列坐标存储:

>>> alignment = alignments[0]
>>> alignment.coordinates
array([[0, 2, 4, 5],
       [0, 2, 2, 3]])

这里,这两行指的是目标和查询序列。这些坐标表明路线由以下三个区块组成:

  • target[0:2] 对准以 query[0:2] ;

  • target[2:4] 与一个缺口对齐,因为 query[2:2] 是空字符串(即,删除);

  • target[4:5] 对准以 query[2:3] .

对于成对比对,比对序列的数量始终为2:

>>> len(alignment)
2

对齐长度定义为打印对齐中的列数。这等于目标和查询中的匹配数、不匹配数和间隔总长度之和:

>>> alignment.length
5

aligned 属性返回对齐序列的开始和结束索引,并返回第一次对齐的两个长度为2的二元组:

>>> alignment.aligned
array([[[0, 2],
        [4, 5]],

       [[0, 2],
        [2, 3]]])

而对于替代对齐,返回两个长度为3的二元组:

>>> alignment = alignments[1]
>>> print(alignment)
target            0 GAACT 5
                  0 |-|-| 5
query             0 G-A-T 3

>>> alignment.aligned
array([[[0, 1],
        [2, 3],
        [4, 5]],

       [[0, 1],
        [1, 2],
        [2, 3]]])

请注意,不同的排列可能具有彼此排列的相同子序列。特别是,如果比对仅在间隙位置方面彼此不同,则可能会发生这种情况:

>>> aligner.mode = "global"
>>> aligner.mismatch_score = -10
>>> alignments = aligner.align("AAACAAA", "AAAGAAA")
>>> len(alignments)
2
>>> print(alignments[0])
target            0 AAAC-AAA 7
                  0 |||--||| 8
query             0 AAA-GAAA 7

>>> alignments[0].aligned
array([[[0, 3],
        [4, 7]],

       [[0, 3],
        [4, 7]]])
>>> print(alignments[1])
target            0 AAA-CAAA 7
                  0 |||--||| 8
query             0 AAAG-AAA 7

>>> alignments[1].aligned
array([[[0, 3],
        [4, 7]],

       [[0, 3],
        [4, 7]]])

map 方法可以应用于成对对齐 alignment1 查找查询的成对对齐 alignment2 目标数是 alignment1 ,目标是 alignment2 以及询问 alignment1 是相同的。一个典型的例子是 alignment1 是染色体和转录物之间的成对排列, alignment2 是转录物和序列之间的成对比对(例如,RN-seq读取),我们想要找到序列与染色体的比对:

>>> aligner.mode = "local"
>>> aligner.open_gap_score = -1
>>> aligner.extend_gap_score = 0
>>> chromosome = "AAAAAAAACCCCCCCAAAAAAAAAAAGGGGGGAAAAAAAA"
>>> transcript = "CCCCCCCGGGGGG"
>>> alignments1 = aligner.align(chromosome, transcript)
>>> len(alignments1)
1
>>> alignment1 = alignments1[0]
>>> print(alignment1)
target            8 CCCCCCCAAAAAAAAAAAGGGGGG 32
                  0 |||||||-----------|||||| 24
query             0 CCCCCCC-----------GGGGGG 13

>>> sequence = "CCCCGGGG"
>>> alignments2 = aligner.align(transcript, sequence)
>>> len(alignments2)
1
>>> alignment2 = alignments2[0]
>>> print(alignment2)
target            3 CCCCGGGG 11
                  0 ||||||||  8
query             0 CCCCGGGG  8

>>> mapped_alignment = alignment1.map(alignment2)
>>> print(mapped_alignment)
target           11 CCCCAAAAAAAAAAAGGGG 30
                  0 ||||-----------|||| 19
query             0 CCCC-----------GGGG  8

>>> format(mapped_alignment, "psl")
'8\t0\t0\t0\t0\t0\t1\t11\t+\tquery\t8\t0\t8\ttarget\t40\t11\t30\t2\t4,4,\t0,4,\t11,26,\n'

映射比对并不取决于序列内容。如果我们删除序列内容,会在PSL格式中发现相同的比对(尽管我们显然失去了打印序列比对的能力):

>>> from Bio.Seq import Seq
>>> alignment1.target = Seq(None, len(alignment1.target))
>>> alignment1.query = Seq(None, len(alignment1.query))
>>> alignment2.target = Seq(None, len(alignment2.target))
>>> alignment2.query = Seq(None, len(alignment2.query))
>>> mapped_alignment = alignment1.map(alignment2)
>>> format(mapped_alignment, "psl")
'8\t0\t0\t0\t0\t0\t1\t11\t+\tquery\t8\t0\t8\ttarget\t40\t11\t30\t2\t4,4,\t0,4,\t11,26,\n'

默认情况下,会执行全局成对对齐,从而找到整个长度上的最佳对齐 targetquery .相反,局部对齐将找到的子序列 targetquery 具有最高的对齐分数。可以通过设置生成局部对齐 aligner.mode"local" :

>>> aligner.mode = "local"
>>> target = "AGAACTC"
>>> query = "GAACT"
>>> score = aligner.score(target, query)
>>> score
5.0
>>> alignments = aligner.align(target, query)
>>> for alignment in alignments:
...     print(alignment)
...
target            1 GAACT 6
                  0 ||||| 5
query             0 GAACT 5

请注意,如果可以将评分为0的片段添加到对齐中,则最佳局部对齐的定义存在一些模糊性。我们遵循Waterman & Eggert的建议 [Waterman1987] 并禁止此类扩展。

如果 aligner.mode 设置为 "fogsaa" ,然后是快速最优全局对齐算法 [Chakraborty2013] 进行了一些修改。该模式计算全局对齐,但与常规方式不同 "global" 模式 它最适合相似序列之间的长比对。FOGMAA没有像其他算法那样计算所有可能的对齐,而是使用启发式来检测对齐中无法导致最佳对齐的步骤。这可以加快对齐速度,但是,启发式会对您的匹配、不匹配和差距分数做出假设。如果匹配分数小于不匹配分数或任何差距分数,或者如果任何差距分数大于不匹配分数,则会发出警告并且算法可能返回不正确的结果。与可能返回多个对齐方式的其他模式不同,FOGMAA始终仅返回一个对齐方式。

>>> aligner.mode = "fogsaa"
>>> aligner.mismatch_score = -10
>>> alignments = aligner.align("AAACAAA", "AAAGAAA")
>>> len(alignments)
1
>>> print(alignments[0])
target            0 AAAC-AAA 7
                  0 |||--||| 8
query             0 AAA-GAAA 7

成对对齐器对象

PairwiseAligner 对象存储用于成对对齐的所有对齐参数。要查看所有参数值的概述,请使用

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner(match_score=1.0, mode="local")
>>> print(aligner)
Pairwise sequence aligner with parameters
  wildcard: None
  match_score: 1.000000
  mismatch_score: 0.000000
  open_internal_insertion_score: 0.000000
  extend_internal_insertion_score: 0.000000
  open_left_insertion_score: 0.000000
  extend_left_insertion_score: 0.000000
  open_right_insertion_score: 0.000000
  extend_right_insertion_score: 0.000000
  open_internal_deletion_score: 0.000000
  extend_internal_deletion_score: 0.000000
  open_left_deletion_score: 0.000000
  extend_left_deletion_score: 0.000000
  open_right_deletion_score: 0.000000
  extend_right_deletion_score: 0.000000
  mode: local

参见章节  替代评分 , 仿射间隙得分 ,而且 一般差距分数 以下是这些参数的定义。属性 mode (见上文第  基本用法 )可以设置等于 "global""local" 分别指定全局或局部成对对齐。

取决于差距评分参数(请参阅部分  仿射间隙得分一般差距分数 )和模式,a PairwiseAligner 对象自动选择适当的算法用于成对序列比对。要验证所选算法,请使用

>>> aligner.algorithm
'Smith-Waterman'

此属性是只读的。

A PairwiseAligner 对象还存储精度 \(\epsilon\) 用于对齐期间。的价值 \(\epsilon\) 存储在属性中 aligner.epsilon ,默认情况下等于 \(10^{-6}\) :

>>> aligner.epsilon
1e-06

如果两个分数之间的绝对差异小于,则出于对齐的目的,将被视为彼此相等 \(\epsilon\) .

替代评分

替代分数定义了当两个字母(核苷或氨基酸)相互对齐时要添加到总分中的值。将使用的替补分数 PairwiseAligner 可以通过两种方式指定:

  • 通过为相同字母指定匹配分数,并为不匹配的字母指定不匹配分数。核苷序列比对通常基于匹配和错配评分。例如,默认情况下,BST [Altschul1990] 使用的匹配分数为 \(+1\) 而不匹配得分为 \(-2\) 对于核苷酸比对, megablast ,差距罚分为2.5(请参阅部分 仿射间隙得分 有关差距分数的更多信息)。可以通过设置 matchmismatch 的属性 PairwiseAligner 对象:

    >>> from Bio import Align
    >>> aligner = Align.PairwiseAligner()
    >>> aligner.match_score
    1.0
    >>> aligner.mismatch_score
    0.0
    >>> score = aligner.score("ACGT", "ACAT")
    >>> print(score)
    3.0
    >>> aligner.match_score = 1.0
    >>> aligner.mismatch_score = -2.0
    >>> aligner.gap_score = -2.5
    >>> score = aligner.score("ACGT", "ACAT")
    >>> print(score)
    1.0
    

    使用匹配和不匹配分数时,您可以指定预设字符 (None 默认情况下)对于未知字母。无论匹配或不匹配得分的值如何,这些在对齐中都会得到零分:

    >>> aligner.wildcard = "?"
    >>> score = aligner.score("ACGT", "AC?T")
    >>> print(score)
    3.0
    
  • 也可以使用 substitution_matrix 属性 PairwiseAligner 对象指定替换矩阵。这使您可以对不同的匹配和不匹配的字母对应用不同的分数。这通常用于氨基酸序列比对。例如,默认情况下,BST [Altschul1990] 通过以下方式使用BLOSUM 62取代矩阵进行蛋白质比对 blastp .此替代矩阵可从Biopython获得:

    >>> from Bio.Align import substitution_matrices
    >>> substitution_matrices.load()
    ['BENNER22', 'BENNER6', 'BENNER74', 'BLASTN', 'BLASTP', 'BLOSUM45', 'BLOSUM50', 'BLOSUM62', ..., 'TRANS']
    >>> matrix = substitution_matrices.load("BLOSUM62")
    >>> print(matrix)
    #  Matrix made by matblas from blosum62.iij
    ...
         A    R    N    D    C    Q ...
    A  4.0 -1.0 -2.0 -2.0  0.0 -1.0 ...
    R -1.0  5.0  0.0 -2.0 -3.0  1.0 ...
    N -2.0  0.0  6.0  1.0 -3.0  0.0 ...
    D -2.0 -2.0  1.0  6.0 -3.0  0.0 ...
    C  0.0 -3.0 -3.0 -3.0  9.0 -3.0 ...
    Q -1.0  1.0  0.0  0.0 -3.0  5.0 ...
    ...
    >>> aligner.substitution_matrix = matrix
    >>> score = aligner.score("ACDQ", "ACDQ")
    >>> score
    24.0
    >>> score = aligner.score("ACDQ", "ACNQ")
    >>> score
    19.0
    

    当使用替代矩阵时, Xnot 被解释为未知角色。相反,将使用替代矩阵提供的分数:

    >>> matrix["D", "X"]
    -1.0
    >>> score = aligner.score("ACDQ", "ACXQ")
    >>> score
    17.0
    

默认情况下, aligner.substitution_matrixNone .的属性 aligner.match_scorealigner.mismatch_score 如果 aligner.substitution_matrixNone .设置 aligner.match_scorealigner.mismatch_score 有效值将重置 aligner.substitution_matrixNone .

仿射间隙得分

仿射间隙分数由打开间隙的分数和扩大现有间隙的分数定义:

\(\textrm{gap score} = \textrm{open gap score} + (n-1) \times \textrm{extend gap score}\),

哪里 \(n\) 是间隙的长度。Biopython的成对序列对齐器通过指定以下十二个属性,允许对缺口评分方案进行细粒度控制 PairwiseAligner 对象:

Opening scores

Extending scores

open_left_deletion_score

extend_left_deletion_score

open_internal_deletion_score

extend_internal_deletion_score

open_right_deletion_score

extend_right_deletion_score

open_left_insertion_score

extend_left_insertion_score

open_internal_insertion_score

extend_internal_insertion_score

open_right_insertion_score

extend_right_insertion_score

这些属性允许内部缺口和序列两端的不同缺口评分,如本例所示:

target

query

score

A

开放左删除分数

C

延长左删除分数

C

延长左删除分数

G

G

匹配分数

G

T

失配得分

G

开放内部删除分数

A

扩展内部删除分数

A

扩展内部删除分数

T

T

匹配分数

A

A

匹配分数

G

开放内部删除分数

C

C

匹配分数

C

开放内插入评分

C

扩大内部插入评分

C

C

匹配分数

T

G

失配得分

C

C

匹配分数

C

开放内插入评分

A

A

匹配分数

T

开放右插入评分

A

延长右插入评分

A

延长右插入评分

为了方便起见, PairwiseAligner 对象具有其他属性,这些属性共同引用这些值中的许多值,如表中所示(按层次结构)  成对对齐器对象的元属性。 .

表 1 成对对齐器对象的元属性。

元属性

它映射到的属性

gap_score

insertion_score , deletion_score

open_gap_score

open_insertion_score , open_deletion_score

extend_gap_score

extend_insertion_score , extend_deletion_score

internal_gap_score

internal_insertion_score , internal_deletion_score

open_internal_gap_score

open_internal_insertion_score , open_internal_deletion_score

extend_internal_gap_score

extend_internal_insertion_score , extend_internal_deletion_score

end_gap_score

end_insertion_score , end_deletion_score

open_end_gap_score

open_end_insertion_score , open_end_deletion_score

extend_end_gap_score

extend_end_insertion_score , extend_end_deletion_score

left_gap_score

left_insertion_score , left_deletion_score

right_gap_score

right_insertion_score , right_deletion_score

open_left_gap_score

open_left_insertion_score , open_left_deletion_score

extend_left_gap_score

extend_left_insertion_score , extend_left_deletion_score

open_right_gap_score

open_right_insertion_score , open_right_deletion_score

extend_right_gap_score

extend_right_insertion_score , extend_right_deletion_score

open_insertion_score

open_internal_insertion_score , open_left_insertion_score , open_right_insertion_score

extend_insertion_score

extend_internal_insertion_score , extend_left_insertion_score , extend_right_insertion_score

insertion_score

open_insertion_score , extend_insertion_score

open_deletion_score

open_internal_deletion_score , open_left_deletion_score , open_right_deletion_score

extend_deletion_score

extend_internal_deletion_score , extend_left_deletion_score , extend_right_deletion_score

deletion_score

open_deletion_score , extend_deletion_score

internal_insertion_score

open_internal_insertion_score , extend_internal_insertion_score

end_insertion_score

open_end_insertion_score , extend_end_insertion_score

open_end_insertion_score

open_left_insertion_score , open_right_insertion_score

extend_end_insertion_score

extend_left_insertion_score , extend_right_insertion_score

left_insertion_score

open_left_insertion_score , extend_left_insertion_score

right_insertion_score

open_right_insertion_score , extend_right_insertion_score

end_deletion_score

open_end_deletion_score , extend_end_deletion_score

open_end_deletion_score

open_left_deletion_score , open_right_deletionp_score

extend_end_deletion_score

extend_left_deletion_score , extend_right_deletion_score

internal_deletion_score

open_internal_deletion_score , extend_internal_deletion_score

left_deletion_score

open_left_deletion_score , extend_left_deletion_score

right_deletion_score

open_right_deletion_score , extend_right_deletion_score

一般差距分数

为了对差距评分进行更细粒度的控制,您可以指定差距评分函数。例如,下面的缺口评分功能不允许查询序列中两个碱基后的删除:

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
>>> def my_gap_score_function(start, length):
...     if start == 2:
...         return -1000
...     else:
...         return -1 * length
...
>>> aligner.deletion_score = my_gap_score_function
>>> alignments = aligner.align("AACTT", "AATT")
>>> for alignment in alignments:
...     print(alignment)
...
target            0 AACTT 5
                  0 -|.|| 5
query             0 -AATT 4

target            0 AACTT 5
                  0 |-.|| 5
query             0 A-ATT 4

target            0 AACTT 5
                  0 ||.-| 5
query             0 AAT-T 4

target            0 AACTT 5
                  0 ||.|- 5
query             0 AATT- 4

使用预定义的替代矩阵和差距分数

默认情况下 PairwiseAligner 对象初始化时匹配分数为+1.0,不匹配分数为0.0,所有差距分数均等于0.0。虽然这具有作为简单评分方案的好处,但一般来说,它不会提供最佳性能。相反,您可以使用参数 scoring 在初始化时选择预定义的评分方案 PairwiseAligner object.目前,提供的评分方案为 blastnmegablast ,适合于核苷酸比对,以及 blastp ,适合蛋白质比对。选择这些评分方案将初始化 PairwiseAligner 反对BLASYS、MegaBST和BLASTP分别使用的默认评分参数。

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner(scoring="blastn")
>>> print(aligner)
Pairwise sequence aligner with parameters
  substitution_matrix: <Array object at ...>
  open_internal_insertion_score: -7.000000
  extend_internal_insertion_score: -2.000000
  open_left_insertion_score: -7.000000
  extend_left_insertion_score: -2.000000
  open_right_insertion_score: -7.000000
  extend_right_insertion_score: -2.000000
  open_internal_deletion_score: -7.000000
  extend_internal_deletion_score: -2.000000
  open_left_deletion_score: -7.000000
  extend_left_deletion_score: -2.000000
  open_right_deletion_score: -7.000000
  extend_right_deletion_score: -2.000000
  mode: global

>>> print(aligner.substitution_matrix[:, :])
     A    T    G    C    S    W    R    Y    K    M    B    V    H    D    N
A  2.0 -3.0 -3.0 -3.0 -3.0 -1.0 -1.0 -3.0 -3.0 -1.0 -3.0 -1.0 -1.0 -1.0 -2.0
T -3.0  2.0 -3.0 -3.0 -3.0 -1.0 -3.0 -1.0 -1.0 -3.0 -1.0 -3.0 -1.0 -1.0 -2.0
G -3.0 -3.0  2.0 -3.0 -1.0 -3.0 -1.0 -3.0 -1.0 -3.0 -1.0 -1.0 -3.0 -1.0 -2.0
C -3.0 -3.0 -3.0  2.0 -1.0 -3.0 -3.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -3.0 -2.0
S -3.0 -3.0 -1.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
W -1.0 -1.0 -3.0 -3.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
R -1.0 -3.0 -1.0 -3.0 -1.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
Y -3.0 -1.0 -3.0 -1.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
K -3.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -2.0
M -1.0 -3.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
B -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
V -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
H -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
D -1.0 -1.0 -1.0 -3.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -2.0
N -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0

迭代对齐

alignments 返回 aligner.align 是一种不可变的可迭代对象(类似于 range ).虽然它们看起来类似于 tuplelistAlignment 对象,它们在某种意义上是不同的, Alignment 当需要时动态创建对象。选择这种方法是因为对齐的数量可能非常大,特别是对于对齐不良的情况(请参阅第节  示例 例如)。

您可以对以下操作 alignments :

  • len(alignments) 返回存储的对齐数。即使对齐数量很大,该函数也会快速返回。如果对齐数量非常大(通常大于9,223,372,036,854,775,807,这是可以存储为 long int 在64位机器上), len(alignments) 将引发 OverflowError .大量对齐表明对齐质量较低。

    >>> from Bio import Align
    >>> aligner = Align.PairwiseAligner()
    >>> alignments = aligner.align("AAA", "AA")
    >>> len(alignments)
    3
    
  • 您可以通过索引提取特定的对齐方式:

    >>> from Bio import Align
    >>> aligner = Align.PairwiseAligner()
    >>> alignments = aligner.align("AAA", "AA")
    >>> print(alignments[2])
    target            0 AAA 3
                      0 -|| 3
    query             0 -AA 2
    
    >>> print(alignments[0])
    target            0 AAA 3
                      0 ||- 3
    query             0 AA- 2
    
  • 您可以迭代对齐,例如在

    >>> for alignment in alignments:
    ...     print(alignment)
    ...
    

    alignments 迭代器可以转换成 listtuple :

    >>> alignments = list(alignments)
    

    明智的做法是通过调用检查对齐数量 len(alignments) 在尝试打电话之前 list(alignments) 将所有路线保存为列表。

  • 对齐分数(对于中的每个对齐具有相同的值) alignments )存储为属性。这使您可以在继续提取单个对齐之前检查对齐分数:

    >>> print(alignments.score)
    2.0
    

与反向链对齐

默认情况下,成对对齐器将查询的正向链与目标的正向链对齐。计算对齐分数 query 到的反向链 target ,使用 strand="-" :

>>> from Bio import Align
>>> from Bio.Seq import reverse_complement
>>> target = "AAAACCC"
>>> query = "AACC"
>>> aligner = Align.PairwiseAligner()
>>> aligner.mismatch_score = -1
>>> aligner.internal_gap_score = -1
>>> aligner.score(target, query)  # strand is "+" by default
4.0
>>> aligner.score(target, reverse_complement(query), strand="-")
4.0
>>> aligner.score(target, query, strand="-")
0.0
>>> aligner.score(target, reverse_complement(query))
0.0

与反向链的比对可以通过指定 strand="-" 打电话时 aligner.align :

>>> alignments = aligner.align(target, query)
>>> len(alignments)
1
>>> print(alignments[0])
target            0 AAAACCC 7
                  0 --||||- 7
query             0 --AACC- 4

>>> print(alignments[0].format("bed"))
target   2   6   query   4   +   2   6   0   1   4,   0,

>>> alignments = aligner.align(target, reverse_complement(query), strand="-")
>>> len(alignments)
1
>>> print(alignments[0])
target            0 AAAACCC 7
                  0 --||||- 7
query             4 --AACC- 0

>>> print(alignments[0].format("bed"))
target   2   6   query   4   -   2   6   0   1   4,   0,

>>> alignments = aligner.align(target, query, strand="-")
>>> len(alignments)
2
>>> print(alignments[0])
target            0 AAAACCC----  7
                  0 ----------- 11
query             4 -------GGTT  0

>>> print(alignments[1])
target            0 ----AAAACCC  7
                  0 ----------- 11
query             4 GGTT-------  0

请注意,对齐的分数 query 到的反向链 target 可能与对齐的反向补集的得分不同 query 到前链 target 如果左右差距得分不同:

>>> aligner.left_gap_score = -0.5
>>> aligner.right_gap_score = -0.2
>>> aligner.score(target, query)
2.8
>>> alignments = aligner.align(target, query)
>>> len(alignments)
1
>>> print(alignments[0])
target            0 AAAACCC 7
                  0 --||||- 7
query             0 --AACC- 4

>>> aligner.score(target, reverse_complement(query), strand="-")
3.1
>>> alignments = aligner.align(target, reverse_complement(query), strand="-")
>>> len(alignments)
1
>>> print(alignments[0])
target            0 AAAACCC 7
                  0 --||||- 7
query             4 --AACC- 0

取代矩阵

取代矩阵 [Durbin1998] 提供评分项,用于对两个不同残基相互替代的可能性进行分类。这对于进行序列比较至关重要。Biopython提供了大量常见的替代矩阵,包括著名的AM和BLONUM系列矩阵,还提供了创建您自己的替代矩阵的功能。

阵列对象

您可以将取代矩阵视为2D阵列,其中索引是字母(核苷或氨基酸)而不是整除。的 Array 阶级 Bio.Align.substitution_matrices 是numpy数组的一个子集,支持按integer和特定字符串进行索引。一个 Array 实例可以是一维数组或方形二维数组。一维 Array 例如,对象可以用于存储DNA序列的核苷频率,而二维 Array 对象可用于表示序列比对的评分矩阵。

要创建一维 Array ,只需指定允许的字母的字母表即可:

>>> from Bio.Align.substitution_matrices import Array
>>> counts = Array("ACGT")
>>> print(counts)
A 0.0
C 0.0
G 0.0
T 0.0

允许的字母存储在 alphabet 属性:

>>> counts.alphabet
'ACGT'

此属性是只读的;修改基础 _alphabet 属性可能会导致意想不到的结果。元素可以通过字母和integer索引访问:

>>> counts["C"] = -3
>>> counts[2] = 7
>>> print(counts)
A  0.0
C -3.0
G  7.0
T  0.0

>>> counts[1]
-3.0

使用不在字母表中的字母或越界的索引将导致 IndexError :

>>> counts["U"]
Traceback (most recent call last):
    ...
IndexError: 'U'
>>> counts["X"] = 6
Traceback (most recent call last):
    ...
IndexError: 'X'
>>> counts[7]
Traceback (most recent call last):
    ...
IndexError: index 7 is out of bounds for axis 0 with size 4

二维 Array 可以通过指定来创建 dims=2 :

>>> from Bio.Align.substitution_matrices import Array
>>> counts = Array("ACGT", dims=2)
>>> print(counts)
    A   C   G   T
A 0.0 0.0 0.0 0.0
C 0.0 0.0 0.0 0.0
G 0.0 0.0 0.0 0.0
T 0.0 0.0 0.0 0.0

同样,字母和整数都可以用于索引,指定不在字母表中的字母将导致 IndexError :

>>> counts["A", "C"] = 12.0
>>> counts[2, 1] = 5.0
>>> counts[3, "T"] = -2
>>> print(counts)
    A    C   G    T
A 0.0 12.0 0.0  0.0
C 0.0  0.0 0.0  0.0
G 0.0  5.0 0.0  0.0
T 0.0  0.0 0.0 -2.0

>>> counts["X", 1]
Traceback (most recent call last):
    ...
IndexError: 'X'
>>> counts["A", 5]
Traceback (most recent call last):
    ...
IndexError: index 5 is out of bounds for axis 1 with size 4

从二维数组中选择行或列将返回一维 Array :

>>> counts = Array("ACGT", dims=2)
>>> counts["A", "C"] = 12.0
>>> counts[2, 1] = 5.0
>>> counts[3, "T"] = -2
>>> counts["G"]
Array([0., 5., 0., 0.],
      alphabet='ACGT')
>>> counts[:, "C"]
Array([12.,  0.,  5.,  0.],
      alphabet='ACGT')

Array 因此,对象可以用作数组和字典。它们可以转换为普通的numpy数组或普通的字典对象:

>>> import numpy as np
>>> x = Array("ACGT")
>>> x["C"] = 5
>>> x
Array([0., 5., 0., 0.],
      alphabet='ACGT')
>>> a = np.array(x)  # create a plain numpy array
>>> a
array([0., 5., 0., 0.])
>>> d = dict(x)  # create a plain dictionary
>>> d
{'A': 0.0, 'C': 5.0, 'G': 0.0, 'T': 0.0}

虽然字母表的一个 Array 通常是一个字符串,您还可以使用(不可变)对象的数组。这例如用于密码子取代矩阵(如在 substitution_matrices.load("SCHNEIDER") 稍后显示的例子),其中关键不是单个的核苷酸或氨基酸,而是三个核苷酸密码子。

alphabet 财产的 Array 是不可变的,您可以创建新的 Array 通过从字母表中选择您感兴趣的字母来对象。例如,

>>> a = Array("ABCD", dims=2, data=np.arange(16).reshape(4, 4))
>>> print(a)
     A    B    C    D
A  0.0  1.0  2.0  3.0
B  4.0  5.0  6.0  7.0
C  8.0  9.0 10.0 11.0
D 12.0 13.0 14.0 15.0

>>> b = a.select("CAD")
>>> print(b)
     C    A    D
C 10.0  8.0 11.0
A  2.0  0.0  3.0
D 14.0 12.0 15.0

请注意,这还允许您重新排序字母表。

字母表中找不到的字母的数据设置为零:

>>> c = a.select("DEC")
>>> print(c)
     D   E    C
D 15.0 0.0 14.0
E  0.0 0.0  0.0
C 11.0 0.0 10.0

Array 类是numpy数组的一个子集,可以这样使用。一 ValueError 如果 Array 例如,数学运算中出现的对象具有不同的字母表

>>> from Bio.Align.substitution_matrices import Array
>>> d = Array("ACGT")
>>> r = Array("ACGU")
>>> d + r
Traceback (most recent call last):
    ...
ValueError: alphabets are inconsistent

根据成对序列比对计算替代矩阵

作为 Array 是numpy数组的一个子集,您可以对 Array 以大致相同的方式物体。在这里,我们通过根据16 S核糖体RNA基因序列的排列计算评分矩阵来说明这一点 Escherichia coliBacillus subtilis .首先,我们创建一个 PairwiseAligner 对象(请参阅第章  双序列比对 )并使用使用的默认分数初始化它 blastn :

>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner(scoring="blastn")
>>> aligner.mode = "local"

接下来,我们读取16 S核糖体RNA基因序列 Escherichia coliBacillus subtilis (提供于 Tests/Align/ecoli.faTests/Align/bsubtilis.fa ),并将它们彼此对齐:

>>> from Bio import SeqIO
>>> sequence1 = SeqIO.read("ecoli.fa", "fasta")
>>> sequence2 = SeqIO.read("bsubtilis.fa", "fasta")
>>> alignments = aligner.align(sequence1, sequence2)

生成的比对数量非常大:

>>> len(alignments)
1990656

然而,由于它们彼此之间只有微小的差异,因此我们任意选择第一个对齐,并计算每个替换的数量:

>>> alignment = alignments[0]
>>> substitutions = alignment.substitutions
>>> print(substitutions)
      A     C     G     T
A 307.0  19.0  34.0  19.0
C  15.0 280.0  25.0  29.0
G  34.0  24.0 401.0  20.0
T  24.0  36.0  20.0 228.0

我们对总数进行标准化,以找出每次替换的概率,并创建观察频率的对称矩阵:

>>> observed_frequencies = substitutions / substitutions.sum()
>>> observed_frequencies = (observed_frequencies + observed_frequencies.transpose()) / 2.0
>>> print(format(observed_frequencies, "%.4f"))
       A      C      G      T
A 0.2026 0.0112 0.0224 0.0142
C 0.0112 0.1848 0.0162 0.0215
G 0.0224 0.0162 0.2647 0.0132
T 0.0142 0.0215 0.0132 0.1505

背景概率是在每个序列中分别发现A、C、G或T核苷的概率。这可以计算为每一行或每一列的总和:

>>> background = observed_frequencies.sum(0)
>>> print(format(background, "%.4f"))
A 0.2505
C 0.2337
G 0.3165
T 0.1993

随机预期的替换数量只是背景分布本身的产物:

>>> expected_frequencies = background[:, None].dot(background[None, :])
>>> print(format(expected_frequencies, "%.4f"))
       A      C      G      T
A 0.0627 0.0585 0.0793 0.0499
C 0.0585 0.0546 0.0740 0.0466
G 0.0793 0.0740 0.1002 0.0631
T 0.0499 0.0466 0.0631 0.0397

然后,评分矩阵可以计算为观察到的概率和预期的概率比的对数:

>>> oddsratios = observed_frequencies / expected_frequencies
>>> import numpy as np
>>> scoring_matrix = np.log2(oddsratios)
>>> print(scoring_matrix)
     A    C    G    T
A  1.7 -2.4 -1.8 -1.8
C -2.4  1.8 -2.2 -1.1
G -1.8 -2.2  1.4 -2.3
T -1.8 -1.1 -2.3  1.9

该矩阵可用于设置成对对齐器的替换矩阵(请参阅第章  双序列比对 ):

>>> aligner.substitution_matrix = scoring_matrix

根据多序列比对计算替代矩阵

在这个例子中,我们首先从Custalw文件中读取蛋白质序列比对 protein.aln (also在线提供 here )

>>> from Bio import Align
>>> filename = "protein.aln"
>>> alignment = Align.read(filename, "clustal")

部分  ClustalW 包含有关此操作的更多信息。

substitutions 对齐的属性存储不同残基相互替代的次数:

>>> substitutions = alignment.substitutions

为了使示例更具可读性,我们将仅选择具有带电荷侧链的氨基酸:

>>> substitutions = substitutions.select("DEHKR")
>>> print(substitutions)
       D      E      H      K      R
D 2360.0  270.0   15.0    1.0   48.0
E  241.0 3305.0   15.0   45.0    2.0
H    0.0   18.0 1235.0    8.0    0.0
K    0.0    9.0   24.0 3218.0  130.0
R    2.0    2.0   17.0  103.0 2079.0

从矩阵中去除其他氨基酸的柱和柱。

接下来,我们对矩阵进行规格化并使其对称。

>>> observed_frequencies = substitutions / substitutions.sum()
>>> observed_frequencies = (observed_frequencies + observed_frequencies.transpose()) / 2.0
>>> print(format(observed_frequencies, "%.4f"))
       D      E      H      K      R
D 0.1795 0.0194 0.0006 0.0000 0.0019
E 0.0194 0.2514 0.0013 0.0021 0.0002
H 0.0006 0.0013 0.0939 0.0012 0.0006
K 0.0000 0.0021 0.0012 0.2448 0.0089
R 0.0019 0.0002 0.0006 0.0089 0.1581

行或列的总和给出了每个残基的相对出现频率:

>>> background = observed_frequencies.sum(0)
>>> print(format(background, "%.4f"))
D 0.2015
E 0.2743
H 0.0976
K 0.2569
R 0.1697

>>> sum(background) == 1.0
True

那么,残基对的预期频率是

>>> expected_frequencies = background[:, None].dot(background[None, :])
>>> print(format(expected_frequencies, "%.4f"))
       D      E      H      K      R
D 0.0406 0.0553 0.0197 0.0518 0.0342
E 0.0553 0.0752 0.0268 0.0705 0.0465
H 0.0197 0.0268 0.0095 0.0251 0.0166
K 0.0518 0.0705 0.0251 0.0660 0.0436
R 0.0342 0.0465 0.0166 0.0436 0.0288

在这里, background[:, None] 创建由单个列组成的2D数组,值为 expected_frequencies ,而且 expected_frequencies[None, :] 将这些值作为一行的2D数组。取其点积(内部积)创建预期频率矩阵,其中每个条目由两个组成 expected_frequencies 价值观相互叠加。例如, expected_frequencies['D', 'E'] 等于 residue_frequencies['D'] * residue_frequencies['E'] .

我们现在可以通过将观察到的频率除以预期频率并取对数来计算对比矩阵:

>>> import numpy as np
>>> scoring_matrix = np.log2(observed_frequencies / expected_frequencies)
>>> print(scoring_matrix)
      D    E    H     K    R
D   2.1 -1.5 -5.1 -10.4 -4.2
E  -1.5  1.7 -4.4  -5.1 -8.3
H  -5.1 -4.4  3.3  -4.4 -4.7
K -10.4 -5.1 -4.4   1.9 -2.3
R  -4.2 -8.3 -4.7  -2.3  2.5

当执行比对时,该矩阵可以用作替代矩阵。例如,

>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()
>>> aligner.substitution_matrix = scoring_matrix
>>> aligner.gap_score = -3.0
>>> alignments = aligner.align("DEHEK", "DHHKK")
>>> print(alignments[0])
target            0 DEHEK 5
                  0 |.|.| 5
query             0 DHHKK 5

>>> print("%.2f" % alignments.score)
-2.18
>>> score = (
...     scoring_matrix["D", "D"]
...     + scoring_matrix["E", "H"]
...     + scoring_matrix["H", "H"]
...     + scoring_matrix["E", "K"]
...     + scoring_matrix["K", "K"]
... )
>>> print("%.2f" % score)
-2.18

(see章  双序列比对 详情)。

阅读 Array 文件中的对象

Bio.Align.substitution_matrices 包括一个解析器来读取一维和二维 Array 文件中的对象。一维数组由简单的两列格式表示,第一列包含键,第二列包含相应的值。例如文件 hg38.chrom.sizes (从UCSC获取),可在 Tests/Align Biopython分布的字节包含人类基因组组装hg 38中每条染色体的核苷大小:

chr1    248956422
chr2    242193529
chr3    198295559
chr4    190214555
...
chrUn_KI270385v1    990
chrUn_KI270423v1    981
chrUn_KI270392v1    971
chrUn_KI270394v1    970

要解析此文件,请使用

>>> from Bio.Align import substitution_matrices
>>> with open("hg38.chrom.sizes") as handle:
...     table = substitution_matrices.read(handle)
...
>>> print(table)
chr1 248956422.0
chr2 242193529.0
chr3 198295559.0
chr4 190214555.0
...
chrUn_KI270423v1       981.0
chrUn_KI270392v1       971.0
chrUn_KI270394v1       970.0

使用 dtype=int 要将值读取为integer:

>>> with open("hg38.chrom.sizes") as handle:
...     table = substitution_matrices.read(handle, int)
...
>>> print(table)
chr1 248956422
chr2 242193529
chr3 198295559
chr4 190214555
...
chrUn_KI270423v1       981
chrUn_KI270392v1       971
chrUn_KI270394v1       970

对于二维数组,我们遵循NCBI提供的替换矩阵的文件格式。例如,BLOSUM 62矩阵,这是NCBI蛋白质-蛋白质RST的默认替代矩阵 [Altschul1990] 程序 blastp ,存储如下:

#  Matrix made by matblas from blosum62.iij
#  * column uses minimum score
#  BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
#  Blocks Database = /data/blocks_5.0/blocks.dat
#  Cluster Percentage: >= 62
#  Entropy =   0.6979, Expected =  -0.5209
   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1  0 -4
R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1  0 -1 -4
N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  3  0 -1 -4
D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4  1 -1 -4
C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4
Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0  3 -1 -4
E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4
G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -2 -1 -4
H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0  0 -1 -4
...

该文件包含在Biopython发行版中, Bio/Align/substitution_matrices/data .要解析此文件,请使用

>>> from Bio.Align import substitution_matrices
>>> with open("BLOSUM62") as handle:
...     matrix = substitution_matrices.read(handle)
...
>>> print(matrix.alphabet)
ARNDCQEGHILKMFPSTWYVBZX*
>>> print(matrix["A", "D"])
-2.0

标题行以开头 # 存储在属性中 header :

>>> matrix.header[0]
'Matrix made by matblas from blosum62.iij'

我们现在可以使用这个矩阵作为对齐器对象上的替代矩阵:

>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()
>>> aligner.substitution_matrix = matrix

要保存数组对象,请首先创建一个字符串:

>>> text = str(matrix)
>>> print(text)
#  Matrix made by matblas from blosum62.iij
#  * column uses minimum score
#  BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
#  Blocks Database = /data/blocks_5.0/blocks.dat
#  Cluster Percentage: >= 62
#  Entropy =   0.6979, Expected =  -0.5209
     A    R    N    D    C    Q    E    G    H    I    L    K    M    F    P    S ...
A  4.0 -1.0 -2.0 -2.0  0.0 -1.0 -1.0  0.0 -2.0 -1.0 -1.0 -1.0 -1.0 -2.0 -1.0  1.0 ...
R -1.0  5.0  0.0 -2.0 -3.0  1.0  0.0 -2.0  0.0 -3.0 -2.0  2.0 -1.0 -3.0 -2.0 -1.0 ...
N -2.0  0.0  6.0  1.0 -3.0  0.0  0.0  0.0  1.0 -3.0 -3.0  0.0 -2.0 -3.0 -2.0  1.0 ...
D -2.0 -2.0  1.0  6.0 -3.0  0.0  2.0 -1.0 -1.0 -3.0 -4.0 -1.0 -3.0 -3.0 -1.0  0.0 ...
C  0.0 -3.0 -3.0 -3.0  9.0 -3.0 -4.0 -3.0 -3.0 -1.0 -1.0 -3.0 -1.0 -2.0 -3.0 -1.0 ...
...

和写入 text 到文件。

加载预定义的替代矩阵

Biopython包含文献中定义的大量替代矩阵,包括BLONUM(区块替代矩阵) [Henikoff1992] 和AM(点接受突变)矩阵 [Dayhoff1978]. 这些矩阵在 Bio/Align/substitution_matrices/data 目录,并且可以使用 load 功能 substitution_matrices 子模块。例如,BLOSUM62矩阵可以通过运行

>>> from Bio.Align import substitution_matrices
>>> m = substitution_matrices.load("BLOSUM62")

该取代矩阵具有一个字母表,由遗传密码中使用的20个氨基酸、三个模糊的氨基酸B(芦笋或芦笋)、Z(谷氨醇或谷氨酸)和X(代表任何氨基酸)以及由星号代表的终止密码子组成:

>>> m.alphabet
'ARNDCQEGHILKMFPSTWYVBZX*'

要获取可用替代矩阵的完整列表,请使用 load 没有争论:

>>> substitution_matrices.load()
['BENNER22', 'BENNER6', 'BENNER74', 'BLASTN', 'BLASTP', 'BLOSUM45', 'BLOSUM50', ..., 'TRANS']

请注意,施耐德提供的替代矩阵 et al. [Schneider2005] 使用由三个核苷酸密码子组成的字母表:

>>> m = substitution_matrices.load("SCHNEIDER")
>>> m.alphabet
('AAA', 'AAC', 'AAG', 'AAT', 'ACA', 'ACC', 'ACG', 'ACT', ..., 'TTG', 'TTT')

示例

假设您想在上面相同的两个血红蛋白序列之间进行全局成对比对 (HBA_HUMAN , HBB_HUMAN )存储在 alpha.faabeta.faa :

>>> from Bio import Align
>>> from Bio import SeqIO
>>> seq1 = SeqIO.read("alpha.faa", "fasta")
>>> seq2 = SeqIO.read("beta.faa", "fasta")
>>> aligner = Align.PairwiseAligner()
>>> score = aligner.score(seq1.seq, seq2.seq)
>>> print(score)
72.0

显示对齐评分为72.0。要查看各个对齐,请执行

>>> alignments = aligner.align(seq1.seq, seq2.seq)

在这个例子中,最佳对齐的总数是巨大的(多于 \(4 \times 10^{37}\) ),并打电话 len(alignments) 将引发 OverflowError :

>>> len(alignments)
Traceback (most recent call last):
...
OverflowError: number of optimal alignments is larger than 9223372036854775807

我们来看看第一个对齐:

>>> alignment = alignments[0]

对齐对象存储对齐分数以及对齐本身:

>>> print(alignment.score)
72.0
>>> print(alignment)
target            0 MV-LS-PAD--KTN--VK-AA-WGKV-----GAHAGEYGAEALE-RMFLSF----P-TTK
                  0 ||-|--|----|----|--|--||||-----|---||--|--|--|--|------|-|--
query             0 MVHL-TP--EEK--SAV-TA-LWGKVNVDEVG---GE--A--L-GR--L--LVVYPWT--

target           41 TY--FPHF----DLSHGS---AQVK-G------HGKKV--A--DA-LTNAVAHV-DDMPN
                 60 ----|--|----|||------|-|--|------|||||--|--|--|--|--|--|---|
query            39 --QRF--FESFGDLS---TPDA-V-MGNPKVKAHGKKVLGAFSD-GL--A--H-LD---N

target           79 ALS----A-LSD-LHAH--KLR-VDPV-NFK-LLSHC---LLVT--LAAHLPA----EFT
                120 -|-----|-||--||----||--|||--||--||------|-|---||-|-------|||
query            81 -L-KGTFATLS-ELH--CDKL-HVDP-ENF-RLL---GNVL-V-CVLA-H---HFGKEFT

target          119 PA-VH-ASLDKFLAS---VSTV------LTS--KYR- 142
                180 |--|--|------|----|--|------|----||-- 217
query           124 P-PV-QA------A-YQKV--VAGVANAL--AHKY-H 147

更好的对齐通常是通过惩罚差距来实现的:打开差距的成本更高,扩大现有差距的成本更低。对于氨基酸序列,匹配分数通常编码在如下矩阵中 PAMBLOSUM .因此,通过使用BLOSUM62矩阵以及空位开放罚分为10和空位延伸罚分为0.5,可以获得对我们的例子更有意义的比对:

>>> from Bio import Align
>>> from Bio import SeqIO
>>> from Bio.Align import substitution_matrices
>>> seq1 = SeqIO.read("alpha.faa", "fasta")
>>> seq2 = SeqIO.read("beta.faa", "fasta")
>>> aligner = Align.PairwiseAligner()
>>> aligner.open_gap_score = -10
>>> aligner.extend_gap_score = -0.5
>>> aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
>>> score = aligner.score(seq1.seq, seq2.seq)
>>> print(score)
292.5
>>> alignments = aligner.align(seq1.seq, seq2.seq)
>>> len(alignments)
2
>>> print(alignments[0].score)
292.5
>>> print(alignments[0])
target            0 MV-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGS
                  0 ||-|.|..|..|.|.||||--...|.|.|||.|.....|.|...|..|-|||-----.|.
query             0 MVHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGN

target           53 AQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAH
                 60 ..||.|||||..|.....||.|........||.||..||.|||.||.||...|...||.|
query            58 PKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHH

target          113 LPAEFTPAVHASLDKFLASVSTVLTSKYR 142
                120 ...||||.|.|...|..|.|...|..||. 149
query           118 FGKEFTPPVQAAYQKVVAGVANALAHKYH 147

这种比对具有与我们之前使用相同序列和相同参数使用CLARSS针获得的相同分数。

要执行局部对齐,请设置 aligner.mode'local' :

>>> aligner.mode = "local"
>>> aligner.open_gap_score = -10
>>> aligner.extend_gap_score = -1
>>> alignments = aligner.align("LSPADKTNVKAA", "PEEKSAV")
>>> print(len(alignments))
1
>>> alignment = alignments[0]
>>> print(alignment)
target            2 PADKTNV 9
                  0 |..|..| 7
query             0 PEEKSAV 7

>>> print(alignment.score)
16.0

广义成对排列

在大多数情况下, PairwiseAligner 用于执行序列(字符串或 Seq 物体)由单字母核苷或氨基酸组成。更一般地说, PairwiseAligner 还可以应用于任意对象的列表或字节组。本节将描述此类广义成对比对的一些示例。

使用替代矩阵和字母表的广义成对比对

施耐德 et al. [Schneider2005] 创建了一个取代矩阵来对齐三个核苷酸密码子(参见 below 节中  取代矩阵 了解更多信息)。这个替代矩阵与由所有三字母密码子组成的字母表相关:

>>> from Bio.Align import substitution_matrices
>>> m = substitution_matrices.load("SCHNEIDER")
>>> m.alphabet
('AAA', 'AAC', 'AAG', 'AAT', 'ACA', 'ACC', 'ACG', 'ACT', ..., 'TTG', 'TTT')

我们可以使用这个矩阵来相互对齐密码子序列:

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
>>> aligner.substitution_matrix = m
>>> aligner.gap_score = -1.0
>>> s1 = ("AAT", "CTG", "TTT", "TTT")
>>> s2 = ("AAT", "TTA", "TTT")
>>> alignments = aligner.align(s1, s2)
>>> len(alignments)
2
>>> print(alignments[0])
AAT CTG TTT TTT
||| ... ||| ---
AAT TTA TTT ---

>>> print(alignments[1])
AAT CTG TTT TTT
||| ... --- |||
AAT TTA --- TTT

请注意,对齐 TTTTTA ,如本例所示:

AAT CTG TTT TTT
||| --- ... |||
AAT --- TTA TTT

会得到低得多的分数:

>>> print(m["CTG", "TTA"])
7.6
>>> print(m["TTT", "TTA"])
-0.3

大概是因为 CTGTTA 两者都编码白藜芦醇,而 TTT 苯色素的代码。三字母密码子取代矩阵还揭示了代表同一氨基酸的密码子之间的偏好。例如, TTA 有偏好 CTG 相比之下, CTC ,尽管这三个代码都是Leuine:

>>> s1 = ("AAT", "CTG", "CTC", "TTT")
>>> s2 = ("AAT", "TTA", "TTT")
>>> alignments = aligner.align(s1, s2)
>>> len(alignments)
1
>>> print(alignments[0])
AAT CTG CTC TTT
||| ... --- |||
AAT TTA --- TTT

>>> print(m["CTC", "TTA"])
6.5

使用匹配/错配分数和字母表的广义成对比对

使用三字母的氨基酸符号,上面的序列翻译为

>>> s1 = ("Asn", "Leu", "Leu", "Phe")
>>> s2 = ("Asn", "Leu", "Phe")

我们可以通过使用三字母的氨基酸字母表将这些序列直接相互对齐:

>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
>>> aligner.alphabet = ['Ala', 'Arg', 'Asn', 'Asp', 'Cys',
...                     'Gln', 'Glu', 'Gly', 'His', 'Ile',
...                     'Leu', 'Lys', 'Met', 'Phe', 'Pro',
...                     'Ser', 'Thr', 'Trp', 'Tyr', 'Val']  # fmt: skip
...

我们使用+6/-1匹配和错配分数作为BLOSUM 62矩阵的逼近,并将这些序列相互对齐:

>>> aligner.match = +6
>>> aligner.mismatch = -1
>>> alignments = aligner.align(s1, s2)
>>> print(len(alignments))
2
>>> print(alignments[0])
Asn Leu Leu Phe
||| ||| --- |||
Asn Leu --- Phe

>>> print(alignments[1])
Asn Leu Leu Phe
||| --- ||| |||
Asn --- Leu Phe

>>> print(alignments.score)
18.0

使用匹配/错配评分和整序列的广义成对比对

在内部,执行对齐时的第一步是用由与对齐器相关的字母表中每个序列中每个字母的索引组成的整组数组替换两个序列。通过直接传递integer数组可以绕过此步骤:

>>> import numpy as np
>>> from Bio import Align
>>> aligner = Align.PairwiseAligner()
>>> s1 = np.array([2, 10, 10, 13], np.int32)
>>> s2 = np.array([2, 10, 13], np.int32)
>>> aligner.match = +6
>>> aligner.mismatch = -1
>>> alignments = aligner.align(s1, s2)
>>> print(len(alignments))
2
>>> print(alignments[0])
2 10 10 13
| || -- ||
2 10 -- 13

>>> print(alignments[1])
2 10 10 13
| -- || ||
2 -- 10 13

>>> print(alignments.score)
18.0

请注意,索引应该由32位整数组成,如本示例中指定的 numpy.int32 .

通过定义大写字符并使用相应的Unicode代码点号作为索引,可以再次包含未知字母:

>>> aligner.wildcard = "?"
>>> ord(aligner.wildcard)
63
>>> s2 = np.array([2, 63, 13], np.int32)
>>> aligner.gap_score = -3
>>> alignments = aligner.align(s1, s2)
>>> print(len(alignments))
2
>>> print(alignments[0])
2 10 10 13
| .. -- ||
2 63 -- 13

>>> print(alignments[1])
2 10 10 13
| -- .. ||
2 -- 63 13

>>> print(alignments.score)
9.0

使用替换矩阵和整序列的广义成对比对

别名序列也可以使用替换矩阵来对齐,在这种情况下,它是一个没有与之相关的字母表的numpy Square。在这种情况下,所有索引值必须是非负的,并且小于替换矩阵的大小:

>>> from Bio import Align
>>> import numpy as np
>>> aligner = Align.PairwiseAligner()
>>> m = np.eye(5)
>>> m[0, 1:] = m[1:, 0] = -2
>>> m[2, 2] = 3
>>> print(m)
[[ 1. -2. -2. -2. -2.]
 [-2.  1.  0.  0.  0.]
 [-2.  0.  3.  0.  0.]
 [-2.  0.  0.  1.  0.]
 [-2.  0.  0.  0.  1.]]
>>> aligner.substitution_matrix = m
>>> aligner.gap_score = -1
>>> s1 = np.array([0, 2, 3, 4], np.int32)
>>> s2 = np.array([0, 3, 2, 1], np.int32)
>>> alignments = aligner.align(s1, s2)
>>> print(len(alignments))
2
>>> print(alignments[0])
0 - 2 3 4
| - | . -
0 3 2 1 -

>>> print(alignments[1])
0 - 2 3 4
| - | - .
0 3 2 - 1

>>> print(alignments.score)
2.0

密码子排列

CodonAligner 阶级 Bio.Align 模块实现专门的对准器,用于将核酸序列与其编码的氨基酸序列进行对准。如果在翻译过程中发生移码,这种对齐就很重要了。

将碱基序列与氨基酸序列进行比对

要将核苷序列与氨基酸序列进行比对,首先创建 CodonAligner 对象:

>>> from Bio import Align
>>> aligner = Align.CodonAligner()

CodonAligner 对象 aligner 存储要用于对齐的对齐参数:

>>> print(aligner)
Codon aligner with parameters
  wildcard: 'X'
  match_score: 1.0
  mismatch_score: 0.0
  frameshift_minus_two_score: -3.0
  frameshift_minus_one_score: -3.0
  frameshift_plus_one_score: -3.0
  frameshift_plus_two_score: -3.0

wildcard , match_score ,而且 mismatch_score 参数的定义与 PairwiseAligner 上述类(请参阅第节  成对对齐器对象 ).指定的值 frameshift_minus_two_score , frameshift_minus_one_score , frameshift_plus_one_score ,而且 frameshift_plus_two_score 每当对齐中分别发生-2、-1、+1或+2帧位移时,参数就会添加到对齐分数中。默认情况下,帧移动得分设置为-3.0。类似于 PairwiseAligner 类(表  成对对齐器对象的元属性。 ), CodonAligner 类定义了集体引用其中许多值的附加属性,如表中所示  CodonAligner对象的元属性。 .

表 2 CodonAligner对象的元属性。

元属性

它映射到的属性

frameshift_minus_score

frameshift_minus_two_score , frameshift_minus_one_score

frameshift_plus_score

frameshift_plus_two_score , frameshift_plus_one_score

frameshift_two_score

frameshift_minus_two_score , frameshift_plus_two_score

frameshift_one_score

frameshift_minus_one_score , frameshift_plus_one_score

frameshift_score

frameshift_minus_two_score , frameshift_minus_one_score , frameshift_plus_one_score , frameshift_plus_two_score

现在让我们考虑两个碱基序列及其编码的氨基酸序列:

>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> nuc1 = Seq("TCAGGGACTGCGAGAACCAAGCTACTGCTGCTGCTGGCTGCGCTCTGCGCCGCAGGTGGGGCGCTGGAG")
>>> rna1 = SeqRecord(nuc1, id="rna1")
>>> nuc2 = Seq("TCAGGGACTTCGAGAACCAAGCGCTCCTGCTGCTGGCTGCGCTCGGCGCCGCAGGTGGAGCACTGGAG")
>>> rna2 = SeqRecord(nuc2, id="rna2")
>>> aa1 = Seq("SGTARTKLLLLLAALCAAGGALE")
>>> aa2 = Seq("SGTSRTKRLLLLAALGAAGGALE")
>>> pro1 = SeqRecord(aa1, id="pro1")
>>> pro2 = SeqRecord(aa2, id="pro2")

虽然这两个蛋白质序列都由23个氨基酸组成,但第一个碱基序列由 \(3 \times 23 = 69\) 核苷酸,而第二个核苷酸序列仅由68个核苷酸组成:

>>> len(pro1)
23
>>> len(pro2)
23
>>> len(rna1)
69
>>> len(rna2)
68

这是由于第二个核苷酸序列翻译期间发生的-1移码事件造成的。使用 CodonAligner.align 对齐 rna1pro1 ,而且 rna2pro2 ,返回的迭代器 Alignment 对象:

>>> alignments1 = aligner.align(pro1, rna1)
>>> len(alignments1)
1
>>> alignment1 = next(alignments1)
>>> print(alignment1)
pro1              0 S  G  T  A  R  T  K  L  L  L  L  L  A  A  L  C  A  A  G  G
rna1              0 TCAGGGACTGCGAGAACCAAGCTACTGCTGCTGCTGGCTGCGCTCTGCGCCGCAGGTGGG

pro1             20 A  L  E   23
rna1             60 GCGCTGGAG 69

>>> alignment1.coordinates
array([[ 0, 23],
       [ 0, 69]])
>>> alignment1[0]
'SGTARTKLLLLLAALCAAGGALE'
>>> alignment1[1]
'TCAGGGACTGCGAGAACCAAGCTACTGCTGCTGCTGGCTGCGCTCTGCGCCGCAGGTGGGGCGCTGGAG'
>>> alignments2 = aligner.align(pro2, rna2)
>>> len(alignments2)
1
>>> alignment2 = next(alignments2)
>>> print(alignment2)
pro2              0 S  G  T  S  R  T  K  R   8
rna2              0 TCAGGGACTTCGAGAACCAAGCGC 24

pro2              8 L  L  L  L  A  A  L  G  A  A  G  G  A  L  E   23
rna2             23 CTCCTGCTGCTGGCTGCGCTCGGCGCCGCAGGTGGAGCACTGGAG 68

>>> alignment2[0]
'SGTSRTKRLLLLAALGAAGGALE'
>>> alignment2[1]
'TCAGGGACTTCGAGAACCAAGCGCCTCCTGCTGCTGGCTGCGCTCGGCGCCGCAGGTGGAGCACTGGAG'
>>> alignment2.coordinates
array([[ 0,  8,  8, 23],
       [ 0, 24, 23, 68]])

alignment1 是69个核苷酸与23个氨基酸的连续排列, alignment2 我们发现24个核苷酸后发生-1移码。作为 alignment2[1] 包含应用-1码移后的核苷序列,比 nuc2 并且可以直接翻译,产生氨基酸序列 aa2 :

>>> from Bio.Seq import translate
>>> len(nuc2)
68
>>> len(alignment2[1])
69
>>> translate(alignment2[1])
'SGTSRTKRLLLLAALGAAGGALE'
>>> _ == aa2
True

对齐分数作为属性存储在 alignments1alignments2 迭代器,以及各个对齐 alignment1alignment2 :

>>> alignments1.score
23.0
>>> alignment1.score
23.0
>>> alignments2.score
20.0
>>> alignment2.score
20.0

其中的分数 rna1 - ' pro 1 '排列等于排列的氨基酸的数量,以及排列的分数 rna2 - “pro2”由于帧移动的惩罚,对齐减少了3。要在不计算对齐本身的情况下计算对齐分数, score 可以使用的方法:

>>> score = aligner.score(pro1, rna1)
>>> print(score)
23.0
>>> score = aligner.score(pro2, rna2)
>>> print(score)
20.0

生成密码子序列的多序列比对

假设我们有第三个相关的氨基酸序列及其相关的核苷酸序列:

>>> aa3 = Seq("MGTALLLLLAALCAAGGALE")
>>> pro3 = SeqRecord(aa3, id="pro3")
>>> nuc3 = Seq("ATGGGAACCGCGCTGCTTTTGCTACTGGCCGCGCTCTGCGCCGCAGGTGGGGCCCTGGAG")
>>> rna3 = SeqRecord(nuc3, id="rna3")
>>> nuc3.translate() == aa3
True

如上所述,我们使用 CodonAligner 为了将核苷酸序列与氨基酸序列进行比对:

>>> alignments3 = aligner.align(pro3, rna3)
>>> len(alignments3)
1
>>> alignment3 = next(alignments3)
>>> print(alignment3)
pro3              0 M  G  T  A  L  L  L  L  L  A  A  L  C  A  A  G  G  A  L  E
rna3              0 ATGGGAACCGCGCTGCTTTTGCTACTGGCCGCGCTCTGCGCCGCAGGTGGGGCCCTGGAG

pro3             20
rna3             60

这三个氨基酸序列可以相互比对,例如使用CustalW。在这里,我们手工创建对齐:

>>> import numpy as np
>>> from Bio.Align import Alignment
>>> sequences = [pro1, pro2, pro3]
>>> protein_alignment = Alignment(
...     sequences, coordinates=np.array([[0, 4, 7, 23], [0, 4, 7, 23], [0, 4, 4, 20]])
... )
>>> print(protein_alignment)
pro1              0 SGTARTKLLLLLAALCAAGGALE 23
pro2              0 SGTSRTKRLLLLAALGAAGGALE 23
pro3              0 MGTA---LLLLLAALCAAGGALE 20

现在我们可以使用 mapall 蛋白质比对的方法,以核酸与蛋白质成对比对为参数,获得相应的密码子比对:

>>> codon_alignment = protein_alignment.mapall([alignment1, alignment2, alignment3])
>>> print(codon_alignment)
rna1              0 TCAGGGACTGCGAGAACCAAGCTA 24
rna2              0 TCAGGGACTTCGAGAACCAAGCGC 24
rna3              0 ATGGGAACCGCG---------CTG 15

rna1             24 CTGCTGCTGCTGGCTGCGCTCTGCGCCGCAGGTGGGGCGCTGGAG 69
rna2             23 CTCCTGCTGCTGGCTGCGCTCGGCGCCGCAGGTGGAGCACTGGAG 68
rna3             15 CTTTTGCTACTGGCCGCGCTCTGCGCCGCAGGTGGGGCCCTGGAG 60

分析密码子排列

计算每个站点的非同义和同义替换数量

密码子比对最重要的应用是估计每个位点的非同义取代数(dN)和每个位点的同义取代数(dS)。这些可以通过 calculate_dn_ds 功能 Bio.Align.analysis .该函数接受成对密码子对齐和输入,以及可选参数 method 指定计算方法, codon_table (默认标准代码),比例 k 转换率和颠换率, cfreq 以确定平衡密码子频率。Biopython目前支持三种基于计数的方法 (NG86 , LWL85 , YN00 )以及最大似然法 (ML )来估计dN和dS:

  • NG86 :Nei和Gojobori(1986) [Nei1986] (默认)。使用此方法,您还可以通过参数指定转变率和颠倒率的比率 k ,默认为 1.0 .

  • LWL85 :李 et al. (1985) [Li1985].

  • YN00 :杨和尼尔森(2000) [Yang2000].

  • ML :高盛和杨(1994) [Goldman1994]. 使用此方法,您还可以通过指定平衡密码子频率 cfreq 参数,具有以下选项:

    • F1x4 :计算提供的密码子序列中的碱基频率,并使用它来计算背景密码子频率;

    • F3x4 :(默认)分别计算所提供密码子中第一、第二和第三位置的碱基频率,并用它计算背景密码子频率;

    • F61 :计算所提供密码子序列中密码子的频率,伪计数为0.1。

calculate_dN_dS 该方法可以应用于成对密码子比对。一般来说,不同的计算方法会导致dN和dS的估计略有不同:

>>> from Bio.Align import analysis
>>> pairwise_codon_alignment = codon_alignment[:2]
>>> print(pairwise_codon_alignment)
rna1              0 TCAGGGACTGCGAGAACCAAGCTA 24
                  0 |||||||||.||||||||||||..
rna2              0 TCAGGGACTTCGAGAACCAAGCGC 24

rna1             24 CTGCTGCTGCTGGCTGCGCTCTGCGCCGCAGGTGGGGCGCTGGAG 69
                 24 ||.||||||||||||||||||.|||||||||||||.||.|||||| 69
rna2             23 CTCCTGCTGCTGGCTGCGCTCGGCGCCGCAGGTGGAGCACTGGAG 68

>>> dN, dS = analysis.calculate_dn_ds(pairwise_codon_alignment, method="NG86")
>>> print(dN, dS)
0.067715... 0.201197...
>>> dN, dS = analysis.calculate_dn_ds(pairwise_codon_alignment, method="LWL85")
>>> print(dN, dS)
0.068728... 0.207551...
>>> dN, dS = analysis.calculate_dn_ds(pairwise_codon_alignment, method="YN00")
>>> print(dN, dS)
0.081468... 0.127706...
>>> dN, dS = analysis.calculate_dn_ds(pairwise_codon_alignment, method="ML")
>>> print(dN, dS)
0.069475... 0.205754...

对于密码子序列的多重比对,您可以计算dN和dS值的矩阵:

>>> dN, dS = analysis.calculate_dn_ds_matrix(codon_alignment, method="NG86")
>>> print(dN)
rna1    0.000000
rna2    0.067715    0.000000
rna3    0.060204    0.145469    0.000000
    rna1    rna2    rna3
>>> print(dS)
rna1    0.000000
rna2    0.201198    0.000000
rna3    0.664268    0.798957    0.000000
    rna1    rna2    rna3

的对象 dNdS 返回 calculate_dn_ds_matrixDistanceMatrix 阶级 Bio.Phylo.TreeConstruction .此函数只接受 codon_table 作为可选论点。

从这两个序列中,您可以使用创建dN树和dS树 Bio.Phylo.TreeConstruction :

>>> from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
>>> dn_constructor = DistanceTreeConstructor()
>>> ds_constructor = DistanceTreeConstructor()
>>> dn_tree = dn_constructor.upgma(dN)
>>> ds_tree = ds_constructor.upgma(dS)
>>> print(type(dn_tree))
<class 'Bio.Phylo.BaseTree.Tree'>
>>> print(dn_tree)
Tree(rooted=True)
    Clade(branch_length=0, name='Inner2')
        Clade(branch_length=0.053296..., name='rna2')
        Clade(branch_length=0.023194..., name='Inner1')
            Clade(branch_length=0.0301021..., name='rna3')
            Clade(branch_length=0.0301021..., name='rna1')
>>> print(ds_tree)
Tree(rooted=True)
    Clade(branch_length=0, name='Inner2')
        Clade(branch_length=0.365806..., name='rna3')
        Clade(branch_length=0.265207..., name='Inner1')
            Clade(branch_length=0.100598..., name='rna2')
            Clade(branch_length=0.100598..., name='rna1')

进行麦克唐纳-克赖特曼测试

麦克唐纳-克赖特曼测试通过比较物种内同义取代和非同义取代与物种之间同义取代和非同义取代来评估适应性进化的量,看看它们是否来自同一进化过程。该测试需要从同一物种的不同个体中采样基因序列。在下面的例子中,我们将使用来自果蝇的Adh基因。该数据包括来自 Drosophila melanogaster 、4人来自 Drosophila simulans ,以及12名来自 Drosophila yakuba .蛋白质比对数据和核苷酸序列可在 Tests/codonalign 目录作为文件 adh.alndrosophila.fasta ,分别在Biopython分布中。功能 mktestBio.Align.analysis 实施Mcdonald-Kreitman测试。

>>> from Bio import SeqIO
>>> from Bio import Align
>>> from Bio.Align import CodonAligner
>>> from Bio.Align.analysis import mktest
>>> aligner = CodonAligner()
>>> nucleotide_records = SeqIO.index("drosophila.fasta", "fasta")
>>> for nucleotide_record in nucleotide_records.values():
...     print(nucleotide_record.description)
...
gi|9097|emb|X57361.1| Drosophila simulans (individual c) ...
gi|9099|emb|X57362.1| Drosophila simulans (individual d) ...
gi|9101|emb|X57363.1| Drosophila simulans (individual e) ...
gi|9103|emb|X57364.1| Drosophila simulans (individual f) ...
gi|9217|emb|X57365.1| Drosophila yakuba (individual a) ...
gi|9219|emb|X57366.1| Drosophila yakuba (individual b) ...
gi|9221|emb|X57367.1| Drosophila yakuba (individual c) ...
gi|9223|emb|X57368.1| Drosophila yakuba (individual d) ...
gi|9225|emb|X57369.1| Drosophila yakuba (individual e) ...
gi|9227|emb|X57370.1| Drosophila yakuba (individual f) ...
gi|9229|emb|X57371.1| Drosophila yakuba (individual g) ...
gi|9231|emb|X57372.1| Drosophila yakuba (individual h) ...
gi|9233|emb|X57373.1| Drosophila yakuba (individual i) ...
gi|9235|emb|X57374.1| Drosophila yakuba (individual j) ...
gi|9237|emb|X57375.1| Drosophila yakuba (individual k) ...
gi|9239|emb|X57376.1| Drosophila yakuba (individual l) ...
gi|156879|gb|M17837.1|DROADHCK D.melanogaster (strain Ja-F) ...
gi|156863|gb|M19547.1|DROADHCC D.melanogaster (strain Af-S) ...
gi|156877|gb|M17836.1|DROADHCJ D.melanogaster (strain Af-F) ...
gi|156875|gb|M17835.1|DROADHCI D.melanogaster (strain Wa-F) ...
gi|156873|gb|M17834.1|DROADHCH D.melanogaster (strain Fr-F) ...
gi|156871|gb|M17833.1|DROADHCG D.melanogaster (strain Fl-F) ...
gi|156869|gb|M17832.1|DROADHCF D.melanogaster (strain Ja-S) ...
gi|156867|gb|M17831.1|DROADHCE D.melanogaster (strain Fl-2S) ...
gi|156865|gb|M17830.1|DROADHCD D.melanogaster (strain Fr-S) ...
gi|156861|gb|M17828.1|DROADHCB D.melanogaster (strain Fl-1S) ...
gi|156859|gb|M17827.1|DROADHCA D.melanogaster (strain Wa-S) ...
>>> protein_alignment = Align.read("adh.aln", "clustal")
>>> len(protein_alignment)
27
>>> print(protein_alignment)
gi|9217|e         0 MAFTLTNKNVVFVAGLGGIGLDTSKELVKRDLKNLVILDRIENPAAIAELKAINPKVTVT
gi|9219|e         0 MAFTLTNKNVVFVAGLGGIGLDTSKELVKRDLKNLVILDRIENPAAIAELKAINPKVTVT
gi|9221|e         0 MAFTLTNKNVVFVAGLGGIGLDTSKELVKRDLKNLVILDRIENPAAIAELKAINPKVTVT
...
gi|156859         0 MSFTLTNKNVIFVAGLGGIGLDTSKELLKRDLKNLVILDRIENPAAIAELKAINPKVTVT

...

gi|9217|e       240 GTLEAIQWSKHWDSGI 256
gi|9219|e       240 GTLEAIQWSKHWDSGI 256
gi|9221|e       240 GTLEAIQWSKHWDSGI 256
...
gi|156859       240 GTLEAIQWTKHWDSGI 256

>>> codon_alignments = []
>>> for protein_record in protein_alignment.sequences:
...     nucleotide_record = nucleotide_records[protein_record.id]
...     alignments = aligner.align(protein_record, nucleotide_record)
...     assert len(alignments) == 1
...     codon_alignment = next(alignments)
...     codon_alignments.append(codon_alignment)
...
>>> print(codon_alignment)
gi|156859         0 M  S  F  T  L  T  N  K  N  V  I  F  V  A  G  L  G  G  I  G
gi|156859         0 ATGTCGTTTACTTTGACCAACAAGAACGTGATTTTCGTTGCCGGTCTGGGAGGCATTGGT

gi|156859        20 L  D  T  S  K  E  L  L  K  R  D  L  K  N  L  V  I  L  D  R
gi|156859        60 CTGGACACCAGCAAGGAGCTGCTCAAGCGCGATCTGAAGAACCTGGTGATCCTCGACCGC

...

gi|156859       240 G  T  L  E  A  I  Q  W  T  K  H  W  D  S  G  I   256
gi|156859       720 GGCACCCTGGAGGCCATCCAGTGGACCAAGCACTGGGACTCCGGCATC 768

>>> nucleotide_records.close()  # Close indexed FASTA file
>>> alignment = protein_alignment.mapall(codon_alignments)
>>> print(alignment)
gi|9217|e         0 ATGGCGTTTACCTTGACCAACAAGAACGTGGTTTTCGTGGCCGGTCTGGGAGGCATTGGT
gi|9219|e         0 ATGGCGTTTACCTTGACCAACAAGAACGTGGTTTTCGTGGCCGGTCTGGGAGGCATTGGT
gi|9221|e         0 ATGGCGTTTACCTTGACCAACAAGAACGTGGTTTTCGTGGCCGGTCTGGGAGGCATTGGT
...
gi|156859         0 ATGTCGTTTACTTTGACCAACAAGAACGTGATTTTCGTTGCCGGTCTGGGAGGCATTGGT

...

gi|9217|e       720 GGCACCCTGGAGGCCATCCAGTGGTCCAAGCACTGGGACTCCGGCATC 768
gi|9219|e       720 GGCACCCTGGAGGCCATCCAGTGGTCCAAGCACTGGGACTCCGGCATC 768
gi|9221|e       720 GGTACCCTGGAGGCCATCCAGTGGTCCAAGCACTGGGACTCCGGCATC 768
...
gi|156859       720 GGCACCCTGGAGGCCATCCAGTGGACCAAGCACTGGGACTCCGGCATC 768

>>> unique_species = ["Drosophila simulans", "Drosophila yakuba", "D.melanogaster"]
>>> species = []
>>> for record in alignment.sequences:
...     description = record.description
...     for s in unique_species:
...         if s in description:
...             break
...     else:
...         raise Exception(f"Failed to find species for {description}")
...     species.append(s)
...
>>> print(species)
['Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila yakuba', 'Drosophila simulans', 'Drosophila simulans', 'Drosophila simulans', 'Drosophila simulans', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster', 'D.melanogaster']
>>> pvalue = mktest(alignment, species)
>>> print(pvalue)
0.00206457...

除了多密码子对齐外,功能 mktest 将比对中每个序列所属的物种作为输入。密码子表可以作为可选参数提供 codon_table .