序列 (skbio.sequence
)¶
本模块提供用于存储和处理序列的类,包括没有字母表限制的通用/非生物序列 (Sequence
)以及基于IUPAC定义的字母表的序列 (DNA
, RNA
, Protein
). 常见的操作被定义为方法,例如计算DNA序列的反向补码,或在蛋白质序列中寻找N-糖基化基序。类属性提供有效的字符集、不同序列类型的补码映射以及退化字符定义。此外,此模块定义 GeneticCode
类,它表示将DNA或RNA序列转换为蛋白质序列的不可变对象。
为每种不同类型的序列对象存储的主要信息是底层序列数据本身。存储为不可变数组。此外,每种类型的序列可以包括可选元数据和位置元数据。请注意,元数据和位置元数据是可变的。
Classes¶
|
存储通用序列数据和可选的关联元数据。 |
|
存储符合字符集的序列数据。 |
|
存储DNA序列数据和可选的相关元数据。 |
|
存储RNA序列数据和可选的相关元数据。 |
|
存储蛋白质序列数据和可选的相关元数据。 |
|
将密码子翻译成氨基酸的遗传密码。 |
子包¶
序列距离度量(skBio.equence.Distance) |
示例
使用可选元数据和位置元数据创建新序列。元数据以Python的形式存储 dict
,而位置元数据存储为Pandas DataFrame
.
>>> from skbio import DNA, RNA
>>> d = DNA('ACCGGGTA', metadata={'id':"my-sequence", 'description':"GFP"},
... positional_metadata={'quality':[22, 25, 22, 18, 23, 25, 25, 25]})
>>> d
DNA
-----------------------------
Metadata:
'description': 'GFP'
'id': 'my-sequence'
Positional metadata:
'quality': <dtype: int64>
Stats:
length: 8
has gaps: False
has degenerates: False
has definites: True
GC-content: 62.50%
-----------------------------
0 ACCGGGTA
也可以从现有序列中创建新序列,例如作为它们的反向补码或去加码(即未对齐)版本。
>>> d1 = DNA('.ACC--GGG-TA...', metadata={'id':'my-sequence'})
>>> d2 = d1.degap()
>>> d2
DNA
--------------------------
Metadata:
'id': 'my-sequence'
Stats:
length: 8
has gaps: False
has degenerates: False
has definites: True
GC-content: 62.50%
--------------------------
0 ACCGGGTA
>>> d3 = d2.reverse_complement()
>>> d3
DNA
--------------------------
Metadata:
'id': 'my-sequence'
Stats:
length: 8
has gaps: False
has degenerates: False
has definites: True
GC-content: 62.50%
--------------------------
0 TACCCGGT
计算序列之间的距离(可选地使用用户定义的距离度量,默认值是Hamming distance,这要求被比较的序列的长度相同)也可以直接用于序列聚类、系统发育重建等。
>>> r1 = RNA('GACCCGCUUU')
>>> r2 = RNA('GCCCCCCUUU')
>>> r1.distance(r2)
0.2
类似地,您可以计算一对对齐序列之间的相似度百分比(dis)。
>>> r3 = RNA('ACCGUUAGUC')
>>> r4 = RNA('ACGGGU--UC')
>>> r3.match_frequency(r4, relative=True)
0.6
>>> r3.mismatch_frequency(r4, relative=True)
0.4
序列可以搜索已知的基序类型。这将返回描述匹配项的切片。
>>> r5 = RNA('AGG-GGACUGAA')
>>> for motif in r5.find_motifs('purine-run', min_length=2):
... motif
slice(0, 3, None)
slice(4, 7, None)
slice(9, 12, None)
这些切片可以用来提取相关的子序列。
>>> for motif in r5.find_motifs('purine-run', min_length=2):
... r5[motif]
... print('')
RNA
--------------------------
Stats:
length: 3
has gaps: False
has degenerates: False
has definites: True
GC-content: 66.67%
--------------------------
0 AGG
RNA
--------------------------
Stats:
length: 3
has gaps: False
has degenerates: False
has definites: True
GC-content: 66.67%
--------------------------
0 GGA
RNA
--------------------------
Stats:
length: 3
has gaps: False
has degenerates: False
has definites: True
GC-content: 33.33%
--------------------------
0 GAA
在搜索时可以忽略空白或其他特征,因为这些可能会破坏其他有意义的主题。
>>> for motif in r5.find_motifs('purine-run', min_length=2, ignore=r5.gaps()):
... r5[motif]
... print('')
RNA
--------------------------
Stats:
length: 7
has gaps: True
has degenerates: False
has definites: True
GC-content: 66.67%
--------------------------
0 AGG-GGA
RNA
--------------------------
Stats:
length: 3
has gaps: False
has degenerates: False
has definites: True
GC-content: 33.33%
--------------------------
0 GAA
在上面的示例中,很容易从结果motif匹配中移除间隙,因为切片匹配本身是与输入相同类型的序列。
>>> for motif in r5.find_motifs('purine-run', min_length=2, ignore=r5.gaps()):
... r5[motif].degap()
... print('')
RNA
--------------------------
Stats:
length: 6
has gaps: False
has degenerates: False
has definites: True
GC-content: 66.67%
--------------------------
0 AGGGGA
RNA
--------------------------
Stats:
length: 3
has gaps: False
has degenerates: False
has definites: True
GC-content: 33.33%
--------------------------
0 GAA
类似地,可以使用正则表达式搜索序列中的任意模式。
>>> for match in r5.find_with_regex('(G+AC[UT])'):
... match
slice(4, 9, None)
DNA可以转录成RNA:
>>> dna = DNA('ATGTGTATTTGA')
>>> rna = dna.transcribe()
>>> rna
RNA
--------------------------
Stats:
length: 12
has gaps: False
has degenerates: False
has definites: True
GC-content: 25.00%
--------------------------
0 AUGUGUAUUU GA
DNA和RNA都可以翻译成蛋白质序列。例如,让我们使用NCBI的标准遗传代码(表ID 1,scikit bio中的默认遗传代码)翻译我们的DNA和RNA序列:
>>> protein_from_dna = dna.translate()
>>> protein_from_dna
Protein
--------------------------
Stats:
length: 4
has gaps: False
has degenerates: False
has definites: True
has stops: True
--------------------------
0 MCI*
>>> protein_from_rna = rna.translate()
>>> protein_from_rna
Protein
--------------------------
Stats:
length: 4
has gaps: False
has degenerates: False
has definites: True
has stops: True
--------------------------
0 MCI*
这两种翻译相当:
>>> protein_from_dna == protein_from_rna
True
类级方法包含有关分子类型的信息。
>>> sorted(DNA.degenerate_map['B'])
['C', 'G', 'T']
>>> sorted(RNA.degenerate_map['B'])
['C', 'G', 'U']