Bio.PopGen:群体遗传学

Bio.PopGen 是一个支持群体遗传学的Biopython模块,可在Biopython 1.44起版本中提供。该模块的目标是支持广泛使用的数据格式、应用程序和数据库。

GenePop

GenePop(http://genepop.curtin.edu.au/)是一个流行的群体遗传学软件包,支持Hardy-Weinberg检验,连锁不平衡,群体分化,基本统计, \(F_{st}\) 和移民估计等。GenePop不提供基于序列的统计数据,因为它不处理序列数据。GenePop文件格式得到了广泛的其他群体遗传软件应用程序的支持,因此使其成为群体遗传学领域的相关格式。

Bio.PopGen 提供GenePop文件格式的解析器和生成器。还提供了操作记录内容的实用程序。以下是如何读取GenePop文件的示例(您可以在Biopython的Test/PopGen目录中找到GenePop数据文件示例):

from Bio.PopGen import GenePop

with open("example.gen") as handle:
    rec = GenePop.read(handle)

这将读取名为example.gen的文件并对其进行解析。如果您打印rec,记录将以GenePop格式再次输出。

rec中最重要的信息是基因座名称和种群信息(但还有更多信息-使用帮助(GenePop.Record)检查API文档)。位置名称可以在rec. locis_list上找到。人口信息可以在rec. Populations上找到。人口是每个人口包含一个元素的列表。每个元素本身就是一个个体列表,每个个体都是由个体名称和一系列基因(每个标记2个)组成的一对,这里是rec.群体的一个例子:

[
    [
        ("Ind1", [(1, 2), (3, 3), (200, 201)]),
        ("Ind2", [(2, None), (3, 3), (None, None)]),
    ],
    [
        ("Other1", [(1, 1), (4, 3), (200, 200)]),
    ],
]

所以我们有两个种群,第一个种群有两个个体,第二个种群只有一个个体。第一个群体的第一个个体被称为Ind1,以下是3个基因座中每个基因座的基因信息。请注意,对于任何位置,信息都可能缺失(例如,请参阅上面的Ind2)。

提供了一些操作GenePop记录的实用函数,以下是一个示例:

from Bio.PopGen import GenePop

# Imagine that you have loaded rec, as per the code snippet above...

rec.remove_population(pos)
# Removes a population from a record, pos is the population position in
# rec.populations, remember that it starts on position 0.
# rec is altered.

rec.remove_locus_by_position(pos)
# Removes a locus by its position, pos is the locus position in
#  rec.loci_list, remember that it starts on position 0.
#  rec is altered.

rec.remove_locus_by_name(name)
# Removes a locus by its name, name is the locus name as in
# rec.loci_list. If the name doesn't exist the function fails
# silently.
# rec is altered.

rec_loci = rec.split_in_loci()
# Splits a record in loci, that is, for each loci, it creates a new
# record, with a single loci and all populations.
# The result is returned in a dictionary, being each key the locus name.
# The value is the GenePop record.
# rec is not altered.

rec_pops = rec.split_in_pops(pop_names)
# Splits a record in populations, that is, for each population, it creates
# a new record, with a single population and all loci.
# The result is returned in a dictionary, being each key
# the population name. As population names are not available in GenePop,
# they are passed in array (pop_names).
# The value of each dictionary entry is the GenePop record.
# rec is not altered.

GenePop不支持人口名称,这一限制有时会很麻烦。目前正在计划为Biopython提供启用人口名称的功能。这些扩展不会以任何方式破坏与标准格式的兼容性。从中期来看,我们还希望支持GenePop网络服务。