表索引#

一旦成为 Table 创建后，可以在表的一个或多个列上创建索引。索引基于索引列(S)对表的行进行内部排序，从而允许按列值检索元素并提高某些表操作的性能。

创建索引#

要在表上创建索引，请使用 add_index() 方法：

>>> from astropy.table import Table
>>> t = Table([(2, 3, 2, 1), (8, 7, 6, 5)], names=('a', 'b'))
>>> t.add_index('a')

可选参数 unique 可以指定以创建具有唯一值元素的索引。

若要在多个列上创建复合索引，请传递列列表：：

>>> t.add_index(['a', 'b'])

特别是使用 add_index() 方法被视为默认索引或“主键” indices 属性：

>>> t.indices['a']
<SlicedIndex original=True index=<Index columns=('a',) data=<SortedArray length=4>
 a  rows
--- ----
  1    3
  2    0
  2    2
  3    1>>
>>> t.indices['a', 'b']
<SlicedIndex original=True index=<Index columns=('a', 'b') data=<SortedArray length=4>
 a   b  rows
--- --- ----
  1   5    3
  2   6    2
  2   8    0
  3   7    1>>

使用索引的行检索#

行检索可以使用两个表属性来完成： loc 和 iloc 。这个 loc 属性可以按列值、列值范围( including 边界)，或一个 list 或 numpy.ndarray 列值的百分比：

>>> t = Table([(1, 2, 3, 4), (10, 1, 9, 9)], names=('a', 'b'), dtype=['i8', 'i8'])
>>> t.add_index('a')
>>> t.loc[2]  # the row(s) where a == 2
<Row index=1>
  a     b
int64 int64
----- -----
    2     1
>>> t.loc[[1, 4]]  # the row(s) where a in [1, 4]
<Table length=2>
  a     b
int64 int64
----- -----
    1    10
    4     9
>>> t.loc[1:3]  # the row(s) where a in [1, 2, 3]
<Table length=3>
  a     b
int64 int64
----- -----
    1    10
    2     1
    3     9
>>> t.loc[:]
<Table length=4>
  a     b
int64 int64
----- -----
    1    10
    2     1
    3     9
    4     9

请注意，默认情况下， loc 使用主索引，此处为COLUMN 'a' 。要使用其他索引，请在检索数据之前传递索引列名：：

>>> t.add_index('b')
>>> t.loc['b', 8:10]
<Table length=3>
  a     b
int64 int64
----- -----
    3     9
    4     9
    1    10

该物业 iloc 工作原理类似，只是检索信息必须是整数或 slice ，并且与索引的排序顺序相关，而不是与列值相关。例如：：

>>> t.iloc[0] # smallest row by value 'a'
<Row index=0>
  a     b
int64 int64
----- -----
    1    10
>>> t.iloc['b', 1:] # all but smallest value of 'b'
<Table length=3>
  a     b
int64 int64
----- -----
    3     9
    4     9
    1    10

对性能的影响#

当存在索引时，表操作会发生一些变化，在决定使用索引是否会提高性能时，需要考虑许多因素。一般来说，索引具有以下优点：

基于索引列的表分组和排序都变得更快。
按索引检索值比自定义搜索快。

但也有一些注意事项：

创建索引需要时间和内存。
由于自动索引更新，表修改变得更慢。
由于索引重新标记，对表进行切片会变得更慢。

见 here 对于IPython笔记本，可以分析表索引的各个方面。

索引模式#

这个 index_mode() 方法允许用户通过上下文管理器进入特定的索引模式，从而使表索引的行为具有一定的灵活性。目前有三种索引模式： 'freeze' ， 'copy_on_getitem' ，以及 'discard_on_copy' 。

这个 'freeze' 模式防止每当修改索引的列时自动更新索引，并且所有索引在上下文结束后自动刷新：

>>> with t.index_mode('freeze'):
...    t['a'][0] = 0
...    print(t.indices['a']) # unmodified
<SlicedIndex original=True index=<Index columns=('a',) data=<SortedArray length=4>
 a  rows
--- ----
  1    0
  2    1
  3    2
  4    3>>
>>> print(t.indices['a']) # modified
<SlicedIndex original=True index=<Index columns=('a',) data=<SortedArray length=4>
 a  rows
--- ----
  0    0
  2    1
  3    2
  4    3>>

这个 'copy_on_getitem' 模式强制列在切片时复制并重新标记其索引。在没有此模式的情况下，表片将保留索引，而列片不会：

>>> ca = t['a'][[1, 3]]
>>> ca.info.indices
[]
>>> with t.index_mode('copy_on_getitem'):
...     ca = t['a'][[1, 3]]
...     print(ca.info.indices)
[<SlicedIndex original=True index=<Index columns=('a',) data=<SortedArray length=2>
 a  rows
--- ----
  2    0
  4    1>>]

这个 'discard_on_copy' 模式可防止在复制列或表时复制索引：

>>> t2 = Table(t)
>>> t2.indices['a']
<SlicedIndex original=True index=<Index columns=('a',) data=<SortedArray length=4>
 a  rows
--- ----
  0    0
  2    1
  3    2
  4    3>>
>>> with t.index_mode('discard_on_copy'):
...    t2 = Table(t)
...    print(t2.indices)
[]

使用索引更新行#

行更新可以通过分配table属性来完成 loc 完整行或行列表：

>>> t = Table([('w', 'x', 'y', 'z'), (10, 1, 9, 9)], names=('a', 'b'), dtype=['str', 'i8'])
>>> t.add_index('a')
>>> t.loc['x']
<Row index=1>
 a     b
str1 int64
---- -----
   x     1
>>> t.loc['x'] = ['a', 12]
>>> t
<Table length=4>
 a     b
str1 int64
---- -----
   w    10
   a    12
   y     9
   z     9
>>> t.loc[['w', 'y']]
<Table length=2>
 a     b
str1 int64
---- -----
   w    10
   y     9
>>> t.loc[['w', 'z']] = [['b', 23], ['c', 56]]
>>> t
<Table length=4>
 a     b
str1 int64
---- -----
   b    23
   a    12
   y     9
   c    56

使用索引检索行的位置#

可以使用表属性来实现行位置的检索： loc_indices 。这个 loc_indices 属性可以按列值、列值范围( including 边界)，或一个 list 或 numpy.ndarray 列值的百分比：

>>> t = Table([('w', 'x', 'y', 'z'), (10, 1, 9, 9)], names=('a', 'b'), dtype=['str', 'i8'])
>>> t.add_index('a')
>>> t.loc_indices['x']
np.int64(1)

发动机#

通过创建索引时 add_index() ，关键字参数 engine 可以指定为使用特定的索引引擎。可用的发动机有：

SortedArray ，排序数组引擎使用基础排序的 Table 。
SCEngine, a sorted list engine using the Sorted Containers 包裹。
BST ，一个基于Python的二进制搜索树引擎(不推荐)。

风景取决于 sortedcontainers 附属国。最重要的是 SortedArray （默认引擎）通常是最好的 SCEngine 可能更适合于在空列上创建的索引，因为添加新值更快。

这个 BST Engine演示了一个简单的纯Python实现的搜索树引擎，但对于较大的表来说性能很差。这在很大程度上可以在代码中作为实现参考。