RFC 63:稀疏数据集改进

作者:连鲁奥

联系人:even.rouault,网址:spatialys.com

状态:通过、实施

版本:2.2

总结

这个RFC涵盖了管理稀疏数据集的改进,也就是说,包含大量空区域的数据集。

途径

有些用例需要读取或生成覆盖很大空间范围的数据集,但其中重要部分不被数据覆盖。GDAL API无法快速知道哪些区域被数据覆盖,因此需要处理所有像素,这是相当低效的。而一些格式,如GeoTIFF、VRT或GeoPackage,可能会在不处理像素的情况下提供这样的信息。

因此,建议在gdalStartBand类中添加一个新方法GetDataCoverageStatus(),该方法接受感兴趣的窗口作为输入,并返回它是由数据、空块还是它们的混合组成。

此方法将由gdaldatasetcopywholerater()方法(CreateCopy()/gdal_translate使用)使用,以避免在输出驱动程序指示时处理稀疏区域。

C++ API

在gdalStartBand类中,添加了一个新的虚拟方法:

 virtual int IGetDataCoverageStatus( int nXOff, int nYOff,
                                     int nXSize, int nYSize,
                                     int nMaskFlagStop,
                                     double* pdfDataPct);


/**
 * \brief Get the coverage status of a sub-window of the raster.
 *
 * Returns whether a sub-window of the raster contains only data, only empty
 * blocks or a mix of both. This function can be used to determine quickly
 * if it is worth issuing RasterIO / ReadBlock requests in datasets that may
 * be sparse.
 *
 * Empty blocks are blocks that contain only pixels whose value is the nodata
 * value when it is set, or whose value is 0 when the nodata value is not set.
 *
 * The query is done in an efficient way without reading the actual pixel
 * values. If not possible, or not implemented at all by the driver,
 * GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED | GDAL_DATA_COVERAGE_STATUS_DATA will
 * be returned.
 *
 * The values that can be returned by the function are the following,
 * potentially combined with the binary or operator :
 * <ul>
 * <li>GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED : the driver does not implement
 * GetDataCoverageStatus(). This flag should be returned together with
 * GDAL_DATA_COVERAGE_STATUS_DATA.</li>
 * <li>GDAL_DATA_COVERAGE_STATUS_DATA: There is (potentially) data in the queried
 * window.</li>
 * <li>GDAL_DATA_COVERAGE_STATUS_EMPTY: There is nodata in the queried window.
 * This is typically identified by the concept of missing block in formats that
 * supports it.
 * </li>
 * </ul>
 *
 * Note that GDAL_DATA_COVERAGE_STATUS_DATA might have false positives and
 * should be interpreted more as hint of potential presence of data. For example
 * if a GeoTIFF file is created with blocks filled with zeroes (or set to the
 * nodata value), instead of using the missing block mechanism,
 * GDAL_DATA_COVERAGE_STATUS_DATA will be returned. On the contrary,
 * GDAL_DATA_COVERAGE_STATUS_EMPTY should have no false positives.
 *
 * The nMaskFlagStop should be generally set to 0. It can be set to a
 * binary-or'ed mask of the above mentioned values to enable a quick exiting of
 * the function as soon as the computed mask matches the nMaskFlagStop. For
 * example, you can issue a request on the whole raster with nMaskFlagStop =
 * GDAL_DATA_COVERAGE_STATUS_EMPTY. As soon as one missing block is encountered,
 * the function will exit, so that you can potentially refine the requested area
 * to find which particular region(s) have missing blocks.
 *
 * @see GDALGetDataCoverageStatus()
 *
 * @param nXOff The pixel offset to the top left corner of the region
 * of the band to be queried. This would be zero to start from the left side.
 *
 * @param nYOff The line offset to the top left corner of the region
 * of the band to be queried. This would be zero to start from the top.
 *
 * @param nXSize The width of the region of the band to be queried in pixels.
 *
 * @param nYSize The height of the region of the band to be queried in lines.
 *
 * @param nMaskFlagStop 0, or a binary-or'ed mask of possible values
 * GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED,
 * GDAL_DATA_COVERAGE_STATUS_DATA and GDAL_DATA_COVERAGE_STATUS_EMPTY. As soon
 * as the computation of the coverage matches the mask, the computation will be
 * stopped. *pdfDataPct will not be valid in that case.
 *
 * @param pdfDataPct Optional output parameter whose pointed value will be set
 * to the (approximate) percentage in [0,100] of pixels in the queried
 * sub-window that have valid values. The implementation might not always be
 * able to compute it, in which case it will be set to a negative value.
 *
 * @return a binary-or'ed combination of possible values
 * GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED,
 * GDAL_DATA_COVERAGE_STATUS_DATA and GDAL_DATA_COVERAGE_STATUS_EMPTY
 *
 * @note Added in GDAL 2.2
 */

此方法有一个哑默认实现,返回GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED| GDAL_DATA_COVERAGE_STATUS_DATA

公共API由以下部分组成:

/** Flag returned by GDALGetDataCoverageStatus() when the driver does not
 * implement GetDataCoverageStatus(). This flag should be returned together
 * with GDAL_DATA_COVERAGE_STATUS_DATA */
#define GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED 0x01

/** Flag returned by GDALGetDataCoverageStatus() when there is (potentially)
 * data in the queried window. Can be combined with the binary or operator
 * with GDAL_DATA_COVERAGE_STATUS_UNIMPLEMENTED or
 * GDAL_DATA_COVERAGE_STATUS_EMPTY */
#define GDAL_DATA_COVERAGE_STATUS_DATA          0x02

/** Flag returned by GDALGetDataCoverageStatus() when there is nodata in the
 * queried window. This is typically identified by the concept of missing block
 * in formats that supports it.
 * Can be combined with the binary or operator with
 * GDAL_DATA_COVERAGE_STATUS_DATA */
#define GDAL_DATA_COVERAGE_STATUS_EMPTY         0x04


C++ :

int  GDALRasterBand::GetDataCoverageStatus( int nXOff,
                                            int nYOff,
                                            int nXSize,
                                            int nYSize,
                                            int nMaskFlagStop,
                                            double* pdfDataPct)

C :
int GDALGetDataCoverageStatus( GDALRasterBandH hBand,
                               int nXOff, int nYOff,
                               int nXSize,
                               int nYSize,
                               int nMaskFlagStop,
                               double* pdfDataPct);

gdalStartBand::GetDataCoverageStatus()在调用IGetDataCoverageStatus()之前对窗口的有效性进行基本检查

变化

gdaldatasetcopywholerate()和gdalrasterbandcopywholerate()接受一个SKIP-HOLES选项,输出驱动程序可以将该选项设置为YES,以便在源数据集的每个块上调用GetDataCoverageStatus(),以确定是否只包含孔。

驱动程序

此RFC升级GeoTIFF和VRT驱动程序以实现IGetDataCoverageStatus()方法。

GeTIFF驱动程序还接收了一些与该主题相关的先前增强,例如接受CealEcopy()模式中的SaleSyoCy=“是”创建选项(或更新模式中的SabeSyOK Open选项)。

驱动程序文档的摘录:

GDAL makes a special interpretation of a TIFF tile or strip whose offset
and byte count are set to 0, that is to say a tile or strip that has no corresponding
allocated physical storage. On reading, such tiles or strips are considered to
be implicitly set to 0 or to the nodata value when it is defined. On writing, it
is possible to enable generating such files through the Create() interface by setting
the SPARSE_OK creation option to YES. Then, blocks that are never written
through the IWriteBlock()/IRasterIO() interfaces will have their offset and
byte count set to 0. This is particularly useful to save disk space and time when
the file must be initialized empty before being passed to a further processing
stage that will fill it.
To avoid ambiguities with another sparse mechanism discussed in the next paragraphs,
we will call such files with implicit tiles/strips "TIFF sparse files". They will
be likely *not* interoperable with TIFF readers that are not GDAL based and
would consider such files with implicit tiles/strips as defective.

Starting with GDAL 2.2, this mechanism is extended to the CreateCopy() and
Open() interfaces (for update mode) as well. If the SPARSE_OK creation option
(or the SPARSE_OK open option for Open()) is set to YES, even an attempt to
write a all 0/nodata block will be detected so that the tile/strip is not
allocated (if it was already allocated, then its content will be replaced by
the 0/nodata content).

Starting with GDAL 2.2, in the case where SPARSE_OK is *not* defined (or set
to its default value FALSE), for uncompressed files whose nodata value is not
set, or set to 0, in Create() and CreateCopy() mode, the driver will delay the
allocation of 0-blocks until file closing, so as to be able to write them at
the very end of the file, and in a way compatible of the filesystem sparse file
mechanisms (to be distinguished from the TIFF sparse file extension discussed
earlier). That is that all the empty blocks will be seen as properly allocated
from the TIFF point of view (corresponding strips/tiles will have valid offsets
and byte counts), but will have no corresponding physical storage. Provided that
the filesystem supports such sparse files, which is the case for most Linux
popular filesystems (ext2/3/4, xfs, btfs, ...) or NTFS on Windows. If the file
system does not support sparse files, physical storage will be
allocated and filled with zeros.

绑定

Python绑定具有GDALGetDataCoverageStatus()的映射。可以更新其他绑定(需要弄清楚如何同时返回状态标志和百分比)

公用事业

公用设施没有直接变化。

结果

有了这个新功能,一个大小为20000x 200 000像素的VRT包含两个20x20像素的区域,每个区域可以在2秒内被gdal_转换为稀疏平铺的GeoTIFF。生成的GeoTIFF本身可以同时转换为另一个稀疏的tiled GeoTIFF。

今后的工作

使用新功能的未来工作可以在概览构建或扭曲中完成。其他驱动程序也可以从这个新功能中获益:GeoPackage、ERDAS Imagine。。。

文档

新方法被记录在案。

测试套件

对VRT和GeoTIFF驱动程序的测试进行了增强,以测试它们的IGetDataCoverageStatus()实现。

兼容性问题

C++ ABI的变化。未预见到功能不兼容。

实施

甚至鲁奥也会执行。

提议的实施 https://github.com/rouault/gdal2/tree/sparse_datasets

变化可以通过 https://github.com/OSGeo/gdal/compare/trunk...rouault:sparse_datasets?expand=1

投票历史

+来自Evner和DanielM的1个