Optimizing Python GDAL ReadAsArray

Optimizing Python GDAL ReadAsArray

I am using the GDAL ReadAsArray method to work with raster data using numpy (specifically reclassification). As my rasters are large, I process the arrays in blocks, iterating though each block and processing with a similar method to the GeoExamples example.

I am now looking at how best to set the size of these blocks to optimize the time taken to process the whole raster. Being aware of the limitations with numpy array sizes, and the use of the GDAL GetBlockSize to use the "natural" block size of a raster, I have testing using a few different block sizes, made up of multiples of the "natural" size, with the example code below:

import timeit try: import gdal except: from osgeo import gdal # Function to read the raster as arrays for the chosen block size. def read_raster(x_block_size, y_block_size): raster = "path to large raster" ds = gdal.Open(raster) band = ds.GetRasterBand(1) xsize = band.XSize ysize = band.YSize blocks = 0 for y in xrange(0, ysize, y_block_size): if y + y_block_size < ysize: rows = y_block_size else: rows = ysize - y for x in xrange(0, xsize, x_block_size): if x + x_block_size < xsize: cols = x_block_size else: cols = xsize - x array = band.ReadAsArray(x, y, cols, rows) del array blocks += 1 band = None ds = None print "{0} blocks size {1} x {2}:".format(blocks, x_block_size, y_block_size) # Function to run the test and print the time taken to complete. def timer(x_block_size, y_block_size): t = timeit.Timer("read_raster({0}, {1})".format(x_block_size, y_block_size), setup="from __main__ import read_raster") print "	{:.2f}s
".format(t.timeit(1)) raster = "path to large raster" ds = gdal.Open(raster) band = ds.GetRasterBand(1) # Get "natural" block size, and total raster XY size. block_sizes = band.GetBlockSize() x_block_size = block_sizes[0] y_block_size = block_sizes[1] xsize = band.XSize ysize = band.YSize band = None ds = None # Tests with different block sizes. timer(x_block_size, y_block_size) timer(x_block_size*10, y_block_size*10) timer(x_block_size*100, y_block_size*100) timer(x_block_size*10, y_block_size) timer(x_block_size*100, y_block_size) timer(x_block_size, y_block_size*10) timer(x_block_size, y_block_size*100) timer(xsize, y_block_size) timer(x_block_size, ysize) timer(xsize, 1) timer(1, ysize)

Which produces the following sort of output:

474452 blocks size 256 x 16: 9.12s 4930 blocks size 2560 x 160: 5.32s 58 blocks size 25600 x 1600: 5.72s 49181 blocks size 2560 x 16: 4.22s 5786 blocks size 25600 x 16: 5.67s 47560 blocks size 256 x 160: 4.21s 4756 blocks size 256 x 1600: 5.62s 2893 blocks size 41740 x 16: 5.85s 164 blocks size 256 x 46280: 5.97s 46280 blocks size 41740 x 1: 5.00s 41740 blocks size 1 x 46280: 800.24s

I have tried running this for a few different rasters, with different sizes and pixel types, and appear to be getting similar trends, where a ten fold increase in the x or y dimension (in some cases, both) halves the processing time, which although not that significant in the example above, can mean a number of minutes for my largest rasters.

So my question is, why is this behavior occurring?

I did expect using fewer blocks to improve processing time, but the tests using the least are not the quickest. Also, why does the final test take so much longer than the one preceding it? Is there some kind of preference with rasters for reading by row or column, or in the shape of the block being read, the total size? What I'm hoping to get from this is the information to get a basic algorithm together that will be able to set the block size of a raster to an optimal value, depending on the size of the input.

Note that my input is an ESRI ArcINFO grid raster, which has a "natural" block size of 256 x 16, and the total size of my raster in this example was 41740 x 46280.

Have you tried using an equal blocksize. I deal with raster data which is of the order of 200k x 200k pixels and quite sparse. A lot of benchmarking has yielded 256x256 pixels blocks as most efficient for our processes. This is all to do with how many disk seeks are required to retrieve a block. If the block is too large then it is harder to write it to disk contiguously, meaning more seeks. Likewise, if it is too small, you will need to do many reads to process the whole raster. It also helps to ensure the total size is a power of two. 256x256 is incidentally the default geotiff block size in gdal, so perhaps they drew the same conclusion

My suspicion is you're really bumping up against GDAL's block cache, and that's a knob that's going to have a significant impact on your processing speed curve.

See SettingConfigOptions, specificallyGDAL_CACHEMAX, for more detail on this and investigate how changing that value to something significantly larger interacts with your simulation.

Watch the video: Reproject, resample and clip raster data with GDAL in Python