The data set is a 2 dimensional grid. Updating the grid from realtime source happens at an extremely high frequency but the processing of this data takes a long time.
A timer samples the grid at fixed times for cells marked dirty and need processing.
The overhead to start the processing, call it function P() takes a very long time to bootstrap. P can take a 1 dimensional array, such as a scanline horizontally or vertically.
The question is how to design an efficient algorithm that can "chunk" an arbitrary set of dirty bits on the 2D grid into scanlines to minimize the number of times P() is invoked?
You could look into a variation on the prefix sum algorithm to quickly scan across the bitmap in parallel, isolating the blocks that are dirty, and packing pointers to those dirty blocks into a new array or "scanline" that can be passed to your P() function.
For instance, using parallel threads, perform a prefix sum on the 2D array where the dirty blocks have a value of 1 and the non-dirty blocks have a value of 0. After you've done the prefix sum, all the dirty blocks will count upwards from 1....N. Now, using a set of parallel threads, copy or create pointers to the numbered dirty blocks to a new "scanline" array ... the value from the prefix sum becomes a basic hash-value for the slot in the scanline array that each dirty block should be referenced from.