What’s the point? 
March 9, 2013
One of the main drawbacks of the adapted-sorted-brute-force search algorithm we developed in the previous post is that it only sorts the data along a single coordinate axis. For two dimensional data that means we’re ignoring half the information. For three dimensional data we’re ignoring twice as much information as we’re using to speed up the search. The more dimensions, the less efficient that searcher becomes.
The key to optimizing a Closest Point Search is to reject as many points as possible because measuring the distance between points takes time. It might not take a lot of time to compute the distance between two points, but if you want to find the nearest point in a collection of 10 million points for 1 million other points, you end up computing 107 * 106 = 1013 distances if you take the brute-force approach. Even if a single distance test only takes 5 nanoseconds it will still take 50,000 seconds, which is almost 14 hours. The adapted-sorted-brute-force (which wasn’t all that brute-force by the time we were done with it) managed to reject points because their distance measured along the sorting direction exceeded the best result we’ve found so far.
What we’d like to accomplish today is roughly the same, except we want to be able to reject points in all directions, not just one. Instead of sorting all the points, what we’ll do instead is put them in spatial buckets, and these buckets in turn will be defined in such a fashion as to allow very easy navigation to neighbouring buckets. In other words: a grid.
In the image above you can see the entire point collection represented by the black dots. Point h is represented by the red cross and we’re looking for the black dot closest to h. This search algorithm also requires a pre-process step, but instead of sorting all the points we’ll instead assign them to whatever grid square contains them. So instead of a single large collection of points (C), we end up with a grid structure where each grid cell contains a subset of C. This is a very cheap process as it only takes a single subtraction, multiplication and rounding operation per dimension per point*. It’s also very easy to multithread this pre-process as it belongs to a class of problems known as embarrassingly parallel. The algorithm —which we shall henceforth refer to as the square-grid-search algorithm— would work as follows:
- Create a grid of square cells that encompasses all of C.
- Assign all points in C to their respective grid cells.
- Find which cell r contains h (or which is closest to h if h is not within the grid bounds).
- Find the nearest point p in r to h using brute-force search.
- If r is empty, extend to search to all eight neighbouring cells and so on until at least one point is encountered.
- Find the collection of grid cells R which intersect with the search boundary (represented by the green circle).
- Iterate over all points in R using brute-force to see if any of them are closer than the early result.
The algorithm is fairly straightforward and easy to implement, but —like adapted-sorted-brute-force— is has serious drawbacks. It will perform very well on balanced point collections, but the performance heavily depends on the size of the grid cells. Too small cells and many of them will be empty, causing us to spend a lot of time iterating over pointless** areas. Too big and they’ll contain a lot of points which will slow down the algorithm as the intra-cell search is brute-force.
If the collection is unbalanced then we’ll end up with many empty cells and some cells which contain many points. As long as the intra-cell search remains brute-force, square-grid-search will not be suitable for such cases.
There is a new drawback as well. It’s quite simple and very fast to insert points into a sorted collection without breaking the sort order. It is also very easy to insert points into a spatial bucketing grid, provided the points are within the grid boundaries. If you wish to add points outside of the existing grid boundary, the grid will have to be grown, which can be a computationally expensive operation***.
Up next we’ll examine ways to make spatial bucketing more effective for unbalanced —indeed, wildly unbalanced— point collections.
* Assuming the Grid only has rectangular cells. Square cells make the algorithm much easier to implement though, not to mention hexagonal or free-form cells.
** Bow down before the Master of Puns!
*** It is possible to use a data structure known as a sparse array to store the grid cells rather than a standard multi-dimensional array. This effectively solves the insertion problem, but it also makes navigation between neighbouring cells slightly slower.