Home

What’s the point? [2]

March 8, 2013

In the previous post we discussed some generic optimization strategies that might be applied to software to speed things up. Now we’ll look instead at a very well-known problem to see how being a smart programmer is still better than having a faster computer. The problem we’ll tackle is known as Closest Point Search, and it is interesting both because it is very simple to understand and very simple to write a naive algorithm that solves the problem.

Imagine a large collection of two-dimensional points (C) strewn across a flat surface. For the time being we’ll assume that the furthest distance between any two points in the collection is finite and that the average distance from any point to its neighbours is roughly the same. Such a point collection is called balanced. By contrast, an unbalanced collection would have very pronounced clusters where points group together. The further apart these clusters are the more unbalanced the collection is.

Balanced vs. Unbalanced

Balanced vs. Unbalanced

The question we want to answer is; “Which point (p) in this collection is closest to a location (h)?“. A naive algorithm which would answer this question is the brute-force implementation:

  1. For each point p in C, compute the distance to h.
  2. If this distance is less than the previous smallest distance, remember this new distance as the smallest distance and remember p as the best result so far.
  3. When we’re done iterating over all points, we’ll have remembered the point nearest to h.

The problem with this algorithm is that it has to iterate over all points in C. Which means that when there’s twice as many points, it’ll take twice as long. Ten times as many points, it’ll take ten times as long. It would be nice if we could somehow find a way to omit points from the search. If we can reject out of hand —say— half the points our algorithm will be twice as fast.

First however some bad news. If we’ll only ask the question “which is the point closest to h?” once then there’s absolutely nothing we can do*. We can only start to optimize the search if we’re certain that there will be multiple closest point searches performed on the same —or at least a very similar— collection.

So let’s consider an optimization to our brute-force search algorithm. We’ll add a pre-process step and also an escape clause:

  1. Sort the points in C by ascending x-coordinate (i.e. from left to right).
  2. For each point p in C, compute the distance to h.
  3. If this distance is less than the previous smallest distance, remember this new distance as the smallest distance and remember p as the best result so far.
  4. If the x-coordinate of p is higher than the x-coordinate of h + the current smallest distance then stop iterating.
  5. When we’re done iterating, we’ll have remembered the one nearest to h.

The trick here is the sorting. Since we know we’re now iterating from left to right, we also know that every new point we test will have an x-coordinate that is equal to or higher than the point we just tested. So once we find that the distance between this x-coordinate and the x-coordinate of h is larger than the smallest distance we’ve found so far**, we know for a fact that all remaining points cannot possibly yield a better answer then the one we already have. We can thus reject those from the search.

BruteForcePlusSorting

The image above provides a snap-shot of the search algorithm just prior to step 4. All the black points have already been tested and almost every new point turned out to be closer than the previous one. We’ve just tested the point that was nearest to × and the green circle represents the remaining area in which even closer points may be found. The right-most edge of the circle defines the location of the rejection boundary. The next point to be tested is just beyond this boundary and thus it and any subsequent points still left in the collection cannot possibly yield a better result then we have already.

This approach allows us to omit certain points and will thus be faster than the brute-force search. There is however an initial penalty since we have to sort the points first. But although this algorithm performs better, it still has many weaknesses. If the point collection has little variation along the x-direction for example then the benefit of sorting along the x-direction are minimal. Also when h is far outside the bounding box of C, then the shortest distance will always be quite large and the Rejection Boundary might never actually exclude any points, nullifying the optimization. Finally, if h is somewhere near the right edge of C, then the Rejection Boundary will be well outside of C for most of the time. One could of course choose to sort C along the y-direction or to iterate backwards instead of forwards depending on the relative location of h with respect to C, this will definitely alleviate some of the drawbacks at the expense of more complicated code. There is however another trick that can be used to significantly improve the performance of the sorted brute-force search. But for that we need to understand how binary searching works…

* Short of multithreading the Brute-force search. We could split the collection into N sub-collections (one for each available core), find the nearest point in each sub-collection and then find the best result amongst the N partial results.

** We are ignoring the contribution of the y-coordinate, which can only make the distance larger, never smaller. It is thus safe to ignore.