Home

What’s the point? [3]

March 9, 2013

Remember the sorted brute-force search algorithm we came up with last time? It attempted to outperform regular brute-force searching by sorting the points left to right and then skipping as many points as possible by deciding that the best possible answer has already been found. We also mentioned that sometimes it makes more sense to iterate from right-to-left, or top-to-bottom or bottom-to-top depending on the relative location of the search locus (h) with respect to the point collection (C) as a whole.

But in many cases we have to work our way from one extreme of C to the point nearest to h before we can seriously start to omit points. If only we could start somewhere in the vicinity of h then the chances of finding the best possible answer quick will be much higher, which in turn means we can skip more points, which in turn means we’ll be done quicker. So an improved sorted brute-force search algorithm might look like:

  1. Sort the points in C by ascending x-coordinate (i.e. from left to right).
  2. Starting from a point p in C somewhere in the vicinity of h, iterate towards the right and compute the distance to h.
  3. If this distance is less than the previous smallest distance, remember this new distance as the smallest distance and remember p as the best result so far.
  4. If the x-coordinate of p is higher than the x-coordinate of h + the current smallest distance then stop iterating.
  5. Starting from the point p in C that was in the vicinity of h, iterate towards the left and compute the distance to h.
  6. If this distance is less than the previous smallest distance, remember this new distance as the smallest distance and remember p as the best result so far.
  7. If the x-coordinate of p is lower than the x-coordinate of h – the current smallest distance then stop iterating.
  8. When we’re done iterating, we’ll have remembered the one nearest to h.

The only difference between this adapted search algorithm and the original is that instead of iterating from one end of C towards the other, we start somewhere in the middle, at a point that is already near h. But how do we quickly find a point near h? Wasn’t that what we were trying to accomplish in the first place?

Well, not quite. We’re looking for the point closest to h. But for this adapted algorithm to work, all we need is a point somewhere in the vicinity of h. As long as it’s not way off, we should be able to get some benefit out of our smarter iteration logic. So the trick now is to find this magical point in the vicinity of h (let’s call it pbase) and of course we need to do so quickly because we’re trying to safe time, not spend it.

There is a more rigid definition of pbase than “in the vicinity of” and it may give a clue as to how we’re supposed to find it quickly. pbase can be defined as that point p in C whose x-coordinate is most similar to the x-coordinate of h. In other words, for this initial guess, we’re happy looking only at the distribution of C in the x-direction, ignoring the fact that C is a two-dimensional point collection and thus also has a distribution in the y-direction. There is a fantastically clever and fast and easy way to find a value in a sorted collection that is close to some other value. It is called Binary Searching.

A binary search progresses by taking a series of decreasing steps towards the right answer. It only requires that the data it is stepping over is sorted. A major benefit of binary searching is that the runtime complexity is O(log N), which is a very expensive way of saying that as the collection gets larger, it doesn’t take up correspondingly more time. Imagine for example that we use binary search to find an item in a collection containing 100 items. It will take —on average— N milliseconds to complete. If we now apply the same algorithm to a collection ten times the size, i.e. 1000 values, it might only take 2×N milliseconds rather than 10×N. If we make the collection bigger still, say 10000 values, it will take only 4×N milliseconds rather than 100×N.

Put even simpler, binary search —and other algorithms with a logarithmic runtime complexity— take proportionally less time to complete big problems as opposed to small problems.

Let’s have a look at how a binary searcher might find pbase in our point collection. The image below shows the points in C as grey circles, sorted from left to right. Note that the y-coordinate of each point  —while drawn— is irrelevant. Below the points are drawn the four steps that a binary searcher performs in order to arrive at the conclusion that point #3 is closest to h (×) if you only take the x-distribution into account. By a happy coincidence, point #3 is also the closest point in general, but that’s not necessarily the case.

BinarySearch

A binary searcher always operates on the same basic principle, which is then recursively applied over and over until a solution is found. We know that h is somewhere in between the extremes of the entire collection. We can ascertain this by making sure that h is to the right of the first point (0) and to the left of the last point (19). So we take the point halfway in between 0 and 19 and test whether it is to the left or to the right of h. That point happens to be #10 and it happens to be to the right of h. So now we know that the real answer must be somewhere between 0~10.

Once again we test the point halfway between 0 and 10, which is point #5. Same deal as before, 5 is to the right of h and we’ve now shrunk our search range to 0~5. If we look halfway between 0 and 5, we find point #2 which happens to be to the left of h, so instead of adjusting the upper limit of the search range we now adjust the lower and end up with a range of 2~5. Repeat until you can move neither left nor right without making things worse.

As you can see, the search range is effectively cut in half with every step. So when you double the number of items, it only takes one extra step to complete the search. Quadruple the number of items and it only takes two additional steps. This is why binary search is a logarithmic algorithm.

We’ve now developed a closest point search algorithm which actually runs pretty fast, especially on balanced point collections. We’ve gone from brute-force, to sorted-brute-force, to adapted-sorted-brute-force which is hardly brute-force at all any more.

However there are still deep problems with the core algorithm and if we want to address them we’ll have to drastically rethink our approach. The main issue is that the sorting only happens along a single axis. With well balanced, two-dimensional points that isn’t too big a deal, but if you start dealing with three or more dimensions it becomes increasingly less efficient.

Up next, spatial buckets and how they can be used to generate a notion of locality.

Leave a comment